Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CICC v13.0 — Reverse Engineering Reference

CICC is NVIDIA's CUDA C-to-PTX compiler — the binary that transforms CUDA C++ source code (or LLVM bitcode) into PTX assembly for GPU execution. At 60 MB, it is one of the largest single compiler binaries in production use. This wiki documents its internal architecture, recovered from static analysis of the stripped x86-64 ELF binary using IDA Pro 8.x and Hex-Rays decompilation.

Binarycicc v13.0, 60,108,328 bytes, x86-64, stripped
Buildcuda_13.0.r13.0/compiler.36424714_0
Decompilation80,562 functions, 80,281 recovered (99.65%), IDA Pro 8.x + Hex-Rays
Strings188,141 extracted
LLVM baseLLVM 20.0.0 (internal), bitcode producer ID "LLVM7.0.1" (NVVM compat)
LLVM pass classes~402 standard + 35 NVIDIA custom
CLI options~1,689 registered via cl::opt + 222 NVVMPassOptions slots
NVVM builtins770 (IDs 1–770, wyhash open-addressing table)
Default targetsm_75 (Turing)
Supported SMssm_75 through sm_121f (Turing through Blackwell (sm120))

Three Subsystems

CICC is not a monolithic compiler. It is composed of three largely independent subsystems, each with its own lineage, coding conventions, and internal data structures:

1. EDG 6.6 C++ Frontend (3.2 MB, 0x5D00000x8F0000) — A licensed commercial frontend from Edison Design Group that parses CUDA C++ source code and emits transformed C code. It operates as a source-to-source translator: CUDA kernel launch syntax (<<<>>>) is lowered to CUDA runtime API calls, memory space qualifiers (__shared__, __constant__) are resolved to address space annotations, and C++ templates/constexpr are fully evaluated. The output is not LLVM IR — it is C code that feeds into a second compilation phase. See EDG 6.6 Frontend.

2. NVVM Bridge (~4 MB, 0x8F00000x12CFFFF) — The glue layer between EDG and LLVM. It handles CLI parsing, architecture detection (23 SM variants with 3-column flag fan-out), the dual-path compilation dispatch (Path A via LibNVVM API, Path B standalone), the NVVMPassOptions knob system (221 per-pass configuration slots), and the 770-entry builtin resolution table. This layer is entirely NVIDIA-proprietary. See Entry Point & CLI and LLVM Optimizer.

3. LLVM 20.0.0 Backend (~45 MB, 0x12D00000x3BFFFFF) — A heavily modified LLVM fork that performs IR optimization and PTX code generation. NVIDIA has added 35 custom passes (MemorySpaceOpt, Rematerialization, BranchDist, LoopIndexSplit, Sinking2, etc.), a proprietary two-phase compilation model with per-function thread parallelism, and extensive modifications to the NVPTX backend for tensor core code generation across 5 GPU architecture generations. See Code Generation and PTX Emission.

Additionally, jemalloc 5.3.x (~400 functions at 0x12FC000) is statically linked, replacing the system allocator for improved memory allocation performance during compilation.

Dual-Path Architecture

A distinctive feature of cicc is its dual-path design — two complete copies of the compilation backend exist within the same binary, selected at runtime:

Path A (0x90xxxx)Path B (0x126xxxx)
PurposeLibNVVM API modeStandalone mode
Simple compilesub_902D10sub_1262860
Multi-stagesub_905EE0 (43KB)sub_1265970 (48KB)
CLI parsingsub_900130sub_125FB30
Builtin tablesub_90AEE0 (109KB)sub_126A910 (123KB)
Libdeviceunk_3EA0080 (455KB)unk_420FD80 (455KB)
Version string-nvvm-version=nvvm-latest-nvvm-version=nvvm70

Runtime selection is controlled by v253 in sub_8F9C90 (the real main function). The default value (2) triggers an environment variable lookup through an obfuscated string comparison to determine which path to take. This design allows a single binary to serve both the nvcc driver toolchain and the LibNVVM runtime compilation API.

Compilation Pipeline

Both paths converge on the same 5-stage pipeline:

CUDA C++ Source (.cu / .ci / .i)
  │
  ├─ EDG 6.6 Frontend (sub_5D2A80)
  │   ├─ lgenfe_main (sub_617BD0): 282-case CLI, 737 #defines
  │   ├─ Parser: recursive-descent + declaration specifier state machine
  │   ├─ Constexpr evaluator: 317KB tree-walking interpreter
  │   └─ Backend: "Generating NVVM IR" → .int.c / .device.c / .stub.c
  │
  └─ NVVM/LLVM Pipeline
      │
      ├─ IRGEN:  EDG IL → LLVM IR translation (cicc's equivalent of Clang CodeGen)
      │            Type translation (fixed-point iteration, address space mapping)
      │            Expression/statement/function codegen (recursive AST walk)
      │            CUDA semantic lowering (threadIdx→intrinsics, printf→vprintf, etc.)
      │            Kernel metadata emission (nvvm.annotations)
      │            Two copies: Path A (0x90xxxx) and Path B (0x126xxxx)
      │
      ├─ LNK:     Module linking + libdevice (455KB embedded bitcode)
      │            Triple validation (must be nvptx64-)
      │            IR version check (nvvmir.version metadata)
      │
      ├─ OPT:     Two-phase compilation (Phase I: whole-module, Phase II: per-function)
      │            ~150 pass insertions via sub_12E54A0
      │            Three language paths: "ptx" / "mid" / default
      │            35 NVIDIA custom passes interleaved with standard LLVM
      │            Optional: concurrent per-function compilation (thread pool + jobserver)
      │
      ├─ OPTIXIR:  OptiX IR generation (optional, --emit-optix-ir)
      │
      └─ LLC:     NVPTX backend code generation
                   SelectionDAG lowering (2.3 MB NVPTXTargetLowering)
                   19 MMA shapes × 11 data types for tensor core codegen
                   9 PTX register classes
                   StructurizeCFG (mandatory for PTX structured control flow)
                   → .ptx output

Subsystem Address Map

SubsystemAddress RangeSizeKey Entry Point
jemalloc stats0x40D0000x41FFFF~80KBsub_40D5CA (vsnprintf)
Global constructors0x4300000x5CFFFF~1.6 MBcl::opt registration (~1,689 options)
EDG 6.6 Frontend0x5D00000x8EFFFF3.2 MBsub_5D2A80 (orchestrator)
CLI / Real Main0x8F00000x96FFFF520 KBsub_8F9C90 (real main)
Bitcode reader0x9F00000xAFFFFF~1 MBsub_9F2A40 (parseFunctionBody)
LLVM verifier0xBF00000xC6FFFF500 KBsub_BFC6A0 (visitCallInst)
LLVM passes0xC000000x12CFFFF~7 MBInstCombine, GVN, DSE, LICM, etc.
PassManager / NVVM bridge0x12D00000x16FFFFF4.2 MBsub_12E54A0 (pipeline assembly)
Backend / machine passes0x17000000x1EFFFFF8 MBMRPA, Block Remat, Mem2Reg
SelectionDAG0x1F000000x20FFFFF2 MBsub_20019C0 (LegalizeTypes, 348KB)
NVPTX emission0x21000000x21FFFFF1 MBsub_215A3C0 (function headers)
New PM / pass registration0x23400000x23FFFFF768 KBsub_2342890 (2,816-line registrar)
Loop passes0x2A000000x2DFFFFF4 MBLoopVectorize, SLP, Unroll, etc.
NVPTX ISel + lowering0x30000000x36FFFFF7 MBsub_33B0210 (intrinsic switch, 343KB)
Embedded libdevice0x3EA0080 / 0x420FD80456 KB × 2LLVM bitcode (~400 math functions)

Reading This Wiki

The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with LLVM backend experience.

Section Index

  • Pipeline Overview — End-to-end compilation flow diagram with links to every stage.
  • Entry Point & CLI — CLI parsing, dual-path dispatch, architecture detection.
  • EDG 6.6 Frontend — CUDA C++ to transformed C source-to-source translation.
  • NVVM IR Generation — EDG IL tree to LLVM Module: types, expressions, statements, functions.
  • LLVM Optimizer — Two-phase compilation, pipeline assembly, NVVMPassOptions.
  • Code Generation — SelectionDAG, ISel, register allocation, scheduling.
  • PTX Emission — AsmPrinter, directive emission, PTX body output.
  • NVIDIA Custom Passes — 35 proprietary passes not in upstream LLVM.
  • LLVM Pass Pipeline & Ordering — Complete pass registration, execution order per O-level, tier system.
  • NVVM Builtins — 770-entry builtin table: hash structure, ID inventory, category breakdown.
  • GPU Targets — SM feature gates, architecture detection, sm_75 through sm_121f.
  • Data Structures — IR node layout, pattern database, DAG node, symbol table, NVVM container.
  • Infrastructure — Alias analysis, MemorySSA, AsmPrinter, debug verification, NVPTX target.
  • LTO & Module Optimization — Cross-TU inlining, devirtualization, GlobalOpt, ThinLTO import.
  • Configuration — Three knob systems: ~1,689 cl::opt flags, 222 NVVMPassOptions slots, ~70 codegen knobs.
  • Reference — Address spaces, register classes, NVPTX opcodes, GPU execution model.
  • Function Map — Address-to-identity lookup for ~350 key functions with confidence levels.
  • Binary Layout — Subsystem address map at pass granularity.
  • Methodology — How this analysis was performed and how to assess confidence.

Reading Path 1: End-to-End Pipeline Understanding

Goal: understand how CUDA source becomes PTX, what each stage does, and how control flows between subsystems.

Read in this order:

  1. Pipeline Overview — The complete flow diagram. Establishes the 10 stages and their address ranges. Read this first to build the mental model that all other pages assume.
  2. Entry Point & CLI — How cicc is invoked, the 1,689-flag CLI, dual-path dispatch (Path A LibNVVM vs. Path B standalone), and the sub_8F9C90 real-main function.
  3. nvcc-to-cicc Interface — The flag translation layer between nvcc and cicc. The 40+ flag mappings and 3-column architecture fan-out. Necessary context for understanding why certain flags exist.
  4. EDG 6.6 Frontend — The commercial C++ frontend. How CUDA syntax is lowered to C, the 737 configuration #defines, and the .int.c / .device.c / .stub.c output split.
  5. NVVM IR Generation — The EDG-to-LLVM bridge. Then follow the four sub-pages: Type TranslationExpressionsStatementsFunctions.
  6. Libdevice Linking — The embedded 455KB bitcode library with 352 __nv_* math functions. Triple validation, version checking.
  7. LLVM Optimizer — The two-phase compilation model, the 49.8KB pipeline assembler (sub_12E54A0), pass ordering, and the NVVMPassOptions knob system. This is the longest and densest stage.
  8. Pipeline & Pass Ordering — The exact pass execution order at each O-level, the tier system, and the 526 registered passes.
  9. Code Generation — SelectionDAG lowering, instruction selection, register allocation, instruction scheduling. Hub page with links to deep dives.
  10. PTX Emission — AsmPrinter, directive headers, PTX body output, metadata emission.

Optional extensions after the core path:

Reading Path 2: Reimplementing a Specific Pass

Goal: reproduce the exact behavior of one NVIDIA custom pass or understand an LLVM pass modification deeply enough to write a compatible replacement.

For an NVIDIA custom pass (e.g., MemorySpaceOpt, Rematerialization, BranchDist):

  1. NVIDIA Custom Passes — Overview — Locate the pass in the inventory table. Note its category (module/function/loop/machine), its pipeline position, and its controlling knobs.
  2. The pass's dedicated page (e.g., MemorySpaceOpt, Rematerialization, Branch Distribution). Every dedicated page contains the function address, decompiled algorithm, data flow description, controlling knobs, and diagnostic strings.
  3. NVVMPassOptions — The 222-slot struct that controls per-pass enable/disable toggles and parametric thresholds. Find which slots your target pass reads.
  4. Pipeline & Pass Ordering — Determine exactly where the pass runs in the pipeline. Identify what analyses it depends on (must run before it) and what passes consume its results (run after it).
  5. Optimization Levels — Determine at which O-levels the pass is enabled, disabled, or parameterized differently.
  6. Function Map — Cross-reference the pass's internal function addresses with the master function map for confidence levels.

For a modified LLVM pass (e.g., InstCombine, GVN, DSE, LICM, LoopVectorize):

  1. The pass's dedicated page (e.g., InstCombine, GVN, DSE, LICM). These pages document NVIDIA's modifications relative to upstream LLVM 20.0.0.
  2. Alias Analysis & NVVM AA — The custom alias analysis chain. Nearly every optimization pass depends on AA, and NVIDIA's GPU-aware AA behaves differently from upstream (address-space-aware NoAlias for disjoint spaces, __restrict__ propagation).
  3. MemorySSA — The memory dependence representation used by DSE, LICM, and other memory-sensitive passes.

For a machine-level pass (e.g., Block Remat, MRPA, Machine Mem2Reg):

  1. Machine-Level Passes — The complete machine pass pipeline with per-pass algorithm descriptions.
  2. Register Allocation — The greedy RA algorithm with NVIDIA's occupancy-driven spill heuristics.
  3. Register Classes — The 9 PTX register classes and their constraints.
  4. NVPTX Machine Opcodes — The MachineInstr opcode reference.

Supporting references for any pass reimplementation:

  • IR Node Layout — The internal IR data structures that passes operate on.
  • Address Spaces — GPU address space semantics that many passes must respect.
  • NVPTX Target Infrastructure — TargetMachine, TTI hooks, and target feature queries.
  • Diagnostics — The three diagnostic systems (EDG, LLVM remarks, profuse framework) for reproducing pass-level reporting.

Reading Path 3: Debugging Correctness

Goal: diagnose a miscompilation, a crash, or incorrect PTX output by tracing the problem to a specific pass or pipeline stage.

Start with instrumentation and observability:

  1. Diagnostics & Optimization Remarks — The three independent diagnostic layers: EDG frontend errors, LLVM optimization remarks (-opt-bisect-limit, -Rpass=, -Rpass-missed=), and NVIDIA's profuse framework (profuseinline, profusegvn). This page tells you how to make cicc talk about what it is doing.
  2. Debug Info Verification — The three verification modes (verify-each, debugify-each, and JSON delta reporting). Use verify-each to detect the first pass that corrupts debug metadata.
  3. CLI Flags — Locate the flags for dumping IR at specific pipeline points: --print-after-all, --print-before-all, --filter-print-funcs=, --opt-bisect-limit=. Also the --passes= interface for running individual passes in isolation.
  4. Optimization Levels — Compare the pass pipeline at different O-levels. If a bug appears at -O2 but not -O1, the diff between their pipelines identifies the suspect passes.

Then isolate the pipeline stage:

  1. Pipeline Overview — Determine which stage produces the incorrect output. The pipeline is linear: EDG → IR Generation → Libdevice Linking → Optimizer → Codegen → Emission. The stage boundary where output first goes wrong narrows the search.
  2. NVVM IR Verifier — The 230KB three-layer verifier (module + function + intrinsic). It validates triples, address spaces, atomic restrictions, pointer cast rules, and architecture-gated intrinsic availability. A verification failure after a specific pass is a strong signal.
  3. Bitcode I/O — If the problem is in bitcode reading/writing (corrupted input, version mismatch), this page documents the reader at sub_9F2A40 and the writer.

Then investigate the suspect pass:

  1. NVIDIA Custom Passes or the relevant LLVM pass page — Read the algorithm description for the suspect pass. Look for documented edge cases, known limitations, and diagnostic strings that would appear in verbose output.
  2. NVVMPassOptions — Check whether the suspect pass has enable/disable knobs or threshold parameters that could be adjusted to confirm or rule it out.
  3. Environment Variables — Some passes are gated by environment variables (including obfuscated ones). Check whether any are influencing behavior.

For correctness issues specific to GPU semantics:

  • Address Spaces — Incorrect address space resolution is a common source of silent miscompilation. Global vs. shared vs. local aliasing rules differ from CPU memory models.
  • MemorySpaceOpt — This pass resolves generic pointers to specific address spaces. If it infers the wrong space, downstream code will access the wrong memory.
  • Alias Analysis — If the alias analysis returns NoAlias for pointers that do alias, DSE/LICM/GVN will misoptimize. The process-restrict propagation is a known source of aggressive alias assumptions.
  • StructurizeCFG — PTX requires structured control flow. If structurization produces incorrect flow blocks, the kernel will execute the wrong path.
  • Dead Barrier Elimination and Dead Synchronization Elimination — Incorrect elimination of barriers or synchronization can cause race conditions that only manifest under specific warp configurations.

Reading Path 4: Tuning Performance

Goal: understand what cicc does at each optimization level, which passes are the performance-critical ones, and what knobs control their aggressiveness.

Start with the tuning infrastructure:

  1. Optimization Levels — The four standard levels (O0--O3) and three fast-compile tiers (Ofcmin/Ofcmid/Ofcmax). This page shows the exact pass pipeline diff between levels, including which passes are added, removed, or reparameterized at each step.
  2. NVVMPassOptions — The 222-slot per-pass configuration system. This is the primary tuning mechanism. The page documents every slot's type (boolean/integer/string), its default value, and which pass reads it.
  3. CLI Flags — The flag-to-pipeline routing tables. Locate flags that control pass thresholds (--inline-threshold=, --unroll-count=, etc.) and pass enable/disable toggles.
  4. LLVM Knobs — The ~1,689 cl::opt flags with their defaults, types, and controlling constructors.
  5. Environment Variables — Runtime environment overrides, including the obfuscated variables.

Then study the high-impact optimization passes:

  1. LLVM Optimizer — Understand the two-phase model. Phase I (whole-module) determines inlining decisions, inter-procedural memory space propagation, and global optimization. Phase II (per-function, potentially concurrent) does register-pressure-driven rematerialization and instruction scheduling. Tuning decisions in Phase I cascade into Phase II.
  2. Inliner Cost Model — Inlining is typically the single highest-impact optimization decision. This page documents the cost model thresholds, the caller/callee size heuristics, and NVIDIA's kernel-specific adjustments.
  3. LoopVectorize & VPlan — Loop vectorization for GPU SIMT. The VPlan infrastructure, cost model, and the NVIDIA TTI hooks that influence vectorization width decisions.
  4. Loop Unrolling — Unrolling thresholds, the NVIDIA-specific unroll heuristics, and the interaction with register pressure.
  5. Rematerialization — NVIDIA's IR-level rematerialization pass (67KB). Trades recomputation for register pressure reduction, which directly affects occupancy on GPU.
  6. Register Allocation — The greedy RA with occupancy-driven spill heuristics. Register count directly determines maximum occupancy.
  7. Instruction Scheduling — The scheduler subsystems and their interaction with hardware latency models.

For tensor core workloads specifically:

  • Tensor / MMA Codegen — 19 MMA shapes across 11 data types. The instruction selection patterns, register allocation constraints, and WGMMA code generation for Hopper and Blackwell.
  • Tensor / MMA Builtins — The builtin-to-intrinsic lowering for wmma, mma, and wgmma operations.
  • SM 90 — Hopper — Hopper-specific features: TMA, WGMMA, asynchronous barriers, cluster launch.
  • SM 100 — Blackwell — Blackwell-specific features: new MMA shapes, FP4/FP6 support, sparsity.

For understanding performance at the target level:

  • GPU Targets — The SM feature gate matrix. Which features are enabled at each architecture level, and how architecture detection routes to different codegen paths.
  • NVPTX Target Infrastructure — The TTI hooks that passes query for target-specific costs (memory latency, instruction throughput, register file size).
  • Concurrent Compilation — If compile time itself is the bottleneck, understand the Phase II thread pool and GNU Jobserver integration to maximize parallelism.

Function Map

Address-to-identity lookup table. Confidence: VERY HIGH = string evidence, HIGH = strong structural evidence, MEDIUM = inferred from context/callgraph.

Top Functions by Size

FunctionAddressSizeConfidence
X86 AutoUpgrade (intrinsic rename, leftover from LLVM x86 target)0xA939D0457KBVERY HIGH
InstCombine::visitCallInst / visitIntrinsic0x10EE7A0396KBHIGH
SelectionDAG LegalizeTypes workhorse (ExpandOp/PromoteOp)0x20019C0341KBHIGH
New PassManager pipeline parser (function-level, 268 pass names)0x2368220326KBVERY HIGH
EDG constexpr expression evaluator core (124 operator opcodes, 9,075 lines)0x786210317KBVERY HIGH
SelectionDAG LegalizeOp main switch0x20ACAE0295KBHIGH
SelectionDAGBuilder::visit (IR → DAG)0x2081F00261KBHIGH
LLVM IR Verifier (visitCallInst), 298 verification messages0xBFC6A0207KBVERY HIGH
X86 Intrinsic Upgrade Helper (broadcastf32x4, compress, etc.)0xA8A170195KBHIGH
EDG IL tree walker #1 (297 self-recursive, 87 node types, 305 cases)0x7506E0190KBHIGH
EDG declaration specifier parser (393 LABEL_ gotos, NOT switch/case)0x7C0F00184KBHIGH
Bitcode Reader parseFunctionBody, 174 error strings0x9F2A40182KBVERY HIGH
EDG constexpr top-level dispatch (80 expression types + 62 intrinsics)0x77FCB0150KBHIGH
EDG IL tree copier/transformer (callback params a3/a4, template instantiation)0x766570148KBHIGH
SelectionDAG LegalizeTypes dispatch (967 case labels)0x1FFB890137KBHIGH
EDG declaration specifier state machine (80 token cases, 4,371 lines)0x672A20132KBVERY HIGH
je_malloc_conf_init (199 config strings)0x12FCDB0129KBVERY HIGH
computeKnownBits / SimplifyDemandedBits0x11A7600125KBVERY HIGH
EDG lgenfe_main (282-case CLI switch, 737 config macros, EDG 6.6)0x617BD0123KBVERY HIGH
NVVM Builtin Resolution table (post-opt, 770 entries)0x126A910123KBVERY HIGH
NVVMPassOptions init (4,786 lines, 221 slots in 4,512-byte struct)0x12D6300125KBVERY HIGH
PassOptionRegistry::lookupOption (hash table at registry+120)0x12D6170HIGH
PassOptionRegistry::getBoolOption (triple: '1'/true, 't'/true)0x12D6240HIGH
writeStringOption (24-byte entry to output struct)0x12D6090HIGH
writeBoolOption (16-byte entry to output struct)0x12D6100HIGH
4-stage pipeline orchestrator (LNK/OPT/OPTIXIR/LLC), nvopt+nvllc objects0x12C35D041KBVERY HIGH
Bitcode linker: triple validation, IR version check, symbol size matching0x12C06E063KBVERY HIGH
NVVM IR version checker (nvvmir.version metadata, NVVM_IR_VER_CHK env)0x12BFF609KBVERY HIGH
NVVM container format parser (arch, FTZ, IEEE, opt level extraction)0x12642A0HIGH
Concurrent worker entry (dispatches Phase I/II)0x12E7B903KBHIGH
Concurrent compilation entry (jobserver, thread pool, split-module)0x12E1EF051KBVERY HIGH
Function sorting by priority (insertion sort / introsort)0x12E0CA0HIGH
Per-function compilation callback (completion handler)0x12E8D50HIGH
Phase II per-function optimizer (sets qword_4FBB3B0=2)0x12E86C0HIGH
Concurrency eligibility check (counts defined functions)0x12D4250HIGH
GNU Jobserver init (parse MAKEFLAGS, create pipe, spawn pthread)0x16832F0HIGH
Bitcode Metadata Reader (parseMetadata)0xA09F80121KBVERY HIGH
EDG IL function body processor (14 params, scope stack management)0x627530114KBHIGH
EDG IL tree walker #2 (427 self-recursive, parallel traversal)0x760BD0109KBHIGH
EDG IL codegen (node type dispatch on byte+80, 2,589 lines)0x8BA620108KBHIGH
NVVM Builtin Resolution table (pre-opt, 770 entries)0x90AEE0107KBVERY HIGH
NVVM Builtin lowering engine (pre-opt, wgmma/tex/surf, 3571 lines)0x955A70103KBHIGH
New PassManager pipeline parser (CGSCC-level)0x2377300103KBHIGH

Pipeline Functions

FunctionAddressSizeConfidence
main() thunk → sub_8F9C900x4396A0tinyKNOWN
Real main: CLI parsing, wizard check, dispatch0x8F9C9010KBVERY HIGH
Simple compile entry (Path A)0x902D10HIGH
Simple compile entry (Path B)0x1262860HIGH
LibNVVM pipeline driver (Path A): 14-phase flow, libdevice linking, API dispatch0x905EE043KBVERY HIGH
LibNVVM compilation entry (Path B): 4-stage pipeline, embedded builtins0x126597048KBVERY HIGH
CUDA C++ Front-End stage (lgenfe): timer "CUDA C++ Front-End"0x9058806KBHIGH
NVVM IR Container → Module opt setup0x9047E010KBHIGH
Backend SM config + EDG binding, triple construction0x90885010KBHIGH
LNK stage verbose callback0x903BA05KBHIGH
LLC stage verbose callback0x9037305KBHIGH
CLI processing (Path A): -arch, -maxreg, -split-compile, -gen-lto0x900130HIGH
CLI processing (Path B)0x125FB30HIGH
EDG master orchestrator (setjmp recovery, timer callbacks)0x5D2A802KBVERY HIGH
Backend entry: "Generating NVVM IR", file output (.int.c/.device.c/.stub.c), TileIR dlopen0x5E3AD011KBVERY HIGH
Multi-stage orchestrator: .lnk.bc → .opt.bc → .ptx0x9685E0HIGH
Architecture detection: -arch → triple fan-out0x95EB4015KBVERY HIGH
NVVM option parsing (all -opt-, -llc-, -gen-*, -Xopt)0x9624D0HIGH
Flag mapping table (O0-O3, nvcc flag translation)0x8FE280HIGH
LLVM cl::opt bulk registration (~1500 options)0xB6EEA0HIGH
Timer/context creation ("CUDA C++ Front-End", "LibNVVM")0xC996C0HIGH

EDG 6.6 Frontend

Core Orchestration

FunctionAddressSizeConfidence
EDG master orchestrator (setjmp recovery, timer callbacks)0x5D2A802KBVERY HIGH
EDG lgenfe_main (282-case CLI switch, 737 config macros, EDG 6.6)0x617BD0123KBVERY HIGH
CLI option registration table (~300 options via sub_6101D0)0x61026022KBHIGH
Option fetcher (called in main loop of sub_617BD0)0x6140E06KBHIGH
Backend entry: "Generating NVVM IR", file output (.int.c/.device.c/.stub.c), TileIR dlopen0x5E3AD011KBVERY HIGH
Translation unit init (416-byte TU object, keyword init, parser entry)0x8D0BC0VERY HIGH
Semantic analysis init (zeroes 6 globals)0x8D0F00tinyHIGH
Keyword table init (~350 keywords via sub_885C00)0x70625030KBVERY HIGH
TU finalization ("Generating Needed Template Instantiations")0x7093305KBHIGH
Register single keyword: (token_id, "keyword_string")0x885C00tinyHIGH

AST-to-Source Printer Cluster

FunctionAddressSizeConfidence
Main expression/statement emitter (61 self-references, recursive)0x5DBFC041KBHIGH
Function declaration printer (__sti__, #pragma section, nv_linkonce_odr)0x5E13C044KBHIGH
Statement printer (if/else/for/while/switch/case/return)0x5DFD0026KBHIGH
Declaration printer (linkage/storage, __builtin_va_alist)0x5D933012KBHIGH
Scope/block printer (bit-fields, array dimensions)0x5DA0F013KBHIGH
Struct/union/enum printer (#pragma pack)0x5DAD309KBHIGH
Variable initializer printer (memcpy, aggregate init)0x5D80F017KBHIGH
Inline asm printer (volatile, constraints, format specifiers)0x5DF1B011KBHIGH
Identifier printer (keyword mangling: auto→__xauto)0x5D5A807KBHIGH
Top-level declaration dispatcher0x5DB9807KBHIGH
Function parameter list printer (__text__/__surf__ annotations)0x5D78606KBHIGH

Parser & Declaration Processing

FunctionAddressSizeConfidence
Declaration specifier state machine (while/switch, 80 token cases)0x672A20132KBVERY HIGH
Declaration specifier parser (393 LABEL_ gotos, NOT switch/case)0x7C0F00184KBHIGH
Top-level declaration/declarator parser0x662DE061KBHIGH
Overloaded function resolution (__builtin_ detection, OMP variants)0x6523A064KBHIGH
Struct/union/class specifier processing0x66AC4049KBHIGH
Enum specifier processing0x66F9E039KBHIGH
Block-level declaration/statement processor (largest in 0x630000 zone)0x63CAE067KBHIGH
Declaration statement parsing (35 token refs, 14 diagnostics)0x66140028KBHIGH
Function declarator processing (parameter lists, return types)0x66DF4024KBHIGH
Declaration specifier combination validator0x668EE026KBHIGH
Storage class specifier processor (_Thread_local validation)0x6682309KBHIGH
Primary declarator-to-IL conversion (type kind dispatch)0x6333F026KBHIGH
Name/identifier processing0x64BAA046KBHIGH
Builtin/intrinsic recognition (53 string refs, C++20/23 reflection)0x64A92025KBHIGH
IL function body processor (14 params, scope stack management)0x627530114KBHIGH
IL statement processing (16 params, IL walker/transformer)0x62C0A063KBHIGH

Type System

FunctionAddressSizeConfidence
Type conversion checker (recursive, vector type handling)0x713ED036KBHIGH
Binary operation type checker (11 callers — very central)0x7115B017KBHIGH
Usual arithmetic conversions (10 params)0x71277012KBHIGH
Type node comparator (parallel tree walk, canonicalization)0x7386E023KBHIGH
Declaration-level type comparison0x73943020KBHIGH
Type-to-string emitter (19 callers, backbone of diagnostics)0x74A39029KBVERY HIGH
Constant expression emitter (alignof, sizeof, nullptr, zero-init)0x74800045KBHIGH
Declarator emitter (19 callers, paired with sub_74A390)0x74D11010KBHIGH
Type node deep-copy0x73A9D019KBHIGH
Declaration node deep-copy (192 bytes = 12 x __m128i)0x73F7806KBHIGH
Operator overloadability checker0x73CC209KBHIGH

IL Tree Infrastructure

FunctionAddressSizeConfidence
IL tree walker #1 (297 self-recursive, 87 node types, 305 cases)0x7506E0190KBHIGH
IL tree walker #2 (427 self-recursive, parallel traversal)0x760BD0109KBHIGH
IL tree walker #3 (316 self-recursive)0x75C0C087KBHIGH
IL tree copier/transformer (callback params a3/a4, template instantiation)0x766570148KBHIGH
Walker driver/setup (5 callbacks + flags)0x759B5031KBHIGH
Copier driver (parallel to sub_759B50)0x75B26016KBHIGH
Master walker driver (sets all 6 global callback pointers)0x75AFC0HIGH

Constexpr Evaluator

FunctionAddressSizeConfidence
EDG constexpr expression evaluator core (124 operator opcodes, 9,075 lines)0x786210317KBVERY HIGH
Statement executor (declarations, loops, switch, compound blocks)0x79566077KBHIGH
Object member accessor (base classes, virtual bases, union tracking)0x79CCD067KBHIGH
Aggregate initializer evaluator (arrays, structs, designated init)0x799B7033KBHIGH
Function call evaluator (argument binding, recursion limits)0x79B7D029KBHIGH
EDG constexpr top-level dispatch (80 expression types + 62 intrinsics)0x77FCB0150KBHIGH
Type size calculator (Robin Hood hash memoization, 64MB cap)0x7764B018KBHIGH
Loop/range-for evaluator0x7987E011KBHIGH
Builtin call evaluator (dispatched from case 0x3D)0x77C87018KBHIGH
Aggregate initializer evaluator (struct/array/union at compile time)0x77D75034KBHIGH

Preprocessor

FunctionAddressSizeConfidence
Main preprocessor token scanner (all C/C++ token kinds)0x7B8B5059KBHIGH
Macro expansion engine (99-entry predefined table, __VA_OPT__)0x81B8F077KBHIGH
Numeric literal tokenizer (hex float, binary, digit separators)0x7B40D042KBHIGH
Character classification / next-token dispatch (trigraphs, line splices)0x7BC39029KBHIGH
String literal scanner (escape processing, raw strings)0x7B6B0013KBHIGH
Macro body substitution (__VA_ARGS__, __VA_OPT__)0x8200E022KBHIGH
Source character reader / tokenizer bootstrap0x7B2B1016KBHIGH
Preprocessing directive dispatcher0x7B82708KBHIGH

Template Engine

FunctionAddressSizeConfidence
Complete template instantiation engine (parameter lists, member iteration)0x7A944040KBHIGH
Template argument type resolution/matching0x7410C042KBHIGH
Template type instantiation handler0x74360019KBHIGH
Template instantiation engine (word_4F06418 SM-arch checks)0x5EBF7030KBHIGH
Template argument deduction engine (pattern matching, pack expansion)0x5FBCD038KBHIGH

Semantic Analysis

FunctionAddressSizeConfidence
Deep semantic analysis (29 SM-arch refs, 27 sub_8D* calls)0x6040F064KBHIGH
Overload resolution main (43 SM-arch refs — highest)0x607B6032KBHIGH
Expression parsing/semantic ("Parsing Lambda", __nv_parent)0x609F0058KBHIGH
Declaration processing (9 SM version refs)0x5FE9C028KBHIGH
Class hierarchy analysis (vtable layout, diamond inheritance)0x5F94C024KBHIGH
Conversion function lookup (33 sub_8D* calls)0x5F4F2021KBHIGH
Operator overload resolution0x5F292023KBHIGH
Declaration elaboration (type-spec strings "A;P", "O;F", "I", "B")0x84EC3071KBHIGH
Declaration semantic analysis (148 global refs, highest density)0x8708D063KBHIGH

CUDA-Specific Frontend

FunctionAddressSizeConfidence
Memory space attribute processing (__shared__, __constant__, __managed__)0x6582F022KBHIGH
Declaration with memory space annotation (15 diagnostic calls)0x65F40024KBHIGH
Atomic builtin name generator (__nv_atomic_fetch_*)0x6BBC4034KBHIGH
CUDA device code generation master0x804B2028KBHIGH
CUDA registration stub (__cudaRegisterAll, __cudaRegisterEntry)0x806F608KBVERY HIGH
Device stub generator ("__device_stub_%s", __cudaLaunch)0x80859011KBHIGH
CUDA kernel launch lowering (cudaGetParameterBufferV2)0x7F2B5016KBHIGH
Static init with CUDA memory space (__sti__, __constant__)0x8018807KBHIGH
Optimization flag configurator (109 flags from O-level)0x60D6506KBHIGH
SM-arch feature gate (56 qword_4F077A8 comparisons)0x60E7C012KBHIGH

Name Mangling (Itanium ABI)

FunctionAddressSizeConfidence
Primary mangling entry0x8E74B029KBHIGH
Type mangling0x8E9FF026KBHIGH
Type component mangling (__real__, __imag__)0x81646024KBHIGH
Builtin type mangling (DF16_, Cu6__bf16, u6__mfp8)0x80E34023KBHIGH
NVIDIA extension mangling (Unvdl, Unvdtl, Unvhdl)0x80FE008KBHIGH
Special type mangling (basic_ostream, allocator substitution)0x80C5A011KBHIGH
Expression mangling0x81379013KBHIGH

Diagnostics & Support

FunctionAddressSizeConfidence
Diagnostic emitter (severity labels, ANSI color, word-wrap)0x681D2037KBVERY HIGH
SARIF JSON diagnostic output (ruleId, level, locations)0x6837D020KBHIGH
Type name formatter (quoted type names for error messages)0x67FCF040KBHIGH
EDG abort / __builtin_unreachable (478 callers!)0x721090tinyVERY HIGH
Exit with status ("Compilation aborted/terminated")0x720FF0HIGH
IR node alloc with context (204 callers)0x724DC0HIGH
IR node free (196 callers)0x724E30HIGH
Get/create void type singleton at qword_4F07BA8 (145 callers)0x72C930HIGH
Arena allocator (63 callers)0x7247C0HIGH
IR node hash (polynomial: v10 += ch + 32*v10, 9 callers)0x72DB908KBHIGH
Tracked heap allocation (linked list at qword_4F195F8)0x822B10HIGH
Hash table bucket chain finalizer0x823310HIGH
EDG heap pool allocator (152-byte, 416-byte, etc. entries)0x823970HIGH

Class Layout & Vtable

FunctionAddressSizeConfidence
Class layout emitter (__vptr, __v_, __b_ prefixes)0x7E3EE07KBHIGH
Virtual base offset calculator0x7E57B09KBHIGH
Virtual call lowering (node_kind==103)0x7E88E011KBHIGH
Class definition emitter (vtable, nested types, friends)0x7E9AF013KBHIGH
Statement emission mega-function (largest in class layout zone)0x7EE56045KBHIGH
Class member emission (__cxa_atexit, __cxa_vec_cctor)0x7FEC5048KBHIGH
Function definition emission (ctor initializers, default args)0x7FCF8017KBHIGH

LLVM cl::opt Registration Infrastructure

FunctionAddressSizeConfidence
Global option counter (atomic increment)0xC523C0HIGH
cl::Option::setArgStr(name, len) — Legacy PM0xC53080HIGH
cl::Option::addArgument() — Legacy PM0xC53130HIGH
cl::OptionCategory getter0xC57470HIGH
cl::opt name setter — New PM0x16B8280HIGH
cl::opt finalization — New PM0x16B88A0HIGH
SmallVector::grow()0xC8D5F0HIGH

Key Constructors (cl::opt registration)

FunctionAddressSizeConfidence
ctor_010_0: TargetLibraryInfo VecFuncs table (9 vector math libs, 960 string xrefs, NOT decompiled)0x4397F0~102KBVERY HIGH
ctor_027: DOES NOT EXIST (phantom, no decompiled file)0x456120DISPROVED
ctor_036: LLVM version = "20.0.0" (via LLVM_OVERRIDE_PRODUCER fallback)0x48CC902KBVERY HIGH
ctor_043_0: NVIDIA CICC-specific options (19 opts, XOR cipher hidden flag)0x48D7F030KBVERY HIGH
MASTER pass/analysis registration (~172 init calls)0x4A59507KBVERY HIGH
ctor_107_0: MC/Target options (131 opts, getenv("bar") backdoor)0x4A64D059KBVERY HIGH
ctor_133_0: Known library function table (422 C/POSIX functions)0x4B018029KBVERY HIGH
ctor_145: MISSING from decompilation (too large for Hex-Rays)0x4B4360~99KBHIGH
ctor_147_0: PassManager debug/print options0x4CC76020KBHIGH
ctor_156_0: CLI infrastructure (help, version, print-options)0x4CEB509KBHIGH
ctor_186_0: Inliner heuristics (NVIDIA: profuseinline, inline-budget)0x4DBEC014KBHIGH
ctor_201: GVN options (NVIDIA: profusegvn, gvn-dom-cache)0x4E09909KBHIGH
ctor_214_0: LSR options (NVIDIA: disable-lsr-for-sharedmem32-ptr)0x4E4B008KBHIGH
ctor_216_0: Loop Unrolling options (largest unroll ctor)0x4E5C3021KBHIGH
ctor_259_0: CICC core compiler options (debug-compile, maxreg)0x4F0FB017KBHIGH
ctor_262_0: BranchDist pass options0x4F283010KBHIGH
ctor_263_0: SCEV-CGP pass options (44 strings!)0x4F36F010KBHIGH
ctor_264: IP-MSP knobs0x4F45B0HIGH
ctor_267_0: MemorySpaceOpt options (18 strings)0x4F54D010KBHIGH
ctor_277_0: Rematerialization options (39 strings, remat-for-occ)0x4F7BE07KBHIGH
ctor_335_0: MASTER codegen pass configuration (88 strings)0x50731029KBVERY HIGH
ctor_356_0: NVPTX SM enum + PTX version table (45 entries, sm_20–sm_121f)0x50C89016KBVERY HIGH
ctor_358_0: NVPTX pass enable/disable (43 strings, usedessa)0x50E8D021KBHIGH
ctor_361_0: NV Remat Machine Block options (30 strings, nv-remat-*)0x5108E08KBHIGH
ctor_376_0: LTO/bitcode/plugin options0x512DF039KBHIGH
ctor_377_0: PassBuilder pipeline configuration (77 strings)0x51619044KBHIGH
ctor_388_0: Optimizer pipeline enables (enable-ml-inliner, etc.)0x51B71015KBHIGH
ctor_600_0: CodeGen/TargetMachine mega-options (118 strings)0x57F21059KBHIGH
ctor_605: SM processor table (45 entries, sm_20–sm_121f, PTX version map)0x5845103KBVERY HIGH
ctor_609_0: NVPTX backend options (25+ opts, usedessa, enable-nvvm-peephole)0x585D3037KBHIGH
ctor_637_0: disable-*Pass flag registration (48 flags)0x593380HIGH
ctor_701: MISSING data blob (likely instruction encoding tables)0x5A8850~70KBMEDIUM

NVIDIA Custom Pass Implementations

FunctionAddressSizeConfidence
MemorySpaceOptPass registration0x2CDD6D0regHIGH
MemorySpaceOptPass factory0x2CDFF20factoryHIGH
MemorySpaceOpt core analysis0x2CDA66010KBHIGH
MemorySpaceOpt address space inference0x2CD77109KBHIGH
IPMSPPass (interprocedural memory space) registration0x1C6FBC0regHIGH
RematerializationPass (IR-level) implementation0x1CE7DD013KBHIGH
Machine Block Rematerialization0x2186D909KBHIGH
BranchDistPass registration0x1C4B520regHIGH
LoopIndexSplitPass implementation0x1C7B2C011KBHIGH
NVVMPeepholeOptimizerPass registration0x2CAF0F0regHIGH
ByValMem2RegPass0x2CD6510350BHIGH
BasicDeadBarrierEliminationPass0x2CD2690366BHIGH
CNPLaunchCheckPass (Dynamic Parallelism validation)0x1CEBC30regHIGH
PrintfLoweringPass0x1CB0B80nameHIGH
Pass registration master function (all 402+20 passes)0x234289032KBVERY HIGH
Pass name listing (pipeline names for all passes)0x233C410HIGH

MMA / Tensor Core Emission

FunctionAddressSizeConfidence
MMA instruction operand builder (shapes, types, rounding modes)0x21E74C017KBVERY HIGH
tcgen05 Blackwell scaled MMA operands (scaleD, negA, negB, transA)0x21E8CD02KBVERY HIGH
HMMA store-C (hmmastc), SM ≥ 700x21DFBF05KBHIGH
HMMA load-A/B (hmmaldab), SM ≥ 700x21E03603KBHIGH
HMMA load-C (hmmaldc), SM ≥ 700x21E06303KBHIGH
HMMA MMA (hmmamma), SM ≥ 700x21E08704KBHIGH
IMMA load-A/B (immaldab), SM ≥ 720x21E12804KBHIGH
IMMA load-C (immaldc), SM ≥ 720x21E15D03KBHIGH
IMMA store-C, SM ≥ 720x21E18305KBHIGH
IMMA MMA w/ saturation (immamma), SM ≥ 720x21E1D206KBHIGH
Binary MMA (bmmamma, b1 .and.popc/.xor.popc), SM ≥ 750x21E22806KBHIGH
MMA address-space resolver (opcode → addrspace enum)0x21DEF90HIGH
tcgen05 scaled MMA operands (NVPTX backend copy)0x35F3E90HIGH
tcgen05.mma full instruction lowering (10 shape variants)0x36E9630HIGH
tcgen05.mma SelectionDAG lowering0x304E6C0HIGH
tcgen05 infrastructure ops (fence/wait/alloc/dealloc/cp/commit)0x30462A0HIGH

PTX Emission

FunctionAddressSizeConfidence
Function header orchestrator (.entry/.func, params, attrs, pragmas)0x215A3C0VERY HIGH
Kernel attribute emission (.reqntid, .maxntid, cluster, .maxnreg)0x214DA90VERY HIGH
Stack frame emission (__local_depot, %SP, %SPL, register decls)0x2158E8017KBVERY HIGH
Register class → encoded ID (9 classes, 0x10000000–0x90000000)0x21583D0HIGH
Register class → PTX type suffix (.pred, .b16, .b32, .b64, .f32, .f64, .b128)0x2163730HIGH
Register class → PTX prefix (%p, %rs, %r, %rd, %f, %fd, %h, %hh, %rq)0x21638D0HIGH
GenericToNVVM pass registration ("generic-to-nvvm")0x215DC20VERY HIGH
GenericToNVVM pass body (addrspace 0→1 rewriting)0x215E10036KBHIGH
Module emission entry (global ctor rejection, DWARF init)0x215ACD0HIGH
Global variable emission (texref/surfref/samplerref/data)0x2156420HIGH
Atomic opcode emission (13 ops, scope prefix)0x21E5E70VERY HIGH
L2 cache-hinted atomic emission (Ampere+)0x21E6420HIGH
Memory barrier emission (membar.cta/gpu/sys, fence.sc.cluster)0x21E94F0HIGH
Cluster barrier emission (arrive/wait + relaxed)0x21E8EA0HIGH
Special register emission (%tid, %ctaid, %ntid, %nctaid)0x21E86B0VERY HIGH
Cluster special register emission (15 regs, SM 90+)0x21E9060HIGH
Address space conversion + MMA helpers (cvta, rowcol, abtype)0x21E7FE0HIGH

Hash Infrastructure

FunctionAddressSizeConfidence
wyhash v4 hash function (multi-length dispatch)0xCBF760VERY HIGH
Thin wrapper → sub_CBF760 (hash for builtin names)0xC92610HIGH
Hash table insert-or-find (quadratic probing, triangular numbers)0xC92740VERY HIGH
Hash table find-only (same probing)0xC92860HIGH
Rehash at 75% load factor (double or tombstone cleanup)0xC929D0HIGH
String entry allocator (length+17, 8-byte aligned)0xC7D670HIGH

NVVM Builtin Infrastructure

FunctionAddressSizeConfidence
Hash table insertion helper (pre-opt)0x90ADD056 linesVERY HIGH
Builtin dispatcher (pre-opt): name → ID0x91345027 linesVERY HIGH
Builtin dispatcher (post-opt): name → ID0x12731E025 linesVERY HIGH
Builtin lowering engine (pre-opt, wgmma/tex/surf, 3571 lines)0x955A70103KBHIGH
Builtin lowering engine (post-opt, 3408 lines)0x12B3FD0101KBHIGH

Register Allocation

FunctionAddressSizeConfidence
Instruction constraint emission (180+ case opcode switch)0xB612D0102KBHIGH
SimplifyAndColor phase0x108140013KBHIGH
SelectNodeForRemoval / Briggs criterion (K=15 at 3 locations)0x1090BD010KBVERY HIGH
AssignColorsAndOptimize (address unverified, was erroneously listed as 0x12E1EF0)0x10841C011KBMEDIUM
Operand constraint spec creator (type 14=GPR, 40=FP, 78=vec)0xA778C0HIGH
Final instruction emitter with allocated registers0xA78010HIGH

jemalloc (Statically Linked, v5.3.x)

FunctionAddressSizeConfidence
je_stats_print_arena (per-arena stats, HPA shards)0x4134A783KBHIGH
je_stats_print_bins (18 stat columns per bin)0x40F89437KBHIGH
je_stats_general (version, build config, runtime opts)0x41141932KBHIGH
je_stats_print (top-level: allocated, active, resident, mapped)0x417CBD14KBHIGH
je_stats_print_large (large extent class stats)0x40EF0613KBHIGH
je_malloc_vsnprintf (custom format printer, avoids reentrancy)0x40D5CA21KBHIGH
je_mutex_stats_read (mutex profiling counters)0x40E5B57KBHIGH
je_malloc_conf_init (199 config strings)0x12FCDB0129KBVERY HIGH

Optimizer Pipeline Assembly

Functions discovered during wiki writing (W101--W241). These assemble the LLVM optimization pipeline from NVVMPassOptions slots.

Pipeline Builders

FunctionAddressSizeConfidence
Master pipeline assembler (reads opts struct, ~150 pass-insertion decisions)0x12E54A050KBVERY HIGH
Tier 0 full optimization sub-pipeline (~40 passes, base for O1/O2/O3)0x12DE330VERY HIGH
Tier 1/2/3 phase-specific sub-pipeline (phase-conditional pass insertion)0x12DE8F0VERY HIGH
Codegen pass dispatch (reads opts[200] optimization threshold)0x12DFE0020.7KBHIGH
OPT stage two-phase orchestrator (sets qword_4FBB3B0 to 1 or 2)0x12E7E70VERY HIGH
New-PM driver: pipeline name selector (O0/O1/O2/O3/Ofcmin/Ofcmid/Ofcmax)0x226C400HIGH
NVPTXTargetMachine creation (NVIDIA options, standalone path)0x12F406016KBHIGH
OptiX IR generation core function0x12F9270~6KBHIGH

Pass Factories (Pipeline Insertion Order)

Each factory creates a pass instance; referenced from sub_12E54A0, sub_12DE330, and sub_12DE8F0.

FunctionAddressSizeConfidence
NVVMReflect factory (~8 pipeline insertions)0x1857160HIGH
SCCP factory0x1842BC0HIGH
NVVMVerifier wrapper (creates context, invokes module verifier)0x12D4560HIGH
NVVMPredicateOpt factory (AggressiveInstCombine variant)0x18A3430HIGH
NVVMPredicateOpt variant / LoopRotate factory0x18A3090HIGH
ConstantMerge / GlobalDCE / LICM factory0x184CD60HIGH
FunctionAttrs factory (infers readonly, nounwind, etc.)0x1841180HIGH
LICM factory (parameter 0 = standard mode)0x195E880HIGH
LoopVectorize/SLP factory (7 params: width, thresholds)0x19B73C0HIGH
CGSCC standard pipeline factory (InlinerWrapper, 1--5 iterations)0x1A62BF0HIGH
PrintModulePass factory (debug dump, params: level, verbose)0x17060B0HIGH
JumpThreading / CVP factory (parameter: threshold)0x198DF00HIGH
EarlyCSE factory0x196A2B0HIGH
SROA factory0x1968390HIGH
DCE (DeadCodeElimination) factory0x18DEFF0HIGH
Sink/MemSSA factory (3 params: mode, flags)0x1869C50HIGH
NVVMLoopOpt/BarrierOpt / IV Demotion factory0x18B1DE0HIGH
NVVMIntrinsicLowering factory (level 0 = basic, level 1 = barrier)0x1CB4E40HIGH
MemCpyOpt factory0x1B26330HIGH
LoopUnroll / SpeculativeExecution factory (2 params)0x19C1680HIGH
ADCE (AggressiveDeadCodeElimination) factory0x1C76260HIGH
ADCE variant factory (separate pipeline position)0x1C6FCA0HIGH
SimplifyCFG factory (2 params: mode, flags)0x190BB10HIGH
InstructionSimplify factory0x1A7A9F0HIGH
NVVMRematerialization factory (IR-level)0x1A13320HIGH
Reassociate factory (parameter: tier)0x1B7FDF0HIGH
LoopStrengthReduce factory0x19CE990HIGH
NVVMBranchDist factory (two pipeline positions)0x1CB73C0HIGH
NVVMSinking2 factory (SM-specific late sinking)0x1CC60B0HIGH
NVVMGenericAddrOpt factory (generic address optimization)0x1CC71E0HIGH
NVVMReduction factory (SM-specific)0x1CC5E00HIGH
NVVMUnreachableBlockElim factory0x1CC3990HIGH
NVVMLateOpt factory (Tier 3 only)0x1C46000HIGH
NVVMLowerAlloca factory (dual gate: opts[2240] + opts[2280])0x1CBC480HIGH
NVVMLowerBarriers factory (runs between LICM invocations)0x1C98160HIGH
Sinking2Pass fast-mode factory (flag=1, Ofcmin pipeline)0x18B3080HIGH
VerifierPass factory (late CFG cleanup guard at opts[4464])0x1654860HIGH
NVIDIA loop pass factory (opts[3080] guard)0x1922F90MEDIUM
EarlyCSE MemorySSA variant / NVVMBarrierAnalysis factory0x18E4A00HIGH
EarlyCSE variant (v=1 if opts[3704])0x1C8A4D0HIGH
NVVMAnnotationsProcessor factory0x215D9D0HIGH
NVIDIA Custom Inliner (CGSCC, 20,000-unit per-caller budget)0x186406075KBVERY HIGH

NVPTX Backend (SelectionDAG & ISel)

FunctionAddressSizeConfidence
NVPTXTargetLowering::LowerIntrinsicCall (largest function in binary)0x33B0210343KBVERY HIGH
NVPTXDAGToDAGISel::Select (ISel entry, hash-based cost table)0x3090F9091KBVERY HIGH
computeKnownBitsForTargetNode (112 opcodes, 399x sub_969240 calls)0x33D4EF0114KBHIGH
NVPTXTargetLowering::LowerCall (PTX .param calling convention)0x3040BF088KBHIGH
LLVM standard InlineCostAnalysis (library function)0x30DC7E051KBHIGH
Vector legalization type-split record mapping0x3302A00HIGH
Operand type classifier (reads byte_444C4A0)0x34961A026.6KBHIGH

NVVM Verifier Subsystem

FunctionAddressSizeConfidence
NVVMModuleVerifier (data layout, address space, triple validation)0x2C80C9051KBHIGH
NVVMIntrinsicVerifier (SM gates, types, MMA, atomics, tex/surf)0x2C7B6A0143KBVERY HIGH
Frontend verifier (convergent intrinsic SM-version gating)0x1C36530HIGH
NVVMIntrinsicLowering core engine (2,460 lines)0x2C63FB0140KBHIGH

LTO Subsystem

FunctionAddressSizeConfidence
NVModuleSummary builder (ThinLTO, two-phase declaration merge)0xD7D4E074KBHIGH
New PM CGSCC inliner (inside LazyCallGraph framework)0x261393069KBHIGH
IP-MSP module-pass variant (LIBNVVM path, DenseMap-based)0x1C6A6C054KBHIGH
LinkUserModules (wrapper around LLVM Linker::linkModules)0x12F5610~4KBHIGH

LLVM IR Utility Functions

Common LLVM IR manipulation functions referenced across many passes.

FunctionAddressSizeConfidence
operator new / BumpPtrAllocator (SDNode, BasicBlock, pass objects)0x22077B0HIGH
Value::replaceAllUsesWith / salvageDebugInfo0xBD84D0HIGH
Instruction::eraseFromParent / SDUse remove from use list0xB43D60HIGH
getCalledFunction / BranchInst::getCondition0xB43CB0HIGH
Function::hasAttribute(N) (noimplicitfloat, optnone, convergent)0xB2D610HIGH
Function::getName / IR node name getter0xBD5D20HIGH
PHINode::Create / SDNode alloc variant (80 bytes)0xBD2DA0HIGH
hasAttribute(26) (convergent/varargs marker check)0xB91C10HIGH
TTI::getInstructionCost (IR-level) / MDString::getString0xB91420HIGH
Ref-count decrement on metadata/debug-info0xB91220HIGH
Ref-count increment on metadata/debug-info0xB96E90HIGH
Value::setName / SetValueName (assigns %name to IR value)0x164B780HIGH
IRBuilder::CreateBinOp / SCEV type extension (349x callers)0x1623A60HIGH
ReleaseDebugLoc / debug location list removal0x161E7C0HIGH
Fatal error emitter ("Broken module found, compilation aborted!")0x16BD130HIGH
Create binary OR instruction (opcode 27)0x15FB440HIGH
DataLayout::getPointerSizeInBits(addressSpace)0x15A9520HIGH
DataLayout::getStructLayout (struct size computation)0x15A9930HIGH
SCEV fold/normalize / NVVM AA address-space NoAlias query0x146F1B0HIGH
CombineTo / ReplaceAllUsesWith (DAG use-chain + worklist push)0xF162A0HIGH
Function cloner (coroutine resume/destroy)0xD2E510HIGH
Create runtime library call instruction (OpenMP, MMA, barriers)0x921880HIGH
Builtin function call emitter (pre-opt path, EDG builtins)0x1285290HIGH
Kernel metadata emitter (cluster_dim, blocksareclusters)0x93AE30~5.6KBHIGH
ExpandIntegerResult (type legalization, 632 case labels)0x201BB9075KBHIGH

Machine-Level Infrastructure

FunctionAddressSizeConfidence
InstrEmitter DenseMap grow / rehash (hash: key*37)0x2E29BA0HIGH
TwoAddressInstruction DenseMap (SrcEqClassMap)0x1F4E3A0HIGH

Binary Layout

This page is a visual guide to navigating the cicc v13.0 binary in IDA Pro. It covers the ELF structure, section layout, subsystem address ranges, embedded data payloads, and the statically linked jemalloc allocator. If you are opening this binary for the first time, start here to orient yourself before diving into individual subsystems.

ELF Overview

CICC is a statically linked, stripped x86-64 ELF binary. There are no dynamic symbol tables, no .dynsym, no DWARF debug info, and no export table. Every function name was removed at build time. IDA Pro recovers 80,562 functions; Hex-Rays successfully decompiles 80,281 of them (99.65%).

PropertyValue
File size60,108,328 bytes (57.3 MB)
Architecturex86-64, little-endian
LinkingFully static (no .interp, no PLT/GOT)
StrippedYes, all symbol tables removed
Build IDcuda_13.0.r13.0/compiler.36424714_0
CompilerBuilt with GCC (inferred from CRT stubs and .init_array layout)
Allocatorjemalloc 5.3.x, statically linked (~400 functions)

Because the binary is statically linked, libc, libpthread, and libm are all embedded. This inflates the raw function count but also means every call target resolves to a concrete address within the binary itself -- there are no external dependencies at runtime beyond the kernel syscall interface.

Address Space Map

The binary's .text section spans roughly 0x400000 to 0x3C00000. Within that 56 MB range, subsystems occupy contiguous, non-overlapping regions. The map below is the primary orientation tool for IDA Pro navigation.

0x400000 ┌─────────────────────────────────────────┐
         │  CRT startup + libc stubs               │  ~52 KB
0x40D000 ├─────────────────────────────────────────┤
         │  jemalloc stats / vsnprintf              │  ~80 KB
0x420000 ├─────────────────────────────────────────┤
         │  (gap: misc libc, math, string ops)      │  ~64 KB
0x430000 ├─────────────────────────────────────────┤
         │  Global constructors (cl::opt reg)        │  ~1.6 MB
         │  ~1,689 LLVM command-line option objects  │
0x5D0000 ├─────────────────────────────────────────┤
         │  EDG 6.6 C++ Frontend                    │  3.2 MB
         │  Parser, constexpr evaluator, IL walker   │
0x8F0000 ├─────────────────────────────────────────┤
         │  CLI / Real Main / NVVM Bridge            │  520 KB
         │  sub_8F9C90 (real main), dual-path dispatch│
0x960000 ├─────────────────────────────────────────┤
         │  Architecture detection, NVVM options     │  576 KB
0x9F0000 ├─────────────────────────────────────────┤
         │  Bitcode reader (parseFunctionBody)       │  ~1 MB
0xAF0000 ├─────────────────────────────────────────┤
         │  X86 AutoUpgrade (legacy, 457KB fn)       │  ~1 MB
0xBF0000 ├─────────────────────────────────────────┤
         │  LLVM IR Verifier                        │  500 KB
0xC00000 ├─────────────────────────────────────────┤
         │  LLVM Support / ADT library              │  ~3.2 MB
         │  (see detailed sub-map below)             │
0x12D0000├─────────────────────────────────────────┤
         │  PassManager / NVVM bridge                │  4.2 MB
         │  Pipeline assembly (sub_12E54A0)          │
0x12FC000├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤
         │  jemalloc core (~400 functions)           │  ~256 KB
0x1700000├─────────────────────────────────────────┤
         │  Backend / machine passes                 │  8 MB
         │  RegAlloc, Block Remat, Mem2Reg           │
0x1F00000├─────────────────────────────────────────┤
         │  SelectionDAG                            │  2 MB
         │  LegalizeTypes (348KB), LegalizeOp        │
0x2100000├─────────────────────────────────────────┤
         │  NVPTX PTX emission                      │  1 MB
0x2340000├─────────────────────────────────────────┤
         │  New PM / pass registration               │  768 KB
         │  2,816-line registrar at sub_2342890      │
0x2A00000├─────────────────────────────────────────┤
         │  Loop passes                             │  4 MB
         │  LoopVectorize, SLP, Unroll               │
0x3000000├─────────────────────────────────────────┤
         │  NVPTX ISel + lowering                    │  7 MB
         │  343KB intrinsic switch (sub_33B0210)     │
0x3700000├─────────────────────────────────────────┤
         │  Machine-level passes (tail)              │  ~3 MB
         │  BlockPlacement, Outliner, StructurizeCFG │
0x3A00000├─────────────────────────────────────────┤
         │  (trailing code, CRT finalization)        │
         └─────────────────────────────────────────┘

DATA SECTIONS:
0x3EA0080   Embedded libdevice bitcode (Path A)    456 KB
0x420FD80   Embedded libdevice bitcode (Path B)    456 KB
0x4F00000+  Global BSS (cl::opt storage, hash tables, state)

Detailed Subsystem Map at Pass Granularity

The coarse map above partitions the binary into ~18 zones. The following map refines every zone to individual-pass resolution, giving the factory address of each identified pass or subsystem entry point. Addresses prefixed with sub_ are IDA function names. Sizes in parentheses are decompiled C output; actual machine code is typically 2-3x smaller.

Zone 1: CRT, libc, jemalloc stats (0x400000 - 0x42FFFF)

0x400000   _start / CRT entry (ELF entry point)
0x40D5CA   sub_40D5CA   vsnprintf (jemalloc stats formatting)
0x420000   libc math/string helpers (memcpy, memset, strlen, etc.)

No LLVM or NVIDIA code lives here. Pure runtime support.

Zone 2: Global constructors (0x430000 - 0x5CFFFF)

~1,689 cl::opt registration constructors execute before main(). Each registers a command-line option string, description, default value, and storage pointer into the global option registry. The .init_array section holds function pointers to these constructors.

Zone 3: EDG 6.6 C++ Frontend (0x5D0000 - 0x8EFFFF)

The complete Edison Design Group C++ frontend, version 6.6. Contains the lexer, parser, constexpr evaluator, template instantiator, overload resolver, IL walker/copier, diagnostic engine, SARIF output, and CUDA-specific extensions (kernel launch grammar, __shared__/__device__ memory space parsing, atomic builtin stubs).

FunctionAddressSize
EDG main entry (called from real main)sub_5D2A80
Expression parser coresub_610000-sub_62FFFF128 KB
Declaration processingsub_750000-sub_76FFFF128 KB
Template / constexprsub_840000-sub_87FFFF256 KB
SARIF, diagnostics, keywordssub_880000-sub_8EFFFF448 KB

Zone 4: CLI / Real Main / Dual-Path Entry (0x8F0000 - 0x9EFFFF)

FunctionAddressSize
Real main (after CRT/jemalloc init)sub_8F9C90
Path A CLI parsing (LibNVVM API mode)sub_900130
Path A simple compile entrysub_902D10
Path A multi-stage pipelinesub_905EE043 KB
Path A builtin resolution tablesub_90AEE0109 KB
Architecture detection, NVVM option parsingsub_960000-sub_9EFFFF576 KB

Zone 5: Bitcode Reader / X86 AutoUpgrade / Verifier (0x9F0000 - 0xBFFFFF)

Sub-rangeContents
0x9F0000-0xAEFFFFBitcode reader (sub_A24000 parseFunctionBody ~166KB)
0xAF0000-0xBEFFFFX86 AutoUpgrade (sub_A939D0 457KB -- legacy intrinsic upgrader)
0xBF0000-0xBFFFFFLLVM IR Verifier entry points

Zone 6: LLVM Support Library (0xC00000 - 0xCAFFFF)

1,653 functions. Pure LLVM infrastructure -- no NVIDIA-specific modifications except a single !Flat address space annotation in the sample profile reader at sub_C29E70.

Sub-rangeFunctionsContents
0xC00000-0xC0F00065IR Verifier (sub_C05FA0 visitInstruction 75KB, sub_C0A940 verify 12KB)
0xC0D4F01sub_C0D4F0 TargetRegistry::lookupTarget (8KB)
0xC0F6D01sub_C0F6D0 IR module linker (48KB)
0xC10000-0xC2FFFF~400InstrProf reader, Sample Profile reader/writer, hashing
0xC30000-0xC3FFFF214ImmutableMap/Set, APInt printing
0xC40000-0xC4FFFF197APInt core arithmetic (div, mul, shift)
0xC50000-0xC5FFFF141CommandLine parser (cl::opt infrastructure)
0xC60000-0xC6FFFF135JSON parser, debug counters, error handling
0xC70000-0xC7FFFF114ConstantRange arithmetic
0xC80000-0xC8FFFF194SHA-1 hash, regex, SmallVector, sorting
0xC90000-0xC9FFFF139Timer/profiling, TimeTrace (Chrome trace)
0xCA0000-0xCAFFFF186YAML lexer/parser, TypeSize, VFS

Zone 7: NVVM Container, SCEV, DWARF, MC Layer (0xCB0000 - 0x10CFFFF)

This 4 MB zone contains LLVM mid-level infrastructure and the NVVM container format.

Sub-rangeContentsKey functions
0xCB0000-0xCBFA60YAML parser/emitter (libyaml)sub_CB9640 main parser (26KB)
0xCC0130-0xCCABA0LLVM Triple parsingsub_CC0130 Triple_normalize (35KB)
0xCCBB10-0xCDCA30NVVM container formatsub_CDD2D0 serialize, sub_CD1D80 deserialize, sub_CCD5F0 version validator (9KB)
0xCD9990NVVM options parser (calls 60+ parse helpers)
0xD60000-0xD82000NV Module Summary / LTOsub_D7D4E0 buildModuleSummary (74KB), sub_D81040 runOnModule (56KB)
0xD83000-0xDFD000ScalarEvolution (SCEV)SCEV framework, AddRecExpr, backedge analysis
0xE00000-0xE0FFFFDWARF debug info string/enum tables
0xE10000-0xE2FFFFItanium C++ name demanglersub_E18BB0 parseExpr (47KB)
0xE30000-0xEBFFFFMC assembler layerELF/COFF/MachO section parsers, expression evaluator
0xEC0000-0xED0000MC assembler directivessub_ECB300 ELF section parser (40KB)
0xED0000-0xEF8000InstrProf / MemProf readerProfiling data infrastructure
0xEF8000-0xF05000Bitstream remark serialization
0xF05000-0xF6FFFFSelectionDAG infrastructureDAG node creation, SDValue, EVT/MVT helpers
0xF70000-0xF8FFFFLoop vectorization runtime checkssub_F77B70 vectorizeLoop (37KB), sub_F72730 canVectorizeMemory (29KB)
0xF90000-0xFCFFFFSimplifyCFG + code sinkingsub_FB0000 switch table gen, sub_FA0000 speculative exec
0xFD0000-0xFEFFFFAliasSet, register pressure tracking, CFG graphviz
0xFF0000-0x101FFFFBlock scheduling, RPO traversal, constant folding
0x1020000-0x103FFFFInline ASM + scheduling modelsub_1035170 CUTLASS kernel detection (41KB)
0x1040000-0x106FFFFDivergence analysis, DAG utilities, IR linker
0x1070000-0x10AFFFFMC object emission, InstructionSimplifysub_10ACA40 visitAdd (94KB)

Zone 8: InstCombine Mega-Region (0x10D0000 - 0x122FFFF)

The single largest contiguous pass in the binary. NVIDIA's modified InstCombine spans 1.4 MB of code with three NVIDIA-custom opcodes (0x254D, 0x2551, 0x255F) for proprietary intrinsic folding.

Sub-rangeContentsKey functions
0x10D0000-0x10EFFFFInstCombine visitors (casts, shifts, memory)Various visitXxx functions
0x10EE7A0InstCombine main visitorsub_10EE7A0 (405KB / 9,258 lines -- largest function in binary)
0x10F0000-0x1100000Sub-visitors for specific opcodes
0x1100000-0x1170000Intrinsic folding, demanded bitssub_1169C30 intrinsic folder (87KB), sub_11A7600 computeKnownBits (127KB)
0x1180000-0x119FFFFInstCombine core worklistsub_1190310 main dispatch (88KB)
0x11A0000-0x11AFFFFValueTracking / KnownBitssub_11AE870 SimplifyDemandedBits
0x11B0000-0x11BFFFFInstCombine tail (vector, extract/insert)
0x11D0000-0x11FFFFFSimplifyLibCallsMath function optimization
0x11FF000-0x122FFFFLLVM textual IR parser (LLParser)

Zone 9: NVVM Bridge / Builtin System / IR Codegen (0x1230000 - 0x12CFFFF)

This zone is the core NVIDIA bridge between the EDG frontend AST and the LLVM IR optimizer.

Sub-rangeContentsKey functions
0x1230000-0x125FFFFLLVM IR codegen from ASTExpression, statement, type codegen
0x125FB30Path B CLI parsingsub_125FB30 (standalone/nvcc mode)
0x1262860Path B simple compilesub_1262860
0x1265970Path B multi-stage pipelinesub_1265970 (48KB)
0x126A7B0Builtin lookup helpersub_126A7B0
0x126A910Builtin registration tablesub_126A910 (126KB) -- registers 717 builtins (IDs 1-770)
0x12B3FD0Builtin resolution dispatchsub_12B3FD0 (103KB) -- giant switch on builtin ID
0x12C06E0Bitcode linkersub_12C06E0 (libdevice linking)

Zone 10: Pipeline Builder / Pass Options (0x12D0000 - 0x12FFFFF)

The pipeline assembler constructs the complete LLVM pass pipeline, inserting passes by calling factory functions whose addresses scatter across the entire binary.

FunctionAddressSize
Module split-range helpersub_12D3E60
Pass factory: creates NVIDIA custom passsub_12D4560325 B
NVVMPassOptions initializer -- populates 222 pass option slots into 4,480-byte structsub_12D6300125 KB
AddPass -- hash-table-based pass insertion into pipelinesub_12DE0B03.5 KB
Tier 0 sub-pipeline builder (full optimization, 40 passes)sub_12DE3304.8 KB
Tier 1/2/3 sub-pipeline builder (85-pass superset, tier-gated)sub_12DE8F0
Codegen dispatch -- routes to backend machine pass pipelinesub_12DFE00
Master pipeline assembler -- 1,553 lines, two major pipelines (normal + fast)sub_12E54A049.8 KB
Machine pass assembly (Pipeline B fast path)sub_12EB010
Machine codegen executionsub_12EC4F0
jemalloc core (~400 functions)sub_12FC000+~256 KB
malloc_conf_init (parses 199 config strings from MALLOC_CONF)sub_12FCDB0129 KB

Zone 11: IR Infrastructure / PassManager (0x1300000 - 0x16FFFFF)

Dense LLVM infrastructure: IR types, constants, instructions, metadata, use-lists, PassManager execution engine, IR linker, bitcode reader, regex, and DataLayout.

Sub-rangeContentsKey functions
0x1300000-0x135FFFFIR constants, types, APInt, APFloat
0x1360000-0x13FFFFFIR instructions, basic blocks, functionssub_1361950 AssumptionCacheTracker
0x1400000-0x14FFFFFTargetLibraryInfo, pass schedulingsub_149CCE0 TLI wrapper, sub_14A04B0 TLI creation, sub_14A3CD0 NVPTX TargetPassConfig
0x1500000-0x15FFFFFIR builder, GEP, PHI, branch creationsub_15F83E0 conditional branch, sub_15F9210 load, sub_15F9650 store
0x1600000-0x160FFFFPassManager execution enginesub_160FB70 PassManager::run, sub_1611EE0 PassManagerBuilder init
0x1610000-0x162FFFFPass scheduling, metadata RAUWsub_1619140 register target passes, sub_1619BD0 PassManager::finalize
0x1630000-0x16FFFFFIR Linker, bitcode reader, regexsub_16786A0 IRLinker::run (61KB), sub_166A310 parseFunctionBody (60KB)

Zone 12: InstCombine (NewPM) + Sanitizers + PGO (0x1700000 - 0x17FFFFF)

946 functions. Dominated by the New Pass Manager version of InstCombine (~600 functions, ~3.5 MB decompiled), with sanitizer instrumentation (MSan, TSan, coverage) and PGO/GCov infrastructure.

Sub-rangeContentsKey functions
0x1700000-0x17B0000InstCombine (NewPM)sub_1743DA0 main visitor (168KB), sub_17A9010 liveness (111KB)
0x17B0000-0x17BFFFFGCov instrumentationsub_17BF860 coverage notes (53KB)
0x17C0000-0x17CFFFFPGO indirect-call promotionsub_17C2DB0 (39KB)
0x17D0000-0x17DFFFFMemorySanitizersub_17DDCE0 shadow propagation (58KB)
0x17E0000-0x17EFFFFPGO instrumentationsub_17EEF60 InstrProfiling reader (81KB)
0x17F0000-0x17FFFFFThreadSanitizer, SanitizerCoveragesub_17FF260 TSan entry (51KB), sub_17F91F0 SanCov (44KB)
sub_17060B0PrintModulePass (debug dump, inserted ~30x in pipeline)

Zone 13: GVN + Scalar Passes + NVIDIA Custom IR Passes (0x1800000 - 0x1CFFFFF)

This 5 MB zone contains the bulk of LLVM's scalar optimization passes and all of NVIDIA's custom IR-level passes.

GVN family (0x1900000 - 0x193FFFF):

FunctionAddressSize
GVN::runOnFunction (core fixed-point iteration)sub_1900BB083 KB
GVN PRE (Partial Redundancy Elimination)sub_190672026 KB
NewGVN expression printingsub_19308103 KB
NewGVN core value numberingsub_1933B4043 KB

Standard scalar passes (0x1830000 - 0x1AFFFFF):

Function (pipeline factory call)AddressSize
InstructionCombining (Old PM wrapper)sub_1832270
TailCallElim / JumpThreadingsub_1833EB0
FunctionAttrssub_1841180
SCCP (Sparse Conditional Constant Propagation)sub_1842BC0
ConstantMerge / GlobalDCEsub_184CD60
NVVMReflectsub_1857160
IPConstantPropagation / ArgumentPromotionsub_185D600
Sink / MemorySSAsub_1869C50
NVVMPredicateOpt / SelectionOptsub_18A3430
LoopPass (barrier optimization)sub_18B1DE0
DCE (Dead Code Elimination)sub_18DEFF0
CorrelatedValuePropagationsub_18EEA90
DSE (Dead Store Elimination)sub_18F5480
DeadArgumentEliminationsub_18FD350
SimplifyCFGsub_190BB10
LICM / LoopRotatesub_195E880
LoopIndexSplitsub_1952F90
LoopUnroll / LoopVectorizesub_197E720
LoopSimplify / IndVarSimplifysub_198DF00
SROA (Scalar Replacement of Aggregates)sub_198E2A0
InstCombine variantsub_19401A0
SROA variant / LoopUnswitchsub_19B73C0
NVIDIA pass (unknown)sub_19CE990
NVVMRematerialization (IR-level remat)sub_1A13320
NVVMIRVerificationsub_1A223D0
LLVM standard pass pipeline (parameterized, called ~8x with different configs)sub_1A62BF0
LoopIdiomRecognize / IndVarSimplifysub_1A68E70
InstructionSimplify / ValueTrackingsub_1A7A9F0

Loop unrolling + switch lowering (0x1B00000 - 0x1B7FFFF):

FunctionAddressSize
LoopUnroll main driversub_1B01A4068 KB
Unroll-and-Jamsub_1B0729055 KB
Loop peelingsub_1B0BF1039 KB
Unroll prologue/epilogue generationsub_1B12B9065 KB
Code sinking (".sink.split")sub_1B5111051 KB
SimplifyCFG condition combiningsub_1B5C58030 KB
Switch-to-lookup-table transformationsub_1B6070083 KB

Loop/SLP vectorizer (0x1B80000 - 0x1BFFFFF):

FunctionAddressSize
LoopVectorize main driver ("loop-vectorize")sub_1BB674043 KB
VPlan buildersub_1BAB46032 KB
SLP horizontal reduction ("slp-vectorizer")sub_1BDDB0047 KB
SLP shuffle/reorder enginesub_1BD066062 KB

NVVM module validation + configuration (0x1C00000 - 0x1C3FFFF):

FunctionAddressSize
NVVM codegen config parser (70+ knobs: AdvancedRemat, CSSACoalescing, DoMMACoalescing, PGO, OCGKnobs)sub_1C2017033 KB
NVVM compile mode parser (WHOLE_PROGRAM_NOABI/ABI, SEPARATE_ABI, opt level, debug info)sub_1C21CE028 KB
Kernel attribute validator (cluster launch, parameter size, Hopper constraints)sub_1C3274030 KB
NVVM intrinsic lowering (tex/surf/syncwarp/ISBE/MAP/ATTR validation)sub_1C36530112 KB
NVVM module validator (data layout, target triple, UnifiedNVVMIR)sub_1C3BC1048 KB

NVIDIA custom IR passes (0x1C40000 - 0x1CFFFFF):

This 1 MB block contains the majority of NVIDIA's proprietary IR-level optimization passes. Every pass listed here has no upstream LLVM equivalent.

FunctionAddressSizeRole
Dead Synchronization Elimination -- removes redundant __syncthreads() barriers via fixed-point R/W dataflowsub_1C4781063 KBdead-sync-elim
Alloca cloning / PHI insertion (mem2reg extension)sub_1C4D21069 KB
NVIDIA pass helper (dead-sync / common-base infrastructure)sub_1C585C039 KB
Common Base Elimination -- removes redundant base address computationssub_1C5DFC039 KBcommon-base-elim
Block-level analysis infrastructure ("Processing", "Block")sub_1C5FDC026 KB
Base address bitcast helper ("baseValue", "bitCastEnd")sub_1C637F028 KB
Base Address Strength Reduction ("BaseAddressStrengthReduce")sub_1C6778059 KBbase-addr-sr
MemorySpaceOpt loop index analysis ("phi maxLoopInd")sub_1C6A6C054 KB
GVN or LICM variantsub_1C6E800
ADCE (Aggressive DCE)sub_1C6FCA0
MemorySpaceOpt function cloning -- specializes generic pointers to global/shared/localsub_1C7091075 KBmemspace-opt (core)
LoopIndexSplit -- splits loops on index conditions (three modes: all-but-one, single-iter, range-split)sub_1C7B2C084 KBloop-index-split
Memmove Unrolling -- forward/reverse element copy loopssub_1C82A5040 KBlower-aggr-copies
Struct/Aggregate Splitting -- element-wise memcpy decompositionsub_1C86CA073 KBlower-aggr-copies
EarlyCSE / GVN variantsub_1C8A4D0
FP128/I128 Emulation -- replaces 128-bit ops with __nv_* library callssub_1C8C17026 KBlower-ops
MemorySpaceOpt entry (pipeline factory address)sub_1C8E680nvvm-memspace-opt
NVVMLowerBarriers / BarrierLoweringsub_1C98160
MemorySpaceOpt address space resolution (warnings for illegal atomics on const/local)sub_1CA292032 KB
MemorySpaceOpt secondary resolversub_1CA9E9028 KB
Printf Lowering -- lowers printf to vprintf + local buffer packingsub_1CB1E6031 KBprintf-lowering
NVVMIntrinsicLowering (most frequently inserted pass, ~10 occurrences in pipeline)sub_1CB4E40nvvm-intrinsic-lower
NVVMBranchDistsub_1CB73C0branch-dist
RLMCAST transformation (register-level multicast)sub_1CBFA4075 KB
NVVMSinking2 (NVIDIA enhanced code sinking)sub_1CC60B0sinking2
IV Demotion -- narrows 64-bit induction variables to 32-bit ("demoteIV", "newBaseIV")sub_1CD74B075 KBiv-demotion
NLO (NVIDIA Live Output) helper ("nloNewAdd", "nloNewBit")sub_1CDC1F035 KB
Instruction classification / cost model (NLO/remat)sub_1CDE4D080 KB
Simplify Live Output (NLO pass -- "nloNewBit")sub_1CE10B048 KB
Rematerialization pull-in cost analysis ("Total pull-in cost")sub_1CE3AF056 KB
Rematerialization block executor ("remat_", "uclone_" prefixes)sub_1CE67D032 KB
NVVMRematerialization main driver -- live-in/live-out pressure analysis per blocksub_1CE7DD067 KBremat
Final NVVM lowering / intrinsic cleanupsub_1CEBD10
Formal parameter space overflow checkersub_1CEE97027 KB
NVVMPeepholesub_1CEF8F0nvvm-peephole
Instruction scheduling helper (physical register constraints)sub_1CFDD6049 KB

Zone 14: SelectionDAG ISel / CodeGenPrepare / Backend (0x1D00000 - 0x1EFFFFF)

Sub-rangeContentsKey functions
0x1D00000-0x1D60000SelectionDAG ISel coresub_1D4BB00 bytecode interpreter (97KB, 131-case switch), sub_1D54C20 runOnMachineFunction (72KB, "sdagisel")
0x1D1B0D0sub_1D1B0D0 computeKnownBits (87KB, 62-case ISD switch)
0x1D210A0sub_1D210A0 SimplifyDemandedBits (46KB, 118-case switch, calls NVPTX hooks at sub_1F58D40)
0x1D70000-0x1D7FFFFCodeGenPreparesub_1D73760 address sinking (65KB, "sunkaddr")
0x1D07BB057 KBPre-RA instruction scheduling
0x1D80000-0x1DFFFFFDeque worklist, block splittingsub_1D7AA30 (74KB, ".unlikely", ".cond.split")
0x1E00000-0x1EFFFFFRegister allocation infrastructureGreedy RA, live intervals, spill cost

Zone 15: Backend CodeGen Infrastructure (0x1F00000 - 0x20FFFFF)

Sub-rangeContentsKey functions
0x1F00000-0x1F0C000ScheduleDAG infrastructuresub_1F0A020 DAG builder/emitter (41KB)
0x1F0BF50-0x1F0EBC0Shrink Wrappingsub_1F0DCB0 core analysis (27KB, "shrink-wrap")
0x1F10000-0x1F15000SlotIndexes + SpillPlacementsub_1F10320 "slotindexes", sub_1F12110 "spill-code-placement"
0x1F15000-0x1F1F000LiveInterval utilitiessub_1F19E60 "Impossible to implement partial COPY"
0x1F20000-0x1F5FFFFRegister coalescer, VirtRegRewriter
0x1F58D40NVPTX target hook for SimplifyDemandedBits
0x1F60000-0x1FFFFFTwoAddressInstruction, stack protection
0x2000000-0x20FFFFFLegalizeTypessub_20019C0 (341KB -- third largest function in binary)

Zone 16: NVPTX Target Backend (0x2100000 - 0x21FFFFF)

Sub-rangeContentsKey functions
0x2100000-0x210FFFFRegister allocation supportsub_210BC20 seedLiveRegs ("regalloc"), sub_210BE60 "ran out of registers"
0x2110000-0x212FFFFDAG type legalization/promotion
0x2130000-0x213FFFFDAG combiners, ISel patterns
0x2140000-0x214FFFFNVPTXAsmPrinterPTX header/kernel emission
0x2150000-0x215FFFFPTX function/param emissionsub_215D9D0 NVVMAnnotationsProcessor / GenericToNVVM
0x2160000-0x216FFFFNVPTXTargetMachinePass pipeline, SubtargetInfo
0x2170000-0x218AFFFAtomics lowering, rematerialization (machine-level)
0x21BC000-0x21BFFFFAlloca hoisting, image opt
0x21C0000-0x21CFFFFMemorySpace lowering (machine-level)
0x21D0000-0x21DFFFFDAG lowering mega-function, peephole, prolog/epilog
0x21E0000-0x21EFFFFMMA/tensor codegen, atomics, special regs, cluster ops
0x21F0000-0x21FFFFFLdg transform, vec split, mem2reg, register pressure

Zone 17: New PM Pass Registration (0x2340000 - 0x23FFFFF)

FunctionAddressSize
Master pass registration -- registers all 526 passes (121 module + 174 function + 23 loop + 48 MF + analyses) into StringMapsub_2342890~2,816 lines
Print available passes (--print-pipeline-passes)sub_233C410
Function pass pipeline text parsersub_233F860
Module pipeline text parsersub_2377300
Inner function/loop pipeline parsersub_2368220
Alias analysis name resolver (globals-aa, basic-aa, scev-aa, tbaa)sub_233BD40
Hash table insertion (pass_name -> constructor)sub_E41FB0

Zone 18: IPO / Attributor / OpenMP Optimization (0x2400000 - 0x29FFFFF)

Sub-rangeContentsKey functions
0x2400000-0x25FFFFFAttributor frameworksub_251CD10 runTillFixpoint (53KB)
0x2590000-0x265FFFFSanitizer instrumentation (ASan, HWASan)
0x266E000-0x269FFFFOpenMP target offloadingsub_2686D90 runtime table (215KB, ~160 __kmpc_* entries), sub_26968A0 Generic-to-SPMD transform (61KB, "OMP120")
0x267842041 KBOpenMP state machine for generic kernels
0x268094052 KBParallel region merging
0x26A0000-0x29FFFFFCoroutine support, LTO infrastructure, PGO lowering

Zone 19: Loop Transforms (0x2A00000 - 0x2CFFFFF)

FunctionAddressSize
LoopPeeling ("llvm.loop.peeled.count")sub_2A07DE076 KB
LoopRotation (".lr.ph", "h.rot")sub_2A0CFD065 KB
UnrollLoop main ("loop-unroll", "UnrollCount")sub_2A15A2085 KB
UnrollAndJamLoop ("loop-unroll-and-jam")sub_2A1CF0058 KB
Runtime unrolling (".epil.preheader", ".prol.preheader")sub_2A2526091 KB
IndVarSimplify IV widening ("iv.rem", ".sext", ".zext")sub_2A76A4067 KB
WidenIV / IV transformationsub_2A79EE082 KB
Dead Synchronization Elimination (island -- the larger copy; see also sub_1C47810)sub_2C84BA094 KB

Note: sub_2C84BA0 is a second copy of the dead synchronization elimination pass located outside the main NVIDIA custom pass zone. This is the 94KB variant analyzed in depth (p2b.6-01), with the four-category fixed-point R/W dataflow algorithm and red-black tree maps.

Zone 20: Codegen Target Options / SelectionDAG Lowering (0x2D00000 - 0x2FFFFFF)

5,217 functions. Contains LLVM TargetMachine option registration and the core SelectionDAG infrastructure used by the NVPTX backend.

Sub-rangeContentsKey functions
0x2D00000-0x2D8FFFFSelectionDAG coreDAG combine, node creation, legalization helpers
0x2D97F20112 KBTargetOptions registration (all cl::opt for -march/-mcpu/-mattr/relocation/code model)
0x2E00000-0x2FFFFFSelectionDAG continuedType legalization, custom lowering, pattern matching

Zone 21: NVPTX ISel + SelectionDAG Lowering (0x3000000 - 0x36FFFFF)

7 MB. The NVPTX instruction selection and target-specific DAG lowering.

Sub-rangeContentsKey functions
0x3000000-0x328FFFFDAG node construction, EVT/MVT helpers
0x3290000-0x32FFFFFNVPTXTargetLoweringsub_32E3060 LowerOperation dispatcher (111KB), sub_32A1EF0 type legalization (109KB), sub_32D2680 load/store lowering (81KB)
0x3300000-0x33AFFFFIntrinsic lowering (DAG level)sub_33B0210 intrinsic switch (343KB)
0x33B0000-0x36FFFFFISel pattern helpers, register info

Zone 22: NVPTX Instruction Selector / Machine Tail (0x3700000 - 0x3BFFFFF)

Sub-rangeContentsKey functions
0x3700000-0x37AFFFFTable-driven instruction selectorsub_376DE90 main pattern matcher (138KB -- per-SM opcode legality gating via compressed table at offset 521536)
0x372FEE0104 KBDAG operand tree copier (recursive)
0x374DD2067 KBNVPTX custom lowering entry
0x3900000-0x396FFFFNVIDIA register pressure / remat (machine-level)sub_396A6C0 RP reporting ("Register Pressure: N"), sub_3964ED0 ".remat" naming
0x393724014 KBABI Preserve directive emission
0x395CFD011 KBGEP Splitting pass
sub_395DD2066 KBDAG pattern computation
0x3970000-0x397FFFFAsmPrinter / PTX emissionsub_3979400 emitFunctionBody (62KB), sub_397DF10 emitInlineAsm (30KB)
sub_3970E4018 KBBB print + .pragma "nounroll"
0x3980000-0x3BFFFFFMC layer, DWARF, ELF emissionObject file writers, section management

Pass Factory Address Summary

The pipeline assembler (sub_12E54A0) calls pass factory functions to construct the pipeline. Each factory address below is called directly from the pipeline builder and uniquely identifies a pass in the binary.

Factory addressPass identityType
sub_1654860BreakCriticalEdgesF
sub_17060B0PrintModulePass (debug dump)M
sub_1832270InstructionCombiningF
sub_1833EB0TailCallElim / JumpThreadingF
sub_1841180FunctionAttrsM
sub_1842BC0SCCPF
sub_184CD60ConstantMerge / GlobalDCEM
sub_1857160NVVMReflectF
sub_185D600IPConstantPropagationM
sub_1869C50Sink / MemorySSAF
sub_18A3430NVVMPredicateOptF
sub_18B1DE0LoopPass (barrier opt)F
sub_18DEFF0DCEF
sub_18EEA90CorrelatedValuePropagationF
sub_18F5480DSEF
sub_18FD350DeadArgumentEliminationM
sub_190BB10SimplifyCFGF
sub_195E880LICM / LoopRotateF
sub_1952F90LoopIndexSplitL
sub_197E720LoopUnroll / LoopVectorizeF
sub_198DF00LoopSimplify / IndVarSimplifyF
sub_198E2A0SROAF
sub_19401A0InstCombine variantF
sub_19B73C0SROA variant / LoopUnswitchF
sub_19CE990NVIDIA pass (unknown)F
sub_1A13320NVVMRematerialization (IR-level)F
sub_1A223D0NVVMIRVerificationM
sub_1A62BF0LLVM standard pass pipeline (parameterized)M
sub_1A68E70LoopIdiomRecognizeF
sub_1A7A9F0InstructionSimplifyF
sub_1B26330MemCpyOptF
sub_1B7FDF0Reassociate / SinkingF
sub_1C4B6F0AlwaysInlinerM
sub_1C6FCA0ADCEF
sub_1C8A4D0EarlyCSEF
sub_1C8E680NVVMMemorySpaceOptM
sub_1C98160NVVMLowerBarriersF
sub_1CB4E40NVVMIntrinsicLowering (~10 insertions)F
sub_1CB73C0NVVMBranchDistF
sub_1CC60B0NVVMSinking2F
sub_1CE7DD0NVVMRematerialization (main)F
sub_1CEBD10Final NVVM loweringF
sub_1CEF8F0NVVMPeepholeF
sub_1CB0F50ProfileSummaryInfoWrapper / NVVMModulePassF
sub_12D4560NVVMVerifier / ModuleVerifierM
sub_215D9D0NVVMAnnotationsProcessorM
sub_149CCE0TargetLibraryInfoWrapperPassM
sub_1BFB520TargetTransformInfoWrapperPassF
sub_14A7550createVerifierPass / BasicAliasAnalysisM
sub_1361950AssumptionCacheTrackerM

Type: M = ModulePass, F = FunctionPass, L = LoopPass.

Embedded Data Payloads

Libdevice Bitcode

Two identical copies of NVIDIA's libdevice are embedded directly in the .rodata section as raw LLVM bitcode. Each copy is approximately 456 KB and contains around 400 math intrinsic implementations (__nv_sinf, __nv_expf, __nv_sqrtf, etc.). The duplication supports the dual-path architecture: Path A (LibNVVM API mode) references one copy at 0x3EA0080; Path B (standalone mode) references the other at 0x420FD80. The bitcode is linked into the user's module during the LNK phase via the bitcode linker at sub_12C06E0.

String Tables

IDA Pro extracts 188,141 strings from the binary. These fall into several categories:

CategoryApproximate countExample
LLVM cl::opt descriptions~1,689"Enable aggressive reassociation"
LLVM error/diagnostic messages~5,000"Invalid bitcode signature"
EDG error messages~2,500"expected a declaration"
LLVM pass names~440"instcombine", "gvn", "nvvm-memspace-opt"
PTX instruction templates~800"mov.b32 %0, %1;"
NVVM builtin names~770"__nvvm_atom_cas_gen_i"
jemalloc config strings~200"background_thread", "dirty_decay_ms"
NVVM container field names~144"SmMajor", "FastMath.Ftz"
Miscellaneous (format strings, assertions)~170,000+"%s:%d: assertion failed"

String cross-referencing is the single most productive technique for identifying functions in a stripped binary. The LLVM pass registration pattern is especially reliable: a string like "nvvm-memspace-opt" appears exactly once, in the constructor of that pass, which IDA locates via xref.

NVVM Container Format

The binary includes a proprietary container format for wrapping LLVM bitcode with compilation metadata. The container uses a 24-byte binary header with magic 0x7F4E5C7D, followed by delta-encoded tag/value pairs (only fields that differ from defaults are serialized). There are 144 distinct tag IDs spanning core options (tags 1-39), compression metadata (tag 99), extended target options (tags 101-173), blob data (tags 201-218), and structured hardware descriptors (tags 401-402 for TMA/TCGen05 configurations). Serialization and deserialization are handled by sub_CDD2D0 and sub_CD1D80 respectively.

jemalloc Integration

NVIDIA statically links jemalloc 5.3.x as the process-wide memory allocator. The jemalloc functions cluster around 0x12FC000 (approximately 400 functions). The configuration initialization function sub_12FCDB0 (129 KB, one of the largest functions in the binary) parses 199 configuration strings from the MALLOC_CONF environment variable.

Key jemalloc entry points visible in the binary:

FunctionAddress
malloc_conf_init (199 config strings)0x12FCDB0
vsnprintf (jemalloc stats formatting)0x40D5CA
Core arena management, tcache, extent allocator0x12FC000 range

The jemalloc integration is significant for reverse engineering because it means malloc/free calls throughout the binary resolve to jemalloc's arena-based allocator rather than glibc's ptmalloc2. When tracing memory allocation patterns in IDA, look for calls into the 0x12FC000 range.

Global Constructors

The region from 0x430000 to 0x5CFFFF (~1.6 MB) is dominated by global constructors that execute before main(). The primary purpose of these constructors is LLVM cl::opt registration: approximately 1,689 command-line option objects are initialized, each registering a string name, description, default value, and storage location into LLVM's global option registry.

The .init_array section contains function pointers to these constructors. They execute in linker-determined order and populate a global hash table that sub_8F9C90 (the real main) later queries during CLI parsing. In IDA Pro, navigating to any cl::opt constructor reveals the option name string and its associated global variable, which is invaluable for understanding what flag controls what behavior.

Additional global constructors handle:

  • LLVM pass registration (RegisterPass<T> and PassInfo objects)
  • LLVM target initialization (NVPTX target machine factory)
  • jemalloc allocator bootstrapping
  • EDG frontend static initialization tables

Dual-Path Code Duplication

A distinctive structural feature of the binary is the presence of two near-complete copies of the NVVM bridge and backend entry points. Path A (LibNVVM API mode) lives around 0x90xxxx; Path B (standalone/nvcc mode) lives around 0x126xxxx. Each path has its own:

ComponentPath APath B
Simple compile entrysub_902D10sub_1262860
Multi-stage pipelinesub_905EE0 (43 KB)sub_1265970 (48 KB)
CLI parsingsub_900130sub_125FB30
Builtin resolution tablesub_90AEE0 (109 KB)sub_126A910 (123 KB)
Embedded libdevice refunk_3EA0080unk_420FD80
Version stringnvvm-latestnvvm70

In IDA, if you have identified a function in one path, search for a structurally similar function at the corresponding offset in the other path. The code is not byte-identical -- Path B is generally slightly larger due to additional standalone-mode logic -- but the control flow graphs are nearly congruent.

IDA Pro Navigation Tips

When opening cicc in IDA Pro for the first time, the auto-analysis will take several minutes due to the 60 MB size. The following workflow accelerates orientation:

  1. Start with strings. Open the Strings window (Shift+F12), filter for known LLVM pass names ("instcombine", "gvn", "nvvm-"). Each xref leads directly to a pass constructor or registration site.

  2. Use the address map above. If you are looking at an address in the 0xC00000-0x12CFFFF range, you are in LLVM optimization passes. The 0x3000000-0x36FFFFF range is NVPTX instruction selection. The 0x5D0000-0x8EFFFF range is EDG. Context narrows the search space immediately.

  3. Watch for vtable patterns. LLVM passes are C++ classes with virtual methods. IDA's vtable reconstruction reveals inheritance hierarchies. Every FunctionPass, ModulePass, and LoopPass subclass has a vtable with runOnFunction/runOnModule at a consistent slot offset.

  4. Anchor on mega-functions. The largest functions are the easiest to locate and serve as landmarks: sub_A939D0 (457 KB, X86 AutoUpgrade), sub_10EE7A0 (396 KB, InstCombine), sub_20019C0 (341 KB, LegalizeTypes). These anchors partition the address space.

  5. Follow the pipeline. Entry at sub_8F9C90 calls into EDG at sub_5D2A80, pipeline assembly at sub_12E54A0, and PTX emission starting at 0x2100000. Tracing callgraph edges from these known entry points maps out the entire compilation flow.

  6. Mark jemalloc early. Identifying and labeling the jemalloc cluster at 0x12FC000 prevents wasted time reverse-engineering well-known allocator internals. The 199-string malloc_conf_init function is an unmistakable fingerprint.

  7. Locate NVIDIA passes via factory addresses. The Pass Factory Address Summary table above maps every pipeline-inserted pass to its constructor address. In IDA, setting a breakpoint at sub_12DE0B0 (AddPass) and logging the second argument reveals the exact pass insertion order at runtime.

Master Address-Range Map

The definitive quick-reference for "what lives at address X?" Every major address range in the cicc v13.0 binary, sorted by start address, consolidated from all subsystem pages in this wiki.

.text Section (0x400000 - 0x3BFFFFF)

StartEndSizeSubsystemZone
0x4000000x40CFFF52 KBCRT startup (_start, libc stubs)1
0x40D0000x41FFFF80 KBjemalloc stats (vsnprintf at sub_40D5CA)1
0x4200000x42FFFF64 KBlibc helpers (memcpy, memset, strlen, math)1
0x4300000x5CFFFF1.6 MBGlobal constructors (~1,689 cl::opt registrations, pass/target init)2
0x5D00000x8EFFFF3.2 MBEDG 6.6 C++ Frontend (parser, constexpr, templates, IL walkers, SARIF, preprocessor)3
0x8F00000x8FFFFF64 KBReal main / CLI (sub_8F9C90 entry, flag mapping, XOR deobfuscator)4
0x9000000x92FFFF192 KBPath A entry (LibNVVM API: CLI parse, pipeline driver, builtin tables)4
0x9300000x95FFFF192 KBPath A builtins (pre-opt builtin lowering, 770-entry resolution)4
0x9600000x9EFFFF576 KBArchitecture detection (-arch fan-out, NVVM option parsing)4
0x9F00000xAEFFFF1 MBBitcode reader (parseFunctionBody 166KB, metadata reader 121KB)5
0xAF00000xBEFFFF1 MBX86 AutoUpgrade (sub_A939D0 457KB -- legacy intrinsic upgrader)5
0xBF00000xBFFFFF64 KBLLVM IR Verifier (entry points, visitCallInst 207KB)5
0xC000000xCAFFFF704 KBLLVM Support/ADT (APInt, CommandLine, ConstantRange, JSON, Timer, YAML, VFS)6
0xCB00000xCBFFFF64 KBYAML parser/emitter (libyaml)7
0xCC00000xCCFFFF64 KBLLVM Triple parsing (Triple_normalize 35KB)7
0xCCD0000xCDFFFF76 KBNVVM container format (serialize sub_CDD2D0, deserialize sub_CD1D80, 144 tags)7
0xCE00000xD5FFFF512 KBNVVM options (container validators, option parsers)7
0xD600000xD82FFF140 KBNV Module Summary / LTO (buildModuleSummary 74KB, runOnModule 56KB)7
0xD830000xDFFFFF500 KBScalarEvolution (SCEV) (AddRecExpr, backedge analysis, trip counts)7
0xE000000xE0FFFF64 KBDWARF debug info (string/enum tables)7
0xE100000xE2FFFF128 KBItanium name demangler (parseExpr 47KB)7
0xE300000xEBFFFF576 KBMC assembler layer (ELF/COFF/MachO section parsers, expression evaluator)7
0xEC00000xED000064 KBMC directives (sub_ECB300 ELF section parser 40KB)7
0xED00000xEF8000160 KBInstrProf / MemProf reader (profiling data infrastructure)7
0xEF80000xF0500052 KBBitstream remark serialization7
0xF050000xF6FFFF428 KBSelectionDAG infrastructure (DAG node creation, SDValue, EVT/MVT helpers)7
0xF700000xF8FFFF128 KBLoop vectorization runtime checks (vectorizeLoop 37KB, canVectorizeMemory 29KB)7
0xF900000xFCFFFF256 KBSimplifyCFG + code sinking (switch table gen, speculative exec)7
0xFD00000xFEFFFF128 KBAliasSet / register pressure (CFG graphviz)7
0xFF00000x101FFFF192 KBBlock scheduling (RPO traversal, constant folding)7
0x10200000x103FFFF128 KBInline ASM + scheduling model (CUTLASS kernel detection 41KB)7
0x10400000x106FFFF192 KBDivergence analysis (DAG utilities, IR linker)7
0x10700000x10CFFFF384 KBMC object emission + InstructionSimplify (visitAdd 94KB)7
0x10D00000x122FFFF1.4 MBInstCombine mega-region (main visitor 396KB, KnownBits 125KB, SimplifyLibCalls, LLParser)8
0x12300000x12CFFFF640 KBNVVM Bridge / IR codegen (AST-to-IR, Path B entry, builtin tables, bitcode linker)9
0x12D00000x12FBFFF176 KBPipeline builder (NVVMPassOptions 125KB, AddPass, tier builders, master assembler 50KB)10
0x12FC0000x133FFFF256 KBjemalloc core (~400 functions, malloc_conf_init 129KB)10
0x13400000x16FFFFF3.8 MBIR infrastructure / PassManager (IR types, constants, instructions, metadata, execution engine, IR linker)11
0x17000000x17FFFFF1 MBInstCombine (NewPM) + Sanitizers + PGO (MSan, TSan, coverage, GCov)12
0x18000000x18DFFFF896 KBStandard scalar passes (InstructionCombining, TailCallElim, FunctionAttrs, SCCP, Sink, MemorySSA)13
0x18E00000x18FFFFF128 KBDCE / CVP / DSE (Dead Code Elimination, CorrelatedValuePropagation, Dead Store Elimination)13
0x19000000x193FFFF256 KBGVN family (runOnFunction 83KB, PRE 26KB, NewGVN 43KB)13
0x19400000x19FFFFF768 KBScalar passes continued (LICM, LoopRotate, LoopIndexSplit, LoopUnroll, SROA)13
0x1A000000x1AFFFFF1 MBNVVMRematerialization / LLVM standard pipeline / InstructionSimplify13
0x1B000000x1B7FFFF512 KBLoop unrolling + switch lowering (main driver 68KB, Unroll-and-Jam 55KB, peeling 39KB)13
0x1B800000x1BFFFFF512 KBLoop/SLP vectorizer (LoopVectorize 43KB, VPlan 32KB, SLP 47KB+62KB)13
0x1C000000x1C3FFFF256 KBNVVM module validation + config (codegen config 33KB, compile mode 28KB, intrinsic lowering 112KB, module validator 48KB)13
0x1C400000x1CFFFFF768 KBNVIDIA custom IR passes (dead-sync-elim, common-base-elim, base-addr-sr, memspace-opt, loop-index-split, printf-lowering, iv-demotion, remat, peephole, sinking2, NLO)13
0x1D000000x1DFFFFF1 MBSelectionDAG ISel / CodeGenPrepare (bytecode interpreter 97KB, address sinking 65KB)14
0x1E000000x1EFFFFF1 MBRegister allocation infrastructure (Greedy RA, live intervals, spill cost)14
0x1F000000x1FFFFFF1 MBBackend codegen infrastructure (ScheduleDAG, ShrinkWrapping, SpillPlacement, register coalescer, TwoAddressInstruction)15
0x20000000x20FFFFF1 MBLegalizeTypes (sub_20019C0 341KB -- third largest function)15
0x21000000x21FFFFF1 MBNVPTX target backend (AsmPrinter, PTX emission, MMA/tensor codegen, atomics, TargetMachine)16
0x22000000x233FFFF1.25 MB(gap: misc codegen, late passes)--
0x23400000x23FFFFF768 KBNew PM pass registration (master registrar 2,816 lines, 526 passes, pipeline text parser)17
0x24000000x258FFFF1.6 MBAttributor framework (runTillFixpoint 53KB)18
0x25900000x265FFFF832 KBSanitizer instrumentation (ASan, HWASan)18
0x26600000x269FFFF256 KBOpenMP target offloading (194-entry __kmpc_* table, Generic-to-SPMD 61KB, state machine 41KB)18
0x26A00000x29FFFFF3.5 MBCoroutines / LTO infrastructure / PGO lowering / EarlyCSE / SROA (NewPM)18
0x2A000000x2CFFFFF3 MBLoop transforms (LoopPeeling, LoopRotation, UnrollLoop, IndVarSimplify, dead-sync-elim island)19
0x2D000000x2FFFFFF3 MBCodegen target options / SelectionDAG lowering (TargetOptions 112KB, DAG combine, type legalization)20
0x30000000x36FFFFF7 MBNVPTX ISel + DAG lowering (NVPTXTargetLowering 111KB, intrinsic switch 343KB, register info)21
0x37000000x37AFFFF704 KBTable-driven instruction selector (main matcher 138KB, per-SM opcode gating)22
0x37B00000x38FFFFF1.3 MBLate machine passes (inliner cost model at 0x38576C0, pipeline helpers)22
0x39000000x397FFFF512 KBNVIDIA machine-level passes (register pressure, remat, ABI preserve, GEP split, AsmPrinter/PTX emission)22
0x39800000x399FFFF128 KBMC layer / DWARF emission (object file writers, DWARF sections at 0x3990000-0x39DF000)22
0x39A00000x3BFFFFF2.4 MBTrailing codegen (section management, CRT finalization)22

.rodata / .data Sections (0x3C00000+)

StartEndSizeContents
0x3C000000x3EAFFFF~2.7 MBRead-only data (strings, jump tables, XOR-encrypted env vars at 0x3C23A7B)
0x3EA00800x3F1FFFF456 KBEmbedded libdevice bitcode (Path A)
0x3F252E00x3F3E6C0+variesNVPTX tables (constraint type table, constraint word table, MVT tables)
0x420FD800x428FFFF456 KBEmbedded libdevice bitcode (Path B)
0x42812C0--variesObfuscated version strings (XOR+ROT13 ciphertext)
0x444C4A00x4456580+variesMVT tables (operand type, vector element count, scalarized MVT)
0x4F00000+--largeBSS (cl::opt storage, hash tables, global state)

Usage

Given an IDA address, find the row whose Start <= address < End. The Subsystem column tells you which component of cicc you are looking at. For pass-level detail within a zone, jump to the corresponding Zone section above.

Cross-References

Methodology

This page documents how the reverse engineering of cicc v13.0 was performed. It serves as both a transparency record -- so readers can assess the confidence of any claim in this wiki -- and as a practical guide for anyone who wants to reproduce or extend the analysis.

Scope and Scale

CICC is a 60 MB stripped x86-64 ELF binary with no debug symbols, no export table, and no DWARF information. The scale of the analysis:

MetricValue
Total functions detected80,562
Functions decompiled80,281 (99.65%)
Strings extracted188,141
LLVM base version20.0.0 (internal fork)
LLVM pass classes identified~402 standard + 35 NVIDIA custom
CLI options registered~1,689 cl::opt + 222 NVVMPassOptions
NVVM builtins catalogued770 (IDs 1-770)

The 281 functions that Hex-Rays could not decompile are predominantly very small thunks, computed-jump trampolines, or hand-written assembly stubs in the CRT startup and jemalloc fast paths. None are in critical compiler logic.

Toolchain

All analysis was performed with IDA Pro 8.x and the Hex-Rays x86-64 decompiler. No dynamic analysis (debugging, tracing, instrumentation) was used -- the entire effort is static analysis of the binary at rest. Supplementary tools:

ToolPurpose
IDA Pro 8.xDisassembly, auto-analysis, cross-referencing, type reconstruction
Hex-Rays decompilerPseudocode generation for all 80,281 recovered functions
IDA Python scriptingBulk string extraction, function size enumeration, xref graph walking
Custom Python scriptsCallgraph analysis, module taxonomy, evidence indexing, pipeline tracing

No runtime instrumentation, no strace/ltrace, no gdb breakpoints. Every finding derives from static analysis of the binary's code and data sections.

Function Identification Strategies

Identifying functions in a stripped binary of this size requires multiple complementary strategies. They are listed below in order of reliability.

String Cross-References (Highest Confidence)

LLVM is a string-rich codebase. Error messages, pass names, option descriptions, and assertion text are compiled into the binary. A string like "Running pass 'NVVMMemorySpaceOpt'" appears at exactly one address in .rodata, and IDA's xref from that string leads directly to the function that prints it. This is the most reliable identification technique and produces VERY HIGH confidence identifications.

Specific high-value string patterns:

  • LLVM pass registration: "instcombine", "gvn", "nvvm-memspace-opt" -- each appears in exactly one RegisterPass constructor or PassInfo initializer.
  • cl::opt names: "-nvvm-enable-remat", "-nvvm-branch-dist-threshold" -- each names a global variable and its registration constructor.
  • Error messages with context: "parseFunctionBody: ..." (174 unique error strings in the bitcode reader), "visitCallInst: ..." (298 verification messages in the verifier).
  • Timer names: "CUDA C++ Front-End", "LibNVVM", "Optimizer" -- appear in timer-creation calls that bracket pipeline stages.
  • EDG error templates: "expected a %s", "declaration not allowed here" -- 2,500+ diagnostic strings anchoring the frontend parser.

LLVM Pass Registration Patterns (Very High Confidence)

Every LLVM pass follows a predictable structural pattern. A pass class has a vtable with virtual methods at fixed offsets (runOnFunction at slot N, getAnalysisUsage at slot M). The pass registers itself via a global constructor that stores a PassInfo object containing the pass name string, the pass ID address, and a factory function pointer. By enumerating all .init_array entries that write a PassInfo-shaped structure, all ~437 passes were catalogued systematically.

The New Pass Manager (at sub_2342890, a 2,816-line registrar function) contains a massive string-to-pass-factory dispatch table with ~268 pass name entries. Decompiling this single function yields the name-to-address mapping for every New PM pass in the binary.

Vtable Analysis (High Confidence)

LLVM's class hierarchy is deep and regular. Pass -> FunctionPass -> LoopPass, Pass -> ModulePass, etc. Each level adds virtual methods at predictable vtable slots. By reconstructing vtable layouts (finding pointers to __cxa_pure_virtual for abstract methods, then tracing concrete overrides), the class hierarchy was reconstructed without debug symbols.

For the NVPTX backend specifically, vtable analysis identified NVPTXTargetLowering (2.3 MB of lowering logic), NVPTXInstrInfo, NVPTXRegisterInfo, and NVPTXFrameLowering as distinct classes with their own method tables.

Callgraph Propagation (High Confidence)

Once a function is identified with high confidence, its callees and callers gain contextual identity. If sub_12E54A0 is the pipeline assembly function (confirmed by string refs to pass names it registers), then the functions it calls to create individual passes are the pass factory functions. This propagation is transitive: identifying a factory function identifies its return type's vtable, which identifies the pass's runOnFunction method.

The pipeline orchestrator at sub_12C35D0 (41 KB) is a particularly productive anchor: it calls into the LNK, OPT, OPTIXIR, and LLC stages in sequence, and each stage's entry point was identified by following its callgraph edges.

Size and Structural Fingerprinting (Medium Confidence)

Some functions are identifiable by their size and structural characteristics alone. LLVM's InstCombine::visitCallInst is famously enormous (396 KB in this binary) because it handles every LLVM intrinsic. SelectionDAG::LegalizeTypes (348 KB) contains a switch with 967 case labels. These mega-functions have no structural equivalents and can be identified by size alone with reasonable confidence.

Similarly, the EDG frontend's constexpr evaluator (sub_786210, 317 KB) is identifiable by its 124 case labels corresponding to C++ operator opcodes -- a characteristic that matches the known EDG evaluator design.

Known Library Fingerprinting (Medium Confidence)

jemalloc was identified by its 199 configuration string names ("background_thread", "dirty_decay_ms", "narenas", etc.), which are unique to jemalloc's malloc_conf_init function. Once the allocator library was identified, its ~400 functions were bulk-labeled, removing them from the analysis scope.

The X86 AutoUpgrade function (sub_A939D0, 457 KB) is an LLVM artifact -- leftover x86 intrinsic renaming code that ships in every LLVM-based binary regardless of target. It was identified by its intrinsic name strings ("llvm.x86.sse2.*", "llvm.x86.avx.*") and excluded from NVPTX-specific analysis.

Confidence Levels

Every function identification in this wiki carries one of four confidence levels:

LevelMeaningBasis
KNOWNIdentity is certainDirect string evidence naming the function, or the function is a trivial thunk to a known target
VERY HIGHEffectively certainMultiple corroborating string references, structural match to known LLVM code, consistent callgraph position
HIGHStrong identificationSingle strong indicator (vtable match, size fingerprint, callgraph position) corroborated by context
MEDIUMProbable identificationInferred from callgraph context, parameter patterns, or structural similarity without direct string evidence

Approximately 60% of identified functions are VERY HIGH or KNOWN confidence. The remaining 40% are HIGH or MEDIUM, concentrated in areas with fewer string anchors (machine-level passes, register allocation internals, EDG IL tree walkers).

Analysis Pipeline and Scripts

The manual IDA Pro work was augmented by a systematic scripted pipeline that processed the exported IDA databases into structured evidence. The pipeline operates in two phases: L0 (foundation) builds indices and classifies all 80,562 functions automatically, and L1 (module analysis) organizes functions into per-module directories with metadata for human review.

All scripts live in cicc/scripts/. The pipeline requires four JSON databases exported from IDA: cicc_functions.json (80,562 function records), cicc_strings.json (188,141 string records), cicc_xrefs.json (cross-reference records), and cicc_callgraph.json (call edge records). These exports are stored in cicc/databases/.

L0 Foundation Pipeline

The L0 pipeline runs as a single sequential batch via scripts/run_foundation_analysis.sh. Each step depends on the output of the previous step.

Step 0: Extract Wiki Knowledge (foundation/00_extract_wiki_knowledge.py)

Scans all existing wiki markdown files for hex addresses (regex \b0x[0-9a-fA-F]{6,}\b) and builds a ground-truth mapping of address-to-module from prior manual analysis. This seed data provides the highest-confidence module assignments (100% confidence) used to bootstrap the automated classifier.

Output: foundation/taxonomy/modules/wiki_known_functions.json, wiki_module_addresses.json.

Step 1: Build Fast Lookup Indices (foundation/01_build_indices.py)

Loads the three IDA JSON databases (functions, strings, xrefs) and builds four pickle-serialized indices for O(1) lookup in subsequent steps:

  • addr_to_func.pkl -- address to function metadata (name, size, instruction count, library/thunk flags).
  • string_to_xrefs.pkl -- string address to string value and xref list.
  • func_to_callers.pkl -- function name to list of caller names.
  • func_to_callees.pkl -- function name to list of callee names.

Output: foundation/indices/.

Step 2: Classify Strings (foundation/02_classify_strings.py)

Applies four regex-based pattern sets to all 188,141 strings, classifying each into one or more semantic categories:

  • Error messages: strings matching error, failed, invalid, unsupported, expected, etc.
  • Optimization passes: strings matching pass, optimize, transform, inline, unroll, gvn, licm, etc.
  • Architecture features: strings matching sm_\d+, tensor, warp, FP4, blackwell, hopper, etc.
  • Debug messages: strings matching debug, trace, dump, verbose.

Each classified string retains its address and xref list, so the classifier output doubles as a "which functions reference optimization-related strings" index.

Output: foundation/taxonomy/strings/error_messages.json, optimization_passes.json, architecture_features.json, debug_messages.json, extracted_pass_names.json.

Step 3: Build Module Taxonomy (foundation/03_build_module_taxonomy.py)

The core classification engine. Assigns each of the 80,562 functions to one of eight compiler subsystem modules (or unknown) using four strategies applied in decreasing confidence order:

  1. Wiki ground truth (100% confidence) -- addresses found in wiki pages in Step 0.
  2. String content analysis (80% confidence) -- functions whose string xrefs match module-specific keyword patterns (e.g., a function referencing "tensor", "mma", or "tcgen" strings is classified as tensor_core_codegen).
  3. Call proximity propagation (30-60% confidence, 3 iterations) -- unclassified functions are assigned to the module voted by their callers (weighted 2x) and callees. A minimum of 2 votes is required. Each iteration propagates classifications outward from already-classified functions.
  4. Code location heuristics (40% confidence) -- address range rules for known code regions (e.g., 0x2F00000-0x3000000 maps to register_allocation).

The eight modules are: optimization_framework, register_allocation, compilation_pipeline, ptx_emission, instruction_selection, error_handling, tensor_core_codegen, architecture_detection.

Output: foundation/taxonomy/modules/function_to_module_map.json, module_list.json.

Step 4: Analyze Call Graph (foundation/04_analyze_callgraph.py)

Computes three structural properties of the call graph:

  • Entry points -- functions with zero callers and nonzero callees (top 100 by callee count). These are pipeline entry points, API functions, or global constructors.
  • Leaf functions -- functions with zero callees and nonzero callers (top 1,000 by caller count). These are utility functions, allocators, and assertion handlers.
  • Hot paths -- functions ranked by caller count (top 1,000). The highest-traffic functions in the binary.

Output: foundation/callgraph/entry_points.json, leaf_functions.json, hot_paths.json.

Step 5: Assign Priorities (foundation/05_assign_priorities.py)

Computes a composite priority score for each function to guide analysis effort allocation. The scoring formula:

  • Size component: 1000 points for functions over 10 KB, 700 for 5-10 KB, 400 for 2-5 KB, 200 for 1-2 KB, 100 for 500 B-1 KB.
  • Call frequency component: 500 points for 1000+ callers, 300 for 500+, 150 for 100+, 75 for 50+.
  • Named function bonus: 200 points if the function has a recovered name (not sub_).
  • Critical module bonus: 300 points if the function belongs to a critical module (compilation_pipeline, tensor_core_codegen, architecture_detection, register_allocation, instruction_selection, ptx_emission).

Functions scoring 1000+ are tier CRITICAL, 500+ are HIGH, 200+ are MEDIUM, below 200 are LOW.

Output: foundation/priorities/scoring_report.json, critical.json, high.json, medium.json, low.json.

Step 6: Generate Coverage Tracker (foundation/06_generate_coverage_tracker.py)

Aggregates all prior outputs into a master JSON tracker that records, per module and per function, the analysis status (pending/in-progress/complete), the assigned analyst, and the evidence quality score. This tracker serves as the coordination database for the L1 phase.

Output: foundation/coverage/tracker.json.

L1 Module Analysis Pipeline

The L1 pipeline runs via scripts/run_l1_programmatic.sh and requires L0 completion. It organizes CRITICAL and HIGH priority functions into per-module directories for systematic human review.

Step 1: Create Module Structure (modules/01_create_module_structure.py)

Creates the directory tree modules/{module}/functions/{critical,high}/ for each of the eight modules. MEDIUM and LOW tiers are intentionally excluded from L1 to focus effort on the most important functions.

Step 2: Extract Function Metadata (modules/02_extract_function_metadata.py)

For each CRITICAL and HIGH function, creates a directory modules/{module}/functions/{tier}/{address}/ containing a metadata.json file with: address, name, module, priority score, size, call frequency, scoring reasons, top 50 callers, top 50 callees, and paths to decompiled/disassembly/CFG files if they exist on disk.

Step 3: Generate Module READMEs (modules/03_generate_module_readmes.py)

Generates a skeleton README.md for each module with function counts, analysis progress tracking fields, and section headings for purpose, key functions, integration points, and data structures. These serve as the starting point for human-written module documentation.

Standalone Analysis Scripts

Six additional scripts perform targeted analyses independent of the L0/L1 pipeline:

analyze_nvvm_pipeline.py -- Loads the NVVM call graph (nvvm_callgraph.json, exported from the LibNVVM shared object analysis) and traces the compilation flow from nvvmCompileProgram. Identifies NVVM API entry points, finds LLVM optimization pass function symbols, traces call paths to depth 10, identifies hub functions (nodes with in-degree or out-degree above 10), and extracts the optimization pass ordering reachable from the compile entry point.

deep_pipeline_trace.py -- Performs deep BFS traversal (up to depth 15, width 100 per level) from nvvmCompileProgram through the NVVM call graph. Annotates each function with structural characteristics (LEAF, HUB, FANOUT, FANIN) and groups results by call depth to reveal the pipeline's stage boundaries. Also traces from secondary API entry points (nvvmVerifyProgram, nvvmAddModuleToProgram, nvvmCreateProgram).

extract_pipeline_structure.py -- Parses the 188,141 strings database for disable-*Pass patterns and Disable * description strings to extract the complete list of optimization passes by name. Categorizes passes into groups (Dead Code Elimination, Loop Optimizations, Inlining, Memory, NVVM-Specific, Lowering, etc.) and reconstructs the 13-stage compilation pipeline from NVVM module loading through PTX code generation. Also extracts compilation mode information (fast-compile, split-compile, partial-link).

analyze_performance_hotspots.py -- Loads the full function database (cicc_functions.json) and computes: global hotspot ranking (top 100 most-called functions), hot path chains (BFS from top 50 hotspots through callees, tracking weighted call frequency), size-efficiency analysis (bytes per call for each function), loop depth estimation (regex-based nesting analysis of decompiled C files), bottleneck identification (functions with 500+ callers), and module-level hotspot distribution.

catalog_optimization_framework.py -- Specialized script for the optimization_framework module. Reads per-function metadata from the L1 module directories, builds a critical function registry sorted by size, extracts HIGH-tier statistics (size tier distribution, top 20 most-called), scans decompiled code for optimization-related string patterns (pass references, iteration patterns, technique keywords), and identifies entry points (functions with 2 or fewer callers).

validate_callgraph.py -- Comprehensive validation system that cross-checks the call graph data against module classifications. Performs six verification analyses: cross-module call matrix verification (counting inter-module edges and sampling for spot-checks), entry point validation (confirming claimed entry points have zero callers), reachability analysis (BFS from main to find dead code), module dependency cycle detection (DFS on the module dependency graph), integration hotspot verification (functions called by all 8 modules), and bridge function identification (functions that both call into and are called from 2+ other modules).

Evidence Index Builders

Two versions of the evidence aggregation engine synthesize all data sources into per-function quality scores:

build_evidence_index.py (v1) -- Loads all five databases (functions, callgraph, strings, xrefs, names, comments, module map) into memory. For each of the 80,562 functions, counts eight evidence types (metadata, callers, callees, strings, xrefs, name pattern, size, module consistency) and computes a weighted confidence score (string evidence weighted highest at 20 points, callgraph at 15 each, xrefs at 15, metadata and name at 10 each, module at 10, size at 5). Produces nine output files including quality tier assignments (GOLD >= 80%, SILVER >= 50%, BRONZE < 50%), citation density analysis, cross-reference statistics, and prioritized recommendations for further analysis.

build_evidence_index_v2.py (v2, optimized) -- Memory-efficient reimplementation that avoids loading the full xref list into memory. Instead of building complete xref lookup tables, it streams the xref file line-by-line and counts only. The callgraph is preprocessed into a caller/callee count map rather than a full edge list. Produces the same nine analysis files as v1 with identical quality tier logic. Recommended for systems with less than 32 GB RAM.

Cross-Module Dependency Analysis

07_analyze_cross_module_dependencies.py -- The most complex standalone analysis. Streams the full call graph (using ijson for memory-efficient parsing) four times to compute:

  1. Inter-module call matrix -- for each pair of the 8 modules, the number of call edges crossing the boundary.
  2. Module dependency depth -- per-module statistics on how many other modules each function depends on, identifying isolated functions and hub functions.
  3. Critical bridges -- functions that call into 3 or more other modules (top 100 by bridge count).
  4. Integration hotspots -- functions called by 3 or more other modules (top 100 by fan-in).
  5. Module dependency graph -- a JSON graph structure with weighted edges suitable for visualization.
  6. Integration patterns -- entry point modules (highest out-degree), utility hub modules (highest in-degree), and linear dependency chains.

Data Flow and Directory Structure

The complete analysis data is organized as follows:

cicc/
  databases/                    # IDA exports (input data)
    cicc_functions.json         #   80,562 function records
    cicc_strings.json           #   188,141 string records
    cicc_xrefs.json             #   cross-reference records
    cicc_callgraph.json         #   call edge records
    cicc_names.json             #   recovered names
    cicc_comments.json          #   IDA comments
  foundation/                   # L0 pipeline output
    indices/                    #   pickle indices for fast lookup
    taxonomy/
      modules/                  #   function-to-module map, module list
      strings/                  #   classified string databases
    callgraph/                  #   entry points, leaf functions, hot paths
    priorities/                 #   priority scoring and tier assignments
    coverage/                   #   master progress tracker
    analyses/                   #   evidence index, quality tiers, cross-module data
  modules/                      # L1 pipeline output
    {module}/
      functions/
        critical/{addr}/        #   metadata.json per critical function
        high/{addr}/            #   metadata.json per high function
      analysis/                 #   module-level analysis files
      README.md                 #   module documentation skeleton
  decompiled/                   # Hex-Rays output (per-function C files)
  disasm/                       # IDA disassembly output (per-function ASM files)
  graphs/                       # Control flow graphs (JSON and DOT)
  scripts/                      # All analysis scripts
    foundation/                 #   L0 pipeline scripts (00-07)
    modules/                    #   L1 pipeline scripts (01-03)
    run_foundation_analysis.sh  #   L0 batch runner
    run_l1_programmatic.sh      #   L1 batch runner

Verification Approaches

To verify any specific finding in this wiki:

  1. Open IDA at the stated address. Every function identification includes an address. Navigate to it, press F5 to decompile, and check whether the decompiled code matches the described behavior.

  2. Check string xrefs. For VERY HIGH and KNOWN identifications, search for the quoted string in IDA's Strings window. The xref should lead to the stated function address or a function that directly calls it.

  3. Compare with upstream LLVM. CICC is based on LLVM 20.0.0. The LLVM source tree at the corresponding git tag contains the original implementations of all standard passes. Structural comparison (switch case counts, parameter counts, error message text) between the decompiled code and the LLVM source is the gold standard for verification.

  4. Cross-reference the dual paths. Path A and Path B contain near-duplicate code. If a function is identified in Path A, the corresponding Path B function should exhibit the same structure. Agreement between the two paths increases confidence.

  5. Trace from known entry points. Start at sub_8F9C90 (real main, KNOWN confidence) and follow the call chain. Every function reachable from main through a chain of identified functions has a verified callgraph path.

  6. Run the validation script. Execute scripts/validate_callgraph.py to cross-check the call graph against module classifications. The script produces a CALLGRAPH_VALIDATION_REPORT.json with quantitative metrics: entry point accuracy, cross-module call counts, reachability percentage, bridge function inventory, and module dependency cycles. A healthy analysis should show entry point confidence above 90% and reachability above 80%.

  7. Re-run the evidence index. Execute scripts/foundation/build_evidence_index_v2.py to regenerate quality tier assignments. Compare the GOLD/SILVER/BRONZE percentages against the expected distribution (majority SILVER or GOLD for classified functions). Functions that drop to BRONZE after a wiki edit indicate a regression in evidence consistency.

Reproducing the Full Analysis

To reproduce this analysis from scratch:

  1. Obtain the binary. Install CUDA Toolkit 13.0. The binary is at <cuda>/nvvm/bin/cicc. SHA-256 and build string cuda_13.0.r13.0/compiler.36424714_0 must match.

  2. Run IDA auto-analysis. Open cicc in IDA Pro 8.x with default x86-64 analysis settings. Allow auto-analysis to complete (5-10 minutes for a binary of this size). Accept the detected compiler (GCC).

  3. Batch decompile. Run the following IDA Python script to decompile all functions and export per-function C files:

    import idautils, ida_hexrays, idc
    for func_ea in idautils.Functions():
        try:
            cfunc = ida_hexrays.decompile(func_ea)
            name = idc.get_func_name(func_ea)
            addr = f"0x{func_ea:X}"
            with open(f"decompiled/{name}_{addr}.c", "w") as f:
                f.write(str(cfunc))
        except:
            pass
    
  4. Export databases. Use IDA Python to export the five JSON databases (functions, strings, xrefs, callgraph, names) to cicc/databases/. The function export should iterate Functions() and record address, name, size, instruction count, is_library, is_thunk, callers, and callees for each. The string export should iterate IDA's string list and record address, value, and xrefs.

  5. Run L0 foundation pipeline.

    cd cicc/scripts
    bash run_foundation_analysis.sh
    

    This executes Steps 0-6 in sequence, producing all indices, classifications, and the coverage tracker. Expected runtime: 2-5 minutes on a modern machine.

  6. Run L1 module setup.

    bash run_l1_programmatic.sh
    

    This creates the per-module directory structure, extracts metadata for CRITICAL and HIGH functions, and generates module README skeletons. Expected runtime: under 1 minute.

  7. Run standalone analyses (optional, for deeper investigation):

    python3 analyze_nvvm_pipeline.py       # NVVM pipeline trace
    python3 deep_pipeline_trace.py          # Deep BFS from nvvmCompileProgram
    python3 extract_pipeline_structure.py   # Pass extraction from strings
    python3 analyze_performance_hotspots.py # Hotspot ranking
    python3 validate_callgraph.py           # Validation report
    
  8. Run evidence indexing (optional, for quality assessment):

    cd foundation
    python3 build_evidence_index_v2.py
    
  9. Begin manual analysis. With the foundation data in place, start from the CRITICAL priority list and the string anchors described in the Function Identification Strategies section above. The Function Map page is the primary lookup table.

Dependencies

The analysis scripts require only the Python 3.8+ standard library with one exception: 07_analyze_cross_module_dependencies.py uses ijson for streaming JSON parsing of the large callgraph file. Install with pip install ijson. All other scripts use only json, pickle, re, collections, pathlib, statistics, dataclasses, and typing.

Binary Address Sweep Reports

In addition to the automated scripts, the analysis produced 90+ raw binary sweep reports stored in cicc/raw/. Each report covers a contiguous address range (typically 128 KB to 512 KB) and contains per-function identification notes, string evidence citations, structural observations, and confidence assessments. The reports are named by address range (e.g., p1.3-01-sweep-0x8F0000-0x90FFFF.txt covers the compilation pipeline entry region) and organized into 10 sweep phases corresponding to the binary's major sections. A second sweep phase (p2-* and p2a-p2g) provides focused analyses of specific subsystems (EDG frontend, IR generation, optimization passes, SelectionDAG, register allocation, scheduling, configuration).

These raw reports are the primary source material from which the wiki pages were written. They are not cleaned or edited for presentation -- they contain working notes, false starts, and corrections made during the analysis process.

Limitations and Known Gaps

This analysis has several inherent limitations:

  • No dynamic validation. All findings are from static analysis. Runtime behavior under specific inputs (unusual SM targets, edge-case CUDA constructs) has not been verified.
  • EDG internals are partially opaque. The EDG frontend is a licensed third-party component. Its internal data structures are less well-documented in the LLVM literature, making identification harder. The IL tree format and scope management structures are identified at MEDIUM confidence.
  • Inlined functions are invisible. If the compiler inlined a function during the build of cicc itself, that function has no standalone address and cannot be independently identified. Some small LLVM utility functions (SmallVector operations, StringRef comparisons) are likely inlined throughout.
  • Proprietary NVIDIA code has no public reference. The 35 custom NVIDIA passes, the NVVM bridge layer, and the NVVMPassOptions system have no upstream source to compare against. These are identified purely from string evidence and structural analysis.
  • Version-specific. All findings apply to cicc v13.0 (build cuda_13.0.r13.0/compiler.36424714_0). Addresses, function sizes, and pass counts will differ in other CUDA toolkit versions.
  • Module classification accuracy degrades at the boundary. The automated taxonomy assigns ~60% of functions with high confidence (wiki ground truth or strong string evidence). The remaining functions are classified by call proximity propagation or address range heuristics at 30-60% confidence. Functions at module boundaries may be misclassified; the validate_callgraph.py script quantifies this.
  • Callgraph completeness depends on IDA's xref analysis. Indirect calls through function pointers (vtable dispatch, callback registrations) are not fully captured by IDA's static analysis. The call graph is therefore a lower bound on the true call relationships. This primarily affects LLVM's pass manager dispatch and the EDG frontend's visitor pattern implementations.

Version Tracking

This page documents the exact version identifiers embedded in the cicc v13.0 binary and the version relationships between its components. Every version listed here was recovered from string constants, constructor initializations, or binary header fields in the stripped ELF binary. This is the single source of truth for version-related questions across the wiki.

Version Summary

ComponentVersionEvidence
cicc binaryv13.0Build string cuda_13.0.r13.0/compiler.36424714_0
CUDA Toolkit12.8Toolkit release that ships cicc v13.0
LLVM base (internal)20.0.0ctor_036 at 0x48CC90 falls back to "20.0.0"; string "llvm-mc (based on LLVM 20.0.0)" at sub_E7A190
Bitcode producer (emitted)"LLVM7.0.1"ctor_154 at 0x4CE640 writes "7.0.1" to producer global
EDG frontend6.6String "Based on Edison Design Group C/C++ Front End, version 6.6"
NVVM IR version (user code)3.2Metadata gate at sub_157E370: major == 3, minor <= 2
NVVM IR version (libdevice)2.0!nvvmir.version = !{i32 2, i32 0} -- always-compatible sentinel
NVVM container format1.xHeader field version_major = 1, version_minor <= 0x41
NVVM debug info version3.2Container header nvvm_debug_major = 3, nvvm_debug_minor <= 2
Embedded libdevicelibdevice.10.bc455,876 bytes, 352 functions, triple nvptx64-nvidia-gpulibs
GCC emulation (EDG)8.1DEFAULT_GNU_VERSION = 80100
Clang emulation (EDG)9.1DEFAULT_CLANG_VERSION = 90100
jemalloc5.3.x~400 statically linked functions at 0x12FC000
Default PTX ISA (sm_90)8.5.version 8.5 computed from PTXVersion / 10, PTXVersion % 10
Default SM targetsm_75Hardcoded strcpy("compute_75") in sub_900130 and sub_125FB30

LLVM Version: The Dual Identity

CICC has two LLVM version identities. Internally, it is an LLVM 20.0.0 fork -- all modern instruction opcodes, metadata formats, type encodings, and pass infrastructure from LLVM 20 are present. Externally, the bitcode it emits identifies itself as "LLVM7.0.1" in the producer field.

The reason is historical: NVVM IR 2.0 was defined against LLVM 7.0.1. The entire NVVM toolchain ecosystem (libNVVM, nvcc's device pipeline, nvdisasm, third-party NVVM IR consumers) standardized on "LLVM7.0.1" as the format identifier. Changing the producer string would require a coordinated update across the entire CUDA toolkit and all downstream consumers.

Binary evidence:

  • ctor_036 at 0x48CC90: reads LLVM_OVERRIDE_PRODUCER environment variable, falls back to "20.0.0" (the true version).
  • ctor_154 at 0x4CE640: reads LLVM_OVERRIDE_PRODUCER, falls back to "7.0.1" (the compatibility marker). This is the constructor that runs for the bitcode writer path.
  • sub_E7A190: contains the string "llvm-mc (based on LLVM 20.0.0)".
  • sub_1538EC0 (writeModule): emits "LLVM" + "7.0.1" = "LLVM7.0.1" as the IDENTIFICATION_BLOCK producer.

Both constructors accept the LLVM_OVERRIDE_PRODUCER environment variable to override the default. Setting it changes the embedded producer string in output bitcode.

See Bitcode Reader/Writer for the full dual-producer mechanism.

EDG 6.6 Frontend

The EDG (Edison Design Group) frontend is a licensed commercial C/C++ frontend. Version 6.6 occupies 3.2 MB of code at 0x5D0000--0x8F0000. The version string is embedded literally as "Based on Edison Design Group C/C++ Front End, version 6.6" and is accessible via the --version flag.

EDG 6.6 in cicc is configured to emulate GCC 8.1 (DEFAULT_GNU_VERSION = 80100) and Clang 9.1 (DEFAULT_CLANG_VERSION = 90100). It supports C++23 as the newest C++ standard and C23 as the newest C standard, with C++17 as the default mode.

See EDG 6.6 Frontend for the full frontend documentation.

NVVM IR Version

The NVVM IR version is a metadata tuple (major, minor) embedded in every NVVM bitcode module via the !nvvmir.version named metadata node. CICC v13.0 has two distinct version contexts:

User code: the IR generation phase (sub_9151E0) emits !nvvmir.version with the current version tuple. The version checker at sub_157E370 enforces major == 3 and minor <= 2, making 3.2 the current maximum accepted version. Modules with major != 3 or minor > 2 are rejected with "Broken module found, compilation aborted!".

Libdevice: the embedded libdevice.10.bc carries !nvvmir.version = !{i32 2, i32 0}. The version (2, 0) is hard-coded in the version checker (sub_12BDA30) as an always-compatible sentinel -- it passes the check regardless of the current NVVM IR version. This ensures the embedded math library is compatible with any user module.

Container format: the NVVM container binary header stores version fields separately at offsets 0x06--0x07 (nvvm_ir_major, nvvm_ir_minor). These track the container-level IR spec version and may differ from the bitcode-level metadata tuple.

Bypass: setting NVVM_IR_VER_CHK=0 in the environment disables version validation entirely, allowing any version tuple to pass.

See Bitcode Reader/Writer for the version gate implementation and NVVM Container for the container-level version fields.

Embedded Libdevice

The embedded libdevice is libdevice.10.bc, a 455,876-byte LLVM bitcode library containing 352 GPU-optimized math functions. Two identical copies are statically embedded in the binary:

CopyAddressPipeline
Aunk_3EA0080LibNVVM mode (Path A)
Bunk_420FD80Standalone mode (Path B)

Key properties:

  • Target triple: nvptx64-nvidia-gpulibs
  • Function count: 352 (all alwaysinline nounwind)
  • NVVM IR version: (2, 0) -- always-compatible sentinel
  • Producer: "clang version 3.8.0 (tags/RELEASE_380/final)" -- the Clang version that originally compiled libdevice (not indicative of cicc's own compiler version)
  • NVVMReflect calls: uses __nvvm_reflect("__CUDA_FTZ"), __nvvm_reflect("__CUDA_ARCH"), and __nvvm_reflect("__CUDA_PREC_SQRT") for runtime specialization

The libdevice.10.bc naming convention carries forward from the CUDA 5.0 era. The 10 in the filename originally indicated "compute capability 1.0 and above" (i.e., universal), not a version number.

See Libdevice Linking for the linking algorithm, version validation, and NVVMReflect interaction.

NVVM Container Format Version

The NVVM container binary envelope uses its own versioning scheme, independent of the NVVM IR version:

FieldOffsetValueMeaning
version_major0x041Container format major
version_minor0x05<= 0x41Container format minor
nvvm_ir_major0x062NVVM IR spec major (container-level)
nvvm_ir_minor0x07<= 0x62NVVM IR spec minor (container-level)
nvvm_debug_major0x083Debug info format major
nvvm_debug_minor0x09<= 2Debug info format minor
llvm_major0x0AencodedLLVM version (combined: major * 100 + minor = 2000)
llvm_minor0x0Bencoded

The container's LLVM version encoding stores the combined value 20 * 100 + 0 = 2000, confirming the internal LLVM 20.0.0 base.

See NVVM Container for the full binary format specification.

Version Cross-Reference Matrix

How versions flow through the pipeline:

                    EDG 6.6 Frontend
                         |
                         v
                    NVVM IR Generation
                    (emits nvvmir.version = {3, 2})
                         |
                    +----+----+
                    |         |
              libdevice    user IR
           (version 2,0) (version 3,2)
                    |         |
                    +----+----+
                         |
                    NVVM IR Version Check
                    (gate: major==3, minor<=2)
                    (sentinel: 2,0 always passes)
                         |
                    LLVM 20.0.0 Optimizer
                         |
                    Bitcode Writer
                    (producer: "LLVM7.0.1")
                         |
                    NVVM Container Serializer
                    (container version 1.x, LLVM encoded as 2000)
                         |
                         v
                    .ptx / .optixir output

Future Updates

This wiki documents cicc v13.0 from CUDA 12.8. When a new CUDA toolkit release ships a newer cicc binary, the following version fields are the most likely to change:

  • LLVM base version: NVIDIA periodically rebases on newer LLVM releases. A jump from 20.0.0 to a later version would change the internal string, the container LLVM encoding, and potentially add new passes, opcodes, and metadata formats.
  • EDG version: EDG releases track independently of LLVM. A bump from 6.6 to a later version would affect C++ standard support, keyword handling, and the frontend error catalog.
  • NVVM IR version minor: the minor field (currently 2 in the major == 3 series) may increment to accommodate new metadata kinds or intrinsic conventions without breaking the major version.
  • PTX ISA version: new SM targets require new PTX versions. sm_100 Blackwell already uses a higher PTX version than sm_90 Hopper.
  • SM target range: new GPU architectures add new SM numbers. The sm_75--sm_121 range in v13.0 will expand in future releases.

The bitcode producer string ("LLVM7.0.1") is unlikely to change in the near term -- doing so would break backward compatibility with the entire NVVM IR ecosystem. The libdevice version sentinel (2, 0) is similarly stable because the version checker special-cases it.

To update this wiki for a new cicc version:

  1. Extract the build string (search for cuda_XX.Y.rXX.Y/compiler.).
  2. Check ctor_036 for the LLVM version fallback string.
  3. Check the EDG version string at sub_617BD0.
  4. Check the NVVM IR version gate constants at the version checker function.
  5. Measure the embedded libdevice size and function count.
  6. Verify the NVVM container header version fields.

Cross-References

Compilation Pipeline Overview

This page maps the complete end-to-end flow of a CUDA compilation through cicc v13.0, from the initial CLI invocation to the final PTX text output. Each stage is a self-contained subsystem with its own address range, data structures, and failure modes. The links below lead to dedicated pages with reimplementation-grade detail for every stage.

Pipeline Diagram

nvcc
  |
  v
+===========================================================+
| cicc (60 MB, 80,562 functions)                            |
|                                                           |
|  1. CLI Parsing & Dispatch -----> [entry.md]              |
|     |  argv/envp, flag translation, arch detection        |
|     |  dual-path select: Path A (LibNVVM) / Path B        |
|     v                                                     |
|  2. nvcc-to-cicc Interface -----> [nvcc-interface.md]     |
|     |  flag tree (40+ mappings), 3-column arch fan-out    |
|     |  mode cookies: 0xABBA=CUDA, 0xDEED=OpenCL          |
|     v                                                     |
|  3. EDG 6.6 Frontend -----------> [edg.md]                |
|     |  CUDA C++ --> transformed C (.int.c/.device.c)      |
|     |  737 config #defines, GCC 8.1 / Clang 9.1 emu      |
|     v                                                     |
|  4. NVVM IR Generation ---------> [ir-generation.md]      |
|     |  EDG IL tree --> LLVM Module (NVVM IR)              |
|     |  address spaces, kernel metadata, builtins          |
|     v                                                     |
|  5. Libdevice Linking ----------> [../infra/libdevice-linking.md]
|     |  embedded 455KB bitcode, 352 __nv_* math fns        |
|     |  target triple validation, NVVM version check       |
|     v                                                     |
|  6. LLVM Optimizer -------------> [optimizer.md]          |
|     |  two-phase model (analysis -> codegen-oriented)     |
|     |  49.8KB pipeline assembler, ~150 pass insertions    |
|     |  concurrent per-function Phase II                   |
|     v                                                     |
|  7. LTO Pipeline ---------------> [../lto/index.md]       |
|     |  cross-TU inlining, devirt, GlobalOpt               |
|     |  closed-world GPU model: no dlopen, no .so          |
|     v                                                     |
|  8. Code Generation ------------> [codegen.md]            |
|     |  SelectionDAG, ISel, RegAlloc, MachineIR passes     |
|     |  37 MB of code, largest subsystem                   |
|     v                                                     |
|  9. PTX Emission ---------------> [emission.md]           |
|     |  .entry/.func headers, register decls, .loc/.file   |
|     |  AsmPrinter, GenericToNVVM addrspace rewrite        |
|     v                                                     |
|  OUTPUT: .ptx file (or NVVM bitcode, or OptiX IR)        |
+===========================================================+

Side paths:
  * OptiX IR (--emit-optix-ir) ----> [optix-ir.md]
  * Debug info (all stages) -------> [debug-info-pipeline.md]

Stage Descriptions

1. Entry Point & CLI Parsing

The real main (sub_8F9C90, 10KB) parses argv, detects wizard mode via NVVMCCWIZ=553282, selects the target architecture (default sm_75), and dispatches into one of two compilation paths. Path A serves the LibNVVM API; Path B serves standalone nvcc invocations. Both paths are functionally identical but duplicated in the binary at different address ranges. See Entry Point & CLI.

2. nvcc-to-cicc Interface

The flag translation layer (sub_8FE280) rewrites nvcc-facing flags into cicc-facing flags through a std::map red-black tree, then a second stage (sub_95EB40) fans each flag out into three columns targeting EDG, OPT, and LLC separately. Mode cookies (0xABBA for CUDA, 0xDEED for OpenCL) select language-specific behavior. See nvcc-to-cicc Interface.

3. EDG 6.6 Frontend

A licensed commercial frontend (3.2 MB, 0x5D0000--0x8F0000) parses CUDA C++ source and emits transformed C code into .int.c, .device.c, and .stub.c files. CUDA syntax (<<<>>>, __shared__, __device__) is fully resolved in this stage. The output is C source, not LLVM IR. See EDG 6.6 Frontend.

4. NVVM IR Generation

Translates the EDG intermediate language (IL) tree into an LLVM Module with proper NVPTX address space annotations, nvvm.annotations kernel metadata, and lowered builtins. This is cicc's equivalent of Clang's lib/CodeGen, but operates on EDG's proprietary IL node format. See NVVM IR Generation and its sub-pages for expressions, statements, functions, and types.

5. Libdevice Linking

A 455,876-byte LLVM bitcode library containing 352 GPU-optimized math functions (__nv_sinf, __nv_expf, etc.) is embedded directly in the cicc binary. The linker validates the nvptx64- target triple, checks NVVM IR version metadata, and merges the library into the compilation module. No filesystem access is required. See Libdevice Linking.

6. LLVM Optimizer

A proprietary two-phase pipeline (sub_12E54A0, 49.8KB) runs ~150 passes: Phase I performs module-wide analysis, Phase II performs codegen-oriented transforms with optional per-function parallelism using a jobserver or thread pool. All behavior is controlled by the 222-slot NVVMPassOptions system. See LLVM Optimizer and Pipeline & Ordering.

7. LTO Pipeline

Exploits the GPU's closed-world compilation model (no dlopen, no shared libraries, no symbol interposition) for aggressive cross-TU inlining, whole-program devirtualization, and global variable promotion. Activated in separate compilation mode (nvcc -dc), but GlobalOpt and the inliner run even in single-TU mode. See LTO & Module Optimization.

8. Code Generation

The largest subsystem (37 MB, 0x1700000--0x35EFFFF) lowers optimized LLVM IR to NVPTX MachineInstr through SelectionDAG construction, type legalization, instruction selection via a three-level pattern match engine (900KB), pressure-driven greedy register allocation, and ~30 machine-level passes including tensor core codegen for HMMA/IMMA/WGMMA/tcgen05. See Code Generation.

9. PTX Emission

The AsmPrinter (sub_31EC4F0, 72KB) walks the final MachineFunction and emits PTX text: .entry/.func headers with kernel attributes, register declarations for 9 register classes, .loc/.file debug directives, and instruction mnemonics. A GenericToNVVM pass rewrites any remaining generic address space references before emission. See PTX Emission.

Side Paths

OptiX IR -- When --emit-optix-ir is passed, the pipeline replaces LLC with an OPTIXIR stage that serializes the optimized LLVM module for the OptiX ray tracing runtime's continuation-based execution model. See OptiX IR Generation.

Debug Info -- Debug metadata flows through all stages: generated in IR-gen, preserved or stripped in the optimizer (5 stripping passes), verified after each pass, and emitted as .loc/.file PTX directives. See Debug Info Pipeline.

Internal Pipeline Encoding

Internally, cicc represents the active pipeline stages as a bitmask:

StageInternal NameBitDescription
LNKLibdevice link0x01Merge embedded math library
OPTOptimizer0x02LLVM IR optimization (Phase I + II)
OPTIXIROptiX IR0x40OptiX serialization (mutually exclusive with LLC)
LLCCode generation0x04SelectionDAG through PTX emission

The standard CUDA compilation bitmask is LNK | OPT | LLC = 0x07. OptiX mode uses 0x43.

Cross-References

Entry Point & CLI

The cicc binary has a surprisingly complex entry point. Rather than a straightforward main → compile → exit flow, it implements a dual-path architecture where the same binary can operate as either a LibNVVM-based compiler (Path A) or a standalone compiler (Path B), selected at runtime through environment variables and obfuscated string comparisons. This design allows NVIDIA to ship a single binary that serves both the nvcc toolchain and the LibNVVM API.

The entry point region (0x8F00000x96FFFF, ~520 KB) handles CLI parsing, architecture detection with a 3-column flag fan-out system, and dispatch into one of several compilation pipelines. A hidden "wizard mode" gated behind an environment variable with a magic number enables developer diagnostics that are otherwise completely inaccessible.

main() thunk0x4396A0 (16 bytes) — return sub_8F9C90(argc, argv, envp)
Real mainsub_8F9C90 (10,066 bytes, 1,990 lines)
Wizard modegetenv("NVVMCCWIZ") == 553282byte_4F6D280 = 1
Default archcompute_75 / sm_75 (Turing)
Flag catalogsub_9624D0 (75KB, 2,626 lines, 4 output vectors)
Architecture mapsub_95EB40 (38KB, 23 architectures, 3-column fan-out)
Flag translationsub_8FE280 (red-black tree at qword_4F6D2A0, 40+ nvcc→cicc mappings)
Pipeline stagesLNK → OPT → [OPTIXIR] → LLC
Dual pathPath A (sub_905EE0) / Path B (sub_1265970)
LibdevicePath A: unk_3EA0080 / Path B: unk_420FD80 (455,876 bytes each)
Arch bitmask0x60081200F821 (validates SM 75–121)

Architecture

main (0x4396A0, 16B thunk)
  │
  └─ sub_8F9C90 (10KB, REAL MAIN)
       │
       ├─ getenv("NVVMCCWIZ") == 553282 → wizard mode
       ├─ sub_16C5290: extract program name from argv[0]
       │
       ├─ ARGUMENT LOOP (v15 = 1..argc)
       │    ├─ -o <file>              → v257 (output)
       │    ├─ -nvvmir-library <path> → v256 (libdevice)
       │    ├─ -lgenfe/-libnvvm/-lnk/-opt/-llc → v263 (mode)
       │    ├─ -arch/-mcpu/--nv_arch  → v242 (SM number)
       │    ├─ --emit-optix-ir        → v243=1, v258=1
       │    ├─ -nvc                   → v258=1
       │    ├─ -irversion             → print IR version, exit
       │    ├─ .bc/.ci/.i/.ii/.cup/.optixir → s (input file)
       │    └─ obfuscated option      → v253 (0 or 1)
       │
       ├─ v253 RESOLUTION (if still == 2)
       │    └─ getenv(obfuscated) → compare → set v253 = 0 or 1
       │
       ├─ DISPATCH (v263 × v253)
       │    ├─ v263==0, v253==1 → sub_902D10  (simple Path A)
       │    ├─ v263==0, v253==0 → sub_1262860 (simple Path B)
       │    ├─ v263==1          → sub_905E50 / sub_12658E0 (lgenfe)
       │    ├─ v263≥2, v253==1  → sub_905EE0  (multi-stage Path A)
       │    └─ v263≥2, v253==0  → sub_1265970 (multi-stage Path B)
       │
       └─ CLEANUP: free all vectors, strings, argv copy

Real Main — sub_8F9C90

The exported main() at 0x4396A0 is a 16-byte thunk that immediately tail-calls sub_8F9C90 — the actual entry point. This function is a monolithic CLI parser and dispatcher: it copies argv into a local buffer, checks for wizard mode, iterates over all arguments accumulating state in ~12 local variables, resolves the compilation path, and finally dispatches to the appropriate pipeline function. The entire function is a single 10KB basic-block-heavy control flow graph with ~80 branch targets.

FieldValue
Address0x8F9C900x8FC3E2
Size10,066 bytes
Stack frame0x978 bytes (2,424 bytes)
Local buffersv284[2096] for argv copy (stack if argc ≤ 256, else heap)

Argument Handling and Argv Copy

The function begins with a defensive copy of argv into a local buffer. When 8 * argc fits within 0x800 bytes (argc ≤ 256), the copy lives in v284[2096] on the stack. For larger argument lists -- which can occur during complex nvcc invocations with many pass-through flags -- it allocates heap memory via sub_16CD150. This copy is necessary because the argument loop modifies pointers (advancing i to skip flag values), and the caller's argv must not be disturbed.

if (8 * argc > 0x800)
    v284 = sub_16CD150(8 * argc);   // heap alloc for large argc
// else use stack buffer v284[2096]
memcpy(v284, argv, 8 * argc);       // copy all pointers

After copying, sub_16C5290 extracts the base program name from argv[0] -- stripping directory prefixes -- and stores it in dest. This name appears in error messages and verbose output throughout the pipeline.

Key Local Variables

The function's behavior is controlled by two critical dispatch variables: v253 (which compilation backend to use) and v263 (which phase of the pipeline to invoke). These are accumulated during the argument loop and combined after parsing to select one of ~10 possible code paths. The interaction between them creates a matrix of behaviors that covers everything from simple single-file compilation to multi-stage LibNVVM pipeline processing.

VariableInitPurpose
v2532Dispatch mode: 0=Path B, 1=Path A, 2=default (needs env resolution)
v2630Invocation mode: 0=default, 1=lgenfe, 2=libnvvm, 3=lnk, 4=opt, 6=llc
v2420Target architecture (SM number)
v2580NVC flag
v2430OptiX IR flag
v2590Verbose (only effective in wizard mode)
v2610Dryrun
v2620Keep intermediates (only effective in wizard mode)
sNULLInput file path
v257NULLOutput file path
v256NULLNVVM IR library path
v266vectorPass-through options vector

Wizard Mode

v10 = getenv("NVVMCCWIZ");                    // 0x8F9D36
if (v10 && strtol(v10, NULL, 10) == 553282)   // 0x8F9D92
    byte_4F6D280 = 1;

Global byte_4F6D280 gates the effectiveness of -v, -keep, -dryrun. Without wizard mode, these flags are silently ignored — v259 and v262 stay 0. This is a deliberate anti-reverse-engineering measure: even if someone discovers the -v flag, it does nothing without the magic environment variable. The magic number 553282 (0x87142) appears to be arbitrary.

Invocation Modes (v263)

The v263 variable determines which stage of the compilation pipeline cicc enters. When nvcc invokes cicc directly, v263 stays at 0 (default). But cicc can also be invoked in sub-pipeline mode — for example, -lnk runs only the linking phase, -opt runs only the optimizer, and -llc runs only code generation. This is how the multi-stage pipeline works: the outer driver calls cicc multiple times with different -lXXX flags, or a single invocation with -libnvvm runs all stages internally.

Each mode has its own format for the -discard-value-names flag, which tells the LLVM backend whether to strip IR value names (reducing memory usage). The different formats exist because each sub-pipeline stage has its own option namespace:

v263FlagModediscard-value-names format
0(none)Default (nvcc invocation)-discard-value-names
1-lgenfeEDG frontend linkage--discard_value_names=1 (underscores)
2-libnvvmLibNVVM API-discard-value-names=1 (dashes)
3-lnkLinker-lnk-discard-value-names=1
4-optOptimizer-opt-discard-value-names=1
5(internal)Undocumented (sets v278 high byte)
6-llcStandalone LLVM codegen

Input File Extensions

Input files are identified by extension during the argument loop. The last matching file wins (s is overwritten each time). Unrecognized arguments are added to the v266 pass-through vector and forwarded to sub-pipelines. The .cup extension has a special restriction — it's only accepted when the preceding argument is --orig_src_path_name or --orig_src_file_name, which are metadata flags inserted by nvcc to track the original source file.

ExtensionFormatCondition
.bcLLVM bitcodeAlways accepted
.ciCUDA intermediate (preprocessed)Always accepted
.iPreprocessed C/C++Always accepted
.iiPreprocessed C++Always accepted
.cupCUDA sourceOnly after --orig_src_path_name or --orig_src_file_name
.optixirOptiX IRAlways accepted

Obfuscated Strings

At 0x8F98A0, sub_8F98A0 decrypts strings using an XOR + ROT13-like cipher:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0xC5));
// then ROT13 on alphabetic characters

This hides an environment variable name and option prefix from static analysis. The decrypted strings control the v253 (Path A vs Path B) resolution when no explicit mode is specified.

Error Messages

MessageConditionAddress
"Missing output file\n"-o with no next argument0x8FA365
"Missing NVVM IR library file\n"-nvvmir-library with no next arg0x8FAB34
"Unparseable architecture: " + valueInvalid arch stringMultiple
"Missing input file\n"No recognized input file0x8FBEAD
"Recognized input file extensions are: .bc .ci .i .cup .optixir"After missing input0x8FBE97
"Error: Output file was not specified (See -o option).\n"Multi-stage without -o0x8FB655

The v253 Dispatch Variable

The v253 variable is the single most important dispatch control in the entire entry point. It determines whether the compilation uses Path A (the EDG/PTX-producing pipeline) or Path B (the standalone LLVM-based pipeline). Understanding its resolution logic is essential to reproducing cicc's behavior.

Initialization and Explicit Setting

v253 begins at 2 (unresolved default). During the argument loop, obfuscated string matching can set it directly:

SourceValueMeaning
Initial default2Needs environment variable resolution
Obfuscated option suffix matches byte_3C23AC31Path A explicitly requested
Obfuscated option suffix matches byte_3C23AB40Path B explicitly requested

Environment Variable Resolution

When v253 remains at 2 after argument parsing (the common case), cicc resolves it through the obfuscated environment variable NV_NVVM_VERSION (decrypted from byte_3C23A9F). The resolution has two sub-cases depending on the target architecture:

if (v253 == 2) {
    env = getenv(decrypt(byte_3C23A9F));   // NV_NVVM_VERSION
    if (env matches decrypt(byte_3C23A82))       // "nvvm-latest"
        v253 = 1;  // Path A
    else if (env matches decrypt(byte_3C23A7B))  // "nvvm70"
        v253 = 0;  // Path B
    else if (v242 > 99 && !v258)                 // SM >= 100, not -nvc
        v253 = 0;  // Path B (new architectures default to standalone)
    else
        v253 = 1;  // Path A (legacy default)
}

The architectural threshold at SM 100 (Blackwell) is notable: for SM < 100, the default is Path A (the EDG frontend path). For SM >= 100, unless the -nvc flag is present, the default switches to Path B. This suggests NVIDIA is migrating newer architectures toward the standalone LLVM pipeline, possibly as a precursor to eventually deprecating the EDG-based path.

Version Strings Injected per Path

After v253 is resolved and for multi-stage modes (v263 >= 3), the entry point injects a version string into the pass-through options:

v253Injected stringSemantics
1 (Path A)"-nvvm-version=nvvm-latest" (25 bytes from xmmword_3C23BC0)Targets the latest NVVM IR specification
0 (Path B)"-nvvm-version=nvvm70" (20 bytes)Targets NVVM 7.0 IR (frozen at LLVM 7.0.1 bitcode format)

This version string propagates through the entire pipeline, controlling bitcode compatibility, intrinsic name resolution, and metadata format expectations.

Post-Parse Dispatch Logic

After the argument loop terminates, the dispatch logic combines v253 and v263 to select the target function. The combined keep-and-verbose flag v260 = v262 & v259 is also computed -- both wizard-mode flags must be active for intermediate file retention and verbose logging to function simultaneously.

Simple Dispatch (v263 == 0)

When cicc is invoked without any -lXXX mode flag (the standard nvcc invocation path):

if (v253 == 1)
    v8 = sub_902D10(dest, 0, &v266, s, v257, v256, v260, v262, v261);
    // Path A: CLI → lgenfe → LibNVVM pipeline
else
    v8 = sub_1262860(dest, 0, &v266, s, v257, v256, v260, v262, v261);
    // Path B: CLI → standalone LLVM pipeline

Both functions receive identical parameter signatures: program name, zero (unused), pass-through options, input file, output file, libdevice path, verbose+keep, keep, and dryrun. The return value becomes the process exit code.

lgenfe Dispatch (v263 == 1)

The -lgenfe mode builds a full argv-style array with the program name as the first entry, followed by all v266 pass-through options. This argv is then passed to one of two function pairs:

v253Init functionPipeline function
1 (Path A)sub_B6EEA0 (LLVMContext + metadata kind registration)sub_905880 (EDG lgenfe)
0 (Path B)sub_1602D10 (standalone context initialization)sub_1265340 (standalone lgenfe)

The init functions create the LLVM context and register the 42+ metadata kinds used throughout the pipeline (dbg, tbaa, prof, noalias, etc.). These must be registered before any IR construction begins.

Multi-Stage Dispatch (v263 >= 2)

For -libnvvm, -lnk, -opt, and -llc modes, the dispatch constructs a CompilationState structure with input/output strings, extra arguments, and the v278 mode byte, then calls:

v253FunctionSizeRole
1sub_905EE043 KBPath A multi-stage pipeline driver
0sub_126597048 KBPath B multi-stage pipeline driver

For -libnvvm (v263 == 2), the extra args are taken directly from v266 without prepending the program name. For -lnk/-opt/-llc (v263 >= 3), the appropriate version string (nvvm-latest or nvvm70) is appended to the pass-through options before dispatch.

Cleanup

After the pipeline function returns, sub_8F9C90 performs deterministic cleanup in reverse allocation order: the v281 extra-argument char** array and each entry, the v275 output string, the s2 input string, each element of the v266 pass-through vector, the vector's backing buffer, the dest program name, and the v282 argv copy buffer (if heap-allocated). The return value v8 is 0 on success, 1 on argument errors, or the pipeline function's return code (stored in v264).

Path A — EDG → LibNVVM Pipeline

Path A is the full CUDA C++ compilation path. It starts with the EDG 6.6 C++ frontend parsing CUDA source code into an IL tree, then converts that IL into LLVM IR via the lgenfe (LLVM Generation Front End) stage, and finally runs the LibNVVM pipeline to optimize and lower the IR to PTX. This is the path taken when cicc is invoked by nvcc for .cu file compilation, and it represents the standard CUDA compilation flow that most users encounter.

Path A Orchestrator — sub_902D10

The orchestrator is a 9 KB function that sequences the three major stages of Path A compilation. It acts as the conductor between the CLI processing layer, the EDG frontend, and the LibNVVM optimizer/codegen.

FieldValue
Address0x902D10
Size~9 KB
TimerCreates 8-byte timer via sub_22077B0sub_B6EEA0

Execution flow:

  1. Timer creation. Allocates and initializes an 8-byte timing context. The sub_B6EEA0 init function also registers the 42+ LLVM metadata kinds (dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42) that all subsequent IR construction depends on. This is why the timer creation happens first: the metadata registration is a side effect of context initialization.

  2. CLI processing. Calls sub_900130 (39 KB) to parse the accumulated CLI flags into structured forms: command buffer v58, emit-llvm-bc flag v52, architecture compute/SM numbers v55/v56, and file paths. On failure: "Error processing command line: <cmd>\n".

  3. Include path setup. If an input file is present (v64), calls sub_C98ED0 to configure system and user include paths for the EDG frontend.

  4. EDG frontend (lgenfe). Calls sub_905880 with timer name "CUDA C++ Front-End". This stage:

    • Allocates an 880-byte module object via sub_BA8740
    • Processes lgenfe CLI options from the options struct
    • In dryrun mode: skips execution, frees the module, returns null
    • On success: returns a module pointer and sets the output path
  5. LibNVVM pipeline. If lgenfe succeeds (module pointer is non-null), calls sub_905EE0 with the module for the full optimization and codegen pipeline.

  6. Time profiler output. After pipeline completion, checks sub_C96F30() for active profiling. If profiling is enabled, writes timing data to the output file via sub_C9C600. Failure emits: "Error: Failed to write time profiler data.\n".

  7. Cleanup. Frees the timer (sub_B6E710), option strings, and option arrays.

EDG Frontend Stage — sub_905880

The lgenfe stage bridges the EDG 6.6 C++ frontend to LLVM IR generation. This is where CUDA C++ source code becomes NVVM IR.

FieldValue
Address0x905880
Size~6 KB
Timer label"CUDA C++ Front-End"
Module size880 bytes (allocated by sub_BA8740)

The function reconstructs a verbose command line for diagnostic output (quoting paths for --orig_src_file_name, --orig_src_path_name, --compiler_bindir, --sdk_dir), builds an argument array, and calls sub_908750(numArgs, argArray, opt_level) to create the LLVM module. On success, it copies the output path into the module at offset 21*8 and, if the keep flag is set via a3->byte[66], calls sub_905860 to write intermediate files.

The actual EDG parsing and IL-to-IR conversion happens inside sub_908750, which eventually calls sub_617BD0 — the lgenfe_main function documented in the EDG Frontend page.

EDG Module Binding — sub_908850

After the EDG frontend produces its IL tree, sub_908850 (10 KB) bridges the output to the LLVM backend. This function performs the critical step of configuring the LLVM module's data layout and target triple based on the target architecture.

Data layout strings are selected based on unk_4F06A68 (address space width):

Widthp3 flagData layout string
8 (64-bit)unk_4D0461C set"e-p:64:64:64-p3:32:32:32-i1:8:8-..." (167 chars)
8 (64-bit)Not set"e-p:64:64:64-i1:8:8-..." (155 chars)
4 (32-bit)"e-p:32:32:32-i1:8:8-..." (155 chars)

The p3:32:32:32 component enables 32-bit pointers in address space 3 (shared memory), which is critical for SM architectures where shared memory accesses use 32-bit addressing even in 64-bit compilation mode.

Target triple is set to "nvptx64-nvidia-cuda" for 64-bit or "nvptx-nvidia-cuda" for 32-bit. The function also:

  • Creates a 496-byte target info structure via sub_AE3F70
  • Iterates global function declarations, marking device functions for compilation via sub_91CA00
  • Iterates global variables, processing initializers for device-side storage via sub_9172F0
  • Runs LLVM module verification via sub_B89FE0 -- on failure: "there was an error in verifying the lgenfe output!"
  • Stores the module globally at unk_4F6D2F8

LibNVVM Pipeline Driver — sub_905EE0

This 43 KB function is the core of Path A. It orchestrates the full compilation through 14 sequential phases, using an interesting indirection mechanism: rather than calling LibNVVM API functions directly, it resolves them at runtime through sub_12BC0F0(id) — a dispatch function that takes a numeric ID and returns a function pointer.

FieldValue
Address0x905EE0
Size43KB (1,268 lines)
Timer"LibNVVM"
Orchestratorsub_902D10 (simple mode)

14-Phase Compilation Flow

The compilation proceeds through these phases sequentially. Phases 2.1–2.14 are the core compilation unit lifecycle: create, populate, configure, compile, extract results, destroy. The -keep flag (when wizard mode is active) causes intermediate .lnk.bc and .opt.bc files to be written to disk, which is invaluable for debugging the pipeline.

PhaseAction
0Verbose command-line reconstruction
1Input file handling — NVVM container (sub_9047E0) or regular bitcode
2.1Pipeline infrastructure: verbose stream, address space query, module prep
2.2nvvmCreateCU (dispatch ID 2151)
2.3nvvmCUAddModule (dispatch ID 4660)
2.4Check for -opt/-llc direct stage invocation
2.5Parse and add additional modules from extra args
2.6Libdevice linking: external file or embedded unk_3EA0080 (455,876 bytes)
2.7Intermediate file path construction (.lnk.bc, .opt.bc)
2.8Option parsing via sub_9685E0 → 4 stage vectors (lnk/opt/lto/llc)
2.9Verbose stage logging + callback registration
2.10Load 37 LLVM options from off_4B90FE0 via nvvmSetOptionStrings (ID 0xFEED)
2.11nvvmCUCompile (ID 0xBEAD) with phase code 57069 (0xDEED)
2.12Get compilation log/warnings
2.13Get PTX output
2.14nvvmDestroyCU (ID 21257 = 0x5309)
3Error/warning output to caller
4Write output file (text or binary detection via ELF magic)
5Timer stop

Input File Handling — Phase 1

Phase 1 has two sub-paths based on the a3->byte[65] container flag:

Path 1A: NVVM IR Container. When the input is an NVVM container (a binary format wrapping IR plus compilation options), sub_9047E0 (10 KB) parses it. The container format encodes the target SM version, FTZ mode, precision settings, and IEEE mode. The parser extracts these and converts them to LLVM CLI flags:

// Pseudo-code for container option extraction
push("-march=nvptx");
push("-mcpu=sm_" + str(container->sm_version / 10));
if (container->flags[200] & 0x20) push("-nvptx-f32ftz");
if (container->flags[200] & 0x80) push("-nvptx-prec-sqrtf32=1");
else                               push("-nvptx-prec-sqrtf32=0");
push(container->flags[204] ? "-nvvm-ieee-mode=S" : "-nvvm-ieee-mode=T");
if (container->mode == 2) push("--device-c");  // relocatable compilation

If parsing fails, the error message is "Invalid NVVM IR Container" (error code 259).

Path 1B: Regular LLVM bitcode. For raw .bc files, the function creates a timer object, configures the SM architecture via sub_B6F950, opens the file via sub_C7EAD0, and parses it into an LLVM module via sub_A01950.

LibNVVM API Dispatch IDs

Internal function sub_12BC0F0(id) returns API function pointers by numeric ID. This indirection exists because the LibNVVM API is implemented within the same binary — these aren't dynamically-linked external functions but rather internal call points resolved through a dispatch table. The hex IDs double as a form of internal documentation:

IDHexFunction
21510x0867nvvmCreateCU
41110x100FnvvmGetCompiledResult
46600x1234nvvmCUAddModule
171850x4321nvvmCUSetExtraArgs
212570x5309nvvmDestroyCU
418560xA380nvvmGetCompilationLog
469030xB737nvvmGetCompiledResultLog
469670xB777nvvmGetErrorString
488130xBEADnvvmCUCompile
488790xBEEFCallback registrar
614510xF00BnvvmGetCompiledResultSize
622980xF37AnvvmCUAddModuleFromBuffer
652610xFEEDnvvmCUSetOptions

The complete dispatch table in sub_12BC0F0 contains 25 entries implemented as a binary search tree on the ID value:

IDHexTargetSemantic Name
21510x0867sub_12BB090nvvmCreateCU
21670x0877sub_12BB090(alias)
39110x0F47sub_12BBF40nvvmCUSetProgressCallback
41110x100Fsub_12BA8F0nvvmGetCompiledResult
46060x11FEsub_12BA330nvvmCULinkModule
46600x1234sub_12BC650nvvmCUAddModule
83200x2080sub_12BB400nvvmCUSetOption
112450x2BEDsub_12BB290nvvmCUGetLog
171850x4321sub_12BBD80nvvmCUSetExtraArgs
212570x5309sub_12B9C40nvvmDestroyCU
232940x5AFEsub_12BAF10nvvmVerify
418560xA380sub_12BA220nvvmGetCompiledResultSize
452420xB0BAsub_12BAB40nvvmCUGetWarnings
469030xB737sub_12BA7C0nvvmGetCompiledResultLog
469670xB777sub_12B9980nvvmGetErrorString
488130xBEADsub_12BA110nvvmCUCompile
488790xBEEFsub_12BACF0nvvmCURegisterCallback
495220xC172sub_12BA470nvvmCUGetIR
519660xCAFEsub_12B9A50nvvmGetVersion
564950xDCEFsub_12B9A40(unknown)
570050xDEADsub_12B9C00nvvmInit
614510xF00Bsub_12BA560nvvmGetCompiledResultPTXSize
614530xF00Dsub_12BA6A0nvvmCURegisterLNKCallback
618060xF16Esub_12BAA30nvvmCUGetOptIR
622980xF37Asub_12BC8B0nvvmCUAddModuleFromBuffer
652610xFEEDsub_12B9AB0nvvmSetOptionStrings

Public LibNVVM API vs Internal CU API

The dispatch table above reveals a critical architectural detail: cicc's internal API uses compilation unit semantics (nvvmCreateCU, nvvmCUAddModule, nvvmCUCompile), while the public LibNVVM shared library (libnvvm.so) exports a different API surface using program semantics (nvvmCreateProgram, nvvmAddModuleToProgram, nvvmCompileProgram). The public API is documented in NVIDIA's nvvm.h header; the internal API exists only within cicc and is never exported.

Evidence for this mapping comes from nvlink's -dlto code path, which dynamically loads libnvvm.so via dlsym() and resolves symbols by their public names:

// nvlink sub_4BC290 — loads libnvvm.so for device LTO
dlsym(handle, "nvvmCreateProgram");    // → internally nvvmCreateCU
dlsym(handle, "nvvmCompileProgram");   // → internally nvvmCUCompile
dlsym(handle, "nvvmGetCompiledResultSize");
dlsym(handle, "nvvmGetCompiledResult");
dlsym(handle, "nvvmDestroyProgram");   // → internally nvvmDestroyCU

The complete mapping between the public libnvvm.so API (as used by external callers like nvlink and user programs) and cicc's internal CU dispatch IDs:

Public API (libnvvm.so)Internal NameDispatch IDHexTarget
nvvmCreateProgramnvvmCreateCU21510x0867sub_12BB090
nvvmAddModuleToProgramnvvmCUAddModule46600x1234sub_12BC650
nvvmLazyAddModuleToProgramnvvmCUAddModuleFromBuffer622980xF37Asub_12BC8B0
nvvmCompileProgramnvvmCUCompile488130xBEADsub_12BA110
nvvmVerifyProgramnvvmVerify232940x5AFEsub_12BAF10
nvvmGetCompiledResultSizenvvmGetCompiledResultPTXSize614510xF00Bsub_12BA560
nvvmGetCompiledResultnvvmGetCompiledResult41110x100Fsub_12BA8F0
nvvmGetProgramLogSizenvvmGetCompiledResultSize418560xA380sub_12BA220
nvvmGetProgramLognvvmGetCompiledResultLog469030xB737sub_12BA7C0
nvvmDestroyProgramnvvmDestroyCU212570x5309sub_12B9C40

Note the naming confusion in the internal API: nvvmGetCompiledResultSize (ID 0xA380) returns the log size, while nvvmGetCompiledResultPTXSize (ID 0xF00B) returns the actual PTX output size. The public API resolves this with clearer names (nvvmGetProgramLogSize vs nvvmGetCompiledResultSize).

The internal-only API entries have no public equivalents:

Internal NameDispatch IDHexTargetPurpose
nvvmInit570050xDEADsub_12B9C00One-time initialization of LLVM infrastructure
nvvmGetVersion519660xCAFEsub_12B9A50Returns internal NVVM version tuple
nvvmGetErrorString469670xB777sub_12B9980Maps nvvmResult code to human-readable string
nvvmSetOptionStrings652610xFEEDsub_12B9AB0Bulk-loads LLVM CLI option table (37 entries)
nvvmCUSetExtraArgs171850x4321sub_12BBD80Passes additional argc/argv to compilation
nvvmCUSetOption83200x2080sub_12BB400Sets a single compilation option
nvvmCUSetProgressCallback39110x0F47sub_12BBF40Registers progress/cancellation callback
nvvmCURegisterCallback488790xBEEFsub_12BACF0Registers stage-boundary callback (verbose output)
nvvmCURegisterLNKCallback614530xF00Dsub_12BA6A0Registers LNK-stage-specific callback
nvvmCUGetLog112450x2BEDsub_12BB290Alternative log retrieval interface
nvvmCUGetWarnings452420xB0BAsub_12BAB40Retrieves warning-only messages
nvvmCUGetIR495220xC172sub_12BA470Retrieves intermediate LLVM IR after linking
nvvmCUGetOptIR618060xF16Esub_12BAA30Retrieves optimized IR (post-OPT stage); also used by -irversion
nvvmCULinkModule46060x11FEsub_12BA330Explicit module linking (separate from add-then-compile)
(unknown)564950xDCEFsub_12B9A40Unknown (one byte smaller than nvvmGetVersion)
(alias)21670x0877sub_12BB090Alias for nvvmCreateCU (same target, different ID)

The nvvmCUGetOptIR function at sub_12BAA30 serves double duty: it is both the post-optimization IR retrieval API and the target of sub_12BC0E0 (a thunk called from sub_8F9C90 for the -irversion flag). When the user passes -irversion, the real main calls sub_12BC0E0 which dispatches to sub_12BAA30, which returns the IR version tuple as major * 100 + minor. This value is printed to stdout and the process exits immediately.

The sub_12BC0F0 Dispatch Mechanism

sub_12BC0F0 is a ~3 KB function at 0x12BC0F0 that implements a binary search tree over the 25 dispatch IDs. The function takes a single unsigned int argument (the ID) and returns a function pointer (void*). The tree is hardcoded as a series of comparison-and-branch instructions, not as a data-driven lookup table.

// Pseudocode for sub_12BC0F0(unsigned int id)
void* nvvm_dispatch(unsigned int id) {
    // Binary search over 25 IDs
    if (id < 17185) {
        if (id < 4660) {
            if (id == 2151 || id == 2167) return sub_12BB090;
            if (id == 3911) return sub_12BBF40;
            if (id == 4111) return sub_12BA8F0;
            if (id == 4606) return sub_12BA330;
        } else {
            if (id == 4660)  return sub_12BC650;
            if (id == 8320)  return sub_12BB400;
            if (id == 11245) return sub_12BB290;
        }
    } else {
        // ... upper half of the tree
        if (id == 48813) return sub_12BA110;   // 0xBEAD
        if (id == 65261) return sub_12B9AB0;   // 0xFEED
        // etc.
    }
    return NULL;  // unknown ID
}

The hex IDs are deliberately memorable patterns used as a form of internal documentation: 0xDEAD = init, 0xBEAD = compile, 0xBEEF = callback, 0xCAFE = version, 0xFEED = options, 0xF00D = LNK callback, 0xF00B = result size. The secondary ID 0x0877 (2167) is an alias for 0x0867 (2151) and dispatches to the same sub_12BB090 target, suggesting an internal API version migration where both old and new IDs must remain functional.

Dual-Path Initialization

The two compilation paths (Path A and Path B) use independent initialization sequences, creating a dual-path initialization architecture where the same underlying LLVM infrastructure is bootstrapped through different entry points. This is why two copies of libdevice, two LLVM options tables, and two sets of verbose callbacks exist.

Path A initialization (EDG → LibNVVM):
  sub_B6EEA0  — Creates LLVMContext + registers 42+ metadata kinds
                 (dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42)
  sub_900130  — 39 KB CLI parser for Path A flags
  sub_905880  — EDG frontend produces LLVM module (880-byte object)
  sub_908850  — Binds module to target: data layout, triple, verification
  → sub_905EE0 enters LibNVVM pipeline with module

Path B initialization (Standalone):
  sub_1602D10 — Creates standalone LLVMContext (no EDG metadata assumptions)
  sub_125FB30 — 8 KB CLI parser for Path B flags
  sub_1265340 — Pre-compilation setup (configure output path, timer)
  → sub_1265970 enters LibNVVM pipeline with bitcode input

The version resolver sub_12B9F70 at 0x12B9F70 is shared between both paths and determines which NVVM IR compatibility mode to use. It reads two obfuscated environment variables in sequence:

// Pseudocode for sub_12B9F70(unsigned int sm_version)
int nvvm_version_resolve(unsigned int sm_version) {
    // Try NV_NVVM_VERSION first (decrypted from 0x3C23A90)
    char *env = getenv(decrypt("NV_NVVM_VERSION"));
    if (!env) {
        // Fallback: try LIBNVVM_NVVM_VERSION (decrypted from 0x42812F0)
        env = getenv(decrypt("LIBNVVM_NVVM_VERSION"));
    }
    if (env) {
        if (strcmp(env, "nvvm70") == 0)      return 0;  // Path B mode
        if (strcmp(env, "nvvm-latest") == 0)  return 1;  // Path A mode
    }
    // Default: SM >= 100 uses Path B, SM < 100 uses Path A
    return (sm_version > 99) ? 0 : 1;
}

This function is called from both sub_8F9C90 (the real main, for v253 resolution) and sub_12BB580 (inside the LibNVVM compilation unit initialization). The dual call-site ensures that the version mode is consistent regardless of whether the compiler was invoked via CLI or via the LibNVVM API.

The nvvmInit function (ID 0xDEAD, sub_12B9C00) performs one-time LLVM infrastructure initialization. It is called implicitly during nvvmCreateCU (sub_12BB090) via a pthread_once guard at dword_4F92D9C. The initialization includes:

  1. Registering LLVM target triples (nvptx64-nvidia-cuda, nvptx-nvidia-cuda)
  2. Initializing the NVPTX target machine factory
  3. Setting up the LLVM pass registry
  4. Configuring thread-safety based on LIBNVVM_DISABLE_CONCURRENT_API (byte_4F92D70)

When byte_4F92D70 == 1 (concurrent API disabled), the pipeline operates in single-threaded mode — no pthread_mutex locks are acquired around compilation unit operations, and Phase II concurrent optimization is disabled regardless of the module's function count.

Internal API Usage Sequence

The complete sequence of dispatch table calls during a standard Path A compilation (from sub_905EE0):

1.  sub_12BC0F0(2151)   → nvvmCreateCU(&handle)
    Creates compilation unit. Calls nvvmInit via pthread_once on first use.

2.  sub_12BC0F0(46967)  → nvvmGetErrorString
    Saved for later error message formatting.

3.  sub_12BC0F0(4660)   → nvvmCUAddModule(handle, IR_data, IR_size, NULL)
    Adds the user's LLVM bitcode module.

4.  sub_12BC0F0(21257)  → nvvmDestroyCU
    Saved as cleanup function pointer (not called yet).

5.  sub_12BCB00 [thunk]  → nvvmCUAddModuleFromBuffer(handle, buf, size, NULL)
    Called N times: once per additional module from extra args,
    once for libdevice (embedded or external).

6.  sub_12BC0F0(48879)  → nvvmCURegisterCallback
    Registers verbose stage callbacks:
      sub_903BA0 with ID 61453 (LNK stage)
      sub_903730 with ID 47710 (LLC stage)
    When -keep mode active, also registers:
      sub_9085A0 with ID 64222 (OPT output → .opt.bc file)
      sub_908220 with ID 56993 (LLC output → final file)

7.  sub_12BC0F0(65261)  → nvvmSetOptionStrings(opts_table, 37)
    Loads 37 LLVM backend configuration strings from off_4B90FE0.
    Calls sub_1C31130() internally to register/reset LLVM options.

8.  sub_12BC0F0(48813)  → nvvmCUCompile(handle, 57069)
    Main compilation. Phase code 57069 (0xDEED) triggers full
    LNK → OPT → [OPTIXIR] → LLC pipeline in sub_12C35D0.

9.  sub_12BC0F0(17185)  → nvvmCUSetExtraArgs(handle, argc, argv)
    Passes additional arguments collected from the CLI.

10. sub_12BC0F0(41856)  → nvvmGetCompiledResultSize(handle, &log_size)
    Queries the compilation log size.

11. sub_12BC0F0(46903)  → nvvmGetCompiledResultLog(handle, log_buf)
    Retrieves the compilation log (warnings/errors).

12. sub_12BC0F0(61451)  → nvvmGetCompiledResultPTXSize(handle, &ptx_size)
    Queries the PTX output size.

13. sub_12BC0F0(4111)   → nvvmGetCompiledResult(handle, ptx_buf)
    Copies the generated PTX into the caller's buffer.

14. sub_12BC0F0(21257)  → nvvmDestroyCU(&handle)
    Destroys the compilation unit, frees all internal resources.

Path B (sub_1265970) follows the identical sequence but uses off_4C6EEE0 for the options table (step 7), unk_420FD80 for the embedded libdevice (step 5), and appends "-nvvm-version=nvvm70" instead of "-nvvm-version=nvvm-latest" to the pipeline arguments.

nvvmResult Error Codes

The nvvmGetErrorString function (ID 0xB777, sub_12B9980) maps integer result codes from all API functions to descriptive strings:

CodeConstantDescription
0NVVM_SUCCESSOperation completed successfully
1NVVM_ERROR_OUT_OF_MEMORYMemory allocation failed
2NVVM_ERROR_PROGRAM_CREATION_FAILUREFailed to create compilation unit
3NVVM_ERROR_IR_VERSION_MISMATCHIncompatible NVVM IR version detected
4NVVM_ERROR_INVALID_INPUTMalformed input (bad bitcode, wrong magic)
5NVVM_ERROR_INVALID_PROGRAMNull or invalid compilation unit handle
6NVVM_ERROR_INVALID_IRIR failed verification
7NVVM_ERROR_INVALID_OPTIONUnrecognized compilation option
8NVVM_ERROR_NO_MODULE_IN_PROGRAMCompilation unit has no modules added
9NVVM_ERROR_COMPILATIONCompilation failed (linker, optimizer, or codegen error)
10NVVM_ERROR_CANCELLEDCompilation cancelled by user callback

The pipeline orchestrator sub_12C35D0 maps its internal return codes to these: 0 → NVVM_SUCCESS, 7 → NVVM_ERROR_INVALID_OPTION, 9 → NVVM_ERROR_COMPILATION, 10 → NVVM_ERROR_CANCELLED, 100 → NVVM_ERROR_COMPILATION (post-pipeline verification failure).

37 LLVM Options from off_4B90FE0

Phase 2.10 loads a hardcoded table of 37 LLVM option strings from off_4B90FE0 (296 bytes = 37 pointers). These are static, compiled-in LLVM backend configuration flags that are injected into every compilation unit via nvvmSetOptionStrings (ID 0xFEED). The options include target architecture flags (-march=nvptx64, -mcpu=sm_XX), math precision controls (-nvptx-f32ftz, -nvptx-prec-sqrtf32=), optimization levels, debug info flags, and NVPTX-specific feature knobs. The sub_12B9AB0 target function calls sub_1C31130() -- the LLVM option registration/reset function -- to apply them.

Embedded Libdevice

A key design decision: two identical copies of the libdevice bitcode are statically embedded in the binary. Each is 455,876 bytes (~445 KB) of LLVM bitcode containing ~400+ math functions (__nv_sin, __nv_cos, __nv_exp, __nv_log, __nv_sqrt, etc.) plus atomic operation helpers and FP16/BF16 conversion routines. The duplication exists because Path A and Path B have separate initialization sequences and the linker didn't deduplicate the .rodata sections.

When the user provides -nvvmir-library <path>, the external file is used instead. This allows overriding the built-in math library — useful for testing custom libdevice builds.

PathAddressSizePurpose
Path Aunk_3EA0080455,876 bytesDefault libdevice for LibNVVM mode
Path Bunk_420FD80455,876 bytesDefault libdevice for standalone mode

Verbose Callbacks and Intermediate Files

Phase 2.9 registers callback functions that fire at pipeline stage boundaries. When verbose mode is active, these callbacks produce reconstructed command-line output for each stage:

[ "<src>" -lnk -nvvmir-library "<path>" "<input>" -o "<file>.lnk.bc" <opts> -nvvm-version=nvvm-latest ]
[ "<src>" -llc "<llc_path>" -o "<output>" <opts> -nvvm-version=nvvm-latest ]

The callback registration uses sub_12BC0F0(48879) (ID 0xBEEF = nvvmCURegisterCallback) with stage-specific callback IDs:

CallbackIDStage
sub_903BA061453LNK stage output
sub_90373047710LLC stage output
sub_9085A064222OPT output (keep mode)
sub_90822056993LLC output (keep mode)

Intermediate file paths (.lnk.bc for linked-but-unoptimized, .opt.bc for optimized-but-not-yet-codegen'd) are always constructed as strings, but the actual files are only written to disk when the -keep flag is active in wizard mode.

Path A Error Messages

All errors from sub_905EE0 are written to stderr via sub_223E0D0. Error categories:

CategoryPrefixExample
File I/O"<src>: ""error in open <file>", "input file <f> read error"
LibNVVM API"libnvvm: error: ""failed to create the libnvvm compilation unit"
Output"<src>: ""IO error: <system_error_msg>"
Fatal(none)"basic_string::append" (std::string overflow at 0x3FFFFFFFFFFFFFFF)

The error code from LibNVVM API calls maps to nvvmResult: 0 = success, 1 = out of memory, 4 = invalid input, 5 = invalid compilation unit (null handle).

Path B — Standalone cicc Pipeline (sub_1265970)

Path B is the standalone compilation path used when cicc is invoked with LLVM bitcode input (.bc files), by the LibNVVM API directly, or as the default for SM >= 100 architectures. Despite the different entry point, it shares the same underlying LLVM infrastructure as Path A — the difference is in how modules are loaded and how the pipeline stages are orchestrated. Path B appends -nvvm-version=nvvm70 to the optimizer arguments, indicating it targets the NVVM 7.0 IR specification (corresponding to LLVM 7.0.1 bitcode format, the version NVIDIA froze their IR compatibility at).

The 4-stage pipeline (LNK → OPT → OPTIXIR → LLC) runs in-memory: each stage takes an LLVM Module, transforms it, and passes it to the next stage. The OPTIXIR stage is optional and only active when --emit-optix-ir is specified. A user-provided cancellation callback can abort compilation between stages (return code 10).

FieldValue
Address0x1265970
Size~48KB (1,371 lines)
Timer"LibNVVM" (same name as Path A)
Version string-nvvm-version=nvvm70

Path B Entry — sub_1262860

sub_1262860 (418 lines) is the command-line entry point for Path B, analogous to sub_902D10 for Path A. It parses CLI flags, initializes the compilation context, and calls sub_1265970 for the actual compilation.

FieldValue
Address0x1262860
Timer initsub_1602D10 (standalone context, contrasted with Path A's sub_B6EEA0)
CLI parsersub_125FB30 (Path B's equivalent of Path A's sub_900130)

The flow is: allocate timer handle → parse CLI via sub_125FB30 → configure output path → call sub_1265340 for pre-compilation setup → call sub_1265970 for compilation → write output. Output can go to stdout if the output path is "-", handled by sub_125C500. On failure: "\n Error processing command line: <details>".

Path B Compilation Orchestrator — sub_1265970

This 48 KB function mirrors sub_905EE0's role but with Path B's initialization and context. It handles both LibNVVM API invocations (when a11 = 1) and CLI invocations (when a11 = 0), with the same 14-phase structure as Path A but using Path B's context objects and the nvvm70 version string.

Key behavioral differences from Path A:

  1. Context initialization. Path B uses sub_1602D10 for context init (rather than sub_B6EEA0), which creates a standalone LLVM context without the EDG frontend's metadata registration assumptions.

  2. NVVM IR container handling. Container parsing is performed by sub_12642A0 (Path B's container parser) rather than sub_9047E0.

  3. Embedded libdevice address. Uses unk_420FD80 (the second copy) rather than unk_3EA0080.

  4. LLVM options table. Loads 37 options from off_4C6EEE0 (Path B's copy) rather than off_4B90FE0.

  5. Verbose callbacks. Registers sub_1263280 (ID 61453) and sub_12636E0 (ID 47710) for LNK and OPT stage output respectively, and sub_1268040/sub_1267CC0 for keep-mode output.

  6. Version string. Always appends "-nvvm-version=nvvm70" rather than "-nvvm-version=nvvm-latest".

4-Stage Pipeline Orchestrator — sub_12C35D0

The orchestrator creates two backend objects — nvopt (512 bytes, the optimizer) and nvllc (480 bytes, the code generator) — and wires them together with the stage dispatch structure. Each stage is controlled by a bit in a stage bitmask derived from sub_12D2AA0, which parses architecture and options into per-stage configuration.

FieldValue
Address0x12C35D0
Size41KB (1,446 lines)
Backend objectsnvopt (512 bytes) + nvllc (480 bytes)
StageBitTimer StringCore Function
LNK0x01"LNK" / "LibNVVM module linking step."sub_12C06E0 (63KB, module linker)
OPT0x80"OPT" / "LibNVVM optimization step."sub_12E7E70 (full LLVM pipeline)
OPTIXIR0x40"OPTIXIR" / "LibNVVM Optix IR step."sub_12F9270 (OptiX IR gen)
LLC0x04"LLC" / "LibNVVM code-generation step."sub_12F5100 (SelectionDAG codegen)

Pipeline stage bitmask (from sub_12D2AA0): bit 0=LNK, bit 2=LLC, bit 5=verify, bit 6=OPTIXIR, bit 7=OPT.

Return codes: 0=success, 7=parse failure, 9=link/layout/verification error, 10=cancelled, 100=post-pipeline verification failure.

Backend Object Initialization

The orchestrator allocates and initializes two backend objects with distinct vtables:

// nvllc — code generator backend (480 bytes)
v8 = sub_22077B0(480);
sub_12EC960(v8, "nvllc", 5);
v8->vtable = &unk_49E7FF0;

// nvopt — optimizer backend (512 bytes)
v10 = sub_22077B0(512);
sub_12EC960(v10, "nvopt", 5);
v10->vtable = &unk_49E6A58;
v10->sub_vtable = &unk_49E6B20;    // at offset +60*8
v10->plugin_slots[0..2] = 0;       // offsets 61-63 cleared

A stage dispatch structure (vtable &unk_49E6B38) links the OPT output to the LLC input and stores the cancellation callback pointer.

Cancellation Callback

Between every pipeline stage, the orchestrator checks an optional user-provided cancellation callback stored at state[26]:

cancellation_fn = state[26];
if (cancellation_fn && cancellation_fn(state[27], 0))
    return 10;   // CANCELLED

This mechanism allows the LibNVVM API caller to abort a long-running compilation. Return code 10 propagates up through the entire call chain, causing sub_8F9C90 to return 10 as the process exit code.

Two-Phase Optimization (OPT Stage)

The OPT stage calls sub_12E7E70, which implements a two-phase optimization protocol. Both phases call the same underlying pipeline function sub_12E54A0, but a TLS variable qword_4FBB3B0 is set to 1 or 2 to indicate which phase is active:

PhaseTLS valuePurpose
Phase I1Analysis + early IR optimization (module-level, CGSCC, function passes)
Phase II2Backend optimization + codegen preparation (lowering, legalization)
Complete3Compilation finished for this module

Between phases, sub_12D4250 checks concurrency eligibility: if the module contains more than one defined function (non-declaration), and the options permit it, Phase II can run with multiple threads. Thread count is determined from opts[1026] or falls back to get_nprocs(). When concurrency is enabled, sub_12E7B90 is the concurrent worker entry point.

For single-function modules, the optimizer skips the two-phase protocol entirely and runs a single un-phased call to sub_12E54A0 -- no phase counter is set, and the optimizer executes both analysis and backend passes in one invocation.

Data Layout Validation

After the LLC stage but before returning, the orchestrator validates the module's data layout string. If the module has no data layout:

"DataLayoutError: Data Layout string is empty"
→ return 9

On layout mismatch, it produces a detailed diagnostic:

"<error details>\nExample valid data layout:\n64-bit: <reference_layout>"

The reference layout string is loaded from off_4CD4948[0].

Module Linker — sub_12C06E0

The LNK stage's core function (63KB) links multiple LLVM bitcode modules into a single module. This is where user code gets linked with the libdevice math library and any additional modules. The linker performs several validation steps to catch incompatible IR early — before the expensive optimization and codegen stages:

  • Bitcode magic validation: checks for 0xDE,0xC0,0x17,0x0B (raw LLVM bitcode) or 0x42,0x43,0xC0,0xDE (bitcode wrapper). Anything else → error code 9.
  • Triple validation: every module's target triple must start with "nvptx64-". Modules without a triple get a clear error: "Module does not contain a triple, should be 'nvptx64-'".
  • IR version compatibility: sub_12BFF60 reads "nvvmir.version" metadata (2 or 4 element tuples: major.minor or major.minor.debug_major.debug_minor). The NVVM_IR_VER_CHK environment variable can disable this check entirely (set to "0"), useful when mixing IR from different CUDA toolkit versions.
  • Symbol size matching: for multi-module linking, compares the byte sizes of identically-named globals across modules. Size computation uses type codes (1=half(16b), 2=float(32b), 3=double(64b), 7=ptr, 0xB=integer, 0xD=struct, 0xE=array). A mismatch produces: "Size does not match for <sym> in <mod> with size X specified in <other> with size Y."

Single-module fast path: When only one module is present (after adding user code and libdevice), the linker returns it directly via sub_1C3DFC0 without invoking the full linking machinery.

Multi-module linking: For N > 1 modules, the linker copies the primary module's target triple to all secondary modules, then calls sub_12F5610 to perform the LLVM link. After user modules are linked, builtin modules (from a1[3..4]) are linked via sub_1CCEBE0, followed by target feature configuration via sub_1CB9110 and sub_1619140.

NVVM IR Version Checker — sub_12BFF60

The version checker reads "nvvmir.version" named metadata and validates it against the compiler's expected version range.

FieldValue
Address0x12BFF60
Size~9 KB (362 lines)
Metadata key"nvvmir.version"
Debug metadata"llvm.dbg.cu"

Version tuples come in two forms:

  • 2-element: (major, minor) — IR version only. Special case: (2, 0) always passes.
  • 4-element: (major, minor, debug_major, debug_minor) — IR version plus debug info version. Special case: debug_major == 3, debug_minor <= 2 always passes.

The NVVM_IR_VER_CHK environment variable is checked multiple times throughout the validation. When set to "0", all version checks are bypassed, returning 0 (compatible). This is a critical escape hatch for mixing bitcode from different CUDA toolkit versions.

Memory Management

jemalloc — The Global Allocator

cicc statically links a jemalloc 5.x allocator in the address range 0x12FC0000x131FFFF (~400 functions). This replaces the system malloc/free entirely. The jemalloc configuration parser (sub_12FCDB0, 131,600 bytes -- the largest single function in this range) handles the MALLOC_CONF environment variable and /etc/malloc.conf symlink, supporting dozens of tuning options: abort, cache_oblivious, metadata_thp, trust_madvise, retain, dss, tcache, narenas, percpu_arena, background_thread, san_guard_small, san_guard_large, and more.

The choice of jemalloc over glibc's allocator is significant for compiler workloads. jemalloc's thread-local caching (tcache) and arena-per-CPU design (percpu_arena) reduce contention during the concurrent Phase II optimization, where multiple threads may be simultaneously allocating and freeing IR nodes, instruction objects, and analysis results.

The jemalloc stats subsystem (functions at 0x4000000x42FFFF) provides comprehensive per-arena statistics including allocation counts, active/dirty/muzzy page tracking, mutex contention metrics, and HPA hugify counts. These can be triggered via MALLOC_CONF="stats_print:true".

EDG Memory Regions — sub_822260

The EDG 6.6 frontend uses a custom memory region system configured with USE_MMAP_FOR_MEMORY_REGIONS = 1. During post-parse validation in sub_617BD0 (lgenfe_main), sub_822260() is called 11 times to initialize memory regions 1 through 11. These regions serve as arena-style allocators for different categories of EDG internal data:

  • Token buffers (preprocessor token storage)
  • IL node pools (intermediate language tree nodes)
  • Symbol tables (name→declaration mappings)
  • Type representations (structural type information)

The mmap-backed regions grow by mapping additional pages on demand, avoiding the fragmentation problems that would occur with individual malloc calls for the millions of small, short-lived objects the frontend creates during parsing. Region cleanup happens in bulk when the frontend completes -- all pages for a region are unmapped at once rather than individually freed.

The EDG heap allocator cluster at 0x8210000x823FFF includes tracked allocation (sub_822B10/sub_822B90) with a 1024-entry inline tracking array (unk_4F19620, 1024 * 24 bytes) that overflows to heap when exceeded. The tracking count is maintained in dword_4F19600. The finalization function sub_823310 walks bucket chains to free all tracked allocations.

Large Argument Lists

The argv copy in sub_8F9C90 uses a threshold-based allocation strategy:

if (8 * argc <= 0x800)   // argc <= 256
    v284 = stack_buffer;  // 2096 bytes on stack
else
    v284 = sub_16CD150(8 * argc);  // heap allocation

This avoids heap allocation for the common case (most cicc invocations have fewer than 256 arguments) while handling the worst case gracefully. The heap path uses sub_16CD150 (a realloc-like wrapper), and the buffer is freed during cleanup if it was heap-allocated.

Signal Handling and Crash Recovery

EDG Signal Handler

The EDG frontend registers a signal handler at 0x723610 during initialization:

// signal handler (0x723610)
void handler(int sig) {
    write(STDERR_FILENO, "\n", 1);
    dword_4F0790C = 1;    // set "interrupted" flag
    sub_7235F0(9);         // initiate orderly shutdown
}

This handler is registered for SIGINT, allowing the compiler to be interrupted gracefully during long frontend operations (template instantiation, constexpr evaluation). The global dword_4F0790C flag is checked periodically by the parser loop, enabling cooperative cancellation.

LLVM Crash Recovery

The LLVM infrastructure provides its own crash handling via the print-on-crash and print-on-crash-path CLI options (registered in the 0x4F00000x51FFFF range). When enabled, the LLVM pass manager dumps the current IR to a specified path on any unhandled signal (SIGSEGV, SIGABRT, etc.). This is separate from the EDG handler and covers the optimization and codegen phases.

Concurrent API Protection

The global constructor at 0x4A5810 checks LIBNVVM_DISABLE_CONCURRENT_API. When set (to any value), byte_4F92D70 = 1 disables thread-safe LibNVVM API usage. The pipeline orchestrator (sub_12C35D0) uses pthread_once(&dword_4F92D9C, init_routine) for one-time setup, and TLS at __readfsqword(0)-24 stores exception handling stack frames while __readfsqword(0)-32 stores the cleanup function sub_12BCC20. These TLS slots ensure that concurrent compilations in the same process do not corrupt each other's state.

Timer Infrastructure

Compilation timing is implemented through a hierarchical timer system. Timer creation (sub_C996C0) takes a label and context string; timer stop (sub_C9AF60) records the elapsed time. The timer hierarchy is:

"CUDA C++ Front-End"     ← EDG parsing + IL-to-IR conversion (Path A only)
  └─ "LibNVVM"           ← Full optimization + codegen pipeline
       ├─ "LNK"          ← Module linking (sub_12C06E0)
       ├─ "OPT"          ← LLVM optimization (sub_12E7E70)
       │    ├─ "Phase I"  ← Analysis + early optimization
       │    └─ "Phase II" ← Backend optimization + codegen prep
       ├─ "OPTIXIR"      ← OptiX IR generation (optional)
       └─ "LLC"          ← SelectionDAG codegen (sub_12F5100)

The profiler is controlled by sub_C96F30() (returns nonzero when active). Timer data is written to the output file after compilation via sub_C9C600 (Path A) or sub_16DD960 (Path B). The -time flag or environment variable controls activation. The timer names appear in the profiler output, making them essential for identifying compilation bottlenecks.

Architecture Detection — sub_95EB40

One of the most important functions in cicc: the architecture detection system translates a single user-facing flag like -arch=compute_90a into three independent flag strings, one for each pipeline stage. This 3-column fan-out is necessary because the EDG frontend, the LLVM optimizer, and the LLVM backend each use different flag formats to specify the target architecture. The mapping is stored in a std::map<string, ArchTriple> in a red-black tree at a1+248.

ColumnTargetExample
Column 1EDG frontend-R __CUDA_ARCH=750
Column 2Optimizer-opt-arch=sm_75
Column 3LLC backend-mcpu=sm_75

Architecture Validation Bitmask

Before the 3-column mapping is consulted, the architecture number is validated against a hardcoded 64-bit bitmask. This is a fast rejection filter: the SM number minus 75 gives a bit index, and if that bit isn't set in the constant 0x60081200F821, the architecture is rejected. This means cicc v13.0 has a fixed, compile-time-determined set of supported architectures — you cannot add new SM targets without rebuilding the binary.

offset = arch_number - 75;
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
    → ERROR: "is an unsupported option"

Valid architectures (bit positions in 0x60081200F821). Note the gaps — SM 81–85, 91–99, 101–102, 104–109, 111–119 are all absent:

BitSMGeneration
075Turing
580Ampere
1186Ampere
1287Ampere (Jetson Orin)
1388Ada (undocumented)
1489Ada Lovelace
1590Hopper
25100Blackwell
28103Blackwell
35110Jetson Thor
45120Blackwell (sm120) — RTX 50xx / Pro
46121Blackwell (sm120) — DGX Spark

Suffix handling: a and f variants share the base SM number for validation but get distinct -mcpu=sm_XXa/-mcpu=sm_XXf strings.

Architecture Parsing in the EDG Frontend

The EDG frontend (sub_617BD0, option ID 0x52 = --nv_arch) performs its own independent architecture parsing that produces three global variables:

GlobalAddressPurpose
unk_4D045E80x4D045E8SM compute version (integer: 75, 80, ..., 121)
unk_4D045E40x4D045E4Accelerated flag (1 if suffix a)
unk_4D045E00x4D045E0Fast flag (1 if suffix f; also sets accelerated=1)

The f suffix (fast-mode) is new to SM >= 100 architectures. When present, it implies a forward-compatible feature set that may not exactly match the base SM version's capabilities.

Flag Catalog — sub_9624D0

The flag catalog is the second-largest function in the entry point range at 75KB. It takes the raw CLI arguments and sorts them into four output vectors — one per pipeline stage (lnk, opt, lto, llc). This is the translation layer between user-facing flags and the internal per-stage options that each pipeline component understands.

A clever detail: the function takes a "mode cookie" parameter (a4) that distinguishes CUDA compilation (0xABBA) from OpenCL compilation (0xDEED). Several flags behave differently depending on this cookie — for example, -prec-div=0 maps to -nvptx-prec-divf32=1 in CUDA mode but -nvptx-prec-divf32=0 in OpenCL mode, reflecting the different default precision expectations of the two languages.

FieldValue
Address0x9624D0
Size75KB (2,626 lines)
Mode cookiea4: 0xABBA=CUDA, 0xDEED=OpenCL
Output vectorslnk, opt, lto, llc (32-byte std::string elements with SSO)

-Ofast-compile Levels

NVIDIA's -Ofast-compile is a compile-time vs runtime-performance tradeoff. At "max" level, it disables memory space optimization and LSA optimization entirely — these are expensive analysis passes that improve runtime performance but slow compilation significantly. The "mid" and "min" levels provide intermediate points. This feature is targeted at iterative development workflows where compile speed matters more than code quality.

Level StringInternal ValueEffect
"max"2Most optimizations skipped, forces -lsa-opt=0 -memory-space-opt=0
"mid"3Medium speedup
"min"4Minimal speedup
"0"1 → reset to 0Disabled

Error: "libnvvm : error: -Ofast-compile specified more than once". Only one -Ofast-compile per compilation is allowed.

Flag-to-Pipeline Routing (Selected)

This table shows how a single user-facing flag gets split into per-stage options. The pattern reveals NVIDIA's compilation architecture: the LNK stage communicates via -R macro definitions (these become #defines visible to the linker), the OPT stage uses NVIDIA-specific optimizer flags (-opt-use-*), and the LLC stage uses LLVM backend flags (-nvptx-*). Some flags like -ftz=1 propagate to all three stages, while others like -aggressive-inline only affect the optimizer.

User FlagLNK ForwardOPT ForwardLLC Forward
-ftz=1-R __CUDA_FTZ=1-nvptx-f32ftz-nvptx-f32ftz
-prec-div=1 (CUDA)-R __CUDA_PREC_DIV=1-opt-use-prec-div=true-nvptx-prec-divf32=2
-prec-div=0 (CUDA)-opt-use-prec-div=false-nvptx-prec-divf32=1
-prec-sqrt=1-R __CUDA_PREC_SQRT=1-nvptx-prec-sqrtf32=1
-fma=1-nvptx-fma-level=1
-fast-math (CUDA)-R __CUDA_USE_FAST_MATH=1-opt-use-fast-math
-unsafe-math-R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1-opt-use-fast-math -nvptx-f32ftz-nvptx-fma-level=1 -nvptx-f32ftz
-aggressive-inline-inline-budget=40000
-new-nvvm-remat-enable-new-nvvm-remat=true -nv-disable-remat=true -rp-aware-mcse=true

nvcc→cicc Flag Translation — sub_8FE280

When cicc is invoked by nvcc (the CUDA compiler driver), the flags arrive in nvcc's format and need to be translated to cicc's internal format. This translation happens through a red-black tree at qword_4F6D2A0, populated once on first use (guarded by qword_4F6D2C8). Each entry maps an nvcc flag to a pair: an EDG passthrough string and a cicc internal string. Some flags only affect one side — for example, -fmad=1 has no EDG equivalent (FMA is a backend concern) but maps to cicc's -fma=1. Others are dual-mapped: -O0 becomes both --device-O=0 for EDG and -opt=0 for cicc.

nvcc FlagEDG Passthroughcicc Internal
-O0..-O3--device-O=N-opt=N
-fmad=1-fma=1
-prec_sqrt=1-prec-sqrt=1
-Ofast-compile=max-Ofast-compile=max
-Ofc=max-Ofast-compile=max (alias)
--emit-optix-ir--emit-lifetime-intrinsics--emit-optix-ir
-discard-value-names--discard_value_names=1-discard-value-names=1

Environment Variables

cicc checks 20 distinct environment variables across its subsystems. The six NVIDIA-specific variables are the most important for understanding and reimplementing the entry point behavior:

VariableFunctionEffect
NVVMCCWIZsub_8F9C90Set to 553282 → enables wizard mode (byte_4F6D280 = 1)
NVVM_IR_VER_CHKsub_12BFF60Set to "0" → disables NVVM IR version checking
LIBNVVM_DISABLE_CONCURRENT_APIctor at 0x4A5810Any value → disables thread-safe API (byte_4F92D70 = 1)
NV_NVVM_VERSIONsub_8F9C90, sub_12B9F70"nvvm70" or "nvvm-latest" → controls Path A/B default and IR compat mode
LIBNVVM_NVVM_VERSIONsub_12B9F70Same as NV_NVVM_VERSION (checked as fallback)
LLVM_OVERRIDE_PRODUCERctors at 0x48CC90, 0x4CE640Overrides the producer string in output bitcode metadata

The NV_NVVM_VERSION and LIBNVVM_NVVM_VERSION variables are obfuscated in the binary using the same XOR+ROT13 cipher as the CLI option strings. They are decrypted from 0x3C23A90 and 0x42812F0 respectively.

Key Global Variables

These globals persist across the entire compilation and are accessed from multiple subsystems. The wizard mode flag and flag mapping tree are set during CLI parsing and read throughout the pipeline. The embedded libdevice addresses are compile-time constants (.rodata), while the data model width is set during architecture configuration.

VariablePurpose
byte_4F6D280Wizard mode flag (gates -v, -keep)
qword_4F6D2A0Flag mapping red-black tree root
qword_4F6D2C8Tree initialization guard
byte_4F6D2D0--partial-link active flag
byte_4F6D2DC--force-llp64 active flag
unk_3EA0080Embedded libdevice bitcode (Path A, 455,876 bytes)
unk_420FD80Embedded libdevice bitcode (Path B, 455,876 bytes)
off_4B90FE0LLVM options table (Path A, 37 entries)
off_4C6EEE0LLVM options table (Path B, 37 entries)
unk_4F06A68Data model width (8=64-bit, 4=32-bit)
unk_4D0461CEnable p3:32:32:32 in data layout (shared mem 32-bit ptrs)
byte_4F92D70Concurrent API disabled flag
dword_4F92D9Cpthread_once guard for one-time pipeline setup
qword_4FBB3B0TLS: optimization phase counter (1=Phase I, 2=Phase II, 3=done)
unk_4F6D2F8Global module pointer (set by sub_908850 after EDG binding)

Function Map — Entry Point Cluster

FunctionAddressSizeRole
main() thunk → sub_8F9C900x4396A016 B--
String deobfuscation (XOR + ROT13)0x8F98A0~512 B--
Push string to std::vector<std::string>0x8F9C20~128 B--
Real main — CLI parser + dispatcher0x8F9C9010,066 B--
nvcc→cicc flag translation (red-black tree)0x8FE280~4 KB--
Path A CLI processing0x90013039 KB--
Path A orchestrator (simple mode)0x902D10~9 KB--
LLC stage verbose callback0x903730~5 KB--
LNK stage verbose callback0x903BA0~5 KB--
NVVM IR container parser (Path A)0x9047E010 KB--
CUDA C++ Front-End (lgenfe stage)0x905880~6 KB--
lgenfe single-stage wrapper (Path A)0x905E50~256 B--
LibNVVM pipeline driver (Path A)0x905EE043 KB--
Backend SM config + EDG module binding0x90885010 KB--
Architecture detection (3-column fan-out)0x95EB4038 KB--
Flag catalog (4 output vectors)0x9624D075 KB--
Pipeline option parser (4 stage vectors)0x9685E0~8 KB--
Path B CLI processing0x125FB30~8 KB--
Path B entry (simple mode)0x1262860~4 KB--
Path B LNK verbose callback0x1263280~1 KB--
Path B OPT verbose callback0x12636E0~1 KB--
NVVM container parser (Path B)0x12642A0~3 KB--
Path B pre-compilation setup0x1265340~4 KB--
lgenfe single-stage wrapper (Path B)0x12658E0~256 B--
LibNVVM compilation entry (Path B)0x126597048 KB--
LibNVVM API dispatch table (25 entries)0x12BC0F0~3 KB--
Thunk → sub_12BC8B0 (nvvmCUAddModuleFromBuffer)0x12BCB00~64 B--
NVVM IR version checker0x12BFF60~9 KB--
Module linker (LNK stage core)0x12C06E063 KB--
4-stage pipeline orchestrator0x12C35D041 KB--
Stage bitmask parser0x12D2AA0~4 KB--
Concurrency eligibility check0x12D4250~2 KB--
Two-phase optimizer entry0x12E7E70~8 KB--
Concurrent worker entry point0x12E7B90~4 KB--
LLC core (SelectionDAG codegen)0x12F5100~12 KB--
OptiX IR generator0x12F9270~6 KB--
Path B context initialization0x1602D10~2 KB--

Cross-References

  • EDG Frontendsub_617BD0 (lgenfe_main), the 282-case CLI dispatch inside the EDG 6.6 frontend
  • NVVM Container Format — Container parsing by sub_9047E0 (Path A) and sub_12642A0 (Path B)
  • Optimizer Pipeline — The OPT stage driven by sub_12E7E70 (two-phase optimization)
  • IR Generation — Module creation via sub_908850 (EDG module binding)
  • PTX Emission — The LLC stage's PTX output via sub_12F5100

nvcc-to-cicc Interface Contract

When nvcc compiles device code, it invokes cicc as an external process, passing the preprocessed CUDA source (or LLVM bitcode) along with a carefully translated set of flags. cicc never sees the raw -fmad=1 or -prec_sqrt=0 flags that the user typed on the nvcc command line -- those are rewritten through a flag translation table implemented as a global std::map red-black tree at sub_8FE280. This page documents the complete interface contract: how nvcc invokes cicc, how flags are translated, how the mode cookie selects CUDA vs. OpenCL behavior, what input formats are accepted, and what output modes are available.

The flag translation is split into two stages. Stage 1 (sub_8FE280) translates nvcc-facing flags into cicc-facing flags, producing a dual-slot result with an EDG front-end flag and an internal cicc flag. Stage 2 (sub_95EB40) further expands each cicc-facing flag into a three-column architecture mapping, routing each flag to the EDG frontend, the NVVM optimizer, and the LLC backend. The composition of these two stages means a single nvcc flag like -fmad=1 can silently become --emit-llvm-bc (always injected), nothing to EDG, nothing to OPT, and -nvptx-fma-level=1 to LLC.

Flag translation treesub_8FE280 -- global std::map at qword_4F6D2A0, 40+ entries
Tree guardqword_4F6D2C8 (set to 1 after first initialization)
Tree node size72+ bytes: key at +32, length at +40, FlagPair* at +64
CLI parser (Path A)sub_900130 (39 KB, 12 parameters)
Flag catalog (Path A/B)sub_9624D0 (75 KB, 2,626 lines, 4 output vectors)
3-column arch tablesub_95EB40 (38 KB, 23 architectures, 3-column fan-out)
Mode cookies0xABBA = CUDA, 0xDEED = OpenCL
Default architecturecompute_75 / sm_75 (Turing)
Input extensions.bc, .ci, .i, .cup, .optixir, .ii
Default opt level-opt=3 (O3)

Invocation Contract

nvcc invokes cicc as a subprocess with a single input file and a set of translated flags. The general invocation form is:

cicc [mode-flags] [translated-flags] [pass-through-flags] -o <output> <input>

For the standard CUDA compilation path (no explicit -lXXX mode flag), cicc enters sub_8F9C90 (real main, 10,066 bytes at 0x8F9C90), parses all arguments into ~12 local variables, resolves the Path A / Path B dispatch variable v253, and calls one of:

  • Path A (EDG pipeline): sub_902D10 -- invokes sub_900130 for CLI parsing, then the EDG frontend via sub_905880, then the LibNVVM pipeline via sub_905EE0.
  • Path B (standalone LLVM pipeline): sub_1262860 -- similar flow but through standalone LLVM infrastructure at 0x1262860.

Path selection is controlled by v253, which defaults to 2 (unresolved) and is resolved through the obfuscated environment variable NV_NVVM_VERSION. For SM >= 100 (Blackwell and later), the default is Path B unless the -nvc flag is present. For SM < 100, the default is Path A. See Entry Point for the full dispatch matrix.

When cicc is invoked in multi-stage mode (-lnk, -opt, -llc, -libnvvm), the entry point dispatches to sub_905EE0 (Path A, 43 KB) or sub_1265970 (Path B, 48 KB), which orchestrate the LNK, OPT, and LLC sub-pipelines internally.

Parameter Passing to sub_900130

The Path A CLI parser sub_900130 receives 12 parameters and performs a two-pass argument scan:

unsigned int sub_900130(
    const char *input_file,    // a1: input filename
    const char *opencl_src,    // a2: OpenCL source path (NULL for CUDA)
    const char *output_file,   // a3: output filename
    __int64    *arg_vector,    // a4: pointer to std::vector<std::string>
    char        mode_flag,     // a5: mode flag (0=normal, 1=special)
    __int64     job_desc,      // a6: output compilation job struct
    __int64     error_out,     // a7: error string output
    _BYTE      *m64_flag,     // a8: output - set to 1 if -m64 seen
    _BYTE      *discard_names, // a9: output - set to 1 if -discard-value-names
    __int64     trace_path,    // a10: device time trace path
    __int64     trace_pid,     // a11: trace PID
    __int64     trace_env      // a12: trace env value
);
// Returns: 0 = success, 1 = error

Pass 1: Scans for -arch flag via sub_8FD0D0, extracts architecture string.

Pass 2: Iterates all arguments, looking each up in the red-black tree at qword_4F6D2A0. For tree hits, the EDG slot is pushed to the EDG argument vector (v145) and the cicc slot is pushed to the backend argument vector (v148). For tree misses, sequential string comparisons handle extended flags (-maxreg=N, -split-compile=N, --Xlgenfe, --Xlibnvvm, etc.).

Before any user flags, sub_900130 unconditionally injects:

  • --emit-llvm-bc into the EDG argument vector
  • --emit-nvvm-latest into the backend argument vector

After all arguments are processed, architecture strings are appended:

  • --nv_arch + sm_XX to EDG arguments
  • -arch=compute_XX to backend arguments

Mode Cookies

The sub_9624D0 flag catalog function takes a fourth parameter a4 that selects the language mode. This is not a user-visible flag -- it is passed internally by the pipeline orchestrator.

CookieHexDecimalLanguage
0xABBA0xABBA43,962CUDA compilation
0xDEED0xDEED57,069OpenCL compilation

The cookie affects multiple behaviors:

Precision division routing. In CUDA mode (0xABBA), -prec-div=0 maps to -nvptx-prec-divf32=1 (not 0) at LLC, while -prec-div=1 maps to -nvptx-prec-divf32=2. In OpenCL mode (0xDEED), the mapping is straightforward: -prec-div=0 maps to -nvptx-prec-divf32=0, -prec-div=1 to -nvptx-prec-divf32=1, and OpenCL additionally supports -prec-div=2 mapping to -nvptx-prec-divf32=3.

Fast-math routing. In CUDA mode, -fast-math maps to -R __CUDA_USE_FAST_MATH=1 for EDG and -opt-use-fast-math for OPT, with no LLC flag. In OpenCL mode, -fast-math maps to -R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1 for EDG and -opt-use-fast-math -nvptx-f32ftz for OPT.

Default precision. -prec-sqrt defaults to 1 (precise) in CUDA mode, 0 (imprecise) in OpenCL mode.

Discard value names. In CUDA mode (0xABBA), without explicit override, value names are discarded by default (a1+232 = 1), generating -lnk-discard-value-names=1, -opt-discard-value-names=1, and -lto-discard-value-names=1. In OpenCL mode (0xDEED), this only applies when (a13 & 0x20) is set (LTO generation active).

OptiX IR emission. The --emit-optix-ir flag is only valid when the cookie is 0xABBA or 0xDEED.

Internal compile call. The LibNVVM compile function nvvmCUCompile (dispatch ID 0xBEAD) is called with phase code 57,069 (0xDEED) regardless of the outer cookie -- this is the internal LibNVVM compile phase code, not a language selector.

Flag Translation Table

sub_8FE280 populates a global std::map<std::string, FlagPair*> in the red-black tree at qword_4F6D2A0. Each FlagPair is a 16-byte struct with two slots: slot 0 for the EDG frontend passthrough, slot 1 for the internal cicc flag. The function is called exactly once, guarded by qword_4F6D2C8.

Red-Black Tree Structure

qword_4F6D2A0  -- tree root pointer (std::_Rb_tree)
dword_4F6D2A8  -- sentinel node (tree.end())
qword_4F6D2B0  -- root node pointer
qword_4F6D2B8  -- begin iterator (leftmost node)
qword_4F6D2C8  -- initialization guard (1 = already built)

Each node is 72+ bytes:

OffsetField
+0Color (0=red, 1=black)
+8Parent pointer
+16Left child pointer
+24Right child pointer
+32Key data pointer (std::string internals)
+40Key length
+48Key capacity
+64Value pointer (FlagPair*)

Lookup is via sub_8FE150 (lower_bound + insert-if-not-found). Insert is via sub_8FDFD0 (allocate node + rebalance). Comparison uses standard std::string::compare.

Complete nvcc-to-cicc Mapping

The table below shows every entry in the sub_8FE280 red-black tree. Slot 0 is forwarded to the EDG frontend; slot 1 is forwarded to the cicc backend pipeline. <null> means no flag is generated for that slot.

nvcc flagEDG passthrough (slot 0)cicc internal (slot 1)Notes
-m32--m32<null>
-m64--m64<null>Also sets *a8 = 1
-fast-math<null>-fast-math
-ftz=1<null>-ftz=1
-ftz=0<null>-ftz=0
-prec_sqrt=1<null>-prec-sqrt=1Underscore to hyphen
-prec_sqrt=0<null>-prec-sqrt=0Underscore to hyphen
-prec_div=1<null>-prec-div=1Underscore to hyphen
-prec_div=0<null>-prec-div=0Underscore to hyphen
-fmad=1<null>-fma=1fmad renamed to fma
-fmad=0<null>-fma=0fmad renamed to fma
-O0--device-O=0-opt=0Dual-mapped
-O1--device-O=1-opt=1Dual-mapped
-O2--device-O=2-opt=2Dual-mapped
-O3--device-O=3-opt=3Dual-mapped
-Osize<null>-Osize
-Om<null>-Om
-Ofast-compile=max<null>-Ofast-compile=max
-Ofc=max<null>-Ofast-compile=maxAlias
-Ofast-compile=mid<null>-Ofast-compile=mid
-Ofc=mid<null>-Ofast-compile=midAlias
-Ofast-compile=min<null>-Ofast-compile=min
-Ofc=min<null>-Ofast-compile=minAlias
-Ofast-compile=0<null><null>No-op
-Ofc=0<null><null>No-op alias
-g--device-debug-gDual-mapped
-show-src<null>-show-src
-disable-allopts<null>-disable-allopts
-disable-llc-opts<null>disable-llc-opts
-w-w-wDual-mapped
-Wno-memory-space<null>-Wno-memory-space
-disable-inlining<null>-disable-inlining
-aggressive-inline<null>-aggressive-inline
--kernel-params-are-restrict--kernel-params-are-restrict-restrictDual-mapped, renamed
-allow-restrict-in-struct<null>-allow-restrict-in-struct
--device-c--device-c--device-cDual-mapped
--generate-line-info--generate-line-info-generate-line-infoDual-mapped
--enable-opt-byval--enable-opt-byval-enable-opt-byvalDual-mapped
--no-lineinfo-inlined-at<null>-no-lineinfo-inlined-at
--keep-device-functions--keep-device-functions<null>EDG only
--emit-optix-ir--emit-lifetime-intrinsics--emit-optix-irTriggers lifetime intrinsics in EDG
-opt-fdiv=0<null>-opt-fdiv=0
-opt-fdiv=1<null>-opt-fdiv=1
-new-nvvm-remat<null>-new-nvvm-remat
-disable-new-nvvm-remat<null>-disable-new-nvvm-remat
-disable-nvvm-remat<null>-disable-nvvm-remat
-discard-value-names--discard_value_names=1-discard-value-names=1Also sets *a9 = 1
-gen-opt-lto<null>-gen-opt-lto

Key translation patterns:

  • Underscore to hyphen: nvcc uses underscores (-prec_sqrt), cicc uses hyphens (-prec-sqrt).
  • Rename: -fmad becomes -fma internally.
  • Dual-mapping: -O0 through -O3 emit both an EDG flag (--device-O=N) and a cicc flag (-opt=N).
  • Alias expansion: -Ofc=X is silently rewritten to -Ofast-compile=X.
  • Implicit dependency: --emit-optix-ir adds --emit-lifetime-intrinsics to the EDG frontend, enabling lifetime intrinsic generation that the OptiX IR output path requires.

Extended Flags (Not in Tree)

The following flags are handled by sequential string comparison in sub_900130 when a tree lookup misses:

nvcc flagExpansionNotes
-maxreg=N-maxreg=<N> to backend
-split-compile=N-split-compile=<N> to OPTError if specified twice
-split-compile-extended=N-split-compile-extended=<N> to OPTMutually exclusive with -split-compile
--Xlgenfe <arg><arg> to EDG
--Xlibnvvm <arg><arg> to backend
--Xlnk <arg> / -Xlnk <arg>-Xlnk + <arg> to backend
--Xopt <arg> / -Xopt <arg>-Xopt + <arg> to backend
--Xllc <arg> / -Xllc <arg>-Xllc + <arg> to backend
-Xlto <arg><arg> to LTO vector
-covinfo <file>-Xopt -coverage=true -Xopt -covinfofile=<file>
-profinfo <file>-Xopt -profgen=true -Xopt -profinfofile=<file>
-profile-instr-use <file>-Xopt -profuse=true -Xopt -proffile=<file>
-lto-gen-lto to backend; enables LTO
-olto <file>-gen-lto-and-llc + flag + next arg
--promote_warnings-Werror to backend; flag to EDG
-inline-info-Xopt -pass-remarks=inline + missed + analysis
-jump-table-density=N-jump-table-density=<N> to backend
-opt-passes=<val>-opt-passes=<val> to backend
--orig_src_file_name <val>--orig_src_file_name + <val> to EDG
--force-llp64Pass to EDG; sets byte_4F6D2DC = 1
--partial-linkComplex: may add -memdep-cache-byval-loads=false to OPT and LLCSets byte_4F6D2D0 = 1
--tile-onlyPass to EDG + --tile_bc_file_name + output path
--device-time-tracePass to EDG; next arg becomes trace path
-jobserver-jobserver to backend or pass to EDG

Input Extensions

Input files are identified by extension during the argument loop in sub_8F9C90. The last matching file wins (the input variable s is overwritten each time). Extension matching proceeds by checking trailing characters: last 3 for .bc/.ci, last 2 for .i, last 3 for .ii, last 4 for .cup, last 8 for .optixir.

ExtensionFormatConditionAddress
.bcLLVM bitcodeAlways accepted0x8FAA0A
.ciCUDA intermediate (preprocessed)Always accepted0x8FAA29
.iPreprocessed C/C++Always accepted0x8FA9xx
.iiPreprocessed C++Always accepted0x8FBF7E
.cupCUDA sourceOnly after --orig_src_path_name or --orig_src_file_name0x8FBFC4
.optixirOptiX IRAlways accepted0x8FC001

Unrecognized arguments (those failing both tree lookup and sequential matching, and lacking a recognized extension) are silently appended to the v266 pass-through vector, which is forwarded to sub-pipelines.

If no input file is found after parsing all arguments:

Missing input file
Recognized input file extensions are: .bc .ci .i .cup .optixir

Note that .ii is not mentioned in the error message despite being accepted -- this appears to be a minor oversight in the error string.

Output Modes

cicc can produce several output formats, controlled by the combination of flags in the a13 compilation mode bitmask. The bitmask is accumulated during flag parsing in sub_9624D0:

a13 ValueModeOutput Format
0x07Default (all phases)PTX text assembly
0x10Debug/line-infoPTX with debug metadata
0x21-gen-ltoLTO bitcode (.lto.bc)
0x23-lto (full LTO)LTO bitcode + link
0x26-link-ltoLinked LTO output
0x43--emit-optix-irOptiX IR (.optixir)
0x80-gen-opt-ltoOptimized LTO bitcode
0x100--nvvm-6464-bit NVVM mode modifier
0x200--nvvm-3232-bit NVVM mode modifier

The default output is PTX text, written through the LLC backend's PTX printer. The output file path is specified by -o <file> (fatal if missing in multi-stage modes). When no output path is provided in simple mode, sub_900130 constructs a .ptx filename from the input.

PTX Text Output (Default)

The standard path runs all four internal phases: LNK (IR linking), OPT (NVVM optimizer), optionally OptiX IR emission, then LLC (code generation). The LLC backend writes PTX assembly text to the output file. In sub_905EE0, the output writing (Phase 4) checks the first bytes of the result for ELF magic (0x7F, 0xED) to detect accidentally binary output; if the mode is text mode (0) and ELF headers are present, it indicates an internal error.

LTO Bitcode Output

When -lto or -gen-lto is active, cicc produces LLVM bitcode instead of PTX. The -gen-lto flag sets a13 = (a13 & 0x300) | 0x21 and adds -gen-lto to the LTO argument vector. The -gen-lto-and-llc variant additionally runs LLC after producing the LTO bitcode, generating both outputs. The -olto flag takes a next argument (the LTO optimization level) and combines LTO bitcode generation with LLC execution.

OptiX IR Output

The --emit-optix-ir flag sets a13 = (a13 & 0x300) | 0x43. In the flag translation tree, it also injects --emit-lifetime-intrinsics into the EDG frontend, enabling lifetime intrinsic emission that is required for the OptiX IR format. In the flag catalog (sub_9624D0), it additionally routes -do-ip-msp=0 and -do-licm=0 to the optimizer, disabling interprocedural memory space promotion and LICM for OptiX compatibility.

Split Compile

The -split-compile=N flag (or -split-compile-extended=N) routes to the optimizer as -split-compile=<N> (or -split-compile-extended=<N>). These are mutually exclusive and error if specified more than once ("split compilation defined more than once"). When -split-compile-extended is used, it also sets the flag at a1+1644 to 1. The split compile mechanism divides the compilation unit into N partitions for parallel processing.

Exit Codes

The process exit code is the return value of sub_8F9C90 (real main), stored in v8:

CodeMeaningSource
0SuccessNormal compilation; -irversion query
1Argument errorMissing input file, missing output file, CLI parse failure
v264Pipeline errorReturn code from sub_905EE0 / sub_1265970 / sub_905880

Within the pipeline, error codes from sub_905EE0 are set via *a8:

*a8 ValueMeaning
0Success (NVVM_SUCCESS)
-1File open/read error
1NVVM_ERROR_OUT_OF_MEMORY
4NVVM_ERROR_INVALID_INPUT
5NVVM_ERROR_INVALID_CU (null compilation unit)

Error messages are written to qword_4FD4BE0 (stderr stream) via sub_223E0D0. All LibNVVM-originated errors are prefixed with "libnvvm : error: ". Representative errors:

  • "Error processing command line: <cmd>" (from sub_900130 failure)
  • "Missing input file" / "Missing output file"
  • "<src>: error in open <file>" (file I/O)
  • "libnvvm: error: failed to create the libnvvm compilation unit"
  • "libnvvm: error: failed to add the module to the libnvvm compilation unit"
  • "libnvvm: error: failed to get the PTX output"
  • "Invalid NVVM IR Container" (error code 259, from sub_C63EB0)
  • "Error opening '<file>': file exists!" / "Use -f command line argument to force output"
  • "Error: Failed to write time profiler data."
  • "Unparseable architecture: <val>"
  • "libnvvm : error: <flag> is an unsupported option"
  • "libnvvm : error: <flag> defined more than once" (duplicate -maxreg, etc.)

Special Behaviors

.cup Extension Gate

The .cup extension (CUDA preprocessed source) is only accepted as an input file when the preceding argument is --orig_src_path_name or --orig_src_file_name. These are metadata flags inserted by nvcc to track the original source file path for diagnostic messages. The check is:

// At 0x8FBFC4 and 0x8FBFDE:
if (strcmp(argv[i-1], "--orig_src_path_name") == 0 ||
    strcmp(argv[i-1], "--orig_src_file_name") == 0) {
    s = argv[i];  // accept .cup as input
}

This means cicc will silently ignore a .cup file that appears without a preceding metadata flag. When accepted, the .cup extension triggers --orig_src_path_name / --orig_src_file_name handling in sub_900130, which forwards the original source path to the EDG frontend for accurate error location reporting.

-Ofc Alias Handling

The -Ofc=X form is a shorthand alias for -Ofast-compile=X, handled entirely within the sub_8FE280 flag translation tree. The tree contains six entries for fast-compile control:

Tree Keycicc InternalEffect
-Ofast-compile=max-Ofast-compile=maxIdentity
-Ofc=max-Ofast-compile=maxAlias
-Ofast-compile=mid-Ofast-compile=midIdentity
-Ofc=mid-Ofast-compile=midAlias
-Ofast-compile=min-Ofast-compile=minIdentity
-Ofc=min-Ofast-compile=minAlias
-Ofast-compile=0<null>No-op
-Ofc=0<null>No-op alias

The aliasing happens at the tree level, before sub_9624D0 ever sees the flag. By the time the flag catalog processes the argument, -Ofc=max and -Ofast-compile=max are indistinguishable. See Optimization Levels for what each fast-compile tier actually does.

In sub_9624D0, -Ofast-compile is stored at offset a1+1640 as an integer:

Level stringInteger valueBehavior
"0"1Disabled (then reset to 0)
"max"2Most optimizations skipped; forces -lsa-opt=0, -memory-space-opt=0
"mid"3Medium pipeline
"min"4Close to full optimization

Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level, only supports 0, min, mid, or max".

Only one -Ofast-compile is permitted per invocation. A second occurrence triggers: "libnvvm : error: -Ofast-compile specified more than once".

Discard Value Names

The -discard-value-names flag has complex interaction semantics. In the tree, it dual-maps to --discard_value_names=1 (EDG, note underscores) and -discard-value-names=1 (cicc, note hyphens). Additionally, per-phase overrides are possible via -Xopt -opt-discard-value-names=0, -Xlnk -lnk-discard-value-names=0, or -Xlto -lto-discard-value-names=0.

In CUDA mode, without explicit flags, value names are discarded by default. In OpenCL mode, the default only applies when LTO generation is active (a13 & 0x20). This reflects the fact that value names are useful for debugging but waste memory in production builds.

Wizard Mode Interaction

The -v (verbose), -keep (keep intermediates), and -dryrun flags are parsed in sub_8F9C90 but are only effective when wizard mode is active. Wizard mode is gated by getenv("NVVMCCWIZ") == 553282, which sets byte_4F6D280 = 1. Without wizard mode, these flags are silently accepted but have no effect -- v259 (verbose) and v262 (keep) remain 0. This is a deliberate anti-reverse-engineering measure.

Default Values When Flags Are Absent

When a flag is not explicitly provided, sub_9624D0 applies these defaults (checking stored-value sentinels):

FlagDefault ValueSentinel Offset
-opt=-opt=3 (O3)a1+400
-arch=compute_-arch=compute_75 (Turing)a1+560
-ftz=-ftz=0 (no flush-to-zero)a1+592
-prec-sqrt=-prec-sqrt=1 (CUDA) / -prec-sqrt=0 (OpenCL)a1+624
-prec-div=-prec-div=1 (precise)a1+656
-fma=-fma=1 (enabled)a1+688
-opt-fdiv=-opt-fdiv=0a1+464

Configuration

Four Output Vectors

sub_9624D0 builds four independent std::vector<std::string> that are serialized into char** arrays at function exit:

VectorSeedOutputPipeline Phase
v324 (LNK)"lnk"a5/a6Phase 1: IR linker
v327 (OPT)"opt"a7/a8Phase 2: NVVM optimizer
v330 (LTO)(none)a9/a10Phase 3: LTO passes
v333 (LLC)"llc"a11/a12Phase 4: Code generation

Each vector element is a 32-byte std::string with SSO. At exit, elements are serialized via malloc(8 * count) for the pointer array and malloc(len+1) + memcpy for each string.

Architecture Bitmask Validation

Architecture validation in sub_9624D0 uses a 64-bit bitmask 0x60081200F821:

offset = arch_number - 75;
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
    // error: "is an unsupported option"

Valid architectures (bit positions): SM 75, 80, 86, 87, 88, 89, 90, 100, 103, 110, 120, 121. The a/f sub-variants share the base SM number for bitmask validation but receive distinct routing in sub_95EB40.

Compilation Mode Flags Bitmask (a13)

The a13 parameter in sub_9624D0 is an IN/OUT bitmask tracking compilation mode:

Bit/MaskSource FlagMeaning
0x07(default)Phase control: all phases active
0x10-g, --generate-line-infoDebug/line-info enabled
0x20-gen-lto, -gen-lto-and-llcLTO generation enabled
0x21-gen-ltoGen-LTO mode
0x23-ltoFull LTO mode
0x26-link-ltoLink-LTO mode
0x43--emit-optix-irOptiX IR emission mode
0x80-gen-opt-ltoOptimized LTO lowering
0x100--nvvm-6464-bit NVVM mode
0x200--nvvm-3232-bit NVVM mode
0x300(mask)64/32-bit mode bits mask

Function Map

FunctionAddressSizeRole
sub_8F9C900x8F9C9010,066 BReal main entry point
sub_8FE2800x8FE280~35 KBFlag translation tree builder (nvcc -> cicc)
sub_8FE1500x8FE150--Tree lookup (lower_bound + insert)
sub_8FDFD00x8FDFD0--Tree insert + rebalance
sub_8FD0D00x8FD0D0--Architecture flag scanner (first pass)
sub_9001300x90013039 KBCLI processing Path A (12 params)
sub_902D100x902D10~9 KBPath A orchestrator
sub_9044500x904450--Push flag to argument vector
sub_9058800x905880~6 KBEDG frontend stage
sub_905EE00x905EE043 KBPath A multi-stage pipeline driver
sub_9082200x908220--LLC output callback (ID 56993)
sub_9088500x908850--Triple construction (nvptx64-nvidia-cuda)
sub_9085A00x9085A0--OPT output callback (ID 64222)
sub_95EB400x95EB4038 KB3-column architecture mapping table builder
sub_9624D00x9624D075 KBFlag catalog (4 output vectors, ~111 flags)
sub_12628600x1262860--Path B simple dispatch
sub_12659700x126597048 KBPath B multi-stage pipeline driver

Global Variables

AddressVariablePurpose
qword_4F6D2A0Flag tree rootstd::map root for sub_8FE280
dword_4F6D2A8Flag tree sentineltree.end()
qword_4F6D2B0Flag tree root nodeRoot node pointer
qword_4F6D2B8Flag tree beginLeftmost node (begin iterator)
qword_4F6D2C8Init guardSet to 1 after sub_8FE280 first call
byte_4F6D2D0Partial-link flagSet by --partial-link
byte_4F6D2DCLLP64 flagSet by --force-llp64
unk_4F06A68Data model width8 = 64-bit, 4 = 32-bit
unk_4D0461CAddress space 3 flagEnables p3:32:32:32 in datalayout
byte_4F6D280Wizard modeSet by NVVMCCWIZ=553282

Cross-References

EDG 6.6 Frontend

NVIDIA licenses the Edison Design Group (EDG) C/C++ front end — a commercial compiler frontend used by several major compilers including Intel ICC. In cicc v13.0, EDG version 6.6 occupies 3.2 MB of code (0x5D00000x8F0000), making it the largest single subsystem in the binary. Unlike most modern compilers that parse directly to an SSA-based IR, EDG operates as a source-to-source translator: it parses CUDA C++ source code and emits transformed C code containing CUDA runtime API calls. This output is then fed into a second compilation phase that produces NVVM IR (LLVM bitcode). This two-stage design means the CUDA language extensions (kernel launch syntax, memory space qualifiers, device/host function annotations) are resolved entirely within EDG, and the LLVM-based backend never sees raw CUDA syntax.

The EDG frontend is configured at compile time through 737 #define macros, including GCC 8.1 emulation mode and Clang 9.1 emulation mode. Exceptions are disabled by default — CUDA device code cannot use C++ exceptions — while RTTI remains enabled for dynamic_cast support in host-side code that interacts with device objects.

EDG version6.6 (string: "Based on Edison Design Group C/C++ Front End, version 6.6")
Entry symbollgenfe_main (string at sub_617BD0)
GCC emulation8.1 (DEFAULT_GNU_VERSION = 80100)
Clang emulation9.1 (DEFAULT_CLANG_VERSION = 90100)
C++ standardsC++98, C++11, C++14, C++17, C++20, C++23 (unk_4F07778 = year code)
C standardsC99, C11, C18, C23
ExceptionsDisabled by default (DEFAULT_EXCEPTIONS_ENABLED = 0)
RTTIEnabled by default (DEFAULT_RTTI_ENABLED = 1)
Target modelLP64 (TARG_SIZEOF_POINTER = 8, TARG_SIZEOF_LONG = 8)
BackendC-codegen (BACK_END_IS_C_GEN_BE = 1) — emits C source, not LLVM IR directly
Functions~5,000 in range, 300+ above 5KB

Architecture

The compilation flow through EDG has four major phases: CLI parsing (282-case switch), translation unit initialization (keyword tables, parser bootstrapping), parsing and semantic analysis (the bulk of the 3.2 MB), and backend code emission (generating three output files: .int.c for internal declarations, .device.c for device code, and .stub.c for host-side launch stubs). Error recovery uses setjmp/longjmp — any of the 478 call sites that invoke the abort handler (sub_721090) will unwind back to the orchestrator rather than crashing the process.

sub_5D2A80 (orchestrator, setjmp error recovery)
  │
  ├─ sub_617BD0 (lgenfe_main: 282-case CLI switch, 737 config #defines)
  │    ├─ sub_610260 (register 300+ CLI options)
  │    └─ sub_6140E0 (option fetcher loop)
  │
  ├─ sub_8D0BC0 (translation unit init)
  │    ├─ sub_706250 (keyword table: ~350 keywords via sub_885C00)
  │    ├─ sub_858C60 (parser entry)
  │    └─ sub_709290 (finalize)
  │
  ├─ sub_709330 ("Generating Needed Template Instantiations", "Wrapping up translation unit")
  │
  └─ sub_5E3AD0 (backend entry: "Generating NVVM IR")
       ├─ Opens .int.c / .device.c / .stub.c output files
       ├─ sub_5DB980 (top-level declaration dispatcher)
       │    ├─ sub_5E13C0 (function declaration printer, 44KB)
       │    ├─ sub_5DBFC0 (expression printer, 41KB, 61 self-references)
       │    ├─ sub_5DFD00 (statement printer, 26KB)
       │    ├─ sub_5D80F0 (initializer printer)
       │    ├─ sub_5DAD30 (struct/union/enum printer)
       │    └─ sub_5DF1B0 (inline asm printer)
       │
       └─ dlopen("libTileIRCompiler_shared.so") [optional, gated by dword_4D045A0]
            └─ dlsym("cudacc_back_end") — 17-entry function pointer table

Timer callbacks record "Front end time", "Back end time", and "Total compilation time" via sub_7211D0.

Orchestrator — sub_5D2A80

The master entry point for the entire frontend. Uses setjmp for non-local error recovery — when any of the ~5,000 EDG functions detects an unrecoverable error (type system inconsistency, parser corruption, internal assertion failure), it calls sub_721090, which longjmps back to this function. The 478 call sites that reference the abort handler demonstrate just how pervasive error checking is throughout the frontend — roughly 10% of all functions in the EDG range can trigger a fatal abort.

GlobalPurpose
unk_4D045D8Phase callback (prints "Generating NVVM IR" etc.)
unk_4D04744Timer enable flag
unk_4F074B0Error flag (frontend errors occurred)
unk_4F074A8Warning count
qword_4F076F0Input source filename

Frontend Entry — sub_617BD0 (lgenfe_main)

At 123KB and 3,113 decompiled lines, lgenfe_main is the largest function in the EDG range. The name "lgenfe" stands for "LLVM-generating front end" — a hint that this function was originally designed for a different backend before NVIDIA adopted the EDG+LLVM architecture. The function is divided into three distinct regions: a massive 282-case switch for CLI option parsing (2,000 lines), a post-parse validation phase that checks for conflicting options and enforces CUDA-specific constraints, and a file I/O setup phase that installs 11 signal handlers and returns a pointer to the configured compilation context.

Signature: (int argc, __int64 argv).

Structure

RegionLinesContent
A164–2157282-case switch on option ID (v6)
B2157–2700Post-parse validation and cross-option consistency
C2700–3113File I/O setup, 11 signal handlers, return &qword_4D046F0

Architecture Parsing (case 0x52)

compute_75, compute_80, compute_86, compute_87, compute_88, compute_89
compute_90, compute_90a
compute_100, compute_100a, compute_100f
compute_103, compute_103a, compute_103f
compute_110, compute_110a, compute_110f
compute_120, compute_120a, compute_120f
compute_121, compute_121a, compute_121f

Storage: unk_4D045E8 = SM number, unk_4D045E4 = a suffix flag, unk_4D045E0 = f suffix flag.

Configuration Emission (case 0xE1)

Emits 737 #define macros to configure the EDG compiler. Key defines:

DefineValueMeaning
VERSION_NUMBER"6.6"EDG frontend version
EDG_MAIN"lgenfe_main"Entry point symbol
DEFAULT_GNU_VERSION80100Emulate GCC 8.1
DEFAULT_CLANG_VERSION90100Emulate Clang 9.1
DEFAULT_EXCEPTIONS_ENABLED0CUDA: no exceptions
TARG_SIZEOF_POINTER864-bit pointers
TARG_SIZEOF_LONG_DOUBLE16128-bit long double
TARG_LITTLE_ENDIAN1x86-64 host
USE_SOFTFLOAT1Software FP for constexpr
ABI_COMPATIBILITY_VERSION9999Maximum ABI compat
MODULE_MAX_LINE_NUMBER250000Max lines per module

CLI Option Registration — sub_610260

Registers ~300 options via sub_6101D0(id, name, flag, ...). CUDA-specific options include:

IDNamePurpose
51no-device-int128Disable __int128 on device
59emit-llvm-bcEmit LLVM bitcode directly
60device-debugDevice-side debug info
68force-volatileForce volatile on memory space (global/shared/constant/local/generic/all)
73kernel-params-are-restrictAll kernel pointer params are __restrict__
82nv_archcompute_XX architecture selection
93device-cSeparate compilation mode
105tile-onlyTileIR-only compilation
124extended-lambdaExtended lambda support (--expt-extended-lambda)
132emit-lifetime-intrinsicsLLVM lifetime intrinsics

Translation Unit Processing

Translation unit processing is where EDG transitions from CLI configuration to actual compilation. The init function sets up the lexer, allocates the translation unit data structure (416 bytes), populates the keyword table with ~350 entries, and enters the recursive-descent parser. EDG uses a keyword-registration model where each keyword is individually registered with its token ID — this allows NVIDIA to add CUDA-specific keywords (like __shared__ or __nv_fp8_e4m3) without modifying the core parser grammar.

Init — sub_8D0BC0

  1. Reset token state (dword_4F063F8 = 0)
  2. Call sub_727950 (lexer init)
  3. Allocate 416-byte TU object via sub_823970
  4. Call sub_706250 — keyword table init (~350 keywords)
  5. Call parser entry (sub_858C60 or PCH path sub_852E40)
  6. Call sub_709290 — finalize

Keyword Registration — sub_706250

30KB. Calls sub_885C00(token_id, "keyword_string") ~350 times. Initializes 30+ subsystems before keyword registration. Categories:

  • C89 keywords: auto, break, case, const, continue, default, do, double, else, ...
  • C99 additions: _Bool, _Complex, _Generic, _Atomic, restrict, inline
  • C11/C23: _Static_assert, _Thread_local, _Alignas, _Alignof, constexpr, typeof
  • C++ keywords: class, template, virtual, namespace, using, try, catch, throw, ...
  • C++20: co_yield, co_return, co_await, requires, concept
  • Type traits (~80): __is_pod, __is_abstract, __is_trivially_copyable, __has_virtual_destructor, ...
  • NVIDIA extensions: __nv_is_extended_device_lambda_closure_type, __nv_is_extended_host_device_lambda_closure_type, __nv_is_extended_device_lambda_with_preserved_return_type
  • EDG internal: __edg_type__, __edg_vector_type__, __edg_neon_vector_type__, __edg_scalable_vector_type__

Version-gated by dword_4F077C4 (language mode), unk_4F07778 (standard year), qword_4F077B4 (feature flags).

Finalization — sub_709330

Strings: "Generating Needed Template Instantiations", "Wrapping up translation unit". Calls sub_8B18F0 for C++ template instantiation when dword_4F077C4 == 2.

Preprocessor

EDG includes its own preprocessor rather than relying on an external cpp. This is standard for EDG-based compilers — the preprocessor is tightly integrated with the parser to handle complex interactions between macros and C++ syntax (e.g., __VA_OPT__ in C++20, which requires the preprocessor to understand syntactic context). The preprocessor occupies ~250KB across four major functions and maintains a 99-entry predefined macro table plus a 25-entry feature-test macro table.

Token Scanner — sub_7B8B50 (59KB)

The main preprocessor tokenizer. Handles all C/C++ token kinds: identifiers, numbers (delegates to sub_7B40D0), string literals, operators, punctuators, UCN sequences. Detects C++20 module/import keywords via string comparison.

Numeric Literal Parser — sub_7B40D0 (42KB)

Second-largest preprocessor function. Handles: integer suffixes (u/U/l/L/ll/LL), float suffixes (f/F/l/L), hex floats (0x...p...), binary literals (0b...), C++14 digit separators (').

Macro Expander — sub_81B8F0 (77KB)

The central macro expansion engine. Features:

  • __VA_ARGS__ (C99) and __VA_OPT__ (C++20) support
  • 99-entry predefined macro table at off_4B7C440 (stride 40 bytes)
  • 25-entry feature-test macro table at off_4B7C360
  • Recursion limit: 300 expansions (error 0xE3)
  • Intrinsic type-trait macros: __type_pack_element, __is_signed, __make_integer_seq, __is_pointer

Character Scanner — sub_7BC390 (29KB)

Giant switch on character value. Handles trigraph sequences, line splices, multi-byte characters, comment detection (// and /*).

Parser & Declaration Processing

The parser subsystem is the largest part of the EDG frontend — over 1 MB of code spread across dozens of functions. EDG uses a recursive-descent parser augmented with a declaration-specifier state machine. The state machine design is necessary because C/C++ declaration specifiers can appear in any order (const unsigned long long int and int long unsigned long const are identical), requiring the parser to accumulate specifiers into bitmasks and resolve the final type only after all specifiers have been consumed.

NVIDIA's major contribution to the parser is the CUDA type extension infrastructure: 19 new FP8/FP6/FP4/MX-format type tokens (339–354) for Blackwell's tensor core operations, 9 address-space qualifier tokens (272–280) for GPU memory spaces, and 4 memory-space declaration specifiers (133–136) that piggyback on the existing width-modifier field. These extensions are grafted onto EDG's type system in a way that minimizes changes to the core parser logic — CUDA qualifiers reuse existing state variables with previously-unused value ranges.

Declaration Specifier State Machine — sub_672A20 (132KB, 4,371 lines)

The central parser function and one of the most complex functions in the binary. A while(2)/switch dispatcher on token codes from word_4F06418[0] with ~80 case labels. It accumulates type specifiers, qualifiers, storage-class specifiers, and CUDA address-space qualifiers from the token stream into a set of bitmask variables, then constructs the final type node from the accumulated state.

State Variables

VariableStackBitsRole
v325[rsp+B8h]uintType specifier kind (see table below)
v327[rsp+C0h]uint64Specifier category bitmask
v307[rsp+90h]intCV-qualifier accumulation bits
v302[rsp+78h]uintLong count (0=none, 1=long, 2=long long)
v305[rsp+84h]uintSignedness/width — reused for CUDA (4–7)
v299[rsp+68h]int_Complex (1) / _Imaginary (2) tracking

Type Specifier Kind (v325)

ValueMeaningToken Case
0None yet
2char80
3wchar_t165
4bool / _Bool128 / 120
5float126
6double127
7void180
8signed / __int893 / 239
9__float128331
12int (explicit)89
14__float16332
15short / half85
16_Float16333
17__bf16334
19bfloat16335
20Resolved typedef/CUDA type namescope lookup
21struct/union/enum tag101/104/151
23decltype()183
24auto (deduced)186
25Resolved identifier typeC++ lookup
26Error recovery typediagnostic

Specifier Bitmask (v327)

BitMaskMeaning
00x1Storage class (extern/static/etc.)
10x2CV-qualifier seen
20x4Type specifier seen
30x8friend specifier
40x10__declspec / attribute seen
50x20explicit specifier
60x40inline specifier
70x80_Thread_local / thread_local
100x400typeof / decltype
120x1000__declspec() already processed
130x2000explicit(bool) already processed
140x4000_Noreturn / [[noreturn]]
150x8000_Atomic

CV-Qualifier Bits (v307)

BitMaskQualifier
00x01const (case 81)
10x02volatile (case 107)
20x04restrict / __restrict (cases 118/119)
30x08__unaligned (case 263 with parens)
40x10__ptr32 (case 264)
50x20__ptr64 (case 265)
60x40__sptr / __uptr (case 266)

Duplicate CV qualifiers trigger diagnostic 83.

CUDA Memory Space Tokens (133–136)

These piggyback on the signedness/width field v305 with values 4–7:

TokenKeywordv305v325Formula
133__shared__42Special case
134__device__58token - 129
135__constant__68token - 129
136__managed__78token - 129

Clean separation: values 0–3 = standard C width modifiers, 4–7 = CUDA address-space qualifiers. The type-construction switch handles both ranges.

CUDA Extended Type Tokens (339–354)

TokenTypeFormat
236__nv_fp8_e4m3FP8
339__nv_fp8_e5m2FP8
340–343__nv_fp8x{2,4}_e{4m3,5m2}FP8 vector
344–345__nv_fp6_e{2m3,3m2}FP6
346–347__nv_fp6x2_e{2m3,3m2}FP6 vector
348–349__nv_mxfp8_e{4m3,5m2}MX-format FP8
350–351__nv_mxfp6_e{2m3,3m2}MX-format FP6
352__nv_mxfp4_e2m1MX-format FP4
353__nv_satfiniteSaturation type
354__nv_e8m0Exponent-only E8M0

All resolve via sub_6911B0() → type node, then set v325=20, v327|=4.

CUDA Address Space Qualifier Tokens (272–280)

TokenKeywordSpace IDHandler
272__attribute__((address_space(N)))parsed intsub_6210B0
273__global__0sub_667B60(0,...)
274__shared__ (addr space)2sub_667B60(2,...)
275__constant__ (addr space)3sub_667B60(3,...)
276__generic__sub_72B620(type, cv)
277__nv_tex_surf_handle_tsub_72BA30(unk_4F06A51)
278__nv_buffer_handle_tsub_72BA30(unk_4F06A60)
279__nv_grid_constantsub_72C390()
280__nv_is_extended_device_lambdasub_72C270()

Type Construction Functions

FunctionPurposeTrigger
sub_72BA30(code)Fundamental signed integer typeint, short, long, long long
sub_72BC30(code)CUDA extended-width integerCUDA mode + v305 > 3
sub_72BCF0(code)Unsigned fundamental typeunsigned combos
sub_72BDB0(code)CUDA unsigned extended typeCUDA mode + unsigned
sub_72BF70()float typev325 == 5
sub_72C030()double typev325 == 6
sub_72C0F0()long double typelong + double
sub_72C1B0()__float128 typev325 == 9
sub_72C610(kind)Float-by-kind (mapped from v325)FP8/FP6/BF16/etc.
sub_72C6F0(kind)_Complex float variantv299 == 1
sub_72C7D0(kind)_Imaginary float variantv299 == 2
sub_72C930(code)Error/placeholder typediagnostic issued
sub_72CBA0()Dependent typev325 == 25
sub_72CBE0(...)__int128 typev325 == 1
sub_73C570(type, cv, flags)Apply CV-qualifiers to typepost-construction

Accumulation Flow

  1. Initialize: all state variables to 0
  2. Loop: read word_4F06418[0], dispatch through switch — set bitmask bits, update kind/cv/width
  3. Exit: unrecognized token → LABEL_8 (default exit)
  4. Type construction: switch on v325 × v302 × v305 → call appropriate sub_72B*/sub_72C*
  5. CV application: sub_73C570 wraps the type with const/volatile/restrict
  6. Return: type stored at ds->field_272, CV bits at ds->field_120

Declaration Specifier Parser — sub_7C0F00 (184KB, 3,953 lines)

Uses goto-driven dispatch (393 LABEL_ references) — NOT a switch/case. This is a massive state machine for declaration specifier resolution. Self-recursive at line 2407 with flags=20 for nested declarator parsing.

Top-Level Declaration Parser — sub_662DE0 (61KB)

Declarator parsing — handles pointer (*), reference (&/&&), array ([]), and function (()) declarators. Uses SSE __m128i for bulk struct copying of 64-byte EDG type nodes.

Overload Resolution — sub_6523A0 (64KB)

The master overload resolution function. Given a declaration being introduced and a set of existing candidates from name lookup, it decides whether the declaration is a new overload, a redeclaration, or an error. At 2,448 decompiled lines with 39 diagnostic call sites, it is one of the heaviest diagnostic emitters in the frontend.

Candidate collection uses a 72-byte ranking context (v320 on stack) and dispatches to one of three collectors: sub_644100 for non-member/ADL candidates, sub_648CF0 for member + using-declaration candidates (chosen when C++ mode, prior declaration exists, and the class has base classes or is a template), or sub_6418E0 for C-linkage functions. The best candidate is selected by sub_641B60.

__builtin_ prefix forwarding (lines 2060-2162): after resolution, if the resolved symbol is a bodyless non-member function, the resolver checks if a compiler builtin equivalent exists. It hardcodes three function names by length: "abs" (3), "ceil" (4), "strlen" (6). For each, it constructs "__builtin_" + name in a scratch buffer at qword_4F06C50, looks it up via sub_878540, then compares parameter types via sub_8DED30(type1, type2, 0x100004) (exact match + qualification conversion). On match, the builtin's scope entry is linked into the user function's auxiliary data at offset +256 field 8.

OpenMP variant dispatch (lines 727-752): when unk_4D03A10 is set, the resolver renames the declaration to "<name>$$OMP_VARIANT%06d" using a monotonic counter unk_4D03A0C. This creates unique internal names for each device/host specialization.

Constexpr/consteval propagation (lines 2288-2301): gated by unk_4F07778 (C++ standard year). For C++11 and later, byte +204 of the scope entry is bit-packed with three globals: bits 5-6 = unk_4F06C58 (constexpr disposition), bits 1-2 = unk_4F06C5A (consteval disposition), bits 3-4 = unk_4F06C59 (immediate-function flag). Diagnostic 2383 fires on constexpr mismatch between declaration and definition.

Device/host overload sets: CUDA allows the same function name to have both __host__ and __device__ overloads. EDG does not treat execution space as part of the function signature for overload resolution purposes -- the standard C++ overload rules apply first, and execution space filtering happens later during code generation. The $$OMP_VARIANT renaming mechanism is used for OpenMP dispatch variants that need distinct host/device specializations, but regular CUDA __host__/__device__ overloads rely on the backend's execution space filtering rather than frontend overload resolution. This means that if two functions have identical C++ signatures but differ only in __host__ vs __device__, they are treated as redeclarations (not overloads) at the EDG level, and the execution space annotation at scope entry offset +198 determines which version survives into device or host code.

CUDA Memory Space Processing — sub_6582F0 (22KB)

Validates __shared__, __constant__, __managed__ attributes on declarations. Emits diagnostic for automatic variables in inappropriate memory spaces.

Type System

Type Node Layout (192 bytes = 12 x __m128i)

OffsetSizeField
+88Next pointer (linked lists)
+408Name pointer
+481Declaration kind byte
+801Entity kind byte
+1401TYPE KIND DISCRIMINATOR (the central dispatch key)
+1608Inner/child type pointer (typedef chains, pointer bases)
+1688Member list / parameter chain
+1731Specifier/node kind byte
+1762Entity kind (uint16, dispatch key for constexpr evaluator)
+1851CV-qualifier bits (bit 0=const, 1=volatile, 2=restrict)
+2001Attribute flags

Type kind discriminator values at offset +140:

ValueTypeNotes
0void
1error typeSentinel
2–4fundamental (char, int, ...)
5pointerFollows +160 chain
6pointer-to-member
7function typeComplex: 17 sub-kinds for calling conventions
8arrayElement count at +128, element type at +160
9–11class / struct / unionMembers at +168
12typedef / cv-qualifiedFollow +160 for underlying type (critical: skip in type-walk loops)
13enum
14void (incomplete)
15vectorElement count at +128
19decltype
21placeholder / auto

Scope Table Entry (776 bytes)

Indexed by dword_4F04C64 into base qword_4F04C68:

OffsetField
+0Scope identifier
+4Scope kind (5=namespace, 6=class, 7=function, 8=block, 9=enum, 12=template)
+6–10Flag bytes
+24Name list head
+32Name list tail
+208Class type pointer
+232Deferred list
+328Template info
+552Parent scope index
+624Declaration pointer
+680Linkage specification

Type Comparison — sub_7386E0 (23KB)

The core type equivalence engine. Takes two type node pointers packed in an __int128 and a flags word, returns boolean equality. The flags word controls comparison mode: bits 0-1 select cv-qualifier strictness (0=strict, 1=relaxed, 2=overload), bit 2 enables template matching (class-equivalence shortcuts), and bit 5 enables anonymous-class structural comparison.

Entry sequence: both types are first canonicalized through sub_72EC50, which peels through chains of non-template typedef aliases. The canonicalizer checks three fields on the elaborated type node: +173 == 12 (typedef kind), +176 == 1 (single-member), and +170 bit 4 == 0 (no template specialization). If all hold, it unwraps one level via sub_72E9A0 and loops. This means typedef int MyInt; typedef MyInt YourInt; canonicalizes YourInt directly to int.

After canonicalization, a quick-reject compares three header bytes without recursing: byte +24 (type kind) must match exactly, bytes +25 XOR must be zero for bits 0x03 (const/volatile) and 0x40 (restrict), and byte +26 XOR must be zero for bit 0x04. Any mismatch short-circuits to return 0.

The main switch dispatches on 38 type kinds. Key cases for CUDA:

  • Case 1 (fundamental): compares sub-kind at +56, extra flags at +58 (bits 0x3A), and the base type chain at +72. For integer sub-kind (sub_kind == 'i'), follows a resolution chain to find the underlying class scope. In template matching mode (flags bit 2), uses sub_8C7520 to check whether two class instantiations share the same primary template, then sub_89AB40 to compare template argument lists. This path handles CUDA's exotic numeric types (__nv_fp8_e4m3, __nv_fp8_e5m2, etc.) which are represented as fundamental types with distinct sub-kinds.
  • Case 3 (class/struct/union): fast identity via scope pointer equality, then unique-ID shortcut via dword_4F07588. For anonymous classes with template matching, calls sub_740200 to extract canonical member lists and performs structural comparison. This is relevant for CUDA lambda closure types, which are anonymous classes.
  • Case 33 (using-declaration/alias): in overload mode (flags bit 1), performs a hash table lookup via *qword_4D03BF8 to retrieve base class triples and compare element-by-element. This ensures that two using declarations resolving to different base classes are treated as distinct for overload discrimination.

Overload mode specifics (flags & 2): the post-switch check additionally verifies that both types agree on the presence/absence of the +80 "extra declaration" pointer. Template parameters are forced unequal (never match for overload purposes without being identical). Scope pointer equivalence is verified via unique-ID for using-declaration discrimination.

CUDA type equivalence: the NVIDIA-specific float types (__nv_fp8_e4m3, __bf16, _Float16, etc.) each have distinct sub-kind values at type node +56 (see the type mangling table: sub-kind 0 = _Float16, 1 = __fp16, 9 = __bf16, 0xA = _Float16 alternate, 0xB = _Float32, 0xC = _Float64, 0xD = _Float128). The type comparison treats them as distinct fundamental types -- _Float16 and __fp16 are NOT equivalent despite both being 16-bit floats. The half type in CUDA maps to _Float16 (sub-kind 0 or 0xA depending on context), while __half in cuda_fp16.h is a wrapper struct (type kind 9, class/struct), so half and __half are never type-equivalent at the EDG level. User code relies on implicit conversions defined in the CUDA headers, not on type equivalence.

Type-to-String Emitter — sub_74A390 (29KB, 19 callers)

The backbone type printer. Walks type nodes recursively, emitting textual representation for diagnostics. Handles NVIDIA-specific types: __surface_type__, __texture_type__, __nv_bool.

IL Tree Infrastructure

EDG represents parsed code as an Intermediate Language (IL) tree — a rich AST that preserves full C++ semantic information including template instantiation state, scope chains, and type qualifiers. The IL is not LLVM IR; it is EDG's proprietary tree representation that predates the LLVM integration. All semantic analysis, template instantiation, and overload resolution operate on this tree.

The IL tree is traversed by four structurally identical walker functions that share the same 87 node-type dispatch table. The walkers are instantiated from a common template with different callback functions — a design pattern where the traversal logic is fixed but the action at each node is parameterized through function pointers stored in six global variables. This callback-driven walker system is central to EDG's architecture: template instantiation, type checking, code emission, and tree copying all use the same walker infrastructure with different callbacks.

FunctionSizeSelf-recursive CallsPurpose
sub_7506E0190KB297Primary walker
sub_760BD0109KB427Parallel walker (deeper traversal)
sub_75C0C087KB316Third-pass walker
sub_766570148KB2Copier/transformer (takes callback params)

Walker Callback System

Six global function pointers form the visitor dispatch table:

GlobalRole
qword_4F08028Node pointer remapper (called before recursion)
qword_4F08020Linked-list child remapper
qword_4F08038String field processor
qword_4F08030Pre-visit callback (return nonzero to skip)
qword_4F08040Post-visit callback
dword_4F08014Skip-shared-nodes flag
dword_4F08018Clear/detach mode (null out fields for ownership transfer)

IL Node Types (87 types, from walker case labels)

IDTypeIDType
1source_file28integral_constant
2scope (15 sub-kinds)29float_constant
3type_qualifier30expression (generic)
4simple_type41call_expression
5pointer_type42cast_expression
6function_type (17 sub-kinds)43conditional_expression
7class_type44string_literal
8enum_type48template_argument (4 sub-kinds)
9array_type59concept_expression (10 sub-kinds)
10bitfield_type65type_list (core linked list)
13statement (30+ sub-kinds)75block/compound_statement
23scope_entry (root)76access_specifier

Deep Copy — sub_766570 with sub_8C2C50

sub_8C2C50 calls sub_766570 with copy callback sub_8C38E0 and list-copy callback sub_8C3810. Node size table at qword_4B6D500[node_type] provides memcpy sizes. Critical for template instantiation.

Constexpr Evaluator

The constexpr evaluator is arguably the most technically impressive subsystem in the EDG frontend. It is a complete tree-walking interpreter that can execute arbitrary C++ code at compile time, implementing the full C++20 constexpr specification including heap allocation (constexpr new), string literals, virtual function dispatch, and complex control flow. At 317KB for the expression evaluator alone, plus 77KB for the statement executor and ~200KB in supporting functions, it constitutes nearly 20% of the entire EDG frontend.

The evaluator operates on EDG's IL tree directly — it does not compile to bytecode or any intermediate form. Instead, it recursively walks expression and statement nodes, maintaining its own memory model (a 3-tier page arena), variable bindings (an open-addressing hash table), and lifetime tracking (scope epoch counters). This design trades execution speed for implementation simplicity and guaranteed semantic fidelity with the compiler's own type system.

Signature:

bool constexpr_eval_expr(
    constexpr_ctx *ctx,     // a1: evaluation context (hash table, arena, flags)
    expr_node **expr,       // a2: expression AST node
    __m128i *result,        // a3: output value slot (16 or 32 bytes)
    char *frame_base        // a4: stack frame base pointer for lifetime tracking
);

Expression Evaluator — sub_786210 (317KB, 9,075 lines)

The largest function in the entire EDG frontend. Two-level dispatch: outer switch on expression kind *(a2+24), inner switch on operator code *(a2+56) with 124 cases.

Outer Switch — Expression Kinds

KindHexMeaningNotes
00x00Void/emptySets `ctx+132
10x01Operator expression→ 124-case inner switch on *(a2+56)
20x02Variable referenceHash table lookup, kind==1(const) or kind==3(constexpr)
30x03Function reference / enumeratorSubkind==5: has constexpr body → recurse
40x04Literal (int/float constant)Immediate return — value is in the node
5–60x05–06String / compound literalC++20 mode required (dword_4F077C4 == 2)
70x07Function callMost complex case (~1200 lines)
100x0AParenthesized expressionRecurse on a2[7]
110x0BMember access (->)Navigate member hierarchy via type-size table
170x11Lambda expressionSave/restore ctx+72, execute body via sub_7987E0
180x12Capture variableHash table lookup by a2[7]
200x14Address-ofSet flags a3+8 = 0x20 (IS_SYMBOLIC)
230x17sizeof / alignofDelegate to sub_620D80
240x18Subscript (array[index])Bounds check, compute elem_size * index
270x1BImplicit conversionNavigate chain, recurse on inner
310x1FRequires expression (C++20)Execute body via sub_79B7D0
320x20Type traitsub_693DC0xmmword_4F08280/xmmword_4F08290
330x21SFINAE / substitution failureTemplate context check, sub_6F2300

Inner Switch — Operator Codes (124 cases, selected)

CasesCategoryOperations
0–1Assignment= / initialization (ref types: 32-byte memcpy)
3–4ConversionLvalue-to-rvalue via sub_7A0070
5Type caststatic_cast — massive dispatch: int→int(sub_622780), float→float(sub_709EF0), int→float(sub_710280), ptr→ptr(sub_770010)
14–15Member access. and -> — offset via sub_8D5CF0, virtual base via sub_771030
16–17Pointer arithmeticSubtraction, ptrdiff_t via sub_7764B0
20, 29Comparison==, != via sub_7759B0
26–28Unary++, --, unary minus (sub_621DB0)
30–31Vector opsElement-wise comparison loop, broadcast
39–45Arithmetic+(sub_621270), -(sub_6215F0), *(sub_621F20), /(sub_6220A0), %(sub_6220C0), <<(sub_70BBE0), >>(sub_70BCF0) — all with overflow/divzero checks
46–49Bitwise&, |, ^, ~
50–57Logical&&, || with short-circuit evaluation
58–59Detailed comparisonInteger(sub_621000), float(sub_70BE30), pointer(address+symbolic)
64Spaceship<=> → strong_ordering values at unk_4F06BD8unk_4F06C30
73–84Compound assignment+= through ^= with lifetime validation, const-check (diag 0x1318)
91–93ConditionalTernary ?:, array subscript (bounds-checked, error 0xA84)
94–95Virtual dispatchVtable lookup → sub_79CCD0
96–97AllocationPlacement new / operator new
103Exceptionthrow (always fails in constexpr)
105–108Delegatedsub_77FCB0 (builtin operators)

Value Slot Layout (16 bytes at a3)

OffsetSizeField
0–78Primary value (integer, IEEE float, or arena pointer)
81Flags byte (see below)
9–113Alignment info, compound assignment tracking
12–154Scope epoch ID (lifetime validation)

Extended slot (32 bytes for reference types) adds secondary address at +16 and frame base at +24.

Flags Byte (offset +8)

BitMaskNameMeaning
00x01IS_POINTERValue is an indirect pointer
10x02IS_PAST_ENDOne-past-the-end pointer
20x04HAS_CLEANUPDestructor chain at +16
30x08HAS_SUBOBJECTRefers to a subobject
40x10HAS_BITFIELDBitfield offset in bits 8–31
50x20IS_SYMBOLICUnresolved symbolic reference
60x40IS_CONSTFrom a const declaration
70x80IS_ARRAY_MEMBERPart of array storage

Statement Executor — sub_795660 (77KB)

Dispatch on *(a2+40) — statement kind:

CaseKindNotes
0DeclarationArena alloc → eval initializer → insert into scoped hash table
1–4If / if-else / if-init / if-constexprCondition → bool via sub_620EE0 → branch
5While loopStep counter at ctx+120, limit from qword_4D042E0 (~1M). Error 0x97F on exceeded.
6Jump (break/continue/goto)Sets control flow bits: bit 1=continue, bit 2=break, bit 3=goto
7,15,24Null/emptyReturn success
8ReturnWalk call chain at ctx+72, store result, set "returned" flag
11Expression statementEvaluate for side effects via sub_7987E0
12For loopInit → alloc → [condition → body → increment → cleanup] loop
13Do-whileDelegates to sub_7A0E60
14Range-based for4 temp slots via sub_77A250, iterator advance via sub_7A0470

Memory Management — 3-Tier Page Arena

TierLocationPage SizeThresholdPurpose
Primaryctx+16/ctx+2464KBdefaultExpression evaluation temporaries
Secondaryctx+144/ctx+15264KBlazy init (ctx+132 & 8)Variable declarations
Tertiaryctx+8064KBnullableString/compound literals

Overflow: allocations >1024 bytes go to heap via sub_822B10(size+16), forming a singly-linked list from ctx+32. Freed by walking until scope epoch matches.

Value slot header: type pointer at offset -8 (8 bytes), lifetime bits at offset -9 (1 byte, bit 0 = "initialized").

Scope epoch: monotonic counter at ctx+128. Hash table at ctx+56/ctx+64 maps epoch → page state. Arena rewound on scope exit.

Hash Table (ctx+0/ctx+8)

Open-addressing with 16-byte entries [key, value]. Hash: key_pointer >> 3. Collision: linear probing. Doubles at 2 * count > capacity (via sub_7704A0). Secondary table at ctx+56/ctx+64/ctx+68 uses 4-byte integer keys (scope epoch IDs).

Diagnostic Codes

CodeHexMeaning
610x3DDivision by zero
24310x97FStep limit exceeded
26920xA84Array index out of bounds
26950xA87Unsupported jump in constexpr
26980xA8ANull pointer dereference
27050xA91Negative shift count
27070xA93Integer overflow/underflow
27120xA98Use of uninitialized variable
27210xAA1Not a constant expression (generic)
27270xAA7Invalid type conversion
27350xAAFPointer below array start
27510xABFAccess outside lifetime
27660xACEModification through null pointer
29590xB8BMissing return in constexpr function
30070xBBFreinterpret_cast in constexpr
30220xBCECall to undefined constexpr function

Silent mode: ctx+132 bit 5 (0x20) suppresses diagnostics (SFINAE contexts).

Constexpr and CUDA: Host-Side Evaluation of Device Code

A key architectural question for any CUDA compiler is whether constexpr functions annotated __device__ are evaluated at host compile time. In cicc v13.0, the answer is yes, conditionally. The constexpr evaluator operates entirely within the EDG frontend, which runs on the host. When a constexpr __device__ function is used in a context requiring a constant expression (template argument, array bound, static_assert, constexpr variable initializer), the evaluator executes it using its tree-walking interpreter regardless of the function's execution space annotation. The execution space attributes (__device__, __host__, __global__) are semantic annotations for code generation, not for the constexpr evaluator -- the evaluator sees only the IL tree and does not distinguish between host and device function bodies.

This works because EDG's constexpr evaluator uses software floating point (USE_SOFTFLOAT = 1 in the 737-define configuration block). All floating-point arithmetic in constexpr contexts goes through the softfloat library (sub_70B8D0 add, sub_70B9E0 sub, sub_70BBE0 mul, sub_70BCF0 div, sub_709EF0 convert) rather than the host CPU's FPU. This guarantees that constexpr evaluation of device code produces results consistent with IEEE 754 semantics regardless of the host platform's floating-point behavior. The softfloat library handles all precision levels including _Float16, __bf16, _Float32, _Float64, and __float128.

SM architecture gates influence constexpr relaxations. The global qword_4F077A8 (SM version) gates certain constexpr features:

  • SM >= 89 (qword_4F077A8 > 0x15F8F): relaxed constexpr rules for variables with incomplete types
  • dword_4F077C4 == 2: C++20 features including constexpr new, constexpr string literals, and constexpr member access (expression evaluator cases 5/6)
  • dword_4D04880: C++14 relaxed constexpr (loops, local variable mutation, multiple return statements)
  • C++23/26 extensions: constexpr try-catch (statement executor case 14), constexpr placement new (expression evaluator case 103), constexpr dynamic_cast (error 0xBB7)

The evaluator enforces a step limit (qword_4D042E0, default ~1M iterations) to prevent infinite loops in constexpr evaluation. This limit applies uniformly to both host and device constexpr functions. When exceeded, diagnostic 0x97F ("constexpr evaluation step limit exceeded") is emitted.

One important consequence: __global__ (kernel) functions cannot be constexpr because they have no return value in the conventional sense -- they are launched asynchronously. The parser enforces this at the declaration specifier level, not in the constexpr evaluator.

Supporting Functions

FunctionSizeRole
sub_79CCD067KBObject member accessor (base classes, virtual bases, union tracking)
sub_799B7033KBAggregate initializer (arrays, structs, designated init, brace elision)
sub_79B7D029KBFunction call evaluator (argument binding, body execution, recursion limits)
sub_7987E011KBStatement list executor entry
sub_77FCB0150KBTop-level dispatch (80 expression types + 62-entry intrinsic table)
sub_7764B018KBType size calculator (Robin Hood hash memoization, 64MB cap)
sub_7707D0Clone constexpr object
sub_7790A0Trivial aggregate copy
sub_7A0070Lvalue-to-rvalue load
sub_77F5C0Bounds check (ptr, type → idx, err, size)
sub_76FFC0Run cleanup/destructor chain

Bigint Library (sub_621*)

FunctionOperation
sub_621000compare(a, width_a, b, width_b) → {-1,0,1}
sub_621270add(dst, src, width, overflow_out)
sub_6215F0sub(dst, src, width, overflow_out)
sub_621F20mul(dst, src, width, overflow_out)
sub_6220A0div(dst, src, width, divzero_out)
sub_6220C0mod(dst, src, width, divzero_out)
sub_621DB0negate(dst)
sub_620EE0to_int(value, width, result_out)

Float Library (sub_70B*)

FunctionOperation
sub_70B8D0add(type, lhs, rhs, dst, inexact, exception)
sub_70B9E0sub
sub_70BAF0negate
sub_70BBE0mul
sub_70BCF0div
sub_70BE30compare(type, lhs, rhs, nan_result) → {-1,0,1,NaN}
sub_709EF0convert(src, src_prec, dst, dst_prec, inexact)

Key Globals

VariablePurpose
dword_4F077C4C++ standard version (2 = C++20, enables constexpr new/string)
dword_4D04880C++14 relaxed constexpr (enables loops, mutation)
qword_4D042E0Max constexpr evaluation steps (~1M)
xmmword_4F08280Canonical constexpr TRUE
xmmword_4F08290Canonical constexpr FALSE
qword_4F08380Global type-size hash table base
qword_4F08060Global allocator function pointer (constexpr new detection)

CUDA-Specific Extensions

NVIDIA's extensions to the EDG frontend fall into four categories: memory space qualifiers that map to GPU address spaces, kernel launch syntax that gets lowered to CUDA runtime API calls, registration stubs that tell the CUDA runtime about compiled kernels, and atomic builtin generation for the C++11 atomics model on GPU. These extensions are concentrated in the 0x6500000x810000 range and reference SM architecture version globals extensively — many features are gated by qword_4F077A8 comparisons against architecture thresholds.

CUDA Keyword Extensions

NVIDIA extends the EDG keyword table with execution space qualifiers, memory space qualifiers, and type intrinsics. These exist in four distinct layers -- registered keywords, declaration specifier tokens, address space attribute tokens, and extended type tokens -- each integrated differently into the EDG parser infrastructure.

The critical architectural fact: __device__, __host__, and __global__ are not keywords in the EDG keyword table. They are processed through the C/C++ attribute system, where EDG maps them to internal single-character codes. The declaration specifier state machine (sub_672A20) and the address space handler together resolve these attributes into symbol-table fields that downstream passes consume.

Token ID Inventory

NVIDIA uses four non-contiguous token ID ranges:

RangeCategoryCountRegistration
133-136Memory space declaration specifiers4Hardcoded in sub_672A20 switch
236, 339-354Extended numeric types (FP8/FP6/FP4/MX)17Resolved via sub_6911B0
272-280Address space qualifier / special type tokens9Hardcoded handlers in sub_672A20
328-330NVIDIA type trait intrinsics3Registered via sub_885C00 in sub_706250

Only tokens 328-330 use the standard sub_885C00(token_id, "keyword") registration path. All other CUDA tokens are wired directly into parser switch cases, bypassing the keyword table entirely.

Execution Space Qualifiers -- Attribute Path

__device__, __host__, and __global__ are recognized by the attribute parser, which stores them as single-character codes at declaration context offset +269. The complete internal attribute character map (sub_5C79F0 at 0x5C79F0):

CharHexAttributeScope Entry Bits
'V'0x56__host__-- (host is the default)
'W'0x57__device__+198 bit 4 (0x10)
'X'0x58__global__+198 bit 4 (0x10) AND bit 5 (0x20)
'Y'0x59__tile_global__--
'Z'0x5A__shared__-- (stored in +136 as space code 3)
'['0x5B__constant__-- (stored in +136 as space code 2)
'\'0x5C__launch_bounds__Arguments at decl+336 struct
']'0x5D__maxnreg__--
'^'0x5E__local_maxnreg__--
'_'0x5F__tile_builtin__--
'f'0x66__managed__-- (stored in +136 as space code 5)
'k'0x6B__cluster_dims__Arguments at cluster config struct
'l'0x6C__block_size__--
'r'0x72__nv_pure__--

The attribute character code at +269 is consumed by sub_6582F0 (declaration-side validation) and sub_65F400 (definition-side validation). These functions never see the CUDA qualifier as a keyword token -- they only see the resolved character code.

Execution space at scope entry offset +198 is the authoritative record of a function's execution space for all downstream passes:

  • Bit 4 (0x10): function is __device__ or __global__ -- activates device-scope variable validation
  • Bit 5 (0x20): function is __global__ (kernel entry point) -- triggers kernel metadata emission via sub_12735D0, which emits ("kernel", 1) to LLVM IR
  • Bit 2 (0x04) at offset +199: full_custom_abi flag

When a function has bit 5 set, the attribute emitter also iterates the parameter array (40-byte entries at decl+16) and emits ("grid_constant", param_index) for each parameter where byte +33 is nonzero. The preserve-register struct at decl+336 (three int32 fields: data, control, after) is consumed and cleared (set to -1) after emission.

Memory Space Declaration Specifiers (Tokens 133-136)

These piggyback on the signedness/width field v305 in the declaration specifier state machine with values 4-7, cleanly separated from the standard C width modifiers (0-3):

TokenKeywordv305 Valuev325 ValueFormula
133__shared__42Special case
134__device__58token - 129
135__constant__68token - 129
136__managed__78token - 129

The type construction switch in sub_672A20 branches on v305 > 3 to invoke CUDA-specific type constructors (sub_72BC30 for signed, sub_72BDB0 for unsigned) instead of the standard C type constructors used for v305 values 0-3.

Address Space Qualifier Tokens (272-280)

Processed by dedicated handlers in the declaration specifier parser:

TokenKeywordHandlerArgument
272__attribute__((address_space(N)))sub_6210B0Parses integer N
273__global__ (addr space annotation)sub_667B60(0, ...)Space ID = 0
274__shared__ (addr space annotation)sub_667B60(2, ...)Space ID = 2
275__constant__ (addr space annotation)sub_667B60(3, ...)Space ID = 3
276__generic__sub_72B620(type, cv)--
277__nv_tex_surf_handle_tsub_72BA30(unk_4F06A51)Texture/surface handle
278__nv_buffer_handle_tsub_72BA30(unk_4F06A60)Buffer handle
279__nv_grid_constantsub_72C390()Grid-constant marker
280__nv_is_extended_device_lambdasub_72C270()Lambda closure check

Note the dual role of __shared__, __constant__, and __global__: each appears both as a memory space declaration specifier (tokens 133-135) and as an address space qualifier (tokens 273-275). The declaration specifier path stores the result in the symbol-table entry's memory_space_code at offset +136 and memory_space_flags at offset +156. The address space qualifier path stores the result in the EDG type node's qualifier word at offset +18 (values 1=global, 32=shared, 33=constant). Both representations flow downstream: the symbol-table code controls declaration validation, while the type qualifier controls LLVM pointer type construction in sub_911D10.

The __grid_constant__ qualifier (token 279, handler sub_72C390) marks kernel parameters as grid-constant -- the parameter is read-only across all thread blocks and may be placed in constant memory by the backend. This is a SM 70+ feature.

NVIDIA Type Trait Keywords (Tokens 328-330)

The only CUDA tokens registered through the standard sub_885C00 keyword registration path. Always registered -- not gated by any version, language mode, or feature flag:

TokenKeywordRegistration
328__nv_is_extended_device_lambda_closure_typesub_885C00(328, ...)
329__nv_is_extended_host_device_lambda_closure_typesub_885C00(329, ...)
330__nv_is_extended_device_lambda_with_preserved_return_typesub_885C00(330, ...)

These type traits are used by CUDA's extended lambda machinery to query whether a lambda closure type carries device or host-device execution space annotations. They participate in SFINAE and if constexpr contexts for compile-time dispatch between host and device lambda implementations.

The lambda mangling extensions in sub_80FE00 use the execution space information from these traits to choose between three proprietary Itanium ABI mangling prefixes: Unvdl (device lambda), Unvdtl (device template lambda), and Unvhdl (host-device lambda). The selection is based on flag byte +92 of the closure descriptor, where bit 5 (0x20) marks an extended CUDA lambda, bit 4 (0x10) marks host-device, and bit 2 (0x04) marks a template lambda.

Extended Numeric Type Tokens (236, 339-354)

Blackwell tensor core operations require exotic floating-point formats. These are resolved via sub_6911B0() to a type node, then set v325=20, v327|=4 in the declaration specifier state machine:

TokenTypeFormatWidth
236__nv_fp8_e4m3FP88b
339__nv_fp8_e5m2FP88b
340-341__nv_fp8x2_e{4m3,5m2}FP8 vector16b
342-343__nv_fp8x4_e{4m3,5m2}FP8 vector32b
344-345__nv_fp6_e{2m3,3m2}FP66b
346-347__nv_fp6x2_e{2m3,3m2}FP6 vector12b
348-349__nv_mxfp8_e{4m3,5m2}MX-format FP88b
350-351__nv_mxfp6_e{2m3,3m2}MX-format FP66b
352__nv_mxfp4_e2m1MX-format FP44b
353__nv_satfiniteSaturation modifier--
354__nv_e8m0Exponent-only E8M08b

These types are represented as fundamental types with distinct sub-kind values at type node +56 in the EDG type system. The type comparison engine (sub_7386E0, case 1) compares sub-kind, extra flags at +58 (bits 0x3A), and the base type chain at +72 to ensure each format is treated as a distinct type.

Attribute Processing Pipeline

The complete pipeline from CUDA source keyword to LLVM IR metadata:

CUDA source: __global__ void kernel() __launch_bounds__(256, 2)
  |
  v
Phase 1: Attribute parser → char code 'X' (0x58) at decl context +269
  |
  v
Phase 2: Declaration specifier state machine (sub_672A20)
         → scope entry +198 bit 5 set (kernel)
  |
  v
Phase 3: Post-parse fixup (sub_5D0FF0)
         → __launch_bounds__(256, 2) extracted to launch config struct
  |
  v
Phase 4: CUDA attribute validator (sub_826060)
         → validates __launch_bounds__ on __global__ function
         → diagnostic 0xDCE (3534) if __launch_bounds__ on non-kernel
         → diagnostic 0xE83 (3715) if values out of range
         → diagnostic 0xE87 (3719) if __launch_bounds__ + __maxnreg__ conflict
  |
  v
Phase 5: Attribute emission to LLVM IR (sub_12735D0)
         → emits ("kernel", 1) from bit 5 of decl+198
         → emits ("grid_constant", N) per qualifying parameter
  |
  v
Phase 6: Kernel metadata generation (sub_93AE30)
         → "nvvm.maxntid" = "256,1,1"
         → "nvvm.minctasm" = "2"

Memory Space Attributes

sub_6582F0 (22KB) and sub_65F400 (28KB) validate __shared__, __constant__, __managed__ on variable declarations and definitions respectively. Token cases 133-136 in the parser handle these as first-class declaration specifiers. The validation logic enforces CUDA semantics: __shared__ variables cannot have initializers (shared memory is not initialized on kernel launch), __constant__ variables must have static storage duration, and __managed__ variables require unified memory support on the target architecture.

Symbol Table Memory Space Encoding

Memory space is tracked in two locations within each symbol-table entry:

OffsetSizeFieldValues
+1361 bytememory_space_code0=default, 1=__device__, 2=__constant__, 3=__shared__, 5=__managed__
+1561 bytememory_space_flagsbit 0=device, bit 1=shared, bit 2=constant, bit 4=thread_local interaction
+1571 byteExtended flagsbit 0=managed

The dual encoding exists because the flags are additive from parsed attributes (multiple attributes can be OR'd in) while the code is the single resolved value used by downstream passes. The code at +136 is set by sub_735FB0 (symbol entry constructor) and queried throughout the compiler.

Declaration-Side Validation -- sub_6582F0

The validation follows a ten-phase pipeline:

  1. __managed__ pre-resolution: when dword_4F04C5C == dword_4F04C34 (host-only mode), managed variables are silently downgraded to __device__ (space code 1) and the extern flag is cleared.

  2. Extern handling: sets bit 0 of decl context +122 and the is_extern tracking variables.

  3. Type normalization: checks function-type declarations against CUDA criteria via sub_8D4C10; emits diagnostic 891 for function types with memory space.

  4. Specifier processing: calls sub_6413B0 against the current compilation target.

  5. Prior-declaration conflict detection: looks up existing symbol, compares memory space codes. Mismatch with dword_4F077C4 == 2 (separate compilation) triggers diagnostic 172 (warning 4).

  6. New symbol creation: sub_735FB0(type_ptr, space_code, target_id, is_new_decl).

  7. __managed__ namespace binding: validates namespace name via sub_703C10; checks class/struct type compatibility (diagnostic 1560 on failure).

  8. Storage class adjustments: processes constant-space read-only flags.

  9. Device-scope enforcement: when scope +198 bit 4 is set (inside __device__/__global__ function), local variables cannot carry device memory qualifiers. Diagnostic 3484: "an automatic variable may not be declared as __device__". The memory space name is determined by the priority cascade: __constant__ > __managed__ > __shared__ > __device__.

  10. Final fixup: type validation (sub_8D9350), attribute propagation (sub_8756F0), "main" function warnings (diagnostic 2948), thread-safety analysis (sub_826000).

Memory Space Mutual Exclusivity

The code consistently enforces these combinations:

CombinationDiagnosticSeverity
__shared__ + __constant__3481error
__constant__ + __managed__3568error
__constant__ + __shared__ + __managed__3568error
thread_local + __device__892error
thread_local + any device space3578error
auto variable + __device__/__constant__/__managed__3484error
__shared__ + initializer3510error
__constant__ in device-function scope3512error
register + device memory3485/3688error
volatile + __constant__1378error
redeclaration with different memory space3499warning 5

The diagnostic name-string priority cascade (__constant__ > __managed__ > __shared__ > __device__) appears identically in six locations: sub_6582F0 lines 734-739, sub_65F400 lines 541-549 and 927-935, sub_5C6B80 lines 22-34, sub_667550 lines 87-98, and sub_5D9330 (the symbol printer).

Address Space Flow: EDG to LLVM to PTX

CUDA SourceEDG TokenSymbol +136Type Qualifier +18LLVM ASPTX Directive
__device__ int x;134111.global
__shared__ int x;1333323.shared
__constant__ int x;1352334.const
__managed__ int x;136511.global + runtime registration
(local in kernel)--0--0/5.local/.param

The EDG type node qualifier word (offset +18, masked to 0x7FFF) carries address space through the type system. During EDG-to-LLVM type translation, sub_911D10 reads this qualifier from pointer/reference types (kind 75/76) and maps to LLVM address space numbers via sub_5FFE90. __managed__ variables are compiled as __device__ (LLVM address space 1) with additional runtime registration calls generated by sub_806F60 for unified memory management.

Kernel Launch Lowering — sub_7F2B50 (16KB)

Transforms CUDA's <<<gridDim, blockDim, sharedMem, stream>>> kernel launch syntax into CUDA runtime API calls. The lowered sequence allocates a parameter buffer via cudaGetParameterBufferV2, copies kernel arguments into it, and launches with cudaLaunchDeviceV2. For the simpler launch path, it generates __cudaPushCallConfiguration followed by individual __cudaSetupArg/__cudaSetupArgSimple calls. This lowering happens entirely within EDG — by the time the code reaches the LLVM backend, kernel launches are ordinary function calls.

Registration Stub Generator — sub_806F60

Generates __cudaRegisterAll function with calls to: __cudaRegisterEntry, __cudaRegisterVariable, __cudaRegisterGlobalTexture, __cudaRegisterGlobalSurface, __cudaRegisterManagedVariable, __cudaRegisterBinary, ____cudaRegisterLinkedBinary.

Host-side stubs generated by sub_808590: "__device_stub_%s", "__cudaLaunch", "__cudaSetupArg", "__cudaSetupArgSimple".

Atomic Builtin Generator — sub_6BBC40 (34KB)

Constructs __nv_atomic_fetch_{add,sub,and,xor,or,max,min} names with type suffixes (_s, _f, _u) and width (_%u).

SM Architecture Gates

Two functions configure ~160 optimization/feature flags based on SM version:

FunctionRoleThresholds
sub_60D650Optimization level → 109 unk_4D04* flagsSingle integer parameter (O-level)
sub_60E7C0SM arch → 60 unk_4D04* feature flagsSM 75 (30399), SM 80 (40000), SM 90 (89999), SM 100 (109999), SM 120 (119999)

Each flag is gated by a byte_4CF8* user-override check, preventing auto-configuration when the user explicitly sets a flag via CLI.

TileIR Backend

sub_5E3AD0 optionally loads libTileIRCompiler_shared.so via dlopen and looks up symbol "cudacc_back_end". A 17-entry function pointer table is passed. Gated by dword_4D045A0.

Diagnostic System

EDG's diagnostic system supports three output formats: human-readable terminal output (with ANSI color and word-wrapping), SARIF JSON for IDE integration, and a machine-readable log format for automated tooling. All three share the same diagnostic numbering scheme and severity classification. The terminal output handler alone is 37KB — it implements its own word-wrapping algorithm with configurable terminal width, recursive child diagnostic emission (for "note: see declaration of X" chains), and color coding by severity level.

Terminal Output — sub_681D20 (37KB)

Formats error/warning/remark messages with:

  • Severity labels: remark (2), warning (4), caution (5), severe-warning (6), error (7–8), catastrophe (9–10), internal-error (11)
  • Source location: file:line:col
  • ANSI color escapes (gated by dword_4F073CC)
  • Word-wrapping at dword_4D039D0 (terminal width)
  • Recursive child diagnostic emission

SARIF JSON Output — sub_6837D0 (20KB)

Structured diagnostics for IDE integration, enabled by --diagnostics_format=sarif (CLI case 0x125, sets unk_4D04198 = 1). The output is a comma-separated stream of SARIF result objects -- NOT a complete SARIF envelope with $schema, runs[], etc. The caller or a post-processor is expected to wrap the stream in the standard SARIF container.

Each diagnostic emits one JSON object:

{
  "ruleId": "EC<number>",
  "level": "error",
  "message": {"text": "<json-escaped message>"},
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {"uri": "file://<path>"},
        "region": {
          "startLine": 42,
          "startColumn": 17
        }
      }
    }
  ],
  "relatedLocations": [
    {
      "message": {"text": "see declaration of X"},
      "physicalLocation": { ... }
    }
  ]
}

Rule ID format: "EC" + decimal error number from the diagnostic record at offset +176. For example, EDG error 1234 becomes "EC1234".

Severity mapping (byte at diagnostic node +180):

SeverityEDG MeaningSARIF level
4remark"remark"
5warning"warning"
7, 8error"error"
9catastrophe"catastrophe"
11internal error"internal_error"

Note that SARIF spec only defines "warning", "error", and "note" as standard levels. The "remark", "catastrophe", and "internal_error" values are EDG extensions -- consuming tools should treat unknown levels as "error".

Message text escaping: sub_683690 renders the diagnostic text into qword_4D039E8, then copies character-by-character into the output buffer, escaping " as \" and \ as \\. No other JSON escaping (e.g., control characters, Unicode) is applied.

Location resolution: sub_67C120 calls sub_729E00 to decompose the packed source location into (file-id, line, column), then sub_722DF0 to resolve the file-id to a filesystem path. The startColumn field is omitted when column is zero.

Related locations: the linked list at diagnostic node +72 chains "note" sub-diagnostics. Each is emitted as a relatedLocations array entry with its own message and physical location.

Filtering before emission: diagnostics pass through severity threshold check (byte_4F07481[0]), duplicate detection (byte_4CFFE80[4*errnum + 2] bit flags), pragma-based suppression (sub_67D520), and error limit check (unk_4F074B0 + unk_4F074B8 >= unk_4F07478). All filtering happens before the SARIF/text format branch.

Machine-Readable Log

Writes to qword_4D04908 in format: <severity-char> "<filename>" <line> <col> <message>\n. Severity chars from "RwweeccccCli": R=remark, w=warning, e=error, c=catastrophe.

Name Mangling (Itanium ABI)

EDG includes a complete implementation of the Itanium C++ ABI name mangling specification. NVIDIA extends the standard mangling with three proprietary prefixes (Unvdl, Unvdtl, Unvhdl) for device lambdas, device template lambdas, and host-device lambdas respectively. These extensions are necessary because CUDA's execution model requires distinguishing between host and device versions of the same lambda — they must have different mangled names to avoid linker collisions when both host and device code are linked into the same binary.

Address range 0x8100000x8EFFFF:

FunctionSizeRole
sub_8E74B029KBPrimary mangling entry
sub_8E9FF026KBType mangling
sub_81646024KBType component mangling
sub_81379013KBExpression mangling
sub_80E34023KBBuiltin type mangling (incl. DF16_, DF16b, Cu6__bf16, u6__mfp8)
sub_80FE008KBNVIDIA extension mangling (Unvdl, Unvdtl, Unvhdl)

NVIDIA lambda mangling extensions (sub_80FE00): standard Itanium ABI uses Ul<params>E<index>_ for unnamed lambda types and Ut<index>_ for unnamed non-lambda types. NVIDIA adds three proprietary prefixes chosen based on flag byte +92 of the lambda's closure descriptor:

PrefixMeaningCondition
Unvdl__device__ lambdaflag_byte_92 & 0x20 set, not host-device, not template
Unvdtl__device__ template lambdaflag_byte_92 & 0x20 set, flag_byte_92 & 4 set
Unvhdl__host__ __device__ lambdaflag_byte_92 & 0x20 set, flag_byte_92 & 0x10 set

The Unvhdl prefix carries three single-digit flags separated by underscores after the prefix: Unvhdl<index>_<has_explicit_return>_<is_host_device>_<has_template_params>_. Each flag is '0' or '1'. This is richer than the standard Ul which only encodes parameter types.

NVIDIA vendor type manglings (sub_80E340): the type mangler handles CUDA-specific types as Itanium vendor types (prefix u + length + name):

TypeManglingNotes
__bf16 (bfloat16)u6__bf16 or DF16bABI-gated: qword_4F077B4 lo32 selects vendor vs C++23 encoding
__mfp8 (FP8)u6__mfp8NVIDIA micro-float 8-bit for transformer inference
__metainfoU10__metainfoKernel parameter metadata type attribute
float80u7float80x87 extended precision (vendor type)

The __bf16 mangling has a three-way gate reflecting the ongoing ABI transition: qword_4F077B4 lo32 != 0 selects "u6__bf16" (vendor type); hi32 == 0 selects "DF16b" (C++23 standardized P1467); otherwise qword_4F06A78 determines which encoding. The ABI version variable unk_4D04250 controls this and other encoding decisions, with known thresholds at 0x76BF (GCC 3.3 compat) and 0xC350 (GCC 12 compat).

Standard float types follow Itanium: _Float16 = "DF16_", __fp16 = "Dh", float = "f", double = "d", __float128 = "g", with the complex variants adding a 'C' prefix.

Key Global Variables

VariableSizeRole
dword_4F077C44Language mode: 0=neither, 1=C, 2=C++
unk_4F077784C/C++ standard year (199711, 201103, 201402, 201703, 202002, 202310)
qword_4F077B48Dialect extension flags (lo=CUDA extensions, hi=GNU extensions)
dword_4F077BC4NVCC mode flag
dword_4F077C04GCC compatibility mode
qword_4F077A88SM architecture version (controls feature gates throughout)
word_4F064182Current parser token
qword_4F04C688Scope table base pointer (776-byte entries)
dword_4F04C644Current scope index
qword_4CF7CE08AST printer callback vtable
qword_4D03FF08Current translation unit pointer
qword_4D049088Machine-readable diagnostic log FILE*
qword_4F08028qword_4F0804048IL tree walker callback table
dword_4D045A04TileIR mode flag

NVVM IR Generation

Between the EDG 6.6 frontend and the LLVM optimizer sits a layer that has no upstream LLVM equivalent: the NVVM IR generation subsystem. Its job is to translate the EDG intermediate language (IL) tree -- a C-level AST produced by EDG's source-to-source backend -- into LLVM IR suitable for the NVPTX target. This is cicc's equivalent of Clang's CodeGen library (lib/CodeGen/CGExpr.cpp, CGStmt.cpp, CGDecl.cpp, etc.), but it operates on EDG's proprietary IL node format rather than a Clang AST. Understanding this layer is essential because it determines every structural property of the LLVM IR that the optimizer and backend will see: address space annotations on pointers, alloca placement conventions, kernel metadata encoding, and the specific IR patterns used for CUDA-specific constructs like threadIdx.x or __shared__ memory.

The EDG frontend does not produce LLVM IR directly. Its backend mode (BACK_END_IS_C_GEN_BE = 1) emits transformed C code into .int.c, .device.c, and .stub.c files. A second compilation pass then parses these files back through EDG to produce an IL tree -- a typed, linked representation of every declaration, statement, and expression in the translation unit. The IR generation layer walks this IL tree recursively, creating LLVM BasicBlocks, Instructions, and GlobalVariables via a hand-rolled IR builder that directly manipulates LLVM's in-memory data structures. The result is a complete LLVM Module containing one function per device-side function definition, with kernel entry points annotated via nvvm.annotations metadata.

Dual-Path Architecture

One of the most distinctive features of cicc's IR generation is that two complete copies exist within the binary. This mirrors the dual-path design observed throughout cicc: Path A (LibNVVM API mode, 0x90xxxx) and Path B (standalone mode, 0x126xxxx).

ComponentPath A (LibNVVM)Path B (Standalone)
Expression codegen0x91xxxx--0x94xxxx0x127xxxx--0x12Bxxxx
EmitExpr (master dispatch)sub_91DF90sub_128D0F0
EmitStmt (statement dispatch)sub_9363D0(parallel at similar offset)
EmitFunction (entry block setup)sub_946060(parallel)
GenerateFunctionPrologsub_938240(parallel)
Builtin lowering mega-switchsub_90AEE0 (109KB)sub_12B3FD0 (103KB)
Bitfield load/storesub_923780 / sub_925930sub_1282050 / sub_1284570
Special variable codegensub_920430 / sub_922290sub_127F7A0 / sub_1285550
Inline asm codegensub_932270sub_1292420
Global variable codegensub_916430(parallel)
Type translationsub_91AED0(parallel)
Kernel metadata emittersub_93AE30(parallel)

These are not shared-library variations or template instantiations across different types. They are structurally identical copies of the same algorithms with the same string constants (e.g., "allocapt", "agg.result", "entry", "return", ".addr") and the same error messages (e.g., "unsupported expression!", "Argument mismatch in generation function prolog!"). The two copies use different calling conventions for their codegen context objects -- Path A passes codegen state through a flat struct with LLVM API vtable pointers, while Path B uses a pointer-to-pointer indirection scheme -- but the algorithmic logic and IR output are byte-for-byte identical.

The remainder of this page uses Path B addresses (the 0x12xxxxx range) as the primary reference because they correspond to the standalone compilation path that nvcc invokes, and because the B-series analysis reports provide the most detailed coverage of this path. Every function described here has a direct counterpart in Path A at the corresponding 0x9xxxxx address.

Address Map

Address RangeSubsystemKey Functions
0x126A000--0x126BFFFVolatile detection, alignment queriessub_126A420 (IsVolatileAddress)
0x1273000--0x1275FFFFunction attribute emissionsub_12735D0 (EmitFunctionAttrs), sub_1273F90 (AttributeReader)
0x127A000--0x127CFFFType translation helperssub_127A030 (GetLLVMType), sub_127B390 (GetSMVersion), sub_127B420 (IsAddressOfExpr), sub_127B550 (FatalDiag)
0x127D000--0x127FFFFConstants, alloca creation, bool emissionsub_127D8B0 (EmitConstExpr), sub_127FC40 (CreateAlloca), sub_127FEC0 (EmitBoolExpr)
0x1280000--0x1285FFFBitfield access, member loads, inline asmsub_1282050 (EmitBitfieldStore), sub_1284570 (EmitBitfieldLoad), sub_1285290 (EmitAsmCall)
0x1286000--0x128FFFFL-value codegen, binary ops, expression dispatchsub_1286D80 (EmitAddressOf), sub_128A450 (EmitCast), sub_128D0F0 (EmitExpr), sub_128F9F0 (EmitBinaryArithCmp)
0x1290000--0x129AFFFControl flow helpers, inline asm, printf loweringsub_1290AF0 (SetInsertPoint), sub_1292420 (EmitInlineAsm), sub_12992B0 (LowerPrintfToVprintf)
0x129B000--0x12AFFFFBuiltin helpers, atomic ops, surface/texture opssub_12A4D50 (CreateBasicBlock), sub_12A7DA0 (AtomicOps), sub_12ADE80 (SurfaceTexture)
0x12B0000--0x12BFFFFBuiltin mega-switchsub_12B3FD0 (BuiltinLowering, 103KB, 770 IDs)

The IRGenState Object

Every codegen function receives a context object -- called IRGenState or CodeGenState in this wiki -- that carries all mutable state for the current function being compiled. Two distinct layouts exist depending on whether the context is accessed through the Path A flat struct or the Path B double-indirection pattern. Both layouts carry the same logical fields; the difference is structural.

Path B Layout (pointer-to-pointer pattern)

In Path B, the primary codegen context a1 is a CodeGenState** -- a pointer to a pointer. The outer pointer dereferences to a struct containing the core IR builder state, and sibling pointers at a1[1], a1[2], etc., reach related context objects:

AccessOffsetFieldPurpose
*a1+0IRBuilder stateCurrent function, insert point, module
a1[1]+8Insertion context[0] = debug location, [1] = current BB, [2] = insertion sentinel
a1[2]+16LLVM context/moduleModule handle, LLVMContext
a1[4]+32Module pointerLLVM Module*
a1[5]+40Type contextType table for GetLLVMType, getIntNTy
a1[6]+48Debug locationCurrent DebugLoc to attach to new instructions
a1[7]+56Current BasicBlockBB for instruction insertion
a1[8]+64Insertion pointIterator into BB's instruction list
a1[9]+72Address space contextFor alloca type creation
a1[19]+152Cached printf allocaReused "tmp" alloca for vprintf buffer packing

Path A Layout (flat struct, offsets from a1)

OffsetFieldPurpose
+32Module pointerLLVM Module*
+40IR builderCurrent builder state
+48, +56Operand pair arrayBase and count for metadata pairs
+96Current BasicBlockActive BB
+104Insertion pointIterator
+128Instruction creation vtableVirtual dispatch for instruction emission
+136Emitter contextVtable at [0], dispatch at vtable[2]
+192Current FunctionLLVM Function* being populated
+200Return BBThe "return" basic block
+208Return value alloca"retval" alloca or sret pointer
+240Has-cleanups flagNonzero when C++ destructors are pending
+344Module (kernel metadata)Used by sub_93AE30
+360/376In-kernel flagBit 0 set when compiling a __global__ function
+424Cleanup stackStack of pending destructor frames (24 bytes each)
+456Allocapt markerThe "allocapt" sentinel instruction

The "allocapt" marker deserves special attention. When EmitFunction (sub_946060) creates the entry block, it inserts a dummy bitcast void to void instruction named "allocapt" as a sentinel. All subsequent alloca instructions created by CreateTmpAlloca (sub_921D70 / sub_127FC40) are inserted before this sentinel, ensuring that every alloca ends up clustered at the top of the entry block. This is a hard requirement for LLVM's mem2reg pass to promote stack slots to SSA registers. The allocapt marker is removed by a later cleanup pass.

EDG IL Node Layout

Every codegen function traverses EDG IL nodes -- linked structures that represent declarations, statements, and expressions from the parsed CUDA source. The node layout is consistent across all codegen paths:

Expression node (passed as a2 to EmitExpr):

OffsetFieldDescription
+0Type pointerEDG type node (dereference for type info)
+18Qualifier word16-bit: bits 0--14 = qualifier ID, bit 15 = negation
+24Kind byteTop-level expression category (1=operation, 2=literal, 3=member, 0x11=call, 0x14=decl-ref)
+25Flags byteBit 2 = assignment context (write-only)
+36Source locationPassed to debug info attachment
+56Sub-opcode / dataFor kind=1: operator sub-opcode; for kind=2: literal data
+72Child/operandPointer to first child expression

Type node (accessed via expression's type pointer):

OffsetFieldDescription
+8Type classification byte1--6 = float types, 11 = integer, 15 = pointer, 16 = vector
+128Byte sizeElement count for arrays, byte size for scalars
+136Element sizeSize in bits for non-typedef types
+140Type tag1=void, 8--11=aggregate (struct/union/class/array), 12=typedef alias, 16=__int128
+144FlagsBit 2 = is_bitfield, bit 3 = signed
+160Inner type / nextFollowed when tag==12 (typedef stripping)
+176Element countFor array types

The typedef-stripping idiom appears throughout every codegen function (15+ occurrences in EmitExpr alone):

for (t = *expr_type; *(BYTE*)(t + 140) == 12; t = *(QWORD*)(t + 160));

This walks through chains of typedef aliases (kind 12) until it reaches the canonical type.

Function Emission Pipeline

When cicc processes a device-side function, IR generation proceeds through a fixed sequence of stages. The entry point is EmitFunction (sub_946060), which sets up the function skeleton and then calls GenerateFunctionProlog (sub_938240) to emit parameter handling, followed by recursive statement emission.

Stage 1: Function skeleton (sub_946060). Creates the LLVM Function* object, resolves the function type through the EDG typedef chain, and optionally sets a section name. Then creates two basic blocks: "entry" (the function entry point) and "return" (the single return block -- all return paths branch here). Inserts the "allocapt" sentinel into the entry block. For non-void functions, creates a "retval" alloca to hold the return value; for sret functions (returning aggregates), uses the first argument directly.

Stage 2: Function prolog (sub_938240). Iterates the EDG parameter linked list (next pointer at offset +112, stride 40 bytes per LLVM argument slot) in lockstep with the LLVM function's argument list. For each parameter:

  • If the first parameter has ABI kind 2 (sret), names it "agg.result" and advances.
  • Unnamed parameters get the name "temp_param"; the implicit this parameter (flags bit 0 at offset +172) gets "this".
  • Creates an alloca named <param_name>.addr via CreateTmpAlloca.
  • Emits a store of the incoming SSA argument into the alloca.
  • Registers the EDG declaration -> LLVM Value mapping in a hash table (open addressing, quadratic probing) for later lookup during expression codegen.
  • Optionally emits "__val_param" temporaries for byval aggregate parameters.

Stage 3: Body emission (recursive emitStmt / EmitExpr). Walks the IL tree for the function body, dispatching through the statement codegen switch and the expression codegen switch (detailed below).

Stage 4: Kernel metadata (sub_93AE30). For __global__ functions, emits nvvm.annotations metadata: kernel flag, __launch_bounds__ parameters (nvvm.maxntid, nvvm.reqntid, nvvm.minctasm, nvvm.maxnreg), cluster dimensions (nvvm.cluster_dim, nvvm.blocksareclusters), and per-parameter metadata (alignment, grid_constant, hidden-parameter flags).

Stage 5: Function attributes (sub_12735D0). Emits function-level metadata for CUDA-specific attributes: grid_constant (per-parameter), preserve_n_data / preserve_n_control / preserve_n_after (register preservation hints), and full_custom_abi (custom calling convention flag). These are later read back by sub_1273F90 and re-encoded as LLVM named metadata with MDString keys.

CUDA Semantic Mapping

The central task of this layer is mapping CUDA-specific semantics to LLVM IR constructs. The following table summarizes every CUDA concept and its IR representation:

CUDA ConceptLLVM IR RepresentationCodegen Function
threadIdx.xcall i32 @llvm.nvvm.read.ptx.sreg.tid.x()sub_1286E40 (EmitSpecialVarMemberAccess)
blockIdx.ycall i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()same, category 2, component 1
blockDim.zcall i32 @llvm.nvvm.read.ptx.sreg.ntid.z()same, category 1, component 2
gridDim.xcall i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()same, category 3, component 0
warpSizecall i32 @llvm.nvvm.read.ptx.sreg.warpsize()sub_1285550 (EmitSpecialVarAccess)
__shared__ variable@var = addrspace(3) global ...sub_916430 (address space = 3)
__constant__ variable@var = addrspace(4) global ...same (address space = 4)
__device__ variable@var = addrspace(1) global ...same (address space = 1)
__global__ functiondefine void @kern() #0 + !{ptr @kern, !"kernel", i32 1} in nvvm.annotationssub_93AE30
__launch_bounds__(N, M)!{!"nvvm.maxntid", !"N,1,1"} + !{!"nvvm.minctasm", !"M"}same
__cluster_dims__(x,y,z)!{!"nvvm.cluster_dim", !"x,y,z"} + !{!"nvvm.blocksareclusters"}same
__syncthreads()Builtin ID dispatch -> llvm.nvvm.barrier0sub_12B3FD0 (cases 0xB5--0xCC)
atomicAdd(ptr, val)Builtin dispatch -> atomicrmw add or llvm.nvvm.atomic.*same (cases 0xBA--0xCC)
printf(fmt, ...)Rewritten to vprintf(fmt, packed_buf)sub_12992B0 (LowerPrintfToVprintf)
__asm__("ptx" : ...)call void asm sideeffect "ptx", "=r,..."(...)sub_1292420 (EmitInlineAsm)
Texture/surface opscall @llvm.nvvm.tex.* / @llvm.nvvm.suld.*sub_12ADE80, sub_12AA9B0
__nv_float2int_rzcall i32 @__nv_float2int_rz(float %v)sub_128A450 (EmitCast, NVIDIA intrinsic path)

The special variable recognition pipeline (sub_127F7A0) checks five preconditions before treating a variable as a hardware register read: (1) the in-kernel flag at IRGenState+376 must be set, (2) the symbol must not be extern, (3) it must not be template-dependent, (4) its element count must be 1, and (5) its name must be non-null. The intrinsic IDs are stored in a static 5x3 table (unk_427F760): 5 categories (threadIdx, blockDim, blockIdx, gridDim, warpSize) times 3 components (x, y, z), with warpSize using only the first slot.

Common IR Emission Patterns

Alloca-at-entry

Every local variable and parameter copy uses the same pattern:

sub_127FC40(ctx, type, name, alignment, addrspace)
  -> sub_921B80(ctx, type, name, arraySize=0)
     -> insert AllocaInst BEFORE the allocapt sentinel
     -> set alignment bits
     -> return alloca pointer

The critical detail: when arraySize == 0 (the common case), the alloca is inserted at IRGenState+456+24 -- the position just before the allocapt marker. This ensures all allocas land at the top of the entry block regardless of where in the function body they are created.

Instruction insertion and debug location

After creating any instruction, the same 15-line pattern inserts it into the current basic block and attaches debug metadata:

bb = ctx[1][1];              // current BB
sentinel = ctx[1][2];        // insertion sentinel
sub_157E9D0(bb + 40, inst);  // update BB instruction list
// doubly-linked list pointer surgery with 3-bit tag in low bits
sub_164B780(inst, &name);    // set instruction name (e.g., "arraydecay")
debugLoc = *ctx_debug;
if (debugLoc) {
    sub_1623A60(&loc, debugLoc, 2);  // clone debug location
    *(inst + 48) = loc;              // attach at instruction offset +48
    sub_1623210(&loc, loc, inst+48); // register in debug info list
}

The low 3 bits of list pointers carry tag/flags (alignment guarantees those bits are zero for valid pointers). Offset +24 is prev, +32 is parent block, +48 is debug location on each instruction node.

Constant vs instruction dispatch

Throughout expression codegen, a consistent threshold check determines whether to constant-fold or create an IR instruction:

if (*(BYTE*)(value + 16) > 0x10u)
    // Real IR instruction -> emit IR-level operation
    result = sub_15FDBD0(opcode, value, destTy, &out, 0);  // CastInst
else
    // Constant value -> constant-fold
    result = sub_15A46C0(opcode, value, destTy, 0);         // ConstantExpr

The byte at value+16 encodes the LLVM Value subclass kind. Values <= 0x10 are constants (ConstantInt, ConstantFP, ConstantPointerNull); values > 0x10 are Instruction subclasses. This avoids creating unnecessary instructions when both operands are compile-time constants.

Short-circuit boolean evaluation

Logical AND (&&) and OR (||) use the same short-circuit pattern with PHI merge:

; Logical AND (a && b):
  %lhs = icmp ne i32 %a, 0
  br i1 %lhs, label %land.rhs, label %land.end
land.rhs:
  %rhs = icmp ne i32 %b, 0
  br label %land.end
land.end:
  %0 = phi i1 [ false, %entry ], [ %rhs, %land.rhs ]
  %land.ext = zext i1 %0 to i32

Logical OR inverts the branch sense: TRUE goes to the end block (result is true), FALSE falls through to evaluate the RHS. Both share the same ZExt epilogue code via a merged tail at LABEL_162, selecting the name "land.ext" or "lor.ext" through a variable.

Printf lowering

Device-side printf cannot use C varargs. The compiler rewrites it to CUDA's vprintf(fmt, packed_buffer) ABI:

  1. Look up or create @vprintf in the module via Module::getOrInsertFunction.
  2. Allocate a stack buffer ("tmp" alloca, cached at IRGenState+152 for reuse across multiple printf calls in the same function).
  3. For each vararg: compute its byte size, round offset to natural alignment, GEP into the buffer ("buf.indexed"), bitcast if needed ("casted"), and store.
  4. Promote float arguments to double per C variadic convention (fpext).
  5. If the total packed size exceeds the current alloca size, patch the alloca's size operand in-place by manipulating the use-def chain.
  6. Emit call i32 @vprintf(ptr %fmt, ptr %buf).

The alloca in-place resize (step 5) is unusual -- most LLVM passes would create a new alloca. NVIDIA's motivation is to maintain a single alloca that dominates all printf pack sites within a function.

Type Translation System

The EDG-to-LLVM type translation (sub_91AED0 and its callees) is a worklist-driven fixed-point computation that runs before per-function codegen. It translates every EDG type node into an LLVM type, handling:

  • Primitive types: Direct mapping (EDG int -> LLVM i32, EDG float -> LLVM float).
  • Pointer types: Carry qualifier words at node+18 that encode CUDA address spaces (qualifier 1 = global/addrspace 1, qualifier 32 = shared/addrspace 3, qualifier 33 = constant/addrspace 4).
  • Struct/union/class types: Recursive member-by-member translation with reference counting to handle shared sub-types and diamond inheritance.
  • Typedef chains: Stripped by the standard for (t = type; tag == 12; t = *(t+160)) idiom.
  • Template specializations: Two-pass approach -- syntactic substitution (sub_908040) followed by semantic matching (sub_910920), gated by optimization flags.
  • Mutually recursive types: Handled by the fixed-point iteration do { changed = process_all(); } while (changed).

All hash tables in the type system use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the common implementation.

Global Variable Codegen

Device-side globals (__device__, __constant__, __shared__, __managed__) are emitted by sub_916430 (determineAddressSpaceAndCreate) which reads EDG IL node attributes at offsets +0x88 (storage class), +0x9C, +0xAE, and +0xB0 to determine the NVPTX address space:

EDG AttributeNVPTX Address SpacePTX Qualifier
__device__1 (global).global
__constant__4 (constant).const
__shared__3 (shared).shared
Generic (default)0 (generic)(none)

After creating the GlobalVariable, sub_915400 (finalizeGlobals) orchestrates module-level metadata emission: nvvmir.version (IR version metadata), nvvm.annotations (kernel and parameter annotations), llvm.used (prevents dead-global elimination), Debug Info Version module flag (value 3), and optionally llvm.ident.

Naming Conventions

The IR generation layer produces named IR values that match Clang's naming conventions almost exactly, confirming that NVVM's codegen was closely modeled on Clang's IRGen:

IR NameContextSource
"entry"Function entry basic blocksub_946060
"return"Return basic blocksub_946060
"allocapt"Sentinel instruction for alloca groupingsub_946060
"retval"Return value allocasub_946060
"agg.result"Sret argumentsub_938240
<name>.addrParameter allocasub_938240 / sub_9446C0
"temp_param"Unnamed parametersub_938240
"this"Implicit C++ this parametersub_938240
"__val_param"<name>Byval parameter copysub_938240
"arraydecay"Array-to-pointer decay GEPsub_128D0F0 (opcode 0x15)
"lnot" / "lnot.ext"Logical NOT + ZExtsub_128D0F0 (opcode 0x1D)
"land.rhs" / "land.end" / "land.ext"Logical AND blocks + resultsub_128D0F0 (opcode 0x57)
"lor.rhs" / "lor.end" / "lor.ext"Logical OR blocks + resultsub_128D0F0 (opcode 0x58)
"cond.true" / "cond.false" / "cond.end"Ternary operator blockssub_128D0F0 (opcode 0x67)
"tobool" / "conv"Cast resultssub_128A450
"sub.ptr.lhs.cast" / "sub.ptr.rhs.cast" / "sub.ptr.sub" / "sub.ptr.div"Pointer subtractionsub_128D0F0 (opcode 0x34)
"if.then" / "if.else" / "if.end"If statement blockssub_937020
"while.cond" / "while.body" / "while.end"While loop blockssub_937180
"for.cond" / "for.body" / "for.inc" / "for.end"For loop blockssub_936D30
"do.body" / "do.cond" / "do.end"Do-while loop blockssub_936B50
"bf.*"Bitfield access temporaries (30+ variants)sub_1282050 / sub_1284570
"predef_tmp_comp"Special register read resultsub_1286E40
"buf.indexed" / "casted"Printf buffer GEP and castsub_12992B0
"asmresult"Inline asm extractvalue resultsub_1292420

Sub-Page Navigation

The IR generation subsystem is documented in detail across four sub-pages, each covering a major functional area:

  • Expression & Constant Codegen -- The EmitExpr master dispatch (sub_128D0F0), its 40-operator inner switch, compile-time constant emission (sub_127D8B0), and the cast/conversion codegen (sub_128A450). Covers every C/C++ expression type from array decay to pointer subtraction to logical short-circuit.

  • Statement & Control Flow Codegen -- The emitStmt dispatcher (sub_9363D0), basic block creation for if/while/do-while/for/switch, cleanup scope management for C++ destructors, label and goto handling, and #pragma unroll metadata attachment.

  • Function, Call & Inline Asm Codegen -- Function skeleton creation (sub_946060), the parameter prolog (sub_938240), call instruction emission with ABI classification (sub_93CB50), inline asm template parsing and constraint construction (sub_1292420), printf-to-vprintf lowering (sub_12992B0), and the 770-entry builtin dispatch table (sub_12B3FD0).

  • Type Translation, Globals & Special Vars -- The fixed-point type translation system (sub_91AED0), address space mapping for CUDA memory qualifiers, global variable creation (sub_916430), kernel metadata emission (sub_93AE30), function attribute handling (sub_12735D0), and special variable codegen for threadIdx/blockIdx/blockDim/gridDim/warpSize.

Expression & Constant Codegen

The central expression emitter sub_128D0F0 (56 KB, 1751 decompiled lines) is the single function responsible for translating every C/C++ expression in the EDG AST into LLVM IR. It is a large recursive two-level switch: the outer switch classifies the expression node kind (operation, literal, member access, call, etc.), and the inner switch dispatches across 40+ C operators to emit the corresponding LLVM IR instruction sequences. Every named temporary in the output (%arraydecay, %land.ext, %sub.ptr.div, %cond, etc.) originates from explicit SetValueName calls within this function, closely mirroring Clang's IRGen naming conventions.

Two companion subsystems handle specialized expression domains: bitfield codegen (sub_1282050 store, sub_1284570 load) lowers C bitfield accesses to shift/mask/or sequences, and constant expression codegen (sub_127D8B0, 1273 lines) produces llvm::Constant* values for compile-time evaluable expressions. Cast codegen (sub_128A450, 669 lines) maps every C cast category to the appropriate LLVM cast opcode.

Master dispatchersub_128D0F0EmitExpr (56 KB, address 0x128D0F0)
Bitfield storesub_1282050EmitBitfieldStore (15 args, R-M-W sequence)
Bitfield loadsub_1284570EmitBitfieldLoad (12 args, extract sequence)
Constant expressionssub_127D8B0EmitConstExpr (1273 lines, recursive)
Cast/conversionsub_128A450EmitCast (669 lines, 11 LLVM opcodes)
Bool conversionsub_127FEC0EmitBoolExpr (expr to i1)
Literal emissionsub_127F650EmitLiteral (numeric/string constants)

Master Expression Dispatcher

Reconstructed signature

// sub_128D0F0
llvm::Value *EmitExpr(CodeGenState **ctx, EDGExprNode *expr,
                      llvm::Type *destTy, unsigned flags, unsigned flags2);

The ctx parameter is a pointer-to-pointer hierarchy:

OffsetField
*ctxIRBuilder state (current function, insert point)
ctx[1]Debug info context: [0] = debug scope, [1] = current BB, [2] = insertion sentinel
ctx[2]LLVM module/context handle

EDG expression node layout

Every expression node passed as expr has a fixed layout:

OffsetSizeField
+0x008Type pointer (EDG type node)
+0x181Outer opcode (expression kind byte)
+0x191Flags byte
+0x2412Source location info
+0x381Inner opcode (operator sub-kind, for kind=1)
+0x488Child/operand pointer

Type nodes carry a tag at offset +140: 12 = typedef alias (follow +160 to unwrap), 1 = void. The typedef-stripping idiom appears 15+ times throughout the function:

// Type unwrapping — strips typedef aliases to canonical type
for (Type *t = expr->type; *(uint8_t*)(t + 140) == 12; t = *(Type**)(t + 160))
    ;

Outer switch — expression categories

The byte at expr+0x18 selects the top-level expression category:

KindCategoryHandler
0x01Operation expressionInner switch on expr+0x38 (40+ C operators)
0x02Literal constantEmitLiteral (sub_127F650)
0x03Member/field accessEmitAddressOf + EmitLoadFromAddress
0x11Call expressionEmitCall (sub_1296570)
0x13Init expressionEmitInitExpr (sub_1281220)
0x14Declaration referenceEmitAddressOf + EmitLoadFromAddress
defaultFatal: "unsupported expression!"

Inner switch — complete opcode reference

When the outer kind is 0x01 (operation), the byte at expr+0x38 selects which C operator to emit. The complete dispatch table follows. Every opcode is listed; no gaps exist between documented entries.

OpcodeC operatorHandler / delegateLLVM pattern
0x00Constant subexprsub_72B0F0 (evaluate) + sub_1286D80 (load)Constant materialization
0x03Compound special AEmitCompoundAssign (sub_1287ED0)Read-modify-write
0x05Dereference (*p)Elide if child is &: IsAddressOfExpr (sub_127B420). Otherwise: recursive EmitExpr + EmitLoad (sub_128B370)%val = load T, ptr %p
0x06Compound special BEmitCompoundAssign (sub_1287ED0)Read-modify-write
0x08Compound special CEmitCompoundAssign (sub_1287ED0)Read-modify-write
0x15Array decaySee Array decay%arraydecay = getelementptr inbounds ...
0x19Parenthesized (x)Tail-call optimization: a2 = child, restart loop(no IR emitted)
0x1Asizeof / alignofEmitSizeofAlignof (sub_128FDE0)Constant integer
0x1CBitwise NOT (~x)sub_15FB630 (xor with -1)%not = xor i32 %x, -1
0x1DLogical NOT (!x)Two-phase: EmitBoolExpr + zext%lnot = icmp eq ..., 0 / %lnot.ext = zext i1 ... to i32
0x1EType-level constConstantFromType (sub_127D2C0)Compile-time constant
0x1FType-level constConstantFromType (sub_127D2C0)Compile-time constant
0x23Pre-increment ++xEmitIncDec (sub_128C390): prefix=1, inc=1%inc = add ... / %ptrincdec = getelementptr ...
0x24Pre-decrement --xEmitIncDec (sub_128C390): prefix=0, inc=0%dec = sub ... / %ptrincdec = getelementptr ...
0x25Post-increment x++EmitIncDec (sub_128C390): prefix=1, inc=0Returns old value; %inc = add ...
0x26Post-decrement x--EmitIncDec (sub_128C390): prefix=0, inc=1Returns old value; %dec = sub ...
0x27-0x2B+, -, *, /, %EmitBinaryArithCmp (sub_128F9F0)add/sub/mul/sdiv/srem (or u/f variants)
0x32Comma (a, b)Emit both sides; return RHS(LHS discarded)
0x33Subscript a[i]EmitSubscriptOp (sub_128B750): GEP + load%arrayidx = getelementptr ... + load
0x34Pointer subtractionSee Pointer subtraction%sub.ptr.div = sdiv exact ...
0x35-0x39==, !=, <, >, <=, >=EmitBinaryArithCmp (sub_128F9F0)icmp eq/ne/slt/sgt/sle/sge (or u/f variants)
0x3A<<EmitShiftOrBitwise (sub_128F580): triple (1, 32, 32)shl
0x3B>>EmitShiftOrBitwise (sub_128F580): triple (14, 33, 33)ashr (signed) / lshr (unsigned)
0x3C&EmitShiftOrBitwise (sub_128F580): triple (2, 38, 34)and
0x3D^EmitShiftOrBitwise (sub_128F580): triple (4, 40, 36)xor
0x3E|EmitShiftOrBitwise (sub_128F580): triple (3, 39, 35)or
0x3FRotateEmitShiftOrBitwise (sub_128F580): triple (5, 41, 37)llvm.fshl / llvm.fshr
0x41-0x46Type-level constsConstantFromType (sub_127D2C0)Compile-time constant
0x49Member access ./->See Member accessgetelementptr + load (or bitfield path)
0x4A+=EmitCompoundAssignWrapper (sub_12901D0) + sub_1288F60Load + add + store
0x4B-=EmitCompoundAssignWrapper (sub_12901D0) + sub_1288370Load + sub + store
0x4C*=EmitCompoundAssignWrapper (sub_12901D0) + sub_1288770Load + mul + store
0x4D/=EmitCompoundAssignWrapper (sub_12901D0) + sub_1289D20Load + div + store
0x4E%=EmitCompoundAssignWrapper (sub_12901D0) + sub_1288DC0Load + rem + store
0x4F&=EmitCompoundAssignWrapper (sub_12901D0) + sub_1288B70Load + and + store
0x50|=EmitCompoundAssignWrapper (sub_12901D0) + sub_1289360Load + or + store
0x51<<=EmitCompoundAssignWrapper (sub_12901D0) + sub_1288090Load + shl + store
0x52>>=EmitCompoundAssignWrapper (sub_12901D0) + sub_1287F30Load + ashr/lshr + store
0x53^=EmitCompoundAssignWrapper (sub_12901D0) + sub_1288230Load + xor + store
0x54,= (rare)EmitCompoundAssignWrapper (sub_12901D0) + sub_128BE50Comma-compound
0x55[]= (subscript compound)EmitCompoundAssignWrapper (sub_12901D0) + sub_128B750GEP + R-M-W
0x56Bitfield assignSee Bitfield CodegenR-M-W sequence
0x57Logical AND &&See Logical ANDland.rhs/land.end + PHI
0x58Logical OR ||See Logical ORlor.rhs/lor.end + PHI
0x59, 0x5A, 0x5DType-level constsConstantFromType (sub_127D2C0)Compile-time constant
0x5BStatement expression ({...})EmitStmtExpr (sub_127FF60); create empty BB if (*a1)[7] == 0Body emission
0x5C, 0x5E, 0x5FCompound specialEmitCompoundAssign (sub_1287ED0)Read-modify-write
0x67Ternary ?:See Ternary operatorcond.true/cond.false/cond.end + PHI
0x68Type-level constConstantFromType (sub_127D2C0)Compile-time constant
0x69Special constEmitSpecialConst (sub_1281200)Constant materialization
0x6FLabel address &&labelGCC extension: sub_12A4D00 (lookup) + sub_1285E30(builder, label, 1)blockaddress(@fn, %label)
0x70Label valuesub_12A4D00 + sub_12812E0(builder, label, type)Indirect goto target
0x71Computed goto goto *psub_12A4D00 + sub_1285E30(builder, label, 0)indirectbr
0x72va_argsub_12A4D00 on va_list child + sub_1286000va_arg lowering
defaultFatalDiag (sub_127B550)"unsupported operation expression!"

Shift and bitwise triple encoding

The EmitShiftOrBitwise (sub_128F580) triple (signedOp, intOp, fpOp) encodes three things: signedOp controls signed-vs-unsigned selection for right shift (14 selects ashr for signed, lshr for unsigned), intOp is the LLVM integer opcode number, and fpOp is the floating-point variant (unused for shift/bitwise but present for uniformity).

Increment / decrement detail

EmitIncDec (sub_128C390, 16 KB) handles integer, floating-point, and pointer types. It reads the expression type to select the arithmetic operation:

  • Integer path: add/sub nsw i32 %x, 1 with name "inc" or "dec". For prefix variants, the incremented value is returned; for postfix, the original value is returned and the increment is stored.
  • Floating-point path: fadd/fsub float %x, 1.0 with the same return-value semantics.
  • Pointer path: getelementptr inbounds T, ptr %p, i64 1 (or i64 -1 for decrement) with name "ptrincdec". Element type comes from the pointed-to type.

All paths load the current value, compute the new value, store back, and return either old or new depending on prefix/postfix.

Compound assignment wrapper mechanics

EmitCompoundAssignWrapper (sub_12901D0) implements the common load-compute-store pattern for all compound assignment operators (+=, -=, etc.):

// sub_12901D0 pseudocode
Value *EmitCompoundAssignWrapper(ctx, expr, impl_fn, flags) {
    Value *addr = EmitAddressOf(ctx, expr->lhs);     // sub_1286D80
    Value *old_val = EmitLoadFromAddress(ctx, addr);  // sub_1287CD0
    Value *rhs_val = EmitExpr(ctx, expr->rhs);        // sub_128D0F0 (recursive)
    Value *new_val = impl_fn(ctx, old_val, rhs_val);  // per-operator function
    EmitStore(ctx, new_val, addr);                     // store back
    return new_val;
}

Each impl_fn is a small function (typically 200-400 lines) that handles integer/float type dispatch and signedness. For example, sub_1288F60 (AddAssign) selects between add, fadd, and pointer-GEP addition.

Member access multi-path handler

Opcode 0x49 handles struct field access (. and ->) through a multi-path dispatcher:

  1. Simple scalar field (field count == 1): Computes field address via EmitAddressOf (sub_1286D80), checks the volatile bit (v349 & 1), copies 12 DWORDs of field descriptor into the local frame, then loads via EmitLoadFromAddress (sub_1287CD0).

  2. Bitfield field: If the field descriptor indicates a bitfield, routes to EmitBitfieldAccess (sub_1282050) which emits the shift/mask extraction sequence.

  3. Nested/union access (field count > 1): Calls ComputeCompositeMemberAddr (sub_1289860) for multi-level GEP computation, then EmitComplexMemberLoad (sub_12843D0).

  4. Write-only context: If the assignment bit (a2+25, bit 2) is set, returns null -- the caller only needs the address, not the loaded value.

Statement expression, label address, and va_arg

Statement expression (0x5B): Emits the compound statement body via EmitStmtExpr (sub_127FF60). If no return basic block exists yet ((*a1)[7] == 0), creates an anonymous empty BB via CreateBasicBlock + SetInsertPoint to serve as the fall-through target. The value of the last expression in the block is the statement expression's result.

Label address (0x6F): Implements the GCC &&label extension. Looks up the label via LookupLabel (sub_12A4D00), then creates a blockaddress(@current_fn, %label) constant via sub_1285E30(builder, label, 1). The second argument 1 distinguishes "take address" from "goto to".

Computed goto (0x71): The goto *ptr extension. Same LookupLabel call, but sub_1285E30(builder, label, 0) with flag 0 emits an indirectbr instruction targeting the resolved label.

va_arg (0x72): Extracts the va_list child node at +72, its sub-child at +16, resolves both via sub_12A4D00, then calls EmitVaArg (sub_1286000) which lowers to a va_arg LLVM instruction with the appropriate type.

Constant vs. instruction dispatch

Throughout all operator emission, a consistent pattern selects between constant folding and IR instruction creation. The byte at Value+16 encodes the LLVM Value subclass kind: values <= 0x10 are constants (ConstantInt, ConstantFP, etc.) and values > 0x10 are instructions. This check appears 20+ times throughout the function, always with the same structure:

// Constant-fold or emit IR? Decision pattern (appears 20+ times)
if (*(uint8_t*)(value + 16) > 0x10) {
    // Real IR instruction -- create via IR builder
    result = CreateCast(opcode, value, destTy, &out, 0);    // sub_15FDBD0
    result = CreateBinOp(opcode, lhs, rhs, &out, 0);       // sub_15FB440
} else {
    // Compile-time constant -- constant-fold at LLVM ConstantExpr level
    result = ConstantExprCast(opcode, value, destTy, 0);    // sub_15A46C0
    result = ConstantFoldBinOp(lhs, rhs, 0, 0);            // sub_15A2B60
}

The dispatch table for the constant-fold vs IR-instruction paths:

OperationIR path (Value > 0x10)Constant path (Value <= 0x10)
Binary opCreateBinOp (sub_15FB440)ConstantFoldBinOp (sub_15A2B60)
Unary NOTCreateUnaryOp (sub_15FB630)ConstantFoldUnary (sub_15A2B00)
CastCreateCast (sub_15FDBD0)ConstantExprCast (sub_15A46C0)
Int comparesub_15FEC10(op=51, pred)sub_15A37B0(pred, lhs, rhs)
Float comparesub_15FEC10(op=52, pred)sub_15A37B0(pred, lhs, rhs)
Sub (constant)CreateBinOp(13=Sub)ConstantFoldSub (sub_15A2B60)
SDiv exactCreateBinOp(18=SDiv) + SetExactFlagConstantFoldSDiv (sub_15A2C90)

When the constant path is taken, no LLVM instruction is created and no BB insertion occurs -- the result is a pure llvm::Constant* that can be used directly. This is critical for expressions like sizeof(int) + 4 where no runtime code should be emitted.

Key Expression Patterns

Array decay

Opcode 0x15. Converts an array lvalue to a pointer to its first element.

When IsArrayType (sub_8D23B0) confirms the source is an array type, the emitter creates an inbounds GEP with two zero indices. The GEP instruction is constructed manually: allocate 72 bytes for 3 operands via AllocateInstruction, compute the result element type, propagate address space qualifiers from the source, then fill operands (base, i64 0, i64 0) and mark inbounds:

%arraydecay = getelementptr inbounds [N x T], ptr %arr, i64 0, i64 0

If the source is already a pointer type (not an array), the function either passes through directly or inserts a ptrtoint / zext if the types differ.

Pointer subtraction

Opcode 0x34. The classic 5-step Clang pattern for (p1 - p2):

%sub.ptr.lhs.cast = ptrtoint ptr %p1 to i64
%sub.ptr.rhs.cast = ptrtoint ptr %p2 to i64
%sub.ptr.sub      = sub i64 %sub.ptr.lhs.cast, %sub.ptr.rhs.cast
%sub.ptr.div      = sdiv exact i64 %sub.ptr.sub, 4    ; element_size=4 for int*

Step 5 (the sdiv exact) is skipped entirely when the element size is 1 (i.e., char* arithmetic), since division by 1 is a no-op. The element size comes from the pointed-to type at offset +128. The exact flag on sdiv tells the optimizer that the division is known to produce no remainder -- a critical optimization hint.

Logical AND (short-circuit)

Opcode 0x57. Creates two basic blocks and a PHI node for C's short-circuit && evaluation:

entry:
    %lhs = icmp ne i32 %a, 0
    br i1 %lhs, label %land.rhs, label %land.end

land.rhs:
    %rhs = icmp ne i32 %b, 0
    br label %land.end

land.end:
    %0 = phi i1 [ false, %entry ], [ %rhs, %land.rhs ]
    %land.ext = zext i1 %0 to i32

The construction sequence:

  1. Create blocks land.end and land.rhs via CreateBasicBlock (sub_12A4D50).
  2. Emit LHS as boolean via EmitBoolExpr (sub_127FEC0).
  3. Conditional branch: br i1 %lhs, label %land.rhs, label %land.end.
  4. Switch insertion point to %land.rhs.
  5. Emit RHS as boolean.
  6. Unconditional branch to %land.end.
  7. Switch to %land.end, construct PHI with 2 incoming edges.
  8. Zero-extend the i1 PHI result to the expression's declared type (i32 typically) with name land.ext.

The PHI node is allocated as 64 bytes via AllocatePHI (sub_1648B60), initialized with opcode 53 (PHI), and given a capacity of 2. Incoming values are stored in a compact layout: [val0, val1, ..., bb0, bb1, ...] where each value slot occupies 24 bytes (value pointer + use-list doubly-linked-list pointers), and basic block pointers form a parallel array after all value slots.

Logical OR (short-circuit)

Opcode 0x58. Identical structure to logical AND but with inverted branch sense: the TRUE outcome of the LHS branches to lor.end (short-circuits to true), and FALSE falls through to evaluate the RHS:

entry:
    %lhs = icmp ne i32 %a, 0
    br i1 %lhs, label %lor.end, label %lor.rhs

lor.rhs:
    %rhs = icmp ne i32 %b, 0
    br label %lor.end

lor.end:
    %0 = phi i1 [ true, %entry ], [ %rhs, %lor.rhs ]
    %lor.ext = zext i1 %0 to i32

Internally, the AND and OR paths share a common tail (merging at a single code point with a variable holding either "lor.ext" or "land.ext").

Ternary / conditional operator

Opcode 0x67. Constructs a full three-block diamond with PHI merge for a ? b : c:

entry:
    %cond.bool = icmp ne i32 %test, 0
    br i1 %cond.bool, label %cond.true, label %cond.false

cond.true:
    %v1 = <emit true expr>
    br label %cond.end

cond.false:
    %v2 = <emit false expr>
    br label %cond.end

cond.end:
    %cond = phi i32 [ %v1, %cond.true ], [ %v2, %cond.false ]

The function creates three blocks (cond.true, cond.false, cond.end), records which basic block each arm finishes in (since the true/false expression emission might create additional blocks), and builds the PHI from those recorded blocks. When one arm is void, the PHI is omitted and whichever arm produced a value is returned directly.

Logical NOT and bitwise NOT

Logical NOT (opcode 0x1D) is a two-phase emit:

%lnot     = icmp eq i32 %x, 0         ; Phase 1: convert to bool
%lnot.ext = zext i1 %lnot to i32      ; Phase 2: extend back to declared type

Phase 1 calls EmitBoolExpr which produces the icmp eq ... 0 comparison. Phase 2 zero-extends the i1 back to the expression's target type. If the value is already a compile-time constant, the constant folder handles it directly.

Bitwise NOT (opcode 0x1C) produces xor with all-ones:

%not = xor i32 %x, -1

Created via CreateUnaryOp (sub_15FB630) which synthesizes xor with -1 (all bits set). Optional zext follows if the result needs widening.

Dereference with address-of elision

Opcode 0x05. Before emitting a load for unary *, the function checks if the child is an address-of expression via IsAddressOfExpr (sub_127B420). If so, the dereference and address-of cancel out -- no IR is emitted, only a debug annotation is attached. This handles the common pattern *&x becoming just x.

Bitfield Codegen

Bitfield loads and stores are lowered to shift/mask/or sequences by two dedicated functions. A path selector CanUseFastBitfieldPath (sub_127F680) determines whether the bitfield fits within a single naturally-aligned container element (fast path) or must be processed byte-by-byte (general path).

EDG bitfield descriptor

The bitfield metadata object carries:

OffsetTypeField
+120qwordContainer type node
+128qwordByte offset within struct
+136byteBit offset within containing byte
+137byteBit width of the field
+140byteType tag (12 = array wrapper, walk chain)
+144byteFlags (bit 3 = signed bitfield)
+160qwordNext/inner type pointer

Fast path (single-container load)

When the bitfield plus its bit range fits within one container element, the fast path loads the entire container and extracts the field with a single shift and mask:

// Example: struct { unsigned a:3; unsigned b:5; } s;
// s.b: byte_offset=0, bit_offset=3, bit_width=5, container=i8

Load s.b (fast path):

%container  = load i8, ptr %s
%shifted    = lshr i8 %container, 3            ; "highclear" -- position field at bit 0
%result     = and i8 %shifted, 31              ; "zeroext" -- mask to 5 bits (0x1F)

The shift amount is computed as 8 * elem_size - bit_width - bit_offset - 8 * (byte_offset % elem_size). When this evaluates to zero, the lshr is constant-folded away.

For signed bitfields, the zero-extend is replaced with an arithmetic sign extension via shift-left then arithmetic-shift-right:

%shifted = lshr i8 %container, 3              ; "highclear"
%signext = ashr i8 %shifted, 5                ; "signext" -- propagates sign bit

Store s.b = val (fast path read-modify-write):

%container     = load i8, ptr %s
%bf.value      = and i8 %val, 31              ; mask to 5 bits
%cleared       = and i8 %container, 7         ; "bf.prev.cleared" -- clear bits [3:7]
%positioned    = shl i8 %bf.value, 3          ; "bf.newval.positioned"
%merged        = or  i8 %cleared, %positioned ; "bf.finalcontainerval"
store i8 %merged, ptr %s

The clear mask is ~((1 << bit_width) - 1) << bit_position). For containers wider than 64 bits, both the clear mask and the value mask are computed via APInt operations (sub_16A5260 to set bit range, sub_16A8F40 to invert).

Byte-by-byte path (spanning load)

When the bitfield spans multiple container elements, it is processed one byte at a time. Each iteration loads a byte, extracts the relevant bits, zero-extends to the accumulator width, shifts into position, and ORs into the running accumulator.

For example, a 20-bit field starting at byte 0, bit 0:

; Byte 0: bits [0:7]
%bf.base.i8ptr = bitcast ptr %s to ptr         ; pointer cast
%byte0.ptr     = getelementptr i8, ptr %bf.base.i8ptr, i64 0
%bf.curbyte.0  = load i8, ptr %byte0.ptr
%bf.byte_zext.0 = zext i8 %bf.curbyte.0 to i32
; accumulator = %bf.byte_zext.0 (shift=0 for first byte)

; Byte 1: bits [8:15]
%byte1.ptr     = getelementptr i8, ptr %bf.base.i8ptr, i64 1
%bf.curbyte.1  = load i8, ptr %byte1.ptr
%bf.byte_zext.1 = zext i8 %bf.curbyte.1 to i32
%bf.position.1  = shl i32 %bf.byte_zext.1, 8   ; "bf.position"
%bf.merge.1     = or  i32 %bf.byte_zext.0, %bf.position.1  ; "bf.merge"

; Byte 2: only 4 bits remain (20 - 16 = 4)
%byte2.ptr         = getelementptr i8, ptr %bf.base.i8ptr, i64 2
%bf.curbyte.2      = load i8, ptr %byte2.ptr
%bf.end.highclear  = lshr i8 %bf.curbyte.2, 4  ; "bf.end.highclear" -- clear top 4 bits
%bf.byte_zext.2    = zext i8 %bf.end.highclear to i32
%bf.position.2     = shl i32 %bf.byte_zext.2, 16
%bf.merge.2        = or  i32 %bf.merge.1, %bf.position.2

The byte-by-byte store path mirrors this in reverse: for boundary bytes (first and last), it loads the existing byte, masks out the target bits with AND, positions the new bits with SHL, and merges with OR. Middle bytes that are entirely overwritten skip the read-modify-write and store directly.

The bf.* naming vocabulary

All bitfield IR values use a consistent naming scheme:

NamePathMeaning
bf.base.i8ptrBothPointer cast to i8*
bf.curbyteLoadCurrent byte in iteration loop
bf.end.highclearLoadlshr to clear unused high bits in last byte
bf.byte_zextLoadzext of byte to accumulator width
bf.positionBothshl to position byte/value within accumulator/container
bf.mergeLoador to merge byte into accumulator
bf.highclearLoadlshr before sign extension
bf.finalvalLoadashr for sign extension
highclearLoad fastFast-path lshr to clear high bits
zeroextLoad fastFast-path zero-extend result
signextLoad fastFast-path ashr sign extension
bf.valueStoreand(input, width_mask) -- isolated field bits
bf.prev.clearedStore fastContainer with old field bits cleared
bf.newval.positionedStore fastNew value shifted to field position
bf.finalcontainervalStore fastor(cleared, positioned) -- final container
bf.reload.valStoreTruncated value for compound assignment reload
bf.reload.sextStoreSign-extended reload via shift pair
bassign.tmpStoreAlloca for temporary during bitfield assignment

Wide bitfield support (> 64 bits)

Both load and store functions handle bitfields wider than 64 bits through APInt operations. The threshold check width > 0x40 (64) appears throughout: values <= 64 bits use inline uint64_t masks computed as 0xFFFFFFFFFFFFFFFF >> (64 - width), while wider values allocate heap-backed APInt word arrays. Every code path carefully frees heap APInts after use. This supports __int128 bitfields in CUDA.

Volatile and alignment

Volatile detection uses a global flag at unk_4D0463C. When set, sub_126A420 queries whether the GEP target address is in volatile memory, propagating the volatile bit to load/store instructions. The alignment parameter for bitfield container loads must be 1; the function asserts on other values with "error generating code for loading from bitfield!".

Duplicate implementations

Two additional copies exist at sub_923780 (store) and sub_925930 (load) -- identical algorithms with the same string names, same opcodes, same control flow. These likely correspond to different template instantiations or address-space variants in the original NVIDIA source. The 0x92xxxx copies are in the main NVVM frontend region while the 0x128xxxx copies are in the codegen helper region.

Constant Expression Codegen

EmitConstExpr (sub_127D8B0) converts EDG constant expression AST nodes into llvm::Constant* values. It is recursive: aggregate initializers call it for each element.

// sub_127D8B0
llvm::Constant *EmitConstExpr(CodeGenState *ctx, EDGConstExprNode *expr,
                               llvm::Type *arrayElemTyOverride);

The constant kind byte at expr[10].byte[13] is the primary dispatch:

KindCategoryOutput type
1Integer constantConstantInt
2String literalConstantDataArray
3Floating-point constantConstantFP
6Address-of constantGlobalVariable*, Function*, or string global
0xAAggregate initializerConstantStruct, ConstantArray, or ConstantAggregateZero
0xENull/emptyReturns 0 (no constant)
defaultFatal: "unsupported constant variant!"

Integer constants

For normal integers (up to 64 bits), the value is extracted via edg::GetSignedIntValue or edg::GetUnsignedIntValue depending on signedness, masked to the actual bit width, and passed to ConstantInt::get(context, APInt).

For __int128 (type size == 16 bytes), the EDG IL stores the value as a decimal string. The path is: edg::GetIntConstAsString(expr) returns the decimal text, then APInt::fromString(128, str, len, radix=10) parses it into a 128-bit APInt. This string-based transfer suggests the EDG IL uses text encoding for portability of wide integers.

APInt memory management follows the standard pattern: values > 64 bits use heap-allocated word arrays (checked via width > 0x40). Every path frees heap APInts after consumption.

When the target LLVM type is a pointer (tag 15), the integer constant is first created, then ConstantExpr::getIntToPtr converts it.

String literals

The character width is determined from a lookup table qword_4F06B40 indexed by the encoding enum at expr[10].byte[8] & 7:

IndexWidthC type
01 bytechar / UTF-8
1platformwchar_t
21 bytechar8_t
3from globalplatform-dependent
4from globalplatform-dependent

The raw byte buffer is built by copying byte_count bytes from the EDG node, reading each character through edg::ReadIntFromBuffer(src, width) -- an endian-aware read function (the EDG IL may store string data in a platform-independent byte order). The buffer is then passed to ConstantDataArray::getRaw(data, byte_count) to create the LLVM constant.

For each character width, the LLVM element type is selected: i8 for 1-byte, i16 for 2-byte, i32 for 4-byte, i64 for 8-byte. Empty strings create zero-element arrays. If the array type override a3 provides a larger size than the literal, the remaining bytes are zero-filled.

Floating-point constants

Raw bit patterns are extracted via edg::ExtractFloatBits(kind, data_ptr), then reinterpreted into native float or double values:

EDG kindC typeConversion path
2floatBitsToFloat -> APFloat(float) -> IEEEsingle semantics
4doubleBitsToDouble -> APFloat(double) -> IEEEdouble semantics
6long doubleTruncated to double (with warning 0xE51)
7__float80Truncated to double (with warning 0xE51)
8, 13__float128Truncated to double (with warning 0xE51)

All extended-precision types (long double, __float80, __float128) are silently lowered through the double path. NVPTX has no hardware support for 80-bit or 128-bit floats, so CICC truncates them to 64-bit IEEE 754. When the compilation context has the appropriate flag (bit 4 at offset +198), a diagnostic warning is emitted identifying the specific type being truncated.

Address-of constants

Sub-dispatched by a byte at expr[11].byte[0]:

  • Byte 0 -- Variable/global reference: Calls GetOrCreateGlobalVariable (sub_1276020), returning a GlobalVariable* as a constant pointer. Debug info is optionally attached.
  • Byte 1 -- Function reference: Calls GetOrCreateFunction (sub_1277140). For static-linkage functions, resolves through LookupFunctionStaticVar.
  • Byte 2 -- String literal reference (&"..."): Validates the node kind is 2 (string), then calls CreateStringGlobalConstant (sub_126A1B0).

Post-processing applies a constant GEP offset if expr[12].qword[0] is nonzero, and performs pointer type cast if the produced type differs from the expected type. Same-address-space mismatches use ConstantExpr::getBitCast; cross-address-space mismatches use ConstantExpr::getAddrSpaceCast. Pointer-to-integer mismatches use ConstantExpr::getPtrToInt with address-space normalization to addrspace(0) first.

Aggregate initializers

The largest case (630+ lines). After stripping typedefs, dispatches on the canonical type tag at +140:

TagTypeOutput
10StructConstantStruct or ConstantAggregateZero
11UnionAnonymous {member_type, [N x i8]}
8ArrayConstantArray
12TypedefStrip and re-dispatch
otherFatal: "unsupported aggregate constant!"

Struct (tag 10): Walks the EDG field list and initializer list in parallel. The field chain is traversed via +112 pointers; the initializer list via +120 next pointers.

  • Padding/zero-width fields are skipped (flag byte at +146, bit 3).
  • For each non-bitfield field, GetFieldIndex (sub_1277B60) returns the LLVM struct element index. If gaps exist between the previous and current index, intermediate slots are filled with Constant::getNullValue (sub_15A06D0).
  • Each field's initializer is processed by recursive EmitConstExpr call.
  • Packed struct fields (flag at +145, bit 4) have their sub-elements extracted individually via ConstantExpr::extractvalue (sub_15A0A60).
  • Missing trailing fields are padded with null values.
  • If the struct has no fields and the initializer list is empty, returns ConstantAggregateZero::get (sub_1598F00) as a shortcut.
  • Final assembly: ConstantStruct::get (sub_159F090) with type compatibility check via Type::isLayoutIdentical (sub_1643C60). If packed, StructType::get(elts, n, true) (sub_15943F0).

Struct bitfield packing (post-processing)

When any bitfield field is detected during the main walk (flag bit 2, &4 at +144), the function re-enters a post-processing phase after the main field loop. This packs bitfield constant values byte-by-byte into the struct's byte array:

// Bitfield packing pseudocode — sub_127D8B0, case 0xA post-processing
StructLayout *layout = DataLayout::getStructLayout(structTy);  // sub_15A9930

for (each bitfield field where flag &4 at +144 && name at +8 is non-null) {
    uint32_t byte_offset = field->byte_offset;
    uint32_t elem_idx = StructLayout::getElementContainingOffset(layout, byte_offset);
                                                                // sub_15A8020
    // Validate the target byte is zero
    assert(elements[elem_idx] == ConstantInt::get(i8, 0),
           "unexpected error while initializing bitfield!");

    // Evaluate bitfield initializer
    Constant *val = EmitConstExpr(ctx, init_expr, 0);          // recursive
    assert(val != NULL, "bit-field constant must have a known value at compile time!");

    APInt bits = extractAPInt(val);  // at constant+24, width at constant+32
    uint8_t bit_width = field->bit_width;    // at +137
    if (bits.width > bit_width)
        bits = APInt::trunc(bits, bit_width);                  // sub_16A5A50

    // Pack into struct bytes, one byte at a time
    uint8_t bit_offset = field->bit_offset;  // at +136 (within first byte)
    while (remaining_bits > 0) {
        uint8_t available = (first_byte ? 8 - bit_offset : 8);
        uint8_t take = min(remaining_bits, available);

        APInt slice = bits;
        if (slice.width > take)
            slice = APInt::trunc(slice, take);                 // sub_16A5A50
        if (take < 8)
            slice = APInt::zext(slice, 8);                     // sub_16A5C50
        slice = slice << bit_offset;                           // shl
        existing_byte |= slice;                                // sub_16A89F0

        elements[byte_index] = ConstantInt::get(ctx, existing_byte);
        bits = bits >> take;                                   // sub_16A7DC0
        remaining_bits -= take;
        bit_offset = 0;       // subsequent bytes start at bit 0
        byte_index++;
    }
}

This implements the C standard's bitfield byte-packing model: bits are inserted starting at the field's bit_offset within its containing byte, potentially spanning multiple bytes. Values wider than 64 bits use heap-backed APInt word arrays.

Union (tag 11): Finds the initialized member via two paths:

  1. Designated initializer (kind 13): *(init+184) is the designated field, *(init+120) is the actual value expression.
  2. Implicit: Walk the field chain (type+160) looking for the first non-skip, non-bitfield field. Named bitfield members are explicitly rejected: "initialization of bit-field in union not supported!". If no field is found: "cannot find initialized union member!".

The member value is emitted recursively. Padding to the full union byte size is added as [N x i8] zeroinitializer. The result is an anonymous {member_type, [N x i8]} struct via ConstantStruct::getAnon (sub_159F090).

Array (tag 8): Resolves element type via GetArrayElementType (sub_8D4050), walks the initializer linked list via +120 next pointers, calls EmitConstExpr recursively for each element. Designated initializers (kind 11) are supported: *(node+176) gives the designated element index, *(node+184) gives the range count. Type mismatches are handled by sub_127D000 (resize constant to target type).

When the declared dimension exceeds the initializer count, remaining elements are filled with Constant::getNullValue. The result uses ConstantArray::get (sub_159DFD0) when all elements have the same LLVM type (the common case), or falls back to an anonymous struct via StructType::get + ConstantStruct::get for heterogeneous cases (which should not occur in well-formed C but is handled defensively).

Cast / Conversion Codegen

EmitCast (sub_128A450) handles every C-level cast category. The function first checks for early exits (skip flag, identity cast where source type equals destination type), then dispatches by source and destination type tags.

// sub_128A450
llvm::Value *EmitCast(CodeGenState **ctx, EDGCastNode *expr,
                      uint8_t is_unsigned, llvm::Type *destTy,
                      uint8_t is_unsigned2, char skip_flag,
                      DiagContext *diag);

Type classification

Type tags at *(type+8):

TagType
1-6Floating-point (1=half, 2=float, 3=double, 4=fp80, 5=fp128, 6=bf16)
11Integer (bit-width encoded in upper bits)
15Pointer
16Vector/aggregate

The test (tag - 1) > 5 means "NOT a float" (tags 1-6 are float types).

Tobool patterns

When the destination type is i1 (bool), the codegen produces comparison-against-zero:

Integer/float source (tags 1-6, 11):

%tobool = icmp ne i32 %val, 0          ; integer source
%tobool = fcmp une float %val, 0.0     ; float source

Float-to-bool uses fcmp une (unordered not-equal), which returns true for any non-zero value including NaN. Integer-to-bool uses icmp ne with a zero constant of matching type.

Pointer source (tag 15):

%tobool = icmp ne ptr %val, null

A shortcut exists: if the source expression is already a comparison result (opcode 61) and the source is already the bool type, the comparison result is returned directly without creating a new instruction.

Integer-to-integer (trunc / zext / sext)

The helper sub_15FE0A0 internally selects the operation based on relative widths:

  • dest_width < src_width -> trunc
  • dest_width > src_width AND unsigned -> zext
  • dest_width > src_width AND signed -> sext

All produce a value named "conv".

Pointer casts

Pointer-to-pointer: In LLVM opaque-pointer mode (which CICC v13 uses for modern SMs), same-address-space casts hit the identity return path and produce no IR. Cross-address-space casts use addrspacecast (opcode 47).

Pointer-to-integer: ptrtoint (opcode 45). Asserts that the destination is actually an integer type.

Integer-to-pointer: A two-step process. First, the integer is widened or narrowed to the pointer bit-width (32 or 64, obtained via sub_127B390). Then inttoptr (opcode 46) converts the properly-sized integer to a pointer:

%conv1 = zext i32 %val to i64          ; step 1: widen to pointer width
%conv  = inttoptr i64 %conv1 to ptr    ; step 2: int -> ptr

Float-to-integer and integer-to-float

Two paths exist for these conversions:

Standard path: Uses LLVM's native cast opcodes. Triggered when the global flag unk_4D04630 is set (relaxed rounding mode), or when the destination is 128-bit, or when the source is fp128:

DirectionSigned opcodeUnsigned opcode
int -> floatsitofp (39)uitofp (40)
float -> intfptosi (41)fptoui (42)

NVIDIA intrinsic path: For SM targets that require round-to-zero semantics on float-int conversions. Constructs an intrinsic function name dynamically and emits it as a plain function call:

// Name construction pseudocode
char buf[64];
if (src_is_double)  strcpy(buf, "__nv_double");
else                strcpy(buf, "__nv_float");

strcat(buf, is_unsigned ? "2u" : "2");

if (dest_bits == 64) strcat(buf, "ll_rz");
else                 strcat(buf, "int_rz");

Producing names like:

IntrinsicConversion
__nv_float2int_rzf32 -> i32, signed, round-to-zero
__nv_float2uint_rzf32 -> u32, unsigned, round-to-zero
__nv_double2ll_rzf64 -> i64, signed, round-to-zero
__nv_double2ull_rzf64 -> u64, unsigned, round-to-zero
__nv_float2ll_rzf32 -> i64, signed, round-to-zero

These are emitted as plain LLVM function calls (call i32 @__nv_float2int_rz(float %val)), not as LLVM intrinsics. The NVIDIA PTX backend later pattern-matches these __nv_ calls to cvt.rz.* PTX instructions. The intrinsic call is created by sub_128A3C0, which builds a function type, looks up or creates the declaration in the module, and emits a CallInst with one argument.

If the source integer is 32-bit but the target needs 64-bit conversion, the function first converts i32 to i64, then recursively calls itself to convert i64 to the target float type.

Float-to-float (fptrunc / fpext)

The source and destination type tags are compared directly. If the destination tag is larger (wider float), opcode 44 (fpext) is used. If smaller, opcode 43 (fptrunc).

%conv = fpext float %val to double       ; float -> double
%conv = fptrunc double %val to float     ; double -> float

Cast control flow summary

EmitCast(ctx, expr, is_unsigned, destTy, is_unsigned2, skip, diag)
  |
  +-- skip_flag set          --> return 0
  +-- destTy == BoolType?
  |     +-- src is float       --> fcmp une %val, 0.0    "tobool"
  |     +-- src is ptr/int     --> icmp ne %val, null/0  "tobool"
  +-- srcTy == destTy          --> return expr (identity)
  +-- ptr -> ptr               --> bitcast(47)           "conv"
  +-- ptr -> int               --> ptrtoint(45)          "conv"
  +-- int -> ptr               --> resize + inttoptr(46) "conv"
  +-- int -> int               --> trunc/zext/sext       "conv"
  +-- int -> float
  |     +-- standard           --> sitofp(39)/uitofp(40) "conv"
  |     +-- nvidia             --> __nv_*2*_rz call      "call"
  +-- float -> int
  |     +-- standard           --> fptosi(41)/fptoui(42) "conv"
  |     +-- nvidia             --> __nv_*2*_rz call      "call"
  +-- float -> float
        +-- wider              --> fpext(44)             "conv"
        +-- narrower           --> fptrunc(43)           "conv"

IR Instruction Infrastructure

BB insertion linked list

After creating any LLVM instruction, it must be inserted into the current basic block. This appears ~30 times across the expression codegen functions as a doubly-linked intrusive list manipulation. The low 3 bits of list pointers carry tag/flag bits (alignment guarantees valid pointers have zero in those positions):

// Repeated BB insertion pattern
Value *tail = ctx[1][1];           // current BB's instruction list tail
if (tail) {
    Value *sentinel = ctx[1][2];   // sentinel node
    InsertIntoBB(tail + 40, inst); // sub_157E9D0
    // Linked list fixup (doubly-linked with 3-bit tag):
    inst->prev = (*sentinel & ~7) | (inst->prev & 7);   // preserve tag bits
    inst->parent = sentinel;
    ((*sentinel & ~7) + 8) = inst + 24;    // old_tail.next = inst
    *sentinel = (*sentinel & 7) | (inst + 24);  // sentinel.head = inst
}

Instruction offsets: +24 = prev pointer, +32 = parent block, +48 = debug location metadata slot.

Debug metadata attachment

After every BB insertion, debug location metadata is cloned and attached:

SetValueName(inst, &name);                    // sub_164B780: e.g. "lnot.ext"
Value *debugLoc = *ctx_debug;
if (debugLoc) {
    Value *cloned = CloneDebugLoc(debugLoc, 2);  // sub_1623A60
    if (inst->debugLoc)
        ReleaseDebugLoc(inst + 48);              // sub_161E7C0: free old
    inst->debugLoc = cloned;
    if (cloned)
        RegisterDebugLoc(cloned, inst + 48);     // sub_1623210
}

Global flags

AddressPurpose
dword_4D04720 + dword_4D04658Debug info emission control. When both zero, source location is forwarded before dispatch
dword_4D04810Bitfield optimization flag. When set, enables bassign.tmp alloca path for bitfield assignments
unk_4D04630When set, forces standard LLVM casts (sitofp/fptosi) instead of __nv_*_rz intrinsics
unk_4D04700When set, marks tobool results as "potentially inexact" via flag bit
unk_4D0463CVolatile detection flag. When set, queries address volatility

Helper Function Reference

AddressRecovered nameRole
sub_128D0F0EmitExprMaster expression dispatcher (this page)
sub_128A450EmitCastAll C-level casts
sub_127D8B0EmitConstExprCompile-time constant expressions
sub_1282050EmitBitfieldStoreBitfield write (R-M-W)
sub_1284570EmitBitfieldLoadBitfield read (extract)
sub_127FEC0EmitBoolExprExpression to i1 conversion
sub_127F650EmitLiteralNumeric/string literal emission
sub_1286D80EmitAddressOfCompute pointer to lvalue
sub_1287CD0EmitLoadFromAddressLoad via computed address
sub_1287ED0EmitCompoundAssignGeneric compound assignment
sub_128C390EmitIncDecPre/post increment/decrement
sub_128F9F0EmitBinaryArithCmpBinary arithmetic and comparison
sub_128F580EmitShiftOrBitwiseShift and bitwise operators
sub_128B750EmitSubscriptOpArray subscript (GEP + load)
sub_128FDE0EmitSizeofAlignofsizeof and alignof operators
sub_12901D0EmitCompoundAssignWrapperWrapper dispatching to per-operator impl
sub_1296570EmitCallFunction call emission
sub_12897E0EmitBitfieldStore (inner)Actual bitfield store logic
sub_127A030GetLLVMTypeEDG type to LLVM type translation
sub_127F680CanUseFastBitfieldPathBitfield path selector
sub_128A3C0EmitIntrinsicConvCall__nv_*_rz intrinsic call helper
sub_12A4D50CreateBasicBlockCreate named BB
sub_12A4DB0EmitCondBranchConditional branch emission
sub_12909B0EmitUnconditionalBranchUnconditional branch emission
sub_1290AF0SetInsertPointSwitch current BB
sub_15FB440CreateBinOpBinary instruction creation
sub_15FDBD0CreateCastCast instruction creation (IR path)
sub_15A46C0ConstantExprCastCast (constant-fold path)
sub_15A0680ConstantInt::getInteger constant creation
sub_159C0E0ConstantInt::get (APInt)Wide integer constant creation
sub_159CCF0ConstantFP::getFloat constant creation
sub_128B370EmitLoadLoad with volatile/type/srcloc
sub_128BE50EmitCommaOpComma operator RHS extraction
sub_1289860ComputeCompositeMemberAddrMulti-level GEP for nested fields
sub_12843D0EmitComplexMemberLoadNested struct/union field load
sub_127FF60EmitStmtExprStatement expression body emission
sub_1281200EmitSpecialConstSpecial constant materialization
sub_1281220EmitInitExprInit expression emission
sub_1285E30EmitBlockAddressblockaddress / indirect branch
sub_1286000EmitVaArgva_arg lowering
sub_127FC40CreateAllocaAlloca with name and alignment
sub_127B420IsAddressOfExprCheck if child is & (for elision)
sub_127B3A0IsVolatileVolatile type query
sub_127B390GetSMVersionReturns current SM target
sub_127B460IsPackedPacked struct type query
sub_127B550FatalDiagFatal diagnostic (never returns)
sub_127C5E0AttachDebugLocDebug location attachment
sub_127D2C0ConstantFromTypeType-level constant (sizeof, etc.)
sub_12A4D00LookupLabelLabel resolution for goto/address
sub_1648A60AllocateInstructionRaw instruction memory allocation
sub_1648B60AllocatePHIPHI node memory allocation
sub_164B780SetValueNameAssigns %name to IR value
sub_157E9D0InsertIntoBasicBlockBB instruction list insertion
sub_1623A60CloneDebugLocDebug location cloning
sub_1623210RegisterDebugLocDebug location list registration
sub_161E7C0ReleaseDebugLocDebug location list removal
sub_15F1EA0InitInstructionInstruction field initialization
sub_15F1F50InitPHINodePHI node initialization (opcode 53)
sub_15F2350SetExactFlagMark sdiv/udiv as exact
sub_15F55D0GrowOperandListRealloc PHI operand array
sub_15FEC10CreateCmpInstICmp/FCmp instruction creation
sub_15FE0A0CreateIntResizeTrunc/zext/sext helper
sub_15FB630CreateUnaryOpUnary NOT (xor -1)
sub_15F9CE0SetGEPOperandsGEP operand filling
sub_15FA2E0SetInBoundsFlagMark GEP as inbounds
sub_8D23B0IsArrayTypeArray type check
sub_72B0F0EvaluateConstantExprEDG constant evaluation
sub_731770NeedsBitfieldTempBitfield temp alloca check

Constant expression helper functions

AddressRecovered nameRole
sub_127D8B0EmitConstExprMaster constant expression emitter
sub_127D000ResizeConstantResize constant to target type
sub_127D120DestroyAPFloatElementAPFloat cleanup in aggregate loop
sub_127D2E0PushElementBulkBulk push to element vector
sub_127D5D0PushElementSingle push to element vector
sub_1277B60GetFieldIndexStruct field index query
sub_1276020GetOrCreateGlobalVarGlobal variable creation/lookup
sub_1277140GetOrCreateFunctionFunction creation/lookup
sub_1280350LookupFunctionStaticVarStatic local variable resolution
sub_126A1B0CreateStringGlobalConstGlobal string constant creation
sub_1598F00ConstantAggregateZero::getZero-initialized aggregate
sub_15991C0ConstantDataArray::getRawRaw byte array constant
sub_159DFD0ConstantArray::getTyped array constant
sub_159F090ConstantStruct::getStruct constant
sub_15943F0StructType::getAnonymous struct type
sub_15A06D0Constant::getNullValueZero constant for any type
sub_15A0A60ConstantExpr::extractvalueSub-element extraction
sub_15A2E80ConstantExpr::getGEPConstant GEP expression
sub_15A4510ConstantExpr::getBitCastConstant bitcast
sub_15A4A70ConstantExpr::getAddrSpaceCastConstant addrspacecast
sub_15A4180ConstantExpr::getPtrToIntConstant ptrtoint
sub_15A8020StructLayout::getElemContainingOffsetBitfield byte lookup
sub_15A9930DataLayout::getStructLayoutStruct layout query
sub_620E90edg::IsSignedIntConstSignedness query
sub_620FA0edg::GetSignedIntValueSigned integer extraction
sub_620FD0edg::GetUnsignedIntValueUnsigned integer extraction
sub_622850edg::GetIntConstAsString__int128 decimal string extraction
sub_622920edg::ExtractFieldOffsetField offset extraction
sub_709B30edg::ExtractFloatBitsFloat raw bits extraction
sub_722AB0edg::ReadIntFromBufferEndian-aware integer read
sub_8D4050edg::GetArrayElementTypeArray element type query
sub_8D4490edg::GetArrayElementCountArray dimension query

LLVM Opcode Constants

Numeric opcode constants used in CreateBinOp, CreateCast, and instruction creation calls throughout the expression codegen:

NumberLLVM instructionUsed by
13subPointer subtraction step 4
18sdivPointer subtraction step 5 (with exact flag)
32shlLeft shift (<<)
33ashr / lshrRight shift (>>, signedness-dependent)
34and (FP variant)Bitwise AND
35or (FP variant)Bitwise OR
36xor (FP variant)Bitwise XOR
37zextZero-extend (bool-to-int, lnot.ext, land.ext)
38andBitwise AND (integer)
39sitofp / orSigned int-to-float / bitwise OR (integer)
40uitofp / xorUnsigned int-to-float / bitwise XOR (integer)
41fptosi / funnel shiftSigned float-to-int / rotate
42fptouiUnsigned float-to-int
43fptruncFloat-to-float truncation
44fpextFloat-to-float extension
45ptrtointPointer-to-integer cast
46inttoptrInteger-to-pointer cast
47bitcast / addrspacecastPointer casts
51ICmp instruction kindInteger comparison creation
52FCmp instruction kindFloat comparison creation
53PHI node kindPHI creation for &&, ||, ?:

PHI Node Construction Detail

PHI nodes are used by three expression types: logical AND (0x57), logical OR (0x58), and ternary (0x67). The construction sequence is identical across all three:

  1. Allocate: AllocatePHI (sub_1648B60) with 64 bytes.
  2. Initialize: InitPHINode (sub_15F1F50) with opcode 53 (PHI), type, and zero for parent/count/incoming.
  3. Set capacity: *(phi+56) = 2 -- two incoming edges.
  4. Set name: SetValueName (sub_164B780) with "land.ext", "lor.ext", or "cond".
  5. Reserve slots: sub_1648880(phi, 2, 1) -- reserve 2 incoming at initial capacity 1.

Adding each incoming value:

count = *(phi+20) & 0xFFFFFFF;           // current operand count
if (count == *(phi+56))                   // capacity full?
    GrowOperandList(phi);                 // sub_15F55D0: realloc

new_idx = (count + 1) & 0xFFFFFFF;
*(phi+20) = new_idx | (*(phi+20) & 0xF0000000);  // update count, preserve flags

// Large-mode flag at *(phi+23) & 0x40 selects operand array location:
base = (*(phi+23) & 0x40) ? *(phi-8) : phi_alloc_base - 24*new_idx;

// Value slot: base + 24*(new_idx-1) — 24 bytes per slot (value ptr + use-list pointers)
slot = base + 24*(new_idx - 1);
*slot = value;                           // incoming value
slot[1] = value.use_next;               // link into value's use-list
slot[2] = &value.use_head | (slot[2] & 3);
value.use_head = slot;

// Basic block slot: stored after all value slots as parallel array
bb_offset = base + 8*(new_idx-1) + 24*num_incoming + 8;
*bb_offset = incoming_bb;

The PHI operand layout is [val0, val1, ..., bb0, bb1, ...] where each value slot occupies 24 bytes (value pointer + doubly-linked use-list pointers), and basic block pointers form a parallel 8-byte array after all value slots.

Duplicate Implementations

Two additional copies of the bitfield codegen exist at sub_923780 (store) and sub_925930 (load) -- identical algorithms with the same string names, same opcodes, same control flow. These are in the 0x92xxxx range (NVVM frontend region) while the primary copies are in the 0x128xxxx range (codegen helper region). They likely correspond to different template instantiations or address-space variants in the original NVIDIA source code.

Diagnostic String Index

StringOrigin functionTrigger
"unsupported expression!"EmitExpr (sub_128D0F0)Default case in outer switch
"unsupported operation expression!"EmitExpr (sub_128D0F0)Default case in inner switch
"constant expressions are not supported!"EmitConstExpr (sub_127D8B0)Unsupported context kind (sub_6E9180 returns true)
"unsupported constant variant!"EmitConstExpr (sub_127D8B0)Unknown constant kind in main switch; also byte != 0/1/2 in address-of
"unsupported float variant!"EmitConstExpr (sub_127D8B0)Float kind 5, or kind < 2
"long double" / "__float80" / "__float128"EmitConstExpr (sub_127D8B0)Warning 0xE51: extended precision truncated to double on CUDA target
"failed to lookup function static variable"EmitConstExpr (sub_127D8B0)Function static address with type tag > 0x10
"taking address of non-string constant is not supported!"EmitConstExpr (sub_127D8B0)&literal where literal kind != 2 (non-string)
"unsupported cast from address constant!"EmitConstExpr (sub_127D8B0)Type mismatch that is not ptr-to-ptr or ptr-to-int
"unsupported aggregate constant!"EmitConstExpr (sub_127D8B0)Type tag not in {8, 10, 11, 12} for aggregate case
"initialization of bit-field in union not supported!"EmitConstExpr (sub_127D8B0)Union initializer targeting a named bitfield
"cannot find initialized union member!"EmitConstExpr (sub_127D8B0)Union field chain exhausted without finding target
"bit-field constant must have a known value at compile time!"EmitConstExpr (sub_127D8B0)Bitfield initializer evaluates to NULL
"unexpected error while initializing bitfield!"EmitConstExpr (sub_127D8B0)Pre-existing byte in struct is not zero when packing
"unexpected non-integer type for cast from pointer type!"EmitCast (sub_128A450)ptrtoint destination is not integer
"unexpected destination type for cast from pointer type"EmitCast (sub_128A450)inttoptr source is not integer
"error generating code for loading from bitfield!"EmitBitfieldLoad (sub_1284570)Alignment assertion failure
"expected result type of bassign to be void!"EmitExpr (sub_128D0F0)Bitfield assign result type validation

Cross-References

Statement & Control Flow Codegen

The statement code generator converts EDG IL statement nodes into LLVM IR basic blocks and terminators. It is the control flow backbone of NVVM IR generation: every if, while, for, switch, goto, return, and compound block passes through a single recursive dispatcher (sub_9363D0) that reads a statement-kind byte and fans out to 17 specialized handlers. Each handler creates named basic blocks following a fixed naming convention, connects them with conditional or unconditional branches, and attaches metadata for branch prediction and loop optimization. Understanding this subsystem means understanding exactly how C/CUDA source-level control flow maps to the LLVM IR that downstream optimization passes will transform.

Binary coordinates: Handlers span 0x930000--0x948000 (~96 KB). The dispatcher itself is at 0x9363D0; the most complex handler (try/catch at sub_932270) is 57 KB alone.

Statement Dispatcher -- sub_9363D0 (emitStmt)

void emitStmt(CGModule *cg, StmtNode *stmt);

The dispatcher is the only entry point for statement lowering. All control flow handlers, compound statements, and even the top-level function body driver call emitStmt recursively.

Entry logic:

  1. If cg->currentBB (offset +96) is NULL, create an anonymous unreachable basic block via createBB("") and insert it. This is the "dead code after return" safety net -- it ensures the IR builder always has an insertion point, even for unreachable code that follows a return or goto.

  2. Read stmt->stmtKind (byte at StmtNode offset +40).

  3. Special fast path: if kind == 8 (return), call setDebugLoc + pushScope + emitReturnStmt and return immediately. Returns get priority handling because they terminate the current BB and may trigger cleanup scope unwinding.

  4. General path: setDebugLoc + pushScope, then dispatch on kind through a switch table.

Kind Dispatch Table

KindStatement typeHandlerAddress
0Expression statementemitExprStmtsub_921EA0
1if statementemitIfStmtsub_937020
2if constexpr (C++17)emitConstexprIfsub_936F80
5while loopemitWhilesub_937180
6gotoemitGotosub_931270
7Label statementemitLabelsub_930570
8returnemitReturnsub_9313C0
11Compound { ... }emitCompoundsub_9365F0
12do-while loopemitDoWhilesub_936B50
13for loopemitForsub_936D30
15case labelemitCasesub_935670
16switch statementemitSwitchsub_9359B0
17Variable declarationemitDeclStmtsub_9303A0
18try/catchemitTryCatchsub_932270
20Cleanup/destructor scopeemitCleanupScopesub_931670
24Null/empty statement(return immediately)--
25Expression statement (alt)emitExprStmtsub_921EA0

Kinds 0 and 25 share the same handler. The split likely distinguishes C expression-statements from GNU statement-expressions or a similar EDG internal distinction. Any unrecognized kind triggers fatal("unsupported statement type").

Gaps in the numbering (3, 4, 9, 10, 14, 19, 21--23) either correspond to statement types handled entirely in the EDG frontend (lowered before codegen sees them) or are reserved for future use.


If Statement -- sub_937020

Reads from the StmtNode: condition expression at offset +48, then-body at +72, else-body at +80 (may be NULL).

BB Layout: if/else

    ┌─────────────────────┐
    │    current BB        │
    │  %cond = ...         │
    │  br i1 %cond,        │
    │    label %if.then,    │
    │    label %if.else     │
    └──┬──────────────┬────┘
       │              │
       ▼              ▼
 ┌──────────┐   ┌──────────┐
 │ if.then  │   │ if.else  │
 │ <then>   │   │ <else>   │
 │ br %end  │   │ br %end  │
 └────┬─────┘   └────┬─────┘
      │               │
      ▼               ▼
    ┌─────────────────────┐
    │      if.end          │
    └─────────────────────┘

BB Layout: if without else

    ┌─────────────────────┐
    │    current BB        │
    │  %cond = ...         │
    │  br i1 %cond,        │
    │    label %if.then,    │
    │    label %if.end      │
    └──┬──────────────┬────┘
       │              │
       ▼              │
 ┌──────────┐         │
 │ if.then  │         │
 │ <then>   │         │
 │ br %end  │         │
 └────┬─────┘         │
      │               │
      ▼               ▼
    ┌─────────────────────┐
    │      if.end          │
    └─────────────────────┘

LLVM IR pseudocode:

  %cond = icmp ne i32 %x, 0           ; evalCondition: convert to i1
  br i1 %cond, label %if.then, label %if.else, !prof !0

if.then:
  ; ... then-body codegen ...
  br label %if.end

if.else:
  ; ... else-body codegen ...
  br label %if.end

if.end:
  ; continues here

!0 = !{!"branch_weights", i32 2000, i32 1}  ; if __builtin_expect(x, 1)

Branch Weight Metadata

sub_92F9D0 examines __builtin_expect annotations on branch bodies by checking bit flags at StmtNode offset +41:

FlagSource annotationWeight encodingMetadata attached
bit 0x10__builtin_expect(x, 1) -- likelyweightHint = 1!{!"branch_weights", i32 2000, i32 1}
bit 0x20__builtin_expect(x, 0) -- unlikelyweightHint = 2!{!"branch_weights", i32 1, i32 2000}
neitherno annotationweightHint = 0(no metadata)

The 2000:1 ratio represents 99.95% prediction confidence. For compound statements (kind 11), the function recurses into the compound's first child statement to find the annotation.

Constexpr If -- sub_936F80

C++17 if constexpr is fully resolved during EDG frontend semantic analysis. By the time the codegen sees it, only the taken branch body survives. The handler reads a selection record from offset +72: a bit at +24 determines which of two fields contains the surviving body pointer. If non-null, it creates constexpr_if.body and constexpr_if.end BBs and emits the body with an unconditional branch to .end. If null (dead branch entirely eliminated), no codegen occurs at all.


While Loop -- sub_937180

    ┌─────────────────────┐
    │    current BB        │
    │  br label %while.cond│
    └─────────┬───────────┘
              │
              ▼
    ┌─────────────────────┐◄──────────────┐
    │   while.cond         │               │
    │  %c = ...            │               │
    │  br i1 %c,           │               │
    │    label %while.body, │               │
    │    label %while.end   │               │
    └──┬──────────────┬────┘               │
       │              │                    │
       ▼              │                    │
 ┌──────────────┐     │                    │
 │ while.body   │     │                    │
 │ <body>       │     │                    │
 │ br %cond ────┼─────┼────────────────────┘
 └──────────────┘     │         backedge with
                      │         !llvm.loop metadata
                      ▼
              ┌──────────────┐
              │  while.end   │
              └──────────────┘

LLVM IR pseudocode:

  br label %while.cond

while.cond:
  %c = icmp slt i32 %i, %n
  br i1 %c, label %while.body, label %while.end

while.body:
  ; ... body codegen ...
  br label %while.cond, !llvm.loop !1

while.end:
  ; continues here

!1 = !{!1, !2}                                    ; self-referential loop ID
!2 = !{!"llvm.loop.mustprogress"}

The backedge branch (br label %while.cond from while.body) always receives !llvm.loop metadata via emitLoopMustProgress (sub_930810). If the loop carries #pragma unroll, additional unroll metadata is merged into the same MDNode (see Loop Metadata below).


Do-While Loop -- sub_936B50

The key structural difference from while: the body executes before the condition. The condition BB follows the body.

    ┌─────────────────────┐
    │    current BB        │
    │  br label %do.body   │
    └─────────┬───────────┘
              │
              ▼
    ┌─────────────────────┐◄──────────────┐
    │   do.body            │               │
    │  <body>              │               │
    │  br label %do.cond   │               │
    └─────────┬───────────┘               │
              │                            │
              ▼                            │
    ┌─────────────────────┐               │
    │   do.cond            │               │
    │  %c = ...            │               │
    │  br i1 %c,           │               │
    │    label %do.body, ──┼───────────────┘
    │    label %do.end     │     backedge
    └──────────────┬──────┘
                   │
                   ▼
           ┌──────────────┐
           │   do.end      │
           └──────────────┘

LLVM IR pseudocode:

  br label %do.body

do.body:
  ; ... body codegen ...
  br label %do.cond

do.cond:
  %c = icmp ne i32 %x, 0
  br i1 %c, label %do.body, label %do.end, !llvm.loop !1

do.end:
  ; continues here

The backedge is the conditional branch in do.cond (true edge back to do.body). Debug location is set separately for the condition expression using the condition node's own source location (offset +36 from the condition expression node).


For Loop -- sub_936D30

The most complex loop handler. Reads four components from the StmtNode: init statement at offset +80 field [0], condition at +48, increment expression at +80 field [1], and body at +72. Any of init, condition, and increment may be NULL.

    ┌─────────────────────┐
    │    current BB        │
    │  <init statement>    │    ← emitted in current BB if non-null
    │  br label %for.cond  │
    └─────────┬───────────┘
              │
              ▼
    ┌─────────────────────┐◄──────────────┐
    │   for.cond           │               │
    │  %c = ... or true    │               │
    │  br i1 %c,           │               │
    │    label %for.body,   │               │
    │    label %for.end     │               │
    └──┬──────────────┬────┘               │
       │              │                    │
       ▼              │                    │
 ┌──────────────┐     │                    │
 │  for.body    │     │                    │
 │  <body>      │     │                    │
 │  br %for.inc │     │                    │
 └──────┬───────┘     │                    │
        │             │                    │
        ▼             │                    │
 ┌──────────────┐     │                    │
 │  for.inc     │     │                    │
 │  <increment> │     │                    │
 │  br %for.cond┼─────┼────────────────────┘
 └──────────────┘     │         backedge
                      ▼
              ┌──────────────┐
              │   for.end    │
              └──────────────┘

LLVM IR pseudocode:

  ; init: i = 0
  store i32 0, ptr %i.addr, align 4
  br label %for.cond

for.cond:
  %i = load i32, ptr %i.addr, align 4
  %cmp = icmp slt i32 %i, %n
  br i1 %cmp, label %for.body, label %for.end

for.body:
  ; ... body codegen ...
  br label %for.inc

for.inc:
  %i1 = load i32, ptr %i.addr, align 4
  %inc = add nsw i32 %i1, 1
  store i32 %inc, ptr %i.addr, align 4
  br label %for.cond, !llvm.loop !1

for.end:
  ; continues here

Special cases:

  • Null condition: If the condition expression is NULL (e.g., for(;;)), the handler calls ConstantInt::getTrue (sub_ACD6D0) to create an unconditionally-true condition, producing an infinite loop.
  • Volatile increment: If the increment expression operates on a volatile pointer (type descriptor & 0xFB == 8 and isVolatile() returns true), the store is marked volatile.
  • Scope tracking: Outside "fast codegen" mode (dword_4D04658 == 0), pushes a DW_TAG_lexical_block debug scope at for-loop entry via sub_941230/sub_9415C0 and pops it at exit via sub_93FF00. This generates correct DWARF scoping so debuggers see for-local variables in the right scope.

The for.inc BB is only created when an increment expression exists. If omitted, the body branches directly back to for.cond.


Switch Statement -- sub_9359B0

The largest control flow handler after try/catch (~550 decompiled lines). Uses a three-phase approach with an internal open-addressing hash table.

Phase 1: Build case-to-BB mapping

Iterates the case list (linked list at stmt[10]+16, next pointer at +32). For each case label, creates a switch_case.target BB. Also creates one switch_case.default_target BB for the default case. Stores the mapping in an open-addressing hash table at CGModule offsets +496 through +520.

Hash table layout (32-byte entries):

CGModule offsetField
+496numEntries
+504bucket array pointer
+512numOccupied
+516numTombstones
+520capacity

Uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function and growth policy.

Phase 2: Emit LLVM SwitchInst

Evaluates the switch condition via sub_92F410, then creates a SwitchInst via sub_B53A60 (SwitchInst::Create) with the case count, default target BB, and condition value. Each case constant is added via sub_B53E30 (SwitchInst::addCase).

Phase 3: Emit body

Creates a switch_child_entry BB, inserts it, and recursively emits the switch body. If the switch has no explicit default: case, emits a fallthrough to the switch_case.default_target BB.

LLVM IR pseudocode:

  %val = load i32, ptr %x.addr
  switch i32 %val, label %switch_case.default_target [
    i32 0, label %switch_case.target
    i32 1, label %switch_case.target1
    i32 5, label %switch_case.target2
  ]

switch_case.target:                ; case 0
  ; ...
  br label %switch_child_entry     ; fallthrough or break

switch_case.target1:               ; case 1
  ; ...

switch_case.default_target:        ; default
  ; ...

Note that cicc always emits an LLVM switch instruction. The decision to lower a switch into a jump table versus sequential comparisons is made later by the SelectionDAG backend (specifically NVPTXTargetLowering), not during IR generation. The codegen produces a clean, canonical switch and lets the backend optimize the dispatch strategy.

Case Label -- sub_935670

When the recursive statement walk encounters a case label (kind 15), it looks up the parent switch node (asserts stmtKind == 16), finds the pre-allocated target BB from the hash table, and calls insertBB to make it the current insertion point. Fatal error "basic block for case statement not found!" if the hash table lookup fails.

For the default case (identified by a null value at +8), retrieves the last entry in the mapping vector.


Goto and Label Statements

Goto -- sub_931270

Reads the target label from stmt->auxData+128. Fatal error if null: "label for goto statement not found!".

Two code paths based on cleanup state:

Simple goto (no active cleanups, CGModule offset +240 == 0): Resolves the label to its BB via sub_946C80 and emits an unconditional branch.

Goto with cleanups (offset +240 != 0): Before branching, the handler must destroy all local variables whose scope is being exited. Calls sub_9310E0 to compute the destruction set, iterates each variable calling sub_9465D0 to emit destructor calls, resets the cleanup stack, then resolves and branches to the label BB.

  ; goto with cleanup: jumping out of scope with a std::string local
  call void @_ZNSsD1Ev(ptr %str)     ; ~string()
  br label %label_target

Label -- sub_930570

Resolves the label to its BB via sub_946C80 and inserts it as the current basic block via insertBB. The BB name comes from the label's symbol name in the EDG IL.

Computed Goto (GCC &&label Extension)

Computed goto is handled in the expression codegen layer, not the statement dispatcher. Expression kind 0x71 at sub_921EA0 calls EmitBlockAddress (sub_1285E30) to produce an LLVM blockaddress constant, and expression kind 0x70 produces the label-as-value. The resulting indirectbr instruction is lowered later by IndirectBrExpandPass (pipeline parser index 247, "indirectbr-expand") because NVPTX does not natively support indirect branches -- they are expanded into a switch over all possible target labels.


Return Statement -- sub_9313C0

Reads the return expression from StmtNode offset +48. Dispatches on CGModule return-type information (offsets +208 and +216):

Path A -- Aggregate (struct) return: If the return type is aggregate (sub_91B770 returns true), emits a memcpy-like sequence into the sret pointer via sub_947E80. For multi-register returns (offset +216 > 0), uses bit-width analysis (_BitScanReverse64) to determine the return bit layout.

Path B -- Scalar return: Evaluates the expression, creates a ReturnInst via sub_B4D3C0, and may bitcast the value for ABI compliance via sub_AE5020.

Path C -- Void return with expression: Evaluates the expression for side effects only (calls emitExprStmt), then falls through to emit a void return.

Cleanup before return: If cg->hasCleanups (offset +240) is set, calls sub_9310E0 to compute the set of locals requiring destruction, emits destructor calls in reverse order, resets the cleanup stack, then emits an unconditional branch to the function's unified return block (offset +200).

  ; return with cleanup unwind
  call void @_ZN3FooD1Ev(ptr %obj)    ; ~Foo()
  store i32 %retval, ptr %retval.addr
  br label %return

return:                               ; unified return BB
  %0 = load i32, ptr %retval.addr
  ret i32 %0

The unified return block pattern means every return in a function branches to a single shared return BB rather than emitting ret directly. This is standard in compilers because it simplifies cleanup handling and produces cleaner IR for optimization.


Try/Catch -- sub_932270

The largest single statement handler at 2,225 decompiled lines (57 KB, 0x3B0 bytes of stack locals). Lowers C++ try/catch into LLVM's landingpad-based exception handling model.

High-level structure:

  1. Collect catch handlers: Traverses the linked list at stmt->auxData+136 to build a vector of catch clause pointers.

  2. Construct cleanup names: Builds a mangled cleanup function name from the function's symbol (reading name range from symbol +184/+176). Single $ characters are doubled to $$ for LLVM compatibility.

  3. Build dispatch mapping: Creates an outer dispatch vector mapping each catch clause to its target BB, stored in the same open-addressing hash table scheme used by switch.

  4. Emit try body: Installs the landingpad/invoke mechanism so that throwing calls within the try body become invoke instructions rather than call instructions.

  5. Emit catch handlers: For each catch clause, creates a BB, emits the handler body, and generates the cleanup/resume path.

Note that CUDA device code has exceptions disabled by default (EDG config DEFAULT_EXCEPTIONS_ENABLED = 0). This handler is exercised primarily for host-side code compiled through cicc, or for the rare case where exceptions are explicitly enabled via compiler flags. When exceptions are disabled, the EDG frontend strips try/catch entirely and the codegen never sees kind 18.

The NVVM IR verifier (sub_2C76F10) explicitly rejects landingpad, invoke, and resume instructions in device code, confirming that exception handling is a host-only feature.


Cleanup/Destructor Scope -- sub_931670

Handles statement kind 20. Only active when cg->hasCleanups (offset +240) is set.

Walks a linked list at StmtNode offset +72. For each entry where the byte at +8 equals 7 (indicating a variable with non-trivial destructor):

  1. Extracts the variable reference at entry[2] (offset +16).
  2. Checks visibility flags (bits 0x60 at +170, byte +177 != 5) to skip external and static symbols.
  3. Looks up the variable in the CGModule's var-lookup hash table (offsets +8 through +24) using the same hash function as the switch table.
  4. If the variable is already registered for cleanup (checked via sub_91CCF0), adds it to the pending cleanup list and emits an immediate destructor call via sub_9465D0.
  5. If not yet registered, just adds it to the pending list for later processing.

This mechanism ensures that C++ automatic variables with non-trivial destructors are properly destroyed when their scope exits -- whether by normal control flow, goto, return, or exception propagation.


Compound Statement -- sub_9365F0

Handles { ... } blocks (kind 11). This is the workhorse that ties everything together: the function body itself is a compound statement, and every block scope creates a nested compound.

Cleanup frame management: When cg->hasCleanups (offset +240) is set, pushes a new cleanup frame onto the cleanup stack (offset +424). Each frame is a 24-byte record: {pendingDestructors ptr, end, capacity}.

Variable declarations: Iterates local declarations at scope fields [14] and [15] (linked lists). For each local variable, emits an alloca or initializer as needed. If the variable has a non-trivial destructor, registers it in the cleanup set.

Statement iteration: Walks the child statement linked list starting at StmtNode offset +72, following nextStmt pointers at +16. For each child, calls emitStmt(cg, child) recursively. Between statements, checks whether pending cleanups need flushing (temporaries with non-trivial destructors). If new cleanup entries appeared since the last check, iterates them in reverse order and emits destructor calls.

Statement-expression support (GNU extension): For ({...}) expressions, if the last statement in the block is an expression (kind 0 or 25), treats its value as the compound's result. Fatal error: "unexpected: last statement in statement expression is not an expression!" if the last statement is not an expression type.

Scope tracking: Outside fast-codegen mode, pushes a DW_TAG_lexical_block debug scope at entry and pops at exit, so debuggers correctly associate variables with their lexical scope.


Variable Declaration -- sub_9303A0

Reads the variable descriptor from StmtNode offset +72, then the variable's symbol from descriptor +8.

Initialization dispatch (based on byte at symbol +177):

ValueMeaning
4Block-scope static -- fatal("block scope static variable initialization is not supported!")
0, 3No dynamic init needed -- skip codegen
2Dynamic initialization -- main path

Dynamic init sub-dispatch (descriptor +48 byte):

Sub-kindHandlerPurpose
1sub_91DAD0Load-address style init
2sub_91FFE0Emit initializer expression
3sub_92F410Direct expression evaluation
other--fatal("unsupported dynamic initialization")

After computing the initializer value, the handler checks for volatile store qualification, computes alignment via sub_91CB50, retrieves the alloca/global address via sub_9439D0, and emits the store via sub_923130.

  %x = alloca i32, align 4              ; from function prologue
  %init = call i32 @compute_value()     ; dynamic initialization
  store i32 %init, ptr %x, align 4      ; emitDeclStmt

Block-scope static variables (static int x = expr;) are explicitly unsupported and fatal. In CUDA device code, block-scope statics have no sensible semantics (no persistent thread-local storage across kernel invocations), so this restriction is intentional.


Loop Metadata: Pragma Unroll and Mustprogress

Pragma Unroll -- sub_9305A0

Called from while, do-while, and for handlers when StmtNode offset +64 (pragma annotation) is non-NULL. Parses "unroll %d" from the pragma string via sscanf.

Count valueMetadata produced
0x7FFFFFFF (INT_MAX)!{!"llvm.loop.unroll.full"}
Specific N!{!"llvm.loop.unroll.count", i32 N}
<= 0fatal("Unroll count must be positive.")
Parse failurefatal("Parsing unroll count failed!")

The metadata is wrapped in the standard LLVM loop-ID self-referential MDNode pattern:

  br label %for.cond, !llvm.loop !3

!3 = !{!3, !4}                                      ; self-ref loop ID
!4 = !{!"llvm.loop.unroll.count", i32 8}

Global flag dword_4D046B4 ("skip pragma" mode) gates this entirely -- when set, sub_9305A0 returns immediately.

Loop Mustprogress -- sub_930810

Called on every loop backedge (while, do-while, for). Creates !{!"llvm.loop.mustprogress"} and attaches it to the backedge branch. If the backedge already has !llvm.loop metadata (from pragma unroll), the existing operands are read and the mustprogress node is appended to create a combined MDNode:

  br label %while.cond, !llvm.loop !5

!5 = !{!5, !6, !7}                                  ; merged: self-ref + unroll + mustprogress
!6 = !{!"llvm.loop.unroll.count", i32 4}
!7 = !{!"llvm.loop.mustprogress"}

This metadata tells the LLVM optimizer that loops must make forward progress -- it is allowed to remove provably-infinite side-effect-free loops. This corresponds to the C++ forward progress guarantee required by the standard.


Infrastructure Functions

createBB -- sub_945CA0

Allocates an 80-byte BasicBlock object and initializes it with the LLVM context from CGModule offset +40. The name parameter produces the characteristic BB names visible throughout this page: "if.then", "while.cond", "for.inc", "switch_case.target", "constexpr_if.body", etc.

insertBB -- sub_92FEA0

void insertBB(CGModule *cg, BasicBlock *bb, int canDelete);

Finalizes the current BB (emits an implicit unconditional branch to bb if the current BB lacks a terminator), then inserts bb into the function's BB list. If canDelete is 1 and the BB has no predecessors, the BB is immediately freed -- this garbage-collects unreachable continuation blocks (e.g., if.end when both branches terminate, while.end when the loop is infinite).

The canDelete=1 flag is used for if.end, while.end, for.end, and do.end BBs.

finalizeBB / emitBr -- sub_92FD90

If the current BB exists and its last instruction is NOT a terminator (opcode check: opcode - 30 > 10 filters out br, ret, switch, etc.), creates a BranchInst to the target BB and inserts it. Then clears cg->currentBB and the insert point.

emitCondBr -- sub_945D00

Creates a conditional BranchInst with true/false targets and optional branch weight metadata. When weightHint != 0, attaches !prof branch_weights metadata via MDBuilder::createBranchWeights.

evalCondition -- sub_921E00

Evaluates a condition expression and converts the result to i1. Checks for aggregate types (fatal error if the condition is an aggregate), determines signedness, evaluates the expression, then emits icmp ne 0 (integer) or fcmp une 0.0 (floating point) to produce a boolean.


EDG StmtNode Layout

Reconstructed from usage patterns across all statement handlers:

OffsetSizeField
+04Source location: line number
+42Source location: column number
+168nextStmt -- linked list pointer
+401stmtKind -- enum value (0--25 observed)
+411Flags (bit 0x10 = likely, bit 0x20 = unlikely)
+488exprPayload / condition expression pointer
+648Pragma annotation (NULL or "unroll N" string)
+728auxData -- kind-specific (then-body, label, variable descriptor, etc.)
+808auxData2 -- kind-specific (else-body for if, init/increment for for, etc.)

CGModule Offsets Used by Statement Codegen

OffsetSizeField
+88varLookupTable.buckets
+244varLookupTable.capacity
+408llvmContext
+968currentBB (BasicBlock pointer)
+1048insertPoint
+1928currentFunction (Function pointer)
+2008returnBlock (unified return BB)
+2088returnValue / sret pointer
+2164returnAlignment
+2401hasCleanups flag
+248--cleanupSet (DenseSet tracking which vars need cleanup)
+4248cleanupStack pointer (24-byte frames)
+4968switchHashTable.count
+5048switchHashTable.buckets
+5124switchHashTable.numOccupied
+5164switchHashTable.numTombstones
+5204switchHashTable.capacity
+5288currentScope pointer

Global Mode Flags

GlobalPurpose
dword_4D04658Fast codegen mode. Skips debug location emission, scope tracking, and some pragma processing. Corresponds to -G0 or equivalent "no debug" mode.
dword_4D046B4Skip pragma mode. emitUnrollPragma returns immediately. Also gates some compound-statement declaration processing.
dword_4F077C4CUDA compilation mode. Value 2 triggers alternate volatile-qualification logic in for-loop increment and variable declaration codegen.

Complete BB Naming Reference

Every basic block created by the statement codegen uses one of these exact names:

Statement typeBB names created
ifif.then, if.else, if.end
if constexprconstexpr_if.body, constexpr_if.end
whilewhile.cond, while.body, while.end
do-whiledo.body, do.cond, do.end
forfor.cond, for.body, for.inc, for.end
switchswitch_case.target (per case), switch_case.default_target, switch_child_entry
goto / label(named from label symbol)
return(branch to unified return block)
compound { }(no BBs unless cleanup)
dead code"" (anonymous unreachable BB)

These names survive into the final LLVM IR dump (-Xcuda-ptxas=-v) and are visible in optimization pass debug output. Recognizing them immediately tells you which source-level construct produced a given IR region.

Function, Call & Inline Asm Codegen

This page covers the four subsystems that together translate CUDA/C++ function definitions and call sites into LLVM IR: function prolog generation, call instruction emission, inline assembly compilation, and builtin lowering. The code lives in the 0x930000--0x960000 address range (Path A) with a parallel copy at 0x1270000--0x12D0000 (Path B).

EmitFunctionsub_946060 (Path A) -- creates entry BB, allocapt sentinel, dispatches to prolog
GenerateFunctionPrologsub_938240 (16 KB) -- parameter iteration, ABI dispatch, alloca emission
EmitCallExprsub_93CB50 (1,293 lines) -- type resolution, ABI classification, call emission
EmitInlineAsmsub_1292420 (53 KB, 2,087 lines) -- 7-phase asm template-to-IR pipeline
BuiltinLoweringsub_12B3FD0 (103 KB, 3,409 lines) -- mega-switch over ~250 builtin IDs
EmitFunctionAttrssub_12735D0 / sub_1273F90 -- grid_constant, preserve_n, custom ABI metadata

Function Prolog: Entry Block Setup

Every LLVM function produced by cicc starts with the same structural skeleton: an entry basic block containing a sentinel instruction, a cluster of alloca instructions for parameters and locals, and a return basic block for the unified exit path. The outer driver EmitFunction (sub_946060) builds this skeleton; the inner workhorse GenerateFunctionProlog (sub_938240) populates it with parameter handling code.

EmitFunction -- The Outer Driver

EmitFunction executes a fixed 10-step initialization sequence before tail-calling into the prolog generator:

EmitFunction(IRGenState *S, FunctionDecl *Decl, Function *F,
             ParamList *Params, TypeInfoArray *TI, SourceLoc Loc, bool ByvalDemotion):

  1.  Resolve function type through typedef chain (kind==12 -> follow offset+160)
  2.  Call SetupFunctionMetadata(S, Decl)
  3.  Optionally set section name on F via Value::setSection
  4.  Create "entry" basic block:
        entryBB = BasicBlock::Create(S, "entry", F, nullptr)
  5.  Create the "allocapt" sentinel instruction:
        voidTy   = Type::getVoidTy(ctx)
        undef    = UndefValue::get(voidTy)
        allocapt = new BitCastInst(undef, voidTy)  // void-to-void no-op
        entryBB->getInstList().push_back(allocapt)
        allocapt->setName("allocapt")
        S->AllocaInsertPt = allocapt          // stored at IRGenState+456
  6.  Create "return" basic block:
        retBB = BasicBlock::Create(S, "return", nullptr, nullptr)
        S->ReturnBlock = retBB                // stored at IRGenState+200
  7.  Set up return value slot:
        if returnType is void:
            S->RetVal = nullptr
        elif ABI kind == 2 (sret) AND isAggregate(returnType):
            S->RetVal = F->arg_begin()        // reuse the sret pointer
        else:
            S->RetVal = CreateTmpAlloca(S, returnType, "retval")
  8.  Store alignment of return type at S+216
  9.  Initialize insertion state: S->CurrentBB = entryBB
 10.  Tail-call GenerateFunctionProlog(S, Decl, F, Params, TI, Loc, ByvalDemotion)

The allocapt sentinel is the critical mechanism. It is a dead bitcast void undef to void instruction that serves as an insertion anchor. When CreateTmpAlloca (at sub_921D70) is called with no explicit array size -- the common case -- it inserts the new AllocaInst before the allocapt marker rather than at the current builder insertion point. This ensures that all alloca instructions cluster at the top of the entry block regardless of where in the function body they were requested, which is a hard requirement for LLVM's mem2reg pass to promote them to SSA registers.

The sentinel is eventually dead-code-eliminated in a later pass since it produces no usable value.

GenerateFunctionProlog -- Parameter Lowering

The prolog iterates four parallel data structures in lockstep:

CursorSourceStrideTermination
EDG parameter nodeLinked list from Declnext at offset +112nullptr
LLVM argument slotF->arg_begin()40 bytesF->arg_end()
Type info entryFrom the ABI classifier40 bytes(parallel with args)
Parameter index1-based counter+1(parallel with params)

A post-loop assertion validates that both cursors reached their end simultaneously: "Argument mismatch in generation function prolog!".

Struct Return: The agg.result Convention

Before entering the parameter loop, a helper (sub_938130) checks whether the first argument's ABI kind equals 2 (sret). When true, the prolog names the first LLVM argument "agg.result" and advances the argument cursor by one slot (+40 bytes), so that subsequent parameter processing starts at the second argument. This mirrors the standard LLVM sret convention where the caller pre-allocates space for a returned struct and passes a pointer as a hidden first parameter.

ABI Variant Dispatch

For each parameter, the ABI variant field at TypeInfo+12 selects one of four lowering paths:

Variant 0/1 -- Indirect/Aggregate Pass. The parameter arrives as a pointer to caller-allocated memory. If the type is an aggregate (struct/union/class/array -- type kinds 8--11 checked by IsAggregateType at sub_91B770), the prolog creates a local alloca named <param>.addr, stores the incoming argument into it, and registers the alloca in the declaration map via EmitParamDecl. If the type is a scalar, it goes directly to EmitParamDecl without an intermediate alloca.

Variant 2 -- Direct Pass (most common). The parameter is passed by value in a register or register pair. Two sub-paths exist:

  • Byval demotion path. When the ByvalDemotion flag (parameter a7) is set and the parameter carries a byval attribute (TypeInfo+16 nonzero), the prolog consults a global name-set (dword_4D04688) to decide whether to create a __val_param temporary. If selected, it allocates a "tmp" alloca via CreateTmpAlloca, stores the argument into it, names the alloca "__val_param" + param_name, and falls through to EmitParamDecl. The __val_param prefix is NVIDIA-specific and marks parameters that have been demoted from byval to local copy for downstream optimization passes.

  • Normal path. For non-byval scalars, calls EmitParamDecl directly. A guard validates that non-aggregate arguments are not marked indirect: "Non-aggregate arguments passed indirectly are not supported!".

Variant 3 -- Coercion. The parameter's LLVM type does not match the source type and requires a coercion cast. For aggregates, a "tmp" alloca is created. For scalars, the declaration is looked up and wrapped with a bitcast. The result is forwarded to EmitParamDecl.

EmitParamDecl -- Registration

EmitParamDecl (sub_9446C0) performs the final steps for each parameter:

  1. For scalar (non-aggregate, non-indirect) parameters: creates an alloca named <param>.addr, stores the incoming argument into it, and names the argument with the original parameter name.
  2. Inserts the mapping (EDG decl pointer -> LLVM Value*) into a hash map with open-addressing/quadratic-probing collision resolution. A duplicate check guards against re-declaration: "unexpected: declaration for variable already exists!".
  3. If debug info is enabled (dword_4D046B4), emits debug metadata for the parameter via sub_9433F0.

Naming Convention Table

IR ValueName Assigned
sret argument"agg.result"
Unnamed parameter"temp_param"
C++ this parameter"this" (detected by bit 0 at EDG node offset +172)
Parameter alloca<param_name> + ".addr"
Byval temp alloca"__val_param" + <param_name>
Return value alloca"retval"
Entry basic block"entry"
Return basic block"return"
Alloca sentinel"allocapt"

CreateTmpAlloca Internals

CreateTmpAlloca (sub_921D70) computes alignment from the type size using _BitScanReverse64 (effectively log2(size)), looks up or creates the pointer-to-type in the module's type system, then delegates to CreateAllocaInst (sub_921B80). The key detail: when no explicit array size is provided, the alloca is inserted at the allocapt marker position (IRGenState+456+24), not at the current builder insertion point.

Call Codegen

Call emission (sub_93CB50) is a 1,293-line function that handles direct calls, indirect calls, builtins, special intrinsics, and printf interception. It receives the caller's codegen context, the EDG call expression node, and an optional pre-allocated destination for aggregate returns.

Phase 1: Type Resolution

The callee operand is extracted from the call node's first operand slot (offset +72). The function resolves the callee's declaration via sub_72B0F0, then peels through the type chain -- stripping typedef aliases (kind 12) by following offset +160 -- until it reaches a pointer-to-function type (kind 6) wrapping a function type (kind 7). Fatal assertions guard both steps: "Expected pointer to function!" and "unexpected: Callee does not have routine type!".

Phase 2: Builtin Dispatch

For direct calls (opcode 20), the resolved callee declaration is checked for the builtin flag: byte[199] & 2. When set, the entire normal call path is bypassed. Control transfers to sub_955A70 (or sub_12B3FD0 on Path B), the builtin lowering mega-switch described in a later section. If the builtin returns an aggregate, the call codegen allocates an "agg.tmp" stack slot and emits a store of the result into it.

Phase 3: Intrinsic Special Cases

If the callee is not a builtin but carries an intrinsic ID (word[176] != 0), a handful of intrinsic IDs receive special treatment:

Intrinsic IDDescription
10214Surface/texture primitive
10219, 10227Warp-level primitives (detected via (id - 10219) & 0xFFF7 == 0)
15752Special return convention intrinsic

These dispatch to sub_939370, a dedicated handler that bypasses the normal ABI classification entirely.

Phase 4: Argument Processing

Arguments are codegen'd by walking the argument linked list and calling sub_921F50 on each expression. Results are collected into a dynamically-growing array (24 bytes per entry, managed by sub_C8D5F0).

When bit 1 of the call node's flags byte (offset +60) is set -- indicating variadic or reversed-evaluation convention -- arguments are first collected into a temporary linked list and then written into the array in reverse order. This preserves the C right-to-left evaluation order for variadic calls.

Phase 5: ABI Classification

The ABI classifier (sub_9378E0) receives the return type, parameter types, and byval flags, and produces a calling-convention descriptor. Each parameter gets an ABI kind:

ABI KindMeaningCodegen Action
0Direct (register)Push value directly if scalar; alloca + store if byval aggregate
1Indirect (pointer)Push pointer directly (only valid for aggregates)
2Indirect + byvalPush value directly (callee copies)
3Coercion/expandMulti-register split, handled by sub_923000

For the return value, ABI kind 2 means sret: a hidden first parameter is prepended to the argument list, pointing to a caller-allocated "tmp" alloca.

Phase 6: Callee Bitcast Folding

If the callee operand is a bitcast (byte[0] == 5), the optimizer walks back to the original function pointer and compares return types and parameter counts. If the signature matches exactly (pointer equality on type nodes, parameter-by-parameter comparison), the bitcast is folded out. This removes unnecessary bitcast wrappers that arise from C-style casts between compatible function pointer types.

Phase 7: Pre-Call Hooks and printf Interception

Debug location metadata is emitted via sub_92FD10. Then a special case: if the call is direct (opcode 20) and the callee name is literally "printf", control transfers to sub_939F40 which performs GPU printf lowering -- converting the printf call into a vprintf-style call that writes formatted output through the GPU's printf buffer mechanism.

Phase 8: preserve_n Operand Bundles

If the call node's preserve_data field (offset +64) is non-null, up to three operand bundles are attached to the call instruction:

preserve_data[0] >= 0  =>  "preserve_n_data"    = ConstantInt(value)
preserve_data[1] >= 0  =>  "preserve_n_control"  = ConstantInt(value)
preserve_data[2] >= 0  =>  "preserve_n_after"    = ConstantInt(value)

These NVPTX-specific operand bundles are register-pressure hints consumed by the instruction scheduler and register allocator. The value -1 means "not specified" and suppresses the bundle.

Phase 9: Call Emission and Attribute Attachment

The LLVM CallInst is created by sub_921880, which takes the callee, the argument array, return type, and the optional operand bundle. Calling-convention attributes (sret, byval, alignment) are collected by sub_93AE30 and attached to the call. For indirect calls, the instruction is named "call" for readability; direct calls inherit the callee's name.

Phase 10: Return Value Handling

Return ABI KindHandling
0 or 1 (direct scalar)Return the CallInst result directly
0 or 1 (direct aggregate)Allocate "agg.tmp", store the result, return the alloca
2 (sret)Return the sret pointer (aggregate) or load from it (scalar)
3 (expanded/multi-register)Call sub_923000 to split across multiple extracts

For indirect calls, callalign metadata is constructed by querying the alignment requirement of the return type and each argument type, wrapping them in an MDTuple, and attaching it to the call instruction. This metadata is consumed by the NVPTX backend to generate correct alignment annotations in PTX.

Call Emission Pseudocode

EmitCallExpr(Result *Out, CodegenCtx *Ctx, CallNode *Call, u64 DestFlags, u32 Align):

  callee_decl = ResolveCallee(Call->operand[0])
  func_type   = PeelTypedefs(callee_decl->type)  // kind 6 -> kind 7

  // ---- Builtin fast path ----
  if Call->opcode == CALL_DIRECT  AND  callee_decl->flags[199] & 2:
      result = BuiltinLowering(Ctx, Call)
      if isAggregate(func_type->returnType):
          dest = DestFlags.ptr  OR  CreateTmpAlloca("agg.tmp")
          Store(result, dest, ComputeAlign(returnType))
          Out = {dest, INDIRECT, sizeof(returnType)}
      else:
          Out = result
      return

  // ---- Special intrinsics ----
  if callee_decl->intrinsicID in {10214, 10219, 10227, 15752}:
      return SpecialIntrinsicHandler(Out, Ctx, callee_decl->intrinsicID, Call)

  // ---- Normal call path ----
  callee_val = CodegenCallee(Ctx, Call->operand[0])
  args[]     = CodegenArguments(Ctx, Call->argList)
  if Call->flags & REVERSED_EVAL:
      Reverse(args)

  abi_desc   = ClassifyABI(func_type->returnType, paramTypes, byvalFlags)

  if abi_desc.returnIsSRet:
      sret_ptr = DestFlags.ptr  OR  CreateTmpAlloca("tmp")
      PrependArg(args, sret_ptr)

  for each (arg, abi_entry) in zip(args, abi_desc.params):
      if abi_entry.kind == DIRECT  AND  abi_entry.isByval:
          tmp = CreateAllocaForAggregate(arg)
          Store(arg, tmp)
          arg = tmp
      elif abi_entry.kind == INDIRECT:
          assert isAggregate(arg.type)

  callee_val = FoldCalleebitcast(callee_val, func_type)

  EmitDebugLoc(Ctx, Call->srcLoc)

  if Call->opcode == CALL_DIRECT  AND  callee_name == "printf":
      return PrintfExpansion(Ctx, abi_desc, args, Call->srcLoc)

  bundle = BuildPreserveNBundle(Call->preserveData)
  call_inst = EmitCall(func_type, callee_val, args, bundle)
  AttachCCAttrs(call_inst, abi_desc)

  Out = HandleReturnValue(call_inst, abi_desc, func_type->returnType)

Inline Assembly Codegen

The inline asm handler (sub_1292420, 53 KB) translates a CUDA __asm__() statement into an LLVM InlineAsm call instruction through a strict 7-phase pipeline. A nearly-identical duplicate exists at sub_932270 for the Path A codegen context -- same parsing logic, same constraint table, different diagnostic function pointers.

Phase 1: Template String Parsing

The raw PTX template string from the EDG AST is scanned character-by-character into a fragment array. Each fragment (48 bytes) is either a literal text chunk (kind=0) or an operand substitution reference (kind=1 with an operand index at offset +0x28).

The parser handles the CUDA-to-LLVM syntax translation:

CUDA SyntaxLLVM IR OutputParser Action
$ (literal dollar)$$Escape doubling
%%%Literal percent
%N (operand ref)Fragment kind=1, index=NMulti-digit decimal parse
%= (unique ID)${:uid}LLVM unique-identifier modifier
%[name]--Fatal: "symbolic operand reference not supported!"
%cN (modifier+operand)Fragment kind=1, modifier=c, index=NAlpha char + decimal parse

For operands referencing string literal constants (the C constraint), the parser resolves the constant through the EDG value chain, validates the type is array of char, extracts each byte, escapes any $ characters, strips the trailing NUL, and emits the entire string as a literal fragment.

Phase 2: Template Reconstruction

The fragment array is serialized into the final LLVM inline-asm template string:

  • Literal fragments: appended verbatim.
  • Operand references without modifier: converted to $N (e.g., operand 3 becomes $3).
  • Operand references with modifier: converted to ${N:c} (e.g., operand 0 with modifier h becomes ${0:h}).

This is where the CUDA %N convention is translated to LLVM's $N convention. Literal % characters in PTX (like %tid.x) pass through unchanged because they were never parsed as operand references.

Phase 3: Constraint String Construction

The parser iterates the EDG operand linked list, building a comma-separated LLVM constraint string. Each EDG operand carries a constraint type-chain -- a linked list of tag bytes that map through a 256-byte global lookup table (aXg0123456789rh[]) to produce LLVM constraint letters.

Output operands (flags & 2 != 0):

  • Pointer types: constraint prefix "=*" + letters (indirect output).
  • Non-pointer types: constraint prefix "=" + letters (direct output).
  • Read-write operands (byte at +24 == 3): a tied input operand is generated with the output's index as the constraint, linking them as a two-address pair.

Input operands:

  • Same tag-to-letter mapping.
  • Tags 10--19 are prohibited: "tied input/output operands not supported!" (GCC-style matching-digit constraints are not implemented).
  • Tag 23 (the C constraint on inputs) creates an undef value -- the constant's value was already inlined into the template string during Phase 1.

Special tag handling:

TagEffect
8, 9Sets is_address + is_memory flags; tag 9 also emits "imr" composite constraint
0x14, 0x15, 0x16, 0x18, 0x26, 0x2APointer-through types: follow type chain, set is_address
0x19, 0x1B, 0x1CMemory constraints
23Remapped to tag 20 before table lookup

Phase 4: Clobber List

The EDG clobber linked list (at asmInfo+144) is iterated. Each clobber node has a tag byte selecting the clobber type:

  • Tag 1: Memory clobber. Appends ",~{memory}" to the constraint string.
  • Tag 58: Named register clobber. Uses the name string from the node. Appends ",~{<name>}".
  • Other tags: Looks up the register name from a global table (off_4B6DCE0[tag]). Appends ",~{<name>}".

Phase 5: InlineAsm Object Creation

The LLVM function type for the asm is constructed based on the output count:

  • Zero outputs: void return type.
  • One output: scalar return type matching the output operand.
  • Multiple outputs: anonymous struct return type.

The volatile/sideeffect flag is read from asmInfo+128 (bit 2). A diagnostic (0xE9F) warns when outputs exist but the asm is not marked volatile, as this risks miscompilation.

The InlineAsm object is created via InlineAsm::get(funcType, asmString, constraintString, hasSideEffects, isAlignStack=0, dialect=0) and a CallInst is emitted to invoke it.

Phase 6: Result Extraction

For single-output asm, the CallInst result is used directly. For multiple outputs, each result is extracted with extractvalue instructions:

  • Results with type size <= 16 bytes: a compact extractvalue path.
  • Results with type size > 16 bytes: a full instruction node (88 bytes) is allocated, the extractvalue is constructed with explicit index arrays, linked into the basic block's instruction list, and named "asmresult".

Each extracted value is then stored into its output destination via sub_12843D0, which reads the output codegen-info records built during Phase 3.

Phase 7: Cleanup

All temporary vectors and strings are freed: the fragment array (with per-element string cleanup), constraint strings, operand/type/destination vectors, and tied-operand tracking arrays.

End-to-End Example

CUDA source:    __asm__("mov.u32 %0, %tid.x" : "=r"(result));

Phase 1 parse:  [literal("mov.u32 "), operand(idx=0), literal(", %tid.x")]
Phase 2 recon:  "mov.u32 $0, %tid.x"
Phase 3 constr: "=r"
Phase 4 clobber: ""
Phase 5 create: InlineAsm::get("mov.u32 $0, %tid.x", "=r", sideeffects=true)
                call i32 asm sideeffect "mov.u32 $0, %tid.x", "=r"()
Phase 6 extract: (single output -- use call result directly)
                 store i32 %asm_result, i32* %result.addr

Builtin Lowering

The builtin lowering mega-switch (sub_12B3FD0, 103 KB) is one of the largest single functions in the binary. It handles ~250 builtin IDs across ~130 case labels, dispatching CUDA intrinsic functions like __syncthreads(), __shfl_sync(), and __hmma_m16n16k16_mma_f16f16 into LLVM IR.

Entry Logic

The function extracts the callee from the call expression, validates the builtin bit (flags byte[199] & 2), then looks up the builtin ID by name via sub_12731E0. If the ID is 0 (name not in the builtin table), execution falls through to the LLVM intrinsic fallback path at line 3154.

Five Lowering Strategies

StrategyUsage (%)Mechanism
Sub-handler delegation66% (~165 IDs)Calls a specialized function for a family of builtins
Intrinsic call emission12% (~30 IDs)1:1 mapping to a single llvm.nvvm.* intrinsic via sub_1285290
Inline IR generation10% (~25 IDs)Builds IR nodes directly (alloca, load, store, cast, insertvalue)
Table-driven selection10% (~25 IDs)Selects intrinsic ID from a table keyed by operand type/size
SM-gated conditional2% (~5 IDs)Different lowering depending on target SM version

Per-Category Dispatch

Atomics and synchronization (IDs 0xB5--0xCC, 181--204). Atomic operations delegate to sub_12A7DA0; fences and barriers to sub_12AB550. Cases 0xBA--0xBC map directly to LLVM intrinsic 6 (likely llvm.nvvm.atomic.*) with type-overloaded arguments. Case 0xCB is SM-gated: on SM <= 63 it emits an inline constant; on SM >= 70 it emits intrinsic 3769.

Warp shuffle (IDs 0x15F--0x166, 351--358). All eight variants delegate to sub_12ABB90 parameterized by shuffle mode (0=idx, 1=up, 2=down, 3=butterfly) and sync flag (0=legacy, 1=__shfl_sync_*). The clamp flag distinguishes butterfly from other modes.

Warp vote/ballot (IDs 0x12E--0x135, 0x152--0x159, 0x18B--0x192). Three groups of 8 IDs each, all delegating to sub_12B3540 with the builtin ID as a discriminator. This covers __ballot_sync, __all_sync, __any_sync across integer/float/predicate operand types.

Surface and texture operations (IDs 0xCF--0x113, 0x287--0x2A5, 207--275 + 647--677). The largest category at ~95 IDs (38%). Organized into pairs using two sub-handlers: sub_12ADE80(ctx, intrinsic_base, surface_type, variant, args) for individual load/store operations, and sub_12AA9B0(ctx, surface_type, expr) for combined operations. Surface types are encoded as integers (0=generic, 1=1D, 5=2D, 7=3D, 8=cubemap, 10=1D array, 11=2D array, 14=buffer). Intrinsic bases 3701/3702 are primary read/write; 3698/3699 are 2D-array variants.

The texture handler (case 0x287) is the most complex single case at ~230 lines. It walks the AST to extract the texture name string and return element type, constructs an intrinsic name as "<texname>_<typename>" using a type-name resolution switch (mapping integer subtypes 0--10 to strings like "uchar", "int", "ulonglong"), and emits the call. A global flag (dword_4F06B98) controls whether plain char maps to uchar or schar.

Tensor core / WMMA (IDs 0x16E--0x1D9, 0x2A6--0x2E8, 366--473 + 678--744). The second-largest category at ~85 IDs (34%). Three sub-handlers partition the work: sub_12AC1A0 handles wmma::mma_sync with bias/scale flags (has_bias, has_scale) encoding four accumulator modes; sub_12AC5F0 handles store_matrix_sync; sub_12ACA80 handles load_matrix_sync. IDs group into triplets by matrix shape: m16n16k16, m32n8k16, m8n32k16, m16n16k8 (TF32), bf16, and fp8 (SM 89+) families.

WGMMA (IDs 0x2E9--0x302, 745--770). SM 90+ warpgroup MMA operations. Cases 0x2E9--0x2EE handle fence/commit/wait. Cases 0x2F1--0x2FC implement __wgmma_mma_async through a massive ~800-line handler that selects from a 144-entry intrinsic table spanning IDs 5304--5447. The table is indexed by a 5-dimensional grid: N-size (16/32/64/128), B-operand source (shared vs register), element type (s64 vs other), scale/negate flags, and case variant. Mode bits are packed into a single integer: bit0=accumulate | bit1=transpose | bit2=negate-C | bit4=negate-A.

Memory copy (IDs 0x199, 0x291--0x299, 409 + 657--665). Memcpy variants encode alignment directly in the builtin ID: ID 658 = align 2, ID 659 = align 4, ID 660 = align 8, ID 661 = align 16. The actual emission delegates to sub_12897A0. Memset operations (IDs 410, 663, 665) delegate to sub_12A6DF0.

TMA bulk operations (IDs 0x19B--0x1A0, 411--416). Cases 0x19B and 0x19C are the largest individual handlers (~300 and ~450 lines respectively) for SM 90+ tensor memory access bulk copy/scatter operations. They build operand vectors iteratively and select from intrinsic tables indexed by element count (IDs 4218--4223 for stores, 4244--4250 for loads).

LLVM Intrinsic Fallback Path

When the builtin ID is 0, the default path (lines 3154--3407) looks up the LLVM intrinsic by name via sub_15E2770. If the intrinsic is type-overloaded, argument types are used to resolve the declaration. Each argument is lowered via sub_128F980, with type-mismatch bitcasts (opcode 47) and vector zexts (opcode 33) inserted as needed. Struct-return intrinsics are handled by iterating the return struct's fields with extractvalue.

Function Attributes

CUDA function attributes are lowered through a three-stage pipeline: EDG frontend parsing, attribute emission during IR generation, and a final metadata-attachment pass.

Stage 1: Frontend Parsing (sub_64F1A0)

The EDG parser scans the token stream for preserve_n_data, preserve_n_control, and preserve_n_after identifiers, parses each as an integer, and stores them in a 12-byte struct at offset +336 of the function declaration node:

struct preserve_reg_info {
    int32_t preserve_n_data;     // +0, -1 = not specified
    int32_t preserve_n_control;  // +4, -1 = not specified
    int32_t preserve_n_after;    // +8, -1 = not specified
};

Stage 2: Attribute Emission (sub_12735D0)

During IR generation, the attribute emitter checks declaration flags and writes attribute bundles:

  • Bit 0x20 at decl+198 (kernel function): emits ("kernel", 1). Then iterates the parameter array (40-byte entries); for each parameter with byte[+33] != 0, emits ("grid_constant", param_index) where param_index is 1-based. This marks individual kernel parameters as grid-constant, enabling the backend to place them in constant memory.

  • Bit 0x04 at decl+199 (custom ABI): emits ("full_custom_abi", 0xFFFFFFFF).

  • Preserve-reg struct at decl+336: for each of the three fields, if the value is >= 0, emits the corresponding attribute and then writes -1 back (consumed pattern) to prevent double-emission.

Stage 3: Metadata Attachment (sub_1273F90)

The reader pass iterates all functions' attribute bundles and re-encodes them as LLVM named metadata:

grid_constant. Per-parameter type values are collected into a vector, then bundled under the MDString key "grid_constant" as an MDTuple. The downstream consumer sub_CE8660 queries this metadata to determine aliasing/readonly semantics for kernel parameters.

preserve_reg_abi. The three preserve_n values are collected with their MDString keys ("preserve_n_data", "preserve_n_control") into a vector, then bundled under the composite key "preserve_reg_abi" as an MDTuple. The register allocator and prologue-epilogue inserter query this via sub_314D260.

full_custom_abi. Emitted as a simple (MDString, MDNode(i32 0xFFFFFFFF)) pair. When a function has this attribute but NOT the full_custom_abi flag, the alternative "numParams" key records the explicit parameter count as a nested MDTuple.

Final Metadata Layout

For a __global__ kernel with grid_constant parameters and register preservation:

!kernel_attrs = !{
  !MDString("kernel"), !MDNode(i32 1),
  !MDString("grid_constant"), !MDTuple(
    !MDNode(i32 <param1_type>), !MDNode(i32 <param2_type>), ...
  ),
  !MDString("preserve_reg_abi"), !MDTuple(
    !MDString("preserve_n_data"),    !MDNode(i32 N),
    !MDString("preserve_n_control"), !MDNode(i32 M),
    !MDString("preserve_n_after"),   !MDNode(i32 K)
  )
}

Attribute Semantics

AttributeMeaningBackend Effect
grid_constantKernel parameter is immutable across the gridPlace in constant memory; optimize loads
preserve_n_dataN data registers must be preserved across callsRegister allocator reserves R0--RN
preserve_n_controlN predicate registers to preservePrologue/epilogue saves predicates
preserve_n_afterN registers preserved after a call (callee-save count)Adjusts spill/restore boundaries
full_custom_abiFunction bypasses standard CUDA calling conventionParameter passing determined by explicit annotations
numParamsExplicit parameter count for non-full_custom_abi functionsCustom ABI parameter setup

Cross-Reference

AddressFunctionRole
sub_946060EmitFunctionCreates entry BB, allocapt, return BB, dispatches to prolog
sub_938240GenerateFunctionPrologIterates parameters, ABI dispatch, alloca emission
sub_9446C0EmitParamDeclCreates alloca+store, registers decl->Value mapping
sub_921D70CreateTmpAllocaAlloca creation with alignment, inserted at allocapt
sub_921B80CreateAllocaInstLow-level alloca IR emission
sub_938130IsSRetReturnChecks ABI kind == 2
sub_91B770IsAggregateTypeType kinds 8--11 (struct/union/class/array)
sub_93CB50EmitCallExprFull call instruction emission (1,293 lines)
sub_9378E0ClassifyABIReturn + parameter ABI classification
sub_939F40PrintfExpansionGPU vprintf lowering for printf calls
sub_93AE30CollectCCAttrsBuilds sret/byval/align attribute list
sub_955A70 / sub_12B3FD0BuiltinLoweringMega-switch over ~250 builtin IDs
sub_1292420 / sub_932270EmitInlineAsm7-phase asm template-to-IR pipeline
sub_12735D0EmitFunctionAttrsWrites attribute bundles during IR gen
sub_1273F90ReadFunctionAttrsAttaches LLVM named metadata from bundles
sub_64F1A0ParsePreserveAttrsEDG parser for preserve_n_* tokens

Type Translation, Globals & Special Variables

The type translation subsystem is one of the most algorithmically complex parts of NVVM IR generation. It converts the Edison Design Group (EDG) intermediate language type graph --- which can contain arbitrary mutual recursion, template-dependent types, and CUDA address-space qualifiers --- into a well-formed LLVM type system. The same IR generation phase also handles global variable materialization (with CUDA memory-space assignment), kernel metadata emission, and the translation of CUDA built-in variables (threadIdx, blockIdx, etc.) into LLVM intrinsic calls.

Type translation entrysub_91AED0 (640 bytes)
Fixed-point driversub_91AB30 (896 bytes)
Topological sortsub_919CD0 (896 bytes, 10-level BFS)
Type-kind dispatchsub_918E50 (2,400 bytes, 11+ categories)
Type-pair comparatorsub_911D10 (1,024 bytes)
Global var creationsub_915C40 (2,018 bytes)
Address space logicsub_916430 (482 bytes)
Annotation emittersub_914410 (3,524 bytes)
Kernel metadatasub_93AE30 (~5,600 bytes)
Special var classifiersub_920430 (old) / sub_127F7A0 (new)
Special var codegensub_922290 (old) / sub_1285550 (new)

EDG-to-LLVM Type Translation

The Problem

EDG represents C++ types as a graph of IL nodes linked through child/parent pointers, member chains, and scope references. This graph can be arbitrarily cyclic: consider struct A { B* b; }; struct B { A* a; }; where translating A requires translating the pointee type B, which requires translating the pointee type A. Template instantiations add another dimension --- a template class body may reference types that cannot be resolved until the template arguments themselves are translated. The type translator must produce valid LLVM types from this graph without infinite recursion or stale mappings.

NVIDIA solves this with a fixed-point iteration scheme: translate every type, detect whether any translation changed a previously-emitted LLVM type, and if so, repeat the entire pass. The iteration terminates when a full pass produces no changes.

Context Object Layout

The type translation pass operates on a context structure initialized by sub_91AB30 and threaded through every function in the subsystem:

OffsetSizeField
+0x0008debug_logger --- nullable, enables trace output when non-null
+0x0088pass_list_ptr --- vector of (vtable_ptr, pass_instance) pairs
+0x0108target_info
+0x0188address_space_map --- qualifier-to-LLVM-AS translation table
+0x0208llvm_context --- the LLVMContext*
+0x0288module_ptr
+0x0388edg_node_map --- hash table: EDG nodes to LLVM values
+0x03816visited_set --- open-addressed hash set for dedup (at +0x38..+0x48)
+0x0504iteration_counter
+0x06012visited_set control (count, capacity, bucket_count)
+0x0788processed_list --- vector of completed types
+0x09016type_cache --- hash table: EDG type pointer to LLVM Type*
+0x0A08remap_list --- vector of type-remapping entries
+0x1508alignment_table --- target-specific alignment data
+0x1684threshold --- type index below which scope lookups are attempted
+0x2A016pending_replacements --- vector of (old_type, new_type) pairs
+0x3101flags --- bit-packed control flags

Fixed-Point Iteration Algorithm

The entry point sub_91AED0 recovers pass infrastructure objects by iterating a vector<pair<void*, void*>> at context+8. Each element is 16 bytes: a vtable pointer identifying the pass, and a pass instance pointer. The function compares vtable pointers against 8 known globals to extract the data layout, reflect pass, target transform info, module context, dominator tree, and alias analysis results. It then calls sub_91AB30, the actual iteration driver.

// sub_91AB30: TypeTranslationPass driver
fn translate_all_types(ctx: &mut TypeTransCtx, module: &EDGModule) {
    // Optional pre-processing (gated by byte_3C34E60)
    if PRE_PROCESS_FLAG {
        pre_process_types(ctx, module);  // sub_90F800
    }

    // Gather initial flags from all module members
    for member in module.members() {       // linked list from module+80
        gather_initial_flags(member);      // sub_AA3700
    }

    // MAIN FIXED-POINT LOOP
    loop {
        let changed = single_iteration(ctx, module);  // sub_91AA50
        if !changed { break; }
    }

    // Optional late fixup pass (gated by byte_3C35480)
    if OPTIMIZATION_FLAG {
        finalize_late_types(ctx, module);  // sub_90F750
        loop {
            let changed = late_fixup(ctx, module);  // sub_917E30
            if !changed { break; }
        }
    }

    // Optional cleanup (gated by dword_3C351E0)
    if CLEANUP_FLAG {
        cleanup_stale_types(ctx);  // sub_90EB40
    }

    flush_and_finalize(ctx);  // sub_909590
}

Each single iteration (sub_91AA50) performs three steps:

  1. Topological sort (sub_919CD0): Build a dependency ordering of all EDG type nodes reachable from the module root.
  2. Invalidate (sub_913880 for each type in reverse order): Remove stale cache entries for types whose dependencies have changed.
  3. Process (sub_9197C0 for each type in reverse order): Translate each type, returning whether any LLVM type was modified.

The iteration returns the logical OR of all sub_9197C0 results. If any type replacement occurred, the outer loop repeats.

10-Level Topological Sort

The function sub_919CD0 produces a dependency-ordered list of EDG types. Rather than a standard DFS-based topological sort, it uses a 10-level iterative BFS implemented with sorted sets at each level. This unusual depth accommodates deeply nested C++ class hierarchies with multiple inheritance, where types at depth N must be resolved before types at depth N+1 can be translated.

Each level maintains a sorted set (vector-backed, managed by sub_6CDA50 for initialization and sub_6CDC80 for merge/sort). Starting from the module's member list, the algorithm:

  1. Inserts root-level type declarations into level 0.
  2. For each level 0..9, discovers type dependencies and inserts them into the next level.
  3. After all 10 levels, concatenates the sets in reverse (leaf types first, composite types last).

The output is a vector of EDG type node pointers ordered so that leaf types precede the composite types that reference them.

EDG Type Kind Dispatch

The core dispatcher sub_918E50 (2,400 bytes) reads the type-kind byte at edg_node+16 and routes to specialized handlers:

Kind ByteValueHandlerDescription
0x00--0x100--16Primitive dispatchvoid, bool, char, int, float, double, etc.
0x1117Void specialVoid type with swap handling in comparator
0x055sub_5FFE90Qualified type (const/volatile/restrict) --- carries address-space info
0x0D13Enum pathEnum type bridging C/C++ enum constants to LLVM integers
0x0E14Function pathFunction type with parameter chain traversal
0x1A26sub_915850Array type (subscript form with enumeration base)
0x1B27Inline handlerCompound type (struct/union/class) --- multi-child with dedup hash
0x32--0x3350--51Union variantsUnion type (two internal representations)
0x3654sub_918C40Typedef / using declaration --- chains through EDG resolution
0x3755Using variantUsing declaration variant
0x4B--0x4C75--76Pointer/refPointer and reference types --- carry qualifier words for address spaces
0x4D77Member pointerPointer-to-member type
0x4E78sub_914070Dependent/nested type --- requires scope resolution

For types with kind > 23 that are not special-cased, a default handler applies a bitmask test: 0x100000100003FF >> (kind - 25). If the low bit is set, the type requires scope tracking (kinds 25--34 selectively, plus kinds 57 and 73). The handler then looks up any existing LLVM type for this EDG type via the scope table, and if the mapping has changed, triggers a replacement plus metadata propagation.

Compound Type (Struct/Class) Translation

When kind 0x1B (27) is encountered, the dispatcher uses an inline handler that:

  1. Reads the child count from node+20 & 0xFFFFFFF and divides by 2 (children come in pairs: type descriptor + offset/alignment info).
  2. Builds a reference-counting hash table to detect shared sub-types. If a child type appears exactly once, it can be translated independently. If it appears multiple times, it indicates a shared base class or diamond inheritance pattern.
  3. For unique children, calls sub_911D10 (the type-pair comparator) with the parent scope to translate.

Diamond inheritance is detected by the reference count exceeding 1, which prevents the comparator from making conflicting replacements for the same sub-type.

Type-Pair Comparison Engine

The function sub_911D10 is the core workhorse for comparing and replacing type pairs. It takes (context, type_a, type_b, scope_pair, is_recursive_flag) and maintains a local worklist of (type_a, type_b) pairs:

fn compare_and_replace(ctx, type_a, type_b, scope, is_recursive) {
    let mut worklist = vec![(type_a, type_b)];

    while let Some((a, b)) = worklist.pop() {
        if a == b { continue; }

        // Normalize: larger type index = v15, smaller = v14
        let (v14, v15) = if type_index(a) < type_index(b) { (a, b) } else { (b, a) };

        // Primitive vs compound: record scope mapping
        if v14.kind <= 0x17 && v15.kind > 0x17 {
            record_scope_mapping(ctx, v14, v15);
        }

        // Check for UINT_MAX sentinel (incomplete type) -> swap
        if scope_table_lookup(v15) == UINT_MAX {
            swap(&mut v14, &mut v15);
        }

        // Perform actual replacement
        replace_type(ctx, v14, v15, is_recursive);

        // For pointer/reference types: propagate through children
        if v15.kind == 75 || v15.kind == 76 {
            let qualifier = v15.qualifier_word & 0x7FFF;
            // Address space qualifiers trigger child propagation
            if qualifier == 1 || qualifier == 32 || qualifier == 33 || qualifier == 14 {
                worklist.push((v14.child, v15.child));
            }
        }

        // For union types: push all variant children
        if v15.kind == 50 || v15.kind == 51 {
            for child in v15.children() { worklist.push((v14, child)); }
        }
    }
}

This worklist-based approach avoids stack overflow on deeply nested types while correctly propagating address-space information through pointer chains.

CUDA Address Space Propagation

CUDA memory-space qualifiers flow through the EDG type system via a 15-bit qualifier word stored at edg_node+18. The low 15 bits encode the qualifier ID; bit 15 is a negation flag. During type translation, when the type-pair comparator encounters pointer or reference types (kinds 75/76), it reads the qualifier word and maps it to an LLVM address space:

EDG QualifierValueLLVM Address SpaceCUDA Meaning
Generic00Generic (default)
Global11__device__ / global memory
Function14---Method qualifier (not an address space)
Array context A26---Array subscript qualifier A
Array context B27---Array subscript qualifier B
Shared323__shared__ memory
Constant334__constant__ memory

The conversion is performed by sub_5FFE90 (qualifier to LLVM address space number) and sub_5A3140 (creates the appropriately qualified LLVM pointer type). The function sub_911CB0 combines the conversion with a type-index computation: it takes (type_kind - 24) as a base and combines it with the qualifier to produce a unique index for the scope table.

Address-space propagation is transitive: if struct S contains a __shared__ int* field, the shared qualifier must be reflected in the LLVM type of the pointer field within S. The type-pair comparator achieves this by pushing child pairs onto its worklist whenever a pointer/reference type carries a non-zero qualifier.

Five Caching Layers

To avoid redundant work, the translator maintains five distinct caches:

CacheLocationKeyValuePurpose
Visited setctx+0x38..+0x48EDG node ptr(presence only)Prevents re-processing the same declaration
Type cachectx+0x70..+0x94EDG decl ptrchild type ptrTracks which LLVM type a declaration was previously translated to
Type-value mapPer-call in sub_913E90EDG type ptrLLVM Type*Caches enum/struct translations; supports inline mode (up to 4 entries)
Scope tablectx+0x10, hash at +8/+24scope IDtype infoMaps scope identifiers to type information for type-pair comparison
Type index tablectx+0x98+compound keymonotonic indexLinear ordering of processed types; Jenkins-like hash for compound keys

All hash tables use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.

Cache invalidation is handled by sub_913880, which walks a type's member list and removes stale entries. Invalidation cascades: if a struct type is invalidated, all member types that are non-trivial (not kind 54/55 typedef/using) are also removed from the cache.

Template Specialization

Template types are handled by sub_918790 (struct/class type translation with template instantiation support):

  1. sub_41F0F0 extracts template argument descriptions from the EDG IL into a 1,536-byte stack buffer (heap fallback for > 50 arguments).
  2. sub_908040 performs syntactic template argument substitution, producing two lists: substituted types and original types.
  3. If both lists are non-empty and the optimization flags byte_3C35480 + byte_3C353A0 are both set, sub_910920 performs semantic type matching using the full optimization infrastructure.
  4. Otherwise, sub_906590 creates the LLVM type directly from the substitution result.

The two-pass approach (syntactic substitution then semantic matching) handles cases like template<typename T> struct Wrapper { T* data; } where Wrapper<__shared__ int> must produce a pointer in address space 3 --- the syntactic pass substitutes T = __shared__ int, and the semantic pass verifies the LLVM type is correct.

Template specialization support is entirely optional and gated behind configuration flags, allowing it to be disabled for faster compilation when not needed.

Primitive Type Translation Table

The dispatcher sub_918E50 handles kinds 0x00--0x10 (values 0--16) as primitive/scalar types. These map directly from EDG internal type representation to LLVM IR types. The correspondence between the three type-tag namespaces used across cicc is:

EDG Type KindEDG Printer type_kindCast Codegen Tag (*(type+8))LLVM IR TypeWidth
0x000x00 error---<error>---
0x010x01 void3void0
0x020x02 scalar/integer17iNN bits
0x030x03 float1 (half), 2 (float), 3 (double), 4 (fp80), 5 (fp128), 6 (bf16)see FP tablevaries
0x040x04 imaginary---emulatedvaries
0x050x05 complex---{ fN, fN } struct2x float
0x060x06 pointer/ref18ptr (opaque) or ptr addrspace(N)32/64
0x070x07 function15 (function), 16 (ptr-to-fn)function type---
0x080x08 array20[N x elem]N * elem
0x09--0x0B0x09--0x0B class/struct/union/enum21 (struct)%struct.Name = type { ... }layout
0x0C0x0C elaborated/typedef---resolved target---
0x0D0x0D pointer-to-member---{ ptr, i64 } or i6464/128
0x0E0x0E template param---deduced---
0x0F0x0F vector16<N x elem>N * elem
0x100x10 scalable vector16<vscale x N x elem>runtime

The integer type (EDG kind 0x02) carries its bit-width in the upper bytes of the type word. The cast codegen subsystem (sub_128A450) classifies types by the tag byte at *(type+8): tags 1--6 are floating-point (see next section), tag 11 is integer, tag 15 is pointer, and tag 16 is vector/aggregate. The key dispatch idiom (tag - 1) > 5u tests "is NOT a float"; (tag & 0xFD) != 0xB tests "is NOT integer-like".

Floating-Point Type Encoding

Floating-point types use a sub-kind byte stored in the EDG type node at v3[10].m128i_i8[0] (type printer) or equivalently the cast codegen tag at *(type+8). The complete mapping including all NVIDIA-extended formats:

Cast TagEDG FP Sub-kindManglingC++ TypeLLVM TypeWidthSM Minimum
10 / 0xADF16__Float16 / __halfhalf16SM 53 (scalar), SM 70 (packed)
11Dh__fp16half16SM 53
22ffloatfloat32all
---3DF32x_Float32xdouble (promoted)64all
34ddoubledouble64all
---5DF64x_Float64xfp128 (emulated)128all
---6(single)long doubleplatform-dependentarch---
---7u7float80float80x86_fp8080N/A on GPU
---8g__float128fp128128emulated
69u6__bf16 or DF16b__bf16 / __nv_bfloat16bfloat16SM 80
---0xBDF32__Float32float32all
---0xCDF64__Float64double64all
---0xDDF128__Float128fp128128emulated

The bf16 mangling has a three-way ABI gate controlled by qword_4F077B4 (low 32 = use_new_bf16_mangling, high 32 = bf16_abi_version) and qword_4F06A78 (secondary selector). Old ABI emits u6__bf16 (Itanium vendor-extended); C++23 ABI emits DF16b (P1467 standard). The __nv_bool type (EDG printer case 0x02, bit 4 of +162) is a CUDA-specific boolean that emits "__nv_bool" when sub_5D76E0 (CUDA mode check) returns true, or "_Bool" / "bool" otherwise.

Two additional NVIDIA-specific types have dedicated mangling:

EDG Type CodeManglingC++ TypePurpose
17u11__SVCount_t__SVCount_tARM SVE predicate count
18u6__mfp8__mfp88-bit minifloat (FP8 E4M3/E5M2 base)

On the LLVM side, the __mfp8 type maps to i8 storage with metadata annotations indicating the floating-point interpretation.

CUDA FP8/FP6/FP4 Extended Type Keywords

CUDA 12.x+ introduces narrow floating-point types for transformer inference and tensor core operations. The EDG parser (sub_691320) recognizes these as token values 236 and 339--354, all resolved through sub_6911B0 (CUDA type-token resolver):

TokenKeywordFormatWidthPacked VariantSM Requirement
236__nv_fp8_e4m3E4M3 (4-bit exponent, 3-bit mantissa)8---SM 89
339__nv_fp8_e5m2E5M2 (5-bit exponent, 2-bit mantissa)8---SM 89
340__nv_fp8x2_e4m3E4M3 packed pair162 elementsSM 89
341__nv_fp8x2_e5m2E5M2 packed pair162 elementsSM 89
342__nv_fp8x4_e4m3E4M3 packed quad324 elementsSM 89
343__nv_fp8x4_e5m2E5M2 packed quad324 elementsSM 89
344__nv_fp6_e2m3E2M3 (2-bit exponent, 3-bit mantissa)6---SM 100
345__nv_fp6_e3m2E3M2 (3-bit exponent, 2-bit mantissa)6---SM 100
346__nv_fp6x2_e2m3E2M3 packed pair122 elementsSM 100
347__nv_fp6x2_e3m2E3M2 packed pair122 elementsSM 100
348__nv_mxfp8_e4m3MX-format E4M38---SM 100
349__nv_mxfp8_e5m2MX-format E5M28---SM 100
350__nv_mxfp6_e2m3MX-format E2M36---SM 100
351__nv_mxfp6_e3m2MX-format E3M26---SM 100
352__nv_mxfp4_e2m1MX-format E2M1 (FP4)4---SM 100
353__nv_satfiniteSaturation-to-finite modifier------SM 89
354__nv_e8m0E8M0 exponent-only scale format8---SM 100

The resolver sub_6911B0 follows the field_140 == 12 (qualified/elaborated type) chain to find the base type node, then sets v325 = 20 (typename). At the LLVM level, these narrow types are lowered to integer storage types (i8, i16, i32) with type metadata or intrinsic-based interpretation. The cvt_packfloat intrinsic family handles conversion to and from these formats with explicit format specifiers:

cvt_packfloat CasePTX SuffixFormat
2.e4m3x2FP8 E4M3 pair
3.e5m2x2FP8 E5M2 pair
4.bf16x2BFloat16 pair
5.e2m1x2FP4 E2M1 pair (SM 100+)
6.e2m3x2FP6 E2M3 pair (SM 100+)
7.e3m2x2FP6 E3M2 pair (SM 100+)
8.ue8m0x2UE8M0 scale pair (SM 100+)

Address Space Annotations on Types

CUDA memory-space qualifiers propagate through the EDG type system via a 15-bit qualifier word at edg_node+18. The low 15 bits encode a qualifier ID; bit 15 is a negation flag. The qualifier word is the single mechanism through which __device__, __shared__, __constant__, and __managed__ semantics reach the LLVM type system.

EDG qualifier word to LLVM address space mapping (performed by sub_5FFE90):

Qualifier Word (node+18 & 0x7FFF)LLVM Address SpaceCUDA SourceNotes
00(default/generic)Unqualified pointers
11__device__ / globalExplicit global annotation
90 (with flag check via sub_5F3280)(generic variant)Conditional on context
14---__host__ / method qualifierNot an address space --- function qualifier
26---(array subscript context A)Internal, not an address space
27---(array subscript context B)Internal, not an address space
323__shared__Per-block shared memory
334__constant__Read-only constant memory

The function sub_5A3140 creates the appropriately address-space-qualified LLVM pointer type given the qualifier output from sub_5FFE90. The helper sub_911CB0 combines address space information with the type kind to produce a unique scope-table index: it computes (type_kind - 24) as a base and combines it with the qualifier to produce a monotonic key.

EDG frontend encoding (from sub_691320 parser, tokens 133--136, and sub_667B60):

Parser TokenCUDA Keywordv305 ValueEDG memory_space_codeTarget AS
133__shared__423
134__device__511
135__constant__634
136__managed__7(special)0 + "managed" annotation
273__global__ (addr-space attr)---00
274__shared__ (addr-space attr)---23
275__constant__ (addr-space attr)---34
276__generic__ (addr-space attr)---(parsed)(parsed)

Address-space propagation through types is transitive: if struct S contains a __shared__ int* field, the shared qualifier flows through the pointer type and is preserved in the LLVM ptr addrspace(3) type of that field. The type-pair comparator sub_911D10 achieves this by pushing child pairs onto its worklist whenever a pointer/reference type (kinds 75/76) carries a non-zero qualifier. The qualifier-word masks 1, 14, 32, and 33 are the four values that trigger this child propagation.

For a full cross-reference of all 10 address spaces (including AS 5 local, AS 6 tensor memory, AS 7 shared cluster, AS 25 internal device, AS 53 MemorySpaceOpt annotation, AS 101 param), see Address Spaces.

Vector Type Handling

NVPTX has a highly constrained vector type model. Only four vector types are legal --- all packed into 32-bit Int32HalfRegs (%hh prefix in PTX):

Legal Vector TypeLLVM MVTPTX Register ClassPTX SuffixSM Minimum
v2f16v2f16Int32HalfRegs.f16x2SM 70 (arith), SM 53 (ld/st)
v2bf16v2bf16Int32HalfRegs.bf16x2SM 80
v2i16v2i16Int32HalfRegs.s16x2SM 70
v4i8v4i8Int32HalfRegs(packed bytes)SM 70

All wider vector types are illegal and undergo recursive split/scalarize during type legalization. The split depth for common CUDA vector types:

CUDA TypeLLVM TypeSplit ChainFinal Form
float4v4f32v4f32 -> 2x v2f32 -> 4x f324 scalar float ops
float2v2f32v2f32 -> 2x f322 scalar float ops
int4v4i32v4i32 -> 2x v2i32 -> 4x i324 scalar i32 ops
double2v2f64v2f64 -> 2x f642 scalar double ops
half2v2f16legal (no split)single .f16x2 packed op
__nv_bfloat162v2bf16legal (no split, SM 80+)single .bf16x2 packed op
short2v2i16legal (no split)single .s16x2 packed op
char4 / uchar4v4i8legal (no split)single packed-byte op
half (4 elements)v4f16v4f16 -> 2x v2f162 packed .f16x2 ops
half (8 elements)v8f16v8f16 -> v4f16 -> 2x v2f164 packed .f16x2 ops

The critical architectural insight: v2f32 is NOT legal on NVPTX (no 64-bit packed float register class exists), so float4 always fully scalarizes to four independent f32 operations. In contrast, half2 stays packed throughout the pipeline, delivering 2x throughput via add.f16x2, mul.f16x2, and fma.rn.f16x2 PTX instructions.

SM-version gating affects which types are legal at which pipeline stage:

  • SM < 53: No legal vector types; v2f16 must be scalarized, and scalar f16 is promoted to f32.
  • SM 53--69: Scalar f16 is legal; v2f16 is legal for load/store but packed arithmetic may be Custom or Expand.
  • SM 70+: v2f16 fully legal with packed arithmetic. i128 scalar register class added.
  • SM 80+: v2bf16 added as legal vector type.
  • SM 100+: Additional packed FP types for cvt_packfloat --- e2m1x2, e2m3x2, e3m2x2, ue8m0x2.

Tensor core matrix fragments bypass vector legalization entirely. WMMA and WGMMA intrinsics represent matrix data as individual scalar registers or {f16, f16, ...} struct aggregates, not as LLVM vector types. See MMA Codegen for the tensor-core lowering path.

Cast Codegen Type Tags

The cast emission function sub_128A450 uses a distinct type-tag namespace at *(type+8). This tag drives all cast instruction selection and must be clearly distinguished from the EDG type-kind byte at edg_node+16:

TagLLVM TypeCast Behavior
1half (f16)Float family; float-to-float casts use fpext/fptrunc
2float (f32)Float family
3double (f64)Float family
4x86_fp80Float family (not used on GPU)
5fp128Float family; triggers standard LLVM cast path (no __nv_*_rz intrinsic)
6bfloat (bf16)Float family
11iN (integer)Integer family; width at *(type+8) >> 8
15ptrPointer family
16<N x elem> (vector)Vector/aggregate; address-space extraction via sub_16463B0

Integer-to-float conversions (tags 11 -> 1..6) default to sitofp/uitofp but can route through NVIDIA-specific __nv_*_rz round-to-zero intrinsics when unk_4D04630 is clear. These intrinsics (__nv_float2int_rz, __nv_double2ll_rz, etc.) are emitted as plain function calls and later pattern-matched by the PTX backend to cvt.rz.* instructions. The fp128 path always uses standard LLVM casts because 128-bit floating point is emulated via FP128/I128 library calls.

SelectionDAG SimpleVT Encoding

After IR generation, types enter the SelectionDAG type system where they are encoded as single-byte SimpleVT values for the legality table lookup at NVPTXTargetLowering + 2422:

SimpleVTLLVM TypeBitwidth
0extended/customcomputed via sub_1F58D40
1i11
2i22
3i88
4i1616
5i3232
6i6464
7i128128
8f16 / bf1616
9f3232
10f6464
14--55fixed-width vector typesvector of above
56--109scalable vector typesscalable vector of above

The bitwidth-to-SimpleVT conversion pattern appears 11 times in the 348KB DAGTypeLegalizer::run monolith (sub_20019C0), and the vector-to-scalar-element switch table (cases 14--109 mapping back to scalar VT 2--10) appears 6 times. This redundancy is an artifact of the monolithic inlining --- upstream LLVM factors these into per-category files (LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, etc.).

Global Variable Code Generation

Module-Level Driver

Global variable codegen is driven by sub_915990 (~2,700 bytes), which iterates all EDG IL global declarations and categorizes them into sorted sets:

  • Regular device globals
  • __constant__ globals
  • __shared__ globals
  • __managed__ globals
  • Texture references
  • Surface references
  • Grid constants

After categorization, a topological sort (using the same sub_3FEBB0/sub_3FED60 graph primitives as the type translator) determines the order in which globals must be materialized. If global A's initializer references global B, then B must be code-generated first. The transitive dependency discovery is performed by sub_914960, a BFS that walks EDG IL linkage chains, filtering nodes with kind byte in range [25..34] (variable, function, and template declarations).

Address Space Determination

The function sub_916430 (482 bytes) examines EDG IL node attribute bytes to determine the NVPTX address space for a global variable:

fn determine_address_space(edg_node: &EDGNode) -> u32 {
    let storage_class = edg_node[0x88];
    let flags_9c      = edg_node[0x9C];
    let flags_b0      = edg_node[0xB0];
    let flags_ae      = edg_node[0xAE];
    let flags_a8      = edg_node[0xA8] as u64;

    // __constant__: storage class 2
    if storage_class == 2 {
        return 4;  // constant address space
    }

    // __shared__: bit 7 of flags_9c
    if flags_9c & 0x80 != 0 {
        if flags_ae & 1 != 0 {
            return 3;  // extern __shared__
        }
        if flags_b0 & 0x20 != 0 {
            return 5;  // local memory (stack-local shared variant)
        }
        return 3;  // __shared__
    }

    // Bit 6 of flags_9c: device-side memory
    if flags_9c & 0x40 != 0 {
        if edg_node[0xF0] != 0 {
            return 3;  // template-instantiated shared variable
        }
        return 0;  // generic device
    }

    // Extended attribute flags
    if flags_a8 & 0x2000100000 != 0 {
        return 3;  // shared-like semantics
    }

    if storage_class > 2 {
        emit_diagnostic("unsupported storage class!");
    }

    return 0;  // default: generic device memory
}

NVPTX Address Space Assignment

See Address Spaces for the complete master table mapping LLVM AS numbers to PTX qualifiers, hardware, and pointer widths.

In the IR generation context: address space 0 (generic) is the default for __device__ variables. Address space 1 (global) appears in pointer types when the global qualifier is explicit in the type annotation (as opposed to being inferred from the variable declaration). __managed__ variables use address space 0 (same as regular device globals) but receive a "managed" annotation in nvvm.annotations that the runtime uses to set up Unified Virtual Memory mappings.

GlobalVariable Object Creation

The function sub_915C40 (2,018 bytes) materializes an LLVM GlobalVariable:

  1. Hash table lookup: Checks whether the EDG node has already been materialized. The table at ctx+0x178..0x190 maps EDG node pointers to GlobalVariable*. If found with a different type, calls GlobalVariable::mutateType to reconcile.

  2. Allocation: Allocates 88 bytes (0x58) via operator new, then calls the GlobalVariable constructor with module, type, isConstant flag, linkage, initializer (null for declarations), name, and address space.

  3. Alignment: Computes alignment via sub_91CB50 (a DataLayout wrapper), then converts to log2 via BSR (bit-scan-reverse) for LLVM's MaybeAlign representation. Always explicitly set, even for naturally-aligned types.

  4. Initializer: If edg_node[0xB0] & 0x20 is set and the variable is not extern (edg_node[0x88] != 1), calls sub_916690 to generate the initializer IR. The initializer handler dispatches on a variant byte: variant 0/3 for constant expressions, variant 1/2 for aggregate initializers.

  5. __managed__ annotation: If edg_node[0x9D] & 1 is set, emits ("managed", 1) to the annotation list via sub_913680.

  6. Texture/surface detection: If the mode flag at ctx+0x168 has bit 0 set, calls sub_91C2A0 (isTextureType) and sub_91C2D0 (isSurfaceType). Matching variables get "texture" or "surface" annotations and are inserted into a red-black tree at ctx+0x200 for ordered tracking during annotation emission.

  7. Registration: The new GlobalVariable* is stored into the hash table for future lookups.

Finalization: Metadata and @llvm.used

After all globals are materialized, sub_915400 calls four finalization functions in sequence:

sub_9151E0 --- emit nvvmir.version: Creates a named metadata node "nvvmir.version" containing version operands as ConstantInt values wrapped in ConstantAsMetadata. When debug info is present (ctx+0x170 non-null), the tuple has 4 operands including address-space-qualified indices; otherwise 2 operands.

sub_914410 --- emit nvvm.annotations: Iterates the annotation list at ctx+0x1B0..0x1B8 and creates MDTuple entries under the named metadata "nvvm.annotations". Each annotation record produces a {GlobalValue*, MDString-key, ConstantInt-value} triple. Three annotation categories receive special batching: "grid_constant", "preserve_n_data", and "preserve_reg_abi" --- these are collected into compound MDTuples rather than emitting one per parameter, reducing metadata size in kernels with many annotated parameters.

sub_90A560 --- emit @llvm.used: Builds the @llvm.used global array that prevents LLVM from dead-stripping texture references, surface references, and managed variables. The function iterates the registered global triples at ctx+0x198..0x1A0 (24-byte records, hence the 0xAAAAAAAAAAAAAAAB magic divisor for dividing by 3), bitcasts each GlobalValue* to i8*, constructs a ConstantArray of type [N x i8*], and creates a global with name "llvm.used", appending linkage, and section "llvm.metadata".

Conditional: If debug info is present, emits a "Debug Info Version" module flag with value 3 via Module::addModuleFlag. If enabled, also emits "llvm.ident" metadata identifying the compiler.

Kernel Metadata

Annotation Emitter (sub_93AE30)

After a kernel's function body has been code-generated, sub_93AE30 translates EDG-level kernel attributes (__launch_bounds__, __cluster_dims__) into LLVM named metadata under "nvvm.annotations". The function signature:

void emitKernelAnnotationMetadata(
    NVVMContext *ctx,       // ctx->module at offset +344
    FuncDecl    *funcDecl,  // EDG function declaration, params at +16, count at +8
    LaunchAttr  *launch,    // __launch_bounds__/cluster attrs, NULL if none
    MDNodeVec   *out        // output vector of metadata nodes
);

Parameter Metadata

For each function parameter (stride 40 bytes, iterated from funcDecl+16):

  1. Visibility check: If launch attributes exist and bit 0x20 of launch+198 is clear, or param+32 != 0, emits opcode 22 (hidden/implicit parameter). If dword_4D04628 is set and the launch bit is set, calls sub_8D2E30 to check for special types and emits opcode 40.

  2. Type dispatch:

    • Type 1 (pointer): Checks sub_91B6F0 for read-only image/sampler (opcode 54) and sub_91B730 for surface reference (opcode 79).
    • Type 2 (value): Computes alignment metadata via sub_91A390, then log2 via BSR, emits packed (log2, hasValue) pair. Checks for alignment attribute tag 92 via sub_A74D20.
  3. MDNode creation: sub_A7B020(module, paramIndex, &attrAccum) creates the MDNode for each parameter.

Cluster Metadata

Triggered when launch is non-null and *(launch+328) points to a valid cluster config. The cluster config struct:

OffsetFieldUsed As
+20[5]reqntid.x (cluster)
+24[6]reqntid.y (cluster)
+28[7]reqntid.z (cluster)
+40[10]cluster_dim.z (also presence flag: > 0 triggers emission)
+44[11]cluster_dim.y
+48[12]cluster_dim.x

When cluster_config[10] > 0, three metadata entries are emitted in order:

  1. nvvm.blocksareclusters --- boolean flag, no value string. Emitted unconditionally.
  2. nvvm.reqntid --- the three cluster dimension fields [12],[11],[10] are converted to decimal strings and concatenated with commas: "{x},{y},{z}". Uses SSO std::string objects with a two-digit lookup table ("00","01",...,"99") for fast integer-to-string conversion. A 0x3FFFFFFFFFFFFFFF sentinel triggers a fatal "basic_string::append" error on overflow.
  3. nvvm.cluster_dim --- the three fields [7],[6],[5] are similarly concatenated.

Function-Level Metadata Node

After all per-parameter and cluster metadata is accumulated, if the accumulator is non-empty, sub_A7B020(module, 0xFFFFFFFF, &attrAccum) creates a function-level MDNode with parameter index -1 (sentinel). This node carries all function-level annotations combined.

Annotation Reader (sub_A84F90)

The inverse of the emitter. Reads "nvvm.annotations" named metadata from an LLVM Module and populates internal structures. For each {function_ref, key_string, value} operand tuple, the key is matched via raw integer comparisons (not strcmp):

Key StringMatch MethodHandler
"kernel"6-byte i32+i16 comparesub_CE8040: set/clear nvvm.kernel flag
"maxntidx/y/z"7-byte prefix + suffix charsub_A7C1C0 with "nvvm.maxntid"
"reqntidx/y/z"7-byte prefix + suffix charsub_A7C1C0 with "nvvm.reqntid"
"cluster_dimx/y/z"12-byte qword+i32 + suffixsub_A7C1C0 with "nvvm.cluster_dim"
"maxnreg"7-byte qword + byte 'g'sub_B2CD60 with "nvvm.maxnreg"
"minctasm"8-byte single qword comparesub_B2CD60 with "nvvm.minctasm"
"maxclusterrank"14-byte multi-width comparesub_B2CD60 with "nvvm.maxclusterrank"
"cluster_max_blocks"18 bytesSame handler as maxclusterrank
"align"5 bytessub_B2CCF0: BSR-based log2 alignment

The raw integer comparison technique avoids strcmp overhead by loading the key bytes as i32/i64 values and comparing in a single instruction. For example, "kernel" is checked as two loads: *(uint32_t*)key == 0x6E72656B and *(uint16_t*)(key+4) == 0x6C65.

Complete Metadata String Catalog

Module-level named metadata:

KeyPurpose
nvvm.annotationsContainer for all kernel and global annotations
nvvm.annotations_transplantedFlag: annotations already migrated to function-level
nvvm.reflectionCompile-time reflection constants
nvvmir.versionNVVM IR version (2 or 4 operands)
llvm.usedArray preventing dead-stripping of annotated globals
llvm.identCompiler identification string

Function-level metadata keys:

KeyValue FormatSource
nvvm.kernel(boolean presence)__global__ qualifier or calling convention 0x47
nvvm.maxntid"x,y,z"__launch_bounds__(maxThreads)
nvvm.reqntid"x,y,z"__launch_bounds__ or cluster config
nvvm.maxnregdecimal string__launch_bounds__(..., ..., maxRegs)
nvvm.minctasmdecimal string__launch_bounds__(..., minCTAs)
nvvm.maxclusterrankdecimal stringSM >= 90 cluster rank limit
nvvm.blocksareclusters(boolean presence)__cluster_dims__ present
nvvm.cluster_dim"x,y,z"__cluster_dims__(x,y,z)

Global variable annotations (emitted as {GlobalValue*, MDString, i32} triples in nvvm.annotations):

AnnotationValueTrigger
"managed"1__managed__ qualifier
"texture"1Texture reference type detected
"surface"1Surface reference type detected
"grid_constant"(batched)__grid_constant__ parameter attribute
"preserve_n_data"(batched)NVIDIA-internal preservation hint
"preserve_reg_abi"(batched)NVIDIA-internal register ABI hint

Metadata Accessor Functions

The backend reads metadata through typed accessor functions in the 0xCE7xxx--0xCE9xxx range:

AddressReconstructed NameReturns
sub_CE9220isKernel(func)true if linkage == 0x47 OR nvvm.kernel present
sub_CE8D40getMaxNtid(out, func)Parses "nvvm.maxntid" as (x,y,z) triple
sub_CE8DF0getReqNtid(out, func)Parses "nvvm.reqntid" as (x,y,z) triple
sub_CE8EA0getClusterDim(out, func)Parses "nvvm.cluster_dim" as (x,y,z) triple
sub_CE9030getMaxClusterRank(func)Checks "cluster_max_blocks" then "nvvm.maxclusterrank"
sub_CE90E0getMinCtaSM(func)Checks "minctasm" then "nvvm.minctasm"
sub_CE9180getMaxNReg(func)Checks "maxnreg" then "nvvm.maxnreg"

Each accessor first checks the function-level metadata (post-transplant), then falls back to the raw nvvm.annotations tuples (pre-transplant). The isKernel check is especially important: it recognizes kernels either by calling convention 0x47 or by the nvvm.kernel metadata presence, ensuring compatibility with both the EDG frontend path and bitcode loaded through LibNVVM.

Metadata Lifecycle

The complete flow from CUDA source to PTX directives:

CUDA:  __global__ void kern() __launch_bounds__(256, 2) __cluster_dims__(2, 1, 1)

EDG:   LaunchAttr { cluster_config[12]=256, [11]=1, [10]=1, [7]=1, [6]=1, [5]=2 }

sub_93AE30:
  -> nvvm.blocksareclusters (presence flag)
  -> nvvm.reqntid = "256,1,1"
  -> nvvm.cluster_dim = "2,1,1"
  -> function-level MDNode (index -1)

sub_A84F90:  reads back on bitcode load

Backend accessors (CE8xxx): typed access

PTX emitter (sub_3022E70):
  .blocksareclusters
  .reqntid 256, 1, 1
  .reqnctapercluster 2, 1, 1

Special Variables: threadIdx, blockIdx, blockDim, gridDim, warpSize

Recognition Pipeline

CUDA built-in variables (threadIdx, blockIdx, blockDim, gridDim, warpSize) are not stored in memory --- they map directly to PTX special registers accessed via LLVM intrinsics. Two parallel codegen paths exist: an older one in the 0x920xxx range and a newer one in the 0x1285xxx range. Both share the same logic structure.

The classifier function isSpecialRegisterVar (sub_920430 / sub_127F7A0) checks five preconditions before recognizing a variable:

  1. Inside kernel: (ctx->flags_at_360 & 1) != 0 --- only valid in __global__ function context.
  2. Not extern: (sym->byte_89 & 1) == 0.
  3. Not template-dependent: *(signed char*)(sym+169) >= 0.
  4. Element count == 1: sym->elem_count_at_136 == 1.
  5. Name non-null: sym->name_at_8 != NULL.

If all pass, the name is compared via strcmp against the five known strings. The output category:

CategoryNameType
0threadIdxdim3 (3-component struct)
1blockDimdim3
2blockIdxdim3
3gridDimdim3
4warpSizescalar int

Intrinsic ID Table

A static 2D array int intrinsicIDs[5][3] maps (category, component) to LLVM intrinsic IDs:

CUDA Variable.x.y.z
threadIdx@llvm.nvvm.read.ptx.sreg.tid.x.tid.y.tid.z
blockDim@llvm.nvvm.read.ptx.sreg.ntid.x.ntid.y.ntid.z
blockIdx@llvm.nvvm.read.ptx.sreg.ctaid.x.ctaid.y.ctaid.z
gridDim@llvm.nvvm.read.ptx.sreg.nctaid.x.nctaid.y.nctaid.z
warpSize@llvm.nvvm.read.ptx.sreg.warpsize------

Each intrinsic is a zero-argument call returning i32. The old codegen path uses intrinsic ID 9374 for warpSize; the new path uses 4348.

dim3 Member Access Codegen

Two functions handle the code generation, depending on whether the access is a full dim3 struct or a single component:

Full struct access (sub_922290 / sub_1285550): For threadIdx as a whole (all three components), loops 3 times:

for (component = 0; component < 3; component++) {
    intrinsicID = intrinsicIDs[category][component];
    decl = Module::getOrInsertIntrinsic(intrinsicID);
    callInst = CallInst::Create(decl);  // zero-arg, returns i32
    // Insert into struct via InsertValue
}

The three call results are composed into the struct type via CreateInsertValue. The IR value is named "predef_tmp".

Single component access (sub_9268C0 / sub_1286E40): For threadIdx.x specifically, the member name's first character is extracted from member_symbol+56+8:

  • 'x' (0x78) with null terminator '\0' at next byte -> component 0
  • 'y' (0x79) -> component 1
  • 'z' (0x7A) -> component 2

The null-terminator check prevents false matches on member names like "xy". A single intrinsic call is emitted, named "predef_tmp_comp":

%predef_tmp_comp = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()

Both paths compute alignment from the return type's bit-width via BSR and handle sign extension: if the type tag byte at +140 satisfies (tag & 0xFB) == 8 (signed int), the result is marked as signed.

PTX Backend Mapping

The NVPTX backend (sub_21E86B0) maps internal register encodings (single-byte case labels using ASCII character codes) to PTX special register names:

CodeASCIIPTX Register
0x26&%tid.x
0x27'%tid.y
0x28(%tid.z
0x29)%ntid.x
0x2A*%ntid.y
0x2B+%ntid.z
0x2C,%ctaid.x
0x2D-%ctaid.y
0x2E.%ctaid.z
0x2F/%nctaid.x
0x300%nctaid.y
0x311%nctaid.z

Codes 0x5E (^) and 0x5F (_) are delegated to sub_3958DA0 for cluster and warp-level registers. Any unhandled code triggers a fatal "Unhandled special register" error. Register names are written via optimized memcpy of 6--9 bytes directly to the output stream.

ISel Lowering

The instruction selector (sub_36E4040) validates that the intrinsic declaration returns i32 (type code 7 at offset +48 of the overload descriptor). If the type does not match, it emits a fatal error: "Unsupported overloaded declaration of llvm.nvvm.read.sreg intrinsic". It then creates a MachineSDNode with NVPTX target opcode 3457.

EDG Frontend Diagnostic

The EDG frontend includes a diagnostic at sub_6A49A0 that detects writes to predefined read-only variables. When a store target matches any of the five built-in names, it emits diagnostic 0xDD0:

error: cannot assign to variable 'threadIdx' with predefined meaning in CUDA

This diagnostic fires during semantic analysis, long before IR generation. It ensures that CUDA programs cannot accidentally (or intentionally) write to hardware register proxies.

Libdevice Linking

NVIDIA embeds a complete copy of the libdevice math library -- 455,876 bytes of LLVM bitcode -- directly inside the cicc binary. This library provides GPU-optimized implementations of ~350 mathematical intrinsics (trigonometric, exponential, rounding, Bessel functions, error functions, type conversions, and integer utilities) that are linked into every CUDA compilation during the LNK pipeline stage. The linker (sub_12C06E0, 63KB) validates bitcode magic bytes, enforces the nvptx64- target triple prefix, checks NVVM IR version metadata for cross-release compatibility, and performs symbol-size matching across all modules before producing a single merged module. Two identical copies of the embedded bitcode exist in the binary -- one for each compilation path -- ensuring the library is always available without filesystem access.

Upstream LLVM has no equivalent of this embedded-library mechanism. Clang relies on external libdevice.10.bc files discovered through --cuda-path at driver level. NVIDIA's approach eliminates the file-lookup step entirely, making cicc self-contained: the entire math library ships inside the compiler binary itself.

Embedded size455,876 bytes (445 KB) per copy
Copies in binary2: unk_3EA0080 (Path A), unk_420FD80 (Path B)
Function count352 defined (349 __nv_* public + 3 __internal_* helper)
__nvvm_reflect calls2,016 (architecture/precision dispatch)
Target triplenvptx64-nvidia-gpulibs
NVVM IR version!nvvmir.version = !{i32 2, i32 0} (always-compatible sentinel)
Attribute group#0 = { alwaysinline nounwind } on all public functions
Module linkersub_12C06E0 (63KB, 2,154 lines)
Version checkersub_12BFF60 (9KB, 362 lines)
Pipeline stageLNK (first stage, before OPT)
Override-nvvmir-library <path> CLI flag substitutes an external file
Version bypassNVVM_IR_VER_CHK=0 disables IR version validation

Embedded Bitcode Layout

The cicc binary contains two byte-identical copies of the libdevice bitcode at different virtual addresses. Each compilation path uses its own copy, avoiding any shared-state coordination between Path A (nvcc-invoked) and Path B (standalone/LibNVVM):

Binary offset         Path   Referenced by          Size
─────────────────────────────────────────────────────────────
unk_3EA0080           A      sub_905EE0 (43KB)      455,876 bytes
unk_420FD80           B      sub_1265970 (48KB)      455,876 bytes

Both copies contain identical LLVM bitcode with:

  • Data layout: e-i64:64-v16:16-v32:32-n16:32:64
  • Target triple: nvptx64-nvidia-gpulibs (note: gpulibs, not cuda)
  • Producer: clang version 3.8.0 (tags/RELEASE_380/final) -- the bitcode was originally compiled with an ancient Clang but has been maintained through bitcode format upgrades across CUDA toolkit releases
  • Version metadata: !nvvmir.version = !{i32 2, i32 0} -- this specific version tuple (2, 0) is hard-coded in the version checker as an always-compatible sentinel

The duplication exists because the two compilation paths (sub_905EE0 for Path A, sub_1265970 for Path B) are entirely independent code paths with no shared module state. Deduplicating the data would require introducing a shared pointer, which NVIDIA apparently considered not worth the ~445KB savings in a 60MB binary.

Loading the Embedded Bitcode

In both paths, the embedded bitcode is passed to sub_12BCB00 (the nvvmCUAddModuleFromBuffer API wrapper) with a hardcoded size constant:

// Path A (sub_905EE0, line ~167):
v19 = sub_12BCB00(compilation_unit, &unk_3EA0080, 455876, 0);

// Path B (sub_1265970, line ~448):
v19 = sub_12BCB00(compilation_unit, &unk_420FD80, 455876, 0);

When the -nvvmir-library <path> flag is provided, the corresponding path opens the file, reads its contents into memory, and passes that buffer to sub_12BCB00 instead of the embedded pointer. This override is used primarily for testing custom libdevice builds.

Libdevice Function Inventory

The library defines 352 functions across 10 categories. All 349 public functions carry alwaysinline nounwind attributes, meaning they will be unconditionally inlined during the OPT stage after linking. Three internal helper functions (__internal_trig_reduction_slowpathd, __internal_accurate_pow, __internal_lgamma_pos) use noinline nounwind to avoid code size explosion in their callers.

CategoryCountExamples
Type conversions75__nv_float2int_rn, __nv_double2ull_rz, __nv_int2float_rd, __nv_half2float
Rounded arithmetic74__nv_fmaf_rn, __nv_fdiv_rz, __nv_dsqrt_rd, __nv_dadd_ru, __nv_fmul_rn
Trigonometric34__nv_sinf, __nv_cos, __nv_tanf, __nv_asinf, __nv_atan2, __nv_sincospi
Special functions30__nv_erff, __nv_lgamma, __nv_j0, __nv_y1, __nv_cyl_bessel_i0, __nv_normcdf
Roots and norms28__nv_sqrtf, __nv_rsqrt, __nv_cbrt, __nv_hypot, __nv_norm3d, __nv_rnorm4d
Exponential/logarithmic28__nv_expf, __nv_log2, __nv_exp10, __nv_log1p, __nv_ldexp, __nv_frexp
Integer utilities27__nv_clz, __nv_popc, __nv_brev, __nv_mulhi, __nv_abs, __nv_byte_perm
Float utilities20__nv_fabsf, __nv_fminf, __nv_copysign, __nv_fmod, __nv_nextafter, __nv_nan
Rounding14__nv_floorf, __nv_ceil, __nv_truncf, __nv_roundf, __nv_nearbyintf, __nv_rint
Classification11__nv_isinff, __nv_isnand, __nv_isfinited, __nv_signbitf, __nv_ilogb, __nv_logb
Internal helpers3__internal_trig_reduction_slowpathd, __internal_accurate_pow, __internal_lgamma_pos

Every public function body contains calls to @__nvvm_reflect with query strings (__CUDA_FTZ, __CUDA_ARCH, __CUDA_PREC_SQRT) that are resolved by the NVVMReflect pass during optimization. This is how the same bitcode adapts to different precision modes and SM architectures -- see NVVMReflect for details on the reflection mechanism. The 2,016 reflect calls across 352 functions means an average of ~5.7 architecture/precision branch points per function.

Struct Types

The bitcode defines five aggregate types used by multi-return functions:

%struct.uint2                  = type { i32, i32 }
%struct.float2                 = type { float, float }
%struct.trig_reduction_return  = type { double, i32 }
%struct.ulonglong2             = type { i64, i64 }
%struct.double2                = type { double, double }

trig_reduction_return is used by the internal trigonometric range reduction helper. The float2/double2 types appear in sincos/sincospi which return both sine and cosine through output pointers.

Constant Tables

The bitcode contains precomputed coefficient tables in address space 1 (global memory):

GlobalTypePurpose
@__cudart_i2opi_f[6 x i32]Float-precision inverse-of-pi table for trig reduction
@__cudart_i2opi_d[18 x i64]Double-precision inverse-of-pi table for trig reduction
@__cudart_sin_cos_coeffs[16 x double]Chebyshev coefficients for sin/cos polynomial approximation

Module Linker Algorithm

sub_12C06E0 (63KB) is the central module linker that operates during the LNK pipeline stage. It receives a list of user modules and a list of builtin modules (which includes libdevice), validates them, and produces a single merged LLVM module. The algorithm proceeds in six phases:

Phase A: Module Iteration and Bitcode Validation

For each module in the input list (from a1[0] to a1[1], stepping by 4 qwords per entry), the linker:

  1. Opens and reads the module data via sub_16C2450
  2. Validates LLVM bitcode magic bytes -- accepts two formats:
    • Raw bitcode: bytes 0xDE 0xC0 0x17 0x0B (little-endian 0x0B17C0DE)
    • Bitcode wrapper: bytes 0x42 0x43 0xC0 0xDE (ASCII "BC" prefix)
  3. Determines the buffer name (falls back to "Unknown buffer" if the vtable function is sub_12BCB10)
  4. Parses bitcode into an LLVM Module via sub_15099C0
for each entry in modules[a1[0] .. a1[1]]:
    buffer = open_and_read(entry.data, entry.size, entry.name)
    magic = read_4_bytes(buffer)
    if magic != 0x0B17C0DE and magic != 0xDEC04342:
        *error_code = 9   // invalid bitcode
        return NULL
    name = (entry.vtable_func == sub_12BCB10)
           ? "Unknown buffer"
           : entry.vtable_func(entry)
    module = parse_bitcode(buffer, llvm_ctx, name)

Phase B: Triple Validation

After parsing all modules, the linker enforces that every module's target triple starts with nvptx64-. The comparison uses a prefix match against the global string at off_4CD49B0:

for each parsed_module:
    triple = get_triple(parsed_module)   // offset +240
    if triple.length == 0:
        error: "Module does not contain a triple, should be 'nvptx64-'"
        *error_code = 9
    else if !starts_with(triple, "nvptx64-"):
        error: "<module_name>: Module does not contain a triple, should be 'nvptx64-'"
        *error_code = 9

The libdevice bitcode has triple nvptx64-nvidia-gpulibs, which passes this prefix check. User modules typically have nvptx64-nvidia-cuda.

Phase C: IR Version Check

For each module, the linker calls sub_12BFF60 (the version checker -- see next section). If the check fails, the linker emits a diagnostic and returns error code 3:

for each parsed_module:
    result = NVVMIRVersionCheck(modules, parsed_module, flags)
    if result != 0:
        error: "<name>: error: incompatible IR detected. "
               "Possible mix of compiler/IR from different releases."
        *error_code = 3
        return NULL

Phase D: Single-Module Fast Path

When only one module exists (no linking needed), the linker returns it directly via sub_1C3DFC0 without invoking any linking machinery. This fast path avoids the overhead of LLVM's Linker::linkModules for the common case of a single translation unit without libdevice.

Phase E: Multi-Module User Linking

For N > 1 user modules, the linker:

  1. Selects one module as the "primary" (index v57)
  2. Copies the primary module's triple and data layout to all secondary modules (ensuring consistency)
  3. Calls sub_12F5610 -- NVIDIA's wrapper around LLVM's Linker::linkModules -- to merge all user modules into a single module
if module_count > 1:
    primary = modules[v57]
    for each secondary in modules where index != v57:
        set_triple(secondary, get_triple(primary))
        set_data_layout(secondary, get_data_layout(primary))
    result = LinkModules(&modules, linking_state, &error_str, &warnings, options)
    if result != 0:
        error: "<module_name>: link error: <details>"
        *error_code = 9

Phase F: Builtin Linking

After user modules are merged, the linker processes builtin modules from a1[3] to a1[4] (this is where libdevice lives). Each builtin module goes through the same bitcode validation and parsing as user modules, then is linked into the main module using sub_1CCEBE0 -- a different linking function than the user-module linker, likely Linker::linkModules with Linker::OverrideFromSrc flags for builtin definitions:

for each builtin in modules[a1[3] .. a1[4]]:
    validate_and_parse(builtin)
    set_triple(builtin, get_triple(main_module))
    result = LinkBuiltinModule(main_module, builtin, &error_string)
    if result != 0:
        error: "builtins: link error: <details>"
        // continues -- does not abort on builtin link failure
    post_link_cleanup(main_module, target_features)

The post-link cleanup sequence (sub_1611EE0 through sub_160FE50) configures target features on the merged module and finalizes symbol resolution.

Phase G: Symbol Size Matching

The final validation phase walks every global symbol in the linked module and checks that declarations and definitions agree on type sizes. The linker maintains a binary search tree keyed by symbol name and computes type sizes using a recursive size calculator:

Type codeTypeSize formula
1half16 bits
2float32 bits
3, 9double, i6464 bits
4fp8080 bits
5, 6fp128128 bits
7pointer8 * pointer_size
0xBintegerbits >> 8
0xDstructsum of member sizes
0xEarrayalignment * count * ceil(element_bits / (8 * alignment))
0xFnamed typeresolved recursively
0x10vectorelement_size * count
for each global_symbol in linked_module:
    name = get_name(global_symbol)
    if name in size_tree:
        existing_size = size_tree[name].size
        new_size = compute_type_size(global_symbol.type)
        if existing_size != new_size:
            error: "Size does not match for <name> in <module_A> "
                   "with size X specified in <module_B> with size Y."
            size_mismatch = true
    else:
        size_tree.insert(name, compute_type_size(global_symbol.type))
if size_mismatch:
    *error_code = 9

Triple and Version Validation

NVVM IR Version Checker (sub_12BFF60)

The version checker validates the nvvmir.version metadata node that every NVVM-produced bitcode module carries. It ensures that modules compiled by different CUDA toolkit versions are not accidentally mixed.

Metadata lookup: The checker searches for two named metadata nodes:

  1. "nvvmir.version" -- the IR version tuple
  2. "llvm.dbg.cu" -- debug compile unit (presence indicates debug info exists)

Both are looked up via sub_1632310 (named metadata search on the module).

Version tuple format: The metadata node contains either 2 or 4 constant integer operands:

FormatOperandsMeaning
2-element{major, minor}IR version only
4-element{major, minor, dbg_major, dbg_minor}IR version + debug IR version

Compatibility check: For the IR version, sub_12BDA30 performs the actual comparison. The special case (major=2, minor=0) always passes -- this is exactly the version carried by the embedded libdevice, ensuring it is compatible with any user module regardless of toolkit version.

For the debug version, sub_12BD890 checks compatibility with a similar special case: (debug_major=3, debug_minor<=2) always passes.

Unique node deduplication: The checker builds a hash set of unique metadata nodes using the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the hash function and probing strategy. This deduplication handles the case where multiple source files within a compilation unit carry identical version metadata -- each unique version is checked exactly once.

Final gate: If debug info is present in the module, the debug mode flag is set, but no debug version was validated (because the metadata lacked elements 2-3), the checker returns 3 (incompatible). This catches the case where a debug-compiled user module is linked against a non-debug library that lacks debug version metadata.

Symbol Resolution During LNK

The LNK stage processes libdevice functions through LLVM's standard symbol resolution mechanism. Because all 349 public libdevice functions carry the alwaysinline attribute, the resolution and inlining follow a specific sequence:

  1. Declaration matching: User code that calls __nv_sinf(x) contains an external declaration declare float @__nv_sinf(float). The linker resolves this declaration against the define float @__nv_sinf(float) in libdevice.

  2. __nvvm_reflect remains unresolved: After linking, libdevice function bodies contain calls to @__nvvm_reflect which are still unresolved declarations. These are handled during the OPT stage by the NVVMReflect pass, not during linking.

  3. Dead function elimination: Functions from libdevice that are never called by user code are eliminated by GlobalDCE during the OPT stage. Since libdevice provides 352 functions but a typical kernel uses only a handful, the vast majority are stripped.

  4. alwaysinline enforcement: During the OPT stage, the AlwaysInliner pass processes all libdevice functions. After inlining, the original function bodies become dead (no remaining callers) and are removed by subsequent DCE.

The net effect: a kernel calling __nv_sinf ends up with the sinf implementation inlined directly into the kernel body, with __nvvm_reflect calls already resolved to constants by NVVMReflect, and all unused branches from precision/architecture dispatch eliminated by SimplifyCFG.

Constant Folding Interaction

The constant folding engine (sub_14D90D0, 27KB) has special knowledge of libdevice functions. When a libdevice intrinsic is called with constant arguments, the fold eligibility checker determines whether the call can be evaluated at compile time -- before the libdevice function is inlined.

This creates an important ordering constraint:

LNK stage:  link libdevice → user module now has __nv_sinf definitions
OPT stage:  NVVMReflect  → resolve __CUDA_FTZ, __CUDA_ARCH queries
            ConstantFold → fold __nv_sinf(0.0) → 0.0 (if eligible)
            AlwaysInline → inline remaining __nv_sinf calls
            SimplifyCFG  → remove dead reflect branches
            GlobalDCE    → remove unused libdevice functions

The fold eligibility checker (sub_14D90D0) uses three dispatch mechanisms to identify foldable functions:

LLVM intrinsic ID switch (IDs 0-211): Covers standard LLVM intrinsics like llvm.sin, llvm.cos, llvm.sqrt, llvm.fma, llvm.floor, llvm.ceil, llvm.exp, llvm.log, llvm.pow, llvm.fabs, llvm.bswap, llvm.ctlz, llvm.ctpop, and overflow arithmetic.

NVVM intrinsic ID ranges (IDs > 211): Covers NVIDIA-specific intrinsics organized as binary-search ranges with bitmask dispatch:

RangeIDsExamples
0xEB4-0xEE33764-3811nvvm.ceil.f, nvvm.ctlz.i, nvvm.cos.approx.ftz.f
0xF1E-0xF723870-3954nvvm.exp2.approx, nvvm.fabs.f, nvvm.floor.f, nvvm.sqrt.f
0xFE8-0xFEA4072-4074nvvm.sin.approx.ftz.f and similar
0x1012-0x104C4114-4172nvvm.max.i, nvvm.min.ui, nvvm.min.ll
0x1086-0x10874230-4231nvvm.mul.hi.*
0x117B-0x11844475-4484nvvm.sqrt.rn.d, nvvm.sqrt.approx.ftz.f
0x1C80-0x1CAC7296-7340nvvm.fmax.f, nvvm.fmin.ftz.nan.f

Name-based matching (ID = 0): When the call target is not a recognized LLVM or NVVM intrinsic, the checker falls back to string matching on the function name. It dispatches on the first character, then uses DWORD integer comparisons for 4-byte names and memcmp for longer names:

Foldable C library names:
  sin, sinf, cos, cosf, tan, tanf, acos, acosf, asin, asinf,
  atan, atanf, atan2, atan2f, ceil, ceilf, cosh, coshf,
  exp, expf, exp2, exp2f, fabs, fabsf, floor, floorf,
  fmod, fmodf, log, logf, log10, log10f, pow, powf,
  round, roundf, sinh, sinhf, sqrt, sqrtf, tanh, tanhf

Convergent gate: Before any folding, the checker verifies that the callee does not carry the convergent attribute (kind 0x34). Convergent functions have warp-synchronous semantics and must not be speculatively constant-folded, even if all arguments are constants.

Configuration

Environment Variables

VariableEffect
NVVM_IR_VER_CHKSet to "0" to disable IR version validation. Any other value or unset = enabled (default). Checked in sub_12BFF60 at 0x12BFF60 and in the duplicate verifier at 0x2259720.

CLI Flags

FlagEffect
-nvvmir-library <path>Override the embedded libdevice with an external bitcode file. The file is opened, read into memory, and passed to the linker in place of the embedded unk_3EA0080/unk_420FD80 pointer.
-opt / -llcWhen passed as the first extra argument, skips builtin linking entirely (jumps past the libdevice linking code to direct pipeline stage invocation).
-keepPreserves .lnk.bc intermediate file showing the linked module (user + libdevice) before optimization.

Intermediate Files

When -keep is active, the LNK stage serializes its output to a .lnk.bc file alongside the input:

input.cu  →  input.lnk.bc  (linked: user + libdevice)
          →  input.opt.bc  (optimized: after OPT stage)
          →  input.ptx     (final: after LLC stage)

The .lnk.bc file is useful for verifying which libdevice functions survived linking and how __nvvm_reflect calls appear before the NVVMReflect pass resolves them.

Function Map

FunctionAddressSizeRole
ModuleLinkersub_12C06E063KBMain bitcode linker: validates magic, triple, version; links user modules, then builtins
NVVMIRVersionChecksub_12BFF609KBReads nvvmir.version metadata, checks compatibility via sub_12BDA30/sub_12BD890
CheckIRVersionsub_12BDA30~2KBIR version compatibility predicate (special-cases {2,0} as always-compatible)
CheckDebugVersionsub_12BD890~2KBDebug IR version compatibility predicate (special-cases {3, <=2})
PipelineOrchestratorsub_12C35D041KB4-stage pipeline driver; calls sub_12C06E0 during LNK stage
LibNVVMPipelineAsub_905EE043KBPath A pipeline driver; references unk_3EA0080 for embedded libdevice
LibNVVMPipelineBsub_126597048KBPath B pipeline driver; references unk_420FD80 for embedded libdevice
nvvmCUAddModuleFromBuffersub_12BCB00~1KBAPI wrapper that adds a bitcode buffer to the compilation unit
LibNVVM API dispatchsub_12BC0F03KBResolves LibNVVM API function pointers by hash ID
ParseBitcodeFilesub_15099C0~8KBLLVM bitcode parser entry point
LinkBuiltinModulesub_1CCEBE0~4KBLinks a single builtin module into the main module (Linker::linkModules with OverrideFromSrc [MEDIUM confidence] -- inferred from the override-from-source semantics of builtin linking and the 4KB size matching a thin wrapper around LLVM's linker API, but no diagnostic string confirms the exact LLVM API call)
LinkUserModulessub_12F5610~4KBLinks multiple user modules (Linker::linkModules [MEDIUM confidence] -- same reasoning as above; wrapper size and call pattern match, but unconfirmed by string evidence)
CanFoldIntrinsicsub_14D90D027KBConstant-fold eligibility checker for math intrinsics
embedded libdevice (Path A)unk_3EA0080455,876BRaw LLVM bitcode blob
embedded libdevice (Path B)unk_420FD80455,876BRaw LLVM bitcode blob (identical copy)

Reimplementation Checklist

  1. Embedded bitcode storage and loading. Embed the libdevice bitcode blob (455,876 bytes) directly in the compiler binary, provide two independent copies for dual-path compilation (Path A / Path B), and implement the nvvmCUAddModuleFromBuffer API wrapper to load the embedded blob or an external override file via -nvvmir-library.
  2. Bitcode magic validation. Accept two bitcode formats: raw bitcode (0xDE 0xC0 0x17 0x0B, little-endian 0x0B17C0DE) and bitcode wrapper (0x42 0x43 0xC0 0xDE, ASCII "BC" prefix). Reject anything else with error code 9.
  3. Target triple and IR version validation. Enforce nvptx64- prefix on all module triples. Implement the NVVM IR version checker that reads nvvmir.version metadata (2-element or 4-element tuples), special-cases version {2,0} as always-compatible (the libdevice sentinel), and checks debug IR version compatibility for {3, <=2}.
  4. Multi-module linking pipeline. Implement the six-phase linker: (A) module iteration with bitcode validation, (B) triple validation, (C) IR version check, (D) single-module fast path, (E) multi-module user linking with primary module selection and triple/data-layout propagation, (F) builtin linking with OverrideFromSrc semantics.
  5. Symbol size matching. Walk all global symbols in the linked module, compute type sizes recursively (handling half/float/double/pointer/integer/struct/array/vector types), and verify that declarations and definitions agree on type sizes using a binary search tree keyed by symbol name.
  6. Constant folding integration. Implement the fold eligibility checker for libdevice functions with three dispatch mechanisms (LLVM intrinsic ID switch for IDs 0--211, NVVM intrinsic ID ranges for IDs >211, name-based matching for C library names), gated by the convergent attribute check to prevent folding warp-synchronous functions.

Cross-References

LLVM Optimizer

NVIDIA's LLVM optimizer in cicc v13.0 is not a straightforward invocation of the upstream LLVM opt pipeline. Instead, it implements a proprietary two-phase compilation model where the same 49.8KB pipeline assembly function (sub_12E54A0) is called twice with different phase counters, allowing analysis passes to run in Phase I and codegen-oriented passes in Phase II. Individual passes read a TLS variable (qword_4FBB3B0) to determine which phase is active and skip themselves accordingly.

The optimizer also supports concurrent per-function compilation: after Phase I completes on the whole module, Phase II can be parallelized across functions using a thread pool sized to get_nprocs() or a GNU Jobserver token count. This is a significant departure from upstream LLVM, which processes functions sequentially within a single pass manager invocation.

The entire optimization behavior is controlled by the NVVMPassOptions system — a 4,512-byte struct with 221 option slots (114 string + 100 boolean + 6 integer + 1 string-pointer) that provides per-pass enable/disable toggles and parametric knobs. This system is completely proprietary and has no upstream equivalent.

Address range 0x12D00000x16FFFFF (~4.2 MB of code).

Pipeline assemblersub_12E54A0 (49.8KB, 1,553 lines, ~150 pass insertions)
Phase orchestratorsub_12E7E70 (9.4KB, Phase I / Phase II)
Concurrent entrysub_12E1EF0 (51.3KB, jobserver + split-module + thread pool)
PassOptions initsub_12D6300 (125KB, 4,786 lines, 221 option slots)
New PM registrationsub_2342890 (2,816 lines, 35 NVIDIA + ~350 LLVM passes)
Target creationsub_12EA530 (4.1KB, "nvptx" / "nvptx64")
AddPasssub_12DE0B0 (3.5KB, hash-table-based pass insertion)
Tier 0 sub-pipelinesub_12DE330 (4.8KB, ~40 passes)
Tier 1/2/3 sub-pipelinesub_12DE8F0 (17.9KB, phase-conditional)
Codegen dispatchsub_12DFE00 (20.7KB)
LTO pipelinesub_12F5F30 (37.8KB, dead kernel elimination)
jemalloc5.3.x statically linked (~400 functions at 0x12FC000)

Architecture

sub_12E1EF0 (51KB, concurrent compilation entry)
  │
  ├─ GNU Jobserver init (sub_16832F0, --jobserver-auth=R,W from MAKEFLAGS)
  ├─ Bitcode reading + verification (sub_153BF40)
  ├─ Function sorting by priority (sub_12E0CA0)
  ├─ Thread pool creation (sub_16D4AB0, min(requested, num_functions) threads)
  │
  └─ sub_12E7E70 (9.4KB, two-phase orchestrator)
       │
       ├─ Phase I: qword_4FBB3B0 = 1
       │    └─ sub_12E54A0 (whole-module analysis + early optimization)
       │
       ├─ Concurrency check: sub_12D4250 (>1 defined function?)
       │    ├─ Yes, threads>1 → per-function Phase II via thread pool
       │    │    └─ sub_12E86C0 per function (qword_4FBB3B0 = 2)
       │    └─ No → sequential Phase II
       │         └─ sub_12E54A0 (qword_4FBB3B0 = 2)
       │
       └─ qword_4FBB3B0 = 3 (done)

sub_12E54A0 (49.8KB, MASTER PIPELINE ASSEMBLY)
  │
  ├─ Top branch: a4[4384] → Pipeline B (fast/codegen-only)
  │                    else → Pipeline A (normal LLVM)
  │
  ├─ Target machine setup
  │    ├─ Triple: "nvptx64" or "nvptx" (based on pointer size)
  │    ├─ sub_16D3AC0 → TargetRegistry::lookupTarget()
  │    ├─ TargetOptions: PIC=1, CodeModel=8, OptLevel=1, ThreadModel=1
  │    └─ DataLayout from qword_4FBB430
  │
  ├─ Phase 0: Infrastructure (TLI, TTI, Verifier, AssumptionCache, ProfileSummary)
  ├─ Phase 1: Language dispatch (a4[3648]: "ptx"/"mid"/default)
  ├─ Phase 2: Pre-optimization passes
  ├─ Phase 3: Main optimization loop (tier threshold dispatch)
  ├─ Phase 4: Post-opt language-specific pipelines
  ├─ Phase 5: Finalization (NVVMLowerBarriers, BreakCriticalEdges, codegen)
  ├─ Phase 6: Phase 2 codegen check (qword_4FBB3B0 == 2)
  ├─ Phase 7: PassManager::run
  └─ Phase 8: Basic block naming ("F%d_B%d" for debug)

Two-Phase Compilation — sub_12E7E70

FieldValue
Address0x12E7E70
Size9.4KB
Strings"Phase I", "Phase II", "Concurrent=Yes/No"

The two-phase model exists because certain optimization passes (e.g., inter-procedural memory space propagation, global inlining decisions) require whole-module visibility, while others (register pressure-driven rematerialization, instruction scheduling) operate per-function and benefit from parallelization. Phase I runs the whole-module analysis and early optimization passes; Phase II runs the per-function backend-oriented passes.

Both phases call the same sub_12E54A0. The difference: qword_4FBB3B0 (TLS variable) is set to 1 or 2 before each call. Individual passes read this counter and skip themselves if the current phase doesn't match their intended execution phase. When the module contains only a single defined function, the phase mechanism is bypassed entirely — a single unphased call handles everything.

Phase State Machine:

  START → [phase=1] → sub_12E54A0 (Phase I)
    │
    error? → RETURN
    │
    count_functions()
    ├─ 1 func → [phase=2] → sub_12E54A0 → [phase=3] → DONE
    ├─ N funcs, threads>1 → per-function Phase II (thread pool) → [phase=3] → DONE
    └─ N funcs, threads≤1 → [phase=2] → sub_12E54A0 → [phase=3] → DONE

Single-function modules skip the phase mechanism entirely — a single unphased call to sub_12E54A0.

GNU Jobserver Integration

When cicc is invoked from a parallel make -jN build, it can participate in the GNU Jobserver protocol to limit its own thread count to the available parallelism tokens. This prevents oversubscription — without it, a -j16 build could spawn 16 cicc processes each creating their own thread pool, resulting in hundreds of threads competing for CPU time. The jobserver reads the --jobserver-auth=R,W pipe file descriptors from the MAKEFLAGS environment variable.

In sub_12E1EF0 (lines 833–866), when a4+3288 is set:

v184 = sub_16832F0(&state, 0);   // parse MAKEFLAGS for --jobserver-auth=R,W
if (v184 == 5 || v184 == 6)      // pipe issues
    warning("jobserver pipe problem");
elif (v184 != 0)
    fatal("GNU Jobserver support requested, but an error occurred");

sub_16832F0 allocates a 296-byte state structure, parses MAKEFLAGS, creates a pipe for token management, and spawns a pthread to manage tokens. Throttles concurrent per-function compilations to match the build's -j level.

Split-Module Compilation

Split-module compilation is NVIDIA's mechanism for the -split-compile=N flag. It decomposes a multi-function module into individual per-function bitcode blobs, compiles each independently (potentially in parallel), then re-links the results. This trades away inter-procedural optimization opportunities for compilation speed and reduced peak memory usage — a worthwhile tradeoff for large CUDA kernels during development iteration.

When optimization level (a4+4104) is negative, enters split-module mode:

  1. Each function's bitcode is extracted via sub_1AB9F40 with filter callback sub_12D4BD0
  2. Module name: "<split-module>" (14 chars)
  3. After thread pool completes, split modules are re-linked via sub_12F5610
  4. Linkage attributes restored from hash table (external linkage types: bits 0–5, dso_local: bit 6 of byte+33)

Pipeline Assembly — sub_12E54A0

The pipeline assembly function is the heart of the optimizer. At 49.8KB with ~150 AddPass calls, it constructs the complete LLVM pass pipeline at runtime rather than using a static pipeline description. The function first sets up target machine infrastructure (triple, data layout, subtarget features), then dispatches into one of three language-specific paths that determine which passes run and in what order. After the language-specific path completes, a shared finalization phase runs barriers, critical edge breaking, and codegen preparation.

A distinguishing feature of NVIDIA's pipeline is the tier system: passes are organized into Tiers 0–3, each gated by a threshold counter. As compilation progresses through the main loop (which iterates over external plugin/extension pass entries), tiers fire when the accumulated pass count exceeds their threshold. This allows NVIDIA to precisely control where in the pipeline their custom passes interleave with standard LLVM passes.

Language-Specific Paths

The pipeline branches based on a4[3648] (language string). The three paths represent different optimization strategies for different IR maturity levels:

StringPathPass CountKey Difference
"ptx"Path A~15Light: NVVMPeephole → LLVM standard → DCE → MemorySpaceOpt
"mid"Path B~45Full: SROA → GVN → LICM → LoopIndexSplit → Remat → all NVIDIA passes
(default)Path C~40General: 4 LLVM standard passes + NVIDIA interleaving

Tier System

The main loop iterates over entries at a4[4488] (16-byte stride: vtable + phase_id):

if (opt_enabled && phase_id > opt_threshold) → sub_12DE330  // Tier 0 (full)
if (tier1_flag && phase_id > tier1_threshold) → sub_12DE8F0(1) // Tier 1
if (tier2_flag && phase_id > tier2_threshold) → sub_12DE8F0(2) // Tier 2
if (tier3_flag && phase_id > tier3_threshold) → sub_12DE8F0(3) // Tier 3

Each tier fires once (flag cleared after execution). Remaining tiers fire unconditionally after the loop.

Tier 0 — Full Optimization (sub_12DE330)

Tier 0 is the most aggressive optimization sub-pipeline. It runs ~40 passes in a carefully ordered sequence that interleaves standard LLVM passes with NVIDIA-specific ones. The ordering reveals NVIDIA's optimization strategy: start with GVN and SCCP for value simplification, then run NVIDIA's custom NVVMReflect and NVVMVerifier to clean up NVVM-specific constructs, followed by aggressive loop transformations (LoopIndexSplit, LoopUnroll, LoopUnswitch), and finally register-pressure-sensitive passes (Rematerialization, DSE, DCE) to prepare for codegen.

~40 passes in order:

Confidence note: Pass identifications are based on diagnostic strings, factory signatures, and pipeline ordering. Most are HIGH confidence. Entries with [MEDIUM confidence] are inferred from code structure rather than direct string evidence.

#FactoryLikely PassGuarded By
1sub_1654860(1)BreakCriticalEdges
2sub_1A62BF0(1,...)LLVM standard pipeline #1
3sub_1B26330MemCpyOpt
4sub_185D600IPConstantPropagation
5sub_1C6E800GVN
6sub_1C6E560NewGVN/GVNHoist [MEDIUM confidence]
7sub_1857160NVVMReflect
8sub_1842BC0SCCP
9sub_12D4560NVVMVerifier
10sub_18A3090NVVMPredicateOpt
11sub_184CD60ConstantMerge
12sub_1869C50(1,0,1)Sink/MemSSA [MEDIUM confidence]!opts[1040]
13sub_1833EB0(3)TailCallElim/JumpThreading [MEDIUM confidence]
14sub_1952F90(-1)LoopIndexSplit
15sub_1A62BF0(1,...)LLVM standard pipeline #1
16sub_1A223D0NVVMIRVerification
17sub_1A7A9F0InstructionSimplify
18sub_1A62BF0(1,...)LLVM standard pipeline #1
19sub_1A02540GenericToNVVM
20sub_198DF00(-1)LoopSimplify
21sub_1C76260ADCE!opts[1320]
22sub_195E880(0)LICMopts[2880]
23sub_19C1680(0,1)LoopUnroll!opts[1360]
24sub_19401A0InstCombine
25sub_1968390SROA
26sub_196A2B0EarlyCSE
27sub_19B73C0(2,...)LoopUnswitch
28sub_190BB10(0,0)SimplifyCFG
29sub_1A13320NVVMRematerialization
30sub_18F5480DSE
31sub_18DEFF0DCE
32sub_1A62BF0(1,...)LLVM standard pipeline #1
33sub_18B1DE0NVVMLoopPass [MEDIUM confidence]
34sub_1841180FunctionAttrs

"mid" Path — Complete Pass Ordering

The "mid" path is the primary optimization pipeline for standard CUDA compilation. At ~45 passes, it is the most comprehensive of the three paths. The key pattern is repeated interleaving of NVIDIA custom passes with standard LLVM passes: NVVMIntrinsicLowering runs 4 times at different points, NVVMReflect runs 3 times, and NVVMIRVerification runs after each major transformation to catch correctness regressions early. The MemorySpaceOpt pass appears once in this sequence (gated by !opts[1760]) — it runs again later via the parameterized <second-time> invocation in Tier 1/2/3.

ConstantMerge → NVVMIntrinsicLowering → MemCpyOpt → SROA → NVVMPeephole → NVVMAnnotations → LoopSimplify → GVN → NVVMIRVerification → SimplifyCFG → InstCombine → LLVM standard #5 → NVVMIntrinsicLowering → DeadArgElim → FunctionAttrs → DCE → ConstantMerge → LICM → NVVMLowerBarriers → MemorySpaceOpt → Reassociate → LLVM standard #8 → NVVMReflect → ADCE → InstructionSimplify → DeadArgElim → TailCallElim → DeadArgElim → CVP → Sink → SimplifyCFG → DSE → NVVMSinking2 → NVVMIRVerification → EarlyCSE → NVVMReflect → LLVM standard #8 → NVVMIntrinsicLowering → IPConstProp → LICM → NVVMIntrinsicLowering → NVVMBranchDist → NVVMRemat

NVVMPassOptions — sub_12D6300

NVVMPassOptions is NVIDIA's proprietary mechanism for fine-grained control over every optimization pass. Unlike LLVM's cl::opt system (which uses global command-line options), NVVMPassOptions stores per-pass configuration in a flat struct that is allocated once and passed through the pipeline by pointer. This design avoids the global-state problems of cl::opt and allows different compilation units to have different pass configurations within the same process — critical for the concurrent per-function compilation model.

The 125KB initialization function is the largest in the optimizer range. Its size comes from the sheer number of option slots: each of the 221 slots requires a hash-table lookup, a default-value resolution, and a type-specific store, with most slots organized in pairs (a string parameter + a boolean enable flag).

FieldValue
Address0x12D6300
Size125KB (4,786 lines)
Output struct4,512 bytes (allocated via sub_22077B0(4512))
Slot count221 (indices 1–221)
Slot types114 string + 100 boolean + 6 integer + 1 string-pointer

Struct Layout

RegionOffsetContent
Header0–7int opt_level (from a2+112)
Registry ptr8–15Pointer to PassOptionRegistry
Slot pairs16–4479221 option slots (string/bool/int pairs)
Sentinel4480–45114 qwords zeroed

Option Slot Types

TypeSizeWriterCount
String24Bsub_12D6090114
Bool (compact)16Bsub_12D610083
Bool (inline)16Bdirect byte write17
Integer16Bsub_16D2BB0 (parseInt)6
String pointer28Bdirect qword write (slot 181 only)1

Pair Organization

Slots are organized in pairs: even = string parameter (the pass's configuration value or name), odd = boolean enable/disable toggle (the do-X flag). This consistent pairing means each "pass knob" has both a parametric value and an on/off switch, allowing passes to be individually disabled without removing their configuration — useful for A/B testing optimizations.

Exceptions to the pair pattern: slots 160–162 (3 consecutive strings — a pass with 3 string parameters), slots 192–193 (2 consecutive bools — a pair of binary flags), slot 181 (the only string-pointer type, storing a char* + length directly — likely a file path or regex pattern).

Defaults Enabled (14 of 100 booleans)

Slots: 19, 25, 93, 95, 117, 141, 143, 151, 155, 157, 159, 165, 211, 219. These are passes that run by default and must be explicitly disabled.

Integer Defaults

SlotDefaultLikely Purpose
91Iteration count / threshold
19720Limit (e.g., unroll count)
203-1Sentinel (unlimited/auto)
205-1Sentinel
207-1Sentinel
2150Disabled counter

Known Option Names

Boolean toggles (do-X / no-X): do-ip-msp, do-licm, do-remat, do-clone-for-ip-msp, do-cssa, do-scev-cgp, do-function-scev-cgp, do-scev-cgp-aggresively, do-base-address-strength-reduce, do-base-address-strength-reduce-chain, do-comdat-renaming, do-counter-promotion, do-lsr-64-bit, do-sign-ext-expand, do-sign-ext-simplify

Parametric knobs: remat-for-occ, remat-gep-cost, remat-max-live-limit, remat-maxreg-ceiling, remat-move, remat-single-cost-limit, remat-use-limit, branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm, scev-cgp-check-latency, scev-cgp-control, scev-cgp-cross-block-limit, scev-cgp-idom-level-limit, scev-cgp-inst-limit, scev-cgp-norm, cssa-coalesce, cssa-verbosity, base-address-strength-reduce-iv-limit

Dump flags: dump-ip-msp, dump-remat, dump-branch-dist, dump-scev-cgp, dump-sink2, dump-before-cssa, dump-normalize-gep, dump-simplify-live-out

New PM Pass Registration — sub_2342890

NVIDIA maintains both the Legacy Pass Manager and the New Pass Manager in cicc v13.0. The New PM registration lives in a single 2,816-line function that registers every analysis, pass, and printer by calling sub_E41FB0(pm, class_name, len, pass_name, len) for each. Standard LLVM passes use the llvm:: prefix (stripped during registration), while NVIDIA custom passes use their own class names.

The registration function also handles parameterized pass parsing: when the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), it calls a registered parameter-parsing callback that returns a configured pass options struct. This is how MemorySpaceOpt can run twice with different configurations in the same pipeline.

NVIDIA Custom Passes (35 total)

Module passes (12): check-gep-index, check-kernel-functions, cnp-launch-check, ipmsp, nv-early-inliner, nv-inline-must, nvvm-pretreat, nvvm-verify, printf-lowering, select-kernels, lower-ops*, set-global-array-alignment*

Function passes (20): basic-dbe, branch-dist, byval-mem2reg, bypass-slow-division, normalize-gep, nvvm-reflect-pp, nvvm-peephole-optimizer, old-load-store-vectorizer, remat, propagate-alignment, reuse-local-memory, set-local-array-alignment, sinking2, d2ir-scalarizer, sink<rp-aware>, memory-space-opt*, lower-aggr-copies*, lower-struct-args*, process-restrict*

Loop pass (1): loop-index-split

Analyses (2): rpa (RegisterPressureAnalysis), merge-sets (MergeSetsAnalysis)

* = parameterized

Key Discoveries

  • nvvm-reflect-pp is actually SimplifyConstantConditionalsPass, not a reflection pass. It runs after NVVMReflect resolves __nvvm_reflect() calls to constants, cleaning up the resulting dead branches and unreachable code. The misleading name ("pp" = post-processing) obscures what is essentially a targeted dead-code-elimination pass.
  • memory-space-opt runs twice in the pipeline with different parameterizations: <first-time> early in optimization (conservative, uses available alias information) and <second-time> late (aggressive, benefits from earlier optimizations having simplified the IR). This two-pass approach is necessary because address space resolution depends on pointer analysis quality, which improves as other passes simplify the code.
  • d2ir-scalarizer reuses LLVM's ScalarizerPass class under a different name, suggesting NVIDIA added a custom registration point to control when scalarization happens in the NVPTX pipeline without modifying the upstream pass.
  • Legacy PM co-existence: both Legacy PM and New PM registrations exist for the same passes, with slightly different names (e.g., "memory-space-opt-pass" vs "memory-space-opt"). This dual registration is necessary during the LLVM Legacy→New PM migration — cicc v13.0 appears to be in the middle of this transition.

Key Global Variables

VariablePurpose
qword_4FBB3B0Phase counter TLS: 1=Phase I, 2=Phase II, 3=done
qword_4FBB370Feature flag register (value 6 = barrier opt + memspace opt)
qword_4FBB410Tier execution tracker
qword_4FBB430Optimization level store
qword_4FBB510Debug/trace verbosity level
byte_3F871B3NVIDIA global flag byte (empty/null string in .rodata)
byte_4F99740CUTLASS optimization enable flag

NVVMPassOptions Deep Dive

Memory Layout

The 4,512-byte NVVMPassOptions struct is allocated on the heap via sub_22077B0(4512) at the start of each compilation. The layout divides into four regions:

Offset 0x000 [8B]  : int32 opt_level (from config+112) + 4B padding
Offset 0x008 [8B]  : qword ptr to PassOptionRegistry (hash table source)
Offset 0x010 [4464B]: 221 option slots (indices 1-221)
Offset 0x1180[32B] : 4 qwords zeroed (sentinel/trailer)

The slots start at offset 16 and are packed contiguously. Each slot occupies a fixed size depending on its type, but the stride varies: string options take 24 bytes, boolean options take 16 bytes, integer options take 16 bytes, and the single string-pointer option (slot 181) takes 28 bytes. The overall packing is not uniform-stride; the offset of each slot must be computed from the cumulative widths of all preceding slots.

Slot Type Formats

Five distinct slot types exist, each written by a dedicated helper:

// TYPE A: String option (114 instances)
// Written by sub_12D6090 (writeStringOption)
struct StringSlot {        // 24 bytes
    char*   value_ptr;     // +0: pointer to string value
    int32_t option_index;  // +8: 1-based slot index
    int32_t flags;         // +12: from PassDef byte+40
    int32_t opt_level;     // +16: optimization level context
    int32_t pass_id;       // +20: resolved via sub_1691920
};

// TYPE B: Boolean compact (83 instances)
// Written by sub_12D6100 (writeBoolOption)
struct BoolCompactSlot {   // 16 bytes
    uint8_t value;         // +0: 0 or 1
    uint8_t pad[3];        // +1: padding
    int32_t option_index;  // +4
    int32_t flags;         // +8
    int32_t pass_id;       // +12
};

// TYPE C: Boolean inline (17 instances)
// Written directly as byte + int32 fields
struct BoolInlineSlot {    // 16 bytes
    uint8_t value;         // +0: 0 or 1
    uint8_t pad[3];        // +1
    int32_t option_index;  // +4: from sub_12D6240 return hi32
    int32_t opt_level;     // +8
    int32_t pass_id;       // +12: resolved inline
};

// TYPE D: Integer (6 instances)
// Value parsed by sub_16D2BB0 (parseInt)
struct IntegerSlot {       // 16 bytes
    int32_t value;         // +0: parsed integer
    int32_t option_index;  // +4
    int32_t opt_level;     // +8
    int32_t pass_id;       // +12
};

// TYPE E: String pointer (1 instance, slot 181 only)
struct StringPtrSlot {     // 28 bytes
    char*   char_ptr;      // +0: raw string data pointer
    int64_t str_length;    // +8: length of string
    int32_t option_index;  // +16
    int32_t opt_level;     // +20
    int32_t pass_id;       // +24
};

Helper Function Chain

The initialization function sub_12D6300 populates the struct by iterating all 221 slot indices and calling a chain of helpers for each:

  1. sub_12D6170 (PassOptionRegistry::lookupOption) -- looks up a slot index in the hash table at registry+120. Returns a pointer to an OptionNode struct: [+40] int16 flags, [+48] qword* value_array_ptr, [+56] int value_count. Returns null if the option was not set on the command line.

  2. sub_12D6240 (getBoolOption) -- resolves a boolean option. Calls sub_12D6170 to find the option, then if a string value exists, lowercases it via sub_16D2060 and tests if the first char is '1' (0x31) or 't' (0x74). If the option was not found, defaults to true (enabled). Returns the boolean packed with the flags in the low 40 bits.

  3. sub_1691920 (PassDefTable::getPassDef) -- looks up a PassDef entry in a table where each entry is 64 bytes. Computes: table[0] + (index - 1) * 64. The PassDef at [+32] holds the pass_id, at [+36] a has_overrides flag, and at [+40] an override index.

Initial Slots (1-6): Global Configuration

The first six slots are all string types at a uniform 24-byte stride, starting at offset 16. They do not follow the pair pattern and represent global pipeline parameters rather than per-pass knobs:

SlotOffsetLikely Content
116ftz (flush-to-zero mode string)
240prec-div (precise division setting)
364prec-sqrt (precise square root setting)
488fmad (fused multiply-add policy)
5112opt-level (optimization level string)
6136sm-arch (target SM architecture string)

CLI Interface

Users interact with NVVMPassOptions via the -opt flag, which appends key=value pairs to the PassOptionRegistry before sub_12D6300 flattens them:

cicc -opt "-do-ip-msp=0"            # disable memory space propagation
cicc -opt "-do-licm=0"              # disable LICM
cicc -opt "-remat-max-live-limit=50" # set rematerialization threshold
cicc -opt "-dump-remat"             # enable remat dump output

The registry is a hash table populated from these CLI strings. Each -opt argument is parsed into a key (the option name) and value (the string after =). When sub_12D6300 runs, it queries the registry for each of the 221 slot indices. If a CLI override exists, it takes precedence; otherwise the compiled-in default is used.

Option Anomalies

Several regions break the standard string/boolean pair pattern:

  • Slots 160-162: Three consecutive string slots with no interleaved boolean. [LOW confidence] This represents a pass (likely MemorySpaceOpt or the CSSA pass) that takes three string configuration parameters followed by a single boolean enable flag at slot 163. The pass identity is uncertain because neither MemorySpaceOpt nor CSSA has been confirmed to consume three string parameters; the association is based on pipeline position proximity only.
  • Slots 192-193: Two consecutive boolean slots. One is the main enable toggle; the other appears to be a sub-feature flag (both default to disabled).
  • Slot 181 (offset 3648): The only STRING_PTR type. Its default is byte_3F871B3 (an empty string in .rodata). The raw pointer + length storage suggests this holds a file path or regex pattern for pass filtering.
  • Slots 196-207: Alternating string + integer slots instead of string + boolean. [LOW confidence] This high-numbered region contains all six integer options, likely controlling late-pipeline passes with numeric thresholds (unroll counts, live-variable limits, iteration bounds). The specific pass-to-slot associations are unconfirmed; the "unroll counts, live-variable limits, iteration bounds" interpretation is based on typical LLVM integer-valued pass options, not direct evidence.

Complete Slot-to-Offset Map with Known Consumers

The following table maps NVVMPassOptions slot indices to struct byte offsets, types, defaults, and -- where the cross-reference to the pipeline assembler's a4[offset] guards could be established -- the consuming pass(es). Offsets marked with * are confirmed by cross-referencing a4[offset] guards in sub_12E54A0 and sub_12DE8F0.

SlotOffsetTypeDefaultKnown Knob NameConsuming Pass
116STRINGftzGlobal: flush-to-zero mode
240STRINGprec-divGlobal: precise division
364STRINGprec-sqrtGlobal: precise sqrt
488STRINGfmadGlobal: fused multiply-add
5112STRINGopt-levelGlobal: optimization level
6136STRINGsm-archGlobal: target SM architecture
7160BOOL0
8176STRING
9200*INTEGER1Opt level for sub_12DFE00 codegen
10216STRING
11240BOOL0
13280*BOOL0no-dcesub_18DEFF0 (DCE)
15320*BOOL0no-tailcallelimsub_1833EB0 (TailCallElim)
17360*BOOL0no-late-optsub_1C46000 (NVVMLateOpt)
19400*BOOL1no-inline-aInlining variant A
21440*BOOL0no-inline-bsub_1C4B6F0 (AlwaysInliner)
23480*BOOL0no-inline-csub_1C4B6F0 in sub_12DE8F0
25520*BOOL1sub_1AAC510 (NVIDIA pass A)
27560*BOOL0sub_1AAC510 (NVIDIA pass B)
29600*BOOL0no-nvvm-verifysub_12D4560 (NVVMVerifier)
33680*BOOL0no-func-attrssub_1841180 (FunctionAttrs)
35720*BOOL0no-sccpsub_1842BC0 (SCCP)
37760*BOOL0no-dsesub_18F5480 (DSE)
43880*BOOL0no-nvvm-reflectsub_1857160 (NVVMReflect)
45920*BOOL0no-ipconstsub_185D600 (IPConstProp)
47960*BOOL0no-simplifycfgsub_190BB10 (SimplifyCFG)
491000*BOOL0no-instcombinesub_19401A0 (InstCombine)
511040*BOOL0no-sinksub_1869C50 (Sink/MemSSA)
531080*BOOL0no-dumpsub_17060B0 (PrintModulePass)
551120*BOOL0no-predoptsub_18A3430 (NVVMPredicateOpt)
571160*BOOL0no-loopindexsplitsub_1952F90 (LoopIndexSplit)
591200*BOOL0no-simplifycfg-bSimplifyCFG variant B
611240*BOOL0do-licm (inverted)sub_195E880 (LICM)
631280*BOOL0no-reassocsub_1B7FDF0 (Reassociate)
651320*BOOL0no-adce-asub_1C76260 (ADCE variant)
671360*BOOL0no-loopunrollsub_19C1680 (LoopUnroll)
691400*BOOL0no-sroasub_1968390 (SROA)
711440*BOOL0no-earlycsesub_196A2B0 (EarlyCSE)
731480*BOOL0no-adce-bADCE variant B
751520*BOOL0no-loopsimplifysub_198DF00 (LoopSimplify)
831680*BOOL0sub_19CE990 (NVIDIA pass)
871760*BOOL0do-ip-msp (inverted)sub_1C8E680 (MemorySpaceOpt)
911840*BOOL0no-adce-csub_1C6FCA0 (ADCE)
931880BOOL1NVVMReduction param A
951920BOOL1NVVMReduction param B
971960*BOOL0no-constmergesub_184CD60 (ConstantMerge)
992000*BOOL0no-intrin-lowersub_1CB4E40 (NVVMIntrinsicLowering)
1012040*BOOL0no-memcpyoptsub_1B26330 (MemCpyOpt)
1052120*BOOL0no-branchdist-bsub_1CB73C0 (NVVMBranchDist B)
1092200*BOOL0no-generic2nvvmsub_1A02540 (GenericToNVVM)
1132280*BOOL0no-loweralloca-bNVVMLowerAlloca B
1152320*BOOL0do-remat (inverted)sub_1A13320 (NVVMRemat)
1172360BOOL1sub_1CC3990 (NVVMUnreachBlockElim)
1212440*BOOL0no-sinking2sub_1CC60B0 (NVVMSinking2)
1272560*BOOL0no-genericaddroptsub_1CC71E0 (NVVMGenericAddrOpt)
1292600*BOOL0no-irverifysub_1A223D0 (NVVMIRVerification)
1312640*BOOL0no-loopoptsub_18B1DE0 (NVVMLoopOpt)
1332680*BOOL0no-memspaceopt-bMemorySpaceOpt in sub_12DE8F0
1352720*BOOL0no-instsimplifysub_1A7A9F0 (InstructionSimplify)
1412840*BOOL1Enable ADCE (sub_1C6FCA0, reversed)
1432880*BOOL1do-licmEnable LICM (reversed logic)
1493000*BOOL0Extra DeadArgElim trigger
1513040BOOL1Enable CorrelatedValuePropagation
1553120*BOOL1Address space optimization flag
1573160*BOOL1dump-* masterDebug dump mode (PrintModulePass)
1593200*BOOL1Enable advanced NVIDIA passes group
1653328*BOOL1Enable SM-specific warp/reduction/sinking
1733488*BOOL0Enable barrier optimization
1753528*BOOL0Tier 1 optimization enable
1773568*BOOL0Tier 2 optimization enable
1793608*BOOL0Tier 3 optimization enable
1813648*STR_PTR""Language string ("ptx"/"mid"/"idn")
1833704*BOOL0Late optimization / address-space mode
1933904*BOOL0Debug: verify after each plugin pass
1953944*BOOL0Debug: rename BBs to "F%d_B%d"
1973984INTEGER20Limit/threshold (e.g., unroll count)
2034104INTEGER-1Sentinel: unlimited/auto
2054144INTEGER-1Sentinel: unlimited/auto
2074184INTEGER-1Sentinel: unlimited/auto
2094224*BOOL0Master optimization switch
2114264BOOL1
2134304*BOOL0Device-code / separate-compilation
2154344INTEGER0Disabled counter
2174384*BOOL0Fast-compile / bypass LLVM pipeline
2194424BOOL1
2214464*BOOL0Disable late CFG cleanup variant B

Slots not listed have no confirmed cross-reference to pipeline assembler guards. The full 221-slot table is in the NVVMPassOptions Reference.

Complete Option Name Inventory

The following option names were extracted from binary string references in .rodata. They are set via -opt "-name=value" on the cicc command line (requires NVVMCCWIZ=553282 in non-release builds).

Boolean toggles (do-X / no-X):

NameEffect
do-ip-mspEnable inter-procedural memory space propagation
do-licmEnable LICM (loop-invariant code motion)
do-rematEnable NVVMRematerialization
do-clone-for-ip-mspEnable function cloning for IPMSP
do-cssaEnable Conventional SSA construction
do-scev-cgpEnable SCEV-based CodeGenPrepare
do-function-scev-cgpEnable function-level SCEV-CGP
do-scev-cgp-aggresivelyAggressive SCEV-CGP mode [sic]
do-base-address-strength-reduceEnable base address strength reduction
do-base-address-strength-reduce-chainEnable chained base address SR
do-comdat-renamingEnable COMDAT group renaming
do-counter-promotionEnable counter promotion
do-lsr-64-bitEnable 64-bit loop strength reduction
do-sign-ext-expandEnable sign extension expansion
do-sign-ext-simplifyEnable sign extension simplification

Parametric knobs:

NameTypeDefaultPurpose
remat-for-occstringRematerialization occupancy target
remat-gep-coststringGEP rematerialization cost
remat-ignore-single-coststringSkip single-use cost analysis
remat-lli-factorstringLive-interval factor
remat-load-paramstringParameter load remat policy
remat-loop-tripstringLoop trip count for remat decisions
remat-max-live-limitstringMaximum live variable count
remat-maxreg-ceilingstringRegister ceiling for remat
remat-movestringRematerialization move policy
remat-single-cost-limitstringSingle-value cost limit
remat-use-limitstringUse count limit for remat
branch-dist-block-limitstringBlock count limit for branch distribution
branch-dist-func-limitstringFunction-level branch dist limit
branch-dist-normstringNormalization factor
scev-cgp-check-latencystringLatency check threshold
scev-cgp-controlstringCGP control mode
scev-cgp-cross-block-limitstringCross-block analysis limit
scev-cgp-idom-level-limitstringImmediate dominator depth limit
scev-cgp-inst-limitstringInstruction count limit
scev-cgp-normstringNormalization factor
scev-cgp-old-basestringLegacy base address mode
scev-cgp-tid-max-valuestringThread ID maximum value
base-address-strength-reduce-iv-limitstringIV count limit for base addr SR
base-address-strength-reduce-max-ivstringMaximum IV for base addr SR
cssa-coalescestringCSSA coalescing mode
cssa-verbositystringCSSA debug verbosity

Dump/debug flags:

NamePurpose
dump-ip-mspDump IPMSP analysis results
dump-ir-before-memory-space-optDump IR before MemorySpaceOpt
dump-ir-after-memory-space-optDump IR after MemorySpaceOpt
dump-memory-space-warningsDump address space warnings
dump-rematDump rematerialization decisions
dump-remat-addDump remat additions
dump-remat-ivDump remat induction variables
dump-remat-loadDump remat load decisions
dump-branch-distDump branch distribution analysis
dump-scev-cgpDump SCEV-CGP analysis
dump-base-address-strength-reduceDump base address SR
dump-sink2Dump Sinking2 pass output
dump-before-cssaDump IR before CSSA
dump-phi-removeDump PHI node removal
dump-normalize-gepDump GEP normalization
dump-simplify-live-outDump live-out simplification
dump-process-restrictDump restrict processing
dump-process-builtin-assumeDump builtin assume processing
dump-conv-dotDump convergence as DOT graph
dump-conv-funcDump convergence per function
dump-conv-textDump convergence as text
dump-nvvmirDump NVVM IR
dump-vaDump value analysis

Tier-Based Pass Ordering

The Threshold Dispatch Mechanism

NVIDIA's tier system is a priority-driven scheduling mechanism that interleaves optimization sub-pipelines with external plugin passes. The master pipeline function sub_12E54A0 iterates over a pass registration array at a4[4488] (16-byte stride entries: [+0] vtable_ptr, [+8] phase_id). As it processes each entry, it checks whether the entry's phase_id exceeds a threshold. When it does, the corresponding tier sub-pipeline fires once:

// Pseudocode for the main loop in sub_12E54A0
for (entry = a4[4488]; entry < a4[4496]; entry += 16) {
    int phase_id = *(int*)(entry + 8);

    if (opt_enabled && phase_id > opt_threshold) {
        sub_12DE330(PM, opts);      // Tier 0: full optimization
        opt_enabled = 0;            // fire once
    }
    if (tier1_flag && phase_id > tier1_threshold) {
        sub_12DE8F0(PM, 1, opts);   // Tier 1
        tier1_flag = 0;
    }
    if (tier2_flag && phase_id > tier2_threshold) {
        sub_12DE8F0(PM, 2, opts);   // Tier 2
        tier2_flag = 0;
    }
    if (tier3_flag && phase_id > tier3_threshold) {
        sub_12DE8F0(PM, 3, opts);   // Tier 3
        tier3_flag = 0;
    }

    // Insert the plugin/external pass itself
    pass = vtable_call(entry, +72);  // entry->createPass()
    AddPass(PM, pass, 1, 0);
}

// Any tier that didn't fire during the loop fires now
if (opt_enabled)  sub_12DE330(PM, opts);
if (tier1_flag)   sub_12DE8F0(PM, 1, opts);
if (tier2_flag)   sub_12DE8F0(PM, 2, opts);
if (tier3_flag)   sub_12DE8F0(PM, 3, opts);

This design means tier placement is data-driven: the thresholds stored at config offsets 4224/4228 (Tier 0), 3528/3532 (Tier 1), 3568/3572 (Tier 2), and 3608/3612 (Tier 3) determine exactly where in the plugin pass sequence each tier's sub-pipeline gets inserted. Changing the threshold shifts an entire tier of ~40 passes to a different position relative to the external passes. After each tier fires, its flag is cleared so it cannot fire again.

Tier 0 Ordering Strategy

Tier 0 (sub_12DE330) is the most comprehensive sub-pipeline at ~40 passes. Its ordering reflects NVIDIA's optimization philosophy for GPU code:

Phase A -- Value Simplification (passes 1-8): BreakCriticalEdges normalizes the CFG, then the CGSCC inliner framework runs first to create optimization opportunities. NVVMReflect resolves __nvvm_reflect() calls to compile-time constants (GPU architecture queries), and SCCP propagates those constants. GVN and NewGVN/GVNHoist eliminate redundant computations.

Phase B -- NVIDIA-Specific Cleanup (passes 9-12): NVVMVerifier catches NVVM-specific IR errors early. NVVMPredicateOpt optimizes predicate expressions. ConstantMerge reduces module size.

Phase C -- Loop Transformations (passes 13-27): This is the core loop optimization sequence. Sink/MemSSA moves code out of hot paths. LoopIndexSplit divides loops at index boundaries. LICM hoists invariants. LoopUnroll with factor 3 expands small loops. LoopUnswitch moves conditionals out of loops. ADCE removes dead code exposed by loop transformations.

Phase D -- Register Pressure Management (passes 28-40): InstCombine and SROA simplify the IR further. NVVMRematerialization recomputes values to reduce register pressure -- critical for GPU occupancy. DSE and DCE clean up dead stores and code. The final CGSCC pass and FunctionAttrs prepare for per-function Phase II processing.

Tier 1/2/3 Incremental Additions -- sub_12DE8F0

Address0x12DE8F0
Size17,904 bytes
Signatureint64 sub_12DE8F0(int64 passMgr, int tier, int64 opts)

sub_12DE8F0 adds passes incrementally based on the tier value (1, 2, or 3). Its first action stores the tier into qword_4FBB410 (the tier tracker global), then checks qword_4FBB3B0 (phase counter) for phase-dependent behavior. Nearly every pass insertion is gated by a boolean in the NVVMPassOptions struct.

The full pass list for sub_12DE8F0 (all tiers combined, with tier-specific gates):

sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering (level=1)
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering (barrier=1)
sub_18E4A00()  [opts[3488]]             NVVMBarrierAnalysis
sub_1C98160(0) [opts[3488]]             NVVMLowerBarriers
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_185D600()  [opts[3200]&&!opts[920]] IPConstPropagation         [advanced group]
sub_1857160()  [opts[3200]&&!opts[880]] NVVMReflect                [advanced group]
sub_18A3430()  [opts[3200]&&!opts[1120]] NVVMPredicateOpt          [advanced group]
sub_1842BC0()  [opts[3200]&&!opts[720]] SCCP                       [advanced group]
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_18A3090()  [opts[3200]&&!opts[2160]] NVVMPredicateOpt variant  [advanced group]
sub_184CD60()  [opts[3200]&&!opts[1960]] ConstantMerge             [advanced group]
sub_190BB10(1,0)[tier!=1 && guards]     SimplifyCFG                [TIER 2/3 ONLY]
sub_1952F90(-1)[tier!=1 && guards]      LoopIndexSplit             [TIER 2/3 ONLY]
sub_12D4560()  [tier!=1 && !opts[600]]  NVVMVerifier               [TIER 2/3 ONLY]
sub_195E880(0) [opts[3704]&&opts[2880]] LICM
sub_1C8A4D0(v) [v=1 if opts[3704]]     EarlyCSE
sub_1869C50(1,0,1)[tier!=1&&!opts[1040]] Sink                     [TIER 2/3 ONLY]
sub_1833EB0(3) [tier==3 && !opts[320]]  TailCallElim              [TIER 3 ONLY]
sub_1CC3990()  [!opts[2360]]            NVVMUnreachableBlockElim
sub_18EEA90()  [opts[3040]]             CorrelatedValuePropagation
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_1C4B6F0()  [!opts[440]&&!opts[480]] Inliner
sub_1A7A9F0()  [!opts[2720]]            InstructionSimplify
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_1A02540()  [!opts[2200]]            GenericToNVVM
sub_198DF00(-1)[!opts[1520]]            LoopSimplify
sub_1C76260()  [!opts[1320]&&!opts[1480]] ADCE
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1C98160(v) [opts[3488]]             NVVMLowerBarriers
sub_19C1680(0,1)[!opts[1360]]           LoopUnroll
sub_19401A0()  [!opts[1000]]            InstCombine
sub_196A2B0()  [!opts[1440]]            EarlyCSE
sub_1968390()  [!opts[1400]]            SROA
sub_19B73C0(t,...)[tier!=1]             LoopUnswitch (SM-dependent) [TIER 2/3 ONLY]
sub_1A62BF0(1,...)[!opts[600]]          LLVM standard pipeline #1
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_190BB10(0,0)[!opts[960]]            SimplifyCFG
sub_1922F90()  [opts[3080]]             NVIDIA-specific loop pass
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1A13320()  [!opts[2320]]            NVVMRematerialization
sub_1968390()  [!opts[1400]]            SROA
sub_18EEA90()  [opts[3040]]             CorrelatedValuePropagation
sub_18F5480()  [!opts[760]]             DSE
sub_18DEFF0()  [!opts[280]]             DCE
sub_1A62BF0(1,...)[!opts[600]]          LLVM standard pipeline #1
sub_1AAC510()  [!opts[520]&&!opts[560]] NVIDIA-specific pass
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_1C8E680()  [!opts[2680]]            MemorySpaceOpt (from opts[3120])
sub_1CC71E0()  [!opts[2560]]            NVVMGenericAddrOpt
sub_1C98270(1,v)[opts[3488]]            NVVMLowerBarriers variant
sub_1C6FCA0()  [opts[2840]&&!opts[1840]] ADCE
sub_18B1DE0()  [opts[3200]&&!opts[2640]] LoopOpt/BarrierOpt        [advanced group]
sub_1857160()  [opts[3200]&&tier==3]    NVVMReflect                [TIER 3 ONLY]
sub_1841180()  [opts[3200]&&!opts[680]] FunctionAttrs              [advanced group]
sub_1C46000()  [tier==3&&!opts[360]]    NVVMLateOpt                [TIER 3 ONLY]
sub_1841180()  [opts[3200]&&!opts[680]] FunctionAttrs (2nd call)   [advanced group]
sub_1CBC480()  [!opts[2240]&&!opts[2280]] NVVMLowerAlloca
sub_1CB73C0()  [!opts[2080]&&!opts[2120]] NVVMBranchDist
sub_1C7F370(1) [opts[3328]&&!opts[1640]] NVVMWarpShuffle           [SM-specific]
sub_1CC5E00()  [opts[3328]&&!opts[2400]] NVVMReduction             [SM-specific]
sub_1CC60B0()  [opts[3328]&&!opts[2440]] NVVMSinking2              [SM-specific]
sub_1CB73C0()  [opts[3328]&&guards]     BranchDist (2nd call)      [SM-specific]
sub_1B7FDF0(3) [opts[3328]&&!opts[1280]] Reassociate               [SM-specific]

Tier 1 (baseline) adds the passes above EXCEPT those gated by tier!=1: SimplifyCFG, LoopIndexSplit, Sink, and LoopUnswitch are all skipped. This is a conservative set focused on NVIDIA-specific cleanup without expensive LLVM optimization.

Tier 2 adds everything Tier 1 has plus the tier!=1-gated passes. The LoopUnswitch parameters are SM-architecture-dependent: sub_19B73C0 receives different vector widths based on the target subtarget.

Tier 3 adds TailCallElim (gated tier==3), NVVMReflect at a late position (gated tier==3), and NVVMLateOpt (gated tier==3). Critically, it also triggers feature flag escalation (see below).

Feature Flag Escalation

A notable pattern occurs only in Tier 3: if BYTE4(qword_4FBB370[2]) is zero (no advanced features enabled), the tier handler allocates a new integer with value 6 and stores it via sub_16D40E0. The value 6 (binary 110) enables two feature gates used by later passes: barrier optimization and memory-space optimization. This means Tier 3 (O3) automatically enables optimization features that lower tiers leave disabled, without requiring explicit CLI flags.

O-Level Pipeline Comparison

Pipeline Selection

The new-PM driver sub_226C400 selects pipeline name strings based on config flags:

byte[888]  set  →  "nvopt<O0>"
byte[928]  set  →  "nvopt<O1>"
byte[968]  set  →  "nvopt<O2>"
byte[1008] set  →  "nvopt<O3>"

These strings are passed to sub_2277440 (the new-PM text pipeline parser). The nvopt prefix is registered as a pipeline element in both sub_225D540 (new PM) and sub_12C35D0 (legacy PM), with vtables at 0x4A08350 and 0x49E6A58 respectively.

O0: No Optimization

O0 skips the full pipeline entirely. The code falls through to LABEL_159 which calls only sub_1C8A4D0(0) (NVVMFinalCleanup), then proceeds directly to finalization. No Tier 0/1/2/3 sub-pipelines fire. The result is ~5-8 passes total: TargetLibraryInfo, TargetTransformInfo, Verifier, AssumptionCache, ProfileSummary, NVVMFinalCleanup, and codegen setup.

O1/O2/O3: Full Pipeline with Tier Differentiation

All three levels call sub_12DE330 for the same ~40-pass Tier 0 sub-pipeline. The differences manifest through four mechanisms:

1. Tier sub-pipeline gating. sub_12DE8F0 is called with the tier number corresponding to the O-level. O1 gets tier=1 (conservative, skips several passes). O2 gets tier=2 (full set). O3 gets tier=3 (aggressive + feature flag escalation).

2. CGSCC iteration counts. The CGSCC pass manager wrapper sub_1A62BF0 takes an iteration count as its first argument. In the O1/O2/O3 base pipeline, it is called with 1 (single inliner pass). In the "mid" fast-compile path, it is called with 5 iterations. In the default path, it varies from 1 to 8 depending on pipeline position, allowing more aggressive devirtualization and inlining at higher optimization levels.

3. Loop unroll factor. sub_1833EB0 is called with factor 3 in the standard pipeline. Tier 3 adds an additional call to TailCallElim and more aggressive LoopUnswitch parameters (the sub_19B73C0 call receives SM-arch-dependent vector widths at Tier 2/3).

4. Vectorizer parameters. sub_19B73C0 receives different arguments based on tier:

  • Tier 0: (2, -1, -1, -1, -1, -1, -1) -- conservative vector width 2, all thresholds unlimited
  • "mid" path: (3, -1, -1, 0, 0, -1, 0) -- vector width 3, some thresholds zeroed (disabled)
  • Tier 2/3: Parameters vary by SM architecture via config struct lookups

Fast-Compile Levels vs O-Levels

PipelineEntry PathPassesLSAMemSpaceOptKey Difference
nvopt<O0>LABEL_159~5-8offoffNo optimization
nvopt<Ofcmax>LABEL_196~12-15forced 0forced 0Sinking2(fast) + minimal canonicalization
nvopt<Ofcmid>LABEL_297~25-30normalenabledCGSCC(5), LoopVectorize(conservative)
nvopt<Ofcmin>LABEL_297~30-35normalenabledLike Ofcmid but more aggressive loop settings
nvopt<O1>sub_12DE330~35normalenabledTier 1: conservative set
nvopt<O2>sub_12DE330~35+normalenabledTier 2: full optimization set
nvopt<O3>sub_12DE330~35+normalenabledTier 3: aggressive + feature escalation

Ofcmax is architecturally distinct: it forces -lsa-opt=0 and -memory-space-opt=0 in the optimizer flags (confirmed in both sub_9624D0 line 1358 and sub_12CC750 line 2025). This means two of NVIDIA's most important proprietary passes -- LSA optimization and MemorySpaceOpt -- are unconditionally disabled regardless of what the user requests.

Pipeline Text Strings and nvopt<> Dispatch

The nvopt<> Naming Convention

NVIDIA replaces LLVM's standard default<O2> pipeline naming with a proprietary nvopt<> prefix. The new-PM driver sub_226C400 (35KB, at 0x226C400) selects one of exactly seven pipeline name strings based on optimization level and fast-compile flags. These strings are passed verbatim to sub_2277440 (60KB, at 0x2277440) -- NVIDIA's equivalent of LLVM's PassBuilder::buildDefaultPipeline().

nvopt<O0>       Optimization disabled. ~5-8 infrastructure passes only.
nvopt<O1>       Standard optimization, Tier 1 (conservative).
nvopt<O2>       Standard optimization, Tier 2 (full).
nvopt<O3>       Standard optimization, Tier 3 (aggressive + feature escalation).
nvopt<Ofcmax>   Fast-compile maximum speed. Forces -lsa-opt=0, -memory-space-opt=0.
nvopt<Ofcmid>   Fast-compile medium. MemorySpaceOpt enabled, CGSCC(5) iterations.
nvopt<Ofcmin>   Fast-compile minimum. Like Ofcmid but more aggressive loop settings.

Selection Algorithm (sub_226C400)

The config struct encodes O-level flags at fixed byte offsets. The fast-compile level string (if present) is at qwords 131/132 (offset 1048/1056), encoded as a 3-byte sequence compared via 2-byte word + 1-byte suffix:

// sub_226C400, lines 828-874 (pseudocode)
char* select_pipeline_name(Config* cfg) {
    if (cfg->byte[928])   return "nvopt<O1>";     // 9 chars
    if (cfg->byte[968])   return "nvopt<O2>";     // 9 chars
    if (cfg->byte[1008])  return "nvopt<O3>";     // 9 chars

    char* fc = cfg->qword[131];
    int fc_len = cfg->qword[132];
    if (fc_len == 3) {
        // Word comparison: *(uint16_t*)fc, then byte fc[2]
        if (*(uint16_t*)fc == 24941 && fc[2] == 120)  // "max" = 'a','m' + 'x'
            return "nvopt<Ofcmax>";   // 14 chars
        if (*(uint16_t*)fc == 26989 && fc[2] == 100)  // "mid" = 'i','m' + 'd'
            return "nvopt<Ofcmid>";   // 14 chars
        if (*(uint16_t*)fc == 26989 && fc[2] == 110)  // "min" = 'i','m' + 'n'
            return "nvopt<Ofcmin>";   // 14 chars
    }
    return "nvopt<O0>";              // 9 chars
}

The nvopt prefix is registered as a pipeline element in sub_225D540 (new PM, vtable 0x4A08350) and sub_12C35D0 (legacy PM, vtable 0x49E6A58). Both route into an nvopt pipeline builder class that creates a 512-byte pipeline object via sub_12EC960.

Mutual Exclusion

Combining -O# with --passes= or --foo-pass is an error:

Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass,
use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass'

Pipeline Text Parser (sub_2277440)

sub_2277440 (60KB) is the new-PM buildDefaultPipeline() equivalent. It tokenizes the pipeline name string via sub_2352D90, then dispatches to the appropriate pipeline builder based on the nvopt<> parameter. NVIDIA custom passes are injected via extension point callbacks at [PassBuilder+2208] (stride 32 bytes per entry, count at [PassBuilder+2216]). Each callback entry has a guard pointer at [+16] and a callback function at [+24].

Fast-Compile Level Encoding

In the libnvvm config struct, offset 1640 holds an integer encoding:

ValueCLI SourcePipeline NameNotes
0(no -Ofast-compile)normal O-levelDefault
1-Ofast-compile=0reset to 0Treated as "off"
2-Ofc=maxnvopt<Ofcmax>Forces -lsa-opt=0, -memory-space-opt=0
3-Ofc=midnvopt<Ofcmid>MemorySpaceOpt enabled
4-Ofc=minnvopt<Ofcmin>Closest to full optimization

Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level, only supports 0, min, mid, or max".

Pass Registration Architecture

Dual Pass Manager Support

cicc v13.0 maintains registrations for both the Legacy Pass Manager and the New Pass Manager simultaneously. This dual support is necessary during the LLVM Legacy-to-New PM migration. The Legacy PM path is taken when a4[4384] != 0 (the fast-compile/bypass flag), while the New PM path handles normal compilation.

Legacy PM registration occurs in pass constructor functions scattered throughout the binary. For example, MemorySpaceOpt registers as "memory-space-opt-pass" via sub_1C97F80. Each Legacy PM pass calls RegisterPass<> with a pass ID and description string.

New PM registration is centralized in sub_2342890 -- a single 2,816-line function that registers every analysis, pass, and printer. It calls sub_E41FB0(pm, class_name, len, pass_name, len) for each pass, inserting into a StringMap with open-addressing and linear probing.

New PM Registration Structure

sub_2342890 registers passes in a strict ordering by pipeline level:

SectionLinesCountContent
Module analyses514-596~18CallGraph, ProfileSummary, LazyCallGraph, etc.
Module passes599-1153~95AlwaysInline, GlobalOpt, NVIDIA module passes
CGSCC analyses1155-1163~5FunctionAnalysisManagerCGSCC, etc.
CGSCC passes1170-1206~15Inliner, Attributor, ArgumentPromotion
Function analyses1208-1415~65DominatorTree, LoopInfo, MemorySSA, rpa, merge-sets
Function passes1420-2319~185SROA, GVN, LICM, all NVIDIA function passes
LoopNest passes2320-2339~8LoopInterchange, LoopFlatten
Loop analyses2340-2362~10LoopAccessAnalysis, IVUsers
Loop passes2367-2482~40IndVarSimplify, LICM, LoopUnroll, loop-index-split
Machine analyses2483-2580~30LiveIntervals, SlotIndexes
Machine passes2581-2815~80ExpandPostRAPseudos, BranchFolding

Parameterized Pass Parsing

When the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), a registered callback parses the parameter string. The parsing flow:

  1. sub_2337DE0 matches the pass name via a starts_with comparison
  2. sub_234CEE0 extracts the <...> parameter string
  3. The parameter-parsing callback (e.g., sub_23331A0 for MemorySpaceOpt) is invoked
  4. The parser splits on ; and matches each token against known parameter names
  5. A configured pass options struct is returned and used to construct the pass

For MemorySpaceOpt, the parameter parser (sub_23331A0) recognizes four tokens:

TokenLengthEffect
first-time10Sets first_time = true (default)
second-time11Sets first_time = false
warnings8Enables address-space warnings
no-warnings11Disables warnings

Invalid parameters produce: "invalid MemorySpaceOpt pass parameter '{0}'".

Pass Serialization

Each parameterized NVIDIA pass also registers a serializer for pipeline text output (used by --print-pipeline-passes). The serializers write the pass class name followed by the current parameter state:

PassSerializerOutput Format
MemorySpaceOptsub_2CE0440MemorySpaceOptPass]<first-time;...>
BranchDistsub_2311040BranchDistPass]
Sinking2sub_2315E20llvm::Sinking2Pass]
Rematsub_2311820RematerializationPass]
NVVMPeepholesub_2314DA0NVVMPeepholeOptimizerPass]
LoopIndexSplitsub_2312380LoopIndexSplitPass]

Pipeline Construction Flow

The AddPass Mechanism -- sub_12DE0B0

Address0x12DE0B0
Size3,458 bytes
Signatureint64 sub_12DE0B0(int64 passMgr, int64 passObj, uint8 flags, char barrier)
Call count~137 direct calls from sub_12E54A0, ~40 from sub_12DE330, ~50+ per tier

sub_12DE0B0 is the sole entry point for adding passes to the pipeline. Every pass factory call in the entire pipeline assembler funnels through this function. It performs three operations atomically: hash-table insertion for O(1) lookup, flag encoding for the pass scheduler, and append to the ordered pass array.

// Detailed pseudocode for sub_12DE0B0
int64 AddPass(PassManager* PM, Pass* pass, uint8_t flags, char barrier) {
    // --- Step 1: Hash the pass pointer ---
    // Uses a custom shift-XOR hash, NOT a standard hash function.
    // The two shifts (9 and 4) spread pointer bits across the table.
    uint64_t hash = ((uint64_t)pass >> 9) ^ ((uint64_t)pass >> 4);

    // --- Step 2: Open-addressing insert into hash table at PM+80 ---
    // The hash table is a flat array of 16-byte entries at PM+80:
    //   [+0] uint64 pass_pointer (0 = empty slot)
    //   [+8] uint8  combined_flags
    // Table capacity is stored at PM+72 (initial: derived from 0x800000000 mask).
    // Collision resolution: linear probing with step 1.
    uint8_t combined = flags | (barrier ? 2 : 0);
    //   Bit 0 (0x01): 1 = FunctionPass, 0 = ModulePass/AnalysisPass
    //   Bit 1 (0x02): 1 = barrier (scheduling fence)
    //   Remaining bits: reserved

    size_t capacity = PM->ht_capacity;       // at PM+72
    size_t idx = hash & (capacity - 1);      // power-of-2 masking
    Entry* table = (Entry*)(PM + 80);

    while (table[idx].pass != 0) {
        if (table[idx].pass == pass) {
            // Pass already inserted -- update flags only
            table[idx].flags = combined;
            return 0;  // dedup: no second insertion
        }
        idx = (idx + 1) & (capacity - 1);   // linear probe
    }
    table[idx].pass = pass;
    table[idx].flags = combined;

    // --- Step 3: Append to ordered pass array at PM[0] ---
    // PM[0] = pointer to dynamic array of 8-byte pass pointers
    // PM[1] = count of passes (PM+8)
    // Growth: geometric reallocation (not shown here)
    uint64_t* array = (uint64_t*)PM->passes; // PM[0]
    array[PM->count] = (uint64_t)pass;
    PM->count++;                              // PM+8

    return 0;
}

The flags parameter encodes the pass type: 0 for module/analysis passes, 1 for function passes. The barrier parameter (bit 1) is a scheduling fence that tells the pass manager all preceding passes must complete before this pass runs -- used for passes that require the module in a globally consistent state (e.g., after whole-module inlining).

The hash table serves two purposes: (a) deduplication -- if the same pass factory is called twice (which happens for NVVMReflect, NVVMIntrinsicLowering, etc.), the second call updates flags rather than inserting a duplicate; and (b) O(1) flag lookup during the codegen dispatch phase (sub_12DFE00), where each pass's type and barrier status must be queried efficiently.

The pass manager container is initialized at line 390 of sub_12E54A0 with inline storage: v270 = v272 (stack buffer), v271 = 0x800000000 (capacity/flags encoding with 33-bit sentinel).

Complete 8-Phase Construction Algorithm

The full pipeline construction in sub_12E54A0 proceeds through eight phases. The pseudocode below is reconstructed from the decompiled 49.8KB function at lines 300-757 of the decompilation output. All a4 offsets refer to the CompilerOptions struct (parameter 4, ~4500 bytes).

Phase 0: Infrastructure (lines 396-420, always runs)

// Phase 0: Analysis infrastructure required by all subsequent passes
#01  TLI = sub_149CCE0(malloc(368), sub_14A04B0(triple));
     AddPass(PM, TLI, 0, 0);     // TargetLibraryInfoWrapperPass [Module]

#02  TTI = sub_1BFB520(malloc(208), sub_1BFB9A0(dataLayout));
     AddPass(PM, TTI, 1, 0);     // TargetTransformInfoWrapperPass [Function]

#03  verifier = sub_14A7550();
     AddPass(PM, verifier, 0, 0); // VerifierPass / BasicAliasAnalysis [Module]

#04  assumptions = sub_1361950();
     AddPass(PM, assumptions, 0, 0); // AssumptionCacheTracker [Module]

#05  profile = sub_1CB0F50();
     AddPass(PM, profile, 1, 0); // ProfileSummaryInfoWrapperPass [Function]

These five analysis passes have no upstream-LLVM equivalent in terms of initialization ordering. NVIDIA always adds them first regardless of optimization level, language, or fast-compile mode.

Phase 1: Language Dispatch (lines 421-488)

Phase 1 reads the language string at a4[3648] (pointer) with length at a4[3656]. Three language paths exist; each produces a fundamentally different pass sequence. See the Language Path Differences section below for the complete per-path pass lists.

// Phase 1: Language-based pipeline branching
char* lang = *(char**)(a4 + 3648);
int lang_len = *(int*)(a4 + 3656);

bool opt_enabled = *(bool*)(a4 + 4224);
bool fc_max = false, fc_mid = false;
int v238 = *(int*)(a4 + 4304);  // device-code / additional-opt flag

if (lang_len == 3) {
    uint16_t w = *(uint16_t*)lang;
    if (w == 0x7470 && lang[2] == 0x78) {        // "ptx"
        goto PATH_A_PTX;
    }
    if (w == 0x696D && lang[2] == 0x64) {         // "mid"
        goto PATH_B_MID;
    }
    // "idn" (w == 0x696D && lang[2] == 0x6E) shares the default path
}
// Fall through to PATH_C_DEFAULT

// Fast-compile dispatch (within the language check):
// fc="max" AND !v238 → v244=1, v238=1, goto LABEL_191 (minimal + O0)
// fc="max" AND v238  → goto LABEL_196 → LABEL_188 (Sinking2 + common)
// fc="mid"           → goto LABEL_297 (mid pipeline)
// fc="min"           → goto LABEL_297 (min pipeline, differs via v238)
// no fc, no O-level  → LABEL_159 (O0 minimal pipeline)
// O-level set        → LABEL_38 → LABEL_39 (process pass list + tiers)

Phase 2: Pre-Optimization (lines 442-480)

Only when optimization is not completely skipped. Each pass is gated by a per-pass disable flag in the NVVMPassOptions struct.

// Phase 2: Early passes before the main optimization loop
if (!a4[1960] || a4[3000])                           // not disabled OR extra trigger
    AddPass(PM, sub_1857160(), 1, 0);                // NVVMReflect

if (a4[3000])                                        // extra DeadArgElim trigger
    AddPass(PM, sub_18FD350(0), 1, 0);              // DeadArgElimination

if (!a4[1680])                                       // NVIDIA pass not disabled
    AddPass(PM, sub_19CE990(), 1, 0);               // LoopStrengthReduce (NVIDIA)

AddPass(PM, sub_1CB4E40(0), 1, 0);                  // NVVMIntrinsicLowering(level=0)

if (!a4[2040])
    AddPass(PM, sub_1B26330(), 1, 0);                // MemCpyOpt

AddPass(PM, sub_12D4560(), 1, 0);                    // NVVMVerifier

if (!a4[1960])
    AddPass(PM, sub_184CD60(), 1, 0);                // ConstantMerge

if (!a4[440] && !a4[400])
    AddPass(PM, sub_1C4B6F0(), 1, 0);               // AlwaysInliner

if (a4[3160])                                        // debug dump enabled
    AddPass(PM, sub_17060B0(1, 0), 1, 0);           // PrintModulePass

Phase 3: Main Optimization Loop (lines 481-553)

The tier-threshold-driven loop iterates over the plugin/external pass array at a4[4488]. Each entry is 16 bytes (vtable pointer + phase_id). When a threshold is crossed, the corresponding tier sub-pipeline fires once and never again.

// Phase 3: Tier dispatch within the main plugin pass loop
uint64_t* entry = *(uint64_t**)(a4 + 4488);
uint64_t* end   = *(uint64_t**)(a4 + 4496);

while (entry < end) {
    int phase_id = *(int*)((char*)entry + 8);

    // Tier 0: full optimization sub-pipeline
    if (*(bool*)(a4+4224) && phase_id > *(int*)(a4+4228)) {
        sub_12DE330(PM, opts);        // ~40 passes
        *(bool*)(a4+4224) = false;    // fire once
    }
    // Tier 1: conservative
    if (*(bool*)(a4+3528) && phase_id > *(int*)(a4+3532)) {
        sub_12DE8F0(PM, 1, opts);
        *(bool*)(a4+3528) = false;
    }
    // Tier 2: full
    if (*(bool*)(a4+3568) && phase_id > *(int*)(a4+3572)) {
        sub_12DE8F0(PM, 2, opts);
        *(bool*)(a4+3568) = false;
    }
    // Tier 3: aggressive
    if (*(bool*)(a4+3608) && phase_id > *(int*)(a4+3612)) {
        sub_12DE8F0(PM, 3, opts);
        *(bool*)(a4+3608) = false;
    }

    // Insert the plugin/external pass itself
    Pass* plugin = vtable_call(entry, +72);  // entry->createPass()
    AddPass(PM, plugin, 1, 0);

    // Optional debug verification after each plugin pass
    if (a4[3904]) {
        sub_12D3E60();  // insert verification/print pass
        sub_16E8CB0();
        sub_15E9F00();
    }

    entry = (uint64_t*)((char*)entry + 16);  // next entry (16-byte stride)
}

// Any tier that didn't fire during the loop fires unconditionally now
if (*(bool*)(a4+4224))  sub_12DE330(PM, opts);
if (*(bool*)(a4+3528))  sub_12DE8F0(PM, 1, opts);
if (*(bool*)(a4+3568))  sub_12DE8F0(PM, 2, opts);
if (*(bool*)(a4+3608))  sub_12DE8F0(PM, 3, opts);

Phase 4: Post-Optimization Language Paths (lines 580-1371)

After the main loop, language-specific post-optimization runs. This is where the three paths diverge most significantly. Each path ends by falling through to LABEL_84 (Phase 5). See Language Path Differences for complete pass lists per path.

Phase 5: Finalization (LABEL_84, lines 640-653)

Always runs after the language-specific optimization path completes.

// Phase 5: Finalization -- barriers, cleanup, codegen
if (!v244 && a4[3488])                              // barrier optimization enabled
    AddPass(PM, sub_1C98160(a4[2920]!=0), 1, 0);   // NVVMLowerBarriers

AddPass(PM, sub_1CEBD10(), 1, 0);                   // NVVMFinalLowering (cleanup)

if (!a4[2800] && !a4[4464])                         // late CFG cleanup not disabled
    AddPass(PM, sub_1654860(1), 1, 0);              // BreakCriticalEdges

sub_12DFE00(PM, subtargetInfo, a4);                  // Codegen dispatch (see below)

Phase 6: Phase II Codegen Check (lines 654-693)

Reads the compilation phase counter and conditionally enters a special codegen extension block for multi-phase compilation.

// Phase 6: Phase-II-specific codegen extensions
int phase = sub_16D40F0(qword_4FBB3B0);  // read cl::opt<int> phase counter
if (phase == 2 && (*(int*)(a4 + 4480) & 4)) {
    // Enter special Phase II codegen block
    // Calls vtable at v245+56 (TargetMachine::addPhaseIIPasses)
    // Passes SubtargetInfo (v253) and CodeGenOpt config (v262)
    target_machine->addPhaseIIPasses(subtarget, codegen_config);
}

Phase 7: Pipeline Execution (lines 694-698)

// Phase 7: Run the assembled pipeline
sub_160FB70(PM, *output, output[1]);   // PassManager::run(Module, outputs)
sub_1619BD0(PM, module);               // PassManager::finalize(Module)
free(v274);                            // cleanup allocations
sub_160FE50(PM);                       // PassManager::destroy()

Phase 8: Basic Block Naming (lines 700-757)

Only when a4[3944] (debug/naming mode) is set. Produces deterministic block names for debugging.

// Phase 8: Debug block naming for IR dump readability
if (a4[3944]) {
    int funcIdx = 0;
    for (Function* F = module->functions; F; F = F->next) {
        if (sub_15E4F60(F))  continue;  // skip declarations
        funcIdx++;
        int blockIdx = 0;
        for (BasicBlock* BB = F->blocks; BB; BB = BB->next) {
            blockIdx++;
            char name[32];
            sprintf(name, "F%d_B%d", funcIdx, blockIdx);
            sub_164B780(BB, &name);  // BB->setName()
        }
    }
}

Language Path Differences

The three language paths in Phase 1/4 represent fundamentally different IR maturity levels. The a4[3648] string pointer determines which path is taken, with length at a4[3656].

Path A: "ptx" -- Light Pipeline (~15 passes)

PTX text input has already been lowered by an earlier compilation stage. This path applies only light cleanup and canonicalization:

sub_1CEF8F0()               NVVMPeephole
sub_215D9D0()               NVVMAnnotationsProcessor
sub_1857160()  [!a4[880]]   NVVMReflect
sub_1A62BF0(1,0,0,1,0,0,1)  LLVM standard pipeline #1
sub_1B26330()  [!a4[2040]]  MemCpyOpt
sub_17060B0(0,0)            PrintModulePass (debug)
sub_18DEFF0()  [!a4[280]]   DCE
sub_1A62BF0(1,0,0,1,0,0,1)  LLVM standard pipeline #1 (repeat)
sub_18B1DE0()  [!a4[2640]]  LoopPass / BarrierOpt
sub_1C8E680(0) [!a4[1760]]  MemorySpaceOptimization
 --> LABEL_84 (finalization)

Key difference: no SROA, no GVN, no loop transformations, no CGSCC inlining. The PTX path trusts that the earlier compilation already optimized the code.

Path B: "mid" -- Full Optimization (~45 passes)

The primary path for standard CUDA compilation. The IR comes from the EDG frontend through IR generation and is at "mid-level" maturity (high-level constructs lowered, but not yet optimized).

sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (1st of 4)
sub_1B26330()  [!a4[2040]]    MemCpyOpt
sub_198E2A0()                  SROA
sub_1CEF8F0()                  NVVMPeephole
sub_215D9D0()                  NVVMAnnotationsProcessor
sub_198DF00(-1)[!a4[1520]]     LoopSimplify
sub_1C6E800()                  GVN
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification (1st of 5+)
sub_190BB10(0,0)               SimplifyCFG
sub_1832270(1)                 InstructionCombining
sub_1A62BF0(5,0,0,1,0,0,1)    CGSCC pipeline (5 iterations)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (2nd)
sub_18FD350(0)                 DeadArgElim
sub_1841180()  [!a4[680]]     FunctionAttrs
sub_18DEFF0()  [!a4[280]]     DCE
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_195E880(0) [!a4[1240]]    LICM
sub_1C98160(0)                 NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]]    MemorySpaceOpt (1st invocation)
sub_1B7FDF0(3) [!a4[1280]]    Reassociate
sub_1A62BF0(8,0,0,1,1,0,1)    CGSCC pipeline (8 iterations)
sub_1857160()  [!a4[880]]     NVVMReflect (2nd of 3)
sub_1C6FCA0()  [!a4[1840]]    ADCE
sub_1A7A9F0()  [!a4[2720]]    InstructionSimplify
sub_18FD350(0)                 DeadArgElim
sub_1833EB0(3) [!a4[320]]     TailCallElim
sub_18FD350(0)                 DeadArgElim
sub_18EEA90()                  CorrelatedValuePropagation
sub_1869C50(1,0,1)             Sink (MemorySSA-based)
sub_190BB10(0,0)[!a4[960]]     SimplifyCFG
sub_18F5480()  [!a4[760]]     DSE
sub_1CC60B0()  [!a4[2440]]    NVVMSinking2
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1C8A4D0(0)                 EarlyCSE
sub_1857160()  [!a4[880]]     NVVMReflect (3rd)
sub_1A62BF0(8,0,0,1,1,0,1)    CGSCC pipeline (8 iterations)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (3rd)
sub_185D600()  [!a4[920]]     IPConstPropagation
sub_195E880(0) [!a4[1240]]    LICM
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (4th)
sub_1CB73C0()  [!a4[2120]]    NVVMBranchDist
sub_1A13320()  [!a4[2320]]    NVVMRematerialization
 --> LABEL_84 (finalization)

Key pattern: NVVMIntrinsicLowering runs 4 times, NVVMReflect runs 3 times, NVVMIRVerification runs 5+ times. The CGSCC pipeline is called with 5 and 8 iteration counts (aggressive devirtualization).

Path C: Default -- General Pipeline (~40 passes)

Used for bitcode from external sources (not marked as "ptx" or "mid"). Balances optimization breadth with conservative assumptions about IR maturity.

sub_1A62BF0(4,0,0,1,0,0,1)    LLVM standard pipeline #4
sub_1857160()  [!a4[880]]     NVVMReflect (1st)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering
sub_1857160()  [!a4[880]]     NVVMReflect (2nd)
sub_1CEF8F0()                  NVVMPeephole
sub_215D9D0()                  NVVMAnnotationsProcessor
sub_1A7A9F0()  [!a4[2720]]    InstructionSimplify
sub_1A62BF0(5,0,0,1,0,0,1)    LLVM standard pipeline #5
sub_185D600()  [!a4[920]]     IPConstPropagation
sub_1B26330()  [!a4[2040]]    MemCpyOpt
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_1A13320()  [!a4[2320]]    NVVMRematerialization
sub_1833EB0(3) [!a4[320]]     TailCallElim
sub_1C6E800()                  GVN
sub_1842BC0()  [!a4[720]]     SCCP
sub_18DEFF0()  [!a4[280]]     DCE
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_18FD350(0)                 DeadArgElim
sub_18EEA90()                  CorrelatedValuePropagation
sub_1A62BF0(1,0,0,1,0,0,1)    LLVM standard pipeline #1
sub_197E720()                  LoopUnroll
sub_19401A0()  [!a4[1000]]    InstCombine
sub_1857160()  [!a4[880]]     NVVMReflect (3rd)
sub_1A62BF0(7,0,0,1,0,0,1)    LLVM standard pipeline #7
sub_1C8A4D0(0)                 EarlyCSE
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1832270(1)                 InstructionCombining
sub_1869C50(1,0,1)             Sink
sub_1A68E70()                  LoopIdiomRecognize
sub_198DF00(-1)[!a4[1520]]     LoopSimplify
sub_195E880(0) [!a4[1240]]    LICM
sub_190BB10(0,0)[!a4[960]]     SimplifyCFG
sub_19B73C0(3,-1,-1,0,0,-1,0)  LoopUnswitch
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1C98160(0)                 NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]]    MemorySpaceOpt
sub_1B7FDF0(3) [!a4[1280]]    Reassociate
sub_18B1DE0()  [!a4[2640]]    LoopPass
sub_1952F90(-1)[!a4[1160]]     LoopIndexSplit
sub_18FD350(0)                 DeadArgElim
sub_1CC60B0()  [!a4[2440]]    NVVMSinking2
sub_1A62BF0(2,0,0,1,0,0,1)    LLVM standard pipeline #2
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_18A3430()  [!a4[1120]]    NVVMPredicateOpt
sub_1A62BF0(4,0,0,1,1,0,1)    LLVM standard pipeline #4 (inlining)
 --> LABEL_84 (finalization)

Key difference from "mid": default path uses LLVM standard pipeline wrappers (IDs 1,2,4,5,7) more heavily, runs SCCP explicitly, includes LoopIdiomRecognize, and uses a conservative LoopUnswitch with zeroed thresholds (3,-1,-1,0,0,-1,0).

Codegen Dispatch -- sub_12DFE00

Address0x12DFE00
Size20,729 bytes
Signatureint64 sub_12DFE00(int64 passMgr, int64 subtargetInfo, int64 opts)
Called fromPhase 5 of sub_12E54A0 (LABEL_84, line 640)

The codegen dispatch does not simply append passes to the pipeline. It performs a full dependency analysis over every pass already inserted, constructs an ordering graph, and then emits codegen passes in topologically-sorted order. This is necessary because machine-level passes (register allocation, instruction scheduling, frame lowering) have strict ordering dependencies that the flat AddPass model cannot express.

// Pseudocode for sub_12DFE00 (codegen dispatch with dependency analysis)
void CodegenDispatch(PassManager* PM, SubtargetInfo* STI, CompilerOpts* opts) {
    // Step 1: Read optimization level to determine analysis depth
    int opt_level = *(int*)(opts + 200);  // opts[200] = optimization level
    bool do_deps = (opt_level > 1);       // dependency tracking for O2+

    // Step 2: Classify existing passes
    // Iterates PM->passes[0..PM->count], calling two vtable methods per pass
    HashTable dep_graph;   // secondary hash table for dependencies (v134..v137)
    init_hashtable(&dep_graph);

    for (int i = 0; i < PM->count; i++) {
        Pass* p = PM->passes[i];

        // 2a. Check if pass is codegen-only (vtable+112)
        bool is_codegen = p->vtable->isCodeGenOnly(p);   // vtable offset +112
        if (is_codegen)
            continue;  // already classified, skip

        // 2b. Check registration status
        int status = sub_163A1D0(p);   // pass registry check
        sub_163A340(p, &status);       // update status

        // 2c. If pass needs codegen support, mark it in the hash table
        if (pass_needs_codegen(p)) {
            // Set flag |= 2 in the AddPass hash table entry
            // This marks the pass as "codegen-interacting"
            Entry* e = hashtable_find(PM + 80, p);
            if (e) e->flags |= 2;
        }

        // 2d. Build dependency edges (getAnalysisUsage)
        if (do_deps) {
            AnalysisUsage AU;
            p->vtable->getAnalysisUsage(p, &AU);   // vtable offset +16

            // For each required analysis, create an ordering edge
            // in the dependency hash table
            for (AnalysisID* req = AU.required; req; req = req->next) {
                dep_graph_add_edge(&dep_graph, p, req->pass);
            }
        }
    }

    // Step 3: Emit codegen passes in dependency-respecting order
    // Calls the SubtargetInfo hook to get the ordered codegen pass list
    // vtable+16 at STI -> STI->emitCodeGenPasses(PM, dep_graph)
    STI->vtable->emitCodeGenPasses(STI, PM, &dep_graph);
    // Each emitted pass gets a flag:
    //   0 = normal pass (no special ordering)
    //   1 = pass with codegen requirement (flag bit 0 from AddPass)
}

The dependency graph construction is what makes this function 20KB: it must handle the full LLVM analysis dependency model, including transitive dependencies and analysis preservation. The getAnalysisUsage calls return Required, RequiredTransitive, and Preserved sets that define the ordering constraints between passes.

For O0 compilation (opt_level == 0), the dependency tracking is skipped entirely -- codegen passes are emitted in a fixed default order since no optimization passes exist that could create ordering conflicts.

Pass Iteration and Convergence

CGSCC Fixed-Point Iteration

The CGSCC (Call Graph Strongly Connected Component) pass manager sub_1A62BF0 wraps a standard LLVM InlinerWrapper with a configurable iteration count. The first parameter controls how many times the CGSCC pipeline iterates over the call graph:

Pipeline PositionIteration CountContext
O1/O2/O3 base (sub_12DE330)1Standard inlining: one pass over the call graph
"mid" path (Ofcmid/Ofcmin)5Aggressive: 5 iterations to resolve indirect calls
Default path (general IR)1, 2, 4, 5, 7, or 8Varies by position in pipeline

Higher iteration counts allow the CGSCC framework to resolve more indirect calls through devirtualization. After each iteration, newly-inlined code may expose new call targets, which the next iteration can inline. The diminishing returns typically plateau after 3-5 iterations, which explains NVIDIA's choice of 5 for the "mid" fast-compile path (balancing compile time against code quality).

NVVMReflect Multi-Run Pattern

NVVMReflect (sub_1857160) runs multiple times in the pipeline because NVVM IR may contain __nvvm_reflect("__CUDA_ARCH") calls at different nesting depths. The first run resolves top-level reflect calls to constants. Subsequent optimization passes (inlining, constant propagation, loop unrolling) may expose new reflect calls that were hidden inside inlined functions or unrolled loop bodies. Running NVVMReflect again after these transformations catches these newly-exposed calls.

In the "mid" path, NVVMReflect appears at three distinct positions:

  1. Early (before GVN) -- resolves top-level architecture queries
  2. Mid (after CGSCC inlining and DeadArgElim) -- catches reflect calls exposed by inlining
  3. Late (after LoopSimplify and second CGSCC) -- catches reflect calls exposed by loop transformations

NVVMIntrinsicLowering Repetition

Similarly, NVVMIntrinsicLowering (sub_1CB4E40) runs 4 times in the "mid" path. Each invocation lowers a different subset of NVVM intrinsics based on what the preceding optimization passes have simplified. The pass takes a level parameter (0 or 1) that controls which lowering rules are active. Level 0 handles basic intrinsic lowering; level 1 handles barrier-related lowering that only becomes safe after certain control flow transformations.

NVVMIRVerification as a Convergence Check

NVVMIRVerification (sub_1A223D0) runs after every major transformation group -- not for optimization, but as a correctness invariant check. In the "mid" path it appears at 5+ positions. In the tier 1/2/3 sub-pipeline it appears 4 times (after NVVMIntrinsicLowering, after barrier lowering, after GenericToNVVM, and after the late optimization sequence). If any transformation violates NVVM IR constraints (invalid address space usage, malformed intrinsic signatures, broken metadata), this pass reports the error immediately rather than allowing it to propagate to codegen where diagnosis would be much harder.

The Repeat-Until-Clean Philosophy

NVIDIA's pipeline does not use explicit fixed-point loops (run passes until IR stops changing). Instead, it achieves convergence through strategic repetition: the same pass appears at multiple carefully-chosen pipeline positions, with different optimization passes running between repetitions. This is more predictable than a true fixed-point approach because compilation time is bounded by the static pipeline length rather than by how many iterations are needed for convergence. The tradeoff is that the pipeline may not reach a true fixed point -- some optimization opportunities exposed by late passes may not be caught -- but in practice, the multi-position placement catches the vast majority of cases.

LLVM Standard Pass Pipeline Factory -- sub_1A62BF0

The LLVM standard pass pipeline is invoked multiple times throughout the optimizer via sub_1A62BF0. The first parameter is a pipeline ID that selects which LLVM extension point to inject passes at:

Pipeline IDLLVM Extension PointUsage Context
1EP_EarlyAsPossible / basic cleanupTier 0, default path
2EP_LoopOptimizerEndDefault path late
4EP_ScalarOptimizerLateDefault path, Tier sub-pipeline
5EP_VectorizerStart"mid" path, default path
7EP_OptimizerLastDefault path
8EP_CGSCCOptimizerLate"mid" path (with opt flag = 1 for inlining)

The signature is sub_1A62BF0(pipelineID, 0, 0, 1, optFlag, 0, 1, outBuf), where optFlag at position 5 enables inlining within the CGSCC sub-pipeline (observed as 1 for pipeline IDs 4 and 8 in the "mid" path: sub_1A62BF0(8,0,0,1,1,0,1)).

Each call potentially returns a cleanup callback stored in v298, invoked as v298[0](s, s, 3) for destructor/finalization. The factory is called 9+ times across the three language paths.

CompilerOptions Struct Flag Map

The a4 parameter to sub_12E54A0 is a ~4500-byte CompilerOptions struct. The following offsets have been confirmed through cross-referencing guards in the pipeline assembler and tier sub-pipelines.

OffsetTypePurposeCross-Reference
+200intOptimization level (0-3)sub_12DFE00 codegen depth
+280boolDisable DCEsub_18DEFF0 guard
+320boolDisable TailCallElimsub_1833EB0 guard
+360boolDisable NVVMLateOptsub_1C46000 guard
+400boolDisable inlining variant A
+440boolDisable inlining variant Bsub_1C4B6F0 guard
+480boolDisable inlining variant Csub_12DE8F0 guard
+520boolDisable NVIDIA pass Asub_1AAC510 guard
+560boolDisable NVIDIA pass Bsub_1AAC510 guard
+600boolDisable NVVMVerifiersub_12D4560 guard
+680boolDisable FunctionAttrssub_1841180 guard
+720boolDisable SCCPsub_1842BC0 guard
+760boolDisable DSEsub_18F5480 guard
+880boolDisable NVVMReflectsub_1857160 guard
+920boolDisable IPConstPropagationsub_185D600 guard
+960boolDisable SimplifyCFGsub_190BB10 guard
+1000boolDisable InstCombinesub_19401A0 guard
+1040boolDisable Sink/MemSSAsub_1869C50 guard
+1080boolDisable PrintModulePasssub_17060B0 guard
+1120boolDisable NVVMPredicateOptsub_18A3430 guard
+1160boolDisable LoopIndexSplitsub_1952F90 guard
+1240boolDisable LICMsub_195E880 guard
+1280boolDisable Reassociatesub_1B7FDF0 guard
+1320boolDisable ADCE variant Asub_1C76260 guard
+1360boolDisable LoopUnrollsub_19C1680 guard
+1400boolDisable SROAsub_1968390 guard
+1440boolDisable EarlyCSEsub_196A2B0 guard
+1520boolDisable LoopSimplifysub_198DF00 guard
+1680boolDisable NVIDIA passsub_19CE990 guard
+1760boolDisable MemorySpaceOptsub_1C8E680 guard
+1840boolDisable ADCE Csub_1C6FCA0 guard
+1960boolDisable ConstantMergesub_184CD60 guard
+2000boolDisable NVVMIntrinsicLoweringsub_1CB4E40 guard
+2040boolDisable MemCpyOptsub_1B26330 guard
+2120boolDisable NVVMBranchDist Bsub_1CB73C0 guard
+2200boolDisable GenericToNVVMsub_1A02540 guard
+2320boolDisable NVVMRematerializationsub_1A13320 guard
+2440boolDisable NVVMSinking2sub_1CC60B0 guard
+2560boolDisable NVVMGenericAddrOptsub_1CC71E0 guard
+2600boolDisable NVVMIRVerificationsub_1A223D0 guard
+2640boolDisable NVVMLoopOptsub_18B1DE0 guard
+2720boolDisable InstructionSimplifysub_1A7A9F0 guard
+2840boolEnable ADCE (reversed logic)sub_1C6FCA0
+2880boolEnable LICM (reversed logic)sub_195E880
+2920boolNVVMLowerBarriers paramsub_1C98160
+3000boolExtra DeadArgElim triggersub_18FD350
+3040boolEnable CVPsub_18EEA90
+3080boolEnable NVIDIA loop passsub_1922F90
+3120boolAddress space optimization flagsub_1C8E680 param
+3160boolDebug dump modesub_17060B0 enable
+3200boolEnable advanced NVIDIA groupIPConst/Reflect/SCCP/etc.
+3328boolEnable SM-specific passesWarp/Reduction/Sinking2
+3488boolEnable barrier optimizationsub_1C98160, sub_18E4A00
+3528boolTier 1 enablePhase 3 loop
+3532intTier 1 phase thresholdPhase 3 loop
+3568boolTier 2 enablePhase 3 loop
+3572intTier 2 phase thresholdPhase 3 loop
+3608boolTier 3 enablePhase 3 loop
+3612intTier 3 phase thresholdPhase 3 loop
+3648ptrLanguage string ("ptx"/"mid"/"idn")Phase 1 dispatch
+3656intLanguage string lengthPhase 1 dispatch
+3704boolLate optimization modesub_195E880, sub_1C8A4D0
+3904boolDebug: verify after pluginsPhase 3 loop
+3944boolDebug: BB naming "F%d_B%d"Phase 8
+4224boolOptimization master switchTier 0 gate
+4228intOptimization phase thresholdTier 0 gate
+4304boolDevice-code flagPhase 1 v238
+4384boolFast-compile / bypass pipelineTop branch Pipeline A vs B
+4464boolDisable late CFG cleanup BPhase 5 sub_1654860
+4480ptrSM feature capabilityPhase 6: & 4 = codegen ext
+4488ptrPlugin pass array startPhase 3 loop
+4496ptrPlugin pass array endPhase 3 loop

Pass Factory Address Inventory

All unique pass factory addresses called from the pipeline assembler and tier sub-pipelines:

FunctionAddressSizeRole
NVVMVerifiersub_12D4560many (tiers)many (tiers)
AssumptionCacheTrackersub_136195011
TargetLibraryInfoWrapperPasssub_149CCE011
VerifierPass / BasicAAsub_14A755011
BreakCriticalEdgessub_165486022
PrintModulePass (debug dump)sub_17060B0~30+~30+
InstructionCombiningsub_183227022
TailCallElim / JumpThreadingsub_1833EB033
FunctionAttrssub_184118033
SCCPsub_1842BC022
NVVMReflectsub_1857160~8~8
IPConstantPropagationsub_185D60033
Sink (MemorySSA-based)sub_1869C5033
NVVMPredicateOpt variantsub_18A309022
NVVMPredicateOpt / SelectionOptsub_18A343022
NVVMLoopOpt / BarrierOptsub_18B1DE033
Sinking2Pass (fast=1 for fc mode)sub_18B308011
DCEsub_18DEFF044
NVVMBarrierAnalysissub_18E4A0011
CorrelatedValuePropagationsub_18EEA9033
DSEsub_18F548022
DeadArgEliminationsub_18FD35055
SimplifyCFGsub_190BB1044
NVIDIA-specific loop passsub_1922F9011
LoopIndexSplitsub_1952F9033
LICM / LoopRotatesub_195E88044
SROAsub_196839022
EarlyCSEsub_196A2B022
LoopUnrollsub_197E72011
LoopSimplifysub_198DF0033
SROA (variant)sub_198E2A011
InstCombinesub_19401A022
LoopUnswitch (7 params)sub_19B73C033
LoopUnroll variantsub_19C168022
NVIDIA custom passsub_19CE99011
GenericToNVVMsub_1A0254011
NVVMRematerializationsub_1A1332033
NVVMIRVerificationsub_1A223D05+5+
LLVM StandardPassPipelinesub_1A62BF0~9~9
LoopIdiomRecognizesub_1A68E7011
InstructionSimplifysub_1A7A9F033
NVIDIA-specific passsub_1AAC51011
MemCpyOptsub_1B2633044
Reassociatesub_1B7FDF033
TTIWrapperPasssub_1BFB52011
NVVMLateOptsub_1C4600011
Inliner / AlwaysInlinesub_1C4B6F022
NewGVN / GVNHoistsub_1C6E56011
GVNsub_1C6E80022
ADCEsub_1C6FCA022
ADCE variantsub_1C7626022
NVVMWarpShufflesub_1C7F37011
EarlyCSE / GVN variantsub_1C8A4D033
MemorySpaceOptimizationsub_1C8E68044
NVVMLowerBarrierssub_1C9816044
NVVMLowerBarriers variantsub_1C9827011
ProfileSummaryInfosub_1CB0F5011
NVVMIntrinsicLoweringsub_1CB4E40~10~10
NVVMBranchDistsub_1CB73C033
NVVMLowerAllocasub_1CBC48011
NVVMUnreachableBlockElimsub_1CC399011
NVVMReductionsub_1CC5E0011
NVVMSinking2sub_1CC60B033
NVVMGenericAddrOptsub_1CC71E011
NVVMFinalLoweringsub_1CEBD1011
NVVMPeepholesub_1CEF8F022
NVVMAnnotationsProcessorsub_215D9D022

Total unique pass factory addresses: ~55.

Function Map

FunctionAddressSizeRole
NVVMPassOptions::initsub_12D6300125KBPopulates 4,512-byte options struct
writeStringOptionsub_12D6090~100BWrites 24-byte string slot
writeBoolOptionsub_12D6100~80BWrites 16-byte boolean slot
PassOptionRegistry::lookupOptionsub_12D6170~200BHash table lookup
getBoolOptionsub_12D6240~300BBoolean resolution with default
PassDefTable::getPassDefsub_1691920~50B64-byte stride table lookup
parseIntsub_16D2BB0~100BString to int64
Pipeline assembler (master)sub_12E54A049.8KB8-phase pipeline construction
AddPasssub_12DE0B03.5KBHash-table-based insertion
Tier 0 sub-pipelinesub_12DE3304.8KB~40 passes, full optimization
Tier 1/2/3 sub-pipelinesub_12DE8F017.9KBPhase-conditional, incremental
Codegen dispatchsub_12DFE0020.7KBDependency-ordered codegen
Phase I/II orchestratorsub_12E7E709.4KBTwo-phase state machine
New PM registrationsub_2342890~50KB2,816 lines, 35 NVIDIA + ~350 LLVM
registerPass (hash insert)sub_E41FB0~300BStringMap insertion
Pass name prefix matchersub_2337DE0~100Bstarts_with comparison
Parameterized pass parsersub_234CEE0~200BExtracts <params>
MemorySpaceOpt param parsersub_23331A0~300Bfirst-time/second-time/warnings
New PM pipeline driversub_226C40035KBnvopt<O0/O1/O2/O3/Ofcmax/Ofcmid/Ofcmin> selection
New PM text parser (buildDefaultPipeline)sub_227744060KBParses pipeline name strings
nvopt registration (new PM)sub_225D540~32KBPipeline element vtable at 0x4A08350
nvopt registration (legacy PM)sub_12C35D0~500BPipeline element vtable at 0x49E6A58
nvopt object initializersub_12EC960~100BCreates 512-byte pipeline object
LLVM standard pipeline factorysub_1A62BF0variesPipeline IDs 1,2,4,5,7,8
Pass registry checksub_163A1D0~100BPass registration status
Pass status updatesub_163A340~100BUsed in codegen dispatch
Pipeline text tokenizersub_2352D90~200BTokenizes nvopt<> strings

Reimplementation Checklist

  1. Two-phase compilation model. Implement a TLS phase variable (values 1=Phase I, 2=Phase II, 3=done) read by individual passes to skip themselves when the current phase does not match their intended execution phase. Phase I runs whole-module analysis; Phase II runs per-function codegen-oriented passes.
  2. Pipeline assembly function (~150 AddPass calls). Build the master pipeline at runtime using hash-table-based pass insertion (AddPass), with language-specific dispatch (paths for "ptx", "mid", and default), tier-based interleaving (Tiers 0--3 fired by accumulated pass-count thresholds), and phase-conditional pass inclusion.
  3. NVVMPassOptions system (4,512-byte struct, 221 slots). Implement the proprietary per-pass enable/disable and parametric knob system with 114 string + 100 boolean + 6 integer + 1 string-pointer option slots, parsed from CLI flags and routed to individual passes.
  4. Concurrent per-function compilation. After Phase I completes on the whole module, split Phase II across a thread pool sized to get_nprocs() or GNU Jobserver token count, with per-function bitcode extraction, independent compilation, and re-linking of results.
  5. GNU Jobserver integration. Parse --jobserver-auth=R,W from MAKEFLAGS environment variable, create a token management pipe, and spawn a pthread to throttle concurrent compilations to the build system's -j level.
  6. Split-module compilation. Implement the -split-compile=N mechanism: decompose multi-function modules into per-function bitcode blobs via filter callbacks, compile each independently (potentially in parallel), re-link results, and restore linkage attributes from a hash table.
  7. Tier 0 full optimization sub-pipeline. Assemble the ~40-pass Tier 0 sequence: BreakCriticalEdges, GVN, NVVMReflect, SCCP, NVVMVerifier, LoopIndexSplit, ADCE, LICM, LoopUnroll, InstCombine, SROA, EarlyCSE, LoopUnswitch, SimplifyCFG, NVVMRematerialization, DSE, DCE, with per-pass NVVMPassOptions gating.

Cross-References

OptiX IR Generation

When cicc receives the --emit-optix-ir flag, it activates an alternate compilation path that produces OptiX IR instead of PTX. OptiX IR is the intermediate representation consumed by NVIDIA's OptiX ray tracing engine, which uses a continuation-based execution model fundamentally different from the standard CUDA kernel launch model. Rather than compiling all the way down to PTX machine code, the OPTIXIR pipeline stage serializes the optimized LLVM module in a form that the OptiX runtime can later JIT-compile, link with ray tracing shaders, and schedule across the RT cores' hardware intersection pipeline.

The OptiX path is the third of four stages in cicc's internal pipeline (LNK -> OPT -> OPTIXIR -> LLC), but it is mutually exclusive with LLC in practice: when OptiX mode is active, the pipeline bitmask enables OPTIXIR (0x40) and disables certain optimizations that would be incorrect for continuation-based code. The flag also forces the EDG frontend to emit lifetime intrinsics (--emit-lifetime-intrinsics, EDG option id 132), which mark the live ranges of local variables -- essential information for the OptiX runtime's continuation frame layout.

Pipeline stageOPTIXIR (stage 3 of 4)
Stage bitBit 6 (0x40) in pipeline bitmask
Mode bitmask0x43 = (a13 & 0x300) | 0x43
Core functionsub_12F9270 (~6 KB)
Timer name"OPTIXIR" / "LibNVVM Optix IR step."
Container IR levelNVVM_IR_LEVEL_OPTIX (value 2)
CLI flag--emit-optix-ir (15 bytes, inline-matched)
Input extension.optixir (recognized at 0x8FC001)
Callback slotCompilationState+144 (function), +152 (user data)
AvailabilityCUDA (0xABBA) and OpenCL (0xDEED) modes only

Flag Processing

--emit-optix-ir in Real Main (sub_8F9C90)

In the standalone entry point, --emit-optix-ir is matched at 0x8FAD00 by a 15-byte inline comparison (split across three immediate compares: "--emit-o" + "ptix" + "-ir"). When matched, it performs three actions:

  1. Pushes three strings to the v266 pass-through vector:

    • "--emit-optix-ir" (literal, 15 bytes via explicit strcpy)
    • An 18-byte target string from xmmword_3C23B30 + "28" (likely target-related configuration)
    • A 20-byte GPU name string from xmmword_3C23B40 + "t128" (likely target capability)
  2. Sets v243 = 1 (the OptiX IR mode flag)

  3. Sets v258 = 1 (the NVC flag, also set by -nvc)

--emit-optix-ir in Flag Catalog (sub_9624D0)

In the 3-column flag fan-out system, --emit-optix-ir is processed at line 2415 of the decompiled flag catalog. Its behavior:

// Only valid when a4 == 0xDEED (OpenCL) or a4 == 0xABBA (CUDA)
if (a4 == 0xDEED || a4 == 0xABBA) {
    // Route to optimizer: disable IP-MSP and LICM
    append_to_opt_vector("-do-ip-msp=0");
    append_to_opt_vector("-do-licm=0");

    // Set mode bitmask: preserve 64/32-bit mode bits, set OptiX mode
    a13 = (a13 & 0x300) | 0x43;
}

The 0x43 value decomposes to:

  • Bits [1:0] = 0x03 -- all standard phases enabled (LNK + LLC)
  • Bit 6 = 0x40 -- OPTIXIR stage enabled

3-Column Fan-Out

The flag translation table maps --emit-optix-ir across all three compilation columns:

ColumnForwarded As
nvcc -> EDG--emit-lifetime-intrinsics
nvcc -> cicc (optimizer)--emit-optix-ir + -do-ip-msp=0 + -do-licm=0
cicc internalMode bitmask 0x43

This is notable because a single user-facing flag triggers a different flag in the EDG frontend (--emit-lifetime-intrinsics, EDG option id 132) while also routing the OptiX flag itself to the cicc optimizer. The EDG side-effect ensures that lifetime markers (llvm.lifetime.start / llvm.lifetime.end) are present in the generated LLVM IR, which the OptiX runtime needs to compute continuation frame sizes.

Pipeline Stage

Bitmask and Gating

The pipeline orchestrator sub_12C35D0 (41 KB, the nvvmCompileProgram internal) reads the pipeline stage bitmask from sub_12D2AA0 during initialization. This function parses the architecture code and options into four stage descriptors:

StageDescriptor PairBitmask Bit
LNK(&v195, &v200)Bit 0 (0x01)
OPT(&v196, &v201)Bit 7 (0x80)
OPTIXIR(&v197, &v202)Bit 6 (0x40)
LLC(&v198, &v203)Bit 2 (0x04)

The OPTIXIR stage executes at lines 1093--1150 of the decompiled orchestrator, after OPT and before LLC:

// STAGE 3 -- OPTIXIR
if (v87 & 0x40) {
    // Start timer
    sub_16D8B50(timer_ctx, "OPTIXIR", 7,
                "LibNVVM Optix IR step.", 22, ...);

    // Generate OptiX IR from the optimized LLVM module
    err = sub_12F9270(arch_code,      // a3: SM architecture code
                      llvm_ctx,       // a4: LLVM context
                      module,         // current LLVM Module*
                      state + 6,      // output buffer for OptiX IR
                      &error_str);    // error string out

    if (err) {
        // Append error to state[10] error log
        ...
    }

    // Close timer
    sub_16D7950(timer_ctx);
}

Callback Mechanism

Like the other three stages, OPTIXIR has a callback slot in the CompilationState structure:

OffsetField
+112LNK callback function pointer
+120LNK callback user data
+128OPT callback function pointer
+136OPT callback user data
+144OPTIXIR callback function pointer
+152OPTIXIR callback user data
+160LLC callback function pointer
+168LLC callback user data

In the standalone pipeline entry (sub_1265970), the OPTIXIR callback is registered when both verbose and keep-temps modes are active (the logical AND of -v and -keep, which requires wizard mode). The callback ID is 64222, registered via sub_1268040 through sub_12BC0F0.

sub_12F9270 -- OptiX IR Generator

FieldValue
Address0x12F9270
Size~6 KB
Parameters(uint arch_code, LLVMContext *ctx, Module *module, OutputBuffer *out, char **error_str)
Returnunsigned int (0 = success)

This function takes the fully optimized LLVM module and serializes it into OptiX IR format. The output goes into the state+6 output buffer in the CompilationState, not into the PTX output buffer at state+80. The architecture code and LLVM context are passed through from the pipeline orchestrator's arguments.

The function is relatively small (~6 KB) compared to the LLC stage (sub_12F5100, ~12 KB), consistent with it being primarily a serialization step rather than a full code generation pipeline. It does not run SelectionDAG, register allocation, or instruction scheduling -- those are the domain of the LLC stage, which is typically skipped when OptiX mode is active.

IR Level and Container Marking

When the NVVM container format wraps an OptiX IR payload, the IRLevel field in the binary header is set to NVVM_IR_LEVEL_OPTIX (value 2):

IRLevel ValueEnum NameMeaning
0NVVM_IR_LEVEL_UNIFIED_AFTER_DCIDefault: IR after Device-Code-Interface unification
1NVVM_IR_LEVEL_LTOLink-Time Optimization IR (partially optimized)
2NVVM_IR_LEVEL_OPTIXOptiX pipeline IR

In the binary header, this is stored as a uint16_t at offset 0x0C:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  IRLevel = 0x0002 (OPTIX)    |   0x0C in NvvmContainerBinaryHeader
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

In the XML serialization path (used for debugging), this appears as the "IRLevel" element with the symbolic name "NVVM_IR_LEVEL_OPTIX".

The .optixir file extension is recognized as an input format by cicc's argument parser (matched at 0x8FC001 by comparing the last 8 characters of the filename). This allows round-tripping: cicc can both produce and consume OptiX IR files.

Optimization Pipeline Differences

When OptiX mode is active, the flag catalog forces two critical optimizer changes via the pass-through vector to the OPT stage:

LICM Disabled (-do-licm=0)

Loop Invariant Code Motion is completely disabled when compiling for OptiX. The do-licm NVVMPassOption (at a known offset in the 4,512-byte options struct) gates the LICM pass insertion in the pipeline assembler sub_12E54A0. When set to 0, the sub_195E880(0) LICM pass at position 22 of the Tier 0 pipeline is skipped entirely.

The rationale is that OptiX uses a continuation-based execution model where functions can be suspended and resumed at hardware-defined continuation points (ray-surface intersection, any-hit shader invocation, etc.). LICM hoisting moves computations out of loops and into dominating blocks, which can move them across implicit continuation boundaries. If a hoisted value is live across a continuation point, the OptiX runtime must save it to the continuation frame -- potentially increasing frame size and reducing performance. Worse, the hoisting may move side-effecting operations across points where the program could be suspended, violating the continuation semantics. Disabling LICM avoids these correctness and performance hazards entirely.

IP-MSP Disabled (-do-ip-msp=0)

Interprocedural Memory Space Propagation is also disabled. IP-MSP (sub_12E6160, the NVVMMemorySpacePropagation pass) propagates memory space annotations (generic -> shared/local/global) across function boundaries. This optimization is meaningless for OptiX IR because the OptiX runtime performs its own memory space analysis during JIT compilation, and the intermediate representation must remain generic to allow runtime binding of hit attributes, payload data, and SBT (Shader Binding Table) records to their final memory spaces.

Forced Inlining (nv-inline-all)

The nv-inline-all knob (registered at constructor ctor_186_0 at 0x4DBEC0 in the NVIDIA custom inliner) bypasses cost analysis entirely and forces inlining of every call. This mode is used for OptiX compilation where the entire call graph must be flattened for the hardware intersection pipeline. The OptiX runtime requires monolithic shader functions because the RT core hardware executes individual ray tracing programs as atomic units -- there is no call stack during hardware intersection traversal.

From the inliner cost model (sub_1864060, 75 KB):

The nv-inline-all knob bypasses cost analysis entirely and forces inlining of every call. This is used for specific compilation modes (e.g., OptiX ray tracing where the entire call graph must be flattened for the hardware intersection pipeline).

The standard inline-budget (default 20,000) and inline-total-budget are irrelevant when nv-inline-all is active -- every call site is inlined unconditionally regardless of cost.

Continuation-Based Execution Model

OptiX IR exists because NVIDIA's ray tracing hardware uses a fundamentally different execution model than standard CUDA kernels. Understanding this model explains every design decision in the OPTIXIR pipeline stage.

Standard CUDA vs. OptiX Execution

In standard CUDA, a kernel is a single function that runs to completion on an SM. The compiler produces PTX, which ptxas assembles into SASS machine code. The entire call graph is resolved at compile time, and the GPU executes instructions sequentially (modulo warp divergence and memory latency hiding).

In OptiX, a ray tracing pipeline consists of multiple programs (ray generation, closest-hit, any-hit, miss, intersection, callable) that are compiled separately and linked at runtime by the OptiX driver. When a ray-surface intersection occurs, the hardware suspends the current program, saves its live state to a continuation frame in device memory, and launches the appropriate hit shader. When the hit shader completes, execution resumes from the continuation point.

This model has several consequences for compilation:

  1. No cross-function calls during intersection. The RT core hardware does not support a general call stack. All function calls within a single program must be fully inlined before the OptiX runtime receives the IR -- hence nv-inline-all.

  2. Lifetime intrinsics are critical. The OptiX runtime uses llvm.lifetime.start / llvm.lifetime.end markers to determine which local variables are live at each potential continuation point. Variables that are provably dead at a continuation point do not need to be saved to the continuation frame. Without these markers, the runtime must conservatively assume all locals are live, inflating frame sizes and reducing performance.

  3. LICM is unsafe. Hoisting computations out of loops can move them across implicit continuation points, creating live ranges that span suspension/resumption boundaries. The OptiX runtime cannot reconstruct the hoisted value after resumption unless it is saved, but the compiler does not know where the continuation points will be (they are determined at runtime by the ray tracing pipeline topology).

  4. Memory space must remain generic. OptiX IR is JIT-compiled at runtime with knowledge of the full pipeline configuration. Memory space decisions that depend on the pipeline topology (shared memory for hit attributes, global memory for payload) cannot be made at cicc compile time.

  5. The output is IR, not machine code. Unlike the LLC stage which produces PTX text, the OPTIXIR stage serializes the LLVM module in a form suitable for the OptiX JIT. This is why sub_12F9270 is only ~6 KB -- it is a serializer, not a code generator.

Configuration

CLI Activation

# Standard OptiX compilation via nvcc
nvcc --emit-optix-ir -arch=sm_89 -o kernel.optixir kernel.cu

# Direct cicc invocation
cicc --emit-optix-ir -arch sm_89 -o kernel.optixir kernel.bc

# The flag also accepts .optixir input files for round-tripping
cicc -arch sm_89 -o kernel.ptx kernel.optixir

Effective Configuration When Active

When --emit-optix-ir is specified, the following configuration is implicitly applied:

SettingValueSource
v243 (OptiX flag)1Real main sub_8F9C90
v258 (NVC flag)1Real main sub_8F9C90
Pipeline bitmask0x43Flag catalog sub_9624D0
do-licm0Flag catalog, routed to OPT
do-ip-msp0Flag catalog, routed to OPT
EDG: emit-lifetime-intrinsics (id 132)enabled3-column fan-out
Container IRLevel2 (NVVM_IR_LEVEL_OPTIX)Container serializer
nv-inline-alltrueOptiX mode forces all inlining

Bitmask Decomposition

The 0x43 mode value preserves the 64/32-bit mode bits (mask 0x300) from any previously-set a13 value:

a13 = (a13 & 0x300) | 0x43

Bit field:
  [9:8] = preserved (0x100 = 64-bit, 0x200 = 32-bit)
  [7]   = 0  (OPT stage -- controlled separately)
  [6]   = 1  (OPTIXIR stage enabled)
  [5:3] = 0  (no LTO, no verification override)
  [2]   = 0  (LLC stage -- typically not run in OptiX mode)
  [1:0] = 11 (LNK + base phase control)

Note that bit 2 (LLC) is 0 in the 0x43 bitmask, confirming that the LLC stage is not activated when OptiX mode is the primary output. The pipeline runs LNK -> OPT -> OPTIXIR and stops.

Diagnostic Strings

StringLengthContext
"OPTIXIR"7Timer phase name (passed to sub_16D8B50)
"LibNVVM Optix IR step."22Timer description string
"--emit-optix-ir"15CLI flag literal (inline-matched in real main)
"--emit-lifetime-intrinsics"27EDG flag routed from --emit-optix-ir
".optixir"8Input file extension (matched at 0x8FC001)
"-do-ip-msp=0"13Optimizer option routed when OptiX active
"-do-licm=0"12Optimizer option routed when OptiX active

Function Map

FunctionAddressSizeRole
OptiX IR generator (core OPTIXIR stage)sub_12F9270~6 KB--
Pipeline orchestrator (nvvmCompileProgram internal)sub_12C35D0~41 KB--
Bitmask / stage descriptor parsersub_12D2AA0--
Flag catalog (routes --emit-optix-ir)sub_9624D0~75 KB--
Real main (matches --emit-optix-ir at 0x8FAD00)sub_8F9C90~10 KB--
OPTIXIR callback registration (callback ID 64222)sub_1268040--
Pipeline callback dispatchersub_12BC0F0--
Inliner cost model (nv-inline-all bypass)sub_1864060~75 KB--
CGSCC inliner core (inlineCallsImpl)sub_186CA00~61 KB--
Timer start (receives "OPTIXIR" phase name)sub_16D8B50--
Timer closesub_16D7950--
Pipeline assembler (skips LICM when do-licm=0)sub_12E54A0~49.8 KB--

Cross-References

Code Generation

NVPTX backend: SelectionDAG lowering, instruction selection, register allocation, and machine-level passes. Address range 0x17000000x35EFFFF (~37 MB of code) -- the largest address range in the binary. This page is the hub for the entire code generation pipeline; each stage has a dedicated deep-dive page linked below.

SelectionDAG pipelineSelectionDAG & ISel — build, legalize, combine, select
Type legalizationType Legalization — 348KB monolithic dispatch
ISel patternsISel Pattern Matching — three-level dispatch, 900KB
Register allocationRegister Allocation — pressure-driven greedy RA
Register classesNVPTX Register Classes — nine classes, ID map
SchedulingInstruction Scheduling — MRPA, pipeliner, post-RA
Machine passesMachine-Level Passes — MRPA, remat, LDG, peephole
StructurizeCFGStructurizeCFG — mandatory structured control flow
CodeGenPrepareCodeGenPrepare & SCEV-CGP — IR-level backend prep
KnownBitsKnownBits & DemandedBits — fused analysis with GPU SR oracle
Tensor core codegenMMA Code Generation — HMMA/IMMA/WGMMA/tcgen05 lowering pipeline
Tensor core builtinsTensor / MMA Builtins — per-ID reference, validation rules
AtomicsAtomic Builtins — scope-aware atom lowering
Target infrastructureNVPTX Target Infrastructure — TargetMachine, TTI, SubtargetFeatures
Live range calcLiveRangeCalc — dual-bitvector liveness
RematerializationRematerialization — IR-level + machine-level remat
InstrEmitterInstrEmitter — DAG-to-MachineInstr conversion
DAG node layoutSelectionDAG Node Structure — 104-byte SDNode

Architecture

The code generation pipeline runs after the LLVM optimizer and produces MachineIR that the PTX emission stage serializes to text. The pipeline follows upstream LLVM's SelectionDAG architecture with NVIDIA-specific passes inserted at key points.

LLVM IR
  │
  ├─ CodeGenPrepare (IR-level backend prep)
  │    sub_1D70000-1D7FFFF: sunkaddr, sunk_phi, block splitting
  │
  ├─ SelectionDAG Build
  │    sub_2065D30 (visit dispatcher)
  │    sub_2056920 (major worker, 69KB)
  │    sub_2077400 (NVVM tex/surf handle lowering) ★ NVIDIA
  │    sub_2072590 (NVPTX argument passing, 38KB) ★ NVIDIA
  │
  ├─ LegalizeTypes
  │    sub_20019C0 (348KB main loop)
  │    sub_201E5F0 (opcode dispatch, 81KB)
  │    sub_201BB90 (expand integer, 75KB)
  │
  ├─ LegalizeOp
  │    sub_1FFB890 (169KB, type action dispatch)
  │    sub_1FF6F70 (43KB, atomic target-specific lowering) ★ NVIDIA
  │
  ├─ DAG Combining
  │    sub_F681E0 (65KB, top-level orchestrator)
  │    sub_F20C20 (64KB, visitNode main)
  │
  ├─ Instruction Selection
  │    sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) ★ NVIDIA
  │    sub_33D4EF0 (complex addressing, calls sub_969240 399×)
  │
  ├─ Instruction Scheduling
  │    sub_355F610 (64KB, ScheduleDAGMILive post-RA)
  │    sub_3563190 (58KB, MachinePipeliner)
  │
  ├─ Register Allocation
  │    sub_2F49070 (82KB, RAGreedy::selectOrSplit)
  │    sub_2F2D9F0 (93KB, LiveRangeSplitter)
  │
  ├─ Machine-Level Passes
  │    MRPA, Block Remat, Mem2Reg, LDG, Peephole, etc.
  │
  └─ StructurizeCFG
       sub_35CC920 (95KB, mandatory for PTX structured control flow)

Items marked ★ NVIDIA are NVIDIA-proprietary additions not present in upstream LLVM.

Stage Overview

CodeGenPrepare (detail) -- last IR-level pass before ISel. Sinks address computations, creates PHI nodes for sunk values, and splits critical edges. NVIDIA adds an optional SCEV-CGP extension.

SelectionDAG Build (detail) -- converts LLVM IR into a target-independent DAG. NVPTX intercepts for .param-space argument passing and texture/surface handle lowering.

Type Legalization (detail) -- rewrites every illegal type into legal equivalents via promote, expand, soften, or split-vector actions.

Operation Legalization -- processes nodes whose opcodes are illegal for the target. Atomic operations receive NVIDIA-specific scope-aware lowering (CTA/GPU/SYS) with per-SM feature gates.

DAG Combining -- folds redundant operations, canonicalizes patterns, and reduces the DAG before instruction selection. The KnownBits analysis feeds into combining decisions.

Instruction Selection (detail) -- matches DAG nodes against PTX instruction patterns via a three-level dispatch hierarchy. A compressed per-SM-variant legality table gates which opcodes exist on which GPU architecture.

Instruction Scheduling (detail) -- post-RA scheduling plus an optional software pipeliner. NVIDIA's custom MRPA provides incremental register pressure tracking.

Register Allocation (detail) -- pressure-driven greedy allocator adapted for PTX's virtual register model. Works with nine typed register classes; live range splitting and rematerialization reduce spill pressure.

Machine-Level Passes (detail) -- NVIDIA-proprietary and stock LLVM passes that optimize register pressure, promote stack objects back to registers, and prepare clean PTX for ptxas.

StructurizeCFG (detail) -- mandatory pass that converts arbitrary CFGs into the structured form PTX requires, rejecting irreducible CFGs and EH funclets.

Two-Stage Compilation: cicc + ptxas

CUDA compilation is a two-stage process. cicc (this binary) compiles CUDA/NVVM IR down to PTX assembly text -- a virtual ISA with unlimited registers and structured control flow. ptxas then compiles the PTX into SASS machine code for a specific SM target. This split means that many of cicc's code generation decisions (register allocation, instruction scheduling, peephole optimization) are revisited by ptxas with full hardware knowledge. cicc's code generation pipeline therefore optimizes for two audiences simultaneously: (1) reducing register pressure and producing clean PTX that gives ptxas maximum optimization freedom, and (2) performing target-aware lowering (type legalization, instruction selection, structured CFG) that ptxas cannot undo. The practical consequence is that cicc's backend is pressure-driven rather than latency-driven -- scheduling for low register count matters more than scheduling for pipeline throughput, because ptxas will re-schedule for the hardware but cannot reduce register demand below what cicc emitted.

Cross-References

PTX Emission

PTX assembly output, function headers, stack frames, register declarations, special registers, atomic instructions, barriers, debug info, and output modes. Address range 0x2140000--0x21FFFFF for NVPTX-specific emission, 0x31E0000--0x3240000 for AsmPrinter.

AsmPrinter::emitFunctionBodysub_31EC4F0 (72KB)
Function header orchestratorsub_215A3C0 (.entry/.func, .param, kernel attrs, .pragma)
Kernel attribute emissionsub_214DA90 (.reqntid, .maxntid, .minnctapersm, cluster)
Stack frame setupsub_2158E80 (17KB, .local, .reg, __local_depot)
Register class mapsub_2163730 + sub_21638D0 (9 classes)
GenericToNVVMsub_215DC20 / sub_215E100 (36KB, addrspace rewriting)
Special registerssub_21E86B0 (%tid, %ctaid, %ntid, %nctaid)
Cluster registerssub_21E9060 (15 registers, SM 90+)
Atomic emissionsub_21E5E70 (13 opcodes) + sub_21E6420 (L2 cache hints)
Memory barrierssub_21E94F0 (membar.cta/gpu/sys, fence.sc.cluster)
Cluster barrierssub_21E8EA0 (barrier.cluster.arrive/wait)
Global variable emissionsub_2156420 (texref/surfref/samplerref/data)
Global variable orderingsub_2157D50 (5.9KB, topological sort with circular dependency detection)
Bitcode producer"LLVM7.0.1" (NVVM IR compat marker, despite LLVM 20.0.0)

Function Header Emission -- sub_215A3C0

Emits a complete PTX function prologue in this exact order:

StepOutputCondition
(a).pragma "coroutine";\nMetadata node type 'N' linked to current function
(b)CUDA-specific attributes*(a1+232)->field_952 == 1
(c).entry or .func sub_1C2F070 (isKernelFunction)
(d)Return type spec.func only, via sub_214C940
(e)Mangled function namesub_214D1D0
(f).param declarationssub_21502D0 (monotonic counter _param_0, _param_1, ...)
(g)Kernel attributes.entry only, via sub_214DA90
(h)Additional attributessub_214E300
(i).noreturnNon-kernel with noreturn attribute (metadata attr 29)
(j){\nOpen function body
(k)Stack frame + registerssub_2158E80
(l)DWARF debug infoIf enabled

Kernel Attributes -- sub_214DA90

Reads NVVM metadata and emits performance-tuning directives. Attribute emission order:

OrderAttributeSource MetadataCondition
1.blocksareclustersnvvm.blocksareclustersFatal if reqntid not set
2.reqntid X, Y, Znvvm.reqntid + sub_1C2EDB0Comma-separated strtol parse
3.maxntid X, Y, Zsub_1C2EC00 / structuredUnspecified dims default to 1
4.minnctapersm Nsub_1C2EF70--
5.explicitclusternvvm.cluster_dimSM > 89 only
6.reqnctapercluster X, Y, ZCluster dim readersSM > 89 only
7.maxclusterrank Nsub_1C2EF50SM > 89 only
8.maxnreg Nsub_1C2EF90--

Cluster attributes (5--7) gated by *(a1+232)->field_1212 > 0x59 (SM > 89, i.e., SM 90+).

Stack Frame -- sub_2158E80

FieldValue
Address0x2158E80
Size17KB

Emission Steps

  1. Local depot (if *(frame_info+48) != 0):

    .local .align 16 .b8 __local_depot0[256];
    

    Where alignment = *(frame_info+60), index = function index, size = frame size.

  2. Stack pointer registers:

    .reg .b64 %SP;    // stack pointer
    .reg .b64 %SPL;   // stack pointer local
    

    Uses .b32 in 32-bit mode (checked via *(a2+8)->field_936).

  3. Virtual register declarations -- iterates register map at *(a1+800), deduplicates via hash table at a1+808:

    .reg .pred  %p<5>;
    .reg .b16   %rs<12>;
    .reg .b32   %r<47>;
    .reg .b64   %rd<8>;
    .reg .f32   %f<20>;
    .reg .f64   %fd<3>;
    

Register Class Map

The complete 9-class register table (vtable addresses, PTX type suffixes, prefixes, encoded IDs, copy opcodes, and coalescing constraints) is in Register Classes. The encoding scheme (sub_21583D0: class_encoded_id | (register_index & 0x0FFFFFFF), fatal "Bad register class" on unrecognized vtable) is documented in Register Encoding Scheme.

Special Registers -- sub_21E86B0

Switch on operand value (ASCII-encoded):

OpcodeCharRegisterDescription
0x26&%tid.xThread ID, X
0x27'%tid.yThread ID, Y
0x28(%tid.zThread ID, Z
0x29)%ntid.xBlock dim, X
0x2A*%ntid.yBlock dim, Y
0x2B+%ntid.zBlock dim, Z
0x2C,%ctaid.xBlock ID, X
0x2D-%ctaid.yBlock ID, Y
0x2E.%ctaid.zBlock ID, Z
0x2F/%nctaid.xGrid dim, X
0x300%nctaid.yGrid dim, Y
0x311%nctaid.zGrid dim, Z
0x5E^(dynamic)Via sub_3958DA0(0, ...) -- %warpid/%laneid
0x5F_(dynamic)Via sub_3958DA0(1, ...)

Cluster Registers -- sub_21E9060 (SM 90+)

ValueRegisterDescription
0%is_explicit_clusterExplicit cluster flag
1%cluster_ctarankCTA rank within cluster
2%cluster_nctarankCTAs in cluster
3--5%cluster_nctaid.{x,y,z}Cluster grid dimensions
6--8%cluster_ctaid.{x,y,z}CTA ID within cluster
9--11%nclusterid.{x,y,z}Number of clusters
12--14%clusterid.{x,y,z}Cluster ID

Fatal: "Unhandled cluster info operand" on invalid value.

Atomic Instruction Emission

Operand Encoding

The atomic instruction word packs scope and operation into a single integer read from the operand array at *(operand_array + 16*a2 + 8):

Bit layout:
  [3:0]   — reserved
  [7:4]   — scope: 0=gpu (implicit), 1=cta, 2=sys
  [15:8]  — reserved
  [23:16] — atomic opcode (BYTE2)

The scope field emits a prefix before the atomic suffix: scope 0 produces no prefix (implicit .gpu), scope 1 emits ".cta", scope 2 emits ".sys". The complete PTX instruction format is atom[.scope].op.type.

Base Atomics -- sub_21E5E70

13-operation dispatch table. The switch on BYTE2(v4) selects both the operation suffix and its type class:

OpcodeSuffixType ClassPTX Semantics
0x00.exch.bbitwiseExchange -- atomically swap value
0x01.add.uunsignedUnsigned integer addition
0x03.and.bbitwiseBitwise AND
0x05.or.bbitwiseBitwise OR
0x06.xor.bbitwiseBitwise XOR
0x07.max.ssignedSigned integer maximum
0x08.min.ssignedSigned integer minimum
0x09.max.uunsignedUnsigned integer maximum
0x0A.min.uunsignedUnsigned integer minimum
0x0B.add.ffloatFloating-point addition
0x0C.inc.uunsignedUnsigned increment (wrapping)
0x0D.dec.uunsignedUnsigned decrement (wrapping)
0x0E.cas.bbitwiseCompare-and-swap

Opcodes 0x02 and 0x04 are intentionally absent -- the PTX ISA has no signed atomic add at that slot, and no bitwise operation occupies slot 4. The 13 operations exactly match the PTX atom instruction repertoire.

The type width suffix (.b32, .b64, .u32, .u64, .s32, .s64, .f32, .f64) is appended separately by the instruction printer after the operation suffix, based on the register class of the destination operand.

L2 Cache-Hinted Atomics -- sub_21E6420 (Ampere+)

A parallel emission function that inserts L2::cache_hint between the operation and type suffix to produce the extended format:

atom[.scope].op.L2::cache_hint.type

All 13 atomic operations are supported with L2 hints. The hint instructs the GPU L2 cache controller to retain (or evict) the target cache line after the atomic completes -- a data-locality optimization introduced with Ampere (SM 80).

The function uses SSE xmmword loads from precomputed string constants at addresses xmmword_435F590 through xmmword_435F620 to fast-copy 16-byte prefixes of each suffix string. This avoids per-character string construction: each atomic variant's complete suffix (e.g., .exch.L2::cache_hint.b at 22 bytes) is assembled from a 16-byte SSE load of the prefix plus a patched tail. The compiler optimized this into aligned vector moves rather than memcpy calls.

Atomic Emission Pseudocode

void emitAtomicOp(raw_ostream &OS, unsigned operand) {
    unsigned scope = (operand >> 4) & 0xF;
    unsigned opcode = (operand >> 16) & 0xFF;  // BYTE2

    OS << "atom";
    if (scope == 1) OS << ".cta";
    else if (scope == 2) OS << ".sys";
    // scope 0 = implicit .gpu, no suffix

    switch (opcode) {
    case 0x00: OS << ".exch.b"; break;
    case 0x01: OS << ".add.u";  break;
    // ... 0x02, 0x04 absent ...
    case 0x03: OS << ".and.b";  break;
    case 0x05: OS << ".or.b";   break;
    case 0x06: OS << ".xor.b";  break;
    case 0x07: OS << ".max.s";  break;
    case 0x08: OS << ".min.s";  break;
    case 0x09: OS << ".max.u";  break;
    case 0x0A: OS << ".min.u";  break;
    case 0x0B: OS << ".add.f";  break;
    case 0x0C: OS << ".inc.u";  break;
    case 0x0D: OS << ".dec.u";  break;
    case 0x0E: OS << ".cas.b";  break;
    }
    // Type width appended by caller
}

The L2-hinted variant (sub_21E6420) follows identical dispatch logic but emits .op.L2::cache_hint.type instead of .op.type.

Memory Barriers -- sub_21E94F0

ValueInstructionScope
0membar.gpuDevice
1membar.ctaBlock
2membar.sysSystem
4fence.sc.clusterCluster (SM 90+)
3--Fatal: "Bad membar op"

Cluster Barriers -- sub_21E8EA0 (SM 90+)

Encoding: bits[3:0] = operation (0=arrive, 1=wait), bits[7:4] = ordering (0=default, 1=relaxed).

InstructionMeaning
barrier.cluster.arriveSignal arrival
barrier.cluster.arrive.relaxedRelaxed-memory arrival
barrier.cluster.waitWait for all CTAs
barrier.cluster.wait.relaxedRelaxed-memory wait

GenericToNVVM -- sub_215DC20 / sub_215E100

Pass Registration

FieldValue
Pass name"generic-to-nvvm"
Description"Ensure that the global variables are in the global address space"
Pass IDunk_4FD155C
Factorysub_215D530 (allocates 320-byte state)
Disable knobNVVMPassOptions[2200] (bool)
Pipeline positionAfter InstructionSimplify, before LoopSimplify (position ~22 in optimizer)

Registration uses a once-init pattern guarded by dword_4FD1558. The 80-byte pass descriptor stores the description at offset 0, pass kind 64 (ModulePass) at offset 8, the name string at offset 16, its length 15 at offset 24, the pass ID pointer at offset 32, flags 0 at offset 40, and the factory function pointer at offset 72. Registration dispatches through sub_163A800 (the LLVM pass registration infrastructure).

A new-pass-manager version also exists: GenericToNVVMPass, registered at sub_305ED20 / sub_305E2C0 with CLI name "generic-to-nvvm".

Algorithm -- sub_215E100 (36KB)

The pass body at sub_215E100 is 36KB because it must rewrite every address-space-dependent use of every affected global. The factory function sub_215D530 allocates a 320-byte state object containing two DenseMap-like hash tables:

TableOffsetPurposeInitial Capacity
GVMap+168Old GlobalVariable -> New GlobalVariable128 buckets, 48 bytes/bucket
ConstMap+248Old Constant -> New Constant (for constant expressions)128 buckets, 48 bytes/bucket

The algorithm proceeds in three phases:

Phase 1 -- Clone globals. Iterate over all GlobalVariable objects in the module. For each global in addrspace(0) (the LLVM generic address space):

  1. Create a new GlobalVariable in addrspace(1) (NVPTX global memory) with identical initializer, linkage, alignment, and section attributes.
  2. Store the old-to-new mapping in GVMap.

Phase 2 -- Rewrite uses. For each cloned global:

  1. Create an addrspacecast instruction from the new global (addrspace(1)*) back to the original pointer type (addrspace(0)*). This preserves type compatibility with all existing uses.
  2. Call RAUW (replaceAllUsesWith) on the original global, substituting the addrspacecast value. All instructions, constant expressions, and metadata references that pointed to the original global now point through the cast.
  3. The ConstMap table handles the tricky case of constant expressions that embed a global reference: ConstantExpr::getAddrSpaceCast, ConstantExpr::getGetElementPtr, and similar must be reconstructed with the new global. This is the bulk of the 36KB function body -- a recursive walk over the constant expression tree, rebuilding each node.

Phase 3 -- Erase originals. Iterate GVMap and erase each original global from the module. The cleanup helper sub_215D780 iterates the map, properly managing LLVM Value reference counts during deletion.

The destructor at sub_215D1A0 / sub_215CE20 frees both hash tables and all stored Value references.

// Pseudocode for GenericToNVVM::runOnModule
bool runOnModule(Module &M) {
    for (GlobalVariable &GV : M.globals()) {
        if (GV.getAddressSpace() != 0) continue;  // skip non-generic
        if (GV.isDeclaration()) continue;

        // Phase 1: Clone to addrspace(1)
        GlobalVariable *NewGV = new GlobalVariable(
            M, GV.getValueType(), GV.isConstant(),
            GV.getLinkage(), GV.getInitializer(),
            GV.getName(), /*InsertBefore=*/nullptr,
            GV.getThreadLocalMode(), /*AddressSpace=*/1);
        NewGV->copyAttributesFrom(&GV);
        GVMap[&GV] = NewGV;
    }

    for (auto &[OldGV, NewGV] : GVMap) {
        // Phase 2: addrspacecast + RAUW
        Constant *Cast = ConstantExpr::getAddrSpaceCast(NewGV,
            OldGV->getType());
        OldGV->replaceAllUsesWith(Cast);
    }

    for (auto &[OldGV, NewGV] : GVMap) {
        // Phase 3: Erase originals
        OldGV->eraseFromParent();
    }
    return !GVMap.empty();
}

Why this exists. The CUDA frontend (EDG) generates globals in addrspace(0) (LLVM's generic/default address space). The NVPTX backend requires device globals to reside in addrspace(1) (GPU global memory) for correct PTX emission. GenericToNVVM bridges this mismatch. Upstream LLVM has an equivalent NVPTXGenericToNVVM pass, but cicc's version carries the additional ConstMap machinery for handling nested constant expression trees that reference relocated globals -- a case that upstream handles differently through its GenericToNVVM + NVPTXAssignValidGlobalAddresses split.

Global Constructor Rejection -- sub_215ACD0

if (lookup("llvm.global_ctors") && type_tag == ArrayType && count != 0)
    fatal("Module has a nontrivial global ctor, which NVPTX does not support.");
if (lookup("llvm.global_dtors") && type_tag == ArrayType && count != 0)
    fatal("Module has a nontrivial global dtor, which NVPTX does not support.");

GPU kernels have no "program startup" phase -- no __crt_init equivalent. Static initialization with non-trivial constructors is incompatible with the GPU execution model.

Global Variable Emission -- sub_2156420

Overview

The function sub_2156420 (20KB, printModuleLevelGV) handles PTX emission for individual global variables. It processes each global in the module, categorizing it by type (texture reference, surface reference, sampler reference, or data variable) and emitting the appropriate PTX declaration.

Skipped globals: "llvm.metadata", "llvm.*", "nvvm.*".

Global TypePTX Output
Texture reference.global .texref NAME;
Surface reference.global .surfref NAME;
Sampler reference.global .samplerref NAME = { ... }
Managed memory.attribute(.managed)
Demoted (addrspace 3)// NAME has been demoted (comment only)

Sampler Reference Initializer

Sampler references receive a structured initializer block with addressing mode, filter mode, and normalization settings. The emission format:

.global .samplerref my_sampler = {
    addr_mode_0 = clamp_to_edge,
    addr_mode_1 = wrap,
    addr_mode_2 = mirror,
    filter_mode = linear,
    force_unnormalized_coords = 1
};

The addressing mode values are selected from four string literals:

ValueString
0"wrap"
1"clamp_to_border"
2"clamp_to_edge"
3"mirror"

Filter mode selects between "nearest" and "linear". The force_unnormalized_coords field is emitted only when the sampler uses unnormalized texture coordinates (integer addressing).

Address Space Qualifiers

sub_214FA80 maps NVPTX address space numbers to PTX qualifier strings (0=no qualifier, 1=.global, 3=.shared, 4=.const, 5+=.local). See Address Spaces for the complete mapping including tensor memory, shared cluster, and param spaces.

Additional attributes emitted by sub_214FEE0:

  • .attribute(.managed) for CUDA managed memory globals
  • .attribute(.unified) or .attribute(.unified(N)) for unified addressing

Data Type Emission

For aggregate or large types, the emitter uses .b8 NAME[SIZE] (byte array). For pointer types with initializers, it selects .u32 or .u64 arrays depending on the pointer width flag at *(a1+232)->field_936. Simple scalar types use the type from sub_214FBF0 (.u32, .u64, .f32, .f64, etc.).

Invalid Address Space Detection

If a global has an initializer in an address space that does not support static initialization:

fatal("initial value of 'NAME' is not allowed in addrspace(N)");

This diagnostic is emitted via sub_1C3F040.

Global Variable Ordering -- sub_2157D50 (Topological Sort)

Problem

Global variables with initializers can reference other globals. If global A's initializer contains a reference to global B, then B must be emitted before A in the PTX output. Circular dependencies are illegal and must be detected.

Algorithm -- DFS Topological Sort

sub_2157D50 (5.9KB) implements a depth-first topological sort over the global use-def chains. The algorithm:

  1. Build dependency graph. For each global variable in the emission set, walk its initializer constant expression tree. Every GlobalVariable reference found in the initializer creates a directed edge from the referencing global to the referenced global.

  2. DFS with three-color marking. Each global is in one of three states:

    • White (unvisited): not yet processed.
    • Gray (in progress): currently on the DFS stack -- its subtree is being explored.
    • Black (finished): all dependents have been emitted.
  3. Visit procedure. For each white global, mark it gray and recurse into its dependencies. When all dependencies return, mark it black and push it onto the output ordering (post-order).

  4. Cycle detection. If the DFS encounters a gray node, a back-edge has been found, which means a circular dependency. The pass emits the fatal diagnostic:

"Circular dependency found in global variable set"

This is a hard error -- cicc cannot emit globals with mutual references. The PTX format requires a linear declaration order, and there is no forward-declaration mechanism for global variable initializers.

Pseudocode

// sub_2157D50 — topological sort of globals for PTX emission
void orderGlobals(SmallVectorImpl<GlobalVariable *> &Ordered,
                  ArrayRef<GlobalVariable *> Globals) {
    enum Color { White, Gray, Black };
    DenseMap<GlobalVariable *, Color> color;

    for (GlobalVariable *GV : Globals)
        color[GV] = White;

    std::function<void(GlobalVariable *)> visit =
        [&](GlobalVariable *GV) {
        if (color[GV] == Black) return;
        if (color[GV] == Gray)
            fatal("Circular dependency found in global variable set");
        color[GV] = Gray;

        // Walk initializer for GlobalVariable references
        if (Constant *Init = GV->getInitializer())
            for (GlobalVariable *Dep : globalsReferencedBy(Init))
                if (color.count(Dep))
                    visit(Dep);

        color[GV] = Black;
        Ordered.push_back(GV);
    };

    for (GlobalVariable *GV : Globals)
        if (color[GV] == White)
            visit(GV);
}

Interaction with Sampler References

Sampler reference globals can have structured initializers that reference other sampler state. These initializers are walked by the same DFS traversal. The topological sort ensures that any sampler whose initializer references another sampler or texture object appears after its dependencies in the PTX output.

Call Context

sub_2157D50 is called from the module-level emission entry (sub_215ACD0 -> sub_214F370) after all globals have been collected but before any global PTX text is written. The ordered list is then iterated by sub_2156420 to emit each global in dependency order.

Output Mode Selection

Compilation output mode is controlled by a bitmask in the a13 mode flags parameter, passed through the pipeline from the CLI flag parser (sub_95C880). The low bits encode the output format, while bits 8--9 encode the address width (32/64-bit).

Mode Flag Bitmask

BitsValueModeDescription
[2:0]0x07Phase controlDefault = 7 (all phases: lnk + opt + llc)
[4]0x10DebugDebug compile or line-info enabled
[5]0x20LTO genLTO generation enabled
combined0x21gen-ltoGenerate LTO bitcode for later linking
combined0x23full LTOComplete LTO compilation (lnk + opt + lto)
combined0x26link-ltoLink-time LTO phase (consume LTO bitcode)
combined0x43OptiX IREmit .optixir format
[7]0x80gen-opt-ltoLowering flag for LTO
[8]0x100nvvm-6464-bit pointer mode
[9]0x200nvvm-3232-bit pointer mode

CLI Flag to Mode Mapping

CLI FlagMode Bits SetPipeline Effect
(default)0x07All phases run, PTX text output
--emit-llvm-bc(EDG flag id=59)Emit raw LLVM bitcode .bc after optimization
--emit-optix-ir(a13 & 0x300) | 0x43Disables IP-MSP and LICM, emits .optixir
-gen-lto(a13 & 0x300) | 0x21Generates LTO-compatible bitcode
-gen-lto-and-llca13 | 0x20LTO generation plus LLC codegen
-link-lto(a13 & 0x300) | 0x26Consumes LTO bitcode for final compilation
-lto(a13 & 0x300) | 0x23Full LTO mode (all phases)
-split-compile=N(stored at offset+1480)Per-function compilation, F%d_B%d output naming

OptiX IR Mode

The --emit-optix-ir flag is valid only when the compilation mode is CUDA (a4 == 0xABBA) or OpenCL (a4 == 0xDEED). It forces two optimizer passes to be disabled by routing "-do-ip-msp=0" and "-do-licm=0" to the opt phase. The output is an .optixir file containing NVVM IR in a format consumable by the OptiX ray-tracing runtime for JIT compilation. See OptiX IR for the full format details.

Split Compilation

The -split-compile=N flag (stored at options offset +1480, with a sentinel at +1488 to detect double-definition) enables per-function or per-block compilation for large kernels. The pipeline assembler at sub_12E54A0 generates output identifiers using the "F%d_B%d" format string (function index, block index). Each split unit is compiled independently and the results are linked back together. An extended variant -split-compile-extended=N sets the additional flag at offset +1644.

When split-compile is active, the optimization level is set to negative (typically -1), triggering special handling in sub_12E1EF0: each compiled function's bitcode is re-read via sub_153BF40, validated against the "<split-module>" identifier, and linked back through sub_12F5610 with linkage attributes restored from a hash table.

LTO Modes

Three LTO modes interact with emission:

  1. gen-lto (0x21): Runs optimization but skips LLC. Output is optimized LLVM bitcode suitable for later link-time optimization. The -gen-lto string is forwarded to the LTO phase.

  2. link-lto (0x26): Consumes bitcode produced by gen-lto. Runs the LTO linker and optimizer, then proceeds to LLC for final codegen. The -link-lto string is forwarded.

  3. full LTO (0x23): Single-invocation LTO that runs all phases including linking and codegen.

Bitcode Producer ID

The bitcode writer at sub_1538EC0 (58KB, writeModule) stamps "LLVM7.0.1" as the producer identification string in the IDENTIFICATION_BLOCK of every output bitcode file. This is despite cicc being built on LLVM 20.0.0 internally.

Dual-Constructor Mechanism

Two separate global constructors manage producer version strings, both reading the same environment variable but with different defaults:

ConstructorAddressDefaultStored AtPurpose
ctor_0360x48CC90"20.0.0"qword_4F837E0True LLVM version (internal use)
ctor_1540x4CE640"7.0.1"(separate global)NVVM IR compatibility marker

Both constructors execute this logic:

char *result = getenv("LLVM_OVERRIDE_PRODUCER");
if (!result) result = default_string;  // "20.0.0" or "7.0.1"
producer_global = result;

The bitcode writer uses the ctor_154 value, producing "LLVM" + "7.0.1" = "LLVM7.0.1" in the output. Setting LLVM_OVERRIDE_PRODUCER in the environment overrides both constructors to the same value.

Why "LLVM7.0.1"

The "LLVM7.0.1" string is the NVVM IR compatibility marker. It signals that the bitcode format conforms to the NVVM IR specification originally based on LLVM 7.0.1's bitcode structure. Even though cicc's internal passes operate at LLVM 20.0.0 capability, the output bitcode format (record encoding, metadata layout, type table) is constrained to be readable by older NVVM toolchain components (libNVVM, nvdisasm, Nsight) that expect LLVM 7.x-era bitcode. The writer achieves this by:

  1. Using the IDENTIFICATION_BLOCK producer string to declare compatibility.
  2. Constraining the MODULE_BLOCK record types to the LLVM 7.x repertoire.
  3. Enforcing nvvmir.version metadata with major == 3, minor <= 2.

The disable-bitcode-version-upgrade cl::opt (registered in ctor_036) controls whether the bitcode reader accepts version mismatches during ingestion.

NVVM_IR_VER_CHK=0 bypasses the NVVM IR version validation at sub_157E370 and sub_12BFF60, which normally enforces major == 3, minor <= 2 and fatals with "Broken module found, compilation aborted!" on mismatch.

Address Space Operations -- sub_21E7FE0

Multi-purpose helper for cvta, MMA operands, and address space qualifiers:

QueryValuesOutput
"addsp"0=generic, 1=.global, 3=.shared, 4+=.localcvta address space suffix
"ab"0="a", 1="b"cvta direction
"rowcol"0="row", 1="col"MMA layout
"mmarowcol"0--3"row.row"/"row.col"/"col.row"/"col.col"
"satf"0=(none), 1=".satfinite"MMA saturation
"abtype"0--6"u8"/"s8"/"u4"/"s4"/"b1"/"bf16"/"tf32"
"trans"0=(none), 1=".trans"WGMMA transpose

Architecture-Gated Features

FeatureMin ArchitectureEvidence
Basic atomics (all 13 ops)SM 20+ (all)sub_21E5E70, no arch check
Atomic scopes (.cta/.sys)SM 60+ (Pascal)Scope bits in operand
L2 cache-hinted atomicsSM 80+ (Ampere)sub_21E6420 separate function
membar.cta/gpu/sysSM 20+ (all)sub_21E94F0, no arch check
fence.sc.clusterSM 90+ (Hopper)Opcode 4 in membar handler
barrier.cluster.arrive/waitSM 90+ (Hopper)sub_21E8EA0 entire function
Cluster special registers (15)SM 90+ (Hopper)sub_21E9060 entire function
MMA row/col layoutSM 70+ (Volta)mmarowcol in sub_21E7FE0
MMA abtype: bf16/tf32SM 80+ (Ampere)Ampere-class MMA formats
.trans modifier (WGMMA)SM 90+ (Hopper)WGMMA transpose

Key Global Variables

VariablePurpose
byte_4FD17C0Pass configuration flag
byte_4FD16E0ISel dump enable
byte_4FD2160Extra ISel pass enable
dword_4FD26A0Scheduling mode (1=simple, else=full pipeline)
unk_4FD155CGenericToNVVM pass ID
dword_4FD1558GenericToNVVM once-init guard
qword_4F837E0True LLVM producer version ("20.0.0")

ptxas Interaction

The PTX text emitted by cicc is not executed directly -- it is consumed by ptxas, which parses the PTX back into an internal IR, applies its own optimization and scheduling passes (195+ knobs), performs hardware register allocation, and emits SASS machine code. Every formatting decision in emission (register naming with %r<N> angle-bracket counts, .pragma annotations, kernel attribute placement) must conform to what ptxas's PTX parser expects. The "LLVM7.0.1" producer string exists specifically because ptxas gates certain parsing behaviors on the declared producer version. Emission quality directly affects ptxas optimization scope: cleaner PTX with fewer redundant moves gives ptxas more freedom to schedule and allocate efficiently.

Cross-References

Debug Info Pipeline

Debug information in cicc follows a four-stage lifecycle: generation in the EDG/IR-generation frontend, preservation and selective stripping in the optimizer, verification after each pass, and emission as .loc/.file directives in the PTX backend. This page traces the full journey of debug metadata from CUDA source to PTX output, covering the three compilation modes (-g, -generate-line-info, neither), the five stripping passes, the NVIDIA-custom verification infrastructure, and the backend emission format with its non-standard inlined-at extension. Understanding this flow is essential for anyone reimplementing cicc's debug info contract, because the NVPTX target's debug model is fundamentally different from x86 DWARF: PTX is a virtual ISA with no physical registers, no real stack, and no fixed instruction encoding, so the debug metadata cicc emits is consumed by ptxas rather than directly by a debugger.

Debug info generationsub_9433F0 (per-parameter), sub_943430 (per-global), sub_941230 (source location)
Debug version module flagsub_915400 -- emits "Debug Info Version" = 3
Flag filtersub_12C6910 -- checks -debug-compile, -g, -generate-line-info
Verification passsub_29C8000 (12,480B, 434 BBs) -- runs after each optimization pass
Per-instruction verifiersub_29C3AB0 (5,592B)
Debugify injectorsub_29C1CB0
Stripping passes#110--#114 in the pipeline parser
.loc emissionsub_31D55F0 (per-instruction), sub_31E4280 (function-scope .file/.loc)
DWARF section emissionsub_399B1E0 (29KB, DwarfDebug::beginModule)
NVVM container fieldDebugInfo at container offset +12 (enum: NONE/LINE_INFO/DWARF)
cl::opt registrationctor_043 at 0x48D7F0 -- debug-compile, generate-line-info, line-info-inlined-at

Three Compilation Modes

cicc supports three debug info levels. The mode is selected at the CLI layer and propagated through the flag dispatch table into both the optimizer and the backend. The flag filter function sub_12C6910 reads the CLI flags and routes them to the appropriate pipeline stages.

CLI flagFlag struct offsetRoutingNVVM container DebugInfoDICompileUnit emission kind
-g+296-debug-compile to LNK and OPT stagesNVVM_DEBUG_INFO_DWARF (2)FullDebug
-generate-line-info+328-generate-line-info to OPT stage onlyNVVM_DEBUG_INFO_LINE_INFO (1)LineTablesOnly
(neither)----NVVM_DEBUG_INFO_NONE (0)NoDebug

The distinction between -g and -generate-line-info is critical and non-obvious:

  • -g routes as -debug-compile to both the linker (LNK) and optimizer (OPT) stages. The linker stage needs the flag because libdevice linking must preserve debug info from the user module when merging with the stripped libdevice bitcode. The optimizer preserves all metadata: DICompileUnit, DISubprogram, DILocalVariable, DIType, scope chains, dbg.value()/dbg.declare() intrinsics -- everything. The backend emits complete DWARF sections. cuda-gdb can step through source, inspect variables, and reconstruct inlined call stacks.

  • -generate-line-info routes only to the OPT stage (not the linker). Early in the optimizer, StripNonLineTableDebugInfoPass strips all metadata except DILocation / DISubprogram / DICompileUnit with LineTablesOnly emission kind. This is enough for profiler source correlation (Nsight Compute maps .loc directives back to source lines) but not enough for variable inspection or source-level debugging in cuda-gdb.

  • Neither flag: no debug metadata is generated. The IR-generation frontend skips all debug calls (the dword_4D046B4 / [ctx+0x170] guards prevent emission), and the module has no llvm.dbg.cu named metadata. The verification pass detects this in Phase 1 and returns immediately.

Stage 1: Frontend Debug Metadata Generation

EDG IL-to-IR Layer

The IR generation frontend creates debug metadata when the debug info flag is active. Two independent guards control this:

  • dword_4D046B4: a global flag checked at parameter and statement codegen entry points. When set, the function prolog emitter (sub_938240 / Path B equivalent) calls sub_9433F0 to emit DILocalVariable metadata for each parameter, and the statement emitter (sub_9363D0) calls sub_941230 to set the IR builder's debug location from the EDG source position.

  • [ctx+0x170]: a pointer to the DICompileUnit object in the codegen context. When non-null, the global variable emitter (sub_916430 and friends) calls sub_943430 to attach debug metadata to each GlobalVariable, and the module finalizer (sub_915400) emits the "Debug Info Version" module flag with value 3.

The metadata hierarchy created during IR generation:

DICompileUnit
  [ctx+0x170], emission kind: FullDebug or LineTablesOnly
  ├── DIFile (per source file)
  ├── DISubprogram (per __global__ / __device__ function)
  │     ├── DILocalVariable (per parameter, via sub_9433F0)
  │     │     arg: 1-based index from v10 in the parameter iteration loop
  │     │     scope: parent DISubprogram
  │     │     file, line, type: from EDG declaration node
  │     ├── DILocalVariable (per auto variable, via statement codegen)
  │     └── DILocation (per instruction, via sub_941230)
  │           line, column: from EDG source position
  │           scope: nearest enclosing DILexicalBlock or DISubprogram
  └── DIGlobalVariable (per device-side global, via sub_943430)
        [gv+0xAD] < 0 indicates debug info present on the GlobalVariable

The module finalizer sub_915400 runs after all globals and functions have been code-generated. Its debug-relevant actions:

  1. Calls sub_9151E0 to emit nvvmir.version metadata. When [ctx+0x170] is non-null, the version tuple has 4 operands instead of 2, including address-space-qualified indices.
  2. Calls sub_914410 to emit nvvm.annotations metadata.
  3. If [ctx+0x170] != 0: calls sub_BA93D0 (Module::addModuleFlag) with ("Debug Info Version", 3). This module flag is mandatory -- without it, LLVM's DWARF backend refuses to emit debug sections.

DIBuilder Infrastructure

The actual metadata node creation uses LLVM's DIBuilder infrastructure at 0xAD0000--0xAF0000 (Zone 2 of the type system module). This includes DIBasicType / DIDerivedType / DICompositeType uniquing, scope chain construction, and the standard LLVM !dbg attachment API. cicc uses the standard LLVM DIBuilder without modifications -- the NVIDIA-specific aspects are in the calling patterns (which EDG nodes map to which DI metadata), not in the metadata creation API itself.

Stage 2: Optimizer Preservation and Stripping

The StripNonLineTableDebugInfoPass

When -generate-line-info is active (but not -g), the optimizer runs StripNonLineTableDebugInfoPass ("strip-nonlinetable-debuginfo", pipeline parser slot #114) early in the pipeline. This pass:

  1. Strips all DILocalVariable and DIGlobalVariable metadata
  2. Removes all dbg.value() and dbg.declare() intrinsics
  3. Strips DIType nodes, imported entities, and retained nodes
  4. Downgrades DICompileUnit emission kind from FullDebug to LineTablesOnly
  5. Preserves DISubprogram, DILocation, DIFile, and DICompileUnit (the minimum needed for .loc directives)

After this pass, the module has enough metadata for line-table-based profiling but not for source-level debugging.

The Five Stripping Passes

cicc registers five debug stripping passes in the pipeline parser, all standard LLVM passes:

Pipeline nameSlotLLVM pass classWhat it stripsWhat survives
"strip-dead-debug-info"#110StripDeadDebugInfoPassDebug info for dead functions/globalsEverything for live code
"strip-debug-declare"#112StripDebugDeclarePassdbg.declare() intrinsics onlydbg.value(), all metadata
"strip-nondebug"#113StripNonDebugSymbolsPassNon-debug symbolsAll debug metadata
"strip-nonlinetable-debuginfo"#114StripNonLineTableDebugInfoPassEverything except line tablesDILocation, DISubprogram, DIFile
(core stripping at 0xAE0000)--stripDebugInfo()All llvm.dbg.* intrinsicsNothing

The core debug stripping implementation at 0xAE0000 (Zone 3 of the type system module) is the nuclear option -- it calls stripDebugInfo() to remove everything. The four named passes provide finer granularity.

Optimizer Pass Behavior with Debug Info

Every standard LLVM optimization pass is expected to preserve debug metadata it does not intentionally modify. In practice, some passes degrade debug info quality:

Passes that preserve debug info well:

  • InstCombine: updates dbg.value() when simplifying instructions, uses replaceAllDbgUsesWith
  • SROA: splits dbg.declare() into multiple dbg.value() fragments when decomposing allocas
  • GVN: preserves debug locations on replacement instructions
  • SimplifyCFG: maintains DILocation through block merging

Passes that commonly degrade debug info:

  • Inlining: creates new DISubprogram for inlined functions, must maintain inlined-at chains. Failure to do so triggers the verifier's "did not generate DISubprogram" diagnostic.
  • LoopUnroll: duplicates instructions without always duplicating DILocation scope context
  • LICM: moves instructions out of loops, potentially detaching them from their original scope
  • Dead code elimination: removes instructions along with their dbg.value() references
  • Tail merging / BranchFolding: merges basic blocks from different source scopes

The verification pass (sub_29C8000) runs after each optimization pass and tracks exactly which passes degrade debug info. When the debugify-each knob is active, the full Debugify-then-CheckDebugify cycle runs around every pass, injecting synthetic debug metadata before the pass and verifying it survived afterward.

Stage 3: Debug Info Verification

The verification pass sub_29C8000 is documented in detail on the Debug Info Verification page. Here we summarize its role in the pipeline.

Pipeline Integration Protocol

The pipeline runner invokes the verifier as a sandwich around each optimization pass:

// Pseudocode for the verification protocol
snapshot_debug_metadata(M);          // Phase 2 of sub_29C8000: 8 hash tables
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", 11, file, fileLen, jsonOut);
// Returns: true = PASS, false = FAIL (debug info degraded)

The pass name argument lets the JSON report attribute degradation to the specific pass responsible. The eight-table metadata snapshot captures DISubprogram, DIScope, DIGlobalVariable, DILocalVariable, DIType, DIImportedEntity, DILabel, and retained nodes -- far more comprehensive than upstream LLVM's CheckDebugInfoPass, which only tracks subprograms and debug variable intrinsics.

Verification Modes

Three modes of debug verification exist, controlled by LLVM knobs:

ModeKnobWhat runs
Standardverify-each or verify-after-allsub_29C8000 after every pass
Debugifydebugify-eachsub_29C1CB0 (inject) + pass + sub_29C8000 (check)
Selectiveverify-debuginfo-preserveLighter-weight preservation checking

The Debugify mode is especially powerful: it first injects synthetic debug metadata via sub_29C1CB0 (ensuring every instruction has a DILocation and every variable has dbg.value()), then runs the optimization pass, then checks whether the synthetic metadata survived. This detects passes that drop debug info even when the original module had sparse or no debug metadata.

Behavior in -generate-line-info Mode

When the module is in LineTablesOnly mode (after StripNonLineTableDebugInfoPass has run), the verifier still executes but its scope is narrower. Phase 5 (per-function debug variable checking) skips variable intrinsic validation because dbg.value()/dbg.declare() were intentionally stripped. Only Phase 6 (per-instruction DILocation verification via sub_29C3AB0) remains fully active, checking that:

  • Every instruction with a DebugLoc has a valid DILocation
  • DILocation scope chains resolve to a valid DISubprogram
  • No orphaned debug locations reference deleted subprograms
  • BB-level consistency is maintained

Stage 4: Backend Emission

The .loc Directive

The AsmPrinter emits DWARF .loc directives as inline annotations in the PTX instruction stream. The per-instruction emitter sub_31D55F0 runs after each real (non-meta) instruction when HasDebugInfo (r15+0x1E8) is set. It reads the DebugLoc attached to each MachineInstr and emits:

.loc 1 42 0
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
.loc 1 43 5
mul.wide.u32 %rd2, %r1, 4;

The function-scope emitter sub_31E4280 handles .file directives that establish the file index table, and sub_31E6100 (insertDebugLocEntry) maintains a file/line-to-MCSymbol mapping for MBB boundaries used in DWARF line table construction.

The NVIDIA Inlined-At Extension

Standard LLVM .loc emits only file line column. cicc extends .loc with function_name and inlined_at attributes that encode the full inlining chain:

.loc 1 42 0, function_name _Z6kernelPf, inlined_at 2 15 3

This allows ptxas to reconstruct the complete call stack at any point in inlined code, so cuda-gdb can show the user which function was inlined and where. The implementation in the AsmPrinter:

  1. Reads the DebugLoc from the MachineInstr
  2. Walks the inlined-at chain via DebugLoc::getInlinedAt()
  3. Builds a work list (SmallVector<DebugLoc, 8>) of the full chain
  4. Emits in reverse order (outer locations before inner) so ptxas sees the outermost caller first
  5. Tracks already-emitted inlined-at locations in an InlinedAtLocs set to prevent duplicates

The line-info-inlined-at LLVM knob (registered at 0x48D7F0, cl::opt<bool>) controls whether this extension is active. The CLI flag -no-lineinfo-inlined-at disables it by setting -line-info-inlined-at=0 on the backend command line. When disabled, only the immediate source location is emitted, losing inlining context but producing smaller PTX.

The dwarf-extended-loc Knob

The dwarf-extended-loc knob (enum: Default/Enable/Disable, registered at 0x490000 area) controls whether extended flags appear in .loc directives:

ValueEffect
Default (0)Platform-dependent behavior
Enable (1)Emit is_stmt, prologue_end, discriminator extensions
Disable (2)Bare .loc file line column only

The Disable mode exists for compatibility with older ptxas versions that do not parse extended .loc flags. When enabled, the extended flags allow cuda-gdb to identify statement boundaries (is_stmt), function entry points (prologue_end), and distinguish between multiple code paths at the same source line (discriminator).

Source Interleaving

The -show-src CLI flag (flag struct offset +808, routed to the backend as -nvptx-emit-src) enables the InterleaveSrcInPtx mode. When active, the AsmPrinter reads source file lines and emits them as comments interleaved with the PTX:

// kernel.cu:42    float val = input[idx];
.loc 1 42 0
ld.global.f32 %f1, [%rd2];
// kernel.cu:43    val = val * val;
.loc 1 43 0
mul.f32 %f2, %f1, %f1;

This is purely a readability feature -- the comments are ignored by ptxas and have no effect on debug quality. The nvptx-emit-src LLVM knob description string is "Emit source line in ptx file".

.file Directive Emission

The .file directives are emitted by emitDwarfFileEntries during doFinalization (sub_3972F10, 24KB). They map source filenames to numeric file indices referenced by .loc:

.file 1 "/path/to/kernel.cu"
.file 2 "/usr/local/cuda/include/cuda_runtime.h"

The file table is built incrementally as .loc directives reference new files during instruction emission. The DWARF line section symbols are created via sub_E808D0 (createTempSymbol for DwarfLineSection) and bound via sub_E81A00 (emitDwarfLineSection).

DWARF Section Emission

When full debug info (-g) is active, a separate DWARF emission module at 0x3990000--0x39DF000 generates complete DWARF debug sections. This is standard LLVM DWARF emission with no significant NVIDIA modifications to the section format:

AddressSizeFunction
sub_399B1E029KBDwarfDebug::beginModule() -- initializes from llvm.dbg.cu, strings: "DWARF Debug Writer", "DWARF Emission"
sub_3997B5033KB.debug_aranges emission -- address range tables
sub_399D1D012KBRange list emission (DW_RLE_base_address, DW_RLE_offset_pair, DW_RLE_start_length)
sub_399EB7012KBRegister location expressions -- strings: "no DWARF register encoding", "sub-register"
sub_39BDF6038KB.debug_names accelerator table -- bucket count, name count, augmentation string
sub_39B639033KBDWARF form size calculator -- switch on DW_FORM_* codes
sub_215ACD08.1KBModule-level emission entry (NVPTX Debug Info Emission)

The module-level entry sub_215ACD0 checks *(a1+240)->field_344 to determine if DWARF is enabled, then looks up the "NVPTX DWARF Debug Writer" / "NVPTX Debug Info Emission" pass info. The NVPTX backend does not emit physical register locations -- GPUs have no DWARF register numbering scheme that maps to hardware. Instead, it emits virtual register references that ptxas resolves through SASS-level debug info.

The DWARF string/enum tables at 0xE00000--0xE0FFFF (tag-to-string conversion, attribute-to-string, operation encoding) are stock LLVM 20 BinaryFormat/Dwarf.cpp utilities with no visible NVIDIA modifications.

.target Debug Suffix

The header emission function sub_214F370 appends , debug to the .target directive when MCAsmInfo::doesSupportDebugInformation() returns true:

.target sm_90, texmode_independent, debug

This suffix tells ptxas that the PTX contains debug information and should be processed accordingly. Without it, ptxas ignores .loc and .file directives.

NvvmDebugVersion

The NVVM container format includes a debug version field at header bytes 0x08--0x09:

OffsetSizeField
0x081 byteNvvmDebugVersion.Major
0x091 byteNvvmDebugVersion.Minor

Current version: Major=3, Minor<=2. The version check logic in sub_CD41B0:

  • Major must equal 3 (hard fail on mismatch: "not compatible" error, returns NULL)
  • Minor > 2: warning printed, parse continues
  • If absent: default {3, 2} is assumed

This version tracks the debug metadata schema independently of the NVVM IR version (NvvmIRVersion at 0x06--0x07, current Major=2, Minor<=0x62). The separation allows debug format evolution without breaking IR compatibility -- NVIDIA can add new debug metadata fields (e.g., for new SM features) without requiring a full IR version bump.

The container's DebugInfo field (at deserialized struct offset +12) also encodes the debug level as an enum that must be consistent with the module metadata:

enum NvvmDebugInfo {
    NVVM_DEBUG_INFO_NONE      = 0,  // no debug info
    NVVM_DEBUG_INFO_LINE_INFO = 1,  // -generate-line-info
    NVVM_DEBUG_INFO_DWARF     = 2   // -g
};

The standalone pipeline validates this at IR intake: if debug_info_present AND debug_mode_flag AND NOT debug_version_validated, the function returns error code 3 (incompatible).

Debug Records Format

cicc v13.0 inherits LLVM 20's support for the new debug records format (DbgRecord) as an alternative to the traditional dbg.value() / dbg.declare() intrinsics. Three knobs control this:

KnobTypeDefaultEffect
write-experimental-debuginfobooltrueWrite debug info in new non-intrinsic format
write-experimental-debuginfo-iterators-to-bitcodebooltrueSerialize debug records to bitcode
preserve-input-debuginfo-formatboolOrDefaultfalseWhen true, preserve whatever format the input uses

The write-experimental-debuginfo default of true means cicc v13.0 uses the new DbgRecord format internally by default. This is an LLVM 20 feature where debug info is stored as DbgVariableRecord and DbgLabelRecord objects attached directly to instructions rather than as separate dbg.value() intrinsic calls. The format change is transparent to the optimizer and backend -- the verification pass and AsmPrinter handle both formats identically.

End-to-End Flow Diagram

CUDA Source (.cu / .cup)
    │
    ▼
EDG 6.6 Frontend (IL tree)
    │  dword_4D046B4 / [ctx+0x170] guards debug emission
    │  sub_9433F0: per-parameter DILocalVariable
    │  sub_943430: per-global DIGlobalVariable
    │  sub_941230: per-instruction DILocation
    │  sub_915400: "Debug Info Version" = 3 module flag
    ▼
LLVM Module with debug metadata
    │  llvm.dbg.cu → DICompileUnit → DISubprogram → ...
    │
    ├─ If -generate-line-info:
    │    StripNonLineTableDebugInfoPass (#114)
    │    strips variables, types, scopes; keeps DILocation/DISubprogram
    │
    ▼
LLVM Optimizer (sub_12E54A0)
    │  ┌─────────────────────────────────────────────┐
    │  │  For each pass:                              │
    │  │    snapshot = sub_29C8000 Phase 2 (8 tables) │
    │  │    run_pass(M);                              │
    │  │    sub_29C8000(M, ..., passName, ...);       │
    │  │    if FAIL: JSON report + diagnostic         │
    │  └─────────────────────────────────────────────┘
    ▼
Optimized LLVM Module
    │
    ▼
NVPTX Backend (SelectionDAG → MachineInstr)
    │  DebugLoc attached to each MachineInstr
    │
    ▼
AsmPrinter (sub_31EC4F0)
    │  sub_31D55F0: per-instruction .loc emission
    │  sub_31E4280: .file/.loc at function scope
    │  inlined-at chain walking → function_name, inlined_at extensions
    │  InterleaveSrcInPtx: source line comments
    │
    ├─ If -g:
    │    sub_399B1E0: DwarfDebug::beginModule()
    │    sub_3997B50: .debug_aranges
    │    sub_39BDF60: .debug_names
    │
    ▼
PTX Output
    .target sm_90, texmode_independent, debug
    .file 1 "kernel.cu"
    .loc 1 42 0, function_name _Z6kernelPf
    ld.param.u64 %rd1, [_Z6kernelPf_param_0];

Knobs Reference

KnobTypeDefaultScopeEffect
-g / -debug-compilebooloffCLIFull debug compilation (FullDebug emission)
-generate-line-infobooloffCLILine tables only (LineTablesOnly emission)
-no-lineinfo-inlined-atbooloffCLIDisable inlined-at tracking (sets -line-info-inlined-at=0)
-show-src / -nvptx-emit-srcbooloffCLIInterleave source lines as PTX comments
dwarf-extended-locenumDefaultLLVMDefault/Enable/Disable extended .loc flags
dwarf-versionunsigned(platform)LLVMDWARF version for debug sections
line-info-inlined-atbooltrueLLVMEmit inlined-at chains in .loc directives
debugify-eachbooloffLLVMDebugify + CheckDebugify around every pass
debugify-levelenumlocation+variablesLLVMlocations or location+variables
debugify-quietbooloffLLVMSuppress debugify diagnostics
debugify-func-limitintunlimitedLLVMMax functions to debugify
debugify-exportstring--LLVMExport debugify results to file
verify-eachbooloffLLVMRun IR verifier after every pass
verify-debuginfo-preservebooloffLLVMEnable debug info preservation checking
no-inline-line-tablesbooloffLLVMPrevent inlining from merging line tables
write-experimental-debuginfobooltrueLLVMUse DbgRecord format instead of intrinsics
preserve-input-debuginfo-formatboolOrDefaultfalseLLVMPreserve input debug info format as-is
NvvmDebugVersion{u8,u8}{3,2}ContainerDebug metadata schema version
qword_5008FC8booloffGlobalVerbose diagnostic output enable
qword_5008C88int32>0GlobalMetadata depth threshold (<=0 skips deep scope walk)

NVIDIA Modifications vs Stock LLVM

  1. Inlined-at .loc extension. Upstream LLVM's NVPTX AsmPrinter emits standard .loc file line column. cicc appends function_name and inlined_at attributes that encode the full inlining chain for cuda-gdb call stack reconstruction.

  2. Eight-table verification. Upstream CheckDebugInfoPass tracks DISubprogram and debug variable intrinsics. NVIDIA's sub_29C8000 maintains eight separate hash tables covering subprograms, scopes, global variables, local variables, types, imported entities, labels, and retained nodes.

  3. JSON structured reporting. NVIDIA added a YAML/JSON serializer to the verification pass that produces machine-parseable bug reports with per-pass attribution -- no upstream equivalent.

  4. Metadata reconstruction. After verification, NVIDIA's pass reconstructs the module's metadata tables from verified versions (Phase 8), effectively serving as a "repair" pass that normalizes metadata after corruption.

  5. Container debug versioning. The NvvmDebugVersion field in the NVVM container header tracks the debug metadata schema independently of the IR version -- a concept that does not exist in upstream LLVM.

  6. Three-level debug info enum. The NVVM_DEBUG_INFO_NONE / LINE_INFO / DWARF enum in the container provides a compile-unit-level debug mode indicator that ptxas and libNVVM can check without parsing the full module metadata.

Function Map

FunctionAddressSizeRole
Emit DILocalVariable for function parametersub_9433F0----
Emit debug info for GlobalVariable (conditional on [ctx+0x170])sub_943430----
Set IR builder DebugLoc from EDG source positionsub_941230----
Module finalizer: emit "Debug Info Version" = 3 module flagsub_915400133B--
Flag filter: checks -debug-compile, -g, -generate-line-infosub_12C6910----
Debug info verification pass (main entry)sub_29C800012,480B--
Per-instruction DILocation verifiersub_29C3AB05,592B--
Debugify synthetic debug info injectorsub_29C1CB0----
NewPMCheckDebugifyPass wrappersub_22702B0----
NewPMDebugifyPass wrappersub_2270390----
Per-instruction .loc emissionsub_31D55F0----
Function-scope .file/.loc emissionsub_31E4280----
insertDebugLocEntry (file/line to MCSymbol mapping)sub_31E6100----
Instruction-level debug comment emissionsub_31D89B0----
emitHeader (.version, .target ... , debug)sub_214F3707.2KB--
Module-level emission entry / NVPTX Debug Info Emissionsub_215ACD08.1KB--
DwarfDebug::beginModule()sub_399B1E029KB--
.debug_aranges emissionsub_3997B5033KB--
Range list emission (DW_RLE_*)sub_399D1D012KB--
Register location expressionssub_399EB7012KB--
.debug_names accelerator tablesub_39BDF6038KB--
DWARF form size calculatorsub_39B639033KB--
DIBuilder / debug metadata helpersub_ADCDB0----
cl::opt registration: debug-compile, generate-line-info, line-info-inlined-atsub_48D7F0----
NVVM container version check (validates NvvmDebugVersion.Major == 3)sub_CD41B0----

Cross-References

NVIDIA Custom Passes

25+ proprietary optimization passes not found in upstream LLVM. Registered into the New PM pipeline at sub_2342890 and into the pipeline assembler at sub_12E54A0.

Module-level custom16 passes
Function-level custom9 passes
Loop-level custom1 pass
Custom analyses2 analyses
Machine-level custom13 passes
Registrationsub_2342890 (New PM) + sub_12E54A0 (pipeline builder)
Dedicated deep-dive pages22

IR-Level Module Passes

Pass NameClass / FunctionSizeDescription
memory-space-optsub_1C70910 / sub_1CA2920clusterResolves generic pointers to specific address spaces (global/shared/local/const). Warns on illegal ops: atomics on constant mem, wmma on wrong space. Parameterized: first-time, second-time, no-warnings, warnings
printf-loweringsub_1CB1E6031KBLowers printfvprintf + local buffer. Validates format string is a literal. "vprintfBuffer.local", "bufIndexed"
nvvm-verifysub_2C80C90230KBThree-layer NVVM IR verifier (module + function + intrinsic). Validates triples, address spaces, atomic restrictions, pointer cast rules, architecture-gated intrinsic availability
nvvm-pretreatPretreatPassIR pre-treatment before optimization
check-kernel-functionsNVPTXSetFunctionLinkagesPassKernel function linkage validation
check-gep-indexGEP index validation
cnp-launch-checkCNPLaunchCheckPassCooperative launch validation
ipmspIPMSPPassInter-procedural memory space propagation
nv-early-inlinerNVIDIA early inlining pass
nv-inline-mustInlineMustPassForce-inline functions marked __forceinline__
select-kernelsSelectKernelsPassKernel selection for compilation
set-global-array-alignmentParameterized: modify-shared-mem, skip-shared-mem, modify-global-mem, skip-global-mem
lower-aggr-copies72KB+58KBLower aggregate copies: struct splitting, memmove unrolling. Param: lower-aggr-func-args
lower-struct-argsLower structure arguments. Param: opt-byval
process-restrictProcess __restrict__ annotations. Param: propagate-only
lower-opsLowerOpsPassLower special operations. Includes FP128/I128 emulation via 48 __nv_* library calls

IR-Level Function Passes

Pass NameFunctionSizeDescription
branch-distsub_1C47810 clusterBranch distribution optimization. Knobs: branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm
nvvm-reflectsub_1857160Resolves __nvvm_reflect() calls to integer constants based on target SM and FTZ mode. Runs multiple times as inlining exposes new calls
nvvm-reflect-ppNVVM reflect preprocessor
nvvm-intrinsic-loweringsub_2C63FB0140KBLowers llvm.nvvm.* intrinsics to standard LLVM IR. Two levels: 0 = basic, 1 = barrier-aware. Runs up to 10 times in mid pipeline
nvvm-peephole-optimizerNVVM-specific peephole optimizations
rematsub_1CE7DD067KBIR-level rematerialization. Analyzes live-in/live-out register pressure per BB. Contains IV demotion sub-pass (75KB)
reuse-local-memoryLocal memory reuse optimization
set-local-array-alignmentSet alignment for local arrays
sinking2NVIDIA-specific instruction sinking (distinct from LLVM's Sink pass)

IR-Level Loop Pass

Pass NameFunctionSizeDescription
loop-index-splitsub_2CC5900 / sub_1C7B2C069KBSplit loops on index conditions. NVIDIA-preserved pass (removed from upstream LLVM)

Custom Analyses

Analysis NamePurpose
rpaRegister Pressure Analysis — feeds into scheduling and rematerialization decisions
merge-setsMerge set computation — used by coalescing and allocation

Machine-Level Passes

Pass NameFunctionPass IDSizeDescription
Block Rematsub_2186D90nvptx-remat-block47KBTwo-phase candidate selection + iterative "pull-in" for register pressure reduction. "Max-Live-Function(", "Really Final Pull-in:"
Machine Mem2Regsub_21F9920nvptx-mem2regPromotes __local_depot stack objects back to registers post-regalloc
MRPAsub_2E5A4E0machine-rpa48KBMachine Register Pressure Analysis — incremental tracking, not in upstream LLVM
LDG Transformsub_21F2780ldgxformTransforms global loads to ldg.* (texture cache) for read-only data
GenericToNVVMsub_215DC20generic-to-nvvm36KBMoves globals from generic to global address space
Alloca Hoistingsub_21BC7D0alloca-hoistingEnsures all allocas are in entry block (PTX requirement)
Image Optimizersub_21BCF10Optimizes texture/surface access patterns
NVPTX Peepholesub_21DB090nvptx-peepholeNVPTX-specific peephole optimization
Prolog/Epilogsub_21DB5F0Custom frame management (PTX has no traditional prolog/epilog)
Replace Image Handlessub_21DBEA0Replaces IR-level image handles with PTX texture/surface references
Extra MI Printersub_21E9E80extra-machineinstr-printerRegister pressure statistics reporting
Valid Global Namessub_21BCD80nvptx-assign-valid-global-namesSanitizes global names to valid PTX identifiers
NVVMIntrRangesub_216F4B0nvvm-intr-rangeAdds !range metadata to NVVM intrinsics (e.g., tid.x bounds)

Major Proprietary Subsystems

Dead Synchronization Eliminationsub_2C84BA0

FieldValue
Size96KB
PurposeRemoves redundant __syncthreads() barriers

Bidirectional fixed-point dataflow analysis across the CFG, tracking four memory access categories per BB through eight red-black tree maps. Each deletion triggers full restart. Distinct from lightweight basic-dbe. See dedicated page for full algorithm.

MemorySpaceOpt — Multi-Function Cluster

FunctionSizePurpose
sub_1C70910Pass entry point
sub_1C6A6C0Pass variant
sub_1CA292032KBAddress space resolution — "Cannot tell what pointer points to, assuming global memory space"
sub_1CA9E9028KBSecondary resolver
sub_1CA535045KBInfrastructure
sub_2CBBE9071KBMemory-space-specialized function cloning

NV Rematerialization Cluster

FunctionSizeRole
sub_1CE7DD067KBMain driver — live-in/live-out analysis, skip decisions
sub_1CE67D032KBBlock-level executor — "remat_", "uclone_" prefixes
sub_1CE3AF056KBPull-in cost analysis — "Total pull-in cost = %d"

NLO — Simplify Live Output

FunctionSizeStrings
sub_1CE10B048KB"Simplify Live Output", "nloNewBit", "newBit"
sub_1CDC1F035KB"nloNewAdd", "nloNewBit"

Creates new add/bit operations to simplify live-out values at block boundaries.

IV Demotionsub_1CD74B0

FieldValue
Size75KB
Strings"phiNode", "demoteIV", "newInit", "newInc", "argBaseIV", "newBaseIV", "iv_base_clone_", "substIV"

Demotes induction variables (e.g., 64-bit to 32-bit), creates new base IVs, clones IV chains for register pressure reduction. Sub-pass of rematerialization. See dedicated page for full algorithm.

RLMCAST — sub_2D13E90

FieldValue
Size67KB
PurposeRegister-level multicast instruction lowering

Broadcasts a value to multiple register destinations. Uses 216-byte and 160-byte node structures.

Texture Group Merge (.Tgm) — sub_2DDE8C0

Groups texture load operations to hide latency. Uses .Tgm suffix in scheduling and function pointer table (3 predicates) for grouping decisions.

NVVM Intrinsic Verifiersub_2C7B6A0

FieldValue
Size143KB
PurposeValidates ALL NVVM intrinsics against SM capabilities

Architecture-gated validation for every intrinsic call. Part of the three-layer NVVM verifier (230KB total).

NVVM Intrinsic Loweringsub_2C63FB0

FieldValue
Size140KB
PurposeLowers NVVM intrinsics to concrete operations

Pattern-matching rewrite engine for llvm.nvvm.* intrinsics. Two levels (basic + barrier-aware), runs up to 10 times. See dedicated page for full dispatch table.

Base Address Strength Reductionsub_2CA4A10

FieldValue
Size58KB
Knobsdo-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions)

Scans loop bodies for memory ops sharing a common base pointer, hoists the anchor computation, rewrites remaining addresses as (anchor + relative_offset). See dedicated page for the anchor selection algorithm.

Common Base Eliminationsub_2CA8B00

FieldValue
Size39KB
PurposeHoists shared base address expressions to dominating CFG points

Operates at inter-block level (vs BASR intra-loop). The two passes form a complementary pair for comprehensive GPU address computation reduction. See dedicated page.

CSSA Transformationsub_3720740

FieldValue
Size22KB
PurposeConventional-SSA for GPU divergent control flow
Knobsdo-cssa, cssa-coalesce, cssa-verbosity, dump-before-cssa
Debug"IR Module before CSSA"

Rewrites PHI nodes to be safe under warp-divergent execution by inserting explicit copy instructions at reconvergence points. See dedicated page for the divergence model.

NVIDIA Codegen Knobs — sub_1C20170

70+ knobs parsed from the NVVM container format:

Graphics Pipeline

VSIsVREnabled, VSIsLastVTGStage, EnableZeroCoverageKill, AllowComputeDerivatives, AllowDerivatives, EnableNonUniformQuadDerivatives, UsePIXBAR, ManageAPICallDepth

Compute / Memory

DisableSAMRAM, DoMMACoalescing, DisablePartialHalfVectorWrites, AssumeConvertMemoryToRegProfitable, MSTSForceOneCTAPerSMForSmemEmu, AddDepFromGlobalMembarToCB

Register Allocation / Scheduling

AdvancedRemat, CSSACoalescing, DisablePredication, DisableXBlockSched, ReorderCSE, ScheduleKils, NumNopsAtStart, DisableERRBARAfterMEMBAR

Type Promotion

PromoteHalf, PromoteFixed, FP16Mode, IgnoreRndFtzOnF32F16Conv, DisableLegalizeIntegers

PGO

PGOProfileKind, PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex

Knob Forwarding

OCGKnobs, OCGKnobsFile, NVVMKnobsString, OmegaKnobs, FinalizerKnobs

Compile Modes — sub_1C21CE0

ModeConstant
Whole-program no-ABINVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI
Whole-program ABINVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI
Separate ABINVVM_COMPILE_MODE_SEPARATE_ABI
Extensible WP ABINVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI
Opt LevelConstant
NoneNVVM_OPT_LEVEL_NONE
1NVVM_OPT_LEVEL_1
2NVVM_OPT_LEVEL_2
3NVVM_OPT_LEVEL_3
Debug InfoConstant
NoneNVVM_DEBUG_INFO_NONE
Line infoNVVM_DEBUG_INFO_LINE_INFO
Full DWARFNVVM_DEBUG_INFO_DWARF

NVVMReflect

The NVVMReflect pass resolves calls to __nvvm_reflect() -- a compile-time introspection mechanism that lets CUDA device code query compilation parameters such as the target GPU architecture, flush-to-zero mode, and precision settings. Each __nvvm_reflect("__CUDA_ARCH") call is replaced with an integer constant derived from the target SM version, and each __nvvm_reflect("__CUDA_FTZ") is replaced with 0 or 1 depending on the -ftz flag. After replacement, the constant result feeds into conditional branches that standard LLVM passes (SimplifyCFG, SCCP, ADCE) can fold away, eliminating dead architecture-specific code paths at compile time. This is NVIDIA's primary mechanism for producing architecture-specialized code from a single portable source: libdevice alone contains hundreds of __nvvm_reflect calls that select between FTZ and non-FTZ instruction variants.

The pass is relatively small in code size but architecturally critical -- it runs multiple times at different pipeline positions because inlining, loop unrolling, and other transformations continuously expose new __nvvm_reflect calls that were previously hidden inside un-inlined function bodies.

Key Facts

PropertyValue
Pass factorysub_1857160
Pass levelFunction pass (runs per-function)
RegistrationLegacy PM only (not separately registered in New PM); post-processor nvvm-reflect-pp is New PM #381 at line 2237
Runtime positionsTier 0 #7; Tier 1/2/3 #9, #73 (see Pipeline)
Pipeline disable flagNVVMPassOptions offset +880
Knobnvvm-reflect-enable (boolean, default: true)
Global knob constructorctor_271
Vtable (likely)unk_3C2026C
Post-processing passnvvm-reflect-pp = SimplifyConstantConditionalsPass
New PM registrationNot separately registered -- NVVMReflect is a legacy-PM pass invoked from the pipeline assembler; nvvm-reflect-pp is the New PM companion at registration line 2237 of sub_2342890
Upstream equivalentNVVMReflect in llvm/lib/Target/NVPTX/NVVMReflect.cpp
Occurrences in pipeline~8 invocations across all paths (see Multi-Run Pattern)

Reflect Query Names

The __nvvm_reflect mechanism supports a fixed set of query strings. These are embedded as global string constants in NVVM IR (typically from libdevice bitcode) and matched by the pass:

Query StringMeaningValue Source
__CUDA_ARCHTarget GPU compute capability-arch=compute_XX flag, encoded as major*100 + minor*10
__CUDA_FTZFlush-to-zero mode for single-precision-ftz=1 sets to 1; default 0
__CUDA_PREC_DIVPrecise division mode-prec-div=1 sets to 1; default 0
__CUDA_PREC_SQRTPrecise square root mode-prec-sqrt=1 sets to 1; default 0

__CUDA_ARCH Values

The __CUDA_ARCH value is an integer encoding SM_major * 100 + SM_minor * 10, propagated from the CLI through the EDG frontend as -R __CUDA_ARCH=NNN:

Architecture__CUDA_ARCHSM Variants
Turing750sm_75
Ampere800, 860, 870, 880sm_80, sm_86, sm_87, sm_88
Ada Lovelace890sm_89
Hopper900sm_90, sm_90a (both share 900)
Blackwell1000, 1030sm_100/100a/100f, sm_103/103a/103f
(SM 11.x)1100sm_110/110a/110f
(SM 12.x)1200, 1210sm_120/120a/120f, sm_121/121a/121f

Note: Architecture variants with a (accelerated) and f (forward-compatible) suffixes share the same __CUDA_ARCH value as their base. They differ only in -opt-arch and -mcpu flags, which affect instruction selection and scheduling but not reflect queries.

Algorithm

The NVVMReflect pass implements a straightforward pattern-matching replacement. In pseudocode:

bool NVVMReflectPass::runOnFunction(Function &F) {
    bool changed = false;
    if (!nvvm_reflect_enable)  // controlled by 'nvvm-reflect-enable' knob
        return false;

    SmallVector<CallInst *, 8> reflect_calls;

    // Phase 1: Collect all __nvvm_reflect call sites
    for (BasicBlock &BB : F) {
        for (Instruction &I : BB) {
            if (auto *CI = dyn_cast<CallInst>(&I)) {
                Function *callee = CI->getCalledFunction();
                if (callee && callee->getName() == "__nvvm_reflect")
                    reflect_calls.push_back(CI);
            }
        }
    }

    // Phase 2: Resolve each call to a constant
    for (CallInst *CI : reflect_calls) {
        // Extract the query string from the first argument.
        // The argument is a pointer to a global constant string:
        //   @.str = private constant [12 x i8] c"__CUDA_ARCH\00"
        // The pass traces through the GEP/bitcast to find the
        // ConstantDataArray initializer.
        StringRef query = extractStringArgument(CI->getArgOperand(0));

        int result = 0;
        if (query == "__CUDA_ARCH")
            result = sm_version;          // e.g., 900 for sm_90
        else if (query == "__CUDA_FTZ")
            result = ftz_enabled ? 1 : 0;
        else if (query == "__CUDA_PREC_DIV")
            result = prec_div ? 1 : 0;
        else if (query == "__CUDA_PREC_SQRT")
            result = prec_sqrt ? 1 : 0;
        else
            result = 0;  // unknown query => 0

        // Replace the call with the constant integer
        CI->replaceAllUsesWith(ConstantInt::get(CI->getType(), result));
        CI->eraseFromParent();
        changed = true;
    }
    return changed;
}

The string extraction logic must handle the IR pattern produced by the CUDA frontend and libdevice linking:

@.str = private unnamed_addr constant [12 x i8] c"__CUDA_ARCH\00", align 1

%1 = call i32 @__nvvm_reflect(ptr @.str)

The pass walks through the argument operand, stripping ConstantExpr GEPs and bitcasts, to reach the ConstantDataArray containing the query string. If the argument is not a resolvable constant string, the call is left unmodified (this is a no-op safety -- in practice, all reflect calls use literal string arguments).

Interaction with Constant Propagation and Dead Code Elimination

The reflect replacement produces a constant integer that feeds directly into an icmp and conditional branch. This is the canonical pattern in libdevice:

Before NVVMReflect (from libdevice.10.ll, function __nv_floorf):

define float @__nv_floorf(float %f) {
  %1 = call i32 @__nvvm_reflect(ptr @.str)   ; @.str = "__CUDA_FTZ"
  %2 = icmp ne i32 %1, 0
  br i1 %2, label %ftz_path, label %precise_path

ftz_path:
  %3 = call float @llvm.nvvm.floor.ftz.f(float %f)
  br label %merge

precise_path:
  %4 = call float @llvm.nvvm.floor.f(float %f)
  br label %merge

merge:
  %.0 = phi float [ %3, %ftz_path ], [ %4, %precise_path ]
  ret float %.0
}

After NVVMReflect (with -ftz=1):

define float @__nv_floorf(float %f) {
  %2 = icmp ne i32 1, 0            ; constant 1 replaces the call
  br i1 %2, label %ftz_path, label %precise_path

ftz_path:
  %3 = call float @llvm.nvvm.floor.ftz.f(float %f)
  br label %merge

precise_path:                       ; now unreachable
  %4 = call float @llvm.nvvm.floor.f(float %f)
  br label %merge

merge:
  %.0 = phi float [ %3, %ftz_path ], [ %4, %precise_path ]
  ret float %.0
}

After SimplifyCFG / SCCP / ADCE (subsequent passes):

define float @__nv_floorf(float %f) {
  %1 = call float @llvm.nvvm.floor.ftz.f(float %f)
  ret float %1
}

The icmp ne i32 1, 0 folds to true, SimplifyCFG eliminates the dead branch, and ADCE removes the unused llvm.nvvm.floor.f call. The function collapses from 4 basic blocks to 1.

This pattern repeats for every libdevice math function: __nv_fabsf, __nv_fminf, __nv_fmaxf, __nv_rsqrtf, __nv_exp2f, and dozens more all contain the same __nvvm_reflect("__CUDA_FTZ") branch. After reflect resolution, each function specializes to either FTZ or precise mode.

__CUDA_ARCH branching pattern

For architecture-dependent code, the pattern uses inequality comparisons:

%arch = call i32 @__nvvm_reflect(ptr @.str.1)  ; "__CUDA_ARCH"
%is_sm80_plus = icmp sge i32 %arch, 800
br i1 %is_sm80_plus, label %sm80_path, label %legacy_path

sm80_path:
  ; use SM 8.0+ specific intrinsics (e.g., async copy, cp.async)
  ...

legacy_path:
  ; fallback path for older architectures
  ...

After NVVMReflect replaces %arch with (e.g.) 900 for Hopper, the comparison icmp sge i32 900, 800 folds to true, and the legacy path is eliminated.

Multi-Run Pattern

NVVMReflect (sub_1857160) is invoked multiple times across the pipeline because optimization passes continuously expose new reflect calls. The key insight is that __nvvm_reflect calls originate primarily from libdevice functions, which are linked as bitcode and initially exist as un-inlined function calls. Each inlining pass expands these functions inline, exposing their internal __nvvm_reflect calls to the containing function.

Tier 0 Pipeline (Full Optimization via sub_12DE330)

In the Tier 0 (O1/O2/O3) full optimization pipeline, NVVMReflect appears once:

PositionFactoryContext
#7sub_1857160()After CGSCC inliner (#2), GVN (#5-6). Catches reflect calls exposed by first-round inlining

"mid" Path Pipeline (Ofcmid/Ofcmin via sub_12E54A0 PATH B)

In the "mid" fast-compile path, NVVMReflect appears at three distinct positions:

PositionFactoryGuardContext
After CGSCC pipeline #8sub_1857160()!opts[880]After aggressive CGSCC inlining (8 iterations). Catches reflect calls from freshly inlined libdevice bodies
After Sinking2 + EarlyCSEsub_1857160()!opts[880]After loop transformations and code motion. Catches reflect calls in loop bodies after unrolling
(appears once more in late position)sub_1857160()!opts[880]Final cleanup after late CGSCC pass and NVVMIntrinsicLowering

Default/General Path Pipeline (PATH C)

In the default path (external bitcode input), NVVMReflect appears at three positions:

PositionFactoryContext
After CGSCC pipeline #4sub_1857160()First resolution after initial inlining
After NVVMIntrinsicLoweringsub_1857160()Intrinsic lowering may expose new reflect patterns
After LoopUnroll + InstCombinesub_1857160()Loop unrolling duplicates loop bodies containing reflect calls

Tiered Pipeline Insertions (sub_12DE8F0)

Within the tiered sub-pipeline, NVVMReflect appears with additional gating:

TierGuardPosition
1, 2, 3opts[3200] && !opts[880]Mid-tier, after NVVMVerifier and IPConstPropagation
3 onlyopts[3200] && tier==3 && !opts[880]Late-tier, after ADCE and LoopOpt/BarrierOpt. This extra run at O3 catches reflect calls exposed by the most aggressive transformations

Why Multiple Runs Are Necessary

Consider this scenario:

  1. User code calls __nv_sinf(x) (a libdevice function).
  2. Initially, __nv_sinf is an external function call -- its body contains __nvvm_reflect("__CUDA_FTZ") but the reflect call is not visible to the optimizer.
  3. First NVVMReflect run: No-op for this function (the reflect is inside __nv_sinf's body, which has not been inlined yet).
  4. CGSCC Inliner runs: Inlines __nv_sinf into the caller, expanding its body with the __nvvm_reflect call.
  5. Second NVVMReflect run: Now sees the freshly-inlined __nvvm_reflect call and resolves it to a constant.
  6. Loop Unrolling runs: If the __nv_sinf call was inside a loop, unrolling duplicates the call site. If the loop body was too complex to inline before unrolling simplified it, a third inlining opportunity may arise.
  7. Third NVVMReflect run: Resolves any remaining reflect calls exposed by unrolling + re-inlining.

Without multiple runs, libdevice functions inlined late in the pipeline would retain their reflect-based branching, defeating the specialization mechanism and leaving dead code paths in the final binary.

The nvvm-reflect-pp Post-Processing Pass

After NVVMReflect replaces calls with constants, the resulting IR contains trivially-foldable comparisons and dead branches. While standard LLVM passes (SimplifyCFG, ADCE) handle most of this, NVIDIA registers a dedicated post-processing pass under the misleading name nvvm-reflect-pp.

Despite its name, nvvm-reflect-pp is SimplifyConstantConditionalsPass (class llvm::SimplifyConstantConditionalsPass), not a reflection pass. It is a targeted dead-branch elimination pass that:

  1. Finds conditional branches where the condition is a constant (icmp with both operands constant).
  2. Replaces the branch with an unconditional branch to the taken target.
  3. Marks the not-taken successor as potentially unreachable.
  4. Cleans up resulting dead phi nodes and empty blocks.

This pass is registered in the New PM at sub_2342890 line 2237 as a function-level pass. It runs immediately after NVVMReflect in some pipeline configurations to ensure that reflected constants are cleaned up before subsequent optimization passes see the IR.

Configuration

KnobTypeDefaultEffect
nvvm-reflect-enablebooltrueMaster enable for NVVMReflect. When false, all __nvvm_reflect calls are left unresolved (they default to 0 at link time, selecting the non-FTZ/non-precise/lowest-arch path).

Pipeline Disable Flag

NVVMPassOptions offset +880 is the per-compilation disable flag for NVVMReflect. When set (e.g., by an internal debugging mechanism), all pipeline insertion points skip the pass via the !opts[880] guard. This flag is distinct from the nvvm-reflect-enable knob: the knob controls the pass's internal behavior, while the pipeline flag prevents the pass from being added to the pipeline at all.

Reflect Value Propagation Path

The reflect query values flow from the CLI through three layers:

  1. CLI: -arch=compute_90 is parsed by sub_95EB40 / sub_12C8DD0
  2. EDG frontend: Receives -R __CUDA_ARCH=900 and defines the preprocessor macro
  3. Optimizer: Receives -opt-arch=sm_90. The NVVMReflect pass reads the SM version from the target machine configuration (not from -R flags -- those are for the preprocessor)

For FTZ/precision flags, the path is:

  1. -ftz=1 maps to -R __CUDA_FTZ=1 (EDG) and -nvptx-f32ftz (optimizer/backend)
  2. The NVVMReflect pass reads the FTZ setting from the NVPTX subtarget or a global variable set during pipeline configuration

Differences from Upstream LLVM

Upstream LLVM's NVVMReflect pass (in llvm/lib/Target/NVPTX/NVVMReflect.cpp) is functionally similar but differs in several respects in CICC v13.0:

AspectUpstream LLVMCICC v13.0
Pipeline placementRuns once, typically earlyRuns ~8 times at strategic positions throughout the pipeline
Post-processingRelies on standard SimplifyCFGHas dedicated nvvm-reflect-pp (SimplifyConstantConditionalsPass)
Pipeline integrationNew PM function passLegacy PM function pass invoked from the pipeline assembler (sub_12E54A0), with the pipeline disable flag at NVVMPassOptions[880]
Tier 3 extra runNot applicableExtra late-pipeline run gated by tier==3 for O3-only cleanup
Query string set__CUDA_ARCH, __CUDA_FTZSame set plus __CUDA_PREC_DIV, __CUDA_PREC_SQRT

The multi-run strategy is the most significant difference. Upstream LLVM assumes that NVVMReflect runs once before optimization, resolving all reflect calls in the linked libdevice bitcode. CICC's pipeline accounts for the reality that aggressive inlining and loop transformations in a GPU-focused compiler expose reflect calls at many different pipeline stages.

Function Map

FunctionAddressSizeRole
NVVMReflect pass factorysub_1857160--Creates and returns a new NVVMReflect pass instance
NVVMReflect constructor knobctor_271--Registers nvvm-reflect-enable cl::opt
SimplifyConstantConditionalsPass (nvvm-reflect-pp)registered at line 2237 of sub_2342890--Post-reflect dead branch cleanup
Pipeline assemblersub_12E54A0--Inserts NVVMReflect at multiple positions
Tier 0 pipeline buildersub_12DE330--Inserts NVVMReflect as pass #7
Tiered sub-pipelinesub_12DE8F0--Inserts NVVMReflect at tier-gated positions
Architecture detection tablesub_95EB40--Maps -arch=compute_XX to __CUDA_ARCH values
Architecture detection (libnvvm)sub_12C8DD0--Parallel mapping table for the libnvvm path

Test This

The following kernel calls a libdevice math function whose implementation branches on __CUDA_FTZ and __CUDA_ARCH. Compile for two configurations and compare the PTX to see NVVMReflect in action.

#include <math.h>

__global__ void reflect_test(float* out, const float* in, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        out[tid] = sinf(in[tid]);
    }
}

Compile twice:

nvcc -ptx -arch=sm_90 -ftz=true  reflect_test.cu -o reflect_ftz.ptx
nvcc -ptx -arch=sm_90 -ftz=false reflect_test.cu -o reflect_noftz.ptx

What to look for in PTX:

  • With -ftz=true: the PTX should contain flush-to-zero math instructions (e.g., sin.approx.ftz.f32). The NVVMReflect pass resolved __nvvm_reflect("__CUDA_FTZ") to 1, SimplifyCFG folded the branch, and only the FTZ code path survived.
  • With -ftz=false: the PTX should contain precise math instructions without the .ftz suffix. The reflect resolved to 0, selecting the non-FTZ path.
  • The key evidence is that the PTX contains only one code path -- no conditional branch choosing between FTZ and non-FTZ variants. If both paths survive, NVVMReflect or its downstream cleanup passes failed.
  • Comparing -arch=sm_75 vs. -arch=sm_90 exercises the __CUDA_ARCH reflect. Functions like __nv_dsqrt_rn use architecture comparisons (icmp sge i32 %arch, 800) to select between SM 8.0+ instruction sequences and legacy fallbacks.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent compile-time reflection mechanism.

1. Returning the wrong __CUDA_ARCH encoding. The __CUDA_ARCH value is major * 100 + minor * 10, not major * 10 + minor. For SM 9.0, the correct value is 900, not 90. For SM 10.0, the correct value is 1000, not 100. A reimplementation that uses the wrong encoding will select the wrong code paths in libdevice, potentially enabling instructions not supported by the target architecture (e.g., SM 7.0 paths on an SM 9.0 target) or disabling instructions that should be available. This encoding is also used by the CUDA preprocessor (__CUDA_ARCH__), so consistency between the frontend macro and the reflect value is critical.

2. Running NVVMReflect only once in the pipeline. The pass must run multiple times (approximately 8 invocations across the full pipeline) because __nvvm_reflect calls are hidden inside un-inlined libdevice function bodies. The first run resolves calls visible at the top level, but each subsequent inlining pass exposes new reflect calls from freshly inlined libdevice functions. A reimplementation with a single early invocation will leave reflected branches unresolved in all functions inlined after that point, resulting in both FTZ and non-FTZ code paths surviving to the final binary -- doubling code size and defeating the entire specialization mechanism.

3. Not running SimplifyConstantConditionalsPass (nvvm-reflect-pp) after reflect resolution. After NVVMReflect replaces __nvvm_reflect("__CUDA_FTZ") with the constant 1, the IR contains icmp ne i32 1, 0 feeding a conditional branch. If no pass simplifies this to an unconditional branch, the dead code path survives through the rest of the pipeline, consuming compile time in every subsequent pass and inflating the final binary. While standard LLVM SimplifyCFG will eventually handle it, the dedicated nvvm-reflect-pp pass provides immediate cleanup at the point where it matters most.

4. Returning 0 for unknown query strings instead of propagating a diagnostic. The pass returns 0 for any unrecognized __nvvm_reflect query string. This is the correct behavior (documented default), but a reimplementation that raises an error or leaves the call unresolved will break forward compatibility: future CUDA toolkit versions may introduce new query strings that libdevice checks. The value 0 is the safe default because libdevice code always treats 0 as "feature not available" and falls back to the conservative code path.

5. Reading the SM version from the wrong source. The reflect query values flow through three layers: CLI (-arch=compute_90), EDG frontend (-R __CUDA_ARCH=900), and optimizer (-opt-arch=sm_90). The NVVMReflect pass must read the SM version from the target machine configuration (the optimizer-level value), not from the -R preprocessor flags. A reimplementation that reads from the wrong layer may get a stale or mismatched value, especially in LTO scenarios where the preprocessor flags were consumed during an earlier compilation phase.

Cross-References

NVVM IR Verifier (Deep Dive)

The NVVM IR Verifier (nvvm-verify) is NVIDIA's three-layer correctness gate that runs between optimization passes throughout the CICC pipeline. Unlike LLVM's generic Verifier pass, which validates structural IR invariants, this pass enforces the complete NVVM IR contract: valid target triples, legal address space usage, architecture-gated intrinsic availability, MMA dimension/type constraints, function attribute restrictions, and atomic operation rules. It is the single largest verification subsystem in CICC at approximately 230KB across three cooperating functions. The verifier is inserted at roughly a dozen points in every optimization tier, guarded only by NVVMPassOptions[600] (disable). Every NVVM intrinsic call, every address space cast, and every unsupported CPU-oriented feature triggers a check here; failure produces a diagnostic message and sets the module error flag, but compilation continues to collect as many errors as possible in a single run.

Key Facts

PropertyValue
Pass namenvvm-verify
Pass classllvm::NVVMIRVerifierPass
Registrationsub_2342890 (New PM), sub_12E54A0 (pipeline builder)
Entry pointsub_12D4560
Module verifiersub_2C80C90 (51KB, ~1671 lines)
Function verifiersub_2C771D0 (36KB, ~1165 lines)
Intrinsic verifiersub_2C7B6A0 (143KB, ~4139 lines)
Binary size~230KB decompiled
Pipeline slot~12 per tier (O1-O3), after GVN, after DSE, after LICM, etc.
Disable flagNVVMPassOptions[600] (bool)
Primary knobsnvvm-verify-show-info
Error modelAccumulate-and-continue (no early abort)
SM encodingInternal SM * 10 (e.g., sm_90 = 900) at context offset +8
Upstream equivalentNone -- fully proprietary

Three-Layer Verification Architecture

The pass operates as three nested verification functions. The module verifier is the entry point; it calls the function verifier once per function, and the function verifier dispatches to the intrinsic verifier for every intrinsic call instruction.

sub_2C80C90 (NVVMModuleVerifier)
  |
  +-- Validate data layout string
  +-- Validate target triple against whitelist
  +-- sub_2C797D0() for each global variable
  +-- sub_2C7A130() for each function declaration
  +-- sub_2C7AA20() for each named metadata node
  |
  +-- For each function:
  |     |
  |     +-- sub_2C771D0 (NVVMFunctionVerifier)
  |     |     +-- Cluster dimension validation (Hopper+ gate)
  |     |     +-- Parameter width validation (>=32-bit or sext/zext)
  |     |     +-- Function attribute rejection (17 attributes)
  |     |     +-- Entry/exit handler constraints
  |     |
  |     +-- For each instruction in each basic block:
  |           |
  |           +-- Switch on opcode 0x1E..0x60
  |           +-- Opcode 0x55 (intrinsic call) --> sub_2C7B6A0
  |                 (NVVMIntrinsicVerifier, 143KB)
  |                 +-- Switch on intrinsic ID
  |                 +-- SM version gate checks
  |                 +-- Type, address space, constant arg validation
  |                 +-- MMA shape/type cross-validation

Context Object Layout

All three verifiers share a context object passed as the first argument:

OffsetTypeField
0int32Mode (0 = standard, 1 = UnifiedNVVMIR)
4int32Flags
8int32SM version (SM * 10, e.g., 900 for sm_90)
16ptrError flag pointer (set on any validation failure)
24ptrError output stream

Target Triple Whitelist

The module verifier validates the module's target triple against two whitelists depending on mode.

UnifiedNVVMIR Mode (mode == 1) -- Exact Match

Eight triples are accepted:

TripleArchAPI
nvptx-nvidia-cuda32-bit PTXCUDA
nvptx64-nvidia-cuda64-bit PTXCUDA
nvptx-nvidia-nvcl32-bit PTXOpenCL
nvptx64-nvidia-nvcl64-bit PTXOpenCL
nvsass-nvidia-cudaSASS directCUDA
nvsass-nvidia-nvclSASS directOpenCL
nvsass-nvidia-directxSASS directDirectX
nvsass-nvidia-spirvSASS directSPIR-V

The nvsass triples confirm that CICC can compile directly to native GPU assembly (SASS) without the PTX intermediate step, and can do so for DirectX shader and SPIR-V/Vulkan shader pipelines. This reveals CICC's role in NVIDIA's shader compiler toolchain beyond CUDA.

Failure message: "Invalid target triple".

Standard Mode (mode != 1) -- Prefix + Suffix Match

The triple must begin with "nvptx-" or "nvptx64-" and end with "-cuda". The middle component is wildcarded.

Failure message: "Invalid target triple (<actual>), must be one of:" followed by "nvptx-*-cuda" and "nvptx64-*-cuda".

Data Layout Validation

If the module's data layout string is empty: "Empty target data layout, must exist".

Otherwise, sub_2C74F70 parses and validates the layout. On failure, the verifier prints "Example valid data layout:" with reference strings from:

GlobalDescription
off_4C5D0A032-bit layout example
off_4C5D0A864-bit layout example
off_4C5D07064-bit with mixed pointer widths (p3:32:32:32)

Per-Instruction Validation (Module Verifier)

After calling sub_2C771D0 for function-level checks, the module verifier iterates every instruction in every basic block and dispatches on the LLVM IR opcode. The opcode range is 0x1E through 0x60:

OpcodeIR InstructionValidation
0x1Fcall (non-intrinsic)Calls sub_2C795F0. Checks for "pragma" metadata; rejects "unroll" pragma with: "pragma unroll is not supported. Please use llvm.loop.unroll.count instead". Validates branch pragma operand count.
0x21indirectbrRejected via sub_2C76F10(ctx, "indirectbr", instr)
0x22invokeRejected via sub_2C76F10(ctx, "invoke", instr)
0x23resumeRejected via sub_2C76F10(ctx, "resume", instr)
0x3CallocaAlignment must be <= 2^23. Address space must be Generic (AS 0): "Allocas are not supported on address spaces except Generic"
0x3DloadRejects atomic loads: "Atomic loads/stores are not supported". Rejects tensor memory (AS 6): "Tensor Memory loads/stores are not supported"
0x3EstoreSame atomic and tensor memory checks as load
0x40fenceIn UnifiedNVVMIR mode: only acq_rel and seq_cst allowed. Otherwise: rejected entirely via sub_2C76F10
0x41cmpxchgOnly i32/i64/i128 types. Pointer must be in generic, global, or shared AS
0x42(GEP/addrspacecast helper)Calls sub_2C7AF00
0x4FaddrspacecastValidates source and target AS are in range. "Cannot cast non-generic pointer to different non-generic pointer" -- at least one side must be AS 0 (generic)
0x55call (intrinsic)Dispatches to sub_2C7B6A0 (NVVMIntrinsicVerifier)
0x5FlandingpadRejected: "landingpad" unsupported

The unsupported instructions -- indirectbr, invoke, resume, landingpad -- are CPU exception-handling features with no GPU equivalent. Their rejection at the IR level prevents downstream passes from encountering them.

Address Space Casting Rules

The addrspacecast validation enforces NVIDIA's GPU address space model:

Rule: At least one operand of addrspacecast must be AS 0 (generic).
      Non-generic-to-non-generic casts are illegal.

Legal:   addrspacecast i32* addrspace(0) to i32* addrspace(1)  ; generic -> global
Legal:   addrspacecast i32* addrspace(3) to i32* addrspace(0)  ; shared -> generic
Illegal: addrspacecast i32* addrspace(3) to i32* addrspace(1)  ; shared -> global

The valid address space range check uses the expression (AS + ~2) & 0xFFFFFF) > 2, which means AS values 0 (generic), 1 (global), and 3 (shared) are always valid for atomic and cast operations. AS 2 (constant) and higher values have restricted usage contexts.

Function Attribute Rejection

The function verifier (sub_2C771D0) rejects 17 LLVM function attributes that have no GPU meaning. Each is identified by its LLVM attribute kind ID:

Attr IDAttribute NameError Message
4builtin"builtin function attribute is not supported."
17jumptable"jumptable function attribute is not supported."
20naked"naked function attribute is not supported."
23nobuiltin"nobuiltin function attribute is not supported."
30noimplicitfloat"noimplicitfloat function attribute is not supported."
35noredzone"noredzone function attribute is not supported."
42nonlazybind"nonlazybind function attribute is not supported."
53returns_twice"returns_twice function attribute is not supported."
55safestack"safestack function attribute is not supported."
56sanitize_address"sanitize_address function attribute is not supported."
59sanitize_memory"sanitize_memory function attribute is not supported."
63sanitize_thread"sanitize_thread function attribute is not supported."
69ssp"ssp function attribute is not supported."
70sspreq"sspreq function attribute is not supported."
71sspstrong"sspstrong function attribute is not supported."
86alignstack"alignstack function attribute is not supported."
95uwtable"uwtable function attribute is not supported."

These attributes fall into four categories: (1) CPU ABI (naked, alignstack, noredzone), (2) security hardening (ssp/sspreq/sspstrong, safestack, sanitizers), (3) EH-related (uwtable, returns_twice, personality), and (4) linker features (jumptable, nonlazybind, builtin, nobuiltin). None have GPU equivalents.

Additional Function-Level Checks

CheckError MessageNotes
Cluster dimensions on pre-Hopper"Cluster dimensions and cluster maximum blocks are not supported on pre-Hopper Architectures"SM version <= 899 (i.e., before sm_90)
Cluster dims on non-kernel"Cluster dimensions and cluster maximum blocks are only allowed for kernel functions"Checked via sub_CE9220
Partial zero cluster dims"If any cluster dimension is specified as 0 then all other dimensions must be specified as 0"
Zero max cluster blocks"Cluster maximum blocks must be non-zero"
Narrow int param without sign attr"Integer parameter less than 32-bits without sext/zext flag"PTX requires >=32-bit params
Narrow int return without sign attr"Integer return less than 32-bits without sext/zext flag"
InReg attribute"InReg attribute on parameter will be ignored"Warning only
Nest attribute"Nest attribute on parameter will be ignored"Warning only
Explicit section"Explicit section marker <name> is not allowed."
Explicit alignment"Explicit alignment is not allowed."
Prefix data"Prefix data is not allowed."CPU feature
Prologue data"Prologue data is not allowed."CPU feature
Personality function"Personality function is not allowed."EH feature
GC names"GC names are not supported."
Non-void kernel/entry"non-void entry function."Return type must be void
Entry with params"entry function with parameters."Non-kernel entries only
Non-void exit handler"non-void exit handler function."
Exit handler with params"exit handler function with parameters."

Architecture Gates (SM-Gated Features)

The intrinsic verifier (sub_2C7B6A0) uses the SM version stored at context offset +8 (encoded as SM*10) to gate feature availability. The threshold checks use <=, so e.g. <= 899 means "below sm_90".

SM GateThresholdIntrinsics / FeaturesError Message
sm_70 (Volta)<= 699llvm.nvvm.branch.if.all.convergent (ID 0x205A)"...not supported on pre-Volta Architectures"
sm_72 (Volta+)<= 719llvm.nvvm.cvt base conversion (ID 0x2106)"this instrinsic is only supported for Volta (sm_72)+"
sm_75 (Turing)<= 749cvt extended types -- BF16, TF32 conversions (within ID 0x2106)"conversion type only supported for Turing (sm_75)+"
sm_80 (Ampere)<= 799llvm.nvvm.branch.if.convergent (ID 0x205B)"...not supported on pre-Ampere Architectures"
sm_89 (Ada)<= 889Extended type conversion intrinsic (ID 0x2107)"this instrinsic is only supported for Ada (sm_89)+"
sm_90 (Hopper)<= 899TMA, async copy (IDs 0x2279, 0x232D), cluster dims, bulk async (IDs 0x244D-0x2459, 0x2487-0x2489)"this intrinsic is only supported for Hopper+"
sm_90 (Hopper)<= 89964-bit pointer requirement for TMA"this intrinsic is only supported when pointer size is >= 64 bits"
sm_100+ (Blackwell)<= 1199.offset.bindless intrinsics (checked via sub_CEA320)".offset.bindless intrinsics are not supported on pre-Blackwell architectures"

Note the typo "instrinsic" in the Volta and Ada messages -- this is present in the binary. The Blackwell gate threshold of 1199 means the .offset.bindless intrinsics are available on sm_120 (value 1200) and above, covering all Blackwell-generation architectures including consumer (sm_120/121) and datacenter (sm_100/103).

Intrinsic Verification Categories

The intrinsic verifier is a single monolithic switch on the NVVM internal intrinsic ID (stored at function value offset +36). The 143KB function covers 26+ validation categories:

A. Constant Argument Validation

Many NVVM intrinsics require one or more arguments to be compile-time constants (typically mode selectors, masks, or task IDs):

  • "arg0 of intrinsic not constant"
  • "op0 of intrinsic not constant" / "op1 of intrinsic not constant"
  • "Flag argument must be an immediate."
  • "the task_id parameter must be constant"
  • "the mask parameter must be constant"
  • "Mode operand must be constant"

B. Rounding Mode Validation

Rounding mode encoding: bits[2:0] of the mode word
Valid range: 1..4 (round-to-nearest-even, round-down, round-up, round-to-zero)
Reject: value == 0 or value > 4
Message: "rounding mode not a valid value"

C. Subword Mode Validation

For conversion intrinsics that operate on sub-word portions:

Source subword mode:  bits[9:7], valid range 0..2
Dest subword mode:    bits[12:10], valid range 0..2
Messages: "src subword mode not a valid value"
          "dest subword mode not a valid value"

D. Reserved Bits Checking

Multiple locations verify that high/reserved bits in mode words are zero:

  • "reserved flag bits used"

This prevents future-proofing conflicts if NVIDIA later assigns meaning to currently reserved fields.

E. Address Space Validation

Intrinsics that access memory enforce specific address space requirements:

CheckMessage
Global pointer required"pointer address space not global"
Invalid arg1 address space"arg1 invalid addrspace"
Arg0 must be pointer"arg0 of intrinsic not pointer"
Constant AS required"Operand must be in constant address space"
Memcpy/memmove targets constant AS"memmove/memcpy cannot target constant address space"
Memset targets constant AS"memset cannot point to constant address space"
Stack ops require local AS (5)"llvm.nvvm.stackrestore is only supported with local address space pointers"
Stack ops require local AS (5)"llvm.nvvm.stacksave is only supported with local address space pointers"

F. Type Validation

CheckMessage
bswap operand"Invalid type for bswap, need i16, i32, or i64"
ctpop/ctlz/cttz operand"Invalid type for ctpop/ctlz/cttz, need i8, i16, i32, ..." (i64)
Arithmetic overflow"Invalid type for arithmetic overflow intrinsic, need i16, i32, or i64"
Inline asm type"Invalid type in inline assembly, must be i1, i8, i16, i32, i64, float, or double"
MMA element"op1 of intrinsic not containing f32 or i32 element"

Inline assembly type validation uses a bitmask check: valid bit widths are 1, 8, 16, 32, 64 (encoded as 0x1000000010001 for fast lookup).

G. Atomic Intrinsic Validation

CheckMessage
CAS opcode mismatch"the opcode of atomic_cas must be CAS"
RMW opcode error"the opcode of atomic_rmw must not be CAS, CAST or CAST_SPIN"
CAST opcode error"the opcode of atomic_cast must be CAST or CAST_SPIN"
CAST type restriction"atomic.cast only overloads on i32 and i64"
CAST pointer restriction"atomic.cast is only allowed on shared pointers"
CAST ordering restriction"atomic.cast works on shared memory, so cannot be ordered"
Global ordering scope"Global ordering on atomics is only allowed on generic/global pointers"
Ordering mode"ordering mode not a valid value"
Scope mode"scope mode not a valid value"
Cache hint"Cache operation hint not a valid value"
Operation mode"operation mode not a valid value"

H. Texture/Surface Validation

CheckMessage
Texture dimensionality"dimensionality not a valid value"
LOD adjust"LOD Adjust mode not a valid value"
Binding mode"Binding Mode is not a valid value"
Border mode"border mode not a valid value"
Address mode"address mode not a valid value"
Scope"scope not a valid value"
Semantic mode"semantic mode not a valid value"
Query mode"query mode is not a valid value"
Handle source"Op0 of nvvm.texsurf.handle must be a metadata wrapper around a tex/surf GlobalVariable"
Deprecated desc"Desc parameter is deprecated and should be undef." (IDs 8937, 9549)

I. SATF (Saturate-to-Float) Validation

For math intrinsics with saturation control (IDs 0x2281-0x229C, covering fma/mul/add variants):

Message: "satf operand must be a constant zero"

The satf parameter was deprecated but the intrinsic signatures retain it for ABI compatibility. The verifier enforces it must be zero.

J. Constant Load Validation

For ID 0x2310 (constant bank load):

CheckMessage
Load kind"Invalid constant load kind"
Bound bank type"Bound bank must be i32"
Bindless bank type"Bindless bank must be i64"

K. TMA/Shared Memory Validation

For IDs 0x2319-0x231B:

CheckMessage
Column-major restriction"ColMajor is not supported for this size"
Size encoding"Invalid size" (bits[3:1] > 4)

L. Load Bounds Check

For ID 0x231C:

Validation: (value & 7) must be <= 2
Message: "invalid load bounds check type"
Also: "pointer address space not global"

M. Convergent Branch Result Validation

For IDs 8282 (llvm.nvvm.branch.if.all.convergent) and 8283 (llvm.nvvm.branch.if.convergent):

Message: "result of llvm.nvvm.branch.if.convergent and
          llvm.nvvm.branch.if.all.convergent can only be
          used by exactly one branch instruction"

This enforces that the convergent branch intrinsic's boolean result flows directly to a single terminator branch, preventing misuse that would break convergence guarantees.

N. MMA (Matrix Multiply-Accumulate) Validation

The most complex validation category (ID 0x2366 = 9062). Validates WMMA/MMA intrinsics against a multidimensional constraint space:

Opcode byte encoding:

ByteBitsField
byte0[2:0]Rounding mode
byte0[7:4]MMA opcode
byte1allA matrix element type (1-13, lookup via dword_43A2620)
byte2allB matrix element type
byte4allMNK dimension encoding (cases 1-0x19)
byte5allAdditional type info

MNK dimension decoding (selected cases):

EncodingMNKNotes
1888Legacy HMMA
0x101688
0x1716816
0x183288
0x1916816

Validation checks:

CheckMessage
MNK dimensions"Invalid MMA MNK"
A element type"Invalid MMA AType"
Fragment A bit width"Invalid MMA FragASize"
Fragment B bit width"Invalid MMA FragBSize"
Fragment C bit width"Invalid MMA FragCSize"
Fragment A IR type"Invalid fragA type"
Rounding mode"Invalid MMA Rounding Mode"
MMA opcode"Invalid MMA Opcode"
A/B type match"Mismatched MMA A B Type"
Fragment element consistency"Mismatched fragA, fragB and fragC element type"

O. Type Conversion Validation

For IDs 0x2106 and 0x2107:

Conversion type: bits[3:1], must be 1..4
Messages: "conversion type not a valid value"
          "Invalid dst type" / "Invalid src type"
          "Src and dst type must be different types"
          "Src and dst type must be different bit widths"

P. Other Validation Categories

CategoryIDsKey Messages
Coroutine--"llvm.nvvm.coro.create.suspend must have exactly one argument, which must be a constant integer"
Subop mode9383-9384"Invalid subop mode" (bits[3:1] > 5)
Geometry output--"geometry out mode not a valid value", "op1 of GeometryOut intrinsic must be constant when CUT mode", "op1 of GeometryOut intrinsic must be 0 when CUT mode"
Syncwarp--"syncwarp mode not a valid value"
Cache operations--"invalid cache type", "invalid cache op"
Wait intrinsic--"Invalid wait mode"
ISBE0x2BC1 (11201)"Only writes to MAP or ATTR are supported", "Cannot write to input ISBE"
Unsupported fallback--"Unsupported intrinsic: <name>"

Cmpxchg Restrictions

The module verifier enforces strict constraints on cmpxchg:

Allowed types:  i32, i64, i128
Allowed spaces: generic (AS 0), global (AS 1), shared (AS 3)

Messages:
  "Atomic operations on non-i32/i64/i128 types are not supported"
  "cmpxchg pointer operand must point to generic, global, or shared address space"

This rules out i8/i16 atomics (hardware does not support sub-word CAS) and atomics on constant/local address spaces.

Tensor Memory Restrictions

Load and store instructions targeting address space 6 (tensor memory) are rejected at the IR level:

Message: "Tensor Memory loads/stores are not supported"

Tensor memory access is handled through dedicated intrinsics (TMA/cp.async) rather than generic load/store instructions. The verifier enforces this indirection.

Pipeline Placement

The NVVMVerifier is inserted repeatedly throughout the optimization pipeline, not just once. In the pipeline assembler (sub_12E54A0), it appears after nearly every major optimization pass, gated by !NVVMPassOptions[600]:

PositionAfter PassNotes
10 (O1 tier)GVNVerify IR after value numbering
After DSEDead Store EliminationVerify after store removal
After EarlyCSEEarly CSEO2+ only
After LoopIndexSplitLoop Index SplitO2+ only
After NVVMReflectNVVM ReflectCommon tail
After LICMLoop-Invariant Code MotionCommon tail
After LowerSwitchSwitch loweringFinal position in common tail

This aggressive re-verification catches bugs introduced by any optimization pass. In debug/development builds, this is the primary mechanism for detecting optimizer-introduced IR invalidity.

Configuration

KnobStorageTypeDefaultDescription
NVVMPassOptions[600]opts arrayboolfalseWhen true, disables ALL NVVMVerifier insertions in the pipeline
nvvm-verify-show-infoctor_257boolfalseEnables informational messages (e.g., "IR Kind is UnifiedNVVMIR")

Diagnostic Infrastructure

Error messages are produced through a chain of helper functions:

FunctionRole
sub_2C764C0Create diagnostic message with severity level
sub_2C76A00Create error diagnostic for a specific instruction
sub_2C76240Flush diagnostic to error stream
sub_2C76F10Report an unsupported instruction by name (takes a string literal like "indirectbr")
sub_904010Append string to diagnostic buffer
sub_CB6200Write raw bytes to output buffer
sub_CB5AE0Flush buffer

The error model is accumulate-and-continue: the verifier sets the error flag at context offset +16 and writes the diagnostic, but does not abort. This allows a single verification run to report all errors in the module.

Function Map

FunctionAddressSizeRole
NVVMModuleVerifiersub_2C80C9051KBModule entry: triples, data layout, per-instruction dispatch
NVVMFunctionVerifiersub_2C771D036KBFunction-level: attributes, params, cluster dims, entry funcs
NVVMIntrinsicVerifiersub_2C7B6A0143KBIntrinsic-level: SM gates, types, MMA, atomics, tex/surf
NVVMVerifier pass wrappersub_12D4560smallPipeline entry point, creates context, invokes module verifier
Verify global variablesub_2C797D0--Per-global validation
Verify function declarationsub_2C7A130--Checks function declarations (not definitions)
Verify named metadatasub_2C7AA20--Named metadata validation
Verify address space castsub_2C7AF00--addrspacecast / GEP rule checker
Verify generic callsub_2C795F0--Non-intrinsic call validation, pragma check
Report unsupported instructionsub_2C76F10--Produces "<name> is not supported" diagnostics
Is kernel function?sub_CE9220--Checks kernel calling convention
Extract cluster dimensionssub_CE8EA0--Reads cluster dims from function metadata
Extract cluster max blockssub_CE9030--Reads max cluster blocks from metadata
Check function attributesub_A73ED0--Tests presence of attribute by ID
Is .offset.bindless?sub_CEA320--Blackwell gate predicate
Get intrinsic name stringsub_BD5D20--Returns intrinsic name for error messages
Get integer bit widthsub_BCAE30--Type query helper
Compute total bit widthsub_CA1930--Aggregate/vector width computation

Cross-References

NVVM Intrinsic Lowering

The NVVMIntrinsicLowering pass is a pattern-matching rewrite engine that transforms NVVM intrinsic calls into equivalent sequences of standard LLVM IR operations. NVVM IR uses hundreds of target-specific intrinsics (llvm.nvvm.*) for GPU-specific operations -- texture/surface access, warp shuffles, type conversions, wide vector manipulations, barrier synchronization, and tensor core primitives. These intrinsics encode NVIDIA-specific semantics that have no direct LLVM IR equivalent. This pass bridges the gap: it matches each intrinsic call against a database of lowering rules and, when a match is found, replaces the call with a combination of standard LLVM instructions (shufflevector, extractelement, insertelement, bitcast, arithmetic) that express the same semantics in a form amenable to standard LLVM optimization passes.

The pass runs repeatedly throughout the pipeline -- up to 10 times in the "mid" compilation path -- because other optimization passes (NVVMReflect, InstCombine, inlining) can expose new intrinsic calls or simplify existing ones into forms that become lowerable. Two distinct invocation levels exist: level 0 for basic intrinsic lowering, and level 1 for barrier-related intrinsic lowering that must happen after barrier analysis infrastructure is in place.

Pass factorysub_1CB4E40 (creates pass instance with level parameter)
Core enginesub_2C63FB0 (140KB, 2,460 lines)
Pass typeFunctionPass (Legacy PM)
RegistrationLegacy PM only (not separately registered in New PM); invoked from pipeline assembler
Runtime positionsTier 1/2/3 #1, #3, #28, #50, #64 (level 1); "mid" path has 4 level-0 invocations (see Pipeline)
NVVMPassOptions slot99 (offset 2000, BOOL_COMPACT, default = 0 = enabled)
Disable flagopts[2000] = 1 disables all invocations
Level parameter0 = basic lowering, 1 = barrier-aware lowering
Iteration limit30 (global qword_5010AC8)
Upstream equivalentNone -- entirely NVIDIA-proprietary
Address range0x2C4D000--0x2C66000 (lowering engine cluster)

Algorithm

Entry and Dispatch

The pass factory sub_1CB4E40 takes a single integer parameter -- the lowering level. Level 0 performs basic intrinsic lowering (type conversions, vector decomposition, shuffle lowering). Level 1 adds barrier-related intrinsic lowering that depends on barrier analysis having already run. The factory allocates a pass object and stores the level in it; the pass entry point reads this level to filter which intrinsics are candidates for lowering.

At the core engine (sub_2C63FB0), the entry check validates that the instruction is an intrinsic call: the byte at call->arg_chain->offset_8 must equal 17 (intrinsic call marker), and call->offset_16 must be non-null (the callee exists). If either check fails, the function returns 0 (no lowering performed).

Pattern-Matching Rewrite Loop

The algorithm operates as a worklist-driven rewrite system:

function lowerIntrinsic(ctx, call, level):
    if not isIntrinsicCall(call): return 0
    if not hasCallee(call): return 0

    operands = collectOperands(call)        // v285/v286 arrays
    worklist_direct = []                     // v288: direct operand replacements
    worklist_typed  = []                     // v294: type-changed operands
    worklist_shuf   = []                     // v300: shuffle/reorganized operands

    iterations = 0
    while iterations < qword_5010AC8:       // default 30
        iterations++

        // Phase 1: build candidate lowerings
        candidates = buildCandidates(operands)   // sub_2C4D470
        for each candidate in candidates:
            pattern = extractPattern(candidate)  // sub_2C4D5A0

            // Phase 2: type compatibility check
            if not checkTypeCompat(pattern, operands):  // sub_AD7630
                continue

            // Phase 3: operand matching
            if not matchOperands(pattern, operands):
                continue

            // Phase 4: additional pattern checks
            if not additionalChecks(pattern):    // sub_2C50020
                continue

            // Phase 5: core lowering -- create replacement
            replacement = buildReplacement(      // sub_2C515C0
                ctx, operands,
                worklist_direct, worklist_typed, worklist_shuf)

            // Phase 6: substitute
            replaceAllUses(call, replacement)    // sub_BD84D0
            transferMetadata(call, replacement)  // sub_BD6B90
            queueForDeletion(call)               // sub_F15FC0
            return 1

    return 0  // no lowering found within iteration limit

The iteration limit of 30 (stored in qword_5010AC8) exists because lowering one intrinsic can produce new intrinsic calls that themselves need lowering. For example, lowering a wide vector intrinsic into narrower operations may produce calls to narrower intrinsics. Without the limit, pathological patterns could cause infinite expansion. In practice, most intrinsics lower in a single iteration; the limit is a safety net.

Three Worklist Structures

The rewrite engine maintains three parallel worklist structures that categorize how operands are transformed:

WorklistVariablePurpose
Directv288Operands that pass through unchanged -- same value, same type
Type-changedv294Operands that need a type conversion (e.g., NVVM-specific type to standard LLVM type)
Shuffle/reorganizedv300Operands that need positional rearrangement (vector lane reordering, element extraction)

When sub_2C515C0 builds the replacement instruction, it reads all three worklists to assemble the final operand list: direct operands are copied verbatim, type-changed operands go through a bitcast or type conversion, and shuffle operands are processed through a shufflevector or extractelement/insertelement sequence.

Lowering Categories

Vector Operation Decomposition

Wide vector NVVM intrinsics (operating on v4f32, v2f64, v4i32, etc.) are decomposed into sequences of narrower operations. The NVVM IR frontend emits vector intrinsics to express data-parallel GPU operations, but the NVPTX backend's instruction selector handles scalar or narrow-vector operations more efficiently.

The decomposition pattern:

// Before: single wide-vector intrinsic call
%result = call <4 x float> @llvm.nvvm.wide.op(<4 x float> %a, <4 x float> %b)

// After: four scalar operations + vector reconstruction
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3
%b0 = extractelement <4 x float> %b, i32 0
...
%r0 = call float @llvm.nvvm.narrow.op(float %a0, float %b0)
%r1 = call float @llvm.nvvm.narrow.op(float %a1, float %b1)
...
%v0 = insertelement <4 x float> undef, float %r0, i32 0
%v1 = insertelement <4 x float> %v0,   float %r1, i32 1
...

This decomposition enables scalar optimizations (constant folding, CSE) to work on individual lanes, and the narrower intrinsics may themselves lower in subsequent iterations -- hence the iteration limit.

Shuffle Vector Lowering

When an NVVM intrinsic performs pure data reorganization -- lane permutation, broadcast, or subvector extraction -- without any arithmetic, the pass replaces it with an LLVM shufflevector instruction. The core lowering for this path goes through sub_DFBC30, which takes:

sub_DFBC30(context, operation=6, type_info, shuffle_indices, count, flags)

The operation=6 constant identifies this as a shufflevector creation. The shuffle_indices array encodes the lane mapping: for a warp shuffle that broadcasts lane 0 to all lanes, the mask would be <0, 0, 0, 0, ...>. For a rotation, it might be <1, 2, 3, 0>.

Shuffle lowering handles several NVVM intrinsic families:

  • Warp-level shuffle operations (__shfl_sync, __shfl_up_sync, etc.) when the shuffle amount is a compile-time constant
  • Subvector extraction from wider types (e.g., extracting the low v2f16 from a v4f16)
  • Lane broadcast patterns used in matrix fragment loading

Type Conversion Lowering

NVVM defines intrinsic-based type conversions for types that LLVM's standard type system does not directly support, such as:

  • BF16 (bfloat16) to/from FP32 -- intrinsic ID 0x2106, gated by sm_72+
  • TF32 (tensorfloat32) conversions -- intrinsic ID 0x2106 with conversion type 3+, gated by sm_75+
  • FP8 (E4M3/E5M2) conversions -- intrinsic ID 0x2107, gated by sm_89+ (Ada)
  • Extended type conversions with saturate, rounding mode control

The lowering replaces these intrinsic calls with sequences of:

  • bitcast for reinterpretation between same-width types
  • fptrunc / fpext for standard floating-point width changes
  • trunc / zext / sext for integer width changes
  • Arithmetic sequences for rounding mode emulation when the hardware rounding mode is not directly expressible

The sub_2C52B30 helper ("get canonical type") resolves NVVM-specific type encodings to their standard LLVM Type* equivalents during this process.

Multi-Run Pattern

NVVMIntrinsicLowering appears more times in the compilation pipeline than any other NVIDIA custom pass. In the "mid" path (standard CUDA compilation), it runs approximately 10 times across the main path and the Tier 1/2/3 sub-pipelines. The pattern reveals a deliberate interleaving strategy.

"mid" Path Invocations (Level 0)

All four invocations in the main "mid" path use level 0 and are guarded by !opts[2000]:

PositionContextPreceding PassFollowing PassPurpose
1stEarly pipelineConstantMergeMemCpyOptLower intrinsics from the original IR before SROA/GVN operate
2ndAfter InstCombine + standard pipeline #5LLVM standard #5DeadArgElimRe-lower intrinsics that InstCombine may have simplified or inlining may have exposed
3rdAfter NVVMReflect + standard pipeline #8LLVM standard #8IPConstPropLower intrinsics whose arguments became constant after NVVMReflect resolved __nvvm_reflect() calls
4thLate pipelineLICMNVVMBranchDistFinal cleanup of any remaining lowerable intrinsics before register-pressure-sensitive passes

Tier 1/2/3 Invocations (Level 1)

Within the sub_12DE8F0 tier sub-pipeline, the pass runs with level 1 at five distinct points:

PositionContextNotes
1stTier entryImmediately at tier start -- lower barrier-related intrinsics before barrier analysis
2ndAfter 1st NVVMIRVerificationRe-lower after verification may have canonicalized IR
3rdAfter CVP + NVVMVerifier + NVVMIRVerificationPost-optimization cleanup of barrier intrinsics
4thAfter LoopUnswitch + standard pipeline #1Re-lower intrinsics exposed by loop transformations
5thAfter DSE + DCE + standard pipeline #1Final tier cleanup before MemorySpaceOpt

Each tier (1, 2, and 3) runs this same sequence independently, so in a full compilation with all three tiers active, level-1 lowering executes up to 15 times total.

Level Parameter Semantics

The level parameter partitions the intrinsic lowering rules into two sets:

Level 0 -- Basic lowering. Handles intrinsics whose lowering depends only on the intrinsic's operands and types. This includes vector decomposition, shuffle lowering, and standard type conversions. These are safe to run at any point in the pipeline because they have no dependencies on analysis results. The "mid" path runs level 0 exclusively.

Level 1 -- Barrier-aware lowering. Handles intrinsics related to synchronization barriers (__syncthreads, __syncwarp, barrier-guarded memory operations) whose lowering must coordinate with the barrier analysis infrastructure. In the tier sub-pipeline, level 1 runs at the entry point before NVVMBarrierAnalysis and NVVMLowerBarriers, and again after those passes have run. This two-phase pattern within the tier ensures that:

  1. Barrier intrinsics are lowered to a canonical form that the barrier analysis can recognize
  2. After barrier analysis and lowering, any residual barrier-related intrinsics are cleaned up

The reason level 1 is restricted to tiers rather than the main "mid" path: the tier sub-pipeline (sub_12DE8F0) sets up the barrier analysis state (via sub_18E4A00 / sub_1C98160) that level 1 lowering depends on. Running level 1 in the main path before this state exists would produce incorrect results.

Interaction with NVVMReflect

NVVMReflect resolves compile-time queries about the target GPU architecture:

%arch = call i32 @llvm.nvvm.reflect(metadata !"__CUDA_ARCH__")
; After NVVMReflect: %arch = i32 900  (for sm_90)

This resolution has a cascading effect on intrinsic lowering. Many NVVM intrinsics are conditionally emitted by the frontend behind architecture checks:

if (__CUDA_ARCH__ >= 900) {
    // Hopper-specific intrinsic
    __nvvm_tma_load_async(...);
} else {
    // Fallback path using standard loads
}

After NVVMReflect replaces the architecture query with a constant, and nvvm-reflect-pp (SimplifyConstantConditionalsPass) eliminates the dead branch, the surviving path may contain intrinsics that were previously unreachable. The pipeline runs NVVMIntrinsicLowering after NVVMReflect specifically to catch these newly-exposed intrinsics. This is why the 3rd invocation in the "mid" path immediately follows NVVMReflect + LLVM standard pipeline #8.

Configuration

NVVMPassOptions

SlotOffsetTypeDefaultSemantics
981976STRING(empty)Paired string parameter for the pass (unused or reserved)
992000BOOL_COMPACT0Disable flag: 0 = enabled, 1 = disabled

Setting slot 99 to 1 disables all invocations of NVVMIntrinsicLowering across the entire pipeline -- both level 0 and level 1. There is no mechanism to disable one level independently.

Global Variables

VariableDefaultPurpose
qword_5010AC830Maximum iterations per invocation of the rewrite loop

This global is not exposed as a user-facing knob. It is initialized at program startup and is constant for the lifetime of the process.

Key Helper Functions

Pattern Matching

FunctionRole
sub_2C4D470Build candidate lowering list from intrinsic operands
sub_2C4D5A0Extract pattern from candidate -- returns the lowering rule
sub_2C50020Additional pattern compatibility checks beyond type matching
sub_2C52B30Get canonical LLVM type for an NVVM-specific type encoding
sub_AD7630Type-lowering query -- checks if source type can lower to target type

Instruction Construction

FunctionRole
sub_2C515C0Build replacement instruction from three worklist structures
sub_2C4FB60Opcode dispatch -- selects the LLVM opcode for the lowered operation
sub_DFBC30Create shufflevector or similar vector IR construct (operation=6)

IR Mutation

FunctionRole
sub_BD84D0Replace all uses of old instruction with new value
sub_BD6B90Transfer metadata from old instruction to replacement
sub_F15FC0Queue old instruction for deletion

Pass Infrastructure

FunctionAddressRole
sub_1CB4E400x1CB4E40Pass factory -- creates pass with level parameter
sub_2C63FB00x2C63FB0Core lowering engine (140KB, 2,460 lines)

Diagnostic Strings

The core engine at sub_2C63FB0 contains no user-visible diagnostic strings. This is unusual for a 140KB function and reflects the fact that intrinsic lowering is a mechanical pattern-matching operation: either a lowering rule matches (silently applied) or it does not (silently skipped). Failures are not reported because an unlowered intrinsic is not necessarily an error -- it may be handled by a later pass (NVVMLowerBarriers, GenericToNVVM) or by the NVPTX instruction selector directly.

The pass factory sub_1CB4E40 similarly contains no diagnostic strings.

Pipeline Position Summary

sub_12E54A0 (Master Pipeline Assembly)
  │
  ├─ "mid" path (level 0, 4 invocations):
  │    ├─ #1: After ConstantMerge, before MemCpyOpt/SROA
  │    ├─ #2: After InstCombine + LLVM standard #5
  │    ├─ #3: After NVVMReflect + LLVM standard #8
  │    └─ #4: After LICM, before NVVMBranchDist/Remat
  │
  ├─ "ptx" path (level 0, 0 invocations):
  │    └─ (not present -- PTX input already has intrinsics lowered)
  │
  ├─ default path (level 0, 1 invocation):
  │    └─ #1: After NVVMReflect, before NVVMPeephole
  │
  └─ Tier 1/2/3 sub-pipeline (level 1, 5 invocations per tier):
       ├─ #1: Tier entry
       ├─ #2: After NVVMIRVerification
       ├─ #3: After CVP + NVVMVerifier
       ├─ #4: After LoopUnswitch + LLVM standard #1
       └─ #5: After DSE + DCE + LLVM standard #1

Cross-References

FP128/I128 Emulation

No NVIDIA GPU in any SM generation has native 128-bit arithmetic hardware. Neither fp128 (IEEE 754 binary128) nor i128 (128-bit integer) operations can be lowered to PTX instructions directly. CICC handles this by replacing every fp128 and i128 operation in LLVM IR with a call to one of 48 distinct NVIDIA runtime library functions whose implementations live in a separate bitcode module. The pass at sub_1C8C170 walks each function in the module, inspects every instruction, dispatches on the LLVM opcode byte, and emits the appropriate __nv_* call in place of the original operation. This is a correctness-critical legalization pass -- if any fp128/i128 operation survives past it, instruction selection will abort because NVPTX has no patterns for 128-bit types.

The pass is structurally part of lower-ops (LowerOpsPass), NVIDIA's umbrella module pass for lowering operations that the NVPTX backend cannot handle natively. Within the lower-ops framework, sub_1C8C170 is the dedicated handler for 128-bit types. It runs as a module-level pass early in the pipeline, after libdevice linking and before the main optimization sequence, so that the generated calls can be inlined and optimized by subsequent passes.

Entry pointsub_1C8C170
Size25 KB (~960 lines decompiled)
Pass frameworkPart of lower-ops / LowerOpsPass (module pass)
RegistrationNew PM slot 144 at sub_2342890; param enable-optimization
Runtime functions48 distinct __nv_* library calls
Upstream equivalentNone. Upstream LLVM lowers fp128 through SoftenFloat in type legalization. CICC replaces this with explicit call insertion at the IR level.

Opcode Dispatch

The pass reads the LLVM instruction opcode from the byte at offset +16 of the instruction node and dispatches through a dense switch. The following table lists every handled opcode and the corresponding lowering action. All unlisted opcodes in the range 0x18--0x58 produce an early return (no 128-bit type involvement, or handled elsewhere).

OpcodeLLVM InstructionLowering TargetHandler
0x24fadd__nv_add_fp128sub_1C8A5C0
0x26fsub__nv_sub_fp128sub_1C8A5C0
0x28fmul__nv_mul_fp128sub_1C8A5C0
0x29udiv__nv_udiv128sub_1C8BD70
0x2Asdiv__nv_idiv128sub_1C8BD70
0x2Bfdiv__nv_div_fp128sub_1C8A5C0
0x2Curem__nv_urem128sub_1C8BD70
0x2Dsrem__nv_irem128sub_1C8BD70
0x2Efrem__nv_rem_fp128sub_1C8A5C0
0x36trunc/extType-based conversionsub_1C8ADC0
0x3Ffptoui__nv_fp128_to_uint* or __nv_cvt_f*_u128_rzsub_1C8ADC0 / sub_1C8BF90
0x40fptosi__nv_fp128_to_int* or __nv_cvt_f*_i128_rzsub_1C8ADC0 / sub_1C8BF90
0x41uitofp__nv_uint*_to_fp128 or __nv_cvt_u128_f*_rnsub_1C8ADC0 / sub_1C8BF90
0x42sitofp__nv_int*_to_fp128 or __nv_cvt_i128_f*_rnsub_1C8ADC0 / sub_1C8BF90
0x43fptrunc__nv_fp128_to_float or __nv_fp128_to_doublesub_1C8ADC0
0x44fpext__nv_float_to_fp128 or __nv_double_to_fp128sub_1C8ADC0
0x4Cfcmp__nv_fcmp_* (predicate-selected)dedicated

Ignored opcode ranges: 0x18--0x23, 0x25, 0x27, 0x2F--0x35, 0x37--0x3E, 0x45--0x4B, 0x4D--0x58. Opcode 0x37 (store) receives a similar type check as 0x36 but for store target types.

Library Function Inventory

FP128 Arithmetic (5 functions)

Binary operations on IEEE 754 binary128. Each takes two fp128 operands, returns fp128.

FunctionOperationString Length
__nv_add_fp128fp128 addition14
__nv_sub_fp128fp128 subtraction14
__nv_mul_fp128fp128 multiplication14
__nv_div_fp128fp128 division14
__nv_rem_fp128fp128 remainder14

All five are lowered through sub_1C8A5C0, which constructs the call with a fixed string length of 14 characters.

I128 Division and Remainder (4 functions)

Integer division and remainder for i128. No native PTX instruction exists for 128-bit integer divide.

FunctionOperationSignednessString Length
__nv_udiv128i128 divisionunsigned12
__nv_idiv128i128 divisionsigned12
__nv_urem128i128 remainderunsigned12
__nv_irem128i128 remaindersigned12

Lowered through sub_1C8BD70 with string length 12. Note: i128 add/sub/mul are NOT lowered here -- those can be decomposed into pairs of 64-bit operations by standard LLVM legalization. Only division and remainder require the runtime call path because they involve complex multi-word algorithms.

FP128-to-Integer Conversions (10 functions)

Convert fp128 to integer types of various widths. The target width is determined by examining sub_1642F90 and the type's bit-width field (type_id >> 8).

FunctionConversion
__nv_fp128_to_uint8fp128 -> i8 (unsigned)
__nv_fp128_to_uint16fp128 -> i16 (unsigned)
__nv_fp128_to_uint32fp128 -> i32 (unsigned)
__nv_fp128_to_uint64fp128 -> i64 (unsigned)
__nv_fp128_to_uint128fp128 -> i128 (unsigned)
__nv_fp128_to_int8fp128 -> i8 (signed)
__nv_fp128_to_int16fp128 -> i16 (signed)
__nv_fp128_to_int32fp128 -> i32 (signed)
__nv_fp128_to_int64fp128 -> i64 (signed)
__nv_fp128_to_int128fp128 -> i128 (signed)

Integer-to-FP128 Conversions (10 functions)

Convert integer types to fp128.

FunctionConversion
__nv_uint8_to_fp128i8 (unsigned) -> fp128
__nv_uint16_to_fp128i16 (unsigned) -> fp128
__nv_uint32_to_fp128i32 (unsigned) -> fp128
__nv_uint64_to_fp128i64 (unsigned) -> fp128
__nv_uint128_to_fp128i128 (unsigned) -> fp128
__nv_int8_to_fp128i8 (signed) -> fp128
__nv_int16_to_fp128i16 (signed) -> fp128
__nv_int32_to_fp128i32 (signed) -> fp128
__nv_int64_to_fp128i64 (signed) -> fp128
__nv_int128_to_fp128i128 (signed) -> fp128

String lengths for both fp128-to-integer and integer-to-fp128 conversions vary from 18 to 21 characters depending on the function name. Lowered through sub_1C8ADC0.

FP128-to-Float/Double Conversions (4 functions)

Truncation and extension between fp128 and the native floating-point types.

FunctionConversionOpcode
__nv_fp128_to_floatfp128 -> float0x43 (fptrunc)
__nv_fp128_to_doublefp128 -> double0x43 (fptrunc)
__nv_float_to_fp128float -> fp1280x44 (fpext)
__nv_double_to_fp128double -> fp1280x44 (fpext)

I128-to-Float/Double Conversions (8 functions)

These handle the non-fp128 path: converting i128 directly to/from float/double without going through fp128 as an intermediate. The _rz suffix denotes round-toward-zero mode; _rn denotes round-to-nearest-even.

FunctionConversionRoundingString Length
__nv_cvt_f32_u128_rzi128 (unsigned) -> floattoward zero20
__nv_cvt_f32_i128_rzi128 (signed) -> floattoward zero20
__nv_cvt_f64_u128_rzi128 (unsigned) -> doubletoward zero20
__nv_cvt_f64_i128_rzi128 (signed) -> doubletoward zero20
__nv_cvt_u128_f32_rnfloat -> i128 (unsigned)to nearest20
__nv_cvt_i128_f32_rnfloat -> i128 (signed)to nearest20
__nv_cvt_u128_f64_rndouble -> i128 (unsigned)to nearest20
__nv_cvt_i128_f64_rndouble -> i128 (signed)to nearest20

All eight are lowered through sub_1C8BF90 with a fixed string length of 20 characters. The rounding mode choice is deliberate: _rz for integer-from-float (truncation semantics matching C/C++ cast behavior) and _rn for float-from-integer (IEEE 754 default rounding for conversions).

The dispatch logic selects between the __nv_fp128_to_* / __nv_*_to_fp128 family and the __nv_cvt_* family based on whether the source or destination type is fp128 (type_id == 5). If neither operand is fp128 but one is i128, the __nv_cvt_* path is taken.

FP128 Comparison Predicates

The fcmp instruction (opcode 0x4C) is dispatched by extracting the comparison predicate from bits 0--14 of the halfword at instruction offset +18. Each LLVM fcmp predicate maps to a dedicated runtime function.

Ordered Comparisons (7 functions)

Ordered comparisons return false if either operand is NaN.

FunctionPredicateSemantics
__nv_fcmp_oeqoeqordered equal
__nv_fcmp_ogtogtordered greater-than
__nv_fcmp_ogeogeordered greater-or-equal
__nv_fcmp_oltoltordered less-than
__nv_fcmp_oleoleordered less-or-equal
__nv_fcmp_oneoneordered not-equal
__nv_fcmp_ordordordered (neither is NaN)

Unordered Comparisons (7 functions)

Unordered comparisons return true if either operand is NaN.

FunctionPredicateSemantics
__nv_fcmp_unounounordered (either is NaN)
__nv_fcmp_uequequnordered or equal
__nv_fcmp_ugtugtunordered or greater-than
__nv_fcmp_ugeugeunordered or greater-or-equal
__nv_fcmp_ultultunordered or less-than
__nv_fcmp_uleuleunordered or less-or-equal
__nv_fcmp_uneuneunordered or not-equal

The predicate naming follows the standard LLVM fcmp convention: o prefix = ordered, u prefix = unordered. The 14 predicates cover the complete set of IEEE 754 comparison semantics excluding true and false (which are constant-folded before reaching this pass). Each function takes two fp128 operands and returns i1.

Trunc/Ext Handling (Opcode 0x36)

The trunc/zext/sext opcode path requires special logic because it must distinguish between genuine 128-bit truncation/extension and other type conversions that happen to use the same opcode.

sub_1C8C170::handle_trunc_ext(inst):
    if sub_1642F90(*operand, 128):      // Is the operand type 128-bit?
        // Determine source and dest bit-widths from DataLayout
        src_bits = type_id >> 8          // Bit-width encoded in high byte
        dst_bits = target_type_id >> 8
        if src_bits > dst_bits:
            emit_truncation(inst, src_bits, dst_bits)
        else:
            emit_extension(inst, src_bits, dst_bits, is_signed)
    elif type_id == 5:                   // fp128 type marker
        emit_fp128_conversion(inst)
    else:
        return                           // Not a 128-bit operation

The type_id value 5 is the LLVM type tag for fp128 in CICC's internal representation (consistent with the type code table: 1=half, 2=float, 3=double, 4=fp80, 5=fp128, 6=bf16, 0xB=integer with bit-width at type_id >> 8).

Lowering Helpers

Four internal helper functions perform the actual call construction. Each creates a new CallInst with the library function name, replaces all uses of the original instruction with the call result, and erases the original instruction.

HelperAddressPurposeName Length
sub_1C8A5C00x1C8A5C0Binary fp128 arithmetic (add/sub/mul/div/rem)14
sub_1C8BD700x1C8BD70Binary i128 division (udiv/idiv/urem/irem)12
sub_1C8ADC00x1C8ADC0FP128 conversions (to/from all integer widths, to/from float/double)18--21 (varies)
sub_1C8BF900x1C8BF90I128-to/from-float/double conversions20

The "name length" column refers to the string length passed to the call construction routine. This is a fixed constant in each helper, not computed at runtime, which means the function name strings are embedded as literals in the binary (confirmed by string sweep at 0x1C8C170).

Each helper follows the same pattern:

helper(module, instruction, name_string, name_length):
    // 1. Get or create function declaration in module
    func = module.getOrInsertFunction(name_string, return_type, param_types...)
    // 2. Build argument list from instruction operands
    args = extract_operands(instruction)
    // 3. Create CallInst
    call = IRBuilder.CreateCall(func, args)
    // 4. Replace uses and erase
    instruction.replaceAllUsesWith(call)
    instruction.eraseFromParent()

Libdevice Resolution

The 48 __nv_* functions emitted by this pass are not present in the standard libdevice.10.bc. The standard libdevice (455,876 bytes embedded at unk_3EA0080 / unk_420FD80) contains ~400+ math functions (__nv_sinf, __nv_expf, etc.) but does not include any fp128 or i128 emulation routines.

Instead, these functions are resolved through one of two mechanisms:

  1. Separate bitcode library: A dedicated 128-bit emulation bitcode module linked after lower-ops runs. This module contains the actual multi-word software implementations of 128-bit arithmetic using 64-bit operations.

  2. Late synthesis during type legalization: The SelectionDAG type legalization pass (SoftenFloat action) can also handle fp128 operations, but CICC's IR-level lowering preempts this by replacing operations before they reach the backend. The __nv_* functions, once declared in the module, must be resolvable at link time.

The call declarations emitted by the pass use external linkage, meaning the linker must supply definitions. If a definition is missing, the compilation will fail at the NVPTX link stage with an unresolved symbol error. The benefit of performing this lowering at the IR level rather than in SelectionDAG is that the resulting calls are visible to the LLVM optimizer: the inliner can inline the emulation routines, SROA can decompose the intermediate values, and the loop optimizers can hoist invariant 128-bit computations.

Configuration

The pass has no dedicated knobs. It is controlled indirectly through the lower-ops pass framework:

ParameterEffect
enable-optimizationParameter to LowerOpsPass registration (slot 144). When enabled, the lowered calls may be marked with optimization attributes.

There are no knobs in knobs.txt specific to fp128 or i128 lowering. The pass runs unconditionally whenever lower-ops is in the pipeline -- there is no way to disable 128-bit emulation because leaving fp128/i128 operations in the IR would cause a fatal error in the NVPTX backend.

Diagnostic Strings

The pass itself emits no diagnostic messages or debug prints. All diagnostic information comes from the embedded function name strings:

"__nv_add_fp128"         "__nv_sub_fp128"         "__nv_mul_fp128"
"__nv_div_fp128"         "__nv_rem_fp128"
"__nv_udiv128"           "__nv_idiv128"
"__nv_urem128"           "__nv_irem128"
"__nv_fp128_to_uint8"    "__nv_fp128_to_int8"
"__nv_fp128_to_uint16"   "__nv_fp128_to_int16"
"__nv_fp128_to_uint32"   "__nv_fp128_to_int32"
"__nv_fp128_to_uint64"   "__nv_fp128_to_int64"
"__nv_fp128_to_uint128"  "__nv_fp128_to_int128"
"__nv_uint8_to_fp128"    "__nv_int8_to_fp128"
"__nv_uint16_to_fp128"   "__nv_int16_to_fp128"
"__nv_uint32_to_fp128"   "__nv_int32_to_fp128"
"__nv_uint64_to_fp128"   "__nv_int64_to_fp128"
"__nv_uint128_to_fp128"  "__nv_int128_to_fp128"
"__nv_fp128_to_float"    "__nv_fp128_to_double"
"__nv_float_to_fp128"    "__nv_double_to_fp128"
"__nv_cvt_f32_u128_rz"   "__nv_cvt_f32_i128_rz"
"__nv_cvt_f64_u128_rz"   "__nv_cvt_f64_i128_rz"
"__nv_cvt_u128_f32_rn"   "__nv_cvt_i128_f32_rn"
"__nv_cvt_u128_f64_rn"   "__nv_cvt_i128_f64_rn"
"__nv_fcmp_oeq"          "__nv_fcmp_ogt"          "__nv_fcmp_oge"
"__nv_fcmp_olt"          "__nv_fcmp_ole"          "__nv_fcmp_one"
"__nv_fcmp_ord"          "__nv_fcmp_uno"          "__nv_fcmp_ueq"
"__nv_fcmp_ugt"          "__nv_fcmp_uge"          "__nv_fcmp_ult"
"__nv_fcmp_ule"          "__nv_fcmp_une"

Function Map

FunctionAddressSizeRole
Main entrysub_1C8C17025 KBOpcode dispatch, instruction walk, type checks
FP128 binary loweringsub_1C8A5C0--Emits __nv_{add,sub,mul,div,rem}_fp128 calls
FP128 conversion loweringsub_1C8ADC0--Emits __nv_fp128_to_* / __nv_*_to_fp128 calls
I128 division loweringsub_1C8BD70--Emits __nv_{u,i}div128 / __nv_{u,i}rem128 calls
I128-float loweringsub_1C8BF90--Emits __nv_cvt_* calls (rz/rn variants)
Type width checksub_1642F90--Tests whether a type has a given bit-width (e.g., 128)

Cross-References

Struct/Aggregate Splitting

GPU register files are typed and scalar. An SM has no concept of loading a struct, storing a struct, or passing a struct through a register -- every value that survives past IR lowering must reduce to a set of individually-named scalar registers. LLVM's standard SROA pass handles alloca-based aggregates by promoting them to scalars, but a large class of aggregate operations never touch an alloca: return values, call arguments, PHI nodes carrying struct types, and aggregate load/store patterns from memcpy lowering. NVIDIA's struct-splitting pass operates on these non-alloca aggregate operations at the NVVM IR level, decomposing every struct-typed value into its constituent scalar fields so that downstream register allocation sees only scalar types.

The pass exists in two binary instances. The primary implementation at sub_1C86CA0 (72KB, ~1,200 lines, 500+ locals) lives in the aggregate-splitting cluster at 0x1C80000--0x1CBFFFF and operates on NVVM IR using NVIDIA-proprietary type IDs. A second, closely related implementation at sub_2CCF450 (58KB) handles the lower-aggr-copies pipeline pass and shares the same string constants ("splitStruct", "srcptr", "dstptr", "remsrc", "remdst", "split", "vld"). Both instances produce the same fundamental transformation: aggregate operations become sequences of scalar operations on individual struct elements.

Key Facts

PropertyValue
Entry pointsub_1C86CA0
Size72KB (~1,200 lines decompiled), 500+ local variables
Binary cluster0x1C80000--0x1CBFFFF (Aggregate Splitting + Memory Ops)
Second instancesub_2CCF450 (58KB, lower-aggr-copies pass)
Pipeline pass namelower-aggr-copies (parameterized: lower-aggr-func-args)
Related passlower-struct-args (parameterized: opt-byval)
IR levelNVVM IR (NVIDIA-proprietary type IDs, not LLVM Type::TypeID)
Key opcode32 (splitStruct instruction)
Use replacementsub_164D160 (RAUW -- Replace All Uses With)
LLVM upstreamNo equivalent -- this is entirely NVIDIA-proprietary

Algorithm

The pass walks every instruction in a function, looking for operations whose result type or operand type is an aggregate (struct or array). For each such operation, it decomposes the aggregate into its scalar elements, creates a splitStruct multi-output instruction, and rewires all uses to reference individual element extractions.

Step 1: Type Decomposition

For each struct type encountered, the pass retrieves the struct layout from the DataLayout and enumerates its elements:

function decomposeStructType(struct_type, data_layout):
    layout = sub_1643350(data_layout, struct_type)  // GetStructLayout
    element_types = []
    for each element in struct_type.elements:
        scalar_ty = sub_159C470(element)            // getScalarType
        element_types.append(scalar_ty)
    return element_types

sub_1643350 retrieves the StructLayout from the DataLayout, giving byte offsets and sizes for each field. sub_159C470 maps each element to its scalar type -- for nested structs, this recurses; for arrays, it yields the element type; for scalars, it returns the type directly.

The element types accumulate in a local array v505[] with the count tracked in v506. This flattened type list drives all subsequent instruction creation.

Step 2: splitStruct Instruction Creation (Opcode 32)

The pass creates a new multi-output instruction with NVVM opcode 32:

function createSplitStruct(original_inst, element_types, count):
    composite_ty = sub_15F9F50(element_types, count)     // ComputeCompositeType
    aligned_ty   = sub_1646BA0(composite_ty, data_layout) // SetAlignmentFromDL

    // If original was a vector type (type_id == 16), wrap in vector
    if getTypeId(original_inst.type) == 16:
        aligned_ty = sub_16463B0(aligned_ty)              // WrapInVectorType

    split_inst = sub_15F1EA0(aligned_ty, 32, parent, nops, flags)
                                                          // InitInstruction(opcode=32)
    // Store original type info at inst+56, composite at inst+64
    split_inst[+56] = original_type_info
    split_inst[+64] = sub_15F9F50(composite_ty)
    return split_inst

The splitStruct instruction is the NVVM-specific multi-result node that represents the decomposition. It produces N outputs, one per struct element. The instruction stores both the original aggregate type (at offset +56) and the composite element type (at offset +64) for later phases that may need to reconstruct type information.

Step 3: Element Pointer Extraction

For each element of the decomposed struct, the pass creates an indexed load from the splitStruct result:

for i in 0..count:
    ptr = sub_15FD590(split_inst, element_types[i],
                      operand=i, name="ptr", insertion_point)
    // Creates opcode 56 (extractvalue-like) with type=1

sub_15FD590 creates an instruction with opcode 56 that extracts the i-th element from the multi-output splitStruct node. The "ptr" name prefix appears in debug output. Each extraction yields a scalar-typed value that downstream passes can assign to an individual PTX register.

Step 4: Split Load with Alignment Preservation

For the actual memory access that feeds the splitStruct, the pass creates a split load instruction:

function createSplitLoad(original_load, element_types):
    alignment = computeAlignment(original_load)
    split_load = sub_15F90A0(element_types, alignment, ...)
    additional_align = sub_1CCB4A0(data_layout, element_types)
    final_align = alignment & (-additional_align)  // min power-of-2
    return split_load

The resulting instruction carries the "split" name prefix. The alignment computation is described in detail in the next section.

Step 5: Use Replacement

After creating all scalar operations, sub_164D160 (RAUW -- Replace All Uses With) replaces every use of the original aggregate operation with the corresponding scalar element extraction:

sub_164D160(original_aggregate_inst, split_inst)

This is the same RAUW infrastructure used across CICC (also called from GlobalOpt, DSE, the inliner, and other passes). After replacement, the original aggregate instruction has zero uses and is eligible for dead code elimination.

Alignment Preservation

The pass must preserve memory alignment when splitting aggregate loads/stores into per-element accesses. GPU memory transactions have strict alignment requirements: a misaligned access can silently produce wrong results or trap, depending on the address space and SM architecture.

The Alignment Formula

The decompiled alignment calculation is:

aligned_value = 1 << (alignment_field >> 1) >> 1

Breaking this down:

  1. alignment_field >> 1 -- the alignment is stored in a compressed encoding where the field value is approximately 2 * log2(alignment) + bias.
  2. 1 << (result) -- converts back to a power-of-two alignment value.
  3. >> 1 -- adjusts for the encoding's off-by-one (the encoding stores 2*log2 + 1, so the final shift corrects it).

For example, if alignment_field = 9, then 9 >> 1 = 4, 1 << 4 = 16, 16 >> 1 = 8, yielding 8-byte alignment. This encoding is compact and used throughout NVVM's type system to store alignment in a single byte.

Additional Alignment Computation

sub_1CCB4A0 provides a DataLayout-aware alignment computation for the element type. The final alignment is the minimum of the original alignment and the element's natural alignment, computed via:

final_align = original_align & (-element_natural_align)

The bitwise AND with the negation of the element alignment selects the largest power-of-two that divides both values, ensuring the per-element access is always naturally aligned for its type without exceeding the original aggregate's alignment guarantee.

NVVM Type ID System

The pass operates on NVVM's proprietary type ID system, not LLVM's Type::TypeID. The size classification logic (decompiled lines 997--1030) reveals the mapping:

NVVM Type IDTypeBit Width
1BFloat16 (i8 pair with padding)16
2Float32
3Double / i32 (context-dependent)64
4i6480 (with padding to 10 bytes)
5, 6FP128 / PPC FP128128
7Pointer8 * DataLayout::getPointerSizeInBits(0)
9Float (alternate, possibly metadata)64
0xB (11)Integer (arbitrary width)element_encoding >> 8
0xD (13)Array8 * DataLayout::getStructLayout(type) total size
0xE (14)StructRecursive sum of element sizes
16VectorTriggers vector-type wrapping via sub_16463B0

For struct types (ID 0xE), the size computation is recursive: the pass sums the sizes of all elements, each resolved through the same type-ID dispatch table. Array types (ID 0xD) use sub_15A9930 to look up the total allocation size from the DataLayout's StructLayout cache (which also handles arrays despite the name).

Nested Struct and Array Handling

When a struct element is itself a struct or an array, the pass recurses. The sub_159C470 (getScalarType) call during type decomposition flattens nested aggregates: a struct {i32, {f32, f64}, i16} decomposes not into three elements but into four scalars: i32, f32, f64, i16. The flattening continues until every element is a primitive scalar or a pointer.

Arrays within structs are handled differently depending on their size. Small arrays may be fully unrolled into individual element accesses. The size threshold is governed by the max-aggr-copy-size and large-aggr-store-limit knobs. Arrays that exceed the threshold are not decomposed into per-element loads but instead lowered to byte-copy loops (the "remsrc" / "remdst" / "i8dst" paths correspond to this remainder-byte handling when the aggregate cannot be evenly split into typed elements).

The remainder path:

  1. Computes the number of whole elements that can be extracted as typed loads.
  2. For any trailing bytes that do not fill a complete element, generates an i8 byte loop: "remsrc" is the source pointer for the remainder, "remdst" is the destination, and "i8dst" is the byte-typed destination pointer.

Relationship with SROA

LLVM's SROA (Scalar Replacement of Aggregates) and NVIDIA's struct splitting are complementary, not overlapping:

AspectLLVM SROANVIDIA Struct Splitting
Targetalloca instructions in entry blockNon-alloca aggregate operations
ScopeStack-allocated structsReturn values, call args, PHI nodes, memcpy results
IR levelLLVM IR (standard Type::TypeID)NVVM IR (proprietary type IDs)
Pipeline positionEarly scalar optimization passesAfter LLVM optimization, NVVM lowering phase
OutputSSA scalars replacing alloca usessplitStruct (opcode 32) multi-output nodes
UpstreamStandard LLVM passNo upstream equivalent

SROA runs during the standard LLVM optimization pipeline and eliminates alloca-based aggregates. By the time struct splitting runs, all remaining aggregate operations are those SROA could not handle: function return values carrying struct types, call sites passing or receiving struct-typed parameters, and aggregate-typed PHI nodes at control flow merges. Struct splitting is the final lowering step that ensures no aggregate-typed values survive into register allocation.

PTX Register Mapping

After struct splitting, every value in the IR is scalar-typed. During instruction selection and register allocation, each scalar maps to a PTX virtual register of the corresponding type:

// Before struct splitting:
%result = load {i32, f32, i64}, ptr %p, align 8

// After struct splitting:
%split = splitStruct {i32, f32, i64}   // opcode 32, multi-output
%r0 = extractelement %split, 0         // i32 -> %r1 (32-bit register)
%r1 = extractelement %split, 1         // f32 -> %f1 (32-bit FP register)
%r2 = extractelement %split, 2         // i64 -> %rd1 (64-bit register)

In PTX, register types are explicit:

  • %r registers: 32-bit integers
  • %rd registers: 64-bit integers
  • %f registers: 32-bit floats
  • %fd registers: 64-bit floats
  • %h registers: 16-bit values (half/bfloat)
  • %p registers: predicates (1-bit)

Without struct splitting, the register allocator would need to handle aggregate-typed live ranges, which is impossible on GPU hardware where the register file has no concept of a "struct register." The pass is therefore a hard prerequisite for correct register allocation.

Pipeline Position

The pass runs as part of the NVVM lowering phase, after the main LLVM optimization pipeline has completed. It is registered as lower-aggr-copies in the New PM pipeline parser at index 417 (sub_2342890), with parameter lower-aggr-func-args controlling whether function argument aggregates are also lowered.

Pipeline position:
  LLVM Optimizer (SROA, GVN, DSE, etc.)
    -> NVIDIA NVVM Lowering Phase
      -> lower-struct-args (opt-byval)     [lower struct function args]
      -> lower-aggr-copies (lower-aggr-func-args)  [struct splitting]
      -> memory-space-opt                   [address space resolution]
      -> register allocation preparation

The companion pass lower-struct-args (pass index 418) handles byval-attributed function parameters specifically, converting struct-typed byval parameters into explicit copy + scalar access patterns. It runs before lower-aggr-copies to ensure that byval struct arguments are already decomposed when the main splitting pass encounters them.

Configuration

Knobs (ctor_265 at 0x4F48E0)

KnobDefaultDescription
devicefn-param-always-local--Treat parameter space as local in device functions
skiploweraggcopysafechkfalseSkip safety check in aggregate copy lowering
large-aggr-store-limit--Threshold for large aggregate store unrolling
max-aggr-copy-size--Maximum aggregate size for full decomposition
lower-aggr-unrolled-stores-limit--Limit on unrolled stores per aggregate copy

InstCombine Aggregate Knobs (ctor_086 at 0x49E670)

KnobDefaultDescription
max-aggr-lower-size128Size threshold (bytes) below which InstCombine lowers aggregates
aggressive-max-aggr-lower-size256Aggressive threshold for aggregate lowering
instcombine-merge-stores-from-aggrtrueMerge stores originating from aggregate decomposition
KnobScopeDescription
lsa-optlower-struct-argsControls struct argument lowering
lower-read-only-devicefn-byvallower-struct-argsLower read-only device function byval params
hoist-load-paramlower-struct-argsHoist parameter loads
nvptx-force-min-byval-param-alignbackendForce 4-byte minimum alignment for byval params
nvptx-early-byval-copybackendCopy byval arguments early in the pipeline

Diagnostic Strings

"splitStruct"     -- Name prefix for the opcode-32 multi-output node
"srcptr"          -- Source pointer in aggregate copy lowering
"dstptr"          -- Destination pointer in aggregate copy lowering
"remsrc"          -- Remainder source pointer (byte-copy tail loop)
"remdst"          -- Remainder destination pointer (byte-copy tail loop)
"i8dst"           -- Byte-typed destination for remainder copies
"split"           -- Name prefix for the per-element split load
"ptr"             -- Name prefix for element pointer extractions
"vld"             -- Vector load variant in the second instance

Function Map

Primary Instance (sub_1C86CA0, 72KB)

FunctionAddressRole
Main driversub_1C86CA0Top-level struct splitting pass
StructLayout querysub_1643350DataLayout::getStructLayout
Scalar type querysub_159C470Get scalar element type (recursive for nested structs)
Composite type creationsub_15F9F50Build composite type from element array
Alignment from DLsub_1646BA0Set type alignment from DataLayout
Vector type wrappingsub_16463B0Wrap in vector type if original was vector
Instruction creationsub_15F1EA0InitInstruction(type, opcode=32, parent, nops, flags)
Element extractionsub_15FD590Create indexed load from multi-output node
Split load creationsub_15F90A0Create load with alignment preservation
Alignment computationsub_1CCB4A0DataLayout-aware alignment for element type
Use replacementsub_164D160RAUW (Replace All Uses With)
Pointer size querysub_15A9520DataLayout::getPointerSizeInBits(AS)
Struct size querysub_15A9930DataLayout::getStructLayout for size lookup

Second Instance (sub_2CCF450, 58KB)

FunctionAddressRole
Aggregate loweringsub_2CCF450lower-aggr-copies pass implementation

Pipeline Registration

FunctionAddressRole
New PM registrationsub_2342890Pass index 417 (lower-aggr-copies)
Parameter parsersub_233A3B0Parses lower-aggr-func-args parameter
lower-struct-args parsersub_233A370Parses opt-byval parameter

Test This

The following kernel returns a struct from a device function. Struct splitting should decompose the aggregate return value into individual scalar registers.

struct Result {
    float value;
    int   index;
    float confidence;
};

__device__ Result compute(const float* data, int tid) {
    Result r;
    r.value      = data[tid] * 2.0f;
    r.index      = tid;
    r.confidence = 0.95f;
    return r;
}

__global__ void struct_split_test(const float* in, float* out_val,
                                   int* out_idx, float* out_conf, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n) return;

    Result r = compute(in, tid);
    out_val[tid]  = r.value;
    out_idx[tid]  = r.index;
    out_conf[tid] = r.confidence;
}

What to look for in PTX:

  • The compute function should be inlined, but even if it is not, the struct return should be decomposed. Look for the absence of .local memory for the Result struct -- all three fields (value, index, confidence) should live in individual PTX registers (%f for floats, %r for int).
  • No ld.local/st.local pairs for passing the struct between compute and the kernel. If the struct survives unsplit, the caller allocates local memory for the return value, the callee stores into it, and the caller loads from it -- a 200+ cycle penalty per field.
  • In the PTX, the three stores to out_val, out_idx, out_conf should use values directly from registers without any intermediate local memory traffic. Look for st.global.f32 and st.global.u32 with register operands, not loaded-from-local operands.
  • To see the unsplit case, make compute a __noinline__ function and compile at -O0. The struct will be passed through .param space with explicit st.param/ld.param sequences, showing the overhead that struct splitting eliminates.

Cross-References

Memmove Unrolling

CUDA GPUs have no hardware instruction for bulk memory copy. On a CPU, memcpy and memmove compile down to optimized microcode sequences (REP MOVSB, AVX-512 scatter/gather, or libc hand-tuned SIMD loops). On an SM, every byte of a copy must pass through explicit load and store instructions executed by individual threads. LLVM's standard memcpy lowering in SelectionDAG produces reasonable load/store sequences, but it operates late in the pipeline and cannot reason about NVVM IR semantics -- address spaces, alignment guarantees from the CUDA memory model, or the interaction between copy direction and overlapping shared-memory buffers. NVIDIA's memmove unrolling pass replaces llvm.memmove and llvm.memcpy intrinsic calls at the NVVM IR level with explicit element-wise copy loops, generating both forward and reverse copy paths to handle overlapping memory correctly.

The pass lives in the aggregate-lowering cluster at 0x1C80000--0x1CBFFFF, adjacent to struct splitting (sub_1C86CA0) and FP128/I128 emulation (sub_1C8C170). It is part of the lower-aggr-copies pipeline pass (pass index 417), which coordinates memmove unrolling, struct splitting, and aggregate store lowering as a single pipeline unit. Upstream LLVM has no equivalent IR-level memmove unroller -- this is entirely NVIDIA-proprietary.

Key Facts

PropertyValue
Entry pointsub_1C82A50
Size39KB (~1,200 lines decompiled)
Binary cluster0x1C80000--0x1CBFFFF (Aggregate Splitting + Memory Ops)
Pipeline passlower-aggr-copies (pass index 417, parameterized: lower-aggr-func-args)
Pass registrationsub_233A3B0 (parameter parser for LowerAggrCopiesPass)
IR levelNVVM IR (pre-instruction-selection)
Unroll threshold globaldword_4FBD560
Knob constructorctor_265 at 0x4F48E0
LLVM upstreamNo equivalent -- NVIDIA-proprietary
Neighbor passesStruct splitting (sub_1C86CA0), FP128 emulation (sub_1C8C170)

Why This Pass Exists

On a CPU, memmove(dst, src, n) is a single function call that the runtime library implements with architecture-specific optimized loops, often using SIMD instructions that move 32 or 64 bytes per cycle. On a GPU:

  1. No bulk copy instruction. PTX and SASS have ld and st but no memcpy or rep movsb equivalent. Every byte must be an explicit load followed by an explicit store.

  2. Per-thread execution model. Each thread in a warp copies its own portion of data. A 128-byte struct copy in a kernel with 1024 threads means 1024 independent 128-byte copy sequences, all of which must resolve to individual load/store pairs.

  3. Address space semantics. The source and destination may live in different address spaces (global, shared, local, constant). Generic-pointer memmove requires runtime address-space resolution, but if the compiler can resolve the spaces at IR time, it can emit space-qualified loads and stores that map directly to the correct PTX instructions.

  4. Overlap semantics. memmove guarantees correct behavior when source and destination overlap. The pass must emit both a forward path (for dst < src) and a reverse path (for dst >= src) to preserve this guarantee. memcpy is also routed through this pass because the NVVM verifier enforces overlap-safety uniformly.

Algorithm

The pass scans each function for llvm.memmove and llvm.memcpy intrinsic calls. For each call, it replaces the intrinsic with a 4-block CFG that implements element-wise copying. The generated code has two paths: one for when the element count is statically known and small enough to fully unroll, and one for dynamic or large counts that use a loop with a PHI induction variable.

Step 1: Basic Block Structure Creation

The pass creates four new basic blocks, splitting the block containing the memmove call:

              +-------+
              | split |   (direction comparison)
              +---+---+
             /         \
    +--------+--+   +--+----------+
    | forward.for|   | reverse.for |
    +--------+--+   +--+----------+
             \         /
            +----------+
            | nonzerotrip |   (exit / continuation)
            +----------+
BlockName stringPurpose
Entry"split"Compares src and dst addresses to choose copy direction
Forward"forward.for"Copies elements from index 0 upward
Reverse"reverse.for"Copies elements from index count-1 downward
Exit"nonzerotrip"Continuation after the copy completes

Step 2: Forward vs. Reverse Decision

The split block determines copy direction by comparing the source and destination base addresses:

; Pseudocode for the split block
%cmp = icmp ult ptr %dst, ptr %src     ; sub_12AA0C0, opcode 0x22 (34)
br i1 %cmp, label %forward.for, label %reverse.for   ; sub_15F83E0

The ICMP instruction is created via sub_12AA0C0 with opcode 0x22 (34 decimal, corresponding to an unsigned-less-than integer comparison). The conditional branch is created via sub_15F83E0. When dst < src, memory does not overlap in the forward direction, so the forward path is safe. When dst >= src, copying forward would overwrite source bytes before they are read, so the reverse path is required.

Step 3: Copy Generation -- Small/Static Path

When the copy size is statically known and satisfies size <= dword_4FBD560 (the compile-time unroll threshold), the pass generates fully unrolled element-by-element copies with no loop overhead.

Reverse copy (decompiled lines 606--690):

; Fully unrolled reverse copy, count elements
; For i = count-1 downto 0:
%src.gep.N = getelementptr i8, ptr %src, i64 N     ; named "src.memmove.gep.unroll"
%val.N     = load i8, ptr %src.gep.N, align A       ; sub_15F9210 (InitLoadInstruction)
%dst.gep.N = getelementptr i8, ptr %dst, i64 N     ; named "dst.memmove.gep,unroll" [sic]
store i8 %val.N, ptr %dst.gep.N, align A            ; sub_15F9650 (InitStoreInstruction)
; ... repeated for each index from count-1 down to 0

Forward copy (decompiled lines 1036--1123):

; Fully unrolled forward copy, count elements
; For i = 0 to count-1:
%src.gep.N = getelementptr i8, ptr %src, i64 N     ; "src.memmove.gep.unroll"
%val.N     = load i8, ptr %src.gep.N, align A
%dst.gep.N = getelementptr i8, ptr %dst, i64 N     ; "dst.memmove.gep,unroll" [sic]
store i8 %val.N, ptr %dst.gep.N, align A
; ... repeated for each index from 0 up to count-1

Each load is created via sub_15F9210 (InitLoadInstruction, opcode 64 type 1) and each store via sub_15F9650 (InitStoreInstruction, opcode 64 type 2). Alignment is set on both loads and stores via sub_15F8F50 / sub_15F9450, preserving the alignment from the original memmove intrinsic call (passed as parameter a15). Memory attributes (volatile flags, etc.) are propagated through parameters a16 and a17.

Step 4: Copy Generation -- Large/Dynamic Path

When the copy size exceeds the threshold or is not statically known, the pass generates a single-iteration loop body with a PHI induction variable:

Forward loop:

forward.for:
  %iv = phi i64 [ 0, %split ], [ %iv.next, %forward.for ]   ; sub_15F1EA0, opcode 53
  %src.gep = getelementptr i8, ptr %src, i64 %iv
  %val = load i8, ptr %src.gep, align A
  %dst.gep = getelementptr i8, ptr %dst, i64 %iv
  store i8 %val, ptr %dst.gep, align A
  %iv.next = add i64 %iv, 1        ; sub_15A0680 (constant 1) + sub_15FB440 (ADD, opcode 13)
  %done = icmp eq i64 %iv.next, %count
  br i1 %done, label %nonzerotrip, label %forward.for   ; sub_15F83E0

Reverse loop:

reverse.for:
  %iv = phi i64 [ %count.minus1, %split ], [ %iv.next, %reverse.for ]
  %src.gep = getelementptr i8, ptr %src, i64 %iv
  %val = load i8, ptr %src.gep, align A
  %dst.gep = getelementptr i8, ptr %dst, i64 %iv
  store i8 %val, ptr %dst.gep, align A
  %iv.next = sub i64 %iv, 1
  %done = icmp eq i64 %iv.next, -1      ; or icmp slt i64 %iv.next, 0
  br i1 %done, label %nonzerotrip, label %reverse.for

The PHI node is created via sub_15F1EA0 with opcode 53. The constant 1 for the increment is created via sub_15A0680. The addition/subtraction uses sub_15A2B60 or sub_15FB440 (the 5-argument node constructor, opcode 13 for ADD). The nonzerotrip block serves as the exit target for both loop directions.

Step 5: Alignment Propagation

The pass preserves the alignment annotation from the original memmove/memcpy intrinsic call. The alignment value is passed through the internal parameter a15 to the load/store alignment setter functions sub_15F8F50 (SetLoadAlignment) and sub_15F9450 (SetStoreAlignment). This matters because downstream PTX emission can generate wider loads (e.g., ld.global.v4.b32 for 16-byte aligned accesses) if the alignment permits it.

Step 6: Cleanup

After generating the replacement CFG, the original memmove/memcpy intrinsic call is erased. The pass uses sub_164D160 (RAUW -- Replace All Uses With) to rewire any remaining references.

Unroll Threshold

The global variable dword_4FBD560 controls the boundary between full unrolling and loop generation. This value is registered at ctor_265 (0x4F48E0) as part of the aggregate copy lowering knob group.

ConditionCode generation
count statically known AND count <= dword_4FBD560Fully unrolled: N load/store pairs with no loop overhead
count statically known AND count > dword_4FBD560Dynamic loop with PHI induction variable
count not statically knownDynamic loop with PHI induction variable

The tradeoff is straightforward: full unrolling eliminates loop overhead (branch, PHI, compare) but increases code size linearly. For GPU kernels where instruction cache pressure is rarely the bottleneck, unrolling small copies is almost always profitable. The threshold prevents pathological code size explosion for large static copies (e.g., a 4KB struct assignment would generate 4,096 load/store pairs without the limit).

The related knob lower-aggr-unrolled-stores-limit provides an additional limit on the number of stores generated in unrolled mode, and large-aggr-store-limit controls when aggregate stores transition from unrolled sequences to loops.

Naming Conventions

The pass names its generated GEP instructions with distinctive prefixes that are visible in IR dumps and useful for debugging:

InstructionName stringNotes
Source GEP"src.memmove.gep.unroll"Period-separated
Destination GEP"dst.memmove.gep,unroll"Comma before unroll -- a typo in the binary [sic]

The comma in "dst.memmove.gep,unroll" (where a period would be expected by analogy with the source GEP name) is a benign naming inconsistency baked into the binary string table. It has no semantic effect -- LLVM IR value names are arbitrary strings -- but it serves as a reliable fingerprint for identifying output from this specific pass. A reimplementation should preserve this exact string if binary-identical IR output is desired, or normalize it to "dst.memmove.gep.unroll" if not.

Configuration

Knobs registered at ctor_265 (0x4F48E0), applicable to the lower-aggr-copies pass cluster:

KnobGlobalDescription
lower-aggr-unrolled-stores-limit--Maximum number of stores in unrolled mode
large-aggr-store-limit--Element count above which aggregate stores use a loop
max-aggr-copy-size--Maximum aggregate copy size the pass will handle
skiploweraggcopysafechk--Skip safety check in aggregate copy lowering
devicefn-param-always-local--Treat device function parameter space as local

The pass can be invoked via the pipeline text interface:

-Xcicc "-passes=lower-aggr-copies"
-Xcicc "-passes=lower-aggr-copies<lower-aggr-func-args>"

Related aggregate lowering knobs from ctor_089 (0x4A0D60):

KnobDefaultDescription
max-aggr-lower-size128Threshold size (bytes) below which aggregates are lowered
aggressive-max-aggr-lower-size256Aggressive threshold for aggregate lowering

Diagnostic Strings

"split"
"forward.for"
"reverse.for"
"nonzerotrip"
"src.memmove.gep.unroll"
"dst.memmove.gep,unroll"
"memmove/memcpy cannot target constant address space"   (from nvvm-verify)

Function Map

FunctionAddressSizeRole
Memmove unrollersub_1C82A5039KBMain pass: CFG construction, copy generation
ICMP creationsub_12AA0C0--Creates integer comparison (opcode 0x22)
Conditional branchsub_15F83E0--Creates br i1
InitLoadInstructionsub_15F9210--Creates load instruction (opcode 64, type 1)
InitStoreInstructionsub_15F9650--Creates store instruction (opcode 64, type 2)
SetLoadAlignmentsub_15F8F50--Sets alignment on load
SetStoreAlignmentsub_15F9450--Sets alignment on store
InitInstruction (PHI)sub_15F1EA0--Creates PHI node (opcode 53)
CreateConstantsub_15A0680--Creates integer constant (e.g., 1 for increment)
CreateBinaryOpsub_15FB440--Creates binary operation node (5-arg constructor)
CreateBinaryOp (variant)sub_15A2B60--Alternative binary op constructor
RAUWsub_164D160--Replace All Uses With
Pipeline param parsersub_233A3B0--Parses lower-aggr-func-args parameter

Cross-References

  • Struct/Aggregate Splitting -- sibling pass in the same lower-aggr-copies pipeline unit; decomposes struct-typed operations into scalar field operations
  • FP128/I128 Emulation -- neighbor in the 0x1C80000 cluster; replaces wide arithmetic with runtime library calls
  • NVVM Verifier -- validates that memmove/memcpy targets are not in constant address space
  • NVIDIA Custom Passes -- master index of all proprietary passes
  • SROA -- upstream LLVM pass that splits alloca-based aggregates; handles memcpy/memmove during alloca rewriting

printf-lowering

The printf lowering pass rewrites device-side printf() calls into CUDA's runtime vprintf() ABI. GPU hardware does not support C variadic function calls, so the compiler must pack all arguments into a stack buffer and emit a two-argument call to vprintf(format_string, arg_buffer_ptr). CICC implements this transformation at two levels: a module-level IR pass and an AST-level lowering function.

Pass nameprintf-lowering
Classllvm::PrintfLoweringPass
ScopeModule pass
RegistrationNew PM slot 130, sub_2342890
Module-level entrysub_1CB1E60 (31 KB)
AST-level loweringsub_12992B0 (24 KB)
Enable knobnvvm-lower-printf (registered at ctor_269)

Two Lowering Stages

Printf lowering happens at two points in the compilation pipeline:

Stage 1 -- AST-level (sub_12992B0): During initial IR generation from the EDG frontend output, when the code generator encounters a direct call to printf, it intercepts the call and emits the vprintf rewrite inline. This is the earlier, more detailed pass that handles type promotion, buffer packing, and alloca management.

Stage 2 -- Module-level (sub_1CB1E60): A cleanup pass that runs during the LLVM optimization pipeline. It catches any remaining printf calls that survived the AST lowering (e.g., from linked bitcode modules or inlined functions) and applies the same transformation. This pass validates that the format string is a string literal: "The first argument for printf must be a string literal!".

AST-Level Lowering Algorithm (sub_12992B0)

The AST-level lowering is the more thoroughly analyzed implementation. It operates in six phases:

Phase 1: Resolve the vprintf Symbol

The pass looks up or creates the "vprintf" function declaration in the module:

  1. Build the vprintf parameter type list: (i8*, i8*)
  2. Create the FunctionType via sub_1644EA0
  3. Call sub_1632190(Module*, "vprintf", 7, funcType) -- this is Module::getOrInsertFunction

The literal string "vprintf" with length 7 is stored in a local variable.

Phase 2: Set Up Argument List

  • The format string (**a3) becomes the first argument
  • The remaining varargs (a3[1..]) are collected into a dynamic argument array
  • A 22-QWORD (176-byte) stack small-buffer optimization avoids heap allocation for typical printf calls with fewer than ~16 arguments

Fast path: if argCount <= 1 (format string only, no varargs), the pass skips buffer creation entirely and emits vprintf(fmt, undef) using sub_15A06D0 (UndefValue::get).

Phase 3: Allocate Packed Argument Buffer

For the varargs case, a stack buffer named "tmp" is allocated:

  • sub_127FC40(context, type, "tmp", alignment=8, addrspace=0) creates an alloca
  • The alloca is cached at a1[19] and reused across multiple printf calls within the same function
  • If a cached alloca exists, its size is reused (and potentially grown in Phase 5)

Phase 4: Per-Argument Processing

For each vararg, the pass:

  1. Float promotion: per C variadic calling convention, float arguments are promoted to double via an fpext instruction. Detected when type_info[+12] == 2 and type_info[+16] != 0.

  2. Type size calculation: a multi-level switch on the LLVM type tag computes the byte width:

    Type tagSize (bits)Notes
    116half / i16
    232float / i32
    3, 964double / i64
    480x86_fp80
    5, 6128fp128 / ppc_fp128
    7target-dependentPointer size from DataLayout
    11customdword >> 8 (arbitrary-width integer)
    13aggregateStruct size from DataLayout
    14packed structComplex alignment calculation, up to 3 levels of nesting
  3. Alignment and offset: each argument is placed at the next naturally-aligned offset in the buffer. If offset % argSize != 0, the offset is rounded up.

  4. GEP creation: a GetElementPtr named "buf.indexed" indexes into the packed buffer at the computed byte offset.

  5. Bitcast: if the GEP result type differs from the argument type, a bitcast instruction named "casted" (opcode 47) is emitted.

  6. Store: the argument value is stored into the buffer slot via a StoreInst.

Phase 5: Alloca Resize

After processing all arguments, the pass checks whether the total packed size exceeds the current alloca size. If so, it patches the alloca's size operand in-place by manipulating the use-def chain directly -- unlinking the old size constant and linking a new one. This unusual technique avoids creating a second alloca while ensuring a single allocation dominates all printf pack sites.

Phase 6: Emit vprintf Call

sub_1285290 emits the final call: vprintf(format_string, arg_buffer_ptr).

Cleanup frees any heap-allocated argument arrays (from the small-buffer overflow path).

Module-Level Pass (sub_1CB1E60)

The module-level pass at 0x1CB1E60 (31 KB) performs a similar transformation but operates on already-lowered LLVM IR rather than AST nodes. Key recovered strings:

StringPurpose
"DataLayout must be available for lowering printf!"Guard: DataLayout required
"vprintf"Target function name
"The first argument for printf must be a string literal!"Format string validation
"vprintfBuffer.local"Name of the packed argument buffer alloca
"bufIndexed"Name of GEP instructions into the buffer

The module-level pass uses "vprintfBuffer.local" as the alloca name (versus "tmp" in the AST-level lowering), and "bufIndexed" for the GEP instructions (versus "buf.indexed"). These naming differences confirm the two implementations are distinct codepaths.

Implementation Details

Small-buffer optimization: the argument array uses a 22-QWORD (176-byte) stack buffer. Only if more than ~16 arguments overflow does it heap-allocate via the SmallVector grow path (sub_16CD150). This avoids malloc for typical printf calls.

Alloca caching: a1[19] in the IRGenState caches the "tmp" alloca across multiple printf calls within the same function. This reduces alloca instruction count in functions with many printf calls.

Struct nesting limit: the type-size calculation handles up to 3 levels of nested struct packing (three nested switch statements in the decompilation). Deeper nesting hits a JUMPOUT at 0x129A22F -- likely an assertion for structs nested more than 3 levels in printf arguments.

Pointer tag bits: the basic block instruction list uses an intrusive doubly-linked list where the low 3 bits of next/prev pointers carry metadata tags (masked with 0xFFFFFFFFFFFFFFF8). This is consistent with LLVM's ilist implementation using pointer-int pairs.

Diagnostic Strings

Diagnostic strings recovered from p2-B08-printf-lowering.txt and p1.7-04-sweep-0x1B00000-0x1CFFFFF.txt.

StringSourceCategoryTrigger
"DataLayout must be available for lowering printf!"sub_1CB1E60 (module-level pass)Assertion/ErrorModule lacks DataLayout; fatal guard at module pass entry
"The first argument for printf must be a string literal!"sub_1CB1E60 (module-level pass)ErrorFormat string argument is not a constant string; validation failure
"vprintf"sub_1632190 / sub_12992B0SymbolTarget function name looked up or created in the module (literal string, length 7)
"vprintfBuffer.local"sub_1CB1E60 (module-level pass)IR nameName of the packed argument buffer alloca in the module-level pass
"bufIndexed"sub_1CB1E60 (module-level pass)IR nameName of GEP instructions into the argument buffer in the module-level pass
"tmp"sub_12992B0 (AST-level lowering)IR nameName of the packed argument buffer alloca in the AST-level lowering; cached at a1[19]
"buf.indexed"sub_12992B0 (AST-level lowering)IR nameName of GEP instructions into the argument buffer in the AST-level lowering
"casted"sub_12992B0 (AST-level lowering)IR nameName of bitcast instructions when GEP result type differs from argument type (opcode 47)
"nvvm-lower-printf"ctor_269KnobEnable knob for the printf lowering pass

The two lowering stages produce different IR names for the same conceptual entities ("vprintfBuffer.local" vs "tmp" for the alloca, "bufIndexed" vs "buf.indexed" for the GEPs), confirming they are distinct codepaths.

IRGenState Layout

The codegen context object used by the AST-level lowering:

OffsetFieldPurpose
a1[4]Module*The LLVM module
a1[5]Return typeFunction return type / type context
a1[6]DebugLocCurrent debug location
a1[7]BasicBlock*Current insertion block
a1[8]IteratorInsertion point in BB's instruction list
a1[9]AS contextAddress space context for alloca type creation
a1[19]AllocaInst*Cached "tmp" alloca (reused across printf calls)

ipmsp -- Inter-Procedural Memory Space Propagation

The IPMSP pass resolves generic (address space 0) pointer arguments to concrete NVIDIA address spaces by analyzing call sites across the entire module. When all callers of a function agree that a pointer argument points to a specific memory space (global, shared, local, constant), the pass either specializes the function in place or clones it with narrowed pointer types. This enables downstream passes to emit space-specific load/store instructions (e.g., ld.shared instead of generic ld) and eliminates addrspacecast overhead.

Disabling this pass (-disable-MemorySpaceOptPass) causes 2--20x performance regressions on real workloads. The pass is automatically disabled in OptiX IR mode (--emit-optix-ir routes -do-ip-msp=0).

Pass nameipmsp
Classllvm::IPMSPPass
ScopeModule pass
RegistrationNew PM slot 125, line 1111 in sub_2342890
Main functionsub_2CBBE90 (71 KB) -- MemorySpaceCloning worklist driver
LIBNVVM variantsub_1C6A6C0 (54 KB)
Inference enginesub_2CE96D0 -> sub_2CE8530
Cloning enginesub_F4BFF0 (CloneFunction)
Callee matchingsub_2CE7410
Propagationsub_2CF5840 -> sub_2CF51E0
Pipeline controldo-ip-msp NVVMPassOption (default: enabled)

NVPTX Address Spaces

The pass resolves generic (AS 0) pointers to specific address spaces: global (AS 1), shared (AS 3), constant (AS 4), local (AS 5), or param (AS 101). Generic pointers require a runtime address space check on every access; resolving them statically eliminates this overhead. See Address Spaces for the complete table with hardware mapping, pointer widths, aliasing rules, and the MemorySpaceOpt bitmask encoding.

Algorithm Overview

The pass operates as a worklist-driven inter-procedural fixed-point analysis. The top-level loop:

function IPMSP_Run(Module M):
    worklist = deque<Function*>{}
    argSpaceMap = map<Value*, int>{}        // formal arg -> resolved AS
    returnSpaceMap = map<Function*, int>{}  // function -> return AS
    calleeInfoMap = map<Function*, set<Function*>>{}  // reverse call graph

    // Phase 1: seed
    for each F in M.functions():
        if shouldProcess(F):
            worklist.push_back(F)
        for each caller of F:
            calleeInfoMap[F].insert(caller)

    debug("Initial work list size : %d", worklist.size())

    // Phase 2: fixed-point iteration
    while worklist not empty:
        F = worklist.pop_back()

        // Analyze and specialize F's callee arguments
        changed = analyzeAndSpecialize(F, argSpaceMap, calleeInfoMap)

        if changed:
            // Propagate to F's callees
            propagateSpacesToCallees(F, argSpaceMap)
            for each callee C of F in calleeInfoMap:
                if shouldProcess(C):
                    worklist.push_back(C)
            debug("%d callees are affected")

        // Check return space
        if resolveReturnSpace(F, returnSpaceMap):
            debug("%s : return memory space is resolved : %d")
            // propagate to callers and push them onto worklist

Phase 1: Build Worklist

The pass iterates all functions in the module. A function enters the worklist if sub_2CBA650 returns true, meaning:

  • The function is not a declaration or available_externally
  • Its linkage is not extern_weak or common
  • It is not an intrinsic (sub_B2DDD0 filter)
  • It has at least one formal argument that is a generic pointer not yet in the resolved-space map

Specifically, sub_2CBA650 checks:

function shouldProcess(this, F):
    if F has no users (F[16] == 0): return false

    linkage = F.linkage & 0xF
    if (linkage + 14) & 0xF <= 3: return false   // available_externally, appending
    if (linkage + 7) & 0xF <= 1: return false     // common, extern_weak

    if isIntrinsic(F): return false

    retType = F.getReturnType()
    if retType is pointer with AS 0 and not in returnSpaceMap:
        return true

    return hasUnresolvedPointerArgs(this, F)

sub_2CBA520 (hasUnresolvedPointerArgs) walks the formal arg list (stride 40 bytes) and returns true if any arg has type byte 14 (pointer) and is not already in the arg-space map.

A reverse call graph is also constructed: for each callee, the pass records which callers invoke it.

Debug output (when dump-ip-msp is enabled): "Initial work list size : N"

Phase 2: Per-Function Analysis

For each function popped from the worklist:

  1. Classify arguments: allocate a per-arg array initialized to 1000 ("unresolved"). Non-pointer args and already-resolved args are marked 2000 ("skip").

  2. Walk call sites: for each call instruction, examine each actual argument:

    • If the actual's address space is non-zero (already specific), record it.
    • If the actual is generic (AS 0), first check the callee-space map for a cached result. If not found, invoke the dataflow inference engine sub_2CE96D0 to trace the pointer's provenance.
    • If this is the first call site for this arg, record the space. If a subsequent call site disagrees, mark 2000 ("conflicting -- give up").
  3. Count resolved arguments: any arg where all call sites agree on a single address space is a candidate for specialization.

function analyzeArgSpaces(F, argSpaceMap, calleeSpaceMap):
    numArgs = F.arg_size()
    spaces[numArgs] = {1000, ...}     // 1000 = unresolved

    for i in 0..numArgs:
        arg = F.getArg(i)
        if arg.type != pointer:
            spaces[i] = 2000          // not a pointer, skip
        else if arg in argSpaceMap:
            spaces[i] = 2000          // already resolved

    for each CallInst CI using F:
        calledFn = CI.getCalledFunction()
        for i in 0..numArgs:
            if spaces[i] == 2000: continue
            actual = CI.getOperand(i)
            if actual == F.getArg(i): continue  // passthrough

            as = actual.type.addrspace
            if as == 0:
                // Check cache first
                if actual in calleeSpaceMap:
                    as = calleeSpaceMap[actual]
                else:
                    ok = inferAddressSpace(calledFn, actual, &as, ...)
                    if !ok:
                        spaces[i] = 2000
                        continue

            if spaces[i] == 1000:
                spaces[i] = as         // first call site
            else if spaces[i] != as:
                spaces[i] = 2000       // conflict

    return count(s for s in spaces if s != 1000 and s != 2000)

Debug output: "funcname : changed in argument memory space (N arguments)"

Phase 3: Specialization Decision

The pass chooses between two strategies based on linkage:

LinkageStrategyMechanism
Internal / Private (7, 8)In-place specializationModify the function's arg types directly. No clone needed since all callers are visible.
External / Linkonce / WeakCloneCreate a new function with specialized arg types and internal linkage. Rewrite matching call sites to target the clone. Keep the original for external callers.

The decision at line 1114 in sub_2CBBE90:

if (F.linkage & 0xF) - 7 <= 1:
    // Internal/Private: specialize in place
    for each resolved arg:
        argSpaceMap[arg] = resolvedAS
else:
    // External: must clone
    if resultsTree is empty:
        debug("avoid cloning of %s")
    else:
        createClone(F, resolvedArgs)

The clone is created by sub_F4BFF0 (CloneFunction):

  • Builds a new FunctionType with specific-space pointer arg types
  • Allocates a new Function object (136 bytes via sub_BD2DA0)
  • Copies the body via a ValueMap-based cloner (sub_F4BB00)
  • For each specialized arg, inserts an addrspacecast from specific back to generic at the clone's entry (these fold away in later optimization)
  • Sets clone linkage to internal (0x4007)

Debug output: "funcname is cloned"

Phase 4: Transitive Propagation

After specializing a function, the pass propagates resolved spaces to its callees via sub_2CF5840. This function:

  1. Creates an analysis context similar to sub_2CE96D0
  2. Calls sub_2CF51E0 which walks F's body
  3. For each call instruction in F that targets a known function, determines if the called function's args now have resolved spaces
  4. Updates the arg-space map accordingly

Affected callees are pushed back onto the worklist. This enables bottom-up resolution through call chains: if A -> B -> C, specializing A's args may resolve B's args, which in turn resolves C's args.

Debug output: "N callees are affected"

Phase 5: Return Space Resolution

After argument processing, the pass checks return values:

  • If the function returns a generic pointer, walk all ret instructions.
  • Follow the def chain through GEPs to the base pointer.
  • If all returns agree on a single address space, record it in the return-space map and propagate to callers.

Debug output: "funcname : return memory space is resolved : N"

The Dataflow Inference Engine

The inference engine is the core analysis that determines what address space a generic pointer actually points to. It is invoked when a call-site argument has address space 0 (generic) and the pass needs to determine the concrete space.

Entry Point: sub_2CE96D0

function inferAddressSpace(calledFn, actualArg, &result, module, symtab, argSpaceMap):
    as = actualArg.type.addrspace
    if as != 0:
        *result = as
        return true                    // trivially resolved

    // Generic pointer: need full analysis
    context = alloca(608)              // 608-byte stack context
    // Initialize 6 tracking sets:
    //   [0]  visited set (bitset for cycle detection in PHI chains)
    //   [1]  user-list collector
    //   [2]  callee mapping
    //   [3]  load tracking (when track-indir-load)
    //   [4]  inttoptr tracking (when track-int2ptr)
    //   [5]  alloca tracking

    return coreDataflowWalker(context, calledFn, actualArg,
                              &loadsVec, &callsVec, result)

The 608-byte context is allocated on the stack and contains all working state for the backward dataflow walk.

Core Backward Dataflow Walker: sub_2CE8530

The walker traces the pointer's provenance backward through the SSA def chain. It uses a worklist plus visited-set to handle cycles (primarily PHI nodes).

IR nodes handled:

IR nodeAction
getelementptrTransparent: follow the base pointer operand
bitcastTransparent: follow the source operand
addrspacecastExtract target address space, record it
phiAdd all incoming values to the worklist
selectAdd both arms to the worklist (result = OR of both)
call / invokeLook up callee in return-space map; if found, use that
loadIf track-indir-load enabled: follow the loaded pointer; otherwise opaque
inttoptrIf track-int2ptr enabled: follow the integer source; otherwise opaque
allocaIf process-alloca-always: immediately resolve to AS 5 (local)
argumentIf in arg-space map: use the recorded space

Inference rules (lattice):

The engine collects candidate address spaces from all reachable definitions. The resolution follows these rules:

// All sources agree:     resolved to that space
// Sources disagree:      unresolvable (return false)
// param bit set + param-always-point-to-global:  resolve to global (AS 1)
// alloca found + process-alloca-always:  resolve to local (AS 5)
// __builtin_assume(__isGlobal(p)) + process-builtin-assume:  resolve to global

The walker collects three separate vectors during traversal:

  • loads: pointers loaded from memory (indirect provenance)
  • GEPs: getelementptr instructions encountered along the chain
  • calls: function calls whose return values contribute to the pointer

Per-Callee Space Propagation: sub_2CE8CB0

This function is the heavy-weight driver called from the worklist loop for each function. It processes a function's call graph entries and determines concrete address spaces for callees by examining actual arguments at all call sites.

Architecture:

  1. A global limit at qword_3CE3528 caps maximum analysis depth to prevent explosion on large call graphs.

  2. The function iterates the BB instruction list (offset +328, linked list). For each callee encountered:

    • Check visited set. The set has two representations:
      • Small set: flat array at object offsets +32..+52 (checked when flag at +52 is set)
      • Large set: hash-based DenseSet at offset +24 (checked via sub_18363E0)
    • If callee has no body (*(_DWORD *)(callee + 120) == 0): collect it as a leaf and record its argument address spaces via sub_2CE80A0
    • Otherwise: skip (will be processed when popped from worklist)
  3. For each collected callee, a DenseMap cache at offset +160 is checked:

    • Hash function: (ptr >> 9) ^ (ptr >> 4), linear probing
    • Empty sentinel: -4096 (0xFFFFFFFFFFFFF000)
    • If found in cache: skip re-analysis (use cached result)
  4. After collecting all callees: invoke sub_2CE88B0 for merge/commit.

  5. For single-entry results (exactly 1 callee entry in the vector): special fast path via sub_2CE2F10 that commits directly through a vtable dispatch.

function perCalleePropagate(this, F):
    if this.firstVisit:
        // Reset tracking vectors
        clearUserVectors()

    // Walk BB instruction list
    for each BB in F.body():
        if BB in visitedSet: continue
        if BB.isDeclaration(): continue

        collectCalleeInfo(BB)       // -> sub_2CE80A0
        addToVisitedSet(BB)

    // Check depth limit
    if userVector.size() > depthLimit:
        return false

    // Merge phase
    if userVector.size() > 1:
        return mergeAndCommit(this, F)    // sub_2CE88B0
    elif userVector.size() == 1:
        commitSingleResult(this)          // fast path
    return false

Callee Matching Engine: sub_2CE7410

When multiple call instructions target the same callee, this function determines the best pair to use for space inference. This is critical for correctness -- the pass must ensure that the inferred space is valid for all uses.

Algorithm:

  1. Parallel operand walk: for each pair of call instructions to the same callee, walk their operand use-chains in parallel. Compare the instructions at each position via the instruction equivalence DenseMap at offset +80.

  2. Coverage scoring: count the number of matching operands (variable v95). Higher coverage means more confidence in the match.

  3. Dominance check: call sub_2403DE0(A, B) to test if BB A dominates BB B. Both directions are checked:

    • If A dominates B and B dominates A (same BB or trivial loop): strong match.
    • If only one direction: check if the non-dominating one is the entry BB's first instruction.
  4. Loop membership gate: sub_24B89F0 checks whether both call instructions are in the same loop. If both are in the same loop and the coverage score > 1, the match is accepted even without strict dominance (loops create natural fixed-point convergence).

  5. Attribute check: for each matched pair, sub_245A9B0 verifies metadata flags (at instruction offset +44) to ensure the transformation is legal.

  6. Output: the best-scoring pair is written into the results vector for subsequent instruction rewriting.

Post-Inference Merge: sub_2CE88B0

After the per-callee analysis produces a list of (instruction, resolved_space) entries:

function mergeAndCommit(this, F):
    entries = this.resultVector
    if entries.size() > 1:
        qsort(entries, comparator=sub_2CE2BD0)  // sort by callee ID

    changed = false
    while entries.size() > 1:
        entry = entries.back()
        calleeId = entry.calleeId

        // Find best match for this callee
        matchScore = sub_2CE7410(this, calleeId, ...)

        if matchScore > 0:
            // Commit via instruction specialization
            sub_2CE4830(this, matchedCallee)     // edge weight
            sub_2CE3B60(this, bestMatchIdx)      // commit space

            // Propagate to other entries sharing this callee
            for each other entry with same callee:
                if other != bestMatch:
                    sub_2CE3780(this, other.users, matchedCallee)

            // Compact the entries vector
            changed = true
        else:
            // No match: fallback propagation
            sub_2CE3A70(this, calleeId, ...)

    return changed

Instruction Specialization: sub_2CE8120

Once a callee's address space is determined, this function creates a specialized copy of the instruction:

  1. Legality check: vtable dispatch at offset +408 (sub_25AE460 default). Returns false if the instruction cannot be legally specialized (e.g., volatile operations, intrinsics with fixed types).

  2. Create specialized instruction: sub_244CA00 creates a new instruction with the modified pointer type (generic -> specific address space).

  3. Insert into BB: sub_24056C0 places the new instruction in the basic block's instruction list.

  4. Rewrite use chain: all uses of the old instruction are updated to reference the new specialized version.

  5. Update DenseMap caches:

    • Instruction-to-space map at offset +80: insert mapping from new instruction to resolved space
    • Edge count at offset +72: update via sub_24D8EE0
    • If nested clone tracking (offset +131 flag): update debug info via sub_2D2DBE0

Handling Recursion and Clone Limits

  • Transitive: clones are pushed back onto the worklist, so chains A->B->C are handled iteratively.
  • Mutual recursion: already-resolved args are detected via the map (marked 2000), preventing infinite re-processing.
  • Self-recursion: after the first pass resolves args, re-processing finds agreement and applies specialization.
  • Clone limit: do-clone-for-ip-msp (default -1 = unlimited) caps the total number of clones. Each clone increments a counter at this[200]. When the limit is exceeded, cloning stops but in-place specialization continues for internal functions.
  • Analysis depth limit: qword_3CE3528 limits the per-function callee analysis depth to prevent explosion on large modules.

The LIBNVVM Variant

A second implementation at sub_1C6A6C0 (54 KB) serves the LIBNVVM/module-pass path. Key differences:

  • Uses DenseMap-style hash tables (empty sentinel = -8, tombstone = -16, 16-byte entries)
  • Includes loop-induction analysis via sub_1BF8310 with maxLoopInd tracking (debug: "phi maxLoopInd = N: Function name")
  • Three processing phases controlled by globals:
    • Phase A (dword_4FBD1E0, default=4): call-site collection, threshold dword_4FBC300 = 500
    • Phase B (dword_4FBD2C0, default=2): address space resolution. If dword_4FBCAE0 (special mode), picks the callee with the smallest constant value (minimum address space ID).
    • Phase C (dword_4FBCD80, default=2): WMMA-specific sub-pass via sub_1C5FDC0, called with wmma_mode=1 first (WMMA-specific), then wmma_mode=0
  • Threshold: v302 > 5 triggers sub_1C67780 for deeper analysis
  • Pre/post analysis toggle: byte_4FBC840 controls calls to sub_1C5A4D0

Interaction with memory-space-opt

The ipmsp and memory-space-opt passes are complementary:

  • ipmsp is inter-procedural: it analyzes call graphs, infers address spaces across function boundaries, and specializes function signatures via cloning.
  • memory-space-opt is intra-procedural: it resolves generic pointers within a single function body using backward dataflow analysis and bitmask accumulation.

The typical pipeline flow:

  1. ipmsp runs first (module pass) to propagate address spaces across function boundaries
  2. memory-space-opt runs with first-time mode to resolve obvious intra-procedural cases
  3. Further optimization passes run (may create new generic pointers via inlining, SROA, etc.)
  4. memory-space-opt runs with second-time mode to clean up remaining generic pointers, fold isspacep intrinsics to constants

Both passes share the same set of knobs (with ias- prefixed mirrors for the IAS variant). The inference engine sub_2CE96D0 is shared between IPMSP and the alternate algorithm selected by mem-space-alg.

Knobs

IPMSP-Specific Knobs

KnobDefaultStorageDescription
dump-ip-msp0qword_5013548Enable debug tracing
do-clone-for-ip-msp-1 (unlimited)qword_5013468Max clones allowed
do-ip-msp1 (enabled)NVVMPassOptionEnable/disable the entire pass

Shared Inference Knobs (MemorySpaceOpt variant)

KnobDefaultStorageDescription
param-always-point-to-globaltrueunk_4FBE1EDParameter pointers always resolve to global (AS 1)
strong-global-assumptionstrue(adjacent)Assume constant buffer pointers always point to globals
process-alloca-alwaystrueunk_4FBE4A0Treat alloca-derived pointers as local (AS 5) unconditionally
wmma-memory-space-opttrueunk_4FBE3C0Specialize WMMA call args to shared memory (AS 3)
track-indir-loadtruebyte_4FBDE40Track indirect loads during inference
track-int2ptrtruebyte_4FBDC80Track inttoptr in inference
mem-space-alg2dword_4FBDD60Algorithm selection for address space optimization
process-builtin-assume--(ctor_531_0)Process __builtin_assume(__is*(p)) for space deduction

IAS Variant Knobs (IPMSPPass path, ctor_610)

Each shared knob has an ias- prefixed mirror that controls the InferAddressSpaces-based code path (sub_2CBBE90):

KnobMirrors
ias-param-always-point-to-globalparam-always-point-to-global
ias-strong-global-assumptionsstrong-global-assumptions
ias-wmma-memory-space-optwmma-memory-space-opt
ias-track-indir-loadtrack-indir-load
ias-track-int2ptrtrack-int2ptr

The unprefixed versions control the LIBNVVM variant (sub_1C6A6C0). The ias- prefixed versions control the New PM / IAS variant (sub_2CBBE90).

LIBNVVM Variant Globals

GlobalDefaultDescription
dword_4FBD1E04Phase A call-site collection level
dword_4FBD2C02Phase B resolution level
dword_4FBCD802Phase C WMMA sub-pass level
dword_4FBC300500Max analysis depth threshold
dword_4FBCAE0--Special minimum-selection mode
byte_4FBC840--Pre/post analysis toggle
dword_4FBD020--Debug: maxLoopInd dump

Debug Dump Knobs

KnobDescription
dump-ir-before-memory-space-optDump IR before MemorySpaceOpt runs
dump-ir-after-memory-space-optDump IR after MemorySpaceOpt completes
dump-process-builtin-assumeDump __builtin_assume processing
msp-for-wmmaEnable Memory Space Optimization for WMMA (tensor core)

Data Structures

Worklist

The worklist is a std::deque<Function*> with 512-byte pages (64 pointers per page). Push-back via sub_2CBB610 (extends the deque when the current page is full). Pop-back from the last page.

Red-Black Tree Maps

The cloning engine uses red-black trees (std::map) for four separate maps:

MapKeyValuePurpose
Return-spaceFunction*Resolved ASReturn value address space
Arg-spaceValue*Resolved ASPer-argument address space
Callee-spaceValue*Resolved ASCallee pointer spaces (cached inference results)
Callee-infoFunction*Sub-treeReverse call graph (which callers invoke this callee)

Red-black tree nodes are 0x58 bytes with the standard {left, right, parent, color, key} layout at offsets 16, 24, 8, 0, 32.

DenseMap Caches

The inference engine and per-callee propagation use DenseMap hash tables with LLVM-layer sentinels (-4096 / -8192) and 16-byte entries (key + value). Growth is handled by sub_240C8E0. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.

Three independent DenseMaps are used:

  1. Offset +80: instruction -> resolved space (per-function analysis cache)
  2. Offset +160: callee -> inference result (cross-function cache)
  3. Offset +232: edge weight tracking (call graph weights for profitability)

Visited Sets

Two representations depending on set size:

  • Small set (flag at offset +52): flat array at offsets +32..+44, capacity at +40, count at +44. Linear scan for membership test.
  • Large set (default): hash-based DenseSet at offset +24 via sub_18363E0 for insert and sub_18363E0 for membership test.

Inference Context

The 608-byte stack-allocated context for sub_2CE8530 contains:

Offset rangeContent
0--23Result vector (pointer, size, capacity)
24--47Loads vector (indirect pointer sources)
48--71GEPs vector (getelementptr chains)
72--95Calls vector (call instructions returning pointers)
96--127Worklist for PHI traversal
128--607Visited bitset, callee tracking, metadata

Sentinel Values

ValueMeaningUsed in
1000Unresolved pointer argument (not yet seen at any call site)Per-arg analysis array
2000Non-pointer, already resolved, or conflicting (skip)Per-arg analysis array
-4096DenseMap empty slotAll DenseMap caches
-8192DenseMap tombstone (deleted entry)All DenseMap caches

Diagnostic Messages

MessageSourceCondition
"Initial work list size : %d"sub_2CBBE90Always (when dump-ip-msp)
"funcname : changed in argument memory space (N arguments)"sub_2CBBE90Args resolved
"funcname is cloned"sub_2CBBE90Clone created
"avoid cloning of funcname"sub_2CBBE90External linkage, empty results
"N callees are affected"sub_2CBBE90After propagation
"funcname : return memory space is resolved : N"sub_2CBBE90Return space resolved
"phi maxLoopInd = N: Function name"sub_1C6A6C0LIBNVVM loop-ind analysis

Function Map

FunctionAddressSizeRole
MemorySpaceCloningsub_2CBBE9071 KBWorklist driver (New PM variant)
IPMSPPasssub_1C6A6C054 KBLIBNVVM variant
inferAddressSpacesub_2CE96D0--Inference entry point
coreDataflowWalkersub_2CE8530--Backward dataflow analysis
perCalleePropagatesub_2CE8CB0--Per-callee space propagation
mergeAndCommitsub_2CE88B0--Post-inference merge (qsort)
rewriteCalleePairsub_2CE85D0--Instruction rewriting for matched pairs
calleeMatchingEnginesub_2CE7410--Dominance + coverage scoring
pushInferenceResultsub_2CE80A0--Append to result vector
vectorReallocsub_2CE7E60--Grow inference result vector
computeEdgeWeightsub_2CE4830--Call graph edge weight
commitSpacesub_2CE3B60--Commit resolved space to callee
fallbackPropagatesub_2CE3A70--Propagate unmatched entries
propagateToAlternatesub_2CE3780--Propagate to alternate callee users
commitSingleCalleesub_2CE2F10--Single-callee commit via vtable
singlePredecessorChecksub_2CE2DE0--Check single-predecessor property
qsortComparatorsub_2CE2BD0--Compare callee entries for sorting
mergeSmallVectorssub_2CE2A70--Merge small vector pairs
extractAddressSpacesub_2CE27A0--Extract AS from Value's type
cloneInstructionsub_2CE8120--Clone instruction + DenseMap update
populateUserSetsub_2CE97F0--Build per-arg user list
propagateSpacesToCalleessub_2CF5840--Post-specialization propagation
bodyWalkersub_2CF51E0--Walk function body for propagation
shouldProcessFunctionsub_2CBA650--Worklist eligibility predicate
hasUnresolvedPointerArgssub_2CBA520--Check for unresolved generic ptr args
CloneFunctionsub_F4BFF0--Full function clone with arg rewriting
ValueMapClonersub_F4BB00--ValueMap-based body cloner
replaceAllUsesWithsub_BD84D0--Redirect call sites to clone
mapInsertOrFindsub_2CBB230--Red-black tree insert
mapLookupsub_2CBB490--Red-black tree search
dequeGrowsub_2CBB610--Worklist deque push_back
checkAttributeBundlesub_245A9B0--Attribute flag membership test
instructionEquivalencesub_245AA10--Test instruction equivalence
bbDominatessub_2403DE0--BasicBlock dominance test
loopMembershipsub_24B89F0--Check if two instructions share a loop
createSpecializedInstsub_244CA00--Create instruction with modified types
insertIntoBlocksub_24056C0--Insert instruction into BB
updateDebugInfosub_2D2DBE0--Debug info update for cloned inst

Cross-References

Memory Space Optimization

The Memory Space Optimization pass (memory-space-opt) is NVIDIA's inter-procedural address space resolution engine. Its job is to convert generic (flat) pointers into specific address spaces -- global, shared, local, constant, or parameter -- so that the backend can emit specialized memory instructions (ld.shared, st.global, etc.) instead of generic ones (ld, st) that require address translation hardware at runtime. On NVIDIA GPUs, generic memory accesses go through an address translation unit that adds latency; resolving pointer provenance at compile time eliminates this overhead entirely and is one of the most impactful optimizations in the CUDA compilation pipeline.

The pass is implemented as a multi-function cluster totaling roughly 250KB of decompiled code, with two cooperating systems: an intra-procedural address space resolver and an inter-procedural function cloning engine.

Key Facts

PropertyValue
Pass name (pipeline)memory-space-opt
ClassMemorySpaceOptPass
Pass typeParameterized FunctionPass (NVIDIA-custom)
RegistrationNew PM #416, parameterized: first-time;second-time;no-warnings;warnings
Runtime positionsTier 1/2/3 #65 (after DSE + DCE + LLVM standard pipeline); also runs early in "mid" path (see Pipeline)
Pass entry pointsub_1C70910 (2,427 lines)
Pass factorysub_1C8E680
NVVMPassOptions slotOffset +2680 (disable), offset +3120 (mode parameter)
Binary size~250 KB total (multi-function cluster)
Upstream equivalentNone -- entirely NVIDIA-proprietary

NVPTX Address Space Numbering

The pass operates on the standard NVPTX address spaces (0=generic, 1=global, 3=shared, 4=constant, 5=local, 101=param). See Address Spaces for the complete table with hardware mapping, pointer widths, and aliasing rules.

Internally, the pass encodes address spaces as a single-bit bitmask for efficient dataflow computation (0x01=global, 0x02=shared, 0x04=constant, 0x08=local, 0x10=param, 0x0F=unknown). When multiple pointer sources contribute different spaces, the bitmask is OR'd together. A singleton bit (popcount == 1) means the space is fully resolved; multiple bits set means ambiguous. See the MemorySpaceOpt Internal Bitmask section for the complete mapping and resolution algorithm.

IR Before/After Example

The following illustrates the core transformation: generic-pointer loads/stores are resolved to specific address spaces, enabling specialized PTX memory instructions.

Before (generic pointers, AS 0):

define void @kernel(ptr addrspace(0) %shared_buf, ptr addrspace(0) %global_out) {
  %val = load float, ptr addrspace(0) %shared_buf, align 4
  %add = fadd float %val, 1.0
  store float %add, ptr addrspace(0) %global_out, align 4
  %check = call i1 @llvm.nvvm.isspacep.shared(ptr %shared_buf)
  br i1 %check, label %fast, label %slow
fast:
  ret void
slow:
  ret void
}

After (resolved address spaces):

define void @kernel(ptr addrspace(3) %shared_buf, ptr addrspace(1) %global_out) {
  %val = load float, ptr addrspace(3) %shared_buf, align 4    ; -> ld.shared.f32
  %add = fadd float %val, 1.0
  store float %add, ptr addrspace(1) %global_out, align 4     ; -> st.global.f32
  ; isspacep.shared folded to true (phase 2), branch simplified by later DCE
  br label %fast
fast:
  ret void
}

The addrspacecast instructions are inserted during resolution and consumed by downstream passes. The isspacep folding (phase 2 only) eliminates runtime address space checks when the space is statically known.

Two-Phase Architecture

The pass entry point (sub_1C70910) accepts a mode parameter controlling execution:

ModeNameBehavior
0First-timeConservative resolution via sub_1CA2920. Called early in the pipeline.
1Second-timeHash-table-based resolution via sub_1CA9E90. Called after IP-MSP propagation.
2First-time, no warningsSame as mode 0 but suppresses "Cannot tell what pointer points to" messages.
3Second-time, no warningsSame as mode 1 but silent. Used on re-runs where repeated warnings would be noise.

Both phases share the same instruction dispatch structure, handling loads (opcode 0x36), stores (0x37), calls (0x4E), atomic loads (0x3A), and atomic stores (0x3B).

Phase 1 (first-time) resolves obvious cases where pointer origin is statically known. It uses sub_1C9F820 for dataflow analysis and sub_1C98370 for annotation-based resolution.

Phase 2 (second-time) runs after inter-procedural propagation has enriched the analysis context. It uses hash-table lookups (sub_1CA8350) and can fold isspacep intrinsics (builtins 0xFD0-0xFD5) to constants when the address space is already known, eliminating runtime space checks.

Inter-Procedural Memory Space Propagation (IP-MSP)

Complexity. Let F = number of functions in the module, A = total number of pointer-typed arguments across all functions, E = total call-graph edges, and I = total instructions. The intra-procedural use-def chain walk is O(I) per function (bounded by visited-set to avoid cycles through PHI nodes). The IP-MSP worklist iterates until no argument's bitmask changes; since each of the A arguments has a 5-bit bitmask that can only grow (OR of incoming values), the worklist converges in at most O(A) rounds. Each round re-analyzes at most O(F) functions, and adding callers back to the worklist costs O(E) in total across all rounds. Worst-case: O(A * (F * I_avg + E)) where I_avg is average instructions per function. Function cloning adds at most O(F) clones (bounded by do-clone-for-ip-msp), each clone being O(I_f) to create. In practice, GPU modules have small call graphs (F < 200 after inlining) and the worklist converges in 2--4 rounds, making the pass effectively O(F * I_avg + E).

The IP-MSP driver in sub_1C70910 implements a fixed-point worklist algorithm that propagates address space information across function boundaries:

  1. Build a worklist of all functions in the module. Debug: "Initial work list size: %d".
  2. Pop a function from the worklist.
  3. Run intra-procedural resolution (phase 1 or 2).
  4. If argument memory spaces changed ("changed in argument memory space"), add all callers back to the worklist ("callees are affected").
  5. If the return memory space is resolved ("return memory space is resolved"), propagate to callers.
  6. Repeat until the worklist is empty.

A second IP-MSP implementation exists at sub_1C6A6C0 (54KB), which appears to be the LIBNVVM/module-pass variant. It uses DenseMap-style hash tables (sentinel -8 for empty, -16 for tombstone), has explicit loop-induction analysis (sub_1BF8310), and runs three sub-phases: call-site collection (level controlled by dword_4FBD1E0, default 4), address space resolution (level dword_4FBD2C0, default 2), and a WMMA-specific pass (sub_1C5FDC0).

Function Cloning for Specialization

When different call sites pass pointers from different address spaces to the same function argument, the pass clones the function so that each clone can be specialized for a single address space. This is the key mechanism that eliminates generic pointers at call boundaries.

The cloning engine (sub_2CBBE90, 71KB) uses two distinct strategies based on function linkage:

Strategy 1 -- In-place specialization (internal/private linkage): All call sites are visible within the module, so the function is modified directly. Pointer argument types are changed from generic (AS 0) to the resolved specific space. No clone is created. This is the cheaper path.

Strategy 2 -- Clone and specialize (external/linkonce/weak linkage): The function might have callers outside the module, so the original must be preserved. A clone is created with internal linkage (0x4007), its argument types are specialized, and internal call sites are rewritten to target the clone. The original remains for any remaining generic-pointer callers.

The cloning process (sub_F4BFF0):

  1. Iterate all formal args of the original function.
  2. For each arg whose address space was resolved, create a new function type with the specific address space.
  3. Allocate a new Function object via sub_BD2DA0(136).
  4. Copy linkage, attributes, and calling convention.
  5. Clone the body via sub_F4BB00 (ValueMap-based cloner).
  6. For specialized args, insert addrspacecast instructions at the clone's entry.
  7. Rewrite matching call sites via sub_BD84D0.

After cloning, the clone is pushed back onto the worklist, enabling recursive specialization through call chains: if A calls B calls C, each level's arguments resolve bottom-up as the worklist iterates.

Intra-Procedural Resolution Algorithm

Use-Def Chain Walking (sub_1CA5350)

The core resolver walks backward through use-def chains to find the original allocation a pointer derives from:

IR NodeBehavior
GEP (H)Transparent -- follow pointer operand
Bitcast (G)Transparent -- follow source operand
PHI (O)Follow all incoming values (adds all to worklist)
Call (M)Check if returns a known-space pointer
Load (subcode 32)Tracked if track-indir-load is enabled
inttoptr (subcode 47)Tracked if track-int2ptr is enabled
ptrtoint (subcode 48)Transparent
Alloca (8)Resolves to local (AS 5)

The walker uses a worklist with a visited bitset to handle cycles through phi nodes. It collects three separate vectors: loads (indirect pointers), GEPs, and calls returning pointers.

Resolution Decision

Once the bitmask is computed:

  • Single bit set: resolved. Insert addrspacecast to the target space.
  • Multiple bits set: ambiguous. If param-always-point-to-global is true and the param bit is set, resolve to global. Otherwise emit a warning and default to global.
  • Zero bits: unreachable or error.

Address Space Inference Engine (sub_2CE96D0)

For generic-pointer arguments at call sites, the inference engine creates a 608-byte analysis context on the stack, sets up six independent tracking sets, and calls sub_2CE8530 for deep dataflow analysis tracing pointer provenance through GEPs, bitcasts, PHI nodes, and loads from known-space pointers.

Post-Resolution Optimizations

After resolving a pointer's address space, the pass performs several follow-up transformations:

  • addrspacecast insertion: sub_1CA1B70 (first-time) / sub_1CA28F0 (second-time) inserts a cast from generic to the resolved space and replaces all uses of the generic pointer.
  • Instruction rewriting: Loads and stores on generic pointers are rewritten to use the specific space, enabling the backend to emit ld.shared, st.global, etc.
  • isspacep folding (second-time only): If a pointer's space is known, isspacep.shared(%p) folds to true or false.
  • Dead cast elimination: Redundant addrspacecast chains (e.g., generic-to-shared followed by shared-to-generic) are simplified.
  • Call site specialization: After cloning, call sites are rewritten to call the specialized version with casted arguments.

Error Handling for Illegal Operations

The pass detects and reports illegal address-space/operation combinations as soft warnings (compilation continues):

OperationIllegal SpaceWarning Message
Atomic load/storeConstant"Cannot do atomic operation on const memory"
Atomic load/storeLocal"Cannot do atomic on local memory"
WMMAConstant"Cannot do WMMA on constant memory"
WMMALocal"Cannot do WMMA on local memory"
Vector atomicShared"Cannot to vector atomic on shared memory"
Vector atomicLocal"Cannot to vector atomic on local memory"
Vector atomicConstant"Cannot to vector atomic on const memory"

Note: The vector atomic messages contain a typo in NVIDIA's source -- "Cannot to" should read "Cannot do". This typo is present in all three vector atomic warning strings.

Key Functions

FunctionAddressSizeRole
Pass entry / IP-MSP driversub_1C709102427 linesMain entry point, worklist iteration, mode dispatch
First-time resolversub_1CA29201119 linesConservative address space resolution
Second-time resolversub_1CA9E90933 linesHash-table-based resolution with isspacep folding
Use-def chain walkersub_1CA53501641 linesBackward pointer origin tracking
Per-BB scannersub_1CA8CD0898 linesInstruction scan, bitmask builder
Pass initializationsub_1CAB5901040 linesGlobal registration, data structure setup
MemorySpaceCloning enginesub_2CBBE9071KBInter-procedural function cloning
IPMSPPass variantsub_1C6A6C054KBLIBNVVM module-pass variant
Address space inferencesub_2CE96D0--Dataflow analysis for single argument
CloneFunctionsub_F4BFF0--Full function clone with type rewriting
shouldProcessFunctionsub_2CBA650--Multi-condition filter for worklist eligibility
hasUnresolvedPointerArgssub_2CBA520--Checks if any arg is an unresolved generic pointer
replaceAllUsesWithsub_BD84D0--Rewrites call sites to target the clone
propagateSpacesToCalleessub_2CF5840--Propagates resolved spaces through call graph

Alternate Algorithm

A parallel implementation exists at sub_2CBBE90 / sub_2CEAC10 / sub_2CF2C20, selected when mem-space-alg != 2. The default algorithm (value 2) is the one documented above; the alternate may be a simpler or older version optimized for different patterns.

Configuration Knobs

Primary Knobs (ctor_264 / ctor_267_0)

KnobGlobalTypeDefaultDescription
dump-ip-mspdword_4FBD480boolfalseDump inter-procedural memory space propagation debug info
do-clone-for-ip-mspdword_4FBD3A0int-1Max number of clones (-1 = unlimited). Set to 0 to disable cloning.
param-always-point-to-globalunk_4FBE1EDbooltrueAssume kernel parameters always point to global memory
dump-ir-before-memory-space-optbyte_4FBE000boolfalseDump IR before the pass runs
dump-ir-after-memory-space-optbyte_4FBDF20boolfalseDump IR after the pass completes
track-indir-loadbyte_4FBDE40booltrueTrack pointers loaded from memory during use-def walking
mem-space-algdword_4FBDD60int2Algorithm selection for address space optimization
track-int2ptrbyte_4FBDC80booltrueTrack inttoptr casts during analysis

Additional Knobs (ctor_267_0 / ctor_531_0)

KnobDefaultDescription
process-alloca-alwaystrueTreat alloca instructions as definite local (AS 5) regardless of context
wmma-memory-space-opttrueEnable memory space optimization for WMMA operations
strong-global-assumptionstrueAssume const buffer pointers always point to globals
process-builtin-assume--Process __builtin_assume(__is*(p)) assertions for space deduction

IP-MSP Pass Knobs (ctor_528)

KnobGlobalDefaultDescription
dump-ip-mspqword_50135480Debug tracing for IPMSP variant
do-clone-for-ip-mspqword_5013468-1Clone limit for IPMSP variant

Optimization Level Behavior

LevelPhase 1 (first-time)Phase 2 (second-time)IP-MSP Cloning
O0Runs (mode 0) -- address space resolution is required for correct PTX emissionNot runNot run
OfcmaxRuns (mode 0); LSA-Opt forced to 0, limiting resolution depthNot runNot run
OfcmidRuns (mode 0)Runs (mode 1) after IP-MSP propagationEnabled (do-clone-for-ip-msp=-1)
O1+Runs (mode 0) early in pipelineRuns (mode 1) after IP-MSP propagationEnabled; iterates to fixed point

This pass is unusual in that it runs even at O0 -- address space resolution is a correctness requirement, not purely an optimization. Without it, all memory accesses would use generic (flat) addressing, which is functionally correct but significantly slower due to the address translation hardware penalty. At Ofcmax, the pass runs in a reduced mode with LSA-Opt disabled. See Optimization Levels for the complete pipeline structure.

Diagnostic Strings

"Initial work list size: %d"
"changed in argument memory space"
"is cloned"
"avoid cloning of"
"callees are affected"
"return memory space is resolved"
"Cannot tell what pointer points to, assuming global memory space"
"Cannot do atomic operation on const memory"
"Cannot do atomic on local memory"
"Cannot do WMMA on constant memory"
"Cannot do WMMA on local memory"
"Cannot to vector atomic on shared memory"
"Cannot to vector atomic on local memory"
"Cannot to vector atomic on const memory"

Multi-Pass Data Flow: MemorySpaceOpt / IP-MSP / Alias Analysis

The following diagram shows how three cooperating subsystems exchange data to resolve generic pointers into specific address spaces. The left column is MemorySpaceOpt (per-function), the center is IP-MSP (module-level), and the right is NVVM Alias Analysis (query service). Arrows show data produced (-->) and consumed (<--).

 MemorySpaceOpt (per-function)       IP-MSP (module-level)          NVVM Alias Analysis
 ==============================      ==========================      ======================

 1. EARLY RUN (mode 0)
 +----------------------------+
 | Use-def chain walker       |
 | (sub_1CA5350)              |
 | Walk: GEP, bitcast, PHI,  |
 | alloca, call returns       |
 |                            |
 | Produces:                  |
 |  - per-arg bitmask         |
 |    (0x01=global,0x02=shr,  |
 |     0x04=const,0x08=local, |
 |     0x10=param)            |
 |  - unresolved arg list     |
 +---+------------------------+
     |                                                              +----------------------+
     | per-arg bitmasks                                             | Address space         |
     | (singleton bit = resolved,                                   | disjointness table:  |
     |  multi-bit = ambiguous)                                      |                      |
     v                                                              | AS 1 vs AS 3: NoAlias|
 +---+------------------------+                                     | AS 1 vs AS 5: NoAlias|
 | addrspacecast insertion    |                                     | AS 3 vs AS 5: NoAlias|
 | (sub_1CA1B70)              |                                     | AS 0 vs any: MayAlias|
 | Rewrites loads/stores to   |                                     | (stateless, trivial) |
 | ld.shared / st.global etc. |                                     +----------+-----------+
 +---+------------------------+                                                |
     |                                                                         |
     | Resolved pointer types on                                               |
     | function args + return values                                           |
     v                                                                         |
 +---+-----------------------------+      +--------------------------+         |
 | Unresolved args remain generic  | ---> | IP-MSP worklist driver   |         |
 | Need cross-function evidence    |      | (sub_1C70910 / 2CBBE90)  |         |
 +---+-----------------------------+      |                          |         |
     ^                                    | For each function F:     |         |
     |                                    |  1. Collect all callers  |         |
     |                                    |  2. Intersect arg AS     |         |
     |                                    |     across call sites    |         |
     |                                    |  3. If unanimous:        |         |
     |                                    |     specialize or clone  |         |
     |  propagated arg spaces             |                          |         |
     |  (from callers)                    | Produces:                |         |
     +------------------------------------+  - cloned functions      |         |
                                          |    with AS-specific args |         |
                                          |  - updated call sites    |         |
                                          |  - "changed in argument  |         |
                                          |    memory space" events  |         |
                                          +---+----------------------+         |
                                              |                                |
 2. LATE RUN (mode 1)                         | Enriched module with           |
 +----------------------------+               | resolved pointer types          |
 | Hash-table resolver        |               v                                |
 | (sub_1CA9E90)              | <--- cloned functions re-enter worklist        |
 |                            |                                                |
 | Additional capabilities:   |      Each resolved addrspacecast               |
 |  - isspacep folding        |      feeds into...                             |
 |    (builtins 0xFD0-0xFD5) |                                                |
 |  - Dead cast elimination   |                                     +----------v-----------+
 |                            |                                     | NVVM AA (nvptx-aa)   |
 | Consumes:                  |                                     |                      |
 |  - IP-MSP propagated       |                                     | With resolved AS on  |
 |    address spaces          |                                     | pointers, queries    |
 |  - hash table of known     |                                     | return NoAlias for   |
 |    pointer->space mappings |                                     | cross-space pairs    |
 +---+------------------------+                                     |                      |
     |                                                              | Enables downstream:  |
     | Fully resolved IR                                            |  - GVN load forward  |
     | (minimal generic ptrs)                                       |  - DSE elimination   |
     v                                                              |  - LICM hoisting     |
 +---+------------------------+                                     |  - MemorySSA queries |
 | Downstream consumers:      |                                     +----------------------+
 |  - Instruction selection   |
 |    (ld.shared, st.global)  |
 |  - Backend PTX emission    |
 |  - Register allocation     |
 |    (no generic-ptr spills) |
 +----------------------------+

Data flow summary:

ProducerDataConsumer
MemorySpaceOpt phase 1Per-arg address space bitmaskIP-MSP worklist
IP-MSP worklistCloned functions with specialized arg typesMemorySpaceOpt phase 2
IP-MSP worklistCall-site rewriting (addrspacecast at boundaries)All downstream passes
MemorySpaceOpt phase 2isspacep folded to true/falseDead code elimination
Both phasesResolved pointer address spaces on all IR valuesNVVM AA (nvptx-aa)
NVVM AANoAlias for cross-space pointer pairsGVN, DSE, LICM, MemorySSA

The feedback loop between MemorySpaceOpt and IP-MSP is the critical insight: phase 1 resolves locally-obvious cases, IP-MSP propagates those resolutions across call boundaries (cloning when necessary), and phase 2 picks up the newly-available information to resolve cases that were previously ambiguous. The worklist iterates until no more argument spaces change, guaranteeing a fixed point. NVVM AA is the downstream beneficiary -- every resolved pointer pair that previously required a conservative MayAlias answer can now return NoAlias, enabling more aggressive optimization in GVN, DSE, LICM, and scheduling.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent address space resolution engine.

1. Resolving ambiguous pointers to the wrong default space. When the bitmask has multiple bits set (e.g., 0x03 = global OR shared), the pass defaults to global if param-always-point-to-global is true. A reimplementation that defaults to shared instead will silently produce ld.shared instructions for what is actually global memory, causing out-of-bounds accesses on the shared memory aperture. The correct behavior is: ambiguous always resolves to global (the safe conservative choice), never to a more restrictive space.

2. Forgetting to re-run after inter-procedural propagation. The pass must run twice: once before IP-MSP to resolve locally-obvious cases, and again after IP-MSP to consume propagated information. A single-pass reimplementation will miss every case where a callee's argument space is only known from the caller's context. The second run (mode 1) is not optional -- it catches the majority of inter-procedural resolutions and performs isspacep folding that the first run cannot do.

3. Cloning functions with external linkage instead of specializing in-place. The pass uses two strategies: in-place specialization for internal/private functions (all call sites visible) and clone-and-specialize for external/weak linkage. Reversing this logic -- cloning internal functions or modifying external ones -- either wastes compile time on unnecessary clones or breaks callers outside the module who still pass generic pointers. The linkage check (0x4007 for internal) is the discriminator and must not be inverted.

4. Failing to handle the addrspacecast chain correctly. After resolving a pointer's space, the pass inserts addrspacecast from generic to the specific space and replaces all uses. A reimplementation that replaces the pointer type directly (without the cast) will break LLVM's type system invariants, causing assertion failures in downstream passes. The cast must exist in the IR even though it is semantically a no-op -- LLVM's type-based alias analysis and GEP arithmetic depend on it.

5. Not iterating the IP-MSP worklist to a fixed point. The worklist must iterate until no argument bitmask changes. A reimplementation that runs one pass over all functions and stops will miss transitive resolutions through call chains (A calls B calls C). The bitmask OR is monotone (can only grow), so convergence is guaranteed, but early termination produces incomplete resolutions that leave generic pointers in the IR and forfeit the performance benefit of specialized memory instructions.

Test This

The following minimal kernel exercises address space resolution. Compile with nvcc -ptx -arch=sm_90 and inspect the PTX output.

__global__ void memspace_test(float *global_out, int n) {
    __shared__ float smem[64];
    smem[threadIdx.x] = (float)threadIdx.x;
    __syncthreads();
    float val = smem[threadIdx.x];
    global_out[threadIdx.x] = val + 1.0f;
}

What to look for in PTX:

  • ld.shared.f32 for the read from smem -- confirms the pass resolved the shared pointer from generic (AS 0) to shared (AS 3). If you see a plain ld.f32 without the .shared qualifier, the access goes through the generic address translation unit at runtime.
  • st.global.f32 for the write to global_out -- confirms global pointer resolution (AS 1).
  • Absence of cvta.to.shared / cvta.to.global instructions. These cvta (convert address) instructions indicate the backend is converting generic pointers at runtime instead of using resolved address spaces at compile time. Their absence means the pass succeeded fully.
  • Compare with -O0 to see the unresolved version where generic ld/st instructions dominate.

Reimplementation Checklist

  1. Address space bitmask dataflow engine. Implement the per-value bitmask lattice (0x01=global, 0x02=shared, 0x04=constant, 0x08=local, 0x10=param) with OR-based meet, use-def chain walking through GEP/bitcast/PHI/alloca/inttoptr, and a visited-set to handle cycles through PHI nodes.
  2. Two-phase resolution with mode dispatch. Build a mode-parameterized entry point: mode 0 (conservative first-time), mode 1 (hash-table-based second-time with isspacep folding), and warning-suppression variants (modes 2/3).
  3. Inter-procedural fixed-point worklist (IP-MSP). Implement the module-level worklist that propagates per-argument address space bitmasks across call boundaries, re-adding callers when an argument's bitmask changes, iterating until no bitmask grows.
  4. Function cloning for specialization. Implement two strategies: in-place specialization for internal-linkage functions (modify arg types directly) and clone-and-specialize for external-linkage functions (create internal clone, rewrite call sites, insert addrspacecast at clone entry).
  5. isspacep intrinsic folding (phase 2). When a pointer's address space is resolved, fold isspacep.shared/.global/etc. builtins (IDs 0xFD0--0xFD5) to true or false constants.
  6. Post-resolution cleanup. Insert addrspacecast instructions, rewrite loads/stores to specific address spaces, eliminate dead cast chains (generic-to-shared followed by shared-to-generic), and rewrite call sites to target specialized clones.
  7. Illegal operation detection. Check and warn on illegal address-space/operation combinations (atomics on constant/local, WMMA on constant/local, vector atomics on shared/local/constant) without aborting compilation.

Pipeline Interaction

The pass runs at two points in the CICC pipeline: once early (first-time, mode 0) to resolve obvious cases before optimization, and again after inter-procedural propagation (second-time, mode 1) to catch cases that became resolvable after inlining and constant propagation. The no-warnings variants (modes 2/3) suppress repeated diagnostics on re-runs. The pass feeds directly into instruction selection, where resolved address spaces determine which PTX memory instructions are emitted. It also interacts with the ipmsp module pass, which drives the inter-procedural cloning engine separately from the per-function resolver.

nvvm-peephole-optimizer

The NVVM Peephole Optimizer is an NVIDIA-proprietary function-level IR pass that performs NVVM-specific pattern matching and instruction simplification. It is distinct from both LLVM's standard InstCombine pass (which handles general-purpose peephole optimization across ~600 functions in the 0x1700000--0x17B0000 range) and the machine-level nvptx-peephole pass (sub_21DB090) that operates on MachineInstrs after instruction selection.

This page documents all three peephole layers in cicc, their pipeline positions, their transformations, and the satellite machine-level peephole passes that complement them.

Pass namenvvm-peephole-optimizer
Classllvm::NVVMPeepholeOptimizerPass
ScopeFunction pass (IR level)
RegistrationNew PM slot 382 in sub_2342890
Serializersub_2314DA0
Factory (legacy PM)sub_1CEF8F0
Pipeline parser line3534 in sub_233C410
Enable knobenable-nvvm-peephole (bool, default = true)
Knob addressctor_358_0 @ 0x50E8D0
NVVMPassOptions slotnvvm-peephole-optimizer in 4512-byte options struct
Pipeline positionFunction-level, runs after NVVMReflect + NVVMIntrinsicLowering

Purpose

CUDA programs produce IR patterns that standard LLVM optimizations do not recognize or cannot legally transform. The NVVM peephole pass fills this gap by matching NVVM-specific idioms -- address space casts, intrinsic call sequences, convergent operation patterns, and GPU-specific type conversions -- and rewriting them into simpler, cheaper forms. It operates at the LLVM IR level before code generation, complementing the machine-level nvptx-peephole pass that runs later in the pipeline.

The pass is always paired with sub_215D9D0 (NVVMAnnotationsProcessor), which runs immediately after the peephole in every pipeline path. This companion pass processes NVVM annotations (e.g., tcgen05 tensor annotation metadata) on the IR that the peephole has just simplified.

Three Peephole Layers

CICC contains three distinct peephole optimization layers, each operating at a different abstraction level and targeting different pattern classes.

LayerPassLevelAddress / SlotTargets
LLVM InstCombineinstcombineIR0x1700000+ (~600 funcs)General-purpose: algebraic simplification, constant folding, dead instruction removal
NVVM Peepholenvvm-peephole-optimizerIRslot 382, factory sub_1CEF8F0NVVM-specific: address space casts, intrinsic sequences, GPU type conversions
NVPTX Peepholenvptx-peepholeMachineInstrsub_21DB090PTX-specific: redundant cvta folding, predicate optimization, move elimination

The NVVM peephole pass handles transformations that require knowledge of NVVM's address space model, intrinsic semantics, or GPU-specific type system -- patterns that InstCombine cannot match because they depend on NVPTX target information not available to target-independent passes. The machine-level NVPTX peephole then handles patterns that only emerge after instruction selection has lowered IR to MachineInstrs.

Pipeline Positions

IR-Level: nvvm-peephole-optimizer

The IR-level peephole (sub_1CEF8F0) is invoked from the legacy pipeline assembler (sub_12E54A0) in all three language-specific code paths. Its companion sub_215D9D0 always follows immediately.

Path A -- "ptx" language (lines 580--638 in sub_12E54A0):

sub_1CEF8F0()    NVVMPeephole
sub_215D9D0()    NVVMAnnotationsProcessor
sub_1857160()    NVVMReflect (conditional)
sub_1A62BF0(1)   LLVM standard pipeline #1
sub_1B26330()    MemCpyOpt
sub_18DEFF0()    DCE
...

Path B -- "mid" language (Ofcmid, lines 814--1075):

sub_184CD60()    ConstantMerge / GlobalDCE
sub_1CB4E40(0)   NVVMIntrinsicLowering
sub_1B26330()    MemCpyOpt
sub_198E2A0()    SROA / CorrelatedValuePropagation
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_17060B0(1,0) PrintModulePass
sub_198DF00(-1)  JumpThreading / CVP
sub_1C6E800()    GVN / LICM
...

Path C -- default/general (O2/O3, lines 1077--1371):

sub_1A62BF0(4)   LLVM standard pipeline #4
sub_1857160()    NVVMReflect
sub_1CB4E40(0)   NVVMIntrinsicLowering
sub_1857160()    NVVMReflect (second pass)
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_1A7A9F0()    InstructionSimplify
sub_1A62BF0(5)   LLVM standard pipeline #5
...

Late position (O3 tier finalization):

sub_1B7FDF0(n)   BranchFolding / CFGSimplify
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_18B3080(f)   Sinking2Pass (fast mode)
sub_1CC60B0()    NVVMSinking
sub_18A3430()    AggressiveInstCombine
...

In every path, the peephole runs after NVVMIntrinsicLowering (sub_1CB4E40) and NVVMReflect (sub_1857160) have resolved intrinsics and reflect calls. This ensures the peephole sees simplified IR where previously-opaque intrinsic call patterns have been reduced to simpler forms amenable to pattern matching.

Machine-Level: nvptx-peephole

The machine-level peephole (sub_21DB090) runs in addPreRegAlloc() (sub_2166ED0):

EarlyTailDuplicate
codegen DCE
Machine LICM + CSE + Sinking        (conditional on enable-mlicm, enable-mcse)
PeepholeOptimizerPass                (stock LLVM, slot 492, disable-peephole)
NVPTXPeephole (sub_21DB090)         <<<
DeadMachineInstrElim
MachineCopyPropagation

The string "After codegen peephole optimization pass" in sub_2166ED0 marks the checkpoint after both the stock LLVM peephole and the NVPTX peephole have completed.

New PM Registration

The pass is registered as a function-level pass in the New Pass Manager at registration line 2242 in sub_2342890. It sits in the mid-optimization phase alongside other NVIDIA function passes:

SlotPassClass
376basic-dbeBasicDeadBarrierEliminationPass
377branch-distBranchDistPass
378byval-mem2regByValMem2RegPass
379bypass-slow-divisionBypassSlowDivisionPass
380normalize-gepNormalizeGepPass
381nvvm-reflect-ppSimplifyConstantConditionalsPass
382nvvm-peephole-optimizerNVVMPeepholeOptimizerPass
383old-load-store-vectorizerOldLoadStoreVectorizerPass
384print<merge-sets>MergeSetsAnalysisPrinterPass
385rematRematerializationPass

IR-Level Transformation Categories

Based on pipeline position (after NVVMReflect + NVVMIntrinsicLowering, before sinking and rematerialization) and the patterns visible in NVVM IR, the peephole optimizer targets several categories.

Address Space Cast Simplification

After memory-space-opt and ipmsp resolve generic pointers to specific address spaces, redundant addrspacecast chains remain in the IR. The peephole rewrites these:

; Before:
%p1 = addrspacecast ptr addrspace(3) %src to ptr        ; shared -> generic
%p2 = addrspacecast ptr %p1 to ptr addrspace(3)         ; generic -> shared
store i32 %val, ptr addrspace(3) %p2

; After:
store i32 %val, ptr addrspace(3) %src                   ; chain eliminated
; Before:
%p = addrspacecast ptr addrspace(1) %src to ptr addrspace(1)  ; identity cast

; After:
; (use %src directly — identity addrspacecast removed)

The validation function sub_21BEE70 ("Bad address space in addrspacecast", 4.1KB) ensures the peephole does not create illegal address space transitions. NVPTX address spaces are:

ASNameLegal cast targets
0GenericAll
1GlobalGeneric
3SharedGeneric
4ConstantGeneric
5LocalGeneric

Intrinsic Call Folding

After NVVMIntrinsicLowering has expanded NVVM intrinsics, some expansion sequences can be further simplified:

; Before (after intrinsic lowering, launch_bounds known):
%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%cmp = icmp ult i32 %tid, 256       ; blockDim.x known = 256

; After (when nvvm-intr-range has set !range {0, 256}):
%cmp = i1 true                      ; always true for valid threads
; Before:
call void @llvm.nvvm.barrier0()
; (no shared memory operations between barriers)
call void @llvm.nvvm.barrier0()

; After:
call void @llvm.nvvm.barrier0()     ; redundant barrier removed

Type Conversion Cleanup

GPU-specific type representations (bf16, tf32, fp8) produce conversion chains not present in standard LLVM IR:

; Before (roundtrip through wider type):
%wide = fpext half %x to float
%back = fptrunc float %wide to half

; After:
; (use %x directly — roundtrip eliminated when no precision loss)
; Before (bf16 roundtrip):
%f32 = call float @llvm.nvvm.bf16.to.f32(i16 %bf)
%bf2 = call i16 @llvm.nvvm.f32.to.bf16(float %f32)

; After:
; (use %bf directly)

Post-Reflect Dead Code Cleanup

The companion pass nvvm-reflect-pp (SimplifyConstantConditionalsPass) runs immediately before the peephole in the pipeline. It resolves __nvvm_reflect() calls and simplifies constant conditionals:

; Before (after nvvm-reflect-pp resolves __nvvm_reflect("__CUDA_FTZ") = 1):
%ftz = call i32 @__nvvm_reflect(ptr @"__CUDA_FTZ")     ; resolved to 1
%cmp = icmp ne i32 %ftz, 0                              ; always true
br i1 %cmp, label %ftz_path, label %no_ftz_path

; After nvvm-reflect-pp:
br label %ftz_path                   ; unconditional

; The peephole then cleans up dead instructions in %no_ftz_path
; and simplifies any resulting phi nodes at merge points

Convergent Operation Canonicalization

CUDA's convergent operations (__syncwarp, __ballot_sync, etc.) have specific semantic constraints that standard InstCombine cannot reason about because it must treat convergent calls as opaque. The peephole, with knowledge of NVVM semantics, can simplify convergent call sequences when the mask or participating threads can be determined at compile time.

Machine-Level NVPTXPeephole (sub_21DB090)

The machine-level peephole operates on MachineInstr objects after instruction selection has converted LLVM IR to PTX pseudo-instructions. It targets patterns specific to the PTX instruction set.

Redundant cvta Folding

The cvta (convert address) instruction converts between generic and specific address spaces. Address space lowering often inserts redundant conversions:

// Before:
cvta.to.global %rd1, %rd2        ; convert global -> generic
cvta.global %rd3, %rd1           ; convert generic -> global (redundant pair)

// After:
mov.b64 %rd3, %rd2               ; direct copy, cvta pair eliminated

The companion pass sub_21DA810 ("NVPTX optimize redundant cvta.to.local instruction") handles the remaining cvta.to.local instructions that survive to late post-RA:

// Before (late pipeline):
cvta.to.local %rd1, %rd2         ; redundant when %rd2 is already local-space

// After (sub_21DA810 removes it):
; (use %rd2 directly)

Predicate Pattern Optimization

PTX uses predicate registers for conditional execution. The peephole simplifies predicate sequences:

// Before:
setp.ne.s32 %p1, %r1, 0;
@%p1 bra target;

// After (folds setp into branch when pattern is recognized):
// Combined compare-and-branch

PTX Move Elimination

sub_2204E60 ("Remove redundant moves") eliminates identity moves:

// Before:
mov.b32 %r5, %r5;               ; identity move

// After:
; (deleted)

Satellite Machine Peephole Passes

Three additional machine-level passes perform specialized peephole transformations adjacent to the main NVPTXPeephole:

param-opt (sub_2203290)

Pass nameparam-opt
Entry pointsub_2203290
Description"Optimize NVPTX ld.param"

Optimizes parameter load patterns. In PTX, kernel parameters are loaded via ld.param instructions into registers. When the same parameter is loaded multiple times (e.g., after inlining or loop unrolling), param-opt consolidates them:

// Before:
ld.param.u32 %r1, [_param_0];
...
ld.param.u32 %r7, [_param_0];    ; redundant reload of same parameter

// After:
ld.param.u32 %r1, [_param_0];
...
mov.b32 %r7, %r1;                ; reuse previous load

nvptx-trunc-opts (sub_22058E0)

Pass namenvptx-trunc-opts
Entry pointsub_22058E0
Description"Optimize redundant ANDb16ri instrunctions" [sic]

Eliminates redundant AND operations on b16 (16-bit) registers. When type legalization widens a sub-16-bit value to 16 bits, it inserts an AND with a mask to preserve the original width. If the value is already correctly masked (e.g., from a load that zero-extends), the AND is redundant:

// Before:
ld.u8 %rs1, [%rd1];              ; loads 8-bit, zero-extended to 16
and.b16 %rs2, %rs1, 0xFF;        ; redundant mask — already 8-bit clean

// After:
ld.u8 %rs1, [%rd1];
// (AND deleted, use %rs1 directly)

The binary contains the string with a typo: "instrunctions" instead of "instructions".

Remove Redundant Moves (sub_2204E60)

Entry pointsub_2204E60
Description"Remove redundant moves"

Eliminates move instructions where source and destination are the same register, or where the move is immediately dead. This complements the stock LLVM MachineCopyPropagation pass with PTX-specific move patterns.

Knobs

KnobTypeDefaultScopeEffect
enable-nvvm-peepholebooltrueIR + MachineMaster switch for both the IR-level nvvm-peephole-optimizer and the machine-level nvptx-peephole. Registered at ctor_358_0 (0x50E8D0).
disable-peepholeboolfalseMachine onlyDisables the stock LLVM PeepholeOptimizerPass (slot 492). Does not affect the NVIDIA-specific passes. Registered at ctor_314 (0x502360).
aggressive-ext-optbool(varies)Machine onlyControls aggressive extension optimization in stock LLVM peephole.
disable-adv-copy-optboolfalseMachine onlyDisables advanced copy optimization in stock LLVM peephole.
rewrite-phi-limitint(varies)Machine onlyLimits PHI rewriting in stock LLVM peephole.
recurrence-chain-limitint(varies)Machine onlyLimits recurrence chain analysis in stock LLVM peephole.

The enable-nvvm-peephole description string recovered from the binary is: "Enable NVVM Peephole Optimizer". Its default-on status indicates the pass is mature and does not require opt-in behavior.

Optimization Level Behavior

The IR-level peephole runs in all optimization paths except -O0:

LevelPathNVVMPeephole invocations
Ofcmin"ptx" path1 (early)
Ofcmid"mid" path1 (after SROA/CVP)
O2/O3"default" path1 (after NVVMReflect + IntrinsicLowering)
O3 (late)Tier finalization1 (after BranchFolding/CFGSimplify)

At -O0, the peephole is likely skipped along with most optimization passes. The factory function sub_1CEF8F0 appears only in code paths that are active at O1 and above.

End-to-End Peephole Pipeline

The complete peephole optimization flow through cicc, from IR to PTX:

Source CUDA
    |
    v
[LLVM IR after clang/EDG frontend]
    |
    v
InstCombine (0x1700000+)           General algebraic simplification
    |                               ~600 functions, target-independent
    v
NVVMReflect (sub_1857160)          Resolve __nvvm_reflect() calls
    |
    v
nvvm-reflect-pp                    Simplify constant conditionals from reflect
    |
    v
NVVMIntrinsicLowering (sub_1CB4E40) Expand NVVM intrinsics
    |
    v
nvvm-peephole-optimizer             NVVM-specific IR patterns:
  (sub_1CEF8F0 factory)              - addrspacecast chain folding
    |                                - intrinsic sequence simplification
    v                                - type conversion roundtrip elimination
NVVMAnnotationsProcessor             - post-reflect dead code cleanup
  (sub_215D9D0 companion)
    |
    v
[Further IR optimization: GVN, LICM, Sinking2, etc.]
    |
    v
[Instruction Selection: DAGToDAG (sub_2200150, 78KB)]
    |     Hash-table pattern matching: hash = (37*idx) & (tableSize-1)
    v
PeepholeOptimizerPass (slot 492)    Stock LLVM machine peephole:
    |                                - redundant copy folding
    v                                - compare-and-branch simplification
NVPTXPeephole (sub_21DB090)         PTX-specific machine peephole:
    |                                - cvta pair elimination
    v                                - predicate folding
param-opt (sub_2203290)              - ld.param consolidation
    |
    v
nvptx-trunc-opts (sub_22058E0)      - ANDb16ri elimination
    |
    v
Remove Redundant Moves (sub_2204E60) - identity move deletion
    |
    v
[Register Allocation]
    |
    v
ProxyRegErasure (sub_21DA810)       Late cvta.to.local removal
    |
    v
[PTX Emission]

Function Map

FunctionAddressSizeRole
--sub_1CEF8F0smallNVVMPeephole factory (legacy PM)
--sub_215D9D0--NVVMAnnotationsProcessor (companion, always paired)
--sub_2314DA0smallNVVMPeepholeOptimizerPass serializer (New PM)
--sub_2342890--New PM registration function (slot 382)
--sub_233C410--Pipeline text parser (line 3534)
--sub_21DB090smallNVPTXPeephole machine pass registration
--sub_2166ED01.6KBaddPreRegAlloc() -- hosts NVPTXPeephole
--sub_21DA810--ProxyRegErasure (cvta.to.local removal)
--sub_2203290smallparam-opt (ld.param optimization)
--sub_2204E60smallRemove Redundant Moves
--sub_22058E0smallnvptx-trunc-opts (ANDb16ri elimination)
--sub_21BEE704.1KB"Bad address space in addrspacecast" validation
--sub_20DA7F030KBDAG combine / peephole on MachineInstrs
--sub_37E1AE018KBLate-stage machine optimization (peephole or copy prop)

Differences from Upstream LLVM

Upstream LLVM (as of LLVM 17/18) contains NVPTXPeephole.cpp in llvm/lib/Target/NVPTX/, which implements a small machine-level pass that:

  1. Folds cvta address-space-conversion pseudo-instructions
  2. Removes NVPTX::PROXY_REG pseudo-instructions (now split into a separate NVPTXProxyRegErasure pass in cicc)

CICC v13.0 extends this significantly:

  • The IR-level pass (nvvm-peephole-optimizer) has no upstream counterpart. It is entirely NVIDIA-proprietary, filling a gap between target-independent InstCombine and target-specific machine peephole.
  • Three satellite machine passes (param-opt, nvptx-trunc-opts, Remove Redundant Moves) have no upstream equivalents.
  • The machine-level nvptx-peephole is larger than upstream, likely incorporating additional pattern rules for newer PTX features (tensor core operations, cluster operations, etc.).
  • ProxyRegErasure is separated from NVPTXPeephole into its own pass (sub_21DA810) and runs late post-RA rather than inline with the peephole.

Evidence Summary

The pass's existence and classification are confirmed through multiple independent sources:

SourceAddress / LocationEvidence
Pipeline parsersub_233C410 line 3534Registers "nvvm-peephole-optimizer" as function-level NVIDIA custom pass
New PM registrationsub_2342890 slot 382Maps string to llvm::NVVMPeepholeOptimizerPass
Serializersub_2314DA0Produces "nvvm-peephole-optimizer" text for pipeline printing
Legacy PM factorysub_1CEF8F0Called 2x from sub_12E54A0 (pipeline assembler)
Companion pairingsub_215D9D0Always immediately follows sub_1CEF8F0 in all paths
Knob sweep0x50E8D0 (ctor_358_0)enable-nvvm-peephole = "Enable NVVM Peephole Optimizer", default true
Knob duplicate0x560000 sweep line 292Confirmed with identical description
NVVMPassOptionsp2a.3-03-passoptions.txtListed as nvvm-peephole-optimizer in option table
Machine passsub_21DB090"NVPTX Peephole" / "nvptx-peephole" registration string
Machine pipelinesub_2166ED0"After codegen peephole optimization pass" checkpoint string

Confidence note. The pass registration, knobs, pipeline position, and factory function are confirmed at HIGH confidence from binary evidence. The specific transformation patterns described above are at MEDIUM confidence -- inferred from pipeline position (runs after NVVMReflect + NVVMIntrinsicLowering), NVVM IR semantics, and address space validation code, but the actual NVVMPeepholeOptimizerPass::run() body has not been individually decompiled. The factory sub_1CEF8F0 creates the pass object; the run method is dispatched through the object's vtable.

Cross-References

Sinking2 (NVIDIA Code Sinking)

sinking2 is an NVIDIA-proprietary instruction sinking pass that moves instructions closer to their uses, with specific awareness of GPU texture and surface memory operations. It is entirely distinct from LLVM's stock sink pass: while both perform code sinking, Sinking2 is tailored for NVIDIA's memory hierarchy and iterates to a fixed point rather than making a single pass. The primary motivation is reducing register pressure by deferring computation of values until just before they are consumed, which is especially impactful on GPUs where register files are shared across hundreds of concurrent threads.

The pass is particularly focused on sinking instructions into texture load blocks. Texture operations on NVIDIA GPUs have high latency but are served by a dedicated cache; by sinking the address computation and other operands into the block that performs the texture fetch, the compiler reduces the live range of those values and frees registers for other warps. This directly improves occupancy -- the number of warps that can execute simultaneously on an SM.

Pipeline Position

FieldValue
Pass name (pipeline)sinking2
Pass IDsink2
Display nameCode sinking
Pass typeFunctionPass (NVIDIA-custom)
Classllvm::Sinking2Pass
RegistrationNew PM #390, line 2282 in sub_2342890
Runtime positionsTier 1/2/3 #81 (NVVMSinking2 via sub_1CC60B0, gated by opts[3328] && !opts[2440]); see Pipeline
Legacy PM entrysub_1CCA270
New PM entrysub_2D1C160 (19KB)
Legacy PM registrationsub_1CC7010
New PM registrationsub_2D1B410
Knob constructorctor_275 at 0x4F7750
Vtable (Legacy)off_49F8BC0
Vtable (New PM)off_4A260F0

Relationship to All Sink Passes in cicc

CICC v13.0 contains five distinct sinking mechanisms. Understanding which is which is essential when reading the pipeline or debugging register pressure issues:

Pass ID / FactoryClassOriginKey Difference
sink / sub_1A634D0LLVM SinkingPassUpstream LLVMStock single-pass sinking, uses MemorySSA for alias safety
sink2 / sub_1CCA270llvm::Sinking2PassNVIDIATexture-aware, iterative fixpoint, custom AA layer
sink<rp-aware>Parameterized variantLLVM + NVIDIARegister-pressure-aware sinking (stock sink with rp-aware-sink=true)
NVVMSinking2 / sub_1CC60B0NVIDIA late sinkingNVIDIALate-pipeline SM-specific sinking, gated by opts[3328]
MachineSinkLLVM MachineSinkingLLVMMIR-level sinking, opt-in for NVPTX via nvptx-enable-machine-sink

The stock LLVM sink (sub_1869C50, called with params (1,0,1)) uses MemorySSA for alias queries and makes a single pass. Sinking2 uses its own alias analysis layer routed through sub_13575E0 and iterates to convergence. NVVMSinking2 (sub_1CC60B0) is a separate NVIDIA pass that runs late in the pipeline after barrier lowering and warp-level optimizations, gated by the SM-specific pass group flag opts[3328].

IR Before/After Example

The pass sinks address computation closer to texture/surface use sites, reducing register pressure by shortening live ranges.

Before (address computation in preheader, live across loop body):

preheader:
  %base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
  %addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
  br label %loop

loop:
  %i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
  ; ... many instructions using registers, %base and %addr are live ...
  %tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
  %val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
  %i.next = add i64 %i, 1
  %cmp = icmp slt i64 %i.next, %n
  br i1 %cmp, label %loop, label %exit

After (address computation sunk into loop, next to texture use):

preheader:
  br label %loop

loop:
  %i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
  ; ... many instructions, but %base and %addr are no longer live here ...
  %base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
  %addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
  %tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
  %val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
  %i.next = add i64 %i, 1
  %cmp = icmp slt i64 %i.next, %n
  br i1 %cmp, label %loop, label %exit

The GEP instructions now execute inside the loop (higher execution count) but free registers in the rest of the loop body. This is a deliberate tradeoff: extra ALU work for reduced register pressure, which typically improves occupancy and net throughput.

Algorithm

Entry Point

The legacy PM entry sub_1CCA270 performs these steps:

  1. Fetches DominatorTree analysis (via DominatorTreeWrapperPass at unk_4F9E06C)
  2. Fetches LoopInfo analysis (via LoopInfoWrapperPass at unk_4F96DB4)
  3. Reads sink-into-texture knob (qword_4FBF2C0[20]) -- must be non-zero (enabled)
  4. Reads sink-limit knob (qword_4FBF1E0[20]) -- must be greater than zero
  5. Calls the main worklist driver sub_1CC9110

The New PM entry sub_2D1C160 (19KB) performs the same logic using AnalysisManager to fetch analyses, then dispatches to sub_2D1CFB0 (13KB).

The pass does not require ScalarEvolution (SCEV), MemorySSA, or PostDominatorTree, keeping it simpler and cheaper than loop-oriented or MemorySSA-dependent passes.

Main Worklist Driver (sub_1CC9110, 22KB)

The core algorithm is a fixpoint iteration over the dominator tree:

function SinkingWorklist(F, DT, LI, textureLevel, sinkLimit):
    changed = false
    do:
        roundChanged = false
        sinkCount = 0

        // Walk dominator tree in DFS preorder
        for BB in DT.dfs_preorder():
            // Skip loop headers to avoid creating loop-carried deps
            if LI.isLoopHeader(BB):
                continue

            // Process instructions bottom-up within each block
            for I in reverse(BB.instructions()):
                if sinkCount >= sinkLimit:
                    break                  // complexity limiter

                if I.mayHaveSideEffects() or I.isTerminator():
                    continue               // unsinkable
                if I.use_empty():
                    continue               // dead, leave for DCE

                // Level 3: consider instructions used only outside BB
                if textureLevel < 3 and allUsesInSameBlock(I, BB):
                    continue

                targetBB = findBestSinkTarget(I, DT, LI)  // sub_1CC7510
                if targetBB == BB:
                    continue               // already in best position

                // Profitability: prefer texture/surface blocks
                if textureLevel >= 1:
                    if not blockContainsTextureOps(targetBB):
                        if not dominatesTextureBlock(targetBB, DT):
                            continue       // not profitable

                // Safety: alias analysis check
                if not isSafeToSink(I, BB, targetBB):     // sub_1CC8920
                    continue
                // Safety: memory dependency check
                if not checkMemDep(I, BB, targetBB):       // sub_1CC8CA0
                    continue

                I.moveBefore(targetBB.firstNonPHI())
                roundChanged = true
                sinkCount++

        changed |= roundChanged
    while roundChanged            // iterate until no more changes

    return changed

Key design points:

  • DFS preorder ensures parent blocks are processed before children. Instructions sunk from a parent into a child on one iteration may expose further sinking opportunities for grandchild blocks on the next iteration -- hence the fixpoint loop.
  • Bottom-up within each block processes the last instruction first. This is important because sinking an instruction may make an earlier instruction's operands dead, which DCE will clean up later.
  • Loop headers are skipped to prevent creating loop-carried dependencies (a value defined in the header, consumed in the latch, sunk into the latch would create a cycle).

Instruction Processing (sub_1CC7510, 16KB)

For each candidate instruction, this function:

  1. Walks the use chain to find all consumers (via sub_15F4D60, multi-use check)
  2. For each user, determines the containing basic block
  3. Computes the lowest common dominator (LCD) of all user blocks using the dominator tree
  4. If LCD == current block, no benefit from sinking -- the instruction is already as close to its uses as possible while dominating all of them
  5. Builds a sink mapping: instruction to target block
  6. Checks memory safety via alias analysis (sub_13575E0)
  7. Validates that sinking does not violate memory ordering constraints
  8. Respects PHI nodes (LLVM opcode PHI) as sink boundaries -- an instruction cannot be sunk past a PHI insertion point

The target block selection algorithm effectively finds the nearest common dominator of all uses that is strictly dominated by the current block. If the instruction has a single use, the target is trivially the use's block (or its immediate dominator if the use is a PHI operand).

Dominance Ordering (sub_1CC8170, 13KB)

Implements a hash-based ordering of basic blocks for comparing sink profitability. Uses DFS numbering from the dominator tree to determine which block comes "earlier" in the program. This ordering ensures:

  • Instructions are only sunk toward uses, never away from them
  • When multiple sink targets exist (multi-use instruction), the lowest common dominator is chosen
  • The ordering is consistent across iterations of the fixpoint loop

Alias Checking (sub_1CC8920, 4KB)

Validates that moving instruction I from block From to block To does not reorder I past any conflicting memory access:

function isSafeToSink(I, From, To):
    if not I.mayReadOrWriteMemory():
        return true                    // pure computation, always safe

    // Walk all instructions on the domtree path From -> To
    for BB in pathBlocks(From, To):
        for J in BB.instructions():
            if J == I: continue
            if AA.getModRefInfo(I, J) != NoModRef:
                return false           // conflict: I aliases with J

    return true

This is not MemorySSA-based (unlike stock LLVM sink). The pass invokes the traditional AliasAnalysis query interface through sub_13575E0. This is less precise than MemorySSA but avoids the cost of building and maintaining the MemorySSA graph, which matters because Sinking2 iterates to fixpoint and would need to update MemorySSA on every move.

Memory Dependency Checking (sub_1CC8CA0, 6KB)

Additional memory safety layer beyond alias checking:

  • Store-load forwarding: if I is a load and there is a store between From and To that may alias the loaded location, sinking would change the value loaded
  • Store ordering: if I is a store, moving it past another store to a potentially-aliasing location changes program semantics
  • Volatile/atomic barrier: volatile loads/stores and atomic operations are never sunk (treated as having side effects)
  • Synchronization intrinsics: barrier calls (__syncthreads, bar.sync) are treated as memory fences; no instruction may be sunk past them

Texture/Surface Awareness

The pass identifies "texture blocks" -- basic blocks containing calls to texture/surface intrinsics (the tex.*, suld.*, sust.* family). Address computations that feed these intrinsic calls are the primary sink candidates, because texture address computation chains (GEP + index arithmetic) produce intermediate values that are consumed only at the texture fetch site. Without sinking, these intermediates occupy registers across potentially many instructions.

The sink-into-texture knob controls aggressiveness:

LevelBehavior
0Disabled -- no texture-aware sinking
1Cross-block only: move instructions across block boundaries into texture blocks
2Cross-block + intra-block: also reorder instructions within a block to position them immediately before their texture use
3 (default)All of the above + outside-only: consider instructions whose only uses are in blocks other than where the instruction is defined

Level 3 catches the important case where a GEP in a preheader feeds a texture load inside a loop -- the GEP has no uses in its own block, only "outside" uses.

Address space checks for NVPTX (see reference/address-spaces):

  • AS 1 (global): may alias with texture reads in some configurations
  • AS 3 (shared): texture operations never access shared memory, so shared-space stores are not barriers to texture sinking
  • AS 4 (const): texture/surface descriptors typically live in constant memory
  • AS 5 (local): thread-local, no cross-thread interference

Loop Considerations

Sinking2 is loop-aware but conservative:

  1. Never sinks OUT of a loop: moving an instruction from a loop body to an exit block would change its execution count. The pass skips this entirely.
  2. May sink INTO loop bodies: when an instruction in a loop preheader feeds only uses inside the loop (particularly texture fetches), sinking it into the loop is profitable despite increasing execution count -- the register pressure reduction from shorter live ranges outweighs the extra computation.
  3. Skips loop headers: prevents creating loop-carried dependencies.
  4. Runs after LoopSimplify: the early instance (sub_18B1DE0) runs after LoopSimplify/LCSSA have canonicalized loop structure, so preheaders, latches, and exit blocks are well-formed.

This creates a deliberate tension with LICM:

  • LICM hoists loop-invariant code into the preheader (reducing execution count)
  • Sinking2 sinks non-invariant address computation out of the preheader and into the loop body (reducing register pressure)

The two passes run at different pipeline positions and balance each other. LICM runs first; Sinking2 runs after GVN and CGSCC inlining, when texture patterns are fully exposed.

Barrier Awareness

Sinking2 itself does not contain explicit __syncthreads / bar.sync detection logic. Instead, it relies on the LLVM side-effect model:

  • Barrier intrinsics are marked as having side effects, so they are never sunk
  • Barrier intrinsics are treated as memory fences by alias analysis, so no memory instruction may be sunk past them

The late NVVMSinking2 (sub_1CC60B0) runs after barrier lowering (sub_1CB73C0) and warp-level optimization passes. By that point, barriers have been lowered to their final form. The pipeline ordering is:

NVVMBranchDist -> NVVMWarpShuffle -> NVVMReduction -> NVVMSinking2

This sequence ensures NVVMSinking2 can sink past warp-level operations that are no longer opaque barriers, while still respecting the lowered barrier representation.

Multi-Run Pipeline Pattern

Sinking2 appears at three to four pipeline positions. Each run has different context and different opportunities:

PositionFactoryModeContext
Early (pass ~39)sub_18B1DE0()StandardAfter stock Sink, GVN, and CGSCC inlining. Texture patterns are exposed.
Post-peepholesub_18B3080(1)Fast (flag=1)After NVVMPeephole. Peephole may create new sinking opportunities. Reduced iteration budget.
Late SM-specificsub_1CC60B0()SM-gatedAfter barrier lowering and warp shuffle. Gated by opts[3328] && !opts[2440].

For fast-compile mode (Ofcmax), only sub_18B3080(1) runs -- the single Sinking2 in fast mode with reduced iteration budget. No stock Sink, no NVVMSinking2.

The rationale for multiple runs:

  • Run 1 (stock Sink) handles straightforward cases using MemorySSA's precise alias information
  • Run 2 (Sinking2 early) performs texture-aware sinking now that GVN/CGSCC have simplified the IR
  • Run 3 (Sinking2 fast) cleans up opportunities created by peephole optimization
  • Run 4 (NVVMSinking2) performs SM-specific late sinking after barrier and warp-level transforms

NVVMPassOptions Gating

OffsetTypeEffect
opts[1040]boolDisable stock Sink/MemSSA
opts[2440]boolDisable NVVMSinking2 (sub_1CC60B0)
opts[3328]boolEnable SM-specific warp/reduction/sinking pass group (gates NVVMSinking2)

Cost Model (New PM)

The New PM object (176 bytes) contains floating-point thresholds at offsets +88 and +144, both initialized to 1065353216 (IEEE 754 1.0f). These thresholds suggest the New PM implementation has a more sophisticated cost model than the Legacy PM version:

  • Profitability threshold (+88): minimum benefit score for a sink to be accepted. A value of 1.0 means the benefit must at least equal the cost.
  • Cost threshold (+144): maximum acceptable cost for the sinking motion itself. A value of 1.0 means the movement cost must not exceed the baseline.

The Legacy PM version uses a simpler boolean profitability model (is the target a texture block? yes/no).

Configuration Knobs

Sinking2-Specific (ctor_275 at 0x4F7750)

KnobTypeDefaultStorageDescription
sink-into-textureint3qword_4FBF2C0Texture sinking aggressiveness (0=off, 1=cross-block, 2=+intra, 3=+outside-only)
sink-limitint20qword_4FBF1E0Max instructions to sink per invocation (complexity limiter)
dump-sink2boolfalseqword_4FBF100Dump debug information during sinking
KnobTypeDefaultOwnerDescription
sink-check-schedbooltruestock SinkCheck scheduling effects of sinking
sink-single-onlybooltruestock SinkOnly sink single-use instructions
rp-aware-sinkboolfalsestock SinkConsider register pressure (controls sink<rp-aware> variant)
max-uses-for-sinkingint(default)stock SinkDon't sink insts with too many uses
sink-ld-parambool(default)NVPTX backendSink one-use ld.param to use point
hoist-load-parambool(default)NVPTX backendHoist all ld.param to entry block (counterpart to sink-ld-param)
enable-andcmp-sinkingbool(default)CodeGenPrepareSink and/cmp into branches
aggressive-no-sinkbool(default)(unknown)Sink all generated instructions
instcombine-code-sinkingbool(default)InstCombineEnable code sinking within instcombine
nvptx-enable-machine-sinkbool(default)NVPTX backendEnable MIR-level MachineSink
SinkRematEnablebool(default)ptxasEnable sink+rematerialization in ptxas

Analysis Dependencies

Legacy PMNew PMPurpose
DominatorTreeWrapperPass (unk_4F9E06C)DominatorTreeAnalysis (sub_CF6DB0)Dominator tree for sink legality and ordering
LoopInfoWrapperPass (unk_4F96DB4)LoopAnalysis (sub_B1A2E0)Avoid sinking out of loops; skip loop headers

Does not require: SCEV, MemorySSA, PostDominatorTree, BranchProbabilityInfo.

This is a key difference from stock LLVM SinkingPass, which requires MemorySSAAnalysis. Sinking2 uses its own alias analysis queries through helpers sub_1CC8920 and sub_1CC8CA0, routed through the traditional AA interface at sub_13575E0. This avoids the overhead of building/maintaining MemorySSA across fixpoint iterations.

Pass Object Layout

Legacy PM (160 bytes):

OffsetTypeContent
+0ptrVtable pointer (off_49F8BC0)
+8ptrPass link (next pass in chain)
+16ptrPass ID pointer (&unk_4FBF0F4)
+24int32Mode (default=3, from sink-into-texture)
+28int32Sink limit (default=20, from sink-limit)
+32--48ptr[3]Worklist data (head, tail, size)
+56ptrDominatorTree* (set during runOnFunction)
+64ptrList head 1 (self-referential sentinel)
+72--80ptr[2]List next/prev 1
+96int64Counter (sink count for current iteration)
+104ptrLoopInfo* (set during runOnFunction)
+112ptrList head 2 (self-referential sentinel)
+120--128ptr[2]List next/prev 2
+144int64Data field
+152byteChanged flag (for fixpoint termination)

New PM (176 bytes): two embedded worklists and float thresholds at offsets +88 and +144 (value 1065353216 = 1.0f IEEE 754).

Differences from Upstream LLVM

AspectUpstream LLVM sinkNVIDIA sinking2
Alias analysis backendMemorySSACustom AA layer (sub_13575E0)
Iteration strategySingle passFixpoint iteration
Texture awarenessNone3-level configurable
Address space awarenessGenericNVPTX-specific (AS 1,3,4,5)
Complexity limiterNonesink-limit knob (default=20)
Intra-block reorderingNoLevel >= 2
Outside-only patternNoLevel == 3
Debug dumpStandard LLVM debugdump-sink2 knob
Cost modelBoolean (profitable or not)Float thresholds in New PM
Pipeline occurrences13--4 (multi-run strategy)
Fast-compile variantSame passDedicated fast=1 mode

Diagnostic Strings

StringContext
"llvm::Sinking2Pass]"RTTI name at sub_2315E20
"sink2"Pipeline parser ID
"Code sinking"Display name (shared with stock LLVM sink)
"sinking2"New PM pipeline string match

Function Map

FunctionAddressSizeRole
--sub_1CC7010--Legacy PM pass registration
--sub_1CC7100--Legacy PM factory
--sub_1CC71E0--Legacy PM alternate factory
--sub_1CC751016KBprocessInstruction: sink candidate evaluation, use-chain walk, LCD computation
--sub_1CC817013KBDominance ordering: DFS numbering for block comparison
--sub_1CC89204KBAlias checking helper: validates no conflicting memory accesses on path
--sub_1CC8CA06KBMemory dependency helper: store-load forwarding, store ordering, volatile
--sub_1CC911022KBMain worklist driver: fixpoint iteration over dominator tree
--sub_1CCA270--Legacy PM runOnFunction entry
--sub_2D1B410--New PM pass registration
--sub_2D1BC50--New PM factory
--sub_2D1C16019KBNew PM run() entry
--sub_2D1CFB013KBNew PM core logic
--sub_2D1D7707KBNew PM helper
--sub_2D1DCF07KBNew PM helper
--sub_2315E20--RTTI name printer
--0x4F7750--Knob constructor (ctor_275)

Related pipeline factories:

AddressRole
sub_18B1DE0Sinking2 early-pipeline factory
sub_18B3080Sinking2 fast-mode factory (accepts fast flag parameter)
sub_1CC60B0NVVMSinking2 late-pipeline factory
sub_1A634D0Stock LLVM Sink legacy PM registration
sub_29776B0Stock LLVM Sink New PM registration
sub_1B51110Stock Sink core (51KB, creates .sink.split / .sink blocks)
sub_1869C50Stock Sink pipeline factory (called with params 1,0,1)

Total code size: ~80KB (Legacy PM) + ~65KB (New PM) = ~145KB

GPU-Specific Motivation

Register pressure directly determines occupancy -- each additional live register per thread reduces the number of warps available for latency hiding, with discrete cliff boundaries where a single register can drop an entire warp group.

Sinking instructions closer to their uses shortens live ranges and reduces the peak number of simultaneously live registers. This is especially valuable for texture load sequences, which typically involve address computation (GEP chains, index arithmetic) that produces values consumed only at the texture fetch site. Without sinking, these intermediate values occupy registers across potentially many instructions, bloating register pressure unnecessarily.

The three-level sink-into-texture design reflects a graduated approach to this optimization: level 1 handles the common case (cross-block sinking), level 2 adds intra-block reordering for tighter packing, and level 3 (the default) handles the edge case where an instruction's only uses are in blocks other than where it is defined, enabling more aggressive motion.

The multi-run pattern (early Sinking2, post-peephole fast Sinking2, late NVVMSinking2) ensures that sinking opportunities created by other optimization passes are captured throughout the pipeline, rather than relying on a single sinking point that may miss opportunities not yet exposed.

Cross-References

  • Dead Synchronization Elimination -- runs earlier, removes barriers that Sinking2 would otherwise treat as memory fences
  • LICM -- counterpart: hoists loop-invariant code into preheaders; Sinking2 sinks address computation out of preheaders
  • NVVMPeephole -- runs before late Sinking2, may create new sinking opportunities
  • Rematerialization -- runs after all sinking; rematerialization + sinking together minimize register pressure (ptxas SinkRematEnable knob)
  • MemorySpaceOpt -- changes address spaces which affects sinking profitability
  • NVVMPassOptions -- opts[1040] disables stock Sink; opts[2440] disables NVVMSinking2
  • Register Allocation -- ultimate consumer of the register pressure reduction that sinking provides
  • Optimization Levels -- Ofcmax runs only fast-mode Sinking2; O2/O3 run full multi-run pattern

Loop Index Split

loop-index-split is a loop transformation pass that splits or peels loops when a condition inside the loop body depends on the loop induction variable. The pass was originally part of upstream LLVM 2.x (circa 2008--2009) but was removed around LLVM 3.0 due to correctness concerns and limited applicability. NVIDIA revived and heavily modified it for CUDA workloads, where loops with index-dependent conditionals are extremely common -- boundary handling in stencil computations, tile edge processing, and index-based predication are pervasive GPU kernel patterns. The NVIDIA version is substantially more sophisticated than the original, implementing three distinct transformation modes with full SCEV-based analysis.

By eliminating index-dependent branches from loop bodies, the pass reduces warp divergence on NVIDIA GPUs. When threads in a warp take different paths through a branch, the GPU must serialize both paths (predicated execution or divergent branch), wasting throughput. Splitting the loop so that each resulting loop has a uniform body eliminates this divergence entirely within the split regions, restoring full SIMT efficiency.

Pipeline Position

FieldValue
Pass name (pipeline)loop-index-split
Display nameIndex Split Loops
Pass typeLoopPass (NVIDIA-custom, revived from LLVM 2.x)
Classllvm::LoopIndexSplitPass
Legacy PM registrationsub_1C76080
New PM registrationsub_2CBEC60
Pass IDdword_4FBD4A8 / unk_4FBD4AC
New PM vtableoff_4A25510

Transformation Modes

The pass implements three transformation strategies, attempted in priority order. When the first applicable transformation is found, it is applied and the pass moves on.

Mode A: All-But-One Iteration Peel (processAllButOneIterationLoop)

When: The loop body contains a condition that is true for all iterations except exactly one (typically i == K for a constant K).

What: The pass peels the single exceptional iteration out of the loop and removes the condition from the remaining iterations.

Before:

for (i = 0; i < N; i++) {
    if (i == K) special();
    else normal();
}

After:

for (i = 0; i < K; i++) normal();
special();
for (i = K+1; i < N; i++) normal();

This eliminates the branch from both resulting loops entirely. On a GPU, this means warps executing the pre-K or post-K loops never diverge on this condition.

Implementation: sub_2CC3FF0 (13KB, New PM) / part of sub_1C77080 (46KB, Legacy PM).

Mode B: Only-One-Iteration Collapse (processOnlyOneIterationLoop)

When: The condition is true for exactly one iteration, and the loop body does nothing useful on other iterations.

What: The pass replaces the entire loop with a guarded single execution of the body.

Before:

for (i = 0; i < N; i++) {
    if (i == K) doWork();
}

After:

if (K >= 0 && K < N) doWork();

This transforms an O(N) loop into O(1) code -- a dramatic optimization when the original loop's only purpose was to find and execute a single iteration.

Implementation: sub_2CC4A70 (19KB, New PM) / part of sub_1C77080 (46KB, Legacy PM).

Mode C: Range Split (processSplitRangeLoop)

When: The condition splits the iteration space into two contiguous ranges (e.g., i < M vs i >= M).

What: The pass splits the loop at the boundary point so each resulting loop has a simpler, branch-free body.

Before:

for (i = 0; i < N; i++) {
    if (i < M) a(); else b();
}

After:

for (i = 0; i < min(M, N); i++) a();
for (i = M; i < N; i++) b();

This is the most common transformation for GPU boundary handling code, where the first/last few iterations of a tile perform padding or clamping.

Implementation: sub_2CC5900 (68KB, New PM) / sub_1C7B2C0 (84KB, Legacy PM). The loop cloning and rewiring logic is in sub_2CC1B10 (42KB), with split point computation in sub_2CC0040 and sub_2CC0CC0 (7KB each).

Algorithm Detail

The main driver (sub_2CC5900, 68KB) proceeds as follows:

  1. Verify loop structure: The loop must have exactly one exit, a preheader, a latch block, and an identifiable header.
  2. Initialize SCEV analysis: Obtains the ScalarEvolution result for the loop to identify the induction variable and compute trip counts.
  3. Find the induction variable and exit condition from the loop's back-edge.
  4. Scan the loop body for ICmp or Select instructions that compare the IV against a loop-invariant value.
  5. Validate the comparison uses constant integer bounds (checked via APInt extraction at multiple points).
  6. Safety checks (lines 760--830 of sub_2CC5900):
    • Iterate all loop BBs, checking each instruction:
      • Opcode 85 (Call): reject if callee may have side effects
      • Opcodes 34--85: checked against bitmask 0x8000000000041 for safe operations
      • Store instructions: checked for non-interference with the split
    • No volatile loads permitted
    • No memory operations that prevent reordering
  7. Determine which transformation applies:
    • Try processAllButOneIterationLoop first
    • Try processOnlyOneIterationLoop second
    • Fall back to processSplitRangeLoop
  8. For range splits: Compute the split point, clone the loop (including all basic blocks, PHI nodes, and branch conditions), adjust iteration bounds, and rewire predecessors/successors.

Comparison Classifiers

Four small functions classify how the ICmp operands relate to the induction variable:

FunctionPurpose
sub_2CBED80Determine which operand is the IV
sub_2CBED00Determine which operand is the bound
sub_2CBEE00Classify comparison direction (ascending/descending)
sub_2CBEE80Extended classification for range splits

Legality Validation

FunctionSizePurpose
sub_2CBFC80Validate split is legal (check exit conditions)
sub_2CBF770Validate loop structure for splitting
sub_2CBF180Create new loop preheader for split result

Diagnostic Strings

Diagnostic strings recovered from p2b.4-5-sinking2-loopindexsplit.txt. The pass emits optimization remarks via the standard LLVM OptimizationRemark system.

StringSourceCategoryTrigger
"LoopIndexSplit: performed processAllButOneIterationLoop"sub_2CC3FF0 (New PM) / sub_1C77080 (Legacy PM)RemarkMode A transformation applied: single exceptional iteration peeled
"LoopIndexSplit: performed processOnlyOneIterationLoop"sub_2CC4A70 (New PM) / sub_1C77080 (Legacy PM)RemarkMode B transformation applied: entire loop replaced with guarded single body
"LoopIndexSplit: performed processSplitRangeLoop"sub_2CC5900 (New PM) / sub_1C7B2C0 (Legacy PM)RemarkMode C transformation applied: loop split at range boundary
"Index Split Loops"sub_1C76080 / sub_2CBEC60RegistrationDisplay name used in both Legacy PM and New PM pass registration
"loop-index-split"Pipeline parser (sub_2377300 line 3768, sub_2368220 line 5081)RegistrationPipeline ID string (16 characters)
"LoopSplitIndex" / "LoopIndexSplit"Remark infrastructureRemark tagOptimization remark tag names (both variants observed in binary)

Configuration Knobs

No dedicated cl::opt knobs were found for LoopIndexSplit. The pass is enabled or disabled at the pipeline level via the pass name loop-index-split in the pipeline string or by including/excluding it during pipeline assembly. It can also be controlled by the global pass-control and disable-passno mechanisms.

Analysis Dependencies

Legacy PMNew PMPurpose
DominatorTreeWrapperPass (sub_15CD350)DominatorTreeAnalysis (sub_D4AA90)Dominance checks for loop cloning
LoopInfoWrapperPass (sub_13FBE20)LoopAnalysis (sub_B1A2E0)Loop structure and nesting
ScalarEvolutionWrapperPass (sub_1AE1AE0)ScalarEvolutionAnalysis (sub_11CDF60)IV identification, trip count, range proofs
LoopAccessAnalysis (sub_1AF93A0)LoopAccessAnalysis (sub_F67EE0)Memory dependence in loops

SCEV is the critical dependency: it provides induction variable identification, trip count computation, and the mathematical proofs needed to establish that split points are correct and that bounds do not overflow.

Pass Object Layout

Legacy PM: 80-byte pass descriptor.

New PM: 176-byte pass object with embedded worklists and float thresholds. Key fields during execution:

Offset (QWORDs)Content
0Vtable / loop pointer
1--3Sub-loop tracking
4Sinkable instruction count
5Exit condition block
6Split condition (ICmp/FCmp instruction)
7Loop bound (lower)
8Loop bound (upper)
9Split instruction
10Instruction counter / worklist
11--13DenseSet for tracking visited blocks
14Iteration counter
18--24Computed values (preheader, header, latch, exitBB, etc.)
25SCEV analysis result pointer
26New loop blocks array (for split range)

Function Map

New PM Implementation

FunctionAddressSizeRole
--0x2CBEC60New PM pass registration
--0x2CBFF20New PM factory
--0x2CC3FF013KBprocessAllButOneIterationLoop (Mode A)
--0x2CC4A7019KBprocessOnlyOneIterationLoop (Mode B)
--0x2CC590068KBMain driver + processSplitRangeLoop (Mode C)
--0x2CC1B1042KBLoop cloning and CFG rewiring
--0x2CC00407KBSplit boundary computation
--0x2CC0CC07KBAlternate split boundary computation
--0x2CC9AA018KBHelper
--0x2CCB3B025KBHelper
--0x2CCCE2013KBHelper
--0x2CCDD7015KBHelper
--0x2CCED308KBHelper
--0x2CCF45057KBLarge helper / alternate path
--0x2CBED80Comparison classifier (IV operand)
--0x2CBED00Comparison classifier (bound operand)
--0x2CBEE00Comparison direction classifier
--0x2CBEE80Extended comparison classifier
--0x2CBFC80Split legality validation
--0x2CBF770Loop structure validation
--0x2CBF180Create new preheader

Legacy PM Implementation

FunctionAddressSizeRole
--0x1C76080Legacy PM pass registration
--0x1C76180Legacy PM factory
--0x1C76260Alternate factory
--0x1C763407KBHash table management for visited set
--0x1C768C04KBHelper
--0x1C76B504KBBlock cloning helper
--0x1C76EB02.5KBRecursive loop tree walker
--0x1C7708046KBprocessAllButOneIterationLoop + processOnlyOneIterationLoop
--0x1C797A015KBSplit legality checking
--0x1C7A30021KBLoop body cloning
--0x1C7B2C084KBprocessSplitRangeLoop + main driver

Total code size: ~180KB (Legacy PM) + ~260KB (New PM) = ~440KB. This is one of the largest individual passes in cicc.

GPU-Specific Motivation

Index-dependent conditionals inside loops are ubiquitous in GPU kernels:

  • Boundary handling: Threads at tile edges must check whether their index falls within the valid data range, leading to if (threadIdx.x + blockIdx.x * blockDim.x < N) patterns inside processing loops.
  • Stencil codes: Halo region processing requires different behavior for the first and last few iterations of a tile.
  • Reduction patterns: The final iteration of a reduction loop often has special aggregation logic.
  • Predicated execution: CUDA warp-level programming frequently uses index-based predicates to assign work to specific lanes.

Each of these patterns introduces a branch that causes warp divergence: threads in the same warp take different paths, forcing the GPU to serialize both sides. By splitting the loop at the index boundary, the pass ensures that within each resulting loop, all threads in a warp execute the same path. This eliminates divergence entirely within the split regions, recovering full SIMT throughput.

The pass's large code size (~440KB) reflects the complexity of correct loop cloning on GPU IR, where PHI nodes, memory dependencies, and SCEV invariants must all be preserved across the transformation.

Branch Distribution (Dead Synchronization Elimination)

Despite its name, the branch-dist pass does not distribute or restructure branches. It is a GPU-specific dead synchronization elimination pass that removes __syncthreads() barriers and fence intrinsics when no actual memory hazard exists across the barrier boundary. In CUDA kernels, programmers often insert barriers conservatively to guarantee correctness, but many of these barriers protect code regions that have no conflicting read/write patterns on shared or global memory. Removing them eliminates warp serialization points and reduces the latency cost of unnecessary thread coordination.

The pass works by classifying every instruction in the function as a shared/global memory read, a write, or neither. It then propagates this information through the control flow graph using a standard dataflow fixed-point iteration. For each synchronization instruction, it examines the memory access patterns above and below the barrier; if no read-after-write, write-after-read, or write-after-write hazard exists, the barrier is dead and is deleted. Because removing one barrier may expose others as redundant, the entire analysis restarts after each deletion until no more dead barriers remain.

Pipeline Position

FieldValue
Pass namebranch-dist
Pass typeFunctionPass (NVIDIA-custom, not in upstream LLVM)
RegistrationNew PM #377, line 2217 in sub_2342890
Runtime positionsTier 1/2/3 #78, #82 (NVVMBranchDist via sub_1CB73C0, gated by !opts[2080] && !opts[2120]); see Pipeline
Core functionsub_1C47810 (2357 lines)
Pass wrappersub_1C49D10 (179 lines)
Knob constructorctor_525_0 at 0x563730 (493 lines)
Global enable flagbyte_4FBB6C0 (initialized to 0 in ctor_261)

The pass runs during the NVIDIA IR optimization pipeline. The global enable flag at byte_4FBB6C0 is set by the pipeline setup when appropriate for the current optimization level.

IR Before/After Example

The pass removes __syncthreads() barriers that protect no actual shared/global memory hazard.

Before (conservative barrier placement):

define void @kernel(ptr addrspace(3) %smem) {
entry:
  %x = add i32 %tid, 1               ; pure register computation
  %y = mul i32 %x, 42                ; pure register computation
  call void @llvm.nvvm.barrier0()     ; __syncthreads() -- no shared/global R/W above
  %z = add i32 %y, %x                ; pure register computation
  ret void
}

After (dead barrier removed):

define void @kernel(ptr addrspace(3) %smem) {
entry:
  %x = add i32 %tid, 1
  %y = mul i32 %x, 42
  ; barrier removed: no shared/global reads or writes above or below
  %z = add i32 %y, %x
  ret void
}

When the dataflow analysis determines that neither side of the barrier accesses shared or global memory, the barrier is dead and removed. The pass restarts after each removal since deleting one barrier may expose another as redundant.

Algorithm

Phase 1: Instruction Classification (sub_1C46330)

The classifier (sub_1C45690, 117 lines) examines each instruction's opcode byte at offset +16 and determines whether it reads or writes shared/global memory:

OpcodeHexMeaningAction
0x36'6'LoadCheck address space; mark as read if shared/global
0x37'7'StoreCheck address space; mark as write
0x3A':'Memory opCheck address space
0x3B';'Memory opCheck address space
0x4E'N' (78)CallComplex analysis: filter sync intrinsics, check callee attributes

The classifier is invoked twice per basic block:

  • Forward scan (a3=1): iterates from the last instruction backward to the first sync instruction. Everything after the sync is classified as "above" the barrier.
  • Backward scan (a3=0): iterates from the first instruction forward to the first sync instruction. Everything before the sync is classified as "below" the barrier.

This produces four boolean flags per block, stored in red-black tree maps: reads_above, writes_above, reads_below, writes_below.

Phase 2: CFG Propagation (sub_1C46620)

A classic dataflow fixed-point iteration propagates memory access information through successor edges. For each basic block, the read/write flags from its successors' "below" maps are OR-combined into the current block's "above" maps. The iteration repeats until no flags change (convergence). This ensures that a barrier's necessity accounts for memory accesses reachable through any control flow path, not just the local block.

The branch-dist-norm knob modifies the dataflow meet operator: the default (0) uses OR-propagation (conservative), while a non-zero value likely switches to AND-normalization (more aggressive, requiring all paths to access memory before considering a sync necessary).

Phase 3: Dead Sync Identification and Removal

After propagation, the main function (sub_1C47810) iterates over all blocks and instructions. For each synchronization intrinsic, it looks up the four per-instruction flags:

ra = inst_read_above[I]    wa = inst_write_above[I]
rb = inst_read_below[I]    wb = inst_write_below[I]

A sync is dead (removable) when any of these conditions holds:

ConditionMeaning
!ra && !waNothing above the barrier accesses shared/global memory
!rb && !wbNothing below the barrier accesses shared/global memory
!ra && !wbNo read-after-write or write-after-write hazard
!wa && !rbNo write-after-read or write-after-write hazard

When a sync is removed, the pass calls sub_15F20C0 to delete it from the IR, then restarts the entire algorithm (goto LABEL_2). This restart is necessary because removing one barrier may cause another to become dead.

Special Cases

Barrier variants that carry data -- __syncthreads_count, __syncthreads_and, __syncthreads_or (intrinsic IDs 3734--3736) -- are explicitly excluded from removal. Their return values encode lane participation information, so they cannot be elided even when no memory hazard exists.

Address Space Filtering

The pass only considers memory accesses to shared and global address spaces as relevant for synchronization. The address space check in sub_1C45690:

  • Address space IDs <= 0x1FF (511) or in the 0x300 range: considered local/private -- do not require synchronization.
  • Address space IDs > 511 and not in the 0x3xx range: considered shared/global -- these are the accesses that justify keeping a barrier.

This distinction is critical: local memory is per-thread and never visible to other threads in the warp, so barriers protecting only local accesses are always dead.

Intrinsic Classification

Two predicates classify synchronization-related intrinsics:

sub_1C301F0 (is-sync-intrinsic): Returns true for intrinsic IDs representing barrier operations:

IDLikely Mapping
34llvm.nvvm.barrier0 (basic __syncthreads)
3718--3720barrier.sync / bar.warp.sync variants
3731--3736__syncthreads_count/and/or, bar.arrive

sub_1C30240 (is-fence-intrinsic): Returns true for IDs 4046 and 4242, which are memory fence/membar intrinsics. These are excluded from the sync test -- they impose memory ordering but are not full barriers that can be elided by this pass.

Configuration Knobs

All registered in ctor_525_0 at 0x563730. All are cl::opt<> with hidden visibility.

KnobTypeDefaultDescription
dump-branch-distboolfalseEmit diagnostic output on each removed sync
ignore-call-safetybooltrueTreat function calls as non-memory-accessing (aggressive)
ignore-variance-condint0Ignore warp divergence on branch conditions
ignore-address-space-checkint0Treat all memory accesses as requiring sync (conservative)
ignore-phi-overheadint0Ignore PHI node overhead from sync removal in cost model
disable-complex-branch-distint0Disable inter-block CFG propagation (Phase 2)
no-branch-diststring(empty)Comma-separated list of function names to skip
branch-dist-func-limitint-1Max functions to process (-1 = unlimited)
branch-dist-block-limitint-1Max blocks per function (-1 = unlimited)
branch-dist-normint0Dataflow meet operator mode (0 = OR, non-zero = AND)

The default for ignore-call-safety is notably true (aggressive): device function calls are assumed not to access shared/global memory unless proven otherwise. This is reasonable for typical CUDA kernels where helper functions operate on registers and local memory.

Diagnostic Strings

Diagnostic strings recovered from p2b.3-01-branchdist.txt. All runtime diagnostics are gated by the dump-branch-dist knob (default false).

StringSourceCategoryTrigger
"[filename:line] Removed dead synch: Read above: X, Write above: Y, Read below: Z, Write below: W in function NAME"sub_1C47810 phase 3Debugdump-branch-dist enabled and a barrier is removed; prints the four read/write flags and the function name
"Dump information from Branch Distribution"ctor_525_0 at 0x563730Knobdump-branch-dist knob description
"Ignore calls safety in branch Distribution"ctor_525_0Knobignore-call-safety knob description
"Ignore variance condition in branch Distribution"ctor_525_0Knobignore-variance-cond knob description
"Ignore address-space checks in branch Distribution"ctor_525_0Knobignore-address-space-check knob description
"Ignore the overhead due to phis"ctor_525_0Knobignore-phi-overhead knob description
"Disable more complex branch Distribution"ctor_525_0Knobdisable-complex-branch-dist knob description
"Do not do Branch Distribution on some functions"ctor_525_0Knobno-branch-dist knob description (value format: "function1,function2,...")
"Control number of functions to apply"ctor_525_0Knobbranch-dist-func-limit knob description
"Control number of blocks to apply"ctor_525_0Knobbranch-dist-block-limit knob description
"Control normalization for branch dist"ctor_525_0Knobbranch-dist-norm knob description

Data Structures

The pass allocates a large state object (~696 bytes, 87 QWORDs) containing 13 red-black tree maps organized in three tiers:

MapsKeysValuesPurpose
a1[3..14] (2 maps)Block pointerboolHas-sync-above/below per block
a1[15..38] (4 maps)Block pointerboolPropagated read/write above/below (Phase 2 output)
a1[39..62] (4 maps)Block pointerboolInitial read/write above/below (Phase 1 output)
a1[63..86] (4 maps)Instruction pointerboolPer-instruction read/write above/below (Phase 3)

All maps are std::map-like red-black trees with 48-byte nodes (left/right/parent pointers + key + 1-byte boolean value at offset 40). Tree operations are implemented in sub_1C46280 (insert-or-find for block maps), sub_1C47760 (insert-or-find for instruction maps), sub_1C45B10 (erase), and sub_1C45C70/sub_1C45940 (recursive destructors).

Function Map

FunctionAddressSizeRole
--0x1C478102357LCore algorithm: classify + propagate + remove
--0x1C49D10179LPass wrapper: init state, call core, cleanup
--0x1C46330197LPhase 1: forward/backward instruction scan
--0x1C466201157LPhase 2: CFG successor propagation (fixed-point)
--0x1C45690117LInstruction classifier: determines R/W flags
--0x1C458C028LHelper: classify all instructions in a block
--0x1C4628038LMap insert-or-find (block-level maps)
--0x1C4776037LMap insert-or-find (instruction-level maps)
--0x1C475C043LMap lower_bound lookup
--0x1C4766050LMap find with hint
--0x1C45B10113LMap erase operation
--0x1C45C70133LTree destructor (recursive free)
--0x1C45940133LTree destructor (recursive free, alt type)
--0x1C301F015LIs-sync-intrinsic predicate
--0x1C3024013LIs-fence-intrinsic predicate
--0x563730493LCLI knob registration (ctor_525_0)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent dead barrier elimination pass using CFG dataflow.

1. Using address-level tracking instead of boolean per-category flags. The pass tracks four boolean flags per block (reads_above, writes_above, reads_below, writes_below) for shared/global memory, not specific addresses. A reimplementation that attempts to track precise addresses ("smem[0] is only written above, smem[1] is only read below") will appear to find more dead barriers but is fundamentally unsound for GPU execution. Different threads access different addresses through the same pointer expression (smem[tid] vs smem[tid-1]), making address-based alias analysis across threads impossible at compile time. The boolean-per-category approach is the correct conservative abstraction.

2. Not excluding __syncthreads_count/and/or (IDs 3734--3736) from removal. These barrier variants return a value that encodes lane participation information (__syncthreads_count returns the number of threads that passed a non-zero predicate). Even when no memory hazard exists across the barrier, the return value carries data that the program depends on. A reimplementation that removes these barriers based solely on memory analysis will break programs that use the return value for algorithmic purposes (e.g., warp-level voting patterns, early-exit counting).

3. Treating the ignore-call-safety default as conservative. The default for ignore-call-safety is true (aggressive): function calls are assumed not to access shared/global memory. This is correct for typical CUDA helper functions that operate on registers and local memory, but a reimplementation that uses false as the default will retain nearly all barriers in code that calls device functions, defeating the optimization. Conversely, a reimplementation that uses true but does not also check the callee's isSharedMemoryAccess attribute when available will miss cases where a called function does access shared memory through a pointer argument.

4. Not restarting the analysis after removing a barrier. The pass restarts from Phase 1 (goto LABEL_2) after each barrier deletion because removing one barrier merges the regions it separated, potentially exposing adjacent barriers as dead. A reimplementation that collects all dead barriers in one pass and removes them simultaneously will miss cascading redundancies. Worse, it may remove barriers in the wrong order: if barrier B2 is dead only because barrier B1 separates it from a hazard, removing both simultaneously removes B1's protection while the hazard still exists.

5. Conflating address space filtering with memory visibility. The pass considers only shared and global memory accesses (address spaces > 511 and not in the 0x3xx range) as relevant for barrier justification. Local/private memory (per-thread, invisible to other threads) is correctly excluded. A reimplementation that includes local memory accesses in the analysis will never remove any barrier in code that uses local arrays, since every function with local variables would show "read+write above and below." The address space filter is essential for the optimization to have any effect.

GPU-Specific Motivation

On NVIDIA GPUs, __syncthreads() forces all threads in a thread block to reach the barrier before any can proceed. This is one of the most expensive control flow operations in CUDA -- it serializes warp execution and creates a pipeline stall. In practice, CUDA programmers insert barriers conservatively (every shared memory access pattern gets a barrier "just in case"), leading to significant over-synchronization. This pass recovers the performance lost to unnecessary barriers by proving, through static dataflow analysis, that specific barriers protect no actual memory hazard.

The ignore-variance-cond knob connects to warp divergence analysis: when a branch condition is provably uniform (all lanes take the same path), synchronization across that branch is trivially unnecessary regardless of memory access patterns. This is a common case in well-structured CUDA code where control flow depends on blockIdx or compile-time constants.

Dead Barrier Elimination

CICC contains three independent passes that eliminate redundant __syncthreads() barriers from CUDA kernels. This page documents the lightweight basic-dbe pass -- a single-pass, intra-block pattern matcher that removes trivially dead barriers without dataflow analysis. The two heavyweight engines are covered on their own pages: Dead Synchronization Elimination (sub_2C84BA0, 96KB, full bidirectional fixed-point dataflow) and Branch Distribution (sub_1C47810, 63KB, NVVM-IR-level fixed-point with restart). All three target the same goal -- eliminating barriers that provably do not order any memory hazard -- but at different cost/precision tradeoffs.

Key Facts: basic-dbe

PropertyValue
Pass namebasic-dbe
Classllvm::BasicDeadBarrierEliminationPass
ScopeFunction pass (LLVM IR level)
RegistrationNew PM #376, line 2212 in sub_2342890 (first NVIDIA function pass registered)
Runtime positionsInserted via pipeline extension callbacks; not in the Tier 0/1/2/3 tables (see Pipeline)
ParametersNone (non-parameterized pass)
Knob constructorctor_261 (below 5KB, in 0x4F0000--0x51FFFF range)
Enable globalbyte_4FBB6C0 (initialized to 0 in ctor_261, set to 1 by pipeline setup)
Binary sizeSmall (< 5KB compiled)
Upstream equivalentNone -- entirely NVIDIA-proprietary

Why a Lightweight Pass Exists

The full dead synchronization elimination engine at sub_2C84BA0 is 96KB of code implementing bidirectional fixed-point dataflow with complete restart after each removal. That is expensive. For the common cases -- consecutive barriers with no intervening memory operations, barriers at function entry/exit with no shared memory traffic in the block, or barriers immediately followed by another barrier -- the heavyweight engine is overkill.

basic-dbe exists as a cheap pre-filter: it handles the trivially dead cases in a single linear scan per function, eliminating the low-hanging fruit before the full engine (if scheduled) performs its expensive inter-block analysis. By removing obvious dead barriers early, basic-dbe also reduces the iteration count of the heavyweight pass, since fewer barriers remain for it to analyze.

Algorithm

basic-dbe operates as a single-pass function pass with no dataflow propagation, no fixed-point iteration, and no restart-on-removal. It scans each basic block once and applies local pattern matching to identify barriers that are trivially dead.

Barrier Identification

The pass reuses the same barrier predicate logic as the full engine. An instruction is a synchronization barrier if all of the following hold:

  1. Opcode == 85 (internal call opcode for intrinsics)
  2. The callee pointer at offset -32 is non-null
  3. The callee's byte at offset 0 == 0 (intrinsic, not user-defined function)
  4. The convergent attribute flag (bit 0x20 at byte+33) is set
  5. sub_CEA1A0(callee.field[36]) confirms the intrinsic ID falls within the known barrier ID range

This is the same check implemented by sub_2C83D20 in the full engine.

Elimination Patterns

basic-dbe identifies four categories of trivially dead barriers, all detectable without inter-block analysis:

Pattern 1: Consecutive Barriers

Two or more __syncthreads() calls with no intervening instructions (or only non-memory instructions between them). The second and subsequent barriers are redundant because the first already forces all threads to synchronize.

; Before basic-dbe:
  call void @llvm.nvvm.barrier0()     ; barrier A
  call void @llvm.nvvm.barrier0()     ; barrier B -- DEAD (consecutive)

; After basic-dbe:
  call void @llvm.nvvm.barrier0()     ; barrier A retained

Pattern 2: Barrier in Empty Block

A basic block whose only non-terminator instructions are barriers and non-memory operations (debug info, metadata). If no instruction in the block reads or writes shared/global memory, every barrier in the block is dead -- there is nothing to order.

; Before basic-dbe:
bb_empty:
  call void @llvm.nvvm.barrier0()     ; DEAD -- no memory ops in block
  br label %bb_next

; After basic-dbe:
bb_empty:
  br label %bb_next

Pattern 3: Barrier at Function Entry

A barrier at the start of a kernel (or device function) with no memory operations between function entry and the barrier. Since no thread has performed any shared memory access yet, the barrier orders nothing.

Pattern 4: Barrier Before Return

A barrier immediately before a return with no memory operations between the barrier and the function exit. The barrier would order accesses that have already been performed, but since no subsequent access follows, no hazard exists in the forward direction.

Pseudocode

function BasicDeadBarrierEliminationPass::run(F):
    if not byte_4FBB6C0:          // global enable flag
        return PreservedAnalyses::all()

    changed = false

    for each BB in F:
        barriers = []
        has_memory_op = false

        for each inst in BB:
            if isSyncBarrier(inst):
                if not has_memory_op:
                    // Pattern 2/3: barrier with no preceding memory op
                    // Also handles Pattern 1: consecutive barriers
                    //   (first barrier is not a memory op, so second is dead)
                    mark inst for deletion
                    changed = true
                else:
                    barriers.append(inst)
                    has_memory_op = false   // reset for next segment

            else if classifyMemoryAccess(inst) has read or write:
                has_memory_op = true

        // Pattern 4: check trailing barrier before terminator
        if not barriers.empty() and not has_memory_op:
            mark barriers.back() for deletion
            changed = true

    // Delete all marked instructions
    for each marked inst:
        inst.eraseFromParent()

    if changed:
        return PreservedAnalyses::none()  // IR modified
    else:
        return PreservedAnalyses::all()

The key design choice: basic-dbe treats each basic block as an isolated unit. It does not look at predecessor or successor blocks. This means it will miss cases where a barrier is dead because all reaching paths lack memory accesses -- those cases require the full inter-block dataflow of sub_2C84BA0 or sub_1C47810.

Memory Access Classification

Within the basic block scan, basic-dbe must determine which instructions constitute memory operations that could create cross-thread hazards. The classification mirrors the logic in sub_2C83AE0 (the full engine's classifier):

OpcodeValueInstructionClassification
610x3DStoreMemory write
620x3ELoadMemory read
650x41AtomicMemory read + write
660x42AtomicCmpXchgMemory write
850x55Call/IntrinsicRead+Write if callee accesses shared/global memory

Non-memory instructions (arithmetic, comparisons, PHI nodes, debug info, branches) do not set the has_memory_op flag.

The byte_4FBB6C0 Enable Flag

The global byte at byte_4FBB6C0 serves as a shared enable flag initialized to 0 in ctor_261. The pipeline setup code sets it to 1 when the optimization level and target configuration warrant running barrier elimination. This same flag gates branch-dist (sub_1C49D10 checks it before invoking sub_1C47810), confirming that ctor_261 initializes shared state for the barrier elimination subsystem as a whole, not just basic-dbe.

Relationship to Other Dead-Sync Passes

CICC's three barrier elimination passes form a layered strategy:

Propertybasic-dbebranch-distDead Sync Elimination
Entry pointllvm::BasicDeadBarrierEliminationPasssub_1C47810sub_2C84BA0
PM slot376 (New PM function pass)377 (New PM function pass)None (module-level caller)
ScopeIntra-block onlyInter-block (CFG propagation)Inter-block (full restart)
DataflowNone (pattern match)Fixed-point, 13 RB-tree mapsFixed-point, 12 RB-tree maps
Restart on removalNoYes (goto LABEL_2)Yes (goto LABEL_2)
IR levelLLVM IR (opcodes 61/62/65/66/85)NVVM IR (opcodes 0x36/0x37/0x3A/0x3B/0x4E)LLVM IR (opcodes 61/62/65/66/85)
Binary size< 5KB63KB core + helpers96KB core + helpers
Knobsbyte_4FBB6C0 enable flag10 knobs (ctor_525)None known (controlled by caller)
ComplexityO(n_instructions)O(B * F * C)O(B * F * C)
Typical runtimeMicrosecondsMillisecondsMilliseconds

The intended execution order:

  1. basic-dbe runs first in the function pass pipeline, eliminating trivially dead barriers in O(n) time.
  2. branch-dist runs next (slot 377, immediately after basic-dbe at slot 376), performing full inter-block analysis on the reduced barrier set using NVVM IR opcodes.
  3. Dead Sync Elimination (sub_2C84BA0) runs later from module-level callers (sub_2C88020, sub_2C883F0), performing the most aggressive analysis using LLVM IR opcodes with the element-size gate and special intrinsic ID handling.

Configuration

KnobTypeDefaultEffect
byte_4FBB6C0bool (global)0 (disabled)Master enable for basic-dbe and branch-dist

No dedicated per-pass knobs (threshold, dump flags, or limits) have been identified for basic-dbe itself. The pass is controlled entirely by its enable flag. This is consistent with its role as a lightweight pre-filter -- there is nothing to tune.

Function Map

FunctionAddressSizeRole
--sub_2342890 line 2212--New PM registration: maps "basic-dbe" to llvm::BasicDeadBarrierEliminationPass
--ctor_261 (0x4F range)--Global constructor: initializes byte_4FBB6C0 to 0, registers basic-dbe knob string
--byte_4FBB6C0--Global enable flag (shared with branch-dist)
--sub_2C83D20--isSyncBarrier predicate (shared with full engine)
--sub_2C83AE0--classifyMemoryAccess (shared with full engine)
--sub_CEA1A0--Barrier intrinsic ID confirmation
--sub_B49E00--isSharedMemoryAccess -- CUDA address space check
--sub_B43D60--Instruction::eraseFromParent -- barrier deletion

Cross-References

Dead Synchronization Elimination

The dead synchronization elimination engine at sub_2C84BA0 is the largest NVIDIA-custom pass in cicc at 96KB (~3,400 decompiled lines). It removes __syncthreads() barriers that provably do not order any memory hazard, reducing warp stall cycles in CUDA kernels without affecting correctness. The algorithm performs a bidirectional fixed-point dataflow analysis across the entire function's CFG, tracking four memory access categories per basic block through eight red-black tree maps. After convergence, it evaluates every barrier against the computed access sets and deletes those that protect no actual hazard. Each deletion triggers a full restart of the analysis, handling cascading redundancies at the cost of quadratic worst-case complexity.

This pass is distinct from the lightweight basic-dbe pass (slot 376, llvm::BasicDeadBarrierEliminationPass) and from the branch-dist pass. All three target dead barriers, but only this engine performs full inter-block dataflow with complete restart -- the other two handle simpler local or single-pass cases.

Key Facts

PropertyValue
Entry pointsub_2C84BA0
Binary size96KB (~3,400 decompiled lines)
Pass typeModule-level NVIDIA custom (not registered in New PM)
Callerssub_2C88020, sub_2C883F0, self-recursive
Barrier predicatesub_2C83D20
Access classifiersub_2C83AE0
Per-BB analysissub_2C84640 (bidirectional, parameterized by direction)
State object12 red-black tree maps at known offsets in a1
Diagnostic" Removed dead synch: " with per-category read/write counts
Upstream equivalentNone -- entirely NVIDIA-proprietary

Five-Phase Algorithm

Phase 1: Barrier Identification (sub_2C83D20)

The helper sub_2C83D20 classifies whether a given instruction is a synchronization barrier. The check is a conjunction of five conditions:

function isSyncBarrier(inst) -> bool:
    if inst.opcode != 85:                       // internal call opcode
        return false
    callee = inst.field[-32]                     // callee pointer at offset -32
    if callee == null:
        return false
    if callee.byte[0] != 0:                     // byte 0 == 0 means intrinsic (not user-defined)
        return false
    if callee.field[24] != inst.field[80]:       // scope match
        return false
    if !(callee.byte[33] & 0x20):               // convergent attribute flag
        return false
    return CEA1A0(callee.field[36])              // confirm barrier intrinsic ID

The convergent attribute flag (bit 0x20 at byte+33) is the key discriminator. LLVM marks barrier intrinsics as convergent to prevent optimizations from moving them across control flow boundaries. The final sub_CEA1A0 call validates that the intrinsic ID falls within the known barrier ID range, distinguishing barriers from other convergent intrinsics (e.g., warp vote operations).

Phase 2: Memory Access Classification (sub_2C83AE0)

For every non-barrier instruction, sub_2C83AE0 determines whether it reads from or writes to memory that could create a hazard across a barrier. It outputs two boolean flags via pointer parameters a2 (read) and a3 (write).

OpcodeValueInstructionClassification
610x3DStoreWrite, if element size > 0x1FF bits
620x3ELoadRead, with same large-type gate
650x41AtomicRead + Write
660x42AtomicCmpXchgWrite
850x55Call/IntrinsicContext-dependent (see below)

For call instructions (opcode 85), the classifier applies recursive analysis:

  1. Check if the callee has intrinsic flag 0x20 set.
  2. For barrier-like intrinsics with opcode 25 and field+96 == 0: classify as Read only.
  3. For general calls: invoke sub_B49E00 (isSharedMemoryAccess) to determine whether the callee accesses shared/global memory. If yes: Read + Write.

The element size gate (> 0x1FF bits, i.e., > 511 bits) filters out trivially small memory operations that target scalar types in registers rather than actual memory-backed storage. Loads and stores of types narrower than 512 bits are assumed to operate on register-promoted values and do not participate in cross-thread hazards.

Phase 3: Bidirectional Fixed-Point Dataflow

Complexity. Let B = number of basic blocks, S = number of barrier instructions, and I = total instructions across all blocks. Phase 1 (barrier identification) is O(S). Phase 2 (access classification) is O(I). The dataflow fixed-point iterates until no boolean in the 4 * B * 2 lattice positions flips from 0 to 1; since the lattice has height 1, convergence is bounded by O(B) iterations, each costing O(B + I) for the forward and backward scans, giving O(B * (B + I)) per convergence cycle. Phase 4 (elimination decision) is O(S). Phase 5 restarts the entire analysis from Phase 3 on each removal, yielding a worst-case total of O(S * B * (B + I)). In practice, CUDA kernels have B < 100, S < 20, and convergence in 2--3 iterations, so the pass behaves as near-linear in typical use. The red-black tree maps contribute O(log B) per insert/lookup, but this is dominated by the iteration cost.

This is the core of the pass and accounts for the majority of its 96KB size. The algorithm maintains eight red-black tree maps organized into forward and backward analysis sets, plus four bridge maps for the final elimination decision.

Map Layout

Offset rangeDirectionContents
a1[15..20]ForwardReadAbove per basic block
a1[21..26]ForwardWriteAbove per basic block
a1[27..32]ForwardReadBelow per basic block
a1[33..38]ForwardWriteBelow per basic block
a1[39..44]BackwardReadAbove per basic block
a1[45..50]BackwardWriteAbove per basic block
a1[51..56]BackwardReadBelow per basic block
a1[57..62]BackwardWriteBelow per basic block
a1[63..68]BridgeReadAbove crossing barrier
a1[69..74]BridgeWriteAbove crossing barrier
a1[75..80]BridgeReadBelow crossing barrier
a1[81..86]BridgeWriteBelow crossing barrier

Each map is a std::map-style red-black tree (48-byte nodes: left/right/parent pointers, key = basic block pointer, value = 1-byte boolean at offset 40). The helper sub_2C84590 performs map insertion; sub_2C84AF0 is a variant for a different node type used in the bridge maps.

Iteration Algorithm

The analysis loop is implemented as a goto-based iteration between labels LABEL_2 and LABEL_178 in the decompiled output:

function analyzeBarriers(F, state):
    LABEL_2:  // restart point after barrier removal

    // --- Forward pass ---
    for each BB in F:
        sub_2C84640(state, BB, direction=1)  // scan BB forward
            // For each instruction from BB start toward first barrier:
            //   classify as read/write via sub_2C83AE0
            //   OR the flags into forward maps [15..38]
            // Propagate successor BBs' flags backward if they
            // contain already-analyzed barriers

    // --- Forward convergence check ---
    changed_fwd = false
    for each BB in F:
        if forward_maps[BB] != previous_forward_maps[BB]:
            changed_fwd = true
            break

    // --- Backward pass ---
    for each BB in F:
        sub_2C84640(state, BB, direction=0)  // scan BB backward
            // For each instruction from BB end toward last barrier:
            //   classify as read/write
            //   OR into backward maps [39..62]
            // Propagate predecessor BBs' flags forward

    // --- Backward convergence check ---
    changed_bwd = false
    for each BB in F:
        if backward_maps[BB] != previous_backward_maps[BB]:
            changed_bwd = true
            break

    // If either direction changed, iterate
    if changed_fwd or changed_bwd:
        goto LABEL_2_inner  // re-run dataflow (not full restart)

    // Both converged -- proceed to Phase 4
    goto elimination_phase

The sub_2C84640 helper is the per-BB analysis workhorse. It takes a direction parameter:

  • direction=1 (forward): scans from block entry toward the first barrier, accumulating ReadAbove/WriteAbove. Propagates read/write information from successor blocks.
  • direction=0 (backward): scans from block exit toward the last barrier, accumulating ReadBelow/WriteBelow. Propagates information from predecessor blocks.

The convergence check compares the entire map contents (all four categories for every BB) against their values from the previous iteration. If any single boolean flipped from 0 to 1, the changed flag is set. Since the analysis is monotone (booleans can only transition from 0 to 1, never back), convergence is guaranteed in at most O(|BB|) iterations, though in practice it converges in 2--3 iterations for typical CUDA kernels.

Phase 4: Elimination Decision

After the dataflow converges, the pass examines every barrier instruction and checks the bridge maps (a1[63..86]) which represent the combined read/write sets crossing barrier boundaries.

A barrier is redundant (dead) if any of the following holds:

ConditionInterpretation
ReadAbove == 0 AND WriteAbove == 0No shared-memory accesses reach this barrier from above; the barrier orders nothing
ReadBelow == 0 AND WriteBelow == 0No accesses reach from below
ReadAbove == 0 AND WriteBelow == 0No RAW or WAW hazard across the barrier
WriteAbove == 0 AND ReadBelow == 0No WAR or WAW hazard across the barrier

The first two conditions capture the case where one side of the barrier has no memory traffic at all. The latter two capture the case where both sides access memory, but the access patterns cannot conflict.

Special Case: Intrinsic IDs 8260--8262

For call instructions (opcode 85) where the callee's intrinsic ID satisfies (ID - 8260) <= 2 (i.e., IDs 8260, 8261, or 8262), the pass applies an additional test via sub_BD3660 (hasOneUse). If the barrier-like intrinsic has only a single use, it is considered removable even if the standard dataflow check would keep it. These IDs likely correspond to specialized barrier variants (__syncthreads_count, __syncthreads_and, __syncthreads_or) where the return value is used as data. When the return value has only one use, the compiler can reason that the data-carrying aspect is trivially handled and the barrier itself may still be dead from a memory ordering perspective.

Phase 5: Removal and Complete Restart

When a barrier is identified as dead, the pass:

  1. Emits a diagnostic string (if the controlling dump flag is enabled):

     Removed dead synch: [filename:line] in function <name>
    Read above: N, Write above: N, Read below: N, Write below: N
    

    where N is 0 or 1 for each category.

  2. Calls sub_B43D60 (Instruction::eraseFromParent) to delete the barrier instruction from the IR.

  3. Restarts from Phase 3 (goto LABEL_2) -- a complete re-analysis of the entire function.

The restart is not optional. Removing a barrier changes the memory access pattern visible between adjacent barriers: what was previously two separate "above/below" regions separated by a barrier now becomes a single merged region. This merging may cause an adjacent barrier to lose its hazard justification, making it dead as well. The cascading effect can propagate through a chain of barriers.

Dataflow Analysis Detail

The Eight Red-Black Tree Maps

The forward and backward map sets track identical categories but are populated from different CFG traversal directions. This bidirectional approach ensures that the analysis captures memory accesses reachable from any control flow path, not just straight-line code.

Forward maps (a1[15..38]):

  • Populated by scanning each BB from entry toward the first barrier.
  • ReadAbove[BB] = 1 if any instruction before the first barrier in BB reads shared/global memory, OR if any successor BB contributes a read.
  • WriteAbove[BB] = same for writes.
  • ReadBelow/WriteBelow[BB] = propagated from successor blocks' analysis.

Backward maps (a1[39..62]):

  • Populated by scanning each BB from exit toward the last barrier.
  • ReadBelow[BB] = 1 if any instruction after the last barrier in BB reads memory, OR if any predecessor BB contributes a read.
  • WriteBelow[BB] = same for writes.
  • ReadAbove/WriteAbove[BB] = propagated from predecessor blocks.

Bridge maps (a1[63..86]):

  • Keyed by barrier instruction pointer (not BB pointer).
  • Represent the combined access sets that cross the specific barrier boundary.
  • Populated during the final pass over barrier instructions after dataflow convergence.

Monotone Dataflow Framework

The analysis is a classic monotone dataflow problem on a Boolean lattice:

  • Domain: {0, 1} per (basic-block, category) pair.
  • Transfer function: OR of local classification with propagated values.
  • Meet operator: OR (any path contributing an access sets the flag).
  • Direction: Bidirectional (forward pass propagates from successors, backward pass propagates from predecessors).
  • Convergence: Guaranteed because the lattice has height 1 (a value can only change from 0 to 1, never back). The fixed point is reached when no additional propagation changes any value.

In the worst case, each iteration may set one new bit, and there are 4 * |BB| bits per direction, so convergence takes at most 4 * |BB| iterations per direction. In practice, CUDA kernels have shallow CFGs and the iteration converges in 2--3 rounds.

Cascading Restart Logic

The most expensive aspect of the algorithm is the complete restart after each barrier removal. Consider a function with N barriers:

B0 -- barrier_1 -- B1 -- barrier_2 -- B2 -- barrier_3 -- B3

If barrier_2 is removed first, blocks B1 and B2 merge into a single region. If B1 contained only writes and B2 contained only reads, barrier_1 was previously justified by the WAR hazard between B0's writes and B1's reads. But after merging, B1+B2 now contains both reads and writes, and barrier_3 might become dead if B3 has no memory accesses. This cascading effect requires full re-analysis.

Worst-case complexity: O(N_barriers * N_BBs * convergence_iterations), where convergence_iterations is bounded by 4 * |BB| but is typically 2--3. For a kernel with B barriers removed in sequence, the total work is O(B * F * C) where F is the per-iteration cost of the dataflow and C is the convergence bound.

In practice, CUDA kernels rarely have more than 10--20 barriers, and cascading removals are uncommon (typically 0--3 restarts), so the theoretical quadratic cost is not a bottleneck.

Relationship to basic-dbe and branch-dist

CICC contains three passes that eliminate dead synchronization barriers. They differ in scope, cost, and the cases they handle:

Propertybasic-dbebranch-distDead Sync Elimination
Pass namebasic-dbebranch-dist(unnamed, called from module pass)
Entry pointllvm::BasicDeadBarrierEliminationPasssub_1C47810sub_2C84BA0
RegistrationNew PM slot 376New PM slot (function pass)Module-level caller
ScopeSingle BB / localFunction-level with CFG propagationFunction-level with full restart
DataflowNone (pattern match)Fixed-point, 13 rb-tree mapsFixed-point, 12 rb-tree maps
Restart on removalNoYes (goto LABEL_2)Yes (goto LABEL_2)
Binary sizeSmall (ctor_261)63KB core + helpers96KB core + helpers
Knobsbasic-dbe10 knobs (ctor_525)None known (controlled by caller)

basic-dbe handles trivially dead barriers detectable without dataflow analysis -- cases where the barrier is immediately adjacent to another barrier, or where the enclosing block contains no memory operations at all. It runs in the standard function pass pipeline and is cheap.

branch-dist performs full CFG propagation with 13 red-black tree maps and restart-on-removal, but it uses NVVM IR opcodes (0x36/0x37/0x3A/0x3B/0x4E) rather than the generic LLVM IR opcodes (61/62/65/66/85) used by the full engine. It also has its own address space filtering logic and 10 configurable knobs.

The full dead synchronization elimination engine (sub_2C84BA0) is the most aggressive of the three. It uses the LLVM IR opcode set, applies the element-size gate for loads/stores, and handles the special intrinsic IDs 8260--8262. It runs separately from the New PM function pass pipeline, invoked from module-level callers sub_2C88020 and sub_2C883F0.

Configuration

No dedicated knobs have been identified for the full engine at sub_2C84BA0. Its behavior is controlled entirely by its callers (sub_2C88020, sub_2C883F0), which determine when and whether the engine runs. This is in contrast to branch-dist, which has 10 knobs, and basic-dbe, which has at least an enable flag.

The diagnostic output is gated by an internal condition in the caller, not by a standalone dump knob.

Diagnostic Strings

" Removed dead synch: "
"Read above: "
", Write above: "
", Read below: "
", Write below: "
" in function "
"dbg"

The complete diagnostic message, assembled from these fragments:

 Removed dead synch: [filename:line] in function <name>
Read above: 0, Write above: 0, Read below: 1, Write below: 1

The numeric values are the boolean (0/1) access flags for each category. When the pass removes a barrier, the diagnostic shows exactly why it was safe: which of the four access categories was absent.

Function Map

FunctionAddressSizeRole
--sub_2C84BA096KB (3,400 lines)Main engine: 5-phase algorithm
--sub_2C83D20smallisSyncBarrier predicate
--sub_2C83AE0smallclassifyMemoryAccess (read/write classification)
--sub_2C84640mediumPer-BB analysis (bidirectional, direction parameter)
--sub_2C84590smallRed-black tree insert (forward/backward maps)
--sub_2C84AF0smallRed-black tree insert (bridge maps, different node type)
--sub_2C84080smallMap lookup / convergence check helper
--sub_2C83F20smallMap initialization / clear helper
--sub_2C83D50smallMap destructor / cleanup
--sub_BD3660smallhasOneUse -- used for intrinsic IDs 8260--8262 special case
--sub_CEA1A0smallBarrier intrinsic ID confirmation
--sub_B49E00smallisSharedMemoryAccess -- CUDA address space check
--sub_B43D60smallInstruction::eraseFromParent -- barrier deletion
--sub_B46E30smallgetNumSuccessors -- CFG successor count
--sub_B46EC0smallgetSuccessor(i) -- i-th successor retrieval
--sub_CB6200smallraw_ostream::write -- diagnostic string output
--sub_B91420smallDebug location extraction (filename/line)
--sub_B91F50smallDebug info accessor
--sub_BD5D20smallType/value accessor
--sub_22409D0smallIR utility (instruction manipulation)
--sub_CB59D0smallraw_ostream integer write
--sub_CB59F0smallraw_ostream integer write (variant)
--sub_2C88020--Caller: module-level pass invoking the engine
--sub_2C883F0--Caller: module-level pass invoking the engine (variant)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent dead synchronization elimination engine.

1. Removing a barrier that protects a cross-thread shared memory hazard invisible to single-thread analysis. The most dangerous mistake is treating the analysis as a single-thread dataflow problem. The pass classifies memory accesses as read/write per thread, but the barrier's purpose is to order accesses across threads. If thread A writes to smem[tid] above the barrier and thread B reads smem[tid-1] below it, a single-thread view sees no RAW hazard (different addresses). The correct analysis must conservatively assume that any shared memory write above and any shared memory read below constitutes a hazard -- the pass uses boolean flags (not address tracking) precisely because aliasing across threads is unknowable at compile time. A reimplementation that attempts to be "smarter" by tracking addresses will remove barriers that are needed.

2. Not restarting the full analysis after each barrier removal. When a barrier is deleted, the two regions it separated merge into one. This merged region may expose an adjacent barrier as dead (it no longer has memory accesses on one side). A reimplementation that removes all identified dead barriers in a single pass and then stops will miss these cascading redundancies. The restart is mandatory: the pass deliberately uses a goto back to Phase 3 after each removal, re-analyzing the entire function from scratch.

3. Incorrectly classifying call instructions as non-memory-accessing. The access classifier (sub_2C83AE0) must recursively analyze callees to determine if they access shared/global memory. A reimplementation that conservatively marks all calls as read+write will be correct but will retain too many barriers (poor optimization). Conversely, one that ignores calls entirely will remove barriers protecting memory accesses hidden inside called functions. The correct behavior checks the isSharedMemoryAccess predicate on the callee and falls back to read+write if the callee is opaque.

4. Treating __syncthreads_count/and/or (IDs 8260--8262) the same as plain __syncthreads. These barrier variants return a value (lane participation count/and/or). Even when the barrier is dead from a memory-ordering perspective, the return value may be used as data by the program. The pass applies a special hasOneUse check for these IDs. A reimplementation that blindly removes them when the dataflow says "no hazard" will break programs that depend on the return value for algorithmic purposes.

5. Applying the element-size gate too aggressively. The pass filters out loads/stores of types narrower than 512 bits (> 0x1FF), assuming they are register-promoted scalars. A reimplementation that raises this threshold (e.g., to 1024 bits) will miss legitimate memory operations that should keep a barrier alive. Conversely, lowering it to 0 will make the analysis overly conservative, retaining dead barriers for trivial register operations.

Test This

The following kernel contains consecutive __syncthreads() barriers with no shared memory accesses between them. The dead synchronization elimination pass should remove the redundant barriers.

__global__ void dead_sync_test(float* out, int n) {
    __shared__ float smem[256];

    smem[threadIdx.x] = (float)threadIdx.x;
    __syncthreads();    // barrier 1: needed (write above, read below)

    float val = smem[threadIdx.x ^ 1];
    __syncthreads();    // barrier 2: dead -- no smem access between barrier 1 and 2's "below"

    __syncthreads();    // barrier 3: consecutive with barrier 2 -- trivially dead

    out[threadIdx.x] = val;
}

What to look for in PTX:

  • Count the number of bar.sync 0; instructions. The kernel has three __syncthreads() calls in source, but only one should survive: barrier 1 (which orders the write to smem against the read from smem[tid^1]). Barriers 2 and 3 have no shared memory hazard to protect.
  • The diagnostic "Removed dead synch:" (visible with internal dump flags) shows the per-category access flags that justified removal: Read above: 0, Write above: 0 means no memory accesses reach the barrier from above.
  • To verify the pass preserves necessary barriers, move the float val = smem[...] read to between barriers 2 and 3. Now barrier 2 orders the write against this read and must survive -- expect two bar.sync instructions.
  • The cascading restart behavior is observable with 5 consecutive __syncthreads() with no memory between them. The pass removes one, restarts the analysis, removes the next, and repeats until only one remains.

Reimplementation Checklist

  1. Barrier identification predicate. Implement the five-condition conjunction: opcode == 85 (internal call), non-null callee, byte[0] == 0 (intrinsic flag), scope match (callee.field[24] == inst.field[80]), convergent attribute (bit 0x20 at byte+33), and barrier intrinsic ID confirmation.
  2. Memory access classifier. Classify every non-barrier instruction as read/write/both/neither based on opcode (store=0x3D, load=0x3E, atomic=0x41, cmpxchg=0x42, call=0x55), with the element-size gate (>511 bits) for loads/stores and recursive analysis for call instructions including shared-memory-access checks.
  3. Bidirectional fixed-point dataflow. Maintain eight red-black tree maps (forward ReadAbove/WriteAbove/ReadBelow/WriteBelow per BB, backward same) populated by scanning each BB in both directions, propagating from successors (forward) and predecessors (backward), iterating until no boolean flips from 0 to 1.
  4. Bridge map construction. After dataflow convergence, populate four bridge maps keyed by barrier instruction pointer, representing the combined read/write access sets crossing each specific barrier boundary.
  5. Elimination decision logic. A barrier is dead if: (ReadAbove==0 AND WriteAbove==0), OR (ReadBelow==0 AND WriteBelow==0), OR (ReadAbove==0 AND WriteBelow==0), OR (WriteAbove==0 AND ReadBelow==0). Handle the special case for intrinsic IDs 8260--8262 (__syncthreads_count/and/or) where single-use return values allow additional removal.
  6. Complete restart after removal. After each barrier deletion, restart the entire dataflow analysis from scratch to handle cascading redundancies where removing one barrier makes adjacent barriers dead.

Cross-References

Rematerialization

NVIDIA's rematerialization infrastructure in CICC operates at two levels: an IR-level pass (nvvmrematerialize / "Legacy IR Remat") that reduces register pressure before instruction selection, and a machine-level pass (nv-remat-block / "Do Remat Machine Block") that performs the same transformation on MachineIR after register allocation decisions have been made. Both passes share the same fundamental strategy -- recompute cheap values at their use sites rather than keeping them live across long spans -- but they differ significantly in their cost models, candidate selection criteria, and interaction with the surrounding pipeline.

On NVIDIA GPUs, register pressure directly determines occupancy -- the number of concurrent warps per SM -- with discrete cliff boundaries where a single additional register can drop an entire warp group. Rematerialization trades extra ALU work for reduced register count, a tradeoff that is almost always profitable on GPUs where compute throughput vastly exceeds register file bandwidth.

Key Facts

PropertyValue
Pass name (New PM)remat
Pass name (Legacy PM)nvvmrematerialize / "Legacy IR Remat"
ClassRematerializationPass
RegistrationNew PM #385, line 2257 in sub_2342890
Runtime positionsTier 0 #34 (NVVMRematerialization via sub_1A13320); Tier 1/2/3 #55 (gated by !opts[2320]); see Pipeline
Pass factorysub_1A13320
Machine-level companionnv-remat-block / "Do Remat Machine Block" at sub_2186D90
Upstream equivalentNone -- entirely NVIDIA-proprietary

IR-Level Rematerialization (nvvmrematerialize)

Registration and Dependencies

The pass is registered at sub_1CD0BE0 with pass ID "nvvmrematerialize" and entry point sub_1CD0CE0. Before running, it initializes five analysis passes:

AnalysisFunctionPurpose
Dominator treesub_15CD350Dominance queries for instruction placement
Loop infosub_1440EE0Loop nest structure for cost scaling
Unknownsub_13FBE20Possibly alias analysis
Live variable analysissub_1BFC830Builds live-in/live-out bitvector sets
Unknownsub_1BFB430Possibly register pressure estimation

Main Algorithm (sub_1CE7DD0, 67KB)

Complexity. Let B = number of basic blocks, I = total instructions, and L = number of live-in values. The live-in analysis uses hardware popcnt on bitvectors of size ceil(I / 64) per block, giving O(B * I / 64) per iteration. The intersection of live-in sets (bitwise AND) is O(B * I / 64). The rematizability check for each candidate walks its def chain: O(D) where D is the def-chain depth (bounded by max-recurse-depth). The pull-in cost model (sub_1CE3AF0) scores each candidate in O(U * D) where U = uses per candidate. Candidate sorting is O(K^2) via selection sort where K = candidates selected. The block executor clones instructions in O(K * B). The outer loop runs at most 5 iterations. Overall IR-level: O(5 * (B * I / 64 + K * U * D + K * B)). For the machine-level pass (sub_2186D90): max-live computation is O(I) per block (reverse walk), giving O(I) total. Candidate classification is O(I) for the initial scan, plus O(K * 50) for recursive pullability checks (depth bounded at 50). The second-chance heuristic iterates until convergence -- bounded by the candidate count K. The outer loop runs at most nv-remat-max-times (default 10) iterations. Overall machine-level: O(10 * (I + K^2)).

The driver implements an iterative register pressure reduction loop with up to 5 iterations. The high-level flow:

  1. Function exclusion check: The no-remat knob stores a comma-separated list of function names. If the current function matches, the pass prints "Skip rematerialization on <funcname>" and bails.

  2. Master gate: If all three sub-passes are disabled (do-remat, remat-iv, remat-load all zero), return immediately.

  3. Live-in/live-out analysis: For each basic block, the pass looks up the block's live-in bitvector from the analysis (sub_1BFDF20), counts live-in values via hardware popcnt (sub_39FAC40), and stores per-block counts in a hash map. The maximum live-in across all blocks becomes the pressure target baseline. At dump-remat >= 2, the pass prints "Block %s: live-in = %d".

  4. Register target computation: The algorithm computes how many registers it wants to reduce to:

    • If remat-maxreg-ceiling is set and lower than the actual register count, cap at that value.
    • If remat-for-occ is non-zero (default 120): call sub_1BFBA30 for register usage, then sub_1C01730 for an occupancy-based target. Apply heuristic adjustments based on occupancy level.
    • Otherwise: target = 80% of the current register count.
  5. Iterative loop (up to 5 iterations):

    • If max live-in is already at or below the target, skip to the IV/load phases.
    • Compute the intersection of live-in bitvectors across blocks (bitwise AND). Values that are live-in everywhere are the best rematerialization candidates because pulling them in at each use site eliminates a register everywhere.
    • Walk the intersection bitvector. For each candidate, check rematerizability via sub_1CD06C0. Partition into rematizable and non-rematizable sets.
    • Call sub_1CE3AF0 (pull-in cost analysis) to rank candidates by cost.
    • Build a per-block rematerialization plan and execute via sub_1CE67D0.
    • Recompute max live-in. If it decreased, continue iterating.
  6. Post-remat phases: After the main loop, run IV demotion (sub_1CD74B0) if remat-iv is enabled, then load rematerialization (sub_1CDE4D0) if remat-load is enabled, then cleanup (sub_1CD2540).

  7. Expression factoring: When remat-add is non-zero, the pass also performs strength reduction on chains of add/mul/GEP instructions, factoring common sub-expressions into "factor" named values. This is a mini-pass embedded within rematerialization.

Block-Level Executor (sub_1CE67D0, 32KB)

This function processes one basic block at a time, creating two kinds of instruction clones distinguished by their name prefixes:

remat_ prefix: The value was live-in to the block and is being recomputed from scratch. The defining instruction is duplicated via sub_15F4880, named with the "remat_" prefix via sub_164B780, and inserted at the use site. This is full rematerialization.

uclone_ prefix: The value already has a definition in the block's dominance chain, but a local copy is needed to shorten the live range. The instruction is cloned and named "uclone_". This is a use-level clone for live range splitting, not pure rematerialization.

After cloning, both variants update use-def chains via sub_1648780 and set debug locations via sub_15F22F0.

Pull-In Cost Model (sub_1CE3AF0, 56KB)

The cost model evaluates each candidate for rematerialization by computing:

pull_in_cost = base_cost * use_factor

Where base_cost is the sum of per-instruction costs along the value's def chain (sub_1CD0460), and use_factor is accumulated from per-use costs (sub_1CD3A10), with different cost tables for uses in different loop nests.

Candidates are filtered by three thresholds:

FilterConditionDefault
Use limituse_count > remat-use-limit AND use_factor >= remat-loop-trip10 uses, 20 trips
GEP costcost > remat-gep-cost AND opcode is GEP6000
Single costcost > remat-single-cost-limit (unless remat-ignore-single-cost)6000

After scoring, candidates are sorted by cost (cheapest first via selection sort), and the cheapest N are selected where N is the target reduction count. At dump-remat >= 4, the pass prints "Total pull-in cost = %d".

NLO -- Simplify Live Output (sub_1CE10B0 + sub_1CDC1F0)

The NLO sub-pass normalizes live-out values at block boundaries to reduce register pressure. Controlled by simplify-live-out (default 2):

  • Level 1: Basic normalization only.
  • Level 2 (default): Full normalization. Walks each block's live-out set and replaces values with simpler expressions.
  • Level 3+: Extended patterns.

NLO creates two kinds of synthetic instructions:

  • nloNewBit: A bit-level operation (AND, extract, truncation) to reduce a live-out value to its actually-used bit width.
  • nloNewAdd: A local add instruction to recompute an address/offset that was previously live-out, replacing it with a local computation.

IV Demotion (sub_1CD74B0, 75KB)

The induction variable demotion sub-pass reduces register pressure by narrowing wide IVs (typically 64-bit to 32-bit). Controlled by remat-iv (default 4, meaning full demotion):

LevelBehavior
0Disabled
1-2Basic IV demotion
3Extended IV demotion
4Full demotion including complex patterns (default)
5+Aggressive mode

The algorithm identifies PHI nodes at loop headers, checks whether the IV's value range fits in a smaller type (for 64-bit IVs: (val + 0x80000000) <= 0xFFFFFFFF), and creates narrower replacements:

  • demoteIV: A truncation of the original IV to a narrower type.
  • newBaseIV: A new narrow PHI node to replace the wide loop IV.
  • iv_base_clone_: A clone of the IV's base value for use in comparisons that need the original width.
  • substIV: Replaces uses of the old IV with the demoted version.

Multi-Pass Data Flow: Rematerialization / IV Demotion / NLO

The IR-level rematerialization pass (nvvmrematerialize) contains three cooperating sub-passes that execute in a fixed sequence within a single pass invocation. The following diagram shows the data each sub-pass produces and consumes, and the feedback loop that drives iterative pressure reduction.

               Live Variable Analysis (prerequisite)
               +------------------------------------+
               | Builds per-block live-in/live-out   |
               | bitvector sets via sub_1BFDF20     |
               | Produces:                           |
               |  - live-in bitvector per BB         |
               |  - live-out bitvector per BB        |
               |  - max live-in count (pressure)     |
               +------------------+-----------------+
                                  |
                                  v
  +===============================================================+
  |  MAIN REMATERIALIZATION LOOP (sub_1CE7DD0, up to 5 iterations)|
  |                                                               |
  |  Inputs:                                                      |
  |   - live-in bitvectors (from analysis above)                  |
  |   - register target (from occupancy model or 80% heuristic)  |
  |   - remat cost thresholds (knobs)                             |
  |                                                               |
  |  +----------------------------------------------------------+ |
  |  | Step 1: Compute intersection of live-in sets              | |
  |  | (bitwise AND across all blocks)                           | |
  |  | --> Values live everywhere = best candidates              | |
  |  +---------------------------+------------------------------+ |
  |                              |                                |
  |                              | candidate value set            |
  |                              v                                |
  |  +---------------------------+------------------------------+ |
  |  | Step 2: Pull-In Cost Analysis (sub_1CE3AF0)              | |
  |  | For each candidate:                                       | |
  |  |   cost = base_cost(def chain) * use_factor(loop nesting) | |
  |  | Filter by: remat-use-limit, remat-gep-cost,              | |
  |  |            remat-single-cost-limit                        | |
  |  | Sort by cost (cheapest first)                             | |
  |  | Produces: ranked list of N cheapest candidates            | |
  |  +---------------------------+------------------------------+ |
  |                              |                                |
  |                              | remat plan per block           |
  |                              v                                |
  |  +---------------------------+------------------------------+ |
  |  | Step 3: Block Executor (sub_1CE67D0)                     | |
  |  | For each selected candidate in each block:                | |
  |  |   "remat_" clone: full rematerialization at use site     | |
  |  |   "uclone_" clone: live range split within dom chain     | |
  |  | Produces:                                                 | |
  |  |   - cloned instructions at use sites                      | |
  |  |   - reduced live-in counts per block                      | |
  |  +---------------------------+------------------------------+ |
  |                              |                                |
  |                              | updated IR                     |
  |                              v                                |
  |  Recompute max live-in. If decreased and < 5 iters, loop.   |
  +=======================+=====================================+
                          |
                          | IR with reduced register pressure
                          v
  +=======================+=====================================+
  | IV DEMOTION (sub_1CD74B0, controlled by remat-iv)           |
  |                                                              |
  | Consumes:                                                    |
  |  - Loop header PHI nodes (from LoopInfo)                     |
  |  - Type widths (from DataLayout)                             |
  |  - post-remat IR (live ranges already shortened)             |
  |                                                              |
  | Algorithm:                                                   |
  |  for each loop L:                                            |
  |    for each 64-bit PHI in L.header:                          |
  |      if (val + 0x80000000) <= 0xFFFFFFFF:                    |
  |        create "demoteIV" (trunc to i32)                      |
  |        create "newBaseIV" (narrow PHI replacement)            |
  |        rewrite uses with "substIV"                            |
  |                                                              |
  | Produces:                                                    |
  |  - narrowed IVs (64->32 bit, halving register cost)          |
  |  - "iv_base_clone_" values for comparisons needing           |
  |    original width                                            |
  |  - updated loop exit conditions                              |
  +=======================+=====================================+
                          |
                          | IR with narrowed IVs
                          v
  +=======================+=====================================+
  | NLO -- SIMPLIFY LIVE OUTPUT (sub_1CE10B0, simplify-live-out)|
  |                                                              |
  | Consumes:                                                    |
  |  - per-block live-out bitvector sets                          |
  |  - post-IV-demotion IR                                       |
  |                                                              |
  | For each block's live-out set:                               |
  |  - If a value is live-out but only its low bits are used     |
  |    downstream: create "nloNewBit" (AND/extract/trunc)        |
  |  - If a value is an address live-out that can be recomputed  |
  |    locally in successors: create "nloNewAdd" (local add)     |
  |                                                              |
  | Produces:                                                    |
  |  - "nloNewBit" bit-narrowing instructions                    |
  |  - "nloNewAdd" local recomputation instructions              |
  |  - reduced live-out register count at block boundaries       |
  +=======================+=====================================+
                          |
                          | Final IR: pressure-reduced,
                          | IVs narrowed, live-outs simplified
                          v
  +-------------------------------------------------------+
  | Downstream consumers:                                  |
  |  - Instruction selection (register model now concrete) |
  |  - Machine-level remat (nv-remat-block, second pass)  |
  |  - Register allocation (lower pressure = higher occ.) |
  +-------------------------------------------------------+

Data flow summary:

ProducerDataConsumer
Live Variable AnalysisPer-block live-in/live-out bitvectorsMain remat loop
Occupancy model (sub_1C01730)Register pressure targetMain remat loop
Main remat loopremat_/uclone_ cloned instructionsUpdated IR for IV demotion
IV DemotiondemoteIV, newBaseIV, substIV narrowed valuesNLO and downstream
NLOnloNewBit, nloNewAdd local recomputationsFinal IR for instruction selection
All three sub-passesCumulative register pressure reductionMachine-level remat (nv-remat-block)

The sequencing is important: the main loop reduces cross-block live-in pressure first (the broadest and cheapest wins), IV demotion then halves the cost of loop induction variables (converting two registers to one), and NLO cleans up block-boundary live-out values that survived both earlier phases. The machine-level nv-remat-block pass runs much later in the pipeline (after instruction selection and register allocation) as a final safety net, operating on concrete register assignments rather than abstract SSA values.

Machine-Level Block Rematerialization (nv-remat-block)

Registration

Registered at ctor_361_0 (address 0x5108E0) with pass name "nv-remat-block" and description "Do Remat Machine Block". Main entry point: sub_2186D90 (47KB, ~1742 lines).

Algorithm Overview

The machine-level pass implements a sophisticated iterative pull-in algorithm operating on MachineIR after instruction selection:

  1. Measure: Compute max-live register pressure across all blocks via sub_2186590. Prints "Max-Live-Function(<num_blocks>) = <max_live>".

  2. Identify: For each block where pressure exceeds the target, enumerate live-out registers.

  3. Classify: For each live-out register, determine pullability:

    • MULTIDEF check (sub_217E810): The register must have exactly one non-dead, non-debug definition. Registers with multiple definitions print "MULTIDEF" and are rejected.
    • Opcode exclusion: A large switch/comparison tree excludes memory ops, atomics, barriers, texture ops, surface ops, and other side-effecting instructions. Specific exclusions exist for sm_62 (opcodes 380-396).
    • Operand safety: Instructions that define additional tied registers beyond the target are rejected.
    • Recursive verification (sub_2181550): All operands of the defining instruction must themselves be pullable, checked recursively up to depth 50.
  4. Second-chance heuristic (sub_2181870): Registers initially rejected because one of their operands was non-pullable are re-evaluated when those operands become pullable. This iterates until convergence, using a visit-count mechanism to prevent infinite loops. The hash function throughout is h(regID) = 37 * regID. Debug: "After pre-check, <N> good candidates, <N> given second-chance", "ADD <N> candidates from second-chance".

  5. Cost analysis (sub_2183E30): Each candidate receives a clone cost. Candidates with cost 0 are non-rematerializable.

  6. Selection: Sort candidates by cost (ascending). Greedily select the cheapest candidates until pressure is reduced to target. Double-wide register classes (size > 32) count as 2 for pressure purposes and have their cost doubled. Debug: "Really Final Pull-in: <count> (<total_cost>)".

  7. Execute: For each selected register:

    • Clear from live-out bitmap via sub_217F620.
    • Propagate backward through predecessors via sub_2185250.
    • Clone the defining instruction at use sites via sub_217E1F0.
    • Replace register references via sub_21810D0.
    • Remove now-dead original definitions.
  8. Iterate: Repeat up to nv-remat-max-times (default 10) iterations until max pressure is at or below target, or no further progress is made.

Instruction Replacement (sub_21810D0)

When replacing a rematerialized register:

  1. Create a new virtual register of the same class via sub_1E6B9A0.
  2. Call the target's replaceRegWith method (vtable offset 152).
  3. Walk all uses of the original register ID and rewrite operands via sub_1E310D0.
  4. Handle special cases: DBG_VALUE (opcode 45) and NOP/PHI (opcode 0) instructions use stride-2 operand scanning.

Register Pressure Computation (sub_2186590)

Per-block pressure is computed by starting with the live-out set size, walking instructions in reverse, tracking register births (defs) and deaths (last uses), and recording the peak pressure point. The maximum across all blocks is returned.

Key Functions

IR-Level

FunctionAddressSizeRole
Pass registrationsub_1CD0BE0--Registers "nvvmrematerialize"
Main driversub_1CE7DD067KBIterative live-in reduction loop
Block executorsub_1CE67D032KB"remat_" / "uclone_" creation
Pull-in costsub_1CE3AF056KBCost model and candidate selection
NLO mainsub_1CE10B048KBLive-out normalization
NLO helpersub_1CDC1F035KBInter-block NLO propagation
IV demotionsub_1CD74B075KBInduction variable narrowing
Load rematsub_1CDE4D0--Load rematerialization sub-pass
Per-function initsub_1CDA600--Data structure initialization
Rematizability checksub_1CD06C0--Determines if a value can be recomputed

Machine-Level

FunctionAddressSizeRole
Main enginesub_2186D9047KBIterative pull-in algorithm
Max-live computationsub_2186590--Per-block pressure analysis
MULTIDEF checksub_217E810~230 linesSingle-definition verification
Recursive pullabilitysub_2181550~110 linesOperand chain verification (depth 50)
Second-chancesub_2181870~800 linesRe-evaluation of rejected candidates
Cost evaluatorsub_2183E30--Clone cost computation
Liveness propagationsub_2185250~650 linesBackward propagation + cloning
Instruction replacementsub_21810D0~290 linesRegister use rewriting
Remat allocation helpersub_2184890~477 linesPressure simulation

Configuration Knobs

IR-Level Knobs (ctor_277_0 at 0x4F7BE0)

KnobGlobalDefaultDescription
do-rematdword_4FC05C03Master control. 0=off, 1=conservative, 2=normal, 3=full.
no-rematqword_4FC0440(empty)Comma-separated function exclusion list
remat-ivdword_4FBFB404IV demotion level. 0=off, 4=full.
remat-loaddword_4FBFA601Load rematerialization. 0=off, 1=on.
remat-adddword_4FBF9800Add/GEP factoring. 0=off.
remat-single-cost-limitdword_4FC00806000Max cost per single live-in reduction
remat-loop-tripdword_4FBFFA020Default assumed loop trip count
remat-gep-costdword_4FBFEC06000Max cost for GEP rematerialization
remat-use-limitdword_4FBFDE010Max number of uses for a candidate
remat-max-live-limitdword_4FBFD0010Max live-in limit for rematerialization
remat-maxreg-ceilingdword_4FBF6000Register ceiling (0 = uncapped)
remat-for-occdword_4FBF8A0120Occupancy-driven rematerialization target
remat-lli-factordword_4FC032010Long-latency instruction cost factor
remat-ignore-single-costbyte_4FBFC20falseBypass per-value cost filter
remat-movebyte_4FC0400falseRemat move instructions
simplify-live-outdword_4FBF5202NLO level. 0=off, 2=full.
dump-rematdword_4FC02400Debug dump level (0-4+)
dump-remat-ivdword_4FC01600IV remat debug dump
dump-remat-loaddword_4FBF7200Load remat debug dump
dump-remat-adddword_4FBF6400Add remat debug dump
dump-simplify-live-outbyte_4FBF400falseNLO debug dump

Machine-Level Knobs (ctor_361_0 at 0x5108E0)

KnobGlobalDefaultDescription
nv-remat-blockdword_4FD382014Bitmask controlling remat modes (bits 0-3)
nv-remat-max-timesdword_4FD374010Max outer loop iterations
nv-remat-block-single-costdword_4FD366010Max cost per single live value pull-in
nv-remat-block-map-size-limitdword_4FD35806Map size limit for single pull-in
nv-remat-block-max-costdword_4FD3040100Max total clone cost per live value reduction
nv-remat-block-liveout-min-percentagedword_4FD312070Min liveout % for special consideration
nv-remat-block-loop-cost-factorunk_4FD340020Loop cost multiplier
nv-remat-default-max-regunk_4FD332070Default max register pressure target
nv-remat-block-load-costunk_4FD2EC010Cost assigned to load instructions
nv-remat-threshold-for-spec-regunk_4FD386020Threshold for special register remat
nv-dump-remat-blockbyte_4FD2E80falseDebug dump toggle
nv-remat-check-internal-livebyte_4FD2DA0falseCheck internal liveness during MaxLive
max-reg-kindqword_4FD2C200Kind of max register pressure info
no-mi-rematqword_4FD2BE0(empty)Skip remat for named functions
load-rematword_4FD32F0trueEnable load rematerialization
vasp-fix1word_4FD3210falseVASP fix for volatile/addsp

Complementary ptxas-side Knobs

The assembler (ptxas) has its own rematerialization controls that complement the CICC passes:

  • RegAllocRematEnable=1
  • RegAllocEnableOptimizedRemat=1
  • RematEnable=1
  • SinkRematEnable=1
  • RematBackOffRegTargetFactor=N

Optimization Level Behavior

LevelIR-Level Remat (nvvmrematerialize)Machine-Level Remat (nv-remat-block)
O0Not runNot run
OfcmaxNot runNot run
OfcmidRuns with do-remat=3 (full)Not run
O1Runs with do-remat=3, remat-iv=4, remat-load=1Runs with nv-remat-block=14 (default bitmask)
O2Same as O1Same as O1
O3Same as O1; may see more candidates due to additional inlining/unrollingSame as O1; operates on more aggressively optimized MIR

The do-remat master control (default 3) enables all rematerialization sub-phases at O1+. The machine-level pass is gated by its own NVVMPassOptions slot and runs only when the codegen pipeline includes the full register allocation sequence. At Ofcmax, neither pass runs because the fast-compile pipeline skips the full optimization and codegen stack. See Optimization Levels for the complete pipeline tier structure.

Diagnostic Strings

"Skip rematerialization on <funcname>"
"Block %s: live-in = %d"
"Total pull-in cost = %d"
"remat_"
"uclone_"
"nloNewBit"
"nloNewAdd"
"demoteIV"
"newBaseIV"
"iv_base_clone_"
"substIV"
"factor"
"Max-Live-Function(<num_blocks>) = <max_live>"
"Really Final Pull-in: <count> (<total_cost>)"
"MULTIDEF"
"Skip machine-instruction rematerialization on <name>"
"After pre-check, <N> good candidates, <N> given second-chance"
"ADD <N> candidates from second-chance"
"Pullable: <count>"
"live-out = <count>"
"Total Pullable before considering cost: <count>"

Reimplementation Checklist

  1. Live-in/live-out bitvector analysis. Build per-basic-block bitvector sets tracking which values are live-in and live-out, compute max live-in via hardware popcnt, and maintain a hash map of per-block counts.
  2. Occupancy-driven register target. Query the occupancy model to compute a target register count (default: remat-for-occ=120), apply heuristic adjustments based on occupancy cliff boundaries, and cap at remat-maxreg-ceiling when set.
  3. Candidate selection and cost model. Compute the live-in intersection across all blocks (bitwise AND), check rematerizability of each candidate via def-chain walking (bounded by max-recurse-depth), score candidates as base_cost * use_factor with loop-nesting scaling, filter by remat-use-limit/remat-gep-cost/remat-single-cost-limit, and sort cheapest-first.
  4. Block-level instruction cloning. Implement two clone types: remat_ prefix clones (full rematerialization of live-in values at use sites) and uclone_ prefix clones (use-level copies for live range splitting within the dominance chain), with proper use-def chain and debug location updates.
  5. IV demotion sub-pass. Identify 64-bit loop-header PHI nodes whose value range fits in 32 bits ((val + 0x80000000) <= 0xFFFFFFFF), create narrowed PHI replacements (demoteIV/newBaseIV/substIV), and rewrite loop exit conditions.
  6. NLO live-out simplification. Walk each block's live-out set, create nloNewBit instructions (AND/extract/trunc to actual used bit-width) and nloNewAdd instructions (local address recomputations) to reduce live-out register count at block boundaries.
  7. Machine-level pull-in algorithm (nv-remat-block). Implement the iterative MachineIR rematerialization engine: max-live computation via reverse instruction walk, MULTIDEF verification, recursive pullability checking (depth 50), second-chance heuristic for re-evaluating rejected candidates, cost-sorted greedy selection, and liveness propagation with instruction cloning at use sites.
  8. Iterative convergence loop. Wrap the IR-level pass in an up-to-5-iteration loop (recompute max live-in after each round, stop when target is met) and the machine-level pass in an up-to-nv-remat-max-times loop.

Architecture-Specific Behavior

The machine-level MULTIDEF checker (sub_217E810) contains architecture-specific opcode exclusions: opcodes 380-396 are rejected only when the target SM is sm_62 (GP106, mid-range Pascal), suggesting these instructions have rematerialization hazards specific to that microarchitecture. All other opcode exclusions apply uniformly across SM targets.

Test This

The following kernel creates high register pressure by keeping many independent values alive simultaneously. Compile with nvcc -ptx -arch=sm_90 -maxrregcount=32 to force a low register cap and observe rematerialization in action.

__global__ void remat_test(const float* __restrict__ in, float* __restrict__ out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n) return;

    float a = in[tid];
    float b = in[tid + n];
    float c = in[tid + 2*n];
    float d = in[tid + 3*n];
    float e = in[tid + 4*n];
    float f = in[tid + 5*n];
    float g = in[tid + 6*n];
    float h = in[tid + 7*n];

    float r0 = a * b + c;
    float r1 = d * e + f;
    float r2 = g * h + a;
    float r3 = b * c + d;
    float r4 = e * f + g;

    out[tid]       = r0 + r1;
    out[tid + n]   = r2 + r3;
    out[tid + 2*n] = r4 + r0;
}

What to look for in PTX:

  • Address recomputation: the expressions tid + k*n are cheap to recompute. With -maxrregcount=32, the pass should rematerialize these address calculations at use sites rather than keeping them in registers. Look for repeated mad.lo.s32 or add.s32 instructions computing the same offset near each ld.global instead of a single computation early on.
  • Compare the .nreg directive value between -maxrregcount=32 and the default. The rematerialization pass trades extra ALU instructions for fewer registers to hit the lower target.
  • With -Xcicc -dump-remat=4, cicc prints "Total pull-in cost = %d" for each candidate, showing the cost/benefit analysis.
  • The remat_ prefix on SSA names in LLVM IR dumps identifies rematerialized values.

Pipeline Interaction

The IR-level pass runs after live variable analysis has been computed and before instruction selection. Its register pressure reduction directly influences the occupancy achievable by the final kernel. The machine-level pass runs later, after instruction selection and register allocation, providing a second opportunity to reduce pressure on MachineIR where the register model is concrete rather than abstract. Together, the two passes form a layered rematerialization strategy: the IR pass makes broad, cost-effective reductions early, and the machine pass performs precise, targeted reductions late. Both passes interact with the register pressure analysis (rpa / machine-rpa) that feeds pressure estimates into scheduling and allocation decisions throughout the pipeline.

IV Demotion

IV demotion is NVIDIA's proprietary induction variable narrowing sub-pass, embedded within the IR-level rematerialization pass (nvvmrematerialize). It reduces register pressure by converting wide induction variables -- typically 64-bit -- to narrower types, typically 32-bit. On NVIDIA GPUs this is a high-impact optimization: the NVPTX ISA provides native 32-bit integer arithmetic in a single instruction, while 64-bit operations require multi-instruction sequences (add.cc + addc for a single 64-bit add, for example). A 64-bit loop induction variable that provably fits in 32 bits wastes two registers where one would suffice, and every arithmetic operation on it costs roughly twice the instruction count.

The sub-pass is large -- 75KB of compiled code, larger than the main rematerialization driver itself -- reflecting the complexity of proving that narrowing is safe across all uses of an IV, rewriting PHI nodes, adjusting comparisons, and handling edge cases where some uses require the original width while others can consume the narrowed version.

Key Facts

PropertyValue
Entry pointsub_1CD74B0 (75KB, ~2500 lines)
Parent passnvvmrematerialize (IR-level rematerialization)
Invocation sitesub_1CE7DD0 line ~2276 (post-remat phase)
Primary knobremat-iv (default 4 = full demotion)
Debug knobdump-remat-iv (default 0)
Gate conditiondword_4FBFB40 != 0 (non-zero enables the sub-pass)
Helper: IV analysissub_1CD5F30
Helper: IV base lookupsub_1CD5400
Helper: cleanupsub_1CD0600
IR buildersub_15FB440 (opcode, type, operand, name, insertpt)
Width querysub_127FA20 (DataLayout::getTypeStoreSize)

Demotion Levels

The remat-iv knob controls five demotion aggressiveness levels:

LevelBehaviorGate in binary
0Disabled -- IV demotion entirely skippeddword_4FBFB40 == 0
1--2Basic IV demotion. Only simple induction variables with constant step and all uses in the same loop body.Default path
3Extended IV demotion. Enables demotion of IVs whose uses extend to loop-exit comparisons and address computations outside the innermost loop.line 1380: if (dword > 3)
4Full demotion (default). Includes complex patterns: IVs used in GEP chains, IVs with multiple PHI consumers, and IVs that feed into both narrow and wide downstream computations.line 1546: if (dword <= 4)
5+Aggressive mode. Relaxes safety margins on range proofs, allowing demotion when the range check is tight (no headroom).

Level 4 is the default because it captures the vast majority of profitable demotion opportunities in real CUDA kernels without the correctness risk of aggressive mode.

Algorithm

Phase 1: Loop Iteration and PHI Identification

The algorithm iterates over every loop in the function (obtained from LoopInfo, sub_1440EE0). For each loop, it examines the loop header block's PHI nodes. Each PHI node is a candidate induction variable. The pass checks the PHI's type width via sub_127FA20 (DataLayout::getTypeStoreSize).

for each loop L in function:
    header = L.getHeader()
    for each PHI in header:
        width = getTypeStoreSize(PHI.getType())    // sub_127FA20
        if width != 64:
            continue   // only demote 64-bit IVs to 32-bit

Phase 2: Increment Pattern Analysis

For each 64-bit PHI, the pass identifies the increment pattern -- the value feeding back from the latch block. It verifies the pattern is a simple add/sub by a constant. The helper sub_1CD5F30 (IV analysis helper) walks the def-use chain of the PHI's backedge operand to extract the step value and verify linearity.

backedge_val = PHI.getIncomingValueForBlock(latch)
if backedge_val is not (PHI + constant) and
   backedge_val is not (PHI - constant):
    skip this PHI    // non-linear IV, cannot demote
step = extract_constant(backedge_val)

Phase 3: Value Range Fitting

The critical safety check. The pass must prove that the IV's value never exceeds the 32-bit signed range throughout the loop's execution. The check uses an unsigned comparison trick:

(val + 0x80000000) <= 0xFFFFFFFF

This is equivalent to checking -2^31 <= val <= 2^31 - 1 (the signed i32 range). Adding 0x80000000 shifts the signed range to [0, 0xFFFFFFFF], which can be checked with a single unsigned comparison. The pass evaluates this condition on:

  1. The initial value (from the preheader incoming edge of the PHI).
  2. The final value (derived from the loop trip count and step).
  3. Conservatively, any intermediate values if the step is not +1/-1.

The initial value and trip count information come from the loop analysis infrastructure. The pass does not directly invoke SCEV (ScalarEvolution) -- it operates on NVIDIA's own IR-level live variable analysis and loop info passes (sub_1440EE0 for loop structure, sub_1BFC830 for live variable analysis). However, upstream LLVM's IndVarSimplify (sub_1945A50) may have already widened or simplified IVs using SCEV before this pass runs. The IV demotion pass operates on whatever IV structure remains after the main optimization pipeline.

If the range check fails, the IV is skipped. There is no speculative demotion with runtime guards.

Phase 4: Use Analysis and Classification

Before rewriting, the pass classifies every use of the original 64-bit IV:

  • Narrow-safe uses: Arithmetic (add, sub, mul, shift), array indexing within the loop body. These can consume the 32-bit value directly.
  • Comparison uses: Loop exit conditions (icmp). These need a narrow comparison instruction (newICmp).
  • Address uses: GEP instructions that use the IV as an index. At level 4+, these are handled by cloning the base address computation (iv_base_clone_).
  • Escape uses: Uses outside the loop (LCSSA PHIs, return values). These require sign/zero extension back to 64-bit.

The level knob gates which use categories are eligible:

Use categoryMinimum level
Same-block arithmetic1
Loop exit comparisons2
Cross-block GEP indexing3
Multi-PHI consumers4
Tight-range speculation5

Phase 5: Instruction Generation

Once an IV is approved for demotion, the pass generates four types of synthetic instructions:

demoteIV -- Truncation

v475 = "demoteIV";
v366 = sub_15FB440(11, destg, v401, &v475, v115);
// opcode 11 = trunc

Creates a trunc i64 %iv to i32 instruction, inserted at the point where the original IV was defined. This is the primary demotion: the new 32-bit value replaces the old 64-bit value for all narrow-safe uses.

IR before:

%iv = phi i64 [ %init, %preheader ], [ %iv.next, %latch ]
%iv.next = add i64 %iv, 1

IR after (demoteIV inserted):

%iv = phi i64 [ %init, %preheader ], [ %iv.next, %latch ]
%demoteIV = trunc i64 %iv to i32

newBaseIV -- Narrow PHI Replacement

v475 = "newBaseIV";
desth = sub_15FB440(11, v289, v427, &v475, destd);

When the entire loop can use a 32-bit IV, the pass creates a completely new PHI node with i32 type in the loop header. The old 64-bit PHI is not simply truncated -- a new narrow induction cycle is constructed:

  • A narrow initial value: %newInit = trunc i64 %init to i32
  • A narrow PHI: %newBaseIV = phi i32 [ %newInit, %preheader ], [ %newInc, %latch ]
  • A narrow increment: %newInc = add i32 %newBaseIV, <step32>

The old 64-bit IV becomes dead if all uses are successfully rewritten.

IR after (full base IV replacement):

%newInit = trunc i64 %init to i32
%newBaseIV = phi i32 [ %newInit, %preheader ], [ %newInc, %latch ]
%newInc = add i32 %newBaseIV, 1

iv_base_clone_ -- Comparison Clone

v475 = "iv_base_clone_";
v214 = sub_15F4880(v210);         // clone instruction
sub_164B780(v214, &v475);         // set name
sub_15F2120(v395, v198);          // insert into block

When some uses of the IV require the original 64-bit width -- typically the loop exit comparison or an address computation that cannot be narrowed -- the pass clones the IV's base value. The clone instruction preserves the original semantics while allowing the primary loop computation to proceed with the narrow type. The clone is placed at the specific use site rather than at the loop header, avoiding the register pressure cost of keeping the wide value live across the entire loop body.

substIV -- Use Replacement

After generating the narrow IV infrastructure, the pass walks all uses of the original wide IV and replaces them with the demoted version. This is the final rewriting step:

  • Arithmetic uses: replaced with uses of %newBaseIV or %demoteIV.
  • Comparison uses: replaced with narrow comparisons (newICmp) on the demoted value.
  • PHI uses at LCSSA boundaries: a sext/zext is inserted to restore 64-bit width for consumers outside the loop.

The pass also creates newICmp instructions -- narrower comparison instructions that compare i32 values instead of i64 values, rewriting the loop exit condition to match the demoted IV.

After all use replacement, sub_1CD0600 performs dead code cleanup: if the original 64-bit IV has no remaining uses, the wide PHI and its increment chain are deleted.

GPU Motivation: 32-bit vs. 64-bit Performance

The performance gap between 32-bit and 64-bit integer operations on NVIDIA GPUs is substantial and architectural, not merely a throughput difference:

Instruction count. 64-bit integer addition on PTX compiles to two machine instructions (add.cc.u32 + addc.u32) because the hardware ALU is 32-bit wide. A 64-bit multiply is even worse: it decomposes into multiple 32-bit multiplies and adds. Every loop iteration with a 64-bit IV pays this tax on the increment alone.

Register pressure. A single i64 value occupies a pair of 32-bit registers. In a loop with 3 IVs, demoting all three frees 3 registers -- enough to cross an occupancy cliff and gain an entire warp group on many kernels.

Address arithmetic. CUDA uses 64-bit pointers (nvptx64 target), so loop index computations are promoted to i64 by default during LLVM IR generation. But most CUDA kernels operate on arrays smaller than 4 GB, making the upper 32 bits of the index perpetually zero. The IV demotion pass recovers this wasted precision.

Pipeline utilization. GPU SM pipelines have limited integer execution units. Halving the instruction count for IV arithmetic directly translates to higher utilization of other functional units (FP, memory) in the same warp cycle.

Configuration

Knobs (registered at ctor_277_0, address 0x4F7BE0)

KnobGlobalDefaultDescription
remat-ivdword_4FBFB404IV demotion level. 0=off, 1-2=basic, 3=extended, 4=full, 5+=aggressive.
dump-remat-ivdword_4FC01600Debug dump verbosity for IV demotion. Non-zero enables diagnostic output.

The remat-iv knob is read by the main rematerialization driver (sub_1CE7DD0) at the post-remat phase gate. When non-zero, sub_1CD74B0 is invoked. The level value is then read inside sub_1CD74B0 to control which demotion patterns are attempted.

Interaction with ptxas

The ptxas assembler has its own rematerialization controls (--knob RegAllocRematEnable, RematEnable, etc.) but does not have an IV demotion equivalent. IV demotion is purely an IR-level transformation -- by the time ptxas sees the code, the IVs are already narrow. The ptxas knob --advanced-remat (0/1/2) controls machine-level rematerialization but does not perform type narrowing.

Diagnostic Strings

All strings emitted by sub_1CD74B0:

"phiNode"           -- PHI node identification during loop header scan
"demoteIV"          -- Truncation instruction creation
"newInit"           -- Narrow initial value for new base IV
"newInc"            -- Narrow increment for new base IV
"argBaseIV"         -- Base IV argument lookup
"newBaseIV"         -- New narrow PHI node creation
"newICmp"           -- Narrow comparison instruction creation
"iv_base_clone_"    -- Clone of IV base for original-width uses
"substIV"           -- Use replacement pass

These strings are set as instruction name prefixes via sub_164B780 (for cloned instructions) or passed directly to the IR builder sub_15FB440. They appear in IR dumps when dump-remat-iv is non-zero or when the module is printed after the rematerialization pass.

Differences from Upstream LLVM

Upstream LLVM's IndVarSimplify pass (indvars) performs IV widening and narrowing through SCEV-based analysis. NVIDIA's IV demotion sub-pass is a completely separate implementation with several key differences:

AspectUpstream IndVarSimplifyNVIDIA IV Demotion
Analysis frameworkSCEV (ScalarEvolution)NVIDIA live variable analysis + LoopInfo
DirectionPrimarily widens narrow IVs to canonical formNarrows wide IVs to reduce register pressure
MotivationCanonical form for other optimizationsRegister pressure reduction for GPU occupancy
PlacementEarly in optimization pipelineLate, inside rematerialization (post-optimization)
Range proofSCEV range analysisDirect (val + 0x80000000) <= 0xFFFFFFFF check
IV creationSCEV expanderDirect IR builder calls (sub_15FB440)
Configurationindvars-widen-indvars (bool)remat-iv (6-level integer knob)

The two passes are complementary. IndVarSimplify runs early and may widen IVs for canonical form. Later, IV demotion runs inside rematerialization and narrows them back when the wide form causes excessive register pressure. This is not redundant work -- the early widening enables other optimizations (loop vectorization, strength reduction), and the late narrowing recovers the register cost after those optimizations have completed.

Function Map

FunctionAddressSizeRole
IV demotion entrysub_1CD74B075KBMain algorithm: PHI scan, range check, rewrite
IV analysis helpersub_1CD5F30--Walks def-use chain to extract step/linearity
IV base lookupsub_1CD5400--Finds base value of induction variable
Dead IV cleanupsub_1CD0600--Removes unreferenced wide IVs after demotion
IR buildersub_15FB440--Creates instruction (opcode, type, operand, name, insertpt)
Clone instructionsub_15F4880--Clones an IR instruction (for iv_base_clone_)
Set name prefixsub_164B780--Sets name string on cloned instruction
Insert into blocksub_15F2120--Inserts instruction at specified position
Replace usessub_1648780--Rewrites all uses of a value to a new value
Delete dead instrsub_15F20C0--Erases instruction from parent block
Type store sizesub_127FA20--DataLayout::getTypeStoreSize -- returns bit width

Cross-References

  • Rematerialization -- parent pass; IV demotion is invoked in the post-remat phase
  • ScalarEvolution -- upstream SCEV framework; not used directly by IV demotion but related
  • IndVarSimplify -- upstream IV canonicalization pass
  • LLVM Optimizer -- pipeline context showing where rematerialization runs
  • Knobs -- central knob inventory

Base Address Strength Reduction

Address computation is a disproportionately expensive category of work on NVIDIA GPUs. The integer ALU units that compute memory addresses are a scarce resource relative to the FP/tensor throughput the hardware is designed to maximize. A typical unrolled loop body touching four arrays at A[tid + i], B[tid + i], C[tid + i], D[tid + i] -- where tid is a function of threadIdx.x, blockIdx.x, and blockDim.x -- may emit four independent 64-bit multiply-add chains per iteration, each recomputing the same base expression base_ptr + tid * element_size. Reducing those four chains to one base computation plus three cheap constant-offset additions can halve the integer instruction count in the loop body and free address registers that would otherwise stay live across the entire loop.

Base Address Strength Reduction (BASR) is an NVIDIA-proprietary IR-level pass that performs exactly this transformation. It scans loop bodies for memory operations that share a common base pointer expression, finds the one with the minimum constant offset (the "anchor"), hoists the anchor computation, and rewrites all remaining addresses as (anchor + relative_offset). The pass is confirmed by the string "BaseAddressStrengthReduce" at decompiled line 457 of sub_1C67780.

Key Facts

PropertyValue
Pass nameBaseAddressStrengthReduce
Entry pointsub_1C67780 (Legacy PM), sub_2CA4A10 (New PM)
Binary size58 KB (~1,400 decompiled lines)
Pass typeNVIDIA-proprietary, IR-level, loop body transform
Primary knobsdo-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions)
Chain variantdo-base-address-strength-reduce-chain (separate boolean toggle)
Negative offset controldword_4FBCAE0 (aggressiveness for negative-offset patterns)
IV limitbase-address-strength-reduce-iv-limit (parametric)
Max IVbase-address-strength-reduce-max-iv (parametric)
Debug dumpdump-base-address-strength-reduce
Required analysesLoopInfo (sub_1632FA0), DataLayout
Option registrationctor_263_0 at 0x4F36F0 (shared with SCEV-CGP, 44 strings total)
Companion passCommon Base Elimination (sub_1C5DFC0)
HelperBitcast helper at sub_1C637F0 (28 KB, strings "baseValue", "bitCastEnd")

Algorithm

The pass operates in six phases, executing once per function. It processes all loop bodies simultaneously using worklists seeded from LoopInfo.

Phase 1 -- Initialization (lines 452-497)

The entry function retrieves LoopInfo via sub_1632FA0 and extracts the module's DataLayout from the function object (path: (a1+184)->field+24->field+40). It then allocates bookkeeping state:

  • Eight hash maps at stack offsets v374-v399, keyed by Value* (the base pointer). Each map entry holds a linked list of memory instructions that share that base.
  • Multiple worklists for basic blocks containing loads vs. stores.
  • Threshold: v429 = 2 -- the minimum number of uses of the same base before the pass considers strength reduction worthwhile.
  • Pass counter: v438 = 1 -- the initial pass number (the pass may iterate).

Phase 2 -- Address Pattern Collection (lines 518-600)

For each instruction in the target basic blocks (drawn from the a4 worklist):

  1. sub_1C57390 classifies the address expression, extracting its structural form.
  2. sub_1CCB2B0 computes alignment information from the DataLayout.
  3. sub_1456040 extracts the base pointer from the address expression.

The base pointer is then categorized into one of two buckets:

CategoryConditionHash mapWorklistDescription
Non-pointer-type basetype_id != 15v382v363Integer/GEP-derived base addresses
Pointer-type basetype_id == 15v378v360Bases that are raw pointers to globals

For pointer-type bases, sub_1CCDC20 further extracts the underlying global variable, allowing grouping of addresses to the same global even when accessed through different local pointer variables.

Hash map insertion uses sub_1C50900. If the base pointer is new (not yet in the map), the instruction list is initialized and the base is appended to the corresponding worklist. Otherwise, the instruction is appended to the existing list for that base.

for each instruction I in target BBs:
    addr_info = classify_address(I)          // sub_1C57390
    alignment = compute_alignment(addr_info)  // sub_1CCB2B0
    base_ptr  = extract_base(addr_info)       // sub_1456040

    if type_of(base_ptr) != POINTER_TYPE:
        map_insert(hash_map_v382, base_ptr, I)    // sub_1C50900
        if is_new_entry:
            worklist_v363.append(base_ptr)
    else:
        global = extract_global(base_ptr)         // sub_1CCDC20
        map_insert(hash_map_v378, global, I)
        if is_new_entry:
            worklist_v360.append(global)

Phase 3 -- Anchor Finding (lines 430-470)

For each base pointer that has accumulated at least v429 (2) uses, the pass determines the "anchor" -- the use with the minimum constant offset. This is the instruction whose address computation will be hoisted and shared.

For each candidate base:

  1. sub_1C53170 decomposes each address expression into a (base, constant_offset) pair.
  2. The pass iterates over all uses and finds the one with the smallest constant offset:
    • For offsets that fit in 64 bits: direct integer comparison via sign-extended values.
    • For offsets wider than 64 bits: reads from extended-precision word arrays and compares word-by-word.
  3. The minimum-offset use becomes the anchor.
function find_anchor(base_ptr, use_list):
    min_offset = +INF
    anchor = null

    for each use U in use_list:
        (base, offset) = decompose_address(U)  // sub_1C53170

        if bit_width(offset) <= 64:
            val = sign_extend_64(offset)
        else:
            val = read_extended_precision(offset)

        if val < min_offset:
            min_offset = val
            anchor = U

    return (anchor, min_offset)

Phase 4 -- Address Rewriting (lines 578-600)

Once the anchor is identified:

  1. sub_13A5B00 creates a new base address instruction from the anchor's address computation. This instruction is placed at the loop preheader or the dominating point of all uses.
  2. For every other instruction sharing the same base, the pass computes the relative offset: relative_offset = original_offset - anchor_offset.
  3. sub_14806B0 creates a new address expression (new_base + relative_offset) and replaces the original address operand.
function rewrite_addresses(anchor, anchor_offset, use_list):
    new_base = create_base_instruction(anchor)  // sub_13A5B00

    for each use U in use_list:
        if U == anchor:
            replace_address(U, new_base)
        else:
            (_, orig_offset) = decompose_address(U)
            rel_offset = orig_offset - anchor_offset
            new_addr = create_offset_add(new_base, rel_offset)  // sub_14806B0
            replace_address(U, new_addr)

After this transformation, a loop body that previously contained:

load (base + tid*stride + 0)    // original: full GEP chain
load (base + tid*stride + 16)   // original: full GEP chain
store (base + tid*stride + 32)  // original: full GEP chain
store (base + tid*stride + 48)  // original: full GEP chain

Becomes:

anchor = base + tid*stride + 0  // hoisted once
load anchor                     // offset 0: use anchor directly
load (anchor + 16)              // cheap add
store (anchor + 32)             // cheap add
store (anchor + 48)             // cheap add

The three 64-bit multiply-add chains are replaced by three 64-bit immediate additions.

Phase 5 -- Negative Offset Handling (lines 512-520)

When dword_4FBCAE0 > 1 (the aggressiveness knob is set above default), the pass also considers address groups where the maximum offset has a negative sign bit. These represent patterns like:

load (base + tid*stride - 32)
load (base + tid*stride + 0)
load (base + tid*stride + 32)

Without this phase, the anchor would be the instruction at offset -32, producing negative relative offsets for the first use. Some hardware addressing modes handle negative offsets less efficiently, so this phase is gated separately.

For negative-offset candidates, the pass:

  1. Checks whether the base is loop-invariant via sub_1C51340.
  2. If loop-invariant, creates a separate common base via sub_1C55CE0 that absorbs the negative component.

Phase 6 -- Red-Black Tree Tracking

The pass uses a red-black tree infrastructure (sub_220F040 for insertion, sub_220EF80 for lookup) shared with other NVIDIA passes. This provides O(log n) sorted-set operations for maintaining collections of instruction pointers and efficiently checking membership during the rewriting phase.

Hash Map Implementation

The address pattern hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). The resize/rehash logic lives in sub_1C54050 -- the same function used by Common Base Elimination. Hash keys are Value* pointers with linear probing. See Hash Table and Collection Infrastructure for the hash function and probing strategy.

Relationship with Common Base Elimination

BASR and Common Base Elimination (sub_1C5DFC0) attack the same problem -- redundant address computation -- but at different scopes and with different strategies:

DimensionBase Address Strength ReductionCommon Base Elimination
ScopeIntra-loop: operates within a single loop bodyInter-block: operates across the CFG using dominance
GroupingGroups addresses by shared induction-variable-based baseGroups addresses by shared base pointer to the same global
PlacementAnchor placed at loop preheaderAnchor placed at common dominator of all uses
Offset modelConstant offsets relative to IV-derived baseConstant offsets relative to global-derived base
Entry pointsub_1C67780sub_1C5DFC0
Size58 KB38 KB

The two-pass approach is deliberate. Common Base Elimination runs first at the IR level, hoisting shared base expressions across control flow boundaries. BASR then runs within loop bodies, strength-reducing the IV-dependent address chains that CBE cannot handle because the IV changes each iteration.

Both passes share the same address decomposition helper (sub_1C53170), the same hash map infrastructure (sub_1C50900, sub_1C54050), and the same instruction creation routines (sub_13A5B00, sub_14806B0).

Relationship with SCEV-CGP

The BASR knobs are registered together with SCEV-CGP (Scalar-Evolution-based CodeGenPrepare) in ctor_263_0 at 0x4F36F0. This constructor registers 44 option strings total, covering both SCEV-CGP and BASR. The do-base-address-strength-reduce and do-scev-cgp knobs are stored in the same ctor_526_0 option block.

SCEV-CGP is a broader pass that performs SCEV-based address optimization using thread ID as an induction variable (scev-cgp-tid-max-value controls the maximum thread ID value for analysis). BASR is a sub-transformation within this address optimization framework -- it handles the specific case of multiple memory operations sharing a base, while SCEV-CGP handles the broader case of rewriting address expressions using scalar evolution.

Related SCEV-CGP knobs that interact with BASR:

KnobPurpose
scev-cgp-old-baseControls whether SCEV-CGP creates new base expressions
ignore-bad-baseBypasses validity checks on base pointer classification
ignore-32-bit-overflowSkips 32-bit overflow checks in address arithmetic
ignore-signed-32-bit-overflowSkips signed 32-bit overflow checks
topo-sort-beginControls topological sort start point for address chains
special-reassociate-for-threadidPrevents reassociation from moving threadId-dependent expressions

Configuration

Boolean Knobs

KnobDefaultDescription
do-base-address-strength-reduceEnabled (level 2)Master enable. Level 1 = unconditional; level 2 = with conditions (default). 0 = disabled.
do-base-address-strength-reduce-chainEnabledEnables the chain variant, which strength-reduces chains of dependent address computations
dump-base-address-strength-reducefalsePrints diagnostic output when set

Parametric Knobs

KnobDescription
base-address-strength-reduce-iv-limitMaximum number of induction variables to consider per loop
base-address-strength-reduce-max-ivMaximum IV value for strength reduction eligibility

Global Variables

GlobalPurpose
dword_4FBCAE0Negative offset aggressiveness. When > 1, enables strength reduction of address groups with negative offsets. Also used as a special minimum-selection mode in MemorySpaceOpt.

Diagnostic Strings

"BaseAddressStrengthReduce"   -- Pass identification (line 457)
"baseValue"                   -- Bitcast helper: base value operand name (sub_1C637F0)
"bitCastEnd"                  -- Bitcast helper: end-of-chain marker (sub_1C637F0)

When dump-base-address-strength-reduce is enabled, the pass emits additional diagnostic output showing which base pointers were grouped, which anchor was selected, and which addresses were rewritten.

Key Functions

FunctionAddress (Legacy)SizeRole
Main entrysub_1C6778058 KBPass driver: initialization, collection, anchor finding, rewriting
Main entry (New PM)sub_2CA4A1062 KBNew Pass Manager variant
Address classifiersub_1C57390--Classifies address expression structure
Address decomposersub_1C53170--Decomposes address into (base, constant_offset) pairs
Hash map insertsub_1C50900--Inserts base pointer into pattern hash map
Hash map resizesub_1C54050--Load-factor-based resize/rehash
Loop invariance checksub_1C51340--Tests whether a value is loop-invariant
Negative offset handlersub_1C55CE0--Creates common base for negative-offset patterns
Base instruction creationsub_13A5B00--Creates the hoisted anchor address instruction
Offset rewritingsub_14806B0--Creates (base + relative_offset) replacement
Base extractionsub_1456040--Extracts base pointer from address expression
Global extractionsub_1CCDC20--Extracts underlying global variable from pointer chains
Alignment computationsub_1CCB2B0--Computes alignment from DataLayout
Bitcast helpersub_1C637F028 KBHandles bitcast chains in base address expressions
RB-tree insertsub_220F040--Red-black tree insertion (shared infrastructure)
RB-tree lookupsub_220EF80--Red-black tree membership check
LoopInfo retrievalsub_1632FA0--Gets LoopInfo analysis for the function

Cross-References

Common Base Elimination

The Common Base Elimination pass hoists shared base address expressions to dominating points in the control flow graph, eliminating redundant recomputations of the same base pointer across multiple basic blocks. Where Base Address Strength Reduction targets intra-loop patterns driven by induction variables, Common Base Elimination operates at the inter-block level: it groups memory operations that share the same base pointer regardless of loop structure, finds their common dominator, and creates a single base computation at that dominator. Every original address is then rewritten as (hoisted_base + relative_offset).

This is a strictly GPU-motivated optimization. NVIDIA GPUs have limited integer ALU throughput relative to their floating-point pipelines, so any reduction in address arithmetic directly translates to freed execution slots for other work. On a typical CUDA kernel performing strided accesses across multiple branches (e.g., different cases of a switch over tile indices), the pass can eliminate dozens of redundant GEP chains that independently recompute the same base address.

The two-pass approach -- Common Base Elimination first at the IR level for inter-block redundancies, then Base Address Strength Reduction for intra-loop induction-variable patterns -- ensures comprehensive coverage of GPU address computation overhead.

Key Facts

PropertyValue
Pass name"Common Base Elimination"
Entry pointsub_1C5DFC0
Binary offset0x1C5DFC0
Binary size38 KB (~850 decompiled lines)
ScopeFunction-level
IR levelLLVM IR (pre-codegen)
Upstream equivalentNone -- entirely NVIDIA-proprietary
Complementary passBase Address Strength Reduction (sub_1C67780)
Primary knobsscev-cgp-cross-block-limit -- limits common bases from a single block
Required analysisDominator tree (a1[23]), DataLayout

Algorithm

The pass has four major phases: address decomposition, base pointer grouping, dominator-based hoisting, and address rewriting.

Phase 1 -- Address Expression Decomposition

For every memory operation (load, store, GEP-based address) in the function, the pass calls sub_1C53170 to decompose the address into a structured form:

struct AddressExpr {
    Value *base_ptr;          // The root pointer (alloca, global, argument)
    Operand  operands[];      // List of (index, constant_offset) pairs
    unsigned operand_count;   // Number of index terms
};

The result is stored as a (base_ptr, operand_list, operand_count) tuple. The decomposition strips away GEP chains to expose the underlying base pointer and accumulates constant offset terms separately from variable index terms. This is the same decomposition helper used by BASR (sub_1C67780), ensuring both passes reason about addresses in a compatible representation.

Phase 2 -- Base Pointer Grouping

The pass maintains two hash maps for grouping addresses:

Non-pointer-type bases (hash map at v382, keyed by base pointer value):

  • Each memory operation whose decomposed base is not a pointer type (type_id != 15) is inserted via sub_1C50900.
  • The hash entry accumulates a list of all instructions sharing that base.
  • New bases are appended to worklist v363.

Pointer-to-global bases (hash map at v378, keyed by underlying global variable):

  • For pointer-type bases, sub_1CCDC20 extracts the underlying global variable by walking through bitcast and GEP chains.
  • This allows grouping addresses to the same global even when accessed through different local pointer variables.
  • New globals are appended to worklist v360.

The hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). sub_1C54050 handles both resize and in-place rehash. See Hash Table and Collection Infrastructure for the complete specification.

Phase 3 -- Dominator Walk and Base Hoisting

For each base pointer group containing two or more uses, the pass:

  1. Finds the anchor. Among all constant offsets in the group, the operand with the minimum constant offset becomes the anchor. For offsets up to 64 bits, the constant is extracted directly from the GEP operand. For wider offsets (> 64 bits), the pass reads from extended-precision word arrays. Sign-extended comparisons determine the minimum.

  2. Computes the common dominator. The pass reads the function's dominator tree from a1[23] and walks it to find the nearest block that dominates all use sites. This is the standard findNearestCommonDominator operation -- iteratively walk both paths toward the root until they meet.

  3. Inserts the hoisted base. sub_13A5B00 creates a new base address computation (a GEP or add instruction) at the terminator insertion point of the common dominator block. The hoisted instruction computes base_ptr + min_offset, which is the anchor's address.

  4. Rewrites all uses. For each original memory operation in the group, sub_14806B0 rewrites the address as (hoisted_base + (original_offset - min_offset)). Since the anchor's own relative offset is zero, it becomes a direct use of the hoisted base.

In pseudocode:

fn run_common_base_elimination(F: &Function):
    let dom_tree = F.dominator_tree    // a1[23]
    let data_layout = F.module.data_layout

    // Phase 1+2: decompose and group
    let base_groups: HashMap<Value*, Vec<(Instruction*, ConstantOffset)>> = {}
    let global_groups: HashMap<GlobalVariable*, Vec<(Instruction*, ConstantOffset)>> = {}

    for bb in F.basic_blocks():
        for inst in bb.instructions():
            if !is_memory_op(inst): continue
            let (base, offsets, count) = sub_1C53170(inst)

            if base.type_id != POINTER_TYPE:
                base_groups[base].push((inst, offsets))
            else:
                let gv = sub_1CCDC20(base)   // extract global
                global_groups[gv].push((inst, offsets))

    // Phase 3+4: hoist and rewrite
    for (base, uses) in chain(base_groups, global_groups):
        if uses.len() < 2: continue

        let min_offset = uses.iter().map(|u| u.offset).min()
        let anchor_inst = uses.find(|u| u.offset == min_offset).inst

        // Find common dominator of all use blocks
        let dom_block = uses[0].inst.parent
        for use in uses[1..]:
            dom_block = dom_tree.find_nearest_common_dominator(
                            dom_block, use.inst.parent)

        // Hoist: create base+min_offset at dominator
        let hoisted = sub_13A5B00(dom_block, base, min_offset)

        // Rewrite all uses
        for (inst, offset) in uses:
            let relative = offset - min_offset
            sub_14806B0(inst, hoisted, relative)

Phase 4 -- Pointer-to-Global Grouping

The global-variable grouping deserves special attention. Consider two local pointers p and q that both derive from the same global array g:

%p = getelementptr [1024 x float], ptr @g, i64 0, i64 %tid
%q = getelementptr [1024 x float], ptr @g, i64 0, i64 %tid2

Without the global extraction step, these would be in different groups (keyed by %p vs %q). The sub_1CCDC20 helper walks through the pointer chain to find the underlying @g, allowing the pass to recognize that both addresses target the same global and can share a hoisted base.

Cost-Benefit Analysis

The pass trades register pressure at the dominator for reduced address computation at use sites. This trade-off is particularly favorable on GPUs for two reasons:

Benefit -- Reduced integer ALU pressure. Each eliminated GEP chain frees integer ALU slots. On SM architectures, integer instructions compete for the same warp scheduler slots as floating-point instructions. A kernel with N memory operations sharing the same base saves up to (N-1) complete base address recomputations. For a kernel doing 8 loads from the same struct through different control-flow paths, this eliminates 7 redundant address computations.

Cost -- Extended live range at the dominator. The hoisted base must remain live from the dominator block down to every use site. On GPUs, each additional live register reduces occupancy (the number of concurrent warps per SM). The pass implicitly relies on the subsequent rematerialization pass (sub_1CE7DD0) to undo any hoisting decisions that prove too costly for register pressure -- if the hoisted value's live range crosses too many basic blocks, rematerialization will re-derive it closer to the use point.

The SCEV-CGP knob scev-cgp-cross-block-limit provides an explicit limit on how many common bases can be created from a single block, acting as a safety valve against excessive register pressure growth. The related scev-cgp-idom-level-limit constrains how far up the dominator tree the pass is willing to hoist.

Relationship with Base Address Strength Reduction

The two passes operate at different granularities and are intentionally complementary:

AspectCommon Base EliminationBase Address Strength Reduction
ScopeInter-block (dominator-based)Intra-loop (induction-variable-based)
Target patternMultiple BBs accessing the same baseLoop body with base + stride * iv
MechanismHoist to common dominatorFactor out common base, use incremented pointer
Key helpersub_1C53170 (address decomposition)sub_1C53170 (same decomposition)
Offset handlingMinimum-offset anchorMinimum-offset anchor (same strategy)
Pipeline orderRuns firstRuns after CBE

The shared address decomposition helper (sub_1C53170) and the shared rewriting infrastructure (sub_13A5B00 for creating new base computations, sub_14806B0 for rewriting addresses) confirm that these passes were designed as a coordinated pair. Common Base Elimination runs first to eliminate inter-block redundancies, leaving BASR to focus on the remaining intra-loop stride patterns. Without CBE running first, BASR would encounter more diverse base expressions in loop bodies, reducing its grouping effectiveness.

Both passes share the same 0x1C50000-0x1CCFFFF address range in the binary, and BASR's helper functions (e.g., sub_1C637F0 -- base address bitcast helper, strings "baseValue", "bitCastEnd") are directly adjacent to CBE's entry point.

Configuration

Direct Knobs

No CBE-specific enable/disable knob has been identified in the binary. The pass appears to be unconditionally enabled when the SCEV-CGP subsystem is active.

KnobTypeDescription
scev-cgp-cross-block-limitintMaximum number of common bases that can be created from a single block. Limits the register pressure increase from hoisting.
scev-cgp-idom-level-limitintMaximum dominator tree depth for hoisting. Prevents hoisting too far from use sites.
do-scev-cgpboolMaster enable for the SCEV-CGP subsystem. Disabling this may also disable CBE.
do-base-address-strength-reduceintTwo levels: 1 = basic, 2 = with conditions. Controls the companion BASR pass.
do-base-address-strength-reduce-chainboolEnables chained strength reduction in BASR.
base-address-strength-reduce-iv-limitintIV limit for BASR.
base-address-strength-reduce-max-ivintMaximum IVs considered by BASR.

BASR Aggressiveness Knob

The global dword_4FBCAE0 controls aggressiveness for negative-offset handling in the BASR companion pass. When dword_4FBCAE0 > 1, BASR also considers base groups where the maximum offset has a negative sign bit, checking via sub_1C51340 whether the base is loop-invariant before creating a separate common base via sub_1C55CE0. This knob does not directly affect CBE but influences how much address redundancy remains for CBE to handle.

Diagnostic Strings

"Common Base Elimination"

The pass registers a single diagnostic string (its name). No additional debug/dump strings have been identified. The pass does not appear to have a dedicated dump knob analogous to dump-base-address-strength-reduce for BASR.

Function Map

FunctionAddressSizeRole
CommonBaseElimination::runsub_1C5DFC038 KBMain entry point -- orchestrates all four phases
decomposeAddresssub_1C53170--Decomposes a memory address into (base, offset_list, count) tuple. Shared with BASR.
hashMapGrowOrRehashsub_1C54050--Hash map resize/rehash with load-factor policy
hashMapInsertOrLookupsub_1C50900--Insert into or look up in the base-pointer hash map
extractGlobalFromPointerChainsub_1CCDC20--Walks bitcast/GEP chains to find the underlying GlobalVariable
createCommonBaseForNegativeOffsetssub_1C55CE0--Creates a separate common base when the max offset is negative. Used by BASR, available to CBE.
isBaseLoopInvariantsub_1C51340--Checks whether a base address is loop-invariant
classifyAddressExpressionsub_1C57390--Classifies an instruction's address expression type
createNewBaseInstructionsub_13A5B00--Creates a new base address computation at the insertion point
rewriteAddressAsBaseOffsetsub_14806B0--Rewrites an address as (new_base + relative_offset)
extractBasePointer (SCEV helper)sub_1456040--Extracts the base pointer from an address expression (SCEV getStart/getOperand(0))

Cross-References

CSSA -- Conventional SSA for GPU Divergence

Standard SSA form assumes that a PHI node selects its incoming value based solely on the control flow edge along which execution arrived. On a scalar CPU, exactly one predecessor edge is taken per dynamic execution of the PHI, so this assumption holds trivially. On an NVIDIA GPU, it does not. A warp of 32 threads executes in lockstep, and when control flow diverges -- different threads take different branches -- all paths are eventually serialized and the warp reconverges. At the reconvergence point, a standard PHI node cannot correctly select a single incoming value because the warp carries live values from multiple predecessors simultaneously. The wrong thread could see the wrong value.

CSSA (Conventional SSA) is NVIDIA's transformation that rewrites the IR so that every PHI node is safe under warp-divergent execution. It does this by inserting explicit copy instructions at points where threads reconverge, ensuring that each thread's value is materialized into its own copy before the PHI merges anything. The name "Conventional SSA" comes from the SSA literature: a program is in CSSA form when every PHI node's operands can be simultaneously live without interfering -- the PHI web has no overlapping live ranges. This property is exactly what GPU divergence demands.

Pass locationsub_3720740 (22KB, ~800 lines decompiled)
Address range0x3720740--0x3721501 (3521 bytes)
Gate knobdo-cssa (NVVMPassOptions boolean toggle)
Coalesce knobcssa-coalesce (controls copy coalescing aggressiveness)
Debug knobscssa-verbosity, dump-before-cssa
Container knobCSSACoalescing (NVVM container format, parsed at sub_CD9990)
Debug string"IR Module before CSSA:\n"
Helper clustersub_371F790 (27KB), sub_371F160, sub_371EDF0, sub_371CDC0
Pass-option slotOne of the 221 NVVMPassOptions slots (boolean do/don't pair)
Pipeline positionLate IR, after optimization, before SelectionDAG lowering
Upstream equivalentNone. LLVM has no concept of warp-divergent PHI semantics.

GPU Divergence Background

The Warp Execution Model

NVIDIA GPUs execute threads in warps of 32 under the SIMT model: all threads share a program counter, and divergent branches serialize both paths before the warp reconverges. The full warp execution model and its implications for cicc are documented in the GPU Execution Model.

Why Standard SSA Breaks

Consider a diamond CFG:

         entry
        /     \
     then     else
        \     /
         join        <-- PHI(%x = [then: %a], [else: %b])

On a CPU, the PHI at join works correctly: execution came from exactly one predecessor, so the PHI selects the corresponding value. On a GPU warp where threads 0-15 took then and threads 16-31 took else, both paths executed sequentially. When the warp reconverges at join, the PHI must produce %a for threads 0-15 and %b for threads 16-31 simultaneously in the same register. A naive lowering of the PHI to a simple register copy is incorrect -- whichever path executed last would overwrite the value from the first path.

The CSSA Solution

CSSA transforms the IR so that the PHI web has non-interfering live ranges. Concretely, it inserts copy instructions at the end of each predecessor block so that each thread's value is written into a dedicated copy before the warp reconverges:

         entry
        /     \
     then     else
     %a_copy  %b_copy     <-- inserted copies (one per predecessor)
        \     /
         join
     %x = PHI [then: %a_copy], [else: %b_copy]

Now the PHI's operands occupy distinct virtual registers. During later register allocation, the allocator can assign them the same physical register only when their live ranges truly do not overlap -- which is the correct condition for divergent execution. The copies give the allocator the freedom to keep the values separate when divergence requires it.

Algorithm

The sub_3720740 function implements CSSA in several phases:

Phase 1: Basic Block Ordering and Numbering

The function begins by iterating over all basic blocks in the LLVM function (accessed via [r15], the LLVM Module/Function pointer) and assigning sequential numbering. Each basic block receives an ordinal stored at offset +0x48 (preorder index) and +0x4C (reverse postorder index). These indices are used later for dominance and reconvergence queries. The block list is walked via the standard LLVM doubly-linked list at function offsets +0x48/+0x50 (begin/end sentinels), with a secondary worklist stored in a dynamic array at [rbp-0x240] that grows via the standard SmallVector growth function sub_C8D5F0.

After ordering, the function sets byte [r8+0x70] = 1 and dword [r8+0x74] = 0 on the pass state object (at [r15+8]), marking the ordering phase as complete. If the ordering was already done (byte [r8+0x70] is non-zero on entry), the function skips directly to phase 2.

Phase 2: PHI Node Scanning and Hash Map Population

The function iterates over every basic block (outer loop at 0x37208C0) and within each block walks the instruction use-list (inner loop at 0x3720930). Instructions are identified by checking byte [rbx-0x18] (the LLVM Value tag / opcode byte) against 0x54 (decimal 84), which is the LLVM PHI node opcode. Non-PHI instructions are skipped.

For each PHI node found, the function:

  1. Increments a monotonic counter at [r15+0x78] to assign a unique PHI ID.
  2. Computes a hash of the PHI's pointer value using the standard NVIDIA hash: h = (ptr >> 4) ^ (ptr >> 9), masked by (table_size - 1). This is the same hash function used across CICC's DenseMap infrastructure.
  3. Inserts the PHI (or looks it up) in the hash map at [r15+0x60] with metadata fields: key at [slot+0], PHI ID at [slot+8]. The hash table uses LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure for the probing and growth policy.
  4. Calls sub_A41E30 to resize the hash table when the load factor exceeds the 75% threshold.

Phase 3: Copy Insertion at Reconvergence Points

After populating the PHI map, the function enters the copy-insertion phase. For each basic block that contains PHI nodes, it:

  1. Walks the PHI's incoming values (the use-list at offset +0x18 through the instruction's operand chain at 32-byte stride).
  2. For each incoming value, calls sub_371F160 with r8d=1 (the "insert copy" flag). This helper creates a copy instruction at the end of the predecessor block, before the terminator. The copy is named with the "pcp" (PHI copy propagation) prefix string, as evidenced by the lea rax, aPcp instruction at 0x3720D34.
  3. Calls sub_ACA8A0 to set the name on the newly created copy instruction.
  4. Calls sub_371CDC0 with an instruction builder struct to create the actual copy/move IR instruction. The call passes opcode 0x22D7 (8919 decimal) as the first argument via edi -- this is likely an NVVM-internal opcode for a divergence-safe copy.
  5. Calls sub_371EDF0 to insert the new copy instruction into the predecessor block's instruction list. This is followed by sub_BD84D0 (the standard LLVM insertBefore/insertAfter) to splice the instruction into position.
  6. Updates the PHI node's use chain: the operand that previously pointed to the original value now points to the copy. This rewiring is done at 0x3720C87--0x3720CDD by manipulating the 32-byte use-def chain entries (pointer at [use+0], predecessor at [use+8], backlink at [use+10]).

Phase 4: Instruction-Level Copy Propagation

After copy insertion, the function iterates over all basic blocks a second time (0x3720A2F--0x3720A62). For each instruction in each block, it calls sub_371F790 (27KB, the "NVPTX intrinsic operand builder" / copy propagation helper). This function propagates the "pcp" copies through the instruction graph, replacing uses of the original values with uses of the copies where appropriate, and eliminating redundant copies where the original value and the copy provably carry the same value for all threads.

Phase 5: Dead Copy Cleanup

The final phase walks a linked list at [r15+0x28] (a cleanup worklist). For each entry, it checks whether the instruction at [entry+8] has zero remaining uses ([rdi+0x10] == 0). If so, it calls sub_B43D60 to erase the dead instruction. This removes copies that were rendered unnecessary by the propagation phase.

Copy Coalescing

The cssa-coalesce knob controls how aggressively the pass coalesces the inserted copies back together. Without coalescing, CSSA inserts one copy per PHI operand per predecessor -- potentially a large number of copies in control flow with many branches. Coalescing identifies cases where two or more copies carry the same value and can share a single register, reducing the copy overhead.

The CSSACoalescing knob in the NVVM container format (parsed by sub_CD9990 from the finalizer knobs structure) provides a separate control path for the same behavior. The container knob is categorized alongside register allocation and scheduling controls (AdvancedRemat, DisablePredication, DisableXBlockSched, ReorderCSE), confirming that CSSA coalescing is considered part of the register allocation subsystem.

deSSA Alternative

The usedessa knob (default value 2, registered at ctor_358_0 at 0x50E8D0, stored in dword_4FD26A0) selects an alternative path for PHI elimination during the transition from SSA to machine code. Despite its name suggesting "de-Static Single Assignment", analysis of the dispatch functions shows it controls the scheduling and PHI elimination pipeline:

ModePre-RA SchedulingPost-RA SchedulingBehavior
1SkippedMinimal (single pass)Simple mode -- no pre-RA scheduling
2 (default)Full (&unk_4FC8A0C)Three passes + StackSlotColoringFull mode -- complete scheduling pipeline

The deSSA mode and CSSA transformation are complementary. CSSA operates at the LLVM IR level, converting PHI nodes into a form safe for GPU divergence before instruction selection. The usedessa mode controls how PHI nodes are ultimately eliminated during the MachineIR lowering, after SelectionDAG has already consumed the CSSA-transformed IR. When usedessa=2 (default), the full scheduling pipeline runs, giving the register allocator maximum flexibility to handle the extra copies that CSSA introduced. When usedessa=1, the minimal scheduling mode may be appropriate for debugging or for kernels where scheduling causes regressions.

Configuration Knobs

NVVMPassOptions Knob

KnobTypeDescription
do-cssaboolMaster enable/disable for the CSSA pass

Set via -opt "-do-cssa=0" to disable the pass entirely.

cl::opt Knobs (ctor_705 at 0x5BD430)

KnobTypeDefaultGlobalDescription
cssa-coalesceint(unknown)(ctor_705 data)Controls PHI operand coalescing aggressiveness. Higher values = more aggressive coalescing = fewer copies but higher risk of incorrect merging under divergence.
cssa-verbosityint0(ctor_705 data)Verbosity level for diagnostic output during the CSSA transformation.
dump-before-cssaboolfalseqword_5050A28When non-zero, dumps the entire IR module before CSSA runs. Triggers the "IR Module before CSSA:\n" output followed by sub_A69980 (Module::print).

Container-Format Knob

KnobParsed AtCategoryDescription
CSSACoalescingsub_CD9990Register allocation / schedulingControls CSSA coalescing from the NVVM container format. Parsed alongside AdvancedRemat, DisablePredication, DisableXBlockSched.
KnobTypeDefaultGlobalDescription
usedessaint2dword_4FD26A0Selects deSSA method / scheduling pipeline mode. Mode 1 = simple (no pre-RA scheduling), mode 2 = full.

Diagnostic Strings

"IR Module before CSSA:\n"          -- Module dump header (dump-before-cssa)
"pcp"                               -- PHI copy propagation instruction name prefix

The "pcp" prefix is assigned to all copy instructions created by the CSSA pass. These copies can be identified in IR dumps by their %pcp naming. After register allocation, these copies may be eliminated (coalesced into the same physical register) or materialized as actual move instructions in the final PTX.

Function Map

FunctionAddressSizeRole
CSSA mainsub_372074022KBBB ordering, PHI scanning, copy insertion, cleanup
PCP buildersub_371F79027KBPHI copy propagation / intrinsic operand builder
Copy insertion helpersub_371F160--Creates copy instruction in predecessor block
Copy instruction creatorsub_371EDF0--Inserts copy into instruction list
Copy IR buildersub_371CDC0--Builds the copy instruction IR node
Hash table growsub_A41E30--DenseMap resize for PHI hash table
Module printersub_A69980--Module::print (for dump-before-cssa)
raw_ostream::writesub_CB6200--String output for debug dump
Debug stream gettersub_C5F790--Returns current debug output stream
Instruction erasersub_B43D60--Erases dead instruction from parent block
Instruction insertsub_BD84D0--BasicBlock::insert (instruction splice)
Name settersub_ACA8A0--Value::setName for "pcp" prefix
Use chain rewritesub_B96E90--replaceAllUsesWith on operand
Use helpersub_B91220--Use-list manipulation
DenseMap grow helpersub_C8D5F0--SmallVector/DenseMap capacity growth
Knob registrationctor_705 (0x5BD430)5.4KBRegisters cssa-coalesce, cssa-verbosity, dump-before-cssa
Container knob parsersub_CD999031KBParses CSSACoalescing from NVVM container
deSSA dispatch (post-RA)sub_21668D0--Scheduling pipeline mode selector
deSSA dispatch (pre-RA)sub_2165850--Pre-RA scheduling mode selector

Differences from Upstream LLVM

LLVM's standard PHI elimination pass (llvm::PHIEliminationPass, registered as "phi-node-elimination" at pipeline slot 493 in CICC's pass parser) lowers PHI nodes to machine copies during the SelectionDAG-to-MachineIR transition. It operates under the assumption that PHI semantics follow scalar control flow -- exactly one predecessor contributes a value at each dynamic execution.

NVIDIA's CSSA pass runs before instruction selection, at the LLVM IR level, and transforms the IR into a form where PHI elimination can proceed safely even when the underlying execution model is SIMT. The two passes are not alternatives -- CSSA runs first to prepare the IR, then standard PHI elimination runs later to lower the CSSA-safe PHI nodes to machine copies.

This is one of the fundamental semantic gaps between LLVM's CPU-centric IR model and GPU reality. LLVM assumes sequential scalar semantics; NVIDIA's CSSA pass bridges that gap by making the implicit thread-level parallelism explicit in the copy structure of the IR.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent CSSA transformation for GPU targets.

1. Inserting copies only at the merge block instead of at the end of each predecessor. The entire point of CSSA is that copies must be placed before the warp reconverges, not at the reconvergence point. If you insert the copy instruction at the beginning of the merge block (after the PHI), the warp has already reconverged and whichever path executed last has overwritten the register value for all threads. Copies must be at the terminator position of each predecessor block, before control leaves that block. This is the fundamental GPU-vs-CPU distinction: on a CPU, only one predecessor executes so placement does not matter; on a GPU, all predecessors may execute sequentially within the same warp.

2. Coalescing copies that have divergent live ranges. The cssa-coalesce knob controls how aggressively copies are merged back together. Over-aggressive coalescing can assign two copies to the same physical register when their live ranges overlap under divergence -- threads from different predecessor paths would see each other's values. The coalescer must verify that live ranges are truly non-interfering under the SIMT execution model, not just under the sequential CFG model. A reimplementation that reuses a standard LLVM register coalescer without divergence-aware interference checking will produce silent miscompilation on any kernel with divergent control flow.

3. Failing to insert copies for uniform PHI nodes that become divergent after later transformations. CSSA runs before instruction selection, but divergence analysis at that point may be imprecise. A PHI node classified as uniform (all threads agree on the incoming edge) may become effectively divergent after subsequent loop transformations or predication changes the control flow. The safe approach is to insert copies for all PHI nodes and let the coalescing phase remove unnecessary ones. A reimplementation that skips "uniform" PHI nodes based on divergence analysis risks correctness if that analysis is later invalidated.

4. Using a standard LLVM PHIElimination pass without the CSSA preprocessing step. LLVM's built-in PHI elimination assumes scalar control flow semantics (exactly one predecessor contributes at runtime). Running it directly on GPU IR without first converting to CSSA form will produce incorrect register assignments whenever a warp diverges at a branch leading to a PHI merge point. CSSA is not a replacement for PHI elimination -- it is a prerequisite that transforms PHI semantics into a form safe for the standard lowering.

5. Not propagating the "pcp" copy through the instruction graph after insertion. Phase 4 of the algorithm (copy propagation via sub_371F790) replaces uses of original values with uses of the inserted copies. A reimplementation that inserts copies but skips this propagation step will leave the PHI node still referencing the original value, making the copies dead. The subsequent dead-copy cleanup (Phase 5) will then erase them, and the transformation has no effect -- the original divergence-unsafe PHI remains.

Reimplementation Checklist

  1. Basic block ordering and numbering. Assign preorder and reverse-postorder indices to every basic block (stored at block offsets +0x48/+0x4C), used later for dominance and reconvergence queries.
  2. PHI node scanning and hash map population. Walk all instructions across all basic blocks, identify PHI nodes (opcode 0x54), assign monotonic IDs, and insert into a DenseMap using the hash (ptr >> 4) ^ (ptr >> 9) with LLVM-layer sentinels (-4096/-8192) and 75% load-factor growth.
  3. Copy insertion at reconvergence points. For each PHI node's incoming value, insert a "pcp"-prefixed copy instruction at the end of the predecessor block (before the terminator) using opcode 0x22D7 (divergence-safe copy), then rewire the PHI's use chain so the operand points to the copy instead of the original value.
  4. Copy propagation. Iterate all blocks a second time, invoking the PCP builder on each instruction to propagate inserted copies through the instruction graph, replacing uses of original values with uses of copies where appropriate and eliminating redundant copies where original and copy provably carry the same value for all threads.
  5. Dead copy cleanup. Walk the cleanup worklist, check each entry for zero remaining uses, and erase dead copy instructions via eraseFromParent.
  6. Copy coalescing (cssa-coalesce). Implement configurable coalescing that identifies cases where multiple "pcp" copies carry the same value and can share a single register, reducing copy overhead while preserving correctness under warp divergence.

Cross-References

Minor NVIDIA Passes

This page indexes NVIDIA-proprietary passes that are too small or insufficiently decompiled for dedicated pages. For the ten passes that were previously documented here and now have full pages, see the links below.

Passes with Dedicated Pages

PassPage
NVVM IR Verifiernvvm-verify (Deep Dive)
NVVM Intrinsic Loweringnvvm-intrinsic-lowering
Dead Synchronization Eliminationdead-sync-elimination
IV Demotioniv-demotion
Struct/Aggregate Splittingstruct-splitting
Base Address Strength Reductionbase-address-sr
Common Base Eliminationcommon-base-elim
CSSA (Conventional SSA)cssa
FP128/I128 Emulationfp128-emulation
Memmove Unrollingmemmove-unroll

alloca-hoisting -- Entry Block Alloca Consolidation

FieldValue
Pass IDalloca-hoisting
Entry pointsub_21BC7D0
ScopeMachine-level pass

PTX requires all stack allocations to reside in the entry block. This pass moves alloca instructions inserted by inlining or loop transforms into the entry block, preserving order and alignment. Without it, non-entry-block allocas produce invalid PTX.

image-optimizer -- Texture/Surface Access Optimization

FieldValue
Pass IDnvptx-image-optimizer
Entry pointsub_21BCF10
ScopeMachine-level pass (pre-emission)

Groups related texture loads for cache utilization and merges redundant surface operations. Works in coordination with Replace Image Handles (below). See also Machine-Level Passes.

nvptx-peephole -- Machine-Level Peephole

FieldValue
Pass IDnvptx-peephole
Entry pointsub_21DB090
ScopeMachine-level pass (pre-RA)
Knobenable-nvvm-peephole (default: on)

PTX-specific peephole that folds redundant cvta address space conversions, optimizes predicate patterns, and simplifies PTX-specific instruction sequences. Distinct from the IR-level NVVM Peephole. See Machine-Level Passes for pipeline position.

proxy-reg-erasure -- Redundant cvta.to.local Removal

FieldValue
Pass IDnvptx-proxy-reg-erasure
Entry pointsub_21DA810
ScopeMachine-level pass (late post-RA)

Removes redundant cvta.to.local instructions left by address space lowering. Runs late in the pipeline after register allocation. See Machine-Level Passes.

valid-global-names -- PTX Identifier Sanitization

FieldValue
Pass IDnvptx-assign-valid-global-names
Entry pointsub_21BCD80
ScopeMachine-level pass (pre-emission)

Rewrites global symbol names to comply with PTX naming rules, removing characters illegal in PTX identifiers (@, $, etc.). Runs immediately before PTX emission.

replace-image-handles -- Texture/Surface Handle Substitution

FieldValue
Pass IDnvptx-replace-image-handles
Entry pointsub_21DBEA0
ScopeMachine-level pass (pre-emission)

Replaces IR-level texture/surface handle references with PTX-level .tex / .surf declarations. Paired with image-optimizer above. See Machine-Level Passes.

extra-mi-printer -- Register Pressure Diagnostics

FieldValue
Pass IDextra-machineinstr-printer
Entry pointsub_21E9E80
ScopeDiagnostic (debug-only)

Prints per-function register pressure statistics. Used for tuning pressure heuristics during development. Not active in release builds.

nvvm-intr-range -- Intrinsic Range Metadata

FieldValue
Pass IDnvvm-intr-range
Entry pointsub_216F4B0
ScopeFunction pass (IR level)
Knobnvvm-intr-range-sm (ctor_359)

Attaches !range metadata to NVVM intrinsics that return hardware-bounded values (threadIdx.x, blockIdx.x, etc.), enabling downstream known-bits analysis and range-based dead code elimination. Tightens ranges when __launch_bounds__ metadata is present. Documented in detail in KnownBits & DemandedBits.

GenericToNVVM -- Global Address Space Migration

FieldValue
Pass IDgeneric-to-nvvm
Entry pointsub_215DC20
Size36 KB

Moves global variables from generic address space (AS 0) to global address space (AS 1), inserting addrspacecast at use sites. Required because PTX globals must reside in .global memory. Documented in detail in PTX Emission.


Other Passes Documented Elsewhere

These passes appear in the NVPTX backend but have primary documentation on other pages:

PassEntryPrimary Page
nvvm-pretreatPretreatPass (New PM slot 128)Optimizer Pipeline
NLO (Simplify Live Output)sub_1CE10B0, sub_1CDC1F0Rematerialization
Prolog/Epilogsub_21DB5F0Machine-Level Passes, PrologEpilogInserter
LDG Transformsub_21F2780 (ldgxform)Machine-Level Passes, Code Generation
Machine Mem2Regsub_21F9920 (nvptx-mem2reg)Machine-Level Passes, Code Generation

Pipeline & Pass Ordering

CICC v13.0 implements the LLVM New Pass Manager pipeline infrastructure, with NVIDIA injecting 33 custom passes into the registration table alongside approximately 493 standard LLVM passes. The master registration function at sub_2342890 populates a StringMap<PassInfo> hash table with every known pass name at startup, and a text-based pipeline parser allows the full pass ordering to be specified as a parenthesized string (e.g., module(function(instcombine,dse))). This page documents the complete pass inventory, the registration mechanism, the NVIDIA-specific additions, and — critically — the runtime pass execution order for each optimization level including the tier system and pass factory addresses.

Master registrationsub_2342890 (0x2342890, ~2,816 lines)
Hash table insertsub_E41FB0 (0xE41FB0) -- open-addressing, 48-byte entries
String equalitysub_9691B0 (0x9691B0) -- len==len && memcmp==0
AA name resolversub_233BD40 (0x233BD40) -- chain of string comparisons
AA pipeline parsersub_233C0C0 (0x233C0C0) -- splits on ,, special-cases "default"
Extension callbacksub_233C300 (0x233C300) -- iterates [PassBuilder+2208], stride 32
Option parsersub_233A120 (0x233A120) -- splits on ;, validates tokens
Help/listingsub_233C410 (0x233C410) -- --print-pipeline-passes handler
Pipeline assemblersub_12E54A0 (0x12E54A0, 49.8KB, 1,553 lines)
AddPasssub_12DE0B0 (0x12DE0B0, hash-based pass insertion)
Tier 0 sub-pipelinesub_12DE330 (0x12DE330, ~40 passes)
Tier 1/2/3 sub-pipelinesub_12DE8F0 (0x12DE8F0, phase-conditional)
Codegen dispatchsub_12DFE00 (0x12DFE00, 20.7KB)
Total passes~526 unique registrations
NVIDIA additions33 passes (12 module, 20 function, 1 loop)

Registration Architecture

The pipeline infrastructure follows the standard LLVM New Pass Manager design. At startup, sub_2342890 is called once and inserts every known pass into a StringMap living at [PassBuilder+8]. The insertion function sub_E41FB0 uses open-addressing with linear probing; each entry occupies 48 bytes containing the key pointer, key length, value pointer, value length, and 16 bytes of inline storage for short class names.

Pass lookup during pipeline parsing uses the hash function at sub_C94890 (likely DJB/FNV-family). Parameterized passes are detected by the presence of <...> angle brackets after the pass name; the parameter string is extracted and forwarded to a pass-specific callback. The generic parameter validator sub_233A120 splits option strings on semicolons and compares each token to expected values, emitting "invalid {PassName} pass parameter '{token}'" on mismatch.

The alias analysis pipeline has its own parser at sub_233C0C0. It special-cases the string "default" (which calls sub_23A1380 then sub_23038C0 to build the default AA stack), and otherwise splits on commas, resolving each name through sub_233BD40:

AA NameConstructor
globals-aasub_2396EC0
basic-aasub_2361CE0
objc-arc-aasub_2361F60
scev-aasub_2362040
scoped-noalias-aasub_2362120
tbaasub_2362200

Extension callbacks for target-specific pipeline customization are stored at [PassBuilder+2208] with a count at [PassBuilder+2216]. Each entry is 32 bytes with a guard at offset +16 (must be non-null) and the callback function pointer at offset +24. The string "all" in extension context triggers invalidate<all>.

Pipeline Text Parser

The pipeline text parser accepts a nesting grammar where each level specifies the pass manager scope:

module(
  function(
    instcombine<max-iterations=1>,
    dse,
    loop(indvars, loop-deletion)
  ),
  globalopt
)

The parser splits on commas and parentheses, recognizing module(...), cgscc(...), function(...), and loop(...) as scope wrappers. Bare names are looked up in the StringMap built by sub_2342890. For parameterized passes, the <...> suffix is extracted and dispatched to per-pass option parsers. Several NVIDIA-specific parameter parsers are thin wrappers around sub_233A120:

ParserPassRecognized Options
sub_233A330process-restrictpropagate-only
sub_233A370lower-struct-argsopt-byval
sub_233A3B0lower-aggr-copieslower-aggr-func-args

More complex passes (GVN, SimplifyCFG, InstCombine) use chained sub_9691B0 string comparisons for multi-option parsing.

The pipeline name strings recognized by the nvopt<> dispatch table are:

Pipeline NameCLI SourcePass Count
nvopt<O0>(no -O flag, no -Ofc)~5--8
nvopt<O1>-O1~35
nvopt<O2>-O2~35+
nvopt<O3>-O3~35+
nvopt<Ofcmax>-Ofast-compile=max / -Ofc=max~12--15
nvopt<Ofcmid>-Ofast-compile=mid / -Ofc=mid~25--30
nvopt<Ofcmin>-Ofast-compile=min / -Ofc=min~30--35

Key addresses for pipeline name dispatch: sub_226C400 selects the pipeline name string, which is passed to sub_2277440 (pipeline text parser). The nvopt prefix is registered in sub_225D540 (new PM) and sub_12C35D0 (legacy PM), both calling into a pipeline builder class at vtable unk_4A08350.

Mutual exclusion: combining -O# with --passes= is an error: "Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass, use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass'".

Complete Pass Inventory

The following tables list every pass in exact registration order within sub_2342890. NVIDIA-specific passes are marked with bold names. Registration line numbers are from the decompiled output.

Module Analyses (18)

#Pass NameLLVM ClassReg. Line
1callgraphCallGraphAnalysis514
2collector-metadataCollectorMetadataAnalysis
3ctx-prof-analysisCtxProfAnalysis
4dxil-metadataDXILMetadataAnalysis
5dxil-resource-bindingDXILResourceBindingAnalysis
6dxil-resource-typeDXILResourceTypeAnalysis
7inline-advisorInlineAdvisorAnalysis
8ir-similarityIRSimilarityAnalysis
9last-run-trackingvia sub_2342820
10lcgLazyCallGraphAnalysis
11module-summaryModuleSummaryIndexAnalysis
12no-op-moduleNoOpModuleAnalysis
13pass-instrumentationvia sub_2342830
14profile-summaryProfileSummaryAnalysis
15reg-usagePhysicalRegisterUsageAnalysis
16stack-safetyStackSafetyGlobalAnalysis
17verifyvia sub_2342840596
18globals-aaGlobalsAA

Module Passes (131)

Registration lines 599--1153 in sub_2342890. The first 121 entries are standard LLVM; the final 12 are NVIDIA custom passes registered at lines 1096--1153.

Standard LLVM Module Passes (entries 19--131)

#Pass NameLLVM Class
19always-inlineAlwaysInlinerPass
20annotation2metadataAnnotation2MetadataPass
21assign-guidAssignGUIDPass
22attributorAttributorPass
23attributor-lightAttributorLightPass
24called-value-propagationCalledValuePropagationPass
25canonicalize-aliasesCanonicalizeAliasesPass
26check-debugifyNewPMCheckDebugifyPass
27constmergeConstantMergePass
28coro-cleanupCoroCleanupPass
29coro-earlyCoroEarlyPass
30cross-dso-cfiCrossDSOCFIPass
31ctx-instr-genPGOInstrumentationGen
32ctx-prof-flattenPGOCtxProfFlatteningPass
33noinline-nonprevailingNoinlineNonPrevailing
34deadargelimDeadArgumentEliminationPass
35debugifyNewPMDebugifyPass
36dfsanDataFlowSanitizerPass
37dot-callgraphCallGraphDOTPrinterPass
38dxil-upgradeDXILUpgradePass
39elim-avail-externEliminateAvailableExternallyPass
40extract-blocksBlockExtractorPass
41expand-variadicsExpandVariadicsPass
42forceattrsForceFunctionAttrsPass
43function-importFunctionImportPass
44global-merge-funcGlobalMergeFuncPass
45globaloptGlobalOptPass
46globalsplitGlobalSplitPass
47hotcoldsplitHotColdSplittingPass
48inferattrsInferFunctionAttrsPass
49inliner-ml-advisor-releasevia sub_2342850 (InlinerWrapper)
50inliner-wrappervia sub_2342850 (InlinerWrapper)
51inliner-wrapper-no-mandatory-firstvia sub_2342850
52insert-gcov-profilingGCOVProfilerPass
53instrorderfileInstrOrderFilePass
54instrprofInstrProfilingLoweringPass
55ctx-instr-lowerPGOCtxProfLoweringPass
56print<ctx-prof-analysis>CtxProfAnalysisPrinterPass
57invalidate<all>via sub_2342860
58iroutlinerIROutlinerPass
59jmc-instrumenterJMCInstrumenterPass
60lower-emutlsLowerEmuTLSPass
61lower-global-dtorsLowerGlobalDtorsPass
62lower-ifuncLowerIFuncPass
63lowertypetestsLowerTypeTestsPass
64fatlto-cleanupFatLtoCleanup
65pgo-force-function-attrsPGOForceFunctionAttrsPass
66memprof-context-disambiguationMemProfContextDisambiguation
67memprof-moduleModuleMemProfilerPass
68mergefuncMergeFunctionsPass
69metarenamerMetaRenamerPass
70module-inlineModuleInlinerPass
71name-anon-globalsNameAnonGlobalPass
72no-op-moduleNoOpModulePass
73nsanNumericalStabilitySanitizerPass
74objc-arc-apelimObjCARCAPElimPass
75openmp-optOpenMPOptPass
76openmp-opt-postlinkOpenMPOptPass
77partial-inlinerPartialInlinerPass
78pgo-icall-promPGOIndirectCallPromotion
79pgo-instr-genPGOInstrumentationGen
80pgo-instr-usePGOInstrumentationUse
81pre-isel-intrinsic-loweringPreISelIntrinsicLoweringPass
82printPrintModulePass
83print-callgraphCallGraphPrinterPass
84print-callgraph-sccsCallGraphSCCsPrinterPass
85print-ir-similarityIRSimilarityAnalysisPrinterPass
86print-lcgLazyCallGraphPrinterPass
87print-lcg-dotLazyCallGraphDOTPrinterPass
88print-must-be-executed-contextsMustBeExecutedContextPrinterPass
89print-profile-summaryProfileSummaryPrinterPass
90print-stack-safetyStackSafetyGlobalPrinterPass
91print<dxil-metadata>DXILMetadataAnalysisPrinterPass
92print<dxil-resource-binding>DXILResourceBindingPrinterPass
93print<inline-advisor>InlineAdvisorAnalysisPrinterPass
94print<module-debuginfo>ModuleDebugInfoPrinterPass
95print<reg-usage>PhysicalRegisterUsageInfoPrinterPass
96pseudo-probeSampleProfileProbePass
97pseudo-probe-updatePseudoProbeUpdatePass
98recompute-globalsaaRecomputeGlobalsAAPass
99rel-lookup-table-converterRelLookupTableConverterPass
100rewrite-statepoints-for-gcRewriteStatepointsForGC
101rewrite-symbolsRewriteSymbolPass
102rpo-function-attrsReversePostOrderFunctionAttrsPass
103rtsanRealtimeSanitizerPass
104sample-profileSampleProfileLoaderPass
105sancov-moduleSanitizerCoveragePass
106sanmd-moduleSanitizerBinaryMetadataPass
107scc-oz-module-inlinervia sub_2342850 (InlinerWrapper)
108shadow-stack-gc-loweringShadowStackGCLoweringPass
109stripStripSymbolsPass
110strip-dead-debug-infoStripDeadDebugInfoPass
111strip-dead-prototypesStripDeadPrototypesPass
112strip-debug-declareStripDebugDeclarePass
113strip-nondebugStripNonDebugSymbolsPass
114strip-nonlinetable-debuginfoStripNonLineTableDebugInfoPass
115trigger-crash-moduleTriggerCrashModulePass
116trigger-verifier-errorTriggerVerifierErrorPass
117tsan-moduleModuleThreadSanitizerPass
118tysanTypeSanitizerPass
119verifyvia sub_2342870
120view-callgraphCallGraphViewerPass
121wholeprogramdevirtWholeProgramDevirtPass

NVIDIA Module Passes (entries 122--131)

#Pass NameLLVM ClassReg. LinePurpose
122check-gep-indexCheckGepIndexPass1096Validates GEP index bounds
123check-kernel-functionsNVPTXSetFunctionLinkagesPass1101Enforces kernel linkage
124cnp-launch-checkCNPLaunchCheckPass1106Cooperative launch validation
125ipmspIPMSPPass1111Inter-procedural memory space propagation
126nv-early-inlinervia sub_23428501114NVIDIA early inlining heuristic
127nv-inline-mustInlineMustPass1119Force-inlines __forceinline__ functions
128nvvm-pretreatPretreatPass1124IR canonicalization before optimization
129nvvm-verifyNVVMIRVerifierPass1129NVVM IR constraint validation
130printf-loweringPrintfLoweringPass1134Lowers printf to vprintf ABI
131select-kernelsSelectKernelsPass1139Selects kernels for compilation

Parameterized Module Passes (entries 132--145)

#Pass NameClassParameters
132asanAddressSanitizerPasskernel
133cg-profileCGProfilePassin-lto-post-link
134global-mergeGlobalMergePassgroup-by-use;ignore-single-use;max-offset=N
135embed-bitcodeEmbedBitcodePassthinlto;emit-summary
136globaldceGlobalDCEPassin-lto-post-link
137hwasanHWAddressSanitizerPasskernel;recover
138internalizeInternalizePasspreserve-gv=GV
139ipsccpIPSCCPPassno-func-spec;func-spec
140loop-extractLoopExtractorPasssingle
141memprof-useMemProfUsePassprofile-filename=S
142msanMemorySanitizerPassrecover;kernel;eager-checks;track-origins=N
143print<structural-hash>StructuralHashPrinterPassdetailed;call-target-ignored
144lower-opsLowerOpsPassenable-optimization
145set-global-array-alignmentSetGlobalArrayAlignmentPassmodify-shared-mem;skip-shared-mem;modify-global-mem;skip-global-mem

CGSCC Analyses and Passes (entries 146--158)

#Pass NameLLVM ClassLevel
146no-op-cgsccNoOpCGSCCAnalysisAnalysis
147fam-proxyFunctionAnalysisManagerCGSCCProxyAnalysis
148pass-instrumentationvia sub_2342830Analysis
149argpromotionArgumentPromotionPassPass
150attributor-cgsccAttributorCGSCCPassPass
151attributor-light-cgsccAttributorLightCGSCCPassPass
152invalidate<all>via sub_2342860Pass
153no-op-cgsccNoOpCGSCCPassPass
154openmp-opt-cgsccOpenMPOptCGSCCPassPass
155coro-annotation-elideCoroAnnotationElidePassPass
156coro-splitCoroSplitPassParam: reuse-storage
157function-attrsPostOrderFunctionAttrsPassParam: skip-non-recursive-function-attrs
158inlineInlinerPassParam: only-mandatory

Function Analyses (entries 159--201)

Registration lines 1208--1415 in sub_2342890.

#Pass NameLLVM Class
159aaAAManager
160access-infoLoopAccessAnalysis
161assumptionsAssumptionAnalysis
162bb-sections-profile-readerBasicBlockSectionsProfileReaderAnalysis
163block-freqBlockFrequencyAnalysis
164branch-probBranchProbabilityAnalysis
165cyclesCycleAnalysis
166daDependenceAnalysis
167debug-ataDebugAssignmentTrackingAnalysis
168demanded-bitsDemandedBitsAnalysis
169domfrontierDominanceFrontierAnalysis
170domtreeDominatorTreeAnalysis
171func-propertiesFunctionPropertiesAnalysis
172machine-function-infoMachineFunctionAnalysis
173gc-functionGCFunctionAnalysis
174inliner-size-estimatorInlineSizeEstimatorAnalysis
175last-run-trackingvia sub_2342820
176lazy-value-infoLazyValueAnalysis
177loopsLoopAnalysis
178memdepMemoryDependenceAnalysis
179memoryssaMemorySSAAnalysis
180no-op-functionNoOpFunctionAnalysis
181opt-remark-emitOptimizationRemarkEmitterAnalysis
182pass-instrumentationvia sub_2342830
183phi-valuesPhiValuesAnalysis
184postdomtreePostDominatorTreeAnalysis
185regionsRegionInfoAnalysis
186scalar-evolutionScalarEvolutionAnalysis
187should-not-run-function-passesShouldNotRunFunctionPassesAnalysis
188should-run-extra-vector-passesShouldRunExtraVectorPasses
189ssp-layoutSSPLayoutAnalysis
190stack-safety-localStackSafetyAnalysis
191target-irTargetIRAnalysis
192target-lib-infoTargetLibraryAnalysis
193uniformityUniformityInfoAnalysis
194verifyvia sub_2342840
195rpaRegisterPressureAnalysis
196merge-setsMergeSetsAnalysis

Function AA Analyses (entries 197--201)

#Pass NameLLVM Class
197basic-aaBasicAA
198objc-arc-aaobjcarc::ObjCARCAA
199scev-aaSCEVAA
200scoped-noalias-aaScopedNoAliasAA
201tbaaTypeBasedAA

Function Passes (entries 202--419)

Registration lines 1420--2319 in sub_2342890. The first 173 entries (202--374) are standard LLVM; entries 376--392 are NVIDIA-specific; entries 393--419 are parameterized passes (both standard and NVIDIA).

Standard LLVM Function Passes (entries 202--375)

#Pass NameLLVM Class
202aa-evalAAEvaluator
203adceADCEPass
204add-discriminatorsAddDiscriminatorsPass
205aggressive-instcombineAggressiveInstCombinePass
206alignment-from-assumptionsAlignmentFromAssumptionsPass
207annotation-remarksAnnotationRemarksPass
208assume-builderAssumeBuilderPass
209assume-simplifyAssumeSimplifyPass
210atomic-expandAtomicExpandPass
211bdceBDCEPass
212break-crit-edgesBreakCriticalEdgesPass
213callbr-prepareCallBrPreparePass
214callsite-splittingCallSiteSplittingPass
215chrControlHeightReductionPass
216codegenprepareCodeGenPreparePass
217complex-deinterleavingComplexDeinterleavingPass
218consthoistConstantHoistingPass
219constraint-eliminationConstraintEliminationPass
220coro-elideCoroElidePass
221correlated-propagationCorrelatedValuePropagationPass
222count-visitsCountVisitsPass
223dceDCEPass
224declare-to-assignAssignmentTrackingPass
225dfa-jump-threadingDFAJumpThreadingPass
226div-rem-pairsDivRemPairsPass
227dot-cfgCFGPrinterPass
228dot-cfg-onlyCFGOnlyPrinterPass
229dot-domDOTGraphTraitsPrinter<DominatorTree, false>
230dot-dom-onlyDOTGraphTraitsPrinter<DominatorTree, true>
231dot-post-domDOTGraphTraitsPrinter<PostDominatorTree, false>
232dot-post-dom-onlyDOTGraphTraitsPrinter<PostDominatorTree, true>
233dseDSEPass
234dwarf-eh-prepareDwarfEHPreparePass
235expand-large-div-remExpandLargeDivRemPass
236expand-large-fp-convertExpandLargeFpConvertPass
237expand-memcmpExpandMemCmpPass
238extra-vector-passesExtraFunctionPassManager<ShouldRunExtraVectorPasses>
239fix-irreducibleFixIrreduciblePass
240flatten-cfgFlattenCFGPass
241float2intFloat2IntPass
242gc-loweringGCLoweringPass
243guard-wideningvia sub_2342880
244gvn-hoistGVNHoistPass
245gvn-sinkGVNSinkPass
246helloworldHelloWorldPass
247indirectbr-expandIndirectBrExpandPass
248infer-address-spacesInferAddressSpacesPass
249infer-alignmentInferAlignmentPass
250inject-tli-mappingsInjectTLIMappings
251instcountInstCountPass
252instnamerInstructionNamerPass
253instsimplifyInstSimplifyPass
254interleaved-accessInterleavedAccessPass
255interleaved-load-combineInterleavedLoadCombinePass
256invalidate<all>via sub_2342860
257irceIRCEPass
258jump-threadingJumpThreadingPass
259jump-table-to-switchJumpTableToSwitchPass
260kcfiKCFIPass
261kernel-infoKernelInfoPrinter
262lcssaLCSSAPass
263libcalls-shrinkwrapLibCallsShrinkWrapPass
264lintLintPass
265load-store-vectorizerLoadStoreVectorizerPass
266loop-data-prefetchLoopDataPrefetchPass
267loop-distributeLoopDistributePass
268loop-fusionLoopFusePass
269loop-load-elimLoopLoadEliminationPass
270loop-simplifyLoopSimplifyPass
271loop-sinkLoopSinkPass
272loop-versioningLoopVersioningPass
273lower-atomicLowerAtomicPass
274lower-constant-intrinsicsLowerConstantIntrinsicsPass
275lower-expectLowerExpectIntrinsicPass
276lower-guard-intrinsicLowerGuardIntrinsicPass
277lower-invokeLowerInvokePass
278lower-widenable-conditionLowerWidenableConditionPass
279make-guards-explicitMakeGuardsExplicitPass
280mem2regPromotePass
281memcpyoptMemCpyOptPass
282memprofMemProfilerPass
283mergeicmpsMergeICmpsPass
284mergereturnUnifyFunctionExitNodesPass
285move-auto-initMoveAutoInitPass
286nary-reassociateNaryReassociatePass
287newgvnNewGVNPass
288no-op-functionNoOpFunctionPass
289normalizeIRNormalizerPass
290objc-arcObjCARCOptPass
291objc-arc-contractObjCARCContractPass
292objc-arc-expandObjCARCExpandPass
293pa-evalPAEvalPass
294partially-inline-libcallsPartiallyInlineLibCallsPass
295pgo-memop-optPGOMemOPSizeOpt
296place-safepointsPlaceSafepointsPass
297printPrintFunctionPass
298--338print<access-info> ... print-predicateinfo(41 printer passes)
339reassociateReassociatePass
340redundant-dbg-inst-elimRedundantDbgInstEliminationPass
341reg2memRegToMemPass
342safe-stackSafeStackPass
343sandbox-vectorizerSandboxVectorizerPass
344scalarize-masked-mem-intrinScalarizeMaskedMemIntrinPass
345sccpSCCPPass
346select-optimizeSelectOptimizePass
347separate-const-offset-from-gepSeparateConstOffsetFromGEPPass
348sinkSinkingPass
349sjlj-eh-prepareSjLjEHPreparePass
350slp-vectorizerSLPVectorizerPass
351slsrStraightLineStrengthReducePass
352stack-protectorStackProtectorPass
353strip-gc-relocatesStripGCRelocates
354tailcallelimTailCallElimPass
355transform-warningWarnMissedTransformationsPass
356trigger-crash-functionTriggerCrashFunctionPass
357trigger-verifier-errorTriggerVerifierErrorPass
358tsanThreadSanitizerPass
359unify-loop-exitsUnifyLoopExitsPass
360vector-combineVectorCombinePass
361verifyvia sub_2342870
362--368verify<cycles> ... verify<scalar-evolution>(7 verifiers)
369--374view-cfg ... view-post-dom-only(6 viewers)
375wasm-eh-prepareWasmEHPreparePass

NVIDIA Function Passes (entries 376--392)

Registered at lines 2212--2292 of sub_2342890.

#Pass NameLLVM ClassReg. LinePurpose
376basic-dbeBasicDeadBarrierEliminationPass2212Removes dead bar.sync instructions
377branch-distBranchDistPass2217Branch distribution for divergence control
378byval-mem2regByValMem2RegPass2222Promotes byval arguments to registers
379bypass-slow-divisionBypassSlowDivisionPass2227Fast-path for small-operand division
380normalize-gepNormalizeGepPass2232GEP canonicalization for address arithmetic
381nvvm-reflect-ppSimplifyConstantConditionalsPass2237Folds __nvvm_reflect results (post-processing)
382nvvm-peephole-optimizerNVVMPeepholeOptimizerPass2242NVVM-specific peephole rewrites
383old-load-store-vectorizerOldLoadStoreVectorizerPass2247Legacy load/store vectorization
384print<merge-sets>MergeSetsAnalysisPrinterPass2252Printer for merge-sets analysis
385rematRematerializationPass2257Register-pressure-aware rematerialization
386print<rpa>RegisterPressurePrinterPass2262Printer for register pressure analysis
387propagate-alignmentPropagateAlignmentPass2267Propagates alignment through pointer chains
388reuse-local-memoryReuseLocalMemoryPass2272Shares local memory across kernels
389set-local-array-alignmentSetLocalArrayAlignmentPass2277Aligns stack arrays for coalescing
390sinking2Sinking2Pass2282Enhanced instruction sinking
391d2ir-scalarizerScalarizerPass (NVIDIA alias)2287NVIDIA-branded scalarization
392sink<rp-aware>SinkingPass (variant)2292Register-pressure-aware sinking

Parameterized Function Passes (entries 393--419)

#Pass NameClassParameters
393cfguardCFGuardPasscheck;dispatch
394early-cseEarlyCSEPassmemssa
395ee-instrumentEntryExitInstrumenterPasspost-inline
396function-simplification(byte_3F871B3)O1;O2;O3;Os;Oz
397gvnGVNPassno-pre;pre;no-load-pre;load-pre;...
398instcombineInstCombinePassno-aggressive-aggregate-splitting;...;max-iterations=N
399loop-unrollLoopUnrollPassO0;O1;O2;O3;full-unroll-max=N;...
400loop-vectorizeLoopVectorizePassno-interleave-forced-only;...
401lower-allow-checkLowerAllowCheckPass(empty)
402lower-matrix-intrinsicsLowerMatrixIntrinsicsPassminimal
403lower-switchLowerSwitchPassenable-jump-table
404mldst-motionMergedLoadStoreMotionPassno-split-footer-bb;split-footer-bb
405print<da>DependenceAnalysisPrinterPassnormalized-results
406print<memoryssa>MemorySSAPrinterPassno-ensure-optimized-uses
407print<stack-lifetime>StackLifetimePrinterPassmay;must
408scalarizerScalarizerPassload-store;no-load-store;variable-insert-extract;...
409separate-const-offset-from-gepSeparateConstOffsetFromGEPPasslower-gep
410simplifycfgSimplifyCFGPasssimplify-unreachable;...;bonus-inst-threshold=N
411speculative-executionSpeculativeExecutionPassonly-if-divergent-target
412sroaSROAPasspreserve-cfg;modify-cfg
413structurizecfgStructurizeCFGskip-uniform-regions
414win-eh-prepareWinEHPreparePassdemote-catchswitch-only
415bounds-checkingBoundsCheckingPass (modified)trap
416memory-space-optMemorySpaceOptPassfirst-time;second-time;no-warnings;warnings
417lower-aggr-copiesLowerAggrCopiesPasslower-aggr-func-args
418lower-struct-argsLowerStructArgsPassopt-byval
419process-restrictProcessRestrictPasspropagate-only

LoopNest Passes (entries 420--423)

#Pass NameLLVM Class
420loop-flattenLoopFlattenPass
421loop-interchangeLoopInterchangePass
422loop-unroll-and-jamLoopUnrollAndJamPass
423no-op-loopnestNoOpLoopNestPass

Loop Analyses (entries 424--428)

#Pass NameLLVM Class
424ddgDDGAnalysis
425iv-usersIVUsersAnalysis
426no-op-loopNoOpLoopAnalysis
427pass-instrumentationvia sub_2342830
428should-run-extra-simple-loop-unswitchShouldRunExtraSimpleLoopUnswitch

Loop Passes (entries 429--455)

#Pass NameLLVM Class
429canon-freezeCanonicalizeFreezeInLoopsPass
430dot-ddgDDGDotPrinterPass
431guard-wideningvia sub_2342880
432extra-simple-loop-unswitch-passesExtraLoopPassManager<...>
433indvarsIndVarSimplifyPass
434invalidate<all>via sub_2342860
435loop-bound-splitLoopBoundSplitPass
436loop-deletionLoopDeletionPass
437loop-idiomLoopIdiomRecognizePass
438loop-idiom-vectorizeLoopIdiomVectorizePass
439loop-instsimplifyLoopInstSimplifyPass
440loop-predicationLoopPredicationPass
441loop-reduceLoopStrengthReducePass
442loop-term-foldLoopTermFoldPass
443loop-simplifycfgLoopSimplifyCFGPass
444loop-unroll-fullLoopFullUnrollPass
445loop-versioning-licmLoopVersioningLICMPass
446no-op-loopNoOpLoopPass
447printPrintLoopPass
448--450print<ddg>, print<iv-users>, print<loop-cache-cost>, print<loopnest>(printers)
451loop-index-splitLoopIndexSplitPass

Parameterized Loop Passes (entries 452--455)

#Pass NameClassParameters
452licmLICMPassallowspeculation;conservative-calls
453lnicmLNICMPassallowspeculation
454loop-rotateLoopRotatePassno-header-duplication;header-duplication;...
455simple-loop-unswitchSimpleLoopUnswitchPassnontrivial;no-nontrivial;trivial;no-trivial

Machine Function Analyses (entries 456--475)

#Pass NameLLVM Class
456edge-bundlesEdgeBundlesAnalysis
457livedebugvarsLiveDebugVariablesAnalysis
458live-intervalsLiveIntervalsAnalysis
459live-reg-matrixLiveRegMatrixAnalysis
460live-stacksLiveStacksAnalysis
461live-varsLiveVariablesAnalysis
462machine-block-freqMachineBlockFrequencyAnalysis
463machine-branch-probMachineBranchProbabilityAnalysis
464machine-cyclesMachineCycleAnalysis
465machine-dom-treeMachineDominatorTreeAnalysis
466machine-loopsMachineLoopAnalysis
467machine-opt-remark-emitterMachineOptimizationRemarkEmitterAnalysis
468machine-post-dom-treeMachinePostDominatorTreeAnalysis
469machine-trace-metricsMachineTraceMetricsAnalysis
470pass-instrumentationvia sub_2342830
471regalloc-evictRegAllocEvictionAdvisorAnalysis
472regalloc-priorityRegAllocPriorityAdvisorAnalysis
473slot-indexesSlotIndexesAnalysis
474spill-code-placementSpillPlacementAnalysis
475virtregmapVirtRegMapAnalysis

Machine Function Passes (entries 476--526)

#Pass NameLLVM Class
476dead-mi-eliminationDeadMachineInstructionElimPass
477detect-dead-lanesDetectDeadLanesPass
478early-ifcvtEarlyIfConverterPass
479early-machinelicmEarlyMachineLICMPass
480early-tailduplicationEarlyTailDuplicatePass
481finalize-iselFinalizeISelPass
482fixup-statepoint-caller-savedFixupStatepointCallerSavedPass
483localstackallocLocalStackSlotAllocationPass
484machine-cpMachineCopyPropagationPass
485machine-cseMachineCSEPass
486machine-latecleanupMachineLateInstrsCleanupPass
487machine-schedulerMachineSchedulerPass
488machinelicmMachineLICMPass
489no-op-machine-functionNoOpMachineFunctionPass
490opt-phisOptimizePHIsPass
491patchable-functionPatchableFunctionPass
492peephole-optPeepholeOptimizerPass
493phi-node-eliminationPHIEliminationPass
494post-RA-schedPostRASchedulerPass
495postmischedPostMachineSchedulerPass
496post-ra-pseudosExpandPostRAPseudosPass
497printPrintMIRPass
498--510print<livedebugvars> ... print<virtregmap>(13 MF printers)
511reg-usage-collectorRegUsageInfoCollectorPass
512reg-usage-propagationRegUsageInfoPropagationPass
513register-coalescerRegisterCoalescerPass
514rename-independent-subregsRenameIndependentSubregsPass
515remove-redundant-debug-valuesRemoveRedundantDebugValuesPass
516require-all-machine-function-propertiesRequireAllMachineFunctionPropertiesPass
517stack-coloringStackColoringPass
518stack-slot-coloringStackSlotColoringPass
519tailduplicationTailDuplicatePass
520trigger-verifier-errorTriggerVerifierErrorPass
521two-address-instructionTwoAddressInstructionPass
522verifyMachineVerifierPass
523verify<machine-trace-metrics>MachineTraceMetricsVerifierPass
524machine-sinkMachineSinkingPass (parameterized)
525regallocfastRegAllocFastPass (parameterized)
526greedyRAGreedyPass (parameterized, LAST registered)

No NVIDIA-specific machine function passes were identified in the registration table; NVIDIA's machine-level customizations are implemented through target hooks in the NVPTX backend rather than as separately registered passes.

Runtime Pass Execution Order

Registration order (above) describes what is known to the pipeline parser. Runtime execution order is determined by sub_12E54A0 (the pipeline assembler) and controlled by the tier system. The execution order varies dramatically depending on: (1) optimization level, (2) fast-compile mode, (3) language string, and (4) individual pass enable/disable flags in NVVMPassOptions.

The AddPass Mechanism -- sub_12DE0B0

All runtime pass insertion uses sub_12DE0B0 (0x12DE0B0), a hash-table-based function that:

  1. Hashes the pass pointer: (pass >> 9) ^ (pass >> 4)
  2. Probes an open-addressed hash table at passMgr+80
  3. Stores the pass pointer and a flags byte (flags | 2 if barrier set)
  4. Appends the pass pointer to a dynamic array at passMgr[0]
  5. Increments the counter at passMgr+8

The third parameter encodes pass type: 0 = ModulePass/AnalysisPass, 1 = FunctionPass. The fourth parameter is a scheduling barrier hint.

Tier System Architecture

The tier system is NVIDIA's mechanism for interleaving custom passes with standard LLVM passes at precise points. The main optimization loop in sub_12E54A0 iterates over a plugin/extension pass array at opts[4488..4496] (16-byte stride: vtable + phase_id), and fires tier sub-pipelines when the accumulated phase counter exceeds their thresholds:

// Pseudocode from sub_12E54A0, lines 481-553
for (entry = opts[4488]; entry < opts[4496]; entry += 16) {
    phase_id = entry[8];

    if (opts[4224] && phase_id > opts[4228]) {   // Tier 0
        sub_12DE330(PM, opts);                    // Full optimization
        opts[4224] = 0;                           // Fire once
    }
    if (opts[3528] && phase_id > opts[3532]) {    // Tier 1
        sub_12DE8F0(PM, 1, opts);
        opts[3528] = 0;
    }
    if (opts[3568] && phase_id > opts[3572]) {    // Tier 2
        sub_12DE8F0(PM, 2, opts);
        opts[3568] = 0;
    }
    if (opts[3608] && phase_id > opts[3612]) {    // Tier 3
        sub_12DE8F0(PM, 3, opts);
        opts[3608] = 0;
    }

    pass = entry->vtable[72]();                   // Plugin pass factory call
    sub_12DE0B0(PM, pass, 1, 0);                  // Insert plugin pass

    if (opts[3904])                               // Debug mode
        insert_verifier_after_each();
}
// Remaining unfired tiers fire unconditionally after loop

The tier control fields in the NVVMPassOptions struct:

OffsetTypeField
+3528boolTier 1 enable
+3532intTier 1 phase threshold
+3568boolTier 2 enable
+3572intTier 2 phase threshold
+3608boolTier 3 enable
+3612intTier 3 phase threshold
+4224boolTier 0 (full optimization) enable
+4228intTier 0 phase threshold

Infrastructure Setup (Always Runs)

These five passes are always inserted first, regardless of optimization level:

PosFactoryIdentityAddPass Flags
1sub_149CCE0 (alloc 368B)TargetLibraryInfoWrapperPass(PM, TLI, 0, 0) Module
2sub_1BFB520 (alloc 208B)TargetTransformInfoWrapperPass(PM, TTI, 1, 0) Function
3sub_14A7550VerifierPass / BasicAliasAnalysis(PM, _, 0, 0) Module
4sub_1361950AssumptionCacheTracker(PM, _, 0, 0) Module
5sub_1CB0F50ProfileSummaryInfoWrapperPass(PM, _, 1, 0) Function

Tier 0 -- Full Optimization (sub_12DE330)

Called when opts[4224] (optimization enabled) and the phase threshold is exceeded. This is the primary optimization sub-pipeline for O1/O2/O3, adding ~40 passes. Address: 0x12DE330.

Confidence note: Pass identifications are based on diagnostic strings, factory-function signatures, and pipeline ordering. Most identifications are HIGH confidence (confirmed by unique string literals). Entries marked [MEDIUM confidence] are inferred from code structure, argument patterns, or address proximity rather than direct string evidence.

PosFactory AddressLikely PassGuard Condition
1sub_1654860(1)BreakCriticalEdgesalways
2sub_1A62BF0(1,0,0,1,0,0,1)LLVM standard pipeline #1always
3sub_1B26330MemCpyOptalways
4sub_185D600IPConstantPropagationalways
5sub_1C6E800GVNalways
6sub_1C6E560NewGVN/GVNHoist [MEDIUM confidence]always
7sub_1857160NVVMReflectalways
8sub_1842BC0SCCPalways
9sub_17060B0(1,0)PrintModulePassopts[3160]
10sub_12D4560NVVMVerifieralways
11sub_18A3090NVVMPredicateOptalways
12sub_184CD60ConstantMergealways
13sub_1869C50(1,0,1)Sink/MemSSA [MEDIUM confidence] -- three-arg factory matches Sink with MemSSA parameters, but could also be a custom sinking variant!opts[1040]
14sub_1833EB0(3)TailCallElim/JumpThreading [MEDIUM confidence] -- integer arg=3 could be JumpThreading threshold or TailCallElim mode; no disambiguating stringalways
15sub_17060B0(1,0)PrintModulePassopts[3160]
16sub_1952F90(-1)LoopIndexSplitalways
17sub_1A62BF0(1,...)LLVM standard pipeline #1always
18sub_1A223D0NVVMIRVerificationalways
19sub_17060B0(1,0)PrintModulePassopts[3160]
20sub_1A7A9F0InstructionSimplifyalways
21sub_1A62BF0(1,...)LLVM standard pipeline #1always
22sub_1A02540GenericToNVVMalways
23sub_198DF00(-1)LoopSimplifyalways
24sub_1C76260ADCE!opts[1320]
25sub_195E880(0)LICMopts[2880]
26sub_19C1680(0,1)LoopUnroll!opts[1360]
27sub_17060B0(1,0)PrintModulePassopts[3160]
28sub_19401A0InstCombinealways
29sub_1968390SROAalways
30sub_196A2B0EarlyCSEalways
31sub_19B73C0(2,-1,-1,-1,-1,-1,-1)LoopUnswitchalways
32sub_17060B0(1,0)PrintModulePassopts[3160]
33sub_190BB10(0,0)SimplifyCFGalways
34sub_1A13320NVVMRematerializationalways
35sub_17060B0(1,1)PrintModulePass (verified)opts[3160]
36sub_18F5480DSEalways
37sub_18DEFF0DCEalways
38sub_1A62BF0(1,...)LLVM standard pipeline #1always
39sub_18B1DE0NVVMLoopPass/BarrierOpt [MEDIUM confidence] -- address is in NVVM pass range, but dual name reflects ambiguity between loop optimization and barrier optimization rolesalways
40sub_1841180FunctionAttrsalways

Tier 1/2/3 -- Phase-Specific Sub-pipeline (sub_12DE8F0)

Called with tier number (1, 2, or 3). Address: 0x12DE8F0. Stores the tier value to qword_4FBB410. When tier==3, sets qword_4FBB370 = 6 if BYTE4 was 0 (enables advanced barrier and memory space optimization features).

The pass sequence is significantly longer than Tier 0 and varies by tier. The following shows the superset of all passes that can be inserted; tier-based guards are annotated.

Confidence note: Same methodology as Tier 0 table above. Most identifications confirmed by diagnostic strings or NVVMPassOptions slot cross-references.

PosFactory AddressLikely PassGuard
1sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]
2sub_1A223D0NVVMIRVerification!opts[2600]
3sub_1CB4E40(1)NVVMIntrinsicLowering (barrier)!opts[2000]
4sub_18E4A00NVVMBarrierAnalysisopts[3488]
5sub_1C98160(0)NVVMLowerBarriersopts[3488]
6sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]
7sub_12D4560NVVMVerifier!opts[600]
8sub_185D600IPConstPropagationopts[3200] && !opts[920]
9sub_1857160NVVMReflectopts[3200] && !opts[880]
10sub_18A3430NVVMPredicateOptopts[3200] && !opts[1120]
11sub_1842BC0SCCPopts[3200] && !opts[720]
12sub_17060B0(1,0)PrintModulePass!opts[1080]
13sub_12D4560NVVMVerifier!opts[600]
14sub_18A3090NVVMPredicateOpt variantopts[3200] && !opts[2160]
15sub_184CD60ConstantMergeopts[3200] && !opts[1960]
16sub_190BB10(1,0)SimplifyCFGtier!=1 && !opts[1040] && !opts[1200]
17sub_1952F90(-1)LoopIndexSplit(same guard) && !opts[1160]
18sub_12D4560NVVMVerifier(same guard) && !opts[600]
19sub_17060B0(1,0)PrintModulePass(same guard) && !opts[1080]
20sub_195E880(0)LICMopts[3704] && opts[2880] && !opts[1240]
21sub_1C8A4D0(v)EarlyCSEv=1 if opts[3704]
22sub_1869C50(1,0,1)Sinktier!=1 && !opts[1040]
23sub_1833EB0(3)TailCallElimtier==3 && !opts[320]
24sub_1CC3990NVVMUnreachableBlockElim!opts[2360]
25sub_18EEA90CorrelatedValuePropagationopts[3040]
26sub_12D4560NVVMVerifier!opts[600]
27sub_1A223D0NVVMIRVerification!opts[2600]
28sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]
29sub_1C4B6F0Inliner!opts[440] && !opts[480]
30sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]
31sub_1A7A9F0InstructionSimplify!opts[2720]
32sub_12D4560NVVMVerifier!opts[600]
33sub_1A02540GenericToNVVM!opts[2200]
34sub_198DF00(-1)LoopSimplify!opts[1520]
35sub_1C76260ADCE!opts[1320] && !opts[1480]
36sub_17060B0(1,0)PrintModulePass(same guard)
37sub_12D4560NVVMVerifier(same guard)
38sub_195E880(0)LICMopts[2880] && !opts[1240]
39sub_1C98160(0/1)NVVMLowerBarriersopts[3488]
40sub_19C1680(0,1)LoopUnroll!opts[1360]
41sub_17060B0(1,0)PrintModulePass!opts[1080]
42sub_19401A0InstCombine!opts[1000]
43sub_196A2B0EarlyCSE!opts[1440]
44sub_1968390SROA!opts[1400]
45sub_19B73C0(tier,...)LoopUnswitchtier!=1, SM-arch-dependent params
46sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]
47sub_19B73C0(tier,...)LoopUnswitch (2nd)!opts[2760]
48sub_1A62BF0(1,...)LLVM standard pipeline!opts[600]
49sub_1A223D0NVVMIRVerification!opts[2600]
50sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]
51sub_17060B0(1,0)PrintModulePass!opts[1080]
52sub_190BB10(0,0)SimplifyCFG!opts[960]
53sub_1922F90NVIDIA loop passopts[3080]
54sub_195E880(0)LICMopts[2880] && !opts[1240]
55sub_1A13320NVVMRematerialization!opts[2320]
56sub_1968390SROA!opts[1400]
57sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]
58sub_18EEA90CorrelatedValuePropagationopts[3040]
59sub_18F5480DSE!opts[760]
60sub_18DEFF0DCE!opts[280]
61sub_1A62BF0(1,...)LLVM standard pipeline!opts[600]
62sub_1AAC510NVIDIA-specific pass!opts[520] && !opts[560]
63sub_1A223D0NVVMIRVerification!opts[2600]
64sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]
65sub_1C8E680MemorySpaceOpt!opts[2680], param from opts[3120]
66sub_1A223D0NVVMIRVerificationopts[3120] && !opts[2600]
67sub_17060B0(1,0)PrintModulePass (barrier)!opts[1080]
68sub_1CC71E0NVVMGenericAddrOpt!opts[2560]
69sub_1C98270(1,opts[2920])NVVMLowerBarriers variantopts[3488]
70sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]
71sub_1C6FCA0ADCEopts[2840] && !opts[1840]
72sub_18B1DE0LoopOpt/BarrierOptopts[3200] && !opts[2640]
73sub_1857160NVVMReflectopts[3200] && tier==3 && !opts[880]
74sub_1841180FunctionAttrsopts[3200] && !opts[680]
75sub_1C46000NVVMLateOpttier==3 && !opts[360]
76sub_1841180FunctionAttrs (2nd)opts[3200] && !opts[680]
77sub_1CBC480NVVMLowerAlloca!opts[2240] && !opts[2280]
78sub_1CB73C0NVVMBranchDist!opts[2080] && !opts[2120]
79sub_1C7F370(1)NVVMWarpShuffleopts[3328] && !opts[1640]
80sub_1CC5E00NVVMReductionopts[3328] && !opts[2400]
81sub_1CC60B0NVVMSinking2opts[3328] && !opts[2440]
82sub_1CB73C0NVVMBranchDist (2nd)opts[3328] && !opts[2080] && !opts[2120]
83sub_17060B0(1,0)PrintModulePassopts[3328] && !opts[1080]
84sub_1B7FDF0(3)Reassociateopts[3328] && !opts[1280]
85sub_17060B0(1,0)PrintModulePass (final)opts[3160] && !opts[1080]

Optimization Level Summary

PipelineSub-pipeline calledlsa-optmem-space-optApprox. passes
nvopt<O0>(minimal, sub_1C8A4D0(0) only)offoff~5--8
nvopt<Ofcmax>Sinking2 + common tail onlyforced 0forced 0~12--15
nvopt<Ofcmid>mid-level pipelinenormalenabled~25--30
nvopt<Ofcmin>close to full pipelinenormalenabled~30--35
nvopt<O1>sub_12DE330 (Tier 0)normalenabled~35
nvopt<O2>sub_12DE330 + Tier 1/2normalenabled~35+
nvopt<O3>sub_12DE330 + Tier 1/2/3normalenabled~35+

O1/O2/O3 all route through the same sub_12DE330 (Tier 0). The difference manifests through the tiered pass inserter sub_12DE8F0: O1 only fires Tier 1, O2 fires Tiers 1--2, O3 fires all three tiers. Within the tiers, passes additionally vary by: loop unroll factor (parameter to sub_1833EB0), vectorizer width (parameters to sub_19B73C0), CGSCC iteration count (first parameter to sub_1A62BF0), and the SM-architecture-dependent late passes gated by opts[3328].

Ofcmax critical behavior: when fast-compile level == 2 (max), the libnvvm pipeline builder forces -lsa-opt=0 and -memory-space-opt=0 even if the user explicitly enables them. This is confirmed in both sub_9624D0 (line 1358) and sub_12CC750 (line 2025).

Codegen Dispatch -- sub_12DFE00

After all optimization tiers complete, sub_12DFE00 (0x12DFE00) performs codegen pass scheduling. This is NOT a simple pass adder -- it performs a full dependency graph construction:

  1. Reads optimization level from opts[200] (0 = minimal, >1 = enable dependency tracking)
  2. Iterates all passes already in the pass manager
  3. For each pass, calls vtable+112 (isCodeGenOnly()) to filter
  4. Calls vtable+16 (getAnalysisUsage()) to extract dependencies
  5. Builds a secondary hash table of ordering constraints
  6. Dispatches each pass to the codegen subsystem in topological order via the subtarget hook at vtable+16

Pass Classification Statistics

CategoryCount
Module analyses18
Module passes~131
CGSCC analyses3
CGSCC passes~10
Function analyses~39
Function AA analyses5
Function passes~219
LoopNest passes4
Loop analyses5
Loop passes~26
MachineFunction analyses20
MachineFunction passes~50
Total~526
NVIDIA additions33
Standard LLVM~493

Complete Pass Factory Address Map

Every unique pass factory address observed in sub_12E54A0, sub_12DE330, and sub_12DE8F0:

FunctionAddressSizeRole
NVVMVerifiersub_12D4560many (tiers)many (tiers)
AssumptionCacheTrackersub_136195011
TargetLibraryInfoWrapperPasssub_149CCE011
VerifierPass/BasicAAsub_14A755011
BreakCriticalEdgessub_165486022
PrintModulePass (debug dump)sub_17060B0~30+~30+
InstructionCombiningsub_183227022
TailCallElim/JumpThreadingsub_1833EB033
FunctionAttrssub_184118033
SCCPsub_1842BC022
NVVMReflectsub_1857160~8~8
IPConstantPropagationsub_185D60033
Sink (MemorySSA-based)sub_1869C5033
NVVMPredicateOptsub_18A309022
AggressiveInstCombinesub_18A343022
NVVMLoopOpt/BarrierOptsub_18B1DE033
Sinking2Pass (fast-mode)sub_18B308011
DCEsub_18DEFF044
NVVMBarrierAnalysissub_18E4A0011
CorrelatedValuePropagationsub_18EEA9033
DSEsub_18F548022
DeadArgEliminationsub_18FD35055
SimplifyCFGsub_190BB1044
NVIDIA loop passsub_1922F9011
LoopIndexSplitsub_1952F9033
LICMsub_195E88044
SROAsub_196839022
EarlyCSEsub_196A2B022
LoopUnroll/Vectorizesub_197E72011
LoopSimplify/IndVarSimplifysub_198DF0033
CorrelatedValuePropagationsub_198E2A011
InstCombinesub_19401A022
LoopUnswitchsub_19B73C033
LoopUnrollsub_19C168022
NVIDIA pass (unknown)sub_19CE99011
GenericToNVVMsub_1A0254011
NVVMRematerializationsub_1A1332033
NVVMIRVerificationsub_1A223D05+5+
LLVM StandardPassPipelinesub_1A62BF0~9~9
LoopIdiomRecognizesub_1A68E7011
InstructionSimplifysub_1A7A9F033
NVIDIA-specific passsub_1AAC51011
MemCpyOptsub_1B2633044
Reassociate/Sinkingsub_1B7FDF033
TTIWrapperPasssub_1BFB52011
NVVMLateOptsub_1C4600011
Inliner/AlwaysInlinesub_1C4B6F022
NewGVN/GVNHoistsub_1C6E56011
GVNsub_1C6E80022
ADCE (AggressiveDCE)sub_1C6FCA022
ADCE variantsub_1C7626022
NVVMWarpShufflesub_1C7F37011
EarlyCSE/GVN variantsub_1C8A4D033
MemorySpaceOptsub_1C8E68044
NVVMLowerBarrierssub_1C9816044
NVVMLowerBarriers variantsub_1C9827011
ProfileSummaryInfosub_1CB0F5011
NVVMIntrinsicLoweringsub_1CB4E40~10~10
NVVMBranchDistsub_1CB73C033
NVVMLowerAllocasub_1CBC48011
NVVMUnreachableBlockElimsub_1CC399011
NVVMReductionsub_1CC5E0011
NVVMSinking2sub_1CC60B033
NVVMGenericAddrOptsub_1CC71E011
NVVMFinalLoweringsub_1CEBD1011
NVVMPeepholesub_1CEF8F022
NVVMAnnotationsProcessorsub_215D9D022

Total unique pass factories: ~65.

NVVMPassOptions Offset-to-Pass Guard Map

The NVVMPassOptions struct (4,512 bytes, 221 slots) controls which passes execute. The pipeline assembler reads boolean flags at specific offsets to gate pass insertion. See NVVMPassOptions for the full slot layout. Key offset-to-pass mappings:

OffsetSlotTypeControls
+2009intOptimization level (0/1/2/3)
+28015boolDCE disable
+32017boolTailCallElim/JumpThreading disable
+36019bool (default=1)NVVMLateOpt disable
+60031boolNVVMVerifier disable
+72037boolSCCP disable
+76039boolDSE disable
+88045boolNVVMReflect disable
+92047boolIPConstantPropagation disable
+96049boolSimplifyCFG disable
+100051boolInstCombine disable
+104053boolSink/MemSSA disable
+108055boolPrintModulePass disable
+116059boolLoopIndexSplit disable
+124063boolLICM disable
+128065boolReassociate disable
+132067boolADCE disable
+136069boolLoopUnroll disable
+140071boolSROA disable
+144073boolEarlyCSE disable
+176089boolMemorySpaceOpt disable
+2000101boolNVVMIntrinsicLowering disable
+2320117bool (default=1)NVVMRematerialization disable
+2440123boolNVVMSinking2 disable
+2600131boolNVVMIRVerification disable
+2840141bool (default=1)ADCE enable (reversed logic)
+2880143bool (default=1)LICM enable (reversed logic)
+3120155bool (default=1)MemorySpaceOpt (2nd pass) enable
+3160157bool (default=1)PrintModulePass/debug dump enable
+3200159bool (default=1)Advanced NVIDIA passes group enable
+3328165bool (default=1)SM-specific late passes enable
+3488175boolBarrier optimization enable
+3648181ptrLanguage string ("ptx"/"mid"/"idn")
+3656intLanguage string length
+3704185boolLate optimization / address-space flag
+4064201boolConcurrent compilation enable
+4104203int (default=-1)Thread count
+4224211bool (default=1)Master optimization enable
+4304213boolDevice-code / separate-compilation flag
+4384217boolFast-compile bypass (skip LLVM pipeline)
+4464219bool (default=1)Late CFG cleanup guard

Infrastructure Functions

AddressFunctionRole
0x2342890sub_2342890Master pass registration (~2,816 lines)
0xE41FB0sub_E41FB0StringMap::insert (48-byte entries, open-addressing)
0xE41C70sub_E41C70StringMap::grow (hash table resize)
0xC94890sub_C94890String hash function (DJB/FNV-family)
0x9691B0sub_9691B0String equality (len + memcmp)
0xC931B0sub_C931B0StringRef::find_first_of (delimiter search)
0x95CB50sub_95CB50StringRef::consume_front (strip llvm:: prefix)
0x233C410sub_233C410Help listing (--print-pipeline-passes)
0x233BD40sub_233BD40AA name resolver (chain of comparisons)
0x233C0C0sub_233C0C0AA pipeline parser
0x233C300sub_233C300Extension callback dispatch
0x233A120sub_233A120Generic parameterized option parser
0x12E54A0sub_12E54A0Master pipeline assembler (49.8KB)
0x12DE0B0sub_12DE0B0AddPass (hash-table-based insertion)
0x12DE330sub_12DE330Tier 0 full optimization sub-pipeline
0x12DE8F0sub_12DE8F0Tier 1/2/3 phase-specific sub-pipeline
0x12DFE00sub_12DFE00Codegen dispatch (dependency-ordered)
0x226C400sub_226C400Pipeline name selector (nvopt<O#>)
0x2277440sub_2277440Pipeline text parser entry
0x225D540sub_225D540New PM nvopt registration
0x12C35D0sub_12C35D0Legacy PM pipeline orchestrator
0x2342820sub_2342820LastRunTrackingAnalysis factory
0x2342830sub_2342830PassInstrumentationAnalysis factory
0x2342840sub_2342840VerifierAnalysis factory
0x2342850sub_2342850InlinerWrapper factory (shared by 4 inliner variants)
0x2342860sub_2342860InvalidateAllAnalysesPass factory
0x2342870sub_2342870VerifierPass factory
0x2342880sub_2342880GuardWideningPass factory
0x2339850sub_2339850PassBuilder destructor
0x233B610sub_233B610PassBuilder::~PassBuilder cleanup

Cross-References

Scalar Passes: SROA, EarlyCSE & JumpThreading

Three LLVM scalar optimization passes play outsized roles in cicc's GPU pipeline. Each is a stock LLVM implementation with NVIDIA configuration overrides (and in EarlyCSE's case, binary-level modifications). Each appears multiple times in the pipeline at different tier levels, and each can be independently disabled via NVVMPassOptions flags.

SROA (Scalar Replacement of Aggregates)

SROA eliminates alloca instructions by decomposing aggregates into individual SSA values that the register allocator can place in registers. On a GPU this is existential: every surviving alloca becomes a spill to .local memory (DRAM-backed, 200-800 cycle latency on cache miss versus zero for a register). A single un-promoted alloca in a hot loop can degrade kernel throughput by 10-50x. SROA also eliminates the .param space copies generated for byval struct parameters, preventing round-trips through local memory.

Full SROA analysis >>>

EarlyCSE (Early Common Subexpression Elimination)

Cicc's EarlyCSE is not stock LLVM. The binary contains four CUDA-specific extensions: barrier-aware memory versioning that prevents CSE across __syncthreads() and other synchronization points, shared memory address space 7 protection against unsafe store-to-load forwarding between threads, a dedicated NVVM intrinsic call CSE handler with fast-path recognition for thread-invariant special register reads (threadIdx.x, etc.), and a PHI operand limit of 5 for compile-time control. It also adds a fourth scoped hash table (store-forwarding) that upstream LLVM lacks.

Full EarlyCSE analysis >>>

JumpThreading

JumpThreading duplicates basic blocks so that predecessors with statically-determinable branch conditions jump directly to the correct successor, eliminating warp divergence. The pass is fundamentally at odds with PTX's requirement for reducible control flow: block duplication can create irreducible cycles. Cicc addresses this through loop header protection (jump-threading-across-loop-headers defaults to false), conservative duplication thresholds (6-instruction block limit), and a late-pipeline StructurizeCFG safety net that catches any irreducibility that slips through. NVIDIA provides a separate "disable-jump-threading" kill switch (distinct from upstream's "disable-JumpThreadingPass"), with an OCG experiment annotation suggesting architecture-specific cases where the CFG disruption outweighs the benefit.

Full JumpThreading analysis >>>

Cross-References

  • Pipeline & Ordering -- tier-dependent scheduling of all three passes
  • Register Allocation -- surviving allocas after SROA become register pressure; failed promotion leads to .local memory spills
  • StructurizeCFG -- the safety net that catches irreducible CFG created by JumpThreading or other passes
  • GVN -- GVN performs load CSE and redundancy elimination complementary to EarlyCSE, running later in the pipeline with more expensive analysis
  • MemorySpaceOpt -- resolves generic pointers to specific address spaces; interacts with EarlyCSE's address-space-aware load forwarding
  • DSE -- Dead Store Elimination complements EarlyCSE's within-block store-to-load forwarding with cross-block dead store detection

SROA (Scalar Replacement of Aggregates)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 SROA.cpp. Evidence: preserve-cfg / modify-cfg pipeline parser parameters match LLVM 16+ new PM integration; two-pass analysis mode (qword_50055E8) matches LLVM 17+ pre-analysis path. Core splitting algorithm is stock LLVM with no CUDA-specific modifications detected.

SROA is the single most important early-pipeline optimization for NVIDIA GPU compilation. Every alloca instruction that survives into code generation is lowered to .local memory (NVPTX address space 5) -- physically backed by device DRAM and accessed through the L1/L2 cache hierarchy. A .local access that misses L1 costs 200-400 cycles; a register read costs zero. A single un-promoted alloca in a hot loop can degrade kernel throughput by 10-50x. SROA's job is to decompose aggregate allocas (structs, arrays, unions) into individual scalar SSA values that the register allocator can place in registers, eliminating the memory traffic entirely.

PropertyValue
Pass name"sroa"
Pipeline parser paramspreserve-cfg, modify-cfg
Entry functionsub_2935C30 (runOnAlloca)
Core functionsub_2930B90 (splitAlloca)
Binary footprint~138 KB primary (80 KB + 58 KB), ~200 KB secondary (legacy PM)
Binary address range0x2910000-0x293FFFF (178 functions)
Pipeline positionsPosition 4 (early, after NVVMReflect) and post-sinking (late)
Disable flagNVVMPassOptions offset +1400
Size threshold knobqword_50056C8 (max alloca size in bits)
Two-pass flagqword_50055E8 (enables pre-analysis for new PM)
NVIDIA modificationsNone to core algorithm
Upstream sourcellvm/lib/Transforms/Scalar/SROA.cpp

Why SROA Is Existential on GPU

On a CPU, an alloca that cannot be promoted to a register lives on the stack -- a cached, low-latency memory region with typical access times of 1-4 cycles. On an NVIDIA GPU there is no hardware stack cache: every surviving alloca becomes a .local allocation backed by DRAM with 200-800 cycle latency on cache miss versus zero for a register. See the GPU Execution Model memory hierarchy table for per-tier latencies.

Every alloca that survives SROA becomes a .local allocation. The NVPTX backend emits these as frame objects in the NVPTXFrameLowering::emitPrologue path, and ptxas maps them to per-thread local memory. Because occupancy is bounded by register count per SM, and .local spills effectively consume both registers (for the address) and memory bandwidth, the performance impact compounds.

The pipeline runs SROA twice: once early (position 4, immediately after NVVMReflect) to eliminate allocas before any other transform sees them, and once late (after NVVMCustomSinking2 and BreakCriticalEdges) to catch allocas created or exposed by loop unrolling, inlining, and other mid-pipeline transforms. The early invocation handles the common case (byval parameter copies, local struct variables); the late invocation cleans up whatever the loop optimizer and sinking passes left behind.

The isAllocaPromotable Fast Path

Before performing any splitting, runOnAlloca checks whether the alloca is trivially promotable via sub_B4CE70 (isAllocaPromotable). An alloca is promotable if every use is a simple load or store with no address-taken escape -- the same criterion as mem2reg. When this returns true, SROA marks the alloca for mem2reg and returns without performing any slice analysis or splitting. This fast path avoids the O(n) slice-building cost for the vast majority of CUDA local variables (scalar int, float, simple pointers), which are already simple enough for mem2reg to handle directly.

Algorithm: runOnAlloca (sub_2935C30)

The top-level per-alloca entry point. Validates the alloca as a candidate, builds the partition/slice table, and delegates to splitAlloca for the actual transformation.

Phase 1: Candidate Validation

runOnAlloca(state, alloca):
    if alloca has no users:
        eraseFromParent(alloca)
        return

    if isAllocaPromotable(alloca):
        defer to mem2reg
        return

    type = getAllocatedType(alloca)
    type_byte = getTypeID(type)

    // Accept: integers(3), half(4), bfloat(5), float(6),
    //         pointers(10), vectors(11), arrays(12), structs(15-18, 20)
    // Reject structs/composites unless isVectorType returns true
    if type_byte not in {3,4,5,6,10,11,12,15,16,17,18,20}:
        return
    if type_byte in {15,16,17,18,20} and not isVectorType(type):
        return  // function types, labels, etc.

    size = getTypeSizeInBits(type)   // sub_BDB740
    if size > qword_50056C8:         // SROA size threshold
        return  // alloca too large, leave for backend

The size threshold at qword_50056C8 is a global tuning knob, likely controlled by the sroa<preserve-cfg> / sroa<modify-cfg> pipeline parameter. Allocas larger than this threshold are left untouched; the backend will lower them to .local memory. The exact default is not exposed in the binary's constructor initializers, but upstream LLVM uses a default of 128 bytes (1024 bits) for the sroa-threshold flag.

Phase 2: Use Analysis and Slice Building

    metadata = buildMetadataTable(alloca)   // sub_D5F1F0

    if qword_50055E8:                       // two-pass mode
        buildSlices(state, alloca, 1)       // sub_2927160 — pre-analysis
        slices = buildPartitions(state)     // sub_2924690
    else:
        slices = buildPartitions(state)     // single-pass

buildSlices (sub_2927160) walks all users of the alloca, classifying each use as a "slice" -- a byte range [start, end) with associated flags. Each slice is a 24-byte entry:

OffsetSizeField
+08start (byte offset into alloca)
+88end (byte offset, exclusive)
+168flags -- bit 2 = splittable, bits [63:3] = user instruction metadata pointer

buildPartitions (sub_2924690) groups non-overlapping slices into partitions. Each partition represents a contiguous byte range that can be replaced by a single sub-alloca. Overlapping slices are merged; slices that cross partition boundaries are marked as "unsplittable."

The two-pass flag (qword_50055E8) enables a pre-analysis pass that runs buildSlices first with a "dry-run" mode to count slices and pre-allocate arrays, then runs the actual partition builder. This is the new PM (PassManager) style -- the legacy PM code path at 0x1A10000 does a single pass.

Phase 3: Contiguous Slice Merging

After building slices, runOnAlloca scans for contiguous ranges that share the same base type and can be merged:

    for each group of contiguous slices:
        if all loads/stores in group use the same type:
            if none are volatile (isVolatile check via sub_B46500):
                if all are in-bounds (byte +2, bit 0):
                    mergeSlices(group)   // sub_11D2BF0 + sub_11D3120 + sub_11D7E80

This optimizer/merger reduces redundant slices before the splitting phase. For example, if a 16-byte struct has four contiguous 4-byte i32 loads, the merger can combine them into a single slice covering the full struct, which may then map to a single <4 x i32> register rather than four separate scalar registers.

Phase 4: Dead Instruction Processing

    for each dead instruction found during analysis:
        for each operand:
            addToWorklist(operand)         // sub_29220F0
        replaceAllUsesWith(undef)          // sub_BD84D0 + sub_ACADE0
        eraseFromParent(instruction)       // sub_BD60C0

Dead instructions identified during slice building (stores to never-loaded ranges, loads of write-only ranges) are removed immediately, before the splitting phase begins.

Phase 5: Recursive Splitting

    if slices is non-empty:
        splitAlloca(state, alloca, slices)  // sub_2930B90 — recursive

This is the key: splitAlloca may create new sub-allocas that are themselves candidates for further splitting. The newly created sub-allocas are added to the worklist and processed in stack order (LIFO).

Phase 6-8: Post-Split Processing

After splitting, runOnAlloca processes newly created sub-allocas (56-byte records stored in a SmallVector with 2-element inline buffer), rewrites per-sub-alloca slice lists, and returns a two-byte result: byte 0 = changed flag, byte 1 = re-run needed flag.

Algorithm: splitAlloca (sub_2930B90)

The core splitting function. Given a partitioned alloca and its use-slices, it creates new sub-allocas and rewrites all users.

Phase 1: Pre-Filter Slices

Iterates the 24-byte slice array. For slices whose instruction is a load (opcode 61) or store (opcode 62) of a simple scalar type that fits entirely within the alloca boundary, clears the "splittable" bit (flag & 4). This prevents unnecessary splitting of trivial accesses -- a scalar i32 load from an i32 alloca does not need splitting. If any slices were de-flagged, calls sortSlices (sub_2912200) and compactSlices (sub_2915A90 / sub_2914CE0) to remove the now-redundant entries.

Phase 2: Partition Iteration

buildPartitionTable (sub_2913C40) produces a partition list from the sorted slices. Each partition is a local tuple [start, end, first_slice_ptr, last_slice_ptr]. The main loop advances through partitions via sub_2912870 (advancePartitionIterator).

Phase 3: Find Rewrite Target

For each partition [start, end):

  1. Get the DataLayout via sub_B43CC0 (getDL).
  2. If the partition contains only unsplittable slices, call findExistingValue (sub_291A860) to search for an existing SSA value that already covers [start, end). If found, reuse it instead of creating a new alloca.
  3. Otherwise, scan slices for a single dominating load or store. Dispatch on opcode:
    • 61 (load): extract the loaded type.
    • 62 (store): extract the stored value type from the store's value operand.
    • 85 (intrinsic): memcpy/memset/memmove -- follow the pointer chain to determine the affected type.
  4. Compare type sizes via getTypeSizeInBits (sub_BDB740).
  5. If no suitable existing value, create a new alloca via CreateAlloca (sub_BCD420) or CreateBitCast (sub_BCD140).

Phase 4: Size and Alignment Check

    alloc_size = getTypeAllocSize(partition_type)    // sub_9208B0
    if alloc_size > 0x800000:                        // 8 MB sanity limit
        skip partition

    // Verify rewrite target matches partition size (8-byte aligned)
    if match:
        checkTypeCompatibility(both_directions)      // sub_29191E0
        validateUnsplittableSlices(partition)         // sub_291A4D0

The 8 MB sanity limit prevents SROA from creating absurdly large sub-allocas from pathological input.

Phase 5: Slice Classification

For each slice in the partition, classifySlice (sub_29280E0) sorts it into one of two lists:

ListVariableContents
splittable-insidev446Slices fully contained within [start, end)
splittable-outsidev452Slices that reference bytes outside the partition (integer widening)

The classification also tracks:

  • v413 (sameType flag): whether all slices in the partition use the same LLVM type.
  • v415 (common type): the shared type if sameType is true.
  • v412 (hasPointerType): whether any slice involves a pointer type.
  • Integer types (type byte == 14) are routed to the outside list for special handling (widening/narrowing may be needed).

Then rewritePartition (sub_29197E0) is called twice: first for inside slices with callback sub_2919EF0, then for outside slices if the first call produced nothing.

Phase 6: New Sub-Alloca Creation

    // Compute alignment
    align_log2 = _BitScanReverse64(alloca_alignment)
    abi_align = getABITypeAlignment(type)            // sub_AE5020
    pref_align = getPrefTypeAlignment(type)          // sub_AE5260

    // Build name: original_name + ".sroa." + index
    name = getName(alloca) + ".sroa."                // sub_BD5D20

    // Create the new alloca (80-byte AllocaInst object)
    new_alloca = AllocaInst::Create(type, size, alignment, name)
                                                     // sub_BD2C40 + sub_B4CCA0
    // Insert before the original alloca
    insertBefore(new_alloca, alloca)

    // Copy debug metadata
    copyDebugInfo(alloca, new_alloca)                // sub_B96E90 + sub_B976B0

Each sub-alloca is an 80-byte AllocaInst object with the .sroa. name prefix. The insertion point is always directly before the original alloca in the entry block, maintaining the invariant that all allocas are grouped at the function entry.

Phase 7: Instruction Rewriting

The visitUse function (sub_292A4F0) rewrites each user of the original alloca to reference the appropriate sub-alloca:

  • GEP chains: retargeted to the new sub-alloca with adjusted offsets (sub_29348F0).
  • Loads: rewritten with type-casts if the sub-alloca type differs from the original load type (sub_F38250).
  • Stores: same treatment as loads (sub_F38250).
  • Memcpy/memset: split into smaller operations covering only the sub-alloca's byte range (sub_F38330).

Each rewritten instruction is validated via sub_291F660 (validateRewrite).

Phase 8: Worklist Management

Dead instructions are removed from the pass's open-addressing hash table (at pass state offset +432, mask at +896). New sub-allocas are added to the worklist (sub_2928360) for re-processing. Allocas that cannot be split are recorded via sub_2916C30 (recordNonSplitAlloca).

Phase 9: Result Recording

For each partition that produced a new alloca, the result is stored as a 24-byte entry [new_alloca, bit_offset, bit_size] in the output array. Hash table capacity is computed using the classic 4n/3 + 1 formula (next power of 2), and entries are stored via open-addressing with linear probing (sub_29222D0 handles resizing).

Phase 10: Post-Split Use Rewriting

The most complex phase. For every use of the original alloca:

  1. getOperandNo (sub_B59530) determines which operand references the alloca.
  2. getAccessRange (sub_AF47B0) computes the byte range [begin, end) within the alloca that this use touches.
  3. For each new sub-alloca in the result array, checkSubAllocaOverlap (sub_AF4D30) tests whether the sub-alloca's range overlaps the use's range.
  4. If overlap: computeRewrittenValue (sub_2916270) produces the replacement value by combining reads from multiple sub-allocas if the original use spans a partition boundary.
  5. Dead uses are identified by isDeadUse (sub_291D8F0) and erased.

The use-list implementation uses a tagged-pointer scheme: bit 2 indicates "heap-allocated list" vs. "inline single element," bits [63:3] are the actual pointer. Lists are freed via _libc_free after extracting the data pointer.

Phase 11-12: Lifetime and Debug Info

Lifetime markers (llvm.lifetime.start / llvm.lifetime.end) are rewritten via sub_291E540 to cover only the sub-alloca's byte range. Debug declarations (dbg.declare, dbg.value) are similarly rewritten: each debug-info entry pointing to the original alloca is retargeted to the sub-alloca whose byte range covers the relevant fragment, using the debug expression's DW_OP_LLVM_fragment to indicate the piece.

Speculative Loads Through Select

When a load reaches its pointer through a select instruction, SROA hoists the load into both branches:

; Before SROA:
%p = select i1 %cond, ptr %a, ptr %b
%v = load float, ptr %p, align 4

; After SROA:
%vt = load float, ptr %a, align 4          ; .sroa.speculate.load.true
%vf = load float, ptr %b, align 4          ; .sroa.speculate.load.false
%v  = select i1 %cond, float %vt, float %vf ; .sroa.speculated

This is significant on GPU for two reasons:

  1. SIMT execution model. A select on a GPU maps to a predicated move, which executes in a single cycle without divergence. The two speculative loads execute unconditionally and in parallel (both issue to the memory pipeline regardless of the predicate). This is cheaper than a control-dependent load that would require branch divergence handling.

  2. Alloca elimination. The original pattern requires the select to produce a pointer, which means the alloca must remain in memory (the pointer must be materializable). After speculation, both pointers are consumed directly by loads, and if %a and %b are themselves sub-allocas that can be promoted to registers, the entire chain collapses to register-only operations.

The implementation (Kind 3, lines 1024-1235 of splitAlloca) creates:

  • Two BitCastInst with names .sroa.speculate.cast.true and .sroa.speculate.cast.false.
  • Two LoadInst with names .sroa.speculate.load.true and .sroa.speculate.load.false, preserving alignment from the original load.
  • One SelectInst with name .sroa.speculated via sub_B36550 (SelectInst::Create).
  • Metadata copied from the original load via sub_B91FC0 (copyMetadata).

Interaction with .param Space

Function parameters passed by value in CUDA/PTX use the .param address space (NVPTX address space 101). The EDG frontend generates an alloca to hold a copy of each byval parameter, then loads fields from it. Consider:

struct Vec3 { float x, y, z; };

__device__ float sum(Vec3 v) {
    return v.x + v.y + v.z;
}

The IR before SROA contains:

define float @sum(%struct.Vec3* byval(%struct.Vec3) align 4 %v) {
  %v.addr = alloca %struct.Vec3, align 4           ; byval copy
  %x = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 0
  %0 = load float, ptr %x, align 4
  %y = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 1
  %1 = load float, ptr %y, align 4
  %z = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 2
  %2 = load float, ptr %z, align 4
  %add = fadd float %0, %1
  %add1 = fadd float %add, %2
  ret float %add1
}

SROA splits %v.addr into three scalar allocas (%v.addr.sroa.0, .sroa.1, .sroa.2), each holding a single float. Because each sub-alloca has only simple loads and stores, mem2reg (which runs in the next pipeline iteration) promotes all three to SSA registers. The final IR has no allocas and no memory traffic -- the three float values live entirely in registers.

Without SROA, the byval copy would persist as a .local allocation, and every field access would be a .local load. For a kernel that calls sum() in a tight loop, this difference is the difference between register-speed and DRAM-speed execution.

The NVPTXTargetLowering::LowerCall function (sub_3040BF0) emits DeclareParam (opcode 505) and StoreV1/V2/V4 (opcodes 571-573) for the .param writes on the caller side; SROA's job is to ensure the callee's reads never touch memory.

Auxiliary SROA Functions (Secondary Instance)

The binary contains a second SROA instance at 0x1A10000-0x1A3FFFF (~200 KB), corresponding to the legacy pass manager code path. This instance contains additional rewriting functions not visible in the primary (new PM) instance:

FunctionSizeRoleKey strings
sub_1A3B29058 KBrewritePartition (memcpy/memset)"memcpy.load.fca", "memcpy.store.fca", "memset.store.fca", ".fca"
sub_1A2D07035 KBpresplitLoadsAndStores"select.gep.sroa", "select.sroa", "phi.sroa", "phi.gep.sroa"
sub_1A2C2F09 KBSelect speculation".sroa.speculate.load.true", ".sroa.speculate.load.false"
sub_1A2FFA012 KBVector splat handling"vsplat", ".splatinsert", ".splat"
sub_1A30D1016 KBLoad rewriting"copyload", "oldload"
sub_1A31B609 KBExtract/load patterns"extract", "load.ext", "endian_shift", "load.trunc"
sub_1A23B3011 KBType casting"sroa_raw_cast", "sroa_raw_idx", "sroa_cast"
sub_1A3A67013 KBSpeculative load promotion".sroa.speculated", ".sroa.speculate.load."
sub_1A13B3036 KBAlloca analysis / slice building--
sub_1A15E7034 KBPartition computation--
sub_1A1877038 KBUse analysis--
sub_1A3DCD015 KBCleanup--

The .fca suffix stands for "first-class aggregate" -- LLVM's term for structs and arrays passed by value. The presplitLoadsAndStores function handles a special case where loads and stores of aggregates can be split before the main SROA algorithm runs, decomposing load { i32, i32 } into separate load i32 instructions and store { i32, i32 } into separate store i32 instructions. The select.gep.sroa and phi.gep.sroa strings indicate that this pre-split phase also handles GEP chains through PHI nodes and selects, a pattern common in CUDA code after inlining.

Data Structures

Slice Entry (24 bytes)

struct SROASlice {
    uint64_t start;     // +0:  byte offset into alloca (inclusive)
    uint64_t end;       // +8:  byte offset into alloca (exclusive)
    uint64_t flags;     // +16: bit 2 = splittable, bits [63:3] = user metadata ptr
};

The splittable bit indicates whether the slice can be split across partition boundaries. Loads and stores of simple scalars that fit entirely within the alloca have this bit cleared in Phase 1 of splitAlloca.

Sub-Alloca Record (56 bytes)

struct SubAllocaRecord {
    void* alloca_ptr;       // +0:  pointer to the new AllocaInst
    void* slice_list;       // +8:  pointer to slice list for this sub-alloca
    uint64_t slice_list_cap; // +16: capacity of slice list
    // ... additional fields through +55
};

Stored in a SmallVector<SubAllocaRecord, 2> -- the inline buffer holds two elements (common case: a struct with two fields), spilling to heap for larger aggregates.

Pass State Hash Table

The SROA pass state object (parameter a1 to both main functions) contains an open-addressing hash table at offsets +432 through +896. It uses LLVM-layer sentinels (-4096 / -8192) with instruction pointer keys. This table tracks which instructions have already been processed or are pending in the worklist. See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.

Tagged Pointer Scheme

Use-lists and debug-info lists use a tagged-pointer encoding for memory efficiency:

  • Bit 2 clear: the "pointer" field directly contains a single element (inline storage for the common case of one use).
  • Bit 2 set: bits [63:3] are a heap-allocated pointer to a variable-length list. Freed via _libc_free after masking off the tag bits.

This avoids heap allocation for the overwhelmingly common case where an alloca field has exactly one load or one store.

IR Before/After Example

Consider a CUDA kernel that uses a local struct:

__global__ void kernel(float* out, int n) {
    struct { float a; int b; float c; } local;
    local.a = 1.0f;
    local.b = n;
    local.c = 2.0f;
    out[0] = local.a + local.c;
    out[1] = (float)local.b;
}

Before SROA:

define void @kernel(ptr %out, i32 %n) {
entry:
  %local = alloca { float, i32, float }, align 4
  %a = getelementptr { float, i32, float }, ptr %local, i32 0, i32 0
  store float 1.0, ptr %a, align 4
  %b = getelementptr { float, i32, float }, ptr %local, i32 0, i32 1
  store i32 %n, ptr %b, align 4
  %c = getelementptr { float, i32, float }, ptr %local, i32 0, i32 2
  store float 2.0, ptr %c, align 4
  %v0 = load float, ptr %a, align 4
  %v2 = load float, ptr %c, align 4
  %sum = fadd float %v0, %v2
  store float %sum, ptr %out, align 4
  %v1 = load i32, ptr %b, align 4
  %conv = sitofp i32 %v1 to float
  %idx = getelementptr float, ptr %out, i64 1
  store float %conv, ptr %idx, align 4
  ret void
}

After SROA (three sub-allocas, then mem2reg promotes to registers):

define void @kernel(ptr %out, i32 %n) {
entry:
  ; No allocas remain -- all promoted to SSA values
  %sum = fadd float 1.0, 2.0          ; constant-folded later by InstCombine
  store float %sum, ptr %out, align 4
  %conv = sitofp i32 %n to float
  %idx = getelementptr float, ptr %out, i64 1
  store float %conv, ptr %idx, align 4
  ret void
}

SROA splits %local into %local.sroa.0 (float), %local.sroa.1 (i32), %local.sroa.2 (float). Each sub-alloca has trivial load/store patterns, so mem2reg promotes all three. The stores and loads collapse, GEPs disappear, and the kernel runs entirely from registers.

Name Suffixes Created During Splitting

SuffixPurpose
.sroa.New sub-alloca name prefix
.sroa.speculate.cast.trueBitcast for true branch of select
.sroa.speculate.cast.falseBitcast for false branch of select
.sroa.speculate.load.trueSpeculative load from true branch
.sroa.speculate.load.falseSpeculative load from false branch
.sroa.speculatedFinal select combining speculative loads
.contContinuation block (after branch splitting)
.thenThen-branch block
.elseElse-branch block
.valValue extracted from split load/store
.fcaFirst-class aggregate decomposition
select.gep.sroaGEP through select, pre-split
select.sroaSelect pointer, pre-split
phi.sroaPHI pointer, pre-split
phi.gep.sroaGEP through PHI, pre-split
sroa_raw_castRaw bitcast during type rewriting
sroa_raw_idxRaw index computation during rewriting
sroa_castGeneric SROA type cast
vsplatVector splat element
.splatinsertSplat insert element
.splatSplat shuffle
copyloadCopy of a load during rewriting
oldloadOriginal load being replaced
extractExtracted sub-value
load.extLoad with extension
endian_shiftEndianness-adjustment shift
load.truncLoad with truncation
memcpy.load.fcaMemcpy load of first-class aggregate
memcpy.store.fcaMemcpy store of first-class aggregate
memset.store.fcaMemset store of first-class aggregate

Differences from Upstream LLVM

The core SROA algorithm in cicc v13.0 is stock LLVM SROA. No CUDA-specific modifications to the splitting logic, slice building, or partition computation were detected. The NVIDIA-specific elements are limited to:

  1. Pass state object layout. The offsets within the pass state structure (worklist at +432, hash table at +824-+864, sub-alloca records at +1080-+1096) reflect NVIDIA's PassManager integration, not upstream's.

  2. IR node encoding. Opcode numbers (61 = load, 62 = store, 85 = intrinsic, 55 = phi) and operand layout (32-byte basic blocks, tagged pointers) follow NVIDIA's modified IR format.

  3. Debug metadata system. The metadata kind for debug info uses MD_dbg = 38 (NVIDIA assignment), queried via sub_B91C10.

  4. Global threshold knob. The value at qword_50056C8 may have an NVIDIA-specific default different from upstream's 128-byte / 1024-bit default. The knob is likely settable via the pipeline text sroa<preserve-cfg> or sroa<modify-cfg>.

  5. Pipeline positioning. The early-pipeline placement (position 4, before NVVMLowerArgs and NVVMLowerAlloca) is NVIDIA-specific. Upstream LLVM typically places SROA after InstCombine and SimplifyCFG; cicc places it before those passes to eliminate byval parameter copies as early as possible.

Configuration

KnobGlobalDescription
qword_50056C8SROA size thresholdMaximum alloca size (in bits) that SROA will attempt to split. Allocas exceeding this are left for the backend.
qword_50055E8Two-pass analysis flagWhen set, enables a pre-analysis pass before slice building (new PM integration).
NVVMPassOptions offset +1400Disable flagSetting this byte disables SROA entirely.
Pipeline param preserve-cfg--Runs SROA without modifying the CFG (no block splitting for speculative loads across PHIs).
Pipeline param modify-cfg--Allows SROA to modify the CFG (enables full speculative load hoisting including PHI/select decomposition).

Function Map

FunctionAddressSizeRole
Primary instance (new PM)--
SROAPass::runOnAllocasub_2935C3058 KB--
SROAPass::splitAllocasub_2930B9080 KB--
buildSlices (use analysis)sub_2927160----
buildPartitions (group slices)sub_2924690----
buildPartitionTablesub_2913C40----
sortSlicessub_2912200----
compactSlices (with filter)sub_2915A90----
compactSlices (simple)sub_2914CE0----
findExistingValuesub_291A860----
rewritePartitionsub_29197E0----
rewriteCallbacksub_2919EF0----
visitUse (rewrite one use)sub_292A4F054 KB--
validateRewritesub_291F660----
analyzeSlicesub_29150D0----
addToNewAllocaWorklistsub_2929FB0----
addToWorklistsub_2928360----
addOperandToWorklistsub_29220F0----
clearPendingQueuesub_2921860----
classifySlicesub_29280E0----
recordNonSplitAllocasub_2916C30----
computeRewrittenValuesub_2916270----
advancePartitionIteratorsub_2912870----
rewriteGEPChainsub_29348F0----
replaceAndErasesub_2914800----
collectUsesForRewrite (variant)sub_2914380----
collectUsesForRewrite (original)sub_2914550----
Hash table resizesub_29222D0----
Alloca rewriting helpersub_292D81067 KB--
SROA pass metadatasub_2912100----
SROA pass registration ("Scalar Replacement Of Aggregates", "sroa")sub_2912340----
Secondary instance (legacy PM)--
SROAPass::runOnAlloca (legacy)sub_1A33E8061 KB--
SROAPass::splitAlloca (legacy)sub_1A3704046 KB--
rewritePartition (memcpy/memset)sub_1A3B29058 KB--
presplitLoadsAndStoressub_1A2D07035 KB--
Select speculationsub_1A2C2F09 KB--
Vector splat handlingsub_1A2FFA012 KB--
Load rewritingsub_1A30D1016 KB--
Extract/load patternssub_1A31B609 KB--
Type castingsub_1A23B3011 KB--
Speculative load promotionsub_1A3A67013 KB--
Alloca analysis / slice buildingsub_1A13B3036 KB--
Partition computationsub_1A15E7034 KB--
Use analysissub_1A1877038 KB--
Cleanupsub_1A3DCD015 KB--
Shared helpers--
isAllocaPromotablesub_B4CE70----
getDL (DataLayout)sub_B43CC0----
getTypeSizeInBitssub_BDB740----
getTypeAllocSizesub_9208B0----
getTypesub_BD5C60----
getNamesub_BD5D20----
AllocaInst::Createsub_BD2C40----
PHINode::Createsub_BD2DA0----
AllocaInst constructorsub_B4CCA0----
CreateBitCastsub_BCD140----
CreateAllocasub_BCD420----
replaceAllUsesWithsub_BD84D0----
eraseFromParentsub_B43D60----
SelectInst::Createsub_B36550----
UndefValue::getsub_ACADE0----
getABITypeAlignmentsub_AE5020----
getPrefTypeAlignmentsub_AE5260----
copyMetadatasub_B91FC0----
isVolatilesub_B46500----
isVectorTypesub_BCEBA0----
rewriteLoadStoreOfSlicesub_F38250----
rewriteMemTransferOfSlicesub_F38330----
collectAllUsessub_AE74C0----
getAccessRangesub_AF47B0----
checkSubAllocaOverlapsub_AF4D30----
buildMetadataTablesub_D5F1F0----
addToErasedSetsub_D6B260----
Slice optimizer initsub_11D2BF0----
Slice optimizer runsub_11D3120----
Slice optimizer finalizesub_11D7E80----

Test This

The following kernel allocates a local struct and accesses its fields. SROA should completely eliminate the alloca, promoting all fields to registers.

struct Particle {
    float x, y, z;
    float vx, vy, vz;
};

__global__ void sroa_test(float* out, int n) {
    Particle p;
    p.x  = (float)threadIdx.x;
    p.y  = (float)threadIdx.y;
    p.z  = 0.0f;
    p.vx = 1.0f;
    p.vy = 2.0f;
    p.vz = 3.0f;

    float energy = 0.5f * (p.vx*p.vx + p.vy*p.vy + p.vz*p.vz);
    out[threadIdx.x] = p.x + p.y + p.z + energy;
}

What to look for in PTX:

  • Absence of .local memory declarations. If SROA succeeds, there should be no .local .align directives in the PTX for the Particle struct. All six fields (x, y, z, vx, vy, vz) should live in %f (float) registers.
  • No st.local or ld.local instructions. These indicate that the struct survived into .local memory -- a 200-400 cycle penalty per access versus zero cycles for a register.
  • The PTX should show direct register arithmetic: mov.f32, fma.rn.f32, add.f32 -- no memory traffic at all for the struct fields.
  • To see the failure case, add volatile to the struct declaration (volatile Particle p;). This prevents SROA from promoting the alloca, and ld.local/st.local instructions will appear in the PTX, demonstrating the performance cliff that SROA normally prevents.
  • At -O0, SROA still runs (it is correctness-relevant for address space resolution), but with a more conservative threshold. Compare the .local frame size between -O0 and -O2.

Cross-References

  • Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
  • Pipeline & Ordering -- pipeline positions 4 and post-sinking
  • Register Allocation -- surviving allocas become .local spills, directly increasing register pressure
  • Rematerialization -- recomputes cheap values to reduce register pressure; operates downstream of SROA
  • StructSplitting -- NVIDIA custom pass that splits struct arguments at the call boundary; complements SROA's intra-procedural splitting
  • MemorySpaceOpt -- resolves generic pointers to specific address spaces; runs after SROA
  • Hash Infrastructure -- the open-addressing hash table used by the SROA pass state

EarlyCSE (Early Common Subexpression Elimination)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 EarlyCSE.cpp. Evidence: iterative (non-recursive) dominator-tree walk matches the LLVM 16+ refactoring; MemorySSA-backed variant with early-cse-memssa pipeline parameter matches LLVM 14+. NVIDIA adds four GPU extensions (barrier-aware versioning, AS 7 handling, NVVM call CSE, PHI limit) and a fourth scoped hash table not present in any upstream version.

EarlyCSE is a fast dominator-tree-walk pass that eliminates redundant computations, loads, and calls within a function. Cicc's version is not stock LLVM 20.0.0 -- the binary contains four CUDA-specific extensions that handle GPU memory model semantics: barrier-aware memory versioning with hardcoded NVVM intrinsic ID checks, shared memory address space 7 protection against unsafe store-to-load forwarding, a dedicated NVVM intrinsic call CSE handler with a fast-path for thread-invariant special register reads, and a PHI operand limit of 5 for compile-time control. It also adds a fourth scoped hash table (store-forwarding) that upstream lacks.

Key Facts

PropertyValue
Pass name"early-cse" (standard), "early-cse-memssa" (MemorySSA variant)
Pipeline parser paramsmemssa (selects MemorySSA-backed variant)
Entry point (standard)sub_2778270
Entry point (MemorySSA)sub_27783D0
Core functionsub_2780B00 (12,350 bytes)
NVVM call CSE handlersub_2780450 (1,142 bytes, ~263 decompiled lines)
Pipeline slot245, 291 (tier 1); 525, 593 (tier 2+); ~370 (late)
Disable flagNVVMPassOptions offset +1440
Pipeline assemblersub_18E4A00 (MemorySSA variant), sub_196A2B0 (standard)
Upstream LLVM filellvm/lib/Transforms/Scalar/EarlyCSE.cpp
NVIDIA modificationsBarrier generation tracking, AS 7 handling, NVVM call CSE, PHI limit, store-fwd table

Algorithm Overview

The pass performs a stack-driven iterative DFS over the dominator tree. At each basic block it scans instructions linearly, attempting three forms of elimination:

  1. Expression CSE -- arithmetic, casts, comparisons, GEPs with identical operands are looked up in a scoped hash table. If a matching canonical instruction exists, the redundant one is replaced via RAUW and erased.

  2. Load CSE and store-to-load forwarding -- loads from the same address and type as a prior load (or a prior store) are replaced with the already-available value. This is gated by a CurrentGeneration counter that invalidates stale entries whenever a memory-writing instruction or barrier intrinsic is encountered.

  3. Call CSE -- readonly/readnone calls with identical targets and arguments are deduplicated. The NVVM-specific handler sub_2780450 provides a fast-path for thread-invariant NVVM intrinsics (llvm.nvvm.read.ptx.sreg.*).

The dominator tree walk is not recursive. It uses an explicit growable stack (initial capacity 8 entries, 64 bytes) with DomTreeScope nodes that record per-scope hash table insertions. On scope exit all insertions are tombstoned. This matters for deeply-nested GPU kernel CFGs where stack overflow from recursion is a real risk.

function EarlyCSE(ctx):
    root = ctx.Function.DomTree.root
    stack.push(DomTreeScope(root))

    while stack is not empty:
        scope = stack.top()
        ctx.CurrentGeneration = scope.generation_begin

        if not scope.visited:
            for inst in scope.bb.instructions:
                processNode(ctx, inst)       // CSE logic below
            scope.visited = true
            scope.generation_end = ctx.CurrentGeneration
        else:
            if scope has unvisited children:
                child = scope.children.pop_front()
                stack.push(DomTreeScope(child))
                continue
            else:
                unwindScope(ctx, scope)      // tombstone entries, free node
                stack.pop()

DomTreeScope Structure

Each scope node is 160 bytes (0xA0), allocated via sub_22077B0:

OffsetTypeField
+0x00u32generation_begin -- snapshot of CurrentGeneration at scope entry
+0x04u32generation_end -- value at scope exit (after processing all instructions)
+0x08BasicBlock*The basic block for this domtree node
+0x10DomTreeNode**children_begin
+0x18DomTreeNode**children_end
+0x20scope linkExpression ScopedHT chain -> ctx+0x78
+0x38scope linkLoad ScopedHT chain -> ctx+0x108
+0x50scope linkCall ScopedHT chain -> ctx+0x198
+0x68scope linkCall-values ScopedHT chain -> ctx+0x228
+0x80scope linkStore-fwd ScopedHT chain -> ctx+0x250
+0x98u8visited flag (0 = not yet processed, 1 = instructions scanned)

Each chain entry is a triplet [link_fwd, link_back, insertion_list_head] occupying 24 bytes. On scope exit, the pass walks each insertion list and tombstones the corresponding hash table entries, then frees the scope node.

Four Scoped Hash Tables

Upstream LLVM EarlyCSE has three scoped hash tables (expression, load, call). Cicc adds a fourth dedicated to store-to-load forwarding.

TableContext offsetHash functionEqualityKeyValue
Expression+0xE8 / +0xF8sub_277F590sub_277AC50Opcode + operand value-numbersCanonical instruction pointer
Load+0x178 / +0x188sub_277CF80sub_27792F0Load address + typePreviously loaded value
Call+0x230 / +0x240sub_277CF80sub_27792F0Call target + argumentsReturn value
Store-fwd+0x2C0 / +0x2D0sub_277C800sub_27781D0Store address + typeStored value

All four use open-addressing with linear probing. Sentinel values: 0xFFFFFFFFFFFFF000 = empty, 0xFFFFFFFFFFFFE000 = tombstone. Resize triggers at 75% load factor (4 * (count + 1) >= 3 * bucket_count) or when tombstones exceed 12.5% of capacity. Bucket counts are always a power of two.

The store-forwarding table is the NVIDIA addition. Upstream EarlyCSE performs store-to-load forwarding through the load table by inserting the stored value when a store is processed. Cicc separates this into a dedicated table, which enables more aggressive dead-store detection within the early pipeline -- two stores to the same address with no intervening load or barrier can be recognized without polluting the load table's namespace.

CUDA Extension 1: Barrier-Aware Memory Versioning

The context structure holds a CurrentGeneration counter at offset +0x2E0 (type u32). This counter acts as a memory version number. Every load and call CSE lookup checks whether the cached entry's generation matches the current generation -- a mismatch means an intervening memory-modifying operation invalidated the entry.

Generation is incremented when:

  • A trivially dead instruction is skipped (minor bump at 0x2781950)
  • sub_B46490 (hasMemoryWriteSideEffects) returns true for a call instruction
  • Any of four hardcoded NVVM barrier intrinsic IDs is encountered

The barrier intrinsic checks are explicit cmp dword ptr [rax+24h], IMM instructions at specific addresses in the binary:

AddressEncodingIntrinsic IDDecimalIdentity
0x2781B30cmp ..., 9Bh0x9B155llvm.nvvm.barrier0 (__syncthreads)
0x27812AFcmp ..., CDh0xCD205llvm.nvvm.membar.* (device/system memory barrier)
0x2781F4Dcmp ..., 123h0x123291llvm.nvvm.bar.sync (named barrier sync)
0x2781F40cmp ..., 144h0x144324NVVM cluster barrier (SM 90+ cluster-scope fence)

These checks are a safety net on top of the intrinsics' declared memory-effect attributes. Upstream LLVM relies solely on the memory-effect modeling to determine whether a call clobbers memory. Cicc adds the explicit ID checks because the barrier intrinsics' memory effects, as declared in the NVVM tablegen files, may not fully capture the GPU-specific semantics: a bar.sync does not just write memory from the perspective of one thread -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees at the IR level, so the explicit ID checks are the correctness backstop.

When any of these four intrinsics appears between two memory operations, EarlyCSE refuses to forward the earlier value. This prevents optimizations like:

;; INCORRECT optimization that barriers prevent:
%v1 = load i32, ptr addrspace(3) %p          ;; load from shared memory
call void @llvm.nvvm.barrier0()               ;; __syncthreads()
%v2 = load i32, ptr addrspace(3) %p          ;; CANNOT be replaced with %v1
;; Another thread may have written to %p between the barrier and this load

CUDA Extension 2: Shared Memory Address Space 7 Handling

Stores targeting NVPTX address space 7 (the internal representation for __shared__ memory) receive special treatment that prevents unsafe store-to-load forwarding.

At address 0x2781BB6, the pass checks byte [rdx+8] == 7 on the store's pointer operand type. When this matches, the store is routed through sub_B49E20 (isSharedMemoryStore), which calls sub_B43CB0 (getCalledFunction) and sub_B2D610 (hasIntrinsicID) to confirm the target is a shared memory variable (string ID 0x31 = "shared").

The motivation: shared memory is written by one thread and potentially read by a different thread after a barrier. Forwarding a stored value to a subsequent load in the same thread is only safe if no barrier intervenes -- but even then, a reimplementor must be careful because the CUDA memory model permits a thread to read its own store without a barrier, while other threads cannot. The shared-memory path in EarlyCSE conservatively disables forwarding for shared-memory stores to avoid the case where a load is CSE'd to the stored value, but the actual runtime value has been modified by another thread's post-barrier store to the same location.

processStore(ctx, store_inst):
    ptr_type = store_inst.pointer_operand.type
    if ptr_type.address_space == 7:                 // NVPTX shared memory
        if isSharedMemoryStore(store_inst):         // sub_B49E20
            ctx.CurrentGeneration++                 // invalidate load/call tables
            return                                  // do NOT insert into store-fwd table
    // Normal path: insert stored value into store-fwd table for later forwarding
    insertStoreForwarding(ctx, store_inst)

CUDA Extension 3: NVVM Intrinsic Call CSE (sub_2780450)

The dedicated function sub_2780450 (1,142 bytes, ~263 decompiled lines) handles CSE for calls to NVVM builtin intrinsics. It is entered when the main instruction loop detects a single-use-by-call pattern: the instruction's result has exactly one user, that user is a CallInst (opcode 0x1F), and the operand index is 3.

The function provides a fast-path for thread-invariant special register reads. Many NVVM intrinsics return values that are constant for the lifetime of a kernel invocation from a given thread's perspective:

  • llvm.nvvm.read.ptx.sreg.tid.x/y/z -- threadIdx.x/y/z
  • llvm.nvvm.read.ptx.sreg.ntid.x/y/z -- blockDim.x/y/z
  • llvm.nvvm.read.ptx.sreg.ctaid.x/y/z -- blockIdx.x/y/z
  • llvm.nvvm.read.ptx.sreg.nctaid.x/y/z -- gridDim.x/y/z
  • llvm.nvvm.read.ptx.sreg.warpsize
  • llvm.nvvm.read.ptx.sreg.laneid

Upstream LLVM would model these as readnone and CSE them through the generic call table. The NVVM-specific handler recognizes these intrinsic IDs directly via sub_987FE0 (getIntrinsicID), avoiding the overhead of the general readonly-call analysis. For a kernel that references threadIdx.x twenty times, the fast-path eliminates nineteen redundant intrinsic calls in a single pass.

The function also handles two additional NVVM intrinsic IDs:

IDDecimalIdentityCSE behavior
0xE4228NVVM load intrinsicCSE-able if same address and no intervening clobber
0xE6230NVVM store intrinsicBlocks CSE (generation bump)

The check at 0x2783890 tests for intrinsic ID 228 and at 0x27839BC for intrinsic ID 230. The store intrinsic (230) triggers a generation bump, while the load intrinsic (228) is treated as a CSE candidate.

CUDA Extension 4: PHI Operand Limit

At address 0x2781BED, the pass checks:

if PHINode.getNumIncomingValues() > 5:
    skip CSE analysis for this PHI

This is a compile-time heuristic absent from upstream LLVM. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence becomes quadratic in the operand count (each pair of values must be checked for dominance and equivalence), and the benefit for wide PHIs is marginal -- they rarely represent true common subexpressions.

The threshold of 5 is hardcoded with no cl::opt override.

Instruction Classification

The inner processing loop at 0x2780EB5--0x2781110 classifies each instruction by its opcode byte at [instr-0x18]:

OpcodeHexInstructionEarlyCSE action
0x55StoreStoreInstStore-to-load forwarding path; shared memory check
0x3DCallCallInstCall CSE or generation bump (if memory effects)
0x3EInvokeInvokeInstSame as CallInst
0x3FSelectSelectInstExpression CSE with type-size check
0x40PHIPHINodeExpression CSE if operand count <= 5
<= 0x1C--Constants/argsSkip (not instructions)
0x29ReturnReturnInstSkip
0x43--0x4FCastsCast instructionsExpression CSE

The classification dispatches to these helper predicates:

HelperAddressPurpose
sub_AA54C00x2780EC6isTriviallyDead -- if true, bump generation and skip
sub_D222C00x2780F97isSimpleExpression -- arithmetic, casts, comparisons, GEPs
sub_F50EE00x2780F7AcanCSE / doesNotAccessMemory
sub_1020E100x2781967getCallCSEValue -- readonly/readnone call check
sub_B464200x2781B95isLoadCSECandidate
sub_B464900x2781CC6hasMemoryWriteSideEffects -- triggers generation bump

Load-Store Forwarding Detailed Flow

The most complex code path (0x2781B48--0x2781F32) handles load CSE and store-to-load forwarding:

processLoad(ctx, load_inst):
    key = computeLoadCSEKey(load_inst, ctx.DataLayout)    // sub_2779A20
    if key.status != 0:
        // Cannot form clean key -- check if call/invoke returns equivalent value
        if load_inst is CallInst (0x3D) or InvokeInst (0x3E):
            tryCallValueForwarding(ctx, load_inst)
        return

    // Check for preceding store to same address
    store_entry = lookupStoreTable(ctx, key)
    if store_entry and store_entry.generation == ctx.CurrentGeneration:
        // Forward stored value to this load
        salvageDebugInfo(load_inst, store_entry.value)    // sub_BD84D0
        replaceAllUsesWith(load_inst, store_entry.value)  // sub_11C4E30
        eraseInstruction(load_inst)                       // sub_B43D60
        return CHANGED

    // Check for preceding load from same address
    load_entry = lookupLoadTable(ctx, key)
    if load_entry and load_entry.generation == ctx.CurrentGeneration:
        // Replace with previously loaded value
        replaceAllUsesWith(load_inst, load_entry.value)
        eraseInstruction(load_inst)
        return CHANGED

    // Not found -- insert into load table for future lookups
    insertLoadTable(ctx, key, load_inst, ctx.CurrentGeneration)

For stores, the pass also performs dead-store detection within the same scope: if two stores target the same address with no intervening load or barrier, the earlier store is dead. The barrier check uses the same four intrinsic ID comparisons described above.

Type Compatibility and Bitwidth Handling

At 0x27829C3--0x2782B87, for expression CSE of SelectInst and PHINode:

  • sub_AE43F0 computes type size in bits via the DataLayout
  • If size <= 64 bits: use a u64 bitmask as the CSE key
  • If size > 64 bits: allocate a BitVector via sub_C43690 and use bit-level comparison

At 0x2782F72--0x2782FD5, integer constant range analysis computes leading zeros/ones to determine effective bit-width. If the value fits in fewer bits, EarlyCSE allows CSE across different integer types (e.g., i32 zext i64 vs i64). This is an NVIDIA extension that upstream LLVM does not perform -- upstream requires exact type matches for expression CSE.

Context Structure Layout

The EarlyCSEContext structure passed to sub_2780B00 in rdi:

OffsetFieldSize
+0x00Current instruction pointer8
+0x08DataLayout* / TargetData*8
+0x10Function* (-> [+0x60] = DomTree root)8
+0x18TargetLibraryInfo*8
+0x20AssumptionCache*8
+0x68MemDep result tracking8
+0x70MemDep analysis reference8
+0xE8--+0x110Expression hash table (buckets, count, ScopedHT, free list, allocator)40
+0x170--+0x198Load hash table + ScopedHT40
+0x200--+0x258Call hash table + ScopedHT88
+0x2B8--+0x2D8Store-fwd hash table + ScopedHT32
+0x2E0CurrentGeneration (u32)4

Stack frame: 0x1D0 bytes (sub rsp, 0x1A8 + 5 callee-saved pushes).

Scope Page Management

The scoped hash tables use 512-byte (0x200) scope pages chained together. When a page fills:

  1. At 0x2781328: fetch previous page via [stack.end - 8], advance by 0x200 to the next chained page.
  2. At 0x2782260: when reclaiming, free the current page and pop from the page pointer array.

The initial worklist stack is 64 bytes (8 entries of 8 bytes each). The scope-page-pointer array is 8-byte aligned via lea rbx, [rdx*4 - 4]; and rbx, ~7; add rbx, rax.

memssa Pipeline Parameter

The pipeline parser registers "early-cse" at slot 394 with the parameter keyword memssa. When memssa is specified, the pass uses the MemorySSA-backed variant (sub_27783D0, pass name "Early CSE w/ MemorySSA") instead of the standard variant (sub_2778270, pass name "Early CSE"). Both variants call the same core function sub_2780B00; the difference is that the MemorySSA variant receives a pre-built MemorySSA graph in the context structure and uses it for more precise clobber queries, avoiding the O(n^2) scanning that the non-MSSA path falls back to for load CSE.

Knobs

KnobDefaultDescription
enable-earlycse-memoryssatrueMaster switch for MemorySSA integration
earlycse-debug-hashfalseDebug: log hash function inputs/outputs
earlycse-mssa-optimization-cap500Max MemorySSA queries per block before falling back to conservative
enable-earlycse-imprecisionfalseAllow approximate analysis in pathological cases (huge blocks, deep PHI nests)

No dedicated cl::opt flags exist for any of the four NVIDIA extensions. The PHI operand limit of 5, the four barrier intrinsic IDs, and the shared-memory address space 7 check are all hardcoded in the binary.

Pipeline Positions and Tier Gating

TierPosition(s)Notes
Tier 1 (O1)Skippedsub_12DE8F0 explicitly gates EarlyCSE with tier != 1
Tier 2 (O2)525, 593Two invocations: early function simplification and post-loop-optimization
Tier 3 (O3)245, 291, ~370Three invocations; additional late-pipeline run
OfcmidAfter Sinking2Single invocation in the moderate-optimization path

The pass is independently disableable via NVVMPassOptions at offset +1440. The same offset gates the standard and MemorySSA variants identically.

Key Constants

ValueHexMeaning
1600xA0DomTreeScope node size
5120x200Scope page size
640x40Initial stack capacity (8 entries)
480x30Hash table entry node size
400x28Insertion record size
0xFFFFFFFFFFFFF000--Hash table EMPTY sentinel
0xFFFFFFFFFFFFE000--Hash table TOMBSTONE sentinel
1550x9Bllvm.nvvm.barrier0 intrinsic ID
2050xCDllvm.nvvm.membar.* intrinsic ID
2910x123NVVM bar.sync intrinsic ID
3240x144NVVM cluster barrier intrinsic ID
2280xE4NVVM load intrinsic ID
2300xE6NVVM store intrinsic ID
5--PHI operand limit for CSE

Differences from Upstream LLVM 20.0.0

FeatureUpstreamCicc
Scoped hash tables3 (expression, load, call)4 (+ store-forwarding)
Barrier intrinsic checksRelies on memory-effect attributes onlyExplicit ID checks for IDs 155, 205, 291, 324
Shared memory handlingNo address-space-specific logicAS 7 stores skip store-fwd insertion, bump generation
NVVM intrinsic call CSEGeneric readonly-call pathDedicated sub_2780450 with fast-path for sreg.* reads
PHI operand limitNoneSkip CSE for PHI nodes with >5 incoming values
Cross-type expression CSEExact type match requiredAllows CSE across integer widths when value range fits
Dominator tree walkRecursive in many LLVM buildsAlways iterative (explicit stack)

Function Map

FunctionAddressSizeRole
EarlyCSEPass::run (standard variant entry)sub_2778270----
EarlyCSEPass::run (MemorySSA variant entry)sub_27783D0----
Core pass body (domtree walk + instruction processing)sub_2780B0012,350--
handleNVVMCallCSE (NVVM intrinsic call CSE)sub_27804501,142--
Expression hash functionsub_277F590----
Expression equality checksub_277AC50----
Load/call key hashsub_277CF80----
Load/call key equalitysub_27792F0----
Store key hashsub_277C800----
Store key equalitysub_27781D0----
isSimpleExpressionsub_D222C0----
canCSE / doesNotAccessMemorysub_F50EE0----
isSharedMemoryStore (AS 7 check)sub_B49E20----
isSharedMemoryAccesssub_B49E00----
getCallCSEValue (readonly/readnone check)sub_1020E10----
isLoadCSECandidatesub_B46420----
hasMemoryWriteSideEffectssub_B46490----
computeCSEHash / isVolatilesub_B46500----
getIntrinsicID (NVVM intrinsic ID from call)sub_987FE0----
isTriviallyDeadsub_AA54C0----
replaceAllUsesWith (RAUW)sub_11C4E30----
salvageDebugInfosub_BD84D0----
eraseInstructionsub_B43D60----
removeFromParentsub_27793B0----
computeLoadCSEKeysub_2779A20----
insertStoreForwardingsub_27808D0----
insertExprIntoScopedHTsub_27801B0----
lookupScope (find value by generation)sub_277D510----
lookupCallTablesub_277D3C0----
lookupInScopedHTsub_2778110----
shouldInsertIntoTablesub_27785B0----
growTable (double hash table size)sub_277C980----
insertIntoTable (post-grow insert)sub_277C8A0----
cleanupLoadTable (compact after scope exit)sub_277FFC0----
cleanupCallTable (compact after scope exit)sub_277A110----
compareLoadTypes (type compatibility)sub_277A9A0----
TargetData::getTypeSizeInBitssub_AE43F0----
getCalledFunctionsub_B43CB0----
hasIntrinsicIDsub_B2D610----

Common Pitfalls

These are mistakes a reimplementor is likely to make when extending EarlyCSE for a GPU target with barrier semantics.

1. Relying solely on LLVM memory-effect attributes to model barrier semantics. Upstream LLVM models barrier intrinsics as memory-writing calls, which triggers a generation bump through the standard hasMemoryWriteSideEffects path. This is insufficient for GPU barriers: a bar.sync does not just write memory from one thread's perspective -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees. Cicc adds explicit hardcoded checks for four intrinsic IDs (155, 205, 291, 324) as a safety net. A reimplementation that trusts the declared memory effects alone will forward values across barriers, producing load CSE that reads stale pre-barrier data written by a different thread.

2. Forwarding stores to loads across barriers in shared memory (AS 7). When thread T0 stores to smem[0], a barrier fires, and thread T1 loads from smem[0], the load must see T1's own value (if it wrote) or the value written by whichever thread last stored before the barrier. Forwarding T0's stored value to T0's subsequent load is only safe if no barrier intervenes and no other thread could have written to the same location. Cicc's AS 7 handling conservatively disables store-to-load forwarding for all shared memory stores by bumping the generation counter. A reimplementation that allows shared memory store forwarding without barrier awareness will produce reads that return the local thread's stale value instead of the globally-visible post-barrier value.

3. Missing one or more of the four barrier intrinsic IDs. Cicc checks for IDs 155 (barrier0 / __syncthreads), 205 (membar.*), 291 (bar.sync), and 324 (cluster barrier for SM 90+). A reimplementation that only handles __syncthreads (ID 155) will fail to invalidate the load/call tables when a bar.sync or cluster barrier is encountered. The result: loads before and after a named barrier or cluster-scope fence are incorrectly CSE'd, producing silent data corruption in multi-CTA cooperative kernels.

4. Applying expression CSE to PHI nodes with more than 5 incoming values. Cicc hardcodes a PHI operand limit of 5 for CSE analysis. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence is quadratic in operand count, and the benefit for wide PHIs is negligible -- they rarely represent true common subexpressions. A reimplementation without this threshold will experience severe compile-time regressions on heavily unrolled GPU kernels.

5. Not adding a dedicated store-forwarding hash table. Upstream LLVM uses three scoped hash tables (expression, load, call). Cicc adds a fourth table dedicated to store-to-load forwarding. Without this separation, inserting stored values into the load table pollutes the load namespace, making dead-store detection within the same scope unreliable. Two stores to the same address with no intervening load or barrier should trigger dead-store elimination of the earlier store; mixing stores into the load table obscures this pattern.

Cross-References

  • Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
  • MemorySSA Builder for GPU -- the MemorySSA infrastructure consumed by the early-cse-memssa variant
  • Hash Infrastructure -- the universal DenseMap mechanics shared by all four hash tables
  • Barriers & Sync -- the barrier builtins whose intrinsic IDs trigger generation bumps
  • Dead Synchronization Elimination -- the 96KB pass that removes dead barriers; interacts with EarlyCSE's barrier-aware generation tracking
  • GVN -- the more expensive redundancy elimination pass that complements EarlyCSE later in the pipeline
  • DSE -- Dead Store Elimination, which complements EarlyCSE's within-scope store-to-load forwarding with cross-block analysis
  • Pipeline & Ordering -- tier-dependent scheduling and NVVMPassOptions gating
  • Alias Analysis & NVVM AA -- address-space-aware alias analysis that feeds into MemorySSA clobber queries

InstCombine

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/InstCombine/InstructionCombining.cpp, llvm/lib/Transforms/InstCombine/InstCombine*.cpp (LLVM 20.0.0). The upstream is split across ~15 files by instruction category; cicc inlines them into a single monolithic visitor.

NVIDIA's InstCombine in CICC v13.0 is approximately twice the size of upstream LLVM's, weighing in at roughly 405 KB for the main visitor alone. The monolithic visitor function at sub_10EE7A0 dispatches across 80 unique opcode cases through a three-level switch structure, handling standard LLVM instructions, NVIDIA-extended vector and FMA operations, and three high-opcode NVVM intrinsic dead-code elimination patterns. A separate 87 KB intrinsic folding function (sub_1169C30) handles NVVM-specific canonicalization, and a 127 KB computeKnownBits implementation (sub_11A7600) provides the dataflow backbone. This page covers the visitor architecture, the per-instruction-type visitors recovered from the binary, and the NVIDIA-specific extensions that distinguish this implementation from upstream.

RegistrationNew PM #398, parameterized: no-aggressive-aggregate-splitting;...;max-iterations=N
Runtime positionsTier 0 #28 (via sub_19401A0); Tier 1/2/3 #42 (gated by !opts[1000]); see Pipeline
Main visitorsub_10EE7A0 (0x10EE7A0, ~405 KB, 9,258 lines)
Intrinsic foldingsub_1169C30 (0x1169C30, ~87 KB, 2,268 lines)
computeKnownBitssub_11A7600 (0x11A7600, ~127 KB, 4,156 lines)
SimplifyDemandedBitssub_11AE870 / sub_11AE3E0 (wrapper + hash table)
Opcode cases80 unique case labels across 3 switch blocks
NVIDIA extra size~200 KB beyond upstream (~87 KB intrinsic fold + ~113 KB expanded cases)

Visitor Architecture

The main visitor sub_10EE7A0 receives an NVVM IR node pointer (__m128i* a2) and attempts to simplify it. A persistent local v1612 aliases the instruction being visited. The function has four structural regions:

Preamble (lines ~1760--2000) performs pre-dispatch checks: validating call-site attributes (opcode 41 for bitwise-assert), handling ternary FMA instructions (opcodes 238--245), checking for constant-foldable select patterns, canonicalizing operand ordering (constant to RHS), and running SimplifyDemandedBits via sub_11A3F30 on the result type.

Opcode dispatch reads the NVVM opcode via sub_987FE0 (getOpcode) and uses a three-level switch:

Switch LevelOpcode RangeDescription
Level 10x99--0x2A5 (main)Standard LLVM instructions (GEP, select, stores, casts, compares, calls, vectors)
Level 20x01--0x42 (low)Binary operations, casts, early comparisons
Level 3> 0x13CF (high)NVIDIA proprietary intrinsic IDs (9549, 9553, 9567)

Additional if-else chains handle intermediate ranges: opcodes 0xC7E (3198), 0x2E2 (738), 0x827 (2087), 0x2CC (716), 0xE07--0xE08, 0xE4F--0xE51, 0x13C6--0x13C7, and 0x13CD--0x13CE.

The fallback path at LABEL_95 calls sub_F0C430 for generic simplification. The no-change return path at LABEL_155 is referenced 101 times throughout the function.

Per-Instruction Visitors

Each major instruction type is handled by a dedicated visitor function called from the main dispatch. The following table summarizes the recovered visitors with their sizes and key characteristics.

visitBinaryOperator -- sub_10D8BB0

Address0x10D8BB0 (102 KB, 2,078 lines)
Dispatch case0x3A in the master dispatcher
Sibling cases0x39 (NSW/NUW-focused), 0x3B (associative/commutative)

This is the second-largest visitor. It implements approximately 25 cascading simplification phases for all binary arithmetic (Add, Sub, Mul, Div, Rem, Shl, LShr, AShr, And, Or, Xor, and their floating-point counterparts). The phases execute in a strict try-and-return order:

Phase 0 runs quick exits: pattern-matched constant fold (sub_101E960), SimplifyBinOp (sub_F29CA0), algebraic identities (sub_F0F270), NSW/NUW simplification (sub_F11DB0), and critically the NVIDIA-specific intrinsic handler sub_11AE870 which runs before any standard LLVM folds.

Phases 1--9 handle associative/commutative factoring, cross-operand Mul-of-Add matching, delegated simplification, overflow detection, and multiply-shift strength reduction. Phase 5 detects multiply-by-power-of-2 and converts to shift; sub_10BA120 builds the full strength reduction for patterns like x * (2^n + 1) into (x << n) + x.

Phases 10--25 cover Add-of-Mul factoring, shift chains, linear expression folding, subtraction of multiplied constants, demanded-bits masking, reciprocal elimination, overflow intrinsic decomposition, and division/remainder folding. The division constant folder uses sub_C46BD0 (APInt::udiv), sub_C499A0 (APInt::urem), sub_C45F70 (APInt::sdiv), and sub_C49AB0 (APInt::srem).

Four template-instantiated helpers at sub_10D2680--sub_10D2D70 (2,767 bytes each, identical structure) implement matchBinOpReduction parameterized by NVVM intrinsic ID (329, 330, 365, 366) and acceptable opcode range. These detect NVVM horizontal reduction intrinsics (e.g., horizontal add/mul across vector lanes) and simplify them to scalar binary operations.

visitICmpInst -- sub_1136650 + sub_113CA70

Comprehensive foldersub_1136650 (0x1136650, 149 KB, 3,697 lines)
Per-opcode dispatchsub_113CA70 (0x113CA70) -- 12 case labels

The ICmp folder is the single largest function in InstCombine. It runs before the per-opcode dispatch table and handles 15 major fold categories: all-ones/sign-bit constant folds, Mul-with-constant strength reduction (NUW-gated), nested Mul decomposition, common sub-operand cancellation, NUW/NSW flag-gated predicate conversion, known-nonnegativity folds, ConstantRange intersection, shared sub-operand elimination, Sub sign-bit analysis, min/max pattern recognition, computeKnownBits sign-bit analysis, power-of-2 optimizations, remainder pattern matching, XOR/shift decomposition, and Or/And decomposition with type width folding.

NVVM uses a custom predicate encoding stored at ICmpInst+2 as a 6-bit field (*(_WORD*)(inst+2) & 0x3F):

ValuePredicateValuePredicate
32EQ33NE
34UGT35UGE
36ULT37ULE
38SGT39SGE
40SLT41SLE

The per-opcode dispatch at sub_113CA70 routes based on the non-constant operand's opcode tag:

TagInstructionHandlerSize
* (42)Mulsub_11282901,178 lines
, (44)Addsub_1119FB0413 lines
. (46)Truncsub_1115510--
0 (48)SExtsub_11164F0--
1 (49)ZExtsub_1122A30--
4 (52)Selectsub_1115C10428 lines
6 (54)Andsub_1120680911 lines
7 (55)Orsub_1126B10786 lines
8 (56)Xorsub_1126B10shared with Or
9 (57)Shlsub_112C930664 lines
: (58)LShrsub_1133500--
; (59)Subsub_111CED0 + sub_113BFE0519 lines

visitCastInst -- sub_110CA10

Address0x110CA10 (93 KB, 2,411 lines)
Cast chain helpersub_110B960 (22 KB, 833 lines)

Handles all cast simplification: same-type identity elimination, bool-to-float chains, integer-to-integer narrowing/widening, FP-to-int special cases, FP narrowing, cast-through-select/PHI, and the major cast-of-cast chain folding. The helper sub_110B960 implements deep cast chain folding for aggregate types using a worklist with a DenseMap for O(1) deduplication, preventing exponential blowup on diamond-shaped use-def graphs. The function is conservative about side effects: sub_B46500 (isVolatile) is called before every fold.

visitSelectInst -- sub_1012FB0

Address0x1012FB0 (74 KB, 1,801 lines)
Local variables190 total

Implements 18 prioritized select simplifications: constant fold, undef arm elimination, both-same identity, PHI-through-select, KnownBits sign analysis, ConstantRange analysis, full-range analysis, KnownBits cross-validation, ICmpInst arm synthesis, ExtractValue decomposition, implied condition, canonicalization (delegated to sub_1015760, 27 KB), min/max pattern detection (smin/smax/umin/umax/abs/nabs via four helpers), select-in-comparison chains, PHI-select worklist scan (DenseMap with hash (ptr >> 9) ^ (ptr >> 4)), ValueTracking classification, pointer-null folding, and load/trunc delegation.

visitPHINode -- sub_1175E90

Address0x1175E90 (~57 KB, ~2,130 lines)

Implements 16 PHI optimization strategies tried in sequence: SimplifyInstruction constant fold, foldPHIArgOpIntoPHI (binary/cast with one varying operand), foldPHIArgConstantOp, typed opcode dispatch (GEP via sub_1172510, InsertValue, ExtractValue, CmpInst, BinOp/Cast), GEP incoming deduplication with loop back-edge analysis, single-use PHI user check, GEP-of-PHI transform (sub_1174BB0, 1,033 lines), phi-cycle escape detection, trivial PHI elimination (all-same non-PHI value), recursive PHI cycle resolution (sub_116D410), operand reordering canonicalization, identical-PHI-in-block deduplication, pointer-type struct GEP optimization, all-undef incoming check, and dominator-tree GEP index hoisting using two DenseMaps.

visitCallInst -- sub_1162F40

Address0x1162F40 (50 KB, 1,647 lines)

Processes calls through a 15-step cascade: LibCall simplification (sub_100A740), standard intrinsic folding (sub_F0F270), return attribute analysis (sub_F11DB0), overflow/saturating arithmetic (sub_115C220), inline mul-by-constant folding, generic call combining (sub_115A080), FMA/fneg/fsub canonicalization (the largest block, requiring all of nnan+ninf+nsz+arcp+reassoc on both call and function), constant-argument intrinsic folding, unary intrinsic constant folding, exp/log pair detection (IDs 325 and 63), sqrt/rsqrt folding (IDs 284, 285), min/max folding (IDs 88, 90), nested intrinsic composition, division-to-reciprocal-multiply, and finally the NVIDIA-specific sub_115A4C0 which dispatches to the 87 KB intrinsic folding table.

visitLoadInst -- sub_1152CF0

Address0x1152CF0 (~68 KB, ~1,680 lines)
Stack frame0x4F0 bytes (1,264 bytes)

Four major paths: constant-address fold (loads from known constant pointers with types <= 64 bits are replaced via symbol table lookup using sub_BCD420), address-space-based elimination (loads from non-AS(32) pointers are replaced with constants, exploiting CUDA's read-only address spaces), the main store-to-load forwarding worklist (BFS over the def-use graph following GEPs, PHIs, and bitcasts, depth-limited by global qword_4F90528), and dominator-based forwarding for non-pointer loads. Alignment is propagated as the maximum of source and destination, with the volatile bit carefully preserved through the *(node+2) 16-bit field (bits [5:0] = log2(alignment), bit [6] = volatile flag).

NVIDIA-Specific Extensions

NVVM Intrinsic Folding -- sub_1169C30

This 87 KB function is the core of NVIDIA's additions to InstCombine. Called from the main visitor when the instruction is an NVIDIA intrinsic, it uses a two-layer dispatch:

Layer 1 (primary switch, entered when the uses-list is empty or the "fast" flag at a1+336 is set) dispatches on the node's byte-tag:

TagCharFold Type
42*FNeg/negation -- pushes negation through arithmetic via the "Negator" chain
557Vector extract from intrinsic result (full-width extract becomes identity)
568Vector insert into intrinsic result (full-width insert becomes And mask)
59;Multiply-like symmetric intrinsic (folds when one operand is known non-negative)
68DZExt of i1 intrinsic result (bypasses intrinsic wrapper)
69ESExt of i1 intrinsic result (bypasses intrinsic wrapper)
85UCall-site fold for llvm.nvvm.* with specific IDs (313, 362)
86VSelect-like intrinsic fold (dead select elimination)

Layer 2 (depth-gated by qword_4F908A8 = instcombine-negator-max-depth) adds aggressive cases:

TagCharFold Type
46.Dot product fold
546Indexed access / extract with fold
58:Comparison intrinsic fold
67CType conversion intrinsic fold
84TTensor / multi-operand intrinsic fold
90ZZero-extend intrinsic fold
91[Three-operand fold (e.g., fma)
92\Four-operand fold (e.g., dp4a)
96`Unary special intrinsic fold

The FNeg case (tag 42) is the most complex. It first attempts constant folding: if the operand is all-ones (-1), it creates sub(0, operand) via CSE lookup with opcode 30. When the simple fold fails, it falls through to the Negator chain at LABEL_163: sub_1168D40 collects all negatable sub-expressions, sub_1169800 attempts to fold negation into each operand, and the results are combined with sub_929C50 or sub_929DE0. This pushes negation through chains of arithmetic to find a cheaper representation, depth-gated to prevent exponential blowup. Created replacement instructions carry .neg modifier metadata for PTX emission.

Three High-Opcode NVIDIA Intrinsics

Opcodes 0x254D (9549), 0x2551 (9553), and 0x255F (9567) are NVIDIA-proprietary intrinsic IDs handled directly in the main visitor. All three share the same pattern: extract the commuted-operand index via v1612->m128i_i32[1] & 0x7FFFFFF, verify the other operand has byte-tag 12 or 13 (ConstantInt/ConstantFP), query metadata via sub_10E0080 with mask 0xFFFFFFFFFFFFFFFF, and test specific bit patterns:

OpcodeTestFold Condition
0x2551 (9553)((result >> 40) & 0x1E) == 0x10Fold when bit pattern mismatches
0x255F (9567)(result & 0x10) != 0Fold when bit 4 is clear
0x254D (9549)(result & 0x200) != 0Fold when bit 9 is clear

When the filter passes, the shared epilogue calls sub_F207A0(v6, v1612->m128i_i64) (eraseInstFromFunction), deleting the instruction entirely. These implement dead-code elimination for NVIDIA intrinsics with constant arguments matching known-safe-to-remove criteria.

Separate Storage Assume Bundles

At lines 6557--6567 of the main visitor, the code iterates over operand bundles on llvm.assume calls (opcode 0x0B). For each bundle with a tag of exactly 16 bytes matching "separate_storage" (verified by memcmp), it calls sub_10EA360 on both bundle operands. This implements NVIDIA's separate_storage alias analysis hint, allowing InstCombine to exploit non-aliasing assumptions for pairs of pointers declared to reside in separate memory spaces.

Expanded GEP Handling

The GEP case (opcode 0x99 = 153) is significantly expanded compared to upstream. The global dword_4F901A8 controls a depth-limited chain walk for nested GEP simplification:

v729 = getOperand(0) of GEP
if (dword_4F901A8) {
    v730 = 0;
    do {
        if (!isConstantGEP(v729)) break;
        ++v730;
        v729 = getOperand(0, v729);  // walk up
    } while (v730 < dword_4F901A8);
}
if (*(_BYTE*)v729 != 85)  // 85 = CallInst
    goto LABEL_155;        // bail

This walks backward through constant-index GEP chains up to dword_4F901A8 steps, looking for a CallInst base pointer. The knob controls how many GEP levels to look through when simplifying GEP(GEP(GEP(..., call_result))).

Ternary/FMA Support

The preamble handles 3-operand instructions (opcodes 238--245) representing fused multiply-add variants. This includes checking whether the third operand is a zero-constant, converting between FMA opcode variants (238 vs. 242), and handling address space mismatches on FMA operand types -- entirely NVIDIA-specific for CUDA's FMA intrinsics.

computeKnownBits -- sub_11A7600

The 127 KB computeKnownBits implementation dispatches on the first byte of the NVVM IR node (the type tag):

TagCharNode Type
42*Truncation (extracts low bits)
44,GEP (computes known bits through pointer arithmetic)
46.Comparison (known result bits)
480Select (intersection of known bits from both arms)
524Branch-related
546Vector shuffle
557Vector extract
568Vector insert
579PHI node (intersection across incoming values)
58:Comparison variant
59;Invoke / call
67CCast chain
68DBinary op path 1
69EBinary op path 2
85UCallInst (sub-dispatch: 0x0F=abs, 0x42=ctpop, 0x01=bitreverse)
86VLoadInst

A debug assertion at lines 2204--2212 fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results, printing both APInt values and calling abort(). This invariant check (known_zero & known_one == 0, plus consistency with the demanded mask) is compiled in for debug/checked builds.

SimplifyDemandedBits -- sub_11AE870

The wrapper sub_11AE870 gets the bit-width via sub_BCB060 (or sub_AE43A0 for non-integer types), allocates two APInts sized to the width, delegates to sub_11AE3E0, and frees any heap-allocated storage. The core implementation at sub_11AE3E0 (235 lines) calls computeKnownBits, then if the instruction was simplified, walks the use-chain and inserts each user into a hash table (open-addressing with quadratic probing, hash = (ptr >> 9) ^ (ptr >> 4)) at offset +2064 from the InstCombiner context. This "seen instructions" set prevents infinite recursion during demanded-bits propagation.

Configuration Knobs

GlobalCLI FlagDefaultUsed In
dword_4F901A8(GEP chain look-through depth)unknownGEP handler (case 0x99)
qword_4F908A8instcombine-negator-max-depth-1sub_1169C30 (depth gate)
qword_4F90988instcombine-negator-enabled1ctor_090
qword_4F8B4C0instcombine-split-gep-chain--ctor_068
qword_4F8B340instcombine-canonicalize-geps-i8--ctor_068
qword_4F909E0instcombine-max-num-phis--ctor_091
qword_4F90120instcombine-guard-widening-window3ctor_087
qword_4F90528(load forwarding search depth)--sub_1152CF0

Key Helper Functions

AddressRecovered NamePurpose
sub_987FE0getOpcode()Reads NVVM opcode from IR node
sub_B46B10getOperand(idx)Operand access
sub_B44E20eraseFromParent()Unlink instruction
sub_F207A0eraseInstFromFunction()Delete instruction from worklist
sub_F162A0replaceInstUsesWith()RAUW and return replacement
sub_F20660setOperand(i, val)Replace operand in-place
sub_B33BC0CreateBinOp()IRBuilder binary op creation
sub_B504D0CreateBinOp(no-flags)Binary op without flags
sub_B51D30CreateCast()Cast instruction creation
sub_AD8D80ConstantInt::get(type, APInt)Constant integer factory
sub_AD64C0ConstantInt::get(type, val, signed)Constant integer factory (scalar)
sub_BCB060getScalarSizeInBits()Type bit-width query
sub_10E0080getKnownBitsProperty()Metadata property query
sub_B43CB0getFunction()Get parent function
sub_B43CA0getParent()Get parent basic block
sub_10A0170extractFlags()Read fast-math, exact, etc.
sub_B44900isCommutative()Check commutativity
sub_C444A0APInt::countLeadingZeros()Bit analysis
sub_986760APInt::isZero()Zero test
sub_10EA360recordSeparateStorageOperand()Separate storage alias hint

Diagnostic Strings

Diagnostic strings recovered from the InstCombine binary region. InstCombine uses assertion-style diagnostics rather than optimization remarks; the computeKnownBits consistency check is the primary runtime diagnostic.

StringSourceCategoryTrigger
"computeKnownBits(): "sub_904010 in sub_11A7600 line ~2204AssertionDebug build: computeKnownBits and SimplifyDemandedBits produce inconsistent results (prints both APInt values, then calls abort())
"SimplifyDemandedBits(): "sub_904010 in sub_11A7600 line ~2212AssertionDebug build: paired with computeKnownBits() inconsistency diagnostic above
"separate_storage"Main visitor lines 6557--6567Bundle tagMatched via memcmp (16 bytes) on llvm.assume operand bundles; not a user-visible diagnostic
"instcombine-negator-max-depth"ctor_090 at 0x4F908A8KnobKnob registration (default -1, unlimited)
"instcombine-negator-enabled"ctor_090 at 0x4F90988KnobKnob registration (default 1, enabled)
"instcombine-split-gep-chain"ctor_068 at 0x4F8B4C0KnobKnob registration
"instcombine-canonicalize-geps-i8"ctor_068 at 0x4F8B340KnobKnob registration
"instcombine-max-num-phis"ctor_091 at 0x4F909E0KnobKnob registration
"instcombine-guard-widening-window"ctor_087 at 0x4F90120KnobKnob registration (default 3)

InstCombine does not emit OptimizationRemark diagnostics. The only runtime-visible diagnostic is the debug assertion that fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results (known_zero & known_one != 0, or results disagree with the demanded mask). This check is compiled into debug/checked builds only and calls abort() after printing both APInt values.

Size Contribution Estimate

ComponentSizeDescription
Upstream visitor baseline~200 KBStandard LLVM visiting ~50 instruction types
sub_1169C30 intrinsic folding~87 KBNVVM-specific intrinsic canonicalization
NVVM GEP/FMA/vector cases~40 KBExpanded GEP chains, ternary FMA, vector width-changing
separate_storage + assume~10 KBOperand bundle handling for alias hints
High-opcode NVIDIA intrinsics~15 KBDCE for opcodes 0x254D/0x2551/0x255F
Expanded comparator/cast~50 KBExtended ICmp, cast chain, select handling
NVIDIA total addition~200 KBRoughly doubles upstream InstCombine

Optimization Level Behavior

LevelScheduledInstancesNotes
O0Not run0No optimization passes
OfcmaxRuns1Single instance in fast-compile pipeline
OfcmidRuns2Early + post-GVN cleanup
O1Runs3-4Early, post-SROA, post-GVN, late cleanup
O2Runs4-5Same as O1 + additional Tier 2 instance after loop passes
O3Runs5-6Same as O2 + Tier 3 instance; benefits from more aggressive inlining/unrolling

InstCombine is the most frequently scheduled pass in the CICC pipeline. Each instance runs the full 405KB visitor but benefits from different preceding transformations: the post-SROA instance cleans up cast chains from aggregate decomposition, the post-GVN instance simplifies expressions exposed by redundancy elimination, and the late instance performs final canonicalization before codegen. The instcombine-negator-max-depth and instcombine-negator-enabled knobs apply uniformly across all instances. Even at Ofcmax, at least one InstCombine run is considered essential for basic IR canonicalization. See Optimization Levels for pipeline tier details.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Binary size~200 KB main visitor~405 KB main visitor + 87 KB intrinsic folding (~2x upstream)
NVVM intrinsic foldingNo NVVM-specific intrinsic canonicalizationDedicated 87 KB function (sub_1169C30) with two-layer dispatch for negation, vector extract/insert, FMA, tensor, dot product, and 15+ fold types
High-opcode DCENot presentThree NVIDIA proprietary intrinsic IDs (9549, 9553, 9567) with constant-argument dead-code elimination
separate_storage bundlesNo separate_storage operand bundle handlingIterates llvm.assume bundles, extracting "separate_storage" hints for alias-based optimization
Ternary FMA opcodesStandard llvm.fma / llvm.fmuladd foldingExtended preamble handles opcodes 238--245 for CUDA FMA variants with address-space mismatch handling
GEP chain look-throughSingle-level GEP simplificationDepth-limited chain walk (dword_4F901A8 steps) backward through constant-index GEP chains to find CallInst base pointers
Horizontal reductionStandard intrinsic-based reduction foldFour template-instantiated matchBinOpReduction helpers for NVVM horizontal reduction intrinsics (IDs 329, 330, 365, 366)
KnownBits integrationSeparate computeKnownBits in ValueTrackingFused 127 KB computeKnownBits + SimplifyDemandedBits with GPU special-register range oracle

GVN (Global Value Numbering)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/GVN.cpp, llvm/lib/Transforms/Scalar/NewGVN.cpp (LLVM 20.0.0)

CICC v13.0 ships two GVN implementations: the classic GVN pass at 0x1900BB0 (83 KB, ~2314 decompiled lines) and a NewGVN pass at 0x19F99A0 (68 KB, ~2460 decompiled lines). Both are derived from upstream LLVM but carry substantial NVIDIA modifications for GPU-specific value numbering, store splitting, and intrinsic-aware CSE. The knob constructor at ctor_201 (0x4E0990) registers eleven tunables that control PRE, store splitting, PHI removal, dominator caching, and recursion depth.

Key Facts

PropertyValue
Pass name (pipeline)gvn (parameterized)
RegistrationNew PM #397, parameterized: no-pre;pre;no-load-pre;load-pre;...
Runtime positionsTier 0 #5 (via sub_1C6E800); also appears at NewGVN/GVNHoist position #6; see Pipeline
Classic GVN entrysub_1900BB0 (83 KB, 2,314 lines)
NewGVN entrysub_19F99A0 (68 KB, 2,460 lines)
Knob constructorctor_201 at 0x4E0990
Upstream sourcellvm/lib/Transforms/Scalar/GVN.cpp, NewGVN.cpp (LLVM 20.0.0)

Knob Inventory

Knobs are registered in ctor_201 at 0x4E0990. Bool knobs use cl::opt<bool> (vtable 0x49EEC70); int knobs use cl::opt<int> (vtable 0x49EEB70). The store-split limit knobs route through a custom NVIDIA registrar at sub_190BE40 that accepts an int** default initializer.

KnobTypeDefaultGlobal AddressPurpose
enable-prebooltrue0x4FAEEE0Enable Partial Redundancy Elimination
enable-load-prebooltrue0x4FAEE00Enable load PRE (load sinking across edges)
enable-split-backedge-in-load-preboolfalse0x4FAED20Allow splitting backedges during load PRE
enable-phi-removeint20x4FAEC40PHI removal aggressiveness (0=off, 2=aggressive)
dump-phi-removeint00x4FAEB60Dump PHI removal decisions (debug)
no-split-stores-belowint-10x4FAEA80Minimum store width in bits for splitting (-1 = no limit)
no-split-stores-aboveint-10x4FAE9A0Maximum store width in bits for splitting (-1 = no limit)
split-storesbooltrue0x4FAE8C0Master enable for store splitting
profusegvnbooltrue0x4FAE7E0Verbose diagnostics via NVIDIA profuse framework
gvn-dom-cachebooltrue0x4FAE700Cache dominator tree query results (cache size 32)
max-recurse-depthint10000x4FAE620Maximum recursion depth during simplification

IR Before/After Example

GVN eliminates redundant computations and forwards store values to loads. The following shows a common GPU pattern: a redundant load eliminated via value numbering, and a store-to-load forward.

Before:

define void @f(ptr addrspace(1) %p, ptr addrspace(1) %q) {
  %a = load float, ptr addrspace(1) %p, align 4
  %b = fmul float %a, 2.0
  %c = load float, ptr addrspace(1) %p, align 4        ; redundant load (same %p, no intervening store)
  %d = fadd float %b, %c
  store float 42.0, ptr addrspace(1) %q, align 4
  %e = load float, ptr addrspace(1) %q, align 4        ; load from location just stored to
  ret void
}

After:

define void @f(ptr addrspace(1) %p, ptr addrspace(1) %q) {
  %a = load float, ptr addrspace(1) %p, align 4
  %b = fmul float %a, 2.0
  ; %c eliminated -- replaced with %a (same value number)
  %d = fadd float %b, %a
  store float 42.0, ptr addrspace(1) %q, align 4
  ; %e eliminated -- forwarded from store (value 42.0)
  ret void
}

The second load from %p is eliminated because GVN assigns it the same value number as %a. The load from %q after the store is forwarded directly from the stored constant. On GPU, eliminating memory loads is especially valuable because each avoided ld.global saves hundreds of cycles of memory latency.

Classic GVN Algorithm

The main entry point is GVN::runOnFunction at sub_1900BB0. The pass object is approximately 600 bytes and carries four scoped hash tables plus a dominator tree reference.

Pass Object Layout

OffsetFieldPurpose
+0vtablePass vtable pointer
+16Function*Current function being processed
+72MemoryDependenceResults*MemDep analysis handle
+88DominatorTree*Dominator tree
+240LeaderTableHash: value number to canonical leader
+392StoreExprTableHash: store expressions
+544LoadExprTableHash: load expressions
+592RPO counterCurrent block's RPO number

Complexity

Let N = number of instructions, B = number of basic blocks, and D = depth of the dominator tree. The classic GVN traversal visits every instruction exactly once during the RPO walk: O(N). Each instruction is hashed (O(1) amortized via the scoped hash tables) and looked up in the leader table (O(1) amortized). Memory dependence queries (getDependency) are O(D) per load in the worst case, cached by MemDep to amortize across the function. PRE insertion adds at most O(N) new instructions. Store splitting is bounded by the number of stores times the split factor (controlled by no-split-stores-below/above). The gvn-dom-cache (size 32) converts repeated dominance queries from O(D) to O(1). PHI removal (replaceAndErase) is O(U) per replaced value where U = number of uses. Overall: O(N * D) in the worst case due to dominance queries; O(N) in practice with the dominator cache enabled (default). NewGVN's partition-based algorithm is O(N * alpha(N)) amortized where alpha is the inverse Ackermann function from union-find, though the fixpoint iteration can degrade to O(N^2) on pathological inputs.

Traversal Strategy

The pass walks the dominator tree in reverse post-order using an explicit segmented stack rather than recursion. The initial allocation is an 8-slot array of segment pointers (sub_22077B0(64)), each segment holding 64 pointers (512 bytes). The stack grows by allocating new segments and shrinks by freeing segments when popping past a boundary.

Each dominator tree node is a 136-byte structure (sub_22077B0(136)) containing RPO in/out numbers, basic block pointer, child pointers, scope chain links for all four hash tables, an undo list for backtracking, and a visited flag at offset +128.

Main Processing Loop

For each dominator tree node popped from the stack, the pass:

  1. Sets the RPO number from the node's RPO_in field.
  2. Skips already-visited nodes (checked via the byte at offset +128).
  3. Iterates every instruction in the basic block.
  4. Attempts SimplifyInstruction (sub_1AE9990) first; if it succeeds, replaces all uses and erases via sub_19003A0.
  5. Dispatches on the instruction opcode byte at offset +16:
    • Case 4 (call/intrinsic): Classifies purity via bitmask 0x1F133FFE23FFFF, checks volatility through sub_1560260 (flag 36), looks up in the LeaderTable via sub_18FDEE0 (hash) + sub_18FB980 (compare). Inserts new leaders via sub_18FEF10.
    • Case 79 (load): Queries memory dependence, checks four NVIDIA intrinsic IDs for special pointer extraction, then attempts store-to-load forwarding or PRE.
    • Case 114 (store): Inserts into the StoreExprTable using a 5-element hash key (opcode, type, pointer, value, alignment) via sub_18FEB70 / sub_18FFC60.
    • Default: General expression numbering through sub_13E3350, with sub-dispatch for branches (opcode 57), loads (54/55), and call-like instructions (78).

NVIDIA Intrinsic-Aware Value Numbering

The classic GVN recognizes four NVIDIA-specific LLVM intrinsic IDs and extracts their pointer operands with non-standard indices:

Intrinsic IDNamePointer Operand IndexSemantics
4057llvm.nvvm.ldu1 - numOperandsLoad from uniform memory; aggressively CSE-able
4085llvm.nvvm.ldg1 - numOperandsLoad via texture/global cache; CSE if same address
4492(NVIDIA-specific)2 - numOperandsVariant load with 2-operand pointer extraction
4503(NVIDIA-specific)2 - numOperandsVariant load with 2-operand pointer extraction

These intrinsics bypass the standard volatility checks and use custom operand extraction, allowing CSE of texture and surface loads that upstream LLVM GVN would not touch.

Scoped Hash Tables

GVN maintains four ScopedHashTable instances, pushed on dominator tree entry and popped on exit. The scope teardown at lines 1858-2101 restores the LoadExprTable via the undo list at offset +120, restores the StoreExprTable via the undo list at offset +72, frees the MemDepTable scope through sub_18FE3A0, and deallocates the 136-byte dom node.

The hash function (sub_18FDEE0, approximately 140 lines) is NVIDIA-modified. For binary ops (opcodes 35-52), it hashes the opcode and operand pointers with canonicalization (smaller pointer first for commutative operations). For comparisons, it includes the predicate. For GEPs (opcodes 86/87), it hashes the entire index sequence via sub_1597510. Hash mixing uses the formula (ptr >> 9) ^ (ptr >> 4) with XOR combining. The 5-element store expression variant (sub_18FEB70) computes:

hash = (v12>>9)^(v12>>4) ^ (v11>>9)^(v11>>4) ^ (v10>>9)^(v10>>4) ^ (37*v13) ^ (v9>>9)^(v9>>4)

Store Splitting

Three knobs control this NVIDIA-specific extension: split-stores (master enable), no-split-stores-below and no-split-stores-above (bit-width bounds, both default -1 meaning unlimited). The custom registrar at sub_190BE40 handles the limit knobs.

When GVN discovers a store that partially overlaps with a load, it attempts to split the store into sub-stores that individually satisfy dependence constraints. This is critical for GPU code where vector stores (float4, int4) partially overlap with subsequent scalar loads, texture/surface stores have alignment constraints, and shared memory bank conflicts may favor different store granularities.

The function sub_18FECC0 classifies store expressions by instruction type: store (54), atomic store (55), shufflevector (58), extractelement (59), and insertelement (82). The shufflevector/extract/insert handling reflects NVIDIA's lowering of vector operations into intermediate forms before GVN runs.

Dominator Cache

The gvn-dom-cache knob (default true, cache size 32) addresses a known performance bottleneck. GVN's dominance queries are O(n * depth) and can become expensive on deeply nested GPU kernels with many divergent branches. The cache stores recent dominates(A, B) results keyed by basic block pointer, converting repeated queries to O(1). The working set size of 32 was chosen empirically: GPU kernels typically have moderate dominator tree depth because shared memory parallelism keeps CFGs relatively flat.

PHI Removal

After GVN identifies equivalent values, some PHI nodes become trivial. The enable-phi-remove knob controls aggressiveness: level 0 disables removal, level 1 removes only trivially redundant PHIs, and level 2 (default) removes PHIs that become trivial after leader substitution.

The core replaceAndErase routine (sub_19003A0, 11 KB) iterates all uses of a replaced value, checks each PHI-node use for trivial foldability using a SmallDenseSet (opcode 23), and employs a 4-way unrolled loop (lines 301-317) for use scanning. This micro-optimization targets the common case of PHIs with many incoming edges after switch lowering or loop unrolling.

NewGVN

The NewGVN implementation at sub_19F99A0 (68 KB) uses congruence classes instead of simple leader tables, following the partition-based algorithm from Karthik Gargi (2002). The pass object stores a congruence class hash table at offset +1400 with count, bucket array, entry count, tombstone count, and bucket count fields.

The algorithm:

  1. Builds initial partitions from the RPO-ordered instruction list.
  2. For each worklist instruction, queries the current congruence class and computes the new value expression.
  3. If the expression maps to a different class, splits the partition.
  4. Repeats until fixpoint (no more splits).

Hash table growth is handled by sub_19F5120; insert-or-find by sub_19E6B80. Congruence class members are sorted (sub_19F5A00 + sub_19F6B20) for efficient merge operations.

Memory Dependence Integration

GVN interacts with MemoryDependenceResults at offset +72 through three key functions:

FunctionAddressRole
getDependencysub_1422850Returns the memory instruction this load depends on
getDominatorTreesub_1423BA0Extracts the DomTree from MemDep for dominance queries
properlyDominatessub_1428550Tests strict dominance through the MemDep tree

The replacement safety check (sub_18FBB40) returns true immediately when RPO numbers match, and otherwise chains through getDependency -> getIDom -> dominates().

Profuse Diagnostics

The profusegvn knob (default true) enables verbose output through NVIDIA's custom profuse diagnostic framework, not the standard LLVM OptimizationRemark system. When active, diagnostics are emitted at value replacement decisions, store/load expression matches, and PRE insertion decisions. The framework is likely controlled by environment variables such as CICC_PROFUSE_DIAGNOSTICS.

Key Function Map

FunctionAddressSizeRole
GVN::runOnFunction0x1900BB083 KBMain classic GVN pass
replaceAndErase0x19003A011 KBReplace uses + erase instruction
NewGVN::run0x19F99A068 KBNewGVN algorithm
ctor_2010x4E09909 KBGVN knob registration
hashExpression0x18FDEE0~5 KBExpression hash function
compareExpression0x18FB980~2 KBExpression equality test
lookupExpr50x18FEB70~3 KB5-key store expression lookup
insertExpr50x18FFC60~3 KB5-key insert with scoped undo
insertLeader0x18FEF10~5 KBLeader table insert
checkStoreSplit0x18FECC0~3 KBStore expression for splitting
canReplace0x18FBB40<1 KBDominance-based replacement check
preAvailCheck0x18FC460~3 KBPRE availability analysis
performPRE0x18FF29010 KBPRE insertion
largeGVNHelper0x18F6D0060 KBPRE / load forwarding helper
phiGVNHelper0x18FAA9020 KBPHI-related GVN helper
storeSplitHelper0x190672026 KBStore splitting implementation
storeSplitVisit0x1905CD016 KBStore-split worklist visitor
postGVNCleanup0x1908A0010 KBPost-GVN cleanup
gvnFinalCleanup0x190C3B08 KBFinal cleanup after GVN

Expression Classification Bitmask

The bitmask 0x1F133FFE23FFFF classifies opcodes that are safe for value numbering (pure, side-effect-free). It appears eight times in the main function. Bit positions correspond to (opcode - 35), covering standard arithmetic, logical, comparison, and cast operations, plus NVIDIA-specific opcodes in the extended range.

Multi-Pass Data Flow: SROA / InstCombine / GVN / DSE

These four passes form the core scalar optimization chain in CICC's mid-pipeline. They execute in sequence (often multiple times through the pipeline), with each pass producing IR transformations that create opportunities for the next. The following diagram traces data flow through a single iteration of the chain, showing what each pass produces and what the next pass consumes.

 SROA (Scalar Replacement of Aggregates)
 ========================================
 Input:  IR with aggregate alloca instructions (structs, arrays)
         Example: %s = alloca %struct.float4   -->  lives in .local memory (AS 5)

 +--------------------------------------------------------------+
 | Phase 1: Slice analysis                                      |
 |   Walk all uses of each alloca, build byte-range slices      |
 |   Group non-overlapping slices into partitions               |
 |                                                              |
 | Phase 2: Partition splitting                                 |
 |   Replace each partition with a scalar alloca or SSA value   |
 |   Insert extractvalue/insertvalue for partial accesses       |
 |   Defer trivially-promotable allocas to mem2reg              |
 |                                                              |
 | Produces:                                                    |
 |   - Scalar SSA values replacing aggregate members            |
 |   - Inserted bitcasts, trunc, zext for type mismatches       |
 |   - Dead aggregate allocas (erased)                          |
 |   - GEP chains pointing at sub-fields (now redundant)        |
 +------------------------------+-------------------------------+
                                |
                                | Scalar SSA values with redundant
                                | casts, dead GEPs, identity ops
                                v
 InstCombine (Instruction Combining)
 ========================================
 Input:  Post-SROA IR with redundant instructions

 +--------------------------------------------------------------+
 | 405KB visitor dispatches across 80 opcode cases:             |
 |                                                              |
 | Consumes from SROA:                                          |
 |   - Redundant bitcasts from type-punned accesses             |
 |   - trunc(zext(x)) chains from width mismatches              |
 |   - Dead GEP arithmetic (base + 0)                           |
 |   - Identity selects from conditional stores                 |
 |                                                              |
 | Canonicalization:                                            |
 |   - Constant folding (sub_101E960)                           |
 |   - Algebraic identities: x+0, x*1, x&-1 (sub_F0F270)      |
 |   - Strength reduction: x*2^n -> x<<n (sub_10BA120)         |
 |   - Cast chain collapse: trunc(zext(x)) -> x or smaller     |
 |   - NVIDIA intrinsic folding (sub_1169C30, 87KB)             |
 |   - computeKnownBits propagation (sub_11A7600, 127KB)        |
 |                                                              |
 | Produces:                                                    |
 |   - Canonical instruction forms (const on RHS, etc.)         |
 |   - Simplified expressions (fewer instructions)              |
 |   - Known-bits metadata on values                            |
 |   - Opportunities for value numbering (same expression       |
 |     in different blocks now looks identical)                  |
 +------------------------------+-------------------------------+
                                |
                                | Canonical IR with duplicate
                                | expressions across blocks
                                v
 GVN (Global Value Numbering)
 ========================================
 Input:  Canonicalized IR from InstCombine

 +--------------------------------------------------------------+
 | Traverses dominator tree in RPO with scoped hash tables:     |
 |                                                              |
 | Consumes from InstCombine:                                   |
 |   - Canonical expression forms (enables hash-table matching) |
 |   - Known-bits info (used in SimplifyInstruction)            |
 |   - Folded NVIDIA intrinsics (enables ldu/ldg CSE)           |
 |                                                              |
 | Value numbering:                                             |
 |   - Hash expression: (opcode, type, operands) -> leader      |
 |   - Scoped tables: LeaderTable, StoreExprTable, LoadExprTable|
 |   - NVIDIA ldu/ldg CSE (intrinsics 4057, 4085, 4492, 4503)  |
 |                                                              |
 | Load forwarding:                                             |
 |   - Query MemoryDependenceResults for store->load forwarding |
 |   - Store splitting: float4 store -> scalar float load       |
 |     (NVIDIA extension, controlled by split-stores knob)      |
 |                                                              |
 | PRE (Partial Redundancy Elimination):                        |
 |   - Insert computations at merge points to enable CSE        |
 |   - Load PRE across edges (enable-load-pre)                  |
 |                                                              |
 | Consumes from alias analysis:                                |
 |   - MemoryDependence results (which store feeds which load?) |
 |   - NVVM AA NoAlias answers for cross-address-space pairs    |
 |                                                              |
 | Produces:                                                    |
 |   - Eliminated redundant computations (replaced with leader) |
 |   - Forwarded loads (replaced with stored value)             |
 |   - Trivial PHIs (from leader substitution)                  |
 |   - Dead stores exposed (stored value is never loaded)       |
 +------------------------------+-------------------------------+
                                |
                                | IR with eliminated redundancies,
                                | forwarded loads, exposed dead stores
                                v
 DSE (Dead Store Elimination)
 ========================================
 Input:  Post-GVN IR with dead stores exposed

 +--------------------------------------------------------------+
 | 91KB across three major functions:                           |
 |                                                              |
 | Consumes from GVN:                                           |
 |   - Stores whose values were forwarded to loads (now dead)   |
 |   - Stores to locations that GVN proved are overwritten      |
 |   - Simplified store patterns from PRE insertion             |
 |                                                              |
 | Consumes from alias analysis:                                |
 |   - MemorySSA graph (which stores are visible to which loads)|
 |   - NVVM AA NoAlias (cross-space stores never conflict)      |
 |   - TBAA metadata (type-based aliasing for struct fields)    |
 |                                                              |
 | Dead store detection:                                        |
 |   - Complete overwrite: later store covers same location     |
 |   - Partial overwrite: float4 store then float4 store with   |
 |     overlapping range (72-byte hash table tracking)          |
 |   - Store chain decomposition: aggregate stores decomposed   |
 |     via GEP into element-level dead-store checks             |
 |                                                              |
 | NVIDIA extensions:                                           |
 |   - Partial store forwarding with type conversion            |
 |     (float4 -> float via GEP + load extraction)              |
 |   - Cross-store 6-element dependency records                 |
 |   - CUDA vector type-aware size computation                  |
 |                                                              |
 | Produces:                                                    |
 |   - Eliminated dead stores (fewer memory writes)             |
 |   - Replacement loads for partial forwards                   |
 |   - Reduced memory traffic (critical for GPU bandwidth)      |
 +--------------------------------------------------------------+

Cross-pass data dependency table:

PassConsumes from predecessorProduces for successor
SROAAggregate allocas from frontend/inlinerScalar SSA values, redundant casts/GEPs
InstCombineRedundant casts, identity ops from SROACanonical expressions, known-bits metadata
GVNCanonical forms from InstCombine, MemDep/AA resultsForwarded loads, eliminated redundancies, exposed dead stores
DSEDead stores exposed by GVN, MemorySSA/AA resultsEliminated stores, reduced memory traffic

Why this ordering matters for GPU code: SROA is existential because un-promoted allocas become .local memory (200-400 cycle penalty). InstCombine must run before GVN because GVN's hash-table matching requires canonical expression forms -- without InstCombine, (a + 0) and a would hash differently and miss the CSE opportunity. GVN must run before DSE because GVN's load forwarding is what exposes dead stores: once GVN proves that a load reads a value already available as an SSA register, the store that was keeping that value alive becomes dead. DSE then removes it, reducing the memory write traffic that is the primary bandwidth bottleneck on GPU architectures.

Optimization Level Behavior

LevelClassic GVNNewGVNPREStore Splitting
O0Not runNot runN/AN/A
OfcmaxNot runNot runN/AN/A
OfcmidRuns (1 instance)Not runEnabled (enable-pre=true)Enabled (split-stores=true)
O1Runs (1-2 instances in Tier 0/1)Not runEnabledEnabled
O2Runs (2-3 instances across Tier 0/1/2)Not runEnabledEnabled
O3Runs (2-3 instances, most aggressive inlining exposes more CSE)Not runEnabledEnabled

GVN is a core mid-pipeline pass that runs at O1 and above. It appears multiple times in the pipeline -- typically once after CGSCC inlining and once in the late scalar cleanup. Each instance benefits from different preceding transformations (inlining, SROA, InstCombine). NewGVN is compiled into the binary but not scheduled in any standard pipeline tier. The enable-pre and enable-load-pre knobs are both true by default across all levels. See Optimization Levels for the complete tier structure.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Store splittingNot present; GVN handles stores only for forwardingThree knobs (split-stores, no-split-stores-below, no-split-stores-above) enable splitting wide vector stores into sub-stores matching load granularity
NVIDIA intrinsic CSENo awareness of nvvm.ldu, nvvm.ldgFour NVIDIA intrinsic IDs (4057, 4085, 4492, 4503) with custom pointer operand extraction, enabling CSE of texture/global cache loads
Dominator cacheNo caching; dominance queries are O(n * depth)gvn-dom-cache (default true, size 32) caches recent dominates(A, B) results for O(1) repeated queries
PHI removal aggressivenessBasic trivial PHI cleanupThree-level enable-phi-remove knob (0=off, 1=trivial, 2=aggressive); 4-way unrolled use-scanning loop for PHI-heavy IR
Knob count~4 knobs (enable-pre, enable-load-pre, enable-split-backedge-in-load-pre, max-recurse-depth)11 knobs including store splitting limits, dominator caching, profuse diagnostics, and PHI removal depth
Diagnostic frameworkStandard OptimizationRemark systemprofusegvn knob (default true) uses NVIDIA's custom profuse diagnostic framework, not LLVM's ORE
NewGVNStandard partition-based NewGVNSame algorithm, ships alongside classic GVN at separate address; both carry NVIDIA modifications

Diagnostic Strings

All diagnostic strings recovered from the binary. GVN uses NVIDIA's custom profuse diagnostic framework rather than LLVM's OptimizationRemark system.

StringSourceCategoryTrigger
"profuse for GVN"0x4FAE7E0 (profusegvn knob description)KnobKnob registration
"enable caching of dom tree nodes"0x4FAE700 (gvn-dom-cache knob description)KnobKnob registration
"Max recurse depth (default = 1000)"0x4FAE620 (max-recurse-depth knob description)KnobKnob registration
(profuse GVN diagnostic output)sub_1909530 (~5 KB)Debugprofusegvn knob enabled (default true); emits at value replacement, store/load match, and PRE insertion decisions
(PHI removal diagnostic output)sub_19003A0 regionDebugdump-phi-remove > 0; dumps which PHI nodes are being removed and why

The profusegvn framework follows the same pattern as profuseinline -- it is a custom NVIDIA diagnostic channel likely controlled by environment variables such as CICC_PROFUSE_DIAGNOSTICS, not the standard LLVM OptimizationRemark / ORE system. The dump-phi-remove knob (default 0) separately enables diagnostic output during PHI removal.

Allocation Strategy

The 136-byte domtree nodes and 48-byte expression entries use sub_145CBF0 (BumpPtrAllocator) and sub_22077B0 (malloc wrapper). This careful memory management addresses the potentially large number of expressions produced by heavily unrolled GPU kernels.

Test This

The following kernel contains redundant loads from the same global address. GVN should eliminate the second load by recognizing it has the same value number as the first.

__global__ void gvn_test(const float* __restrict__ in, float* __restrict__ out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n) return;

    float a = in[tid];        // first load
    float b = a * 2.0f;
    float c = in[tid];        // redundant -- same address, no intervening store
    float d = c * 3.0f;

    out[tid] = b + d;
}

What to look for in PTX:

  • Only one ld.global.f32 instruction for in[tid], not two. GVN assigns the same value number to both loads (same pointer, no intervening aliasing store thanks to __restrict__) and replaces the second with the first's result.
  • The arithmetic should reduce to something equivalent to in[tid] * 5.0f. After GVN eliminates the redundant load, InstCombine or the backend may simplify a*2 + a*3 into a*5.
  • Remove __restrict__ and add an intervening store (out[tid] = b; between the two loads). Without __restrict__, GVN cannot prove the second load is redundant (the store to out might alias in), so both ld.global.f32 instructions survive. This demonstrates how alias analysis feeds GVN.
  • For store-to-load forwarding: insert out[tid] = 42.0f; followed by float e = out[tid];. GVN should replace the load with the constant 42.0f -- no ld.global emitted for e.

JumpThreading

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 JumpThreading.cpp. Evidence: DFA JumpThreading variant (dfa-jump-threading) present as a separate pass matches LLVM 14+; early-exit-heuristic knob matches LLVM 16+. Core algorithm is unmodified; NVIDIA changes are configuration-level (adjusted thresholds, three pipeline positions, OCG disable flag).

CICC v13.0 ships LLVM's JumpThreadingPass at sub_2DC4260 (12,932 bytes, address range 0x2DC4260--0x2DC74E4). The pass duplicates basic blocks so that predecessors whose branch conditions can be statically resolved jump directly to the correct successor, eliminating a conditional branch from the critical path. On a GPU, this directly reduces warp divergence: a branch that was previously data-dependent becomes unconditional along each incoming edge, so the warp scheduler never needs to serialize the two paths.

The pass is fundamentally at odds with PTX's requirement for reducible control flow. Block duplication can create multi-entry loops (irreducible cycles) when the duplicated block is a loop header or when the threading target sits inside a loop whose header is not the threading source. CICC addresses this through three layered mechanisms -- loop header protection, conservative duplication thresholds, and a late-pipeline StructurizeCFG safety net -- that collectively keep the CFG reducible without sacrificing the pass's optimization value.

PropertyValue
Pass name (pipeline parser)"jump-threading"
Pass classllvm::JumpThreadingPass
Entry functionsub_2DC4260
Binary size12,932 bytes
Stack frame0x748 (1,864) bytes
Block duplication helpersub_2DC22F0 (2,797 bytes)
CFG finalizationsub_2DC30A0 (1,094 bytes)
Single-instruction threadingsub_2DC37C0 (2,288 bytes)
Select unfoldingsub_2DC40B0 (420 bytes)
Pipeline positionsThree invocations: ~position 234, ~278, and a late tier-3 position (~239)
NVVMPassOptions disable offset+320
Upstream LLVM sourcelib/Transforms/Scalar/JumpThreading.cpp

Why JumpThreading Matters on GPU

Consider a CUDA kernel containing:

if (threadIdx.x < threshold)
    val = computeA();
else
    val = computeB();

if (val > 0)
    result = pathX(val);
else
    result = pathY(val);

The second branch depends on val, which is a PHI of computeA() and computeB(). If JumpThreading can determine that computeA() always returns a positive value, it duplicates the second if block and wires the computeA predecessor directly to pathX. Threads that took the first branch path never execute the second conditional at all.

On a CPU this saves a branch misprediction. On a GPU the payoff is larger: eliminating the second branch prevents a second point of warp divergence. If both branches would diverge on different thread subsets, removing one cuts the total serialization overhead in half. The threads that took computeA proceed straight to pathX without waiting for the computeB threads to rejoin.

Knob Inventory

Six cl::opt globals control the pass, registered in ctor_456 at 0x544220:

KnobDefaultGlobalDescription
jump-threading-threshold6qword_4FFDBA0Max instructions in a block eligible for duplication
jump-threading-implication-search-threshold3qword_4FFDAC0Max predecessors to search for condition implications
jump-threading-phi-threshold76 (0x4C)qword_4FFD9E0Max PHI nodes in a block eligible for duplication
jump-threading-across-loop-headersfalseqword_4FFD900Allow threading across loop headers (testing only)
jump-threading-disable-select-unfoldingfalseqword_4FFDC80Disable unfolding select instructions into branches
print-lvi-after-jump-threadingfalse--Debug: print LazyValueInfo cache after pass completes

The block-size threshold of 6 matches upstream LLVM. The PHI threshold of 76 is significantly higher than upstream's default (which is typically lower), reflecting GPU kernels' tendency toward wider PHI nodes due to predication and convergence patterns. The implication search depth of 3 is conservative, limiting compile-time cost from predecessor chain analysis in the typically shorter basic-block chains of GPU code.

Two Disable Flags

CICC registers two independent cl::opt flags that suppress jump threading behavior. They live in different subsystems and control different things:

FlagRegistrationSubsystemEffect
"disable-JumpThreadingPass"ctor_637 @ 0x5934A7JumpThreading pass itselfDisables the standalone JumpThreadingPass invocations in the pipeline
"disable-jump-threading"ctor_073 @ 0x49A91E (also ctor_243 @ 0x4ED0C0)SimplifyCFGDisables jump threading logic within SimplifyCFG -- the per-block branch-through-PHI threading that SimplifyCFG performs as part of its CFG simplification

The "disable-jump-threading" flag carries the annotation "Disable jump threading for OCG experiments", where OCG is NVIDIA's Optimizing Code Generation research infrastructure. This is a SimplifyCFG option, not a JumpThreadingPass option -- SimplifyCFG has its own internal implementation of branch threading through PHI nodes that is separate from the standalone pass. NVIDIA engineers can disable either or both independently.

The "fold-with-var-cond" flag is registered alongside "disable-jump-threading" in the same SimplifyCFG constructor group, controlling a related NVIDIA-specific extension for folding branches with variance conditions.

Interaction with StructurizeCFG

The fundamental tension: JumpThreading duplicates blocks to bypass conditionals, which can transform a reducible loop into an irreducible cycle. PTX requires all loops to be natural (single-entry, reducible). An irreducible CFG causes StructurizeCFG to emit "UnsupportedIrreducibleCFG" and bail out, leaving the function in a state that ptxas will likely reject.

CICC addresses this through three layered mechanisms:

1. Loop Header Protection via LoopInfo

The jump-threading-across-loop-headers flag defaults to false. Before threading any block, the pass queries LoopInfo through a red-black tree lookup at 0x2DC4781 using dword_501D5A8 as the analysis key. If the target block is a loop header (the LoopInfo query returns a non-null loop containing the block as its header), the pass skips it entirely.

A parallel DominatorTree query at 0x2DC4839 (using dword_501D4C8) verifies loop membership and nesting depth. If the block is found within a loop, a threshold override is loaded from qword_501D628, replacing the standard duplication threshold with a loop-specific one. A second override from qword_501D548 applies to blocks found via the DominatorTree-based lookup.

This double check -- LoopInfo for header identification, DominatorTree for membership -- prevents the most common source of irreducibility: duplicating a loop header creates a second entry into the loop body.

2. Conservative Duplication Thresholds

The three thresholds (6 instructions, 3 predecessors, 76 PHIs) restrict duplication to small, simple blocks where the CFG outcome is highly predictable and the duplication cost is bounded. A block must satisfy all three limits simultaneously. These thresholds interact multiplicatively: even a 6-instruction block with 4 predecessors would exceed the implication search depth and be rejected, while a 5-instruction block with 100 PHIs would exceed the PHI threshold.

3. StructurizeCFG Safety Net

StructurizeCFG (sub_35CC920) runs late in the pipeline, after all IR-level scalar and loop transforms. Its irreducibility detector (sub_35CA2C0) checks every back-edge: if the target does not dominate the source, the loop has multiple entries and is irreducible. If JumpThreading or any other pass creates an irreducible cycle that slipped past the loop header protection, StructurizeCFG will catch it.

This is defense-in-depth: the threading constraints prevent most irreducible cases, and structurization catches the rest. The design deliberately tolerates a small number of "false acceptances" at the JumpThreading level because the cost of occasionally running StructurizeCFG's rejection path is far lower than the cost of being too conservative and missing profitable threading opportunities.

Cost Model

The pass enforces a multi-level cost model that bounds total code growth per function.

Global Budget

At 0x2DC4887, the pass initializes a global instruction budget:

mov ebx, 200h    ; 512 instructions total budget

Each block duplication charges the duplicated block's instruction count against this budget. The budget is tracked in var_460 and checked before each duplication. Once exhausted, no further threading occurs in that invocation regardless of how profitable individual candidates might be.

Per-Predecessor Cost Division

When threading involves multiple predecessors, the per-predecessor cost is the block instruction count divided by the number of predecessors being threaded, with ceiling rounding:

cost_per_pred = block_instr_count / num_predecessors
; ceiling via: sbb eax, -1 (adds 1 if remainder was nonzero)

This division at 0x2DC4D78--0x2DC4D8E means a 6-instruction block being threaded for 3 predecessors costs only 2 instructions per predecessor against the global budget. The logic recognizes that multi-predecessor threading amortizes the code growth across more eliminated branches.

Special Cases

  • Single-instruction blocks (checked at 0x2DC4D94): Always eligible, regardless of budget. A block containing only a terminator instruction costs nothing to duplicate.
  • Empty blocks (checked at 0x2DC4D70): Skipped entirely.
  • Blocks with <=1 effective instructions (0x2DC4BF1): The comparison cmp edx, 1; jbe gates a fast path where the pass bypasses the full cost analysis.

LazyValueInfo Integration

The pass accepts a LazyValueInfo pointer as its third parameter (rdx). When non-null (checked at 0x2DC42BD), LVI provides range-based condition evaluation that enables threading even when the branch condition is not a simple constant comparison.

LVI State

The LVI cache occupies approximately 600 bytes (0x258) of local state:

FieldOffsetPurpose
Cache structurevar_2F0 through var_98LVI range cache local state
Valid flagvar_C0Set to 1 when LVI is initialized
Cached rangesvar_B0SmallVector-like structure
Initial capacityvar_A88 entries

Range-Based Threading

For ICMP_NE conditions (opcode 0xBA = 186), the pass calls sub_11F3070 (LVI::getPredicateAt) with the ICmp operand and a comparison predicate of 2, followed by sub_DFABC0 (evaluateConditionOnEdge) to resolve the branch direction along a specific incoming edge.

For alternate opcode paths (opcode 0x165 = 357), the pass uses sub_988330 (getConstantOnEdge) instead, which returns a concrete constant value if LVI can prove the condition evaluates to a known value along that edge.

The virtual dispatch at 0x2DC67D6 (call qword ptr [rax+78h]) invokes LVI::getPredicateOnEdge. If the vtable matches sub_920130 (the default implementation), a fallback path calls sub_AC4810 (isImpliedCondition) with predicate 0x27 (39), and if that also fails, sub_AA93C0 (SimplifyICmpInst).

Cleanup

On exit, if LVI was used, three cleanup calls occur:

  • sub_FFCE90 -- LVI::eraseBlock (invalidation)
  • sub_FFD870 -- LVI::clear
  • sub_FFBC40 -- LVI::releaseMemory

Main Algorithm

Outer Loop

The pass iterates over the function's basic block list via a linked-list traversal (BB->next chain at [BB+8]):

run(result_ptr, function, lvi_ptr, tli, ...):
    if lvi_ptr != null:
        initialize_lvi_cache(lvi_ptr)

    budget = 512
    changed = false

    loop:
        current_bb = function.entry_block    // sub_B2BEC0
        end = function + 0x48               // end sentinel

        while current_bb != end:
            if try_thread_block(current_bb, budget):
                changed = true
            current_bb = current_bb.next     // [current_bb + 8]

        if changed:
            changed = false
            goto loop    // restart: threading may expose new opportunities

    cleanup_lvi()
    return results

The restart-on-change behavior means threading is iterative: eliminating one branch can expose a new statically-determinable branch downstream.

Per-Block Classification

For each basic block, the pass examines the terminator instruction:

  1. Opcode check (0x2DC443E): The instruction opcode byte is compared against 0x55 (85), which is LLVM's BranchInst opcode. Only conditional branches are considered.

  2. Metadata check (0x2DC4449--0x2DC446E): Two calls to sub_A73ED0 check for metadata kinds 0x17 (23, "prof" branch weights) and 0x04 (debug). Then sub_B49560 (hasMetadataOtherThanDebugLoc) is called on the branch instruction.

  3. Condition extraction (0x2DC45F8--0x2DC4636): sub_981210 (getBranchCondition) returns a success flag and a condition code. Two condition codes are handled:

    • 0x165 (357): likely CmpInst::ICMP_EQ or a switch opcode
    • 0x0BA (186): likely CmpInst::ICMP_NE

    Other condition codes cause the block to be skipped.

  4. Operand analysis (0x2DC465F--0x2DC467C): The operand count is extracted (AND with 0x7FFFFFF mask -- the use-count field in LLVM's Value layout). If the branch condition is an ICmp with a constant operand (type byte 0x11 = 17 = ConstantInt), threading is potentially profitable.

Condition-Specific Threading Paths

The pass contains four specialized threading strategies:

Constant-value threading (0x2DC66B7): When a predecessor can determine the branch outcome via a constant PHI incoming value, the simplest path. Creates a direct unconditional branch.

Single-instruction threading (sub_2DC37C0, 2,288 bytes): For blocks containing exactly one instruction (the terminator), called at 0x2DC6704. Creates a direct branch bypass.

Switch threading (0x2DC6A76--0x2DC6B0C): When the terminator is a SwitchInst (opcode byte 0x37 = 55), calls sub_2DC40B0 (tryToUnfoldSelect). This checks for SelectInst (opcode 0x52 = 82) and unfolds the select into explicit branches that can be individually threaded.

Implication-based threading (0x2DC6E71--0x2DC6EB3): For ICmpInst variants (opcode 0x28 = 40), the pass checks whether the predicate implies the branch condition via sub_B532B0, creates the threaded edge via sub_B52EF0, and wires the new block via sub_92B530.

All-Ones Constant Detection

Four sites (0x2DC71B0, 0x2DC71CA, 0x2DC7380, 0x2DC74DA) check for all-ones constants as PHI incoming values:

or rax, -1          ; create all-ones mask
shr rax, cl         ; cl = 64 - bitwidth, shift to match width
cmp [rdx+18h], rax  ; compare against actual constant value
setz al             ; true if constant is all-ones

For an i1 type, all-ones means true. This handles the common pattern where a PHI incoming value from one predecessor is the constant true (all bits set), allowing the pass to resolve the branch direction for that predecessor.

PHI Operand Iteration

Two nearly identical loops at 0x2DC7206--0x2DC726E and 0x2DC7456--0x2DC74CD iterate PHI operands to determine if all incoming values from relevant predecessors resolve to the same constant:

for pred_idx in range(phi.num_operands):    // var_668
    incoming = phi.getIncomingValueForBlock(pred)  // sub_AD69F0
    type_tag = incoming.type_byte

    if type_tag == 0x0D:     // ConstantInt::getTrue()
        continue
    if type_tag == 0x11:     // ConstantInt with bitwidth check
        if bitwidth <= 64:
            if value == all_ones_for_width:
                continue     // resolves to true
        else:
            skip             // wide integers, bail out

    // If any incoming value is non-constant, threading is unprofitable
    bail_out()

If every relevant predecessor provides the same constant value, the branch direction is fully determined and threading proceeds.

Created Block Names

When threading occurs, the pass creates new basic blocks with diagnostic names:

NameString addressPurpose
"endblock"0x42E9094Terminal block of the threaded path; created via sub_F36990 (SplitBlockAndInsertIfThen)
"phi.res"0x42E90C0PHI resolution node for merged values; created via sub_D5C860 (PHINode::Create)
"res_block"0x42E909DResult block for the threaded path; allocated as 0x50-byte BasicBlock via sub_22077B0
"loadbb"0x42E90B9Load basic block for load-bearing threading; created in a loop at 0x2DC4F05--0x2DC4FFB
"phi.src1"0x42E90A7First PHI source block
"phi.src2"0x42E90B0Second PHI source block

The "loadbb" blocks are created in a dynamic loop for multi-way threading, where each iteration allocates a 0x50-byte (sizeof(BasicBlock)) object and wires it into the CFG via sub_AA4D50 (BasicBlock::insertInto).

Block Duplication Engine: sub_2DC22F0

The 2,797-byte helper performs actual block cloning. Parameters:

RegisterRole
rdiDuplication context structure (at var_490)
rsiSource block's value table
rdxDestination hash table
rcxPHI operand map
r8dInstruction count for the source block

The cloning process:

  1. Clone each instruction from the source block
  2. Insert cloned instructions into use-def chains (0x2DC59A1--0x2DC59E7: linked-list surgery on LLVM's Value use-list)
  3. Update PHI operands to reference the new predecessor (0x2DC5E1E onward)
  4. Update branch targets in the predecessor blocks

CFG Finalization: sub_2DC30A0

The 1,094-byte helper, called at 0x2DC5015 and 0x2DC6408 after threading completes for a block, performs:

  • Successor edge updates
  • Dead block elimination for blocks made unreachable by the threading
  • DominatorTree updates if available (via sub_FFB3D0, DominatorTree::changeImmediateDominator)

Pipeline Positions

JumpThreading appears three times in the CICC pipeline, at different stages with different surrounding context:

PositionPipeline contextParameterPurpose
~234After ADCE, within the main function simplification loopsub_198DF00(-1)First opportunity: thread branches exposed by dead code elimination
~278After NVVMPeephole2 and optionally GVN, in the NVIDIA-specific tier-2 sequencesub_198DF00(-1)Second opportunity: thread branches exposed by value numbering and peephole
Late tier-3Within the ADCE/MemCpyOpt/DSE sequencesub_198DF00(t)Final opportunity: catch any remaining threadable branches before StructurizeCFG

The sub_198DF00 function is the combined CorrelatedValuePropagation/JumpThreading registration wrapper. The -1 parameter likely selects the default mode; the t parameter in the third position may be an optimization-level-dependent configuration.

All three positions are conditional on NVVMPassOptions offset +320 not being set to disable. Each invocation resets the 512-instruction global budget, so the total code growth across all three invocations can reach up to 1,536 instructions per function.

DFA JumpThreading

A separate DFA-based JumpThreading variant exists at sub_276AF50, registered as "dfa-jump-threading" (llvm::DFAJumpThreadingPass). This pass is controlled by:

KnobRegistrationDescription
enable-dfa-jump-threadctor_445 @ 0x53F5C0Enable/disable the DFA variant
dfa-jump-view-cfg-beforector_445Debug: dump CFG before DFA threading
dfa-early-exit-heuristicctor_445Early-exit heuristic for compile time

DFA JumpThreading handles state-machine patterns (switch statements in loops with predictable transitions between cases) that the standard JumpThreading cannot resolve. It is a separate pass with its own pipeline registration and does not share the budget or thresholds of the standard JumpThreading pass.

Before/After IR Example

Consider a kernel with a two-branch diamond:

Before JumpThreading:

entry:
  %cond1 = icmp sgt i32 %x, 0
  br i1 %cond1, label %positive, label %negative

positive:
  %a = call i32 @computeA()
  br label %merge

negative:
  %b = call i32 @computeB()
  br label %merge

merge:
  %val = phi i32 [ %a, %positive ], [ %b, %negative ]
  %cond2 = icmp eq i32 %val, 42
  br i1 %cond2, label %match, label %nomatch

match:
  ...
nomatch:
  ...

If LVI can prove that computeA() always returns 42 (e.g., it is a known constant), JumpThreading duplicates the merge block for the %positive predecessor:

After JumpThreading:

entry:
  %cond1 = icmp sgt i32 %x, 0
  br i1 %cond1, label %positive, label %negative

positive:
  %a = call i32 @computeA()
  br label %match              ; threaded: skip %merge entirely

negative:
  %b = call i32 @computeB()
  br label %merge

merge:                          ; now has only one predecessor
  %val = phi i32 [ %b, %negative ]
  %cond2 = icmp eq i32 %val, 42
  br i1 %cond2, label %match, label %nomatch

match:
  ...
nomatch:
  ...

The %positive path no longer passes through merge. The second branch is eliminated for threads that took the first path.

Differences from Upstream LLVM

AspectCICC v13.0Upstream LLVM 20
PHI threshold default76Lower (typically ~32 or similar)
disable-jump-threading in SimplifyCFGPresent, annotated for OCG experimentsPresent (standard LLVM flag)
Annotation"Disable jump threading for OCG experiments"No OCG reference
Pipeline invocationsThree positions, combined with CVP via sub_198DF00Typically two (early and late in the function simplification pipeline)
NVVMPassOptions disableOffset +320N/A
Loop header override thresholdsqword_501D628, qword_501D548Standard LoopInfo check only
fold-with-var-condNVIDIA-specific SimplifyCFG companion flagNot present

The core algorithm is unmodified from upstream. NVIDIA's changes are configuration-level: adjusted thresholds, additional pipeline positions, the OCG disable flag, and integration with the NVVMPassOptions system.

Function Map

FunctionAddressSizeRole
JumpThreadingPass::run (main pass body)sub_2DC426012,932 bytes--
Block cloning engine (duplicateBlock)sub_2DC22F02,797 bytes--
CFG finalization after threadingsub_2DC30A01,094 bytes--
Single-instruction threadingsub_2DC37C02,288 bytes--
tryToUnfoldSelectsub_2DC40B0420 bytes--
SmallVector append/copy for instruction mapsub_2DC1F40349 bytes--
LVI::getPredicateAtsub_11F3070----
evaluateConditionOnEdgesub_DFABC0----
getConstantOnEdgesub_988330----
isImpliedConditionsub_AC4810----
SimplifyICmpInstsub_AA93C0----
getBranchConditionsub_981210----
BranchInst::getConditionsub_B43CB0----
BranchInst::Create (conditional)sub_B4C9A0----
BranchInst::Create (unconditional)sub_B4C8F0----
PHINode::addIncomingsub_B99FD0----
PHINode::Createsub_D5C860----
SplitBlockAndInsertIfThensub_F36990----
BasicBlock::getContextsub_BD5C60----
operator new(0x50) (allocate BasicBlock)sub_22077B0----
BasicBlock::insertIntosub_AA4D50----
Value::replaceAllUsesWithsub_BD84D0----
Instruction::eraseFromParentsub_B43D60----
DominatorTree::changeImmediateDominatorsub_FFB3D0----
PHINode::getIncomingValueForBlocksub_AD69F0----
LoopInfo pass lookupsub_C959E0----
Predicate implies branch checksub_B532B0----
ConstantExpr::getICmp or create threaded edgesub_B52EF0----
CloneBasicBlock or wire new blocksub_92B530----
CloneBasicBlock (alternate path)sub_929DE0----

Cross-References

  • StructurizeCFG -- the late-pipeline safety net that catches irreducible CFG created by threading or other passes
  • Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
  • GVN -- runs between JumpThreading invocations in the tier-2 sequence; can expose new threadable branches
  • Pipeline & Ordering -- tier-dependent scheduling of all three invocations
  • Knobs -- master knob inventory including all six JumpThreading knobs

LICM (Loop-Invariant Code Motion)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Loop-Invariant Code Motion in cicc v13.0 operates at three distinct levels: an IR-level pass ("licm", backed by MemorySSA), a pre-RA machine pass ("early-machinelicm"), and a post-RA machine pass ("machinelicm"). The IR-level pass runs in two modes within the same pipeline -- a hoist invocation early in the optimization sequence that pulls invariant computations and loads out of loops into preheaders, and a sink invocation via LoopSinkPass (or implicit re-processing) later that pushes unprofitable hoists back into cold loop blocks. On a CPU, hoisting is almost universally profitable because the preheader executes once per loop entry rather than once per iteration. On a GPU, the calculus is different: every value hoisted into the preheader extends its live range across the entire loop body, consuming a register for all iterations. If that extra register pushes the kernel past an occupancy cliff -- the threshold where the SM can fit one fewer warp -- the net effect is a slowdown, not a speedup. NVIDIA addresses this tension through the interplay of the two invocations, the NVVM alias analysis pipeline that makes cross-address-space loads trivially hoistable, and the downstream rematerialization passes that can undo hoists that turned out to be unprofitable after register allocation.

Key Facts

PropertyValue
IR pass name"licm" (new PM), "LICMPass" (legacy)
IR pass factorysub_195E880(0) -- creates LICM with AllowSpeculation=false
IR pass factory (alt)sub_184CD60() -- creates LICM (also identified as ConstantMerge in some sweeps; identity ambiguous -- see Analysis Notes)
Machine pass (pre-RA)"early-machinelicm" / EarlyMachineLICMPass
Machine pass (post-RA)"machinelicm" / MachineLICMPass
Knob registrationctor_457_0 at 0x544C40 (18,398 bytes -- 11 knobs)
MachineLICM knob registrationctor_305 (4 knobs)
Disable flag-disable-LICMPass via -Xcicc
PassOptions disable-opt "-do-licm=0" (also forced by --emit-optix-ir)
NVVMPassOptions slotopts[1240] (disable), opts[2880] (enable, reversed logic)
Upstream LLVM sourcellvm/lib/Transforms/Scalar/LICM.cpp, llvm/lib/CodeGen/MachineLICM.cpp

Pipeline Positions

LICM appears at multiple pipeline positions depending on the optimization tier and compilation mode. The pass uses two distinct factory functions, and the identification of which is definitively LICM versus another pass is uncertain in some cases due to the stripped binary. The following table lists all confirmed appearances.

IR-Level LICM

PositionCall siteFactoryGuard conditionContext
O1 baseline, position 12sub_12DE330sub_184CD60()noneAfter LoopRotate, before IndVarSimplify. First hoist invocation.
Main optimizer, mid-pipelinesub_12DE8F0sub_195E880(0)!opts[1240]Guarded by the LICM disable flag. Runs after DCE and before NVVMLowerBarriers.
Main optimizer, latesub_12DE8F0sub_195E880(0)opts[2880] && !opts[1240]Second invocation, guarded by both enable and disable flags. Runs after ADCE, before LoopUnroll.
Extended pipelinesub_12E54A0sub_195E880(0)opts[2880] && !opts[1240]After NVVMLowerBarriers, before LoopUnroll.
Late pipelinesub_12E54A0sub_195E880(0)!opts[1240]After LoopIdiomRecognize and LoopSimplify, before SimplifyCFG. Late cleanup invocation.
Aggressive (O3, "mid" path)sub_12E54A0sub_184CD60()nonePosition 1 and position 18 of the aggressive pipeline. Second invocation follows GVN.

Machine-Level LICM

PositionPassGuardContext
Pre-RAearly-machinelicmenable-mlicmAfter EarlyTailDuplicate, before MachineCSE. Controlled by the NVPTX target.
Post-RAmachinelicm!disable-postra-machine-licmAfter ExpandPostRAPseudos, before post-RA MachineSink.

Algorithm

IR-Level: Hoist Mode

LICM's hoist mode is the upstream LLVM 20.0.0 algorithm with no visible NVIDIA patches to the core logic. The NVIDIA delta is entirely in the analysis results that LICM consumes (NVVM AA, MemorySSA precision, convergent-call handling) and in the pipeline orchestration (multiple invocations, register-pressure-aware sink mode).

The algorithm processes each loop from innermost to outermost:

for each loop L in post-order (innermost first):
    preheader = L.getLoopPreheader()
    if preheader is null: skip

    // 1. Collect candidates
    for each basic block BB in L:
        for each instruction I in BB:
            if isLoopInvariant(I, L) and isSafeToHoist(I, L):
                candidates.push(I)

    // 2. Hoist each candidate
    for I in candidates:
        if I is a load:
            // Query MemorySSA walker for clobbering stores
            clobber = MSSA.getClobberingMemoryAccess(I)
            if clobber is outside L:
                hoist(I, preheader)
        else if I is a pure computation (no side effects):
            hoist(I, preheader)
        else if I is a store and hoist-const-stores is enabled:
            if store address is loop-invariant and
               no other store in L aliases this address:
                hoist(I, preheader)

The isLoopInvariant check verifies that all operands of the instruction are either defined outside the loop or are themselves loop-invariant. The isSafeToHoist check queries MemorySSA to determine whether the instruction's memory behavior is loop-invariant -- for loads, this means no store inside the loop may alias the load's address.

MemorySSA walker interaction. When LICM calls getClobberingMemoryAccess(load_in_loop), the MemorySSA walker walks upward from the load's MemoryUse through the MemorySSA graph. If the walk reaches the loop's entry MemoryPhi without encountering a MemoryDef that may-alias the load, the load is hoistable. The walk is bounded by licm-mssa-optimization-cap to prevent compile-time explosion on functions with dense memory SSA graphs.

The licm-mssa-max-acc-promotion knob limits how many MemoryAccesses LICM will attempt to promote (scalar-replace loads from loop-invariant addresses with SSA values held in registers across iterations). This is the LICM variant of store-to-load forwarding within a loop.

IR-Level: Sink Mode

The LoopSink pass ("loop-sink", registered at pipeline parser entry 271) is the inverse of hoist mode. It runs late in the pipeline and pushes instructions that were hoisted to the preheader back into the loop body, specifically into cold blocks that execute infrequently relative to the loop header.

The decision to sink is driven by block frequency analysis:

for each instruction I in preheader:
    if I has uses only in cold blocks of the loop:
        coldest_block = argmin(blockFreq(B) for B where I is used in B)
        if blockFreq(preheader) / blockFreq(coldest_block) > threshold:
            sink(I, coldest_block)

On GPUs, the sink mode is particularly important because:

  1. Occupancy recovery. A hoist that added one live register at the preheader may have pushed the kernel from 8 to 7 warps per SM. Sinking that value back undoes the damage.
  2. Divergent control flow. If the hoisted value is only used in a branch taken by some threads (divergent execution), hoisting forces all threads to compute it. Sinking limits the computation to the threads that actually take the branch.

Machine-Level: MachineLICM

MachineLICM operates on MachineInstr after instruction selection. The pre-RA variant (early-machinelicm) is gated by the enable-mlicm knob, which is controlled by the NVPTX target. The post-RA variant (machinelicm) runs unconditionally unless disable-postra-machine-licm is set.

The machine-level algorithm differs from the IR level in that it has concrete register pressure information:

for each machine loop ML (innermost first):
    preheader = ML.getLoopPreheader()
    for each MachineInstr MI in ML:
        if isLoopInvariant(MI) and isSafeToHoist(MI):
            // Compute pressure impact
            pressure_delta = estimatePressureIncrease(MI, preheader)
            if sink-insts-to-avoid-spills and
               pressure_delta would cause spills:
                skip MI  // Do not hoist
            else:
                hoist(MI, preheader)

The sink-insts-to-avoid-spills knob (registered at ctor_305) is the critical GPU-specific control: it tells MachineLICM to abandon a hoist when the resulting register pressure in the preheader would exceed the spill threshold. This directly prevents the occupancy-cliff problem at the machine level.

GPU-Specific Considerations

Register Pressure and Occupancy Cliffs

Each SM's register file is shared among all resident warps, creating discrete occupancy cliffs where a single additional register per thread can drop maximum occupancy by an entire warp group.

Hoisting one additional value into the preheader extends its live range across the entire loop body, increasing peak register pressure by one. If that increase crosses an occupancy cliff boundary, the kernel loses an entire warp's worth of parallelism per SM. This is why cicc invokes LICM early (to expose optimization opportunities for GVN, DSE, and InstCombine) and then relies on the downstream rematerialization infrastructure to undo hoists that became unprofitable after the register allocator made its decisions.

NVVM AA and Cross-Address-Space Independence

The single most impactful NVIDIA-specific behavior in LICM is not a patch to LICM itself but the NVVM alias analysis (nvptx-aa) that feeds into MemorySSA. When LICM queries whether a load from addrspace(1) (global memory) is clobbered by a store to addrspace(3) (shared memory), NVVM AA returns NoAlias immediately. This means:

  • A load from global memory inside a loop is trivially hoistable past any number of shared memory stores.
  • A shared memory load is hoistable past global stores.
  • Only stores to the same address space (or to addrspace(0) / generic) prevent hoisting.

This dramatically increases the set of hoistable instructions compared to a flat-memory architecture. Without NVVM AA, a conservative alias analysis would assume any store could clobber any load, making most loads inside GPU kernels non-hoistable.

Barrier-Aware Motion Constraints

CUDA __syncthreads() barriers are lowered to llvm.nvvm.barrier0 intrinsic calls, which are marked convergent and have memory side effects on shared memory. The convergent attribute prevents LICM from hoisting any instruction that depends (directly or transitively through the call graph) on a convergent call. The memory side effect on the barrier prevents hoisting loads across it even when the load does not depend on the barrier's value, because the barrier's MemoryDef in MemorySSA clobbers all shared-memory accesses.

This means LICM correctly refuses to hoist a shared memory load from below a __syncthreads() to above it -- doing so would read a value that the barrier was supposed to synchronize.

The NVVMLowerBarriers pass (sub_1C98160) runs between LICM invocations in the pipeline. Its position matters: barriers are still at the intrinsic level during the first LICM invocation, providing the convergent/memory-effect constraint. After lowering, the barrier semantics are encoded differently, which could affect what a later LICM invocation can move.

Interaction with Downstream Passes

LICM's hoist decisions feed into several downstream passes that can undo or refine them:

  1. Rematerialization (nvvmrematerialize, nv-remat-block): If hoisting increased register pressure past the target, the rematerialization pass will clone the hoisted instruction back to each use site, effectively undoing the hoist while keeping the optimization benefits at the IR level. See Rematerialization.

  2. Sinking2 (sub_1CC60B0): NVIDIA's custom sinking pass runs after LICM and can push instructions back toward their uses. The rp-aware-sink and max-uses-for-sinking knobs control whether the sink considers register pressure impact. See Sinking2.

  3. Base Address Strength Reduction: Hoisted address computations are candidates for strength reduction. The sub_1C51340 function checks whether a base address is loop-invariant, which is trivially true after LICM has hoisted it.

Configuration

IR-Level LICM Knobs (ctor_457_0 at 0x544C40)

These are standard LLVM knobs present in the cicc binary. No NVIDIA-specific knobs were found in the IR-level LICM registration.

KnobTypeDefaultEffect
disable-licm-promotionboolfalseDisable scalar promotion of memory locations (store-to-load forwarding within loops). When set, LICM will not replace repeated loads from a loop-invariant address with a register-held value.
licm-control-flow-hoistingboolfalseEnable hoisting of instructions with control-flow-dependent execution. When disabled, only instructions that dominate the loop latch can be hoisted.
licm-force-thread-model-singleboolfalseOverride the thread model to single-threaded, allowing LICM to hoist atomic operations. Not useful on GPU.
licm-max-num-uses-traversedint8Maximum number of uses to traverse when checking whether all uses of a hoisted value are inside the loop. Limits compile time on values with many uses.
licm-max-num-fp-reassociationsint(default)Maximum FP reassociation chains LICM will attempt to hoist as a group.
licm-hoist-bo-association-user-limitint(default)User count limit for binary operator association hoisting.
licm-skip-unrolled-loopsboolfalseSkip LICM on loops that have been unrolled (identified by metadata). Avoids re-hoisting values that were deliberately placed by the unroller.
licm-insn-limitint(default)Maximum number of instructions LICM will process per loop. Compile-time safety valve.
licm-max-num-int-reassociationsint(default)Maximum integer reassociation chains for group hoisting.
licm-mssa-optimization-capint(default)Maximum number of MemorySSA accesses the walker will visit per query. Prevents pathological compile times on functions with dense memory access patterns.
licm-mssa-max-acc-promotionint(default)Maximum number of MemoryAccesses LICM will attempt to promote (scalar-replace) per loop.

IR-Level LICM Pipeline Parameters

The pass text-pipeline parser accepts two parameters for the "licm" pass:

ParameterEffect
allowspeculationAllow speculative execution of hoisted instructions (loads that might trap).
conservative-callsUse conservative call analysis -- treat all calls as potentially clobbering.

The factory function sub_195E880(0) creates LICM with AllowSpeculation=false, which is the safe default for GPU code where speculative loads from unmapped memory would fault the entire kernel.

Machine-Level MachineLICM Knobs (ctor_305)

KnobTypeDefaultEffect
avoid-speculationbool(default)Avoid hoisting instructions that could speculatively execute and trap.
hoist-cheap-instsbool(default)Hoist instructions with very low cost even when register pressure is high.
sink-insts-to-avoid-spillsbool(default)Critical GPU knob. When enabled, MachineLICM will sink (not hoist) instructions when hoisting would increase register pressure past the spill threshold. This directly trades code motion for spill avoidance.
hoist-const-storesbool(default)Hoist stores of constant values out of loops. Enabled at the NVIDIA sinking/code-motion category level.

NVPTX Target Gating Knobs

KnobTypeDefaultEffect
enable-mlicmboolopt-level dependentMaster enable for pre-RA EarlyMachineLICM on NVPTX.
disable-machine-licmboolfalseDisable pre-RA MachineLICM (stock LLVM knob).
disable-postra-machine-licmboolfalseDisable post-RA MachineLICM (stock LLVM knob).

Global Pipeline Controls

ControlMechanismEffect
do-licm=0PassOptions (-opt flag)Disables IR-level LICM entirely. Automatically set by --emit-optix-ir.
disable-LICMPass-Xcicc flagDisables IR-level LICM via the pass-disable mechanism.
opts[1240]NVVMPassOptions bitPer-invocation disable flag for IR LICM.
opts[2880]NVVMPassOptions bitPer-invocation enable flag for IR LICM (reversed logic).

Diagnostic Strings

The IR-level LICM pass emits optimization remarks via the standard LLVM remark infrastructure. The following remark identifiers are present in upstream LLVM 20 and apply unchanged in cicc:

RemarkCondition
"hoisted"Instruction was successfully hoisted to preheader.
"sunk"Instruction was sunk from preheader into a loop block.
"promoted"Memory location was scalar-promoted (repeated load replaced with register).
"licm"General LICM diagnostic (pass name in remark metadata).

MachineLICM emits its own set:

StringCondition
"Hoisting to BB#%d"Machine instruction hoisted to the specified preheader block.
"Won't hoist cheap instruction"Instruction deemed too cheap to justify the pressure increase.
"Can't hoist due to spill pressure"sink-insts-to-avoid-spills vetoed the hoist.

Analysis Notes

Identity Ambiguity: sub_184CD60 and sub_195E880

The pipeline analysis identified two factory functions as LICM candidates:

  • sub_195E880(0): Called with explicit LICM disable guards (!opts[1240], opts[2880]). Present in the main optimizer and extended pipeline. This is the higher-confidence identification as the IR-level LICM factory.

  • sub_184CD60(): Called in the O1 baseline pipeline at position 12 (after LoopRotate), and in the aggressive pipeline. Some sweeps identify this as ConstantMerge or GlobalDCE. The O1 pipeline context (LoopRotate -> sub_184CD60 -> IndVarSimplify) strongly suggests this is LICM, as this is the canonical upstream LLVM loop optimization sequence. However, the aggressive pipeline uses it in a position where ConstantMerge would also make sense. Without the stripped symbol, the definitive identification relies on structural context.

Both functions likely create the same underlying LICMPass -- the difference may be in the parameters (e.g., AllowSpeculation, ConservativeCalls) or the analysis dependencies they request.

No Visible NVIDIA Patches to IR-Level LICM

Unlike DSE, GVN, and InstCombine, the IR-level LICM code does not appear to contain NVIDIA-specific modifications. The 11 knobs registered at ctor_457_0 are all standard upstream LLVM options. The NVIDIA delta for LICM is architectural:

  1. Analysis precision: NVVM AA and enhanced MemorySSA provide better aliasing information, making LICM more aggressive without code changes.
  2. Pipeline orchestration: Multiple invocations at different pipeline stages with different guard conditions.
  3. Machine-level integration: sink-insts-to-avoid-spills and enable-mlicm provide GPU-specific pressure management.
  4. Downstream safety net: Rematerialization undoes unprofitable hoists after register allocation.

LICM Disabled for OptiX IR

The --emit-optix-ir mode (triggered by OptiX runtime compilation with device type 0xDEED or 0xABBA) automatically sets do-licm=0, disabling LICM entirely. This suggests that OptiX IR is intended to be consumed by a downstream optimizer (the OptiX JIT compiler) that performs its own code motion decisions, and pre-hoisting at the cicc level would interfere with those decisions.

Function Map

FunctionAddressSizeRole
LICMPass::createsub_195E880--IR-level LICM factory (AllowSpeculation=false)
LICMPass::create (alt)sub_184CD60--IR-level LICM factory (identity ambiguous, may be ConstantMerge)
LICM knob registrationctor_457_0 (0x544C40)--11 cl::opt registrations for IR LICM
MachineLICM knob registrationctor_305--4 cl::opt registrations for MachineLICM
EarlyMachineLICMPass(in codegen pipeline)--Pre-RA machine-level LICM
MachineLICMPass(in codegen pipeline)--Post-RA machine-level LICM
LoopSinkPasspipeline parser entry 271--Inverse of LICM hoist -- sinks unprofitable hoists
NVVMLowerBarrierssub_1C98160--Runs between LICM invocations; lowers barrier intrinsics
NVVM AA querysub_146F1B0--Address-space-based NoAlias determination used by MemorySSA
MemorySSA clobber walksub_1A6AFB3--Walker that LICM uses to determine load hoistability
Loop-invariant checksub_1C51340--Utility for checking if a value is loop-invariant

Differences from Upstream LLVM

AspectUpstream LLVM 20cicc v13.0
Pipeline invocationsTypically one LICM invocation in the function pipeline, plus LoopSink.4-6 invocations at different pipeline stages with conditional guards.
Alias analysis precisionBasicAA + TBAA. Cross-address-space aliasing not exploited (all code shares one address space).NVVM AA returns NoAlias for cross-address-space pairs, dramatically increasing hoistable instruction count.
MemorySSA sparsityDense graphs on flat-memory architectures.Sparse graphs due to NVVM AA, reducing walker overhead and improving LICM precision.
Register pressure feedbackMachineLICM has sink-insts-to-avoid-spills but no GPU occupancy model.sink-insts-to-avoid-spills interacts with NVPTX's occupancy-based register targets. enable-mlicm provides target-level gating.
Speculative hoistingAllowed by default on most targets.Disabled (AllowSpeculation=false) because GPU kernels fault on speculative loads from unmapped memory.
OptiX modeN/A.LICM entirely disabled for OptiX IR emission.
Downstream undoNo systematic mechanism to undo unprofitable hoists.Rematerialization (nvvmrematerialize, nv-remat-block) systematically undoes hoists that increase pressure past the occupancy target.

Cross-References

LICM (Loop-Invariant Code Motion) -- Redirect

This page previously contained LoopUnroll content due to a sweep misidentification. The LoopUnroll pass factory at sub_19B73C0 was incorrectly labeled as LICM because the two passes are adjacent in the binary. All LoopUnroll content has been merged into the Loop Unrolling page.

For the actual LICM documentation, see: LICM (Loop-Invariant Code Motion)

The LICM page covers:

  • IR-level LICM ("licm", backed by MemorySSA) -- hoist and sink modes
  • Machine-level LICM ("early-machinelicm", "machinelicm") -- pre-RA and post-RA
  • GPU-specific considerations: register pressure, occupancy cliffs, NVVM AA cross-address-space independence
  • All pipeline positions, knobs, and diagnostic strings
  • Interaction with downstream passes (rematerialization, Sinking2)

DSE (Dead Store Elimination)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp (LLVM 20.0.0)

CICC v13.0 contains a heavily modified Dead Store Elimination pass totaling approximately 91 KB of decompiled code across three major functions: the core DSE::runOnFunction at sub_19DA750 (33 KB), the overwrite detection engine at sub_19DDCB0 (28 KB), and the partial overwrite tracking system at sub_19DF5F0 (30 KB). This substantially exceeds the size of upstream LLVM DSE, primarily due to NVIDIA's additions for partial store forwarding with type conversion, cross-store dependency tracking, store-chain decomposition for aggregates, and native CUDA vector type awareness.

IR Before/After Example

DSE removes stores that are overwritten before any load reads them. The NVIDIA extension handles partial overwrites common in CUDA vector code.

Before (dead store followed by overwrite):

define void @f(ptr addrspace(1) %p, float %x, float %y) {
  store float %x, ptr addrspace(1) %p, align 4          ; dead: overwritten below before any load
  %other = fadd float %x, %y
  store float %other, ptr addrspace(1) %p, align 4      ; overwrites the first store completely
  ret void
}

After:

define void @f(ptr addrspace(1) %p, float %x, float %y) {
  ; first store removed -- overwritten by second store, no intervening load
  %other = fadd float %x, %y
  store float %other, ptr addrspace(1) %p, align 4
  ret void
}

NVIDIA's DSE also handles partial overwrite patterns with CUDA vector types. When a float4 store partially overwrites a previous float4 store, the pass decomposes via GEP to determine which elements are dead. This is a key GPU extension that upstream LLVM DSE does not handle.

Analysis Dependencies

DSE requires five analysis passes, resolved through the pass manager at registration time (sub_19DD1D0):

AnalysisGlobal AddressPass ID
MemorySSAunk_4F9E06CMemory SSA graph
DominatorTreeunk_4F9A488Dominator tree
MemoryDependenceunk_4F9B6E8Memory dependence queries
PostDominatorTreeunk_4F9D764Post-dominator tree
AliasAnalysisunk_4F9D3C0NVVM-aware alias analysis

Core Algorithm

The main entry point DSE::runOnFunction (sub_19DA750) processes a function by iterating over store instructions and checking whether each store is dead (fully or partially overwritten by a later store to the same location before any intervening load).

Early Exit and Setup

The pass begins with an early exit check via sub_1636880() to determine whether the function should be skipped entirely. It then retrieves MemoryDependence and AliasAnalysis from the pass manager and calls sub_14A4050 / sub_14A2F00 to verify the function contains stores worth analyzing. If no stores are present, the pass returns immediately.

Store Instruction Identification

Store instructions are identified by checking byte +16 of the instruction structure for value 77. The operand count is read from offset +20 (masked with 0xFFFFFFF), and the "has-operand-list-pointer" flag at byte +23, bit 0x40, indicates indirect operand storage for instructions with many operands.

Type Size Computation

DSE computes store sizes through a type-walker switch on byte +8 of the type structure. This logic is shared between the core pass and the overwrite detector:

Type CodeSizeNotes
116 bitsHalf-precision float
232 bitsFloat / int32
3, 964 bitsDouble / int64
480 bitsx86 long double / PTX f80
5, 6128 bitsQuad precision / int128
7pointer-sizedResolved via sub_15A9520
0xBimmediateSize from upper bits of type word
0xDstructLayout computed by sub_15A9930
0xEvectorelement_size * num_elements with alignment
0xFintegerArbitrary-width integer
0x10arrayRecurses into element type, multiplies by count
0, 8, A, Carray-likeFollows pointer chain

The vector type formula (case 0xE) accounts for element alignment: 8 * num_elements * element_alignment * ceil(element_alignment + ceil(element_bits/8) - 1) / element_alignment). This handles CUDA native vector types (float2, float4, int4).

Overwrite Detection

The overwrite analysis engine at sub_19DDCB0 (28 KB) determines whether one store completely or partially covers another. It receives the instruction, an operand index, alias analysis results, and address-space information.

Alias Queries

The function calls sub_14C2730 to perform alias queries with full parameters: (target_ptr, data_layout, 0, instruction, store_address, alias_analysis). This returns whether two memory locations may alias. The alias analysis already incorporates CUDA address-space separation (shared=3, global=1, local=5, constant=4), so DSE itself does not need explicit address-space checks.

Partial Store Forwarding

When store sizes do not match, NVIDIA's DSE creates truncation or extension casts to extract the relevant portion. This is a critical GPU-specific extension:

  • If the source is smaller than the destination: creates an extension (opcode 36 = zext).
  • If the source is larger than the destination: creates a truncation (opcode 38 = trunc).
  • Alignment requirements are verified through sub_16431D0.
  • Complex types use sub_15FDBD0 for cast creation; simple types use sub_15A46C0.

Standard LLVM DSE bails on size mismatches. NVIDIA's version handles the common CUDA pattern of a float4 store followed by a scalar float load by extracting the relevant component via GEP + load.

Store Size Ratio Check

At labels LABEL_25 / LABEL_29 in the core function, DSE performs a ratio check:

  1. Computes v159 = aligned size of destination type.
  2. Computes v48 = aligned size of source type.
  3. Calculates v148 = v48 / v159 (how many destination elements fit in source).
  4. If v48 % v159 != 0, bails (partial overlap that cannot be forwarded).
  5. If sizes differ, creates a GEP + load to extract the relevant portion.

Metadata Preservation

After creating a replacement instruction, the pass preserves metadata:

  • Debug location via sub_157E9D0.
  • Use-chain linkage by updating prev/next pointers at offsets +24/+32.
  • Basic block insertion via sub_164B780.
  • TBAA metadata propagation through sub_1623A60 / sub_1623210.
  • nonnull attribute copying via sub_15FA300 / sub_15FA2E0.
  • Use replacement via sub_164B7C0.

Partial Overwrite Tracking

The function-level partial overwrite pass at sub_19DF5F0 (30 KB) maintains a hash table of all stores in a function and tracks which stores partially overwrite each other.

Hash Table Structure

Each hash table entry is 72 bytes:

OffsetContent
+0Key (store instruction pointer; -8 = empty, -16 = tombstone)
+8Operand list pointer
+16Operand count
+24Inline storage (when count <= small threshold)
+48Additional metadata

The hash function, probing strategy, and growth/compaction thresholds follow the standard DenseMap infrastructure; see Hash Table and Collection Infrastructure. This instance uses NVVM-layer sentinels (-8 / -16) and a minimum table size of 64 entries.

Cross-Store Dependency Records

When a new store aliases an existing entry, DSE records both stores in a 6-element record: {store1, store2, operand1, operand2, ptr1, ptr2}. This enables tracking stores that partially overwrite each other even when the overwritten value has been modified between stores. Reference counting is managed through sub_1649AC0 / sub_1649B30, and per-entry operand lists grow via sub_170B450.

Store-Chain Decomposition

In the LABEL_47 region of the core function, DSE walks store chains through struct/array GEPs and decomposes aggregate stores into element-level dead store checks. sub_19D94E0 handles chain-level elimination, while sub_19D91E0 builds the comparison set for overlap detection.

Address-Space Handling

DSE does not contain explicit CUDA address-space comparisons. Address-space separation is handled entirely by the underlying NVVM alias analysis (unk_4F9D3C0), which knows that different address spaces cannot alias. The alias query function sub_14C2730 receives the full instruction context including address space, so query results already incorporate this constraint.

Store Forwarding to Loads

The function sub_19DBD20 (20 KB) attempts store-to-load forwarding. When sub_19DD7C0 finds a store feeding into a load, it constructs a replacement using sub_12815B0. Sign/zero extension matching uses type byte 15 (float types) and type byte 11 (integer types), with opcodes 45 (float-to-int truncation), 46 (int-to-float), and 47 (generic cast).

Two related passes are registered alongside DSE in the same code region:

  • MergedLoadStoreMotion (sub_19DCD20, pass name mldst-motion): Shares the same alias infrastructure and is registered with the same analysis dependencies.
  • NaryReassociate (sub_19DD420 / sub_19DD530): N-ary reassociation pass factory, registered at sub_19DD1D0 with its own analysis set.

Key Function Map

FunctionAddressSizeRole
DSE::runOnFunction0x19DA75033 KBMain dead store elimination
DSE::analyzeOverwrite0x19DDCB028 KBComplete/partial overwrite detection
DSE::runPartialOverwritePass0x19DF5F030 KBFunction-level partial tracking
DSE::tryForwardStoresToLoad0x19DBD2020 KBStore-to-load forwarding
DSE::buildOverwriteRecord0x19D8AF0--Overlap record construction
DSE::buildComparisonSet0x19D91E0--Set of stores to compare
DSE::eliminateStoreChain0x19D94E0--Chain-level elimination
DSE::scanLoopForDeadStores0x19DCB70--Loop-level DSE
DSE::runOnBasicBlock0x19DCC90--Block-level entry point
DSE::extractStoreOperands0x19DD690--Get base pointer and stored value
DSE::lookupDeadStoreCandidate0x19DD7C0--Hash table lookup
DSE::decomposeGEPStore0x19DD950--GEP-based store decomposition
DSE::collectPartialOperands0x19DEFC0--Partial overwrite operand collection
DSE::checkPartialOverwrite0x19DEE70--Individual partial overwrite check
DSE::tryEliminateStore0x19DF200--Attempt store elimination
DSE::rehashStoreTable0x19DF220--Hash table resize

Differences from Upstream LLVM

  1. Partial store forwarding with type conversion. Standard LLVM DSE bails when store and load sizes differ. NVIDIA's version creates GEP + load sequences to extract relevant portions, handling float4 -> float patterns.
  2. 72-byte hash table entries with cross-store tracking. Upstream uses simpler data structures. NVIDIA tracks which stores partially overwrite each other through 6-element dependency records.
  3. Store-chain decomposition. Aggregate stores are decomposed through struct/array GEPs into element-level checks, enabling elimination of stores that are collectively dead.
  4. Vector type awareness. The type walker includes a dedicated case for CUDA vector types with proper alignment computation.
  5. Total code size. At ~91 KB across three functions, NVIDIA's DSE is roughly 3x the size of upstream LLVM's equivalent.

Constant Folding: Math & Intrinsics

NVIDIA-modified pass. GPU-specific changes (110+ math name variants, 60+ NVVM intrinsic IDs, exception-safe host evaluation) are documented throughout this page.

Upstream source: llvm/lib/Analysis/ConstantFolding.cpp (LLVM 20.0.0). The upstream ConstantFoldCall function handles standard llvm.* intrinsics; NVIDIA's extensions (sub_14D90D0 eligibility checker, sub_14D1BC0 evaluator) are layered on top.

LLVM version note: The upstream ConstantFolding.cpp in LLVM 20 handles approximately 30 standard math intrinsics (llvm.sin, llvm.cos, llvm.sqrt, etc.) and a small set of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in nvvm.* form). CICC extends this to 110+ math name variants (C, glibc __*_finite, C++ mangled _Z*) and 60+ NVVM intrinsic IDs. The upstream disable-fp-call-folding knob (cl::Hidden, default false) is preserved; NVIDIA adds a separate FPFoldDisable CiccOption for independent control.

CICC v13.0 extends LLVM's ConstantFolding analysis with two large custom functions that together enable compile-time evaluation of over 110 distinct math function name variants and 60+ NVVM intrinsic IDs. Upstream LLVM's ConstantFoldCall handles standard llvm.sin, llvm.cos, llvm.sqrt, and a handful of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in their nvvm.* forms, plus FP-to-integer conversion intrinsics). CICC goes far beyond this: it recognizes every C math library name (sin, sinf), every glibc __*_finite internal variant, every C++ mangled form (_Z3cosf, _Z4acosd), and the full set of NVVM approximate/FTZ math intrinsics -- then evaluates them using the host C math library with an exception-safe wrapper that refuses to produce results when the host FPU signals domain errors, overflow, or underflow.

The system is split into two cooperating functions. The eligibility checker sub_14D90D0 (27 KB, called nvvmIntrinsicConstantFold in the sweep analysis) is a fast predicate that answers "can this call be constant-folded?" without touching operand values. The evaluator sub_14D1BC0 (54 KB, called nvvmConstantFoldLibCall) performs the actual computation when all operands are constant. A third function, the NVVM InstCombine intrinsic folder sub_1169C30 (87 KB), handles algebraic simplification of NVVM intrinsics and is documented separately on the InstCombine page.

Eligibility checkersub_14D90D0 (0x14D90D0, 27 KB, 282 basic blocks, 489 edges)
Math evaluatorsub_14D1BC0 (0x14D1BC0, 54 KB)
Constant extractorsub_14D1620 (0x14D1620)
Safe unary eval wrappersub_14D19F0 (0x14D19F0)
Safe binary eval wrappersub_14D1A80 (0x14D1A80)
ConstantFP buildersub_14D17B0 (0x14D17B0)
Custom fabssub_14D1280 (0x14D1280) -- SSE2 sign-bit mask
Custom floorsub_14D13B0 (0x14D13B0) -- truncation + sign correction
Custom ceilsub_14D1410 (0x14D1410) -- truncation + sign correction
Custom sqrtsub_14D1470 (0x14D1470) -- thin wrapper around libc sqrt
Vector math mappingsub_149E420 (0x149E420, 26 KB)
LLVM knobdisable-fp-call-folding (upstream, cl::Hidden, default false)
NVIDIA knobFPFoldDisable (NVIDIA CiccOption, disables FP constant folding)

Two-Tier Architecture: Eligibility vs. Evaluation

The constant folding system operates as a two-phase protocol. The caller (from the ConstantFolding pass or InstCombine visitCallInst path) first invokes the eligibility checker to determine whether a call instruction is a candidate, then invokes the evaluator to produce the folded constant. This split exists for performance: the eligibility check is cheap (no operand extraction, no FP computation), while the evaluator is expensive (extracts APFloat values, calls host math library, checks FP exceptions).

Eligibility Checker: sub_14D90D0

The function takes a tagged IR node pointer and a context (intrinsic descriptor). The node pointer carries a 3-bit tag in its low bits; the function masks with ~7 to recover the aligned base. Before examining intrinsic IDs, it performs three attribute pre-filter checks on the callee:

  1. Speculatable/ReadNone (attribute kind 0x15 = 21): The callee must be safe to speculatively execute. If the direct callee lacks this attribute, the function follows one level of indirection through the resolved function target at [callee + 0x70] and re-checks.

  2. NoUnwind (attribute kind 5): The callee must not throw. Same indirection chain.

  3. Convergent gate (attribute kind 0x34 = 52): If the callee is marked convergent, the function returns 0 immediately. This is the critical safety check for GPU code -- convergent intrinsics like __syncthreads(), __ballot_sync(), and warp shuffle operations have warp-synchronous semantics that would be violated by folding them away, even when all arguments happen to be constant.

After attribute filtering, the function reads the intrinsic ID from [context + 0x24] (offset +36, unsigned 32-bit enum) and dispatches through a two-level scheme.

Evaluation: sub_14D1BC0

The evaluator receives the function name string, its length, an opcode/intrinsic-ID enum, a return type descriptor, an array of constant operand IR nodes, the operand count (1, 2, or 3), a flag enabling name-based matching, and a context pointer. It returns a ConstantFP or ConstantInt IR node on success, or null on failure.

The top-level dispatch is on operand count:

  • Unary (count = 1): Trigonometric, exponential, logarithmic, rounding, and absolute value functions.
  • Binary (count = 2): pow, fmod, atan2, copysign, fmin, fmax.
  • Ternary (count = 3): FMA / fused multiply-add (opcodes 99 and 100 only).

Foldable Intrinsics Master Table

Standard LLVM Intrinsic IDs (0--211)

These are dispatched via a jump table at jpt_14D91F0 in the eligibility checker. The evaluator handles them via cascading opcode comparisons.

IDHexIntrinsicCategory
50x05llvm.bswapBitwise
60x06llvm.ceilRounding
80x08llvm.copysignSign
110x0Bllvm.cosTrig
120x0Cllvm.ctlzBitwise
130x0Dllvm.ctpopBitwise
300x1Ellvm.expExponential
310x1Fllvm.exp2Exponential
320x20llvm.fabsAbsolute
330x21llvm.floorRounding
540x36llvm.fmaTernary
550x37llvm.fmuladdTernary
960x60llvm.logLogarithmic
970x61llvm.log10Logarithmic
990x63llvm.log2Logarithmic
1000x64llvm.lroundRounding
1150x73llvm.maxnumMinMax
1220x7Allvm.minnumMinMax
1230x7Bllvm.nearbyintRounding
1240x7Cllvm.powPower
1290x81llvm.powiPower
1320x84llvm.rintRounding
1390x8Bllvm.roundRounding
1400x8Cllvm.roundevenRounding
1460x92llvm.sinTrig
1470x93llvm.tanTrig
1870xBBllvm.sqrtRoot
1880xBCllvm.truncRounding
189--2110xBD--0xD3Integer ops (umax, sadd.with.overflow, etc.)Integer

NVVM-Specific Intrinsic IDs (>211)

These are dispatched via cascading range checks with bitmask tests in the eligibility checker.

ID RangeHexIntrinsicCategory
3637--36390xE35--0xE37nvvm.bitcast.* / nvvm.move.*Bitwise
36600xE4Cnvvm.ptr.gen.to.*Pointer
3764--37650xEB4--0xEB5nvvm.ceil.f / nvvm.ceil.dRounding
3778--37790xEC2--0xEC3nvvm.ctlz.i / nvvm.ctlz.llBitwise
37870xECBnvvm.cos.approx.ftz.fTrig
38110xEE3nvvm.div.* / nvvm.fabs variantArith
3870--38710xF1E--0xF1Fnvvm.exp2.approx.ftz.f / .dExponential
3911--39120xF47--0xF48nvvm.fabs.f / .dAbsolute
3924--39250xF54--0xF55nvvm.floor.f / .dRounding
39440xF68nvvm.log.approx.ftz.fLogarithmic
39460xF6Anvvm.log2.approx.ftz.fLogarithmic
39480xF6Cnvvm.log10.approx.ftz.fLogarithmic
39500xF6Envvm.rcp.approx.ftz.dReciprocal
39520xF70nvvm.rsqrt.approx.ftz.fRoot
39540xF72nvvm.sqrt.f / .approx.ftz.fRoot
4072--40740xFE8--0xFEAnvvm.sin/cos.approx.ftz variantsTrig
4114--41150x1012--0x1013nvvm.max.i / .uiMinMax
4118--41190x1016--0x1017nvvm.min.i / .uiMinMax
4167--41680x1047--0x1048nvvm.max.ll / .ullMinMax
4170--41720x104A--0x104Cnvvm.min.ll / .ullMinMax
4230--42310x1086--0x1087nvvm.mul.hi.*Multiply
44130x113Dnvvm.sin.approx.ftz.fTrig
4475, 44780x117B, 0x117Envvm.sqrt.f / .rn.dRoot
4483--44840x1183--0x1184nvvm.sqrt.approx.f / .ftz.fRoot
52930x14ADnvvm.f2i / nvvm.d2iConversion
53000x14B4nvvm.i2f / nvvm.i2dConversion
7297--72980x1C81--0x1C82nvvm.fmax.f / .dMinMax
7301--73020x1C85--0x1C86nvvm.fmin.f / .dMinMax
7334--73350x1CA6--0x1CA7nvvm.fmax.ftz.f / .ftz.nan.fMinMax
7339--73400x1CAB--0x1CACnvvm.fmin.ftz.f / .ftz.nan.fMinMax

Name-Based Foldable Functions (Case 0 Fallthrough)

When the intrinsic ID is 0 (unrecognized LLVM intrinsic), both the eligibility checker and the evaluator fall through to string-based matching. The evaluator uses a two-tier name matching system: fast-path intrinsic ID dispatch, then slow-path name comparison when the a7 flag is set.

Plain C library names (44 entries):

CategoryFunctions
Trigonometricsin, sinf, cos, cosf, tan, tanf
Inverse trigacos, acosf, asin, asinf, atan, atanf, atan2, atan2f
Hyperbolicsinh, sinhf, cosh, coshf, tanh, tanhf
Exponentialexp, expf, exp2, exp2f
Logarithmiclog, logf, log10, log10f
Roundingceil, ceilf, floor, floorf, round, roundf
Absolute / Rootfabs, fabsf, sqrt, sqrtf
Binarypow, powf, fmod, fmodf, atan2, atan2f

Glibc __*_finite variants (20 entries):

__acos_finite, __acosf_finite, __asin_finite, __asinf_finite, __atan2_finite, __atan2f_finite, __cosh_finite, __coshf_finite, __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite, __log_finite, __logf_finite, __log10_finite, __log10f_finite, __pow_finite, __powf_finite, __sinh_finite, __sinhf_finite

C++ mangled names (~48 entries): _Z3cosf, _Z3cosd, _Z3sinf, _Z3sind, _Z3tanf, _Z3tand, _Z3expf, _Z3expd, _Z3logf, _Z3logd, _Z4acosf, _Z4acosd, _Z4asinf, _Z4asind, _Z4atanf, _Z4atand, _Z4ceilf, _Z4ceild, _Z4coshf, _Z4coshd, _Z4exp2f, _Z4exp2d, _Z4fabsf, _Z4fabsd, _Z4sinhf, _Z4sinhd, _Z4sqrtf, _Z4sqrtd, _Z4tanhf, _Z4tanhd, _Z4fmodff, _Z4fmoddd, _Z5floorf, _Z5floord, _Z5log10f, _Z5log10d, _Z5atan2ff, _Z5atan2dd, _Z5powff, _Z5powdd, _Z5roundf, _Z5roundd

Total across all three name forms: approximately 112 distinct recognized strings.

Name Matching Algorithm

The evaluator's name matching is a hand-tuned trie-like dispatch optimized for the specific set of math function names. It avoids hash tables or sorted arrays in favor of cascading character comparisons:

nameMatch(name, length):
    // Strip C++ mangling prefix
    if name[0] == '_' and name[1] == 'Z':
        dispatch on name[2]:  // length digit
            '3' -> match 3-char base: cos, sin, tan, exp, log
            '4' -> match 4-char base: acos, asin, atan, ceil, cosh, exp2, fabs, sinh, sqrt, tanh, fmod
            '5' -> match 5-char base: floor, log10, atan2, pow, round
        verify trailing type suffix: 'f' = float, 'd' = double
        return FOUND

    // Strip glibc __finite prefix
    if name[0] == '_' and name[1] == '_':
        dispatch on name[2]:
            'a' -> __acos_finite, __acosf_finite, __asin_finite, __asinf_finite,
                   __atan2_finite, __atan2f_finite
            'c' -> __cosh_finite, __coshf_finite
            'e' -> __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite
            'l' -> __log_finite, __logf_finite, __log10_finite, __log10f_finite
            'p' -> __pow_finite, __powf_finite
            's' -> __sinh_finite, __sinhf_finite
        verify with memcmp against string constant
        return FOUND

    // Plain C library name
    dispatch on name[0]:
        'a' -> acos, asin, atan + 'f' variants
        'c' -> cos, cosf, ceil, ceilf, cosh, coshf
        'e' -> exp, expf, exp2, exp2f
        'f' -> fabs, fabsf, floor, floorf
        'l' -> log, logf, log10, log10f
        'p' -> pow, powf
        'r' -> round, roundf
        's' -> sin, sinf, sinh, sinhf, sqrt, sqrtf
        't' -> tan, tanf, tanh, tanhf

    // Within each group, dispatch on name length:
    length 3: direct 3-byte compare ("sin", "cos", "tan", "exp", "log", "pow")
    length 4: DWORD compare (4-byte integer, little-endian):
        0x736F6361 = "acos"    0x6E697361 = "asin"
        0x6E617461 = "atan"    0x6C696563 = "ceil"
        0x68736F63 = "cosh"    0x73626166 = "fabs"
        0x66736F63 = "cosf"    0x686E6973 = "sinh"
        0x74727173 = "sqrt"    0x686E6174 = "tanh"
        0x32707865 = "exp2"    0x66707865 = "expf"
        ...
    length 5+: memcmp against literal string constant
    return FOUND or NOT_FOUND

The 4-byte integer comparison trick deserves attention: instead of calling memcmp for 4-character names, the code loads the name as a uint32_t and compares against a pre-computed little-endian constant. For example, *(uint32_t*)name == 0x736F6361 checks for "acos" ('a'=0x61, 'c'=0x63, 'o'=0x6F, 's'=0x73). This micro-optimization eliminates function call overhead for the most common name lengths.

Exception-Safe Host Evaluation

The core safety mechanism is the FP exception wrapper used for all transcendental evaluation. Both the unary wrapper (sub_14D19F0) and binary wrapper (sub_14D1A80) follow the same protocol:

Value* safeMathEval(double (*mathFunc)(double), Type* resultType, double arg) {
    feclearexcept(FE_ALL_EXCEPT);        // clear all FP exception flags
    *__errno_location() = 0;             // clear errno

    double result = mathFunc(arg);       // call host C library

    // Check errno for domain/range error
    int e = *__errno_location();
    if (e == EDOM || e == ERANGE) {      // errno 33 or 34
        feclearexcept(FE_ALL_EXCEPT);
        *__errno_location() = 0;
        return nullptr;                  // refuse to fold
    }

    // Check FP exception flags (mask = 0x1D = 29)
    // FE_INVALID(1) | FE_DIVBYZERO(4) | FE_OVERFLOW(8) | FE_UNDERFLOW(16)
    if (fetestexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)) {
        feclearexcept(FE_ALL_EXCEPT);
        *__errno_location() = 0;
        return nullptr;                  // refuse to fold
    }

    // FE_INEXACT (32) is intentionally NOT checked --
    // most transcendentals produce inexact results and that is acceptable.

    return createConstantFP(resultType, result);
}

This design means the folder refuses to produce a result whenever the host FPU signals any exceptional condition other than inexact. The implications:

  • sin(1e308) might overflow on the host -- not folded, left in IR for runtime evaluation.
  • log(-1.0) produces a domain error -- not folded.
  • sqrt(-0.01) triggers FE_INVALID -- not folded.
  • sin(0.5) produces an inexact result (since sin(0.5) is irrational) -- folded normally.

Domain Pre-Checks

In addition to the post-evaluation exception check, certain functions have explicit domain guards before calling the host math library:

FunctionPreconditionRationale
log, logf, log10, log10fargument > 0.0Negative inputs produce NaN
sqrt, sqrtfargument >= 0.0Negative inputs produce NaN
acos, asinno pre-checkRelies on FP exception mechanism

The asymmetry is deliberate: log/sqrt get explicit checks because their domain violations are common and cheap to detect, while acos/asin rely on the post-evaluation FE_INVALID check.

Host FPU vs. GPU Precision

The constant folder evaluates using the host CPU's math library (j_sin, j_cos, j_exp, etc. -- PLT stubs to glibc). This creates a potential precision mismatch: the folded constant may not be bit-identical to what the GPU hardware would compute. NVIDIA mitigates this through several mechanisms:

  1. Custom implementations for exact functions. fabs, floor, ceil, and round have custom host-side implementations that match GPU rounding semantics exactly:

    • fabs (sub_14D1280): Pure SSE2 bitwise AND with 0x7FFFFFFFFFFFFFFF (clear sign bit). Bit-exact regardless of platform.
    • floor (sub_14D13B0): Custom truncation: for |x| < 2^52, truncate to integer, subtract 1.0 if truncation rounded toward zero for negative values, preserve sign bit. For |x| >= 2^52, return unchanged (already integral).
    • ceil (sub_14D1410): Mirror of floor: truncate to integer, add 1.0 if truncation rounded toward zero for positive values.
    • round (j__round): Uses libc round() directly (round-half-away-from-zero, matching PTX round.rni).
  2. Exception rejection for transcendentals. For sin, cos, exp, log and other transcendentals, CICC accepts the host result because IEEE-754 guarantees these are correctly rounded within 1 ULP on both host and device. The exception wrapper catches cases where host and device behavior might diverge (denormals, overflow boundary).

  3. exp2(x) folded as pow(2.0, x). Rather than calling exp2() directly (which might differ between host and device implementations), the evaluator computes pow(2.0, x) through the binary wrapper, ensuring consistent behavior.

  4. No half-precision transcendental folding. The type check at the evaluator's entry rejects type byte 1 (half) for all trig/exp/log functions. Only basic operations (convert, compare) work on fp16. This is safe because half-precision math functions are implemented as promote-to-float, compute, demote-to-half -- by the time the constant folder runs, the promotion has already been inlined.

FTZ and Approximate Intrinsics

NVVM intrinsics like nvvm.exp2.approx.ftz.f and nvvm.sin.approx.ftz.f carry .approx (reduced precision) and .ftz (flush-to-zero for denormals) modifiers. These are present in the foldable ID list, which may seem surprising -- folding an "approximate" intrinsic with exact host math could produce a different value than the hardware.

The rationale: constant folding evaluates the mathematical function, not the hardware instruction. If the input is a normal float and the result is a normal float, the folded value is correct regardless of FTZ or approximation quality. The FTZ modifier only affects denormal inputs (which the exception wrapper would catch via FE_UNDERFLOW), and the .approx modifier only matters for runtime execution speed. For compile-time constants, exact evaluation is strictly better.

Comparison with Upstream LLVM

Upstream LLVM's ConstantFolding.cpp (as of LLVM 19.x) handles NVPTX intrinsics in canConstantFoldCallTo and ConstantFoldCall. The overlap and gaps:

CapabilityUpstream LLVMCICC v13.0
llvm.sin, llvm.cos, llvm.exp, llvm.log, etc.YesYes
nvvm.ceil.f, nvvm.floor.f, nvvm.fabs, nvvm.sqrt.*YesYes
nvvm.fmax.*, nvvm.fmin.* (all variants)Yes (including .xorsign_abs)Yes (subset: .f, .d, .ftz, .ftz.nan)
nvvm.f2i_*, nvvm.d2i_* (FP-to-int with rounding modes)Yes (all 32 variants)Partial (IDs 5293, 5300 only)
Plain C math names (sin, cosf, exp2f, etc.)Via TargetLibraryInfoDirect name matching (44 entries)
Glibc __*_finite variantsNoYes (20 entries)
C++ mangled _Z3cosf, _Z4acosd, etc.NoYes (~48 entries)
nvvm.cos.approx.ftz.f, nvvm.exp2.approx.ftz.f, etc.NoYes
nvvm.rcp.approx.ftz.d, nvvm.rsqrt.approx.ftz.fNoYes
nvvm.mul.hi.*NoYes
Convergent intrinsic rejectionImplicit (no fold path)Explicit attribute check
FMA constant foldYes (via APFloat)Yes (opcodes 99/100, APFloat fma)
Integer min/max/ctlz/cttzPartialYes (full NVVM ID coverage)

The critical CICC-only capabilities are the __*_finite variants (needed when code is compiled with -ffinite-math-only), the C++ mangled names (emitted by device-side C++ math overloads), and the .approx.ftz intrinsic family.

Integer Constant Folding

The evaluator also handles integer-domain operations when operands have type tag 13 (ConstantInt) or when FP operands encode integer comparisons:

Binary integer ops (operand count = 2, both ConstantInt):

  • Opcodes 189, 195, 198, 209, 210, 211: APInt binary operations (add, sub, mul, sdiv, udiv, srem) via sub_16A7290 and related APInt helpers.
  • Opcodes 0xEC2/0xEC3 (3778/3779): ctlz (count leading zeros).
  • Opcodes 0x1014/0x1015, 0x1016/0x1017: Signed/unsigned min/max via APInt comparison.
  • Opcodes 0x104B/0x104C, 0x1087/0x1088: Additional signed/unsigned min/max encodings.
  • Opcode 3811: Division where divisor is known zero -- returns UndefValue.

Integer comparison fold (type tag 14 with integer-domain opcodes):

  • Opcode 0xBB (187), 0x8C (140): icmp eq/ne -- predicate 0.
  • Opcode 0x61 (97): icmp slt -- predicate 2.
  • Opcode 0xBC (188): icmp sgt -- predicate 4.
  • Opcode 0xCE (206): icmp uge -- predicate 3.
  • Opcode 0x08 (8): icmp ult -- predicate 1.

These produce ConstantInt 0 or 1 via sub_169EBA0/sub_169D440.

Libdevice Integration

NVIDIA's libdevice (libdevice.10.bc) provides optimized LLVM bitcode implementations of math functions. After linking libdevice, calls like __nv_sinf are typically inlined and disappear before constant folding runs. However, if inlining fails or is disabled, residual __nv_* calls may survive.

The constant folder does not recognize __nv_* prefixed names directly. The __ name-matching path only handles glibc __*_finite patterns, not NVIDIA's __nv_* convention. Un-inlined libdevice residuals are handled upstream by the NVVM InstCombine intrinsic canonicalizer (sub_1169C30), which recognizes __nv_* prefixes and may convert them to standard LLVM intrinsics that the constant folder can then process.

The __nvvm_reflect mechanism (used for __CUDA_ARCH queries) is resolved by a separate earlier pass (NVVMReflect) that replaces __nvvm_reflect("__CUDA_ARCH") with a constant integer based on the target SM. By the time the constant folder runs, all __nvvm_reflect calls have been eliminated.

Configuration Knobs

KnobTypeDefaultEffect
disable-fp-call-foldingcl::opt<bool>falseUpstream LLVM hidden flag. When true, prevents constant folding of any function returning or accepting floating-point types. Checked in canConstantFoldCallTo.
FPFoldDisableNVIDIA CiccOptionfalseNVIDIA-specific flag that disables FP constant folding at the NVVM level.
instcombine-negator-enabledcl::opt<bool>trueControls the negation propagation system in sub_1169C30 (InstCombine intrinsic folder).
instcombine-negator-max-depthcl::opt<int>platform-dependentDepth limit for the negator chain in InstCombine intrinsic folding. Prevents exponential blowup when pushing negation through deep arithmetic chains.

The FPFoldDisable knob is significant for debugging precision issues: when a kernel produces different results with -O0 vs -O2, disabling FP folding isolates whether constant-folded values are the source of the discrepancy.

ConstantFP Result Creation

The result builder sub_14D17B0 creates the final LLVM ConstantFP IR node from the evaluated double result. It dispatches on the return type byte at *(type + 8):

Type bytePrecisionBehavior
1halfNot reached from math folder (filtered at entry). Infrastructure exists: converts through APFloat semantics.
2floatTruncates double to float via C cast, then converts float to APFloat via sub_169D3B0.
3doubleStores full double precision via sub_169D3F0 (double to APFloat).

Both paths finish with sub_159CCF0(*type, &storage) which constructs the ConstantFP node from the APFloat storage. The float path's truncation via C cast means the folded float value matches what (float)host_result produces -- this is IEEE-754 correct because the cast performs round-to-nearest-even.

Function Map

FunctionAddressSizeRole
nvvmIntrinsicConstantFold0x14D90D027 KBEligibility predicate: can this intrinsic be constant-folded?
nvvmConstantFoldLibCall0x14D1BC054 KBMath evaluator: compute constant result from constant args
extractDoubleFromConstantFP0x14D1620--Extract double from ConstantFP IR node
safeMathEvalUnary0x14D19F0--Exception-safe unary evaluation wrapper
safeMathEvalBinary0x14D1A80--Exception-safe binary evaluation wrapper
createConstantFPResult0x14D17B0--Build ConstantFP from evaluated double
customFabs0x14D1280--SSE2 sign-bit clear
customFloor0x14D13B0--Truncation + sign correction
customCeil0x14D1410--Truncation + sign correction
customSqrt0x14D1470--Thin wrapper around libc sqrt
fptoui_fptosi_fold0x14D1500--FP-to-integer conversion fold
apintMoveTransfer0x14D15E0--APInt move/transfer helper
vectorMathLibMapping0x149E42026 KBScalar-to-vectorized math mapping table
platformFuncCanonicalize0x149FA6015 KBPlatform-specific name canonicalization
constantExprFoldSCEV0x14D44C020 KBConstantExpr fold / SCEV integration
constantFoldAggregate0x14D551016 KBConstantFold for aggregate types
constantFoldGEPExtract0x14D66F017 KBConstantFold for GEP and extract
constantExprSCEVBuild0x14DBA9022 KBConstantExpr + SCEV builder
AttributeList::hasAttribute0x1560260--Attribute query (used 8 times in eligibility checker)
Value::getName0x1649960--Name string extraction (case 0 path)
NVVM InstCombine intrinsic fold0x1169C3087 KBAlgebraic simplification of NVVM intrinsics (see InstCombine)

Cross-References

  • InstCombine -- The NVVM intrinsic canonicalizer (sub_1169C30) handles algebraic simplification, negation propagation, and operand folding for NVVM intrinsics. It calls constant folding as a sub-step.
  • Pipeline & Ordering -- Where constant folding sits in the optimization pipeline (runs within InstCombine and as a standalone analysis).
  • Builtin Table: Math Functions -- The complete list of CUDA math builtins and their mapping to NVVM intrinsics.
  • CLI Flags -- FPFoldDisable and other optimization control flags.
  • LLVM Knobs -- The full disable-fp-call-folding flag and related InstCombine depth limits.

KnownBits & DemandedBits for GPU

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

NVIDIA's KnownBits and DemandedBits infrastructure in cicc v13.0 diverges from upstream LLVM in three structural ways. First, the two analyses are fused into a single 127 KB function (sub_11A7600) that simultaneously computes known-zero/known-one bitmasks and simplifies instructions whose demanded bits allow constant folding or narrowing -- upstream LLVM separates computeKnownBits (in ValueTracking) from SimplifyDemandedBits (in InstCombine). Second, a dedicated GPU-specific known-bits oracle (sub_F0C4B0) provides range constraints for NVIDIA special registers (%tid, %ntid, %ctaid, %nctaid, %warpsize, %laneid) that have no CPU equivalent. Third, an early NVVM pipeline pass (nvvm-intr-range at sub_216F4B0) attaches !range metadata to every special-register read intrinsic, giving downstream analyses the same bounded-range information that CPU targets only get from profile data or programmer assertions. Together these form the primary dataflow backbone for address calculation optimization, type narrowing, and dead-bit elimination in GPU kernels.

Merged computeKnownBits + SimplifyDemandedBitssub_11A7600 (0x11A7600, 127 KB, 4,156 lines)
Secondary SimplifyDemandedBits helpersub_11A1430 (0x11A1430, 6.3 KB, 6 opcodes)
Per-operand demand propagation trampolinesub_11AE940 (0x11AE940)
Generic computeKnownBits (reference)sub_9AC0E0 (fallback for unhandled opcodes)
Debug-only reference computeKnownBitssub_9AC330 (cross-validation oracle)
computeKnownBitsFromOperatorsub_11A3F30 (0x11A3F30, 50 KB)
computeKnownBitsFromAssumesub_11A6910 (0x11A6910, 12.5 KB)
computeKnownBitsFromRangeMetadatasub_11A68C0
Post-analysis NVIDIA fixupsub_99B5E0 (alignment + range refinement)
NVIDIA intrinsic known-bits oraclesub_F0C4B0 (special register ranges)
Intrinsic return range analysissub_10CA790 + sub_11A1390
NVVMIntrRange passsub_216F4B0 (nvvm-intr-range)
SelectionDAG computeKnownBitssub_33D4EF0 (0x33D4EF0, 114 KB, 3,286 lines)
Pointer alignment known-bitssub_BD5420 (getPointerAlignmentBits)
Debug cross-validation flagqword_4F90C28 (enables abort-on-mismatch)
Max recursion depth6 (checked in sub_11AE940)

GPU-Specific Known-Bits Sources

The key difference from CPU targets: GPU code has dozens of values with statically knowable ranges that never exist on a CPU. Every CUDA thread reads its identity from special registers whose values are bounded by hardware launch parameters. NVIDIA exploits this in two places: the nvvm-intr-range pass adds !range metadata at the IR level, and the target-specific known-bits oracle sub_F0C4B0 provides bitmask information directly to computeKnownBits.

Special Register Range Table

The following ranges apply to every NVVM intrinsic that reads a PTX special register. The !range metadata attached by nvvm-intr-range (sub_216F4B0) encodes [lo, hi) as an LLVM MDNode. The known-bits column shows which bits are guaranteed zero given the maximum value.

RegisterPTXNVVM Intrinsic ID RangeValue Rangei32 Known Zero (upper bits)
%tid.x/y/z%tid.x350--352[0, maxntid-1]bits [ceil(log2(maxntid)), 31]
%ntid.x/y/z%ntid.x353--355[1, 1024]bits [11, 31] (at most 1024)
%ctaid.x/y/z%ctaid.x356--358[0, gridDim-1]bits [ceil(log2(gridDim)), 31]
%nctaid.x/y/z%nctaid.x359--361[1, 2^31-1]bit 31 (always non-negative)
%warpsize%WARP_SZ~370{32} (constant)bits [0,4] = 00000, bit 5 = 1, bits [6,31] = 0
%laneid%laneid~371[0, 31]bits [5, 31]
%warpid%warpid~372[0, maxWarpsPerSM-1]SM-dependent upper bits
%smid%smid~375[0, numSMs-1]architecture-dependent
%nsmid%nsmid~376[1, numSMs]architecture-dependent
%gridid%gridid~378[0, 2^32-1]none (full range)
%clock%clock~380[0, 2^32-1]none
%lanemask_eq/lt/le/gt/ge%lanemask_*~382--386[0, 2^32-1]none

When __launch_bounds__(maxThreadsPerBlock, minBlocksPerMP) is present on a kernel, nvvm-intr-range tightens the %tid ranges to [0, maxThreadsPerBlock-1] and %ntid to [1, maxThreadsPerBlock]. Similarly, nvvm.reqntid metadata (from __launch_bounds__ with exact dimensions or reqntid pragmas) can constrain each dimension independently to an exact value.

The knob nvvm-intr-range-sm (constructor ctor_359) selects the SM variant used to determine architectural limits for registers like %warpid, %smid, and %nsmid.

Address Space Known Bits

CUDA uses separate address spaces with distinct pointer bit-widths and alignment properties. These feed directly into sub_BD5420 (getPointerAlignmentBits), which OR's known-zero low bits into the KnownBits result for any pointer-typed value:

Address SpacePTXPointer WidthKnown AlignmentKnown Bits Effect
0 (generic)default64 bitsnone guaranteedpointer alignment only
1 (global).global64 bits>= 16 bytes (typical)low 4 bits often known-zero
3 (shared).shared32 bits>= 4 bytes (minimum)low 2 bits known-zero, bits [32,63] irrelevant
4 (constant).const64 bits>= 4 byteslow 2 bits known-zero
5 (local).local32 bits>= 4 bytes (stack)low 2 bits known-zero, bits [32,63] irrelevant

The 32-bit address spaces (shared and local) are critical: any value known to be a shared-memory pointer has bits [32, 63] entirely dead. The DemandedBits analysis exploits this to eliminate zero-extensions and truncations around shared-memory address calculations, keeping everything in 32-bit arithmetic.

Launch Parameter Integration

The __launch_bounds__ attribute, __maxnreg__ pragma, and nvvm.reqntid / nvvm.maxntid metadata all flow into the known-bits infrastructure:

  1. nvvm-intr-range pass (sub_216F4B0): Runs early in the pipeline. Reads kernel metadata (nvvm.reqntid, nvvm.maxntid) via sub_93AE30. Attaches !range metadata to every llvm.nvvm.read.ptx.sreg.* intrinsic call. The metadata format is !{i32 lo, i32 hi} where hi is exclusive.

  2. computeKnownBitsFromRangeMetadata (sub_11A68C0): Called during standard computeKnownBits traversal. Reads !range metadata from any value and derives known-zero/known-one masks. For a range [0, 1024), this yields knownZero = 0xFFFFFC00 (bits 10--31 known zero).

  3. Intrinsic return range analysis (sub_10CA790 + sub_11A1390): A separate path used when the merged computeKnownBits+SimplifyDemandedBits processes ZExt/SExt of intrinsic calls. Computes [lo, hi] bounds for the intrinsic's return value and checks whether the extension can be eliminated because the return range fits within the demanded bits.

The Merged Analysis: Algorithm and Pseudocode

Unlike upstream LLVM where InstCombiner::SimplifyDemandedBits calls computeKnownBits as a subroutine, cicc fuses them. The entry point sub_11AE870 wraps sub_11AE3E0, which calls the core sub_11A7600. A hash table at InstCombiner + 2064 tracks visited instructions to prevent infinite recursion.

Core Algorithm

// sub_11A7600 — merged computeKnownBits + SimplifyDemandedBits
// Returns: replacement instruction pointer, or NULL if no simplification
Instruction* computeKnownBitsAndSimplify(
    AnalysisCtx    *ctx,        // a1 — holds IR module, pass info
    IRNode         *inst,       // a2 — instruction to analyze
    APInt          *demanded,   // a3 — which output bits the consumer needs
    KnownBits      *result,     // a4 — output {knownZero, knownOne}
    unsigned        depth,      // a5 — recursion depth (checked in caller)
    QueryState     *state       // a6 — worklist context
) {
    uint8_t opcode = inst->opcode_tag;   // single-byte opcode at offset 0
    unsigned width = demanded->getBitWidth();

    // Stack-allocate 4 APInt accumulators for operand known bits
    APInt kz0(width, 0), ko0(width, 0);  // operand 0
    APInt kz1(width, 0), ko1(width, 0);  // operand 1

    switch (opcode) {
    case '*': // Mul — lines 654-1037
        // Pattern: if one operand is known power-of-2 from intrinsic call,
        //          replace Mul with Shl (critical for threadIdx * stride)
        if (auto *rhs = matchConstantPow2Call(inst->getOperand(1))) {
            if (inst->getOperand(0)->hasOneUse())
                return createShl(inst->getOperand(0), log2(rhs));
        }
        // Generic: narrow demanded mask by leading zeros, propagate to operands
        unsigned effectiveBits = width - demanded->countLeadingZeros();
        APInt narrowDemand = APInt::getLowBitsSet(width, effectiveBits);
        propagateDemandToOperand(ctx, inst, 0, narrowDemand, &kz0, &ko0, depth+1, state);
        propagateDemandToOperand(ctx, inst, 1, narrowDemand, &kz1, &ko1, depth+1, state);
        KnownBits::computeForMul(result, {kz0,ko0}, {kz1,ko1}, inst->hasNUW(), inst->hasNSW());
        break;

    case '6': // ZExt — lines 1677-1919
        // Check if source is intrinsic call with known return range
        if (auto range = getIntrinsicReturnRange(inst->getOperand(0))) {
            if (range.fitsBitWidth(demanded->getActiveBits()))
                return inst->getOperand(0);  // eliminate extension
        }
        // Standard: shift demanded bits down, propagate to source, zext result
        propagateDemandToOperand(ctx, inst, 0, demanded->trunc(srcWidth), ...);
        KnownBits::zext(result, srcWidth);
        break;

    case 'U': // NVIDIA Intrinsic — lines 3521-4085
        unsigned intrinsicID = getIntrinsicID(inst);
        switch (intrinsicID) {
        case 0x0F: handleBFE_BFI(inst, demanded, result); break;
        case 0x42: handlePopcount(inst, demanded, result); break;
        case 0x01: handleAbs(inst, demanded, result);      break;
        case 0xB4: handleFSHL(inst, demanded, result);     break;
        case 0xB5: handleFSHR(inst, demanded, result);     break;
        case 0x12B: handleBswap(inst, demanded, result);   break;
        default:
            // Fall through to NVIDIA intrinsic known-bits oracle
            sub_F0C4B0(inst, result, depth, state);
            break;
        }
        break;

    // ... 13 more opcode cases (Add, Sub, Xor, PHI, Trunc, SExt, etc.)

    default:
        sub_9AC0E0(inst, result, depth, state);  // generic fallback
        break;
    }

    // POST-ANALYSIS REFINEMENT (lines 2134-2281)
    // 1. Pointer alignment: if type is pointer, OR alignment bits into knownZero
    if (inst->getType()->isPointerTy()) {
        unsigned alignBits = getPointerAlignmentBits(inst);  // sub_BD5420
        result->knownZero |= APInt::getLowBitsSet(width, alignBits);
    }

    // 2. Debug cross-validation (when qword_4F90C28 is set)
    if (DEBUG_FLAG) {
        KnownBits reference;
        sub_9AC330(inst, &reference, depth, state);  // independent computation
        if (reference != *result) {
            print("computeKnownBits(): ", reference);
            print("SimplifyDemandedBits(): ", *result);
            abort();
        }
    }

    // 3. Demand-covers-known check: can we replace with constant?
    if (demanded->isSubsetOf(result->knownZero | result->knownOne))
        return ConstantInt::get(inst->getType(), result->knownOne);

    return nullptr;
}

Demand Propagation Per Operand

The trampoline sub_11AE940 is the per-operand demand propagation entry point. It increments depth, checks the depth limit (depth > 6 returns all-unknown), and dispatches between the big handler (sub_11A7600) and the binary-arithmetic-specific helper (sub_11A1430) based on opcode class:

// sub_11AE940 — per-operand demand propagation trampoline
Instruction* propagateDemandToOperand(
    AnalysisCtx *ctx, IRNode *parent, unsigned opIdx,
    APInt *demand, KnownBits *out, unsigned depth, QueryState *state
) {
    if (depth > 6)
        return nullptr;  // MaxAnalysisRecursionDepth reached

    IRNode *operand = parent->getOperand(opIdx);
    uint8_t opcode = operand->opcode_tag;

    // Binary arithmetic subset goes to the helper
    if (opcode == '*' || opcode == '9' || opcode == ':' ||
        opcode == ';' || opcode == ',' || opcode == '8')
        return sub_11A1430(ctx, operand, demand, out, depth, state);

    // Everything else goes to the big merged handler
    return sub_11A7600(ctx, operand, demand, out, depth, state);
}

The secondary helper sub_11A1430 handles Add/Sub/Xor/Mul/BitCast/ExtractElement with a tighter structure: it uses a four-accumulator cascade with three successive isSubsetOf checks per operation, which is more aggressive than upstream LLVM's single post-merge check.

The Four-Accumulator Cascade

For binary operators (Add, Sub, Xor), cicc maintains four APInt accumulators (two per operand) and performs a three-tier check:

// Three-tier demand satisfaction check (sub_11A1430 pattern)
// More aggressive than upstream single-check approach
KnownBits kb0, kb1;
computeKnownBits(op0, &kb0, depth+1, state);
computeKnownBits(op1, &kb1, depth+1, state);
KnownBits merged = mergeForOpcode(kb0, kb1, opcode);
sub_99B5E0(inst, &merged, depth, state);  // NVIDIA post-fixup

// Check 1: merged result covers demand?
if (demanded.isSubsetOf(merged.knownZero | merged.knownOne))
    return ConstantInt::get(merged.knownOne);

// Check 2: union of operand known-bits covers demand?
if (demanded.isSubsetOf((kb0.knownZero | kb1.knownZero) |
                        (kb0.knownOne  | kb1.knownOne)))
    return ConstantInt::get(...);

// Check 3: all accumulated zero|one covers demand?
if (demanded.isSubsetOf(allAccumulatedZero | allAccumulatedOne))
    return followUseDef(...);

The post-analysis fixup sub_99B5E0 is NVIDIA-specific and does not exist in upstream LLVM. It applies additional refinements from thread index range constraints, warp-level uniformity, and shared memory alignment guarantees.

DemandedBits for GPU: Narrowing Optimizations

The DemandedBits analysis is the backward complement to KnownBits' forward analysis. When a consumer only needs the low N bits of a value, the producer can be narrowed or eliminated. On GPU, this interaction is dramatically more productive than on CPU because of three factors:

  1. 32-bit address spaces: Shared memory (AS 3) and local memory (AS 5) use 32-bit pointers. When address calculations are performed in i64 (as the generic address space requires), the upper 32 bits are entirely undemanded for shared/local accesses. DemandedBits proves this and enables truncation to i32.

  2. Bounded thread indices: threadIdx.x * stride + offset patterns produce values that fit in far fewer bits than i32. If threadIdx.x < 256 (from __launch_bounds__) and stride < 4096, the product fits in 20 bits. DemandedBits propagates this, enabling downstream shifts and masks to operate on narrower types.

  3. Type demotion to i16/fp16: When DemandedBits proves only the low 16 bits of an i32 computation matter, cicc can demote to 16-bit operations. The function at sub_1185740 (InstCombine's visitTrunc) inserts narrowing truncations. This is particularly valuable for texture coordinate calculations and index arithmetic in tensor core operations.

Dead Bit Elimination

The core optimization check appears approximately 15 times across the analysis functions:

// Inline version (width <= 64):
uint64_t unknown = ~(knownZero | knownOne);
if ((demanded & unknown) == 0) {
    // All demanded bits are determined -> replace with constant
    return ConstantInt::get(type, knownOne);
}

// Wide version (width > 64):
if (demanded.isSubsetOf(knownZero | knownOne)) {
    return ConstantInt::get(type, knownOne);  // sub_AD6220
}

This is the heart of the analysis: backward-propagated demand meets forward-propagated known-bits. When they cover every bit the consumer needs, the entire instruction is dead and can be replaced with a compile-time constant.

GPU Patterns Enabled by Known Bits

The following simplifications are GPU-specific and do not have CPU equivalents:

Mul to Shl for threadIdx arithmetic (lines 714--861): When both operands of a multiply originate from intrinsic calls with known power-of-2 returns (e.g., threadIdx.x * blockDim.x where blockDim is a power-of-2 from __launch_bounds__), the multiply is replaced with a left shift. The pattern matcher checks sub_BCAC40 (hasOneUse) and sub_10A0620 (createShl replacement).

Bswap + BFE fusion (lines 3959--4007): Detects a byte-swap feeding into a bit-field extract and replaces with a direct byte read at the swapped offset. Common in endianness conversion code for shared memory operations.

ZExt/SExt elimination via intrinsic return range (sub_10CA790 path): When a ZExt or SExt extends the result of an NVVM intrinsic call, and the intrinsic's annotated return range fits entirely within the demanded bits, the extension is eliminated. This fires frequently for threadIdx.x reads extended to i64 for address calculations.

BitCast-through-ZExt folding (sub_11A1430 at 0x11A2360): When a BitCast's source is a ZExt and the demanded bits fit within the original narrow type, the bitcast+zext chain collapses to the original value. Common in CUDA address calculations involving zero-extension followed by pointer reinterpretation.

SelectionDAG computeKnownBits

The DAG-level known-bits analysis at sub_33D4EF0 (114 KB, 3,286 lines) mirrors the IR-level analysis but operates on SDNode opcodes. It handles 112 opcode cases organized into 14 groups.

NVPTX Target Node Known Bits

For NVPTX-specific DAG opcodes (above ISD::BUILTIN_OP_END = 499), the function delegates to NVPTXTargetLowering::computeKnownBitsForTargetNode via vtable slot 254 at offset 2032. The key NVPTX-specific cases:

Opcode RangeNVPTX DAG NodeKnown-Bits Behavior
0x152--0x161 (338--353)TEX, SULD, surface opsResult width known: bits above element size set to zero
0x12A (298)LoadV2, LoadParamExtension mode from flags byte bits[2:3]: zext/sext/none
0x16A, 0x16C (362, 364)StoreParam, StoreRetvalWhen flags bits[2:3] == 0b11: element type width known
0x175 (373)ConstantPoolUses ConstantRange::fromKnownBits intersection
0xCA (202)INTRINSIC_WO_CHAINBoolean-like: bit 0 unknown, bits [1..width] known zero
>= 499All target-specificDelegates to vtable[254] computeKnownBitsForTargetNode

The DAG-level analysis uses the same recursion depth cap of 6 (a6 > 5 returns all-unknown), matching LLVM's MaxRecursionDepth.

Texture/Surface Fetch Result Width

Cases 0x152--0x161 encode the known bit-width of texture and surface fetch results. For an 8-bit texture fetch zero-extended to i32, the analysis sets bits [8, 31] as known-zero in the result. This enables downstream shift and mask elimination in texture sampling code.

KnownBits Data Structure Layout

Both the IR-level and DAG-level implementations use the same 32-byte struct:

struct KnownBits {                      // 32 bytes total
    union {
        uint64_t  val;                  // +0x00: inline storage (width <= 64)
        uint64_t *ptr;                  // +0x00: heap pointer  (width > 64)
    } knownZero;
    uint32_t knownZero_width;           // +0x08: bit-width
    uint32_t _pad0;                     // +0x0C: padding
    union {
        uint64_t  val;                  // +0x10: inline storage (width <= 64)
        uint64_t *ptr;                  // +0x10: heap pointer  (width > 64)
    } knownOne;
    uint32_t knownOne_width;            // +0x18: bit-width
    uint32_t _pad1;                     // +0x1C: padding
};

// Invariant: (knownZero & knownOne) == 0   (no bit both 0 and 1)
// Threshold: width > 64 triggers heap allocation via sub_C43690

Roughly 43% of sub_11A1430's binary size consists of APInt destructor sequences (cmp [rbp+var], 0x40; jbe skip; call free) for the width > 64 cleanup paths.

Configuration

KnobSourceDefaultEffect
nvvm-intr-range-smctor_359Current target SMSM variant used to compute special register ranges for nvvm-intr-range pass
scev-cgp-tid-max-valuector_XXXArchitecture limitMaximum value of thread ID used in SCEV-based CodeGenPrep address calculations
nv-remat-threshold-for-spec-regunk_4FD386020Threshold controlling when special register reads are rematerialized instead of spilled (interacts with known-bits because remat preserves range metadata)
qword_4F90C28internal debug flag0 (disabled)Enables cross-validation abort: runs independent reference computeKnownBits (sub_9AC330) and aborts if results disagree with merged analysis
Max recursion depthhardcoded6Matches LLVM's MaxAnalysisRecursionDepth; checked in sub_11AE940
APInt inline thresholdhardcoded64 bitsValues <= 64 bits use inline uint64 storage; wider values heap-allocate

Diagnostic Strings

The merged analysis emits the following diagnostics (only in debug/assert builds when qword_4F90C28 is set):

StringLocationTrigger
"computeKnownBits(): "sub_11A7600 line ~2204Cross-validation mismatch: prints the reference implementation's result
"SimplifyDemandedBits(): "sub_11A7600 line ~2208Cross-validation mismatch: prints the merged analysis result
"Mismatched known bits for <inst> in <func>"sub_11A7600 line ~2200Precedes the two values above; followed by abort()

The nvvm-intr-range pass emits:

StringLocation
"Add !range metadata to NVVM intrinsics."sub_216F4B0 (pass registration)

NVVM IR Node Layout

The KnownBits analysis traverses IR nodes using cicc's internal representation. Each node is 32 bytes:

struct IRNode {             // 32 bytes (0x20)
    uint8_t  opcode;        // +0x00: single-byte opcode tag (ASCII-based)
    uint8_t  flags;         // +0x01: bit 1, bit 2 = nsw/nuw flags
    uint16_t _reserved;     // +0x02
    uint32_t operand_idx;   // +0x04: 27-bit operand index + 5-bit flags
                            //        byte 7 bit 6 (0x40) = use-list vs indexed
    // ... remaining 24 bytes: use-list pointers, type info, metadata
};

// Operand resolution:
// If byte[7] & 0x40 (use-list flag set):
//     operand = *(node - 8) -> *(ptr + 0x20)
// If byte[7] & 0x40 == 0 (indexed):
//     idx = (node[4..7] & 0x7FFFFFF)
//     operand = node - (idx << 5)    // 27-bit index * 32 bytes

The 27-bit index allows up to 134 million nodes (4 GB theoretical IR size).

Function Map

IR-Level Known-Bits

FunctionAddressSize
computeKnownBitsAndSimplify -- merged main analysissub_11A7600127 KB
SimplifyDemandedBitsHelper -- binary arithmetic subsetsub_11A14306.3 KB
Per-operand demand propagation trampoline (depth check)sub_11AE940varies
SimplifyDemandedBits entry wrapper (allocates APInts)sub_11AE870thin
SimplifyDemandedBits result caching (hash table at IC+2064)sub_11AE3E0235 lines
computeKnownBitsFromOperator / PHI mergesub_11A3F3050 KB
computeKnownBitsFromAssume (processes @llvm.assume)sub_11A691012.5 KB
computeKnownBitsFromRangeMetadata (reads !range)sub_11A68C0varies
Generic computeKnownBits (fallback, no simplification)sub_9AC0E0varies
Reference computeKnownBits (debug cross-validation only)sub_9AC330varies
NVIDIA post-analysis fixup (alignment + range refinement)sub_99B5E0varies
NVIDIA intrinsic known-bits oracle (special registers)sub_F0C4B0varies
isNVVMFunction check (NVIDIA-specific flag)sub_F0C3D0varies
Intrinsic return range analysis (computes [lo, hi])sub_10CA79011.2 KB
Extract return range bounds from range analysis resultsub_11A1390varies
getPointerAlignmentBits (alignment-derived known zeros)sub_BD5420varies
isDemandedBitsFullyKnown (demand subset-of known)sub_10024C0varies
NVVMIntrRange pass -- attaches !range metadatasub_216F4B0varies

SelectionDAG-Level Known-Bits

FunctionAddressSize
SelectionDAG::computeKnownBits (recursive, 112 opcode cases)sub_33D4EF0114 KB
Creates all-demanded mask, delegates to sub_33D4EF0sub_33DD090wrapper
computeMinLeadingZeros (calls sub_33D25A0 + returns)sub_33D4D80wrapper
computeNumSignBits (parallel switch structure)sub_33D25A049 KB
computeOverflowForAdd / computeOverflowForSubsub_33DCF10varies

KnownBits Arithmetic Helpers

FunctionAddress
KnownBits::computeForMul(result, nuw, nsw, kb0, kb1)sub_C70430
KnownBits::add(a, b, nsw, nuw, carry)sub_C74E10
KnownBits::sub(a, b, nsw, nuw)sub_C75B70
KnownBits::computeForAddSub(isSub, nsw, nuw, a, b)sub_C76560
KnownBits::shl(a, shamt)sub_C73220
KnownBits::lshr(a, b)sub_C738B0
KnownBits::ashr(a, b)sub_C73E40
KnownBits::and(a, b, commutative)sub_C787D0
KnownBits::or(a, b)sub_C78F20
KnownBits::xor(a, b)sub_C790F0
KnownBits::mergeForPHI / smax(a, b)sub_C79480
KnownBits::truncate / smulh(a, b)sub_C7B4D0
KnownBits::cttz(a, shift)sub_C7BCF0
KnownBits::ctpop(a)sub_C7BD50
KnownBits::bswap(a)sub_C7BDB0
KnownBits::abs(a, known_shift)sub_C746C0
KnownBits::umin(a, b)sub_C740A0
KnownBits::umax(a, b)sub_C74180
KnownBits::ctlz(a, poisonAtZero)sub_C778B0

APInt Utilities

FunctionAddress
APInt(width, 0) -- zero-init constructor (heap for width > 64)sub_C43690
APInt copy constructorsub_C43780
APInt::operator&=sub_C43B90
`APInt::operator\sub_C43BD0
APInt::setBits(lo, hi)sub_C43C90
APInt::flipAllBitssub_C43D10
APInt::trunc(width)sub_C44740
APInt::zext(width)sub_C449B0
APInt::sext(width)sub_C44830
APInt::countTrailingZerossub_C44590
APInt::countLeadingZerossub_C444A0
APInt::countPopulationsub_C44630
APInt::isSubsetOf(other)sub_C446F0
APInt::reverseBits / byteSwapsub_C44AB0
ConstantInt::get(type, APInt) -- creates constant replacementsub_AD6220
ConstantInt::get(type, value, isSigned)sub_AD64C0

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Analysis architectureSeparate computeKnownBits (ValueTracking) and SimplifyDemandedBits (InstCombine)Fused into single 127 KB function (sub_11A7600) that simultaneously computes bitmasks and simplifies instructions
GPU register rangesNo special register concept; all values have full-width rangeDedicated oracle (sub_F0C4B0) provides known-zero bits for %tid, %ntid, %ctaid, %warpsize, %laneid, and 10+ PTX special registers
Range metadata injectionNo equivalent pass; range info comes from profile data or programmer annotationsnvvm-intr-range pass (sub_216F4B0) attaches !range metadata to every special-register read; tightened by __launch_bounds__
Warp sizeNot a concept; no constant is known%warpsize is statically known to be exactly 32 (known-zero bits [0,4] and [6,31], bit 5 = 1)
Cross-validationNo cross-validation in release buildsDebug flag qword_4F90C28 enables abort-on-mismatch between computeKnownBits and SimplifyDemandedBits results
SelectionDAG integrationSeparate DAG-level computeKnownBits (~60 KB)Extended DAG-level version at sub_33D4EF0 (114 KB, 3,286 lines) with GPU-specific value tracking
Max recursion depth6 (configurable)Same default 6, checked in sub_11AE940 with identical semantics

Cross-References

  • InstCombine -- The primary consumer of KnownBits analysis; sub_11AE870 is called from the binary operator visitor's Phase 0
  • SelectionDAG -- DAG-level known-bits at sub_33D4EF0 feeds into DAGCombine and instruction selection pattern matching
  • Loop Strength Reduction -- LSR interacts with shared-memory known-bits through the lsr-no-ptr-address-space3 knob that disables LSR for 32-bit shared memory pointers
  • GVN -- sub_9AC330 (reference computeKnownBits) is also called from GVN to validate value numbering decisions
  • LICM -- Loop-invariant code motion uses known-bits to prove that hoisted expressions are safe (no integer overflow when known-bits constrain the range)

CodeGenPrepare and SCEV-CGP

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Upstream CodeGenPrepare is stock LLVM 20.0.0 CodeGenPrepare.cpp with all 20+ cl::opt knobs unchanged. SCEV-CGP is a fully proprietary NVIDIA pass with no upstream equivalent; it is disabled by default (nv-disable-scev-cgp = true).

cicc v13.0 contains two distinct passes that prepare LLVM IR for the NVPTX backend's instruction selection. The first is upstream LLVM's CodeGenPreparePass, registered as "codegenprepare" in the New PM pipeline (line 216 of sub_2342890), which sinks address computations, creates PHI nodes for sunk values, and splits critical edges. The second is NVIDIA's proprietary SCEV-CGP (Scalar-Evolution-based Code Generation Preparation), a fully custom pass that uses SCEV analysis to rewrite address expressions with GPU thread ID as an induction variable.

Both passes operate at the LLVM IR level, immediately before SelectionDAG construction. They share the goal of making address expressions cheap for the backend to lower, but they work at different abstraction levels: CodeGenPrepare operates syntactically on individual memory instructions; SCEV-CGP operates semantically on entire address expression families using scalar evolution. NVIDIA disables SCEV-CGP by default (nv-disable-scev-cgp defaults to true), relying on upstream CodeGenPrepare plus the downstream Base Address Strength Reduction and Common Base Elimination passes to handle GPU address optimization.

Key Facts

PropertyValue
Pass name (upstream)codegenprepare (New PM)
Pass name (NVIDIA)SCEV-CGP (no formal New PM pass name found in binary)
Binary range (v12.x)0x1D60000--0x1D7FFFF (helpers + main transforms)
Binary range (v13.0)0x2D75700--0x2D88660 (Cluster 6 in 0x2D sweep)
Address sinkingsub_1D73760 / sub_2D75700 (65--72 KB), string "sunkaddr"
PHI sinkingsub_1D706F0 / sub_2D784F0 (64--68 KB), string "sunk_phi"
Block splittingsub_1D7AA30 / sub_2D88660 (54--74 KB), strings ".unlikely", ".cond.split"
Main transformsub_2D80050 (54 KB) -- orchestrates address mode lowering
SCEV-CGP knob ctorctor_263_0 at 0x4F36F0 (9.9 KB, 44 option strings)
CGP knob ctorctor_288_0 at 0x4FA950 (8.6 KB, 44 option strings)
Master disablenv-disable-scev-cgp (default: true -- SCEV-CGP is disabled)
Upstream sourcellvm/lib/CodeGen/CodeGenPrepare.cpp
Pipeline positionLate IR, immediately before SelectionDAG ISel

Upstream CodeGenPrepare

Purpose

CodeGenPrepare is the last IR-level pass before instruction selection. Its job is to transform the IR into a form that the SelectionDAG builder can lower efficiently: address computations should be adjacent to their memory uses (reducing live ranges), complex addressing modes should be materialized as GEP chains that ISel can pattern-match, and unlikely branches should be split into cold blocks so that block placement can isolate them.

On NVPTX this pass is less critical than on x86 because PTX has simpler addressing modes (base + offset, no scaled index), but it still performs three important transforms.

Transform 1: Address Sinking (sunkaddr)

The address sinking logic lives in sub_1D73760 (v12.x) / sub_2D75700 (v13.0). It identifies memory instructions whose address operand is computed in a dominating block, then sinks the computation to the block containing the memory instruction. The sunk address is named "sunkaddr" in the IR, appearing as a GEP, inttoptr, or bitcast chain:

Before:
  entry:
    %addr = getelementptr float, ptr %base, i64 %idx
    br label %loop

  loop:
    %val = load float, ptr %addr          ; addr live across loop

After:
  entry:
    br label %loop

  loop:
    %sunkaddr0 = getelementptr float, ptr %base, i64 %idx
    %val = load float, ptr %sunkaddr0     ; addr local to use

The naming convention "sunkaddr" with a numeric suffix (20+ occurrences in binary string references) is the standard LLVM naming. Each sunk address gets a unique suffix: sunkaddr0, sunkaddr1, etc.

The sinking decision is controlled by a cache called ValueToSunkAddr (a DenseMap at sub_2CE7CF0 in the v13.0 build). Before sinking a value, the pass checks whether the same address expression has already been sunk into the target block. If so, it reuses the existing sunk copy rather than creating a duplicate.

The core sinking algorithm:

for each basic block BB in function:
    for each instruction I in BB:
        if I is a memory instruction (load/store/atomic):
            addr = I.getPointerOperand()
            if addr.getParent() != BB:
                // addr defined in a dominating block
                addr_mode = matchAddressMode(addr)           // sub_2D67BB0
                if addr_mode.isFoldable():
                    sunk = materializeAddrMode(addr_mode, BB) // sub_2D68450
                    I.setPointerOperand(sunk)
                    mark changed

Key helpers in the v13.0 build:

FunctionAddressSizeRole
----sub_2D749D0--
----sub_2D67BB0--
----sub_2D6E640--
----sub_2D68450--
----sub_2CE7CF0--

Transform 2: PHI Sinking (sunk_phi)

When an address computation has multiple uses in successor blocks of a conditional branch, the pass creates a PHI node in the merge block rather than sinking independent copies into each successor. The resulting PHI is named "sunk_phi":

Before:
  entry:
    %addr = getelementptr float, ptr %base, i64 %idx
    br i1 %cond, label %then, label %else

  then:
    %v1 = load float, ptr %addr
    br label %merge

  else:
    %v2 = load float, ptr %addr
    br label %merge

After (conceptual):
  then:
    %sunkaddr0 = getelementptr float, ptr %base, i64 %idx
    %v1 = load float, ptr %sunkaddr0
    br label %merge

  else:
    %sunkaddr1 = getelementptr float, ptr %base, i64 %idx
    %v2 = load float, ptr %sunkaddr1
    br label %merge

When the two sunk copies would be identical and the value is needed in the merge block for other uses, the pass instead creates:

  merge:
    %sunk_phi = phi ptr [ %sunkaddr0, %then ], [ %sunkaddr1, %else ]

The PHI creation calls sub_B44260 (PHI node setup), with naming via sub_BD6B50. The addr-sink-new-phis cl::opt knob (registered at ctor_288_0) controls whether the pass is allowed to create new PHIs during address sinking. The addr-sink-new-select knob similarly controls creation of new select instructions.

Transform 3: Block Splitting

sub_1D7AA30 (v12.x) / sub_2D88660 (v13.0) splits basic blocks to isolate unlikely paths. The pass creates blocks with suffixes ".unlikely" and ".cond.split", allowing MachineBlockPlacement to push cold code away from the hot path. This is driven by branch probability metadata and profile-guided section prefix hints.

On NVPTX, block splitting interacts with StructurizeCFG: the split blocks must still form reducible control flow, otherwise StructurizeCFG will have to insert additional flow blocks to restore structure. The profile-guided-section-prefix knob controls whether section prefix metadata (.hot, .unlikely, .unknown) is attached to split blocks.

Upstream CodeGenPrepare Knobs

All registered at ctor_288_0 (0x4FA950, 8.6 KB, 44 strings). These are standard LLVM cl::opt knobs, unchanged from upstream:

KnobTypeEffect
disable-cgp-branch-optsboolDisable CodeGenPrepare branch optimizations
disable-cgp-gc-optsboolDisable CodeGenPrepare GC optimizations
disable-cgp-select2branchboolDisable select-to-branch conversion
addr-sink-using-gepboolUse GEP instructions for address sinking (vs. inttoptr)
enable-andcmp-sinkingboolSink and/cmp instruction pairs into branches
disable-cgp-store-extractboolDisable store-extractvalue optimization
stress-cgp-store-extractboolStress test store-extractvalue path
disable-cgp-ext-ld-promotionboolDisable extension-load promotion
disable-preheader-protboolDisable loop preheader protection
profile-guided-section-prefixboolAttach section prefix based on profile data
cgp-freq-ratio-to-skip-mergeintBlock frequency ratio threshold to skip block merging
force-split-storeboolForce store splitting
cgp-type-promotion-mergeboolMerge type promotions
disable-complex-addr-modesboolDisable complex addressing mode optimization
addr-sink-new-phisboolAllow creating new PHIs during address sinking
addr-sink-new-selectboolAllow creating new select during address sinking
addr-sink-combine-base-regboolCombine base register in address sink
addr-sink-combine-gvboolCombine global value in address sink
addr-sink-combine-offsboolCombine offset in address sink
addr-sink-combine-scaled-regboolCombine scaled register in address sink
cgp-split-large-offset-gepboolSplit GEPs with large offsets

GPU Relevance of Upstream Knobs

Most of these knobs are effectively no-ops on NVPTX because the target's addressing modes are simple (base + immediate offset, no scaled index register). However, a few matter:

  • addr-sink-using-gep: Controls whether sunk addresses use GEP or inttoptr chains. On NVPTX, GEP chains are preferred because they preserve address space information through lowering. The inttoptr path strips address space, forcing the backend to re-derive it.

  • cgp-split-large-offset-gep: Relevant for large array accesses where the constant offset exceeds the PTX immediate encoding width (±2^31 for 64-bit addressing). Splitting the GEP allows the backend to use a base register plus a small offset rather than a 64-bit constant.

  • addr-sink-new-phis: On GPU, creating new PHIs can increase divergent live ranges. If the condition driving the PHI is thread-divergent, the PHI result will be divergent, potentially requiring a wider (per-lane) register allocation.

NVIDIA SCEV-CGP

What Is It?

SCEV-CGP is a fully custom NVIDIA pass that uses LLVM's ScalarEvolution analysis to optimize address mode expressions at the function level, with specific awareness of GPU thread ID as an induction variable. Where upstream CodeGenPrepare operates syntactically (pattern-matching individual instructions), SCEV-CGP operates semantically: it analyzes address expressions as SCEV recurrences, factors out common base computations, and rewrites them to minimize register pressure.

The pass is registered in ctor_263_0 at 0x4F36F0 alongside Base Address Strength Reduction knobs. The 44 strings registered in this single constructor cover both SCEV-CGP and BASR, confirming they are part of the same address optimization subsystem.

Why NVIDIA Disables It By Default

The nv-disable-scev-cgp knob defaults to true (the description reads "Disable optimize addr mode with SCEV pass" and the raw data at ctor_609_0 marks it as def=on meaning disabled). This is a deliberate choice:

  1. Redundancy with BASR/CBE. NVIDIA has invested heavily in Base Address Strength Reduction (62 KB) and Common Base Elimination (39 KB), which handle the most profitable GPU address optimizations (sharing base computations across array accesses in loop bodies). These passes are simpler, more predictable, and better-tested than the general SCEV-CGP framework.

  2. Interaction with LSR. Both SCEV-CGP and Loop Strength Reduction operate on SCEV expressions. If both are active, they can fight over the same address expressions: LSR rewrites IVs for loop-carried efficiency, then SCEV-CGP undoes part of that work to optimize address modes. The result can be worse than either pass alone. By disabling SCEV-CGP, NVIDIA lets LSR (with its full GPU-aware formula solver) handle SCEV-based address optimization without interference.

  3. Compile-time cost. SCEV-CGP with aggressive mode (do-scev-cgp-aggresively [sic]) is expensive. The scev-cgp-inst-limit and scev-cgp-control knobs exist precisely because uncontrolled SCEV-CGP can balloon compile times on large kernels with many address expressions.

  4. Overflow hazards. The ignore-32-bit-overflow and ignore-signed-32-bit-overflow knobs in ctor_263_0 indicate that SCEV-CGP can produce address arithmetic that overflows 32-bit intermediates. On GPU where 32-bit addressing is common (shared memory, constant memory), this is a correctness risk that NVIDIA mitigates by keeping the pass off by default.

When SCEV-CGP Would Be Beneficial

Despite being disabled by default, the pass has 11 dedicated knobs -- NVIDIA clearly uses it selectively:

  • Kernels with complex strided access patterns where thread ID participates in multi-dimensional address calculations (e.g., base + tid.x * stride_x + tid.y * stride_y + tid.z * stride_z). BASR handles the case where multiple accesses share a base, but it does not factor thread ID expressions across dimensions.

  • Register-pressure-critical kernels at occupancy cliffs where SCEV-based address strength reduction can save enough registers to cross an occupancy boundary. The scev-cgp-tid-max-value knob lets the pass reason about the bounded range of thread IDs, enabling tighter value range analysis.

  • Function-level address optimization (enabled by do-function-scev-cgp) where cross-loop base sharing matters more than per-loop IV optimization.

Thread ID Max Value Knob

The scev-cgp-tid-max-value knob deserves special attention. It provides SCEV analysis with the maximum possible value of a GPU thread ID, which is architecture-dependent:

  • threadIdx.x: max 1024 (all architectures sm_70+)
  • threadIdx.y: max 1024
  • threadIdx.z: max 64
  • blockIdx.x: max 2^31 - 1

By telling SCEV that threadIdx.x is bounded by 1024, the analysis can prove that threadIdx.x * element_size fits in 32 bits for element sizes up to ~2 million bytes. This enables 32-bit address arithmetic where the expression would otherwise be widened to 64 bits. The knob links to the Known Bits analysis documented in Known Bits, where the nvvm-intr-range pass provides similar bounded-range information for special registers.

SCEV-CGP Knobs (Complete Reference)

All registered in ctor_263_0 at 0x4F36F0. These are NVVMPassOptions values, stored in the 222-slot pass option registry.

KnobTypeDefaultEffect
do-scev-cgpbooltrue [MEDIUM confidence]Master enable for SCEV-based CodeGenPrepare transforms. Default inferred from the fact that nv-disable-scev-cgp exists as an override, implying this defaults to enabled.
do-scev-cgp-aggresively [sic]boolfalse [MEDIUM confidence]Enable aggressive SCEV-CGP mode with expanded search. Default inferred from naming convention (aggressive modes typically off by default).
do-function-scev-cgpboolfalse [MEDIUM confidence]Enable function-level (cross-loop) SCEV-CGP. Default inferred from naming convention.
nv-disable-scev-cgpbooltrueMaster disable switch in NVPTX backend (overrides do-scev-cgp)
scev-cgp-controlintunknownLimit the total number of SCEV-CGP transformations per function
scev-cgp-cross-block-limitintunknownMax number of common base expressions from a single block
scev-cgp-idom-level-limitintunknownMax dominator tree depth for hoisting base computations
scev-cgp-inst-limitintunknownMax instructions analyzed per parameter expression
scev-cgp-old-baseboolunknownUse old (legacy) base computation method instead of new
scev-cgp-tid-max-valueintarch-dependentMaximum value of thread ID for address range analysis
scev-cgp-check-latencyintunknownLatency threshold for address computation profitability
scev-cgp-normintunknownNormalization control for SCEV expression canonicalization
print-after-scev-cgpboolfalseDump function IR after SCEV-CGP completes
dump-scev-cgpboolfalseDebug dump during SCEV-CGP execution

The same constructor also registers these knobs, documented in their respective pages:

KnobSee
do-base-address-strength-reduceBase Address Strength Reduction
do-base-address-strength-reduce-chainBase Address Strength Reduction
base-address-strength-reduce-iv-limitBase Address Strength Reduction
base-address-strength-reduce-max-ivBase Address Strength Reduction
topo-sort-beginTopological sort starting point for address expression graph
ignore-bad-baseBypass validity checks on base pointer classification
ignore-32-bit-overflowSkip 32-bit overflow checks in address arithmetic
ignore-signed-32-bit-overflowSkip signed 32-bit overflow checks

Interaction with LSR

CodeGenPrepare/SCEV-CGP and Loop Strength Reduction both optimize address expressions, but at different pipeline stages and granularities.

AspectLSRCodeGenPrepareSCEV-CGP
Pipeline positionLate IR optimization (loop passes)Pre-ISel (after all IR opts)Pre-ISel (NVIDIA custom position)
ScopePer-loop IV rewritingPer-instruction address sinkingPer-function address expression rewriting
SCEV usageFull: formula generation, stride factoring, chain constructionNone (syntactic pattern matching)Full: base decomposition, range analysis
Register pressureExplicit RP tracking with occupancy ceilingImplicit (sinking reduces live ranges)Implicit via scev-cgp-cross-block-limit
Address spaceFull awareness (shared memory protection, 64-bit IV gating)No special GPU handlingThread ID aware (scev-cgp-tid-max-value)
Default statusEnabled (with GPU-custom formula solver)Enabled (standard upstream)Disabled (nv-disable-scev-cgp = true)

The key insight is the pipeline ordering: LSR runs first during the optimization phase, rewriting IVs across the loop. CodeGenPrepare runs later, sinking the results into individual use sites. If SCEV-CGP were also enabled, it would run between these two, potentially undoing LSR's IV choices to create "better" address modes -- which may conflict with LSR's register-pressure-informed formula selection.

NVIDIA's solution is pragmatic: keep SCEV-CGP off, let LSR handle SCEV-level optimization, let BASR/CBE handle GPU-specific base sharing, and let upstream CodeGenPrepare handle the final address sinking.

Differences from Upstream LLVM

AreaUpstream LLVMcicc v13.0
CodeGenPrepare passStandard, used as-isRetained unchanged from LLVM 20.0.0
SCEV-CGPDoes not existNVIDIA proprietary, disabled by default
Address sinkingAlways uses TTI::getAddrModeTypeSame, but NVPTX TTI returns simple modes (base+offset only)
Block splittingHot/cold based on PGOSame, but must preserve reducibility for StructurizeCFG
BASR/CBEDo not existNVIDIA proprietary alternatives to SCEV-CGP for GPU
Knob count~20 cl::opt for CGP20 upstream CGP + 14 SCEV-CGP + 8 BASR = 42 total

Function Map

CodeGenPrepare (v12.x Addresses)

FunctionAddressSizeRole
--sub_1D7376065 KBoptimizeMemoryInst -- address sinking, creates "sunkaddr"
--sub_1D706F068 KBPHI optimization, creates "sunk_phi"
--sub_1D7AA3074 KBBlock splitting, creates ".unlikely", ".cond.split"
--sub_1D779D071 KBIR transform (DAG combine-level, possibly optimizeInst)
--sub_1D765D034 KBSelect lowering ("cond.false", "cond.end")
--sub_1D7F9D031 KBDeque-based worklist processor

CodeGenPrepare (v13.0 Addresses)

FunctionAddressSizeRole
--sub_2D7570072 KBAddress sinking with "sunk_phi", ValueToSunkAddr DenseMap
--sub_2D784F064 KBAddress mode lowering orchestrator, calls sub_2D75700
--sub_2D8005054 KBMain CodeGenPrepare transform, calls TTI and address mode logic
--sub_2D8285062 KBLate lowering/expansion (type widening, custom lowering)
--sub_2D8866070 KBBlock splitting with branch weights ("hot", "unlikely", "unknown")
--sub_2D749D0--Address mode cache lookup
--sub_2D67BB0--Address mode legality test
--sub_2D6E640--Address mode cache insert
--sub_2D68450--Address mode materialization
--sub_2D6DEE0--Address mode matching
--sub_2D69E90--Cleanup/init

Helper Range (0x1D60000--0x1D6FFFF)

This 64 KB sub-range contains CodeGenPrepare helper functions. The sweep identifies it as "CodeGenPrepare helpers" but no individual functions are called out with string evidence. These likely include address mode computation utilities, operand analysis, and GEP canonicalization.

SCEV-CGP Option Registration

FunctionAddressSizeRole
--ctor_263_0 (0x4F36F0)9.9 KBRegisters 44 cl::opt strings for SCEV-CGP + BASR
--ctor_288_0 (0x4FA950)8.6 KBRegisters 44 cl::opt strings for upstream CodeGenPrepare
--ctor_591 (0x57C1A0)9.3 KBAdditional CodeGenPrepare sink/split options
--ctor_544_0 (0x56C190)13.1 KBCodeGenPrepare options (v13.0 duplicate registration)
--ctor_609_0 (0x585D30)37.3 KBNVPTX backend mega-block, includes nv-disable-scev-cgp

Cross-References

ScalarEvolution Overview & Construction

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Analysis/ScalarEvolution.cpp, llvm/include/llvm/Analysis/ScalarEvolution.h (LLVM 20.0.0)

LLVM version note: CICC v13.0 is based on LLVM 20.0.0 ScalarEvolution.cpp. Evidence: the non-recursive worklist-based createSCEV driver (sub_DD8130) matches the LLVM 16+ refactoring that replaced the recursive createNodeForValue. The getSmallConstantTripCount/getSmallConstantMaxTripCount API matches LLVM 17+ signatures. NVIDIA's three extension categories -- simple_mode complexity control, GPU-specific SCEV sources (thread index bounds), and CUDA loop idiom recognition (warp-stride, grid-stride) -- are layered on top of the stock LLVM 20 analysis with no modifications to the core SCEV algebra.

ScalarEvolution (SCEV) is the foundational analysis that models how values change across loop iterations. Every loop optimization in cicc -- vectorization, unrolling, strength reduction, interchange, distribution -- depends on SCEV to answer three questions: "what is the trip count?", "what is the stride?", and "what is the value range?" NVIDIA's cicc v13.0 ships an LLVM 20.0.0-based ScalarEvolution with three categories of proprietary extensions: a complexity control system (simple_mode) that prevents SCEV from spending unbounded time on GPU kernels with hundreds of induction variables, GPU-specific SCEV sources that inject thread index bounds and launch configuration constraints into the analysis, and recognition of CUDA-specific loop idioms (warp-stride and grid-stride patterns) that have no analog in CPU code. This page documents SCEV expression construction -- the core getSCEV / createSCEV / createNodeForInstruction call chain. Range computation and trip count analysis are covered in SCEV Range Analysis & Trip Counts; cache invalidation and delinearization in SCEV Invalidation & Delinearization.

Key Facts

PropertyValue
LLVM base version20.0.0 ScalarEvolution.cpp
Top-level entrysub_DD8400 (getSCEV)
Core buildersub_DD65B0 (createNodeForInstruction, 1103 lines)
Worklist driversub_DD8130 (non-recursive worklist createSCEV, 154 lines)
Instruction decomposersub_D94080 (452 lines)
PHI handlersub_DD92B0 (createNodeForPHI)
GEP handlersub_DD3A70 (getGEPExpr)
Cache lookupsub_D98300 (lookupSCEV)
Cache storesub_DB77A0 (insertSCEV)
NVIDIA complexity scorersub_DB3670 (expression size estimator)
SE object size>1572 bytes (fields documented through offset +1572)
Calling conventions bypassing budgetCC 42, CC 43 (PTX kernel entry points)

ScalarEvolution Object Layout

The ScalarEvolution context (SE) is a large heap-allocated object. The fields relevant to SCEV construction:

OffsetTypeFieldNotes
+0Module*LLVM module / context pointer
+8TargetLibraryInfo*TLIUsed for intrinsic recognition
+32DominatorTree*Dominator treeRequired for PHI analysis
+40LoopInfo*Loop analysisAddRec construction needs this
+48void*Analysis pointerUsed by complexity scorer
+320SmallDenseSetPHI visited setPrevents infinite recursion
+976void*Unsigned range cache table40-byte entries, open addressing
+992uint32_tUnsigned range cache capacityPower-of-two
+1008void*Signed range cache tableSame structure
+1024uint32_tSigned range cache capacity
+1560uint8_tsimple_mode flag0 = normal, 1 = NVIDIA complexity control
+1564uint32_tfailure_countSimple mode: bailed instructions
+1568uint32_trecursion_countNormal mode: depth counter
+1572uint8_tComplexity config bitsTuning for the scorer

The SE object also contains the ValueExprMap (primary SCEV cache mapping Value* to SCEV*), the backedge-taken count cache at offset +648/+656/+672, and the per-exit BTC cache at +1168/+1184. These are documented in the range/BTC page.

The getSCEV Entry Point

sub_DD8400 (getSCEV) is the single entry point for obtaining a SCEV expression for any LLVM Value*. Every consumer -- LoopVectorize, LoopUnroll, LSR, IndVarSimplify, LoopInterchange -- calls this function. The algorithm:

SCEV* getSCEV(SE *se, Value *V) {
    // 1. Memo-table check
    SCEV *cached = lookupSCEV(se, V);      // sub_D98300
    if (cached) return cached;

    // 2. Dispatch based on mode
    if (se->simple_mode == 0) {
        // NORMAL PATH
        CallingConv cc = V->getParent()->getParent()->getCallingConv();
        if (cc == 42 || cc == 43) {
            // PTX kernel entry: bypass budget entirely
            return createSCEV(se, V);
        }
        se->recursion_count++;
        if (se->recursion_count <= MaxRecursionDepth) {
            return createSCEV(se, V);
        }
        return getUnknown(se, V);           // budget exceeded
    }

    // NVIDIA SIMPLE MODE (complexity control)
    if (se->failure_count > MaxExprFailures) {
        SCEV *u = getUnknown(se, V);
        insertSCEV(se, V, u);              // cache the Unknown
        return u;
    }
    uint64_t complexity = computeExprSize(se, V);  // sub_DB3670
    if (complexity > MaxExprSize) {
        se->failure_count++;
        SCEV *u = getUnknown(se, V);
        insertSCEV(se, V, u);
        return u;
    }
    // Expression is small enough: run normal path with mode toggled off
    se->simple_mode = 0;
    se->recursion_count = 0;
    SCEV *result = createSCEV(se, V);
    se->simple_mode = 1;
    return result;
}

The PTX kernel bypass (calling conventions 42 and 43) is significant: kernel functions always receive full SCEV analysis regardless of budget. NVIDIA considers kernels important enough that truncating their analysis would lose more performance than the extra compile time costs. Device helper functions, by contrast, are subject to the budget.

NVIDIA Simple Mode (Complexity Control)

Upstream LLVM uses a single recursion counter to bound getSCEV. NVIDIA replaces this with a two-stage gating system called simple_mode (enabled by the scalar-evolution-complexity-control flag, default true). The system is stored entirely in four bytes of the SE object:

OffsetTypeFieldRole
+1560uint8simple_mode0 = normal (upstream-style), 1 = NVIDIA complexity control
+1564uint32failure_countRunning count of instructions classified as SCEVUnknown by the size gate
+1568uint32recursion_countUpstream-style depth counter, only active when simple_mode == 0
+1572uint8complexity_configTuning bits read by the expression size scorer

When scalar-evolution-complexity-control is true (the default), the SE constructor initializes simple_mode to 1. The gating operates in three stages:

Stage 1 -- Failure gate. Before scoring anything, getSCEV checks failure_count > scalar-evolution-max-expr-failures (global qword_4F88348, default 100). If the function has already exceeded the failure budget, the instruction is classified as SCEVUnknown, the result is cached via sub_DB77A0 (insertSCEV), and control returns immediately. This prevents a single pathological function from burning O(N^2) time trying to score thousands of instructions that will all fail.

Stage 2 -- Expression size scoring. The scorer sub_DB3670 (expressionComplexity, 35KB binary, self-recursive) estimates how large the resulting SCEV expression tree would be. It walks the instruction's def-use chain bottom-up, counting nodes and weighting by expression kind:

uint64_t expressionComplexity(SE *se, Value *V) {
    // sub_DB3670 -- self-recursive, calls sub_CF4090 for SCEV node size
    if (V is Constant)     return 1;
    if (V is Argument)     return 1;
    if (!isSCEVable(V))    return 0;      // non-integer/pointer: free

    // Look up V in the SCEV cache; if already a SCEV node,
    // delegate to the node-size estimator
    SCEV *cached = lookupSCEV(se, V);
    if (cached)
        return sub_CF4090(cached);         // count nodes in SCEV tree

    // Not yet in cache: estimate from instruction structure
    Instruction *I = dyn_cast<Instruction>(V);
    if (!I) return 1;

    uint64_t score = 1;                    // 1 for this node
    Loop *L = LoopInfo->getLoopFor(I);
    if (L) {
        uint32_t depth = L->getLoopDepth();
        score += depth;                    // loop nesting multiplier
    }

    // Walk operands, accumulating recursively
    for (unsigned i = 0; i < I->getNumOperands(); i++) {
        score += expressionComplexity(se, I->getOperand(i));
    }

    // Apply configuration scaling from SE+1572
    if (se->complexity_config & 0x1)
        score = score * 3 / 2;            // 50% penalty for aggressive mode
    if (se->complexity_config & 0x2)
        score += depth * 2;               // extra loop nesting weight

    return score;
}

The helper sub_CF4090 counts nodes in an existing SCEV expression tree: it returns 1 for SCEVConstant and SCEVUnknown, recurses into operands for SCEVAddExpr/SCEVMulExpr/SCEVAddRecExpr (summing child sizes + 1), and handles casts (Truncate/ZeroExtend/SignExtend) as 1 + child size. The node-size estimate is precise because SCEV expressions are uniqued -- the same sub-expression pointer is never double-counted within a single scoring call.

If the total score exceeds scalar-evolution-max-expr-size (global dword_4F88428, default 384), the instruction is classified as SCEVUnknown and failure_count is incremented. The SCEVUnknown result is cached immediately so that later queries from different loop passes return instantly rather than re-running the scorer.

Stage 3 -- Mode toggle. When an instruction passes the size check (score <= 384), simple_mode is temporarily set to 0 and the recursion counter reset to 0 before calling createSCEV:

se->simple_mode = 0;        // disable complexity gating
se->recursion_count = 0;    // reset upstream counter for this sub-tree
SCEV *result = createSCEV(se, V);
se->simple_mode = 1;        // restore

This prevents double budget-checking: the upstream recursion counter inside createSCEV starts from 0 for the sub-expression tree rather than inheriting a parent depth. Each createSCEV call thus gets a fresh budget of scalar-evolution-max-recursion-depth (default 100) for its own sub-tree.

Practical effect: GPU kernels with hundreds of address computations (common in tiled matrix multiply, convolution stencils) hit the complexity wall early for outer variables, but the important inner loop induction variables -- which have simple affine structure -- always get analyzed. The two-stage gate (score first, then depth-limit) avoids the upstream problem where a single deep operand chain exhausts the entire recursion budget for the function.

Why not just raise the upstream recursion limit? The upstream counter is a global depth counter -- raising it means every instruction in the function gets more budget, including ones that will never produce useful SCEV expressions. The NVIDIA approach is per-instruction: each instruction is independently scored, and only instructions with manageable complexity get the full treatment. This keeps total SCEV compile time bounded at O(N * max_expr_size) rather than O(N * max_recursion_depth^2).

Worklist-Driven createSCEV

sub_DD8130 implements a non-recursive worklist to avoid deep stack frames. NVIDIA replaced the upstream recursive createSCEV with this iterative approach to handle GPU kernels that can have extremely deep expression trees (deeply nested address computations involving multiple grid dimensions).

The worklist stores Value* pointers with tag bits in the low 3 bits:

BitMeaning
Bit 2 (0x4)First visit: needs full createNodeForInstruction
Bits 0-1 clearPost-processing: operands have been evaluated, collect results

Algorithm:

  1. Push initial value with bit 2 set.
  2. Pop top entry.
    • If bit 2 set: call sub_DD80F0 (createSCEV wrapper), which checks isSCEVable(V->getType()) via sub_D97040, then delegates to sub_DD65B0 (createNodeForInstruction).
    • If the result is immediately available: cache it via sub_DB77A0 and continue.
    • If operands are needed: push operands (without bit 2) for deferred processing.
  3. Repeat until worklist empty.
  4. Return lookupSCEV(initial_value).

The isSCEVable check (sub_D97040) accepts integer types and pointer types. Floating-point values and aggregate types produce SCEVUnknown.

Instruction Decomposer

Before the main opcode dispatch, sub_D94080 (decomposeIRInstruction) analyzes each instruction and fills a 48-byte decomposition struct:

struct SCEVDecomp {          // 48 bytes
    uint32_t kind;           // +0   decomposition opcode
    void    *operandL;       // +8   left operand (Value*)
    void    *operandR;       // +16  right operand (Value*)
    bool     hasNUW;         // +24  no-unsigned-wrap flag
    bool     hasNSW;         // +25  no-signed-wrap flag
    void    *extra;          // +32  third operand / loop variable
    bool     valid;          // +40  decomposition succeeded
};

The decomposer extracts NUW/NSW flags from inst->byte[1] (bit 2 = NUW, bit 1 = NSW), and these flags are only captured for opcodes matching the bitmask 0x40540000000000 -- covering add, sub, mul, shl, and related flag-bearing arithmetic. The kind field values:

KindDecimalSCEV Construction
0x0D13Add/Sub -- iterative addend collection
0x0F15MulRec -- multiply-recurrence (loop-carried)
0x1117Multiply -- iterative multiplicand collection
0x1319UDiv
0x1622UMax select pattern
0x1925Shl -- converted to multiply by 2^N
0x1A26Generic shift/bitop fallback
0x1B27LShr -- complex truncate+extend chain
0x1C28AShr -- sign-extend analysis
0x1D29ICmp / comparison
0x1E30And (bitwise) -- pointer truncation patterns

The decomposer includes a GPU-specific PHI detection path (kind 64): when a PHI node's incoming value chain traces through a comparison instruction (byte == 0x55) whose operand is a function-entry value (byte == 0) that resolves to one of the recognized NVIDIA builtins (intrinsic IDs 312, 333, 339, 360, 369, 372), the decomposer creates a specialized recurrence form. This is how threadIdx.x-bounded loop variables become proper AddRec expressions.

createNodeForInstruction: The Core Builder

sub_DD65B0 (1103 lines) is the largest function in the SCEV subsystem. It operates in three phases:

Phase 1: Fast Path (lines 300-312)

Checks the instruction's type byte. Constants (byte 17) go directly to getConstant. Non-instruction values go to getUnknown. Real instructions check loop depth via LoopInfo -- if the instruction's loop nesting exceeds the maximum tracked depth, it bails to getUnknown with a simplified operand from sub_ACADE0.

Phase 2: Decomposition-Based Dispatch (lines 336-933)

After calling the instruction decomposer, dispatches on decomp.kind:

Add/Sub (kind 13): Iteratively collects addends into a SmallVector. For each operand with a non-zero extra field (the loop iteration variable), checks the SCEV cache, and if the operand has a known loop context (from sub_DD86E0 / getLoopForExpr), builds an SCEVAddRecExpr. Otherwise recursively calls getSCEV and optionally negates (for subtraction via getNegativeSCEV). Final result: getAddExpr(collected_operands).

Multiply (kind 17): Same iterative structure as Add but builds getMulExpr. For loop-carried chains, constructs getAddRecExpr(start, step, flags).

Shl (kind 25): Converts shift-left to multiplication by a power of two. When the shift amount is a constant: extracts the shift amount, verifies it fits in the type width (sub_986EE0), then builds getMulExpr(getSCEV(base), getConstant(1 << shamt), flags). Handles nested shl-of-shl by re-decomposing.

LShr (kind 27): When shifting right by a constant amount, builds a chain of getMulExpr + getTruncateExpr + getZeroExtendExpr to represent the bit extraction pattern. Falls back for non-constant shifts.

AShr (kind 28): Complex bit-extraction logic. For constant shifts, analyzes known bits to determine whether the shift extracts only zeros from the sign position. If provable, builds getSignExtendExpr(getTruncateExpr(getSCEV(base), intermediate_type), original_type). For non-constant shifts, tries SMin/SMax pattern matching.

And (kind 30): Handles pointer truncation patterns. When the mask equals (1 << ptr_bits) - 1 (a ptrtoint-then-mask pattern), builds getPtrToIntExpr + getSignExtendExpr. Otherwise bails.

Phase 3: Opcode-Based Dispatch (lines 936-1101)

Handles instructions not captured by the decomposer. The normalized opcode maps raw instruction bytes to semantic categories:

Call/Intrinsic (cases 5, 56): First tries the intrinsic SCEV lookup table (sub_B494D0). For known intrinsics, dispatches on intrinsic ID:

IDHexSCEV ConstructionLikely Intrinsic
10x001getNotSCEV(op0)bitwise NOT
70x007getSCEV(op0) (identity)llvm.assume
2920x124getSCEV(op0) (identity)PTX intrinsic passthrough
3290x149getUMinExpr(op0, op1)llvm.umin
3300x14AgetSMinExpr(op0, op1)llvm.smin
3440x158getSCEV(op0) (identity)passthrough
3590x167getSMinExpr + getUDivExpr + getAddExprcomplex min/div
3650x16DgetSMaxExpr(op0, op1)llvm.smax
3660x16EgetSMinExpr(op0, op1)llvm.smin variant
3710x173getAddRecExpr(op0, getUDivExpr(op0, op1))recurrence with division
4930x1EDgetConstant(inst->qword[1])constant from intrinsic metadata

PHI Node (case 34): Dispatches to sub_DD92B0 (createNodeForPHI). Walks PHI incoming values, checks for loop recurrence. If the PHI forms a recurrence: builds {start, +, step} as an SCEVAddRecExpr. Otherwise returns SCEVUnknown.

GEP (case 47): Calls sub_DD3A70 (getGEPExpr). Computes the SCEV of the base pointer, then adds the SCEV of each index scaled by the element size. If the result is SCEVUnknown, bails.

Casts (cases 38-40): Trunc produces getTruncateExpr. SExt produces getSignExtendExpr. ZExt has a special optimization: if the source decomposes as a multiply-recurrence (kind 15), it builds separate zero-extensions of start and step, then constructs getAddRecExpr(zext(start), zext(step), NUW) -- preserving the recurrence structure across the extension.

BitCast/AddrSpaceCast (case 49): If both source and target types are SCEV-able, returns getSCEV(source) (transparent). Otherwise getUnknown.

Select (cases 20, 23): If condition and true-value are loop-invariant (sub_DBED40), builds getUDivExpr (case 20) or getUMaxExpr (case 23) of the branches.

GPU-Specific SCEV Sources

Thread and Block Index Builtins

When the instruction decomposer encounters a PHI whose incoming value chain traces to one of NVIDIA's special register intrinsics, it recognizes it as a bounded induction variable. The recognized intrinsic IDs and their SCEV significance:

Intrinsic IDCUDA VariableSCEV Range Bound
312blockDim.x / gridDim.xDimension query -- provides trip count upper bound
333threadIdx.xRange: [0, blockDim.x)
339threadIdx.y / blockIdx.xRange: [0, blockDim.y) or [0, gridDim.x)
360threadIdx.z / blockIdx.yRange: [0, blockDim.z) or [0, gridDim.y)
369blockIdx.zRange: [0, gridDim.z)
372warpSize / laneidRange: [0, 32) (constant on all architectures)

These ranges are injected during SCEV construction, not during range analysis. When a PHI node tests a value against threadIdx.x (for example, a loop for (int i = threadIdx.x; i < N; i += blockDim.x)), the decomposer produces an SCEVAddRecExpr whose start value carries the constraint [0, blockDim.x). This propagates through all downstream SCEV consumers.

The CUDA variable to LLVM intrinsic mapping is:

CUDALLVM IntrinsicPTX Register
threadIdx.x@llvm.nvvm.read.ptx.sreg.tid.x%tid.x
threadIdx.y@llvm.nvvm.read.ptx.sreg.tid.y%tid.y
threadIdx.z@llvm.nvvm.read.ptx.sreg.tid.z%tid.z
blockDim.x@llvm.nvvm.read.ptx.sreg.ntid.x%ntid.x
blockIdx.x@llvm.nvvm.read.ptx.sreg.ctaid.x%ctaid.x
gridDim.x@llvm.nvvm.read.ptx.sreg.nctaid.x%nctaid.x

PTX Kernel Calling Convention Bypass

Functions with calling convention 42 or 43 (PTX __global__ kernels) bypass the SCEV recursion budget entirely. The rationale: kernels are the units of work the programmer explicitly marked for GPU execution. Spending extra compile time to fully analyze their loop structure always pays off because:

  1. Kernels are where vectorization decisions have the highest payoff.
  2. GPU hardware constraints (occupancy, shared memory) demand precise trip count knowledge.
  3. Kernel functions are few per compilation unit, so the budget bypass does not cause compile-time explosion.

Device functions (__device__, conventions other than 42/43) remain subject to the standard budget.

Warp-Stride and Grid-Stride Loop Patterns

Two CUDA-specific loop idioms produce distinctive SCEV expressions. Neither has an analog in CPU code, and cicc's SCEV subsystem recognizes both at construction time -- not as a post-hoc pattern match.

Warp-Stride Loop

for (int i = threadIdx.x; i < N; i += warpSize) { ... }

The PHI decomposer (sub_D94080) recognizes the increment value as the constant 32 (warpSize is a compile-time constant on all NVIDIA architectures). The resulting SCEV:

{threadIdx.x, +, 32}<nuw><loop>
  • Start: SCEVUnknown(@llvm.nvvm.read.ptx.sreg.tid.x), range [0, blockDim.x) (injected from the builtin table, intrinsic ID 333).
  • Step: SCEVConstant(32).
  • Flags: NUW (no-unsigned-wrap) is set because the start is non-negative and the step is positive. The PHI decomposer sets this flag when the incoming value (intrinsic ID 372 = warpSize) resolves to a constant and the start range has a non-negative lower bound.
  • Trip count: The backedge-taken count (sub_DB9E00) computes:
    BTC = udiv(N - threadIdx.x + 31, 32)
        = udiv(sext(N) - sext(start) + step - 1, step)
    
    This is the standard SCEV computeExitCountFromICmpUN path for i < N with stride 32.

The NUW flag is critical: it allows the loop vectorizer to prove that the induction variable never wraps, enabling vectorization without a runtime overflow check. Without the warp-stride recognition, the vectorizer would see SCEVUnknown(threadIdx.x) as an opaque value and conservatively assume wrapping is possible.

Grid-Stride Loop

for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { ... }

The instruction decomposer traces through the PHI's increment chain. The addition blockDim.x * gridDim.x is recognized as two calls to special register intrinsics (IDs 312 for blockDim.x and 312 again for gridDim.x) combined in a multiply. The resulting SCEV:

{blockIdx.x * blockDim.x + threadIdx.x, +, blockDim.x * gridDim.x}<loop>

Decomposition detail:

  • Start: SCEVAddExpr(SCEVMulExpr(SCEVUnknown(blockIdx.x), SCEVUnknown(blockDim.x)), SCEVUnknown(threadIdx.x)).
    • blockIdx.x (ID 339): range [0, gridDim.x).
    • blockDim.x (ID 312): range [1, 1024] (hardware limit).
    • threadIdx.x (ID 333): range [0, blockDim.x).
    • The combined start range is [0, gridDim.x * blockDim.x) = [0, total_threads).
  • Step: SCEVMulExpr(SCEVUnknown(blockDim.x), SCEVUnknown(gridDim.x)) -- this is the total grid size. Both operands are SCEVUnknown values with ranges from the builtin table.
  • Trip count: computeBackedgeTakenCount (sub_DB9E00) produces:
    BTC = udiv(N - start + step - 1, step)
    
    where start and step are symbolic. The trip count itself is SCEVUnknown (the exact value depends on runtime launch configuration), but the maximum trip count can be bounded using the range constraints.

Delinearization of Grid-Stride Patterns

The delinearization system (sub_DE9D10, documented in SCEV Invalidation & Delinearization) specifically recognizes the grid-stride pattern. In the ZeroExtend/SignExtend handlers (cases 3 and 4 of the delinearizer), when an AddRecExpr whose step matches the delinearization context's step_recurrence field (ctx+0x68):

  1. The delinearizer checks if step == blockDim.x * gridDim.x by comparing the step SCEV pointer against ctx[+0x68].
  2. If matched and the AddRec has exactly 2 operands (start + step), the delinearizer treats this as a dimension boundary -- the step represents the stride of the outer dimension in a multi-dimensional array access.
  3. The dimension size is extracted and added to the term collector at ctx[+0x58]. The element count is obtained via sub_D33D80 (getElementSize) and sub_DA4270 (getConstant).
  4. The delinearizer reconstructs the multi-dimensional subscript by applying getZeroExtendExpr (or getSignExtendExpr) to the start and step separately, preserving the recurrence structure across the extension.

This is how cicc recovers the original multi-dimensional array indices from grid-stride loops over flattened arrays -- essential for dependence analysis in LoopVectorize and LoopInterchange.

Block-Stride Loop (Variant)

A less common but recognized pattern:

for (int i = threadIdx.x; i < N; i += blockDim.x) { ... }

Produces: {threadIdx.x, +, blockDim.x}<loop>. The step is SCEVUnknown(blockDim.x) with range [1, 1024]. The trip count is udiv(N - threadIdx.x + blockDim.x - 1, blockDim.x) -- symbolic but bounded. This pattern is common in reduction kernels and shared-memory tiling.

Aggressive Positive Stride Analysis

The NVIDIA-specific knob aggressive-positive-stride-analysis (see nvbug 3972412) enables additional reasoning about stride signs. When enabled, the SCEV range analysis assumes that strides derived from blockDim.x, gridDim.x, and warpSize are always positive (range [1, ...) rather than [0, ...)). This allows the loop vectorizer and LSR to prove monotonic increase of induction variables, eliminating runtime overflow checks. The knob is registered in ctor_131_0 (constructor at 0x4E1CD0 area) and can be disabled via -no-aggressive-positive-stride-analysis.

The special-reassociate-for-threadid knob (description: "Don't move back expressions with threadid") prevents SCEV-based reassociation from hoisting threadIdx.x expressions out of their canonical position. Without this guard, the reassociator might combine threadIdx.x + offset into a form that obscures the warp/grid-stride pattern for downstream consumers.

SCEV Expression Types and the FoldingSet

SCEV expressions are uniqued in a FoldingSet (LLVM's hash-based deduplication container). Each expression type is identified by a uint16 opcode at scev_expr+24:

OpcodeTypeOperandsNotes
0SCEVConstant1 (APInt)Leaf: integer constant
1SCEVUnknown1 (Value*)Leaf: opaque value, possibly with range info
2SCEVTruncateExpr1 + typeTruncation cast
3SCEVZeroExtendExpr1 + typeZero extension
4SCEVSignExtendExpr1 + typeSign extension
5SCEVAddExprN-aryCommutative sum
6SCEVMulExprN-aryCommutative product
7SCEVUDivExpr2Unsigned division
8SCEVAddRecExpr2+ (start, step, ...){start, +, step}<loop> recurrence
9SCEVSMaxExprN-arySigned maximum
10SCEVUMaxExprN-aryUnsigned maximum
11SCEVSMinExprN-arySigned minimum
12SCEVUMinExprN-aryUnsigned minimum
13(variant min/max)N-aryAdditional min/max form
14SCEVCouldNotCompute0Sentinel: analysis failed
15SCEVSequentialUMinExprN-aryShort-circuit unsigned min

The expression node layout:

OffsetSizeField
+08Vtable / tag
+242Opcode (SCEV kind)
+282Flags: NUW=0x2, NSW=0x4
+328Operand array pointer or first operand
+40variesOperand count (for N-ary) or second operand

Pointer comparisons suffice for SCEV equality because of the uniquing: two SCEV* values are equal if and only if they point to the same node.

SCEV Constructor Functions

Each expression type has a dedicated constructor that canonicalizes and deduplicates:

AddressFunctionSignature
sub_DC8BD0getAddExpr(SmallVector &operands, flags, depth)
sub_DC7ED0getAddExpr(SCEV *a, SCEV *b, flags, depth)
sub_DCA690getMulExpr(SCEV *a, SCEV *b, flags, depth)
sub_DCC810getAddRecExpr(SCEV *start, SCEV *step, flags, depth)
sub_DCB270getUDivExpr(SCEV *lhs, SCEV *rhs)
sub_DCFA50getUMaxExpr(SCEV *a, SCEV *b)
sub_DCEE80getSMinExpr(SCEV *a, SCEV *b)
sub_DCE050getSMaxExpr(SCEV *a, SCEV *b)
sub_DCDFA0getUMinExpr(SCEV *a, SCEV *b)
sub_DC5200getTruncateExpr(SCEV *op, Type *ty, depth)
sub_DC5000getZeroExtendExpr(SCEV *op, Type *ty, depth)
sub_DC2B70getSignExtendExpr(SCEV *op, Type *ty, depth)
sub_DD1D00getPtrToIntExpr(SCEV *ptr)
sub_DA26C0getConstant(APInt val)
sub_DA3860getUnknown(Value *V)
sub_DCAF50getNegativeSCEV(SCEV *expr, flags)
sub_DCE000getNotSCEV(SCEV *expr, bool isNSW) -- -1 - x

The N-ary constructors (getAddExpr, getMulExpr, min/max) canonicalize operand order and fold constants. For example, getAddExpr({5, x, 3}) folds to getAddExpr({8, x}) and orders the constant first.

The SCEV Cache

The primary SCEV cache (ValueExprMap) maps Value* to SCEV* using an open-addressed hash table with the standard hash function used throughout cicc's SCEV subsystem:

slot = ((uint32_t)key >> 9) ^ ((uint32_t)key >> 4)
slot &= (capacity - 1)

Sentinels: EMPTY = 0xFFFFFFFFFFFFF000 (-4096), TOMBSTONE = 0xFFFFFFFFFFFFE000 (-8192). Capacity is always a power of two. Growth occurs at 75% load factor (doubling), and in-place rehashing (tombstone cleanup) triggers when fewer than 1/8 of slots are truly empty.

Cache lookup (sub_D98300) is called at the top of every getSCEV invocation. Cache store (sub_DB77A0) is called after every successful SCEV construction, and also when the complexity control bails to SCEVUnknown (caching the Unknown result prevents re-scoring the same instruction).

The simple mode's failure caching is critical for performance: once an instruction is classified as SCEVUnknown, the result is cached so that subsequent queries (from different loop analysis passes) return instantly rather than re-running the complexity scorer.

How SCEV Feeds Loop Optimizations

SCEV is consumed by every loop optimization in cicc. The key interfaces:

LoopVectorize (sub_DFAE00 and callers): Calls getBackedgeTakenCount (sub_DCF980) to determine whether the loop has a computable trip count. If not, vectorization is abandoned. Uses getSmallBestKnownTC (sub_2AA7EC0) for the trip count upper bound, which is compared against -vectorizer-min-trip-count. SCEV range analysis (sub_DBB9F0) proves that the epilogue trip count is sufficient for the minimum vector factor. Runtime SCEV overflow checks generate scev.check basic blocks.

LoopUnroll (sub_19B6690): The unroll factor selection function extracts MaxTripCount from SCEV. Runtime trip counts below flat-loop-tripcount-threshold (default 5) mark the loop as "flat" and skip unrolling. Partial unrolling requires BackedgeCount % UnrollCount computation. After unrolling, sub_2A13F00 reconciles SCEV and LoopInfo for the modified loop.

Loop Strength Reduction (sub_19A87A0): The NVIDIA custom LSR reads SCEV expressions for each loop use (base SCEV at +0, stride SCEV at +8, loop bounds at +712/+720). The formula solver generates alternatives by factoring common strides out of SCEV expressions. SCEV normalization (sub_199D980) provides canonical forms for hash-table keying.

IndVarSimplify (sub_1945A50): Uses SCEV to compute exit values, rewrite loop exit conditions, and perform LFTR (Linear Function Test Replace). NVIDIA adds two guards:

  • Disable-unknown-trip-iv (registered in ctor_203 at 0x4E1CD0, global qword_4FAF520): When set, the pass is skipped entirely for loops whose trip count is SCEVCouldNotCompute. The check in the run() wrapper (sub_19489B0, lines 119-122) calls sub_1CED350 (trip count query) and sub_1CED620 (trip count for header). This protects GPU-specific loops with divergent control flow from incorrect IV transforms.
  • iv-loop-level (default 1, global qword_4FAF440): Limits IndVarSimplify to loops at nesting depth <= the configured level. sub_193DD90 (getLoopDepth) returns 1 for outermost loops. The default restricts IV simplification to outermost loops only, avoiding compile-time explosion on deeply-nested GPU kernels (stencil, tensor code).

Loop Strength Reduction (sub_19A87A0): The NVIDIA custom LSR reads SCEV expressions for each loop use (base SCEV at +0, stride SCEV at +8, loop bounds at +712/+720). The formula solver generates alternatives by factoring common strides out of SCEV expressions. SCEV normalization (sub_199D980) provides canonical forms for hash-table keying. NVIDIA adds disable-unknown-trip-lsr to skip LSR entirely for unknown-trip-count loops, plus lsr-check-rp / lsr-rp-limit to gate LSR on register pressure.

LoopInterchange (sub_E05-loop-interchange): Uses SCEV stride analysis to determine which loops carry memory strides. If a subscript has stride in both inner and outer loops, it is marked "ambiguous" and interchange is blocked. For grid-stride loops, the step blockDim.x * gridDim.x is recognized as an outer-loop stride, allowing interchange when the array subscript depends on a single loop dimension.

Configuration: All SCEV Knobs

NVIDIA-Specific Knobs

KnobDefaultEffect
scalar-evolution-complexity-controltrueEnables the simple_mode system
scalar-evolution-max-expr-size384Max SCEV expression complexity score before bailing to Unknown
scalar-evolution-max-expr-failures100Max bailed instructions before giving up on entire function
scalar-evolution-max-add-items500Max addends in a single SCEVAddExpr
do-sign-ext-expandfalseExpand sign-extensions during SCEV construction
do-sign-ext-simplify(bool)Simplify SCEV on sign-extend expressions
track-trip-count-moretrueMore aggressive trip count tracking
common-factor-with-mr265trueSCEV common factor optimization (internal MR reference)
scalar-evolution-classify-expressionstrueEnable SCEV expression classification
aggressive-positive-stride-analysis(bool)Aggressive stride sign reasoning for blockDim/gridDim/warpSize (see nvbug 3972412)
special-reassociate-for-threadid(bool)Prevent hoisting threadIdx expressions out of canonical position
Disable-unknown-trip-iv(bool)Skip IndVarSimplify for loops with SCEVCouldNotCompute trip count
disable-unknown-trip-lsr(bool)Skip Loop Strength Reduction for unknown-trip-count loops
iv-loop-level1Max loop nesting depth for IndVarSimplify (1 = outermost only)
scev-cgp-tid-max-value(int)Max value of thread ID for SCEV-CGP address mode optimization

Upstream LLVM Knobs (Preserved in cicc)

KnobDefaultEffect
scalar-evolution-max-recursion-depth100Hard counter for getSCEV depth in normal mode
scalar-evolution-max-iterations100Max iterations for constant evolution
scalar-evolution-max-arith-depth32Max arithmetic simplification depth
scalar-evolution-max-cast-depth8Max cast folding depth
scalar-evolution-max-ext-depth8Max extension analysis depth
scalar-evolution-max-constant-evolving-depth32Max depth for constant evolving analysis
scalar-evolution-max-scev-compare-depth32Max depth for SCEV comparison
scalar-evolution-max-scev-operations-implication-depth2Max depth for implication reasoning
scalar-evolution-max-value-compare-depth2Max depth for value comparison
scev-mulops-inline-threshold32Max multiply operands before outline
scev-addops-inline-threshold500Max add operands before outline
verify-scevfalseEnable SCEV verification
verify-scev-strictfalseStricter SCEV verification
verify-scev-mapsfalseVerify SCEV map consistency

SCEV Global Variables (Binary Addresses)

GlobalKnob StringDefaultUsed By
dword_4F88268scalar-evolution-max-recursion-depth100getSCEV normal mode depth counter
qword_4F88348scalar-evolution-max-expr-failures100getSCEV simple mode failure gate
dword_4F88428scalar-evolution-max-expr-size384expressionComplexity size threshold
qword_4F88DC8(loop iteration bound)--Exit analysis iteration limit
qword_4F88EA8(range recursion limit)--getRangeRef max recursion depth

SCEV-CGP Knobs (Address Mode Optimization)

KnobEffect
do-scev-cgpEnable SCEV-based CodeGenPrepare
do-scev-cgp-aggresivelyAggressive mode (sic -- typo preserved in binary)
do-function-scev-cgpFunction-level SCEV-CGP
nv-disable-scev-cgpDisable the SCEV-CGP pass entirely
scev-cgp-controlControl number of transformations
scev-cgp-cross-block-limitMax common bases from a single block
scev-cgp-idom-level-limitLimit IDOM traversal level
scev-cgp-inst-limitMax instructions considered per parameter
scev-cgp-old-baseUse old base computation method
scev-cgp-tid-max-valueMax thread ID value for address mode analysis
print-after-scev-cgpPrint function IR after SCEV-CGP

Differences from Upstream LLVM

The cicc v13.0 SCEV subsystem diverges from upstream LLVM 20.0.0 ScalarEvolution.cpp in the following ways:

FeatureUpstream LLVMcicc v13.0
Budget systemSingle recursion_count depth counterTwo-stage: expression size scoring (sub_DB3670) + failure counting, toggled via simple_mode flag
Kernel bypassNo concept of calling convention bypassCC 42/43 (PTX __global__) bypass all SCEV budgets
createSCEVRecursiveNon-recursive worklist (sub_DD8130) to handle deep GPU expression trees
GPU builtin rangesNo thread/block index knowledgeIntrinsic IDs 312/333/339/360/369/372 inject ranges at SCEV construction time
PHI decompositionStandard recurrence detectionGPU-specific path (kind 64) traces PHI chains through NVIDIA special register intrinsics
DelinearizationStandard dimension recoveryPolymorphic predicate collector recognizes grid-stride patterns; step_recurrence field enables GPU memory coalescing analysis
Trip count trackingStandardtrack-trip-count-more (default true) enables more aggressive BTC computation
Stride sign reasoningStandardaggressive-positive-stride-analysis assumes blockDim/gridDim/warpSize are always positive
Expression canonicalizationStandardspecial-reassociate-for-threadid prevents moving threadIdx expressions
SCEV-CGPNot presentComplete NVIDIA SCEV-based CodeGenPrepare pass with 11 dedicated knobs
Knob count~15 standard knobs15 upstream + 15 NVIDIA-specific + 11 SCEV-CGP = ~41 total SCEV knobs

The most consequential divergence is the simple_mode system: it changes the compile-time complexity class of SCEV analysis from O(N * D^2) (where D is recursion depth) to O(N * S) (where S is the per-instruction size limit), making SCEV analysis tractable on large GPU kernels without sacrificing accuracy on the important inner-loop induction variables.

Function Map

FunctionAddressSizeRole
getSCEVsub_DD8400--Top-level entry; cache + mode dispatch
Worklist createSCEVsub_DD8130--Non-recursive worklist driver
createSCEV wrappersub_DD80F0--Type check + delegate
createNodeForInstructionsub_DD65B0--Core 3-phase opcode dispatch
decomposeIRInstructionsub_D94080--Instruction to decomposition struct
createNodeForPHIsub_DD92B0--PHI to AddRec conversion
createNodeForSelectOrPHIsub_DD99C0--Select/PHI combined handler
getExistingExprsub_DD6410--Fast path for phi recurrence
getGEPExprsub_DD3A70--GEP to SCEV conversion
getLoopForExprsub_DD86E0--Determine loop context for expression
lookupSCEVsub_D98300--Cache lookup (ValueExprMap)
insertSCEVsub_DB77A0--Cache store
expressionComplexitysub_DB3670--NVIDIA expression size scorer; self-recursive, uses sub_CF4090
SCEV node size countersub_CF4090--Counts nodes in existing SCEV tree for complexity scoring
getSmallConstantTripCountsub_DB04E0--Extract small constant trip count
classifyExpressions / printsub_1495EB0--Debug: "Classifying expressions for: "
isSCEVablesub_D97040--Type is integer or pointer
isUnknown / isFailedSCEVsub_D96A50--Check SCEVUnknown
getSCEVTypesub_D95540--Extract LLVM Type from SCEV expr
getTypeBitWidthsub_D97050--Bit width of a type
lookupIntrinsicSCEVsub_B494D0--Intrinsic fast-path table
isIntrinsicCallsub_988010--Intrinsic detection
isLoopInvariantsub_DBED40--Loop invariance check
isIntegerTysub_BCAC40--Integer type check
getRangeRefsub_DBB9F0--ConstantRange evaluator (see range page)
computeBackedgeTakenCountsub_DB9E00--BTC computation (see range page)
forgetLoopsub_DE2750--Cache invalidation (see invalidation page)
delinearizesub_DE9D10--Array delinearization (see invalidation page)

Cross-References

SCEV Range Analysis & Backedge-Taken Counts

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Every loop optimization in cicc ultimately depends on two questions: "what values can this expression take?" and "how many times does this loop iterate?" The SCEV range analysis (sub_DBB9F0, corresponding to ScalarEvolution::getRangeRef) answers the first by propagating ConstantRange intervals through SCEV expression trees. The backedge-taken count (BTC) machinery (sub_DB9E00 / sub_DB9040, corresponding to computeBackedgeTakenCount / computeExitCountForBranch) answers the second by solving loop exit conditions algebraically. The two systems feed each other: range analysis uses trip counts to bound AddRec expressions, and trip count computation uses ranges to prove overflow behavior. On GPU targets, these analyses gain additional precision from NVIDIA-specific range sources -- thread indices are bounded by block dimensions, warpSize is the constant 32, and __launch_bounds__ metadata constrains block dimensions -- all of which flow into tighter ranges and more computable trip counts.

Key Facts

PropertyValue
Range evaluatorsub_DBB9F0 (0xDBB9F0), 31 KB
BTC dispatchersub_DCF3A0 (0xDCF3A0), mode 0=exact, 1=constant-max, 2=symbolic-max
BTC cache buildersub_DB9E00 (0xDB9E00), 2,265 bytes
Exit count enginesub_DB9040 (0xDB9040), 18 KB
howFarToZerosub_DBA850 (0xDBA850), 8 KB
howManyLessThanssub_DCE310 (0xDCE310), 317 lines
Range cache (unsigned)scev_ctx+976, 40-byte entries, open-addressing
Range cache (signed)scev_ctx+1008, 40-byte entries, open-addressing
BTC cachescev_ctx+656, 168-byte entries, open-addressing
Per-exit BTC cachescev_ctx+1168, 56-byte entries
Max range recursion depthqword_4F88EA8 (global, configurable)
Extended exit analysis flagqword_4F88C08 (global, enables Phase D)
NVIDIA knobstrack-trip-count-more, aggressive-positive-stride-analysis, do-sign-ext-simplify, do-sign-ext-expand

ConstantRange Propagation Algorithm

The range evaluator sub_DBB9F0 takes a SCEV expression, a signedness flag (is_signed: 0=unsigned, 1=signed), and a recursion depth counter. It returns a pointer to a cached 32-byte ConstantRange representing the half-open interval [lower, upper) with wrapping semantics. The algorithm is a recursive descent over the SCEV expression tree with aggressive caching.

Cache Structure

Two separate hash tables store signed and unsigned ranges:

if (is_signed) {
    table    = scev_ctx[+1008];   // signed range cache
    capacity = scev_ctx[+1024];
} else {
    table    = scev_ctx[+976];    // unsigned range cache
    capacity = scev_ctx[+992];
}

Each entry is 40 bytes: an 8-byte key (SCEV pointer, with 0xFFFFFFFFFFFFF000 as the empty sentinel) followed by a 32-byte ConstantRange value. The hash function is:

slot = ((uint32_t)scev_ptr >> 9) ^ ((uint32_t)scev_ptr >> 4);
slot &= (capacity - 1);  // capacity is always a power of two

Linear probing resolves collisions. On a cache hit, the function returns immediately without recomputation.

Dispatch by SCEV Kind

After a cache miss, the evaluator dispatches on the SCEV opcode at scev_expr+24 (uint16):

OpcodeKindRange Computation
0SCEVConstantSingle-value range from the constant's APInt
1SCEVUnknownsub_988CD0: range from ValueTracking / instruction semantics
2SCEVTruncateRecurse on operand, apply ConstantRange::truncate
3SCEVZeroExtendRecurse on operand, apply ConstantRange::zeroExtend
4SCEVSignExtendRecurse on operand, apply ConstantRange::signExtend
5SCEVAddExprFold operand ranges with addWithNoWrap, respecting NUW/NSW
6SCEVMulExprFold operand ranges with ConstantRange::multiply
7SCEVUDivExprConstantRange::udiv of LHS and RHS ranges
8SCEVAddRecExprMulti-phase analysis (see below)
9-13SMax/UMax/SMin/UMinFold via lookup table dword_3F74E60[opcode-9] + sub_ABD750
14SCEVCouldNotComputePassthrough (identity range)
15SCEVSequentialUMinComplex instruction-level analysis (PHI, intrinsics, metadata)

Every computed range is intersected with an initial range derived from the type's bit width and any known-bits / sign-bits information before being stored in the cache. This intersection can only narrow the range, never widen it.

Initial Range Narrowing

Before the SCEV-kind dispatch, the evaluator computes an initial range from type information:

  • Unsigned mode: calls sub_DB5510 (getKnownBits) to extract known high zero bits, constructs a range [0, 2^(bitwidth - leading_zeros)) and intersects it with the full-set range.
  • Signed mode: calls sub_DB55F0 (getNumSignBits) and constructs a symmetric signed range from the sign-bit count, e.g., if 3 sign bits are known, the range is [-2^(bw-3), 2^(bw-3)).

This pre-narrowing ensures that even when the SCEV-kind dispatch returns a full-set (e.g., for complex expressions at the depth limit), the result still reflects type-level constraints.

AddRec Range Analysis (The Core)

The SCEVAddRecExpr case (opcode 8) is the most complex, executing up to five phases that progressively narrow the range of a loop induction variable {start, +, step}:

Phase A -- NoWrap Start Refinement. If the AddRec has NUW or NSW flags (bits at scev_expr+28), the unsigned range of the start value is computed and intersected. This ensures that the IV's initial value constrains the overall range even before considering the step.

Phase B -- Step Monotonicity. If the NSW flag (bit 2, value 0x4) is set:

  • sub_DBED40 checks if all step operands are non-negative (monotone up). If so, the signed minimum of start becomes the lower bound: range [smin(start), SMAX].
  • sub_DBEC80 checks if all steps are non-positive (monotone down). If so, the signed maximum of start becomes the upper bound: range [SMIN, smax(start)+1].

Phase C -- Trip Count Refinement. For simple two-operand recurrences ({start, +, step} with operand count == 2):

  1. Call sub_DCF3A0(ctx, loop, 1) to get the max backedge-taken count.
  2. If the trip count is computable, compute range(start + step * [0, trip_count]) for both unsigned (sub_DBEFC0) and signed (sub_DBF480) domains.
  3. Intersect both results into the accumulated range.

This is where range analysis and BTC computation form their feedback loop: the BTC is used to bound the AddRec's range.

Phase D -- Exit Value Analysis (NVIDIA-gated). Enabled only when global qword_4F88C08 is set. Gets the exact backedge-taken count (mode=2 via sub_DCF3A0), and if the trip count bit width fits within the AddRec's bit width and NSW is set, calls sub_DE4FD0 to compute the exit value range. This provides the tightest possible bound but is more expensive.

Phase E -- Cache and Return. The final accumulated range (from all intersections) is stored in the cache.

SCEVUnknown and Instruction-Level Analysis

For SCEVUnknown (opcode 1) and the complex instruction-level path (opcode 15), the range evaluator performs several specialized analyses:

  • !range metadata: if the underlying instruction carries !range metadata (kind=4), sub_B91C10 extracts it and sub_ABEA30 builds a ConstantRange directly.
  • Predecessor merging: sub_DBB110 computes ranges by analyzing incoming values from predecessor basic blocks, intersecting the results.
  • PHI node analysis: for PHI nodes (instruction opcode 84), the evaluator iterates all incoming values, computes each one's SCEV range, and unions them. A visited-PHI set at scev_ctx+320 prevents infinite recursion through cyclic PHIs.
  • Intrinsic ranges: sub_988010 identifies specific intrinsics (e.g., ctpop, ctlz, cttz) and constrains their ranges to non-negative values via sub_ABB6C0.
  • Stride alignment: sub_BD4FF0 computes stride/alignment information for loads and stores, narrowing the range to multiples of the known stride.

Signed/Unsigned Cross-Pollination

A critical detail: the AddRec analysis explicitly recurses with the opposite signedness flag in certain sub-analyses. Phase A always computes the start in unsigned mode (is_signed=0), while Phase B always uses signed mode (is_signed=1). This cross-referencing allows information from one domain to constrain the other, producing tighter bounds than either domain alone.

GPU-Specific Range Sources

Three categories of NVIDIA-specific range information feed into SCEV range analysis, all derived from the CUDA execution model:

Thread and Block Index Ranges

The intrinsics @llvm.nvvm.read.ptx.sreg.tid.{x,y,z} (threadIdx) produce values in [0, blockDim-1]. The intrinsics @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z} (blockIdx) produce values in [0, gridDim-1]. When these intrinsics appear as SCEVUnknown nodes, the range evaluator propagates their constrained ranges through the expression tree.

The block dimension intrinsics @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z} are bounded by __launch_bounds__ metadata when present. Specifically, nvvm.maxntid (from __launch_bounds__(maxThreads)) provides an upper bound on ntid.x * ntid.y * ntid.z, and nvvm.reqntid provides an exact value. These bounds are read by sub_CE8D40 (NvvmMeta_getMaxNTID) and sub_CE8DF0 (NvvmMeta_getReqNTID).

warpSize (@llvm.nvvm.read.ptx.sreg.warpsize) is the constant 32 on all architectures from sm_70 onward, producing the singleton range [32, 33).

Grid-Stride Loop Patterns

SCEV delinearization (sub_DE9D10) specifically recognizes the grid-stride pattern:

// CUDA:  for (int i = tid + bid * bdim; i < N; i += bdim * gdim)
// SCEV:  {threadIdx.x + blockIdx.x * blockDim.x, +, blockDim.x * gridDim.x}

The step blockDim.x * gridDim.x inherits known-positive range from both operands, enabling the monotonicity analysis in Phase B to prove the IV is non-decreasing. Combined with the bounded start value (tid.x + bid.x * bdim.x is non-negative), the range of the entire AddRec is [0, N) rather than full-set.

KnownBits and DemandedBits Integration

The sub_99B5E0 post-analysis in SimplifyDemandedBits applies NVIDIA-specific refinements including thread index range constraints (threadIdx.x < blockDim.x) and warp-level uniformity assumptions. These propagate through SCEV's getKnownBits (sub_DB5510) to tighten the initial unsigned range of expressions involving GPU special registers.

Backedge-Taken Count Computation

The BTC machinery computes how many times a loop's backedge executes before any exit is taken. The result has three variants:

  • Exact count: the precise number of iterations, or SCEVCouldNotCompute if unknown.
  • Constant max: a constant upper bound on the iteration count.
  • Symbolic max: a SCEV expression bounding the iteration count (may involve loop-invariant values).

BTC Cache Layout

The primary BTC cache at scev_ctx+656 uses 168-byte entries:

OffsetSizeField
+0x008Key: SCEV pointer (sentinels: empty=-4096, tombstone=-8192)
+0x08128Per-exit count data (SmallVector of {BasicBlock*, SCEV* count, flags})
+0x888Exact backedge-taken count (SCEV pointer or null)
+0x901Flag: exact count is valid
+0x988Max backedge-taken count (SCEV pointer or null)
+0xA01Flag: max count is valid

The hash function is identical to the range cache: ((key >> 9) ^ (key >> 4)) & (capacity - 1). Load factor threshold is 75% for capacity doubling (via sub_DB6980) and 87.5% (only capacity/8 truly empty slots remaining) for in-place rehash to reclaim tombstones.

A secondary per-exit table at scev_ctx+1168 stores 56-byte entries indexing individual exit block trip counts, avoiding linear scans through the main entry's embedded exit array.

Exit Count Computation Pipeline

sub_DB9040 (computeExitCountForBranch) is the heavy lifter. For each exiting block, it:

  1. Extracts the branch condition's ICmp instruction.
  2. Identifies the comparison operands as SCEV expressions.
  3. Classifies the exit condition into one of the standard shapes.
  4. Dispatches to the appropriate solver.

The two primary solvers are:

howFarToZero (sub_DBA850, 8 KB) -- handles x != 0 exit conditions. The exit condition is normalized to V = LHS - RHS, so the loop exits when V == 0. For affine AddRec {Start, +, Step}:

// The loop exits when: Start + Step * N = 0 (mod 2^BW)
// Solving: N = -Start / Step (mod 2^BW)
// For positive step (counting up to overflow): N = -Start / Step
// For negative step (counting down to zero):  N = Start / (-Step)

For quadratic AddRec {L, +, M, +, N}, it solves the quadratic equation via SolveQuadraticAddRecExact. If the expression is not affine or quadratic, it returns CouldNotCompute.

howManyLessThans (sub_DCE310, 317 lines) -- handles x < bound (signed or unsigned) exit conditions. For affine IV = {Start, +, Step} with loop-invariant Bound:

// Unsigned: count = ceil_div(max(Bound, Start) - Start, Step)
// Signed:   count = ceil_div(max_signed(Bound, Start) - Start, Step)
// With overflow checks based on NUW/NSW flags

This function also contains special logic for zero-extended IVs: if the comparison involves zext(IV) < Bound, it can infer NUW on the inner AddRec by proving that the bound is small enough that unsigned overflow cannot occur before the exit.

Loop Shape Handling

The BTC computation handles several loop shapes through the exit condition classification:

  • Countable (for-style): for (i = 0; i < N; i++) produces {0, +, 1} < N, solved by howManyLessThans as N - 0 = N iterations.
  • While-do: the exit test precedes the body. Trip count equals the number of backedge traversals, which is one less than the number of condition evaluations.
  • Do-while: the exit test follows the body. The backedge is taken at least once if the loop is entered. Trip count comes directly from the exit condition solver.
  • Multiple exits: computeBackedgeTakenCount (sub_DB9E00) iterates all exiting blocks, computes per-exit counts, and takes the minimum. If any exit is not computable, the exact count is CouldNotCompute but the max count may still be known from the computable exits.
  • Exhaustive evaluation: sub_DCFD50 (computeExitCountExhaustively) brute-force iterates small constant-evolving loops (up to scalar-evolution-max-iterations = 100 iterations) to find exit counts that algebraic methods cannot handle.

Overflow Handling and NoWrap Flags

Trip count precision depends critically on the NoWrap flags (NUW = bit 1, NSW = bit 2) stored at scev_expr+28:

  • NUW (No Unsigned Wrap): if an AddRec {Start, +, Step} has NUW, unsigned arithmetic cannot wrap, so Start + Step * N is monotonically increasing in the unsigned domain. This allows howManyLessThans to compute an exact count without overflow guards.
  • NSW (No Signed Wrap): similarly for signed arithmetic. Enables signed comparison trip counts and the Phase B monotonicity analysis in range computation.
  • Neither flag: the solver must account for wrapping. howFarToZero solves modular arithmetic; howManyLessThans may fall back to constant-max estimates or CouldNotCompute.

The NVIDIA-specific knob aggressive-positive-stride-analysis (documented as "See nvbug 3972412") enables more aggressive inference of NUW flags on AddRec expressions with positive strides, particularly for GPU loop patterns where the step is a known-positive grid dimension.

How BTC Feeds Loop Optimizations

Loop Unrolling

The unroll decision engine (sub_19BB5C0) queries getSmallBestKnownTC (sub_2AA7EC0) which calls the BTC machinery. The result determines the unroll strategy:

  • Exact trip count known and small: enables full unrolling -- the loop body is replicated exactly N times with no remainder loop. This is the most profitable case for GPU code since it eliminates all loop overhead.
  • Exact trip count known but large: enables partial unrolling with an exact remainder. The unroll factor is chosen to divide the trip count, avoiding a remainder loop entirely.
  • Only max trip count known: enables partial unrolling with a runtime remainder check. The unroll factor is bounded by the max trip count.
  • Trip count unknown: unrolling is gated by the NVIDIA knob Disable-unknown-trip-iv -- when set, IndVarSimplify (sub_19489B0) skips loops entirely if the trip count is not computable.

Loop Vectorization

The vectorizer (sub_2AE3460) uses BTC in two ways:

  1. Minimum trip count threshold: getSmallBestKnownTC is compared against dword_500EAE8 (-vectorizer-min-trip-count). If the known trip count is below this threshold, vectorization bails with "LowTripCount" (note the preserved typo: "The trip count is below the minial threshold value.").

  2. Divisibility for epilogue: when the exact trip count is known, the vectorizer checks if it is divisible by the vectorization factor. If so, no scalar epilogue is needed. If not, it generates an epilogue loop. The exact trip count from SCEV enables eliminating the runtime divisibility check.

IRCE (Inductive Range Check Elimination)

IRCE (sub_194D450) uses SCEV ranges to split loops into pre-loop / main-loop / post-loop regions. The BTC determines the main loop's iteration space, and the range checks within the loop body define the boundaries for the pre/post loops. Tighter SCEV ranges mean tighter pre/post loops (fewer wasted iterations), which is significant for GPU kernels where every wasted iteration occupies a warp lane.

IndVarSimplify

IndVarSimplify (sub_1945A50) uses the exact BTC for Linear Function Test Replacement (LFTR): replacing the original loop exit test with a comparison against the trip count. This is gated by three NVIDIA knobs: disable-lftr, Disable-unknown-trip-iv, and iv-loop-level (default 1, restricting IV simplification to outermost loops only to limit compile-time on deeply nested GPU kernels).

GPU-Specific Trip Count Patterns

Grid-Stride Loops

for (int i = threadIdx.x + blockIdx.x * blockDim.x;
     i < N;
     i += blockDim.x * gridDim.x)

SCEV representation: {tid.x + ctaid.x * ntid.x, +, ntid.x * nctaid.x}. The start is bounded by [0, ntid.x * nctaid.x) and the step is provably positive (product of two positive values). Trip count: ceil((N - start) / step). With __launch_bounds__, the step's range can be computed precisely, enabling exact trip count computation when N is loop-invariant.

Warp-Stride Loops

for (int i = threadIdx.x % 32; i < N; i += 32)

SCEV representation: {tid.x urem 32, +, 32}. The start is [0, 31] (since warpSize=32), and the step is the constant 32. Trip count: ceil((N - (tid.x % 32)) / 32). This is always computable when N is loop-invariant.

Block-Bounded Loops

for (int i = 0; i < blockDim.x; i++)

When nvvm.reqntid metadata is present, blockDim.x has a known constant value, and the loop has a compile-time-known trip count. This enables full unrolling -- common for shared memory initialization and reduction loops.

Configuration Knobs

KnobDefaultEffect
scalar-evolution-max-iterations100Max iterations for exhaustive BTC evaluation
scalar-evolution-max-scev-compare-depth32Recursion limit for SCEV comparison
scalar-evolution-max-arith-depth32Recursion limit for arithmetic simplification
scalar-evolution-max-cast-depth8Recursion limit for ext/trunc handling
scalar-evolution-max-ext-depth8Recursion limit for extension expressions
scalar-evolution-max-constant-evolving-depth32Depth limit for constant evolution
scalar-evolution-max-expr-size384Expression complexity budget (NVIDIA simple mode)
scalar-evolution-max-expr-failures100Max failures before all expressions bail to Unknown
scev-addops-inline-threshold500Max add operands before bailing
scev-mulops-inline-threshold32Max mul operands before bailing
scev-cheap-expansion-budget(default)Cost budget for SCEVExpander materialization
track-trip-count-morefalse"Track loop trip count more aggressively" (NVIDIA-specific)
aggressive-positive-stride-analysistrueMore aggressive NUW inference for positive strides (nvbug 3972412)
do-sign-ext-simplify(default)Simplify sign-extension SCEV expressions
do-sign-ext-expand(default)Expand sign-extensions during SCEV construction
qword_4F88EA8(global)Max recursion depth for range computation
qword_4F88C08(global)Enable extended exit-value analysis (Phase D)

The NVIDIA-specific knobs are particularly important. track-trip-count-more enables additional effort in BTC computation that upstream LLVM does not attempt -- the exact mechanism is not fully reversed, but the typo in its description string ("aggresively") matches the binary. aggressive-positive-stride-analysis is tied to a specific NVIDIA bug (nvbug 3972412) and enables proving NUW on AddRec expressions whose step is known positive from range analysis, creating a positive feedback loop between range computation and NoWrap inference.

Function Map

FunctionAddressSizeRole
ScalarEvolution::getRangeRef() -- core range evaluatorsub_DBB9F0----
getRangeForAffineARViaRange() -- predecessor-based rangesub_DBB110----
computeUnsignedRangeFromAddRecTripCount()sub_DBEFC0----
computeSignedRangeFromAddRecTripCount()sub_DBF480----
computeExitValueRange() -- Phase D exit value analysissub_DE4FD0----
getFullRangeFallback() -- depth-exceeded fallbacksub_DDFBD0----
cacheRange() -- insert range into hash tablesub_DB0AC0----
getKnownBits() for SCEV (unsigned known bits)sub_DB5510----
getNumSignBits() for SCEV (signed known bits)sub_DB55F0----
isKnownNonNegative(step)sub_DBED40----
isKnownNonPositive(step)sub_DBEC80----
getBackedgeTakenCount(loop, mode) -- BTC dispatchersub_DCF3A0----
computeBackedgeTakenCount() -- per-loop BTC with cachingsub_DB9E00----
computeExitCountForBranch() -- exit condition analysissub_DB9040----
howFarToZero() -- "reaches zero" trip countsub_DBA850----
howManyLessThans() -- "less than" trip countsub_DCE310----
computeExitCountExhaustively() -- brute-force small loopssub_DCFD50----
computeExitLimit() -- exit limit from conditionsub_DCB270----
getSmallConstantTripCount()sub_DB04E0----
getSmallConstantMaxTripCount()sub_DB06C0----
BTC hash table growth / rehashsub_DB6980----
BTC hash table rehash-in-place (tombstone cleanup)sub_DE0180----
getRangeFromUnknownSCEV() -- range for SCEVUnknownsub_988CD0----
ConstantRange::intersectWith()sub_AB2160----
ConstantRange::unionWith()sub_AB3510----
ConstantRange::addWithNoWrap()sub_ABA0E0----
ConstantRange::multiply()sub_AB5480----
ConstantRange::udiv()sub_AB6A50----
ConstantRange::minmax_combine()sub_ABD750----
ConstantRange from !range metadatasub_ABEA30----
ConstantRange from KnownBitssub_C4B490----

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Range sourcesProfile data, __builtin_assume, !range metadata from user annotationsAdditional GPU-specific sources: nvvm-intr-range pass injects !range on all special register reads; __launch_bounds__ constrains %tid/%ntid ranges; warpSize = 32 constant
Thread index boundsNo concept of bounded thread indices%tid.x/y/z bounded by [0, maxntid-1], %ntid.x/y/z by [1, 1024], %laneid by [0, 31]; these tighten trip count computation for thread-indexed loops
Trip count precisionDepends on programmer-visible range annotationsSubstantially higher precision on GPU due to statically known hardware launch bounds; most CUDA loops have computable trip counts
Range feedback loopRange analysis and BTC computation feed each otherSame mutual feeding, but GPU-specific ranges make the feedback loop converge faster and more precisely
Warp-stride loopsNo concept; stride analysis treats all strides equallyNVIDIA SCEV recognizes warp-stride patterns (stride = warpSize or stride = blockDim.x), enabling specialized BTC computation for cooperative thread loops
Overflow analysisStandard NSW/NUW flag analysisSame flags, plus GPU-specific insight: 32-bit IVs with %tid or %ctaid bases are often provably non-wrapping given launch dimension bounds

Cross-References

SCEV Invalidation & Delinearization

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

SCEV analysis results are expensive to compute and are cached aggressively. When the IR mutates -- a loop is unrolled, a value is replaced, a block is deleted -- cached SCEV expressions, range information, and backedge-taken counts can become stale. The invalidation subsystem (forgetLoop, forgetValue, forgetAllLoops) determines exactly which cache entries must be discarded after each transformation. Get it wrong in either direction and the compiler either produces incorrect code (stale data) or wastes time recomputing everything (over-invalidation).

Delinearization is the complementary recovery problem: given a flat pointer expression like base + i*N*M + j*M + k, recover the original multi-dimensional subscripts [i][j][k]. This is critical for GPU code because memory coalescing analysis needs to know whether adjacent threads in a warp are accessing adjacent addresses -- a question that can only be answered by examining per-dimension subscripts against the thread index structure.

In cicc v13.0, both subsystems carry NVIDIA-specific modifications. The invalidation engine has an extended exit-analysis depth threshold and an early-out for simple two-operand AddRec expressions common in GPU loops. The delinearization engine has a polymorphic predicate collector that supports GPU-aware strategies for shared memory bank conflict detection and coalescing analysis, plus at least 9 configuration knobs not present in upstream LLVM.

PropertyValue
forgetLoop addresssub_DE2750 (0xDE2750)
forgetLoop size10,051 bytes (~2,271 asm lines)
forgetValue addresssub_D9EE30 (0xD9EE30)
forgetValue size~9 KB
forgetAllLoops addresssub_D9D700 (0xD9D700)
forgetAllLoops size~8 KB
delinearize addresssub_DE9D10 (0xDE9D10)
delinearize size3,614 bytes (~849 asm lines)
collectParametricTerms addresssub_DE8D20 (0xDE8D20)
Hash function(key >> 9) ^ (key >> 4) & (capacity - 1)
Empty sentinel0xFFFFFFFFFFFFF000
Tombstone sentinel0xFFFFFFFFFFFFE000

Cache Invalidation

The Seven Caches

SCEV maintains seven distinct cache tables that must be kept consistent. Each has its own eviction path inside forgetLoop:

#CacheEntry sizeKeyValueContext offset
1ValueExprMap (primary)16 bytesValue*SCEV*main SE object
2Unsigned range cache40 bytesSCEV*ConstantRange+976
3Signed range cache40 bytesSCEV*ConstantRange+1008
4BTC cache168 bytes (0xA8)loop SCEV*BackedgeTakenInfo+0x290
5Per-exit BTC cache16 bytesexit SCEV*exit count+0x490
6AddRec folding cacheper-expressionAddRec pairfolded formper-expression
7Predicated BTC cache16 bytesloop SCEV*predicated countsecondary table

All hash tables use the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth/compaction thresholds.

forgetLoop: The 8-Phase Algorithm

sub_DE2750 is the largest invalidation function -- 10 KB of machine code organized into eight sequential phases. It is called after every loop transformation that might invalidate SCEV data.

Signature:

void forgetLoop(
    ScalarEvolution *SE,    // rdi -- the SE context
    Loop *L,                // rsi -- loop being forgotten
    BasicBlock *Header,     // rdx -- loop header block
    ExitInfo *Exits,        // rcx -- exit block info (nullable)
    int DepthFlag,          // r8d -- 0=shallow, 1=deep, >1=bounded
    int ExtraFlag,          // r9d -- controls AddRec early-out
    SmallDenseSet *Visited  // stack -- prevents cycles in nested loops
);

Phase 1 -- Block value collection (0xDE27C9). Iterates the loop's basic blocks and collects all Values that have cached SCEV entries. The block array is at loop[+0x20] -> [+0x10] (pointer) / [+0x18] (count), stored as 32-byte entries. For each Value, a dominance check (sub_B19D00) confirms it belongs to the loop, then the SCEV index is extracted from a 27-bit field at value[+4] & 0x7FFFFFF. Collected pointers are stored in a SmallVector (inline capacity 6) with bit 2 set as a tag.

Phase 1B -- Scope chain collection (0xDE28A7). Walks a scope chain obtained via sub_B6AC80(SE[0][+0x28], 0x99), where 0x99 is the SCEV scope identifier. Filters to SCEVUnknown entries (type byte 0x55) with specific flag conditions (byte [+0x21] & 0x20), verifying loop membership and dominance. This captures values not directly in the loop's blocks but semantically part of its analysis scope.

Phase 2 -- Exit block processing (0xDE29D9). Enumerates exit blocks via sub_AE6EC0 and processes their AddRec chains. For each exit, reads the chain at [exit+0x30] & ~7 (stripping tag bits), checks expression kind byte (range 0x1E--0x28), and extracts operands. For the common case of simple {start, +, step} two-operand recurrences, an early-out stops after processing 2 operands when ExtraFlag != 0. If the loop has exactly 2 exits and ExtraFlag >= qword_4F88DC8 (a global threshold for maximum exit analysis depth), deep exit analysis is skipped entirely.

Phase 3 -- Expression dependency analysis (0xDE2BC5). The core invalidation loop. Iterates the collected values in reverse order and builds a transitive closure of all dependent SCEV expressions. Uses a stack-based worklist (SmallVector, inline capacity 8) and a SmallDenseSet for visited tracking. The dependency walk dispatches on expression type:

Type 0x52 ('R' = AddRec):  Follow Start and Step operands via getSCEV,
                            compare ranges with getRangeRef
Type 0x56 ('V' = variant):  Check function pointer equality at [-0x60]
                            and [+0x08], follow if simple recurrence
Type 0x39/0x3A (flagged):   Check bit 6 of flags byte, follow base pointer
                            or compute canonical form from 27-bit index
General:                    Follow underlying object, check for pointer
                            types (0x11/0x12), verify integer type

Phase 4 -- Primary cache eviction (0xDE2DFF). For each expression identified by Phase 3, looks it up in the ValueExprMap, computes both unsigned and signed ranges via getRangeRef (sub_DBB9F0), compares old and new ranges via ConstantRange::contains (sub_AB1BB0), and clears validity bits in the range cache ([entry+0x20] for unsigned, [entry+0x21] for signed). Wide APInt buffers (>64 bits) are freed through __libc_free.

Phase 5 -- BTC eviction (0xDE3D2F). For each collected value, looks it up in the BTC hash table. On hit: writes TOMBSTONE, decrements entry count, increments tombstone count, then calls forgetMemoizedResults (sub_DE2690) to recursively invalidate any expressions that depended on this backedge-taken count. Also evicts the corresponding predicated BTC entry from the secondary table.

Phase 6 -- AddRec folding cache cleanup (0xDE3230). For AddRec expressions (type 0x52), invalidates pre-computed folding results. Extracts the 6-bit opcode from [expr+2] & 0x3F and dispatches:

  • Opcode 0x20 (shift/power-of-two multiply): checks via countPopulation whether the step is a power of two, then calls tryFoldAddRecWithStep (sub_DCFD50)
  • Opcodes 0x22--0x29 (binary operations): constructs the appropriate folded expression per operation type and marks it for invalidation
  • Opcode 0x24 with pointer type (0x0E): skips pointer-integer cast invalidation

Phase 7 -- Predicate and assumption cleanup (0xDE3856). Processes the predicate hash table via the loop object's fields. Performs range intersection (sub_AB0910), union (sub_AB0A00), and emptiness/fullness checks (sub_AAFBB0, sub_AAF760). If the resulting range is neither empty nor full, stores the updated BTC in the loop's entry.

Phase 8 -- Final output (0xDE3CCD). Writes 0x0101 to loop->flags[+0x20], marking the loop as SCEV-forgotten (bit 0 = primary cache invalidated, bit 8 = secondary cache invalidated). Frees heap-allocated collection and output buffers.

forgetValue and forgetAllLoops

forgetValue (sub_D9EE30, ~9 KB) performs single-value eviction. It removes the value's entry from the ValueExprMap, then walks all expressions that transitively depend on it and evicts those as well. Used when a single instruction is replaced (RAUW) or deleted.

forgetAllLoops (sub_D9D700, ~8 KB) iterates every loop in the function's LoopInfo and calls forgetLoop for each one. Used when the entire function's loop structure changes (e.g., after inlining or full function cloning).

Which Passes Trigger Invalidation

forgetLoop is called after these loop transformations:

PassWhy invalidation is needed
LoopUnrollTrip counts change; unrolled body has different IVs
LoopVectorizeWidened types; vector IVs replace scalar ones
LoopPeelingPeeled iterations change the start value of recurrences
LoopUnswitchingExit conditions change; control flow restructured
LICMHoisted values have new SCEV forms outside the loop
LoopSimplifyPreheader/exit block insertion changes loop structure
LoopRotateHeader/latch swap requires BTC recomputation
LoopDistributeOriginal loop split into multiple loops
LoopIdiomRecognizePattern replacement changes loop body
LoopIndexSplit (NVIDIA)IV range split into subranges
MemorySpaceOpt (NVIDIA)Address space changes invalidate pointer SCEVs

The DepthFlag parameter controls the aggressiveness of invalidation: 0 does shallow invalidation (only direct loop values), 1 follows all dependency chains, and values >1 impose a bounded depth useful for performance in deeply nested loops. The Visited parameter (a SmallDenseSet*) prevents infinite cycles when nested loops have mutual SCEV dependencies.

The forget-scev-loop-unroll knob (boolean) controls whether SCEV cache is invalidated after unrolling -- disabling it is unsound but can be used for compile-time experimentation.

Delinearization

The Problem

CUDA kernels routinely access multi-dimensional arrays:

float val = A[blockIdx.x * BLOCK_H + threadIdx.y][blockIdx.y * BLOCK_W + threadIdx.x];

By the time this reaches LLVM IR, the address computation has been flattened:

%addr = getelementptr float, ptr %A, i64 %flat_idx
; where %flat_idx = (blockIdx.x * BLOCK_H + threadIdx.y) * N + (blockIdx.y * BLOCK_W + threadIdx.x)

SCEV sees this as a single polynomial. Delinearization recovers the per-dimension subscripts, which are essential for:

  1. Coalescing analysis: determining whether adjacent threads (threadIdx.x, threadIdx.x+1, ...) access adjacent memory addresses (coalesced) or strided addresses (uncoalesced). This requires isolating the dimension where threadIdx.x appears.
  2. Shared memory bank conflict detection: 32 banks, 4-byte stride. Knowing whether the innermost subscript is threadIdx.x (conflict-free) vs. threadIdx.x * stride (potential conflicts) requires dimensional decomposition.
  3. Dependence analysis: per-dimension dependence tests (Banerjee, GCD, MIV) are more precise than whole-expression tests. Delinearized subscripts feed DependenceInfo for vectorization legality.

Delinearization Context

The delinearizer (sub_DE9D10) operates on a context object:

OffsetTypeFieldPurpose
+0x00ScalarEvolution*SEParent SCEV context
+0x08SCEV*ElementSizeInnermost element size
+0x10uint8_tFlagsBit 0: inline cache mode
+0x1864 bytesInlineCache4-slot direct-mapped table (inline mode)
+0x20uint32_tCapacityHeap table capacity (heap mode)
+0x58SCEV*TargetArrayPtrArray being delinearized
+0x60void*PredicateCollectorNullable; collects validity predicates
+0x68SCEV*StepRecurrenceAddRec step for innermost dimension

The inline cache (4 slots of 16 bytes at +0x18) is a small-buffer optimization sized for the overwhelmingly common GPU case of 1D or 2D array accesses. Cache entries use the same (key >> 9) ^ (key >> 4) hash as all other SCEV tables.

The Recursive Delinearization Algorithm

sub_DE9D10 is a recursive function dispatching on 17 SCEV expression kinds via a jump table:

KindExpression typeHandling
0, 1, 16Constant, TruncateExpr (ident), UnknownLeaf -- return unchanged
2TruncateExprRecurse into inner, rebuild with getTruncateExpr
3SignExtendExprRecurse; dimension discovery on AddRec step match
4ZeroExtendExprRecurse; dimension discovery on AddRec step match
5AddExprN-ary: delinearize each operand, rebuild with getAddExpr
6MulExprN-ary: delinearize each factor, rebuild with getMulExpr
7UDivExprDelinearize both operands, rebuild with getUDivExpr
8AddRecExprN-ary with wrap flag preservation; critical path
9--13SMax/UMax/SMin/UMin/SeqUMinN-ary: delinearize operands, rebuild
14PtrToIntExprRecurse into pointer, rebuild
15GEPPrimary dimension discovery entry point

The N-ary pattern. Cases 5, 6, 8--13 share a common template:

SmallVector<const SCEV*, 2> NewOps;  // inline capacity 2
bool Changed = false;
for (auto *Op : Expr->operands()) {
    const SCEV *NewOp = delinearize(Ctx, Op);  // recursive
    NewOps.push_back(NewOp);
    if (NewOp != Op) Changed = true;
}
if (!Changed) return Expr;  // pointer identity optimization
return rebuildExpr(SE, Kind, NewOps);

The "changed" flag enables pointer identity short-circuiting: if no operand was modified during recursion, the original expression pointer is returned without allocation.

AddRecExpr (case 8) is the most critical case for GPU code. Multi-dimensional array accesses manifest as nested AddRec expressions: {A[0][0], +, dim1}<outer_loop> wrapping {init, +, 1}<inner_loop>. The delinearizer preserves wrap flags (NSW/NUW/NW from bits [+0x1C] & 7) and the step value ([+0x30]) when reconstructing via getAddRecExpr (sub_DBFF60).

ZeroExtend/SignExtend (cases 3, 4) are secondary dimension discovery points. When the inner operand is an AddRec whose step matches Ctx->StepRecurrence (+0x68) and the AddRec has exactly 2 operands (the common {start, +, step} form), the handler extracts dimension information: it calls getElementSize (sub_D33D80) and getConstant (sub_DA4270) to compute the element count, then pushes a new term into the term collector at Ctx[+0x58]. This identifies a dimension boundary -- the extend operation wrapping a matching-step AddRec indicates the point where one array dimension ends and another begins.

GEP (case 15) is the primary entry for actual dimension discovery. It first checks the predicate collector (Ctx[+0x60]). If present, it searches the collector's table for a matching GEP index entry (type field == 1, matching scev_expr, operation == 0x20). If no predicate collector or no match, it falls back to structural delinearization via sub_DE97B0, which analyzes the GEP's index computation structure, iterates discovered terms, and classifies them by dimension type. Terms matching Ctx->StepRecurrence go to the direct collector; others go through the predicate collector's virtual dispatch (vtable[+0x10]).

Fixed-Point Iteration

The function itself is a single recursive pass, but its callers implement a fixed-point loop:

  1. Initialize the context with an initial guess for dimension sizes
  2. Call sub_DE9D10 to delinearize using those dimensions
  3. During recursion, the GEP and extend handlers collect new dimension information into Ctx[+0x58] (term collector) and Ctx[+0x60] (predicate collector)
  4. If collected dimensions differ from the initial guess, update and repeat from step 2
  5. Terminate when dimensions stabilize or a maximum iteration count is exceeded

The memoization cache ensures unchanged sub-expressions are not recomputed across iterations.

Parametric vs Fixed-Size Arrays

Upstream LLVM has the delinearize-use-fixed-size-array-heuristic knob (default: false). When the standard parametric delinearization fails -- typically because dimension sizes are runtime values with no SCEV relationship -- the fixed-size heuristic uses compile-time-known array dimensions from type metadata to guide decomposition.

cicc extends this with an alternative delinearization entry point at sub_147EE30 (25 KB), which applies additional heuristics controlled by at least 3 of the delinearization config globals (dword_4F9AB60, dword_4F9AE00, dword_4F9B340). This second path is likely NVIDIA-enhanced for cases common in GPU code, such as dynamically-allocated shared memory with dimensions derived from kernel launch parameters.

The dependence analysis subsystem has its own entry points into delinearization (sub_146F1B0 at 40 KB for delinearizeAccess, sub_146B5E0 at 18 KB for tryDelinearize) that combine delinearization with per-dimension dependence testing in a single pass.

GPU-Specific Delinearization Patterns

Thread grid indexing. The canonical GPU pattern threadIdx.x + blockIdx.x * blockDim.x produces an AddRec with step = blockDim.x (grid stride). The delinearizer recognizes this by matching the step recurrence against Ctx[+0x68]. When the step corresponds to a grid dimension, the subscript identifies which dimension of a multi-dimensional array is parallelized across the thread grid.

Shared memory bank conflicts. For shared memory arrays, the delinearizer feeds into bank conflict analysis. Shared memory has 32 banks with 4-byte interleaving. If delinearization reveals A[threadIdx.y][threadIdx.x] with row stride 32 (or any multiple of 32), every thread in a warp hits the same bank -- a 32-way conflict. If the stride is relatively prime to 32, accesses are conflict-free. This analysis requires knowing per-dimension subscripts, which only delinearization can provide from the flat pointer arithmetic.

Predicate collector polymorphism. The PredicateCollector at Ctx[+0x60] uses virtual dispatch (vtable[+0x10]), allowing different delinearization strategies to be plugged in:

  • Standard delinearization for host code
  • GPU-aware delinearization that considers shared memory bank geometry
  • Coalescing-aware delinearization that checks whether the innermost subscript varies with threadIdx.x

High-dimensional tensors. The term collector at Ctx[+0x58] is a growable SmallVector, supporting arrays with arbitrary dimensionality. This matters for tensor operations in CUDA (e.g., CUTLASS library patterns, which cicc special-cases elsewhere -- see the cutlass substring check in the dependence analysis region).

SCEV Term Collection

Before delinearization runs, collectParametricTerms (sub_DE8D20) walks the SCEV expression tree to extract candidate terms:

  • SCEVAddRecExpr operands yield stride candidates (the step of each AddRec)
  • SCEVUnknown and SCEVMulExpr nodes yield dimension-size candidates
  • SCEVSignExtendExpr nodes are also collected (they often wrap dimension-related terms)

These candidates are passed to findArrayDimensions (sub_147B0D0) which uses product decomposition to determine which terms correspond to array dimensions. The resulting dimension list seeds the delinearization context before sub_DE9D10 is invoked.

Configuration

SCEV Invalidation Knobs

KnobDefaultEffect
forget-scev-loop-unrolltrueEnable SCEV invalidation after loop unrolling
verify-scevfalseVerify SCEV consistency after transformations
verify-scev-strictfalseStricter verification (compare old/new trip counts)
verify-scev-mapsfalseVerify SCEV map consistency
qword_4F88DC8 (max exit analysis depth)unknownThreshold beyond which deep exit analysis is skipped

SCEV Analysis Depth Limits (shared with invalidation)

KnobDefaultEffect
scalar-evolution-max-iterations100Maximum loop iterations for constant evaluation
scalar-evolution-max-scev-compare-depth32Maximum SCEV comparison recursion depth
scalar-evolution-max-arith-depth32Maximum SCEV arithmetic simplification depth
scalar-evolution-max-ext-depth8Maximum sign/zero-extend nesting depth
scalar-evolution-max-cast-depth8Maximum cast chain depth
scalar-evolution-max-constant-evolving-depth32Maximum constant evolution depth
scalar-evolution-max-expr-size384Maximum expression node count
scalar-evolution-max-expr-failures100Maximum SCEV creation failures before bailout
scalar-evolution-max-scev-operations-implication-depth2Maximum depth for implications
scalar-evolution-max-value-compare-depth2Maximum value comparison depth

NVIDIA-Specific SCEV Knobs

KnobEffect
aggressive-positive-stride-analysisMore aggressive positive-stride IV analysis (nvbug 3972412)
do-sign-ext-simplifySimplify SCEV sign-extend expressions
do-sign-ext-expandExpand sign-extends during SCEV construction
track-trip-count-moreTrack loop trip counts more aggressively
scev-mulops-inline-threshold (32)Max MulExpr operands before out-of-line
scev-addops-inline-threshold (500)Max AddExpr operands before out-of-line

Delinearization Knobs

GlobalLikely identityNotes
byte_4F9A8C0Delinearization enable flagMaster enable for the delinearization subsystem
dword_4F9A620Config 1Referenced by combined delinearize-and-test
dword_4F9A700Config 2Referenced by delinearizeAccess core
dword_4F9A7E0Config 3Referenced by delinearizeAccess core
dword_4F9AB60Config 4Referenced by alternative delinearization v2
dword_4F9AC40Config 5Referenced by dependence distance with delinearization
dword_4F9AE00Config 6 (shared)Referenced by both combined-test and v2 paths
dword_4F9B260Config 7Referenced by combined delinearize-and-test
dword_4F9B340Config 8Referenced by alternative delinearization v2
da-delinearizeTry to delinearize array referencesDependenceAnalysis pass knob (upstream LLVM)
da-miv-max-level-thresholdMIV test depth limitDependenceAnalysis pass knob (upstream LLVM)

Function Map

Invalidation Functions

FunctionAddressSizeRole
ScalarEvolution::forgetLoopsub_DE275010,051 B8-phase loop invalidation
ScalarEvolution::forgetValuesub_D9EE30~9 KBSingle-value eviction
ScalarEvolution::forgetAllLoopssub_D9D700~8 KBInvalidate all loops
forgetMemoizedResultssub_DE2690smallRecursive BTC invalidation helper
ScalarEvolution::verifysub_DE5FA0~52 KBDebug verification (old/new trip count comparison)
Loop invalidation helpersub_DE5640~178 linesHelper for forgetLoop
SCEV expression invalidatorsub_DCE1C0smallCallback for AddRec folding cleanup

Delinearization Functions

FunctionAddressSizeRole
ScalarEvolution::delinearizesub_DE9D103,614 BRecursive delinearizer (17-case switch)
collectParametricTermssub_DE8D20~521 linesTerm extraction before delinearization
Structural GEP delinearizationsub_DE97B0smallSub-analysis called from GEP case
canonicalizeExprsub_D9ABD0smallSCEV normalization
computeAccessFunctionssub_D94080~12 KBAccess function computation
SCEV_delinearize (dependence region)sub_CF55506,276 BAlternate copy in dependence analysis

Dependence Analysis Delinearization

FunctionAddressSizeRole
delinearizeAccesssub_146F1B040 KBCore delinearization for dependence analysis
tryDelinearizesub_146B5E018 KBDelinearization attempt with fallback
Delinearize subscriptsub_147264010 KBPer-subscript extraction
Array dimension inferencesub_147385012 KBInfers dimensions from access patterns
collectSubscriptssub_147606022 KBMulti-dimensional GEP subscript collection
Dependence distance with delinearizationsub_14747F015 KBComputes dependence vectors using delinearized subscripts
findArrayDimensionssub_147B0D011 KBDimension sizes from SCEV product decomposition
Combined delinearize-and-testsub_147C07034 KBDelinearize + per-dimension dependence test
Alternative delinearization v2sub_147EE3025 KBNVIDIA-enhanced heuristics
Partial result combinersub_147DF4011 KBCombines partial delinearization results

Key SCEV Callees (shared by both subsystems)

FunctionAddress
getRangeRef -- range computationsub_DBB9F0
ConstantRange::containssub_AB1BB0
ConstantRange::intersectWithsub_AB0910
ConstantRange::unionWithsub_AB0A00
ConstantRange::isEmptySetsub_AAFBB0
ConstantRange::isFullSetsub_AAF760
getSCEV -- expression resolutionsub_DD8400
tryFoldAddRecWithStepsub_DCFD50
getAddExpr (N-ary)sub_DC7EB0
getMulExpr (N-ary)sub_DC8BD0
getAddRecExprsub_DBFF60
getUDivExprsub_DCB270
getZeroExtendExprsub_DC5000
getSignExtendExprsub_DC2B70
getTruncateExprsub_DC5200
getPtrToIntExprsub_DD3A70
DominatorTree::dominatessub_B19D00
SmallDenseSet::insertsub_C8CC70
Cache insert (delinearization result memoization)sub_DB11F0

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Delinearization purposeOptimize for cache locality; multi-dimensional subscript recovery for polyhedral analysisOptimize for memory coalescing: recover subscripts to determine whether adjacent warp threads access adjacent addresses
Invalidation triggersStandard loop transformations (unroll, vectorize, simplify)Additional triggers from NVIDIA-specific passes: MemorySpaceOpt (address space transformations), IV Demotion (narrowing changes SCEV types), NVLoopStrengthReduce
Delinearization result cachingNo explicit memoization in upstreamMemoization cache via sub_DB11F0 prevents redundant delinearization of the same GEP across multiple consumers
Thread index awarenessNo concept of thread-index-based access patternsDelinearized subscripts are analyzed against threadIdx dimensions to determine coalescing quality; feeds into vectorization and LSR decisions
forget-scev-loop-unroll knobPresent in upstream LLVMSame knob, but more critical on GPU because over-invalidation forces expensive SCEV recomputation on deeply nested kernel loops
Range source diversityProfile data, programmer assertions (__builtin_assume)Additional sources: !range metadata from nvvm-intr-range, __launch_bounds__, warpSize constant, special register bounded ranges

Cross-References

  • ScalarEvolution Overview & Construction -- SCEV expression creation, the ValueExprMap, and the expression DAG structure that invalidation walks
  • SCEV Range Analysis & Trip Counts -- range caches and BTC caches that invalidation must clear; the getRangeRef and BTC computation functions called during eviction
  • LoopVectorize & VPlan -- primary consumer of delinearization results for vectorization legality; calls forgetLoop after vectorizing
  • Loop Unrolling -- calls forgetLoop after unrolling; the forget-scev-loop-unroll knob controls this
  • Loop Strength Reduction (NVIDIA) -- uses SCEV for IV analysis; its transformations trigger forgetValue calls
  • MemorySpaceOpt -- NVIDIA-specific pass that triggers SCEV invalidation after address space transformations
  • Alias Analysis & NVVM AA -- delinearization results feed into alias analysis for disambiguating multi-dimensional array accesses

Loop Optimization Passes

Loop optimization is the single most performance-sensitive area of the cicc pipeline. On an NVIDIA GPU, the constraints are fundamentally different from CPU: register pressure dominates (every additional register per thread reduces SM occupancy), memory coalescing replaces cache locality as the primary memory optimization target, and warp divergence caused by loop-carried control flow destroys SIMT efficiency. NVIDIA's cicc v13.0 addresses these constraints by shipping a mix of stock LLVM loop passes, LLVM passes with GPU-specific threshold overrides, and fully proprietary loop transformations -- all orchestrated through a carefully ordered pipeline where the position of each pass reflects hard-won engineering tradeoffs between register pressure, instruction count, and memory access patterns.

This page provides the big-picture view of loop optimization in cicc: what passes exist, how they are ordered, what analyses they share, and why the ordering matters for GPU targets. Each pass links to a dedicated sub-page with full algorithmic detail.

Why Loop Optimization Is Different on GPU

Four properties of the GPU execution model distinguish GPU loop optimization from the CPU case that upstream LLVM targets:

Register pressure is the primary constraint. Every loop transformation that increases live values (unrolling, vectorization, LICM hoisting) must be evaluated against the SM's register budget and its discrete occupancy cliffs -- adding one register can drop occupancy by a full warp group. CPU compilers never face this tradeoff.

Memory coalescing replaces cache line optimization. Loop transformations that improve stride-1 access patterns (interchange, vectorization) improve coalescing; transformations that increase the number of live pointers (unrolling, distribution) may degrade it by interleaving access streams.

No out-of-order execution. Warps execute instructions in program order; the only latency-hiding mechanism is warp-level multithreading. Unrolling creates ILP within a single warp by exposing independent instructions that the ptxas backend can interleave, but the benefit is bounded by the register pressure cost.

Address space semantics. GPU memory is partitioned into address spaces with different pointer widths, hardware addressing modes, and performance characteristics. Loop passes that rewrite address computations (LSR, IndVarSimplify) must respect these distinctions -- strength-reducing a 32-bit shared memory pointer into 64-bit generic form defeats the backend's ability to emit efficient .shared:: instructions.

Pipeline Ordering

The loop passes execute within the main optimization pipeline assembled by sub_12E54A0. The ordering below reflects the Tier 1/2/3 optimization path (the normal path for -O1 and above). Passes marked with (N) are NVIDIA-specific or have significant NVIDIA modifications; unmarked passes are stock LLVM with at most threshold overrides.

LoopSimplify + LCSSA                   (canonicalization)
    |
    v
LoopRotate                             (do-while canonical form)
    |
    v
LICM (hoist)                           (move invariants out)
    |
    v
LoopIndexSplit **(N)**                 (split index-dependent branches)
    |
    v
IndVarSimplify **(N)**                 (canonicalize IVs, LFTR)
    |
    v
LoopIdiomRecognize                     (memcpy/memset/mismatch idioms)
    |
    v
LoopDistribute                         (fission for vectorization)
    |
    v
LoopVectorize **(N)**                  (widen scalar loops to v2/v4)
    |
    v
LoopUnroll **(N)**                     (replicate body, GPU-tuned)
    |
    v
LoopInterchange                        (swap nest levels for coalescing)
    |
    v
IRCE                                   (range check elimination)
    |
    v
NVLoopStrengthReduce **(N)**           (NVIDIA custom LSR solver)
    |
    v
LoopDeletion                           (remove dead loops)
    |
    v
LoopSink / LICM (sink)                 (demote unprofitable hoists)

Several passes appear more than once. LICM runs in both hoist and sink mode. LoopUnroll has an early invocation in the main pipeline and a late invocation gated by opts[1360] (nv-disable-loop-unrolling). IndVarSimplify runs before vectorization to canonicalize induction variables, then again after unrolling to clean up newly exposed IVs. LoopSimplify and LCSSA are implicit -- they run as required analyses whenever any loop pass requests them, ensuring loops remain in canonical form throughout.

The ordering reflects a deliberate strategy: canonicalize first (LoopSimplify, LoopRotate, IndVarSimplify), transform for parallelism (LoopDistribute, LoopVectorize, LoopInterchange), replicate for ILP (LoopUnroll), and clean up addressing (LSR, LoopDeletion, LoopSink). Reordering these passes produces measurably different code: running LSR before LoopVectorize would pollute the cost model with strength-reduced IVs that confuse SCEV; running LoopUnroll before LoopVectorize would prevent vectorization of unrolled-but-still-vectorizable loops.

LoopPassManager Structure

cicc uses the LLVM New Pass Manager's LoopPassManager infrastructure. Loop passes are grouped inside a FunctionPassManager that contains a LoopToFunctionPassAdaptor wrapping the LoopPassManager. The adaptor iterates over all loops in the function in reverse post-order of the loop forest (innermost first), running the full sequence of loop passes on each loop before moving to the next.

The LoopStandardAnalysisResults struct is threaded through all loop passes, providing shared access to:

AnalysisTypical AccessorPurpose
ScalarEvolutionAR.SETrip counts, strides, value ranges
LoopInfoAR.LILoop structure, nesting depth
DominatorTreeAR.DTDominance queries for code motion
AssumptionCacheAR.AC__builtin_assume facts
TargetTransformInfoAR.TTICost model, addressing modes
MemorySSAAR.MSSAMemory alias queries for LICM/DSE
AAResultsAR.AAAlias analysis chain

Passes that structurally modify loops (LoopUnroll, LoopDistribute, IRCE) call LPMUpdater::markLoopAsDeleted() or LPMUpdater::addSiblingLoops() to inform the pass manager of changes. SCEV is invalidated per-loop via SE.forgetLoop() after any transformation that changes the loop's backedge-taken count.

Complete Pass Inventory

The table below lists every loop pass present in cicc v13.0 with its pipeline position, NVIDIA modification status, and primary function address.

Pass NamePipeline PositionNVIDIA ModifiedEntry AddressStatus
loop-simplifyInfrastructure (on demand)Nostock LLVMCanonicalizes loop form
lcssaInfrastructure (on demand)Nostock LLVMEnsures loop-closed SSA
loop-rotateEarly, before LICMNostock LLVMConverts to do-while form
licmEarly (hoist) + Late (sink)Threshold onlystock LLVMInvariant code motion
loop-index-splitAfter LICM, before IndVarsYes (proprietary)sub_2CBEC60 (New PM)Splits index-dependent branches
indvarsBefore vectorizeYes (3 knobs)sub_19489B0IV canonicalization + LFTR
loop-idiomBefore distributeNostock LLVMMemcpy/memset/mismatch recognition
loop-distributeBefore vectorizeThreshold onlysub_1A8CD80Loop fission for vectorization
loop-vectorizeMain loop slotYes (cost model)sub_2AF1970Vectorize inner loops to v2/v4
loop-unrollAfter vectorize (x2)Yes (decision engine)sub_19BE360Replicate loop body
loop-interchangeAfter unrollThreshold onlysub_1979A90Swap loop nest levels
irceAfter interchangeNosub_194D450Range check elimination
loop-reduceLate, after unrollYes (complete rewrite)sub_19CE990 (NV wrapper)Strength reduction for GPU
loop-deletionLateNostock LLVMRemove dead/empty loops
loop-sinkLateNostock LLVMSink invariants back into loops
loop-instsimplifyUtilityNostock LLVMSimplify instructions in loops
loop-flattenUtilityNostock LLVMFlatten nested counted loops
loop-guard-wideningUtilityNostock LLVMWiden loop guards
loop-predicationUtilityNostock LLVMPredicate unswitched loops
loop-rerollUtilityNostock LLVMReverse unrolling (rarely used)

Passes marked "Utility" are registered in the pipeline infrastructure but are not part of the default optimization sequence -- they are available for explicit pipeline specification via -mllvm -passes=....

Canonicalization Passes

LoopSimplify and LCSSA run on demand before any loop transformation pass executes. LoopSimplify ensures each loop has a single preheader, a single backedge (latch), and dedicated exit blocks. LCSSA (Loop-Closed SSA) ensures that values defined inside a loop and used outside it pass through PHI nodes at loop exit blocks. These are stock LLVM utilities with no NVIDIA modifications. Together they establish the invariants that all subsequent loop passes depend on.

LoopRotate converts a loop from while-form (while (cond) { body }) to do-while form (do { body } while (cond)). This creates a single-entry loop body and moves the exit test to the latch, which is the canonical form expected by SCEV, LoopVectorize, and LoopUnroll. Stock LLVM, no NVIDIA modifications.

NVIDIA-Custom Loop Passes

Loop Index Split is a revived and heavily reworked version of a pass removed from upstream LLVM 3.0. It splits loops when the loop body contains a condition that depends on the induction variable (e.g., if (i == K)), producing two or three loops where each has a uniform body. On GPU, this eliminates warp divergence caused by index-dependent branches. The pass implements three transformation modes: all-but-one peel (for i == K), only-one collapse (for nearly-empty special iterations), and full range split (for i < K vs i >= K). Proprietary, no upstream equivalent.

IndVarSimplify (NVIDIA) is upstream LLVM's induction variable canonicalization pass with three NVIDIA-specific extensions: Disable-unknown-trip-iv (bool, qword_4FAF520) -- bypasses the pass entirely when SCEV cannot compute the trip count, preventing aggressive IV transforms on warp-divergent loops; iv-loop-level (int, default 1, qword_4FAF440) -- restricts the pass to loops at a maximum nesting depth to control compile time on deeply nested stencil kernels; and disable-lftr (bool, byte_4FAF6A0) -- disables Linear Function Test Replace when the IV canonicalization would increase register pressure.

LoopVectorize (GPU-Adapted) is the largest single pass in the cicc loop pipeline (88 KB). On GPU, vectorization means generating ld.v2/ld.v4 wide loads rather than filling SIMD lanes. The pass builds VPlans, selects VF through a GPU-aware cost model that penalizes register pressure, and caps VF at 4 for most GPU targets. Scalable vectors are always disabled. The pass includes an outer-loop vectorization path (rarely triggered on GPU) and an inner-loop path (the main code path).

Loop Unrolling (GPU-Tuned) ships a substantially reworked computeUnrollCount decision engine with GPU heuristics: a local-array threshold multiplier that aggressively unrolls loops over __shared__ arrays, power-of-two factor enforcement, a pragma threshold 200x larger than stock LLVM, and a register-pressure-aware cost model. The transformation engine is lightly modified upstream UnrollLoop. The pass runs twice: once in the main pipeline, once as a late cleanup.

NVLoopStrengthReduce (NVIDIA Custom) is the most GPU-specific LLVM pass in cicc. NVIDIA ships a complete replacement formula solver (160 KB, 2688 lines) with 11 custom knobs controlling register pressure checking, address-space-aware formula selection, sign-extension optimization, and 64-bit IV handling. The stock LLVM LSR remains in the binary but the NVIDIA overlay replaces the formula generation and selection phases.

Standard Loop Passes (Threshold Overrides Only)

LICM (Loop-Invariant Code Motion) hoists loop-invariant computations above the loop and sinks them below it. On GPU, LICM's hoist mode must be conservative: hoisting increases register pressure in the loop preheader, which may push past occupancy cliffs. The sink mode (running later) undoes unprofitable hoists. Stock LLVM with NVIDIA-tuned thresholds.

LoopInterchange swaps the nesting order of a perfectly-nested loop pair when doing so improves memory access locality. In cicc, the threshold loop-interchange-threshold (dword_4FB07E0) defaults to 0, meaning interchange is only performed when the net locality benefit is non-negative AND parallelism improves. The pass has a 100-pair dependence limit (0x960 bytes) as a compile-time safety valve. There is no visible CUDA-specific memory space awareness -- the standard LLVM stride-1 locality model applies uniformly. See the standard loop passes page for details.

IRCE (Inductive Range Check Elimination) splits a loop into preloop/mainloop/postloop regions, eliminating range checks from the mainloop where the induction variable is provably within bounds. The implementation is stock LLVM with no visible NVIDIA modifications. Configuration globals include a block count threshold (dword_4FB0000), a debug flag (byte_4FAFE40), and a "constrained" relaxation mode (byte_4FAFBA0) that handles slightly non-canonical range checks common in GPU thread-coarsened loops.

LoopDistribute (loop fission) splits a single loop into multiple loops to separate unsafe memory dependences from safe ones, enabling LoopVectorize to vectorize the safe partition. Stock LLVM algorithm. The SCEV runtime check threshold (qword_4FB5480) is likely GPU-tuned. The pass runs before LoopVectorize in the pipeline.

LoopIdiomRecognize detects loops that implement common patterns (byte-by-byte copy, memset, mismatch search, string search) and replaces them with optimized multi-block IR or library calls. The expansion routines generate vectorized mismatch detection (sub_2AA00B0, 48 KB) and vectorized first-occurrence string search (sub_2AA3190, 40 KB), both with page-boundary-safe masked vector loads. Stock LLVM pass; the expansion quality benefits GPU targets where wide loads are profitable.

LoopDeletion removes loops proven dead (no observable side effects). Stock LLVM. LoopSink moves loop-invariant operations that were hoisted by LICM back into the loop body when doing so reduces register pressure -- particularly valuable on GPU where the register pressure tradeoff is acute.

Loop Analysis Infrastructure

All loop passes share three core analysis frameworks.

ScalarEvolution (SCEV)

SCEV models how values evolve across loop iterations. Every loop pass depends on it for trip count computation, stride analysis, and value range queries. cicc ships an LLVM 20.0.0-based SCEV with three NVIDIA extensions: a complexity control system (simple_mode) that prevents unbounded analysis time, GPU-specific SCEV sources that inject thread index bounds, and recognition of CUDA loop idioms (warp-stride, grid-stride). See ScalarEvolution Overview, Range Analysis & Trip Counts, and Invalidation & Delinearization.

LoopInfo

LoopInfo provides the loop forest structure: which basic blocks belong to which loops, nesting depth, header/latch/exit identification. It is the primary structural query interface for all loop passes. Stock LLVM, no NVIDIA modifications.

DependenceInfo

DependenceInfo computes memory dependence direction vectors between instruction pairs across loop iterations. LoopInterchange and LoopDistribute are its primary consumers. The analysis uses SCEV to classify dependences as forward (<), backward (>), equal (=), scalar (S), independent (I), or unknown (*). Direction vectors drive the legality checks for loop interchange (no reversed backward-carried dependences after swap) and loop distribution (which instructions must stay in the same partition).

The following table consolidates all loop-pass-specific configuration knobs discovered in cicc v13.0. These are controllable via -mllvm -<knob>=<value>.

KnobPassTypeDefaultEffect
Disable-unknown-trip-ivIndVarSimplifyboolfalseSkip IV canonicalization for unknown-trip loops
iv-loop-levelIndVarSimplifyint1Max nesting depth for IV simplification
disable-lftrIndVarSimplifyboolfalseDisable Linear Function Test Replace
replexitvalIndVarSimplifyenum1 (cheap)Exit value replacement strategy: 0=never, 1=cheap, 2=always
indvars-widen-indvarsIndVarSimplifybooltrueAllow IV widening to eliminate sign/zero extension
loop-interchange-thresholdLoopInterchangeint0Minimum net locality improvement for interchange
vectorize-loopsLoopVectorizebooltrueMaster vectorization enable
enable-early-exit-vectorizationLoopVectorizeboolfalseAllow vectorization of early-exit loops
force-vector-width-outerLoopVectorizeboolfalseForce VF=4 for outer loops
nv-disable-loop-unrollingLoopUnrollboolfalseDisable the late unroll invocation
disable-unknown-trip-lsrNV LSRboolfalseSkip LSR for unknown-trip loops
lsr-check-rpNV LSRbooltrueEnable register pressure checking in LSR
lsr-rp-limitNV LSRint~32-64Register pressure ceiling for LSR
filter-bad-formulaNV LSRbooltrueNVIDIA custom formula filtering
do-lsr-64-bitNV LSRboolarch-depEnable LSR for 64-bit IVs (false on sm_3x-5x)
count-sxt-opt-for-reg-pressureNV LSRbooltrueCredit sign-ext savings in cost model
lsr-sxtoptNV LSRbooltrueFold sign-extensions into IV expressions
lsr-loop-levelNV LSRint0 (all)Restrict LSR to specific loop nesting depth
lsr-skip-outer-loopNV LSRboolfalseSkip outer loop IVs in nested loops
disable-lsr-for-sharedmem32-ptrNV LSRboolfalseDisable LSR for addrspace(3) pointers
disable-lsr-complexity-discountNV LSRboolfalseDisable complexity discount in cost model
irce-block-thresholdIRCEintvariesMax basic blocks before IRCE bails
enable-loop-distributeLoopDistributeboolfalseForce-enable distribution
loop-distribute-scev-check-thresholdLoopDistributeintvariesMax SCEV runtime checks allowed

Cross-References

Standard Loop Passes

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

CICC v13.0 includes a full complement of LLVM loop transformation passes beyond the major ones (LoopVectorize, LoopUnroll, LICM, LSR) that have their own pages. This page covers the remaining loop passes: LoopInterchange, IRCE, IndVarSimplify, LoopDistribute, LoopIdiom, LoopRotate, LoopSimplify, and LCSSA. Most are stock LLVM with default thresholds, but IndVarSimplify carries three NVIDIA-specific knobs that materially change behavior on GPU code. LoopRotate appears multiple times in the pipeline as a canonicalization prerequisite for LICM and unrolling. The canonicalization trio -- LoopSimplify, LCSSA, and LoopRotate -- run so frequently they constitute the backbone of loop pass infrastructure in cicc.

Barrier awareness. None of these 8 passes have explicit barrier (__syncthreads()) awareness. Barrier handling in cicc occurs through dedicated NVIDIA passes: Dead Barrier Elimination (sub_2C83D20) and convergence control token verification (sub_E35A10). The structural passes (LoopRotate, LoopSimplify, LCSSA) do not move instructions across basic blocks in ways that could reorder barriers. LoopInterchange and LoopDistribute could theoretically reorder barriers, but barriers in CUDA kernels typically occur outside perfectly-nested loop bodies (interchange) or create non-distributable loop bodies (distribution).

Occupancy interaction. None of the 8 passes interact with occupancy or register pressure directly. Occupancy-aware loop optimization occurs in LSR (register pressure tracking at a1+32128 with occupancy ceiling), LoopUnroll (TTI-based register pressure estimation), and register allocation. These 8 passes are IR-level transforms that run before register allocation.

Address space awareness. None of the 8 passes distinguish between addrspace(0) (generic), addrspace(1) (global), addrspace(3) (shared), or addrspace(5) (local). Only LSR has address space awareness via the disable-lsr-for-sharedmem32-ptr knob. This is a notable gap: LoopInterchange's cost model should ideally weight global memory coalescing higher than shared memory locality, and LoopDistribute could benefit from knowing that shared-memory and global-memory partitions have different cost characteristics.


LoopInterchange

Swaps the iteration order of a perfectly-nested loop pair to improve memory access locality. On GPUs, interchange can convert non-coalesced global memory accesses (strided across warps) into coalesced ones (consecutive addresses per warp), which is often the single largest performance lever for memory-bound kernels.

PropertyValue
Entry pointsub_1979A90 (69 KB) -- processLoopList
Legality checkersub_1975210 (45 KB)
Dependence helpersub_1978000 (37 KB)
Pass name"loop-interchange"
Knobloop-interchange-threshold at dword_4FB07E0, default 0
Knob constructorctor_208 at 0x4E39E0
NVIDIA deltaNone -- stock LLVM algorithm and threshold

Required analyses (from sub_19743F0): ScalarEvolution (unk_4F9A488), LoopInfoWrapperPass (unk_4F96DB4), DominatorTreeWrapperPass (unk_4F9E06C), AAResultsWrapperPass (unk_4F9920C), DependenceAnalysisWrapperPass (unk_4F98D2D), OptimizationRemarkEmitter (unk_4FB66D8), TargetTransformInfoWrapperPass (unk_4FB65F4), LoopAccessLegacyAnalysis (unk_4F99CB0). The pass preserves both DominatorTree and LoopInfo.

Algorithm. The pass collects the loop nest as a SmallVector by walking the single-subloop chain (enforcing the "perfectly nested" constraint -- each loop must have exactly one child). For nests with fewer than two levels, it returns immediately. It then builds direction vectors for every memory-dependence pair via DependenceInfo (sub_13B1040), encoding each dimension as one of < (forward), > (backward), = (equal), S (scalar), I (independent), or * (unknown). A hard bail-out fires if the number of dependence pairs exceeds 100 (0x960 bytes at 24 bytes per entry) -- a compile-time safety valve.

For each candidate pair from outermost inward, the decision pipeline runs five checks in sequence:

  1. Dependence safety -- any * or backward-carried dependence that would be reversed by interchange bails with remark "Dependence". The safety check uses two bitmasks: 0x803003 for valid direction combination and 0x400801 for the "all equal-like before inner" pattern. A special case allows inner > when all preceding levels are = or S (zero distance in those dimensions).
  2. Call instructions -- calls in the inner body that are not provably readonly intrinsics bail with "CallInst". The intrinsic check calls sub_1560260(callee, -1, 36) and sub_1560260(callee, -1, 57) for two classes of safe intrinsics.
  3. Tight nesting -- extra computation between the loops (non-PHI, non-terminator instructions) bails with "NotTightlyNested". Checks sub_15F3040 (extra computation), sub_15F3330 (volatile/atomic operations), and sub_15F2ED0 (calls with side effects).
  4. Exit PHI validation -- complex PHI nodes at the loop exit bail with "UnsupportedExitPHI". For each exit PHI, the pass walks the use chain checking operand count via (v287 & 0xFFFFFFF), verifying each operand references the latch block and that sub_157F120 (hasLoopInvariantOperands) returns true.
  5. Cost model -- counts memory subscripts with stride in the inner vs. outer loop. Net cost = benefit - penalty. Interchange proceeds only if cost >= -threshold (default: >= 0) AND all direction vectors show a parallelism improvement (outer dimension becomes scalar/independent while inner becomes equal).

Cost model details. For each memory instruction (opcode byte 0x38 at offset -8), the pass extracts the subscript count via (*(_DWORD*)(instr-4) & 0xFFFFFFF) and calls sub_146F1B0(ScalarEvolution, operand) to get the SCEV expression. Strides are classified per-loop. Subscripts with stride in both loops are counted as penalties (ambiguous). The net cost is locality_benefit - locality_penalty. The parallelism override requires ALL direction vectors to have the outer dimension as S (83) or I (73) and the inner dimension as = (61) -- even a non-negative cost is rejected if this pattern fails, with remark "InterchangeNotProfitable".

Post-interchange bookkeeping. After transformation, the pass: (a) calls sub_1AF8F90 to update LCSSA form for inner loop first, then outer; (b) reruns legality check via sub_1975210 as a safety recheck after LCSSA updates; (c) swaps direction-vector columns and loop-list positions; (d) decrements indices to try the next pair inward. The TTI availability boolean at a1+192 (checked via sub_1636850) is passed to the LCSSA updater as its 4th argument, controlling rewrite aggressiveness.

GPU considerations. The cost model counts memory accesses generically via SCEV stride analysis. There is no visible special handling for address spaces (shared vs. global vs. texture). The standard "stride-1 is good" locality model applies uniformly. For a reimplementation targeting GPUs, you would want to weight global-memory accesses (addrspace 1) far more heavily than shared-memory accesses (addrspace 3), since shared memory has no coalescing requirement. The 100-pair dependence limit prevents the pass from even being considered for CUDA kernels with massive shared-memory access patterns (e.g., tiled matrix multiplication). The pass does not check for barriers -- perfectly-nested loops with __syncthreads() in the inner body would be blocked by the call-instruction check unless the barrier is lowered to an intrinsic classified as safe (which it is not).


IRCE (Inductive Range Check Elimination)

Splits a loop into pre/main/post regions so that inductive range checks (bounds checks on the induction variable) can be eliminated from the main loop body, which executes the vast majority of iterations.

PropertyValue
Entry pointsub_194D450 (71 KB) -- InductiveRangeCheckElimination::run
Pass name"irce"
Block thresholddword_4FB0000 -- max basic blocks before bail-out
Debug flagbyte_4FAFE40 -- prints "irce: looking at loop"
Constrained modebyte_4FAFBA0 -- relaxes canonical-form requirements
SCEV verifybyte_4FAFC80 -- post-transform range verification
Metadata flagbyte_4FAFF20 -- propagate "irce.loop.clone" metadata
NVIDIA deltaMinimal -- stock algorithm, "constrained" mode may help GPU strided patterns

Stack frame and signature. The function allocates ~0x960 bytes (2400 bytes) of local state. Signature: sub_194D450(void *this_pass, void *Loop, void *LoopAnalysisManager, void *LoopStandardAnalysisResults, void *LPMUpdater). Returns PreservedAnalyses by value.

Algorithm (8 phases).

Phase 1 -- Early validation. Extracts ScalarEvolution, DominatorTree, LoopInfo, and BranchProbabilityInfo from LoopStandardAnalysisResults. Loads block count threshold from dword_4FB0000 and bails if the loop exceeds it. Checks simplify form (single latch, single exit, proper preheader).

Phase 2 -- Range check discovery. IRCE scans conditional branches in the loop body for ICmp instructions comparing the induction variable against loop-invariant bounds. The ICmp predicate dispatch table:

Predicate valueLLVM predicateRange check kind
0x20 (32)SLT (signed less-than)UPPER
0x22 (34)SGT (signed greater-than)LOWER (swapped operands)
0x24 (36)SGE (signed greater-equal)LOWER
0x26 (38)UGE (unsigned greater-equal)LOWER
0x28 (40)ULT (unsigned less-than)UPPER

Each candidate is classified into one of four kinds:

RANGE_CHECK_UNKNOWN = 0   (skip)
RANGE_CHECK_LOWER   = 1   (indvar >= lower_bound)
RANGE_CHECK_UPPER   = 2   (indvar < upper_bound)
RANGE_CHECK_BOTH    = 3   (lower <= indvar < upper)

The InductiveRangeCheck structure is 40 bytes (0x28), iterated with stride 0x28: Begin (SCEV, +0x00), Step (SCEV, +0x08), End (SCEV, +0x10), CheckUse (Use*, +0x18), Operand (Value*, +0x20), Kind (uint32, +0x24).

Phase 3 -- Filtering and validation. Calls sub_1949EA0 (classifyRangeCheckICmp) to validate each candidate. A bitvector (allocated at [rbp+var_460]) tracks valid checks. The "constrained" relaxation flag (byte_4FAFBA0) routes to sub_1949670 (canHandleRangeCheckExtended), allowing range checks where the induction variable relationship is slightly non-canonical -- useful for GPU thread-coarsened loops with strided access patterns. Validation requires: constant step (+1 or -1), loop-invariant bounds, simplify form, and SCEV-computable trip count.

Phase 4 -- SCEV-based bound computation. For each valid check, computes the safe iteration range [safe_begin, safe_end) using SCEV. Calls sub_145CF80 (SCEV getConstant), sub_147DD40 (SCEV getAddRecExpr / max/min), and sub_3870CB0 (isSafeToExpandAt). If expansion safety fails, the check is abandoned.

Phase 5 -- Preloop creation. Calls sub_194C320 (createPreLoop, ~1200 bytes) to clone the loop for iterations [0, safe_begin). Creates basic blocks named "preloop" and "exit.preloop.at". The clone remaps instructions and PHI nodes, creates the branch from preloop exit to mainloop entry, and updates dominator tree and loop info.

Phase 6 -- Postloop creation. Calls sub_194AE30 (createPostLoop, ~1300 bytes) for iterations [safe_end, trip_count). Calls sub_1949270 (adjustSCEVAfterCloning) to refresh SCEV expressions invalidated by cloning.

Phase 7 -- Two-path splitting for BOTH checks. When kind=3, IRCE creates TWO separate cloning operations, producing three loop clones total. Both sub_194C320 and a second call produce pre/main/post regions with BOTH range checks eliminated from the center.

Phase 8 -- Cleanup. Cleans up InductiveRangeCheck entries (stride 0x40 after alignment). If metadata flag byte_4FAFF20 is set, propagates "irce.loop.clone" metadata to cloned loops via red-black tree manipulation. Releases SCEV expression references via sub_1649B30.

GPU considerations. The block count threshold (dword_4FB0000) protects against pathologically large GPU kernel loops from unrolled or tiled computations. The constrained relaxation mode helps with range checks in GPU kernels where induction variables use non-canonical strides (common after thread coarsening). IRCE has no barrier awareness -- if a loop body contains __syncthreads(), the loop cloning would duplicate the barrier into all three clones (pre/main/post), which is correct but increases code size and instruction cache pressure. The pass does not check for convergent calls, so it could clone a loop containing warp-level primitives; this is safe because all three clones execute the same iterations as the original (just partitioned differently).

Pipeline position. IRCE runs after LoopSimplify and before LoopUnroll. It consumes canonicalized induction variables produced by IndVarSimplify and feeds into vectorization by removing bounds checks that would otherwise prevent LoopVectorize.


IndVarSimplify

Canonicalizes induction variables: simplifies IV users, performs Linear Function Test Replace (LFTR), replaces exit values with closed-form SCEV expressions, and sinks dead IV computations. This is the pass with the most significant NVIDIA modifications in this group.

PropertyValue
Core functionsub_1945A50 (65 KB) -- IndVarSimplify::run
NewPM wrappersub_19489B0 -- applies NVIDIA guards before core
Pass name"indvars"
NVIDIA knob 1Disable-unknown-trip-iv at qword_4FAF520 -- skip pass for unknown-trip loops
NVIDIA knob 2iv-loop-level at qword_4FAF440, default 1 -- max nesting depth
NVIDIA knob 3disable-lftr at byte_4FAF6A0 -- disable LFTR entirely
Upstream knobreplexitval at dword_4FAF860 -- {never=0, cheap=1, always=2}
All knobs registeredctor_203 at 0x4E1CD0
NVIDIA deltaSignificant -- two custom guard knobs plus depth limiter

NVIDIA guards. Before the core algorithm runs, sub_19489B0 checks two NVIDIA-specific conditions:

  1. Loop depth gate (iv-loop-level): if sub_193DD90(loop) > qword_4FAF440[20], the pass is skipped entirely. sub_193DD90 is a recursive getLoopDepth() returning 1 for outermost loops. Default 1 means only outermost loops receive IV simplification. This controls compile time on deeply-nested stencil and tensor kernels that commonly have 3-5 nested loops.

  2. Unknown trip count gate (Disable-unknown-trip-iv): if LOBYTE(qword_4FAF520[20]) is set AND (sub_1CED350(loop) <= 1 OR !sub_1CED620(loop, header)), the pass is skipped. sub_1CED350 returns the SCEV-computed trip count; values <= 1 indicate unknown or trivial loops. This protects GPU kernels with divergent or dynamic bounds (where trip count depends on threadIdx or blockIdx) from aggressive IV transforms that can cause correctness issues with warp-level scheduling assumptions.

Core algorithm (five phases):

  1. Header PHI collection -- walks the loop header's instruction list via **(a2+32)+48, collecting all PHI nodes (opcode 77) as candidate induction variables into worklist v342.

  2. Per-IV rewriting -- for each PHI, calls sub_1B649E0 (SimplifyIndVar::simplifyIVUsers, via vtable at off_49F3848) to fold truncs/sexts/zexts, fold comparisons with known ranges, and eliminate redundant increment chains. Sets changed flag at a1+448. Then calls sub_1943460 (rewriteLoopExitValues) to replace uses of the IV outside the loop with closed-form SCEV expressions. New PHIs discovered during rewriting are pushed back to the worklist for fixpoint iteration.

  3. LFTR (Linear Function Test Replace) -- gated by four conditions: dword_4FAF860 != 0 (replexitval not "never") AND trip count not constant (!sub_14562D0), !byte_4FAF6A0 (disable-lftr not set), hasCongruousExitingBlock (sub_193E1A0), and exitValueSafeToExpand (sub_193F280). Selects the best IV via sub_193E640 (isBetterIV) preferring non-sign-extending, wider IVs with higher SCEV complexity (sub_1456C90). Computes a wide trip count via sub_1940670 (computeWideTripCount). Three rewriting strategies:

    • Strategy A: Integer IV with matching types -- computes exact exit value via APInt arithmetic, materializes as constant.
    • Strategy B: Type mismatch -- expands SCEV expression via sub_14835F0 (SCEVExpander::expandCodeFor), creates "wide.trip.count" instruction using ZExt (opcode 37) or SExt (opcode 38).
    • Strategy C: Direction check failure -- creates "lftr.wideiv" as a truncation (opcode 36, Trunc) down to exit condition type.
    • Finally creates "exitcond" ICmp instruction (opcode 51) with computed predicate v309 = 32 - depth_in_loop_set.
  4. Exit value replacement -- materializes closed-form exit values via SCEVExpander. The "cheap" mode (replexitval=1) adds a cost gate at sub_1941790 where dword_4FAF860 == 1 && !v136 && v31[24] skips expensive expansions (v136 = simple loop flag, v31[24] = per-candidate "expensive" flag from sub_3872990, the SCEV expansion cost model).

  5. Cleanup -- dead instruction removal (drains worklist at a1+48..a1+56, using opcode check: type <= 0x17 = LLVM scalar type), IV computation sinking (walks latch block backwards, tracks live set in red-black tree via sub_220EF30/sub_220EF80/sub_220F040, sinks dead IVs past loop exit via sub_15F2240), PHI predecessor fixup (handles Switch opcode 27 and Branch opcode 26 terminators), and sub_1AA7010 (deleteDeadPhis) on the loop header.

Additional upstream knobs present: indvars-post-increment-ranges (bool, default true), indvars-predicate-loops (bool, default true), indvars-widen-indvars (bool, default true), verify-indvars (bool, default false).

Pass state object layout:

OffsetTypeContent
+0ptrTargetTransformInfo
+8ptrDataLayout / Module
+16ptrDominatorTree
+24ptrLoopInfo
+32ptrDeadInstVector
+40ptrScalarEvolution
+48ptrDeadInstWorklist array
+56u32DeadInstWorklist count
+60u32DeadInstWorklist capacity
+448byteChanged flag

GPU relevance. The depth limiter is important because CUDA stencil codes often have 3-5 nested loops, and running IndVarSimplify on inner loops can blow up compile time without meaningful benefit (inner loops typically have simple IVs already). The unknown-trip guard prevents miscompiles on kernels where the trip count depends on threadIdx or blockIdx. The interaction with IV Demotion (sub_1CD74B0) is notable: IndVarSimplify runs first and may widen IVs to 64-bit, then IV Demotion (a separate NVIDIA pass) narrows them back to 32-bit where the value range permits, reducing register pressure -- a critical factor for GPU occupancy.


LoopDistribute

Splits a single loop into multiple loops (loop fission), each containing a subset of the original instructions. The primary motivation is separating memory accesses with unsafe dependences from safe ones, enabling LoopVectorize to vectorize the safe partition.

PropertyValue
Entry pointsub_1A8CD80 (63 KB) -- LoopDistributePass::run
Pass name"loop-distribute"
Force flagbyte_4FB5360 -- force distribution ignoring metadata
SCEV check thresholdqword_4FB5480 -- max runtime checks before bail-out
Secondary limitqword_4FB53A0 -- max dependence checks per partition
Verify flagbyte_4FB56E0 -- post-distribution verification
NVIDIA deltaNone -- stock LLVM algorithm

Stack frame. ~0x780 bytes (1920 bytes). Signature: sub_1A8CD80(void *this_pass, void *Function, void *FunctionAnalysisManager).

Algorithm. The pass runs a gauntlet of six bail-out conditions per loop:

  1. "NotLoopSimplifyForm" -- sub_157F0D0 (Loop::isLoopSimplifyForm) fails.
  2. "MultipleExitBlocks" -- sub_157F0B0 (Loop::getUniqueExitBlock) returns null.
  3. Metadata "llvm.loop.distribute.enable" disabled (checked via sub_15E0530 MDNode lookup). byte_4FB5360 (force flag) overrides this.
  4. "NoUnsafeDeps" -- LAI flag at +0xDAh (HasUnsafeDependences) is zero.
  5. "MemOpsCanBeVectorized" -- all memory operations already vectorizable.
  6. "TooManySCEVRuntimeChecks" -- SCEV check count at LAI +0x118 exceeds qword_4FB5480.

LoopAccessInfo (LAI) structure (0x130 = 304 bytes):

OffsetContent
+0x00Loop* TheLoop
+0x08PredicatedScalarEvolution* PSE
+0x10RuntimeCheckingPtrGroup* PtrRtChecks
+0x90SmallVector buffer (16-byte aligned)
+0xDAhbool HasUnsafeDependences
+0xE0hMemoryDepChecker::Dependence* DepArray
+0xE8huint32 NumDependences
+0x108SCEVUnionPredicate* Predicates
+0x110SCEVCheck* SCEVChecks
+0x118uint32 NumSCEVChecks

Dependence entry (0x40 = 64 bytes per entry): source instruction (+0x00), destination instruction (+0x08), dep type info (+0x10), SCEV distance (+0x18), DependenceType byte (+0x28). Stride confirmed at shl rax, 6 (0x1A8E6B9).

If validation passes, the core phase builds a partition graph. Each instruction starts in its own partition. The partition hash set uses 16-byte slots with NVVM-layer sentinels (-8 / -16) and an additional -2 value for "unassigned" partitions. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.

For each unsafe memory dependence pair, the pass either merges source and destination partitions (if the dependence cannot be broken) or marks it as cross-partition. A union-find structure tracks merged partitions. After merging, if at least two distinct partitions remain, sub_1B1E040 (distributeLoopBody, ~2000 bytes) clones the loop body once per partition, removes instructions not belonging to each partition, and wires the clones in dependence order. Optional runtime dependence checks (loop versioning) are added. Post-distribution: sub_1B1DC30 updates the dominator tree, sub_197E390 registers new loops, sub_143AA50 (ScalarEvolution::forgetLoop) invalidates SCEV cache. Metadata "distributed loop" (16 chars) is attached to prevent future re-distribution.

GPU relevance. Distribution is valuable for CUDA kernels that mix shared-memory and global-memory accesses in the same loop -- the shared-memory partition can often be vectorized independently. The "llvm.loop.distribute.enable" metadata is controllable via #pragma clang loop distribute(enable). The SCEV runtime check threshold (qword_4FB5480) balances runtime check overhead against distribution benefit -- GPU kernels often have simple loop structures but complex pointer arithmetic from tiled access patterns.


LoopIdiom

Recognizes loop patterns that correspond to standard library calls (memset, memcpy, memcmp, strstr) and replaces them with optimized implementations. CICC includes both the standard LoopIdiomRecognize pass and the newer LoopIdiomVectorize pass.

PropertyValue
Recognizer coresub_196FF90 (51 KB) -- LoopIdiomRecognize::run
Memset detectionsub_196B740 (10 KB) -- detects memset_pattern16
Memcpy/memmovesub_196E000 (43 KB)
Mismatch expansionsub_2AA00B0 (48 KB) -- expandMemCmpMismatch
String search expansionsub_2AA3190 (40 KB) -- expandFindFirst
Pass name"loop-idiom" (recognizer), "loop-idiom-vectorize" (vectorizer)
Vectorize knobsdisable-loop-idiom-vectorize-all, loop-idiom-vectorize-style (masked/predicated), loop-idiom-vectorize-bytecmp-vf, etc.
NVIDIA deltaNone visible -- stock LLVM

Standard idioms. The recognizer scans loops for store patterns that correspond to memset (constant value stored on every iteration) and memcpy/memmove (load-store pairs with matching strides). It also detects trip-count-decrement patterns ("tcphi", "tcdec") used in hand-written copy loops. Recognized patterns are lowered to @llvm.memset / @llvm.memcpy / @llvm.memmove intrinsics.

Vectorized idiom expansion -- MemCmpMismatch (sub_2AA00B0). The expansion generates a two-tier multi-block IR structure:

  1. LoopIdiomExpansionState structure (80+ bytes): idiom type at +0 (0=byte, 1=word), loop info at +8, DataLayout at +16, alloc context at +24, target info at +32, output blocks at +48 through +80.

  2. 11 basic blocks created in sequence: "mismatch_end", "mismatch_min_it_check", "mismatch_mem_check", "mismatch_vec_loop_preheader", "mismatch_vec_loop", "mismatch_vec_loop_inc", "mismatch_vec_loop_found", "mismatch_loop_pre", "mismatch_loop", "mismatch_loop_inc", "byte.compare".

  3. Page-boundary safety protocol (shared with string search expansion): PtrToInt -> LShr by log2(pagesize) (from sub_DFB4D0 via DataLayout) -> ICmpNE of start/end page numbers. If both pointers stay within a single page, wider-than-element vector loads are safe; otherwise, @llvm.masked.load provides the fallback. The page size is retrieved via sub_DFB4D0(*a1[32]) from the target DataLayout.

  4. Vector loop body: dispatches to sub_2A9D690 (byte-granularity) or sub_2A9EC20 (word-granularity) based on *a1 idiom type. Generates vector load + compare + cttz (count trailing zeros via sub_B34870).

  5. Scalar fallback: byte-by-byte comparison with "mismatch_index" phi node, induction variable add (sub_929C50), and ICmpULT (sub_92B530(0x20)) loop bound check.

  6. LCSSA verification: explicit assertion "Loops must remain in LCSSA form!" via sub_D48E00. SE/LI/DT invalidated/recalculated on exit (sub_FFCE90, sub_FFD870, sub_FFBC40).

Vectorized idiom expansion -- FindFirst (sub_2AA3190). Implements vectorized first-occurrence search (strstr-like):

  1. 7 basic blocks: "scalar_preheader", "mem_check", "find_first_vec_header", "match_check_vec", "calculate_match", "needle_check_vec", "search_check_vec".

  2. Needle splatting: needle[0] is extracted via ExtractElement (sub_B4DE80) with index 0, frozen via sub_B37620, then splatted across all vector lanes via ShuffleVector (sub_B36550). The splat enables parallel comparison of the haystack against the needle's first character.

  3. Masked loads: @llvm.masked.load (sub_B34C20) provides page-boundary-safe vectorized reads. Same page-boundary protocol as mismatch expansion.

  4. Two nested loops: outer scans haystack, inner verifies full needle match at candidate positions. PHI nodes: "psearch" (haystack), "pneedle" (needle position), "match_start", "match_vec".

GPU considerations. LoopIdiom is present in cicc but its value on GPU code is limited. GPU memset/memcpy are typically handled by device runtime calls or specialized PTX instructions (st.global, ld.global with vectorized widths) rather than loop-based patterns. The vectorized mismatch/search expansions target CPU-style byte-level operations that are rare in GPU kernels. The page-boundary safety protocol is irrelevant on GPU (virtual memory page faults work differently -- GPU global memory is always accessible within the allocation). The pass runs but likely fires infrequently. When it does fire, the generated @llvm.memset/@llvm.memcpy intrinsics are later lowered to PTX-specific sequences by the NVPTX backend.


LoopRotate

Transforms loops so that the latch block (back-edge source) becomes the exiting block (where the exit condition is tested). This converts "while" loops into "do-while" form, which is a prerequisite for LICM (the loop body is guaranteed to execute at least once, enabling unconditional hoisting) and simplifies trip count computation for SCEV.

PropertyValue
Entry point (legacy)sub_18A3090 -- called directly in O1/O2/O3 pipeline
Entry point (new PM)sub_28448D0 -- LoopRotatePass with "header-duplication;" param
Core implementationsub_2A0CFD0 (65 KB) -- LoopRotation::runOnLoop
String markers".lr.ph" (preheader), "h.rot", "pre.rot"
Pass name"loop-rotate"
Paramsno-header-duplication / header-duplication
Pipeline knobenable-loop-header-duplication (bool) -- controls default param
NVIDIA deltaNone -- stock LLVM, but appears multiple times in pipeline

Pipeline placement. LoopRotate appears at least four times in the cicc pipeline across different tiers:

  1. Full O1+ pipeline, position 11: sub_18A3090() in sub_12DE330 -- runs before LICM (sub_184CD60) and IndVarSimplify.
  2. Tier 1 passes: appears alongside SimplifyCFG and InstCombine as part of the canonicalization loop.
  3. Tier 2 passes: appears again in the LoopRotate+LICM pair.
  4. Pipeline assembler: sub_195E880 appears 4 times (labeled "LICM/LoopRotate"), conditional on opts[1240] and opts[2880].

This multiple invocation is standard LLVM practice -- rotation may be needed again after other transforms invalidate the rotated form. In the Ofcmid fast-compile pipeline, LoopRotate does not appear as a standalone pass; LICM (which internally depends on rotation) handles it.

Algorithm. The pass duplicates the loop header into the preheader (creating a "rotated" header named "h.rot" or "pre.rot"), then rewires the CFG so the original header becomes the latch. The header-duplication parameter controls whether the header is actually duplicated (which increases code size) or only the branch is restructured. After rotation, SCEV's backedge-taken count computation becomes straightforward because the exit test is at the latch.

SCEV interaction. LoopRotate requires BTC (backedge-taken count) recomputation after the header/latch swap. This is handled by ScalarEvolution::forgetLoop being called by downstream passes that depend on fresh SCEV data.

GPU considerations. LoopRotate is purely a structural transformation that does not examine instruction semantics. It has no barrier awareness -- if a barrier (__syncthreads()) is in the loop header, it will be duplicated into the preheader during rotation. In practice, barriers in CUDA kernels are rarely in loop headers (they are typically in loop bodies or between loops). The header duplication can increase code size, which affects instruction cache utilization on GPU -- SM instruction caches (L0/L1 I-cache) are small (typically 12-48 KB per SM depending on architecture), so excessive duplication of large loop headers across many loops in a kernel could cause I-cache pressure. The pass does not have a size threshold to prevent this.


LoopSimplify

Enforces LLVM's canonical loop form: single preheader, single latch, single dedicated exit block, and no abnormal edges. Nearly every loop optimization pass requires simplify form as a precondition.

PropertyValue
Canonicalization coresub_1A5B3D0 (62 KB)
DomTree update helpersub_1A593E0 (47 KB)
Preheader insertionsub_1A5E350 (25 KB)
Exit block normalizationsub_1A5F590 (42 KB)
Pass name"loop-simplify"
String markers".backedge", "llvm.loop"
Pipeline wrapper (standalone)sub_1832270(n) where n = verify flag
Pipeline wrapper (bundled)sub_1841180() -- LoopSimplify + LCSSA combined
NVIDIA deltaNone -- stock LLVM

Pipeline placement. LoopSimplify is the most frequently invoked loop pass in the cicc pipeline:

ContextCall sitePosition
Full O1+ pipelinesub_1841180()Position 40 (bundled with LCSSA)
Ofcmid pipelinesub_1832270(1)Position 11 (standalone)
Ofcmid pipelinesub_1841180()Position 15 (bundled with LCSSA)
Post-tier insertionsub_1841180()Tier 2/3 additional invocations
As preconditionsub_157F0D0 (check)Called by LoopInterchange, LoopDistribute, IRCE, LoopVectorize

The pass appears at least 5 times across different pipeline tiers. It also runs as a utility called by other loop passes -- LoopInterchange, LoopDistribute, IRCE, and LoopVectorize all check isLoopSimplifyForm() (sub_157F0D0) and bail out if it fails.

What it does. If a loop lacks a single preheader, LoopSimplify creates one by inserting a new basic block on the entry edge (named with .lr.ph suffix via sub_1A5E350). If multiple latch blocks exist, it merges them into one (inserting .backedge blocks). If exit blocks are shared with other loops, it creates dedicated exit blocks via sub_1A5F590 (42 KB normalization function). After transformation, loop metadata ("llvm.loop" nodes) is preserved on the new latch terminator.

GPU considerations. LoopSimplify is purely structural and has no GPU-specific implications. However, it is worth noting that StructurizeCFG (which runs after all loop optimizations, during NVPTX code generation) re-canonicalizes the CFG for GPU divergence handling. Loop structures created by LoopSimplify may be further modified by StructurizeCFG when the loop contains divergent branches. The two passes do not interfere because they run in different pipeline phases (IR optimization vs. code generation).


LCSSA (Loop-Closed SSA)

Ensures that every value defined inside a loop and used outside it passes through a PHI node at the loop exit. This invariant simplifies SSA-based transformations: passes can modify loop internals without worrying about breaking uses outside the loop.

PropertyValue
Formation passsub_1AE2630 (49 KB)
Lightweight formsub_1961B00 (13 KB) -- creates .lcssa PHI nodes
LCSSA updatersub_1AF8F90 -- used by LoopInterchange post-transformation
Pass name"lcssa"
Verify knobverify-loop-lcssa registered at ctor_094 (~0x4A2491)
String markers".lcssa" suffix on PHI node names
NVIDIA deltaNone -- stock LLVM

Pipeline placement. LCSSA runs bundled with LoopSimplify via sub_1841180() at position 40 in the full pipeline. In the Ofcmid fast-compile pipeline, it appears at position 15 via the same bundled wrapper. It is also maintained incrementally by every pass that modifies loop structure:

  • LoopInterchange calls sub_1AF8F90 to update LCSSA form for both inner and outer loops after transformation. The inner loop is updated first. The TTI availability boolean from a1+192 is passed as the 4th argument to the updater.
  • LoopUnroll checks LCSSA form via sub_D49210 and generates .unr-lcssa blocks for unrolled iterations.
  • LoopIdiom expansions (sub_2AA00B0, sub_2AA3190) end with explicit verifyLoopLCSSA assertion ("Loops must remain in LCSSA form!").

What it does. For each instruction defined inside the loop, LCSSA checks all uses outside the loop's exit blocks. For each such use, it inserts a PHI node in the exit block with the defined value as the incoming value from the latch. The PHI node is named with a .lcssa suffix. After LCSSA formation, all external uses of loop-internal values go through these PHI nodes, and loop transforms only need to update the PHI nodes rather than chasing all external uses.

GPU considerations. LCSSA is purely structural and has no GPU-specific behavior. However, LCSSA PHI nodes interact with the NVPTX backend's divergence analysis: when a loop exit depends on a divergent condition (different threads take different exit iterations), the .lcssa PHI node at the exit carries a divergent value. The divergence analysis pass (NVVMDivergenceLowering, sub_1C76260) must handle these PHIs correctly to avoid generating incorrect predication. This is not an issue with LCSSA itself but with downstream consumers.


Function Map

FunctionAddressSizeRole
IndVarSimplify::run (core)sub_1945A5065 KB--
IndVarSimplifyPass::run (NewPM wrapper with NVIDIA guards)sub_19489B0----
rewriteLoopExitValuessub_1943460----
replaceExitValuesWithCompute (LFTR commit)sub_1941790----
computeWideTripCountsub_1940670----
hasCongruousExitingBlocksub_193E1A0----
getLoopDepth (recursive, 1 for outermost)sub_193DD90----
isBetterIV (candidate comparison for LFTR)sub_193E640----
exitValueSafeToExpand (SCEV expandability check)sub_193F280----
findFinalIVValue (trace IV to exit value)sub_193F190----
hasSafeExitBlock (exit block LFTR safety)sub_193F750----
initPassState (initialize pass-level state)sub_1940CE0----
clearPassState (cleanup per-iteration state)sub_1940B30----
SimplifyIndVar::simplifyIVUserssub_1B649E0----
LoopInterchange::processLoopListsub_1979A9069 KB--
LoopInterchange legality checkersub_197521045 KB--
LoopInterchange dependence analysis helpersub_197800037 KB--
LoopInterchange::getAnalysisUsagesub_19743F0----
SmallVector copy helper (dep vector / loop list)sub_19742B0----
vector<DepVector> push_backsub_1974CB0----
Swap loop bounds / trip count metadatasub_1973F90----
InductiveRangeCheckElimination::runsub_194D45071 KB--
createPreLoop / cloneLoopForRange (~1200 bytes)sub_194C320----
createPostLoop / wirePostLoop (~1300 bytes)sub_194AE30----
classifyRangeCheckICmp (~800 bytes)sub_1949EA0----
canHandleRangeCheck (~400 bytes)sub_1949540----
canHandleRangeCheckExtended (~300 bytes, constrained mode)sub_1949670----
buildInductiveRangeCheck (~500 bytes)sub_1949C30----
adjustSCEVAfterCloningsub_1949270----
simplifyLoopAfterCloning (~200 bytes)sub_1948FD0----
verifyLoopStructure (~200 bytes)sub_1948D70----
LoopDistributePass::runsub_1A8CD8063 KB--
distributeLoopBody (core fission engine, ~2000 bytes)sub_1B1E040----
updateDominatorTree (post-distribution, ~400 bytes)sub_1B1DC30----
updateLoopInfo (post-distribution, ~300 bytes)sub_1B1DDA0----
cleanupPartitions (~400 bytes)sub_1B1F0F0----
verifyDistribution (~300 bytes)sub_1B216C0----
cleanupAfterDistribution (~200 bytes)sub_1A8C510----
lookupPartitionForInstruction (hash table lookup)sub_3860240----
hasDirectDependence(partA, partB)sub_385DBB0----
alreadyMerged(partA, partB)sub_385DB90----
isSafeToDistribute (final safety check)sub_1452CB0----
LoopIdiomRecognize::runsub_196FF9051 KB--
LoopIdiom memset pattern detectionsub_196B74010 KB--
LoopIdiom memcpy/memmove patternssub_196E00043 KB--
expandMemCmpMismatchsub_2AA00B048 KB--
expandFindFirst (string search vectorization)sub_2AA319040 KB--
expandByteMismatchLoopBody (type 0)sub_2A9D690----
expandWordMismatchLoopBody (type 1)sub_2A9EC20----
replaceUsesOfPhiInSuccessors (LCSSA fixup)sub_2A9D330----
LoopRotation::runOnLoopsub_2A0CFD065 KB--
LoopRotatePass (NewPM, "header-duplication;")sub_28448D0----
LoopRotate (legacy pipeline call)sub_18A3090----
LoopSimplify canonical form enforcementsub_1A5B3D062 KB--
LoopSimplify DomTree update helpersub_1A593E047 KB--
LoopSimplify preheader insertionsub_1A5E35025 KB--
LoopSimplify exit block normalizationsub_1A5F59042 KB--
LoopSimplify pipeline wrapper (with verify flag)sub_1832270----
LoopSimplify + LCSSA bundled passsub_1841180----
LCSSA formation passsub_1AE263049 KB--
LCSSA lightweight .lcssa PHI insertionsub_1961B0013 KB--
LCSSA form updater (used post-interchange)sub_1AF8F90----
verifyLoopLCSSA (assertion: "Loops must remain in LCSSA form!")sub_D48E00----

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
IndVarSimplify knobsStock LLVM defaults; no GPU-specific configurationThree NVIDIA-specific knobs that change IV widening/narrowing behavior for GPU register pressure management
Barrier awarenessNo concept of GPU barriers or synchronization primitivesNone of the 8 standard passes have explicit barrier awareness; barrier handling deferred to dedicated NVIDIA passes (Dead Barrier Elimination, convergence token verification)
LoopRotate frequencyRuns once or twice in pipelineAppears multiple times as canonicalization prerequisite for LICM and unrolling; forms the backbone of loop pass infrastructure
LoopIdiom patternsmemset, memcpy recognition for CPU targetsSame patterns; GPU-specific expansion handled downstream by MemmoveUnroll pass
IRCERange check elimination for deoptimization-safe targetsPresent but effectiveness limited on GPU: no deoptimization support, relies on SCEV range analysis for bound proofs
LoopInterchangeCost model driven by cache localitySame legality checks; profitability analysis implicitly favors stride-1 access (coalescing) over cache line optimization
IV DemotionNot presentDownstream NVIDIA pass (IV Demotion) narrows IVs widened by IndVarSimplify back to 32-bit where GPU value ranges permit

Cross-References

  • LoopVectorize & VPlan -- LoopDistribute feeds vectorization; IRCE removes bounds checks that block it.
  • Loop Unrolling -- Runs after IndVarSimplify canonicalizes IVs; requires LoopSimplify form. The unroll-runtime-convergent knob forces epilogue mode when convergent calls (warp-level primitives) are present -- an interaction with GPU barrier semantics that these 8 standard passes do not handle.
  • LICM -- Requires LoopRotate and LoopSimplify as prerequisites.
  • ScalarEvolution -- IndVarSimplify and IRCE are among the heaviest SCEV consumers; LoopInterchange uses SCEV for stride analysis. LoopRotate and LoopDistribute call ScalarEvolution::forgetLoop after transformation.
  • SCEV Invalidation -- LoopRotate requires BTC recomputation after header/latch swap; LoopDistribute calls forgetLoop after fission.
  • Loop Strength Reduction -- Runs after IndVarSimplify; consumes the canonicalized IV forms it produces. LSR has address-space-aware chain construction for shared memory (addrspace 3) that these 8 passes lack.
  • IV Demotion -- NVIDIA's custom pass that narrows IVs widened by IndVarSimplify back to 32-bit where value ranges permit, reducing register pressure for GPU occupancy.
  • Dead Barrier Elimination -- Handles barrier optimization that these standard loop passes do not address.
  • Pipeline & Ordering -- LoopRotate at position 11, LoopSimplify/LCSSA at position 40 in the full O1+ pipeline.
  • NVVMDivergenceLowering -- Handles divergent LCSSA PHI nodes at loop exits when different threads take different exit iterations.

Loop Unrolling

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (decision engine), llvm/lib/Transforms/Utils/LoopUnroll.cpp (transformation engine), llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp (runtime unrolling) (LLVM 20.0.0)

Loop unrolling in cicc is one of the most heavily tuned transformations in the entire pipeline. On a GPU, unrolling directly trades register pressure against instruction-level parallelism: every additional copy of the loop body increases live register count, which reduces SM occupancy and the number of concurrent warps available to hide memory latency. Conversely, too little unrolling leaves performance on the table by failing to expose independent instructions that the hardware scheduler can overlap. NVIDIA's unroller resolves this tension through a priority-based decision cascade with GPU-specific heuristics that have no upstream equivalent -- most notably a local-array threshold multiplier, power-of-two factor enforcement, and a pragma threshold 200x larger than stock LLVM. The transformation engine itself is a lightly modified version of upstream llvm::UnrollLoop, but the decision engine (computeUnrollCount) is substantially reworked.

The pass appears twice in the cicc pipeline. The first invocation (sub_197E720) runs early, interleaved with loop vectorization in the main optimization sequence. The second invocation (sub_19C1680) runs later as a cleanup pass, gated by opts[1360] (the nv-disable-loop-unrolling flag). Both share the same decision engine; the second invocation operates on loops that were created or exposed by intervening passes (InstCombine, SROA, EarlyCSE).

PropertyValue
Decision enginesub_19BB5C0 / computeUnrollCount (50 KB, ~1681 lines)
Transformation enginesub_2A15A20 / UnrollLoop (85 KB, ~2434 lines)
Top-level driversub_19BE360 / tryToUnrollLoop
Runtime-check unrollersub_2A25260 / UnrollLoopWithRuntimeChecks (91 KB)
Pipeline slot (early)sub_197E720 -- runs once in main opt pipeline
Pipeline slot (late)sub_19C1680 -- conditional on !opts[1360]
Disable knob-Xcicc "-disable-LoopUnrollPass" or opts[1360]
LLVM baseLoopUnrollPass from LLVM 20.0.0

Why Unrolling Matters More on GPU

On a CPU, the primary benefit of unrolling is reducing branch overhead and enabling wider SIMD scheduling. On a GPU, the calculus is different in three ways that all trace back to the GPU execution model:

First, unrolling increases register pressure, and register pressure determines occupancy. If unrolling pushes a kernel from 64 to 96 registers per thread, the SM drops from 32 to 21 resident warps -- a 34% reduction. Fewer warps means less latency hiding, so the unroll factor selection must be conservative in ways that a CPU unroller never needs to be.

Second, there is no out-of-order execution within a warp; the hardware issues instructions in program order. Unrolling creates independent instructions that the compiler (ptxas) can interleave, particularly independent loads that can overlap with arithmetic. This is the ILP benefit, and it is the primary argument for aggressive unrolling.

Third, GPU loops often access shared memory (__shared__) or local memory arrays indexed by threadIdx. Unrolling these loops enables the backend to promote array elements to registers and to rearrange memory accesses to avoid bank conflicts. NVIDIA's local-array heuristic (see below) exists specifically to exploit this opportunity.

The unroller's job is to find the sweet spot: enough copies to saturate the instruction pipeline, few enough to keep register pressure within occupancy targets.

The Decision Engine: computeUnrollCount

The decision engine at sub_19BB5C0 implements a strict six-level priority cascade. Each level is tried in order; the first level that produces a valid unroll factor wins. Every decision is logged through optimization remarks, making the logic traceable from -Rpass-analysis=loop-unroll.

UnrollParams Struct Layout

The decision communicates its result through a struct passed by pointer (a12 / v14):

OffsetFieldTypeDescription
+0Thresholdu32Cost budget for full unroll
+4MaxPercentThresholdBoostu32Max boost percentage (default 400)
+12PartialThresholdu32Cost budget for partial unroll
+20Countu32Chosen unroll factor (primary output)
+24PeelCountu32Loop peel iteration count
+28DefaultUnrollCountu32Fallback count when no factor found
+32MaxCountu32Hard cap on unroll factor
+36FullUnrollMaxCountu32Max trip count for full unroll
+40FixedCostu32Non-scaling cost (IV increments, branches)
+44AllowPartialu8Partial unrolling permitted
+45AllowRemainderu8Remainder loop generation permitted
+46UserProvidedCountu8True when pragma supplies count
+48(reserved)u8--
+49AllowUpperBoundu8Use max-trip-count when exact unknown

The Cost Model

Every decision in the cascade uses the same linear cost model to estimate unrolled loop size:

estimated_size = FixedCost + Count * (LoopBodySize - FixedCost)

LoopBodySize is the instruction cost of one iteration (parameter a11, computed by LLVM's CodeMetrics). FixedCost captures instructions that do not replicate with unrolling -- induction variable increments, the backedge branch, loop overhead. The difference (LoopBodySize - FixedCost) is the per-copy marginal cost.

For full unrolls, an additional dynamic cost simulation (sub_19B9A90) constant-folds through the unrolled body. If the loop contains iteration-dependent simplifications (constant array indices, strength-reduced expressions), the simulation reports a cost lower than worst-case. The effective budget for this check is boosted:

dynamic_budget = Threshold * MaxPercentThresholdBoost / 100

With the default boost of 400%, this means a loop whose body simplifies substantially after unrolling gets 4x the normal cost budget.

Priority Cascade (Pseudocode)

int computeUnrollCount(Loop *L, SE, TTI, TripCount, MaxTripCount,
                       BodySize, UnrollParams *UP, bool *AllowRuntime) {

    // PRIORITY 1: Local array threshold multiplier (NVIDIA-specific)
    int localSize = computeLocalArraySize(L);  // scans for AS5 allocas
    int multiplier = min(max(localSize, 1), 6);
    int effectiveThreshold = multiplier * UP->Threshold;

    // PRIORITY 2: #pragma unroll N
    int pragmaCount = getMetadataCount(L, "llvm.loop.unroll.count");
    if (pragmaCount != 0) {
        if (pragmaCount == 1) {
            UP->Count = 1;  // disable unrolling
            return UNROLL_DISABLED;
        }
        UP->Count = pragmaCount;
        int estSize = UP->FixedCost + pragmaCount * (BodySize - UP->FixedCost);
        if (estSize > multiplier * PragmaUnrollThreshold) {
            // too large -- try to find smaller factor
            searchSmallerDivisibleFactor(UP, TripCount);
        }
        if (TripMultiple % pragmaCount != 0)
            emitRemark("remainder loops not allowed");
        return UNROLL_PRAGMA;
    }

    // PRIORITY 3: #pragma unroll (full, no count)
    if (hasMetadata(L, "llvm.loop.unroll.full")) {
        if (TripCount > 0 && TripCount <= UP->FullUnrollMaxCount) {
            int estSize = UP->FixedCost + TripCount * (BodySize - UP->FixedCost);
            if (estSize <= effectiveThreshold) {
                if (simulateLoopBody(L, TripCount, dynamicBudget))
                    { UP->Count = TripCount; return FULL_UNROLL; }
            }
        }
        // fallthrough to lower priorities
    }

    // PRIORITY 4: Loop peeling
    int peelCount = computePeelCount(L, SE, UP);
    if (peelCount > 0) {
        UP->PeelCount = peelCount;
        UP->Count = 1;
        return PEEL;
    }

    // PRIORITY 5: Static partial unrolling (known trip count)
    if (TripCount > 0 && (UP->AllowPartial || pragmaOversize) && isInnermost(L)) {
        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;

        // Size clamp
        if (UP->PartialThreshold < UP->FixedCost + count * (BodySize - UP->FixedCost))
            count = (UP->PartialThreshold - UP->FixedCost) / (BodySize - UP->FixedCost);
        count = min(count, UP->MaxCount);

        // Power-of-two + trip-divisible search
        while (count > 0) {
            if (TripCount % count == 0 && isPowerOfTwo(count))
                break;
            count--;
        }

        // Fallback: halve DefaultUnrollCount until it fits
        if (count == 0 && UP->UserProvidedCount) {
            count = UP->DefaultUnrollCount;
            while (UP->PartialThreshold <
                   UP->FixedCost + count * (BodySize - UP->FixedCost))
                count >>= 1;
        }

        if (count > 1) { UP->Count = count; return PARTIAL_UNROLL; }
    }

    // PRIORITY 6: Runtime unrolling (unknown trip count)
    if (!hasMetadata(L, "llvm.loop.unroll.runtime.disable")
        && RuntimeUnrollThreshold >= BodySize
        && isInnermost(L)) {

        int rtTripCount = computeRuntimeTripCount(L, SE);
        if (rtTripCount < FlatLoopTripCountThreshold) return NO_UNROLL;

        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;
        // same halving + threshold logic as Priority 5
        while (UP->PartialThreshold <
               UP->FixedCost + count * (BodySize - UP->FixedCost))
            count >>= 1;
        count = min(count, UP->MaxCount);

        if (count > 1) {
            UP->Count = count;
            *AllowRuntime = true;
            return RUNTIME_UNROLL;
        }
    }

    // Small-function override (tiny kernels get aggressive unrolling)
    if (functionInstructionCount < SmallFunctionThreshold)
        return handleSmallFunction(L, UP, BodySize);

    return NO_UNROLL;
}

Local Array Heuristic

The function sub_19B5DD0 (computeLocalArraySize) is entirely NVIDIA-specific. It scans every basic block in the loop for load/store instructions that access address space 5 (GPU local memory). For each such access, it traces back to the underlying alloca, determines the array type, and computes the product of array dimensions. If any dimension is unknown at compile time, it substitutes the unroll-assumed-size knob (default 4). The returned value is the maximum local-array size found across all accesses.

This value becomes a threshold multiplier, capped at 6:

int computeLocalArraySize(Loop *L) {
    int maxSize = 0;
    for (BasicBlock *BB : L->blocks()) {
        for (Instruction &I : *BB) {
            if (!isLoadOrStore(I) || getAddressSpace(I) != 5) continue;
            Value *base = getUnderlyingAlloca(I);
            if (!base || !isArrayType(base->getType())) continue;
            int size = 1;
            for (int dim : getArrayDimensions(base))
                size *= (dim > 0) ? dim : UnrollAssumedSize;  // default 4
            maxSize = max(maxSize, size);
        }
    }
    return maxSize;
}

The rationale: GPU kernels frequently use __shared__ or local arrays indexed by threadIdx. Unrolling such loops by a factor proportional to the array size enables register promotion of individual array elements and eliminates bank-conflict-prone access patterns. The cap at 6 prevents pathological explosion when arrays are large.

Power-of-Two Factor Enforcement

The partial-unroll factor search at Priority 5 requires the chosen count to satisfy two constraints simultaneously: it must evenly divide the trip count and must be a power of two. The implementation uses the classic bitmask test:

while (count > 0) {
    if (tripCount % count == 0 && (count & (count - 1)) == 0)
        break;
    count--;
}

This is a GPU-specific requirement. Warp size is 32 (a power of two), and many GPU memory access patterns, shared-memory bank calculations, and reduction operations assume power-of-two alignment. An unroll factor of, say, 6 would create asymmetric loop bodies that interact poorly with warp-level execution.

Pragma Handling

The frontend (sub_9305A0 / emitUnrollPragma) translates CUDA pragmas to LLVM metadata during codegen:

CUDA SourceLLVM Metadata
#pragma unroll (bare)!{!"llvm.loop.unroll.full"}
#pragma unroll N (N > 1)!{!"llvm.loop.unroll.count", i32 N}
#pragma unroll 1Disables unrolling at Priority 2

The metadata is attached to the backedge branch as a self-referential !llvm.loop node. A guard flag (dword_4D046B4) skips pragma processing entirely in fast-codegen mode.

The pragma threshold is 32768 (0x8000), compared to upstream LLVM's 16384 (0x4000). This means #pragma unroll succeeds on loop bodies up to approximately 32K cost units -- covering virtually any realistic GPU kernel loop. When even this generous budget is exceeded, the decision engine falls through to lower priorities and attempts partial unrolling.

The __launch_bounds__ attribute does not directly feed the unroll decision. Instead, it constrains register allocation downstream, which indirectly limits the benefit of aggressive unrolling. There is no feedback loop from register pressure estimation back into the unroll factor at this stage of the pipeline; that coordination happens implicitly through the PartialThreshold provided by TTI.

Runtime Unrolling

Runtime unrolling (Priority 6) handles loops whose trip count is unknown at compile time. cicc enables it by default (unroll-runtime = true), with several GPU-specific twists:

Convergent instruction support. The knob unroll-runtime-convergent (default true, NVIDIA-specific) allows unrolling loops that contain convergent operations like warp-level primitives (__shfl_sync, __ballot_sync). Upstream LLVM refuses to unroll such loops because it cannot guarantee all threads in the warp execute the same iterations. cicc overrides this, relying on the waterfall-epilogue mechanism to preserve convergence.

Epilog vs. prolog remainder. The choice is controlled by a cascade:

  1. If waterfall-unrolling-force-epilogue is true (default, NVIDIA-specific) and the loop has runtime trip count: epilog mode is selected.
  2. If the loop body contains function calls (hasCallInLoop / sub_2A10B40 checks for opcode 17): epilog mode is forced. This preserves the property that all threads in a warp participate in calls, which matters for convergent operations.
  3. Otherwise, unroll-runtime-epilog (default false) determines the mode.

In practice, GPU loops almost always use epilog-style remainders.

Flat-loop exclusion. If the estimated runtime trip count is below flat-loop-tripcount-threshold (default 5), runtime unrolling is skipped. The overhead of generating the modulo check and epilog loop is not worth it for loops that iterate fewer than 5 times.

Body size gate. Runtime unrolling only proceeds if runtime-unroll-threshold (default 95) is greater than or equal to the loop body size. This is more conservative than the static partial-unroll threshold, preventing code explosion for large loop bodies when the trip count is unknown.

Thresholds: NVIDIA vs. Upstream LLVM

ParameterUpstream LLVM (O3)Upstream LLVM (NVPTX TTI)cicc v13.0
Threshold300300From TTI (300), then multiplied by local-array factor (1-6x)
PartialThreshold15075 (Threshold / 4)From TTI (75), plus local-array scaling
MaxPercentThresholdBoost400%400%400% (same)
PragmaUnrollThreshold163841638432768
RuntimeUnrollThreshold----95 (NVIDIA addition)
FlatLoopTripCountThreshold555 (same)
MaxUpperBound888 (same)
MaxPragmaUpperBound----64 (NVIDIA addition)
DefaultUnrollRuntimeCount88From TTI
AllowPartialfalsetruetrue (from TTI)
Runtimefalsetruetrue (from TTI)
AllowRemaindertruetruetrue
MaxIterationsCountToAnalyze101010 (same)
UnrollAssumedSize----4 (NVIDIA addition)

The critical differences: cicc doubles the pragma threshold, introduces a body-size gate for runtime unrolling (95), adds the local-array multiplier (up to 6x on base thresholds), and enforces power-of-two partial factors. The upstream NVPTX TTI enables partial and runtime unrolling but leaves thresholds at modest CPU-oriented values; cicc's decision engine applies substantial additional logic on top.

Interaction with Loop Vectorization

In the cicc pipeline, loop vectorization (LoopVectorizePass) runs before the first unroll invocation. Specifically, sub_197E720 combines both vectorization and unrolling decisions in the early pipeline slot. The vectorizer decides the vector width first (VF), and if it applies a transformation, the resulting loop (possibly with a scalar epilog) is then presented to the unroller.

This means vectorization and unrolling do not "coordinate" in the planning sense -- the vectorizer runs to completion before the unroller sees the loop. However, the vectorizer's interleave count (IC) serves a similar role to unrolling: it replicates the vectorized loop body to increase ILP. When the vectorizer chooses IC > 1, the subsequent unroller typically finds the loop body too large to unroll further, producing a de facto coordination through cost thresholds.

The second unroll invocation (sub_19C1680) runs much later, after InstCombine, SROA, and EarlyCSE have had a chance to simplify the vectorized code. Loops that were too large to unroll earlier may become eligible after dead code elimination within the unrolled-and-vectorized body.

The Transformation Engine: UnrollLoop

The transformation at sub_2A15A20 takes a loop and an unroll factor and physically duplicates the loop body. It is structurally close to upstream llvm::UnrollLoop with the following entry guards:

  1. Loop must have a preheader (sub_D4B130)
  2. Loop must have a single latch (sub_D47930)
  3. Loop must be in LCSSA form (sub_D49210)
  4. Header flags must be clean (no special bits set)

The duplication proceeds by iterating Count - 1 times, each iteration cloning every basic block in the loop body, remapping instructions through a value map, and rewiring PHI nodes so that iteration i's latch feeds iteration i+1's header. After all copies, the backedge of the last copy is reconnected to the first copy's header (for partial unroll) or removed entirely (for full unroll).

For partial unrolls where TripCount % Count != 0, a remainder loop is generated by sub_2A23640. If remainder generation fails (e.g., multi-exit loops), the engine delegates to sub_2A25260 which generates the runtime-check variant with prologue/epilogue.

The return value encodes the result: 0 = no change, 1 = partial unroll, 2 = full unroll.

Configuration Knobs

Standard LLVM Knobs (with NVIDIA defaults)

KnobDefaultGlobalEffect
unroll-thresholdFrom TTIsub_19B7760 structBase cost budget for full unroll
unroll-partial-thresholdFrom TTI0x4FB3140 areaCost budget for partial unroll
unroll-max-percent-threshold-boost400dword_4FB3100Max dynamic cost boost (%)
unroll-max-iteration-count-to-analyze10dword_4FB3020Max iterations for cost simulation
unroll-countUnsetdword_4FB2EA8Force specific unroll factor
unroll-max-countUnsetsub_19B7760 structHard cap on unroll factor
unroll-full-max-countUnset0x4FB2CE0 areaMax trip count for full unroll
unroll-peel-countUnset0x4FB2C00 areaForce specific peel count
unroll-allow-partialfalse0x4FB2B20 areaEnable partial unrolling override
unroll-allow-remainderfalse0x4FB2A40 areaEnable remainder loop generation
unroll-runtimetrue0x4FB2960 areaEnable runtime (dynamic TC) unrolling
unroll-max-upperbound8dword_4FB2920Max trip count for upper-bound unroll
pragma-unroll-threshold32768dword_4FB2760Cost budget for pragma-directed unrolls
flat-loop-tripcount-threshold50x4FB2680 areaMin estimated TC for runtime unroll
runtime-unroll-threshold95dword_4FB3560Max body size for runtime unroll
max-pragma-upperbound-unroll64dword_4FB2840Max upper-bound factor for pragma
unroll-assumed-size4dword_4FB33A0Assumed array size for unknown dims

NVIDIA-Specific Knobs

KnobDefaultGlobalEffect
unroll-runtime-convergenttrue0x500A440 areaAllow unrolling loops with convergent ops
unroll-runtime-epilogfalseqword_500A3E8Force epilog-style remainder (override)
waterfall-unrolling-force-epiloguetrueqword_500A148Force epilog for waterfall patterns

Knobs are registered in two constructors: standard LLVM knobs in ctor_216_0 at 0x4E5C30, NVIDIA-specific knobs in ctor_501 at 0x559890.

Function Map

FunctionAddressSizeRole
emitUnrollPragma0x09305A0--Frontend: #pragma unroll to metadata
parseUnrollMetadata0x19B4C50--Reads llvm.loop.unroll.* metadata
computeLocalArraySize0x19B5DD0--NVIDIA: local array threshold heuristic
handleSmallFunction0x19B6500--Special aggressive unroll for tiny kernels
selectUnrollFactor0x19B6690--Trip count analysis helper
emitRemainderNotAllowedRemark0x19B78B0--Diagnostic emission
simulateLoopBody0x19B9A90--Dynamic cost simulation with constant folding
computeUnrollCount0x19BB5C0--Main decision engine
tryToUnrollLoop0x19BE360--Top-level driver
computePeelCount0x1B0B080--Loop peeling logic
computeRuntimeTripCount0x1B18810--Runtime trip count estimation
hasCallInLoop0x2A10B40--Checks for call/invoke in loop body
createSideExitPHI0x2A10DD0--PHI nodes for side-exit unrolled loops
cloneInstructionsInBlock0x2A12AD0--Instruction-level cloning
reconcileLoopAfterUnroll0x2A13F00--Post-unroll SCEV/LoopInfo fixup
UnrollLoop0x2A15A20--Main transformation engine
unrollCostModel0x2A1AA10--Cost estimation helper
UnrollAndJamLoop0x2A1CF00--Unroll-and-jam variant
generateRemainderLoop0x2A23640--Remainder loop construction
UnrollLoopWithRuntimeChecks0x2A25260--Prologue/epilogue generation

Pass Factory and Object Layout

The following section documents the LoopUnroll pass factory at sub_19B73C0, which was originally misidentified as LICM in the P2C.3 sweep due to binary adjacency with the actual LICM pass. The vtable at unk_4FB224C, the 7-parameter constructor signature, and diagnostic function strings all confirm LoopUnroll identity.

The pass factory at sub_19B73C0 allocates a 184-byte pass object and accepts seven parameters that control unroll behavior. When a parameter is -1, the pass uses its compiled-in default.

Constructor Parameters

ParameterOffsetEnable FlagSemantics
a1 (optimization level)+156--2 = standard, 3 = aggressive
a2 (unroll threshold)+168+172Trip count threshold; -1 = use default
a3 (unroll count)+160+164Explicit unroll factor; -1 = use default
a4 (allow partial)+176+1770 = disable partial unroll, 1 = enable
a5 (runtime unroll)+178+1790 = disable runtime unroll, 1 = enable
a6 (upper bound)+180+1810 = disable upper-bound unroll, 1 = enable
a7 (profile-based)+182+1830 = disable profile-guided unroll, 1 = enable

Object Construction

The factory allocates 184 bytes via sub_22077B0, sets the vtable to off_49F45F0 (loop-unroll pass vtable), stores pass ID unk_4FB224C at offset +16, initializes self-referential linked-list pointers at offsets +80/+88 and +128/+136, sets pass type 2 (FunctionPass) at offset +24, and calls sub_163A1D0 / sub_19B71A0 for pass registration.

Pipeline Invocation Configurations

CICC invokes LoopUnroll with six distinct configurations at different pipeline stages, reflecting NVIDIA's careful tuning of unroll aggressiveness per compilation phase. These are the factory-level parameter sets passed to sub_19B73C0; see also the decision engine's per-invocation behavior in The Decision Engine above.

Configuration A: Standard Pipeline (O1/O2)

Call site: sub_12DE330

LoopUnroll(2, -1, -1, -1, -1, -1, -1)

All parameters at defaults. Standard unrolling with default thresholds at optimization level 2.

Configuration B: Code-Size Mode

Call site: sub_12DE8F0, when *(a3+4480) < 0 (NVIDIA code-size flag set)

LoopUnroll(a2, -1, -1, 0, 0, 0, 0)

All unrolling features disabled: partial, runtime, upper-bound, and profile-based are all zeroed. The pass only unrolls when the trip count is statically known and the benefit is certain. This reflects the constraint that GPU register pressure makes speculative unrolling expensive when code size matters.

Configuration C: Normal Optimizer

Call site: sub_12DE8F0, when *(a3+4480) >= 0 (normal mode)

LoopUnroll(a2, -1, -1, -1, -1, -1, -1)

Fully aggressive unrolling with all defaults. The optimization level is passed through from the caller.

Configuration D: Late Pipeline (Conservative)

Call site: sub_12DE8F0, late pipeline position

LoopUnroll(a2, -1, -1, 0, 0, -1, -1)

Partial and runtime unrolling disabled, but upper-bound and profile-based unrolling retain their defaults. This conservative late-pipeline configuration avoids creating new runtime overhead in code that has already been substantially optimized.

Configuration E: Aggressive Pipeline (O3)

Call site: sub_12E54A0

LoopUnroll(3, -1, -1, 0, 0, -1, 0)

Optimization level 3 with aggressive thresholds, but partial, runtime, and profile-based unrolling are disabled. Only upper-bound unrolling retains its default. The rationale is that at O3, the higher thresholds already capture most profitable unrolling opportunities without needing speculative runtime checks.

Configuration F: User-Configured

Call site: sub_12EA3A0

LoopUnroll(a1[4], a1[5], a1[6], a1[7], a1[8], a1[9], a1[10])

All seven parameters are read from a stored configuration object, enabling user-specified unroll behavior via command-line flags or pragmas.

Threshold Initialization (Pass-Level)

The function sub_19B6690 (17 KB) configures unroll thresholds based on optimization level and LLVM knobs at pass construction time. These values feed into the UnrollParams struct consumed by the decision engine.

Default Threshold Values

OffsetFieldDefault (O2+)Default (O1)
+0OptThreshold405150
+4Threshold400400
+12SmallTripCountThreshold150150
+56MaxIterationsCountToAnalyze6060

Function-Attribute-Aware Override

The threshold initializer queries function attributes via sub_1560180:

  • Attribute ID 34 (minsize): Reduces OptThreshold to SmallTripCountThreshold (150).
  • Attribute ID 17 (optsize): Same reduction.

This means kernels annotated with size constraints get conservative unroll thresholds regardless of the global optimization level.

Per-Function Knob Override via BST

The function queries the LLVM option registry (dword_4FA0208 BST) ten times, each time looking up a different knob address. For each knob, it searches the BST rooted at dword_4FA0208[2], compares the current function hash (sub_16D5D50) against node ranges, and applies the override if the knob value meets the threshold. The knob-to-field mapping:

Knob AddressOverride AddressField
dword_4FB3228dword_4FB32C0OptThreshold (+0)
dword_4FB3148dword_4FB31E0SmallTripCountThreshold (+12)
dword_4FB3068dword_4FB3100Threshold (+4)
dword_4FB2DC8dword_4FB2E60field +32
dword_4FB2CE8dword_4FB2D80field +36
dword_4FB2C08dword_4FB2CA0field +24
dword_4FB2B28(next value)field +40

The per-function BST lookup keyed by function hash enables fine-grained tuning of unroll behavior per kernel, a capability not present in upstream LLVM.

Diagnostic Functions

Three diagnostic emission functions produce optimization remarks:

FunctionAddressDiagnostic
emitPragmaCountDiagsub_19B78B0Reports when pragma unroll count conflicts with trip multiple
emitThresholdDiagsub_19B7B10Reports when unrolled size exceeds threshold
emitLoopSizeDiagsub_19B7D80Reports when loop body is too large to unroll

Main Loop Processing and Hash Infrastructure

The primary analysis function sub_19B7FA0 (11 KB) analyzes each candidate loop. The pass uses hash table infrastructure shared with other CICC LLVM passes:

FunctionAddressSizeRole
rehashSmallTablesub_19B60B05 KBSmall hash table resize
rehashTablesub_19B88204 KBKey-value hash table resize
rehashSetsub_19B89E07 KBSet hash table resize
insertIntoSetsub_19B8DA0--Set insert with growth

All hash tables use the same (value >> 9) ^ (value >> 4) hash function and linear probing strategy found throughout CICC's LLVM passes. See Hash Infrastructure for the common implementation.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Pragma thresholdUnrollThreshold default 150; pragma multiplier ~8xPragma threshold 200x larger than stock (PragmaUnrollThreshold = 30000); enables aggressive pragma-directed unrolling for GPU kernels
Power-of-two enforcementNo power-of-two requirement; any profitable factor acceptedEnforces power-of-two unroll factors; non-power-of-two factors are rounded down to avoid irregular loop tails
Local array multiplierNo concept of local array bonusDedicated local-array threshold multiplier boosts unroll budget when loop body accesses alloca/.local arrays indexed by IV, enabling register promotion
Decision engine~20 KB computeUnrollCountSubstantially reworked 50 KB computeUnrollCount (sub_19BB5C0) with 6-level priority cascade and GPU-specific occupancy heuristics
Register pressure modelGeneric TTI-based unroll cost; no occupancy conceptOccupancy-aware cost model considers register pressure cliffs where one additional register per thread drops warp occupancy
Pipeline invocationsSingle invocation in optimization pipelineTwo invocations: early (interleaved with vectorization) and late (cleanup, gated by opts[1360] / nv-disable-loop-unrolling)
Transformation engineStock llvm::UnrollLoopLightly modified UnrollLoop (sub_2A15A20, 85 KB); decision engine is where the changes concentrate

Test This

The following kernel contains a simple counted loop that is a prime candidate for full unrolling. Compile and compare PTX output with and without #pragma unroll.

__global__ void unroll_test(float* out, const float* in) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll
    for (int i = 0; i < 8; i++) {
        sum += in[tid + i * 128];
    }
    out[tid] = sum;
}

What to look for in PTX:

  • With #pragma unroll: the loop should be fully unrolled into 8 sequential ld.global.f32 + add.f32 sequences with no backedge branch. Look for the absence of bra instructions targeting a loop header and the presence of 8 distinct ld.global.f32 instructions with addresses offset by 128*sizeof(float).
  • Without #pragma unroll (remove the pragma): the compiler may still unroll if the trip count (8) times body size fits within the threshold (default 300). Check whether the PTX has a loop or is fully unrolled -- this exercises the automatic decision engine.
  • With #pragma unroll 1: the loop must remain as a counted loop with a backedge branch. This tests that pragma disabling works.
  • Compare .nreg values across the three variants. Full unrolling increases register pressure (8 loads live simultaneously); the partial or no-unroll variant uses fewer registers at the cost of loop overhead.
  • The power-of-two enforcement is visible when the trip count is not a power of two: change the loop bound to 6 and check whether the compiler partially unrolls by 4 (highest power of two dividing the body-size budget) rather than 6.

Cross-References

LoopVectorize and VPlan (GPU-Adapted)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp, llvm/lib/Transforms/Vectorize/VPlan*.cpp (LLVM 20.0.0). VPlan infrastructure lives in llvm/lib/Transforms/Vectorize/VPlan.cpp, VPlanRecipes.cpp, VPlanTransforms.cpp, and related files.

LLVM version note: CICC v13.0 is based on LLVM 20.0.0 trunk. Evidence includes histogram-pattern support (merged in LLVM 19), early-exit vectorization (LLVM 20 experimental feature, gated by byte_500CDA8), and the VPlan-native path. The VPlan object size (656 bytes) is consistent with LLVM 17/18+ layout. Scalable vectors are always disabled for NVPTX.

NVIDIA's cicc ships a heavily modified copy of LLVM's LoopVectorizePass, the single largest pass in the vectorization pipeline at 88 KB of decompiled output (2,612 lines in sub_2AF1970). The modifications do not change the pass's fundamental architecture -- it still builds VPlans, selects a vectorization factor (VF) through cost modeling, and transforms IR through VPlan execution -- but the cost model, VF selection heuristics, interleave count logic, and legality checker are all tuned for a target where "vectorization" means something fundamentally different than on a CPU. On a CPU, loop vectorization fills SIMD lanes: a VF of 4 on SSE processes four float elements per vector instruction. On an NVIDIA GPU, there are no SIMD lanes in the CPU sense -- each thread already executes scalar code, and the warp executes 32 threads in lockstep. The reasons to vectorize on GPU are: (1) memory coalescing -- adjacent threads issuing adjacent loads produce 128-byte cache line transactions, and vectorizing a per-thread loop body with VF=2 or VF=4 produces ld.v2/ld.v4 wide loads that maximize bytes-per-transaction; (2) reducing instruction count -- a single ld.global.v4.f32 replaces four ld.global.f32 instructions, saving fetch/decode/issue bandwidth; (3) register-to-memory width matching -- PTX supports 32-, 64-, and 128-bit load/store widths, and vectorization widens narrow scalar accesses to fill these naturally.

Key Facts

PropertyValue
RegistrationNew PM #400, parameterized: no-interleave-forced-only;...
Runtime positionsNot in Tier 0/1/2/3 tables; invoked via LLVM standard sub-pipeline sub_1A62BF0 when vectorization is enabled (see Pipeline)
Main entry pointsub_2AF1970 (0x2AF1970) -- LoopVectorizePass::processLoop()
Binary size88 KB decompiled, 2,612 lines
VPlan buildersub_2AEE460 (0x2AEE460) -- tryToBuildVPlanWithVPRecipes(), 56 KB
VPlan object size656 bytes (0x290), consistent with LLVM 17/18 layout
LLVM baseLLVM 20 trunk (evidence: histogram-pattern support, early-exit vectorization, VPlan-native path)
Scalable vectorsAlways disabled -- sub_DFE610 returns false for NVPTX
Register bit width (TTI)32 bits fixed (TypeSize::getFixed(32) in upstream NVPTXTTIImpl)
Pass name string"vectorize-loops" at 0x439F095
Address cluster0x2AA0000--0x2C20000 (loop vectorizer + VPlan infrastructure)

Why Vectorize on GPU

GPU vectorization is not about filling SIMD lanes -- the SIMT model already replicates scalar code across 32 threads. Vectorization targets three orthogonal benefits related to memory coalescing and instruction throughput:

Memory coalescing width. The GPU memory subsystem services requests in 128-byte transactions. If a single thread's inner loop accesses 4 consecutive floats in sequence, those 4 accesses become 4 separate scalar loads issued over 4 iterations. Vectorizing with VF=4 converts them into one ld.global.v4.f32, which the memory subsystem can service in a single wider transaction per thread. Across the warp, this multiplies the effective memory bandwidth.

Instruction count reduction. PTX's ld.v2 and ld.v4 instructions load 2 or 4 elements with a single instruction. The instruction issue pipeline has finite throughput (typically 1-2 instructions per clock per scheduler), so halving instruction count directly improves throughput-bound kernels.

Register-width matching. PTX has 32-bit typed registers. A 128-bit ld.v4.f32 loads directly into four consecutive registers via a single instruction, which is strictly better than four separate 32-bit loads (each requiring its own address computation).

These benefits are bounded by register pressure -- the primary constraint that does not exist on CPU. On a GPU, every additional register per thread can cross an occupancy cliff, potentially losing an entire warp group. A VF=4 vectorization that quadruples the live register count may halve occupancy and lose net throughput.

The 8-Phase Pipeline

The main function sub_2AF1970 implements eight phases, closely following upstream LLVM's structure but with GPU-specific decision points at each stage.

Phase 1: Legality Pre-Check

sub_31A4FD0(legalityCtx, Loop, Function, ORE, SE)    // init legality scratch
TTI = *(**(Loop+32) + 72)                             // Loop->getHeader()->getParent()->getTTI()
if (!sub_31A91F0(legalityCtx, TTI, Loop, LoopInfo))   // canVectorize() quick check
    return false

sub_31AF060(costCtx, ForceVectorization)              // canVectorize() full check
// ForceVectorization = qword_500D340[17] ("vectorize-loops" knob)

The legality checker (sub_31AF060) performs standard LLVM legality analysis: loop simplify form, single exit, computable backedge-taken count, no irreducible control flow. The NVIDIA-specific addition is early-exit loop handling:

if (hasUncountableEarlyExit && !byte_500CDA8)         // -enable-early-exit-vectorization
    emit "UncountableEarlyExitLoopsDisabled"
    return false

This knob (byte_500CDA8) gates an LLVM 20 feature that NVIDIA includes but disables by default. Early-exit vectorization requires predicated execution, which on GPU means divergent warps -- typically unprofitable.

Phase 2: Outer vs Inner Loop Dispatch

if (Loop->getSubLoops().size() > 0)
    goto outerLoopPath                                // PATH A (rarely taken on GPU)
else
    goto innerLoopPath                                // PATH B (the main path)

Outer loop vectorization is controlled by byte_500D208 (-force-vector-width-outer). When enabled and TTI-based VF selection returns VF <= 1, the pass forces VF=4 -- a hardcoded NVIDIA override for kernel patterns where outer-loop vectorization benefits warp-level memory access patterns. In practice, inner loop vectorization (Path B) handles the vast majority of GPU kernels.

Phase 3: Trip Count and Safety Checks

tripCount = getSmallBestKnownTC(PSE, Loop)            // sub_2AA7EC0
if (tripCount < VectorizerMinTripCount                // dword_500EAE8
    && !isForceVectorize(legalCtx)
    && !(exactTC >= userVF))
    emit "LowTripCount"
    reduce hint to interleave-only

if (hasAttribute(TTI, NoImplicitFloat))               // attribute 30
    bail "NoImplicitFloat"

if (hasUnsafeFPOps && !canReorderFP(override))
    bail "UnsafeFP" / "CantReorderFPOps"

The FP reorder safety check has an override mechanism: dword_500D508 selects whether the override is active, and byte_500D588 provides the override value. This lets NVIDIA force-allow FP reordering for specific compilation modes (e.g., -ffast-math propagated from nvcc).

Phase 4: VF Selection

This is where NVIDIA diverges most from upstream. The upstream algorithm queries TTI::getRegisterBitWidth() which returns the vector register width (256 for AVX2, 512 for AVX-512), then computes VF = registerWidth / elementSize. On NVPTX, getRegisterBitWidth() returns 32 -- a single scalar register width. This means the upstream formula would always produce VF=1 for 32-bit types.

NVIDIA's VF selection (sub_2AB8AC0 for outer loops, sub_2AE08E0 for inner loops via VPlan cost) works differently:

// sub_2AB8AC0 — outer loop VF selection (simplified)
elementBits = getWideningElementSize(CostModel)       // sub_2AB4370: top 32 bits
regWidth    = TTI.getRegisterBitWidth(Vector)          // sub_DFE640: returns 32
VF          = regWidth / (elementBits / 8)

if (!isScalable && VF <= 1 && forceOuterMode)          // byte_500D208
    VF = 4                                             // NVIDIA hardcoded override

For inner loops (the common path), VF selection goes through the full VPlan cost model:

// sub_2AE08E0 — selectBestVF() from VPlan candidates
bestCost = INT64_MAX
for each VPlan in candidatePlans:
    for each VF in VPlan.VFRange:
        cost = computeCostForVF(VPlan, VF)             // sub_2AE0750
        if isBetterThan(cost, bestCost):               // sub_2AB3FE0
            bestVF = VF
            bestCost = cost
return {bestVF, isScalable, bestIC}

The cost accumulation uses saturating arithmetic -- __OFADD__ overflow detection clamping to INT64_MAX/INT64_MIN -- preventing wrap-around in cost comparisons. This is defensive engineering for GPU kernels with very large loop bodies where naive summation could overflow.

Phase 5: Cost Model Construction

The cost model object (sub_2AB2780, 16 parameters) assembles all analysis results into a single context:

CostModel = {
    Loop*, DominatorTree*, LoopBlocksRPO*, ScalarEvolution*,
    TargetLibraryInfo*, AssumptionCache*, PredicatedScalarEvolution*,
    ValuesToIgnore=0, ORE*,
    /* additional context fields */
}

The VPlan planner (sub_2AF13F0) generates VPlans for all candidate VFs, then sub_2AE08E0 selects the best one. Each VPlan recipe provides its own cost through the virtual getVPCost(VF, CostCtx) method, which delegates to NVPTXTargetTransformInfo for GPU-specific instruction costs.

Phase 6: Profitability Decision and Interleave Selection

After VF selection, the pass evaluates a decision matrix:

ConditionResult
VF=1, not scalableVectorizationNotBeneficial -- bail
IC=1 but user wanted moreInterleavingNotBeneficial
IC>1 but user disabledInterleavingBeneficialButDisabled
Histogram loop + scalar interleaveHistogramPreventsScalarInterleaving -- bail
VF=1, IC>1Interleave-only path: executeVPlan(VF=1, IC)
VF>1Full vectorization path

The histogram diagnostic (HistogramPreventsScalarInterleaving) is an NVIDIA addition not present in upstream LLVM. It blocks scalar interleaving of histogram-pattern loops where reduction ordering constraints make interleaving incorrect without vectorization.

Interleave count selection (sub_2AED330) is register-pressure-bounded on GPU:

// sub_2AED330 — selectInterleaveCount() (simplified)
maxIC = TTI.getMaxInterleaveFactor(VF)                 // sub_DFB120(TTI+448)
// Override knobs:
if (VF.isScalar() && ForceTargetMaxScalarInterleave)   // dword_500E148
    maxIC = ForceTargetMaxScalarInterleave
if (VF.isVector() && ForceTargetMaxVectorInterleave)   // dword_500E068
    maxIC = ForceTargetMaxVectorInterleave

tripCount = getSmallBestKnownTC(PSE, Loop)
IC = bit_floor(tripCount / (VF * 2))                  // conservative: vector loop runs >= 2x
IC = min(IC, maxIC)

// Small loop boost
if (loopCost < SmallLoopCost)                          // qword_500DC88
    smallIC = min(IC, bit_floor(SmallLoopCost / loopCost))
    IC = max(IC, smallIC)

// Scheduling-based cap (NVIDIA-specific TTI path)
issueWidth = *(TTI + 56 + 32)                          // scheduling info at TTI+88
latency    = *(TTI + 56 + 36)                          // scheduling info at TTI+92
IC = IC / max(issueWidth, latency)                     // cap by scheduling model

// Aggressive interleave mode
if (byte_500D908)                                      // AggressiveInterleave
    IC = maxIC                                         // bypass all heuristics

IC = clamp(IC, 1, maxIC)
return powerOf2Floor(IC)

On CPU, the interleave count is bounded by vector register count (e.g., 16 YMM registers / registers per iteration). On GPU, it is bounded by register pressure impact on occupancy -- the TTI scheduling info encodes this constraint. The AggressiveInterleave knob (byte_500D908) bypasses all heuristics and sets IC to the maximum, useful for benchmarking or known-good kernels.

Phase 7: VPlan Execution and Epilogue Vectorization

mainVPlan = getBestPlanFor(bestVF)                     // sub_2BF1320
executeVPlan(mainVPlan, bestVF, IC)                    // sub_2AE3460

// Epilogue vectorization (when byte_500ED88 is set)
epilogueVF = selectEpilogueVectorizationFactor()       // sub_2ABBD40
if (epilogueVF > 1):
    clonedPlan = cloneVPlan(mainVPlan)                 // sub_2BF7CB0
    epiloguePlan = getBestPlanFor(epilogueVF)
    mergeVPlans(clonedPlan, epiloguePlan)              // sub_2AB0350
    // Remap operands between main and epilogue plans:
    //   recipe types 29 (load/store), 36 (phi), 17 (GEP)
    //   types 19-20 (inttoptr/ptrtoint casts)
    executeVPlan(merged, epilogueVF, epilogueIC, isEpilogue=true)

Epilogue vectorization is particularly relevant on GPU: the scalar remainder loop after vectorization forces warp divergence (some threads in the warp execute the epilogue while others are masked off), which is expensive. A vectorized epilogue with a smaller VF reduces the scalar remainder to fewer iterations, minimizing divergence overhead.

The epilogue VF selection (sub_2ABBD40) can be forced via qword_500ECA8 (-epilogue-vectorization-force-VF). When not forced, it uses SCEV range analysis (sub_DC3A60, sub_DBB9F0) to prove the epilogue trip count is sufficient for the candidate VF.

Phase 8: Post-Vectorization Metadata

The pass applies follow-up loop metadata (llvm.loop.vectorize.followup_all, llvm.loop.vectorize.followup_epilogue) and emits optimization remarks through sub_2AC2B40. Generated basic blocks use naming conventions vec.epilog.middle.block and vec.epilog.vector.body.

VPlan Construction (sub_2AEE460)

The VPlan builder allocates a 656-byte VPlan object and iterates over candidate VFs in powers of 2 (VF *= 2 each iteration, visible as add r15d, r15d in the binary). For each VF, it calls sub_2AA9E60 (tryToBuildRecipesForVF).

Recipe type tags observed in the binary:

TagRecipe Type
0x04VPWidenMemoryInstructionRecipe
0x0FVPWidenRecipe
0x1DVPReplicateRecipe
0x21VPWidenSelectRecipe
0x43VPWidenCallRecipe

Interleave group recipes are built from LoopAccessInfo at [Planner+0x28]+0x150. The builder removes individual load/store recipes and replaces them with interleave group recipes via sub_2AB9570 (replaceAllUsesWith), using a hash map with the pointer-hash function (ptr >> 4) ^ (ptr >> 9) & mask -- identical to LLVM's DenseMap hash.

Cost annotation happens in Phase 6 of VPlan construction via sub_2C2E3C0, which walks all recipes and annotates them with TTI-derived costs. This is where NVPTXTargetTransformInfo shapes the cost model: it prices ld.v4 cheaper than 4x ld.f32, making vectorization profitable even with register pressure increase.

The VPlan verification flag at 0x500D2E8 enables VPlan dump/verify paths -- useful for debugging vectorization decisions with -mllvm -vplan-verify-or-dont.

NVPTXTargetTransformInfo Hooks

The loop vectorizer reaches NVIDIA's TTI through Loop->getHeader()->getParent()->getTTI() (recovered as *(**(Loop+32)+72)). Key hooks:

TTI MethodAddressGPU Behavior
getRegisterBitWidth(Vector)sub_DFE640Returns 32 (fixed) -- single scalar register width
supportsScalableVectors()sub_DFE610Returns false -- no SVE/RVV equivalent
getMaxInterleaveFactor()sub_DFB120Queried at TTI+448; register-pressure-bounded
getMaxInterleaveFactor(vectorized)sub_DFB730Separate limit for vectorized loops
hasAttribute(47)sub_B2D610"alwaysvectorize" check
hasAttribute(30)sub_B2D610"noimplicitfloat" check

The 32-bit register width return is the critical difference from CPU targets. It means the standard VF formula (regWidth / elemSize) produces VF=1 for 32-bit types, VF=2 for 16-bit types, and VF=4 for 8-bit types. Wider vectorization (VF=4 for float) must come from the cost model determining that ld.v4.f32 is profitable despite the VF exceeding the "register width."

The scheduling info at TTI+56 (with issue width at offset +32 and latency at +36 within that sub-structure) feeds interleave count capping. This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the issue pipeline may saturate at IC=4.

Knobs and Thresholds

KnobGlobal AddressCLI NameDefaultEffect
ForceVectorizationqword_500D340[17]vectorize-loopstrueMaster switch for loop vectorization
EnableEarlyExitVectorizationbyte_500CDA8-enable-early-exit-vectorizationfalseGates LLVM 20 early-exit loop vectorization
ForceOuterLoopVectorizationbyte_500D208-force-vector-width-outerfalseForces VF=4 for outer loops when TTI returns VF<=1
ForceCanReorderFP (selector)dword_500D508--0Whether FP reorder override is active
ForceCanReorderFP (value)byte_500D588----FP reorder override value
ForceScalarEpilogue (selector)dword_500E308--0Whether scalar epilogue is forced
ForceScalarEpilogue (value)byte_500E388----Scalar epilogue override value
VectorizerMinTripCountdword_500EAE8vectorizer-min-trip-count16 (upstream)Minimum trip count to attempt vectorization
CostThresholdqword_500EA08----Maximum cost for memory reorder safety check
EnableEpilogueVectorizationbyte_500ED88-enable-epilogue-vectorizationtrue (upstream)Enables vectorized epilogue loop
EpilogueVectorizationForceVFqword_500ECA8-epilogue-vectorization-force-VF0Forces specific epilogue VF
AggressiveInterleavebyte_500D908--falseBypasses IC heuristics, sets IC=max
PreferPredicateOverEpiloguebyte_500DAC8prefer-predicate-over-epilogue--Uses predication instead of scalar epilogue
SmallLoopCostqword_500DC88small-loop-cost20 (upstream)Threshold below which loops get boosted IC
ForceTargetMaxScalarInterleavedword_500E148force-target-max-scalar-interleave0Overrides max IC for scalar loops
ForceTargetMaxVectorInterleavedword_500E068force-target-max-vector-interleave0Overrides max IC for vectorized loops

NVIDIA vs upstream defaults: The upstream vectorizer-min-trip-count default is 16. The upstream small-loop-cost default is 20. The upstream enable-epilogue-vectorization default is true. NVIDIA preserves these defaults from the knob registration code, but the TTI hooks (particularly getRegisterBitWidth returning 32 and getMaxInterleaveFactor being register-pressure-bounded) shift the effective behavior dramatically. Where a CPU target with AVX-512 might select VF=16 for float, NVPTX typically selects VF=2 or VF=4 -- just enough to use ld.v2/ld.v4 instructions without excessive register pressure.

Diagnostic Strings

All diagnostic strings are embedded in the binary with OptimizationRemarkAnalysis tags. Source: p2-E01-loop-vectorize.txt.

TagMessageTrigger
UncountableEarlyExitLoopsDisabled"Auto-vectorization of loops with uncountable early exit is not enabled"Early-exit loop + byte_500CDA8 knob off
LowTripCount"The trip count is below the minial threshold value."TC < dword_500EAE8 min threshold (note: "minial" is a typo [sic] in the NVIDIA binary)
NoImplicitFloat"Can't vectorize when the NoImplicitFloat attribute is used"Function attribute 30 check
UnsafeFP"Potentially unsafe FP op prevents vectorization"FP safety check failure
CantReorderFPOps"loop not vectorized: cannot prove it is safe to reorder floating-point operations"FP reorder proof failure
CantReorderMemOps"loop not vectorized: cannot prove it is safe to reorder memory operations"Memory reorder proof failure
VectorizationNotBeneficial"the cost-model indicates that vectorization is not beneficial"Cost model: VF=1 wins
InterleavingNotBeneficial"the cost-model indicates that interleaving is not beneficial"Cost model: IC=1 wins
InterleavingNotBeneficialAndDisabled(appended: " and is explicitly disabled or interleave count is set to 1")IC=1 + explicitly disabled
InterleavingBeneficialButDisabled(tag only, no message body recovered)IC>1 but user disabled interleaving
InterleavingAvoided"Ignoring UserIC, because interleaving was avoided up front"User-specified IC overridden
HistogramPreventsScalarInterleaving"Unable to interleave without vectorization due to constraints on the order of histogram operations"NVIDIA-specific: histogram loop + scalar IC
ScalableVFUnfeasible"Scalable vectorization requested but not supported by the target"Scalable VF on NVPTX
UncountableEarlyExitUnsupported"Auto-vectorization of early exit loops requiring a scalar epilogue is unsupported"Early-exit + epilogue
(success remark)"interleaved loop (interleaved count: N)"Vectorization/interleaving succeeded via sub_2AC2B40
(metadata)"llvm.loop.vectorize.followup_all"Post-vectorization loop metadata tag
(metadata)"llvm.loop.vectorize.followup_epilogue"Post-vectorization epilogue metadata tag
(block name)"vec.epilog.middle.block"Epilogue vectorization middle block
(block name)"vec.epilog.vector.body"Epilogue vectorization body block
(block name)"scev.check"Runtime SCEV overflow check block (sub_27C1C30)
(VPlan debug)"Initial VPlan"VPlan builder debug output at 0x2AEFC7B

Function Map

FunctionAddressSizeRole
LoopVectorizePass::processLoop()sub_2AF197088 KB--
tryToBuildVPlanWithVPRecipes()sub_2AEE46056 KB--
Planner::plan() -- generate VPlans for candidate VFssub_2AF13F0----
selectBestVF() -- iterate VPlans, pick lowest costsub_2AE08E0----
computeCostForVF() -- per-VF cost querysub_2AE0750----
isBetterThan() -- VF cost comparatorsub_2AB3FE0----
executeVPlan() -- IR transformation from VPlansub_2AE3460----
selectInterleaveCount() -- IC heuristicsub_2AED330----
selectEpilogueVectorizationFactor()sub_2ABBD40----
LoopVectorizationCostModel constructor (16 params)sub_2AB2780----
selectVectorizationFactor() -- outer loop pathsub_2AB8AC0----
selectVectorizationFactor() -- hint/pre-checksub_2AAEAB0----
computeExpectedScalarCost()sub_2AAD640----
LoopVectorizationLegality::init()sub_31A4FD0----
canVectorize() -- pre-checksub_31A91F0----
canVectorize() -- full checksub_31AF060----
getBestPlanFor(VF) -- VPlan lookupsub_2BF1320----
cloneVPlan()sub_2BF7CB0----
mergeVPlans() -- main + epilogue mergesub_2AB0350----
buildInterleaveGroupRecipes()sub_2C06CE0----
VPlan cost annotation passsub_2C2E3C0----
VPlan simplification / recipe combiningsub_2C32950----
VPlan legality re-verificationsub_2C2A390----
getSmallBestKnownTC() -- trip count upper boundsub_2AA7EC0----
tryToBuildRecipesForVF() -- per-VF body buildersub_2AA9E60----
finalizeRecipesForVF() -- scaling/wideningsub_2AD9850----
TTI::getMaxInterleaveFactor()sub_DFB120----
TTI::getRegisterBitWidth(Vector)sub_DFE640----
TTI::supportsScalableVectors()sub_DFE610----
Emit vectorization success remarkssub_2AC2B40----
VPlan fixup/finalizesub_ABDAE0----
  • Loop Strength Reduction -- LSR runs after vectorization and must handle the wider induction variables and address expressions that vectorization introduces. NVIDIA's custom LSR is occupancy-aware and interacts with the same register pressure model.
  • Register Allocation -- The register pressure that bounds VF and IC decisions is ultimately resolved by the register allocator. VF=4 with IC=2 may request 8x the base register count; the allocator must either accommodate this or spill to local memory.
  • Scheduling -- The TTI scheduling info (issue width and latency at TTI+56) that caps interleave count comes from the same target model used by instruction scheduling.
  • SelectionDAG -- Vectorized IR produces vector types (<4 x float>) that SelectionDAG must lower to PTX ld.v4/st.v4 instructions.
  • SLP Vectorizer -- SLP vectorization (sub_2BD1C50) handles straight-line code and horizontal reductions; loop vectorization handles loop bodies. Both share the same TTI cost model.

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's LoopVectorize pass was built for CPU SIMD: fill wider vector registers to process more data elements per instruction. On a GPU, every foundational assumption is inverted:

  • Upstream assumes SIMD lanes need filling. The CPU vectorizer exists to pack 4/8/16 scalar operations into one vector instruction (SSE/AVX/NEON). On GPU, there are no SIMD lanes in the CPU sense -- the SIMT model already executes 32 threads in lockstep per warp. "Vectorization" on GPU means widening per-thread memory accesses to ld.v2/ld.v4 for coalescing, not filling SIMD lanes.
  • Upstream computes VF from vector register width. The standard formula is VF = registerWidth / elementSize (e.g., AVX-512 gives VF=16 for float). NVPTX's getRegisterBitWidth() returns 32 bits -- a single scalar register width -- so this formula always produces VF=1 for 32-bit types. Wider VFs must come entirely from the cost model deciding that ld.v4.f32 is profitable, bypassing the standard VF selection path.
  • Upstream ignores register pressure when selecting VF. On CPU, VF=16 using 16 ZMM registers has no throughput penalty -- there is no occupancy concept. On GPU, VF=4 that quadruples live registers can cross an occupancy cliff, losing an entire warp group and halving net throughput. Every VF and IC decision must be bounded by register pressure impact on occupancy.
  • Upstream assumes scalable vectors are desirable. LLVM supports SVE/RISC-V V scalable vector types. NVPTX disables them entirely (supportsScalableVectors() = false) because PTX has no scalable vector model -- only fixed-width ld.v2/ld.v4 instructions exist.
  • Upstream's interleave count is bounded by CPU port pressure. CPU IC selection considers execution port contention and register file depth (e.g., 16 YMM registers). GPU IC selection is capped by the TTI scheduling model's issue width and latency at TTI+56, reflecting the SM's instruction issue pipeline saturation -- a completely different bottleneck.

Optimization Level Behavior

LevelScheduledMax VFInterleaveNotes
O0Not runN/AN/ANo optimization passes
OfcmaxNot runN/AN/AFast-compile skips vectorization entirely
OfcmidNot runN/AN/AVectorization not in medium fast-compile tier
O1Runs (Tier 1)4EnabledSingle instance after loop canonicalization
O2Runs (Tier 1)4EnabledSame scheduling as O1; benefits from more aggressive scalar optimization preceding it
O3Runs (Tier 1)4EnabledSame as O2; additional Tier 3 loop passes (interchange, distribution) may create more vectorization opportunities

Loop vectorization is a Tier 1 pass, meaning it runs at O1 and above but not in any fast-compile tier. The maximum VF is effectively capped at 4 by the GPU register pressure constraint -- higher VFs would multiply live registers past occupancy cliffs. The vectorize-loops knob (qword_500D340[17]) can force vectorization even when the cost model says it is unprofitable; this knob defaults to off and is typically used only for debugging. Early-exit vectorization (byte_500CDA8) is gated separately and defaults to disabled. See Optimization Levels for the complete tier structure.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Vectorization purposeFill SIMD lanes (SSE/AVX/NEON) for data parallelismMemory coalescing (ld.v2/ld.v4), instruction count reduction, and register-to-memory width matching; no SIMD lanes on GPU
Scalable vectorsSupported (SVE, RISC-V V)Always disabled -- sub_DFE610 returns false for NVPTX; only fixed-width VF=2/4
Register bit width (TTI)Target-dependent (128/256/512 for x86)Fixed 32 bits (TypeSize::getFixed(32)) reflecting PTX's 32-bit register model
VF selection cost modelSIMD-width-driven: higher VF fills wider vector registersOccupancy-bounded: VF must not increase register pressure past warp occupancy cliffs; VF=4 is typically the maximum
Interleave countProfile-guided or port-pressure-based (2--8 typical)Capped by TTI scheduling info at TTI+56; conservative due to register pressure cost per interleaved iteration
Early-exit vectorizationExperimental (behind flag)Present, gated by byte_500CDA8 (-enable-early-exit-vectorization)
Convergent call handlingStandard legality rejectionAdditional barrier-aware legality: convergent intrinsics (__syncthreads, warp shuffles) block vectorization of the containing loop body

SLP Vectorizer

NVIDIA-modified pass. See Key Behavioral Differences from Upstream for GPU-specific changes.

The SLP (Superword-Level Parallelism) vectorizer packs independent scalar operations on adjacent data into vector operations. Unlike the loop vectorizer, SLP operates on straight-line code within a single basic block --- it does not require a loop. On NVPTX, the practical payoff is combining two or four scalar loads/stores into ld.v2/ld.v4 (or st.v2/st.v4), and folding arithmetic on adjacent elements into a single wider instruction. CICC runs the SLP vectorizer as part of the combined LoopVectorize / SLPVectorize pass group at step 31 of the O2 pipeline (sub_19B73C0), after SCCP/GlobalOpt and before the post-vectorization GVN cleanup. The pass is registered under the name slp-vectorizer (pipeline slot 350, llvm::SLPVectorizerPass).

PropertyValue
Pass nameslp-vectorizer
Pipeline slot350 (llvm::SLPVectorizerPass)
Constructor registrationctor_517 at 0x560FD0 (12,410 bytes)
Option constructorctor_248 at 0x4EEF30 (8,219 bytes)
Horizontal reduction entrysub_2BD1C50 (~85 KB, ~3,005 decompiled lines)
Straight-line SLP entrysub_2BCE070
Store-SLP entrysub_2BCA110
SLP tree code cluster0x1BC0000--0x1BFFFFF (~1,353 KB across ~266 files)
Key diagnostic strings"slp-vectorizer", "HorSLPNotBeneficial", "VectorizedHorizontalReduction", "const.rdx", "SLP vectorized with cost", "Cannot SLP vectorize list:", "Stores SLP vectorized with cost"

SLP vs Loop Vectorization on GPU

The loop vectorizer (see LoopVectorize & VPlan) transforms counted loops by widening the loop body to process multiple iterations per step, driven by VPlan. SLP vectorization is fundamentally different: it searches a single basic block for groups of isomorphic scalar instructions that operate on adjacent memory or independent data, then replaces them with a single vector instruction. No loop structure is required.

On a GPU, SLP opportunities arise in three main patterns:

  1. Adjacent memory operations. Two consecutive f32 loads from addresses p and p+4 become a single ld.v2.f32. Four consecutive i32 stores become st.v4.b32. This is the highest-value SLP transformation on NVPTX because coalesced memory transactions are critical for throughput.

  2. Same-typed arithmetic on independent operands. Two fadd instructions with no data dependency between them can become a single vector fadd on <2 x float>. The PTX backend later lowers this back to scalar instructions if the target has no native wide ALU, but the combined form enables better scheduling and may survive to the load/store vectorizer's benefit.

  3. Texture coordinate packing. Texture/surface sampling requires coordinate tuples (u, v) or (u, v, w). When the scalar coordinates are computed independently, SLP can pack them into a <2 x float> or <4 x float> bundle that feeds directly into the sampling intrinsic, avoiding per-element extract/insert overhead.

NVPTX TTI Hooks Affecting SLP

The SLP vectorizer consults TargetTransformInfo at several decision points. NVIDIA's proprietary TTI implementation differs significantly from the upstream open-source NVPTX backend.

Upstream Open-Source NVPTX TTI (for reference)

HookReturn ValueComment
getRegisterBitWidth(Vector)32 bits"Only <2 x half> should be vectorized"
getMinVectorRegisterBitWidth()32 bitsMatches 32-bit register file
getNumberOfRegisters()1 (all classes)FIXME in source: "this is conservative"
getArithmeticInstrCost(i64)2x base for ADD/MUL/XOR/OR/ANDReflects 32-bit ALU emulation
supportsScalableVectors()falseNo SVE/RVV equivalent in PTX

With these returns, the standard LLVM VF formula (registerBitWidth / elementBitWidth) produces VF = 1 for f32 and VF = 2 for f16. The open-source backend effectively limits SLP to <2 x half> bundles only.

CICC v13.0 Proprietary TTI

CICC overrides the upstream returns at three levels: the TTI wrapper pass, the SLP tree's internal scheduling-width parameter, and several SLP-specific helper functions that query TTI indirectly.

TTI hooks queried by SLP (directly or via the cost model):

HookAddressReturn / BehaviorSLP Impact
getRegisterBitWidth(Vector)sub_DFE640TypeSize::getFixed(32)Formal register width --- same as upstream. But see a2+840 override below.
getRegisterBitWidth(Scalar)sub_DFB1B032Confirms 32-bit register file for scalar cost comparison.
supportsScalableVectors()sub_DFE610falseScalable VF never attempted.
getInstructionCost()sub_20E14F0 (33KB)Per-opcode latency from scheduling modelCalled indirectly through getTreeCost() (sub_2B94A80) for each tree node.
getInstructionCost() (IR-level)sub_B91420Per-instruction cost estimateCalled 7 times per instruction during per-node SLP cost evaluation.
hasAttribute(47)sub_B2D610Checks alwaysvectorizeWhen set, SLP skips profitability check and vectorizes unconditionally.
hasAttribute(18)sub_B2D610Checks optnoneWhen set, SLP is entirely disabled.

The a2+840 scheduling-width override:

The SLP tree object (BoUpSLP, parameter a2 in the horizontal reduction entry sub_2BD1C50) stores a max register pressure / scheduling width at offset +840. This value does NOT come from getRegisterBitWidth(Vector) directly. Instead, it is computed during SLP tree initialization from a combination of the target's scheduling model and available register budget. In the decompiled code, the VF derivation at lines 1354-1578 reads this value and clamps the resulting bit width to [128, 512]:

// VF derivation from a2+840 (decompiled sub_2BD1C50)
uint64_t max_sched_width = *(a2 + 840);       // NOT from TTI.getRegisterBitWidth()
uint64_t scalar_width = sub_2B49BC0(a2, first_scalar);  // getScalarTypeWidth()

uint64_t vf;
if (scalar_width <= max_sched_width) {
    vf = 1 << bsr(max_sched_width / scalar_width);  // round-down power-of-2
    vf = clamp(vf, 128, 512);                        // clamp to [128, 512] BITS
} else {
    vf = 128;
}
// For f32 (32 bits) with max_sched_width=256: vf = 256/32 = 8 elements
// For f64 (64 bits) with max_sched_width=256: vf = 256/64 = 4 elements

This is the single most important NVIDIA divergence from upstream for SLP: the 32-bit getRegisterBitWidth(Vector) return would produce VF=1 for f32 operations and kill SLP entirely for 32-bit types, but the a2+840 scheduling width allows VF=4 or VF=8 for f32. The result is that CICC's SLP can produce <4 x float> bundles (later lowered to ld.v4.f32 / st.v4.f32) that the open-source backend would never attempt.

SLP-specific TTI helper functions:

FunctionAddressUpstream EquivalentBehavior
getScalarTypeWidth()sub_2B49BC0DL.getTypeSizeInBits()Returns bit width of a scalar type for VF computation.
getNextLegalVF()sub_2B1E190No direct equivalentSteps down through legal vector factors when current VF is unprofitable. Takes (TTI, type, currentVF), returns next smaller legal VF >= minimum VF. Respects PTX v2/v4 legality constraints.
adjustVF()sub_2B1FA70Partial in BoUpSLP::buildTreeWhen SLPMaxVF (qword_500F628) is non-zero and operand_count+1 is a power of 2, returns operand_count directly (non-power-of-2 VF). Otherwise computes a power-of-2 VF.
isTreeNotBeneficialForArch()sub_2B2DA40Not in upstreamNVIDIA-specific early rejection based on SM reduction type (a1+1576). Rejects trees whose structure is known to be unprofitable on the current GPU architecture.

Arithmetic Cost Impact on SLP Trees

The TTI cost model for i64 operations directly affects SLP profitability. Since NVPTX GPUs emulate all 64-bit integer arithmetic through pairs of 32-bit operations, the cost differential inflates the scalar cost baseline, making i64 SLP trees more profitable in relative terms:

Operationi32 Scalar Costi64 Scalar Costi64 Vector Cost (v2)SLP Delta
ADD/SUB12 (add.cc + addc)4 (two add.cc + addc pairs)Neutral (2x scalar = 2x vector)
MUL1~4 (mul.lo + mul.hi + add chain)~8Neutral
Loads11 (ld.b64)1 (ld.v2.b64)Profitable --- single wide load
Stores11 (st.b64)1 (st.v2.b64)Profitable --- single wide store

The asymmetry is clear: SLP profit on NVPTX comes almost entirely from memory coalescing (loads and stores), not from arithmetic. The arithmetic cost for a v2 bundle is roughly 2x the scalar cost for all types, providing no ALU benefit. But a ld.v2.f32 replaces two separate load instructions with one, reducing instruction count and improving coalescing. This is why Store-SLP (sub_2BCA110) and the load/store adjacency heuristics dominate profitable SLP on GPU.

Maximum Vector Width on NVPTX

PTX supports vector types up to .v4 for most data types, but the actual hardware constraint is tighter:

  • v2: Supported for all types (.b8 through .b64, .f16, .f32, .f64). This is the sweet spot for SLP.
  • v4: Supported for .b8, .b16, .b32, .f16, .f32. NOT supported for .b64/.f64.
  • v8/v16: Not supported in PTX at all. CPU-style AVX-width vectorization is never legal.

The SLP vectorizer's VF selection logic at sub_2BD1C50 lines 1354--1578 computes:

// VF selection pseudocode (from decompiled sub_2BD1C50)
uint64_t max_sched_width = *(a2 + 840);  // from TTI
uint64_t scalar_width = getScalarTypeWidth(a2, first_scalar);

uint64_t vf;
if (scalar_width <= max_sched_width) {
    vf = 1 << bsr(max_sched_width / scalar_width);  // round-down power-of-2
    vf = clamp(vf, 128, 512);                        // clamp to [128, 512] bits
} else {
    vf = 128;
}

For f32 (32 bits) with a max scheduling width of 256 bits, this yields VF = 8 elements. However, PTX legalization later splits anything wider than v4 into multiple instructions, so the effective maximum is v4 for 32-bit types and v2 for 64-bit types. The SLP cost model accounts for this split cost.

GPU-Specific Vectorization Constraints

The NVPTX target has exactly ONE vector register class --- Int32HalfRegs (.b32, prefix %hh) --- which holds 32 bits of packed data. The only legal vector types at the SelectionDAG level are:

TypePackingRegisterLegal Since
v2f16Two f16 in 32 bits%hhSM 53+
v2bf16Two bf16 in 32 bits%hhSM 80+
v2i16Two i16 in 32 bits%hhSM 53+
v4i8Four i8 in 32 bits%hhSM 70+

Every other vector type is illegal and must be split or scalarized during type legalization (sub_2029C10 / sub_202E5A0). This includes <2 x float>, <4 x float>, <2 x i32>, and <2 x double> --- the very types SLP produces for 32-bit and 64-bit operations.

How SLP Vectors Survive to PTX

SLP-produced vector types such as <4 x float> are not killed by type legalization. Instead, the path is:

  1. SLP vectorizer (IR level) produces <4 x float> loads, stores, and arithmetic in LLVM IR.
  2. SelectionDAG type legalization splits <4 x float> into four scalar f32 values for arithmetic operations. However, load and store nodes are intercepted by NVPTX's custom lowering (NVPTXTargetLowering::LowerOperation) which converts them to target-specific NVPTX::LD_v4_f32 / NVPTX::ST_v4_f32 pseudo-instructions.
  3. Instruction selection maps these pseudo-instructions to PTX ld.v4.f32 / st.v4.f32.
  4. Arithmetic on the vector elements becomes four independent scalar instructions, which the scheduler can interleave with memory operations.

The net effect: SLP's primary benefit on NVPTX is vectorized memory access, while vectorized arithmetic is a wash. The cost model at sub_2B94A80 (getTreeCost) accounts for this by assigning low cost to vector loads/stores and high scalarization overhead to vector arithmetic.

PTX Vector Width Ceiling

PTX .v2 and .v4 load/store support imposes hard ceilings:

Element TypeMax .vNMax BitsSLP VF Ceiling
.b8 / .u8.v4324
.b16 / .f16.v4644
.b32 / .f32.v41284
.b64 / .f64.v21282
.b128.v1 only1281 (no vectorization)

When the SLP VF exceeds the PTX ceiling (e.g., VF=8 for f32 from the [128,512] bit-width clamping), the backend splits the single wide operation into multiple legal operations. The SLP cost model at sub_2B889C0 factors this split cost into the tree evaluation, ensuring that overly wide VFs are rejected if the split overhead eliminates the coalescing benefit.

Algorithm Overview

CICC's SLP vectorizer has three entry points that collectively implement the upstream BoUpSLP / SLPVectorizerPass:

Straight-Line SLP (sub_2BCE070)

Scans each basic block for groups of isomorphic instructions (same opcode, adjacent or compatible operands). Builds a bottom-up SLP tree using sub_2BAACB0 (buildTree), evaluates cost via sub_2B94A80 (getTreeCost), and emits vector code via sub_2BC6BE0 (vectorizeTree) when profitable. Diagnostic: "SLP vectorized with cost N" on success, "Cannot SLP vectorize list:" on failure.

Store-SLP (sub_2BCA110)

Seeds the SLP tree from consecutive stores to adjacent memory addresses. This is the primary entry point for memory coalescing. Diagnostic: "Stores SLP vectorized with cost N".

Horizontal Reduction SLP (sub_2BD1C50)

The most complex path. Handles horizontal reductions (e.g., summing all elements of a vector). Proceeds in six phases:

Phase 0 -- Scalar chain scan. Reads the reduction operand array at a1+304 (pointer) and a1+312 (count). Each bundle entry is 64 bytes. Classifies operands by opcode: values <= 0x1C are simple scalars (add/sub/mul/etc.), values > 0x1C are complex (fcmp, icmp variants). Calls sub_2B0D8B0 (isReductionOp) to validate each operation as a legal reduction (add, fadd, mul, fmul, and, or, xor, smin/smax/umin/umax, fmin/fmax).

Phase 1 -- Hash table construction. Builds two open-addressing hash tables. The "AllOps" table uses 32-byte entries with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth/compaction thresholds.

Phase 2 -- Bundle pair extraction. Calls sub_2B5F980 per bundle to classify reduction opcode pairs. When two consecutive bundles both contain fadd reductions (opcode 90), NVIDIA attempts a paired fadd bundle merge via sub_2B3C030/sub_2B25EA0/sub_2B38BA0. This is an NVIDIA-specific optimization for warp-level fadd reductions not present in upstream LLVM.

Phase 3 -- Main vectorization loop. For each bundle, builds candidate operand lists, selects a VF, and tries vectorization with progressively smaller VFs on failure. The VF trial loop uses memoization (sub_2B3C060) to avoid re-trying the same (offset, VF) pair. Key substeps: canVectorize (legality), buildTree, isTreeTinyAndNotFullyVectorizable / isTreeNotBeneficialForArch (early rejection), scheduleBlock, getTreeCost + getReductionCost (profitability).

Phase 4 -- Final reduction codegen. Produces the final horizontal reduction instruction via sub_2B21C80 (createFinalReduction), chaining multiple entries with sub_2B34820 when multiple sub-trees were vectorized.

Phase 5 -- Multi-tree scheduling and cleanup. Builds a multi-tree reduction schedule, iteratively calling sub_2B2F4A0 (reduceTreeLevel) until a single root value remains, then replaceAllUsesWith + eraseFromParent.

Paired fadd Bundle Merging (NVIDIA-Specific)

This optimization is absent from upstream LLVM and targets warp-level floating-point reduction patterns common in CUDA kernels (e.g., block-level sum reductions, dot products, softmax denominators). When two consecutive reduction bundles both contain fadd operations, CICC attempts to merge them into a single wider bundle, doubling the effective vectorization width for the reduction.

Trigger Condition

During Phase 2 of the horizontal reduction path (sub_2BD1C50, lines 921-1098), sub_2B5F980 (classifyReductionPair) is called per bundle and returns a pair of reduction opcodes (reductionOpcodeA, reductionOpcodeB). The merge path activates when:

  1. Both opcodes in the current bundle equal 90 (0x5A), which is the internal opcode for fadd reduction.
  2. The next consecutive bundle also has both opcodes equal to 90.
  3. The two bundles are adjacent in the reduction operand array (no intervening non-fadd bundles).
// Trigger check (decompiled from Phase 2, sub_2BD1C50)
if (v83 == v84 && v83 == 90) {        // both opcodes in bundle[i] are fadd
    if (v83_next == v84_next && v83_next == 90) {  // bundle[i+1] also all-fadd
        // Try paired merge
        sub_2B3C030(bundle_i, bundle_i_plus_1, ...);  // tryMergeFaddBundles
    }
}

Three-Function Pipeline

The merge proceeds through three functions in sequence:

StepFunctionAddressRole
1. TrytryMergeFaddBundles()sub_2B3C030Checks whether the two bundles' operand lists can be concatenated without violating data dependencies. Verifies that no operand in bundle B depends on the result of bundle A (or vice versa). Returns a candidate merged-bundle descriptor or null on failure.
2. ValidatevalidateMergedBundle()sub_2B25EA0Confirms that the merged bundle satisfies SLP legality: all operands are isomorphic (same opcode), the combined operand count does not exceed SLPMaxVF limits, and the merged bundle's scheduling pressure stays within a2+840. Also checks that external uses of intermediate reduction values are compatible with the wider bundle.
3. RewriterewriteMergedBundle()sub_2B38BA0Physically merges the two bundle entries in the reduction operand array. The combined bundle gets double the operand count, and the second bundle slot is marked as consumed (skipped in Phase 3). Updates the AllOps hash table entries to point to the new merged bundle.

Why This Matters on GPU

Consider a warp-level sum reduction of 64 f32 values, structured as two consecutive 32-element fadd reduction trees. Without merging, the SLP vectorizer processes each 32-element tree independently, producing two separate vectorized reduction chains. With merging, the combined 64-element tree exposes a wider VF window, allowing the vectorizer to produce wider v4 bundles and reduce the total number of reduction shuffle steps.

The merged bundle also benefits the final reduction codegen (sub_2B21C80, createFinalReduction): instead of producing two separate reduction results and combining them with a scalar fadd, the merged tree produces a single reduction result directly.

Commutativity Classification

The SM reduction type at a1+1576 drives commutativity via bitmask 0x10804:

bool is_commutative;
if (reduction_type <= 0x10) {
    is_commutative = !((1 << reduction_type) & 0x10804);
    // Non-commutative types: 2, 14, 16 (likely fsub, signed cmp variants)
} else {
    is_commutative = true;
}

SLP and the Load/Store Vectorizer

CICC runs two distinct passes that vectorize memory operations, and their scopes partially overlap:

SLP VectorizerOldLoadStoreVectorizerPass
Pass nameslp-vectorizerold-load-store-vectorizer
ScopeIsomorphic ops in a BBAdjacent loads/stores only
SeedAny instruction groupStore/load chains
Handles arithmeticYesNo
Handles reductionsYes (horizontal)No
Pipeline positionStep 31 (with LoopVectorize)Post-optimization (NVIDIA-specific)
Disable flagvectorize-slpdisable-nvptx-load-store-vectorizer

The NVIDIA-proprietary old-load-store-vectorizer (llvm::OldLoadStoreVectorizerPass) is a separate pass distinct from LLVM's LoadStoreVectorizerPass. It runs later in the pipeline and handles NVVM-specific intrinsic vectorization (nvvm_load/nvvm_ld, nvvm_store/nvvm_st) via the vect-intrinsics knob. SLP may vectorize the same load/store chains if they also contain arithmetic; the load/store vectorizer catches whatever SLP missed.

Register Pressure Impact

SLP vectorization increases register pressure because vector values occupy wider registers. On NVPTX, a <2 x float> consumes two 32-bit registers (PTX has no native 64-bit float register file for packed types --- the backend lowers <2 x f32> to a pair of .f32 registers). The benefit comes from reduced instruction count and improved memory coalescing, not from register savings.

The SLP cost model accounts for register pressure through a2+840 (max scheduling width), and the profitability check rejects vectorization when the combined cost (tree cost + reduction cost) exceeds the threshold. When register pressure is already high, the TTI cost model inflates the scalarization overhead, making SLP less likely to fire.

SLP Cost Model and TTI Callouts

The SLP profitability decision is the product of two cost functions that both delegate to TTI: getTreeCost() (sub_2B94A80, 71KB) and getReductionCost() (sub_2B28940). Understanding exactly how these call into TTI is essential for predicting when SLP will fire on a given kernel.

getTreeCost() (sub_2B94A80)

This 71KB function walks every node in the SLP tree and accumulates the cost difference between the vectorized form and the original scalar form. For each tree node, it:

  1. Calls sub_2B889C0 (45KB, the inner cost computation) which dispatches to TTI via sub_B91420 (TTI::getInstructionCost() at the IR level) --- called approximately 7 times per instruction to query costs for the scalar original, the vector alternative, and scalarization overhead (insert/extract elements).
  2. For load/store nodes, queries the memory cost model which returns favorable costs for adjacent accesses (reflecting ld.v2 / ld.v4 coalescing benefit) and high costs for gather/scatter patterns.
  3. For shuffle nodes (operand reordering), queries TTI::getShuffleCost() which on NVPTX returns high cost for any non-identity shuffle --- GPU has no native shuffle-within-register instruction for packed 32-bit values.
  4. Returns a pair: (vectorCost : i64, isExact : i32). When isExact == 1, the cost is a precise measurement from the scheduling model; the profitability check accepts it unconditionally regardless of the threshold.

getReductionCost() (sub_2B28940)

Called with the TTI pointer (a4) as the second parameter, this function computes the cost of the horizontal reduction itself --- the shuffle-and-reduce tree that turns a vector into a scalar. Parameters:

sub_2B28940(
    a1,     // HorizontalReduction object
    a4,     // TargetTransformInfo*
    v478,   // operand window start
    v479,   // operand window end
    v432,   // hasExternalUses flag
    v433,   // common opcode mask from Phase 1
    a2      // BoUpSLP tree
)
// Returns: (reductionCost : i64, costKind : i32)

The reduction cost on NVPTX is typically high because the GPU has no native horizontal reduction instruction for arbitrary vector widths. A <4 x float> fadd reduction requires 2 shuffle-and-add steps (log2(4) = 2), each involving an extractelement and a scalar fadd. The TTI cost model at sub_20E14F0 (33KB) provides the per-step latency from the scheduling model.

Combined Profitability Decision

// Profitability check (decompiled from sub_2BD1C50, lines 2062-2163)
int64_t treeCost   = sub_2B94A80(tree, ...);   // vector tree cost
int64_t reducCost  = sub_2B28940(rd, TTI, ...); // reduction overhead
int64_t combined   = treeCost + reducCost;      // overflow-checked via __OFADD__

int64_t threshold  = -(int64_t)qword_5010428;  // SLPCostThreshold, default 0

if (costKind == 1) {
    // Exact cost from scheduling model: always accept
    goto vectorize;
}
if (combined > threshold) {
    // Not profitable: emit "HorSLPNotBeneficial" diagnostic
    // Try smaller VFs via getNextLegalVF() loop
    goto try_smaller_vf;
}
// Profitable: proceed to vectorizeTree()

The costKind == 1 fast path is notable: when the cost model can determine the exact scheduling benefit (rather than a heuristic estimate), it bypasses the threshold entirely. This typically fires for small, fully-analyzable SLP trees where every instruction's latency is known from the TTI scheduling tables at TTI+56.

VF Stepping on Failure

When vectorization at the current VF is unprofitable, the horizontal reduction path does not immediately give up. Instead, it calls sub_2B1E190 (getNextLegalVF) to step down to the next smaller legal VF, then re-tries the entire build-tree / get-cost cycle:

// VF step-down loop (decompiled from sub_2BD1C50, lines 2097-2163)
while (currentVF > minVF) {
    currentVF = sub_2B1E190(TTI, elementType, currentVF);
    if (sub_2B3C060(&memoSet, {offset, currentVF}))  // alreadyTried?
        continue;
    // Re-try vectorization at new VF
    sub_2BAACB0(tree, ops, currentVF, ...);  // buildTree
    treeCost  = sub_2B94A80(tree, ...);       // getTreeCost
    reducCost = sub_2B28940(rd, TTI, ...);    // getReductionCost
    combined  = treeCost + reducCost;
    if (combined <= threshold)
        goto vectorize;
}
// All VFs exhausted: emit "HorSLPNotBeneficial"

The memoization set (sub_2B3C060) prevents re-evaluating the same (offset, VF) pair, which is essential because the VF step-down loop can iterate many times for large operand counts.

Configuration Knobs

Upstream LLVM Knobs (present in CICC)

KnobTypeLLVM DefaultCICC DefaultEffect
slp-thresholdint00Profitability threshold. Vectorize when cost <= -threshold. Default 0 means any non-positive cost is profitable.
slp-vectorize-horbooltruetrueEnable horizontal reduction vectorization.
slp-vectorize-hor-storeboolfalsefalseSeed horizontal reduction from stores.
slp-max-reg-sizeint128128Maximum vector register size in bits for SLP scheduling.
slp-min-reg-sizeint128128Minimum vector register size.
slp-schedule-budgetint100000100000Maximum scheduling region size per block.
slp-recursion-max-depthint1212Maximum recursion depth for tree building.
slp-min-tree-sizeint33Minimum tree size for full vectorization.
vectorize-slpbooltruetrueMaster switch for the SLP pass.
view-slp-treeboolfalsefalseDisplay SLP trees with Graphviz (debug).
slp-max-vfint00Maximum vector factor override (0 = unlimited).

NVIDIA-Specific Globals

GlobalAddressDefaultEffect
SLPMaxVFqword_500F6280When zero: minimum VF = 4 elements. When non-zero: minimum VF = 3, and the value caps the maximum VF. Also bypasses power-of-2 VF requirement.
SLPCostThresholdqword_50104280Cost threshold for horizontal reduction profitability. Test is cost > -(int)threshold. Default 0: any non-positive cost is profitable.
Straight-line max VFqword_500FEE8unknownMaximum VF override for straight-line SLP (sub_2BCE070), separate from horizontal reduction.

Key Behavioral Differences from Upstream

  1. Minimum VF default. When SLPMaxVF is zero (default), CICC requires at least 4 scalar operands to attempt horizontal reduction vectorization. Upstream LLVM has no such global minimum; it relies on slp-min-tree-size (default 3) instead.

  2. VF clamping. CICC clamps VF to [128, 512] bits based on the a2+840 scheduling width, then steps down via getNextLegalVF() (sub_2B1E190). Upstream computes VF from TTI::getMaximumVF() or slp-max-reg-size without the explicit bit-width clamping. The [128, 512] range allows VF=4 through VF=16 for f32 types, whereas upstream NVPTX (32-bit register width) would produce VF=1.

  3. Paired fadd merging. CICC merges consecutive fadd reduction bundles into wider bundles via sub_2B3C030 / sub_2B25EA0 / sub_2B38BA0. This is absent from upstream and is targeted at GPU warp-level reduction patterns. See the dedicated section above.

  4. Scheduling-width-driven VF (not register-width-driven). The upstream SLP vectorizer derives VF from TTI::getRegisterBitWidth(Vector). CICC stores a separate scheduling width at a2+840 that reflects available register budget after accounting for live-in pressure. This decouples SLP VF from the register file width, allowing profitable vectorization even though getRegisterBitWidth(Vector) returns 32.

  5. isTreeNotBeneficialForArch(). CICC adds a GPU-architecture-specific early rejection filter (sub_2B2DA40) that takes the SM reduction type as a parameter. This rejects tree shapes known to be unprofitable on the target SM variant (e.g., trees that would produce reduction patterns not supported by the SM's warp-level primitives).

  6. O-level gating. SLP vectorization is gated by tier != 1 in the pipeline assembler: it is disabled at O1 and enabled at O2 and O3. At O2/O3, the LoopVectorize/SLP parameter width is set to tier (2 at O2, 3 at O3), affecting the scheduling width multiplier. SM-architecture-dependent thresholds are resolved at runtime via the a2+840 value.

  7. Non-power-of-2 VF support. When SLPMaxVF (qword_500F628) is non-zero and operand_count + 1 is a power of 2, adjustVF() (sub_2B1FA70) returns operand_count directly, enabling VFs like 3, 5, 7. Upstream LLVM requires power-of-2 VFs except in specific recent patches for fixed-length non-power-of-2 vectorization.

Diagnostic Strings

StringFunctionMeaning
"SLP vectorized with cost N"sub_2BCE070Straight-line SLP succeeded
"Cannot SLP vectorize list:"sub_2BCE070Straight-line SLP failed legality/cost
"Stores SLP vectorized with cost N"sub_2BCA110Store-seeded SLP succeeded
"HorSLPNotBeneficial"sub_2BD1C50Horizontal reduction not profitable
"Vectorizing horizontal reduction is possible but not beneficial with cost C and threshold T"sub_2BD1C50Full rejection diagnostic with cost details
"VectorizedHorizontalReduction" / "Vectorized horizontal reduction with cost C and with tree size N"sub_2BD1C50Horizontal reduction succeeded
"const.rdx"sub_2B21B90Intermediate reduction variable name
"rdx.shuf.l", "rdx.shuf.r"(cluster 0x1BDDB00)Left/right reduction shuffle names
"op.rdx", "op.extra"(cluster 0x1BDDB00)Reduction operation and extra operation names

Function Map

FunctionAddressSizeRole
HorizontalReduction::tryToReduce() -- main horizontal reduction entrysub_2BD1C5085 KB--
Straight-line SLP vectorizer entrysub_2BCE070----
Store-SLP vectorizer entrysub_2BCA110----
BoUpSLP::buildTree()sub_2BAACB0----
BoUpSLP::getTreeCost()sub_2B94A8071 KB--
BoUpSLP::vectorizeTree() (codegen)sub_2BC6BE071 KB--
BoUpSLP::computeScheduleData()sub_2BBDBE040 KB--
BoUpSLP::scheduleBlock()sub_2BBFB6071 KB--
BoUpSLP::optimizeGatherSequence()sub_2BB3590----
BoUpSLP::reorderInputsIfNecessary()sub_2BB0460----
BoUpSLP::buildExternalUses()sub_2B4F3D0----
getReductionCost()sub_2B28940----
createFinalReduction()sub_2B21C80----
createReductionOp() ("const.rdx")sub_2B21B90----
buildReductionResult()sub_2B2FE10----
reduceTreeLevel()sub_2B2F4A0----
isReductionOp()sub_2B0D8B0----
isHomogeneous() (all ops satisfy predicate)sub_2B0D880----
canVectorize() (legality check)sub_2B4B450----
isTreeTinyAndNotFullyVectorizable()sub_2B2DB00----
isTreeNotBeneficialForArch()sub_2B2DA40----
adjustVF() (vectorization factor selection)sub_2B1FA70----
getNextLegalVF()sub_2B1E190----
getScalarTypeWidth()sub_2B49BC0----
hasVectorizableReductions()sub_2B6E610----
tryMergeFaddBundles() (NVIDIA-specific)sub_2B3C030----
validateMergedBundle() (NVIDIA-specific)sub_2B25EA0----
rewriteMergedBundle() (NVIDIA-specific)sub_2B38BA0----
perBundleVectorize()sub_2B77B90----
emitVectorizedReductionDiagnostic()sub_2B44ED0----
reorderForCanonical()sub_2B33D00----
SLP tree schedulingsub_2BD7F7046 KB--
SLP tree cost computationsub_2B889C045 KB--
SLP value rewriting (scalar-to-vector)sub_2BCFB9044 KB--
SLP node creation (tree construction)sub_2BCAEC042 KB--
deleteTree() (cleanup on failure)sub_2B5C350----
alreadyTried() (VF memoization)sub_2B3C060----
tryNextVF() (advance or fail)sub_2B399C0----
classifyReductionPair() (per-bundle opcode pair extraction)sub_2B5F980----
hasExternalUses() (external use check for bundles)sub_2B27F10----
getTargetInfo() (TTI accessor)sub_BD5C60----
initDominatorContext()sub_D5F1F0----
hashOperandSlice() (operand slice hash for scheduling cache)sub_27B0000----
Extended opcode classifier (opcodes > 0x1C)sub_2B15E10----
buildOperandOrder() (commutative reorder table)sub_2B3D4E0----
isInScheduledSet() (scheduling membership test)sub_2B3D560----
Reduction use counter (per-operand)sub_2B54920----
TTI::getRegisterBitWidth(Vector) -- returns 32sub_DFE640----
TTI::supportsScalableVectors() -- returns falsesub_DFE610----
TTI::getRegisterBitWidth(Scalar) -- returns 32sub_DFB1B0----
TTI::getInstructionCost() (scheduling cost model)sub_20E14F033 KB--
TTI::getInstructionCost() (IR-level variant)sub_B91420----
TTI::hasAttribute(N) (function attribute query)sub_B2D610----

Data Structure: HorizontalReduction Object

OffsetTypeField
+0ReductionBundle*Array of reduction bundle structs
+8u32Bundle count
+304Value**Pointer to operand arrays (each bundle = 64 bytes)
+312u32Operand array count
+384void*Auxiliary dependency table
+392void*useDef map (bit 0 = inline/external flag)
+400void*useDef map pointer
+408u32useDef map capacity
+1568Value*Root function / reduction entry value
+1576u32SM reduction type (arch-specific opcode)
+1580u8Commutative flag
+1584char*Output result array
+1592u32Output result count
+1596u32Output result capacity
+1600char[16]Inline result storage

Cross-References

Loop Strength Reduction (NVIDIA Custom LSR)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp (LLVM 20.0.0). CICC ships the stock LLVM LSR at 0x284F650--0x287C150 alongside a completely separate NVIDIA custom formula solver at 0x199A--0x19BF. The custom solver replaces the formula generation and selection phases while reusing LLVM's SCEV infrastructure, IV rewriting, and chain construction.

NVIDIA ships two entirely separate LSR implementations inside cicc v13.0. The first is upstream LLVM's LoopStrengthReducePass (approximately 200 helpers across 0x284F650--0x287C150, compiled from llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp). The second is a custom 160KB formula solver (sub_19A87A0, 2688 decompiled lines) sitting at 0x199A--0x19BF, wrapped by NVLoopStrengthReduce at sub_19CE990. Both are invoked through the "loop-reduce" pass name in the LLVM new pass manager pipeline, but NVIDIA's overlay replaces the formula generation and selection phases with GPU-aware logic while reusing LLVM's SCEV infrastructure, IV rewriting, and chain construction.

This page documents the NVIDIA overlay -- the most GPU-specific LLVM pass in cicc. If you are reimplementing cicc's optimizer, this is the pass you cannot skip.

Why NVIDIA Rebuilt LSR

The root motivation is a single equation that does not exist on CPUs: register count determines occupancy, and occupancy determines performance. On a GPU, each additional register per thread can cross a discrete occupancy cliff, dropping warp-level parallelism by an entire warp group -- see the GPU Execution Model for the register budget and cliff table.

On a CPU, LSR's primary concern is minimizing the number of live induction variables to reduce register pressure, with a secondary goal of producing address expressions that fold into hardware addressing modes. The cost model compares formulae by counting registers, base additions, immediate encoding costs, and setup instructions. This works because a CPU's register file is fixed (16 GPRs on x86-64) and the cost of spilling to cache is relatively uniform.

On an NVIDIA GPU, four properties break this model:

  1. Discrete occupancy cliffs. A formula that saves one instruction but adds one register might push the kernel past a cliff and lose 50% throughput. The cliff boundaries and their impact are documented in the occupancy cliff table.

  2. No equivalent of L1 spill cost. When a GPU "spills," values go to local memory (DRAM, 200-800 cycles), which is orders of magnitude slower than CPU L1 cache.

  3. Address space semantics. GPU memory is partitioned into address spaces with different widths and hardware addressing modes. Shared memory (addrspace(3)) uses 32-bit pointers with specialized .shared:: load/store instructions. Generic pointers are 64-bit. Strength-reducing a 32-bit shared-memory pointer can produce 64-bit intermediate values that force truncation, defeating the optimization.

  4. Typed registers. PTX uses typed virtual registers (%r for 32-bit, %rd for 64-bit, %f for float). A 64-bit induction variable costs two 32-bit register slots. On older architectures (sm_3x through sm_5x), 64-bit integer operations are emulated and expensive; on sm_70+, native 64-bit addressing makes them acceptable.

LLVM's stock cost model knows none of this. It calls TTI::isLSRCostLess which compares an 8-field cost tuple ({Insns, NumRegs, AddRecCost, NumIVMuls, NumBaseAdds, ImmCost, SetupCost, ScaleCost}), but the NVPTX TTI implementation cannot express occupancy cliffs, address space constraints, or the sign-extension register savings that matter on GPU. NVIDIA's solution: replace the formula solver entirely, with 11 knobs for fine-grained control.

Architecture Overview

The NVIDIA LSR overlay is structured as a 7-phase formula solver pipeline. The main entry point is sub_19A87A0, which takes a single argument: a pointer to an LSR state object (referred to as a1 throughout). The state object is large -- relevant fields span from offset 0 through offset 32160.

LSR State Object Layout

OffsetTypeField
+8ScalarEvolution*SCEV analysis handle
+32LoopInfo*Loop analysis handle
+40uint64_tTarget address space identifier
+192int64_t**Stride factor table (array of stride values)
+200uint32_tStride factor count
+320void*Reuse chain table base
+328uint32_tReuse chain count
+368LoopRecord*Loop use-groups array base
+376uint32_tLoop use-groups count
+32128RPTrackerRegister pressure tracking structure
+32136void*Formula hash table base
+32152uint32_tFormula hash table bucket count
+32160void*Working formula set

Each loop record is 1984 bytes. Each use record within a loop is 96 bytes. The loop's use array starts at loop record offset +744, with the use count at +752. The stride at these sizes -- 1984 bytes per loop, 96 bytes per use -- is a constant throughout all 7 phases. The solver iterates loop_count * uses_per_loop in every phase, making this an O(L * U * S) algorithm where S is the stride factor table size.

The 7-Phase Formula Solver Pipeline

Phase 1: Initial Use Setup (lines 471--537)

The solver iterates all loop use-groups and, within each, all individual uses. For each 96-byte use record, it:

  1. Copies the use record to stack locals (the record contains base SCEV, stride SCEV, flags, formula kind, scaled register array, offset expression, and secondary immediate).
  2. Calls sub_19930D0 to expand the scaled register array into a working formula operand list.
  3. Calls sub_19A22F0 (per-register formula generation): iterates over the scaled register count and calls sub_19A1B20 for each register operand to generate one initial formula candidate per operand. If formula_kind == 1 (with-offset mode), it also calls sub_19A1B20 with operand_idx = -1 to generate a formula for the offset expression itself.
  4. Calls sub_19A23A0 (alternative formula generation): a second pass with different addressing-mode generation logic, likely producing formulae with combined base+offset or folded immediates.

The output of Phase 1 is a set of initial formula candidates, one per (use, scaled register) pair, covering the basic addressing modes.

Phase 2: Expression Folding with Unfolded Offsets (lines 548--662)

This phase targets uses where base == NULL (pure IV uses, no base pointer). It performs two sub-passes:

Sub-pass A (unfold offset into base): For each pure-IV use, calls sub_19A2680 per scaled register to generate candidate formulae that move the offset expression into the base register field. This is the inverse of LLVM's stock "fold offset into immediate" transform -- NVIDIA sometimes wants the offset in a register because GPU addressing modes have limited immediate widths.

Sub-pass B (factor loop bounds into formula): Builds an iterator set from the loop's start bound (+712) and end bound (+720), then calls sub_19A2820 per (scaled register, iterator bound) pair. This generates formulae that factor common strides out of the loop bounds. For example, if the loop runs i = 0..N and a use computes 4*i + base, this phase can factor out the stride 4 and produce a formula with a single-step IV.

Phase 3: Stride Factor Expansion (lines 662--862)

Runs only for loops where the use type is 3 (immediate-only addressing). This is the phase that explores alternative stride factors from the stride factor table (a1+192).

For each use:

  1. Extract the representative SCEV via sub_1456040.
  2. Verify bit width is at most 64 via sub_1456C90.
  3. Verify single loop bound (start == end, meaning unit stride).
  4. Reject floating-point constant offsets (type 15).
  5. For each stride factor S in the table:
    • Compute scaled_stride = S * use_stride.
    • Overflow check: verify S * stride / S == stride and that INT64_MIN is not involved (avoiding signed overflow UB).
    • Validate SCEV representability via sub_1594790.
    • Also validate S * loop_start and S * loop_end.
    • Construct a candidate formula with the factored stride.
    • Run formula legality check via sub_1995490 (validates against TTI target legality, loop dominance, and SCEV overflow).
    • If legal: rewrite all scaled operands via sub_13A5B60, check value equivalence via sub_1999100, then commit via sub_19A1660.

The overflow guards in this phase are critical. The multiplication overflow check (v94 * v455 / v94 == v455) prevents generating formulae whose stride values cannot be represented in 64-bit arithmetic, which would silently produce wrong results.

Phase 4: Chain-Based Formula Generation (lines 872--1082)

For each use, the solver attempts chain-based strength reduction: building formulae where one IV feeds the next use through a simple increment rather than a full recomputation.

Key logic:

  • Extracts the representative SCEV for the use.
  • If formula_kind != 1 (not with-offset) or the formula has a single element, iterates stride factors and builds chained formulae.
  • For immediate-type uses (type == 3), also considers promoting to with-offset mode (type == 1).
  • Each candidate is validated through sub_1995490.
  • Operands are rewritten via sub_145CF80 (SCEV multiply by stride factor).
  • The flag at loop record +728 controls address-space-aware chain construction. When set, chains respect address space constraints -- critical for shared memory (see the disable-lsr-for-sharedmem32-ptr knob section).

Phase 5: Reuse Chain Matching (lines 1093--1256)

For uses where base == NULL, the solver attempts to match existing IV chains for reuse rather than creating new ones.

  1. Extract the representative SCEV and compute its "extent" (value range) via sub_1456E10.
  2. Iterate the reuse chain table (a1+320 through a1+328).
  3. For each chain entry, check if the use's extent matches the chain target.
  4. Validate legality via sub_14A2CF0.
  5. If matched: rewrite the use's offset via sub_147BE70 (SCEV rebase).
  6. Register pressure check: validate via sub_19955B0(rp_tracker, scev_value, loop_idx) that the register pressure after adding this formula stays under the limit.
  7. If RP passes: tag with address space via sub_19932F0, commit via sub_19A1660.

This is where the lsr-check-rp and lsr-rp-limit knobs have direct effect. The sub_19955B0 function compares projected register pressure against the configured ceiling and returns false if the formula would exceed it.

Phase 6: Formula-to-Use Hash Table Construction (lines 1260--1940)

The most complex phase. It builds two hash tables and uses them to identify which formulae serve multiple uses (shared IV expressions).

Hash Table 1 (7-QWORD entries per slot): maps SCEV expression to a linked list of formula candidates.

Field OffsetSizeContent
+08Key: SCEV pointer, or sentinel (-8 = empty, -16 = tombstone)
+88Formula candidate linked list head
+164Candidate count
+248Linked list tail
+328Previous pointer (for median walk)
+408Next pointer (for median walk)
+488Total cost accumulator

Hash Table 2 (2-QWORD entries): maps SCEV expression to a use-count bitmap tracking how many uses reference the expression.

Both tables use the same hash function: (val >> 9) ^ (val >> 4) masked to the bucket count. Probing is quadratic with tombstone support. Resize triggers at 75% load factor via sub_19A82C0.

The phase then:

  1. Inserts every formula from the working set into Hash Table 1 with SCEV normalization via sub_199D980.
  2. Cross-references into Hash Table 2 for use counting, merging bitmaps via sub_1998630.
  3. Iterates the formula set again and, for each formula, traverses the linked list of referencing uses.
  4. Computes combined cost using sub_220EFE0 (reads cost from a binary tree node at +32).
  5. Finds the median-cost insertion point (threshold at total_cost / 2) -- this is a key difference from upstream LLVM, which always picks the cheapest formula. NVIDIA's median heuristic avoids both extremes: the cheapest formula might use too many registers, while the most register-efficient formula might use too many instructions.
  6. Builds (register_id, distance) candidate pairs for each (formula, use) combination. If the candidate set exceeds 31 entries, it migrates from an inline SmallVector to a balanced tree set (sub_19A5C50).

The use-count bitmap uses a compact inline representation: if (value & 1), the high bits encode max_reg_id and the remaining bits form the bitmap directly; otherwise, the value is a pointer to a heap-allocated BitVector (size at +16, data at +0). The popcount check at line 1927 (popcount != 1) filters out expressions used by only one use -- they cannot benefit from strength reduction.

Phase 7: Final Formula Selection and Commitment (lines 2042--2686)

After hash table cleanup, the solver iterates the candidate triples (register_id, distance, scev_use) and performs the final selection:

for each candidate (reg_id, distance, scev_use):
    loop_record = a1->loops[loop_idx]
    repr_scev   = getStart(scev_use)                    // sub_1456040
    extent      = getExtent(repr_scev)                   // sub_1456E10
    offset_expr = getAddExpr(extent, -distance, 0)       // sub_15A0680
    offset_norm = foldNormalize(scev_ctx, offset_expr)   // sub_146F1B0
    bit_budget  = getBitWidth(offset_norm)                // sub_1456C90

    for each use in loop_record:
        copy 96-byte use record to stack
        if formula_kind == 1:     // with-offset mode
            fold offset into scaled_regs
            set formula_kind = 0  // demote to normal

        if candidate IV appears in use's scaled_regs:
            // Direct replacement path
            validate via sub_1995490 (formula legality)
            build replacement AddRecExpr: sub_147DD40(scev_ctx, [target_iv], 0, 0)
            replace matching operand in formula

            // Sign-extension / width-fit check:
            value_range = computeRange(replacement)
            if abs(distance) < value_range:
                tag with address space (sub_19932F0)
                commit formula (sub_19A1660)

        else if use references a different IV:
            // Cross-IV replacement path
            alt_offset = stride + num_uses * distance
            alt_formula = getAddExpr(extent, -alt_offset, 0)
            validate via sub_1995490

            // Sign-bit check: if sign(replacement) == sign(distance),
            // the formula may wrap -- reject
            if sign_bit_matches: continue

            // Width-fit check via APInt:
            if countLeadingZeros(result) confirms fit in register width:
                commit formula

The width-fit checks use full APInt arithmetic (sub_16A4FD0 for copy, sub_16A7490 for shift/add, sub_16A8F40 for negate, sub_16A7400 for absolute value, sub_16A7B50 for bitwise AND, sub_16A57B0 for leading zero count) to determine whether the replacement formula's value range fits in the bit budget. This is essential for correctness: a formula that overflows its register width produces wrong results silently.

Register Pressure Integration

The integration between LSR and register pressure is the single most important difference from upstream LLVM. It works at three levels:

Level 1: Hard Gate (lsr-check-rp + lsr-rp-limit)

Before committing any reuse chain formula (Phase 5) and internally within the legality check sub_1995490, the solver calls sub_19955B0(rp_tracker, scev_value, loop_idx). This function reads the pre-computed per-loop register pressure estimate from offset a1+32128 and compares the projected post-formula RP against lsr-rp-limit. If the projected RP exceeds the limit, the formula is rejected outright -- it does not even enter the candidate set.

This prevents the pathological case where LSR produces a formula that requires one less instruction per iteration but needs two more live registers, pushing the kernel past an occupancy cliff. On GPU, that one extra instruction is vastly cheaper than the occupancy loss.

Level 2: Bit Budget Proxy (Phase 7)

The "bit budget" computed in Phase 7 (v325 = sub_1456C90(offset_norm)) acts as an indirect register pressure proxy. Wider values need more register slots: a 64-bit value occupies two 32-bit register slots on NVPTX. By enforcing that replacement formulae fit within the bit budget, the solver prevents needless register widening.

Level 3: Sign-Extension Credit (count-sxt-opt-for-reg-pressure + lsr-sxtopt)

When lsr-sxtopt is enabled, LSR attempts to fold sign-extension operations into IV expressions, producing narrower IVs. When count-sxt-opt-for-reg-pressure is also enabled, the cost model credits the register pressure savings from eliminated sign-extensions. A formula that requires one more base register but eliminates a sign-extension might be net-neutral or even beneficial in RP terms.

Level 4: Median-Cost Heuristic (Phase 6)

Rather than always selecting the cheapest formula (as upstream LLVM does), NVIDIA uses a median-cost heuristic. The total cost is summed across all uses of a formula, and the selection threshold is total_cost / 2. This balances instruction cost against register pressure: the cheapest formula often has the highest register pressure, while the formula nearest the median typically represents a balanced tradeoff.

GPU-Specific Knobs

All 11 knobs are registered at ctor_214_0 (0x4E4B00). They are LLVM cl::opt command-line options injected through NVIDIA's option registration infrastructure.

Complete Knob Reference Table

KnobTypeDefaultCategoryEffect
disable-unknown-trip-lsrboolfalseScope ControlSkips LSR entirely for loops where SCEV cannot determine the trip count. Unknown-trip loops on GPU may be warp-divergent; applying LSR without trip count knowledge can increase register pressure with no loop-count-informed gain.
lsr-check-rpbooltrue [MEDIUM confidence]Register PressureMaster switch for register pressure checking. When disabled, LSR ignores occupancy constraints and behaves more like upstream LLVM. Default inferred from observed RP-aware behavior in O2 compilations; constructor default not directly confirmed.
lsr-rp-limitint~32-64 [LOW confidence]Register PressureRegister pressure ceiling. If current RP for the loop meets or exceeds this value, LSR is skipped for that loop. The threshold is set to coincide with occupancy cliff boundaries. Range estimated from SM occupancy math; actual compiled-in default not extracted from binary.
filter-bad-formulabooltrue [MEDIUM confidence]Formula QualityEnables NVIDIA's custom formula pruning pass. "Bad" formulae are those requiring too many registers or producing address modes unsupported by SASS (for example, formulae that require scaled-index modes that only exist on CPU). Default inferred from observed pruning behavior; constructor value unconfirmed.
do-lsr-64-bitboolarch-dependentIV WidthEnables LSR for 64-bit induction variables. Default is false on sm_3x through sm_5x (where 64-bit integer ops are emulated), true on sm_70+ (native 64-bit datapath).
count-sxt-opt-for-reg-pressurebooltrue [MEDIUM confidence]Register PressureWhen calculating RP cost, credits the register savings from sign-extension eliminations that LSR enables. Default inferred from observed behavior; constructor value unconfirmed.
lsr-sxtoptbooltrue [MEDIUM confidence]Sign ExtensionMaster switch for sign-extension folding within LSR. Folds sign-extension operations into IV expressions to produce narrower IVs, reducing register file consumption. Default inferred from observed behavior; constructor value unconfirmed.
lsr-loop-levelint0 (all)Scope ControlRestricts LSR to loops at a specific nesting depth. 0 = all levels. 1 = innermost loops only (where address arithmetic is hottest).
lsr-skip-outer-loopboolfalseScope ControlSkips the outer loop's IV when processing nested loops. Prevents strength-reducing the outer IV when the inner loop is the performance bottleneck.
disable-lsr-for-sharedmem32-ptrboolfalseAddress SpaceDisables LSR for pointers into 32-bit shared memory (addrspace(3)). Protects efficient .shared:: addressing modes and bank-conflict-free access patterns.
disable-lsr-complexity-discountboolfalseCost ModelDisables the complexity estimation discount. When the discount is active (this knob is false), the cost model gives a bonus to formulae that reduce addressing complexity even if they use more registers. Disabling forces strict register-count-based comparison.

Knob Grouping by Function

Register pressure control (4 knobs): lsr-check-rp, lsr-rp-limit, count-sxt-opt-for-reg-pressure, lsr-sxtopt. These collectively determine whether and how aggressively the solver factors occupancy into formula selection. With all four active, NVIDIA's LSR is deeply occupancy-aware. With all four disabled, it degrades toward upstream LLVM behavior.

Scope control (3 knobs): disable-unknown-trip-lsr, lsr-loop-level, lsr-skip-outer-loop. These restrict which loops LSR operates on. They are safety valves: if LSR is hurting a specific kernel, these allow narrowing its scope without disabling it entirely.

Address space control (2 knobs): disable-lsr-for-sharedmem32-ptr, do-lsr-64-bit. These control how LSR interacts with GPU memory semantics. The shared-memory knob protects 32-bit pointer optimality; the 64-bit knob controls IV width policy.

Cost model control (2 knobs): filter-bad-formula, disable-lsr-complexity-discount. These tune the formula evaluation heuristics. The bad-formula filter removes candidates early; the complexity discount adjusts the tradeoff between instruction count and register count.

Address-Space Awareness

Shared Memory 32-Bit Pointer Protection

Shared memory on NVIDIA GPUs uses addrspace(3) with 32-bit addressing. The hardware provides dedicated .shared:: load/store instructions with efficient addressing modes, including bank-conflict-free access patterns tied to pointer alignment.

NVIDIA's LSR overlay tracks address spaces at two levels:

  1. Loop level: the address space identifier at loop record +40.
  2. Use level: the alignment constraint at use record +48.

In Phase 4 (chain-based formula generation), line 983 checks use+48 == a1+40 || flag_at_+728. If the use's address space matches the target or the address-space-crossing flag is set, the solver uses address-space-aware chain construction. The sub_19932F0 helper tags committed formulae with the correct address space.

When disable-lsr-for-sharedmem32-ptr is enabled, the solver skips all formulae targeting addrspace(3) pointers. The rationale: strength-reducing a 32-bit shared memory pointer can create 64-bit intermediate values (the IV increment may be computed in 64-bit before truncation to 32-bit). This defeats the optimization and can prevent the backend from using efficient 32-bit .shared:: addressing modes.

64-Bit IV Control

The do-lsr-64-bit knob controls whether LSR generates formulae using 64-bit induction variables. The architecture-dependent default reflects hardware reality:

  • sm_30 through sm_52: 64-bit integer operations are emulated (two 32-bit ops + carry). A 64-bit IV costs roughly 2x the register pressure and 2x the instruction cost. LSR is disabled for 64-bit IVs.
  • sm_60 through sm_62: Partial native 64-bit support for address computation.
  • sm_70 and above: Full native 64-bit addressing and arithmetic. 64-bit IVs become acceptable.

Phase 3 (stride factor expansion) checks the bit width of the representative SCEV (sub_1456C90 must return at most 64). Phase 7's bit budget check ensures replacement formulae fit within the available register width. Together, these prevent 64-bit IV generation on architectures where it is disabled.

Sign-Extension Optimization

When lsr-sxtopt is enabled, the solver actively seeks to fold sign-extension operations into IV expressions. On NVPTX, this is important because:

  1. PTX uses typed registers. A sext i32 %x to i64 creates a new 64-bit value occupying a separate register pair.
  2. If LSR can express the IV in a narrower type from the start, the sign-extension becomes dead code.
  3. When count-sxt-opt-for-reg-pressure is also enabled, the cost model credits this saving.

The sign-extension check appears in Phase 7's width-fit verification. After constructing a replacement formula, the solver computes the value range using APInt arithmetic and checks whether abs(distance) < value_range. If the replacement fits, the sign-extension can be eliminated. An additional sign-bit check (line 2545) rejects replacements where the sign bit of the result matches the sign of the distance -- this would cause the formula to wrap, producing incorrect values.

Complexity Discount Heuristic

When disable-lsr-complexity-discount is false (the default), the cost model applies a discount to formulae that reduce addressing complexity, even if they use more registers. "Addressing complexity" here means the number of operations required to compute the effective address for a memory operation.

Consider two formulae for a memory access inside a loop:

  • Formula A: base + 4*i -- one multiplication, one addition. Requires a scaled index register.
  • Formula B: ptr += 4 each iteration -- one addition per iteration, no multiplication. Requires one increment register.

Formula B is "simpler" in addressing complexity but might use one more register (the incrementing pointer) alongside the existing base. The complexity discount gives Formula B a bonus in the cost model, reflecting the GPU reality that address computation instructions compete with arithmetic instructions for issue slots, while an extra register has low cost when the kernel is not at an occupancy cliff.

When the discount is disabled (the knob is set to true), the cost model falls back to strict register-count comparison, similar to upstream LLVM behavior.

Comparison: NVIDIA LSR vs Upstream LLVM LSR

AspectUpstream LLVM LSRNVIDIA Custom LSR
Code size~180KB compiled (500+ helpers, 4 mega-functions)~160KB compiled (30 functions, main solver 83KB)
Binary location0x284F650--0x287C1500x199A--0x19BF overlay
Cost model8-field tuple: {Insns, NumRegs, AddRecCost, NumIVMuls, NumBaseAdds, ImmCost, SetupCost, ScaleCost}. Compared via TTI::isLSRCostLess.Register-pressure-aware with occupancy ceiling. Median-cost heuristic. Complexity discount. Sign-extension credit.
Formula selectionAlways picks cheapest formula per cost tuple orderingMedian-cost heuristic: picks near cost midpoint to balance instructions vs registers
Register pressureCounted but not capped. No occupancy awarenessHard-gated: lsr-check-rp + lsr-rp-limit reject formulae that exceed RP ceiling
Address spacesSingle flat address space assumedFull address-space tracking. Shared memory (addrspace 3) gets special 32-bit protection
64-bit IVsAlways considered if legalGated by do-lsr-64-bit with architecture-dependent defaults
Sign-extensionNot a first-class concernDedicated optimization path with RP credit (lsr-sxtopt, count-sxt-opt-for-reg-pressure)
Loop scopeAll loopsFilterable by nesting depth (lsr-loop-level) and outer-loop exclusion (lsr-skip-outer-loop)
Trip count requirementAttempts all loopsCan skip unknown-trip loops (disable-unknown-trip-lsr)
Hash tableDenseSet<SmallVector<SCEV*>> for uniquificationCustom 7-QWORD-per-entry hash table with quadratic probing, tombstones, 75% load factor resize, linked-list formula chains, and use-count bitmaps
Formula phasesSingle-pass candidate generation followed by cost-based pruning7 sequential phases: initial setup, expression folding, stride expansion, chain generation, reuse matching, hash table construction, final selection
SCEV infrastructureNativeReused from LLVM (shared SCEV, IV rewriting, chain construction)
Tuning knobs7 cl::opt knobs (general-purpose: lsr-insns-cost, lsr-filter-same-scaled-reg, lsr-complexity-limit, etc.)11 GPU-specific knobs (register pressure, address space, loop scope, cost model)

What NVIDIA Reuses From Upstream

The NVIDIA overlay does not replace everything. It reuses:

  • SCEV infrastructure (0xDB--0xDF range): ScalarEvolution analysis, AddRecExpr construction, range analysis, and trip count computation.
  • IV rewriting (sub_1997F10): creates the replacement IV values with the naming convention "IV.S." and "IV.S.next.".
  • Chain construction (sub_199EAC0): builds IV chains with the "lsr.chain" naming prefix.
  • Formula cost model base (sub_1995010): the underlying cost computation, which NVIDIA then wraps with RP checking and sign-extension credit.
  • Terminator folding (sub_287C150): the "lsr_fold_term_cond" transform that folds loop exit comparisons.

What NVIDIA Replaces

  • Formula generation (Phases 1--5): entirely custom, with address-space awareness, stride factor expansion, and reuse chain matching with RP validation.
  • Formula-to-use mapping (Phase 6): custom hash tables replacing LLVM's DenseSet-based uniquification with a design optimized for linked-list traversal and median-cost computation.
  • Final selection (Phase 7): custom selection with width-fit checks, sign-extension validation, and cross-IV replacement -- none of which exist in upstream LLVM.

Key Helper Function Map

For reimplementation reference, the critical helpers and their roles:

AddressFunctionRole
sub_19A87A0Main 7-phase solverEntry point (83KB, 2688 lines)
sub_19CE990NVLoopStrengthReduce::run()Pass wrapper
sub_1995490Formula legality validatorTTI + SCEV + loop constraint check
sub_19955B0Register pressure checkCompares projected RP vs limit
sub_19932F0Address space taggerSets addrspace on formula
sub_19A1660Formula commitSorts, deduplicates, inserts into candidate set
sub_19A22F0Per-register formula gen (Phase 1)Loops sub_19A1B20 per operand
sub_19A2680Unfolded-offset formula gen (Phase 2a)Offset-to-base transform
sub_19A2820Loop-bound-factored formula gen (Phase 2b)Stride factoring
sub_19A82C0Hash table resizePower-of-two bucket growth
sub_199D980SCEV normalizationCanonical form for hashing
sub_1998630Use-count bitmap mergeInline bitmap + heap fallback
sub_1456040SCEV getStart()Extract base from AddRecExpr
sub_1456C90SCEV getBitWidth()Register width determination
sub_1456E10SCEV extent computationValue range of IV
sub_145CF80SCEV getMulExpr()Multiply SCEV by stride factor
sub_147BE70SCEV rebaseRewrite base in AddRecExpr
sub_147DD40AddRecExpr constructorBuild replacement IV chain
sub_15A0680SCEV getAddExpr()Add constant offset
sub_146F1B0SCEV fold/normalizeSimplify expression

Data Structure Reference

Use Record (96 bytes)

+0   [8]  base_scev       : SCEV* (NULL for pure-IV uses)
+8   [8]  stride_scev     : SCEV* (loop stride expression)
+16  [1]  flags           : bit 0 = is_address, bit 1 = has_offset
+24  [8]  formula_kind    : 0 = normal, 1 = with-offset, 3 = immediate-only
+32  [8]  scaled_regs_ptr : pointer to SmallVector<SCEV*>
+40  [4]  scaled_regs_cnt : number of scaled register operands
+48  [32] padding / alignment / additional fields
+80  [8]  offset_scev     : SCEV* (offset expression)
+88  [8]  secondary_imm   : secondary immediate value

Loop Record (1984 bytes)

+32   [4]  use_type        : 0 = generic, 1 = address-check, 3 = immediate
+40   [8]  addr_space      : address space identifier
+48   [4]  alignment       : alignment constraint (bytes)
+712  [8]  loop_start      : SCEV* (loop start bound)
+720  [8]  loop_end        : SCEV* (loop end bound)
+728  [1]  as_aware_flag   : address-space-aware LSR active
+729  [1]  dead_guard_flag : if set && use_count > 0, skip loop
+744  [8]  use_array_ptr   : pointer to array of use records
+752  [4]  use_count       : number of uses in this loop

Reimplementation Notes

  1. Start with the knob infrastructure. Register all 11 cl::opt knobs before anything else. The pass wrapper (sub_19CE990) reads these early and uses them to gate entire phases.

  2. The RP tracker must exist before the solver runs. The register pressure estimate at a1+32128 is computed by an earlier pass (likely during loop analysis). The NVIDIA LSR does not compute RP itself -- it only reads and compares.

  3. The hash function is deterministic. (val >> 9) ^ (val >> 4) masked to bucket count. Quadratic probing with tombstone support. If you are reimplementing the hash tables, use the same scheme or your formula deduplication will differ.

  4. The median-cost heuristic is the secret sauce. Upstream LLVM always picks the cheapest formula. NVIDIA picks near the median. This single difference is responsible for most of the occupancy improvements. If you must simplify, keep this heuristic.

  5. The overflow checks in Phase 3 are load-bearing. The S * stride / S == stride check and the INT64_MIN guard prevent generating formulae with wrapped arithmetic. Removing these checks will produce silently wrong code on kernels with large strides.

  6. Address space tagging (sub_19932F0) must happen before commit. Every formula committed via sub_19A1660 must carry the correct address space tag. Forgetting this will produce PTX that uses generic loads/stores where shared-memory instructions are required, breaking both performance and correctness.

  7. The use-count bitmap has two representations. Inline (when value & 1) and heap-allocated. The inline form is fast but limited to small register ID ranges. The heap form uses a BitVector with the size at +16. Both must be supported.

  8. Phase ordering is strict. The 7 phases must run in order. Later phases depend on candidates generated by earlier ones, and the hash tables in Phase 6 assume all candidates have been generated by Phases 1--5.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Formula solverSingle LLVM LoopStrengthReduce with TTI-based cost modelTwo implementations: stock LLVM LSR + custom 160 KB NVIDIA formula solver (sub_19A87A0, 2688 lines) that replaces formula generation/selection
Cost model8-field cost tuple ({Insns, NumRegs, AddRecCost, ...}), no occupancy conceptOccupancy-aware cost: register count evaluated against discrete warp occupancy cliffs where +1 register can halve throughput
Address space awarenessNo address space semantics in formula selectionAddress space tagging (sub_19932F0) ensures formulae preserve shared memory (addrspace 3) 32-bit pointer width; prevents strength-reducing 32-bit pointers into 64-bit generic form
Knob count~5 knobs for cost tuning11 knobs for fine-grained GPU-specific control (lsr-no-ptr-address-space3, stride limits, formula depth, etc.)
Algorithm structureMonolithic formula generator + greedy selector7-phase formula solver pipeline: candidate generation, stride-based filtering, use-group analysis, formula selection, commit, rewrite
State objectModest state for formula tracking32,160-byte state object with embedded register pressure tracker, formula hash table, and per-use-group formula arrays
Typed register costAll registers weigh the same64-bit IVs cost two 32-bit register slots; emulated on sm_3x--5x; native on sm_70+ but still double the pressure

StructurizeCFG

Prerequisites: Familiarity with GPU execution model (warp divergence, reconvergence), LLVM dominator tree and post-dominator tree concepts, and the PTX emission pipeline. Understanding of reducible vs. irreducible control flow is assumed.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/StructurizeCFG.cpp (LLVM 20.0.0). The upstream version was originally written for AMDGPU; cicc ships both the stock AMDGPU copy at sub_1F0EBC0 and a separate NVPTX-customized copy at sub_35CC920.

CICC v13.0 ships two copies of the StructurizeCFG pass: an NVPTX-specific version at sub_35CC920 (95 KB, 2,397 decompiled lines) and a stock LLVM/AMDGPU version at sub_1F0EBC0. Both exist because the binary links both the NVPTX backend and the generic LLVM Scalar library; only the NVPTX instance is scheduled in the CUDA compilation pipeline. This page documents the NVPTX version exclusively.

The pass is mandatory for PTX emission. It is registered as "structurizecfg" in the pipeline parser (sub_2377300, sub_233F860) and listed as a required late pass by sub_29882C0 and sub_1A6D600.

Why PTX Requires Structured Control Flow

PTX is a structured instruction set. Unlike x86 or ARM, where a branch can target any address and the hardware resolves control flow at retirement, the NVIDIA GPU execution model imposes three hard constraints:

  1. Reconvergence at post-dominators. When a warp diverges (threads take different sides of a branch), the hardware needs a defined reconvergence point where all threads synchronize before continuing. This reconvergence point must be the immediate post-dominator of the branch. An unstructured CFG has no guarantee that such a point exists or is reachable from both sides.

  2. No multi-entry loops. A loop header must dominate every block in the loop body. If two distinct blocks serve as loop entries (an irreducible cycle), the hardware has no single point to insert the loop counter logic and the warp-level loop exit barrier. PTX therefore requires all loops to be natural (single-entry, reducible).

  3. No exception handling funclets. CUDA device code has no runtime support for stack unwinding, personality routines, or catch dispatch. The funclet-based EH model (Windows SEH, C++ landing pads) produces control flow patterns that cannot be expressed in PTX.

The StructurizeCFG pass converts reducible-but-unstructured flow into structured form by inserting "Flow" blocks that serve as explicit reconvergence points. It rejects irreducible flow and EH funclets with diagnostic remarks rather than attempting to restructure them.

Binary Layout

FunctionAddressSizeRole
sub_35CC9200x35CC92095 KBMain pass body
sub_35CF9300x35CF930~2 KBEntry gate / dispatch wrapper
sub_35CA2C00x35CA2C0~4 KBIrreducibility detector
sub_35CB4A00x35CB4A0~8 KBUniform branch classifier
sub_35CBCD00x35CBCD0~6 KBRegion structurizer core
sub_35CA5800x35CA580~1 KBDiagnostic emitter
sub_35CA9C00x35CA9C0~1 KBHash-set insert for BB tracking
sub_35C9CD00x35C9CD0~2 KBEdge reroute through new block
sub_35C9ED00x35C9ED0~1 KBDomtree NCA (nearest common ancestor) walk
sub_35C9B400x35C9B40trivialSuccessor array offset (return a1 + 8*a3)

Entry Gate: sub_35CF930

sub_35CF930 is the runOnFunction entry. It implements a multi-stage filter before committing to the expensive structurization:

sub_35CF930(pass, function):
    // 1. Early-out for trivially uninteresting functions
    if sub_BB98D0(pass, function) fails:
        return 0

    // 2. Single-block functions need no structurization
    bb_list = function + 40
    if bb_list points to itself (single block):
        return 0

    // 3. Query target machine for a structurizer strategy object
    strategy = target_machine->vtable[136](...)

    // 4. Check enable-shrink-wrap override
    switch qword_50400C8:
        case 1:  goto force_structurize    // always run
        case 2:  return 0                  // always skip
        case 0:                            // ask strategy object
            if not strategy->vtable[72](function):
                return 0                   // strategy says skip

    // 5. Check function attributes for safe-to-skip markers
    for attr_id in [56, 63, 59, 64, 57]:
        if sub_B2D610(function, attr_id):
            return 0

    // 6. Run the actual structurizer
    force_structurize:
        return sub_35CC920(pass, function)

The attribute IDs likely map to: 56 = convergent, 63 = nodivergencesource, 59 = nounwind, 64 = alwaysinline, 57 = optnone. [MEDIUM confidence] These numeric-to-name associations are inferred from LLVM attribute enumeration ordering in the upstream source and the semantic context of their usage (skip-structurize guard), not from string evidence in the binary. The attribute enum may differ in NVIDIA's fork. Functions carrying any of these are either already guaranteed to have uniform control flow or are explicitly marked as not-to-be-optimized.

CLI Knobs

KnobRegistrationTypeDefaultEffect
structurizecfg-skip-uniform-regionsctor_227 @ 0x4E9E40, ctor_489 @ 0x553F30boolfalseWhen true, regions with only uniform (warp-coherent) branches are left unstructured, avoiding unnecessary code bloat
structurizecfg-relaxed-uniform-regionsctor_489 @ 0x553F30booltrueAllows treating a region as uniform even if sub-regions contain non-uniform branches, provided there is at most one conditional direct child
enable-shrink-wrap (qword_50400C8)ctor_688 @ 0x5A6520int (0/1/2)00 = ask TargetRegisterInfo (vtable+72) whether to structurize; 1 = force structurize unconditionally; 2 = skip structurize entirely

The enable-shrink-wrap knob is stored as a global at qword_50400C8. Despite its name (borrowed from the generic LLVM shrink-wrapping pass infrastructure), it serves as a master override for the structurization decision. Mode 2 effectively disables the pass, which would produce miscompilation for any function with divergent branches -- it exists purely as a debugging/override mechanism.

Irreducibility Detection: sub_35CA2C0

Called early in sub_35CC920 (line ~743 of the decompiled output), this function determines whether the CFG contains irreducible cycles. It detects irreducibility but does not restructure it.

Algorithm

The function receives the RPO-ordered basic block list from the SCC decomposition phase and iterates backwards:

sub_35CA2C0(result, domtree_data, bb_list, bb_count):
    for each BB in reverse(bb_list):
        for each successor S of BB:
            // Probe dominator tree hash table
            // Hash: ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1)
            dom_node = lookup(domtree_data, S)

            // If S does NOT dominate BB, but there is a back-edge
            // from BB to S, this is an irreducible cycle
            if back_edge(BB, S) and not dominates(S, BB):
                return 1  // irreducible

    return 0  // reducible

The core invariant: in a reducible CFG, every back-edge target dominates its source. If a back-edge exists where the target does not dominate the source, the loop has multiple entries and is irreducible.

Rejection behavior

When sub_35CA2C0 returns 1 (irreducible detected), the main pass emits:

remark: UnsupportedIrreducibleCFG
        "Irreducible CFGs are not supported yet."

via sub_35CA580 and returns without modifying the function. The return value is forced to 0 (no modification made).

This is a critical design choice. LLVM upstream provides a separate FixIrreduciblePass (sub_29D33E0, registered as "fix-irreducible") that performs node-splitting to convert irreducible cycles into reducible ones. However, the NVPTX pipeline in CICC v13.0 does not schedule FixIrreduciblePass before StructurizeCFG. The assumption is that well-formed CUDA C++ source never produces irreducible flow. If it does (extreme goto abuse, or a prior optimization pass introducing an irreducible pattern), the compilation emits the diagnostic and the resulting PTX will likely be rejected by ptxas.

EH Funclet Rejection

During the per-block iteration in the main loop, each basic block is checked for funclet status at offset BB+235 (a boolean flag indicating the block is a catchpad, cleanuppad, or catchret target):

if BB->isEHFunclet():   // *(BB + 235) != 0
    emit_diagnostic("UnsupportedEHFunclets",
                     "EH Funclets are not supported yet.")
    clear visited bitvector
    bail out

The funclet model (Windows x64, ARM64) structures exception handling into mini-functions that require personality routines and unwind tables. None of this exists in the GPU runtime. If a funclet block appears, it means the frontend erroneously lowered exception handling into device code.

After emitting the diagnostic, the pass checks qword_503FFE8 (a global flag, possibly a debug override). If nonzero, it attempts to find a single-entry point and process the rest of the function; if zero, it bails out entirely.

Uniform Branch Classification: sub_35CB4A0

This function (~500 decompiled lines) classifies whether a branch instruction is warp-uniform (all threads in the warp take the same direction) or divergent. The classification determines whether the region under that branch needs structurization.

Classification logic

sub_35CB4A0(pass_state, BB, ...):
    terminator_opcode = BB->opcode_category   // BB + 68, unsigned short

    // Non-conditional terminators (ret, unreachable, switch) skip analysis
    if (terminator_opcode - 1) > 1:
        return 0  // not a conditional branch, no structurization needed

    // Check function-level flags
    func_flags = BB->parent->flags   // BB + 32 + 64
    // bit 3 (0x08) = hasConvergentCalls
    // bit 4 (0x10) = hasDivergentBranches

    // Check block-level properties
    block_flags = BB->properties   // BB + 44
    // bit 2 (0x04) = already classified
    // bit 3 (0x08) = uses profile data

    // Query DivergenceAnalysis
    uniformity = sub_2E88A90(divergence_info, BB, mask_bits)
    // mask_bits: 0x80000 = uniform, 0x100000 = divergent, 0x80 = other

    // Additional uniformity check
    is_uniform = sub_2E8B090(divergence_info, BB)

    if is_uniform and skip_uniform_regions_enabled:
        return 0  // uniform, can skip structurization

    return 1  // divergent, needs structurization

When the structurizecfg-skip-uniform-regions knob is active, regions with all-uniform branches are left unmodified. This is sound because uniform branches do not cause warp divergence and therefore do not require explicit reconvergence points. Skipping these regions reduces code bloat from the insertion of unnecessary Flow blocks.

The structurizecfg-relaxed-uniform-regions knob relaxes the uniformity check for sub-regions. In upstream LLVM, hasOnlyUniformBranches refuses to treat a region as uniform if any sub-region contains a non-uniform branch. The relaxed mode allows this if there is at most one conditional direct child, under the reasoning that a single divergent sub-region can be handled by an inner structurization pass invocation.

Region Structurizer Core: sub_35CBCD0

This is the heart of the transformation. When a non-uniform, non-EH block is identified, sub_35CBCD0 processes its region:

sub_35CBCD0(pass_state, BB, context):
    // 1. Manage region boundaries
    head = pass_state[67]   // current region head
    tail = pass_state[68]   // current region tail

    // 2. Iterate successors
    for each successor S of BB (via sub_2E313E0):

        // 3. Check uniformity of successor edge
        if sub_35CB4A0(pass_state, S, ...) returns 0:
            continue  // uniform edge, skip

        // 4. Compute reconvergence point via NCA
        nca = sub_35C9ED0(domtree, BB, S)
        // NCA = nearest common ancestor in dominator tree
        // This is where threads from both sides of the branch
        // must reconverge

        // 5. Update region boundaries
        pass_state[67] = update_head(head, nca)
        pass_state[68] = update_tail(tail, nca)

    // 6. Update visited-BB bitvector
    set_bit(pass_state[91], BB->ordinal)

The NCA computation (sub_35C9ED0) walks the dominator tree upward from both the current block and its successor until finding their nearest common ancestor. This NCA becomes the reconvergence point: the block where the hardware must synchronize all threads before continuing.

Main Structurization Loop: sub_35CC920

The main pass body executes in four phases.

Phase 1: Initialization (lines 433-648)

// Store analysis results in pass object fields
pass[65] = DivergenceAnalysis + 200
pass[66] = LoopInfo + 200
pass[67] = 0              // current head
pass[68] = 0              // current tail
pass[69] = DomTree + 200
pass[70] = PostDomTree + 200
pass[71] = loop_depth_info

// Compute RPO (reverse post-order)
rpo = sub_2EA7130() -> sub_2EA7B20()

// Build SCC ordering (cross-references RPO with SCC decomposition)
scc_order = sub_357E170(rpo)

// Check for irreducible cycles
if sub_35CA2C0(scc_order, domtree, ...):
    emit "UnsupportedIrreducibleCFG"
    return 0

Phase 2: Per-block classification (lines 816-2253)

Iterates blocks in reverse RPO order (bottom-to-top):

for each BB in reverse_rpo(scc_order):

    // (a) Reject EH funclets
    if BB->isEHFunclet:
        emit "UnsupportedEHFunclets"
        clear bitvector, bail out

    // (b) Already marked for structurization
    if BB->structurize_flag (BB+216) or BB->flag_262 (BB+262):
        sub_35CBCD0(pass, BB, ...)  // structurize this region
        continue

    // (c) Check successors for back-edges to visited blocks
    has_loop = false
    for each successor S of BB:
        if bitvector_test(S->ordinal):
            has_loop = true   // back-edge detected = loop header

    // (d) Classify uniformity of predecessors
    needs_structurize = false
    for each predecessor P of BB:
        if sub_35CB4A0(pass, P, ...):
            needs_structurize = true
            break

    // (e) Apply structurization
    if needs_structurize:
        sub_35CBCD0(pass, BB, ...)

    // (f) Update bitvector
    bitvector_set_or_clear(BB->ordinal, needs_structurize)

Phase 3: Domtree-guided reconvergence (lines 2255-2396)

After the per-block loop, if a split point was identified (pass[67] != 0 and pass[68] != 0):

// Walk domtree from split point upward
current = split_point
while current != null:
    // Query strategy object for split decisions
    if strategy->shouldSplit(current):       // vtable+312
        sub_35CBCD0(pass, current, ...)

    if strategy->shouldSplitChild(current):  // vtable+320
        // second round for child regions
        ...

    current = domtree_parent(current)

// Store results in function metadata for PTX emission
function_obj[672] = head    // reconvergence head
function_obj[680] = tail    // reconvergence tail

These stored head/tail values are read by subsequent PTX emission passes to emit the correct convergence/reconvergence annotations in the output PTX.

Phase 4: Cleanup (lines 2383-2396)

Frees the helper object allocated at line 771 (0xA8 bytes), the SCC ordering buffer, and returns the modification flag (0 = no changes, 1 = modified).

Reconvergence Insertion Path

When a non-uniform divergent region is identified between a head block and a tail block, the pass performs the actual CFG transformation:

Step 1: Dominance validation

// Head must dominate tail
if not sub_2E6D360(domtree, head, tail):
    skip  // invalid region, cannot structurize

// Tail must post-dominate head
if not sub_2EB3EB0(postdomtree, tail, head):
    skip

Step 2: Edge classification

Collect successors of the tail into two sets:

  • External edges: successors pointing outside the region (into v395/v396)
  • Internal edges: successors pointing back inside the region (into v404/v405)

The strategy object (vtable+344) classifies each edge to determine if restructuring is needed.

Step 3: Flow block creation

// Create new "Flow" basic block
new_block = sub_2E7AAE0(function, 0, ...)  // BasicBlock::Create
sub_2E33BD0(new_block, insert_point)       // insert into BB list

// Copy phi-node entries from original target
for each phi in original_target:
    sub_2E33140(phi, ...)   // copy incoming value
    sub_2E341F0(phi, ...)   // update predecessor

Step 4: Edge rerouting

// Reroute edges from old target to new Flow block
sub_2E337A0(old_target, new_block)         // replaceAllUsesWith
sub_2E33F80(new_block)                     // finalize successors

// For each stale edge, update divergence info
for each stale_edge:
    sub_35C9CD0(stale_edge, ...)
    strategy->updateDivergence(...)        // vtable+368

Step 5: Recursive child splitting

If the strategy's shouldSplitChild (vtable+320) returns true, the newly created Flow block itself may need further splitting. This creates another block, reroutes edges again, and recurses. This handles deeply nested divergent regions where a single Flow block is insufficient.

Before/After CFG Example

Consider a function with a divergent if-then-else:

Before structurization:

    Entry
    /    \
  Then   Else
    \    /
    Merge
      |
    Exit

If the branch at Entry is divergent (some threads go to Then, others to Else), the hardware needs an explicit reconvergence point. After structurization:

After structurization:

    Entry
    / T
   |    \
   |   Then
   |    /
  Flow1         <- new block: reconvergence for Then
   | F  \
   |   Else
   |    /
  Flow2         <- new block: reconvergence for Else
    |
   Merge
    |
   Exit

The Flow1 and Flow2 blocks are inserted with conditional branches controlled by PHI networks. Flow1 has a branch: if the thread came from Then, continue to Flow2; if the thread skipped Then, also continue to Flow2 (the "false" exit). Flow2 similarly gates the Else path.

For a divergent loop:

Before:

    Entry
      |
    Header <--+
    /    \     |
  Body    |   |
    \    /    |
   Latch -----+
      |
    Exit

After:

    Entry
      |
    Header <------+
      |            |
    Body           |
      |            |
    FlowLoop       |
    / (back) \     |
   |          +----+
   | (exit)
   Exit

FlowLoop is a new block whose branch condition is a PHI: true incoming from Body means exit the loop, false means take the back-edge. This inverted convention (true = break, false = continue) matches upstream LLVM's structurization invariant.

Flow Block Insertion Algorithm

The previous sections describe the pass at the function-dispatch level. This section provides the complete algorithmic detail of how Flow blocks are actually created, wired, and how PHI networks are maintained -- the core transformation that converts a reducible-but-unstructured CFG into a fully structured CFG suitable for PTX emission.

Complexity

Let B = number of basic blocks, E = number of CFG edges, and D = depth of the dominator tree. The irreducibility detection (sub_35CA2C0) is O(B * E) -- for each block in reverse RPO, it probes successors against the dominator tree hash table (O(1) per probe). The per-block classification loop is O(B * (P_avg + S_avg)) where P_avg and S_avg are average predecessor and successor counts -- effectively O(B + E). The uniform branch classifier (sub_35CB4A0) is O(1) per block (a few flag checks and one DivergenceAnalysis query). The NCA computation (sub_35C9ED0) walks the domtree upward from two nodes until convergence: O(D) per call. Each Flow block insertion is O(D + PHI_count) where PHI_count is the number of PHI nodes at the original merge point (each needs entry copying). Recursive child splitting adds at most O(B) new blocks total across the entire function. The bitvector tracking is O(B / 64) per test/set operation. Overall: O(B * D + E + F * PHI_total) where F = number of Flow blocks created. Since F <= B (one Flow per divergent region) and D = O(B) in the worst case, the theoretical worst case is O(B^2 + E). In practice, CUDA CFGs are shallow (D < 20) and sparsely divergent, making the pass effectively O(B + E).

Conceptual Model

A "Flow block" is a synthetic basic block that serves as an explicit thread reconvergence point. In an unstructured CFG, divergent branches may merge at a common successor without any indication of which predecessor each thread arrived from. The hardware's reconvergence mechanism needs a single merge point where it can resume lockstep execution. Flow blocks provide this by:

  1. Interposing between the divergent region and its exit.
  2. Carrying a PHI node whose value encodes the path taken by each thread.
  3. Branching conditionally on that PHI to either enter the next region body or skip to the next Flow block.

The algorithm processes the function bottom-to-top (reverse RPO), which ensures that inner regions are structurized before outer ones. Each region is defined by a head (dominator) and tail (post-dominator). The output is a function where every conditional branch leads to at most one "then" block followed by a Flow block, guaranteeing single-entry single-exit regions.

Top-Level Algorithm: sub_35CC920

This is the complete algorithm for the main pass body, including the Flow block insertion logic interleaved with the classification phases already described above.

sub_35CC920(pass, function):
    // ---- Phase 1: Analysis setup ----
    div_info    = getAnalysis<DivergenceAnalysis>(function) + 200
    loop_info   = getAnalysis<LoopInfo>(function) + 200
    dom_tree    = getAnalysis<DominatorTree>(function) + 200
    post_dom    = getAnalysis<PostDominatorTree>(function) + 200
    pass[65]    = div_info
    pass[66]    = loop_info
    pass[67]    = NULL          // region_head
    pass[68]    = NULL          // region_tail
    pass[69]    = dom_tree
    pass[70]    = post_dom

    // Compute RPO via sub_2EA7130 -> sub_2EA7B20
    rpo_list = computeRPO(function)

    // Cross-reference RPO with SCC decomposition (sub_357E170)
    scc_order = buildSCCOrdering(rpo_list)

    // ---- Phase 1b: Reject irreducible ----
    if sub_35CA2C0(scc_order, dom_tree) == 1:   // irreducible detected
        sub_35CA580(pass, "UnsupportedIrreducibleCFG",
                    "Irreducible CFGs are not supported yet.")
        return 0

    // ---- Phase 1c: Initialize bitvector ----
    bb_count = countBasicBlocks(function)
    word_count = (bb_count + 63) >> 6
    bitvector = allocate(word_count * 8)
    memset(bitvector, 0, word_count * 8)
    pass[91] = bitvector   // at offset +728

    // ---- Phase 2: Bottom-up region identification and Flow insertion ----
    modified = false
    order = reverse(scc_order)      // process bottom-to-top

    for each BB in order:
        // 2a. Reject EH funclets
        if *(BB + 235) != 0:       // isEHFunclet flag
            sub_35CA580(pass, "UnsupportedEHFunclets",
                        "EH Funclets are not supported yet.")
            resetBitvector(pass)
            return 0

        // 2b. Already marked for structurization (from prior inner-region pass)
        if *(BB + 216) != 0 or *(BB + 262) != 0:
            sub_35CBCD0(pass, BB, context)
            continue

        // 2c. Detect back-edges to already-visited blocks (loop detection)
        has_loop_backedge = false
        for each successor S of BB:
            if bitvectorTest(pass[91], S->ordinal):
                has_loop_backedge = true

        // 2d. Classify predecessors for divergence
        needs_structurize = false
        for each predecessor P of BB:
            if sub_35CB4A0(pass, P, ...) == 1:  // divergent branch
                needs_structurize = true
                break

        // 2e. Structurize the region rooted at BB
        if needs_structurize:
            sub_35CBCD0(pass, BB, context)      // collect region bounds

            // If region bounds are valid, insert Flow blocks
            head = pass[67]
            tail = pass[68]
            if head != NULL and tail != NULL:
                modified |= insertFlowBlocks(pass, head, tail, function)

        // 2f. Update bitvector
        if needs_structurize:
            bitvectorSet(pass[91], BB->ordinal)
        else:
            bitvectorClear(pass[91], BB->ordinal)

    // ---- Phase 3: Domtree-guided outer-region finalization ----
    if pass[67] != NULL and pass[68] != NULL:
        current = pass[67]    // split_point
        while current != NULL:
            if strategy->shouldSplit(current):            // vtable+312
                sub_35CBCD0(pass, current, context)
                modified |= insertFlowBlocks(pass, pass[67], pass[68], function)

            if strategy->shouldSplitChild(current):       // vtable+320
                // recurse into child regions
                modified |= insertFlowBlocksForChildren(pass, current, function)

            current = domtreeParent(dom_tree, current)

        // Store reconvergence metadata for PTX emission
        *(function + 672) = pass[67]    // reconvergence head
        *(function + 680) = pass[68]    // reconvergence tail

    // ---- Phase 4: Cleanup ----
    free(scc_order)
    free(bitvector)
    return modified ? 1 : 0

Flow Block Insertion Detail: insertFlowBlocks

This function (inlined within the Phase 2/Phase 3 loops of sub_35CC920, approximately decompiled lines 980--2027) performs the actual CFG transformation for a single region.

insertFlowBlocks(pass, head, tail, function):
    // Step 1: Validate region boundaries via dominator/post-dominator trees
    if not dominates(pass[69], head, tail):
        return false    // head does not dominate tail => not a valid region
    if not postDominates(pass[70], tail, head):
        return false    // tail does not post-dominate head => not a valid region

    // Step 2: Classify edges leaving the tail block
    external_edges = []     // edges pointing outside the region
    internal_edges = []     // edges pointing back inside the region

    for each successor S of tail:
        if not dominatedBy(S, head) or S == tail:
            external_edges.append((tail, S))
        else:
            internal_edges.append((tail, S))

    // Step 3: Query strategy object for each edge
    for each edge E in (external_edges + internal_edges):
        classification = strategy->classifyEdge(E)   // vtable+344
        if classification == SKIP:
            continue
        // else: edge needs restructuring

    // Step 4: Create the Flow block
    //   sub_2E7AAE0 = BasicBlock::Create(context, name_hint, function)
    flow_bb = sub_2E7AAE0(function->getContext(), "Flow", function)

    //   sub_2E33BD0 = insert into function's BB list after tail
    sub_2E33BD0(flow_bb, tail->getNextNode())

    // Step 5: Build PHI node in the Flow block
    //   The PHI encodes "which path did threads arrive from?"
    //   Convention: true (i1 1) = came from the "then" body
    //              false (i1 0) = skipped the body (fell through)
    phi = createPHINode(Type::i1, flow_bb)
    phi.addIncoming(ConstantInt::getTrue(),  body_block)    // threads that executed body
    phi.addIncoming(ConstantInt::getFalse(), head_block)    // threads that skipped body

    // Step 6: Create conditional branch in the Flow block
    //   Branch on PHI: true -> next_region_or_exit, false -> next_flow_or_exit
    createCondBranch(flow_bb, phi, next_target_true, next_target_false)

    // Step 7: Reroute original edges through the Flow block
    //   For each predecessor that previously branched to the original merge:
    for each edge (P, original_merge) that should go through flow_bb:
        // sub_2E337A0 = replaceAllUsesWith for the branch target
        P->getTerminator()->replaceSuccessor(original_merge, flow_bb)

    // Step 8: Copy PHI entries from original merge to Flow block
    //   If the original merge had PHI nodes, their incoming values from
    //   rerouted predecessors must be transferred.
    for each phi_node in original_merge->phis():
        value = phi_node->getIncomingValueForBlock(rerouted_pred)
        // sub_2E33140 = addIncoming to new PHI at flow_bb
        // sub_2E341F0 = removeIncomingValue from original PHI
        flow_bb_phi.addIncoming(value, rerouted_pred)
        phi_node.removeIncomingBlock(rerouted_pred)
        phi_node.addIncoming(flow_bb_phi, flow_bb)

    // Step 9: Update dominator tree
    //   The new Flow block is immediately dominated by head.
    //   It immediately dominates the original merge (if flow_bb is its
    //   only predecessor now).
    dom_tree->addNewBlock(flow_bb, head)

    // Step 10: Update divergence analysis
    //   sub_35C9CD0 = edge reroute handler
    for each rerouted_edge:
        sub_35C9CD0(pass, rerouted_edge)
        strategy->updateDivergence(rerouted_edge)   // vtable+368

    // Step 11: Recursive child-split (if needed)
    //   The strategy may determine that the Flow block itself needs
    //   further splitting (deeply nested divergent regions).
    if strategy->shouldSplitChild(flow_bb):         // vtable+320
        child_flow = sub_2E7AAE0(function->getContext(), "Flow", function)
        sub_2E33BD0(child_flow, flow_bb->getNextNode())
        // ... repeat Steps 5-10 for the child Flow block ...
        // This recursion terminates when shouldSplitChild returns false.

    // Step 12: Expand bitvector if function grew
    new_bb_count = countBasicBlocks(function)
    if new_bb_count > pass[bb_count_field]:
        sub_C8D5F0(pass[91], new_bb_count)    // SmallVector::grow
        // Initialize new words to 0xFF...FF (conservatively "visited")
        // Then clear trailing bits beyond actual block count

    return true

PHI Network Construction for Nested Regions

When multiple Flow blocks are created for a chain of if-then-else regions, the PHI networks form a cascade. Each Flow block's PHI determines whether threads should enter the next body or skip to the subsequent Flow block.

Consider a three-way branch (implemented as nested if-then-else):

Before:                          After:
    Entry                            Entry
    / | \                            |
   A  B  C                          cond_A?
    \ | /                           / T   F
    Merge                          A      |
                                   |      |
                                  Flow1   |
                                  / F  T  |
                                 |   cond_B?
                                 |   / T   F
                                 |  B      |
                                 |  |      |
                                 | Flow2   |
                                 | / F  T  |
                                 ||   C    |
                                 ||   |    |
                                 || Flow3  |
                                 | \ | /   |
                                  Merge----+

The PHI cascade at each Flow block:

Flow1:
    %path_A = phi i1 [ true, %A ], [ false, %Entry ]
    br i1 %path_A, <continue to cond_B>, <skip to Merge via Flow3>

Flow2:
    %path_B = phi i1 [ true, %B ], [ false, %Flow1 ]
    br i1 %path_B, <continue to C>, <skip to Merge via Flow3>

Flow3:
    %path_C = phi i1 [ true, %C ], [ false, %Flow2 ]
    br i1 %path_C, <Merge>, <Merge>
    // Flow3's branch is unconditional to Merge (both sides converge)
    // but the PHI values propagated through the chain ensure each
    // thread sees the correct value at Merge's PHI nodes.

Each Flow block carries exactly one i1 PHI and one conditional branch. The chain length equals the number of divergent exits from the region minus one. The final Flow block has an unconditional branch (or a branch where both targets are the same) because all paths must converge at the region exit.

Loop Flow Block Insertion

For divergent loops, Flow blocks serve double duty: they both gate the loop body and control the back-edge. The algorithm handles loops specially:

insertLoopFlowBlock(pass, header, latch, exit, function):
    // The loop has structure: header -> body -> latch -> {header, exit}
    // After structurization:
    //   header -> body -> FlowLoop -> {header (back-edge), exit}

    // Step 1: Create FlowLoop block between latch and exit
    flow_loop = sub_2E7AAE0(context, "Flow", function)
    sub_2E33BD0(flow_loop, latch->getNextNode())

    // Step 2: PHI in FlowLoop encodes continue/break decision
    //   Convention: true = exit the loop, false = take back-edge
    //   This is INVERTED from what you might expect.
    //   Rationale: the "default" path (false) continues the loop,
    //   and the "exception" path (true) exits. This matches
    //   upstream LLVM's structurization invariant and simplifies
    //   the PHI lowering in CSSA.
    phi_loop = createPHINode(Type::i1, flow_loop)
    phi_loop.addIncoming(ConstantInt::getTrue(),  exit_pred)   // threads exiting
    phi_loop.addIncoming(ConstantInt::getFalse(), body_block)  // threads continuing

    // Step 3: Conditional branch
    createCondBranch(flow_loop, phi_loop, exit, header)
    // true -> exit, false -> header (back-edge)

    // Step 4: Reroute latch
    latch->getTerminator()->replaceSuccessor(header, flow_loop)
    latch->getTerminator()->replaceSuccessor(exit, flow_loop)

    // Step 5: Update loop info
    //   FlowLoop is inside the loop (it has the back-edge to header).
    //   LoopInfo must be updated so that FlowLoop is recognized as
    //   a loop block, otherwise subsequent passes (LICM, LSR) may
    //   misclassify it.
    loop_info->addBlockToLoop(flow_loop, loop)

    // Step 6: Domtree update
    //   FlowLoop is dominated by latch (or by header if the latch
    //   was the only block between header and exit).
    dom_tree->addNewBlock(flow_loop, latch)

The inverted convention (true = break) is critical. It ensures that the "natural" loop iteration (the common case) follows the fall-through path, which maps to the hardware's predicted branch direction. The PTX assembler uses this hint to generate the @p bra instruction with the back-edge as the taken path, minimizing branch misprediction overhead on the GPU.

Irreducible CFG Rejection: Why FixIrreducible is Not Scheduled

The pass rejects irreducible CFGs rather than attempting to restructure them. This section documents the design rationale and the consequences.

What Makes a CFG Irreducible

A CFG is irreducible if it contains a cycle with multiple entry points -- that is, there exist two blocks A and B in the cycle such that neither dominates the other, yet both can be reached from outside the cycle. The classic example is a goto into the middle of a loop:

Irreducible:
    Entry
    / \
   v   v
   A -> B
   ^   /
    \ v
     C

Both A and B are reachable from Entry, and both are in the cycle A->B->C->A.
Neither A dominates B nor B dominates A.

In a reducible CFG, every back-edge target dominates its source. This is the invariant that sub_35CA2C0 checks: it iterates blocks in reverse RPO and, for each back-edge (successor that was already visited), verifies that the target dominates the source via the dominator tree hash table.

The FixIrreducible Pass Exists But Is Not Used

CICC v13.0 links FixIrreduciblePass at sub_29D33E0 (registered as "fix-irreducible" at pipeline-parser index 239). Its core implementation at sub_29D3E80 (60KB) performs controlled node splitting: it duplicates blocks to create a single-entry version of each irreducible cycle. This is the standard compiler technique (T1-T2 node splitting from Hecht and Ullman).

However, the NVPTX pipeline in CICC v13.0 does not schedule FixIrreduciblePass before StructurizeCFG. The pipeline ordering is:

... -> SimplifyCFG -> Sink -> StructurizeCFG -> CSSA -> ISel -> ...
                              ^
                              |
                     fix-irreducible is NOT here

Design Rationale

Three factors explain this decision:

  1. CUDA source language guarantee. Well-formed CUDA C++ does not produce irreducible control flow. The language has no goto across loop boundaries (the EDG frontend rejects it), and structured constructs (if/for/while/do/switch) always produce reducible CFGs. The only way to get irreducible flow is through extreme goto abuse in C mode or through a buggy optimization pass that introduces one.

  2. Code size explosion. Node splitting can exponentially increase code size in pathological cases. For a cycle with N entry points, splitting may duplicate up to 2^N blocks. On a GPU where register pressure is the primary performance limiter, this expansion would be catastrophic -- more blocks means more live ranges, more register pressure, and lower occupancy.

  3. Correctness risk. FixIrreduciblePass transforms the CFG before divergence analysis has finalized. If the splitting creates new blocks with divergent branches, those branches would need re-analysis. The interaction between FixIrreducible, DivergenceAnalysis, and StructurizeCFG is not validated in the NVPTX pipeline.

Consequence: Silent Miscompilation Risk

When sub_35CA2C0 detects irreducibility, it emits a diagnostic remark:

remark: UnsupportedIrreducibleCFG
        "Irreducible CFGs are not supported yet."

The pass then returns 0 (no modification). The function proceeds through the rest of the pipeline with its irreducible CFG intact. Downstream, one of two things happens:

  1. ptxas rejects the PTX. If the irreducible pattern produces a branch target that violates PTX's structured control flow rules, ptxas will emit an error. This is the safe outcome.

  2. ptxas silently accepts malformed PTX. If the irreducible pattern happens to look like valid PTX (perhaps it only involves uniform branches), the resulting code may execute with undefined reconvergence behavior. Threads may reconverge at the wrong point, producing silent data corruption. This is the dangerous outcome.

The Stock LLVM Version Has the Same Limitation

The stock LLVM StructurizeCFG at sub_1F0EBC0 (linked from llvm/lib/Transforms/Scalar/StructurizeCFG.cpp) contains identical rejection logic. The AMDGPU backend, which also requires structured control flow, schedules FixIrreduciblePass explicitly before StructurizeCFG. NVIDIA chose not to do this.

InstanceAddressSizeIrreducible handling
NVPTX customsub_35CC92095 KBReject with diagnostic
Stock LLVMsub_1F0EBC0~58 KBReject with diagnostic
FixIrreduciblesub_29D33E0 / sub_29D3E8060 KBNode splitting (not scheduled)

The Stock StructurizeCFG Entry Block Handling

The stock LLVM version also includes explicit entry block handling at sub_1A74020 (13KB). When the function's entry block has predecessors (which can happen if the function is a loop body extracted by a prior pass), this function creates a new entry block named "entry" and renames the original to "entry.orig". The NVPTX version at sub_35CC920 handles this inline in Phase 1.

PTX Structured Control Flow Contract

This section documents the precise contract that StructurizeCFG must satisfy for downstream passes to emit correct PTX.

What "Structured" Means for PTX

After StructurizeCFG completes, the function's CFG must satisfy these five invariants:

  1. Single-entry regions. Every natural loop has exactly one entry (the loop header dominates all loop blocks). No irreducible cycles exist.

  2. Post-dominator reconvergence. For every divergent conditional branch at block B, there exists a block P that post-dominates B and dominates all merge points of the two branch targets. A Flow block is inserted at P if one does not already exist.

  3. Linear Flow chain. Between any divergent branch and its reconvergence point, the CFG forms a chain of Flow blocks with single-entry single-exit semantics. Each Flow block has exactly two predecessors (the "then" body exit and the "skip" path) and two successors (the next body entry or the final merge).

  4. PHI-encodable path selection. Every Flow block contains an i1 PHI that encodes which path was taken. This PHI is the sole branch condition of the Flow block's terminator. No other computation occurs in Flow blocks.

  5. Metadata tagging. Uniform branches are tagged with !structurizecfg.uniform metadata (metadata kind registered at sub_298D780). This prevents CSSA from inserting unnecessary copies at reconvergence points for branches where all threads agree.

Downstream Consumer: CSSA

The CSSA pass (sub_3720740) consumes the structured CFG and inserts explicit copy instructions at every reconvergence point. It relies on:

  • The Flow block chain to identify where reconvergence happens.
  • The i1 PHI in each Flow block to determine which threads took which path.
  • The !structurizecfg.uniform metadata to skip copy insertion for uniform regions.

Without StructurizeCFG, CSSA would not know where to insert copies, and the resulting register allocation would be unsound under warp divergence.

Downstream Consumer: Convergence Control in AsmPrinter

The reconvergence head/tail stored at function offsets +672 and +680 are consumed by the AsmPrinter's convergence control framework (see AsmPrinter). The AsmPrinter emits CONVERGENCECTRL_ENTRY (opcode 24) and CONVERGENCECTRL_LOOP (opcode 33) pseudo-instructions at the boundaries defined by these metadata values. The hardware uses these to program the convergence barrier stack.

Interaction with SIAnnotateControlFlow (AMDGPU Comparison)

AMDGPU uses a different approach: SIAnnotateControlFlow inserts explicit if/else/end_cf intrinsics after StructurizeCFG. NVPTX does not use this -- instead, the convergence information flows through:

  1. StructurizeCFG (Flow blocks + function metadata)
  2. CSSA (copy insertion at reconvergence)
  3. SelectionDAG / ISel (structured branch patterns)
  4. AsmPrinter (convergence pseudo-instructions)

This four-stage pipeline is NVIDIA-specific. Upstream LLVM for AMDGPU collapses stages 1-2 into StructurizeCFG + SIAnnotateControlFlow and has no equivalent of stage 4.

The Two Binary Instances

CICC v13.0 contains two complete copies of the StructurizeCFG pass because the binary links both the NVPTX backend (custom) and the generic LLVM Scalar library (stock). Only the NVPTX version is scheduled in the pipeline.

NVPTX CustomStock LLVM
Main bodysub_35CC920 (95 KB)sub_1F0EBC0 (~58 KB)
Entry gatesub_35CF930(inlined)
Region processingsub_35CBCD0sub_1A761E0 (28 KB)
Entry block handler(inlined in Phase 1)sub_1A74020 (13 KB, strings "entry.orig", "entry")
Region-basedOperates on entire functionOperates on individual Region objects
Uniform metadatasub_298D780 ("structurizecfg.uniform")Same string, different address
Registrationsub_29882C0 ("Structurize the CFG")sub_2988270 ("Structurize control flow")
Pipeline parserIndex 413: "structurizecfg" with skip-uniform-regions paramSame index, same params

The NVPTX version is 37 KB larger because it inlines the entry-block handler and region-processing logic (avoiding virtual dispatch overhead) and adds the CUDA-specific attribute checks (IDs 56, 63, 59, 64, 57) and the convergence metadata writes at offsets +672/+680.

Bitvector Tracking for Region Membership

The pass tracks which basic blocks have been visited using a dynamically sized bitvector stored in the pass object:

FieldOffsetMeaning
uint64_t *arraypass + 728Pointer to the word array
uint64_t word_countpass + 736Current number of 64-bit words
uint64_t capacitypass + 740Allocated capacity in words
uint64_t bb_countpass + 792Total number of basic blocks

Index computation for a block with ordinal idx:

word_offset = idx >> 6;          // idx / 64
bit_mask    = 1ULL << (idx & 63); // idx % 64

// Test
is_visited = (array[word_offset] & bit_mask) != 0;

// Set
array[word_offset] |= bit_mask;

// Clear
array[word_offset] &= ~bit_mask;

When new basic blocks are created during structurization (the function grows), the bitvector is expanded via sub_C8D5F0 (the SmallVector::grow equivalent). New words are initialized to 0xFFFFFFFFFFFFFFFF (all bits set = "visited"), then trailing bits beyond the actual block count are cleared. This ensures newly created blocks are conservatively marked as visited until explicitly processed.

Hash Table Implementation

The pass uses two DenseSet-style hash tables with LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure for the hash function, probing, and growth policy. The resize function for this pass is sub_2E61F50. Table v394 tracks BBs already processed during the BFS expansion, and v417 serves as a scratch set for child-split deduplication.

Comparison with Upstream LLVM StructurizeCFG

The NVIDIA version and upstream LLVM share the same fundamental algorithm. Both are derived from the same codebase (confirmed by identical diagnostic strings and strategy-object vtable layouts). The differences are:

Architectural differences

AspectNVIDIA (sub_35CC920)Upstream LLVM
GranularityOperates on entire function, iterating blocks in SCC/RPO orderOperates on individual Region objects, one region per invocation
Region discoveryInline SCC decomposition + domtree walkRelies on RegionInfo analysis pass
Object layoutPass fields at a1[65..91]; BB flags at +216, +235, +262Different offsets reflecting different BasicBlock subclass
SCC orderingsub_357E170 computes RPO/SCC cross-productUses scc_iterator from llvm/ADT/SCCIterator.h
Strategy objectQueried via vtable+312/320/344/368Uses TargetTransformInfo for cost decisions

Functional differences

  1. Irreducibility handling. Both reject irreducible CFGs with the same diagnostic. Neither performs restructuring. Upstream LLVM relies on FixIrreduciblePass being scheduled separately (AMDGPU does this). NVIDIA does not schedule it.

  2. EH funclet handling. Both reject funclets. The NVIDIA version checks BB+235 (a wider BasicBlock struct with CUDA-specific fields). Upstream checks via isa<FuncletPadInst>.

  3. Uniform region skipping. Both support structurizecfg-skip-uniform-regions. The NVIDIA version integrates DivergenceAnalysis queries inline (sub_2E88A90, sub_2E8B090). Upstream uses UniformityInfo::isUniform(BranchInst*).

  4. Metadata tagging. Both use the "structurizecfg.uniform" metadata kind to mark branches that have been classified as uniform, preventing re-analysis in nested region processing.

  5. Zero-cost hoisting. Upstream LLVM (recent versions) includes hoistZeroCostElseBlockPhiValues to reduce VGPR pressure from structurization-induced phi nodes. The NVIDIA version may or may not include this optimization; the decompiled code at the corresponding offset shows similar phi-manipulation logic but uses different register-pressure heuristics.

  6. Reconvergence metadata. The NVIDIA version writes reconvergence head/tail to function metadata at offsets +672 and +680. This is consumed by downstream PTX emission passes (AsmPrinter, convergence barrier insertion). Upstream LLVM has no equivalent because AMDGPU uses SIAnnotateControlFlow instead.

What NVIDIA did NOT change

The core structurization algorithm is identical: topological ordering of region nodes, iterative flow-block insertion, PHI-node reconstruction via SSAUpdater, and domtree maintenance. The strategy-object interface (shouldSplit, shouldSplitChild, classifyEdge, updateDivergence) has the same vtable layout in both versions. The FlowBlock naming convention ("Flow") is preserved.

Pipeline Position

StructurizeCFG runs late in the NVPTX backend pipeline, after most IR-level optimizations and before machine code generation:

... -> SimplifyCFG -> Sink -> StructurizeCFG -> CSSA -> ISel -> ...

It must run after divergence analysis (so it can query which branches are uniform) and before instruction selection (which assumes structured control flow). The CSSA (Convergent SSA) pass that follows converts phi nodes to respect warp divergence semantics at the reconvergence points that StructurizeCFG inserted.

Summary of Pass Decisions

Input conditionActionDiagnostic
Single-block functionSkipNone
Function with convergent/optnone attributesSkipNone
enable-shrink-wrap = 2SkipNone
Strategy object declinesSkipNone
All-uniform branches (with skip-uniform knob)SkipNone
Irreducible CFG detectedReject"UnsupportedIrreducibleCFG"
EH funclet block detectedReject"UnsupportedEHFunclets"
Reducible, divergent regionsRestructureNone (new Flow blocks inserted, edges rerouted)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent CFG structurization pass for a GPU target.

1. Attempting to restructure irreducible CFGs instead of rejecting them. The LLVM codebase includes FixIrreduciblePass (sub_29D33E0) which performs T1-T2 node splitting, but NVIDIA deliberately does not schedule it before StructurizeCFG. A reimplementation that adds node splitting to "handle" irreducible CFGs risks exponential code size blowup (2^N blocks for N entry points), catastrophic register pressure increases from the duplicated live ranges, and untested interaction with divergence analysis. The correct approach for an NVPTX target is to reject irreducible CFGs with a diagnostic and rely on the CUDA language guarantee that well-formed source never produces them.

2. Forgetting to update LoopInfo when inserting Flow blocks inside loops. When insertLoopFlowBlock creates a new block between the latch and the exit, that block carries the back-edge to the header and is therefore inside the loop. If LoopInfo is not updated (loop_info->addBlockToLoop), subsequent passes (LICM, LSR, LoopUnroll) will not recognize the Flow block as a loop member and may hoist or sink code across it incorrectly. This is a silent miscompilation: the kernel produces wrong results only for inputs that exercise the divergent loop path.

3. Inverting the Flow block PHI convention. The pass uses true = exit loop (break) and false = continue loop (back-edge) for loop Flow blocks. This is counterintuitive -- most programmers expect true to mean "condition is met, continue." Reversing this convention causes the back-edge to be the taken path for true, which not only produces wrong control flow but also defeats the branch prediction hint that maps the fall-through (false) path to the common-case loop continuation. A reimplementation must match the exact convention documented in the upstream LLVM structurization invariant.

4. Not writing reconvergence metadata to function offsets +672/+680. The AsmPrinter's convergence control framework reads the head and tail stored at these offsets to emit CONVERGENCECTRL_ENTRY and CONVERGENCECTRL_LOOP pseudo-instructions. A reimplementation that structures the CFG correctly but does not write these metadata values will cause the AsmPrinter to emit PTX without convergence barriers. On architectures with hardware convergence tracking (SM 7.0+), this can lead to threads reconverging at incorrect points, producing silent data corruption.

5. Skipping structurization for regions where all branches appear uniform but sub-regions contain divergent branches. The structurizecfg-relaxed-uniform-regions knob allows skipping outer regions when they have at most one conditional direct child. A reimplementation that skips any region marked "uniform" without checking sub-region divergence will fail to insert Flow blocks for inner divergent branches, leaving the PTX with unstructured control flow that ptxas may reject or (worse) silently miscompile.

Cross-References

  • CSSA -- the Conventional SSA pass that consumes Flow blocks to insert warp-safe copies
  • AsmPrinter -- convergence control pseudo-instruction emission consuming the +672/+680 metadata
  • GPU Execution Model -- warp divergence and reconvergence fundamentals
  • Branch Folding -- may eliminate redundant Flow blocks after code generation
  • Hash Infrastructure -- details on the DenseSet implementation used by the BB tracking tables
  • Pipeline -- exact position of structurizecfg in the pass ordering
  • Knobs -- structurizecfg-skip-uniform-regions, structurizecfg-relaxed-uniform-regions, enable-shrink-wrap
  • Upstream LLVM source: llvm/lib/Transforms/Scalar/StructurizeCFG.cpp

Differences from Upstream LLVM

AspectUpstream LLVM (AMDGPU)CICC v13.0 (NVPTX)
Binary copiesSingle StructurizeCFG in LLVM Scalar libraryTwo copies: NVPTX-specific at sub_35CC920 (95 KB) and stock LLVM/AMDGPU at sub_1F0EBC0; only NVPTX instance scheduled
Divergence queryQueries AMDGPU divergence analysisQueries NVPTX warp divergence analysis; uniform branch skip via structurizecfg-skip-uniform-regions knob
Flow block metadataFlow blocks inserted without convergence metadataInserts convergence control metadata at offsets +672/+680 on Flow blocks, consumed by AsmPrinter for warp reconvergence pseudo-instructions
Relaxed uniform regionsNot presentstructurizecfg-relaxed-uniform-regions knob allows less aggressive structurization when all branches in a region are provably uniform
Irreducible CFG handlingAttempts T1/T2 node-folding reductionSame approach, but rejection diagnostic "UnsupportedIrreducibleCFG" is NVPTX-specific; GPU code with irreducible CFG is a hard error
Skip conditionsSkip for single-block functionsExtended skip: single-block, convergent/optnone attributes, enable-shrink-wrap = 2, and strategy object decline
Mandatory statusRequired for AMDGPU but can be skipped via flagMandatory for PTX emission: registered as required late pass by both sub_29882C0 and sub_1A6D600

Machine-Level Passes

Machine-level passes in CICC v13.0 operate on MachineFunction / MachineBasicBlock / MachineInstr representations after SelectionDAG instruction selection has converted LLVM IR into target-specific pseudo-instructions. On a conventional CPU target, these passes ultimately produce native machine code; on NVPTX, they produce PTX assembly -- a virtual ISA with unlimited virtual registers and a structured instruction set. This distinction is fundamental: NVPTX's "machine code" still uses virtual registers (%r0, %f1, %p3), and the final PTX text is consumed by ptxas which performs the actual register allocation against the hardware register file. The machine-level passes in CICC therefore serve a different purpose than on CPU: they optimize register pressure (to maximize occupancy), structure control flow (PTX requires structured CFG), compute .local memory frame layouts, and prepare clean PTX for ptxas to finish.

Pass pipeline parser (MF)sub_235E150 (53KB)
Master pass registrysub_2342890 (102KB)
Codegen pass configctor_335_0 at 0x507310 (88 strings)
NVPTX target pass configctor_358_0 at 0x50E8D0 (43 strings)
Total registered MF passes51 (stock LLVM) + 13 (NVIDIA custom)
Total MF analyses14 registered
Pipeline configurationsub_2166D20 (addISelPasses), sub_2166ED0 (addPreRegAlloc), sub_21668D0 (addPostRegAlloc)

Why Machine Passes Matter on GPU

In upstream LLVM for x86 or AArch64, the machine pass pipeline assigns physical registers, inserts spill code, schedules instructions for pipeline hazards, and emits relocatable object code. On NVPTX, none of this maps directly:

  1. No physical register file. PTX registers are virtual. The greedy register allocator in CICC does not assign physical registers -- it tracks register pressure per class and enforces the -maxreg limit (default 70) that controls SM occupancy. When the allocator "spills," it moves values to .local memory rather than to stack slots addressed by %rsp.

  2. No prolog/epilog in the traditional sense. There is no call stack with push/pop sequences. PrologEpilogInserter in CICC computes .local frame offsets for spilled virtual registers and inserts ld.local/st.local pairs.

  3. Structured control flow is mandatory. PTX requires structured control flow (bra, @p bra, bra.uni). The StructurizeCFG pass runs before instruction selection, and BranchFolding must preserve the structured property.

  4. Instruction scheduling targets ptxas, not hardware. Machine scheduling optimizes the instruction stream that ptxas will consume. Since ptxas performs its own scheduling against the actual hardware pipeline, CICC's scheduling focuses on register pressure reduction (nvptx-sched4reg) and exposing parallelism that ptxas can exploit.

  5. Two peephole levels. CICC runs both the stock LLVM PeepholeOptimizer (operates on generic MachineInstr patterns) and the NVIDIA-specific NVPTXPeephole (sub_21DB090) which handles PTX-specific patterns like redundant cvta instructions, predicate folding, and address space conversions.

Pipeline Flow

SelectionDAG ISel
    │
    ▼
FinalizeISel ─── expand pseudo-instructions from ISel
    │
    ▼
┌─────────────────────────────────────┐
│  Pre-RA Optimization                │
│  ┌─ EarlyTailDuplicate             │
│  ├─ EarlyMachineLICM               │
│  ├─ MachineCSE (RP-aware)          │
│  ├─ MachineSink (gated by knob)    │
│  ├─ PeepholeOptimizer              │
│  ├─ NVPTXPeephole             ★    │
│  ├─ DeadMachineInstrElim           │
│  └─ MachineCopyPropagation         │
└─────────────────────────────────────┘
    │
    ▼
TwoAddressInstruction ─── convert 3-addr to 2-addr form
    │
    ▼
PHIElimination (CSSA/deSSA) ─── lower MachineInstr PHIs to copies
    │
    ▼
┌─────────────────────────────────────┐
│  Register Allocation                │
│  ┌─ LiveIntervals + SlotIndexes    │
│  ├─ RegisterCoalescing             │
│  ├─ RAGreedy (pressure-driven)     │
│  ├─ NVPTXBlockRemat           ★    │
│  └─ StackSlotColoring              │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Post-RA Optimization               │
│  ┌─ ExpandPostRAPseudos            │
│  ├─ MachineLICM (post-RA)          │
│  ├─ MachineSink (post-RA, gated)   │
│  ├─ MachineCopyPropagation         │
│  ├─ BranchFolding / TailMerge      │
│  ├─ MachineBlockPlacement          │
│  └─ MachinePipeliner (SMS)         │
└─────────────────────────────────────┘
    │
    ▼
PrologEpilogInserter ─── .local frame layout
    │
    ▼
MachineOutliner ─── OUTLINED_FUNCTION_ stub creation
    │
    ▼
NVPTXProxyRegErasure ★ ─── remove redundant cvta.to.local
    │
    ▼
AsmPrinter ─── PTX text emission

Passes marked with ★ are NVIDIA-custom. The exact ordering varies by optimization level; at -O0, most pre-RA and post-RA optimization passes are skipped and RegAllocFast replaces RAGreedy.

Pipeline Configuration Functions

The NVPTX backend configures the machine pass pipeline through three key functions:

sub_2166D20 -- addISelPasses(): Configures passes before instruction selection. Diagnostic string: "\n\n*** Final LLVM Code input to ISel ***\n". Adds: alloca hoisting, ISel DAG printer (conditional), NVPTXProxyRegErasure, NVPTXLowerArgs, NVPTX-specific ISel.

sub_2166ED0 -- addPreRegAlloc(): Configures machine passes before register allocation. Diagnostic strings: "After Pre-RegAlloc TailDuplicate", "After codegen DCE pass", "After Machine LICM, CSE and Sinking passes", "After codegen peephole optimization pass". Adds: TailDuplicate, codegen DCE, Machine LICM + CSE + Sinking (conditional on byte_4FD1980, byte_4FD18A0, byte_4FD1A60), codegen peephole.

sub_21668D0 -- addPostRegAlloc(): Configures post-register-allocation passes. Diagnostic strings: "After Machine Scheduling", "After StackSlotColoring". Adds: Machine scheduling (2 modes controlled by dword_4FD26A0 -- value 1 selects simple scheduling, otherwise full pipeline), Stack slot coloring, nvptx-mem2reg (conditional on byte_4FD25C0).

Machine Pass Inventory

NVIDIA-Custom Machine Passes

Pass IDClass / AddressPipeline PositionDescription
nvptx-peepholesub_21DB090Pre-RAPTX-specific peephole: folds redundant address space conversions (cvta), optimizes predicate patterns, simplifies PTX-specific instruction sequences. Controlled by enable-nvvm-peephole (default: on).
nvptx-remat-blocksub_217DBF0During RAMachine-level block rematerialization. Iterative "pull-in" algorithm that recomputes values near their use rather than loading from spill slots. Two-phase candidate selection with a "second-chance" heuristic. See Rematerialization.
machine-rpasub_21EAA00Analysis (pre-RA)Machine Register Pressure Analysis. Provides per-basic-block pressure data consumed by MachineCSE, scheduling, and rematerialization.
extra-machineinstr-printersub_21E9E80DiagnosticPrints per-function register pressure statistics. Debug-only pass for tuning pressure heuristics.
nvptx-mem2regsub_21F9920Pre-RAMachine-level mem2reg: promotes .local memory loads/stores back to virtual registers when profitable. Conditional on byte_4FD25C0 (nv-disable-mem2reg inverts).
ldgxformsub_21F2780Pre-RATransforms qualifying global memory loads into ld.global.nc (LDG -- load through read-only data cache). Splits wide vector loads for hardware constraints.
nvptx-prolog-epilogsub_21DB5F0Post-RANVPTX-specific PrologEpilog pass. Works alongside or replaces the stock PEI to handle PTX frame semantics where there is no traditional stack pointer.
nvptx-proxy-reg-erasuresub_21DA810Late post-RARemoves redundant cvta.to.local instructions left by address space lowering.
nvptx-assign-valid-global-namessub_21BCD80Pre-emissionSanitizes symbol names to comply with PTX naming rules (no @, $, or other characters illegal in PTX identifiers).
nvptx-replace-image-handlessub_21DBEA0Pre-emissionReplaces IR-level texture/surface handle references with PTX-level .tex / .surf declarations.
nvptx-image-optimizersub_21BCF10Pre-emissionTexture/surface instruction optimization: coalesces related texture operations, validates image type consistency for tex, suld, sust, suq.
alloca-hoistingsub_21BC7D0Early post-ISelHoists alloca instructions to the entry basic block, enabling the frame layout pass to assign fixed offsets.
generic-to-nvvmsub_215DC20Early post-ISelConverts generic address space (0) references to global address space (1). Runs before instruction selection on some pipelines, but also present as a machine-level fixup.
param-optsub_2203290Post-ISelOptimizes ld.param instructions. NVIDIA-custom pass for parameter load coalescing and redundant parameter load elimination.
nvptx-trunc-optssub_22058E0Post-ISelOptimizes redundant ANDb16ri instructions [sic: binary string reads "instrunctions"] generated during i16 truncation patterns.
redundant-move-elimsub_2204E60Post-ISelRemoves redundant register-to-register moves left by instruction selection.

Stock LLVM Machine Passes (NVPTX Configuration)

Pass IDClassNVIDIA ModificationNotes
finalize-iselFinalizeISelPassNoneExpands ISel pseudo-instructions; mandatory first MF pass.
early-tailduplicationEarlyTailDuplicatePassNonePre-RA tail duplication. Can be disabled via disable-early-taildup.
early-machinelicmEarlyMachineLICMPassGatedControlled by enable-mlicm. Hoists loop-invariant machine instructions before register allocation.
machine-cseMachineCSEPassModifiedNVIDIA adds register-pressure-aware CSE (rp-aware-mcse, pred-aware-mcse, copy-prop-mcse). Uses MRPA (sub_2E5A4E0) for incremental pressure tracking. See Instruction Scheduling.
machine-sinkMachineSinkingPassGatedDisabled by default on NVPTX; enabled via nvptx-enable-machine-sink. When active, sinks instructions closer to uses to reduce register pressure.
peephole-optPeepholeOptimizerPassNoneStock LLVM peephole: folds redundant copies, simplifies compare-and-branch patterns, optimizes sub-register operations. Can be disabled via disable-peephole.
dead-mi-eliminationDeadMachineInstrElimPassNoneEliminates dead machine instructions. Can be disabled via disable-machine-dce.
machine-cpMachineCopyPropagationPassNonePropagates copies to reduce move instructions. Can be disabled via disable-copyprop.
machinelicmMachineLICMPassGatedPost-RA variant. Controlled by disable-postra-machine-licm. NVIDIA adds sink-insts-to-avoid-spills to trade hoisting for spill reduction.
two-address-instructionTwoAddressInstructionPassNone (stock)Converts three-address instructions to two-address form by inserting copies. sub_1F53550 (79KB, 2470 lines). Shared between cicc and libNVVM (twin at sub_F4EA80).
phi-node-eliminationPHIEliminationPassModifiedNVIDIA's CSSA/deSSA method selection via usedessa (default 2). Controls how machine-level PHI nodes are lowered to copies; affects register allocation quality. See cssa-coalesce, cssa-verbosity.
register-coalescerRegisterCoalescerPassCustom NVPTX variantThe NVPTX backend has its own register coalescing framework at 0x349--0x34B (separate from LLVM's stock coalescer at 0xB40000). Uses interference oracle sub_349D6E0, open-addressing hash with (reg >> 9) ^ (reg >> 4). See Register Coalescing.
greedyRAGreedyPassModifiedPressure-driven rather than assignment-driven. Dual instances (legacy + new PM). Core at sub_2F49070 (82KB). See Register Allocation.
stack-coloringStackColoringPassNoneColors stack slots to reduce .local memory usage by sharing slots with non-overlapping lifetimes.
stack-slot-coloringStackSlotColoringPassNoneSecondary stack slot optimization. Can be disabled via disable-ssc.
post-ra-pseudosExpandPostRAPseudosPassNoneExpands post-RA pseudo-instructions (e.g., COPY to actual move).
post-RA-schedPostRASchedulerPassGatedPost-RA instruction scheduling. Controlled by disable-post-ra.
machine-schedulerMachineSchedulerPassModifiedNVIDIA adds nvptx-sched4reg mode for register-pressure-driven scheduling. Pre-RA scheduling variant.
postmischedPostMachineSchedulerPassNonePost-RA machine scheduling with ScheduleDAGMILive (sub_355F610, 64KB). Controlled by misched-postra.
early-ifcvtEarlyIfConverterPassNoneIf-conversion before register allocation. Can be disabled via disable-early-ifcvt.
machine-combinerMachineCombinerPassNoneCombines machine instructions using target-defined patterns. Knob: machine-combiner-inc-threshold.
block-placementMachineBlockPlacementNone (stock)Profile-guided basic block ordering. sub_3521FF0 (82KB). Uses ext-TSP and chain-based algorithms. See Block Placement.
machine-outlinerMachineOutlinerNoneCreates OUTLINED_FUNCTION_ stubs for repeated instruction sequences. sub_3537010 (77KB). See MachineOutliner.
prologepilogPrologEpilogInserterModifiedNVIDIA's PEI (sub_35B1110, 68KB) computes .local memory frame offsets. Frame objects are 40-byte records with offset, size, alignment, and spill-slot flags. See PrologEpilogInserter.
opt-phisOptimizePHIsPassNoneOptimizes machine-level PHI nodes (removes trivially dead or redundant PHIs).
tailduplicationTailDuplicatePassNonePost-RA tail duplication. Controlled by disable-tail-duplicate.
detect-dead-lanesDetectDeadLanesPassNoneDetects unused sub-register lanes; minimal impact on NVPTX since register classes are fully disjoint.
rename-independent-subregsRenameIndependentSubregsPassNoneSplits sub-register live ranges into independent virtual registers.
localstackallocLocalStackSlotAllocationPassNoneAllocates local frame indices for large stack objects.
machine-latecleanupMachineLateInstrsCleanupPassNoneLate-stage dead instruction cleanup.
machine-pipelinerMachinePipelinerNone (stock)Swing Modulo Scheduling for loop bodies. sub_3563190 (58KB). See below.

Per-Pass Algorithm Descriptions

NVPTXPeephole (sub_21DB090) -- PTX-Specific Peephole Optimizer

Registration: sub_21DB090 at 0x21DB090, pass ID "nvptx-peephole". Enabled by default; controlled by enable-nvvm-peephole.

This pass runs pre-RA and performs pattern-matching rewrites on MachineInstr sequences that are specific to the NVPTX target. Unlike the stock LLVM PeepholeOptimizer (which operates on generic copy/compare patterns), NVPTXPeephole handles PTX address space semantics and predicate register idioms.

Patterns handled:

  1. Redundant cvta elimination. When address space lowering inserts cvta.to.global or cvta.to.shared followed by an operation that already operates in the correct address space, the cvta is dead. The pass scans for cvta instructions whose result is used only by instructions with matching address space qualifiers, and deletes the cvta.

  2. Predicate folding. PTX predicates (%p0, %p1, ...) are first-class. The pass identifies patterns where a setp instruction produces a predicate that is consumed by exactly one @p bra and folds them into a conditional branch with embedded comparison.

  3. Address space conversion simplification. When generic-to-nvvm inserts addrspacecast and the consuming instruction directly emits the correct address qualifier (.global, .shared, .local, .const), the intermediate cast is redundant.

// Pseudocode: NVPTXPeephole main loop
fn nvptx_peephole(MF: &mut MachineFunction) -> bool {
    let mut changed = false;
    for mbb in MF.basic_blocks() {
        let mut dead_list = vec![];
        for mi in mbb.instrs() {
            match mi.opcode() {
                NVPTX::CVTAToGeneric | NVPTX::CVTAToGlobal
                | NVPTX::CVTAToShared | NVPTX::CVTAToLocal => {
                    if single_user_in_matching_addrspace(mi) {
                        propagate_operand_and_kill(mi);
                        dead_list.push(mi);
                        changed = true;
                    }
                }
                NVPTX::SETP_* => {
                    if let Some(bra) = single_predicate_consumer(mi) {
                        fold_setp_into_branch(mi, bra);
                        dead_list.push(mi);
                        changed = true;
                    }
                }
                _ => {}
            }
        }
        for mi in dead_list { mi.erase_from_parent(); }
    }
    changed
}

NVPTXBlockRemat (sub_217DBF0) -- Machine-Level Block Rematerialization

Registration: sub_217DBF0 at 0x217DBF0, pass name "NVPTX Specific Block Remat", pass ID "nvptx-remat-block". Knob constructor at ctor_361_0 (0x5108E0). Main engine: sub_2186D90 (47KB, ~1742 decompiled lines).

This is NVIDIA's custom register-pressure-reduction pass. It re-computes values at their use sites instead of keeping them live across long spans. The algorithm is iterative with a two-phase candidate selection including a "second-chance" heuristic for marginal candidates.

Knobs (16 total):

Global VariableCLI FlagDefaultDescription
dword_4FD3820nv-remat-block14Bitmask controlling remat modes (bits 0-3)
dword_4FD3740nv-remat-max-times10Max iterations of the outer remat loop
dword_4FD3660nv-remat-block-single-cost10Max cost per single live value pull-in
dword_4FD3580nv-remat-block-map-size-limit6Map size limit for single pull-in
dword_4FD3040nv-remat-block-max-cost100Max total clone cost per live value reduction
dword_4FD3120nv-remat-block-liveout-min-percentage70Min liveout % for special consideration
unk_4FD3400nv-remat-block-loop-cost-factor20Loop cost multiplier
unk_4FD3320nv-remat-default-max-reg70Default max register pressure target
unk_4FD2EC0nv-remat-block-load-cost10Cost assigned to load instructions
unk_4FD3860nv-remat-threshold-for-spec-reg20Threshold for special register remat
byte_4FD2E80nv-dump-remat-blockoffDebug dump toggle
byte_4FD2DA0nv-remat-check-internal-liveoffCheck internal liveness during MaxLive
qword_4FD2C20max-reg-kind0Kind of max register pressure info
qword_4FD2BE0no-mi-remat(list)Skip remat for named functions
word_4FD32F0load-rematonEnable load rematerialization
word_4FD3210vasp-fix1offVASP fix (volatile/addsp)

Algorithm pseudocode (sub_2186D90):

fn nvptx_block_remat(MF: &mut MachineFunction) -> bool {
    // (A) INITIALIZATION
    let target = max_reg_override.unwrap_or(nv_remat_default_max_reg);  // default 70
    if MF.block_count() == 1 { return false; }
    if function_name in no_mi_remat_list {
        log("Skip machine-instruction rematerialization on {name}");
        return false;
    }

    // (B) LIVEOUT FREQUENCY COUNTING
    for bb in MF.blocks() {
        for reg in bb.live_out() {
            freq_map[reg] += 1;
        }
    }
    // Normalize: freq_pct = (100 * count) / num_blocks

    // (C) OUTER ITERATIVE LOOP
    let mut iteration = 0;
    let mut overall_changed = false;
    loop {
        iteration += 1;
        if iteration > nv_remat_max_times { break; }  // default 10

        // Phase 1: COMPUTE MAX-LIVE
        let max_live = sub_2186590(MF);  // scan all blocks
        log("Max-Live-Function({num_blocks}) = {max_live}");
        if target >= max_live { break; }  // no pressure problem

        let mut changed = false;
        // Phase 2: FOR EACH OVER-PRESSURE BLOCK
        for bb in blocks_where(pressure > target) {
            let excess = bb.pressure - target;

            // Phase 3: CLASSIFY LIVE-OUT REGISTERS
            let (pullable, non_pullable) = classify_liveout(bb);
            // sub_217E810 (MULTIDEF check) -- must have single unique def
            // sub_2181550 (recursive pullability, depth <= 50)
            log("Pullable: {pullable.len()}");

            // Phase 4: SECOND-CHANCE HEURISTIC (sub_2181870)
            if excess > pullable.len() && second_chance_list.not_empty() {
                second_chance_promote(&mut pullable, &mut non_pullable);
                // Re-evaluates rejected candidates with relaxed criteria
                // Uses visit-count mechanism to prevent infinite loops
                // Hash: h(regID) = 37 * regID, open-addressing
                log("ADD {n} candidates from second-chance");
            }

            log("Total Pullable before considering cost: {pullable.len()}");

            // Phase 5: COST ANALYSIS (sub_2183E30)
            let candidates = pullable.filter_map(|reg| {
                let cost = compute_remat_cost(reg);  // 0 = cannot remat
                (cost > 0).then(|| (reg, cost))
            });

            // Phase 6: SELECT BY COST-BENEFIT (cheapest first)
            candidates.sort_by_key(|(_, cost)| *cost);  // selection sort
            let mut final_list = vec![];
            for (reg, cost) in candidates {
                if cost > nv_remat_block_single_cost { break; } // default 10
                let width = if reg_class_size(reg) > 32 { 2 } else { 1 };
                final_list.push(reg);
                if final_list.len() >= excess { break; }
            }

            log("Really Final Pull-in: {final_list.len()} ({total_cost})");

            // Phase 7: EXECUTE REMATERIALIZATION
            for reg in &final_list {
                clear_from_liveout(bb, reg);            // sub_217F620
            }
            bb.pressure -= final_list.len();
            propagate_backward(bb, &final_list);         // sub_2185250
            // Clone defining instructions at use sites
            // sub_21810D0 replaces register references
            changed = true;
        }

        overall_changed |= changed;
        if !changed { break; }
    }

    // (D) DEAD INSTRUCTION REMOVAL -- cascading deletion
    remove_dead_instructions();  // sub_217DA10
    overall_changed
}

MULTIDEF detection (sub_217E810): Returns the defining instruction if the register has exactly one non-dead, non-debug definition. Rejects instructions with hazardous descriptor flags (desc->flags & 0x3F80), opcodes in the non-rematerializable set (memory ops 534-609, texture ops 680-681, atomics 817-832, barriers 2913-2918, surface ops 3281-3287, 3449-3454, large MMA blocks 4423-4447), and instructions with tied extra defs.

Recursive pullability (sub_2181550): Walks the operand chain up to depth 50, checking each operand register against the non-pullable set and the MULTIDEF oracle. All operands in the chain must be single-def, safe-opcode, and themselves pullable.

Cost model: sub_2183E30 computes the clone cost of rematerializing a register. Load instructions cost nv-remat-block-load-cost (default 10). Instructions in loops are penalized by nv-remat-block-loop-cost-factor (default 20x). Double-wide registers (class size > 32) count as 2 for pressure and have 2x cost.

Machine Register Pressure Analysis (sub_21EAA00) -- MRPA

Registration: sub_21EAA00 at 0x21EAA00, pass name "Register pressure analysis on Machine IRs", pass ID "machine-rpa". Main analysis body: sub_21EEB40 (68KB). Incremental updater: sub_2E5A4E0 (48KB). Backend variant: sub_1E00370 (78KB).

MRPA is NVIDIA's custom analysis pass that provides per-basic-block register pressure data. Unlike LLVM's stock RegisterPressure tracking (which is tightly coupled to the scheduler), MRPA is consumed by multiple clients: RP-aware MachineCSE, instruction scheduling, and the block rematerialization pass.

Architecture:

The MRPA system has two modes:

  1. Full recomputation (sub_21EEB40): Walks every instruction in every basic block, tracking register births (defs) and deaths (last uses), recording the peak pressure per register class per block.
  2. Incremental update (sub_2E5A4E0): When a single instruction is moved or deleted (e.g., by MachineCSE), MRPA updates the affected blocks' pressure without rescanning the entire function.

Incremental update algorithm (sub_2E5A4E0):

fn mrpa_incremental_update(context, bb, instruction_delta) {
    // DenseMap hash: (ptr >> 9) ^ (ptr >> 4)
    // Empty sentinel: -8, Tombstone: -16
    // Minimum 64 buckets, always power-of-2

    // 1. Build worklist of affected BBs via DFS
    let worklist = dfs_from(bb, context.visited_set);

    // 2. For each BB: create/update tracking entry
    for bb in worklist {
        let entry = context.pressure_map.get_or_insert(bb);

        // 3. Filter schedulable instructions via sub_2E501D0
        for mi in bb.instrs().filter(schedulable) {
            // 4. For each virtual register operand (40-byte entries):
            for operand in mi.operands() {
                sub_2EBEF70(operand);  // find existing rename mapping
                sub_2EBEE10(operand);  // query register info
                sub_2EBE820(operand);  // attempt rename if profitable
                sub_2EBF120(operand);  // free old register after rename
            }
            // 5. Check register class constraints via sub_E922F0
            // 6. Validate pressure feasibility via sub_2E4F9C0
        }
        // 7. Erase unprofitable instructions via sub_2E88E20
    }
}

Verification: When verify-update-mcse is enabled (qword_501F8A8, default OFF), MRPA runs a full recomputation after every incremental update and compares results. Mismatch triggers: "Incorrect RP info from incremental MRPA update" via sub_C64ED0. The print-verify knob (qword_501F7C8) controls whether detailed per-register-class diagnostic output is printed on mismatch.

Diagnostic output (sub_21E9A60): The companion pass extra-machineinstr-printer at sub_21E9E80 prints: "Max Live RRegs: {n}\tPRegs: {m}\nFunction Size: {s}" for each function, providing per-function register pressure statistics for tuning.

LDG Transform (sub_21F2780) -- Read-Only Data Cache Load Transformation

Registration: sub_21F2780 at 0x21F2780, pass name "Ldg Transformation", pass ID "ldgxform". Transformation body: sub_21F2C80 (19KB). Vector splitting engine: sub_21F3A20 (44KB).

This pass transforms qualifying global memory loads into ld.global.nc (LDG) instructions, routing them through the read-only texture cache (L1 on Kepler+, unified L1/tex on Maxwell+). The transformation is profitable for read-only data because the texture cache has separate bandwidth from the L1 data cache, effectively doubling memory throughput for qualifying loads.

Algorithm:

fn ldgxform(MF: &mut MachineFunction) -> bool {
    let mut changed = false;
    for mi in MF.all_instrs() {
        if !is_global_load(mi) { continue; }
        if is_volatile(mi) { continue; }
        if !pointer_is_readonly(mi.address_operand()) { continue; }

        // Replace ld.global with ld.global.nc (LDG)
        mi.set_opcode(ldg_variant(mi.opcode()));

        // Split wide loads if necessary
        if load_width(mi) > hardware_max_ldg_width() {
            // sub_21F2C80: LDG split transformation
            // Tags: ".ldgsplit", ".load", ".ldgsplitinsert"
            let (lo, hi) = split_wide_load(mi);
            // Insert: lo = ldg.64 [addr]
            //         hi = ldg.64 [addr + 8]
            //         result = INSERT_SUBREG lo, hi
            changed = true;
        }
        changed = true;
    }
    changed
}

Vector splitting (sub_21F3A20, 44KB): This is the third-largest function in the 0x21F range. NVPTX supports limited native vector widths (typically .v2 and .v4 of 32-bit elements). When wider vectors (e.g., v8f32, v16f16) appear, this engine splits them into legal widths. Operations handled:

  • vecBitCast: bitcast between vector types
  • splitVec: split a vector into sub-vectors
  • extractSplitVec / insertSplitVec: element access on split vectors
  • splitVecGEP: GEP computation on split vector elements

The split width depends on TargetOpt.HasLDG (stored at target options offset 5, extracted from p2h-01 analysis). When LDG is available, 128-bit loads (LDG.128) are preferred, resulting in .v4.b32 patterns.

NVPTXMem2Reg (sub_21F9920) -- Machine-Level Mem2Reg

Registration: sub_21F9920 at 0x21F9920, pass name "Mem2Reg on Machine Instructions to remove local stack objects", pass ID "nvptx-mem2reg". Main body: sub_21FA880 (22KB), engine: sub_21FC920 (33KB). Controlled by byte_4FD25C0 (inverted by nv-disable-mem2reg, default: enabled).

Standard LLVM mem2reg operates on LLVM IR alloca instructions. This NVIDIA-custom pass operates on MachineInstr -- specifically on ld.local / st.local pairs that access __local_depot frame slots. After register allocation, some values that were spilled to .local memory can be promoted back to virtual registers if their access pattern is simple enough (single def, multiple uses, no aliasing stores).

Algorithm:

fn nvptx_machine_mem2reg(MF: &mut MachineFunction) -> bool {
    if nv_disable_mem2reg { return false; }  // byte_4FD25C0

    let mut changed = false;
    for frame_idx in MF.frame_info().stack_objects() {
        if !is_local_depot_slot(frame_idx) { continue; }
        // Collect all loads and stores to this frame slot
        let stores = find_stores_to(MF, frame_idx);
        let loads = find_loads_from(MF, frame_idx);

        if stores.len() != 1 { continue; }  // must be single-def
        let store = stores[0];
        let src_reg = store.source_register();

        // Check: no aliasing stores between def and uses
        // Check: store dominates all loads
        if !dominates_all(store, &loads) { continue; }

        // Promote: replace all ld.local with the source register
        for load in &loads {
            replace_load_with_reg(load, src_reg);
            load.erase_from_parent();
        }
        store.erase_from_parent();
        MF.frame_info().remove_object(frame_idx);
        changed = true;
    }
    changed
}

This pass is positioned in addPostRegAlloc(), meaning it runs after the greedy register allocator has already assigned slots. It acts as a cleanup: register allocation may have conservatively spilled values that turn out to be unnecessary after coalescing and copy propagation eliminate intermediate uses.

GenericToNVVM (sub_215DC20) -- Address Space Normalization

Registration: sub_215DC20 at 0x215DC20, pass name "Ensure that the global variables are in the global address space", pass ID "generic-to-nvvm". Pass descriptor: 80-byte allocation. Factory: sub_215D530 (allocates 320-byte state with two 128-bucket DenseMaps). New PM variant: sub_305ED20.

CUDA and LLVM IR use address space 0 (generic) as the default for globals, but NVPTX requires globals in address space 1. This pass rewrites every GlobalVariable in address space 0 to address space 1, inserting addrspacecast instructions at all use sites.

Algorithm:

fn generic_to_nvvm(M: &mut Module) -> bool {
    let mut gv_map = DenseMap::new(128);     // old -> new Value mapping
    let mut const_map = DenseMap::new(128);  // old -> new Constant mapping

    for gv in M.globals().filter(|g| g.address_space() == 0) {
        // 1. Clone to address space 1
        let new_gv = GlobalVariable::new(
            gv.value_type(), gv.is_constant(), gv.linkage(),
            gv.initializer(), gv.name(), /*addrspace=*/ 1
        );
        new_gv.set_alignment(gv.alignment());

        // 2. Insert addrspacecast(1 -> 0) at each use
        let cast = ConstantExpr::addrspace_cast(new_gv, gv.type());

        // 3. Replace all uses
        gv.replace_all_uses_with(cast);

        // 4. Track in map and erase original
        gv_map.insert(gv, new_gv);
        gv.erase_from_parent();
    }

    // Cleanup: sub_215D780 iterates gv_map, properly ref-counting Values
    cleanup_gv_map(&gv_map);
    !gv_map.is_empty()
}

NVPTXProxyRegErasure (sub_21DA810) -- Redundant cvta.to.local Removal

Registration: sub_21DA810 at 0x21DA810, pass name "NVPTX optimize redundant cvta.to.local instruction".

This late post-RA pass removes cvta.to.local instructions that are left over from address space lowering. After frame layout is complete, local memory addresses are known, and cvta.to.local (which converts a generic pointer to a .local pointer) is redundant when the address is already known to be in .local space. The pass is simple: scan for cvta.to.local MachineInstrs, verify the source is already a .local address, replace uses with the source operand, delete the cvta.

NVPTXAssignValidGlobalNames (sub_21BCD80) -- PTX Name Sanitization

Registration: sub_21BCD80 at 0x21BCD80, pass name "Assign valid PTX names to globals", pass ID "nvptx-assign-valid-global-names".

PTX has stricter naming rules than LLVM IR. Characters like @, $, . (in certain positions), and Unicode are illegal in PTX identifiers. This pass walks all GlobalValues in the module and replaces illegal characters with safe alternatives (typically _). It also handles name demangling artifacts and ensures the final names are unique after sanitization.

NVPTXImageOptimizer (sub_21BCF10) -- Texture/Surface Optimization

Registration: sub_21BCF10 at 0x21BCF10, pass name "NVPTX Image Optimizer". Type validation helper: sub_21DD1A0 (16KB).

This pre-emission pass optimizes texture and surface access patterns. It validates image type consistency for tex, suld, sust, and suq operations, emitting errors for mismatches: "Invalid image type in .tex", "Invalid image type in .suld", "Invalid image type in suq.", "Invalid image type in .sust". The pass coalesces related texture operations when they access the same texture handle with compatible coordinates and can be merged into wider vector fetches.

NVPTXReplaceImageHandles (sub_21DBEA0) -- Image Handle Lowering

Registration: sub_21DBEA0 at 0x21DBEA0, pass name "NVPTX Replace Image Handles".

Replaces IR-level texture/surface handle references (which are LLVM Value pointers to @texture_handle globals) with PTX-level .tex / .surf declarations and integer handle indices. This is a pre-emission pass that bridges the gap between LLVM IR's opaque handle model and PTX's explicit texture declaration model.

AllocaHoisting (sub_21BC7D0) -- Entry Block Alloca Hoisting

Registration: sub_21BC7D0 at 0x21BC7D0, pass name "Hoisting alloca instructions in non-entry blocks to the entry block", pass ID "alloca-hoisting". Registration helper: sub_21BC5A0.

PTX requires that all local memory declarations be hoisted to the function entry. This pass scans all basic blocks for alloca instructions and moves them to the entry block. This enables the frame layout pass (PrologEpilogInserter) to assign fixed offsets to all stack objects -- a requirement because PTX emits .local .align N .b8 __local_depotX[SIZE] at the function prologue and all local accesses are indexed from this single base.

ParamOpt (sub_2203290) -- Parameter Load Optimization

Registration: sub_2203290 at 0x2203290, pass name "Optimize NVPTX ld.param", pass ID "param-opt".

NVPTX-custom pass that optimizes ld.param instructions generated during kernel argument passing. When a kernel parameter is loaded multiple times (common when the same argument is used in different basic blocks), this pass eliminates redundant loads by propagating the first load's result to subsequent uses. Related knob: remat-load-param ("Support remating const ld.param that are not exposed in NVVM IR").

NVPTXTruncOpts (sub_22058E0) -- i16 Truncation Optimization

Registration: sub_22058E0 at 0x22058E0, pass name "Optimize redundant ANDb16ri instrunctions" [sic], pass ID "nvptx-trunc-opts".

When LLVM lowers trunc i32 to i16 operations, the NVPTX backend emits an AND.b16 with mask 0xFFFF to ensure the high bits are zero. In many cases this AND is redundant -- the producing instruction already guarantees a 16-bit result. This pass pattern-matches ANDb16ri instructions with the 0xFFFF immediate and removes them when the source provably fits in 16 bits.

RP-Aware MachineCSE (NVIDIA-Modified machine-cse)

Stock LLVM MachineCSE eliminates redundant machine instructions by matching instruction patterns within dominance regions. NVIDIA adds three extensions via ctor_302_0 (0x4FEB70, 7.8KB, 14 strings):

RP-aware CSE (rp-aware-mcse): Before eliminating a common subexpression, queries MRPA (sub_2E5A4E0) for the current register pressure. If eliminating the CSE candidate would increase pressure beyond the target (because the shared result must stay live longer), the CSE is suppressed. This prevents the classic GPU problem where CSE reduces instruction count but increases register pressure, reducing occupancy.

Predicate-aware CSE (pred-aware-mcse): Extends RP awareness to predicate registers (PTX %p class). Predicate registers are a scarce resource (maximum 7 per thread on most architectures), so predicate pressure is tracked separately from general-purpose register pressure.

Copy-prop CSE (copy-prop-mcse): Embeds copy propagation within the CSE framework. When CSE eliminates an instruction, the resulting COPY instructions can often be propagated immediately rather than waiting for the separate MachineCopyPropagation pass.

Incremental MRPA integration: The MCSE pass uses qword_501F988 (incremental-update-mcse, default ON) to incrementally update MRPA as CSE decisions are made, avoiding full recomputation per CSE candidate.

MachinePipeliner (SMS) Detail

The Swing Modulo Scheduler at sub_3563190 performs software pipelining -- overlapping successive loop iterations to hide latency. It operates on a single loop body at the MachineInstr level:

  1. DAG construction: builds a data dependency graph with sub_2F97F60, computes latencies via sub_3559990, adds edges via sub_3542B20.
  2. MII computation: RecMII (recurrence-based) via sub_354CBB0, ResMII (resource-based) via sub_35449F0. MII = max(RecMII, ResMII).
  3. Early exits: MII == 0 is invalid; MII > SwpMaxMii (default 27, -pipeliner-max-mii) aborts.
  4. II search: starts at MII, tries up to pipeliner-ii-search-range (default 10, qword_503E428) consecutive II values. First valid schedule wins.
  5. Schedule construction: ASAP via sub_354BFF0, ALAP via sub_354BFF0, topological sort, core SMS node placement via sub_354C3A0, then finalization.
  6. Kernel generation: Three code generation backends selected by priority -- annotation-only (pipeliner-annotate-for-testing), MVE-based (pipeliner-mve-cg, default enabled), and experimental peeling (pipeliner-experimental-cg).

The pipeliner stores its schedule context as a 616-byte (0x268) structure with four SmallVectors and per-BB data at 256-byte stride. Maximum pipeline stages: SwpMaxStages (default 3, -pipeliner-max-stages).

Core scheduling pipeline (10 sequential calls):

StepFunctionPurpose
1sub_35476E0DAG construction / dependency analysis
2sub_35523F0Recurrence detection / RecMII computation
3sub_35546F0Resource usage / ResMII computation
4sub_3543340MII = max(RecMII, ResMII) finalization
5sub_35630A0Node ordering / priority assignment
6sub_35568E0Schedule table initialization
7sub_35433F0Pre-scheduling transforms
8sub_3557A10Instruction ordering/selection (heuristic)
9sub_354A760Schedule finalization / modulo expansion
10sub_355F610ScheduleDAGMILive integration (64KB)

Instruction selection heuristic (sub_3557A10): Priority ordering: (1) deeper instructions first (offset 240 = latency/depth), (2) target priority table at a1+3944 (16-byte entries: [start, end, priority, window_width]), (3) narrower schedule windows first. Latency recomputation via sub_2F8F5D0 during comparison.

Error messages:

  • "Invalid Minimal Initiation Interval: 0" -- MII computation returned zero
  • "Minimal Initiation Interval too large: MII > SwpMaxMii. Refer to -pipeliner-max-mii." -- loop is too complex
  • "Unable to find schedule" -- no valid II found within search range
  • "No need to pipeline - no overlapped iterations in schedule." -- numStages == 0
  • "Too many stages in schedule: numStages > SwpMaxStages. Refer to -pipeliner-max-stages." -- pipeline depth exceeded

PrologEpilogInserter (sub_35B1110) -- .local Frame Layout

Address: sub_35B1110 (68KB, 2388 decompiled lines). Stack frame: 0x490 bytes of local state. This is NVIDIA's monolithic PEI for PTX. Unlike a traditional PEI that emits push/pop sequences and adjusts %rsp, this one computes .local memory frame offsets.

10-phase structure:

PhaseLinesDescription
1443-490Target/subtarget retrieval, initial setup
2491-566Callee-saved register determination
3567-730Pre-pass: collect fixed objects from frame info
4733-1070Stack object offset assignment (main layout engine)
51078-1600General local variable layout
61688-1795Frame-pointer stack area
71803-1872Prolog/epilog instruction insertion per BB
81873-2132Scavenger / frame-index elimination
92270-2304Stack-size warning & diagnostic reporting
102305-2388Cleanup & deallocation

Frame object record (40 bytes):

OffsetSizeField
+08Byte offset in .local memory (assigned by PEI)
+88Object size in bytes
+161Alignment (log2)
+201isDead flag (skip if set)
+321isSpillSlot flag
+361Category byte (0/1/2/3)

Stack layout algorithm (Phase 4):

fn assign_frame_offsets(MF: &MachineFunction, frame: &mut FrameInfo) {
    let grows_neg = frame.stack_direction == 1;
    let mut offset = frame.initial_offset;
    let mut max_align = frame.max_alignment;

    // Fixed objects first
    for obj in frame.fixed_objects() {
        if obj.is_dead { continue; }
        let align = 1 << obj.log2_align;
        offset = align_to(offset, align);
        obj.offset = if grows_neg { -offset } else { offset };
        offset += obj.size;
        max_align = max(max_align, align);
    }

    // Callee-saved register region
    for csr in frame.callee_saved_range() {
        if csr.is_dead || csr.size == -1 { continue; }
        let align = 1 << csr.log2_align;
        offset = align_to(offset, align);
        csr.offset = if grows_neg { -offset } else { offset };
        offset += csr.size;
    }

    // General locals: three category buckets, each via sub_35B0830
    for category in [1, 2, 3] {
        for obj in frame.objects_of_category(category) {
            let align = 1 << obj.log2_align;
            offset = align_to(offset, align);
            obj.offset = if grows_neg { -offset } else { offset };
            offset += obj.size;
        }
    }

    frame.stack_size = offset;
}

The final PTX emission (sub_2158E80) uses these offsets to emit: .local .align N .b8 __local_depotX[SIZE]; at the function prologue, and ld.local / st.local instructions reference [%SPL + offset] where %SPL is the local stack pointer register.

ScheduleDAGMILive (sub_355F610) -- Post-RA Instruction Ordering

Address: sub_355F610 (64KB). This is the post-RA machine instruction scheduler, consuming either the pipeliner's output or standalone scheduling regions.

Data structures:

  • SUnit (Scheduling Unit): 88 bytes per instruction
  • Instruction-to-node hash map: 632-byte entries
  • RP tracking structure: 112 bytes (offsets 32-48: per-class pressure current, offsets 56-72: per-class pressure limits)

Scheduling flow:

  1. Initialize RP tracking via sub_3551AB0 (if pipeliner-register-pressure is set)
  2. Set per-class pressure defaults via sub_2F60A40
  3. Walk BB instruction list, build instruction-to-node hash map (632-byte entries)
  4. Compute ASAP via sub_354BFF0 -> earliest cycle per instruction
  5. Compute ALAP via sub_354BFF0 -> latest cycle per instruction
  6. Place instructions via sub_354C3A0 (returns success/failure)
  7. Calculate stage count: (lastCycle - firstCycle) / II
  8. Verify placement via sub_355C7C0
  9. Build stage descriptors via sub_355D7E0 (80 bytes per stage)

Machine-Level Analysis Infrastructure

Machine passes depend on a set of analysis passes that compute liveness, dominance, and frequency information over the MachineFunction representation.

Analysis IDClassDescription
slot-indexesSlotIndexesAnalysisAssigns a dense integer index to every instruction slot in the function. All liveness computations reference slot indexes rather than instruction pointers, enabling O(log n) interval queries.
live-intervalsLiveIntervalsAnalysisComputes live ranges for every virtual register as a set of [start, end) slot-index intervals. The LiveRangeCalc engine (sub_2FC4FC0, 12.9KB) manages 296-byte segment entries with inline small-object buffers for endpoint, register mask, kill-set, and use-def chain data. See LiveRangeCalc.
live-reg-matrixLiveRegMatrixAnalysisTracks physical register unit interference. On NVPTX, used primarily for register-class-level pressure tracking rather than physical unit assignment.
machine-dom-treeMachineDominatorTreeAnalysisDominance tree over MachineBasicBlock graph. Required by LICM, CSE, sinking, and register allocation.
machine-post-dom-treeMachinePostDominatorTreeAnalysisPost-dominance tree. Used by block placement (sub_3521FF0 stores at this+544).
machine-loopsMachineLoopAnalysisLoop detection on the machine CFG. Used by LICM, block placement, and the pipeliner.
machine-block-freqMachineBlockFrequencyAnalysisBlock frequency estimates (profile-guided or static). Block placement uses this at this+528 to drive chain construction.
machine-branch-probMachineBranchProbabilityAnalysisBranch probability data. Block placement stores at this+536.
machine-trace-metricsMachineTraceMetricsAnalysisTrace-based metrics (critical path length, resource depth). Used by MachineCombiner and if-conversion.
machine-opt-remark-emitterMachineOptRemarkEmitterAnalysisOptimization remark emission for machine passes.
edge-bundlesEdgeBundlesAnalysisGroups CFG edges into bundles for spill placement.
spill-code-placementSpillPlacementAnalysisDetermines optimal spill/reload points using edge bundles and frequency data.
regalloc-evictRegAllocEvictionAdvisorAnalysisAdvises the greedy allocator on which live range to evict.
regalloc-priorityRegAllocPriorityAdvisorAnalysisAssigns allocation priority to live ranges.
virtregmapVirtRegMapAnalysisMaps virtual registers to their assigned physical registers (or spill slots).
machine-rpasub_21EAA00NVIDIA-custom machine register pressure analysis. Provides per-BB pressure data consumed by RP-aware MCSE, scheduling, and rematerialization.

Machine Pass Knobs Summary

NVIDIA Target Pass Enable/Disable

KnobTypeDefaultEffect
enable-nvvm-peepholebooltrueEnable NVPTX-specific peephole optimizer
nvptx-enable-machine-sinkboolfalseEnable MachineSink on NVPTX (off by default due to pressure concerns)
enable-mlicmbool(opt-level dependent)Enable MachineLICM on NVPTX
enable-mcsebool(opt-level dependent)Enable MachineCSE on NVPTX
nv-disable-mem2regboolfalseDisable machine-level mem2reg
nv-disable-rematboolfalseDisable all NVIDIA rematerialization passes
enable-new-nvvm-rematbool(varies)Enable new NVVM remat, disable old
usedessaint2Select deSSA method for PHI elimination
cssa-coalesceint(varies)Controls PHI operand coalescing aggressiveness

Stock LLVM Codegen Controls

KnobTypeDefaultEffect
disable-machine-dceboolfalseDisable dead machine instruction elimination
disable-machine-licmboolfalseDisable pre-RA MachineLICM
disable-postra-machine-licmboolfalseDisable post-RA MachineLICM
disable-machine-cseboolfalseDisable MachineCSE
disable-machine-sinkboolfalseDisable MachineSink (NVPTX also gates via nvptx-enable-machine-sink)
disable-postra-machine-sinkboolfalseDisable post-RA MachineSink
disable-branch-foldboolfalseDisable BranchFolding / tail merge
disable-tail-duplicateboolfalseDisable post-RA tail duplication
disable-early-taildupboolfalseDisable pre-RA tail duplication
disable-block-placementboolfalseDisable MachineBlockPlacement
disable-copypropboolfalseDisable MachineCopyPropagation
disable-sscboolfalseDisable Stack Slot Coloring
disable-post-raboolfalseDisable post-RA scheduler
disable-early-ifcvtboolfalseDisable early if-conversion
disable-peepholeboolfalseDisable stock LLVM peephole optimizer
enable-machine-outlinerenum(varies)disable / enable / guaranteed beneficial
misched-postraboolfalseRun MachineScheduler post-RA
optimize-regallocbooltrueEnable optimized register allocation path
verify-machineinstrsboolfalseRun MachineVerifier after each pass

NVIDIA RP-Aware MachineCSE Knobs

KnobTypeDefaultEffect
rp-aware-mcsebool(varies)Enable register-pressure-aware MachineCSE
pred-aware-mcsebool(varies)Enable predicate-register-pressure-aware MCSE
copy-prop-mcsebool(varies)Enable copy propagation within MachineCSE
incremental-update-mcsebooltrueIncrementally update MRPA during MCSE
verify-update-mcseboolfalseDebug: verify incremental MRPA updates against full recomputation
print-verifyboolfalseDebug: print detailed RP mismatch diagnostic
cta-reconfig-aware-mrpabool(varies)CTA reconfiguration aware machine RP analysis

NVPTXBlockRemat Knobs

KnobTypeDefaultEffect
nv-remat-blockint14Bitmask controlling remat modes (bits 0-3)
nv-remat-max-timesint10Max iterations of the outer remat loop
nv-remat-block-single-costint10Max cost per single live value pull-in
nv-remat-block-map-size-limitint6Map size limit for single pull-in
nv-remat-block-max-costint100Max total clone cost per live value reduction
nv-remat-block-liveout-min-percentageint70Min liveout % for special consideration
nv-remat-block-loop-cost-factorint20Loop cost multiplier
nv-remat-default-max-regint70Default max register pressure target
nv-remat-block-load-costint10Cost assigned to load instructions
nv-remat-threshold-for-spec-regint20Threshold for special register remat
nv-dump-remat-blockboolfalseDebug dump toggle
load-rematbooltrueEnable load rematerialization

Pipeliner Knobs

KnobTypeDefaultEffect
enable-pipelinerbooltrueEnable the MachinePipeliner pass
pipeliner-max-miiint27Maximum Minimal Initiation Interval before abort
pipeliner-max-stagesint3Maximum pipeline stages
pipeliner-ii-search-rangeint10Number of consecutive II values to try
pipeliner-register-pressureboolfalseEnable RP tracking during pipelining
pipeliner-register-pressure-marginint5RP margin before pipeliner backs off
pipeliner-ignore-recmiiboolfalseZero out RecMII, use only ResMII
pipeliner-annotate-for-testingboolfalseAnnotate schedule without modifying code
pipeliner-experimental-cgboolfalseUse experimental peeling code generator
pipeliner-mve-cgbooltrueUse MVE code generator (default path)
outliner-benefit-thresholdint1Minimum size in bytes for outlining candidate

Register Pressure Target Knobs

KnobTypeDefaultEffect
reg-target-adjustint0Adjust register pressure target (-10 to +10)
pred-target-adjustint0Adjust predicate register pressure target (-10 to +10)
fca-sizeint8Max size of first-class aggregates in bytes
remat-load-parambool(varies)Support remating const ld.param not exposed in NVVM IR
cta-reconfig-aware-rpabool(varies)CTA reconfiguration aware register pressure analysis

Function Address Map

AddressSizeFunctionRole
sub_215DC20--GenericToNVVM registrationAddress space normalization
sub_215D530320B stateGenericToNVVM factoryAllocates pass state with 2 DenseMaps
sub_215D780--GenericToNVVM cleanupGVMap iteration and Value ref-counting
sub_2166D201.5KBaddISelPassesPre-ISel pass configuration
sub_2166ED01.6KBaddPreRegAllocPre-RA pass configuration
sub_21668D01.2KBaddPostRegAllocPost-RA pass configuration
sub_217D300--BlockRemat pass name"NVPTX Machine Block Level Rematerialization"
sub_217DBF0--BlockRemat registration"nvptx-remat-block"
sub_217E8105.2KBMULTIDEF detectionSingle-def checker with opcode exclusion table
sub_2181550~3KBRecursive pullabilityDepth-limited chain validation (depth <= 50)
sub_218187019KBSecond-chance heuristicRe-evaluates rejected remat candidates
sub_2183E30--Cost evaluatorComputes clone cost for rematerialization
sub_218489012KBRemat allocation helperSimulates pressure after remat
sub_218525017KBLiveness propagationCore instruction cloning/replacement engine
sub_2186590--Max-live computationPer-block pressure scan
sub_2186D9047KBBlockRemat main engineIterative pull-in algorithm (1742 lines)
sub_21810D09.4KBInstruction replacementReplaces register uses after remat
sub_21BC5A0--AllocaHoisting namePass name registration
sub_21BC7D0--AllocaHoisting registration"alloca-hoisting"
sub_21BCD80--ValidGlobalNames registration"nvptx-assign-valid-global-names"
sub_21BCF10--ImageOptimizer registration"NVPTX Image Optimizer"
sub_21DA810--ProxyRegErasureRedundant cvta.to.local removal
sub_21DB090--NVPTXPeephole registration"nvptx-peephole"
sub_21DB5F0--NVPTXPrologEpilog registration"NVPTX Prolog Epilog Pass"
sub_21DBEA0--ReplaceImageHandles registration"NVPTX Replace Image Handles"
sub_21DD1A016KBImage type validationtex/suld/sust/suq type checking
sub_21E9A604.9KBRP stats printer"Max Live RRegs: " / "PRegs: "
sub_21E9E80--ExtraMachineInstrPrinter registration"extra-machineinstr-printer"
sub_21EAA00--MRPA registration"machine-rpa"
sub_21EEB4068KBMRPA full recomputationPer-BB pressure computation
sub_21F2780--LdgXform registration"ldgxform"
sub_21F2C8019KBLDG split body.ldgsplit / .ldgsplitinsert
sub_21F3A2044KBVector splitting enginesplitVec / vecBitCast / extractSplitVec
sub_21F9920--NVPTXMem2Reg registration"nvptx-mem2reg"
sub_21FA88022KBMem2Reg bodyMachine-level mem2reg driver
sub_21FC92033KBMem2Reg enginePromotion/replacement logic
sub_220015078KBDAGToDAG ISel mainHash-table pattern matching (h = (37*idx) & (size-1))
sub_2203290--ParamOpt registration"param-opt"
sub_2204E60--Redundant move elim"Remove redundant moves"
sub_22058E0--TruncOpts registration"nvptx-trunc-opts"
sub_2E5A4E048KBMRPA incremental updaterIncremental RP tracking for MCSE
sub_1E0037078KBMRPA backend variantAlternative RP tracker
sub_35B111068KBPrologEpilogInserter.local frame layout (2388 lines)
sub_356319058KBMachinePipelinerSwing Modulo Scheduling
sub_355F61064KBScheduleDAGMILivePost-RA instruction ordering
sub_3557A10--SMS instruction selectionScheduling heuristic

Global Variable Reference

VariableTypeDefaultRole
byte_4FD1980byte(opt-level)MachineLICM enable flag
byte_4FD18A0byte(opt-level)MachineCSE enable flag
byte_4FD1A60byte(opt-level)MachineSink enable flag
byte_4FD25C0byte(opt-level)nvptx-mem2reg enable
byte_4FD2160byte--Extra ISel pass enable
byte_4FD2E80byteoffnv-dump-remat-block
dword_4FD26A0dword--Scheduling mode (1 = simple, else = full)
dword_4FD3740dword10nv-remat-max-times
dword_4FD3820dword14nv-remat-block mode bitmask
dword_4FD33C0dword70nv-remat-default-max-reg (global)
qword_501F988qword1incremental-update-mcse
qword_501F8A8qword0verify-update-mcse
qword_501F7C8qword0print-verify

Cross-References

SelectionDAG & Instruction Selection

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: Target-independent DAG infrastructure: llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp, DAGCombiner.cpp, LegalizeDAG.cpp, LegalizeTypes.cpp, SelectionDAGBuilder.cpp, SelectionDAGISel.cpp. NVPTX target: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp, NVPTXISelDAGToDAG.cpp, NVPTXInstrInfo.td (LLVM 20.0.0).

LLVM version note: The target-independent SelectionDAG infrastructure at 0xF05000--0xF70000 appears to be stock LLVM 20 with no detectable NVIDIA modifications. All NVIDIA customization lives in the NVPTX target range (0x3290000--0x35FFFFF) via virtual dispatch through NVPTXTargetLowering and NVPTXDAGToDAGISel. The intrinsic lowering switch covers IDs up to 14196 (0x3774), far exceeding upstream NVPTX which covers approximately IDs 0--300.

CICC v13.0 contains a complete NVPTX SelectionDAG backend derived from LLVM 20.0.0, with substantial NVIDIA customizations for GPU-specific lowering, the PTX .param-space calling convention, tensor core intrinsic selection, and a 343KB intrinsic lowering mega-switch covering over 200 CUDA intrinsic IDs. The SelectionDAG pipeline converts LLVM IR into machine-level PTX instructions through four major phases: type legalization, operation legalization, DAG combining, and pattern-based instruction selection.

The NVPTX SelectionDAG backend spans roughly 4MB of code across two address ranges: 0xF05000--0xF70000 for the target-independent DAG infrastructure (combining, known-bits, node management) and 0x3290000--0x35FFFFF for the NVPTX-specific lowering, instruction selection, and register allocation. The infrastructure range is stock LLVM with no detectable NVIDIA modifications; all NVIDIA customization lives in the latter range via target hooks and virtual dispatch.

LowerOperation dispatchersub_32E3060 (111KB, 3,626 lines)
LowerCall (.param ABI)sub_3040BF0 (88KB, 2,909 lines)
Intrinsic lowering switchsub_33B0210 (343KB, 9,518 lines)
ISel::Select driversub_3090F90 (91KB, 2,828 lines)
LegalizeTypessub_20019C0 (348KB, 10,739 lines)
LegalizeOp dispatchersub_1FCE100 (91KB, ~100 opcodes)
LegalizeOp action dispatchsub_1FFB890 (137KB, 967 cases)
DAG combiner visitorsub_F20C20 (64KB)
DAG combiner orchestratorsub_F681E0 (65KB)
DAGCombiner::combine (NVPTX)sub_3425710 (142KB, "COVERED"/"INCLUDED" tracing)
PerformDAGCombine (NVPTX)sub_33C0CA0 (62KB)
DAG combine: post-legalizesub_32EC4F0 (92KB)
computeKnownBits (NVPTX)sub_33D4EF0 (114KB, 3,286 lines)
Inline asm loweringsub_2079C70 (83KB, 2,797 lines)
Inline asm constraints (NVPTX)sub_338BA40 (79KB)
NVPTXTargetLowering initsub_3056320 (45KB, constructor)
Type legalization setupsub_3314670 (73KB, table population)
Upstreamlib/CodeGen/SelectionDAG/, lib/Target/NVPTX/NVPTXISelLowering.cpp

Complexity

Let N = number of DAG nodes and E = number of edges (use-def relationships). The SelectionDAG pipeline runs eight sequential phases. SelectionDAGBuilder converts IR instructions to DAG nodes in O(I) where I = LLVM IR instruction count. Each DAG Combiner pass is worklist-driven: O(N) nodes are visited, each matched against pattern rules in O(1) via opcode dispatch; ReplaceAllUsesWith is O(U) per node where U = uses. The three combiner passes total O(3 * N * U_avg). Type legalization (sub_20019C0, 348KB) iterates until all types are legal -- each iteration processes O(N) nodes, and convergence is guaranteed in O(T) iterations where T = max type-promotion depth (typically 2--3 for GPU types). Operation legalization (sub_1FFB890, 137KB) visits each node once: O(N). The action table lookup is O(1) via the 2D array at TLI + 259 * VT + opcode + 2422. ISel pattern matching (sub_3090F90, 91KB) visits each node once in topological order: O(N). Per-node matching is O(P) where P = number of patterns for that opcode, but NVPTX patterns are organized by opcode-indexed tables making this effectively O(1) for common opcodes. The DAG worklist uses ((addr >> 9) ^ (addr >> 4)) & (cap - 1) hashing for O(1) amortized membership tests. Overall: O(I + N * U_avg * 3 + N * T + N) which simplifies to O(N * U_avg) in practice. The intrinsic lowering mega-switch (343KB, 200+ IDs) adds O(1) per intrinsic call via the jump table, not O(200).

Pipeline Position

The SelectionDAG phases execute in a fixed sequence after SelectionDAGBuilder (sub_2081F00) converts LLVM IR into an initial DAG:

  1. SelectionDAGBuilder -- IR-to-DAG lowering, visitor dispatch at sub_2065D30
  2. DAG Combiner (sub_F681E0 / sub_F20C20) -- initial algebraic simplification
  3. DAGTypeLegalizer (sub_20019C0) -- iterates to fixpoint until all types are legal; see Type Legalization
  4. DAG Combiner -- second pass after type legalization
  5. LegalizeDAG (sub_1FCE100 dispatcher, sub_1FFB890 action engine) -- legalizes operations on legal types
  6. DAG Combiner -- third pass after operation legalization
  7. NVPTXTargetLowering::PerformDAGCombine (sub_33C0CA0) -- NVPTX-specific post-legalize combines
  8. Instruction Selection (sub_3090F90) -- see ISel Patterns

Type Legalization

Type legalization (sub_20019C0) is the largest single function in the SelectionDAG pipeline at 348KB. Unlike upstream LLVM, which splits legalization across LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, and LegalizeVectorTypes.cpp, NVIDIA ships all type-legalization logic inlined into a single monolithic dispatch. This may be an LTO artifact or a deliberate choice for branch-prediction locality.

The master switch dispatches on approximately 50 ISD opcodes. Type legalization actions follow the standard LLVM model:

  • Promote -- widen small types to register width (e.g., i8 to i32) via ANY_EXTEND/ZERO_EXTEND, perform the operation, then TRUNCATE the result.
  • Expand -- split wide types into halves (e.g., i128 into two i64 values) using shift-and-OR sequences.
  • Soften -- emulate unsupported FP types through integer libcall sequences.
  • Scalarize/Split Vector -- decompose illegal vector types into scalar element operations.

The legality table lives inside NVPTXTargetLowering at offset +2422, organized as a 2D array indexed by 259 * VT + opcode. The 259-byte row stride accommodates LLVM's ~250 generic opcodes plus approximately 10 NVPTX target-specific opcodes. A secondary condition-code action table at offset +18112 uses 4-bit packed nibbles indexed by (VT_row + 15 * CC).

The SimpleVT type encoding appears as a recurring pattern throughout the function (at least 11 instances of the same bitwidth-to-VT mapping):

SimpleVTTypeSimpleVTType
1i17i128
3i88f16
4i169f32
5i3210f64
6i6414--109vector types

The vector type range 14--109 maps fixed-width (14--55) and scalable (56--109) vector MVTs to their scalar element types through a ~100-case switch block that appears six times in the function body. The definitive MVT::getSizeInBits() mapping (confirmed at sub_1FDDC20) is:

MVT RangeBitsDescription
0, 10Other, Glue
21i1
38i8
4, 816i16, f16
5, 932i32, f32
6, 1064i64, f64
7128i128
1180ppcf128 / x87 f80
14--23varies2-element vectors
24--109varies3+ element vectors
111--1140token, metadata, untyped

Type legalization workers fan out from several dispatch functions:

DispatcherRoleSizeCases
sub_201E5F0Promote/expand secondary dispatch81KB441 case labels, 6 switches
sub_201BB90ExpandIntegerResult75KB632 case labels
sub_2000100PromoteIntegerResult45KBrecursive self-calls
sub_2029C10SplitVectorResult5KB (dispatcher)~190 cases
sub_202E5A0SplitVectorOperand6KB (dispatcher)~157 cases
sub_2036110ScalarizeVectorResultdispatch"Do not know how to scalarize..."
sub_2035F80ScalarizeVectorOperanddispatch"Do not know how to scalarize..."

For complete detail, see Type Legalization.

Operation Legalization

LegalizeOp Dispatcher: sub_1FCE100

The top-level operation legalizer (sub_1FCE100, 91KB) is a massive switch on SDNode::getOpcode() (read as *(uint16_t*)(node + 24)) that dispatches approximately 100 ISD opcodes to dedicated per-opcode handler functions. The switch covers all major categories:

OpcodeISD NameHandlerSize
0x02EntryTokensub_1F823C0
0x03--0x04TokenFactorsub_1F73660
0x32CopyFromRegsub_1F78510
0x33CopyToRegsub_1F987D0
0x34MERGE_VALUESsub_1FC08F0
0x35ADDsub_1FA8F9031KB
0x36SUBsub_1FAA42026KB
0x37MULsub_1FAB9E0
0x38SDIV/UDIVsub_1FABFF0
0x39--0x3ASREM/UREMsub_1F99DA0
0x3BANDsub_1FD2F20
0x3CORsub_1FD2A20
0x40SHLsub_1FA27D0
0x41SRAsub_1FA2510
0x42SRLsub_1F71080
0x43ROTLinlinebuilds opcode 65 target node
0x44ROTRsub_1FA2D60
0x47CTLZsub_1FA7370
0x49CTPOPsub_1FA2A00
0x4ABSWAPinline16-bit width check
0x4BBITREVERSEinline
0x4CSELECTsub_1FAC48078KB
0x4DSELECT_CCsub_1FAE68087KB
0x4ESETCCsub_1FB04B026KB
0x4FVSELECTsub_1FCC170
0x63SIGN_EXTENDsub_1F8D44022KB
0x65ZERO_EXTENDsub_1F74E80
0x68TRUNCATEsub_1F912F077KB
0x69FP_ROUNDsub_1F9785027KB
0x6AFP_EXTENDsub_1FC15C036KB
0x6CBITCASTsub_1F9435022KB
0x6DLOADinlinealignment+memtype checks
0x70STOREsub_1F766E0
0x72--0x75ATOMIC_FENCE..LOADsub_1FAA010
0x76ATOMIC_STOREsub_1FBDC0076KB
0x77ATOMIC_LOAD_ADDsub_1FB1F3037KB
0x78ATOMIC_LOAD_SUBsub_1FBB60044KB
0x7AATOMIC_LOAD_ANDsub_1FB871047KB
0x7BATOMIC_LOAD_ORsub_1FBA73024KB
0x7CATOMIC_LOAD_XORsub_1FB6C1039KB
0x86INTRINSIC_WO_CHAINsub_1F9E48047KB
0x87INTRINSIC_W_CHAINsub_1F9D3D026KB
0x88INTRINSIC_VOIDsub_1F9CFD0
0x8EBUILD_VECTORsub_1FA3B0026KB
0x8FINSERT_VECTOR_ELTsub_1FA4AC067KB
0x90EXTRACT_VECTOR_ELTsub_1FA0CA020KB
0x91CONCAT_VECTORSsub_1FB3BB065KB
0x94EXTRACT_SUBVECTORsub_1FB5FC019KB
0x9ADYNAMIC_STACKALLOCsub_1F8F600
0x9EBR_CCsub_1F8B6C0

Opcodes not listed (0--1, 5--0x31, 0x3D--3F, 0x46, 0x48, 0x51--0x62, etc.) return immediately with code 0 (legal, no transformation needed).

Action Dispatch Engine: sub_1FFB890

The operation legalization action engine (sub_1FFB890, 137KB) determines what to do for each DAG node based on the target's action table, then executes the chosen strategy. It reads the per-opcode action byte from NVPTXTargetLowering + 2422 using the formula *(uint8_t*)(TLI + 259 * VT + opcode + 2422):

ActionCodeBehavior
Legal0Return immediately -- node is natively supported
Custom1Call NVPTXTargetLowering::LowerOperation (vtable slot #164, offset +1312); if NULL returned, fall through to expand
Expand2Try LegalizeTypes, then ExpandNode (sub_1FF6F70) as fallback
LibCall3Call ExpandNode directly for libcall substitution
Promote4Find a larger legal type and rebuild the node

The function contains 967 case labels dispatching on opcode. When LowerOperation returns NULL (the custom lowering cannot handle the node), the framework falls through to the expansion path. When it returns a different node, ReplaceAllUsesWith (sub_1D44C70) splices the replacement into the DAG and marks the old node as dead (tombstone value -2 in the worklist hash set).

The promote path contains approximately 30 opcode-specific expansion strategies covering integer arithmetic, FP operations, vector operations, bitcasts, shifts, and NVPTX-specific operations. For FP promotion, the pattern is: FP_EXTEND both operands to the promoted type, apply the original operation, then FP_ROUND the result back.

Worklist management uses sub_1FF5010 with a DenseSet-like structure. The hash function for SDNode pointers follows the standard LLVM pattern: ((addr >> 9) ^ (addr >> 4)) & (capacity - 1).

Load/Store Legalization

The largest individual per-opcode handlers deal with memory operations:

HandlerOpcodeSizeBehavior
sub_1FC2C30LOAD (complex)70KBExtending loads, vector loads, memory type conversion
sub_1FC66B0Load/Store vectorization68KBOffset-based coalescing with introsort (sub_1F6CA30)
sub_1FC9570STORE legalization60KBAlignment checks, store splitting, scatter sequences

The load/store vectorization helper sorts operands by memory offset to detect coalescing opportunities, then creates vector load/store sequences when contiguous accesses are found. This is important for NVPTX because PTX supports ld.v2/ld.v4/st.v2/st.v4 instructions that load/store 2 or 4 elements in a single transaction.

Atomic Legalization

All atomic operations (ATOMIC_STORE through ATOMIC_LOAD_XOR, opcodes 0x72--0x7C) follow a shared structural pattern:

  1. Check operation legality via sub_1D16620 (isAtomicStoreLegal / isOperationLegalOrCustom)
  2. If legal, emit the operation directly
  3. If custom, call NVPTXTargetLowering::LowerOperation for scope-aware NVPTX atomics
  4. Build atomic fence pairs around the operation when needed
  5. Lower to target-specific NVPTX atomic operations with CTA/GPU/SYS scope

The ATOMIC_LOAD_SUB handler at sub_1FBB600 converts subtraction to atom.add of the negated operand when the target lacks native atom.sub.

NVPTX Custom Lowering: sub_32E3060

The LowerOperation dispatcher (sub_32E3060, 111KB) handles NVPTX-specific ISD opcode lowering. This is the second-largest function in the 0x32XXXXX range. It operates through a multi-phase approach rather than a clean switch-on-opcode, with approximately 620 local variables and a 0x430-byte stack frame.

The dispatcher is reached via vtable slot #164 (offset +1312) of the NVPTXTargetLowering object whenever the operation legalizer encounters action code 1 (Custom).

Supported Opcodes

OpcodeISD NodeLowering Strategy
51UNDEFDirect pass-through via getNode(UNDEF)
156BUILD_VECTORIterates operands, detects all-same, calls dedicated handler
186VECTOR_SHUFFLEThree-level approach by result count (1, 2, 3+)
234EXTRACT_VECTOR_ELTThree sub-paths: predicate check, direct sub-register, general extract

Additionally, the function handles load/store lowering (sub_32D2680, 81KB companion), integer/FP operation legalization (sub_32983B0, 79KB), address space casts (sub_32C3760, 54KB), bitcast/conversion (sub_32C7250, 57KB), and conditional/select patterns (sub_32BE8D0, 54KB). These large helper functions are called from within sub_32E3060's dispatch logic.

BUILD_VECTOR Lowering

BUILD_VECTOR (opcode 156) lowering begins by iterating all operands to detect the all-same (splat) case. When all elements are the same value, the lowering produces a single scalar load followed by register-class-appropriate replication. When elements differ, it falls through to a per-element insert chain.

For NVPTX, BUILD_VECTOR is significant because PTX has no native vector construction instruction -- vectors are built by storing elements into .param space and reloading as a vector type, or through register-pair packing for 2-element vectors.

VECTOR_SHUFFLE Three-Level Lowering

Vector shuffle lowering (lines 2665--3055 of the decompilation) implements a three-level strategy based on the result element count:

Level 1 -- Single-result shuffle. When the shuffle produces a single element, the lowering extracts the source element directly via EXTRACT_VECTOR_ELT and wraps it in a BUILD_VECTOR if needed. This avoids any actual shuffle machinery.

Level 2 -- Two-result shuffle. The handler uses a two-phase identity/extract detection with BitVector tracking. Phase A scans the shuffle mask to identify which source elements map to which result positions. Phase B determines whether each result position is an identity (element already in the correct position in one of the source vectors) or requires extraction. Results that are identities are left in place; non-identity elements are extracted and inserted.

Level 3 -- General shuffle (3+ results). Falls back to a BUILD_VECTOR-based reconstruction. Each result element is individually extracted from the appropriate source vector using EXTRACT_VECTOR_ELT, then all elements are combined via BUILD_VECTOR. For certain mask patterns, pairwise shuffle via sub_32B2430 is attempted first as an optimization.

EXTRACT_VECTOR_ELT Three Sub-Paths

EXTRACT_VECTOR_ELT (opcode 234) lowering takes one of three paths based on the extraction context:

  1. Predicate extraction. When extracting from a vector of i1 (predicates), the lowering produces a bitwise test on the packed predicate register. This is NVPTX-specific: PTX stores predicate vectors packed into integer registers.

  2. Direct sub-register extraction. When the element index is a compile-time constant and the element aligns with a register boundary, the lowering generates a direct sub-register reference. This maps to PTX's mov.b32 or mov.b64 for extracting elements from packed register pairs.

  3. General extraction. For non-constant indices or non-aligned elements, the lowering stores the entire vector to local memory, computes the byte offset from the index, and loads the element back. This generates st.local + ld.local sequences, which is expensive but handles all cases.

Supporting NVPTX Lowering Functions

The custom lowering infrastructure at 0x3290000--0x32FFFFF consists of approximately 13 large functions totaling ~850KB:

FunctionSizeRole
sub_32E3060111KBMaster LowerOperation dispatcher
sub_32A1EF0109KBCustom type promotion for NVPTX types
sub_32EC4F092KBPost-legalize DAG combine
sub_32FE97088KBVector operation splitting/scalarization
sub_32D268081KBLoad/store DAG lowering (address space, alignment)
sub_32983B079KBInteger/FP operation legalization
sub_32B8A2071KBNVVM intrinsic lowering (tex/surf/special)
sub_32CBCB057KBExtended type legalization
sub_32C725057KBBitcast/conversion lowering
sub_32A903055KBVector operation lowering
sub_32C376054KBAddress space cast / pointer lowering
sub_32BE8D054KBConditional/select lowering
sub_32B654050KBSpecial register / intrinsic lowering

Common helpers shared across all functions in this cluster:

RangeRole
sub_325FxxxEVT/MVT type utilities
sub_326xxxxDAG node creation (getNode variants)
sub_327xxxxDAG memory node creation
sub_328xxxxTarget-specific node creation
sub_33ExxxxNVPTX-specific node builders
sub_33FxxxxNVPTX instruction node helpers
sub_340xxxxNVPTX constant/register node helpers
sub_341xxxxNVPTX chain/glue node construction

The .param-Space Calling Convention

PTX does not use registers for argument passing. Instead, all arguments flow through .param memory space, a compiler-managed address space specifically for call sites. LowerCall (sub_3040BF0, 88KB) implements this convention by emitting a structured sequence of NVPTXISD custom DAG nodes.

Call Sequence DAG Structure

CallSeqBegin(315, seq_id, 0)
  DeclareScalarParam(506, align=4, idx=0, size=32)   // scalar arg
  DeclareParam(505, align=4, idx=1, size=N)           // struct arg (byval)
    StoreV1(571, ...)                                  // 8 bytes at a time
    StoreV2(572, ...)                                  // or 2-element vector
  DeclareRetScalarParam(508, 1, 32, 0)                // return decl
  CallProto(518, callee, ...)
  CallStart(514, ...)                                  // actual call
  LoadRetParam(515, 1, 0, ...)                         // load return value
  CallSeqEnd(517, ...)
CallSeqEnd_Outer(316, ...)

Each call increments a monotonic sequence counter at NVPTXTargetLowering + 537024 (offset 134256 * 4), used to match CallSeqBegin/CallSeqEnd pairs and generate unique .param variable names (e.g., __param_0, __param_1, etc.).

Scalar Widening Rules

Scalar arguments narrower than 32 bits are widened to 32 bits; values between 32 and 64 bits are widened to 64 bits. This matches the PTX ABI requirement that .param scalars have a minimum 32-bit size:

Source WidthWidened ToPTX Type
i1 (1 bit)i32 (32 bit).param .b32
i8 (8 bit)i32 (32 bit).param .b32
i16 (16 bit)i32 (32 bit).param .b32
i32 (32 bit)i32 (no change).param .b32
i64 (64 bit)i64 (no change).param .b64
f16 (16 bit)i32 (32 bit).param .b32
f32 (32 bit)f32 (no change).param .f32
f64 (64 bit)f64 (no change).param .f64

Vector Parameter Passing

Vector arguments use StoreV1/StoreV2/StoreV4 (opcodes 571--573) mapping to PTX st.param.b32, st.param.v2.b32, st.param.v4.b32 and their 64-bit variants. The element count determines the opcode:

OpcodeNamePTXDescription
571StoreV1st.param.b32 / .b64Single element store
572StoreV2st.param.v2.b32 / .v2.b642-element vector store
573StoreV4st.param.v4.b32 / .v4.b644-element vector store

For byval struct arguments, the lowering decomposes the aggregate into chunks that fit the largest available vector store. An 80-byte struct, for example, might be lowered as five StoreV4.b32 operations (5 x 4 x 4 = 80 bytes).

NVPTXISD DAG Node Opcodes

The complete set of NVPTXISD opcodes used in call lowering:

OpcodeNameRole
315CallSeqBeginMarks start of call parameter setup (maps to ISD opcode)
316CallSeqEndOuter end-of-call marker (maps to ISD opcode)
505DeclareParamDeclares a byval .param aggregate parameter
506DeclareScalarParamDeclares a scalar .param parameter with width+alignment
508DeclareRetScalarParamDeclares the return value .param parameter
510CallDirectDirect call with prototype
511CallDirectNoProtoDirect call without prototype (old-style C)
512CallIndirectIndirect call (function pointer) with prototype
513CallIndirectNoProtoIndirect call without prototype
514CallStartThe actual call instruction
515LoadRetParamLoads return value from .param space
517CallSeqEnd (inner)Inner end-of-call marker
518CallProtoCall prototype declaration (type signature)
571--573StoreV1/V2/V4Stores to .param space

Four Call Flavors

Call dispatch is selected by prototype availability and call directness:

OpcodeNameWhen Used
510CallDirectDirect call to a named function with a known prototype
511CallDirectNoProtoDirect call without prototype (K&R C style, rare in CUDA)
512CallIndirectFunction pointer call with known prototype
513CallIndirectNoProtoFunction pointer call without prototype

In CUDA code, CallDirect (510) dominates because the vast majority of device function calls are direct with full prototypes. CallIndirect (512) appears when calling through __device__ function pointers. The no-prototype variants are legacy paths that may not be exercisable from CUDA C++ but are retained for C compatibility.

Libcall Generation

When the lowering needs to synthesize a library call (e.g., for __divdi3 software division), it attaches "nvptx-libcall-callee" metadata set to "true" on the callee. This metadata string was extracted from the binary at sub_3040BF0. The metadata tells later passes that the callee is a compiler-generated runtime helper rather than user code.

The primary helpers called from LowerCall:

HelperRole
sub_302F170Parameter marshaling setup
sub_3031480Argument type coercion
sub_3031850Scalar widening
sub_30351C0Struct decomposition for byval args
sub_303E700Return value handling

DAG Combining

The DAG combiner runs three times during the SelectionDAG pipeline: once after initial DAG construction, once after type legalization, and once after operation legalization. The combiner consists of a target-independent framework and NVPTX-specific target hooks.

Target-Independent Combiner Framework

The combiner orchestrator (sub_F681E0, 65KB) manages the worklist-driven iteration over all DAG nodes:

function DAGCombine(dag):
    worklist = dag.allNodes()    // linked list iteration
    visited = SmallPtrSet()
    while worklist not empty:
        node = worklist.pop()
        if visited.count(node): continue
        visited.insert(node)     // sub_C8CA60 / sub_C8CC70
        result = visitNode(node) // sub_F20C20
        if result != node:
            ReplaceAllUsesWith(node, result) // sub_F162A0
            add users of result to worklist
            mark node dead

The worklist operates on the SDNode linked list. Nodes are processed via sub_C8CA60 (SmallPtrSet::count for visited check) and sub_C8CC70 (SmallPtrSet::insert with vector growth for worklist membership). The exclusion list at this + 64 (with count at this + 76) prevents certain nodes from being visited.

Global flag byte_4F8F8E8 enables verbose/debug tracing of the combining process.

Visitor: sub_F20C20

The per-node combine visitor (sub_F20C20, 64KB) implements six sequential optimization phases for each node:

Phase 1: Opcode-specific combine. Calls sub_100E380, the target-independent combine dispatcher, which switches on the node's opcode and applies algebraic simplifications (e.g., x + 0 -> x, x & -1 -> x, x * 1 -> x). For NVPTX, this also invokes the target-specific combine hook via vtable dispatch.

Phase 2: Known-bits narrowing. For nodes with constant operands, the combiner builds APInt masks and calls sub_11A3F30 (computeKnownBits / SimplifyDemandedBits) to narrow constants. When all high bits of a result are known-zero, the operation can be narrowed to a smaller type. Two global cl::opt flags gate this phase: qword_4F8B3C8 controls strict-FP known-bits combining, and qword_4F8B548 controls 2-operand reassociation.

Phase 3: Operand type-narrowing loop. For each operand, the combiner computes the legalized type, skips zero-constant operands, creates legalized replacements, and inserts SIGN_EXTEND/TRUNCATE cast nodes as needed. This handles the common case where an operation was originally on i64 but only uses the low 32 bits.

Phase 4: All-constant-operand fold. Detects when every operand is a ConstantSDNode (opcode 17) and calls sub_1028510 for full constant-fold evaluation. The constant check uses a 4x-unrolled loop for performance. The operand count is extracted via the 0x7FFFFFF mask from the packed SDNode header.

Phase 5: Division-by-constant strength reduction. Replaces division by power-of-two constants with shift+mask sequences via APInt shift/mask computation. Division by non-power-of-two constants uses the magic-number reciprocal multiplication technique: x / C becomes (x * M) >> shift where M is the multiplicative inverse.

Phase 6: Vector stride / reassociation patterns. Attempts associative FP decomposition via sub_F15980, with fast-math flag propagation when both sub-results are known non-negative. This handles patterns like (a + b) + c -> a + (b + c) when nsz and arcp flags permit.

ReplaceAllUsesWith: sub_F162A0

The combiner's RAUW implementation walks the use-list and hashes each user into a worklist map using the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function and growth policy.

Supporting Combine Functions

FunctionSizeRole
sub_F0F27025.5KBPattern matcher (STORE/BITCAST/CONSTANT)
sub_F2421034.6KBDAG simplification pass
sub_F2B94029.8KBTruncation/extension chain combines
sub_F29CA026.9KBNode morphing / operand updating
sub_F2702025KBSpecific operation combines
sub_F2D1B022.2KBComparison combines
sub_F2DD3011.5KBShift combines
sub_F62E0046.7KBAddress/memory operation combines
sub_F657D026.1KBVector operation combines
sub_F6C1B015.7KBTokenFactor chain management

SDNode Data Structure

The combiner manipulates SDNodes using these field offsets (reconstructed from access patterns throughout the combining code):

OffsetSizeField
-88Operand list pointer (when bit 6 of byte +7 is set)
08First operand / use chain linked list
+44Packed: NumOperands (bits 0--26) | Flags (bits 27--31)
+71Extra flags (bit 6 = has operand pointer at -8)
+88ValueType / MVT
+168Use chain (next user pointer, 0 if none)
+242Opcode (uint16_t)
+324Result type info
+364DebugLoc / location ID
+408Chain operand
+488Value pointer / type info
+724NumResults
+804Additional operand count / mask index

Operand stride is 32 bytes. Access pattern: node - 32 * (node[+4] & 0x7FFFFFF) yields the first operand.

NVPTX Target-Specific Combines: sub_33C0CA0

NVPTXTargetLowering::PerformDAGCombine (sub_33C0CA0, 62KB) provides NVPTX-specific algebraic optimizations. This function is called from the target-independent combiner framework via vtable dispatch. It receives an SDNode and returns either NULL (no transformation) or a replacement node.

The function calls sub_2FE8D10 (13x), sub_2FE6CC0 (12x), sub_30070B0 (14x), and sub_2D56A50 (9x), with 27 calls into sub_B2D*/B2C* for debug value builders.

A secondary NVPTX DAG combine function at sub_32EC4F0 (92KB) handles post-legalize optimization, operating after the main legalization pass. It calls into the same shared DAG construction helpers (sub_2FE3480, sub_2FE6750, sub_325F5D0, sub_3262090).

The NVIDIA-side DAGCombiner at sub_3425710 (142KB) includes debug tracing with "COVERED: " and "INCLUDED: " prefix strings, confirming it was built with NVIDIA's internal debug infrastructure. This function calls sub_C8D5F0 (31x for type action checks), sub_2E79000 (14x for value type access), and sub_3423E80 (8x for combine helper dispatch).

NVPTX Address Spaces

Address space constants appear throughout the SelectionDAG lowering. See Address Spaces for the master table and SelectionDAG Address Space Encoding for the backend-specific secondary encoding used in .param passing conventions.

In LowerCall, pointer arguments undergo addrspacecast to generic (AS 0) via sub_33F2D30. The pointer size for AS 5 follows a power-of-two encoding: sizes 1, 2, 4, 8, 16, 32, 64, 128 bytes map to codes 2, 3, 4, 5, 6, 7, 8, 9.

Address space handling permeates the entire lowering infrastructure. Functions sub_33067C0 (74KB), sub_331F6A0 (62KB), sub_331C5B0 (60KB), and sub_33D4EF0 (114KB) all contain address-space-aware logic for NVPTX memory operations, global address lowering, argument handling, and complex pattern matching respectively.

Intrinsic Lowering

The intrinsic lowering mega-switch (sub_33B0210, 343KB) dispatches over 200 distinct NVPTX intrinsic IDs into DAG node construction. The switch covers intrinsic IDs 0--0x310 in the main body, with high-ID ranges for texture/surface operations extending to ID 14196 (0x3774). The function contains approximately 1,000 local variables and calls sub_338B750 (getValue helper) 195 times, sub_3406EB0 (getNode) 116 times, and sub_337DC20 (setValue) 100 times.

Key intrinsic categories:

CategoryID RangeHandlerCount
Math ops (rounding modes)2, 10, 12, 20, 21, 63, ...sub_33FA050~20
WMMA / MMA (tensor core)0xA4--0xA8, 0x194--0x1ECsub_33A64B095
Texture sampling0x5D--0x8Dsub_33A435050
Surface read/write0x8E--0x90sub_33A31803
Warp shuffle0xD4, 0xD5, 0xDF, 0xE0sub_33FAF804
Vote intrinsics0xE1--0xE6sub_339CDA0 / sub_339E3106
Atomics0xEB--0xF8sub_3405C90 / sub_340AD50~14
cp.async / TMA0x175--0x17Csub_33AD3D0~8
MMA sm90+ (Hopper wgmma)0x183--0x191sub_33AC8F015
Texture/surface handle10578inlinenvvm_texsurf_handle

The WMMA/MMA block is the largest single-handler group: 95 consecutive case labels (intrinsic IDs 404--492) all delegate to sub_33A64B0, covering wmma.load, wmma.store, wmma.mma, mma.sync (sm70+), mma.sp (sm80+), and mma.f64 (sm90+). The warp shuffle intrinsics map to specific NVPTXISD opcodes: __shfl_down_sync to 277, __shfl_up_sync to 275, __shfl_xor_sync to 278, and __shfl_sync to 276.

Math intrinsics encode explicit rounding modes via an inner opcode table. For example, ADD_RN (round-to-nearest) maps to opcode 252, ADD_RZ (round-toward-zero) to 249, ADD_RM (round-toward-minus-infinity) to 245, and ADD_RP (round-toward-plus-infinity) to 270.

NVIDIA-specific intrinsic IDs include high-value entries: ID 10578 handles nvvm_texsurf_handle, IDs 8920/8937--8938 handle texture/surface operations. The overflow path at sub_33A1E80 handles intrinsic IDs that fall outside the main switch range.

NVPTX computeKnownBits

The NVPTX target provides a custom computeKnownBitsForTargetNode implementation (sub_33D4EF0, 114KB) that propagates bit-level information through 112 opcode cases in the SelectionDAG. This function calls sub_969240 (SDNode accessor) 399 times and itself recursively 99 times. It supports demanded-bits pruning via an APInt mask parameter and caps recursion at depth 6 (matching LLVM's default MaxRecursionDepth).

Notable NVPTX-specific known-bits behaviors:

  • Memory operation type inference (opcode 0x12A): Propagates known bits through load operations based on extension mode (zero-extend, sign-extend, any-extend) encoded in the node flags byte at bits [2:3]. Handles ld.global.u32 vs ld.global.s32 vs ld.global.b32 distinctions.
  • Texture/surface fetch results (opcodes 0x152--0x161): Sets known bits in the range [elementSize..width] based on the result type, encoding the known bit-width of texture fetch results.
  • Constant pool integration (opcode 0x175): Uses LLVM's ConstantRange class to derive known bits from constant pool values, chaining fromKnownBits through intersect to toKnownBits.
  • Target fence at opcode 499 (ISD::BUILTIN_OP_END): All opcodes above 499 delegate to the TargetLowering virtual method; below that, the generic ISD switch handles everything.

APInt values with width at most 64 bits use inline storage; wider values trigger heap allocation. The constant 0x40 (64) appears hundreds of times as the inline/heap branch condition.

The target-independent known-bits infrastructure at 0xF50000--0xF60000 includes:

FunctionSizeRole
sub_F5A61036.7KBcomputeKnownBits for generic ISD opcodes (depth limit at a4 == 48)
sub_F5F04052.4KBExtended known-bits with recursive expansion limit: (v74-1)*v77 > qword_4F8BF28
sub_F5CD1026.6KBDAG combine using known-bits results
sub_F5405017.8KBKnown-bits for multi-result nodes
sub_F54F5010.7KBKnown-bits for vector operations

Global qword_4F8BF28 is a threshold that limits recursive known-bits expansion to prevent combinatorial blowup.

Inline Assembly Lowering

Inline assembly lowering spans two locations in the binary: the target-independent SelectionDAGBuilder::visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB).

Target-Independent Framework: sub_2079C70

The inline assembly visitor (sub_2079C70, 83KB, 2,797 lines) lowers LLVM IR asm statements into ISD::INLINEASM (opcode 193) or ISD::INLINEASM_BR (opcode 51) DAG nodes. The function allocates an 8.4KB stack frame and processes operands in five phases:

  1. Initialization. Parses the asm string and metadata. Looks up "srcloc" metadata on the asm instruction for error location reporting.

  2. Constraint pre-processing. Each constraint string is parsed into a 248-byte record. Constraints are classified as: immediate ('i', flag 0x20000), memory ('m', flag 0x30000), or register (determined by target).

  3. Tied operand resolution. Input operands tied to output operands (e.g., "=r" and "0") are matched and validated for type compatibility. Diagnostic: "inline asm not supported yet: don't know how to handle tied indirect register inputs".

  4. Per-operand lowering. Each operand is lowered to an SDValue. Register operands go through TargetLowering::getRegForInlineAsmConstraint() (virtual dispatch). Diagnostics: "couldn't allocate output register for constraint '", "couldn't allocate input reg for constraint '".

  5. DAG node finalization. All operands are assembled into an INLINEASM SDNode with chain and flag operands.

The function uses a 16-entry inline operand buffer (7,088 bytes on stack), reflecting the assumption that CUDA inline asm rarely exceeds 16 operands. Each operand working structure is 440 bytes. Overflow triggers heap reallocation via sub_205BBA0.

Diagnostic strings found in the binary:

StringCondition
"couldn't allocate output register for constraint '"Register constraint unsatisfiable
"couldn't allocate input reg for constraint '"Input constraint unsatisfiable
"Don't know how to handle indirect register inputs yet..."Indirect tied operand
"inline asm error: This value type register class is not natively supported!"Unsupported type for register
"invalid operand for inline asm constraint '"Generic operand mismatch
"Indirect operand for inline asm not a pointer!"Non-pointer indirect operand

NVPTX Constraint Handler: sub_338BA40

The NVPTX-specific inline asm constraint handler (sub_338BA40, 79KB) is part of the NVPTXTargetLowering class. It processes constraint strings specific to the NVPTX backend:

  • Simplified constraint model. NVPTX recognizes single-character 'i' (immediate) and 'm' (memory) constraints through sub_2043C80, avoiding the complex multi-character constraint tables used by x86/ARM backends.

  • Register class mapping. The function maps MVT values to NVPTX register classes using a 544-case switch (confirmed at sub_204AFD0, 60KB): MVTs 0x18--0x20 map to Int32Regs, 0x21--0x28 to Int64Regs, 0x29--0x30 to Float32Regs, 0x31--0x36 to Float64Regs, 0x37 to Int128Regs, 0x56--0x64 to 2-element vector registers.

  • Convergent flag handling (bit 5): Ensures barrier semantics are preserved for inline asm, checked via operand bundle attribute or function-level convergent.

  • Scalar-to-vector conversion. String "non-trivial scalar-to-vector conversion" indicates that the handler attempts to pack scalar inline-asm results into vector register classes when the output constraint specifies a vector type.

Additional support at sub_2046E60 emits ", possible invalid constraint for vector type" when a vector type is used with an incompatible constraint.

ISel Pattern Matching Driver

The instruction selection driver (sub_3090F90) manages the top-level selection loop rather than performing pattern matching directly. It builds a cost table for function arguments using a hash table with hash function key * 37, processes the topological worklist using a min-heap priority queue, and calls the actual pattern matcher (sub_308FEE0) for each node.

The driver maintains an iteration budget of 4 * numInstructions * maxBlockSize to guard against infinite loops. When the budget is exceeded, selection terminates for the current function.

For complete ISel detail, see ISel Pattern Matching & Instruction Selection.

NVPTXTargetLowering Initialization

The NVPTXTargetLowering constructor (sub_3056320, 45KB + sub_3314670, 73KB) populates the legalization action tables that drive all subsequent SelectionDAG processing. It calls sub_302E500, sub_302F030, sub_3030230, and sub_3034720 to register legal/custom/expand actions for each {ISD_opcode, MVT} pair.

Key aspects of the initialization:

  • Subtarget-gated feature checks. Offsets +2843, +2584, and +2498 in the subtarget object encode SM-version-dependent feature availability. These control which types and operations are marked Legal vs. Custom vs. Expand.

  • Vector support. NVPTX has limited native vector support. Most vector operations are marked Custom or Expand, forcing them through the custom lowering at sub_32E3060.

  • Atomic support. The string "vector atomics not supported on this architecture!" at sub_3048C30 confirms SM-version-gated vector atomic support, likely SM 90+ (Hopper) or SM 100+ (Blackwell).

  • Address space assertions. AS values (generic=0, global=1, shared=3, const=4, local=5) are encoded directly into the legalization tables, with different legal operation sets per address space.

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's SelectionDAG framework was designed for CPU ISAs where register classes overlap and share a unified physical register file. The NVPTX target breaks these assumptions at every level:

  • Upstream assumes register classes interfere with each other. On x86, GR32 is a sub-register of GR64; allocating eax constrains rax. The interference graph, coalescing, and copy elimination infrastructure all assume overlapping classes. NVPTX has nine completely disjoint classes (%r, %f, %fd, %p, etc.) with zero cross-class interference. The DAG's register pressure tracking, copy coalescing hints, and class constraint propagation solve a problem that does not exist on this target.
  • Upstream assumes function calls are cheap register shuffles. CPU calling conventions move arguments through registers (rdi, rsi, etc.) or a stack backed by L1 cache. NVPTX function calls go through the .param address space with explicit DeclareParam/st.param/ld.param sequences -- O(n) memory operations per argument. The LowerCall function in cicc is 88KB (vs. upstream's few KB) because it must handle four call flavors, monotonic .param naming, and "nvptx-libcall-callee" metadata for synthesized calls.
  • Upstream assumes a small set of intrinsics. Upstream NVPTX intrinsic lowering covers approximately IDs 0-300. CICC's intrinsic mega-switch at sub_33B0210 (343KB) handles IDs up to 14196, covering cp.async, TMA, WGMMA, and the full SM 90/100 tensor operation set. The upstream framework's assumption that intrinsic lowering is a small switch case is off by two orders of magnitude.
  • Upstream assumes vector types are natively supported. CPU targets have native vector registers (XMM/YMM/ZMM, NEON Q-registers). NVPTX has no native vector registers -- most vector operations are marked Custom or Expand, forcing them through 111KB of custom lowering at sub_32E3060. The "legalize then select" pipeline spends most of its time decomposing vectors that never should have been formed.
  • Upstream assumes known-bits propagation is a small target hook. Upstream NVPTX's computeKnownBitsForTargetNode handles fewer than 20 opcodes. CICC's version at sub_33D4EF0 (114KB, 112 opcode cases) propagates bits through texture fetches, address space loads, and NVPTX-specific operations -- a 50x expansion that upstream's hook interface was never designed to support cleanly.

Differences from Upstream LLVM

The NVPTX SelectionDAG backend in cicc v13.0 diverges from upstream LLVM NVPTX in several structural and behavioral ways. This section catalogs the known differences.

Structural Divergences

Monolithic type legalizer. Upstream LLVM splits type legalization across four source files (LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp, LegalizeTypes.cpp). In cicc, all four are collapsed into a single 348KB function (sub_20019C0), likely an LTO artifact. The behavioral result is identical, but the code layout makes the function nearly impossible to patch incrementally.

Dual-address ISel infrastructure. The NVPTX lowering code exists at two address ranges (0x32XXXXX and 0x33XXXXX), with functions at sub_32E3060 (LowerOperation) and sub_3377410 (secondary dispatch) forming a two-level dispatch. Upstream NVPTX uses a single LowerOperation method. The binary has a secondary overflow path for intrinsic IDs that fall outside the main switch range.

142KB NVPTX DAGCombiner. The function sub_3425710 includes "COVERED:" and "INCLUDED:" debug trace strings not present in any upstream LLVM release. This is NVIDIA internal instrumentation for tracking combine coverage during development.

Two inline asm subsystems. The target-independent visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB) total 162KB. The upstream NVPTX inline asm support is approximately 200 lines of code. The cicc version is vastly more complex, likely handling NVIDIA-internal PTX inline asm patterns.

Behavioral Divergences

Calling convention. Upstream LLVM NVPTX uses a simplified LowerCall that handles only the standard .param space protocol. CICC's sub_3040BF0 (88KB) adds "nvptx-libcall-callee" metadata for synthesized libcalls, monotonic sequence counters for unique .param names, and four call flavors (with/without prototype x direct/indirect). The upstream has two flavors.

Intrinsic count. The cicc intrinsic lowering switch (sub_33B0210, 343KB) handles intrinsic IDs up to 14196 (0x3774), with dedicated handlers for cp.async/TMA and WGMMA instructions. Upstream LLVM's NVPTX intrinsic lowering covers approximately IDs 0--300. The extended range covers SM 90 (Hopper) and SM 100 (Blackwell) tensor operations.

Vector shuffle lowering. The three-level shuffle lowering (identity detection, BitVector tracking, BUILD_VECTOR fallback) is more sophisticated than upstream NVPTX, which typically scalarizes all shuffles unconditionally.

Atomic scope awareness. CICC's atomic lowering at sub_3048C30 (86KB) supports CTA/GPU/SYS scope atomics with SM-version gating. Upstream LLVM NVPTX handles basic atomics but lacks the full scope hierarchy.

Known-bits propagation. The NVPTX computeKnownBitsForTargetNode at sub_33D4EF0 (114KB, 112 opcode cases, 399 SDNode accesses, 99 recursive calls) is far more extensive than the upstream version, which typically handles fewer than 20 target-specific opcodes. The cicc version propagates bits through texture fetches, address space loads, and NVPTX-specific operations.

PerformDAGCombine depth. The NVPTX-specific combine at sub_33C0CA0 (62KB) plus the post-legalize combine at sub_32EC4F0 (92KB) total 154KB. Upstream NVPTXISelLowering::PerformDAGCombine is approximately 2KB.

Address space 101. CICC uses address space 101 as an alternative .param encoding (seen in sub_33067C0), which does not exist in upstream LLVM NVPTX. This may be an internal convention for distinguishing kernel .param from device-function .param.

Unchanged from Upstream

The following components appear to be stock LLVM with no NVIDIA modifications:

  • SelectionDAG core infrastructure at 0xF05000--0xF70000 (combining, known-bits, node management)
  • DAG node hashing with ((a3 >> 4) ^ (a3 >> 9)) & (capacity - 1) at sub_F4CEE0
  • Constrained FP intrinsic lowering at sub_F47010 (36KB, "round.tonearest", "fpexcept.ignore")
  • ReplaceAllUsesWith implementation at sub_F162A0
  • All SDNode creation, deduplication, and lifecycle management

Function Map

FunctionAddressSizeRole
SelectionDAGLegalize::LegalizeOp dispatcher (~100 opcodes)sub_1FCE10091KB--
SelectionDAGLegalize action dispatch (967 cases)sub_1FFB890137KB--
Legalization worklist managementsub_1FF5010--
ExpandNode fallbacksub_1FF6F70--
DAGCombiner::visitNode (6-phase per-node combine)sub_F20C2064KB--
DAGCombiner::combine orchestrator (worklist management)sub_F681E065KB--
ReplaceAllUsesWith (hash: ((id >> 9) ^ (id >> 4)))sub_F162A0--
Combine pattern matcher (STORE/BITCAST/CONSTANT)sub_F0F27025.5KB--
Target-independent opcode-specific combine dispatchersub_100E380--
All-constant-operand fold evaluationsub_1028510--
Vector stride / reassociation combinesub_F15980--
Generic computeKnownBitssub_F5A61036.7KB--
Extended known-bits (recursive expansion limit)sub_F5F04052.4KB--
SelectionDAG::getNode / CSE hash tablesub_F4CEE041.3KB--
DAG node builder (operand/result setup)sub_F4903038.2KB--
Constrained FP intrinsic loweringsub_F4701036.4KB--
NVPTXTargetLowering::LowerOperation dispatchersub_32E3060111KB--
LowerOperation secondary dispatch (overflow)sub_337741075KB--
NVPTX custom type promotionsub_32A1EF0109KB--
NVPTX post-legalize DAG combinesub_32EC4F092KB--
NVPTX vector operation splittingsub_32FE97088KB--
NVPTX load/store loweringsub_32D268081KB--
NVPTX integer/FP legalizationsub_32983B079KB--
NVPTX intrinsic lowering (tex/surf)sub_32B8A2071KB--
NVPTX vector operation loweringsub_32A903055KB--
NVPTX addrspacecast / pointer loweringsub_32C376054KB--
NVPTX conditional/select loweringsub_32BE8D054KB--
NVPTX special register loweringsub_32B654050KB--
NVPTXTargetLowering::PerformDAGCombinesub_33C0CA062KB--
NVPTX DAGCombiner with "COVERED"/"INCLUDED" tracingsub_3425710142KB--
NVPTXTargetLowering::LowerCallsub_3040BF088KB--
NVPTX atomic operation loweringsub_3048C3086KB--
NVPTXTargetLowering constructor (action setup)sub_305632045KB--
Type legalization table populationsub_331467073KB--
Intrinsic lowering mega-switchsub_33B0210343KB--
NVPTX computeKnownBitsForTargetNodesub_33D4EF0114KB--
NVPTX inline asm constraint handlersub_338BA4079KB--
SelectionDAGBuilder::visitInlineAsmsub_2079C7083KB--
NVPTX visitNVVMTexSurf handlersub_207740020KB--
NVPTX argument passing / type coercionsub_207259038KB--
NVPTXDAGToDAGISel::Select driversub_3090F9091KB--
Address space / memory operation supportsub_33067C074KB--
Global address loweringsub_331F6A062KB--
Formal arguments / return loweringsub_334973082KB--
Call lowering (visitCall / LowerCallTo)sub_332FEA079KB--

Reimplementation Checklist

  1. NVPTXTargetLowering with legality tables. Populate the 2D action table at offset +2422 (259-byte row stride, indexed by 259 * VT + opcode) with per-SM-version legal/custom/expand/promote actions for all ISD opcodes and NVPTX-specific opcodes. Include the condition-code action table at offset +18112 and the SM-gated type legality rules (f16 on SM 53+, v2f16 on SM 70+, bf16 on SM 80+).
  2. LowerOperation dispatcher (111KB equivalent). Implement the master LowerOperation switch dispatching ~3,626 lines of GPU-specific lowering for loads, stores, calls, atomics, vector operations, and address space casts, including the .param-space calling convention with DeclareParam/StoreV1-V4/LoadRetParam sequences.
  3. Intrinsic lowering mega-switch (343KB equivalent). Build the intrinsic lowering function covering 200+ CUDA intrinsic IDs (up to ID 14196/0x3774), organized as a jump table with per-intrinsic lowering handlers for tensor core, warp, surface/texture, and math intrinsics.
  4. PerformDAGCombine for NVPTX. Implement the NVPTX-specific DAG combines (62KB) that run after operation legalization, including load/store vectorization (offset-based coalescing with sorting for ld.v2/ld.v4/st.v2/st.v4 detection), NVPTX-specific algebraic simplifications, and the "COVERED"/"INCLUDED" tracing infrastructure.
  5. ISel::Select pattern matching (91KB equivalent). Implement the top-down instruction selection driver that visits DAG nodes in topological order, matching against NVPTX-specific patterns via opcode-indexed tables, with special handling for tensor core instructions, inline assembly constraints, and multi-result nodes.
  6. computeKnownBits for NVPTX (114KB). Implement the NVPTX-specific known-bits analysis covering ctaid, tid, ntid, address space pointer width constraints, and GPU-specific intrinsic range information to enable downstream optimization.

Cross-References

Type Legalization

Prerequisites: Familiarity with SelectionDAG, NVPTX register classes, and LLVM type system basics. Understanding of the compilation pipeline up to instruction selection is assumed.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Type legalization is the SelectionDAG phase that rewrites every DAG node whose result or operand type is illegal for the target into equivalent sequences of legal-type operations. In upstream LLVM this logic spans four source files (LegalizeTypes.cpp, LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp) totaling roughly 16,000 lines. In CICC v13.0, NVIDIA ships all of it as a single 348KB monolithic function -- sub_20019C0 -- the largest function in the SelectionDAG address range and among the largest in the entire binary. Operation legalization follows in a separate 169KB function (sub_1FFB890), and vector split/scalarize dispatchers fan out into an additional 25+ worker functions.

The monolithic structure is either an LTO inlining artifact (all four upstream .cpp files collapsed by link-time optimization) or a deliberate choice for branch-prediction locality. The functional behavior is a faithful reproduction of upstream LLVM's DAGTypeLegalizer, but the legality tables, legal-type set, and vector legalization rules are heavily NVPTX-specific.

Type legalizer monolithsub_20019C0 (348KB, 10,739 lines)
Operation legalizersub_1FFB890 (169KB)
SplitVectorResultsub_2029C10 (dispatcher, 190 cases)
SplitVectorOperandsub_202E5A0 (dispatcher, 157 cases)
ScalarizeVectorResultsub_2036110
ScalarizeVectorOperandsub_2035F80
WidenVectorsub_2036AE0 (31KB, limited NVPTX usage)
ExpandIntegerResultsub_201BB90 (75KB, 632 case labels)
PromoteIntegerResultsub_2000100 (45KB)
PerformExpensiveCheckssub_2010FB0 (62KB, debug verifier)
NVPTXTargetLowering initsub_3314670 (73KB, table population)
UpstreamLegalizeTypes.cpp, LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp

Pipeline Position

Type legalization runs as the first major SelectionDAG transformation after the initial DAG is built by SelectionDAGBuilder (sub_2081F00). The full sequence:

  1. SelectionDAGBuilder converts LLVM IR to an initial DAG with potentially illegal types
  2. DAG Combiner (sub_F20C20) runs initial combines
  3. DAGTypeLegalizer (sub_20019C0) iterates until all types are legal -- this page
  4. LegalizeDAG (sub_1FFB890) legalizes operations on now-legal types
  5. DAG Combiner runs again to clean up
  6. Instruction selection (sub_3090F90) pattern-matches the final legal DAG

The type legalizer iterates to a fixpoint: each pass may create new nodes with illegal types (e.g., splitting a vector creates two half-width vectors that may themselves be illegal), so the worklist loops until every node in the DAG has only legal result and operand types.

The legal type set is defined in the NVPTXTargetLowering constructor (sub_3314670, 73KB) which populates the action table at offset +2422. NVPTX has a narrow set of legal types dictated by the PTX register file:

Register ClassLegal MVTs
Int1Regs (%p)i1
Int16Regs (%rs)i16
Int32Regs (%r)i32
Int64Regs (%rd)i64
Float32Regs (%f)f32
Float64Regs (%fd)f64
Int16HalfRegs (%h)f16, bf16
Int32HalfRegs (%hh)v2f16, v2bf16, v2i16, v4i8
Int128Regs (%rq)i128 (SM 70+)

For the complete register class table (vtable addresses, PTX types, encoded IDs, copy opcodes) see Register Classes.

The critical constraint: Int32HalfRegs is the only vector register class. It holds exactly 32 bits of packed data. The only legal vector types are those that pack into 32 bits:

  • v2f16 -- two f16 values in one 32-bit register
  • v2bf16 -- two bf16 values (SM 80+)
  • v2i16 -- two i16 values in one 32-bit register
  • v4i8 -- four i8 values in one 32-bit register

Every other vector type (v4f32, v2f32, v8i32, v4f16, v2f64, etc.) is illegal and must be split, scalarized, or expanded during type legalization. There is no packed float32 SIMD on NVPTX -- this is a fundamental architectural constraint.

SM-Gated Type Legality

The legal type set changes with the SM version. The constructor at sub_3314670 queries subtarget features and conditionally marks types legal or illegal:

SM RangeLegal Types AddedLegalization Change
SM < 53(base: i1, i16, i32, i64, f32, f64)f16 ops promoted to f32; no legal vectors
SM 53--69Scalar f16v2f16 legal for ld/st but packed arithmetic is Custom/Expand
SM 70+v2f16 packed arithmetic, i128f16x2 PTX instructions (add.f16x2, mul.f16x2, fma.rn.f16x2)
SM 80+v2bf16bf16x2 PTX instructions
SM 100+e2m1x2 (FP4), e2m3x2 (FP6), e3m2x2 (FP6), ue8m0x2Additional packed narrow FP types for tensor core feeders

On SM 70+, v2f16 operations marked Legal or Custom in the action table map directly to packed PTX instructions, delivering 2x throughput versus scalarized f16. This is why CUDA __half2 operations are efficient: the type stays packed through the entire pipeline. In contrast, float4 is always fully scalarized to four independent f32 operations on every SM generation.

The Legality Table

Primary Action Table (offset +2422)

The core data structure is a 2D array inside NVPTXTargetLowering:

action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)

Where:

  • TLI = pointer to NVPTXTargetLowering object (loaded from this->TLI at a1[1])
  • VT = SimpleVT enum value (1--10 for scalar types, 14--109 for vector types)
  • opcode = ISD opcode (0--258), capped at 0x102 by a guard check
  • 259 = row stride (256 generic opcodes + 3 metadata bytes per VT row)

The action byte encodes:

ValueActionMeaning
0LegalNode is natively supported -- return immediately
1CustomCall NVPTXTargetLowering::LowerOperation (vtable slot #164, offset +1312)
2ExpandCall LegalizeTypes, then ExpandNode (sub_1FF6F70) as fallback
3LibCallCall ExpandNode directly for library-call substitution
4PromoteFind a larger legal type and rebuild the node at that type

The legality check uses (action & 0xFB) == 0 as the "legal" predicate. This means bit 2 is a don't-care -- a node with action byte 0x04 is still treated as legal in certain fast-path checks, which is the standard LLVM encoding where bit 2 flags "custom-but-legal" operations.

Type-Supported Flag Array (offset +120)

A second structure at TLI + 8*VT + 120 is a pointer array: non-null means the type VT is natively supported by the target. This provides a fast "is this type legal at all?" check before the per-opcode lookup.

Promotion Action Table (offset +2681)

A 1D table indexed by opcode only (no VT dimension):

action = *(uint8_t *)(TLI + opcode + 2681)

Used for four specific opcodes: BSWAP (43), CTLZ (44), CTTZ (45), and BITREVERSE (199). Also used for opcode 204 (CONCAT_VECTORS) when the operand type is zero. This table encodes whether these operations should be promoted regardless of operand type.

FSINCOS Action Table (offset +3976)

Another 1D table for FSINCOS (opcode 211):

action = *(uint8_t *)(TLI + opcode + 3976)

FSINCOS has unique legalization requirements because it produces two results (sin and cos simultaneously).

Condition Code Action Table (offset +18112)

A packed 4-bit nibble table for condition-code-dependent operations (FP_TO_SINT, FP_TO_UINT, SELECT_CC, BR_CC):

base   = (VT_id >> 3) + 15 * condcode_type + 18112
action = (*(uint32_t *)(TLI + base * 4 + 12) >> (4 * (VT_id & 7))) & 0xF

The 15-entry stride per condition code allows per-CC/per-VT legalization decisions. Each nibble stores a 4-bit action code, so two VT actions pack into one byte. This is the standard LLVM condition-code action encoding, but the table is populated with NVPTX-specific rules (e.g., PTX's limited set of comparison predicates determines which CCs are legal for which types).

SimpleVT Type Encoding

Types throughout the legalizer are encoded as a single byte, the SimpleVT enum:

SimpleVTTypeSimpleVTType
0extended/custom7i128
1i18f16
2i2 (rare)9f32
3i810f64
4i1614--55fixed-width vectors
5i3256--109scalable vectors
6i64

The bitwidth-to-SimpleVT conversion pattern appears as a recurring code fragment at least 11 times in sub_20019C0:

// Reconstructed from decompilation -- 11 instances in the function
if (bits == 32)       VT = 5;  // i32
else if (bits > 32) { VT = 6;  // i64 tentative
  if (bits != 64) { VT = 0;    // extended type
    if (bits == 128) VT = 7;   // i128
  }
} else {
  VT = 3;                      // i8 tentative
  if (bits != 8) VT = 4 * (bits == 16);  // i16 or 0
}

The vector type range 14--109 maps to scalar element types through a ~100-case switch block that also appears six times in the function body:

MVT RangeScalar ElementDescription
14--23i2 (VT 2)Fixed-width v2i2..v1024i2
24--32i8 (VT 3)Fixed-width v2i8..v256i8
33--40i16 (VT 4)Fixed-width v2i16..v64i16
41--48i32 (VT 5)Fixed-width v2i32..v64i32
49--54i64 (VT 6)Fixed-width v2i64..v32i64
55i128 (VT 7)Fixed-width v2i128
56--61i2 (VT 2)Scalable nxv2i2..nxv64i2
62--67i8 (VT 3)Scalable nxv2i8..nxv64i8
68--73i16 (VT 4)Scalable nxv2i16..nxv64i16
74--79i32 (VT 5)Scalable nxv2i32..nxv64i32
80--85i64 (VT 6)Scalable nxv2i64..nxv64i64
86--88f16 (VT 8)Scalable nxv2f16..nxv8f16
89--93f32 (VT 9)Scalable nxv2f32..nxv32f32
94--97f64 (VT 10)Scalable nxv2f64..nxv16f64
98--100f16 (VT 8)Fixed-width v2f16..v8f16 (additional)
101--105f32 (VT 9)Fixed-width v2f32..v32f32 (additional)
106--109f64 (VT 10)Fixed-width v2f64..v16f64 (additional)

This switch implements getVectorElementType() on the decompiled SimpleVT enum. Its six-fold repetition in the monolith accounts for a significant fraction of the function's 348KB size.

The Four Legalization Actions

Promote (Type Widening)

Promotion widens a narrow type to the nearest legal register width. The pattern is consistent across integer and FP promotion:

promoted_vt = TLI.getTypeToPromoteTo(opcode, VT)       // sub_1F40B60
extended    = DAG.getNode(ANY_EXTEND, DL, promoted_vt, input)   // opcode 143
result      = DAG.getNode(original_op, DL, promoted_vt, extended, ...)
truncated   = DAG.getNode(TRUNCATE, DL, original_vt, result)   // opcode 145

For integer promotion, ANY_EXTEND (opcode 143) or ZERO_EXTEND (opcode 144) widens the input depending on whether the high bits need defined values (unsigned operations use ZERO_EXTEND). For FP promotion, the pattern uses FP_EXTEND/FP_ROUND instead:

ext0 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op0)
ext1 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op1)
res  = DAG.getNode(FADD, DL, promoted_vt, ext0, ext1)
out  = DAG.getNode(FP_ROUND, DL, original_vt, res)

The promote path in sub_1FFB890 contains approximately 30 opcode-specific expansion strategies. The custom-promotion BST (red-black tree at TLI + 9257/9258) stores (opcode, VT) pairs that override the default promotion target. When no BST entry exists, a linear scan walks upward from the current VT until it finds a type where the action is not Custom (i.e., Legal or Expand).

Expand (Type Splitting)

Expansion splits a wide type into two halves and reassembles the result:

// i128 ADD expansion (simplified)
lo_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 0)   // low half
hi_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 1)   // high half
lo_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 0)
hi_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 1)
lo_r = DAG.getNode(ADD, DL, i64, lo_a, lo_b)
carry = ...  // carry detection via SETCC
hi_r = DAG.getNode(ADD, DL, i64, hi_a, hi_b)
hi_r = DAG.getNode(ADD, DL, i64, hi_r, carry)
result = DAG.getNode(BUILD_PAIR, DL, i128, lo_r, hi_r)

For CTLZ (case 53), expansion builds an all-ones mask, AND chain, and shift sequence. For SINT_TO_FP/UINT_TO_FP (cases 59/60), the helper sub_20B5C20 performs iterative two-way splitting: it finds the half-type, builds the pair, and recursively legalizes each half.

The ExpandIntegerResult handler at sub_201BB90 (75KB, 632 case labels) is itself a major function that dispatches expansion for specific opcodes including STORE (case 77), shifts (81--93), and atomics.

Soften (Float-to-Integer Emulation)

Softening converts unsupported FP operations to integer-based library call sequences. On NVPTX this primarily affects f128 (which has no hardware support on any SM) and f16 on SM < 53. The softened path at sub_2019DA0 (18KB) dispatches via the SoftenedFloats DenseMap.

The FADD/FMUL cases (74/75 in the main switch) compute twice the bit width, find the promoted FP type, and build SUB (opcode 54) / SRL (opcode 123) chains that implement the FP operation in integer arithmetic.

Scalarize and Split Vector

Vector legalization proceeds through recursive halving:

v8f32  -> split -> 2x v4f32
v4f32  -> split -> 2x v2f32
v2f32  -> scalarize -> 2x f32    (v2f32 is NOT legal on NVPTX)

v4f16  -> split -> 2x v2f16     (LEGAL on SM 70+ -- stops here)
v8f16  -> split -> 2x v4f16 -> 4x v2f16

v4i8   -> LEGAL (packed in Int32HalfRegs, no split needed)
v8i8   -> split -> 2x v4i8     (one split, then legal)

The splitting strategy follows LLVM's standard approach:

  1. Determine half type: v4f32 splits to v2f32 via EVT::getVectorVT(scalar_element, count/2) (sub_1F58CC0)
  2. Split operands: Look up the SplitVectors DenseMap to get {Lo, Hi} halves from the input's own legalization
  3. Apply operation: Lo_result = DAG.getNode(opcode, DL, half_type, Lo_op1, Lo_op2), and similarly for Hi
  4. Record result: Store {Lo_result, Hi_result} in the SplitVectors DenseMap via sub_20167D0

The critical observation for NVPTX: v2f32 is not legal (no 64-bit packed float register class), so v4f32 ends up fully scalarized to 4x f32. In contrast, v4f16 on SM 70+ splits to 2x v2f16 which is legal, enabling the f16x2 packed instruction path.

Master Opcode Dispatch (sub_20019C0)

The main body of sub_20019C0 is a switch on *(int16_t *)(node + 24) -- the ISD opcode of the current SDNode. Approximately 50 cases are handled:

CaseISD OpcodeAction
10LOADlegalizeLoad -- type-aware load splitting
11STOREIterative type demotion loop (see below)
20--21, 26Generic arithmeticPromote via sub_1D38BB0 (getConstant)
27EXTRACT_ELEMENTSplit + re-extract
29BUILD_PAIRPromote to i32
48BITCASTPromote or expand depending on isSimple()
49EXTRACT_SUBVECTORExtract + rebuild via TRUNCATE (opcode 145)
50INSERT_SUBVECTORLow/upper split via ANY_EXTEND (143) / ZERO_EXTEND_INREG (144)
51CONCAT_VECTORSIterate operands, copy each to result list
53CTLZ / CTPOPExpand via mask-then-shift (AND=120, ADD=52)
54ATOMIC_CMP_SWAPFull promote path: check legality table, fallback to libcall
55--56SIGN_EXTEND_INREG / SMINLegality check via TLI + 259*VT + opcode + 2422
57--58FP_TO_SINT / FP_TO_UINTChain of promote + expand nodes
59--60SINT_TO_FP / UINT_TO_FPIterative split via sub_20B5C20
70, 72FMINNUM / FMAXNUMBUILD_PAIR (opcode 0x89) reassembly
74--75FADD / FMULPromote to wider FP type
77FMAExtend operands, FMA at wider type, round back
105BUILD_VECTORDelegate to sub_1FEC5F0
106EXTRACT_VECTOR_ELTCheck vector element count, dispatch
108MGATHER / MSCATTERLoad/store with alignment fixup via sub_20BD400
110VSELECTElement-by-element type demotion loop
112--113SETCCLegality check with swapped-direction fallback
114--117VECREDUCE_*Opcode lookup in dword_42FEAE0, chain to VECREDUCE
122--124SHL / SRL / SRAIterative width expansion
125--126ROTL / ROTR4-way split: shift + mask + OR
136BR_CCUses CC action table at offset +18112
152ATOMIC_LOAD_*Delegate to sub_20B7F50 (atomic promote)
153ATOMIC_CMP_SWAP_WITH_SUCCESSFull CAS expansion with APInt mask
199--200INTRINSIC_W_CHAIN / INTRINSIC_WO_CHAINTLI+112 check, intrinsic lowering dispatch
211UNDEFReplicate zero-constant to fill operand count
243TOKEN_FACTORDuplicate single operand to all slots

Cases not listed fall through to LABEL_25 (node already legal or handled by a different legalization category).

Store Iterative Demotion (Case 11)

The STORE case contains an explicit type-walking loop that searches downward for a legal store type:

// Reconstructed from case 11, lines ~2077-2095
while ((vt_byte - 8) > 1) {          // while VT is not f16(8) or f32(9)
    --vt_byte;                        // try next smaller type
    if (TLI.getTypeAction(VT))        // sub_1D16180
        if (TLI.isOperationLegal(STORE, VT))
            break;                    // found a legal store type
}

This walks i64 -> i32 -> i16 -> i8 (or f64 -> f32 -> f16) until it finds a type the target can store natively, then emits a truncating store sequence via sub_1D3C080 (getTruncStore).

Atomic CAS Expansion (Cases 54, 153)

Atomic operations receive extensive legalization because PTX has limited atomic type support. The CAS expansion at case 153 (ATOMIC_CMP_SWAP_WITH_SUCCESS) builds APInt masks via sub_16A4EF0, constructs compare-and-swap loops, and handles the success flag as a separate result. The helper sub_20B7E10 decides whether to use a CAS loop or a direct atomic based on the target SM's capabilities.

Vector Legalization Workers

SplitVectorResult (sub_2029C10)

This thin dispatcher reads the opcode from *(uint16_t *)(node + 0x18), subtracts base 0x30 (48), and dispatches across 190 cases (opcodes 48--237) to SplitVecRes_XXX workers. Key handler categories:

HandlerCasesDescription
sub_20230C0FADD--FREM, SHL/SRA/SRL, int arithGeneric binary op split: split both inputs, apply op to each half
sub_2028A10CONCAT, INSERT_ELT, load/store variantsUnary/multi-input split with reassembly
sub_2025910Strict FP (cases 81--98)Strict FP split with exception chain propagation
sub_2023B70BUILD_VECTOR (case 104)Split BUILD_VECTOR into two half-width constructs
sub_2023F80CONCAT inner (case 107)Trivial: return two operands as Lo and Hi
sub_20293A0VECTOR_SHUFFLE (case 110, 10KB)Decompose shuffle into sub-shuffles on half-width vectors
sub_20251A0VSELECT, EXTRACT_ELTSplit condition mask along with operands
sub_2025380Extending loads (cases 149--151)Split load into two half-width loads

Four handlers in the 0x214xxxx range are NVPTX-specific split workers not present in upstream:

HandlerOpcodeNVPTX-Specific Behavior
sub_2146BB0CONCAT_VECTORSChecks VT range 0x0E--0x6D for packed-type dispatch
sub_2146C90SELECT_CC / BR_CC (2.7KB)Multi-operand split with per-operand type classification
sub_2147770FP_ROUND-likeNVPTX-specific FP rounding split
sub_2147AE0BITCASTNVPTX-specific bitcast split for packed registers

After a handler returns, the dispatcher stores the {Lo, Hi} result pair in the SplitVectors DenseMap via sub_20167D0 (hash = 37 * key, quadratic probing, rehash at 75% load).

Fatal error on unhandled opcode: "Do not know how to split the result of this operator!" via sub_16BD130.

SplitVectorOperand (sub_202E5A0)

Same dispatch pattern as SplitVectorResult but for operand-side legalization. Base opcode 0x65 (101), range 157 (opcodes 101--258). Notable inline handling for FP_EXTEND/FP_ROUND (cases 146--147, 152--153) that compares source and destination type sizes to choose the correct split strategy:

// Inline in SplitVectorOperand, cases 146-147
src_size = getSizeInBits(src_vt);    // sub_2021900
dst_size = getSizeInBits(dst_vt);
if (dst_size < src_size)
    SplitVecOp_VSELECT(...)          // sub_202D8A0 -- shrinking
else
    SplitVecOp_Generic(...)          // sub_202A670 -- standard split

After the handler, ReplaceAllUsesOfValueWith (sub_2013400) substitutes the old node with the split result.

Scalarize and Widen

ScalarizeVectorResult (sub_2036110) handles vector types that reduce to scalar. ScalarizeVectorOperand (sub_2035F80) has 80 cases starting from base opcode 106. These cover the final step when splitting has reduced a vector to width 1 or 2 elements, and those elements must become individual scalars.

WidenVector (sub_2036AE0, 31KB) sees limited use on NVPTX. Widening is only useful when the wider type is legal:

  • Widening v1f16 to v2f16 is useful (promotes to legal packed type)
  • Widening v3i8 to v4i8 is useful (promotes to legal packed type)
  • Widening v3f32 to v4f32 is not useful (v4f32 is still illegal)

The WidenVector path uses the MVT lookup table at word_4305480 to determine element counts and find the nearest wider legal vector type.

Operation Legalization (sub_1FFB890)

After type legalization, operation legalization processes each node through a per-opcode action lookup. The same primary action table is used:

action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)

The dispatch:

ActionCodePath
Legal0Return immediately
Custom1TLI->LowerOperation(node, DAG) via vtable slot #164 (offset +1312)
Expand2sub_20019C0 (LegalizeTypes), then sub_1FF6F70 (ExpandNode) as fallback
LibCall3sub_1FF6F70 (ExpandNode) directly
Promote4Find larger legal type, rebuild node
Special5+sub_1FF9780 (ExpandLoad) or sub_1FF5310 (LegalizeLoadOps) for load/store variants

When Custom lowering returns NULL, the framework falls through to expansion. When it returns a different node, ReplaceAllUsesWith splices the replacement into the DAG and marks the old node dead (tombstone value -2 in the worklist hash set).

The operation legalizer also contains an outer switch on the ISD opcode (v11 = *(uint16_t *)(node + 24)) for opcode-specific handling before the table lookup. Shift/rotate opcodes (81--98) are remapped to internal opcode numbers before the table lookup (e.g., case 81 maps to internal opcode 76, case 82 to 77). The opcode-specific dispatch covers approximately 30 opcode groups.

How CUDA Vector Types Get Legalized

Tracing common CUDA types through the full legalization pipeline:

float4 (v4f32) -- fully scalarized on every SM:

  1. SplitVectorResult: v4f32 -> 2x v2f32
  2. ScalarizeVectorResult: v2f32 -> 2x f32 (no packed f32 register class)
  3. Final: 4 independent f32 scalar operations
  4. PTX: 4 separate add.f32 / mul.f32 instructions

half2 (__half2 / v2f16) -- stays packed on SM 70+:

  1. Legal type, no splitting needed
  2. Final: single v2f16 packed operation
  3. PTX: add.f16x2, mul.f16x2, fma.rn.f16x2

__nv_bfloat162 (v2bf16) -- legal on SM 80+:

  1. Same as half2 but with bf16x2 PTX instructions

float2 (v2f32) -- scalarized, not packed:

  1. ScalarizeVectorResult: v2f32 -> 2x f32
  2. No 64-bit packed float register class exists

v4f16 on SM 70+:

  1. SplitVectorResult: v4f16 -> 2x v2f16 (legal -- stops here)
  2. Final: 2x f16x2 packed operations (2x throughput vs scalarized)

v4f16 on SM < 53:

  1. Split: v4f16 -> 2x v2f16
  2. Scalarize: each v2f16 -> 2x f16
  3. Promote: each f16 -> FP_EXTEND -> f32
  4. Final: 4x f32 operations with FP_EXTEND/FP_ROUND wrappers

double2 (v2f64):

  1. Scalarize: v2f64 -> 2x f64 (splitting would give v1f64 which is scalar)

Tensor core fragments bypass vector legalization entirely. WMMA/MMA intrinsics represent matrix fragments as individual scalar registers, not LLVM vector types. However, packed conversion types used with tensor cores (e4m3x2, e5m2x2, e2m1x2, etc.) do pass through legalization and map to Int32HalfRegs.

Verification Infrastructure

sub_2010FB0 (62KB) implements DAGTypeLegalizer::PerformExpensiveChecks, gated by the enable-legalize-types-checking flag (registered at ctor_341). It validates nine DenseMap categories that track the state of every legalized value:

MapContent
PromotedIntegersValues widened to a larger integer type
ExpandedIntegersValues split into two halves
SoftenedFloatsFP values converted to integer representation
PromotedFloatsFP values widened to a larger FP type
ExpandedFloatsFP values split into halves
ScalarizedVectorsVectors reduced to scalar elements
SplitVectorsVectors split into {Lo, Hi} pairs
WidenedVectorsVectors widened to a larger legal type
ReplacedValuesValues replaced by RAUW

Diagnostic strings on verification failure: "Processed value not in any map!", "Value in multiple maps!", "Value with legal type was transformed!".

DAG Node Builder Subroutines

Key subroutines called from the type legalizer for constructing replacement DAG nodes:

FunctionUpstream EquivalentNotes
sub_1D309E0DAG.getNode(opc, DL, VT, op)1-operand (TRUNCATE, ANY_EXTEND, etc.)
sub_1D332F0DAG.getNode(opc, DL, VT, op1, op2)2-operand
sub_1D3A900DAG.getNode(opc, DL, VT, op1, op2, op3)3-operand (FMA)
sub_1D38BB0DAG.getConstant(val, DL, VT)Integer constant creation
sub_1D38970DAG.getConstant(APInt)Wide constant / all-ones mask
sub_1D364E0DAG.getUNDEF(VT)Undefined value
sub_1D37440DAG.getSetCC(DL, VT, LHS, RHS, CC)Comparison node
sub_1D36A20DAG.getSelectCC(DL, VT, ..., CC)Select-on-comparison
sub_1D3BC50DAG.getExtLoad(opc, DL, VT, ...)Extending load
sub_1D3C080DAG.getTruncStore(...)Truncating store
sub_1D23890DAG.ReplaceAllUsesWith(old, new)RAUW for result replacement
sub_1FEB8F0MVT::getSizeInBits(SimpleVT)Bit width from SimpleVT
sub_1F58D40EVT::getSizeInBits()Bit width from extended VT
sub_1F58D30EVT::getVectorNumElements()Vector element count
sub_1F40B60TLI.getTypeToPromoteTo(opc, VT)Promotion target lookup
sub_1D16180TLI.getTypeAction(VT)Action for type
sub_1D16EF0TLI.getCondCodeAction(CC, VT)Condition code legality

Result Accumulation and Worklist

Results from each legalization step are accumulated into a SmallVector of {SDValue, SDValue} pairs (node pointer + result index). The vector grows via sub_16CD150 (SmallVector::grow()) when count exceeds capacity. After each pass, new nodes feed back into the worklist for iterative re-legalization until fixpoint -- all types are legal.

The worklist hash set uses open addressing with hash function ((id >> 9) ^ (id >> 4)) & (size - 1) and grows at 75% load factor. Dead nodes are marked with sentinel -2 (tombstone). The DenseMap instances used by the split/scalarize infrastructure use hash 37 * key with quadratic probing.

Differences from Upstream LLVM

AspectUpstream LLVM 20CICC v13.0
Source organization4 files, ~16,000 lines total1 monolithic function, 10,739 lines (348KB)
Vector legal typesTarget-dependent, often includes v4f32, v2f64Only v2f16, v2bf16, v2i16, v4i8 (32-bit packed)
v2f32Legal on most targets (x86, ARM)Illegal -- scalarized
Scalable vectorsActively used (AArch64 SVE)Encoded in tables but no SM target uses them
i128Expanded on most targetsLegal on SM 70+ (Int128Regs / .b128 / %rq)
NVPTX-specific split handlersN/A4 functions in 0x214xxxx range for packed-type dispatch
Custom-promotion BSTStandard red-black treeSame, at TLI offsets +9257/+9258
Type-supported flag arrayPointer array at known offsetAt TLI + 8*VT + 120
CC action table4-bit packed nibblesSame encoding, NVPTX-specific CC legal set

The monolithic structure means that code changes to any legalization category (integer promote, float soften, vector split) require recompilation of the entire 348KB function. In upstream LLVM, these are independent compilation units.

Configuration

KnobLocationDefaultDescription
enable-legalize-types-checkingctor_341falseEnables PerformExpensiveChecks debug verifier

No CICC-specific legalization knobs beyond the standard LLVM flag were found. The ptxas assembler has a related knob MercuryDisableLegalizationOfTexToURBound for texture-to-uniform-register legalization, but this operates at the assembler level, not in CICC.

Key Functions

FunctionAddressSizeRole
Type legalizer monolithsub_20019C0348KBDAGTypeLegalizer::run() master dispatch
PromoteIntegerResultsub_200010045KBInteger type promotion
PromoteFloatResultsub_2019DA018KBFloat type promotion / softening
ExpandFloatResultsub_201B41011KBFloat type expansion
ExpandIntegerResultsub_201BB9075KBInteger type expansion (632 case labels)
Promote+expand dispatchsub_201E5F081KBSecondary dispatch (441 case labels)
PerformExpensiveCheckssub_2010FB062KBDebug verifier for 9 DenseMap categories
SplitVectorResultsub_2029C105KBDispatcher for 190 opcode cases
SplitVectorOperandsub_202E5A06KBDispatcher for 157 opcode cases
SplitVecRes_BinOpsub_20230C0--Generic binary op split
SplitVecRes_VECTOR_SHUFFLEsub_20293A010KBShuffle decomposition
ScalarizeVectorResultsub_2036110--Vector-to-scalar reduction
ScalarizeVectorOperandsub_2035F80--Operand scalarization (80 cases)
WidenVectorsub_2036AE031KBVector widening (limited NVPTX use)
Operation legalizersub_1FFB890169KBLegalizeOp per-node action dispatch
ExpandNodesub_1FF6F7043KBFull node expansion fallback
ExpandLoadsub_1FF978055KBLoad legalization
LegalizeLoadOpssub_1FF531041KBStore splitting/coalescing
NVPTX split: CONCATsub_2146BB0219BNVPTX-specific CONCAT_VECTORS split
NVPTX split: SELECT_CCsub_2146C902.7KBNVPTX-specific SELECT_CC split
NVPTX split: FP_ROUNDsub_2147770--NVPTX-specific FP rounding split
NVPTX split: BITCASTsub_2147AE0--NVPTX-specific bitcast split
NVPTXTargetLowering initsub_331467073KBPopulates legality tables
FP conversion split helpersub_20B5C20--Iterative SINT_TO_FP/UINT_TO_FP
Atomic promote helpersub_20B7F50--ATOMIC_LOAD promotion
CAS expansion decisionsub_20B7E10--CAS loop vs direct atomic
Gather/scatter alignmentsub_20BD400--MGATHER/MSCATTER alignment fixup

Reimplementation Checklist

  1. NVPTX legal type model. Define the narrow set of legal types dictated by PTX register classes (i1, i16, i32, i64, f32, f64, f16, bf16, v2f16, v2bf16, v2i16, v4i8, i128), with SM-gated legality: f16 arithmetic on SM 53+, v2f16 packed ops on SM 70+, v2bf16 on SM 80+, FP4/FP6 packed types on SM 100+.
  2. Primary legality table population. Build the 2D action table at TLI + 259 * VT + opcode + 2422 with per-opcode-per-type action bytes (0=Legal, 1=Custom, 2=Expand, 3=LibCall, 4=Promote), plus the type-supported flag array at offset +120, the promotion action table at offset +2681, and the condition-code action table at offset +18112 with 4-bit packed nibbles.
  3. Four legalization actions. Implement Promote (widen via ANY_EXTEND/ZERO_EXTEND, operate, TRUNCATE), Expand (split via shift-and-OR for integers, libcall for floats), Soften (integer emulation of unsupported FP types), and Scalarize/Split-Vector (decompose illegal vectors into scalar or half-width vector operations).
  4. Iterative fixpoint loop. Run the type legalizer worklist until every node in the DAG has only legal result and operand types, since each pass may create new nodes with illegal types (e.g., splitting a vector creates half-width vectors that may themselves require further splitting).
  5. Vector legalization for NVPTX. Handle the critical constraint that Int32HalfRegs is the only vector class (32 bits total): scalarize all vectors wider than 32 bits (v4f32, v2f32, v8i32, etc.) while keeping v2f16/v2bf16/v2i16/v4i8 legal. Implement the SplitVectorResult/SplitVectorOperand/ScalarizeVector dispatchers with their 190+/157+/~100 case switches.
  6. SimpleVT type encoding. Implement the bitwidth-to-SimpleVT conversion (11 instances in NVIDIA's monolith) and the ~100-case vector-element-type switch (6 instances) mapping MVT ranges 14--109 to their scalar element types.

Cross-References

ISel Pattern Matching & Instruction Selection

Prerequisites: Familiarity with SelectionDAG, Type Legalization, and DAG Node Layout. Understanding of the Pattern Database structure and NVPTX opcodes is recommended.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

The NVPTX instruction selector in cicc v13.0 translates legal SelectionDAG nodes into target MachineInstr opcodes through a three-level dispatch hierarchy totaling approximately 900KB of code. At the top sits NVPTXDAGToDAGISel::Select (sub_3090F90, 91KB), which builds a per-function cost table, manages a priority-queue-driven topological worklist, and calls the pattern matcher (sub_308FEE0) for every node. The pattern matcher fans out to a hand-written NVPTX-specific select switch (sub_347A8D0, 309KB) and a TableGen-generated SelectCode function (sub_348D3E0, 256KB). Surrounding this core are six NVPTX-specific sub-selectors covering memory operations, texture/surface fetches, complex addressing modes, vector patterns, and atomics. NVIDIA's key delta from upstream LLVM is (1) a compressed per-SM-variant legality table that gates which target opcodes exist on which GPU architecture, (2) a secondary 4-bit packed bitfield for fine-grained operand-class legality, and (3) the iteration budget that prevents the selector from looping indefinitely on pathological DAGs.

ISel driversub_3090F90 (91KB, 2,828 lines)
Pattern matcher entrysub_308FEE0
NVPTX Select switchsub_347A8D0 (309KB -- largest ISel function)
SelectCode (TableGen)sub_348D3E0 (256KB -- auto-generated)
Vector/SIMD patternssub_3475BB0 (89KB)
Memory operation patternssub_306D850 (77KB)
Complex addressing modessub_30811D0 (77KB)
Addressing mode helpersub_30783B0 (39KB)
Texture/surface ISelsub_306A930 (52KB)
Atomic loweringsub_3048C30 (86KB)
Constraint tableword_3F3E6C0 (see Pattern Database)
Compressed legality tableBase + 6414, 500-byte stride per SM variant
Secondary 4-bit bitfieldBase + 521536
Legalize action tableObject + 72760, 4-bit packed
Knob registrationctor_286 at 0x4FA0C0 (5KB)
Upstream LLVM sourcelib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

ISel Driver: sub_3090F90

The top-level driver is not the pattern matcher itself; it is the orchestration loop that feeds nodes to the matcher in the right order and maintains shared state. It breaks into three phases.

Phase 1: Function Argument Cost Table

Before selecting any instructions, the driver builds a DenseMap-style hash table at this + 408 that maps function argument indices to their byte sizes. The hash table uses LLVM's standard integer-key hash function key * 37, open addressing with linear probing, and the tombstone sentinel -2. Growth triggers at 75% load factor (4 * (count + 1) >= 3 * capacity).

// Phase 1: build argument cost table
hash_table = this->arg_cost_map;  // at this + 408
for each argument A in function->args():
    byte_size = alignTo(getSizeInBits(A.type) / 8, A.alignment)
    key = A.index
    slot = (key * 37) & (capacity - 1)
    while hash_table[slot] is occupied and != key:
        slot = (slot + 1) & (capacity - 1)
    hash_table[slot] = { key, byte_size }
    if load_factor > 0.75: rehash()

The table layout:

FieldOffset from thisDescription
data+416Pointer to hash bucket array
count+424Number of live entries
tombstone_count+428Number of tombstone slots
capacity+432Total bucket count (power of 2)

If the function has a non-void return type, the driver also inserts the return value sizes into the same table, computing aligned_size = ((size + 7) >> 3 + (1 << align) - 1) >> align << align for each return element. The return-type attribute check uses attribute kind 81 (likely sret).

Phase 2: Return Value Processing

For non-void functions, the driver iterates each return value element via:

  • sub_A74710(attribute, 81) -- checks for sret attribute
  • sub_A748A0(index) -- gets return type at given index
  • sub_AE5020(dataLayout, type) -- computes ABI alignment
  • sub_9208B0(dataLayout, type) -- computes size in bits

Each return value's aligned byte size is inserted into the argument cost table, so the pattern matcher can look up the cost of materializing any function parameter or return value during instruction selection.

Phase 3: Topological Selection Loop

The main selection loop processes DAG nodes in topological order using a min-heap priority queue where priority equals topological order (lower number = earlier in the DAG, processed first). The iteration is bounded by an explicit budget.

// Phase 3: main ISel loop
sub_308B6F0(this);  // initialize worklist from DAG
budget = 4 * numInstructions * maxBlockSize
iteration = 0

while heap is not empty:
    node = heap.extractMin()         // sub_3089BD0: heap-sift-down
    sub_308FEE0(this, node, &tmp)    // pattern matcher dispatch

    if this->selectionChanged:       // byte at this + 400
        re-scan affected nodes

    iteration++
    if iteration > budget:
        break  // anti-infinite-loop guard

sub_308AB30(this)    // cleanup
sub_264E600(this)    // deallocate worklist
sub_308B100(this)    // destroy hash table

The min-heap stores (SDNode*, priority) pairs at 16-byte stride. The heap-sift-down operation (sub_3089BD0) maintains the heap invariant after extraction. The selectionChanged flag at this + 400 is set by the pattern matcher when it replaces a node, signaling the driver to re-examine downstream users.

The iteration budget formula 4 * numInstructions * maxBlockSize is an NVIDIA addition -- upstream LLVM's SelectionDAGISel does not have this guard. It prevents pathological DAGs (for example, from heavily-inlined device functions with thousands of parameters) from causing the selector to spin indefinitely when combine/legalize/select cycles interact.

Pattern Matcher Dispatch: sub_308FEE0

The pattern matcher is called once per SDNode. It reads the node's opcode at *(node + 24) and dispatches through a multi-level decision tree:

  1. Quick-reject filter. If the node is already selected (machine opcode bit set in flags), return immediately.
  2. NVPTX-specific hand-written patterns. Calls sub_347A8D0 for NVPTX custom opcodes (NVPTXISD range >= 499). This handles texture loads, MMA instructions, atomic operations, .param-space loads/stores, and other GPU-specific patterns.
  3. TableGen auto-generated matcher. Calls sub_348D3E0 (SelectCode) for standard ISD opcodes. This function is mechanically generated from the .td pattern files in the NVPTX backend and contains a massive switch table mapping DAG patterns to MachineInstr opcodes.
  4. Complex pattern matching. For load/store addressing modes, calls sub_30811D0 (77KB) and sub_30783B0 (39KB), which match base + offset, base + scaled_index, and address-space-qualified patterns.
  5. Fallback. If no pattern matches, the node is marked as "failed ISel" and the driver may retry after DAG combining.

NVPTX Select Switch: sub_347A8D0 (309KB)

This is the largest single ISel function, containing the hand-written pattern matching for all NVIDIA-specific DAG nodes. It calls sub_969240 263 times (SDNode accessor), is self-recursive 42 times, and dispatches to:

Sub-selectorSizeCoverage
sub_3447D7032KBSpecific pattern sub-dispatch
sub_3441190--Pattern helpers
sub_343FD60--Type-aware matching
sub_3475BB089KBVector/SIMD patterns (v2, v4 packed types)

The function switches on the SDNode opcode to handle:

  • Load/store with address spaces -- selects between ld.global, ld.shared, ld.local, ld.param, ld.const, and generic-space loads, each requiring different PTX instructions.
  • Texture/surface operations -- dispatches to sub_306A930 for tex, suld, sust instruction patterns.
  • MMA/WMMA/tensor ops -- selects the correct mma.sync, wmma.mma, wgmma variant based on operand types and SM architecture.
  • Atomic operations -- selects between atom.global.add, atom.shared.cas, red.global.add, etc., with scope qualifiers (.cta, .gpu, .sys).
  • Barrier/fence operations -- selects bar.sync, bar.warp.sync, membar.cta, membar.gl, membar.sys.

SelectCode (TableGen): sub_348D3E0 (256KB)

This auto-generated function implements the standard LLVM TableGen pattern matching algorithm. It is a giant switch-table compiled from the .td instruction pattern files in lib/Target/NVPTX/*.td. The function:

  • Calls sub_969240 45 times and sub_32889F0 38 times (opcode/type checkers).
  • Contains no string literals (purely mechanical code).
  • Works in tandem with sub_347A8D0: the hand-written selector handles NVPTX custom nodes first, and anything that falls through goes to SelectCode.

The auto-generated matcher encodes patterns as a sequence of opcode checks, type checks, and operand recursive matches. When a full pattern matches, it calls MorphNodeTo to convert the SDNode into a MachineSDNode with the target opcode and register operands.

Compressed Instruction Legality Table

NVIDIA's instruction selector uses a per-SM-variant legality table to determine whether a given target opcode is legal on the current GPU architecture. This table is checked during instruction selection to gate SM-specific instructions (for example, wgmma instructions are illegal on SM 70 but legal on SM 90+).

The table lives at a fixed offset from the base of the ISel object, accessed by sub_376DE90:

legality = *(uint8_t*)(base + 500 * arch_variant + opcode + 6414)
FieldEncoding
Base offset6414 bytes from object base
Row stride500 bytes per architecture variant
Index500 * arch_variant + opcode
Value 0Illegal -- this opcode does not exist on this SM
Value 1Custom -- requires custom lowering before emission
Value 2Legal -- can be emitted directly

The arch_variant value selects which row of the table to consult. Each row contains 500 entries, one per target opcode. The table is read-only after initialization and occupies approximately num_variants * 500 bytes in the .data section.

Secondary 4-bit Packed Bitfield

A second legality table at base + 521536 provides fine-grained operand-class legality using 4-bit packed nibbles:

byte_offset = (opcode_class >> 3) + 36 * arch_id - arch_id
nibble      = (*(uint8_t*)(base + 521536 + byte_offset) >> (4 * (opcode_class & 7))) & 0xF

The offset simplification 36 * arch_id - arch_id equals 35 * arch_id, giving a 35-byte stride per architecture variant. Each byte packs two 4-bit legality fields, and the low/high nibble is selected by bit 0 of opcode_class. The 4-bit values encode a richer set of actions than the primary table's 3-value encoding.

Legalize Action Table

The operation legalization subsystem (separate from the ISel legality table above) uses a 4-bit packed action table at object offset 72760 to determine how to legalize each (opcode, type) pair:

index  = type_bits + 15 * opcode + 18112
action = (*(uint32_t*)(object + 4 * index + 72760) >> (4 * (type & 7))) & 0xF
ActionValueBehavior
Legal0Node is natively supported
Promote1Widen to a larger legal type
Custom5Call NVPTXTargetLowering::LowerOperation via vtable slot 164
ExpandInteger9Split wide integers into halves
ExpandFloat13Emulate unsupported FP via libcalls
SplitVector14Decompose illegal vector into legal sub-vectors

This table is distinct from the type-legality table at TLI + 2422 (described in SelectionDAG), which uses a 259-byte stride and encodes the simpler 5-action set (Legal/Custom/Expand/LibCall/Promote). The table at +72760 is the operation-level action table used during the LegalizeOp phase, while the +2422 table is the type-level action table used during LegalizeTypes.

NVPTX-Specific Pattern Categories

Memory Operations: sub_306D850 (77KB)

Selects PTX load/store instructions with the correct address space qualifier, vector width, and volatility. The function handles the full matrix of {ld,st} x {.global,.shared,.local,.param,.const,.gen} x {.b8,.b16,.b32,.b64,.b128} x {.v1,.v2,.v4} x {.volatile,.relaxed,.acquire,.release} instruction variants. Address space is determined by querying the pointer operand's address space attribute through the DAG.

The memory pattern matching also covers:

  • Vector loads/stores -- ld.global.v2.b32, ld.global.v4.b32, and their 64-bit variants, selected based on the vector element count (1, 2, or 4).
  • Parameter loads -- ld.param.b32 and st.param.b32 for call ABI (see SelectionDAG: .param ABI).
  • Generic-space loads with addrspacecast -- when the address space is generic (AS 0), the selector checks whether the source can be proven to be in a specific space and emits a non-generic load if so.

Texture/Surface Instructions: sub_306A930 (52KB)

Selects tex, suld, and sust instructions from DAG nodes produced by the intrinsic lowering mega-switch. The selector dispatches through helper functions:

HelperPurpose
sub_2FE5F00Texture fetch type selection
sub_2FE5F30Surface read type selection
sub_2FE5F60Surface write type selection
sub_2FE69A0Texture sampler mode selection
sub_2FE6CC0Unified texture/surface dispatch

Texture instructions have complex operand requirements: sampler reference, texture reference, coordinate type (1D/2D/3D/cube), data type (f32/i32/f16), and optional LOD/gradient parameters. The selector maps each combination to a specific PTX tex.1d.v4.f32.f32 (or similar) opcode.

Complex Addressing Modes: sub_30811D0 (77KB)

Matches addressing patterns for load/store operands. NVPTX supports a limited set of addressing modes compared to x86:

  • Register + immediate offset -- [%r1 + 16], the most common PTX addressing mode.
  • Register -- [%r1], zero-offset variant.
  • Immediate -- [0x1000], absolute address (rare on GPU).
  • Register + register -- not directly supported in PTX; decomposed into add + register addressing.

The complex pattern matcher at sub_30811D0 calls seven helper functions (sub_307B990 through sub_307FEF0) to decompose DAG address expressions into base-register + offset pairs. When the offset is a constant that fits in the PTX immediate field, it folds into the instruction encoding. When the offset is too large or non-constant, it generates a separate add instruction and uses register addressing.

MMA / Tensor Core Instructions

Tensor core instruction selection is split across the intrinsic lowering stage (which generates NVPTXISD nodes from wmma.load, wmma.mma, mma.sync, wgmma intrinsics) and the ISel stage (which selects the specific PTX opcode). The ISel switch in sub_347A8D0 handles these by checking:

  1. SM architecture -- wmma requires SM 70+, mma.sync requires SM 75+, wgmma requires SM 90+.
  2. Matrix dimensions -- m16n16k16, m8n8k4, m16n8k8, etc.
  3. Data types -- f16, bf16, tf32, f64, i8, i4, b1, fp8 (SM 90+), fp4 (SM 100+).
  4. Accumulator type -- f16 or f32 for half-precision MMA.

The architecture check consults the compressed legality table to determine whether a given MMA variant is legal on the target SM.

Atomic Operations: sub_3048C30 (86KB)

Atomic instruction selection generates atom.{scope}.{op}.{type} instructions. The selector handles:

OperationPTXNVPTXISD opcodes
Compare-and-swapatom.cas462
Add (int)atom.add294--297
Min (signed)atom.min302--305
Max (signed)atom.max314--317
Exchangeatom.exch(via generic path)
AND/OR/XORatom.and / atom.or / atom.xor(via generic path)

The selector checks "vector atomics not supported on this architecture!" for vector-width atomics and gates them behind an SM version check (likely SM 90+). Scope qualifiers (.cta, .gpu, .sys) are determined from the memory ordering of the LLVM atomic instruction.

Vector / SIMD Patterns: sub_3475BB0 (89KB)

Handles vector-type instruction selection for NVPTX's limited vector support (v2 and v4 packed types). The function calls sub_969240 121 times and is self-recursive 28 times. It selects between:

  • Packed register operations -- add.v2.f32, mul.v2.f32 when the SM supports native vector operations.
  • Scalarized fallback -- decomposes vector operations into per-element scalar operations when the vector type is not natively supported.
  • mov.v2 / mov.v4 -- register-to-register vector moves for shuffles and extracts.

Knobs

The ISel subsystem registers its knobs at ctor_286 (0x4FA0C0, 5KB):

KnobTypeDescription
fast-isel-abortintAbort mode for FastISel failures (0=silent, 1=warn, 2=abort)
fast-isel-report-on-fallbackboolReport when FastISel falls back to SelectionDAG
use-mbpiboolUse Machine Branch Probability Info during ISel
dag-disable-combineboolDisable DAG combining entirely
pre-RA-schedenumPre-RA scheduler variant: "default", "list-burr", "source", "list-hybrid", "list-ilp"

Note that cicc does not use FastISel for GPU code generation. The fast-isel-* knobs exist because the upstream LLVM SelectionDAGISel framework registers them unconditionally, but the NVPTX backend always takes the full SelectionDAG path. The dag-disable-combine flag is the only ISel-phase knob that has a meaningful effect on NVPTX code generation; setting it skips the DAG combiner entirely, which produces worse code but can be useful for debugging.

Differences from Upstream LLVM

AspectUpstream LLVM 20.0NVIDIA cicc v13.0
Iteration budgetNo explicit budget; relies on DAG invariants to terminateBudget = 4 * numInstructions * maxBlockSize
Argument cost tableNot present in SelectionDAGISelHash table with key * 37 hash for argument byte sizes
Legality tableSimple isLegal() callback per targetCompressed 500-stride table + 4-bit packed secondary table
FastISelUsed for -O0 on most targetsNever used; always full SelectionDAG
ISel function sizeTypical NVPTX Select() is ~50KB upstream309KB hand-written + 256KB TableGen = 565KB total
Memory patternsStandard load/store5 address spaces, each with distinct PTX encoding
Texture/surfaceNot present in upstream NVPTX (handled by intrinsics only)52KB dedicated sub-selector for tex/suld/sust
Atomic patternsStandard expansion via AtomicExpandPass86KB custom selector with scope qualifiers and architecture gating

Function Map

FunctionAddressSizeRole
NVPTXDAGToDAGISel::Select -- ISel driversub_3090F9091KB--
Pattern matcher entry (dispatches to Select switch and SelectCode)sub_308FEE0----
NVPTX hand-written Select switchsub_347A8D0309KB--
TableGen-generated SelectCodesub_348D3E0256KB--
Vector/SIMD pattern selectionsub_3475BB089KB--
Memory operation patterns (ld/st with address spaces)sub_306D85077KB--
Complex addressing mode matchingsub_30811D077KB--
Addressing mode helper (base + offset extraction)sub_30783B039KB--
Texture/surface instruction selectionsub_306A93052KB--
Atomic operation selectionsub_3048C3086KB--
Sub-selector for specific NVPTX patternssub_3447D7032KB--
Pattern matching helperssub_347297036KB--
Operand matchingsub_343A2E049KB--
Compressed legality table lookupsub_376DE90----
Initialize topological worklistsub_308B6F0----
Min-heap sift-down (priority queue)sub_3089BD0----
ISel cleanupsub_308AB30----
Hash table destructionsub_308B100----

Cross-References

InstrEmitter

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: SDNode field layout matches LLVM 20.0.0 base. NVIDIA merges the upstream EmitNode/EmitSpecialNode split into a single monolithic function, adds a dedicated CopyToReg handler, an extended MachineInstr flag at bit 36, and a triple vtable dispatch for GPU pseudo-expansion.

InstrEmitter is the final translation layer between LLVM's SelectionDAG representation and the machine-level MachineInstr pipeline. After instruction selection has converted LLVM IR into a DAG of target-specific SDNodes, and after scheduling has linearized those nodes into a sequence, InstrEmitter walks the scheduled sequence and converts each SDNode into one or more MachineInstrs inserted into the current MachineBasicBlock. In CICC v13.0, the emitter lives at sub_2EDDF20 (11,722 bytes) and is called by ScheduleDAGSDNodes::EmitSchedule (sub_2EE0CF0). NVIDIA's build contains three key modifications relative to upstream LLVM: a dedicated CopyToReg handler factored out for NVPTX's physical-register-heavy parameter ABI, a triple vtable dispatch pattern that gates custom pseudo-expansion for GPU-specific instructions, and an extended MachineInstr flag at bit 36 (0x1000000000) not present in stock LLVM.

EmitNode / EmitMachineNodesub_2EDDF20 (11,722 bytes, 872-byte stack frame)
EmitSchedule (top-level driver)sub_2EE0CF0 (59KB)
EmitCopyToReg handlersub_2ED95B0
EmitSubregNodesub_2EDB7A0
EmitCopyToRegClassOpsub_2EDD7E0
ProcessOperands / EmitMachineNode coresub_2ED3660
getRegForValuesub_2E8B400
isDeadNode predicatesub_2DADC00
MinRCSize threshold4 (upstream default, unchanged)
VReg hash load factor3/4 (rehash when count * 4 >= capacity * 3)
Hash functionkey * 37, masked by capacity - 1
SDOperand stride40 bytes (0x28) per entry

Emission Architecture

In upstream LLVM, InstrEmitter::EmitNode is a trivial dispatcher: if the SDNode carries a target-specific (machine) opcode, it calls EmitMachineNode; otherwise it calls EmitSpecialNode for ISD-level pseudo-operations. CICC merges both paths into a single monolithic function (sub_2EDDF20) that dispatches on the raw 16-bit opcode at SDNode offset +0x44. The entry point performs a bit-table test against a 64-bit immediate (0x80001078000) to classify opcodes <= 0x2B as "special" ISD nodes requiring dedicated handling; everything above falls through to the generic machine emission path.

The driver, ScheduleDAGSDNodes::EmitSchedule (sub_2EE0CF0), iterates the scheduled SUnit sequence. For each SUnit, it first walks the glue chain backwards (via SDNode::getGluedNode) and emits each glued predecessor before emitting the SUnit's own node. This guarantees that glued instructions appear as a contiguous sequence in the MachineBasicBlock, which is critical for NVPTX where texture sampling sequences must remain bundled with their address computation.

The Emission Algorithm

The combined EmitNode function proceeds through fourteen phases. The condensed flow:

EmitNode(InstrEmitter *self, SDNode *node):
    // Phase 1: Early exit for dead nodes
    if !self->forceEmit && node->useCount <= 1:
        return false  // single-use folded into consumer

    // Phase 2: Glue chain traversal
    root = node
    while root->predecessor has chain/glue bit set:
        root = strip_tag(root->predecessor)
        if root->hasChainResult:
            walk further to data-producing node

    // Phase 3: Opcode dispatch
    opc = node->opcode  // uint16 at +0x44
    switch opc:
        0x0E (CopyToReg):  call EmitCopyToReg(self, node)
        0x13 (TokenFactor): skip entirely
        0x14 (CopyFromReg): goto copyfromreg_path
        0x0F, 0x10, 0x1C, 0x2B: special ISD handling
        default: goto generic_emission

    // Phase 4: Generic machine emission
    desc = TII->get(opc)
    MI = BuildMI(MBB, node->debugLoc, desc)
    CreateVirtualRegisters(node, MI, desc)
    for each operand in node->operands:
        AddOperand(MI, operand)
    MI.setMemRefs(node->memoperands)
    MBB->insert(InsertPos, MI)

    // Phase 5: Custom inserter check (triple vtable dispatch)
    if TII->vtable[0xB8] != sub_2ED11C0:  // not default
        call custom inserter for NVPTX pseudos
    if TII->vtable[0x348] != sub_2ED11F0:
        call expandPostRAPseudo
    if TII->vtable[0x160] != sub_2ED11E0:
        call sub-register inserter

    // Phase 6: Implicit physreg defs
    collect UsedRegs from glue chain (CopyFromReg, RegisterSDNode)
    mark unused implicit defs as dead

    // Phase 7: Post-emission dead copy elimination
    for each emitted copy:
        if copy result has no remaining uses:
            eraseFromParent(copy MI)

Opcode Dispatch Details

The bit-table dispatch uses a 64-bit immediate as a compressed lookup: bt 0x80001078000, opcode. The bits that are set correspond to ISD opcodes that need special (non-generic) handling:

OpcodeISD ValueHandler
0x0EISD::CopyToRegsub_2ED95B0 -- dedicated handler
0x0FISD::EH_LABEL / specialLabel emission path
0x10ISD::INLINEASMInline assembly emission
0x13ISD::TokenFactorSkipped (ordering-only, no MI)
0x14ISD::CopyFromRegPhysical-to-virtual register copy
0x1CISD::LIFETIME_START/ENDFrame index annotation
0x2BISD::PSEUDO_PROBEProfiling probe emission

For opcodes above 0x2B, the emitter falls through to the generic path that calls TII->get(opc) to obtain the MCInstrDesc and builds a MachineInstr from its operand descriptors.

CopyToReg Emission

CopyToReg (sub_2ED95B0) handles the common case of copying a value from a virtual register into a physical register. Upstream LLVM handles this inline within EmitSpecialNode; NVIDIA factors it into a separate function, likely for code size reasons given how frequently CopyToReg appears in NVPTX code. NVPTX's parameter-passing convention maps kernel parameters to fixed physical registers %r1--%r255, which generates large CopyToReg cascades at function entry and before calls.

The handler:

  1. Reads the destination register from SDNode->operand(1) (a RegisterSDNode).
  2. If the destination is virtual and the source is an IMPLICIT_DEF, emits IMPLICIT_DEF dest directly instead of a COPY.
  3. Otherwise resolves the source value to a virtual register via getVR (which consults the VRBaseMap).
  4. If source and destination are the same register, does nothing (copy coalesced away).
  5. Emits COPY dest, src.

CopyFromReg Emission

CopyFromReg (opcode 0x14) is the reverse: it copies a physical register into the virtual register domain. The CICC implementation at sub_2EDDF20 offset 0x2EDF423 follows a multi-step process:

  1. Extract the source register from SDNode->operand(1). If virtual, insert the SDValue-to-VReg mapping directly into VRBaseMap and return.
  2. If physical, determine the correct register class:
    • Query all users of this CopyFromReg. If the sole user is a CopyToReg to a virtual register in the same class, reuse that destination register.
    • Otherwise compute UseRC as the intersection of all user register class constraints via TRI->getCommonSubClass.
    • Fall back to TRI->getMinimalPhysRegClass(SrcReg, VT).
  3. If copying the physical register is impossible or expensive (RC->expensiveOrImpossibleToCopy()), use the physical register directly.
  4. Otherwise emit COPY VRBase, SrcReg where VRBase is a new virtual register in DstRC.

The register class membership test at 0x2EDF4C2 uses LLVM's compressed bit-vector representation:

bool RegisterClass::contains(unsigned Reg) {
    unsigned class_idx = Reg >> 3;
    if (class_idx >= desc->num_classes)
        return false;
    return (desc->class_table[class_idx] >> (Reg & 7)) & 1;
}

NVPTX Custom Pseudo-Expansion

The triple vtable dispatch pattern is the emitter's most distinctive NVIDIA modification. After inserting a MachineInstr for a target-specific opcode, the emitter checks three separate vtable slots to determine whether the instruction requires custom expansion:

Vtable slot 0xB8: EmitInstrWithCustomInserter Default stub: sub_2ED11C0 (returns false). When the NVPTX target overrides this for a given opcode, the custom inserter replaces the pseudo MachineInstr with an expanded sequence. Approximately 15--20 NVPTX pseudo-instructions use this path:

  • Texture load operations (tex.1d, tex.2d, tex.3d) -- these expand into address register setup, sampler state configuration, and the actual texture fetch instruction.
  • Surface operations (sust, suld) -- surface load/store instructions that need coordinate clamping and format conversion.
  • Warp-level intrinsics (shfl, vote, match) -- instructions that require lane mask setup and predicate register manipulation.
  • Atomic operations -- certain atomics expand into compare-and-swap loops on older architectures.

Vtable slot 0x348: expandPostRAPseudo Default stub: sub_2ED11F0. This handles pseudo-instructions that can only be expanded after register allocation has assigned physical registers. In NVPTX this is less common since the PTX virtual register model defers most allocation to ptxas.

Vtable slot 0x160: sub-register insertion Default stub: sub_2ED11E0. Handles INSERT_SUBREG and related patterns that need target-specific lowering.

All three stubs are adjacent in memory (within 48 bytes of each other), confirming they are trivial return-false implementations in the NVPTXInstrInfo class.

Register Class Assignment During Emission

When creating virtual registers for SDNode results, CreateVirtualRegisters (sub_2E8B400 path) performs:

  1. For each result value of the SDNode, obtain the register class from TII->getRegClass(II, i).
  2. Refine based on the value type: if the type is legal, compute TLI->getRegClassFor(VT, isDivergent) and intersect with the instruction constraint via TRI->getCommonSubClass.
  3. The divergence flag (SDNode::isDivergent) is critical in NVPTX: divergent values must go into general-purpose registers (not uniform/constant registers), which affects class selection.
  4. If a result's sole consumer is a CopyToReg to a virtual register in a compatible class, reuse the CopyToReg destination directly to avoid a redundant copy.
  5. Create the virtual register via MRI->createVirtualRegister(RC) and add it as a def operand on the MachineInstr.

The MinRCSize threshold (4, unchanged from upstream) prevents over-constraining: if the intersection of all register class constraints would yield a class with fewer than 4 registers, the emitter inserts a COPY to a less-constrained virtual register instead.

Implicit Def/Use Handling

After inserting a MachineInstr, the emitter processes implicit physical register definitions. This is essential for GPU instructions that clobber status registers or have side effects beyond their explicit operands.

The flow collects UsedRegs by scanning:

  1. Implicit defs beyond explicit results: if NumResults > NumDefs, the extra results correspond to implicit physical register definitions from MCInstrDesc::implicit_defs(). For each such def that has at least one use, a CopyFromReg is emitted to capture the value.
  2. Glue chain uses: the emitter walks the glue chain upward from the current node, collecting physical registers referenced by CopyFromReg nodes and RegisterSDNode operands.
  3. Dead marking: MachineInstr::setPhysRegsDeadExcept(UsedRegs) marks any implicit def that is NOT in UsedRegs as dead, allowing the register allocator and later passes to ignore it.

NVIDIA Extended Flag: Bit 36 (0x1000000000)

Standard LLVM MachineInstr flags occupy bits 0--31 of the flags word (is_def, is_implicit, is_dead, is_kill, is_undef, is_early_clobber, etc.). CICC extends this to a 64-bit flags field and reserves bit 36 (0x1000000000) for an NVIDIA-specific purpose. The flag is queried via sub_2E88A90 (hasProperty) with argument rsi = 0x1000000000, edx = operand_index.

Where Bit 36 Is Checked

There are exactly two call sites within sub_2EDDF20:

Site 1 -- Generic emission path (0x2EDE50A--0x2EDE523)

0x2EDE4EF: mov  eax, [r13+2Ch]          ; load SDNode property flags
0x2EDE4F3: test eax, 0x20000            ; bit 17 = hasDebugValue?
0x2EDE4F8: jnz  skip_flag_check         ; if set, skip the bit-36 test
0x2EDE4FA: test al, 4                   ; bit 2 = isTied
0x2EDE4FC: jnz  loc_2EDF064             ; tied operand -> different path
0x2EDE502: test al, 8                   ; bit 3 = hasGlue
0x2EDE504: jz   loc_2EDF064             ; no glue -> different path
0x2EDE50A: mov  edx, 1                  ; operand index = 1
0x2EDE50F: mov  rdi, r13                ; SDNode*
0x2EDE512: mov  rsi, 0x1000000000       ; bit 36 flag mask
0x2EDE51C: call sub_2E88A90             ; hasProperty(node, flag, idx)
0x2EDE521: test al, al
0x2EDE523: jnz  loc_2EDE086             ; if set -> skip emission entirely

Site 2 -- CopyFromReg-adjacent path (0x2EDEE5D--0x2EDEE86)

0x2EDEE5D: test al, 4                   ; bit 2 = isTied
0x2EDEE5F: jnz  loc_2EDEFA2             ; tied -> sub-register path
0x2EDEE65: test al, 8                   ; bit 3 = hasGlue
0x2EDEE67: jz   loc_2EDEFA2             ; no glue -> sub-register path
0x2EDEE6D: mov  edx, 1                  ; operand index = 1
0x2EDEE72: mov  rdi, r13                ; SDNode*
0x2EDEE75: mov  rsi, 0x1000000000       ; bit 36 flag mask
0x2EDEE7F: call sub_2E88A90             ; hasProperty(node, flag, idx)
0x2EDEE84: test al, al
0x2EDEE86: jnz  loc_2EDE100             ; if set -> skip (no MI emitted)

Guard Conditions and Semantics

Both sites share the same guard pattern: the flag is only checked when the SDNode's property byte at +0x2C satisfies bit_3_set AND NOT bit_2_set -- i.e., the node has a glue result chain but is not a tied operand. This narrows the check to nodes that participate in glue chains: typically multi-instruction sequences like texture fetches, surface operations, and warp-level intrinsics where a chain of SDNodes must emit as a contiguous bundle.

When hasProperty(node, 0x1000000000, 1) returns true, the emitter skips the node entirely. The operand index of 1 means the flag is checked on the first data operand (operand 0 is typically the chain input). The effect is that nodes carrying bit 36 on operand 1 are treated as "already materialized" -- their value has been produced by a preceding glued instruction and does not require a separate MachineInstr.

The most likely interpretation of bit 36 is "implicit glue consumer already emitted": when a glued predecessor has already produced the value as a side effect (e.g., a texture fetch that writes both the result and a predicate), the glue consumer SDNode carries bit 36 to tell the emitter that no additional COPY or MI is needed. This is consistent with the check position immediately after getRegForValue succeeds -- the VReg mapping exists, the glue chain has been walked, and the emitter is about to create a potentially redundant MI.

sub_2E88A90 Calling Convention

The function serves as a universal property query across the emitter and other codegen passes. Observed flag values and their meanings:

Flag ValueBitMeaningCall Sites
0x807isCallInstruction scheduler (sub_2EE40E0)
0x2009isReservedRegBranch folding (sub_2F33DD0)
0x8000019isImplicitInstrEmitter generic path, StructurizeCFG
0x10000020isSimple / isMachineRegInstrEmitter CopyFromReg, dead copy pass
0x40000022isSubRegisterInstrEmitter sub-register resolution
0x4000000030isAllocatableInstrEmitter CopyFromReg class check
0x100000000036NVIDIA: implicit glue consumerInstrEmitter only (2 sites)

The function signature is bool hasProperty(SDNode *node, uint64_t flag_mask, unsigned operand_idx). It reads the MCInstrDesc via [node+10h] -> [desc+18h], extracts a bit field by shifting right by the appropriate amount, and ANDs with 1 to produce a boolean result.

Internal Data Structures

InstrEmitter Object Layout

The InstrEmitter instance carries three hash tables for tracking the SDNode-to-MachineInstr mapping:

OffsetNameEntry SizePurpose
+0x410VReg Map (Table A)16 bytesSDNode result to virtual register
+0x460MI Map (Table B)40 bytesGlue chain to MachineInstr mapping
+0x4D0Result Map (Table C)32 bytesSDNode to result number
+0x4E0forceEmit flag1 byteWhen set, emit even dead nodes

All three use LLVM's DenseMap implementation with open addressing and linear probing. The hash function is key * 37 (LLVM's DenseMapInfo<unsigned>::getHashValue). Empty sentinel: 0xFFFFFFFF. Tombstone: 0xFFFFFFFE. Table C uses an extended sentinel 0xFFFFFFFFFFFFF000. Rehash triggers at 3/4 load factor: entry_count * 4 >= capacity * 3. Growth is handled by sub_2E29BA0 which doubles capacity and rehashes.

SDOperand Output Record

Each emitted result is recorded in a 40-byte (0x28) structure:

struct EmitResultRecord {  // 40 bytes
    SDNode *producer;         // +0x00: SDNode that produced this result
    int32_t src_vreg;         // +0x08: source virtual register (-1 if physical)
    int32_t dst_vreg;         // +0x0C: destination virtual register (-1 if unassigned)
    TargetRegisterClass *RC;  // +0x10: register class pointer (or NULL)
    unsigned sub_reg_idx;     // +0x18: sub-register index (or 0)
    uint32_t flags;           // +0x20: tied, early_clobber, implicit bits
};

SDNode Field Offsets

Confirmed SDNode field layout from the binary (matches LLVM 20.0.0 base with minor NVIDIA extensions):

OffsetTypeField
+0x00tagged ptrChain/glue link (low 3 bits = type tag)
+0x08uint32Use count / reference count
+0x20ptrOperand array pointer
+0x28uint32Operand count (low 24 bits)
+0x2Cuint8Property flags (bit 2 = isTied, bit 3 = hasEarlyClobber)
+0x30tagged ptrFirst predecessor link
+0x38tagged ptrGlue result chain
+0x44uint16Opcode
+0x78uint32Reference count (dead node detection)

Tagged pointers are stripped throughout with AND 0xFFFFFFFFFFFFFFF8 (clear low 3 bits). Physical registers are encoded with bit 31 set (negative int32); extraction uses AND 0x7FFFFFFF followed by a shift-left by 4 to index the register descriptor table.

Dead Copy Elimination

After the main emission loop completes, a dedicated cleanup pass (Phase 12 in the binary, offset 0x2EE0816--0x2EE09AC) scans all emitted result records and eliminates redundant COPY instructions. This is notably aggressive compared to upstream LLVM, which defers dead copy removal to a separate DeadMachineInstrElimination pass later in the pipeline. CICC performs it inline because NVPTX's SelectionDAG generates massive numbers of redundant copies when lowering kernel parameter loads -- each parameter maps to a fixed physical register (%r1--%r255 corresponding to PTX parameter registers), and the DAG legalizer inserts CopyFromReg nodes for every parameter access.

Dead Copy Elimination Algorithm

The algorithm walks the emitted result record array (0x28-byte stride, accumulated during Phases 4--11) and classifies each record for deletion or preservation.

DeadCopyElimination(InstrEmitter *self, ResultRecord *records, int count):
    // records is at [rbp-0x250], count at [rbp-0x248]
    // stride = 0x28 (40 bytes per record)

    end = records + count * 0x28
    cursor = records

    while cursor < end:
        MI = cursor->producer             // [rbx+0x00]: the MachineInstr*
        TII = self->TargetInstrInfo       // [r14+0x08]

        // Step 1: Classify by opcode
        if MI->opcode == 0x14:            // CopyFromReg
            // CopyFromReg-specific path: virtual dispatch to target
            vtable = TII->vtable
            result = vtable[0xF0](         // ~30th virtual method
                MI,                        // the CopyFromReg MI
                &cursor[0x08],             // source vreg slot
                /* additional args */
            )
            // This checks whether the target considers the copy
            // sinkable or rematerlizable -- NVPTX overrides this
            // for parameter register copies that are trivially dead

        else:
            // Generic MI path: check via vtable[0x350]
            result = TII->vtable[0x350](MI, cursor, ...)

        // Step 2: Check source register kill flags
        src_reg = cursor->src_vreg        // [rbx+0x08]
        if src_reg < 0:                   // physical register (sign bit set)
            clearKillFlags(self->MRI, src_reg)  // sub_2EBF120

        // Step 3: Check dest register kill flags
        dst_reg = cursor->dst_vreg        // [rbx+0x0C]
        if dst_reg < 0:                   // physical register
            clearKillFlags(self->MRI, dst_reg)  // sub_2EBF120

        // Step 4: Determine if MI is dead
        //   Check opcode: if (MI->opcode - 1) <= 1 (opcode 1 or 2)
        //   then check MI->operand[0] byte [+0x40] bit 4 (0x10)
        //   which indicates "result consumed by inline fold"
        opc = MI->opcode
        if (opc == 1 || opc == 2):        // COPY or REG_SEQUENCE
            if MI->operands[0].flags & 0x10:   // inline folded
                goto mark_dead

        // Step 5: Property gate
        flags_2c = MI->flags_2c           // [rdi+2Ch]
        if !(flags_2c & 0x04):            // bit 2 not set
            // Check TSFlags bit 20 via descriptor
            desc = MI->MCInstrDesc        // [rdi+10h]
            tsflags = desc->TSFlags       // [desc+18h]
            is_simple = (tsflags >> 20) & 1
            if !is_simple:
                goto emit_and_advance     // not a candidate

        // (falls through only when bit 2 set OR TSFlags bit 20 set)

        // Step 6: Check hasProperty(0x100000, 1) -- isMachineReg
        has_prop = hasProperty(MI, 0x100000, 1)   // sub_2E88A90
        if !has_prop:
            // MI is deletable: call eraseFromParent
            eraseFromParent(MI)            // sub_2E88E20
            advance cursor by 0x28
            continue

    mark_dead:
        // Step 7: Liveness check via isUnusedReg
        unused = isUnusedReg(MI)           // sub_2E8B100
        if unused:
            // Still has a def -- erase immediately
            eraseFromParent(MI)            // sub_2E88E20
        else:
            // Defer: add to dead list for bulk deletion
            addToDeadList(self->deadList, MI)  // sub_2ED56A0
            // deadList is at InstrEmitter+0x4A0

        advance cursor by 0x28

Glue Chain Walk in Dead Copy Context

After the per-record loop, the emitter performs a secondary traversal for CopyFromReg records that survived deletion. For each surviving copy whose SDNode has a glue result ([r13+38h] != 0):

  1. Walk the glue chain backward via [r13+0] & 0xFFFFFFFFFFFFFFF8 (strip tag bits).
  2. For each predecessor in the chain, check [rax+2Ch] & 4 -- if the predecessor has been scheduled (bit 2 set), continue walking.
  3. If the predecessor has an unresolved glue reference ([r13+38h] non-null) and the predecessor's MI has zero uses after copy elimination, mark it for deferred deletion too.

This secondary walk catches cascading dead copies: when a CopyFromReg is deleted, its glued predecessor may also become dead.

Deferred Deletion via Dead List

MIs added to InstrEmitter+0x4A0 via sub_2ED56A0 are not deleted immediately. Instead, they are accumulated and deleted in bulk during Phase 14 (final cleanup at 0x2EE0C0B). The dead list is a SmallVector<MachineInstr*> with 8 inline entries (64 bytes inline buffer), growing via sub_C8D5F0 if needed. Bulk deletion avoids iterator invalidation during the emission loop and is more cache-friendly for large basic blocks.

Why NVPTX Needs Aggressive Dead Copy Elimination

NVPTX kernel signatures routinely have 20--60 parameters, each lowered through a CopyFromReg from a fixed physical register. The SelectionDAG legalizer creates CopyFromReg SDNodes for each parameter load, but many parameters are only used in a subset of the kernel's basic blocks. Without immediate dead copy elimination, a kernel with 50 parameters would carry 50 COPY MachineInstrs at function entry, most of which are dead in any given block. The standard LLVM DeadMachineInstrElimination pass would eventually clean these up, but doing so immediately during emission:

  1. Reduces the MachineBasicBlock size that subsequent passes (register allocation, scheduling) must process.
  2. Avoids creating unnecessary VReg-to-PhysReg interference entries in the register allocator.
  3. Prevents false register pressure signals from dead copies during the MRPA (Machine Register Pressure Analysis) pass that NVIDIA uses for scheduling decisions.

NVIDIA-Specific Emission Patterns

Parameter Cascade Emission

NVPTX kernel entry functions map each parameter to a physical register via a cascade of CopyFromReg SDNodes. During emission, this produces a dense block of COPY MachineInstrs at the top of the entry MachineBasicBlock. The emitter handles this pattern specially:

  1. When EmitSchedule processes the first SUnit, it detects a sequence of CopyFromReg nodes whose source registers are consecutive physical parameter registers (%r1, %r2, ...).
  2. Each CopyFromReg is processed through the Phase 5 path (at 0x2EDF423). The register class resolution at 0x2EDF4C2 uses the compressed bit-vector test to verify the destination belongs to the Int32Regs or Int64Regs class.
  3. Dead copy elimination (Phase 12) immediately removes copies whose destinations have no users, reducing the entry block size before subsequent passes see it.

Texture/Surface Glue Bundle Emission

Texture and surface operations are emitted as glue bundles: a chain of SDNodes connected by glue edges that must produce a contiguous sequence of MachineInstrs. The emitter walks the glue chain backward from the final node and emits predecessors first. The bit 36 flag is critical here: when a texture fetch produces both a data result and a predicate condition, the predicate-producing node carries bit 36 on its data operand, telling the emitter that the preceding glued instruction already materialized the value and no separate COPY is needed.

The triple vtable dispatch at the end of emission (Phase 5 in the algorithm) handles the expansion of texture pseudo-instructions: EmitInstrWithCustomInserter (vtable 0xB8) replaces the texture pseudo-MI with the actual address setup, sampler configuration, and fetch instruction sequence.

Multi-Result SDNode Self-Recursion

When an SDNode produces multiple results (e.g., a div+rem pair or a load-with-predicate), the emitter calls itself recursively at sub_2EDDF20 to emit MIs for each additional result. The self-recursive call shares the same InstrEmitter instance and hash tables. This is a CICC-specific pattern; upstream LLVM handles multi-result nodes in a loop within EmitMachineNode rather than via recursion. The recursive approach simplifies the handling of multi-result nodes that themselves have glue chains (e.g., a texture fetch that returns 4 components).

Opcode-1/Opcode-2 Inline Fold Detection

During the dead copy scan (Phase 12, offset 0x2EE08A0--0x2EE08BA), the emitter checks if the MI's opcode is 1 or 2 (COPY or REG_SEQUENCE). For these opcodes, it reads the first operand's byte at [operand_array + 0x40] and tests bit 4 (0x10). This bit indicates the result was consumed via an inline fold -- the consumer instruction selected a pattern that folds the copy directly into its own operand. When this bit is set, the COPY MI is marked dead regardless of its use count, because the consuming instruction no longer references it.

0x2EE08A0: movzx eax, word ptr [rdi+44h]   ; MI->opcode
0x2EE08A4: sub   eax, 1                     ; opcode - 1
0x2EE08A7: cmp   eax, 1                     ; is it 1 (COPY) or 2 (REG_SEQUENCE)?
0x2EE08AA: ja    not_copy                   ; no -> skip
0x2EE08AC: mov   rax, [rdi+20h]             ; MI->operands array
0x2EE08B0: test  byte ptr [rax+40h], 0x10   ; bit 4 = inline fold consumed
0x2EE08B4: jnz   mark_dead                  ; if folded -> dead

NVIDIA Modifications vs Stock LLVM

AreaUpstream LLVMCICC v13.0
EmitNode dispatchTwo separate functions: EmitMachineNode + EmitSpecialNodeSingle merged function sub_2EDDF20 with bit-table dispatch
CopyToRegInline in EmitSpecialNodeFactored into dedicated sub_2ED95B0
Custom inserter checkSingle vtable call to EmitInstrWithCustomInserterTriple vtable dispatch (0xB8, 0x348, 0x160)
Extended MI flagsStandard LLVM flag set (32 bits)Bit 36 (0x1000000000) for NVPTX-specific semantics
Dead copy eliminationPost-emission pass in ScheduleDAGSDNodesInlined aggressive cleanup within EmitNode
Stack frame~300--400 bytes typical872 bytes (multiple inline SmallVectors and hash tables)
Self-recursionNot self-recursiveSelf-recursive for multi-result SDNode chains
Inline fold detectionNot present at this stageOpcode-1/2 fold bit check during dead copy scan
Glue chain secondary walkNot presentCascading dead copy detection through glue predecessors

Complexity

  • Main emission loop: O(N) in the number of scheduled SDNodes.
  • Hash table lookups: O(1) amortized with rehashing at 3/4 load.
  • Dead copy elimination: O(C * U) where C = copies emitted, U = average uses per register.
  • Glue chain traversal: O(G) per node where G = glue chain length (typically 1--5).
  • Memory: O(N) for the three hash tables + O(R) for result records.

Function Map

FunctionAddressSizeRole
InstrEmitter::EmitNodesub_2EDDF20--Main entry, 11,722 bytes
ScheduleDAGSDNodes::EmitSchedulesub_2EE0CF0--Top-level driver, 59KB
EmitCopyToRegsub_2ED95B0--Dedicated CopyToReg handler
getRegForValuesub_2E8B400--SDValue to VReg mapping
isUnusedRegsub_2E8B100--Dead register predicate
isDeadNodesub_2DADC00--Dead SDNode predicate
eraseFromParentsub_2E88E20--MachineInstr deletion
hasPropertysub_2E88A90--Register/operand flag query
getVRegDefsub_2EBEE10--Virtual register definition lookup
isPhysRegsub_2EBEF70--Physical vs virtual register check
replaceRegWithsub_2EBECB0--Virtual register substitution
clearKillFlagssub_2EBF120--Remove kill annotations
Sub-register resolutionsub_2ED7930--SUBREG_TO_REG handling
EmitSubregNodesub_2EDB7A0--Sub-register copy emission
EmitCopyToRegClassOpsub_2EDD7E0--Class-constrained copy
ProcessOperandssub_2ED3660--EmitMachineNode core
isAllocatableInClasssub_2E6D360--Register class membership
DenseMap::findsub_2E5E6D0--SDNode-to-MI lookup
addToDeadListsub_2ED56A0--Queue MI for deletion
DenseMap::growsub_2E29BA0--Hash table resize
NVPTXInstrInfo defaultsub_2ED11C0--EmitInstrWithCustomInserter stub
NVPTXInstrInfo defaultsub_2ED11E0--getInsertSubreg stub
NVPTXInstrInfo defaultsub_2ED11F0--expandPostRAPseudo stub
operand comparisonsub_2ED1840--Operand equality helper
MI buildersub_2ED19B0--Additional MachineInstr construction
register mappingsub_2ED41E0--Register mapping utility
register info querysub_2ED4900--Register info accessor
MI property querysub_2ED5D10--MachineInstr property reader
emission utilitysub_2EDA920--Additional emission helper
setDescsub_2EAB0C0--Sets MI operand descriptors during emission
addOperandsub_2E31210--Appends operand to MachineInstr
MI manipulationsub_2E31DD0--Additional MI manipulation utility
TRI utilitysub_2E4EE60--TargetRegisterInfo helper
NVPTXRegisterInfosub_2E4F5F0--Register class query vtable method

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
EmitNode structureSeparate EmitNode and EmitSpecialNode dispatchersMerged into single monolithic function (sub_2EDDF20, 11,722 bytes) with bit-table opcode classification
CopyToReg handlingInline within EmitSpecialNodeFactored out to dedicated handler (sub_2ED95B0) for NVPTX's physical-register-heavy .param ABI
MachineInstr flagsStandard flag bits (up to bit ~20)Extended flag at bit 36 (0x1000000000) not present in stock LLVM; marks NVIDIA-specific instruction properties
Pseudo-expansionSingle vtable dispatch for target pseudo-instructionsTriple vtable dispatch pattern gating custom expansion for GPU-specific pseudo-instructions
Dead node predicateStandard isDeadNode checkCustom sub_2DADC00 predicate with NVPTX-specific liveness criteria
VReg hash tableStandard DenseMap for value-to-VReg mappingCustom hash with key * 37 and 3/4 load factor rehash policy

Cross-References

TwoAddressInstruction

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Structurally identical to LLVM 20.0.0 TwoAddressInstructionPass.cpp. NVIDIA extensions are limited to deeper EXTRACT_SUBREG handling for multi-register results (texture/tensor/warp ops), extended LiveVariables maintenance, OptimizationRemarkEmitter integration, and the standard optnone/fast-compile gate.

The TwoAddressInstruction pass converts three-address MachineInstrs into two-address form by inserting COPY pseudo-instructions so that tied operand constraints are satisfied before register allocation. In upstream LLVM, many CPU targets have instructions where one source operand must be the same physical register as the destination (x86 addl %esi, %edi means %edi = %edi + %esi); the pass rewrites A = B op C into A = COPY B; A op= C. On NVPTX this pass is largely a formality -- PTX instructions are three-address and the virtual register file has no physical-register constraints -- but it still performs essential bookkeeping: eliminating REG_SEQUENCE and INSERT_SUBREG pseudo-instructions, building copy-equivalence maps for downstream coalescing, and handling the tied operands that arise from multi-result NVPTX intrinsics (texture loads, tensor core operations, warp-level collectives). CICC's binary is structurally identical to stock LLVM, with extended EXTRACT_SUBREG handling for multi-register results, deeper LiveVariables maintenance, OptimizationRemarkEmitter integration, and the standard NVIDIA optnone/fast-compile gate.

Pass name"Two-Address instruction pass"
Pass ID"twoaddressinstruction"
Pipeline slot"two-address-instruction" (MachineFunction pass #521)
runOnMachineFunctionsub_1F53550 (79KB, 2,470 lines)
tryInstructionTransformsub_1F4EF20 (28KB, 1,127 lines)
processTiedPairssub_1F50270 (63KB, 2,209 lines)
Cluster address range0x1F4D000 -- 0x1F56000
libNVVM twinsub_F4EA80 (2,455 lines, structurally identical)
Verification string"After two-address instruction pass"
OrderingAfter PHI elimination, before RegisterCoalescer

Why This Pass Exists on NVPTX

PTX is a three-address virtual ISA -- every arithmetic instruction takes separate dst, src0, src1 operands, and the hardware register allocator inside ptxas handles physical assignment. On a CPU target like x86, the TwoAddress pass is critical because most ALU instructions destroy one source register. On NVPTX, the pass fires primarily for three categories:

  1. Pseudo-instruction lowering. REG_SEQUENCE, INSERT_SUBREG, and EXTRACT_SUBREG are LLVM-internal pseudo-opcodes that must be eliminated before register allocation regardless of target. The TwoAddress pass rewrites INSERT_SUBREG into COPY and expands REG_SEQUENCE into per-subreg copies.

  2. Multi-result intrinsics. NVPTX texture/surface loads return v4f32 or v2f64 as multi-register results. Warp-level operations (wmma, mma) produce multi-register outputs. These get lowered into chains of EXTRACT_SUBREG pseudo-instructions that the pass must decompose into individual COPYs, one per extracted component.

  3. Inline assembly tied operands. CUDA inline asm blocks with "+r" (read-write) constraints produce tied operands where the output register must match the input. The pass inserts a COPY from the input virtual register to the output register to satisfy the constraint.

For most ordinary NVPTX arithmetic instructions, collectTiedOperands finds nothing and the pass skips the instruction after updating the distance map and processing any copy-equivalence information. The pass is not a no-op, but the heavy transformation paths (commutation, 3-address conversion, load unfolding) almost never fire for GPU code.

Algorithm

The pass iterates over every MachineBasicBlock and every MachineInstr within it, maintaining per-block data structures that are cleared at block boundaries.

for each MBB in MF:
    clear DistanceMap, SrcRegMap, DstRegMap, SrcEqClassMap, DstEqClassMap, Processed
    dist = 0

    for each MI in MBB:
        skip bundle internals
        skip COPY (opcode 12) and SUBREG_TO_REG (opcode 13)
        skip if MI is in the "reprocess" set

        if MI is EXTRACT_SUBREG (opcode 14):
            // NVPTX extended path -- multi-result decomposition
            // See detailed algorithm below
            decomposeExtractSubreg(MI)
            continue

        if MI is REG_SEQUENCE (opcode 15):
            // Standard LLVM: expand into per-subreg COPYs
            eliminateRegSequence(MI)
            continue

        DistanceMap[MI] = ++dist

        // Build copy-equivalence classes for downstream coalescing
        processCopy(MI)  // tracks COPY, REG_SEQUENCE, INSERT_SUBREG chains

        // Collect (srcIdx, dstIdx) pairs for all tied operands
        if not collectTiedOperands(MI, TiedOperandMap):
            continue

        // Single-pair fast path: attempt commutation / 3-addr conversion
        if TiedOperandMap has exactly 1 register with 1 pair:
            if tryInstructionTransform(MI, srcIdx, dstIdx, dist):
                continue  // constraint eliminated without COPY

        // General path: insert COPYs for all remaining tied pairs
        for each (reg, pairs) in TiedOperandMap:
            processTiedPairs(MI, pairs, dist)

        // Rewrite INSERT_SUBREG to COPY after tied constraints satisfied
        if MI is INSERT_SUBREG:
            remove operands 3 and 1
            rewrite descriptor to COPY

tryInstructionTransform (sub_1F4EF20)

This is the optimization core. When OptLevel != None, it attempts to satisfy a tied constraint without inserting a COPY, in priority order:

  1. Commutation. If swapping operands makes src match dst, commute the instruction via TII->commuteInstruction(). On NVPTX, most arithmetic instructions are commutative, so this is the most frequent success path. Upstream uses isProfitableToCommute() which walks up to MaxDataFlowEdge (default 3) dataflow edges to evaluate benefit.

  2. 3-address conversion. Call TII->convertToThreeAddress() to produce a true three-operand form. On NVPTX this is essentially dead code -- PTX instructions are already three-address -- but the infrastructure exists because the pass is shared LLVM code.

  3. Rescheduling. When twoaddr-reschedule is enabled (default true), attempt to move the kill of the source register closer to the current instruction (rescheduleMIBelowKill) or move the current instruction below the kill (rescheduleKillAboveMI). This can eliminate the need for a copy by making the source register die at the tied use.

  4. Load unfolding. For instructions with folded loads where the source is not killed, unfold the load into a separate MOV + arithmetic pair. Not applicable on NVPTX (no load folding).

  5. COPY insertion. If all optimization attempts fail, fall through to processTiedPairs which inserts an explicit COPY.

The function calls itself recursively (22 cross-references including a recursive self-call at sub_1F4EF20) for transitive constraint resolution -- when unfolding creates a new instruction that itself has tied operands, the resolution recurses.

EXTRACT_SUBREG Multi-Result Decomposition Algorithm

This is the most substantial NVIDIA extension to the upstream pass. The code lives at lines 821--994 of sub_1F53550 (decompilation line numbers from the 2,470-line function body). Standard LLVM handles single-result EXTRACT_SUBREG; the NVPTX version handles multi-result instructions where the InstrEmitter has produced a single EXTRACT_SUBREG pseudo with multiple operand pairs representing all extracted components.

Why Multi-Result EXTRACT_SUBREG Exists

When InstrEmitter::EmitNode (sub_2EDDF20, 872-byte stack frame, self-recursive for multi-result SDNode chains) lowers a multi-result NVPTX intrinsic, it produces a single MachineInstr with opcode 14 (EXTRACT_SUBREG) carrying N operand pairs -- one per result component. Each pair contains a def register (the extracted component destination) and a use register (the source super-register) plus a subreg index encoding which component to extract. The TwoAddress pass must decompose this single multi-operand pseudo into N separate COPY instructions.

The three major producer categories:

ProducerHandlerID rangeTypical result width
Texture/surface loadssub_33A435050 IDs (0x5D--0x8D)v4f32, v2f64, v4i32
WMMA / MMA operationssub_33A64B095 IDs (0xA4--0xA8, 0x194--0x1EC)2--8 register fragments
Multi-element surface opscase 0xA2singleloop over elements
MMA sm90+ (wgmma)sub_33AC8F00x183--0x1918--16 register fragments
TMA operationssub_33AD3D00x179--0x17Cvaries
Async copysub_33ADA200x17F--0x1822 results (data + token)

The DAG-level builders that produce multi-result nodes are sub_3411BE0 (multi-result DAG node), sub_33FC220 (multi-result variadic node), and sub_33F7800 (multi-result alternate form). The type list is built by sub_1D25C30 (SelectionDAG::getVTList for multi-result).

Operand Memory Layout

Each MachineOperand occupies 40 bytes in memory (stride 40 per operand in the operand array):

Offset within operandSizeField
+0byteFlags byte 0: bit 0 = isDef
+2wordFlags word: bits 4--11 = subreg class index, bits 8--19 = subreg index
+3byteFlags byte 3: bit 4 = isTied, bit 6 = earlyTied
+4byteFlags byte 4: bit 0 = isTied flag (secondary)
+8int64Register number (virtual reg > 0, physical reg < 0)

The subreg index is extracted by the formula:

subregIdx = (*(uint32_t*)(operand + 0) >> 8) & 0xFFF

This 12-bit field encodes which sub-register of the source to extract: sub0, sub1, sub2, sub3, etc. For a v4f32 texture result, the values are typically 1 through 4.

Decomposition Pseudocode

// sub_1F53550 lines 821-994: EXTRACT_SUBREG handler (opcode == 14)
decomposeExtractSubreg(MI):
    numOps = MI.getNumOperands()              // v405
    pairIdx = 0                               // v286, stride-2 counter

    while pairIdx < numOps:
        defOp  = MI.getOperand(pairIdx)       // base + pairIdx * 40
        useOp  = MI.getOperand(pairIdx + 1)   // base + (pairIdx+1) * 40

        dstReg = defOp.getReg()               // *(int64*)(defOp + 8)
        srcReg = useOp.getReg()               // *(int64*)(useOp + 8)

        // Extract subreg index from def operand flags (bits 8-19)
        subregIdx = (defOp.flags >> 8) & 0xFFF

        // Check if this operand is already tied (bit 0 of byte +4)
        alreadyTied = (defOp.flagsByte4 & 1) != 0

        // === CREATE COPY INSTRUCTION ===
        // sub_1E0B640(MBB, insertPoint, MI.getDebugLoc(), 0)
        // This is BuildMI -- allocates a new MachineInstr with opcode COPY
        newCOPY = BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(TargetOpcode::COPY))

        // Insert into block's instruction list
        if MI.isBundledWithSucc():
            sub_1DD6E10(MBB, MI, newCOPY)     // insertBefore (bundled variant)
        else:
            sub_1DD5BA0(MBB, MI, newCOPY)     // standard list insert

        // Add def operand: destination register with subreg class encoding
        // sub_1E1A9C0(newCOPY, dstReg, flags_with_subregclass)
        newCOPY.addOperand(MachineOperand::CreateReg(dstReg, /*isDef=*/true))

        // Add use operand: source register
        // sub_1E1A9C0(newCOPY, srcReg, flags_use)
        newCOPY.addOperand(MachineOperand::CreateReg(srcReg, /*isDef=*/false))

        // === EARLY TIED OPTIMIZATION ===
        // When this is NOT the first pair (pairIdx > 0) and the instruction
        // has tied constraints, check if a later pair shares the same dest
        // register. If so, mark the first operand of this COPY with isTied,
        // allowing the register coalescer to merge them without an extra COPY.
        if pairIdx > 0:
            earlyTiedCheck = (defOp.flagsByte3 >> 6) & 1   // bit 6
            isTiedCheck    = (defOp.flagsByte3 >> 4) & 1   // bit 4
            if earlyTiedCheck AND isTiedCheck:
                newCOPY.getOperand(0).setTied()  // set bit 0 of byte +4

        // === OPTIMIZATION REMARK ===
        if ORE != null:                       // pass object offset +272
            sub_1DCCCA0(ORE, dstReg, MI, newCOPY)   // emit copy remark
            remarkData = sub_1DCC790(ORE, dstReg)    // lookup remark data
            sub_1F4C640(remarkData)                   // filter/emit remark
            sub_1DCBB50(ORE)                          // push to output
            if newCOPY.isInsideBundle():
                walk to bundle head via successor chain
            if sub_1E1AFE0(bundleHead):               // hasProperty check
                sub_1DCC370(ORE, remarkNode)          // append to list

        // === LIVEVARIABLES UPDATE ===
        if LV != null:                        // pass object offset +280
            sub_1DBF6C0(LV, MBB, MI, newCOPY, ...)
            // This calls the full update chain:
            //   sub_1DBA290: createNewVarInfo for newCOPY's def register
            //   sub_1DBB110: initVarInfo (initialize kill/def lists)
            //   sub_1DB3C70: findKill (locate kill point in block)
            //   sub_1DB4410: addKill (update kill tracking for srcReg)
            //   sub_1DB8610: addNewBlock (update block-level liveness)

        pairIdx += 2                          // v286 += 2 (stride-2)

    // === CLEANUP ===
    // Remove all operands from original MI, then erase it
    sub_1E16240(MI)                           // RemoveOperand (bulk)
    MI.eraseFromParent()

earlyTied Optimization Detail

The earlyTied optimization is a critical performance path. Consider a v4f32 texture load producing 4 results. Without earlyTied, the decomposition creates 4 independent COPY instructions. The register coalescer must then discover independently that some of these COPYs can be coalesced.

The earlyTied flag (bit 6 of operand flags byte +3) is set during instruction emission when the emitter knows that consecutive extract results target adjacent sub-registers of a contiguous super-register. When detected, the pass marks the COPY's def operand with the isTied bit, creating a chain of tied constraints:

// Without earlyTied (4 independent COPYs, coalescer must work harder):
%dst0 = COPY %src.sub0
%dst1 = COPY %src.sub1
%dst2 = COPY %src.sub2
%dst3 = COPY %src.sub3

// With earlyTied (COPYs carry tie hints, coalescer has direct information):
%dst0 = COPY %src.sub0                          // first pair: no tie
%dst1 = COPY %src.sub1   [tied to %dst0.succ]   // isTied bit set
%dst2 = COPY %src.sub2   [tied to %dst1.succ]   // isTied bit set
%dst3 = COPY %src.sub3   [tied to %dst2.succ]   // isTied bit set

The condition is: (flagsByte3 >> 6) & 1 (earlyTied set) AND (flagsByte3 >> 4) & 1 (isTied set) AND pairIdx > 0 (not the first pair). This triple-guard prevents false positives on single-result extracts and on the first component which has no predecessor to tie to.

LiveVariables Update Chain

Every COPY produced by the decomposition triggers a six-function update sequence. This is deeper than upstream LLVM's TwoAddress LiveVariables handling and suggests NVIDIA's downstream register allocator (the greedy RA at sub_1E5B110) is particularly sensitive to stale liveness:

StepFunctionPurpose
1sub_1DBF6C0Entry: transfer liveness from old MI to new COPY
2sub_1DBA290createNewVarInfo: allocate VarInfo for the COPY's def register
3sub_1DBB110initVarInfo: initialize the VarInfo's kill list, def list, and alive-block bitvector
4sub_1DB3C70findKill: scan the current block to locate where srcReg is killed
5sub_1DB4410addKill / removeKill: move the kill point from the original MI to the new COPY (srcReg now dies at the COPY, not at the original EXTRACT_SUBREG)
6sub_1DB8610addNewBlock: update block-level liveness bitvectors if srcReg is live-in to this block from a predecessor

For a v4f32 decomposition, this executes 24 function calls (6 per component times 4 components). For a wmma.mma producing 8 fragments, it is 48 calls. The cost is quadratic in the worst case because findKill scans from the block start, but in practice the kill is always close to the insertion point.

Multi-Result Producers on NVPTX

The EXTRACT_SUBREG decomposition path fires for all NVPTX operations that produce more than one register result. These originate in the intrinsic lowering pass (sub_33A64B0 and friends in the 0x33A cluster) and flow through SelectionDAG ISel and InstrEmitter before reaching TwoAddress.

Texture and Surface Loads

The texture bulk handler sub_33A4350 covers 50 intrinsic IDs (0x5D through 0x8D). A tex.1d.v4.f32 intrinsic produces an SDNode with value type list {f32, f32, f32, f32, chain} via sub_1D25C30 (getVTList). InstrEmitter converts this into a single MachineInstr with 8 operands (4 def/use pairs), which TwoAddress decomposes into 4 COPYs.

Surface read/write handlers at sub_33A3180 (IDs 0x8E--0x90) and the scatter/gather handler at case 0xA2 follow the same pattern with variable result widths.

WMMA and MMA Operations

The mega-handler sub_33A64B0 services 95 intrinsic IDs covering all wmma/mma variants across sm70+. A wmma.mma.sync on sm70 with fp16 accumulation produces 8 f16x2 fragments; on sm80 with tf32 it produces 4 f32 fragments. The sm90+ wgmma handler at sub_33AC8F0 (IDs 0x183--0x191) can produce up to 16 register fragments for large matrix shapes.

Each fragment becomes one operand pair in the EXTRACT_SUBREG pseudo. The TwoAddress pass decomposes a 16-fragment wgmma result into 16 individual COPYs, each with full LiveVariables update. This is the most expensive decomposition path in the entire pass.

TMA and Async Copy

TMA bulk operations (sub_33AD3D0, IDs 0x179--0x17C) and async copy operations (sub_33ADA20, IDs 0x17F--0x182) produce 2-result nodes (data + completion token). These are simpler decompositions with only 2 COPY instructions.

Inline Assembly Tied Operands

CUDA inline assembly with "+r" read-write constraints is the third category that exercises the TwoAddress pass on NVPTX. The tied operand pipeline spans three compilation stages:

Stage 1: EDG Constraint Construction (sub_1286D80 path)

The EDG frontend's inline asm codegen (analyzed in p2-B07-inline-asm-codegen.txt) detects tied operands when the operand descriptor byte at offset +24 equals 3. It constructs the constraint string by:

  1. Emitting the input value via sub_1286D80
  2. Appending * for indirect operands
  3. Appending the tied operand index as a decimal number to the constraint string

If the type size is a power-of-2 and 64 bits or less, it may insert a bitcast to matching integer type. GCC-style matching-digit constraints in input position are explicitly rejected with "tied input/output operands not supported!".

Stage 2: DAG-Level Tied Resolution (sub_2079C70)

SelectionDAGBuilder::visitInlineAsm (sub_2079C70, 83KB) uses:

  • sub_20B4290: hasTiedOperand() -- checks if tied index is not -1
  • sub_20B42B0: getTiedOperand() -- returns the tied index
  • sub_2045250: resolveTiedOperand() -- creates the DAG-level constraint

The error string "inline asm not supported yet: don't know how to handle tied indirect register inputs" guards against the unsupported case of tied operands on memory-indirect inline asm operands.

Stage 3: TwoAddress COPY Insertion

After ISel, the tied operand from inline asm appears as a regular tied constraint in the MachineInstr operand list. The TwoAddress pass processes it through the standard collectTiedOperands / processTiedPairs path. For "+r" constraints this typically produces a single COPY before the INLINEASM instruction.

processTiedPairs Detail (sub_1F50270)

This 63KB / 2,209-line function is the heavyweight tied-operand resolver. It is called from the main loop whenever collectTiedOperands finds constraints that the fast path (tryInstructionTransform) could not resolve.

processTiedPairs(MI, tiedPairs, distance):
    for each (srcIdx, dstIdx) in tiedPairs:
        srcReg = MI.getOperand(srcIdx).getReg()
        dstReg = MI.getOperand(dstIdx).getReg()

        if srcReg == dstReg:
            continue    // constraint already satisfied

        // === ATTEMPT COMMUTATION (OptLevel != None) ===
        if canCommute(MI):
            // isProfitableToCommute walks up to MaxDataFlowEdge (default 3)
            // dataflow edges from srcReg and dstReg, comparing distances
            // in DistanceMap to determine if commuting reduces copies
            if isProfitableToCommute(MI, srcIdx, dstIdx, distance):
                TII->commuteInstruction(MI)
                if MI.getOperand(srcIdx).getReg() == MI.getOperand(dstIdx).getReg():
                    continue    // resolved by commutation

        // === ATTEMPT RESCHEDULING (twoaddr-reschedule = true) ===
        if twoAddrReschedule:
            // Try to move MI below the kill of srcReg
            if rescheduleMIBelowKill(MI, srcIdx, dstIdx, distance):
                continue    // resolved by rescheduling
            // Try to move the kill of srcReg above MI
            if rescheduleKillAboveMI(MI, srcIdx, dstIdx, distance):
                continue    // resolved by rescheduling

        // === ATTEMPT 3-ADDRESS CONVERSION ===
        // On NVPTX, convertToThreeAddress always returns null (dead code)
        if TII->convertToThreeAddress(MI, LIS):
            continue    // resolved by conversion (never happens on NVPTX)

        // === INSERT COPY (last resort) ===
        newCOPY = BuildMI(MBB, MI, DL, TII.get(COPY), dstReg).addReg(srcReg)

        // Extract subreg index from original operand
        subregIdx = (MI.getOperand(srcIdx).flags >> 8) & 0xFFF
        if subregIdx != 0:
            newCOPY.getOperand(1).setSubReg(subregIdx)

        // Insert into DistanceMap with incremented counter
        // Walk predecessor chain to find scheduling unit
        DistanceMap[newCOPY] = ++distance
        DistanceMap[MI] = ++distance

        // Rewrite srcReg to dstReg in original MI
        MI.getOperand(srcIdx).setReg(dstReg)     // sub_1E310D0

        // Update SrcEqClassMap: map srcReg -> dstReg
        SrcEqClassMap.insert(srcReg, dstReg)      // sub_1F4E3A0

        // === LIVEVARIABLES UPDATE ===
        if LV:
            varInfo = LV.getVarInfo(dstReg)        // sub_1DC1550
            if varInfo not found:
                varInfo = LV.createNewVarInfo(dstReg)  // sub_1DBA290
                LV.initVarInfo(varInfo)                 // sub_1DBB110
            // Transfer kill info: srcReg kill moves from MI to newCOPY
            killInfo = varInfo.findKill(MBB)       // sub_1DB3C70
            varInfo.addKill(newCOPY, flags)        // sub_1DB4410
            // Update block-level liveness
            varInfo.addNewBlock(MBB, position)     // sub_1DB8610

        // === OPTIMIZATION REMARK ===
        if ORE and commutationWasAttempted:        // v384 flag
            sub_1DCC790(ORE, srcReg)               // lookup remark data
            sub_1F4C640(remarkData)                 // filter remark
            sub_1DCBB50(ORE)                        // push
            if newCOPY.isInsideBundle():
                walk to bundle head
            if sub_1E1AFE0(bundleHead):
                sub_1DCC370(ORE, remarkNode)       // append to list

        // === REGISTER CLASS TIGHTENING ===
        // sub_1E69410(SubtargetInfo, dstReg, regClass, 0)
        // constrainRegClass on the destination register to the intersection
        // of the current class and the class required by the tied operand

INSERT_SUBREG Rewrite (lines 2386--2396)

After all tied pairs are processed for an INSERT_SUBREG instruction (opcode 8), the pass converts it into a plain COPY:

if MI.getOpcode() == INSERT_SUBREG:
    // Propagate subreg encoding from operand[3] into operand[0]
    subregBits = MI.getOperand(3).getSubRegIdx()
    MI.getOperand(0).setSubReg(subregBits)
    // Copy tie flag from operand[1] into operand[0]
    MI.getOperand(0).setTied(MI.getOperand(1).isTied())
    // Remove operands 3 and 1 (in reverse order to preserve indices)
    MI.RemoveOperand(3)              // sub_1E16C90(MI, 3)
    MI.RemoveOperand(1)              // sub_1E16C90(MI, 1)
    // Rewrite opcode descriptor to COPY
    MI.setDesc(TII.get(COPY))        // descriptor at TII + 960

Copy-Equivalence Classes

The pass builds two maps (SrcEqClassMap at offset +552, DstEqClassMap at +584) that track transitive copy chains. When it encounters COPY, REG_SEQUENCE, or INSERT_SUBREG instructions, it records the source-to-destination register mapping. The helper collectRegCopies (sub_1F4E620, 357 lines) walks use-def chains to build transitivity: if A -> B -> C via COPYs, then A maps directly to C. These maps are consumed by the downstream RegisterCoalescer to improve copy elimination.

The collectRegCopies algorithm:

collectRegCopies(startReg):
    chain = SmallVector()
    reg = startReg

    while true:
        if not MRI.hasOneUse(reg):        // sub_1E69E00
            break
        defMI = MRI.getVRegDef(reg)
        if defMI.getOpcode() not in {COPY, REG_SEQUENCE, INSERT_SUBREG}:
            break
        nextReg = defMI.getOperand(1).getReg()
        chain.push(reg)
        reg = nextReg

    // Process chain in reverse: build transitivity
    for i in reverse(chain):
        SrcEqClassMap.insert(chain[i], chain[i+1])    // sub_1F4E3A0

Data Structures

TiedOperandMap (stack-allocated SmallDenseMap<unsigned, SmallVector<pair<unsigned,unsigned>, 4>> with 4 inline entries):

Offset in entryTypeField
+0int32Key (virtual register number; -1 = empty, -2 = tombstone)
+8ptrPair list pointer (points to +24 for inline storage)
+16int32Pair list size
+20int32Pair list capacity
+24int64[4]Inline pair storage (each qword packs `srcIdx

Entry stride: 56 bytes. Hash function: 37 * key, linear probing, load factor 3/4. Total inline size: 224 bytes on stack.

DistanceMap (DenseMap<MachineInstr*, unsigned> at pass object offsets +312..+336): maps each MI to its sequential position within the current block. Hash: (ptr >> 4) ^ (ptr >> 9). Used by tryInstructionTransform and processTiedPairs for rescheduling decisions and commutation profitability evaluation.

Pass Object Layout (selected fields):

OffsetTypeField
+232MachineFunction*Current function
+240MachineRegisterInfo*MRI
+248TargetInstrInfo*TII
+256TargetRegisterInfo*TRI
+264ptrInstrItineraryData* or TargetSubtargetInfo*
+272OptimizationRemarkEmitter*ORE (NVIDIA addition)
+280LiveVariables*LV
+288LiveIntervals*LIS (via SlotIndexes at +160)
+296intEffective optimization level
+304MachineBasicBlock*Current MBB
+312..+336DenseMapDistanceMap
+344..+376SmallPtrSetProcessed set
+448..+476SmallPtrSetSecond set (reprocessing)
+552..+576DenseMapSrcEqClassMap
+584..+608DenseMapDstEqClassMap

Tied Operand Scanning (Lines 1183--1413)

The collectTiedOperands logic iterates all operands of an instruction checking for tied constraints. The inner loop (at STEP 7 in the raw analysis) contains a special-case direct resolution path:

for opIdx in 0..numOps-1:
    // Skip defs, already-tied, and operands with no subreg class
    if operand.isDef():           continue     // byte +0 != 0
    if operand.isTied():          continue     // bit 4 of byte +3
    if operand.subregClass == 0:  continue     // bits 4-11 of word +2

    tiedIdx = MI.findTiedOperandIdx(opIdx)     // sub_1E16AB0
    srcReg = operand[opIdx].getReg()
    dstReg = operand[tiedIdx].getReg()

    if srcReg == dstReg:
        continue    // already satisfied

    // SPECIAL CASE: direct resolution without COPY
    if operand.isTied(secondary) AND def.subregClass == 0:
        if dstReg < 0:    // physical register
            regClass = sub_1F3AD60(MRI, instrDesc, opIdx, TII, MF)
            if regClass:
                MRI.constrainRegClass(dstReg, regClass)   // sub_1E69410
        operand.setReg(dstReg)                     // sub_1E310D0
        operand.clearSubregBits()                  // *operand &= 0xFFF000FF
        // Constraint resolved: use now points to same reg as def
        continue

    // NORMAL: add to TiedOperandMap
    TiedOperandMap[srcReg].push({opIdx, tiedIdx})  // packed as qword

The special-case path at the isTied(secondary) check (bit 0 of byte +4) handles the case where the operand carries a secondary tie flag from instruction emission and the def side has no subreg class constraint. In this case the pass can directly rewrite the use register to match the def without inserting a COPY, and clears the subreg bits with the mask 0xFFF000FF.

NVIDIA Modifications

The pass is structurally stock LLVM -- the libNVVM build at sub_F4EA80 is byte-for-byte identical in structure, confirming shared source. The NVIDIA delta consists of four additions:

  1. Extended EXTRACT_SUBREG handling (lines 821--994 of the decompilation). Standard LLVM handles single EXTRACT_SUBREG; the NVPTX version handles multi-result instructions with multiple extract chains via stride-2 operand iteration. This is required for texture/surface loads returning v4f32, wmma/mma producing multi-register fragments, and similar multi-result NVPTX intrinsics. The earlyTied optimization (checking bits 4 and 6 of operand flags byte +3) is unique to this extension and provides direct coalescing hints for contiguous sub-register sequences.

  2. Deeper LiveVariables maintenance (lines 1791--2064). When a COPY is inserted, the pass creates new VarInfo entries (sub_1DBA290), initializes them (sub_1DBB110), updates kill info (sub_1DB3C70 / sub_1DB4410), and maintains block-level liveness (sub_1DB8610). This six-function chain executes per COPY, not per instruction. For a 16-fragment wgmma result, this produces 96 function calls for liveness maintenance alone.

  3. OptimizationRemarkEmitter integration (lines 2207--2258). The pass reports cases where tied-operand constraints forced extra COPY insertions, providing performance diagnostic information. This is absent in upstream LLVM's TwoAddress pass. The ORE pointer is stored at pass object offset +272 and acquired via analysis lookup of unk_4FC4534. The five-function chain (sub_1DCCCA0 through sub_1DCC370) handles remark creation, filtering, and bundle-aware emission.

  4. optnone/fast-compile gate (sub_1636880). When the function has optnone or when NVIDIA's fast-compile mode is active, the effective optimization level is forced to 0. This disables commutation, 3-address conversion, and rescheduling attempts in tryInstructionTransform (which returns false immediately when OptLevel == None), making the pass a pure COPY-insertion pass with no optimization.

Knobs

KnobDefaultEffect
twoaddr-rescheduletrueEnable/disable instruction rescheduling to coalesce copies. When true, the pass attempts to move instructions up or down within the block to avoid needing a COPY.
dataflow-edge-limit3Maximum number of dataflow edges to traverse when evaluating the profitability of commuting operands in isProfitableToCommute(). Higher values allow deeper analysis at compile-time cost.

Both knobs are registered in constructor ctor_337 (found in the sweep at 0x4F0000--0x51FFFF). They are standard upstream LLVM options with no NVIDIA-specific modifications to their defaults.

The optnone/fast-compile gate is not a knob per se but has the effect of disabling all optimization paths in the pass, equivalent to setting both knobs to their most conservative values.

Function Map

FunctionAddressSizeRole
Pass registration (name + ID)sub_1F4D900smallSets "Two-Address instruction pass" and "twoaddressinstruction"
Constructorsub_1F4D9F0small
Helper: rescheduleMIBelowKill supportsub_1F4CC10--Called by sub_1F4EF20
Helper: rescheduleKillAboveMI supportsub_1F4D060--Called by sub_1F4EF20
SmallPtrSet::contains(MI*)sub_1F4DD4067 linesProcessed set membership check
SmallDenseMap::clear()sub_1F4DE20180 linesTiedOperandMap cleanup, frees heap-allocated pair lists
DenseMap<int,int>::insertsub_1F4E3A0166 linesEqClassMap insertion, hash = 37 * key
collectRegCopiessub_1F4E620357 linesWalks COPY chains to build transitive equivalence classes
DenseMap<ptr,int>::insertsub_1F4EC70164 linesDistanceMap insertion, hash = (ptr>>4) ^ (ptr>>9)
tryInstructionTransformsub_1F4EF2028KB / 1,127 linesCore tied-operand rewriter: commutation, 3-addr, COPY. Recursive (22 xrefs).
processTiedPairssub_1F5027063KB / 2,209 linesFull pipeline: commute, convert, COPY insertion, LV/LI update
SmallDenseMap::growsub_1F53020312 linesTiedOperandMap rehash, 56-byte entry stride
runOnMachineFunctionsub_1F5355079KB / 2,470 linesPass entry point
Helper: find matching superclasssub_1F3AD60--Finds register class for tied physical reg constraints
Helper: implicit tied operandssub_1F4C460--Checks if MI has implicit tied operand pairs
Helper: filter/emit remarksub_1F4C640--ORE filtering for copy-insertion diagnostics
LiveVariables::createNewVarInfosub_1DBA290--Allocates VarInfo for new register
LiveVariables::initVarInfosub_1DBB110--Initializes kill/def lists and alive bitvector
VarInfo::findKillsub_1DB3C70--Scans block for register kill point
VarInfo::addKill / removeKillsub_1DB4410--Updates kill tracking
VarInfo::addNewBlocksub_1DB8610--Updates block-level liveness bitvectors
LiveVariables::HandlePhysRegDefsub_1DBF6C0--Transfer liveness from old MI to new COPY
ORE::emit (copy remark)sub_1DCCCA0--Emits optimization remark for COPY insertion
ORE::lookupsub_1DCC790--Looks up remark data for register
ORE::pushsub_1DCBB50--Pushes remark to output
ORE::appendToListsub_1DCC370--Appends remark (bundle-aware)
MachineFunction::verifysub_1E926D0--Called with "After two-address instruction pass"
isOptNone / fast-compile checksub_1636880--Forces OptLevel = 0 when active

Binary Size Note

The 79KB runOnMachineFunction plus 63KB processTiedPairs plus 28KB tryInstructionTransform total approximately 170KB of machine code. Upstream LLVM source for the entire pass is approximately 2,000 lines of C++. The binary bloat is almost entirely explained by aggressive inlining: every DenseMap::insert, DenseMap::find, DenseMap::clear, SmallPtrSet::insert, and SmallPtrSet::find operation is fully expanded inline with all template specialization, sentinel initialization, grow/rehash, and power-of-2 computation logic. This accounts for roughly 40% of the binary. The remaining expansion comes from the COPY-creation path (operand setup, flag manipulation, list splicing) being duplicated for each opcode-specific branch rather than factored into a shared helper.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Primary purposeConvert 3-address to 2-address form for physical register constraints (x86 tied operands)Largely a formality on NVPTX (PTX is 3-address); primary role is eliminating REG_SEQUENCE/INSERT_SUBREG and building copy-equivalence maps
EXTRACT_SUBREG handlingStandard sub-register extraction for CPU multi-result instructionsExtended decomposition for multi-register NVPTX results: texture loads, tensor core operations (WMMA/MMA), and warp-level collectives
LiveVariables maintenanceStandard liveness trackingDeeper LiveVariables maintenance with explicit VarInfo allocation/init (sub_1DBA290/sub_1DBB110) for new registers created during decomposition
ORE integrationBasic or absent remark emission for copiesFull OptimizationRemarkEmitter integration for COPY insertion diagnostics (sub_1DCCCA0/sub_1DCC790/sub_1DCBB50)
Binary size~2,000 lines of C++ source170 KB of machine code (79 KB runOnMachineFunction + 63 KB processTiedPairs + 28 KB tryInstructionTransform); bloat from aggressive DenseMap inlining
optnone/fast-compile gateStandard OptLevel checkNVIDIA optnone / fast-compile check (sub_1636880) forces OptLevel = 0 for fast-compile kernels

Cross-References

Instruction Scheduling

Prerequisites: Familiarity with Register Allocation, NVPTX register classes, and the codegen pipeline. Understanding of the GPU execution model (warp scheduling, latency hiding) is essential.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/CodeGen/MachineScheduler.cpp (ScheduleDAGMILive), llvm/lib/CodeGen/MachinePipeliner.cpp (Swing Modulo Scheduler) (LLVM 20.0.0). The MRPA incremental pressure tracker and Texture Group Merge pass are NVIDIA-only with no upstream equivalent.

CICC v13.0 implements three distinct scheduling subsystems: MRPA (Machine Register Pressure Analysis) for incremental pressure tracking during MCSE, a Swing Modulo Scheduling pipeliner for loop bodies, and ScheduleDAGMILive for post-RA instruction ordering. All three maintain per-register-class pressure arrays but differ in granularity and update frequency. A texture group merge pass (sub_2DDE8C0) acts as a scheduling-adjacent optimization that groups texture load instructions for hardware coalescing.

MRPA incremental trackersub_2E5A4E0 (primary), sub_1E00370 (backend variant)
MachinePipeliner (SMS)sub_3563190
ScheduleDAGMILivesub_355F610
Instruction selection heuristicsub_3557A10
Texture group mergesub_2DDE8C0
Scheduling mode switchsub_21668D0 (post-RA), sub_2165850 (pre-RA)

MRPA: Incremental Register Pressure Tracking

MRPA (Machine Register Pressure Analysis) provides incremental register pressure tracking for the Machine Common Subexpression Elimination (MCSE) pass. Rather than recomputing pressure from scratch after each instruction move or elimination, MRPA applies delta updates to maintain a running pressure state.

The primary implementation lives at sub_2E5A4E0 (48KB), with a backend variant at sub_1E00370 (78KB). Both use DenseMap hash tables for per-instruction pressure data with the hash function (ptr >> 9) ^ (ptr >> 4), empty sentinel -8, tombstone sentinel -16, minimum 64 buckets, and power-of-two sizing. The sub_1E00370 backend variant calls the pressure computation core at sub_1DF7390 (8 call sites) and sub_1DFB9D0 (6 call sites), plus pressure set queries via sub_1E1C690 / sub_1E15D60.

The MRPA pressure cluster spans the address range 0x1DF0000--0x1E0FFFF:

FunctionRole
sub_1DF3D00Scheduler support (lowest address)
sub_1DF4120Scheduler support
sub_1DF4FB0Scheduler support
sub_1DF5810Machine function pass (pressure-aware scheduling)
sub_1DF7390Pressure computation core (called 8x from sub_1E00370)
sub_1DF76E0Register liveness query
sub_1DF7A80Code motion feasibility check
sub_1DF81C0Pressure computation core
sub_1DF9E90Schedule optimization pass
sub_1DFB810DenseMap (64-bit value variant)
sub_1DFB9D0DenseMap (32-bit value variant, called 6x)
sub_1E00370MRPA entry -- backend variant

Incremental Update Flow

The incremental update is the core algorithm. Rather than performing a full O(n) pressure recomputation after every MCSE transform, it maintains a running pressure state through delta operations. The pseudocode below is reconstructed from sub_2E5A4E0:

function mrpa_incremental_update(context, basicBlock):
    // Phase 1: Build worklist via DFS
    visited = DenseSet()                        // v292--v295
    worklist = []
    dfs_push(worklist, basicBlock, visited)     // standard DFS seed

    while worklist is not empty:
        bb = worklist.pop()

        // Phase 2: Create instruction tracking entries
        tracking = context.densemap[+80..+104]  // DenseMap at context offsets
        for each instr in bb.instructions:
            tracking.insert(instr, PressureEntry{})

        // Phase 3: Filter schedulable instructions
        if not sub_2E501D0(instr):              // schedulability predicate
            continue

        // Phase 4: Scan operands (40-byte entries)
        for i in range(instr.num_operands):     // iterated at v69/v70
            operand = instr.operand[i]          // 40-byte stride
            if not operand.isVirtualRegister():
                continue

            // Phase 5: Virtual register operand processing
            old_reg = sub_2EBEF70(operand)      // find existing rename mapping
            reg_info = sub_2EBEE10(operand)     // query register class, constraints
            new_reg = sub_2EBE820(operand)      // attempt rename if profitable
            if new_reg != old_reg:
                sub_2EBF120(old_reg)            // free old register

            // Phase 6: Register class constraint validation
            sub_reg_list = sub_E922F0(reg_info) // sub-register list for class
            for each sub_reg in sub_reg_list:
                validate_class_constraint(sub_reg, context.class_limits)

        // Phase 7: Pressure feasibility check
        bb_pressure = context.per_bb_data[bb]   // at v279[36]
        if not sub_2E4F9C0(bb_pressure):        // exceeds class limits?
            // Rename was unprofitable -- roll back
            context.rename_count_fail++         // *((_DWORD*)v254 + 17)
            sub_2E88E20(instr)                  // erase unprofitable instruction
        else:
            context.rename_count_success++      // *((_DWORD*)v254 + 16)

The key insight is that steps 5--7 form a speculative rename-then-validate loop: MRPA tentatively renames a virtual register, checks whether the rename reduces pressure below the class limit, and rolls back if it does not. The rename counts at *((_DWORD*)v254 + 16) (success) and *((_DWORD*)v254 + 17) (failure) provide a diagnostic ratio of how often speculative renames succeed.

Register Liveness Queries

Register liveness (sub_1DF76E0) checks whether a register is live in an instruction range [a3, a4] using _bittest on register class bitmaps. A compressed alias table at context offset +240 stores sub-register overlap information in 24-byte entries containing alias counts and alias data offsets.

The alias table structure:

OffsetSizeContent
+08Sub-table pointer
+856Alias data block
+8..+10 (per entry)2Alias count (uint16)
+10..variableAlias register IDs (2 bytes each)

Sub-register overlap is resolved through an incremental alias walk: for each register in the query range, the alias table is consulted to expand the register into its physical sub-registers, and each sub-register is tested against the liveness bitmap.

Code Motion Feasibility

Code motion feasibility (sub_1DF7A80) validates whether an instruction can be moved between basic blocks:

  1. Check single-predecessor relationship between source and destination BBs.
  2. Validate against the allocation bitmask at allocator offset +38.
  3. Walk an instruction window bounded by offset +296 (configurable window size).
  4. Count conflicting operands within the window.
  5. Track affected registers in an rb-tree set (offsets 56--88) with node structure [left(16), right(24), value(32)].

An instruction is movable only if the conflicting operand count within the window is zero and the allocation bitmask permits the move.

MRPA Verification

A debug-only verification path checks incremental update correctness against full recomputation. The trigger path in sub_2E5A4E0 (decompiled lines 1702--1708):

if ( *(_BYTE *)(v7 + 40)               // [1] context enable flag -- always ON during MCSE
  && (_BYTE)qword_501F8A8              // [2] verify-update-mcse -- user must enable
  && (_BYTE)qword_501F988              // [3] incremental-update-mcse -- default ON
  && !sub_2E59B70(                     // [4] full recomputation DISAGREES
       *(_QWORD*)(v7+48),
       qword_501F7C8, ...) )
{
  sub_C64ED0("Incorrect RP info from incremental MRPA update\n", 1u);
}

All four conditions must hold simultaneously:

  1. Context enable flag (v7 + 40) is set -- always true during MCSE.
  2. verify-update-mcse is ON -- user must explicitly enable this debug knob.
  3. incremental-update-mcse is ON -- default is ON.
  4. sub_2E59B70 returns false -- full recomputation disagrees with the incremental state.

When all conditions hold, the error "Incorrect RP info from incremental MRPA update" fires via sub_C64ED0 (LLVM's report_fatal_error). The print-verify knob controls whether detailed per-register-class mismatch data is printed.

The backend variant (sub_1E00370, decompiled lines 2416--2420) uses byte_4FC6020 as its guard flag, calls sub_1DFF720 for verification, and falls back to byte_4FC62C0 (a cached result) if verification is disabled.

KnobDefaultDescription
incremental-update-mcsetrueIncrementally update register pressure analysis
verify-update-mcsefalseVerify incremental update by full RP analysis
print-verifyfalsePrint problematic RP info if verification failed

To trigger verification: cicc -Xcuda -verify-update-mcse input.cu. NVIDIA keeps this check off by default since the full rescan is O(n) and expensive.

MachinePipeliner: Swing Modulo Scheduling

Complexity. Let N = number of instructions in the loop body and E = number of dependency edges in the DDG. DDG construction is O(N + E). RecMII computation (computeRecMII) finds the maximum cycle ratio via enumeration of elementary circuits in the DDG -- worst-case exponential, but bounded in practice by small loop sizes (N < 100) and sparse dependency graphs. ResMII computation is O(N) (sum of resource vectors). ASAP/ALAP computation is O(N + E) each (topological traversals). The II search probes at most pipeliner-ii-search-range (default 10) candidate IIs. For each II, node placement is O(N * II) -- each of N nodes probes up to II cycle slots. The total scheduling cost is O((N + E) + R * N * II_max) where R = search range. The pipeliner-max-stages (default 3) and pipeliner-max-mii (default 27) provide additional constant-factor bounds. For MRPA, the incremental pressure update is O(1) per instruction move (delta update), compared to O(N) for a full recomputation -- this is the key efficiency gain over a naive approach.

The MachinePipeliner (sub_3563190, ~2030 decompiled lines, ~58KB) implements Swing Modulo Scheduling (SMS) for software pipelining of loop bodies. It overlaps iterations of a loop body to improve throughput on pipelined hardware by interleaving instructions from different iterations. The upstream LLVM equivalent is SwingSchedulerDAG::schedule().

Pass discovery: the pipeliner walks an analysis array at this+3456 (offset 3456) looking for vtable unk_4F86530 (the MachinePipeliner analysis pass), then extracts the SwingSchedulerDAG context at offset +176.

Phase 1: Initialization and DDG Construction

The setup chain builds the data dependence graph and computes MII lower bounds:

StepFunctionDescription
1sub_2F97F60initializeDAG -- build data dependence graph (DDG) over the single-BB loop body
2sub_3559990computeNodeLatencies -- fill latency fields per SUnit from the target scheduling model
3sub_3542B20addDependencies -- add register/memory/order dependency edges to the DDG
4sub_2F90200updateRegPressure -- compute initial register pressure state for the loop body
5sub_354CBB0computeRecMII -- find the maximum cycle length of any recurrence in the DDG
6sub_35449F0computeResMII -- compute ceil(total_resource_usage / functional_unit_count)

The context object SwingSchedulerDAG occupies approximately 4100 bytes:

OffsetField
+32MachineFunction*
+48..56BB range (iterated at 256-byte stride)
+944DenseMap: pre-existing ordering constraints
+3456Analysis pass vector
+3472MII (int32)
+3480schedulingSucceeded (bool)
+3488DiagnosticsEngine / remark context
+3520TargetSubtargetInfo*
+3944..3952DDG node storage (vector)
+4016..4072Recurrence DenseMap (24-byte entries)

MII computation combines two lower bounds:

  • RecMII (Recurrence MII): the longest cycle in the DDG, computed by sub_354CBB0. Each recurrence (loop-carried dependency cycle) constrains the minimum II because the cycle must fit within one iteration interval. If pipeliner-ignore-recmii is set, RecMII is forced to zero so only resource constraints matter.
  • ResMII (Resource MII): ceil(sum of resource usage across all instructions / number of available functional units), computed by sub_35449F0. This reflects the throughput bottleneck of the hardware.
function compute_MII():
    recMII = sub_354CBB0()           // max recurrence length
    resMII = sub_35449F0()           // resource throughput limit
    if pipeliner-ignore-recmii:      // qword_503E888
        recMII = 0
    MII = sub_3542AB0(resMII, recMII)  // max(resMII, recMII)
    sub_3542AE0()                    // store MII at this+3472
    return MII

The II search algorithm starts at MII and probes upward:

function ii_search(MII):
    max_ii = MII + pipeliner-ii-search-range      // default: MII + 10
    if pipeliner-force-ii != 0:                    // qword_503EB80
        return try_schedule(pipeliner-force-ii)    // skip search entirely

    for II = MII to max_ii:
        // 1. Compute ASAP/ALAP at this II
        asap = compute_ASAP(DDG, II)               // sub_354BFF0 -> v369
        alap = compute_ALAP(DDG, II)               // sub_354BFF0 -> v373

        // 2. Place all nodes into II-wide modulo reservation table
        success = place_nodes(asap, alap, II)       // sub_354C3A0

        if not success:
            continue                                // try next II

        // 3. Compute stage count
        numStages = (lastCycle - firstCycle) / II   // (v84 - v80) / v88

        // 4. Validate stage count
        if numStages > pipeliner-max-stages:        // default 3
            continue

        // 5. Register pressure check (if enabled)
        if pipeliner-register-pressure:             // qword_503E2C0
            if not verify_pressure(II, pipeliner-register-pressure-margin):
                continue                            // sub_355C7C0

        return (II, schedule)

    return FAILURE   // "Unable to find schedule"

The pipeliner-force-ii knob (default 0) bypasses the search entirely and forces a specific II value. This is useful for testing or when the compiler team knows the optimal II for a specific loop shape.

Phase 3: ASAP/ALAP Computation

ASAP (As Soon As Possible) and ALAP (As Late As Possible) define the scheduling window for each instruction at a given II:

ASAP computation (sub_354BFF0, first invocation producing v369): traverses the DDG in topological order. For each node, ASAP = max over all predecessors of (predecessor.ASAP + edge.latency). The root nodes (no predecessors) have ASAP = 0. This gives the earliest cycle each instruction can execute without violating data dependencies.

ALAP computation (sub_354BFF0, second invocation producing v373): traverses the DDG in reverse topological order. For each node, ALAP = min over all successors of (successor.ALAP - edge.latency). Leaf nodes (no successors) have ALAP = II - 1 (or the schedule length bound). This gives the latest cycle an instruction can execute.

The scheduling window for instruction i is [ASAP(i), ALAP(i)]. Instructions with narrow windows (ASAP close to ALAP) are more constrained and are typically scheduled first by the node ordering heuristic.

Phase 4: Node Placement

Node placement (sub_354C3A0) attempts to assign each instruction to a specific cycle in the modulo reservation table (MRT). The MRT has II columns (one per cycle in the initiation interval) and tracks resource usage per cycle.

The placement algorithm follows the Swing Modulo Scheduling strategy:

  1. Node ordering (sub_35630A0): nodes are prioritized by a combination of critical-path depth, recurrence membership, and scheduling freedom (ALAP - ASAP). Nodes in tight recurrences and on the critical path are placed first.
  2. Direction selection: for each node, the scheduler decides whether to place it "forward" (from ASAP toward ALAP) or "backward" (from ALAP toward ASAP) based on its dependency relationships. The "swing" refers to alternating direction between predecessor-constrained and successor-constrained nodes.
  3. Cycle probing: starting from the preferred direction, the scheduler tries each cycle in the node's [ASAP, ALAP] window. At each candidate cycle, it checks resource availability in the MRT (the cycle modulo II must have sufficient functional unit capacity) and verifies that all dependency constraints remain satisfied.
  4. Conflict resolution: if no cycle in the window is feasible, the placement fails for this II and the search continues with II+1.

Phase 5: Kernel Generation

After a valid schedule is found, the pipeliner builds the kernel, prolog, and epilog. The numStages value ((lastCycle - firstCycle) / II) determines how many iterations overlap.

function build_kernel(schedule, II, numStages):
    // Build instruction-to-stage and instruction-to-cycle DenseMaps
    instrToStage = DenseMap<SUnit*, int>()      // v317/v318/v319
    instrToCycle = DenseMap<SUnit*, int>()       // v320/v321/v322
    // DenseMap config: hash=(key>>9)^(key>>4), empty=-4096, tombstone=-8192

    for stage in range(numStages):
        for each SUnit in schedule.stage_bundle(stage):
            instrToStage[SUnit] = stage
            instrToCycle[SUnit] = SUnit.assigned_cycle

    // Cross-reference recurrence edges with stage assignments
    if this+4064 (recurrence count) != 0:
        for each recurrence_edge in this+4056:
            edge.stage = instrToStage[edge.instruction]
            // Build per-recurrence analysis DenseMap (24-byte entries)

    // Select codegen backend (priority order):
    if pipeliner-annotate-for-testing:          // testing mode: annotate only
        sub_359AD80(schedule)
        return
    if pipeliner-experimental-cg:               // peeling code generator
        if numStages == 0:
            sub_35A5710()                       // trivial kernel (no overlap)
        else:
            sub_35A93B0()                       // experimental peeling CG
            sub_3598EB0()                       // finalize prolog/epilog
        return
    if pipeliner-mve-cg:                        // MVE code generator (DEFAULT)
        if numStages == 0 and target_supports_mve():
            sub_35A7730()                       // MVE compatibility check
            sub_35A76E0()                       // MVE code generator
            return
        // else fall through to experimental CG
    // Default fallthrough: experimental CG path

The codegen backend priority is: (1) pipeliner-annotate-for-testing for test infrastructure, (2) pipeliner-experimental-cg for peeling-based generation, (3) pipeliner-mve-cg (default enabled) for the MVE (Modulo Variable Expansion) code generator. The MVE path is gated on numStages == 0 and a target callback at **(this+3520)+72 returning non-default (i.e., not sub_2FDC510).

The SBO (Small Buffer Optimization) pattern is used for nodeInfo arrays: v416 = v418 (inline buffer of 704 bytes = 8 nodes x 88 bytes). When the loop body exceeds 8 instructions, sub_35498F0 sorts and possibly heap-allocates.

Error Conditions

ConditionDiagnosticSeverity
MII == 0"Invalid Minimal Initiation Interval: 0"0x15 (missed)
MII > pipeliner-max-mii"Minimal Initiation Interval too large: MII > SwpMaxMii"0x15 (missed)
Scheduling failure"Unable to find schedule"0x15 (missed)
numStages == 0"No need to pipeline - no overlapped iterations in schedule."0x15 (missed)
numStages > pipeliner-max-stages"Too many stages in schedule: numStages > SwpMaxStages"0x15 (missed)
Success"Pipelined succesfully!" [sic]0x13 (passed)

The typo "succesfully" (single 's') is preserved from upstream LLVM.

Pipeliner Knobs

KnobGlobalDefaultDescription
enable-pipelinerunk_503EE20trueMaster switch for SMS
enable-pipeliner-opt-sizeqword_503ED40falseEnable SWP at -Os
pipeliner-max-miiqword_503ECE827Maximum allowed MII
pipeliner-force-iiqword_503EB800Force specific II (0 = auto)
pipeliner-max-stagesqword_503EB283Maximum pipeline stages
pipeliner-prune-depsqword_503E9C0truePrune deps between unrelated Phi nodes
pipeliner-prune-loop-carriedqword_503E8E0truePrune loop-carried order deps
pipeliner-ignore-recmiiqword_503E888falseIgnore RecMII (hidden knob)
pipeliner-show-maskqword_503E720falseDebug: show scheduling mask
pipeliner-dbg-resqword_503E640falseDebug: resource usage
pipeliner-annotate-for-testingqword_503E5E8falseAnnotate instead of codegen
pipeliner-experimental-cgqword_503E508falseUse peeling code generator
pipeliner-ii-search-rangeqword_503E3A010Range to search for II
pipeliner-register-pressureqword_503E2C0falseConsider register pressure
pipeliner-register-pressure-marginqword_503E1E05Margin % for reg pressure limit
pipeliner-mve-cgunk_503E100trueUse MVE code generator
pipeliner-enable-copytophiqword_503E020trueEnable CopyToPhi DAG Mutation
pipeliner-force-issue-widthqword_503DF400Force issue width (0 = auto)

All registered in ctor_676_0_0x5a3430.c.

MachinePipeliner Function Map

FunctionIdentity
sub_3563190Top-level SMS orchestrator (SwingSchedulerDAG::schedule)
sub_2F97F60initializeDAG -- build DDG
sub_3559990computeNodeLatencies
sub_3542B20addDependencies -- register/memory/order edges
sub_2F90200updateRegPressure
sub_354CBB0computeRecMII
sub_35449F0computeResMII
sub_3542AB0setMII = max(ResMII, RecMII)
sub_3542AE0validateMII / store at +3472
sub_3556270collectNodeInfo -- gather 88-byte per-node records
sub_35476E0initNodeOrder -- compute scheduling order
sub_35523F0computeSchedule -- build SUnit ordering
sub_35546F0orderDependences -- topological sort
sub_3543340computeStart -- ASAP/ALAP times
sub_35630A0normalizeSchedule -- adjust cycle numbering
sub_35568E0scheduleNodes -- core SMS placement
sub_35433F0adjustSchedule -- post-adjustment
sub_3557A10computeFinalSchedule -- finalize stage/cycle
sub_354A760buildStageMap -- iteration-to-stage mapping
sub_355F610schedule() -- II search loop (2351 lines)
sub_354BE50getScheduleForStage
sub_35498F0sortNodeInfo (for >8 nodes)
sub_359AD80annotateForTesting
sub_35A5710generateTrivialKernel
sub_35A93B0experimentalPeelingCG
sub_3598EB0finalizeExperimentalKernel
sub_35A76E0mveCG -- MVE code generator
sub_35A7730mveCompatCheck

ScheduleDAGMILive: Post-RA Instruction Ordering

ScheduleDAGMILive (sub_355F610, 64KB) is the post-RA machine instruction scheduler. It takes the pipeliner's output (or standalone scheduling regions) and determines the final instruction order while respecting register pressure limits.

Data structures:

  • SUnit (Scheduling Unit): 88 bytes per instruction, consistent across both the pipeliner and ScheduleDAGMILive.
  • Instruction-to-node hash map: 632-byte entries per instruction. The unusually large entry size suggests extensive caching of per-instruction metadata (RP deltas, latency info, dependency edges) to avoid recomputation.
  • RP tracking structure: 112 bytes, with per-register-class pressure arrays at offsets 32--48 (current) and 56--72 (limits).

The scheduling flow:

  1. Initialize RP tracking via sub_3551AB0 (if pipeliner-register-pressure is set).
  2. Set per-class pressure defaults via sub_2F60A40.
  3. Walk BB instruction list, build instruction-to-node hash map.
  4. Compute ASAP (earliest cycle) via sub_354BFF0 -> v369.
  5. Compute ALAP (latest cycle) via sub_354BFF0 -> v373.
  6. Place instructions via sub_354C3A0 (returns success/failure).
  7. Calculate stage count: (lastCycle - firstCycle) / II = (v84 - v80) / v88.
  8. Verify placement via sub_355C7C0.
  9. Build stage descriptors via sub_355D7E0 (80 bytes per stage, 10 QWORDs each).

Instruction Selection Heuristic

The instruction selection heuristic (sub_3557A10, 47KB) determines which instruction to schedule next from the ready set. It implements a multi-level priority scheme operating on 88-byte SUnit entries:

Level 1 -- Latency/Depth priority (SUnit offset +240): instructions deeper in the dependency graph are scheduled first. Depth is measured as the longest path from the instruction to a sink node in the DDG. This ensures that critical-path instructions are placed early, preventing them from becoming bottlenecks. Latency recomputation occurs via sub_2F8F5D0 during priority comparison to account for any scheduling decisions already made.

Level 2 -- Target priority table (context a1+3944): a table of 16-byte entries, each containing:

OffsetSizeField
+04start -- first cycle of priority window
+44end -- last cycle of priority window
+84priority -- target-assigned priority value
+124window_width -- scheduling window size

The target (NVPTX backend) populates this table to express hardware-specific ordering preferences -- for example, prioritizing memory operations that can be overlapped with computation, or ensuring that warp-synchronous instructions are scheduled in specific relative positions. Instructions that fall within a priority window with a higher priority value are selected first.

Level 3 -- Schedule window width: when levels 1 and 2 are tied, the instruction with the narrower scheduling window (ALAP - ASAP) is preferred. Narrower windows mean fewer legal placement options, so these instructions should be placed before more flexible ones to avoid creating conflicts.

The ready queue is managed by sub_3553D90. Pattern matching on ready instructions proceeds through sub_35540D0 (applicability check) and sub_35543E0 (pattern application), with validation via sub_3546B80. A hash table at a1+3976 maps instructions to schedule nodes for O(1) lookup during priority comparison.

function select_next_instruction(ready_set):
    best = null
    for each candidate in ready_set:
        if best is null:
            best = candidate
            continue

        // Level 1: depth comparison
        if candidate.depth > best.depth:        // offset +240
            best = candidate
            continue
        if candidate.depth < best.depth:
            continue

        // Level 2: target priority table
        cand_prio = lookup_target_priority(candidate, priority_table)
        best_prio = lookup_target_priority(best, priority_table)
        if cand_prio > best_prio:
            best = candidate
            continue
        if cand_prio < best_prio:
            continue

        // Level 3: prefer narrower window
        cand_width = candidate.ALAP - candidate.ASAP
        best_width = best.ALAP - best.ASAP
        if cand_width < best_width:
            best = candidate

    return best

Texture Group Merge

The Texture Group Merge pass (sub_2DDE8C0, 74KB, 2382 decompiled lines) groups texture load instructions that access related memory locations, enabling the hardware texture unit to coalesce them into fewer requests. This is an NVIDIA-specific pass not present in upstream LLVM.

Fibonacci Hashing

The pass uses Fibonacci hashing for candidate bucketing:

hash = (ptr * 0xBF58476D1CE4E5B9) >> shift

The constant 0xBF58476D1CE4E5B9 is the 64-bit Fibonacci hash multiplier, derived from 2^64 / phi where phi = (1 + sqrt(5)) / 2 is the golden ratio. This multiplicative hash provides near-optimal distribution for pointer-based keys because the golden ratio's irrational nature ensures that consecutive multiples are maximally spread across the output range. The same constant appears in:

  • Linux kernel's hash_64() in include/linux/hash.h
  • LLVM's FoldingSet and DenseMap internals
  • CICC's SCEV expression uniquing (sub_DC2B70)
  • CICC's OpenMP SPMD region hash table

The shift parameter controls how many high bits are retained, effectively determining the number of hash buckets as 2^(64 - shift). For a 1024-bucket table, shift would be 54.

Algorithm Detail

  1. Walk the BB instruction list.
  2. For each instruction, call sub_2DDC600 (candidate identification) to determine if it is a texture load eligible for merging.
  3. Hash the candidate's key pointer using Fibonacci hashing to assign it to a bucket.
  4. Insert the candidate into the group table.

Group table entries are 56 bytes (7 QWORDs):

OffsetSizeContent
+08Key pointer (texture base address or descriptor)
+88Data pointer (to member array)
+164Member count
+204Member capacity
+2432Reserved / padding

Group members are 32 bytes each:

OffsetSizeContent
+08MachineInstr* -- the texture load instruction
+88Symbol -- the texture symbol reference
+168Debug info -- source location
+248Scope info -- DWARF scope

Generated group names carry a .Tgm (Texture Group Merge) suffix via sub_2241490. This suffix appears in debug output and internal symbol tables.

4-Callback Framework

The pass operates through a general instruction grouper framework (sub_3147BA0) that supports multiple types of instruction grouping through a common callback interface. Four callbacks are registered for texture group merge:

#CallbackFunctionPurpose
1Candidate identificationsub_2DDC600Examines each MachineInstr and returns true if it is a texture load eligible for grouping. Checks opcode, address space (texture memory), and operand constraints.
2Group formationsub_2DDBF40After candidates are identified and hashed into buckets, this callback decides which candidates within a bucket should form a group. It checks address proximity, common base registers, and compatible access patterns.
3Merge executionsub_2DDB3F0Applies the actual merge transformation. Replaces individual texture loads with a single grouped load instruction, rewrites operands, and updates dependency edges.
4Cleanupsub_2DDB400Frees temporary data structures (group tables, member arrays, hash buckets) after merging is complete.

Additional helper functions in the texture group merge:

FunctionRole
sub_2DDD850Node insertion into group table
sub_2DDDD70Resize/grow scheduling data
sub_2DDD530Scheduling iteration over groups
sub_2DDDAB0Node analysis (profitability check)
sub_2DDB710Data dependency edge creation
sub_2DDE490Grouping operation (merge groups)
sub_2DDBC50Constraint application
sub_2DDBBA0Constraint application (secondary)
sub_2DDBA80Finalize group (seal and emit)

The grouper framework is designed to be reusable: by registering different callback tuples, the same framework can group surface loads, shared memory accesses, or other coalescing-friendly instruction patterns.

Scheduling Mode: The usedessa Knob

The usedessa knob (dword_4FD26A0, default 2) controls the scheduling pass pipeline configuration despite its name suggesting deSSA (de-Static Single Assignment) method selection. Pre-RA scheduling dispatches through sub_2165850; post-RA through sub_21668D0.

Mode 1 (simple): Pre-RA scheduling is skipped entirely. Post-RA runs only unk_4FCE24C (the post-RA scheduler). This minimal configuration is useful for debugging or when scheduling is harmful to performance.

Mode 2 (full, default): Pre-RA scheduling runs unk_4FC8A0C. Post-RA scheduling runs three passes sequentially:

  1. unk_4FC8A0C -- pre-RA pass (disabled/noop in post-RA context).
  2. unk_4FCE24C -- post-RA scheduler.
  3. unk_4FC9D8C -- extra scheduling pass.

After scheduling completes, the framework prints "After Machine Scheduling", optionally runs sub_21F9D90, then runs unk_4FCAC8C and prints "After StackSlotColoring".

The "disabled" passes in mode 2 are registered but gated internally, allowing the framework to maintain a uniform pass list while selectively activating passes based on the current compilation phase.

Cross-Cutting Observations

Register pressure tracking appears in three distinct places within the scheduling infrastructure, each serving a different consumer:

TrackerConsumerUpdate Frequency
MRPA incremental (sub_2E5A4E0)MCSE decisionsPer instruction move/elimination
ScheduleDAGMILive (sub_355F610)Scheduling decisionsPer scheduling region
MachinePipeliner stage trackingII feasibilityPer pipeline stage

All three maintain per-register-class pressure arrays but with different granularities. The MRPA tracker uses incremental delta updates for efficiency; the scheduler computes ASAP/ALAP bounds per region; the pipeliner tracks pressure per modulo stage.

The DenseMap hash function (ptr >> 9) ^ (ptr >> 4) is shared across both the 32-bit value variant (sub_1DFB9D0) and 64-bit value variant (sub_1DFB810), indicating a common template instantiation pattern consistent with LLVM's DenseMap<K, V> template.

Contrast with ptxas scheduling: ptxas has its own instruction scheduling subsystem with 195 knobs (including scoreboard-aware scheduling via the AdvancedSB* family, SchedDisableAll, SchedForceReverseOrder, and the GemmPipeliner* family of 8 knobs for matrix multiply detection and pipelining). CICC's scheduling operates at the MachineInstr level before PTX emission; ptxas re-schedules at the SASS level after PTX assembly. The two scheduling layers are independent but complementary.

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's instruction scheduling framework was designed for CPU cores with out-of-order execution, branch prediction, and deep reorder buffers. On a GPU SM, these hardware features do not exist:

  • Upstream assumes out-of-order hardware will hide scheduling mistakes. Modern CPUs have 200+ entry reorder buffers that dynamically reorder instructions, making compiler scheduling a second-order optimization. GPU SMs execute instructions in-order within each warp -- every scheduling decision is final. A poorly ordered instruction stream on GPU means stalls that no hardware can recover from.
  • Upstream optimizes for pipeline hazards and port pressure. CPU schedulers model execution port contention (e.g., port 0 vs. port 1 on Intel), dispatch group rules, and pipeline bubble avoidance. GPU scheduling targets register pressure minimization (nvptx-sched4reg) because the SM's warp scheduler handles instruction-level parallelism through warp interleaving, not through instruction reordering within a single thread.
  • Upstream assumes a single scheduling pass produces the final order. On CPU, LLVM's ScheduleDAGMILive emits the final instruction sequence. On NVPTX, cicc's scheduling is the first of two layers -- ptxas re-schedules the entire program at the SASS level with its own 195-knob subsystem (including scoreboard-aware scheduling via the AdvancedSB* family). CICC's scheduler optimizes for ptxas consumption, not for direct hardware execution.
  • Upstream has no concept of texture instruction grouping. CPU scheduling never considers grouping memory operations for hardware coalescing units. NVIDIA adds a dedicated Texture Group Merge pass (sub_2DDE8C0, 74KB) that groups texture load instructions by base address for the hardware texture unit -- an entirely GPU-specific optimization absent from upstream.
  • Upstream does not track register pressure incrementally during CSE. Upstream LLVM recomputes register pressure from scratch after each Machine CSE transform. NVIDIA's MRPA subsystem (sub_2E5A4E0, 48KB) maintains running pressure state through delta updates, because on GPU the pressure-to-occupancy relationship makes every CSE decision a potential occupancy cliff crossing that must be evaluated cheaply.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Scheduling subsystemsScheduleDAGMILive + optional MachinePipeliner; no incremental pressure trackerThree distinct subsystems: MRPA incremental tracker, Swing Modulo Scheduler, ScheduleDAGMILive; plus texture group merge pass
MRPA (incremental pressure)Not present; pressure recomputed from scratch after each CSE transformsub_2E5A4E0 (48 KB) + backend variant sub_1E00370 (78 KB) maintain running pressure state through delta operations during MCSE
Texture group mergeNo concept of texture instruction groupingDedicated pass (sub_2DDE8C0) groups texture load instructions for hardware coalescing; scheduling-adjacent optimization absent from upstream
Scheduling targetOptimize for hardware pipeline hazards and port pressureOptimize the MachineInstr stream for ptxas consumption; focus on register pressure reduction (nvptx-sched4reg) rather than hardware pipeline timing
Two-level schedulingSingle scheduling pass produces final instruction orderCICC scheduling is first layer; ptxas re-schedules at SASS level with its own 195-knob subsystem
Register pressure modelPer-register-class pressure sets from TRISame model but with GPU occupancy awareness; pressure arrays used to detect occupancy cliff crossings
Scheduling mode switchConfigured at pipeline construction timeRuntime mode switch between pre-RA (sub_2165850) and post-RA (sub_21668D0) with different heuristic weights

ptxas Interaction

cicc's instruction scheduling operates at the MachineInstr level and produces a PTX instruction order that is not final. ptxas re-schedules the entire program at the SASS level using its own 195-knob scheduling subsystem, including scoreboard-aware scheduling (AdvancedSB* family), the GemmPipeliner* family for matrix multiply detection and software pipelining, and SchedForceReverseOrder for debugging. cicc's scheduler therefore optimizes for ptxas consumption rather than direct hardware execution: its primary goal is minimizing register pressure (nvptx-sched4reg) so that ptxas starts from a low-pressure baseline. The two scheduling layers are independent but complementary -- cicc controls the virtual register count visible to ptxas, and ptxas maps the resulting instruction stream onto the SM's hardware pipeline with full knowledge of scoreboard latencies and functional unit availability.

LiveRangeCalc

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 17.x LiveRangeCalc.cpp (the page's own diff table cites LLVM 17.x as baseline). NVIDIA adds dual-bitvector GP/predicate tracking, a small-function bypass (instruction count <= 15), an enlarged 296-byte segment structure with inlined SmallVectors, and a 4/5 active-block fraction not present in any upstream version.

LiveRangeCalc is the low-level engine inside LLVM's CodeGen that turns def/use information into live intervals -- contiguous [SlotIndex, SlotIndex) segments describing when each virtual register holds a value. It sits between the SlotIndexes numbering pass and the LiveIntervals analysis, performing the actual iterative dataflow computation that propagates liveness backward through the CFG and inserts PHI-def value numbers at merge points. In CICC v13.0 the implementation at sub_2FC4FC0 is structurally based on upstream LLVM's LiveRangeCalc::extend / calculateValues but carries several NVIDIA-specific modifications: a dual-bitvector tracking scheme that separates general-purpose and predicate register liveness, a small-function bypass that skips the full dataflow for trivial kernels, and an enlarged per-segment structure (296 bytes) that inlines four separate SmallVector buffers to avoid heap allocations on the hot path.

Main entrysub_2FC4FC0 (12,900 bytes, 78KB decompiled)
Stack frame504 bytes (0x1F8)
Callerssub_2FC8470 (LiveIntervals::computeRegUnitRange), sub_2FC8230 (createDeadDef/addSegment), self-recursive
SlotIndexes passsub_1F10BF0 (11KB), registered as "slotindexes" / "Slot index numbering"
LiveIntervals analysispipeline entry "live-intervals" (analysis ID unk_4F96DB4)
Address range0x2FBF390 -- 0x2FC8470 (full LiveRangeCalc cluster)
Returnsbool -- whether any live range was extended

SlotIndex Infrastructure

Before LiveRangeCalc can operate, every MachineInstr must have a SlotIndex -- a monotonically increasing integer that encodes both the instruction's position and a sub-slot discriminator (early-clobber, register, dead, etc.). The SlotIndexes pass at sub_1F10BF0 walks the MachineFunction and assigns these numbers. CICC's implementation matches upstream LLVM: each MachineBasicBlock owns a contiguous range [StartIdx, EndIdx), and the mapping from SlotIndex back to MachineBasicBlock* is maintained in a sorted array that supports binary search.

The sentinel values found in the binary confirm standard LLVM DenseMap usage:

SentinelValueMeaning
Empty key0xFFFFFFFFFFFFF000Slot has never been occupied
Tombstone0xFFFFFFFFFFFFE000Slot was occupied, then erased

These appear throughout the segment hash table, the pending-def table, and the VNInfo chain, always as DenseMap<SlotIndex, ...> sentinels.

Segment Structure Layout

Each live range segment in CICC is 296 bytes (0x128), substantially larger than upstream's LiveRange::Segment (which is 24 bytes). The inflation comes from four inlined SmallVector buffers that avoid separate heap allocations for the common case:

Segment (296 bytes / 0x128):
  +0x00   u64   status / SlotIndex start (sentinel if free)
  +0x08   ptr   endpoint buffer (or inline at +0x18)
  +0x18   [16]  inline endpoint buffer
  +0x28         additional metadata (segment flags, subrange info)
  +0x50   ptr   register mask buffer (or inline at +0x60)
  +0x60   [56]  inline register mask buffer
  +0x98   ptr   kill-set buffer (or inline at +0xA8)
  +0xA8   [48]  inline kill-set buffer
  +0xD8   u32   kill count
  +0xE0   ptr   use-def chain buffer (or inline at +0xF0)
  +0xF0   [48]  inline use-def chain buffer
  +0x120  u32   total instruction count covered

Each pointer field follows the LLVM SmallVector convention: if the pointer equals the address of the inline buffer immediately following it, the data lives inline; otherwise it points to a heap allocation. During cleanup (Phase 1 of the algorithm), each segment's four buffers are freed individually before the segment is marked with the empty sentinel.

VNInfo Structure

Value numbers are tracked via 120-byte (0x78) VNInfo nodes, allocated from a bump-pointer allocator at [this+0x4A0]:

VNInfo (120 bytes / 0x78):
  +0x00   ptr   endpoint buffer (inline at +0x10)
  +0x08   u64   capacity (initial: 0x200000000 = inline cap 2)
  +0x10   [48]  inline endpoint buffer
  +0x40   ptr   kill-set buffer (inline at +0x50)
  +0x48   u64   capacity for kill-set
  +0x60   ptr   sub-chain pointer (phi resolution)
  +0x68   ptr   sub-chain pointer 2
  +0x70   u32   block number
  +0x74   u32   value number (initially unassigned)

The allocator is a classic bump allocator: a cursor at [this+0x4A0] advances by 0x10 per allocation, checked against capacity at [this+0x448]. When the arena fills, a slow-path reallocation grows the backing store. Deallocation chains through sub_2FBF390, which walks sub-chains and calls free with size 0x38 (56 bytes) per intermediate node and 0x78 (120 bytes) for the VNInfo itself.

Algorithm

The computation in sub_2FC4FC0 proceeds in eight phases. It is self-recursive: when iterative refinement discovers new work, the function calls itself to converge.

Phase 1 -- Initialization and Cleanup (0x2FC4FC0 -- 0x2FC50C2)

Links the SlotIndex base ([rdi] = [rsi+0x30]), increments the iteration counter at [this+0x10], and walks the existing segment table (stride 0x128) freeing stale entries. Segments marked with the empty sentinel (0xFFFFFFFFFFFFF000) are skipped; tombstoned entries (0xFFFFFFFFFFFFE000) and live entries both have their four internal buffers freed and are then marked empty.

The cleanup loop at 0x2FC5040--0x2FC50AE iterates with stride 0x128 over the segment array beginning at rbx. For each entry it checks [rbx+0x00] against both sentinels. If the entry is live or tombstoned, it frees four inlined SmallVector buffers in reverse allocation order:

  1. [rbx+0xE0] -- use-def chain buffer (freed if pointer differs from inline region at rbx+0xF0).
  2. [rbx+0x98] -- kill-set buffer (freed if pointer differs from inline region at rbx+0xA8).
  3. [rbx+0x50] -- register mask buffer (freed if pointer differs from inline region at rbx+0x60).
  4. [rbx+0x08] -- segment endpoint buffer (freed if pointer differs from inline region at rbx+0x18).

After freeing, the entry is stamped with the empty sentinel: mov qword [rbx], 0xFFFFFFFFFFFFF000. The old segment count stored at [rdi+0x20] is loaded into r15d at entry and used to bound the cleanup iteration.

Phase 2 -- Auxiliary Table Cleanup (0x2FC50C2 -- 0x2FC52A3)

Resets the old segment count, increments the auxiliary sequence counter, and walks three secondary tables:

  • Pending-def table at [this+0x40] (16-byte stride): cleared with empty sentinels.
  • VNInfo chain at [this+0xA0]: walked back-to-front, freeing each node through sub_2E0AFD0 (getRegInfo) and sub_2FBF390. The walk reads count from [r13+0xA8], loads each entry at [r12-8], decrements r12. For each VNInfo: frees sub-chains via sub_2FBF390 (size 0x38 = 56 bytes per intermediate node), then frees the VNInfo itself (size 0x78 = 120 bytes) via j_j___libc_free_0.
  • Auxiliary tables at offsets 0x130 (48-byte stride) and 0x480 (16-byte stride): freed/resized via sub_C7D6A0 (realloc).
  • Checks [r13+0x458] for additional pending work from a previous iteration.

Phase 3 -- Block Count and Threshold Check (0x2FC52A3 -- 0x2FC53F4)

Computes the active block count from the MBB array: active = (total_blocks * 4/5) - dead_block_count. The * 4/5 fraction is computed via the classic imul 0xCCCCCCCD trick for unsigned division by 5 on x86. If the result is zero, the function returns immediately.

The precise x86 idiom:

mov   rax, [rdx+10h]
sub   rax, [rdx+8]          ; pointer diff on MBB array
sar   rax, 3                ; divide by sizeof(pointer) = 8
imul  eax, 0xCCCCCCCD       ; unsigned multiply by magic constant
shr   eax, 2                ; result = total_blocks * 4 / 5 (rounded down)
sub   eax, [rdx+20h]        ; subtract dead_block_count

Two bitvectors are allocated on the stack for the live-in set. Initial inline capacity is 8 words (512 registers); if the block count exceeds 8, SmallVector::grow at sub_C8D5F0 expands them. The pre-allocated capacity at [r13+0xAC] is also checked; if insufficient, sub_2FC1040 (grow per-block segment table) is called.

Small-function bypass: If the total instruction count is 15 or fewer, OR the block count is 1 or fewer, OR the global flag qword_5025F68 is set (-Ofast-compile mode [LOW confidence] -- the flag triggers a compile-time shortcut consistent with a fast-compile option, but no string or CLI mapping for this global has been recovered; it could also be a debug-only override or an internal tuning knob), the function skips the full dataflow and returns early. This is an NVIDIA addition not present in upstream LLVM -- it avoids the quadratic cost of bitvector dataflow on trivial kernel bodies where liveness is obvious from local analysis alone.

Phase 4 -- Per-Block Segment Allocation (0x2FC538D -- 0x2FC55E7)

Calls sub_2FC1A70 (ensureCapacity) to prepare per-block storage, then loops over all non-dead blocks summing instruction counts. For each block:

  1. Allocates a 120-byte VNInfo via the bump allocator (sub_22077B0). If allocation fails, jumps to error path at 0x2FC7E1C.
  2. Initializes inline buffers with capacity markers (0x200000000 -- encodes inline capacity 2 in the high 32 bits with size 0 in the low 32 bits, the standard LLVM SmallVector representation).
  3. Sets [vn+0x00] = pointer to inline endpoint buffer (rax+0x10), [vn+0x40] = pointer to inline kill-set buffer (rax+0x50).
  4. Clears sub-chain pointers: [vn+0x60] = 0, [vn+0x68] = 0.
  5. Records the block number at [vn+0x70] = ebx and clears the value number [vn+0x74] = 0.
  6. Advances the bump-pointer allocator at [r14+0x4A0] by 0x10 to allocate a "pending use" object. The allocator checks against capacity at [r14+0x448] and falls back to a slow-path reallocation when the arena fills.
  7. Inserts the VNInfo into the [this+0xA0] vector (grows if needed via sub_C7D6A0).
  8. Registers the block number in the [this+0xC0] map (grows if needed).
  9. Frees old VNInfo if it was a placeholder from a previous iteration.

Phase 5 -- Liveness Propagation via Bitvector Dataflow (0x2FC5656 -- 0x2FC5CC6)

This is the core computation -- a standard backward-dataflow fixed-point iteration, operating on 64-bit word bitvectors. It implements the classic liveness equation:

LiveIn(B) = (LiveOut(B) \ Kill(B)) | Def(B)
LiveOut(B) = Union over all successors S of LiveIn(S)

The iteration continues until no bitvector word changes across a complete pass over all pending blocks. The changed flag (var_1B0 on the stack) is cleared at the top of each outer iteration and set whenever any bitvector word is modified.

Detailed dataflow pseudocode

// Phase 5 reconstructed from sub_2FC4FC0 at 0x2FC5656--0x2FC5CC6
//
// State:
//   segment_table[]    -- hash table, stride 0x128, keyed by block ID
//     .gp_bv   (+0x98) -- general-purpose register bitvector (live set)
//     .pred_bv (+0xE0) -- predicate register bitvector (live set)
//     .kill_set(+0xA8) -- inline kill-set buffer
//     .kill_cnt(+0xD8) -- number of killed registers
//     .def_bv  (+0x08) -- def-set bitvector
//   worklist           -- pending blocks at [r13+0x50]
//   bv_words           -- number of 64-bit words = ceil(num_regs / 64)
//   changed            -- var_1B0 on stack

fn liveness_propagation(this: &mut LiveRangeCalc) -> bool {
    let bv_words: usize = (this.num_regs + 63) / 64;
    loop {
        let mut changed: bool = false;

        for block in this.worklist.iter() {
            // --- Step 1: Hash lookup for block's segment entry ---
            // Hash function: h = ((block.id >> 4) ^ (block.id >> 9))
            //                     & (capacity - 1)
            // Linear probing until key match or empty sentinel
            let entry = this.segment_table.lookup(block.id);

            // --- Step 2: Accumulate kill bitvector from kill set ---
            // The kill set at entry.kill_set contains register IDs
            // that are killed (last-use) within this block.
            // For each killed register, look up its own segment entry
            // and OR its kill bitvector into a local accumulator.
            let mut kill_accum: [u64; bv_words] = [0; bv_words];
            for i in 0..entry.kill_cnt {
                let killed_reg = entry.kill_set[i];
                let kill_entry = this.segment_table.lookup(killed_reg);
                // x86: OR [kill_accum + rdx*8], [kill_entry.kill_bv + rdx*8]
                for w in 0..bv_words {
                    kill_accum[w] |= kill_entry.gp_bv[w];
                }
            }

            // --- Step 3: Compute live_in for general-purpose registers ---
            // Standard backward dataflow: live_in = (live_out & ~kills) | defs
            // live_out is the current content of entry.gp_bv (propagated
            // from successors in previous iterations or initialization)
            let mut src: [u64; bv_words];
            for w in 0..bv_words {
                // x86: rax = NOT [kill_accum + w*8]
                //       rax = AND rax, [entry.gp_bv + w*8]    -- live_out & ~kills
                //       rax = OR  rax, [entry.def_bv + w*8]   -- | defs
                src[w] = (entry.gp_bv[w] & !kill_accum[w]) | entry.def_bv[w];
            }

            // Boundary mask: clear unused high bits in last word
            // x86: ecx = num_regs & 63
            //       shl rdx, cl; not rdx; and [src + (bv_words-1)*8], rdx
            if this.num_regs % 64 != 0 {
                let tail_bits = this.num_regs % 64;
                let mask = (1u64 << tail_bits) - 1;
                src[bv_words - 1] &= mask;
            }

            // --- Step 4: Interference check against allocated set ---
            // Compares computed live_in against the segment's "allocated"
            // bitvector at +0x98. Any bit set in src but NOT in allocated
            // indicates a new live register that extends the range.
            // x86 at 0x2FC5B86:
            //   rax = NOT [entry.gp_bv + rdx*8]   -- ~allocated
            //   rax = AND rax, [src + rdx*8]       -- new bits
            //   test rax, rax / jnz -> extend
            for w in 0..bv_words {
                let new_bits = src[w] & !entry.gp_bv[w];
                if new_bits != 0 {
                    entry.gp_bv[w] |= src[w];   // extend coverage
                    changed = true;
                }
            }

            // --- Step 5: Repeat identically for predicate register bv ---
            // The predicate bitvector at entry offset +0xE0 is processed
            // with exactly the same kill-accumulate / dataflow / interference
            // sequence. Predicate registers (%p0, %p1, ...) occupy a
            // physically separate register file in NVPTX hardware, so they
            // get their own independent bitvector to avoid inflating the
            // interference graph of the main register namespace.
            // [identical loop over pred_bv words omitted for brevity]

        } // end for each block

        if !changed {
            break;  // Fixed point reached
        }
        // Otherwise: var_1B0 was set to 1, loop back to top
    }
}

Convergence criteria

The fixed-point iteration terminates when a complete pass over all pending blocks produces no change to any bitvector word. Formally, convergence is guaranteed because:

  1. Monotonicity. Each bitvector word can only gain bits (the |= operation in the interference-check step is monotone). Bits are never cleared during the iteration.
  2. Finite lattice. The bitvector domain is a finite lattice of height num_regs. Each word can change at most 64 times (once per bit), so the total number of changes across all words and all blocks is bounded by N * W * 64 where N = block count and W = bitvector width in words.
  3. Worst-case iterations. In practice, the iteration converges in O(D) passes where D = maximum loop nesting depth of the CFG. Each pass propagates liveness information one level deeper through nested loops. The theoretical worst case is N iterations for a pathological CFG with a chain of N blocks each feeding into the next, but CUDA kernels rarely exhibit such structure.

The changed flag (var_1B0) is a single byte on the stack. It is zeroed with mov byte [rbp+var_1B0], 0 at the top of each outer iteration and set with mov byte [rbp+var_1B0], 1 whenever the interference check finds new bits. The outer do { ... } while (changed) loop tests this byte at 0x2FC5CC0 with cmp byte [rbp+var_1B0], 0; jne back to the loop head at 0x2FC5656.

Kill and Def computation

The kill and def sets are not computed inside sub_2FC4FC0 itself. They are pre-populated by callers before invoking the dataflow engine:

  • Kill set (+0xA8 inline buffer, count at +0xD8): Populated by sub_2FC8470 (LiveIntervals::computeRegUnitRange) which walks each MachineBasicBlock's instruction list. A register is added to the kill set when an instruction has a use operand that is the last use before the next def (or end of block). The kill set is stored as a flat array of register IDs, not a bitvector -- the dataflow loop then expands it into a bitvector accumulator by looking up each killed register in the hash table.

  • Def set (+0x08 endpoint buffer): Populated by the same caller. A register is added when a MachineInstr defines it (operand flag isDef). For NVPTX, since all registers are virtual, every def creates a fresh value number. The def set is stored as a bitvector where bit i is set if virtual register i is defined in the block.

  • Initial live-out (+0x98 for GP, +0xE0 for predicate): Initialized to the empty set for all blocks. The dataflow iteration propagates liveness backward: when a use is found in a successor block with no preceding def, the register becomes live-out in the current block. The first iteration seeds liveness from the use/def information; subsequent iterations propagate it through the CFG.

This separation means the hash table must be fully populated with per-block kill and def information before sub_2FC4FC0 enters Phase 5. The hash table at sub_2FC0880 supports insert, lookup, and resize operations with open addressing.

Bitvector word-at-a-time implementation

All bitvector operations operate on 64-bit words with standard x86-64 bitwise instructions:

Operationx86 patternSemantics
Union (OR)or [rdx+rax*8], rcx`bv[w]
Difference (AND-NOT)mov rax, [rsi+rdx*8]; not rax; and rax, [rdi+rdx*8]new = src[w] & ~allocated[w]
Boundary maskmov ecx, count_mod_64; mov rdx, -1; shl rdx, cl; not rdx; and [ptr+last_word], rdxClear unused high bits
Zero testtest rax, rax; jnz targetAny bit set?

The boundary mask is critical for correctness: without it, garbage bits in the padding region of the last word would create phantom interference. The mask is computed once per iteration entry and applied after every live-in computation. The instruction sequence shl rdx, cl; not rdx creates a mask with count % 64 low bits set and the rest cleared.

Hash table for segment lookup

The segment hash table (sub_2FC0880) uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192) and an entry stride of 0x128 (296 bytes), matching the full segment structure size. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.

During the dataflow iteration, each block requires two hash lookups per killed register (one for the block entry, one for each killed register's entry), so the total hash table traffic per iteration is O(N * K_max) where K_max is the maximum kill-set size across all blocks. Since NVPTX virtual register counts are typically in the hundreds (bounded by -maxreg, default 70), the hash table remains small and cache-friendly.

Phase 6 -- PHI Value Resolution (0x2FC5ED8 -- 0x2FC5F95)

After the dataflow converges, resolves PHI-def values at block boundaries. For each block, walks the predecessor chain at [block+0x30] and calls sub_2FBF8B0 (resolvePhiValue / findReachingDef) with four arguments: the LiveRangeCalc*, predecessor MBB, current bitvector, and a stack-allocated phi resolution buffer. This is the same algorithm as upstream LiveRangeCalc::updateSSA -- it propagates live-out values down the dominator tree and inserts PHI-def VNInfo nodes where multiple values reach a merge point.

The var_181 byte is initialized to 0 before each block as a "phi_resolved" flag. If sub_2FBF8B0 returns true, control jumps to 0x2FC710C for phi merge handling -- this path allocates a new VNInfo, links it into the sub-chain at [vn+0x60]/[vn+0x68], and updates the block's value number at [vn+0x74]. The temporary phi resolution buffer is freed after each block regardless of the outcome.

Phase 7 -- Segment Endpoint Fixup (0x2FC5FA8 -- 0x2FC6021)

For each word in the destination bitvector that has bits set (masked with 0xFFFFFFFFFFFFFFF8 to skip low tag bits), looks up the block's SlotIndex via [r14+0x18] shifted and indexed into the SlotIndex table at [rcx+0x98], retrieves the segment's use-def chain at [rdi+0x40], and calls sub_2E0F080 (addSegment / extendInBlock) to materialize the [start, end) segment in the LiveRange object. After processing all pending blocks, advances to the next MBB in the linked list via [r14+8], continuing until hitting the sentinel at [rbp+var_1F0].

Phase 8 -- Finalization and Return (0x2FC5974 -- 0x2FC59E6)

If no interference was found across all iterations, frees pending blocks from the [this+0x4A8] array (via sub_2E88E20), sets the pending count to zero ([r13+0x4B0] = 0), frees any dynamically-allocated bitvectors, and returns bool indicating whether any live range was extended. The return value is derived from var_1F0 = (count != 0).

Dual Bitvector Tracking

The most significant NVIDIA-specific modification is maintaining two independent bitvectors per segment:

OffsetRegister classPurpose
+0x98General-purpose registers%r, %rd, %f, %fd, %h, %fh liveness
+0xE0Predicate registers%p liveness

Both bitvectors are processed by identical code paths in Phase 5, but independently -- kills in one class do not affect the other. This separation reflects NVPTX's hardware architecture where predicate registers occupy a physically separate register file from data registers. Upstream LLVM's LiveRangeCalc handles all register classes through a single unified mechanism; CICC's split avoids interference-graph inflation by keeping the small predicate namespace out of the main bitvector.

The two bitvectors are processed sequentially within the same iteration body (not in separate passes). For each pending block, the general-purpose bitvector at +0x98 is processed first, then the predicate bitvector at +0xE0 is processed with structurally identical code. The changed flag is shared between both -- a change in either bitvector triggers another iteration of the outer loop. This means the predicate register dataflow rides for free on the same convergence pass, and the two bitvectors converge simultaneously.

The register coalescer at sub_34A46B0 also maintains a bitvector-per-block structure (a 12,336-byte stack buffer v90[12336] at offset 0x270 used as a bitmap for tracking live-through blocks during range rebuild after coalescing). That coalescer bitvector feeds updated information back into the LiveRangeCalc segment table when live intervals are modified by register coalescing.

Differences from Upstream LLVM

CICC v13.0's LiveRangeCalc diverges from upstream LLVM LiveRangeCalc (as of LLVM 17.x) in these specific ways:

  1. Dual bitvector tracking. Upstream uses a single mechanism for all register classes. CICC splits GP and predicate into independent bitvectors to exploit the physical separation in NVPTX hardware.

  2. Small-function bypass. The instruction-count threshold of 15 and the block-count threshold of 1 are NVIDIA additions. Upstream always runs the full dataflow. This optimization is significant because CUDA kernels frequently contain tiny __device__ helper functions that are inlined by the optimizer.

  3. Global fast-compile flag. The qword_5025F68 check that bypasses the entire dataflow loop has no upstream equivalent. It is likely tied to the -Ofast-compile or -O0 optimization level in cicc.

  4. Enlarged segment structure. Upstream's LiveRange::Segment is 24 bytes (start SlotIndex, end SlotIndex, VNInfo pointer). CICC's segment is 296 bytes (0x128), inlining four SmallVector buffers to avoid heap allocations on the hot path. This is a performance optimization for the common case where segments have small kill sets and few endpoints.

  5. Active-block fraction. The * 4/5 computation in Phase 3 (via imul 0xCCCCCCCD) to determine the active block count is not present in upstream. Upstream counts all blocks equally. CICC discounts approximately 20% of blocks, likely accounting for unreachable or dead blocks that StructurizeCFG may have created but not yet eliminated.

  6. PhysReg parameter always zero. Upstream's findReachingDefs takes a Register PhysReg parameter for physical register interference. Since NVPTX has no physical registers (all registers are virtual and hardware-mapped at launch time), this parameter is always Register() (zero). The binary confirms: sub_2E0FDD0 (isAllocatable) is called but its return value never gates segment creation.

GPU-Specific Considerations

Virtual-only register file. NVPTX has no physical registers in the LLVM sense -- all registers are virtual (%r0, %f0, %p0, ...) and the hardware thread scheduler maps them at launch time. This means LiveRangeCalc never needs to handle physical register liveness, live-in lists for calling conventions, or register unit interference. The PhysReg parameter in upstream's findReachingDefs is always Register() (zero). The binary confirms this: sub_2E0FDD0 (isAllocatable / reserved register check) is called but its return value is never used to gate segment creation.

Pressure-driven analysis. The live intervals produced by LiveRangeCalc feed directly into the greedy register allocator's interference cache (at selectOrSplit offset +648). Since NVPTX allocation is pressure-driven rather than assignment-driven, the intervals primarily serve to detect which virtual registers are simultaneously live, not to assign physical registers. The total count of simultaneously-live intervals at any program point determines the register pressure, which the allocator compares against the -maxreg limit (default 70).

Small-kernel bypass. The threshold check in Phase 3 (instruction count <= 15 OR block count <= 1) is absent from upstream LLVM. CUDA kernels frequently contain tiny helper device functions that are inlined into the caller; computing full dataflow liveness for a 10-instruction single-block function is pure overhead. The bypass returns immediately, letting the register allocator fall back to local analysis.

Configuration

KnobDefaultEffect
early-live-intervalsfalseRuns LiveIntervals analysis earlier in the pipeline, before the standard scheduling pass
join-liveintervalstrueMaster enable for register coalescing over live intervals
qword_5025F68 (global flag)0When nonzero (likely -Ofast-compile), skips the full dataflow loop entirely

The instruction-count threshold of 15 and the block-count threshold of 1 are hardcoded constants, not configurable via LLVM cl::opt flags.

LiveRangeCalc Object Layout

The LiveRangeCalc object (this pointer passed in rdi) is reconstructed from register offsets observed throughout sub_2FC4FC0:

LiveRangeCalc (approx 0x4C0 bytes):
  +0x00   ptr    SlotIndex base (set from [rsi+0x30] in Phase 1)
  +0x08   ptr    VNInfo* / MBB* parameter (set from rsi in Phase 1)
  +0x10   u32    iteration counter (incremented each call)
  +0x14   u32    (padding / alignment)
  +0x20   u32    old segment count (r15d loaded in Phase 1)
  +0x30   u32    auxiliary sequence counter (incremented in Phase 2)
  +0x40   ptr    pending-def table (16-byte stride)
  +0x50   ptr    worklist (pending blocks array)
  +0xA0   ptr    VNInfo chain (vector of VNInfo*)
  +0xA8   u64    VNInfo chain count
  +0xAC   u32    pre-allocated capacity for per-block segment table
  +0xC0   ptr    block-number-to-VNInfo map
  +0x130  ptr    auxiliary table (48-byte stride)
  +0x440  ptr    bump allocator arena base
  +0x448  u64    bump allocator capacity
  +0x458  ptr    additional pending work (checked in Phase 2)
  +0x480  ptr    secondary auxiliary table (16-byte stride)
  +0x4A0  ptr    bump allocator cursor (advances by 0x10 per allocation)
  +0x4A8  ptr    pending-blocks array (freed in Phase 8)
  +0x4B0  u64    pending block count (zeroed in Phase 8)

Complexity

  • Per iteration: O(N * W) where N = number of basic blocks, W = bitvector width in words (ceil(num_regs / 64)). Both GP and predicate bitvectors are processed per iteration, so the actual cost is O(N * (W_gp + W_pred)), but since predicate register counts are small (typically < 64, fitting in a single word), the predicate contribution is O(N).
  • Kill-set expansion per iteration: O(N * K_max * W) where K_max = maximum kill-set size per block. For each of the N blocks, up to K_max hash lookups and W-word OR operations are performed.
  • Convergence: Typically O(D) iterations where D = maximum loop nesting depth. The monotonicity of the OR-based bitvector union guarantees termination. Worst case is O(N) iterations for a pathological single-predecessor chain, but CUDA kernels (especially after StructurizeCFG) have bounded nesting depth.
  • Total: O(N * W * D) for the core liveness computation, plus O(N * K_max * W * D) for kill-set expansion.
  • Hash table operations: O(1) amortized per lookup. Load factor is maintained below 75% by the DenseMap rehash policy.
  • Memory: O(N * W) for bitvectors + O(S * 296) for the segment table where S = number of live segments + O(V * 120) for VNInfo nodes where V = number of value numbers.
  • Phase 1 cleanup: O(S_old) where S_old = segment count from previous iteration. Each segment requires checking four buffer pointers and potentially freeing four allocations.

Function Map

FunctionAddressSizeRole
LiveRangeCalc::extend / calculateValues -- main entry, self-recursive (12,900 bytes, 78KB decompiled)sub_2FC4FC0----
LiveIntervals::computeRegUnitRange (caller, populates kill/def sets)sub_2FC8470----
LiveIntervals::createDeadDef / addSegment (caller)sub_2FC8230----
ensureCapacity / resetLiveRanges (per-block storage preparation)sub_2FC1A70----
grow per-block segment table (called when [r13+0xAC] insufficient)sub_2FC1040----
interval building helper (called from sub_2FC1040)sub_2FC1190----
hash table operations: insert/lookup/resize with open addressingsub_2FC0880----
segment creation / initialization (296-byte struct setup)sub_2FC0040----
resolvePhiValue / findReachingDef (PHI resolution, 4 args)sub_2FBF8B0----
free VNInfo chain (frees 0x38-byte intermediate nodes, 0x78-byte VNInfo)sub_2FBF390----
segment merge / extend (interference update)sub_2FBFCC0----
live range querysub_2FC3C20----
live range intersection testsub_2FC3A50----
getRegInfo / MachineRegisterInfo querysub_2E0AFD0----
isAllocatable / reserved register check (return value unused in NVPTX)sub_2E0FDD0----
addSegment / extendInBlock (materializes [start, end) segments)sub_2E0F080----
MachineFunction helpersub_2E76F70----
eraseFromParent (MachineInstr deletion, used in Phase 8 cleanup)sub_2E88E20----
register property check (called with flags 0x80000, 0x100000)sub_2E88A90----
operator new (VNInfo allocation, 120 bytes)sub_22077B0----
SlotIndexes::runOnMachineFunction (11KB)sub_1F10BF0----
SlotIndexes pass registration ("slotindexes" / "Slot index numbering")sub_1F10320----
SlotIndexes insertion / repair (13KB)sub_1F112A0----
SlotIndex validity check (string: "invalid")sub_1F10810----
computeLiveIntervals (RA integration, called from greedy RA init)sub_2F54D60----
SmallVector::grow (bitvector expansion when block count > 8)sub_C8D5F0----
realloc (SmallVector resize / auxiliary table resize)sub_C7D6A0----
malloc (new allocation)sub_C7D670----

Cross-References

  • Register Allocation -- consumes live intervals to drive the pressure-based greedy allocator
  • Register Coalescing -- merges live ranges of copy-connected virtual registers; runs before RA, feeds updated intervals back through LiveRangeCalc
  • Instruction Scheduling -- the SlotIndexes numbering assigned here is consumed during post-RA scheduling for latency-aware reordering
  • SelectionDAG -- produces the initial MachineInstr stream that SlotIndexes numbers

Register Coalescing

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Register coalescing in CICC v13.0 eliminates redundant copy instructions by merging the live ranges of their source and destination virtual registers. NVPTX's unlimited virtual register model (PTX has no fixed physical register file) changes the purpose of coalescing compared to CPU targets: rather than reducing physical register pressure to avoid spills, the goal is strictly copy elimination -- fewer mov instructions in the emitted PTX, which in turn gives ptxas a cleaner input with fewer live-range constraints to resolve during its own physical allocation. CICC runs two coalescing passes in sequence: the standard LLVM RegisterCoalescer at sub_2F71140 (which handles generic COPY pseudo-instructions) and a separate NVPTX-specific coalescer rooted at sub_34AF4A0 (which handles NVPTX copy instruction families in the opcode 440--503 range that the generic pass does not recognize). This page documents both, with emphasis on the NVPTX-specific pass where the bulk of the proprietary logic resides.

Standard LLVM RegisterCoalescersub_2F71140 (80KB, 2,190 lines)
NVPTX coalescing driversub_34AF4A0 (67KB, 2,373 lines)
Per-instruction coalesce attemptsub_34AE060 (28KB)
Interference checksub_34AA450 (11.5KB)
Block-level coalescingsub_34BAAF0 (31.7KB)
Live-out / weight computationsub_34B7280 (22KB)
Interval tree (red-black BST)sub_34A0610 (14.7KB)
Range rebuild after mergesub_34A46B0 (13KB)
Opcode -> copy-type mappingsub_3494EA0 (12.7KB)
Operand type classification tablebyte_444C4A0 (16-byte entries)
Address range0x3494EA0 -- 0x34BF740
Pass parameters(pass_obj*, func_info*, MF*, copy_limit, coalesce_limit)
Pass orderingAfter TwoAddressInstruction, before greedy RA

Why Coalescing Matters on a Virtual-Register Target

On CPU targets, coalescing reduces register pressure by allowing two virtual registers to share one physical register, potentially preventing a spill. On NVPTX the motivation is different. PTX is a virtual ISA with typed, unlimited registers (%r0, %r1, ... for 32-bit integers; %f0, %f1, ... for 32-bit floats). The "physical" allocation is deferred entirely to ptxas, which maps virtual registers to the hardware register file at kernel launch time based on occupancy targets. CICC's coalescing therefore serves three purposes:

  1. Copy elimination. Every mov instruction that survives into emitted PTX is dead weight -- it costs an issue slot and extends the live range of both source and destination. Coalescing removes these by unifying src and dst into a single virtual register.

  2. Reduced register name count. Even though PTX registers are virtual, ptxas must solve a graph-coloring problem on them. Fewer distinct register names (after coalescing merges equivalents) give ptxas a smaller interference graph and faster compilation.

  3. Cleaner SSA destruction. PHI elimination during the transition from SSA form to machine code inserts copies at every PHI edge. Many of these are immediately coalesceable because the PHI operand's live range does not extend past the copy point. The coalescer cleans up the mechanical output of PHI lowering.

Copies that the coalescer processes arise from three sources: PHI elimination copies, ABI/calling-convention .param register copies for kernel call boundaries, and sub-register operations (EXTRACT_SUBREG, INSERT_SUBREG).

Standard LLVM RegisterCoalescer (sub_2F71140)

CICC includes the stock LLVM RegisterCoalescer at sub_2F71140, registered as pass "register-coalescer" with debug output markers "Before register coalescing" / "After register coalescing". This pass handles the generic COPY pseudo-instruction (LLVM's TargetOpcode::COPY) using the standard worklist-driven algorithm from upstream.

The key LLVM knobs that apply to this instance:

KnobDefaultEffect
join-liveintervalstrueMaster enable for copy coalescing
join-splitedgessubtargetCoalesce copies on split critical edges
join-globalcopiessubtargetCoalesce copies that span basic blocks
terminal-ruletrueApply the terminal rule (copies at block ends)
verify-coalescingfalseVerify MachineInstrs before and after coalescing
late-remat-update-threshold100Batch live-interval updates when a def has many copy uses
large-interval-size-threshold100Intervals with more valnos than this are "large"
large-interval-freq-threshold256Stop coalescing a large interval after this many joins

The standard pass operates on COPY pseudo-instructions only. It does not understand NVPTX-specific move instruction families (opcodes 440--503), which is why the NVPTX-specific pass exists.

NVPTX-Specific Coalescer (sub_34AF4A0)

The proprietary coalescer at sub_34AF4A0 runs after the standard RegisterCoalescer and targets NVPTX copy instruction families that the generic pass skips. It operates on the MachineFunction representation and accepts two limit parameters beyond the standard pass/MF arguments: copy_limit (maximum number of copy instructions to consider) and coalesce_limit (maximum number of successful merges before bailing out). These are compile-time budget controls that prevent quadratic behavior on large functions.

Opcode Classification

The function sub_3494EA0 contains a giant switch statement mapping NVPTX instruction opcodes (range 1--0x12) to copy families in the 440--503 opcode range. Each family represents a distinct copy semantic:

  • Opcodes 440--443: Type-preserving moves within a single register class (i32-to-i32, f32-to-f32, etc.). These map from internal opcodes 12, 13, 15 in the operand-type classification table.
  • Opcodes 444--503: Cross-class moves, paired/wide register moves (128-bit pairs for tensor core paths), and ABI-related .param copies.

The return value is an __m128i pair encoding both the copy semantics and the register class constraints, which subsequent stages use to decide whether coalescing is legal.

Operand-type classification happens via sub_34961A0, which reads operands and classifies them through a lookup table at byte_444C4A0. Each entry in this table is 16 bytes:

struct OperandTypeEntry {
    uint8_t type_code;        // +0: 12=i32, 13=i64, 15=f32, etc.
    uint8_t size_class;       // +1: size in register-width units
    uint8_t register_bank;    // +2: bank identifier
    uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
    uint8_t reserved[12];     // +4: padding/future use
};

The constraint flag at offset +3 (mask 0x10) gates whether the operand participates in coalescing at all. Operands with this bit cleared are excluded from the worklist.

Register Class Constraints

Coalescing is constrained to same-class merges. The NVPTX register classes are completely disjoint -- an Int32Regs (%r) register cannot coalesce with a Float32Regs (%f) register even though both are 32 bits wide. This is a consequence of PTX's typed register model: .reg .b32 %r0 and .reg .f32 %f0 are distinct storage locations from ptxas's perspective. The complete register class table and coalescing constraint flags are in Register Classes. All eight primary classes are same-class-only; Int128Regs is excluded from the coalescing worklist entirely (constraint flag cleared).

Cross-class copies (e.g., bitcasting an i32 to f32) use distinct cross-class copy opcodes (see the copy opcode table) and are never eliminated by the coalescer -- they must survive as explicit instructions in PTX.

Sub-Register Handling

NVPTX has a flat register file with no sub-register structure in the CPU sense. There are no %eax/%ax/%al hierarchies. The exception is wide register pairs: 128-bit values used by tensor core operations are represented as pairs of 64-bit registers. sub_3497B40 handles paired-register decomposition, and when coalescing the low half of a pair, the high half inherits corresponding constraints. The coalesce candidate record (248 bytes) stores sub-operand arrays at offset +16 (4 entries of 32 bytes each, inline SBO) specifically for tracking these pair relationships.

Coalescing Algorithm

The NVPTX coalescer follows the standard LLVM pattern of worklist-driven interval joining but uses proprietary data structures throughout.

Phase 1: Initialization (lines 494--617)

Load TargetInstrInfo, TargetRegisterInfo, and TargetSubtargetInfo from the MachineFunction vtables. Initialize approximately 15 open-addressing hash maps, 2 min-heaps, 3 interval trees (red-black BSTs), and 2 linked lists. The stack frame is approximately 4.5KB. Walk all basic blocks, filter virtual-register operands via sub_2DADC00 (the isVirtualRegister check), and collect copy instructions into the worklist hash.

Phase 2: Block-Level Scanning (lines 618--857)

For each basic block, walk instructions and identify NVPTX copy instructions (opcode field at instruction offset +68 equals 14 or 15). For each copy:

  1. Validate source type via sub_B10CD0 (extract register class).
  2. Check physical register constraints (vestigial on NVPTX but present in the code).
  3. Build a coalesce pair via sub_34A70E0, creating a 248-byte candidate record.

Track live-through registers per block using bitvectors.

Phase 3: Interference Graph Construction (lines 858--998)

Build the interval tree via sub_2DACB60 and sub_C8CD80. Cross-compare forward and backward interval lists via sub_2E564A0. Flatten into indexed format via sub_2E507D0. The result is a set of live intervals indexed by register number, stored in a red-black BST where each node is 448 bytes (0x1C0).

Phase 4: Worklist-Driven Coalescing (lines 1040--2092)

This is the core loop. Candidates are extracted from a min-heap ordered by register number (lowest first -- a standard LLVM heuristic that processes defs before uses in reverse postorder).

function CoalesceWorklistDriven(heap, intervals, hash_map):
    while heap is not empty:
        candidate = heap.extract_min()
        src_interval = lookup(hash_map, candidate.src_key)
        dst_interval = lookup(hash_map, candidate.dst_key)

        // Same-class check
        if register_class(src_interval) != register_class(dst_interval):
            continue

        // Interference check
        if CheckInterference(src_interval, dst_interval) != 0:
            push candidate to secondary_heap
            continue

        // Pre-coalesce validation
        if not ValidateCopy(candidate):
            push candidate to secondary_heap
            continue

        // Execute the merge
        merged = MergeIntervals(src_interval, dst_interval)
        RewriteOperands(candidate.copy_instr, merged)
        UpdateHashMap(hash_map, merged)

        // Verify and rebuild
        VerifyMergedInterval(merged)
        RebuildRanges(merged)

    // Double-buffer swap: retry with secondary heap
    swap(heap, secondary_heap)
    if secondary_heap was non-empty:
        repeat from top

The double-buffer swap (lines 2073--2093) alternates between two heaps (v373 and v376). After exhausting one worklist, the pass swaps and retries -- implementing the LLVM-style "iterate until convergence" pattern where an earlier merge may resolve interference that blocked a later merge.

Phase 5: Code Patching (lines 2095--2144)

For each coalesced pair, rewrite instruction operands:

  1. sub_349D6E0 -- look up the merged interval's representative register.
  2. sub_349FA50 -- find the instruction position.
  3. sub_2E31040 -- patch the operand's register field.
  4. Fix linked-list pointers using the ptr & 0xFFFFFFFFFFFFFFF8 mask (the low 3 bits encode tags on MachineOperand pointers: 0 = normal, 3 = tied operand, 4 = implicit operand).

Phase 6: Cleanup (lines 2145--2371)

Destroy interval trees (sub_349E8A0), perform final range rebuild (sub_34A46B0), finalize coalescing metadata (sub_34A2530), commit merged intervals (sub_34AA090), and deallocate all hash maps, heaps, and trees (16+ free calls).

Interference Check (sub_34AA450)

The interference check is the critical decision point. Given two intervals (identified by their register keys), it determines whether merging them would create a conflict -- that is, whether both registers are simultaneously live at any program point.

function CheckInterference(interval_A, interval_B) -> {0 = safe, 1 = interfering}:
    for each instruction I in interval_A.instruction_vector:
        if I is in the "already-coalesced" set:
            continue
        reg_class = extract_register_class(I)
        dst_interval = lookup(reg_to_interval_hash, I.dst_reg)
        if dst_interval overlaps with interval_B:
            return 1  // interfering
    return 0  // safe to coalesce

The "already-coalesced" set is an open-addressing hash map (pointer keys, hash (key >> 9) ^ (key >> 4), sentinels -4096/-8192). The sentinel check at a3+8 (a flag byte) determines whether the set uses inline or heap storage (small-buffer optimization for sets under approximately 8 entries).

Since NVPTX has no physical register file, "interference" here means purely that two virtual register live ranges overlap at a program point. On CPU targets this would also involve physical register conflict checks, but on NVPTX that dimension is absent.

Priority and Weight System

The coalescing priority determines the order in which candidates are processed when the min-heap's register-number ordering produces ties.

Weight computation (sub_34B7280):

weight = instruction_count + spill_weight[offset+240] + use_count[offset+252]

The flag at offset+254 & 1 guards weight computation: if set, the interval was pre-weighted by an earlier pass and the coalescer uses the existing weight rather than recomputing.

Higher weight means higher coalescing priority. The overall ordering is:

  1. Primary key: register number (min-heap, lowest first).
  2. Secondary key: weight (higher breaks ties in favor of more-used registers).

Block frequency integration: The pass reads a boolean from TargetPassConfig (via sub_35DDE70 at *(_QWORD*)(pass[4]+256)+856) that controls whether block frequency data influences priority. When enabled, copies in hot blocks receive higher priority, biasing the coalescer toward eliminating copies on the critical execution path.

Data Structures

Hash Maps

All hash maps use the standard DenseMap open-addressing infrastructure described in Hash Table and Collection Infrastructure. Two sentinel variants appear in this pass:

VariantKey TypeSentinel pair
Integer-keyint32_t-1 / -2 (hash: key * 37)
Pointer-keyint64_t-4096 / -8192 (hash: (key >> 9) ^ (key >> 4))

Growth policy: next_power_of_2(2 * old_capacity - 1), minimum 64 entries.

Allocator: sub_C7D670(size, alignment=8) / sub_C7D6A0(ptr, size, alignment=8) -- CICC's aligned malloc/free wrappers.

Interval Tree (Red-Black BST)

Managed by sub_34A0610. Each node is 448 bytes (0x1C0):

OffsetSizeField
+024Tree links (left, right, parent pointers)
+328Interval key (register/slot encoding)
+648Instruction vector pointer
+724Instruction count
+19216Debug name (SBO: inline if len <= 15)
+2004Sub-operand count
+2244Instruction opcode
+2402Priority/weight (uint16)

Comparator: sub_34A0190 (compares interval start positions). Rebalancing: sub_34A0330. The tree maintains count (a2[5]) and cached min/max (a2[3]/a2[4]).

Coalesce Candidate Record (248 bytes)

Built by sub_349AB40 for each potential coalescing opportunity:

OffsetSizeField
+08Source interval key
+88Destination interval key
+16128Sub-operand array (SBO, 4 entries x 32 bytes)
+64112Type-constraint array (SBO, 2 entries x 56 bytes)
+19232Debug name (SBO string)
+2244Opcode classification (1--6: copy, subreg, extract, ...)
+2324Copy source register
+2402Priority (default: 1)

MachineOperand Pointer Encoding

Throughout the coalescing code, MachineOperand pointers use low-bit tagging (8-byte alignment guarantees 3 unused low bits):

Tag (ptr & 7)Meaning
0Normal operand
3Tied operand (requires special coalescing -- both operands must map to same register)
4Implicit operand (flag bit at operand offset +44, bit 3)

The code consistently masks with & 0xFFFFFFFFFFFFFFF8 before dereferencing and checks (ptr & 7) == 3 or (ptr & 4) != 0 for branching decisions.

CSSA Coalescing (PHI-Specific)

Separate from the two coalescing passes above, CICC includes a CSSA (Conventional SSA) coalescing stage controlled by the cssa-coalesce knob (constructor at ctor_705, address 0x5BD430). This pass operates at the SSA level rather than the machine level, coalescing PHI operands before PHI elimination to reduce the number of copies that PHI lowering generates. Associated knobs:

KnobEffect
cssa-coalesceEnable/disable PHI operand coalescing
cssa-verbosityVerbosity level for CSSA debug output
dump-before-cssaDump IR before CSSA coalescing
usedessaSelect deSSA method (alternative to CSSA)

Knobs and Thresholds Summary

KnobSourceDefaultEffect
join-liveintervalsLLVMtrueMaster enable for standard RegisterCoalescer
join-splitedgesLLVMsubtargetCoalesce on split critical edges
join-globalcopiesLLVMsubtargetCoalesce cross-block copies
terminal-ruleLLVMtrueTerminal rule for block-end copies
verify-coalescingLLVMfalsePre/post verification
late-remat-update-thresholdLLVM100Batch remat update threshold
large-interval-size-thresholdLLVM100Large interval valno threshold
large-interval-freq-thresholdLLVM256Large interval coalesce limit
twoaddr-rescheduleLLVM--Coalesce copies by rescheduling in TwoAddress
copy_limitNVPTXruntimeMax copies to consider in NVPTX pass
coalesce_limitNVPTXruntimeMax merges before bailout in NVPTX pass
cssa-coalesceNVPTX--PHI operand coalescing
cssa-verbosityNVPTX--CSSA debug verbosity
block frequency flagNVPTXconfigWeight copies by block hotness

The copy_limit and coalesce_limit parameters are passed into sub_34AF4A0 at call time (not static cl::opt knobs). Their values come from the pass pipeline configuration and serve as compile-time budget caps to avoid quadratic worst-case behavior on functions with thousands of copies.

Impact on ptxas

The quality of CICC's coalescing directly affects ptxas's register allocation phase:

  • Fewer virtual registers means a smaller interference graph for ptxas to color, reducing its compilation time.
  • Eliminated copies reduce instruction count, giving ptxas's scheduler more freedom and fewer false dependencies.
  • Preserved type invariants (no cross-class coalescing) ensure ptxas never encounters type-inconsistent register usage, which would require additional conversion instructions.
  • Wide register pair tracking ensures tensor core instruction patterns remain intact -- ptxas expects specific register pair relationships for mma and wmma instructions.

A pathological case is over-aggressive coalescing that creates very long live ranges spanning many basic blocks. On NVPTX this does not cause spills (there is no physical register file to spill from), but it can increase ptxas's reported register usage, reducing occupancy. The coalesce_limit parameter and the large-interval frequency threshold exist partly to avoid this scenario.

Function Map

FunctionAddressSizeRole
Main NVPTX coalescing driversub_34AF4A067KB--
Per-instruction coalesce attemptsub_34AE06028KB--
Pre-coalesce validation (opcode 14/15 check)sub_34AB5C016KB--
Post-coalesce update (rewrite def-use chains)sub_34AC81019KB--
Constrained-copy validation variantsub_34AD8B08.5KB--
Interference checksub_34AA45011.5KB--
Range rebuild (bitvector v90[12336])sub_34A46B013KB--
Interval equivalence verifysub_34A27707.3KB--
Interval tree insert/rebalance (RB-tree)sub_34A061014.7KB--
Register-to-interval hash lookupsub_34A39102.7KB--
Build worklist from BB operand scansub_34A3D105KB--
Build worklist from instruction iterationsub_34A41A04.8KB--
Block-level coalescing driversub_34BAAF031.7KB--
Live-out analysis + weight computationsub_34B728022KB--
Per-register interference buildsub_34B662017.7KB--
Operand-type classificationsub_34961A026.6KB--
Register-pair decompositionsub_3497B4016.5KB--
Opcode -> copy-type mapping (switch)sub_3494EA012.7KB--
Build coalesce candidate listsub_349AB4024.5KB--
Merged-interval representative lookupsub_349D6E0----
Instruction position lookup/creationsub_349FA507.1KB--
Interval tree destructor (variant A)sub_349E3304KB--
Interval tree destructor (variant B)sub_349E5004KB--
Interval tree destructor (variant C)sub_349E6D04KB--
Interval tree destructor (variant D)sub_349E8A04KB--
Interval info populate from instructionsub_349F1404.7KB--
Interval structure resetsub_349F7404KB--
Generic map cleanup (callback sub_349D600)sub_34A2010----
Finalize coalescing metadatasub_34A2530----
Commit merged intervalssub_34AA090----
Secondary coalesce commitsub_34A9A60----
Register info initializersub_35065A0----
Standard LLVM RegisterCoalescersub_2F7114080KB--
RegisterCoalescer::getPassNamesub_2F60C50----

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Number of passesSingle RegisterCoalescer pass handling COPY pseudo-instructionsTwo passes in sequence: stock LLVM RegisterCoalescer (sub_2F71140) + NVPTX-specific coalescer (sub_34AF4A0)
Opcode coverageHandles only TargetOpcode::COPY (generic copy pseudo)NVPTX pass handles NVPTX copy instruction families in opcode range 440--503 that the generic pass does not recognize
Coalescing goalReduce physical register pressure to prevent spillsStrictly copy elimination (PTX has unlimited virtual registers); goal is fewer mov instructions in emitted PTX and smaller interference graphs for ptxas
Interference checkStandard LiveIntervals queryCustom interference check (sub_34AA450, 11.5 KB) with interval tree (red-black BST at sub_34A0610) for NVPTX register classes
Block-level coalescingPart of the unified worklistSeparate block-level coalescing pass (sub_34BAAF0, 31.7 KB) processes copies within each block before cross-block coalescing
Operand classificationGeneric operand handlingCustom operand type classification table (byte_444C4A0, 16-byte entries) maps NVPTX opcode families to copy semantics
Pass parametersStandard runOnMachineFunction with no limitsParameterized with explicit (copy_limit, coalesce_limit) bounds for compile-time control on large kernels

Cross-References

  • Register Allocation -- the greedy allocator that runs after coalescing; shares the register class table and interference hash pattern.
  • Instruction Scheduling -- scheduling runs after RA and benefits from reduced copy count; MRPA pressure tracking is affected by coalescing decisions.
  • LLVM Knobs -- full knob inventory including all coalescing-related flags.
  • Code Generation -- pipeline ordering showing where coalescing fits relative to other machine passes.

Register Allocation

Prerequisites: Familiarity with NVPTX register classes, the GPU execution model (especially occupancy and register pressure), and Live Range Calculation. Understanding of Register Coalescing is helpful.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/CodeGen/RegAllocGreedy.cpp, llvm/lib/CodeGen/SplitKit.cpp, llvm/lib/CodeGen/RegisterCoalescer.cpp, llvm/lib/CodeGen/LiveRangeEdit.cpp (LLVM 20.0.0). NVPTX register class definitions: llvm/lib/Target/NVPTX/NVPTXRegisterInfo.td.

LLVM version note: CICC v13.0 ships two complete copies of RAGreedy (legacy PM at 0x1EC0400, new PM at 0x2F4C2E0). The new PM variant matches the LLVM 20 RAGreedyPass interface. The PriorityAdvisor/EvictionAdvisor infrastructure matches LLVM 15+ patterns. All NVPTX-specific behavior (pressure-driven allocation, -maxreg ceiling, occupancy-aware rematerialization) is layered on top of stock RAGreedy via TTI hooks and custom knobs.

NVPTX register allocation in CICC v13.0 operates under a fundamentally different model from CPU targets. PTX has no fixed physical register file -- registers are virtual (%r0, %r1, %f0, ...) and the hardware scheduler maps them to physical resources at launch time. The "physical register" concept in LLVM's greedy allocator maps to register pressure constraints rather than actual hardware registers, making the allocator pressure-driven rather than assignment-driven. The primary constraint is the -maxreg limit (default 70), which bounds total live registers across all classes to control occupancy on the SM.

Greedy RA driversub_2F5A640 (466 lines)
selectOrSplit coresub_2F49070 (82KB, 2,314 lines)
Live range splittingsub_2F2D9F0 (93KB, 2,339 lines)
Register coalescingsub_2F71140 (80KB, 2,190 lines)
Register info init (new)sub_30590F0
Register info init (old)sub_2163AB0
Allocation failure handlersub_2F418E0

Dual Greedy RA Instances

CICC contains two complete copies of the Greedy Register Allocator infrastructure, corresponding to the legacy and new LLVM pass managers:

  • Instance A (legacy, 0x1EC0400 region): registered through the old pass manager pipeline.
  • Instance B (new, 0x2F4C2E0 region): registered through sub_2F504C0 as the factory function.

Both are registered under the pass name "Greedy Register Allocator" via RAGreedyPass (sub_2342890). The selectOrSplit entry point at sub_2F4BAF0 is a thin wrapper that redirects to sub_2F49070(this + 200, ...). A separate entry at sub_2F4BB00 handles the spill-or-split path with SplitEditor integration.

NVPTX Register Classes

CICC defines nine register classes plus one internal-only class. The complete register class table -- vtable addresses, PTX type suffixes, prefixes, encoded IDs, copy opcodes, and coalescing constraints -- is in Register Classes.

The classes are completely disjoint -- there is no cross-class interference. Each type lives in its own namespace: integer 32-bit values occupy %r registers, 32-bit floats occupy %f registers, and so on. Copy instructions are class-specific, with both same-class and cross-class opcodes dispatched by sub_2162350 (see the copy opcode table).

Greedy selectOrSplit -- Detailed Algorithm

Complexity. Let V = number of virtual registers, R = number of register units, and I = total MachineInstr count. The main allocation loop processes V virtual registers in priority order. For each VReg, selectOrSplit performs: (1) operand scanning in O(operands) with 40-byte stride, (2) interference scanning (scanInterference) in O(R) via the RegAllocMatrix, (3) assignment or eviction attempts in O(R) per candidate. The tryLastChanceRecoloring path is bounded by lcr-max-depth (default 5) and lcr-max-interf (default 8), giving O(8^5) = O(32768) per VReg in the absolute worst case -- though this path is rarely taken. Live range splitting (splitAroundRegion, 93KB) iterates segments in O(S) where S = number of live range segments, with interference analysis per segment in O(R). Overall: O(V * R) for the common case, O(V * R + V * 8^D) when last-chance recoloring is exercised at depth D. The interference cache's open-addressing hash map with 37 * reg provides O(1) amortized lookups. Spill cost computation (setupSpillCosts) is O(V * I_avg) where I_avg is average instructions per VReg's live range. On NVPTX, the completely disjoint register classes mean cross-class interference is zero, reducing the effective R to the per-class register count.

The core allocation algorithm (sub_2F49070, 82KB, 2,314 decompiled lines) follows LLVM's standard RAGreedy::selectOrSplit structure with NVPTX-specific adaptations for pressure-driven allocation. The following pseudocode is reconstructed from the decompiled binary and covers the key phases visible in the new-pass-manager instance.

Initialization (lines 381--484)

fn selectOrSplit(this: &mut RAGreedyState, VirtReg: &LiveInterval) -> PhysReg {
    let TRI     = this.TargetRegisterInfo;
    let NumRegs = TRI[+44];                             // total reg unit count

    // --- RegUnitStates: per-register-unit state array ---
    //     Stored at this+1112, 4 bytes per unit.
    //     Values: 0 = free, 1 = interfering, 2 = reserved
    this.RegUnitStates = alloc_zeroed(NumRegs * 4);     // at this+1112

    // --- Live-through bitvector ---
    //     Stored at this+736, one bit per register unit,
    //     packed into 64-bit words.  Set bits mark units
    //     live across the entire interval.
    let bv_words = (NumRegs + 63) / 64;
    this.LiveThrough = alloc_zeroed(bv_words * 8);      // at this+736

    // --- Interference cache ---
    //     Open-addressing hash map at this+648/656/664.
    //     Key   = register number (unsigned 32-bit)
    //     Hash  = 37 * reg  (mod table_capacity)
    //     Empty = 0xFFFFFFFF (-1),  Tombstone = 0xFFFFFFFE (-2)
    //     Growth: when 4*(count+1) >= 3*capacity, double & rehash.
    this.IntfCache.buckets  = alloc_sentinel(initial_cap); // this+648
    this.IntfCache.count    = 0;                           // this+656
    this.IntfCache.capacity = initial_cap;                 // this+664
    ...

The RegUnitStates array is the central per-unit bookkeeping structure for the entire allocation of a single virtual register. Each 4-byte slot tracks whether that register unit is free, already interfering with the current live range, or reserved by the target. The array is zeroed at the start of every selectOrSplit invocation and released at cleanup (lines 2192--2313).

The interference cache at this+648 is distinct from LLVM's standard InterferenceCache (allocated at 0x2C0 bytes via sub_2FB0E40 during driver setup). This per-invocation cache is a lightweight open-addressing map used to deduplicate interference queries within a single selectOrSplit call. The hash function 37 * reg is a small Knuth-style multiplicative hash chosen for speed over distribution quality -- adequate because register numbers are small consecutive integers.

Operand Scanning (lines 690--1468)

The function walks every MachineOperand attached to the live range's segment list. Operands are stored in a flat array with a 40-byte stride per entry. The type byte at offset +0 of each operand classifies it:

Type ByteMeaningAction
0Virtual registerCheck copyable/tied flags; record in VReg worklist
12Register mask (call clobber)Store pointer in regmask list at this+1176/1184
otherPhysical registerMark in reserved bitvector; update RegUnitStates

For each operand:

    for op in VirtReg.operands(stride=40):
        match op.type_byte:
            0 =>                                        // virtual register
                if op.reg & 0x80000000:                 // negative = virtual
                    check_copyable(op);
                    check_tied(op);
                    update needsRecoloringFlag;          // v321
                else:
                    mark_reserved(op.reg, RegUnitStates);
                    update hasPhysicalAssignment;         // v323
            12 =>                                       // regmask
                append(this.regmask_list, op);
            _ =>
                mark_reserved(op.reg, RegUnitStates);

The 40-byte operand stride is wider than upstream LLVM's MachineOperand (typically 32 bytes) because CICC embeds an additional 8-byte field for NVPTX-specific metadata (likely the register class tag and a flags word). The scanning loop at line 690 uses v321 (needsRecoloringFlag) and v323 (hasPhysicalAssignment) as accumulator flags that gate later phases: if no virtual registers need work, the function returns early.

Interference Processing via sub_2F43DC0 (lines 714--955)

After operand scanning, the allocator calls sub_2F43DC0 (scanInterference) to populate the interference cache:

    scanInterference(this, VirtReg, &IntfCache);
    // IntfCache now contains register units that conflict.
    // Iterate the conflict list at this+1128/1136:

    for conflict in IntfCache.entries():
        if conflict.is_constrained:
            // Tied operand or early-clobber -- must try eviction
            result = tryEviction(conflict);              // sub_2F48CE0
        else:
            // Normal overlap -- try simple direct assignment first
            result = tryAssign(conflict);                // sub_2F47B00

        if result.success:
            record_assignment(result.phys_reg);
            break;
        // else: continue to next candidate

sub_2F43DC0 is the interference scanner. It walks the RegAllocMatrix (set up by sub_3501A90 during driver init) to find live range overlaps. For each physical register unit that overlaps the current virtual register's live range, it inserts an entry into the interference cache using the 37 * reg hash. The scanner distinguishes between two conflict types:

  • Constrained conflicts (tied operands, early-clobber, regmask kills) -- these route to sub_2F48CE0 (tryEviction), which attempts to evict the conflicting virtual register from its current assignment if the eviction cost is lower than the current candidate's spill weight.
  • Normal conflicts -- these route to sub_2F47B00 (tryAssign), which attempts a simple recoloring without eviction.

Additional helper functions participate in this phase:

FunctionRole
sub_2F47200processConstrainedCopies -- handles operands where a COPY forced a specific register
sub_2F46530tryLastChanceRecoloring -- last-resort recoloring bounded by lcr-max-depth (default 5) and lcr-max-interf (default 8)
sub_2F46EE0rehashInterferenceTable -- grows/rehashes when load factor exceeds 75%
sub_2F424E0updateInterferenceCache -- inserts a newly discovered conflict
sub_2F42840markRegReserved -- marks a physical register as reserved in RegUnitStates

The tryLastChanceRecoloring path (sub_2F46530) is the most expensive fallback. It recursively attempts to reassign conflicting registers, up to lcr-max-depth levels deep and considering at most lcr-max-interf conflicting live ranges at each level. The exhaustive-register-search flag bypasses both cutoffs, trading compile time for allocation quality.

Copy Coalescing Hints -- Kinds 20 and 21 (lines 1060--1163)

During operand scanning, the allocator identifies COPY-like instructions by checking the operand kind field. Two kind values trigger coalescing hint recording:

    for op in VirtReg.operands(stride=40):
        match op.kind:
            20 =>                                       // direct COPY hint
                record_hint(op.source_reg, op.dest_reg);
            21 =>                                       // parent-chain COPY hint
                if op.flags[+44] & (1 << 4):           // "has parent" flag
                    // Walk up the parent live range chain
                    let parent = op.parent_LR;
                    while parent != null:
                        recordCoalescingHint(parent);   // sub_2F41240
                        parent = parent.parent;
                    // Coalescing opportunities tracked at this+832/840

Kind 20 represents a simple register-to-register COPY where the source and destination should ideally receive the same physical register. Kind 21 is more complex: it indicates a COPY from a split sub-range that has a parent live range. The has parent flag at byte +44 bit 4 triggers a chain walk via sub_2F41240 (recordCoalescingHint), which records each parent in a coalescing hint list at this+832/840. The hint list is later consumed by sub_2F434D0 (collectHintInfo) during the allocation priority computation, biasing the allocator toward assigning the same physical register to the entire chain.

This is standard LLVM coalescing hint infrastructure, but on NVPTX it interacts with the complete class separation: hints only apply within a single register class, since cross-class coalescing is impossible.

Virtual Register Assignment (lines 1005--1368)

After interference processing and copy hint collection, the function enters the main assignment loop:

    for vreg in unassigned_vregs:
        // Check the live-through bitvector at this+736
        if is_live_through(vreg, this.LiveThrough):
            // This vreg is live across the entire region -- expensive
            result = tryLastChanceRecoloring(vreg);     // sub_2F46530
        else:
            result = tryAssignFromHints(vreg);

        if result.success:
            recordAssignment(result);                    // sub_2F42240
            refresh_operand_list();                      // re-scan
        else:
            // Allocation failed for this vreg -- proceed to splitting
            add_to_split_worklist(vreg);

The live-through bitvector at this+736 is the key data structure for this phase. A set bit indicates that the register unit is live from the beginning to the end of the current region, making it the hardest case for the allocator because there is no gap in which to insert a split point. These live-through ranges go directly to last-chance recoloring.

Cleanup (lines 2192--2313)

The function releases the RegUnitStates array, clears the interference cache, frees the live-through bitvector, and returns 1 on success (physical register assigned) or 0 on failure (must spill).

Live Range Splitting -- Detailed Algorithm

The splitting engine (sub_2F2D9F0, 93KB, 2,339 lines) implements RAGreedy::splitAroundRegion with SplitAnalysis and SplitEditor integration. This is the largest single function in the register allocation cluster.

Segment Enumeration (40-byte stride, gap/sub-range flags)

The splitting engine iterates the live range's segment linked list using the same 40-byte stride as the operand scanner. Two flag bits in the segment header control splitting decisions:

FlagLocationMeaning
Gap flagbit 2 of byte[0]Segment has a gap before it (potential split point)
Sub-range flagbit 3 of byte[44]Segment is a sub-range of a larger interval
fn splitAroundRegion(this: &mut SplitEditor, MF: &MachineFunction) {
    let SubTarget = MF.vtable[+128];
    let TRI       = SubTarget.vtable[+200];

    // Per-region loop -- worklist at this+320
    for region in this.worklist:

        // (a) Hash table init -- 16-byte entries per tracked register
        clear_and_resize(this.region_hash, initial_cap=16);

        // (b) Segment enumeration
        let seg = region.first_segment;
        while seg != null:
            let is_gap      = (seg[0] >> 2) & 1;       // bit 2 of byte[0]
            let is_subrange = (seg[44] >> 3) & 1;       // bit 3 of byte[44]

            if is_gap:
                // Potential split point -- record in visit set
                record_gap(seg, this.visit_set);         // sub_C8CC70

            if is_subrange:
                // Chain through sub-ranges
                process_subranges(seg);

            seg = seg.next;                              // stride = 40 bytes

The gap flag is the primary signal for split point selection. When the allocator detects a gap between two live segments, it can insert a split there without introducing a new spill -- the value is simply not live during the gap, so the split editor can create two separate live ranges that each get a different physical register. The sub-range flag indicates that the segment belongs to a sub-register lane (e.g., the low half of an Int64Regs value), which requires special handling to avoid breaking the lane structure.

Copy Hint Detection and Local Splitting

For COPY instructions (kind values 68 and 0), the splitter extracts register pairs and builds a conflict set:

        // (c) Copy hint detection
        for inst in region.instructions:
            if inst.kind == 68 || inst.kind == 0:       // COPY variants
                let (src, dst) = extract_reg_pair(inst); // operands at +32, stride 40, reg at +8
                conflict_set.insert(src);
                conflict_set.insert(dst);

                // Try local split first
                if tryLocalSplit(conflict_set):          // sub_2F2A2A0
                    // Success -- materialize the new segments
                    materializeSplitSegment();            // sub_2FDF330
                    continue;

sub_2F2A2A0 (tryLocalSplit) attempts a low-cost split within a single basic block. On success, sub_2FDF330 inserts the new split segments into the live interval data structure. The result entries from a local split use a 24-byte stride, where byte +16 is a quality flag and dwords at +8/+12 are the start/end positions of the split segment.

Interference Analysis for Non-COPY Segments

For non-COPY segments, the splitting engine performs interference analysis using regmasks:

        // (d) Interference analysis (lines 785-914)
        for seg in region.non_copy_segments:
            for op in seg.operands:
                if op.is_def && op.flags[+3] & (1 << 4):
                    check_def_interference(op);          // sub_2F28E80

            // Regmask check -- type 12 operands
            if op.type_byte == 12:
                for entry in region_hash:
                    if bittest(op.mask_data[+24], entry.reg):
                        // Register killed by mask -- tombstone it
                        tombstone(entry);                // set to -2

The _bittest operation on regmask data at offset +24 identifies which registers are killed by call clobber masks. Killed entries are tombstoned in the tracking hash table (sentinel value -2), removing them from further consideration.

Coalescing and Reassignment Dispatch

The splitting engine dispatches through vtable offsets for coalescing:

        // (e) Coalescing / reassignment (lines 917-999)
        if vtable[1064](this, region):                   // tryReassign
            markRegUsed(result_reg);                     // sub_2E88E20
            goto DONE;

        if vtable[1072](this, region):                   // canRecolorVirtReg
            markRegUsed(result_reg);                     // sub_2E88E20
            goto DONE;

        // Also try alternative local split via vtable[480]
        vtable[480](this, region, &SmallVectorArgs);

The vtable-indirect calls at offsets [1064] and [1072] correspond to tryReassign and canRecolorVirtReg in upstream LLVM. The offset [480] call is a fallback local split strategy. On success, sub_2E88E20 (markRegUsed) updates the allocation state.

Register Pressure and the -maxreg Constraint

The real allocation constraint on NVPTX is not register scarcity but register pressure -- higher per-thread register usage reduces occupancy, directly impacting throughput through fewer warps available for latency hiding. The -maxreg CLI flag (parsed at sub_900130, stored at compilation context offset +1192) caps the total live register count. Duplicate -maxreg definitions produce the error: "libnvvm : error: -maxreg defined more than once" (sub_9624D0).

Concrete Occupancy Examples

The occupancy formula and cliff table are documented in the GPU Execution Model. Here the relevant values are shown for the -maxreg settings that the allocator targets:

-maxregRegs/WarpWarps (SM 8.0)OccupancyWarps (SM 9.0)Occupancy
321,02464100%48100%
642,0483250%3267%
963,0722133%2144%
1284,0961625%1633%
1926,1441016%1021%
2558,160813%817%

The -maxreg flag sets the ceiling, and the remat infrastructure aggressively reduces pressure below the nearest cliff to avoid losing an entire warp slot.

The remat-for-occ knob (default 120) encodes an occupancy target. When set, the IR-level rematerialization pass (sub_1CE7DD0) calls sub_1C01730 to compute an occupancy-based register target. The heuristic applies a scale factor: if the computed occupancy level exceeds 4, it multiplies the target by 3/2 (effectively allowing more registers when occupancy is already high). If the result still exceeds the ceiling, it applies target = 2*target/3 as a tighter bound.

ptxas Register Allocation Knobs

In addition to cicc's LLVM-side allocator, ptxas has its own register allocation stage with 72+ dedicated knobs. These are independent of the LLVM greedy allocator and operate on the ptxas-internal IR after PTX parsing:

ptxas KnobDescription
RegAllocRematEnableEnable ptxas-level rematerialization
RegAllocEnableOptimizedRematUse optimized remat algorithm
RegAllocSpillForceXBlockHoistRefillForce cross-block spill hoist/refill
RegAllocSpillValidateDebugValidate spill code in debug builds
RegAllocDebugConflictDetailsPrint conflict details during allocation
RegAllocPrintDetailsPrint allocation decisions
RegAllocPerfDiffBackoffBack off allocation when perf difference is small
RegAllocPerfDiffBackoffBegin/EndRange for perf backoff
CTAReconfigMaxRegAllocMax registers for CTA reconfiguration
MaxRegsForMaxWarpRegister ceiling for maximum warp occupancy
RegTgtSelHigherWarpCntHeurHeuristic favoring higher warp count
RegTgtSelLowerWarpCntHeurHeuristic favoring lower warp count
CommonCrossBlockRegLimitCross-block register usage limit
DisableHMMARegAllocWarDisable HMMA register allocation workaround

These ptxas knobs are accessed via nvcc -Xptxas "--knob KnobName=Value". The MaxRegsForMaxWarp and RegTgtSel* knobs directly implement the occupancy-aware allocation strategy at the ptxas level, complementing cicc's -maxreg ceiling.

NVIDIA Rematerialization Knobs (cicc)

NVIDIA provides an extensive set of custom rematerialization knobs to reduce pressure below the target threshold:

KnobDefaultDescription
nv-remat-default-max-reg70Default maximum register target
nv-remat-max-times10Max rematerialization iterations
nv-remat-block-single-cost10Single live pull-in cost limit
nv-remat-block-max-cost100Max clone cost for reducing one live
nv-remat-block-loop-cost-factor20Loop body cost scaling factor
nv-remat-block-liveout-min-percentage70Minimum live-out percentage for block remat
nv-remat-block-map-size-limit6Map size limit for block-level remat
nv-remat-block-load-cost10Load cost in Remat Machine Block
nv-remat-threshold-for-spec-reg20Threshold for special register remat
load-remat(flag)Enable load rematerialization
no-mi-remat(flag)Disable MI remat for specific functions

The greedy allocator itself has additional tuning knobs:

KnobDefaultDescription
split-spill-mode10=default, 1=size, 2=speed
lcr-max-depth5Last chance recoloring max depth
lcr-max-interf8Last chance recoloring max interferences
exhaustive-register-search(flag)Bypass LCR depth/interference cutoffs
enable-deferred-spilling(flag)Defer spill code to end of allocation
grow-region-complexity-budget10000growRegion() edge budget
split-threshold-for-reg-with-hint75Split threshold percentage

Additional rematerialization knobs registered separately include do-remat (default 3), remat-maxreg-ceiling (default 0), remat-single-cost-limit (default 6000), remat-loop-trip (default 20), and remat-for-occ (default 120, targeting higher occupancy).

Spill Cost Computation

Spill costs are computed during driver initialization by sub_2RAD5E0 (step 5 of the driver sequence), which calculates VirtRegAuxInfo spill weights for every virtual register before the main allocation loop begins. The spill weight determines priority in the allocation queue and eviction decisions.

On NVPTX, "spilling" is a misnomer because PTX has no stack spill in the traditional CPU sense -- a spilled value either gets rematerialized (re-computed from inputs) or written to local memory (per-thread DRAM-backed memory, orders of magnitude slower than registers). The cost model therefore heavily penalizes local memory spills and strongly favors rematerialization.

The PriorityAdvisor (looked up via global dword_5023AC8) determines the order in which virtual registers enter the allocation queue. The EvictionAdvisor (looked up via dword_5023BA8) determines when to evict a lower-priority register to make room for a higher-priority one. Both advisors are initialized via vtable [24] calls during driver setup and can be customized via the regalloc-evict and regalloc-priority analysis passes registered in the pipeline parser.

Allocation Failure Handler (sub_2F418E0) -- Three Error Paths

When physical register assignment fails (sub_2F418E0), three error paths exist:

Path 1: Empty Allocation Order

"no registers from class available to allocate"

The register class has zero allocatable registers. This can happen for the internal-only class (off_4A026E0) if the target configuration excludes all environment registers. Diagnostic emitted via sub_B6EB20 (DiagnosticHandler).

Path 2: All Registers Occupied

"ran out of registers during register allocation"

The allocation order exists but all registers are occupied/interfering. This fires when the eviction/split pipeline exhausts all options -- the sequence is: tryAssign -> tryEviction -> tryLastChanceRecoloring -> trySplit -> fail. Uses sub_B2BE50 for source location, sub_B157E0 for DebugLoc, and sub_B158E0 for diagnostic formatting.

Path 3: Inline Assembly Overflow

"inline assembly requires more registers than available"

Special handling for inline asm operands (kind values 1--2 at offset +68). Inline assembly can specify explicit register constraints that consume all available registers in a class, leaving nothing for surrounding code.

FailedRegAlloc Flag

All three paths set the FailedRegAlloc flag (bit 10 in MachineFunction properties, sub_2E78A80). This flag allows downstream passes to handle the failure gracefully rather than crashing. Passes that check this flag can skip optimization or emit degraded but correct code.

The RAGreedy Driver

The top-level driver (sub_2F5A640) orchestrates the full allocation pass:

  1. Store MachineFunction at a1[96], retrieve SubTarget (vtable +128).
  2. Optional debug dump: "Before greedy register allocator".
  3. sub_35B4B20 -- calculate register class info.
  4. sub_2F55040 -- check if any virtual registers need allocation.
  5. sub_2FAD5E0 -- setup spill costs.
  6. sub_2F54D60 -- compute live intervals.
  7. Query vtable +328 for getRegPressureSetLimit (stored at a1[3633]).
  8. Look up EvictionAdvisor (dword_5023BA8) and PriorityAdvisor (dword_5023AC8) via std::map lookups.
  9. Initialize advisors via vtable [24].
  10. Allocate InterferenceCache (0x2C0 bytes, sub_2FB0E40).
  11. Allocate SplitAnalysis (0x738 bytes, sub_2FB1ED0).
  12. sub_3501A90 -- setup RegAllocMatrix.
  13. Initialize PhysRegEntries array (32 entries, 144-byte stride).
  14. sub_2F55730 -- reset priority queue.
  15. sub_35B5380 -- seed queue from virtual registers.
  16. sub_2F58C00 -- main allocation loop.
  17. Optional debug dump: "Before post optimization".
  18. Post-allocation optimization via vtable [24].
  19. sub_2F5A580, sub_2F50510 -- finalize.

Differences from Upstream LLVM

The following table summarizes where CICC's register allocator diverges from upstream LLVM 20.0.0 RAGreedy:

AspectUpstream LLVM 20CICC v13.0
Primary constraintFixed physical register set (CPU ISA-defined)Pressure ceiling via -maxreg; no fixed physical registers
Register classesOften overlapping (e.g., GR32 is a subset of GR64 on x86)9 completely disjoint classes; no cross-class interference
Spill destinationStack frame (cheap, L1/L2 latency)Local memory (DRAM-backed, 100x+ latency) or rematerialization
RematerializationLLVM built-in MachineInstr::isRematerializable()Massive custom infrastructure: 11+ nv-remat-* knobs, separate IR-level remat pass (sub_1CE7DD0), iterative pressure reduction loop
Occupancy awarenessNone -- CPU has no occupancy conceptremat-for-occ (default 120) drives occupancy-targeted register reduction; MaxRegsForMaxWarp ptxas knob
Interference cache hashStandard LLVM DenseMap with (ptr >> 4) ^ (ptr >> 9)Custom open-addressing map with 37 * reg hash, -1/-2 sentinels
Operand stride32 bytes (MachineOperand size)40 bytes (8-byte NVPTX extension for class tag + flags)
Dual pass managerSingle implementation used by both old and new PMTwo complete copies: Instance A at 0x1EC0400, Instance B at 0x2F4C2E0
Register encodingLLVM MCRegister (16-bit class + index)32-bit: 4-bit class tag in [31:28], 28-bit index in [27:0]
Spill weight formulalength / (spill_cost * block_freq)Same formula, but cost model penalizes local memory heavily; rematerialization candidates get near-zero weight
Last-chance recoloringSame knobs, but rarely criticalFrequently exercised due to tight -maxreg ceilings; exhaustive-register-search flag more relevant
Post-RA rematMinimalptxas performs a second register allocation with its own 72+ knobs (RegAllocRematEnable, etc.)
Splitting strategyRegion-based splitting (splitAroundRegion)Same algorithm, but gap flag (bit 2) and sub-range flag (bit 3) in 40-byte segment entries use NVPTX-specific encoding
Callee-saved registersCSR-first-time-cost matters for ABI complianceNVPTX has no callee-saved convention; regalloc-csr-first-time-cost is effectively dead code
Debug strings"Before greedy register allocator"Same string, but emitted conditionally on unk_503FCFD (a debug flag at a fixed BSS address)

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's register allocation framework was designed for CPU targets where the register file is a fixed, small, physically-interfering resource. Every core assumption breaks on NVPTX:

  • Upstream assumes spills are cheap (L1/L2 latency). On x86/AArch64, a spill is a store to the stack frame backed by L1 cache (3-5 cycles). On GPU, a "spill" writes to local memory backed by device DRAM at 200-800 cycle latency. This 40-160x penalty makes rematerialization nearly always preferable to spilling, which is why NVIDIA ships 11+ custom nv-remat-* knobs and an iterative remat loop that has no upstream equivalent.
  • Upstream assumes a fixed physical register set with cross-class interference. CPU ISAs have a static register file (e.g., 16 GPRs on x86-64) where GR32 is a sub-register of GR64 and allocating one constrains the other. NVPTX has no fixed register count and its nine register classes are completely disjoint -- allocating %r5 (Int32Regs) never conflicts with %f5 (Float32Regs). The entire interference-graph framework is solving the wrong problem.
  • Upstream has no concept of occupancy. CPU register allocation never reduces parallelism -- a function uses N registers and that is the end of the story. On GPU, every additional register per thread can cross an occupancy cliff, losing an entire warp group and halving throughput. The allocator must minimize pressure to a target, not just avoid running out of registers.
  • Upstream assumes one allocation pass produces the final assignment. On CPU, LLVM's greedy RA emits final machine code. On NVPTX, cicc's allocator emits PTX with virtual registers bounded by -maxreg, and then ptxas performs an entirely separate second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources. The LLVM allocator is half the pipeline, not the whole thing.
  • Upstream's callee-saved register convention is irrelevant. CPU ABIs define callee-saved sets (e.g., rbx, rbp on SysV x86-64) that the allocator must respect. NVPTX has no callee-saved convention at all -- there is no hardware call stack for registers. The regalloc-csr-first-time-cost knob is dead code on this target.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building a register allocator for an NVPTX-like GPU target.

1. Treating register allocation as an assignment problem instead of a pressure problem. On CPU targets, the allocator must map N virtual registers to K physical registers, and the problem is coloring a fixed interference graph. On NVPTX, there is no fixed physical register file -- PTX registers are virtual and unlimited. The real constraint is the -maxreg ceiling, which controls occupancy. A reimplementation that tries to assign physical registers will produce correct but meaningless output; the correct approach is to minimize peak live register count below the -maxreg threshold, and let ptxas handle the final hardware mapping.

2. Ignoring occupancy cliffs when setting the register target. Going from 64 to 65 registers per thread crosses an occupancy cliff that halves the number of active warps on SM 8.0 (from 32 warps at 50% to 21 warps at 33%). A reimplementation that treats the register ceiling as a hard binary constraint (under = good, over = bad) will miss the fact that reducing from 65 to 64 is worth enormous effort (doubles throughput), while reducing from 63 to 62 is nearly worthless. The remat-for-occ knob (default 120) exists specifically to drive rematerialization toward the nearest cliff boundary, not just toward the ceiling.

3. Using CPU-calibrated spill costs. On x86, a spill is a store to L1-cached stack memory at 3-5 cycle latency. On GPU, a "spill" writes to per-thread local memory backed by device DRAM at 200-800 cycle latency -- a 40-160x penalty. A reimplementation that uses upstream LLVM's default spill cost formula without recalibrating for GPU memory latency will spill aggressively when it should rematerialize. NVIDIA's 11+ nv-remat-* knobs and the iterative rematerialization loop exist because rematerialization is almost always cheaper than spilling on GPU.

4. Assuming cross-class register interference exists. NVPTX's nine register classes are completely disjoint: Int32Regs (%r) never conflicts with Float32Regs (%f), Int64Regs (%rd) never conflicts with Float64Regs (%fd), and so on. A reimplementation that builds a global interference graph spanning all classes will waste significant compile time computing interference relationships that are always empty. The correct approach is per-class allocation with independent pressure tracking.

5. Forgetting that cicc's allocation is only half the pipeline. The LLVM greedy allocator in cicc emits PTX with virtual registers bounded by -maxreg. Then ptxas performs an entirely separate second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources. A reimplementation that tries to produce final hardware register assignments at the LLVM level is solving the wrong problem -- the output should be well-pressure-managed virtual registers, not hardware assignments.

Diagnostic Strings

Diagnostic strings recovered from the register allocation binary region (p2c.5-01-register-alloc.txt) and the rematerialization passes (p2b.2-01-remat-ir.txt, p2b.2-02-remat-machine.txt).

Allocation Failure Diagnostics

StringSourceCategoryTrigger
"no registers from class available to allocate"sub_2F418E0 path 1ErrorRegister class has zero allocatable registers; emitted via sub_B6EB20 (DiagnosticHandler)
"ran out of registers during register allocation"sub_2F418E0 path 2ErrorAll registers occupied/interfering after tryAssign -> tryEviction -> tryLastChanceRecoloring -> trySplit exhausted
"inline assembly requires more registers than available"sub_2F418E0 path 3ErrorInline asm explicit register constraints consume all available registers in a class
"libnvvm : error: -maxreg defined more than once"sub_9624D0ErrorDuplicate -maxreg CLI flag definitions

Debug/Trace Diagnostics

StringSourceCategoryTrigger
"Before greedy register allocator"sub_2F5A640 step 2DebugConditional on unk_503FCFD debug flag
"Before post optimization"sub_2F5A640 step 17DebugPost-allocation debug dump
"Before register coalescing"sub_2F60C50DebugRegister coalescer debug dump
"After register coalescing"sub_2F60C50DebugRegister coalescer debug dump

Rematerialization Diagnostics (nv-remat-block)

StringSourceCategoryTrigger
"Skip machine-instruction rematerialization on <name>"sub_1CE7DD0 regionDebugFunction name matches no-mi-remat skip list
"Max-Live-Function(<num_blocks>) = <max_live>"remat-block step 10DebugReports maximum live register count across all blocks
"live-out = <count>"remat-block step 7DebugPer-block live-out register count
"Pullable: <count>"remat-block step 5DebugNumber of pullable (rematerializable) instructions
"Total Pullable before considering cost: <count>"remat-block step 8DebugTotal pullable candidates before cost filtering
"Really Final Pull-in: <count> (<total_cost>)"remat-block step 11DebugFinal rematerialization candidate count and total cost
"After pre-check, <N> good candidates, <M> given second-chance"remat two-phase selectionDebugTwo-phase candidate selection with second-chance
"ADD <N> candidates from second-chance"remat two-phase selectionDebugCandidates recovered from second-chance pass
"\treplaced"remat code emissionDebugRematerialized instruction replacement confirmation

Pass Registration Strings

StringSource
"Greedy Register Allocator"Pass name for both Instance A (0x1EC0400) and Instance B (0x2F4C2E0)
"Register Coalescer"sub_2F60C50 pass registration
"nv-remat-block"ctor_361_0 at 0x5108E0 -- machine-level remat pass registration
"Legacy IR Remat"sub_1CE7DD0 region -- IR-level remat pass display name
"nvvmrematerialize"IR-level remat pass pipeline ID

Function Map

FunctionAddressSizeRole
RAGreedy::runOnMachineFunctionsub_2F5A640--Top-level driver (466 lines)
RAGreedy::selectOrSplitsub_2F49070--Core allocator (82KB, 2,314 lines)
selectOrSplit thunksub_2F4BAF0--Redirects to sub_2F49070(this+200)
selectOrSplit + SplitEditorsub_2F4BB00--Spill-or-split path
SplitEditor::splitAroundRegionsub_2F2D9F0--Live range splitting (93KB)
tryLocalSplitsub_2F2A2A0--Local split within single BB
materializeSplitSegmentsub_2FDF330--Insert split segments
scanInterferencesub_2F43DC0--Populate interference cache
tryAssignsub_2F47B00--Simple assignment path
tryEvictionsub_2F48CE0--Evict conflicting VReg
tryLastChanceRecoloringsub_2F46530--Recursive recoloring fallback
processConstrainedCopiessub_2F47200--Handle tied-operand COPYs
rehashInterferenceTablesub_2F46EE0--Interference cache rehash
rehashCoalescingTablesub_2F46A90--Coalescing hint table rehash
markRegReservedsub_2F42840--Mark unit as reserved
recordAssignmentsub_2F42240--Record successful assignment
updateInterferenceCachesub_2F424E0--Insert conflict entry
recordCoalescingHintsub_2F41240--Record parent-chain hint
collectHintInfosub_2F434D0--Gather all hints for priority
assignRegFromClasssub_2F418E0--Allocation failure handler
hasVRegsToAllocatesub_2F55040--Pre-flight check
computeLiveIntervalssub_2F54D60--Build live interval data
resetPriorityQueuesub_2F55730--Clear and re-init queue
mainAllocationLoopsub_2F58C00--Per-VReg dispatch loop
finalizesub_2F50510--Post-allocation cleanup
setupSpillCostssub_2FAD5E0--Compute VirtRegAuxInfo weights
InterferenceCache::initsub_2FB0E40--Allocate 0x2C0-byte cache
SplitAnalysis::initsub_2FB1ED0--Allocate 0x738-byte analysis
setupRegAllocMatrixsub_3501A90--Build the global interference matrix
calculateRegClassInfosub_35B4B20--Pre-compute class sizes/orders
seedQueueFromVRegssub_35B5380--Initial queue population
RegisterCoalescer::runOnMachineFunctionsub_2F71140--Register coalescing (80KB)
printMachinePropertiessub_2E78A80--Includes FailedRegAlloc flag
encodeVirtualRegsub_21583D0--`CLASS_BITS \
emitCopyInstructionsub_2162350--Class-specific copy opcodes

Reimplementation Checklist

  1. Pressure-driven allocation model. Replace the standard assignment-to-physical-registers model with a pressure-tracking model: PTX registers are virtual, so the allocator must track and bound total live register count per class against the -maxreg ceiling (default 70) rather than assigning to a finite physical register set.
  2. Nine disjoint register classes. Define the nine NVPTX register classes (Int1Regs, Int16Regs, Int32Regs, Int64Regs, Float32Regs, Float64Regs, Int16HalfRegs, Int32HalfRegs, Int128Regs) with complete cross-class disjointness -- no interference between classes, class-specific copy opcodes, and per-class pressure tracking.
  3. Greedy selectOrSplit with NVPTX adaptations. Implement the core allocation loop: per-unit RegUnitStates array (free/interfering/reserved), interference cache with 37 * reg hash, 40-byte-stride operand scanning, copy coalescing hints (kinds 20/21), and live-through bitvector for detecting worst-case live ranges.
  4. Live range splitting with SplitKit. Implement splitAroundRegion (93KB equivalent): identify split points at block boundaries and within blocks, create sub-ranges with new virtual registers, insert copies at split points, and update the interference cache.
  5. Eviction and last-chance recoloring. Implement tryEviction (compare spill weights to decide whether evicting a conflicting VReg is cheaper) and tryLastChanceRecoloring (recursive reassignment bounded by lcr-max-depth=5 and lcr-max-interf=8).
  6. Occupancy-aware spill cost computation. Weight spill costs by occupancy impact: spills to local memory (device DRAM, 200--800 cycle latency) must account for the GPU-specific penalty, and the register ceiling must respect occupancy cliff boundaries.
  7. Dual pass manager instances. Register the allocator for both legacy and new pass managers, ensuring both instances share the same NVPTX-specific hooks (custom rematerialization interaction, pressure-driven priority queues, maxreg enforcement).

Architectural Uniqueness

NVPTX's register allocation differs from all other LLVM targets in several fundamental ways:

  • Unlimited virtual registers: PTX has no fixed register count. The allocator manages pressure, not assignment to a finite set of physical registers.
  • Complete class separation: The nine register classes are fully disjoint. An Int32Regs allocation never conflicts with a Float32Regs allocation.
  • Pressure as the primary constraint: The -maxreg ceiling and NVIDIA's custom rematerialization infrastructure (nv-remat-* knobs) exist specifically to control occupancy, which has no equivalent in CPU register allocation.
  • Two-stage allocation: cicc performs LLVM greedy RA to emit PTX with virtual registers bounded by -maxreg, then ptxas performs a second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources.
  • Dual implementation: Two complete RA copies exist (old at 0x1E*--0x1F*, new at 0x2F*--0x35*), one per pass manager generation.

ptxas Interaction

Register allocation in cicc is the first of two allocation stages. cicc's greedy RA assigns virtual PTX registers (%r0, %f3, etc.) bounded by the -maxreg ceiling to control occupancy, but these are not hardware registers -- they are symbolic names in the PTX text. ptxas then performs its own complete register allocation pass, mapping cicc's virtual registers onto the SM's physical register file (e.g., 255 32-bit registers per thread on SM 80+). ptxas has 72+ RA-related knobs (RegAllocScheme, DynamicRegAlloc, RegUsageOpt, etc.) and may split, coalesce, or spill registers differently than cicc anticipated. The -maxreg value cicc enforces serves as a hint to ptxas about the desired occupancy target, but ptxas makes the final hardware binding decision.

PrologEpilogInserter & Frame Layout

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

NVIDIA GPUs have no hardware stack pointer. There is no push, no pop, no %rsp — the entire concept of a "stack frame" is a compiler fiction. When a CUDA kernel needs local storage (spill slots, alloca, local arrays), cicc allocates a byte array called __local_depot in PTX .local address space and computes all offsets at compile time. The PrologEpilogInserter (PEI) pass is responsible for this: it takes abstract MachineFrameInfo frame indices produced by register allocation and earlier lowering, assigns concrete byte offsets within the depot, emits the two-instruction prologue that sets up the %SP/%SPL pseudo-registers, and rewrites every frame-index operand in the MachineFunction to [%SP + offset] form. At 68 KB and ~2,400 decompiled lines, cicc's PEI is a heavily modified monolith — the upstream open-source NVPTX backend replaces LLVM's standard PEI with a stripped-down 280-line NVPTXPrologEpilogPass that handles only offset calculation and frame-index elimination. cicc restores and extends nearly all of the standard PEI's functionality: callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering, and a stack-size diagnostic system.

PropertyValue
Binary addresssub_35B1110 (0x35B1110)
Binary size68,332 bytes (~2,388 decompiled lines)
Pass identityPrologEpilogInserter::runOnMachineFunction
Pass positionPost-register-allocation, before NVPTXPeephole
Stack frame0x490 bytes of local state (~400 variables)
Upstream equivalentNVPTXPrologEpilogPass (280 lines) + NVPTXFrameLowering (101 lines)
Key strings"warn-stack-size", "stack frame size"
Knobswarn-stack-size (function attribute), nvptx-short-ptr, nv-disable-mem2reg

The GPU "Stack" Model

__local_depot: The Frame Array

Every PTX function that needs local storage declares a .local byte array:

.local .align 16 .b8  __local_depot0[256];

This is the entire "stack frame." The alignment value is the maximum alignment of any object in the frame. The size is the total frame size computed by PEI. The suffix number (0, 1, ...) is the function index within the module.

There is no call stack in the CPU sense. GPU threads have a fixed local memory allocation (typically 512 KB per thread on modern architectures). The .local directive reserves a region within this per-thread memory. Recursive functions and dynamic allocations are legal in PTX but the driver/ptxas resolves their addresses — cicc only needs to produce a statically-sized depot for each function's fixed-size locals.

%SP and %SPL: The Two Frame Pseudo-Registers

PTX declares two pseudo-register pairs for frame access:

.reg .b64  %SP;     // generic address space pointer to the frame
.reg .b64  %SPL;    // local address space (AS 5) pointer to the frame

In 32-bit mode these are .reg .b32. The distinction exists because NVIDIA GPUs use address space qualification:

  • %SPL (Stack Pointer Local) — points directly into the .local address space (PTX address space 5). Loads/stores using %SPL emit ld.local/st.local instructions, which ptxas can optimize for the L1 cache local-memory path. This is the efficient pointer.

  • %SP (Stack Pointer) — a generic address space pointer obtained by converting %SPL via cvta.local. Loads/stores using %SP go through generic address resolution, which adds a TLB lookup to determine the address space at runtime. This is slower but required when the address escapes to code that expects generic pointers (e.g., passing a local variable's address to a called function).

The prologue sequence is:

mov.u64   %SPL, __local_depot0;     // MOV_DEPOT_ADDR_64
cvta.local.u64  %SP, %SPL;          // cvta_local_64

The cvta.local (Convert Address) instruction is the key: it takes a .local pointer and produces the equivalent generic-space pointer. When nvptx-short-ptr is enabled, %SPL is 32 bits (sufficient for the per-thread local memory window, always < 4 GB) while %SP may still be 64 bits on 64-bit targets.

Upstream's NVPTXFrameLowering::emitPrologue implements this directly. It checks MachineRegisterInfo::use_empty for each register — if %SP has no uses, it skips the cvta.local; if %SPL has no uses, it skips the mov.depot. The NVPTXPeephole pass runs immediately after PEI and rewrites LEA_ADDRi64 %VRFrame64, offset followed by cvta_to_local_64 into LEA_ADDRi64 %VRFrameLocal64, offset, eliminating the generic-to-local conversion when the address stays in local space.

Frame Index Resolution

During instruction selection and register allocation, local memory references use abstract frame indices: %stack.0, %stack.1, etc. Each maps to a MachineFrameInfo frame object with a size, alignment, and (after PEI) a byte offset.

Frame-index elimination in upstream is simple — NVPTXRegisterInfo::eliminateFrameIndex replaces the frame-index operand with VRFrame (which prints as %SP) and sets the immediate offset:

MI.getOperand(FIOperandNum).ChangeToRegister(getFrameRegister(MF), false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);

The VRDepot physical register (prints as %Depot internally) serves as the canonical frame base in getFrameIndexReference. For debug info, %Depot is remapped to %SP since cuda-gdb resolves stack frames via the generic pointer.

Frame Layout Algorithm

cicc's PEI executes in ten sequential phases within a single monolithic function. The algorithm is significantly more sophisticated than upstream's linear scan.

Phase 1–2: Setup and Callee-Saved Registers (lines 443–566)

Retrieves the TargetFrameLowering and TargetRegisterInfo from the MachineFunction's subtarget. If callee-saved registers exist (determined by vtable(FrameLowering, +480)), allocates a 0xA8-byte callee-save info structure at PEI state offset +200 containing two inline SmallVectors for register indices.

On GPU targets, callee-saved registers are unusual — PTX functions use a fully virtual register file, so there is no hardware register saving in the CPU sense. However, cicc models device-function calling conventions that may require preserving certain virtual registers across calls, and this mechanism handles that.

Phase 3: Fixed Object Collection (lines 567–730)

Initializes a chunk table (deque-like structure) with -4096 sentinel values. Collects prolog/epilog insertion points from the PEI state arrays at offsets +216 (prolog points, count at +224) and +264 (epilog points, count at +272).

When callee-saves exist and optimization level is not 20 (a special threshold), manually inserts save/restore instructions:

  • Simple saves: storeRegToStackSlot(MBB, MI, reg, kill=1, FI, RC, TRI)
  • Compound saves: handles sub-register decomposition via sub_2F26260 when byte+9 == 1 in the callee-save info.

Phase 4: Offset Assignment — The Core Layout Engine (lines 733–1070)

This is the heart of PEI. It assigns byte offsets within __local_depot to every frame object.

MachineFrameInfo layout:
  StackDirection:    1 = grows-negative (toward lower addresses)
                     0 = grows-positive (toward higher addresses)
  LocalFrameSize:    initial offset base
  NumFixedObjects:   count of pre-positioned objects
  MaxAlignment:      tracks largest alignment seen

Fixed objects are laid out first. Each frame object is a 40-byte record:

OffsetTypeField
+0i64Byte offset (written by PEI)
+8i64Object size in bytes
+16u8Alignment (log2)
+20u8isDead flag
+32u8isSpillSlot flag
+36u8Category (0–3)

The alignment formula appears ~20 times throughout the pass:

// Round up 'value' to next multiple of (1 << align_log2):
aligned = -(1 << align_log2) & (value + (1 << align_log2) - 1);
// Equivalent to: aligned = (value + mask) & ~mask  where mask = (1<<n) - 1

For grows-negative direction, offsets are stored as negative values; for grows-positive, they accumulate upward.

Callee-saved region is laid out next, iterating frame indices in range [PEI+208 .. PEI+212]. Each CSR object gets an aligned offset using the same formula.

Separate stack area: if MachineFrameInfo+665 flag is set, NVIDIA supports a physically separate stack region with its own alignment at +664 and total size at +656. This likely corresponds to a distinct .local segment for shared-memory scratch or ABI-reserved zones.

Phase 5: Categorized Local Variable Layout (lines 1060–1600)

This is cicc's most significant divergence from upstream PEI. Objects are classified into three priority buckets by a category byte at frame-object offset +36:

CategoryBucketTypical contentsLayout order
3v427Vector/tensor spills (high alignment)First
2v419Medium-aligned objectsSecond
1v412General localsThird
0Skip (already placed or dead)

Each bucket is processed by sub_35B0830 which assigns aligned offsets. The ordering minimizes alignment waste: laying out large-alignment objects first avoids padding gaps.

Objects are skipped if:

  • They are spill slots in a separate stack area
  • They fall within the callee-saved index range
  • Their size is -1 (sentinel for dynamic-size objects)
  • They are the frame-pointer object
  • They are dead

Bitmap-Based Packing — When register count is nonzero and canUseStackBitmap returns true (frame size <= 0x7FFFFFFF), cicc builds a bitset representing every byte of the frame:

// Bitmap size in qwords:
bitmap_size = (frame_size + 63) >> 6;

// Mark all bytes as free (bits set to 1)
// Then clear bits for fixed objects and CSR objects
for each placed_object:
    clear bits [offset .. offset + size)

For each unassigned general object, the algorithm scans the bitmap using tzcnt (trailing zero count) to find contiguous runs of set bits that match the object's size and alignment:

for each unassigned_obj in v412:
    candidate = tzcnt_scan(bitmap, obj.size);
    if (candidate != NOT_FOUND):
        // Verify alignment
        if aligned(candidate, obj.alignment):
            // Verify all bits available (inner loop)
            if all_bits_set(bitmap, candidate, candidate + obj.size):
                assign_offset(obj, candidate);
                clear_bits(bitmap, candidate, candidate + obj.size);
                continue;
    // Fallback: linear allocation at end of frame
    offset = align(running_offset);
    assign_offset(obj, offset);
    running_offset += obj.size;

This is substantially more aggressive than both upstream LLVM PEI (which does a single linear pass) and the upstream NVPTX PrologEpilogPass (which has no packing at all). It enables reuse of "holes" left by fixed objects, callee-saves, and dead objects.

Phase 6: Final Alignment and Frame Size (lines 1688–1795)

After all objects are laid out:

  1. If targetHandlesStackFrameRounding returns true, skip to finalization.
  2. Add MaxCallFrameSize to the running offset if the function adjusts the stack.
  3. Choose alignment: StackAlign (from TFI.getStackAlign()) for functions with calls or alloca, or TransientStackAlign for leaf functions. The subtarget stores these at FrameLowering[12] and [13] respectively.
  4. Round up: final = align(running_offset, max(StackAlign, MaxAlignment)).
  5. If alignment changed the total and direction is grows-negative, shift all callee-save offsets by the delta to maintain correct relative positions.
  6. Write FrameInfo.StackSize = final_offset - initial_offset.

This value becomes the SIZE in .local .align ALIGN .b8 __local_depotN[SIZE].

Phase 7: Prologue/Epilogue Insertion (lines 1803–1872)

Executed when optimization level is not at threshold 20. For each prolog insertion point, calls emitPrologue(MF, MBB) via RegisterInfo vtable at +96. For each epilog point, calls emitEpilogue(MF, MBB) at +104.

Post-fixup via sub_35AC7B0, then a second pass over prolog points for insertPrologueSaveCode (vtable +152, if not a null stub).

Architecture-specific extension: checks (*(Module+2) >> 4) & 0x3FF == 0xB (SM arch code 11). When matched, calls an additional prolog handler at vtable +176. This likely targets an early or internal SM variant.

Phase 8–9: Frame Index Elimination (lines 1873–2268)

Two strategies selected by vtable(FrameLowering, +616):

Forward elimination (Path A): walks each MBB's instruction list forward. For each instruction, checks the opcode against FRAME_SETUP and FRAME_DESTROY pseudos — these adjust the SP offset tracker. For other instructions, scans operands for type-5 (FrameIndex), then calls sub_35ABF20 to attempt elimination or falls back to the target-specific handler.

Backward elimination (Path B): same logic but iterates instructions in reverse order. Handles FRAME_SETUP/FRAME_DESTROY with different SP adjustment accumulation.

This dual-path approach is unique to cicc — upstream NVPTX PrologEpilogPass only does a single backward walk. The forward path may be needed for instructions where the SP adjustment at a given point depends on preceding pseudo-ops.

Phase 10: Diagnostics and Cleanup (lines 2270–2388)

Stack size warning: default threshold is 0xFFFFFFFF (4 GB, effectively disabled). If the function has a "warn-stack-size" attribute, it parses the value via strtoul(str, &end, 10). When the total frame size (plus optional regspill area at MF+86*wordsize if opt-level flag 55 is set) exceeds the threshold, emits a "stack frame size" diagnostic.

Stack annotation: if annotation output is enabled (checked via sub_B6EA50/sub_B6F970), formats and writes stack-size metadata to the analysis output for the NVVM container.

Cleanup frees the 0xA8 callee-save info structure, resets prolog/epilog point counts, resets frame metadata, and walks the chunk table to free non-inline instruction arrays.

Dynamic Stack Allocation (alloca)

PTX supports alloca semantics at the LLVM IR level — the alloca instruction lowers to a local memory reservation. However, truly dynamic-sized allocations (variable-length arrays, runtime alloca(N)) are constrained:

  • MachineFrameInfo.hasVarSizedObjects (flag at +36) tracks whether the function contains VLA-style allocations.
  • When present, PEI selects StackAlign (the full stack alignment) rather than TransientStackAlign for final frame rounding.
  • ptxas ultimately resolves dynamic allocations at JIT time, not cicc. cicc's role is to set up the frame pointer correctly so that dynamic objects can be addressed relative to it.
  • The FramePointerIndex (at MachineFrameInfo+68) is laid out last among general objects, ensuring the frame pointer anchors the top of the fixed frame with dynamic objects growing beyond it.

For fixed-size allocas, SROA (Scalar Replacement of Aggregates) typically promotes them to SSA registers before PEI ever runs. When SROA succeeds for all allocas, MachineFrameInfo has no stack objects and PEI emits no __local_depot at all — the function runs entirely in registers.

Spill Slots

Register spills are the primary consumer of __local_depot space. When the register allocator cannot fit a virtual register's live range into the available physical registers, it creates a spill slot — a frame object marked with isSpillSlot = 1 (byte at frame-object +32).

Spill-slot frame objects are created during register allocation. PEI does not create them; it only assigns their offsets. In cicc, spill slots interact with the categorized layout:

  • Spill slots in a separate stack area (when hasSeparateStackArea is set) are excluded from the general layout and handled in Phase 4's separate-area processing.
  • Remaining spill slots are classified into categories 1–3 based on their alignment requirements and register class — vector register spills (e.g., 128-bit %rq registers) end up in category 3, scalar spills in category 1.

After PEI assigns offsets, the spill loads/stores reference [%SP + offset] or [%SPL + offset] directly. The post-PEI NVPTXPeephole pass optimizes these: when a LEA_ADDRi64 %VRFrame64, offset feeds directly into cvta_to_local_64, the peephole collapses this to LEA_ADDRi64 %VRFrameLocal64, offset, saving the generic address conversion.

Interaction with SROA

SROA runs early in the optimization pipeline (see SROA) and aggressively promotes alloca instructions to SSA values. For many GPU kernels — especially those that avoid taking addresses of locals — SROA eliminates all allocas, resulting in an empty MachineFrameInfo. In this case:

  1. PEI's frame size computes to 0.
  2. The PTX emitter (sub_2158E80) checks FrameInfo.StackSize; if zero, it emits no .local directive and no %SP/%SPL declarations.
  3. The function runs entirely in the virtual register file — the ideal case for GPU performance.

When SROA cannot promote (address-taken locals, aggregates too large for SROA's threshold controlled by sroa-size-limit, or when sroa-skip-mem2reg is set), PEI becomes essential. Additionally, cicc has a custom MI Mem2Reg pass (nv-disable-mem2reg controls it) that runs post-register-allocation and promotes MachineIR local-memory accesses back to registers — effectively a second chance at eliminating __local_depot usage after regalloc.

Comparison with Upstream

AspectUpstream NVPTXPrologEpilogPasscicc sub_35B1110
Size280 lines~2,400 lines
Callee-saved regsNot handledFull save/restore infrastructure
Register scavengingNot usedBoth forward and backward paths
Layout algorithmSingle linear pass over all objectsCategorized 3-bucket layout + bitmap packing
Frame packingNone — objects placed sequentiallytzcnt-accelerated bitmap hole-finding
Stack directionSupports both, simpleSupports both, with per-direction callee-save adjustment
DiagnosticsNonewarn-stack-size attribute + annotation output
Separate stack areaNot supportedFull support (flag at MFI+665)
Arch-specific prologNoneSM arch code 0xB extension
Optimization gatingNoneopt-level 20 skips prolog/epilog emission
Frame-index eliminationSingle backward walkDual forward/backward strategies

The upstream pass explicitly disables LLVM's standard PrologEpilogCodeInserterID and replaces it. cicc's version is closer to the full standard LLVM PEI but with GPU-specific extensions — it re-enables callee-saved handling, register scavenging, and the frame-rounding logic that upstream strips out.

Configuration

KnobTypeDefaultEffect
warn-stack-sizeFunction attribute (string→int)0xFFFFFFFF (disabled)Emit diagnostic when frame size exceeds threshold
nvptx-short-ptrcl::opt<bool>falseUse 32-bit pointers for local/const/shared address spaces; affects %SPL width
nv-disable-mem2regcl::opt<bool>falseDisable post-regalloc MI Mem2Reg pass (more objects remain for PEI to lay out)
sroa-size-limitcl::opt<int>(varies)Max aggregate size SROA will promote; larger values reduce PEI workload
Opt-level flag 20InternalSkips prolog/epilog instruction emission and callee-save handling
Opt-level flag 55InternalIncludes regspill area in stack-size diagnostic total
FrameLowering[12]Subtargetarch-dependentStack alignment for functions with calls/alloca
FrameLowering[13]Subtargetarch-dependentStack alignment for leaf functions (TransientStackAlign)

Key Data Structures

MachineFrameInfo (at MachineFunction+48)

Offset  Type   Field
+8      ptr    Objects array base pointer (40-byte records)
+16     ptr    Objects array end pointer
+32     i32    NumFixedObjects
+36     u8     hasVarSizedObjects
+48     i64    StackSize  ← WRITTEN by PEI
+64     u8     MaxAlignment (log2)
+65     u8     hasCalls / needsStackAlignment
+68     i32    FramePointerIndex (-1 if none)
+80     i64    MaxCallFrameSize (-1 if unknown)
+96     ptr    Separate-area array base
+104    ptr    Separate-area array end
+120    u8     hasCalleeSaves  ← SET by PEI
+128    ptr    Extra-area array pointer
+136    i64    Extra-area count
+656    i64    Separate area total size
+664    u8     Separate area alignment
+665    u8     hasSeparateStackArea flag

PEI State (pass object, offset from a1)

Offset  Type   Field
+8      ptr    Analysis list (tagged analysis pointers)
+200    ptr    Callee-save info (0xA8-byte struct, or null)
+208    u32    First CSR frame index
+212    u32    Last CSR frame index
+216    ptr    Prolog insertion points array
+224    u32    Prolog point count
+264    ptr    Epilog insertion points array
+272    u32    Epilog point count
+312    u8     hasReservedCallFrame flag
+313    u8     requiresRegisterScavenging flag
+320    ptr    Stack-size annotation analysis pointer

Frame Object Record (40 bytes each)

Offset  Type   Field
+0      i64    Byte offset in __local_depot (assigned by PEI)
+8      i64    Object size in bytes
+16     u8     Alignment (log2 encoding)
+20     u8     isDead flag
+32     u8     isSpillSlot flag
+36     u8     Category: 0=skip, 1=general, 2=medium, 3=large

Diagnostic Strings

StringWhen emitted
"warn-stack-size"Function attribute name — read and parsed as an integer threshold
"stack frame size"Diagnostic message when total frame size exceeds the warn-stack-size threshold

Function Map

FunctionAddressSizeRole
PrologEpilogInserter::runOnMachineFunction — main entry (68 KB)sub_35B1110----
PEI pre-setup: initialize frame object trackingsub_35AC440----
Record frame object into chunk tablesub_35AFAD0----
Determine CSR frame index range (writes PEI+208, +212)sub_35AEEB0----
Post-save fixupsub_35AE230----
Insert restore instructions at epilog pointssub_35ADBC0----
Assign offsets to categorized frame object bucketsub_35B0830----
Push frame object index into categorized bucketsub_35B0B10----
Post-prolog/epilog fixupsub_35AC7B0----
Try to eliminate a single frame index operandsub_35ABF20----
Initialize register scavenger for a MBBsub_35C5BD0----
Advance register scavengersub_35C5C00----
Post-scavenging callee-save cleanupsub_35C6D20----
Format stack-size annotationsub_35AE7D0----
PTX emitter: emitFunctionFrameSetup() (__local_depot + %SP/%SPL)sub_2158E80----
Local depot helpersub_214C040----
Local depot helpersub_2154370----
Collect callee-saved registerssub_2E77EA0----
Get register class for physical registersub_2FF6500----
Build sub-register decomposition listsub_2F26260----
Insert compound save instructionsub_2E8EAD0----
Check optimization level flagsub_B2D610----
Check function attribute existencesub_B2D620----
Get function attribute valuesub_B2D7E0----
Build stack-size diagnostic messagesub_B15960----

Differences from Upstream LLVM

AspectUpstream LLVM (NVPTX open-source)CICC v13.0
ImplementationStripped-down NVPTXPrologEpilogPass (~280 lines); handles only offset calculation and frame-index eliminationFull 68 KB PEI monolith with callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering
Stack conceptNo hardware stack; minimal __local_depot offset assignmentSame __local_depot model but with full-featured offset assignment: categorized frame objects, alignment-based bucketing, dead frame object elimination
Callee-saved registersSkipped entirely (no function calls in typical kernels)Restored: full callee-saved register scan, compound save/restore instruction insertion for non-inlined device function calls
Register scavengingAbsentIncluded: sub_35C5BD0/sub_35C5C00 initialize and advance a register scavenger per MBB for emergency spill resolution
Frame packingSequential offset assignmentBitmap-based packing with categorized buckets; objects sorted by alignment to minimize padding waste
Stack-size diagnosticsNo diagnostic systemAnnotation system (sub_35AE7D0) formats stack-size remarks; integrates with -Rpass-analysis for occupancy tuning
Prologue emissionTwo-instruction %SP/%SPL setupSame two-instruction prologue (sub_2158E80) but with additional __local_depot sizing logic for complex frame layouts

Cross-References

  • Register Allocation — creates spill slots that PEI lays out; the number and alignment of spills directly determines frame size.
  • Register Coalescing — reduces register pressure, which reduces spills, which reduces frame size.
  • SROA — SROA eliminates allocas before they reach MachineIR; when fully successful, PEI has nothing to do.
  • AsmPrinter & PTX Body Emissionsub_2158E80 emits the .local directive and %SP/%SPL declarations that PEI computed.
  • Instruction Scheduling — runs before PEI; scheduling decisions affect register pressure and thus spill count.
  • Pipeline & Ordering — PEI runs post-regalloc, followed immediately by NVPTXPeephole for %VRFrame to %VRFrameLocal optimization.

BranchFolding & TailMerge

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 BranchFolding.cpp. The critical divergence is that cicc removes the requiresStructuredCFG() gate that upstream uses to disable tail merging for GPU targets, and compensates with a reserved-register merge safety check not present in any upstream version.

BranchFolding is LLVM's post-register-allocation CFG optimizer. It runs after block placement and performs three transformations in a fixed-point loop: tail merging (extracting identical instruction tails from multiple blocks into a shared block), branch optimization (eliminating redundant or unreachable branches, merging single-predecessor blocks into predecessors), and common-code hoisting (lifting identical instructions from successors into a shared predecessor). In cicc v13.0 the pass lives at sub_2F336B0 (the OptimizeBlock / TailMergeBlocks core, 11,347 bytes) with pass entry at sub_2F36310. The NVPTX version carries one critical divergence from upstream LLVM: tail merging is not disabled by requiresStructuredCFG(). Instead, cicc keeps tail merging enabled but gates individual merge decisions on a reserved-register check that prevents merging when NVPTX special registers (%tid.x, %ntid.x, etc.) cross the merge boundary.

Key Facts

PropertyValue
Core functionsub_2F336B0 (OptimizeBlock / TailMergeBlocks)
Function size11,347 bytes (792-byte stack frame)
Pass entry pointsub_2F36310 (iterates all MBBs)
Pass ID (upstream)"branch-folder" / BranchFolderPassID
Pipeline positionAfter register allocation, after block placement
Disable knob-disable-branch-fold (global at qword_5022CC8)
Tail-merge gateenable-tail-merge (tri-state: unset/true/false)
Tail-merge threshold-tail-merge-threshold (default 150)
Minimum tail length-tail-merge-size (default 3 instructions)
Knob constructorctor_346
Required propertyNoPHIs -- SSA phi nodes must already be eliminated

Upstream vs. NVPTX Behavior

In stock LLVM, BranchFolderPass::run checks requiresStructuredCFG() on the TargetMachine and, if true, disables tail merging entirely:

bool EnableTailMerge = !MF.getTarget().requiresStructuredCFG()
                       && PassConfig->getEnableTailMerge();

NVPTX returns true from requiresStructuredCFG(), so upstream LLVM would completely suppress tail merging for GPU targets. cicc removes this gate. The binary evidence is the vtable check at 0x2F337A3 (cmp rax, offset sub_2DAC790), which verifies that the NVPTXInstrInfo vtable supports analyzeBranch -- if it does, tail merging proceeds. The structured-CFG check is absent. This makes sense: StructurizeCFG has already run by this point and guaranteed reducible control flow; tail merging two blocks that share a common successor preserves reducibility because it only introduces a new unconditional branch to the merged tail, which does not create irreducible cycles.

However, cicc compensates with three safety mechanisms that upstream does not need:

  1. Reserved-register check. At 0x2F3427B, the pass calls sub_2E88A90 with flag 0x200 (isReservedReg) on every register live across the proposed merge boundary. NVPTX special registers (%tid.x, %ntid.x, %ctaid.x, etc.) are reserved and cannot be live-in to a newly created shared tail block because their values are implicitly defined by the hardware. If any reserved register is detected, the merge is rejected. See the Reserved-Register Safety Mechanism section below for full detail.

  2. Priority ordering for conditional branches. The pattern or ecx, 2 at 0x2F33B1C assigns priority >= 2 to conditional branch terminators and lower priority to unconditional branches. This ensures unconditional-branch tails are merged first, because those merges never alter branch conditions and are always safe within structured CFG. Conditional tail merges are attempted only after unconditional ones are exhausted.

  3. NVPTXInstrInfo vtable validation. The vtable check at 0x2F337A3 (cmp rax, offset sub_2DAC790) verifies that the TargetInstrInfo object supports analyzeBranch before any merge is attempted. This is a guard against running the pass on a MachineFunction whose InstrInfo does not implement branch analysis -- a scenario that cannot occur in the normal NVPTX pipeline but could if the pass were invoked from an unexpected context. The check loads the vtable pointer from [TII], compares against the known NVPTXInstrInfo vtable base, and short-circuits to "no merge" if the match fails.

Algorithm

The pass entry sub_2F36310 calls OptimizeFunction, which runs a fixed-point loop:

OptimizeFunction(MF):
    repeat:
        changed  = TailMergeBlocks(MF)
        changed |= OptimizeBranches(MF)
        changed |= HoistCommonCode(MF)
    until !changed
    // clean up dead jump tables

TailMergeBlocks

TailMergeBlocks operates in two phases.

Phase A -- return/exit blocks. Collect all blocks with no successors (return blocks, noreturn calls) into MergePotentials, capped at tail-merge-threshold (150). Hash each block's tail via sub_2F26260 (HashEndOfMBB), which computes HashMachineInstr on the last non-debug instruction. If two or more candidates share a hash, call TryTailMergeBlocks to attempt the merge.

Phase B -- multi-predecessor blocks. For each block IBB with >= 2 predecessors, collect the predecessors into MergePotentials. For each predecessor PBB:

  • Skip self-loops (PBB == IBB), EH-pad successors, inline-asm-br blocks.
  • Call AnalyzeBranch (sub_2E09D00) on PBB. If PBB conditionally branches to IBB, reverse the condition so the unconditional fall-through to IBB is removed, leaving only the conditional branch to the "other" target. This normalization enables tail comparison.
  • Hash the tail of the normalized PBB and push it into MergePotentials.

Then call TryTailMergeBlocks(IBB, PredBB, MinCommonTailLength):

TryTailMergeBlocks(SuccBB, PredBB, MinTail):
    sort MergePotentials by hash
    for each group of candidates sharing a hash:
        for each pair (MBB1, MBB2) in the group:
            tail_len = ComputeCommonTailLength(MBB1, MBB2)
            if tail_len >= MinTail:
                // check reserved-register constraint (NVPTX addition)
                for each reg live across merge point:
                    if hasProperty(reg, 0x200):  // isReservedReg
                        reject merge; continue
                // perform the merge
                create new MBB "CommonTail"
                splice tail instructions from MBB1 into CommonTail
                ReplaceTailWithBranchTo(MBB2, CommonTail)
                UpdateTerminator on both blocks
                update live-ins for CommonTail
                merged = true
    return merged

ComputeCommonTailLength walks backwards from both block ends, comparing instructions via isIdenticalTo. It skips debug and CFI instructions. Inline asm is never merged (hard-coded rejection in upstream). The cicc binary performs this comparison at 0x2F33B0F--0x2F33BDD, extracting opcode from [ptr+18h] and comparing sub-fields via sar/and arithmetic on the instruction encoding.

HashEndOfMBB -- sub_2F26260

The hash function at sub_2F26260 computes a 32-bit hash of a block's tail for fast merge-candidate matching. The algorithm:

HashEndOfMBB(MBB):
    iter = MBB.rbegin()        // last instruction
    // skip debug instructions
    while iter != MBB.rend() && iter.isDebugInstr():
        iter++
    if iter == MBB.rend():
        return 0               // empty block (or all-debug)
    // skip terminator branches -- hash the last non-branch
    while iter != MBB.rend() && iter.isTerminator():
        iter++
    if iter == MBB.rend():
        return 0               // block contains only terminators
    return HashMachineInstr(*iter)

HashMachineInstr (at sub_2E89C70) hashes the instruction's opcode, number of operands, and the first two operands' register/immediate values. It does not hash memory operands or metadata -- this is intentional, because the hash is only used to bucket candidates for pairwise comparison. False collisions are resolved by the subsequent ComputeCommonTailLength call. The hash uses a simple multiply-and-XOR scheme:

HashMachineInstr(MI):
    h = MI.getOpcode()
    h = h * 37 + MI.getNumOperands()
    if MI.getNumOperands() >= 1:
        h = h * 37 + hashOperand(MI.getOperand(0))
    if MI.getNumOperands() >= 2:
        h = h * 37 + hashOperand(MI.getOperand(1))
    return h

The * 37 constant is standard LLVM hashing (the same multiplier used in DenseMapInfo). The hash is deliberately coarse -- it accepts false positives (two different instructions hashing to the same value) but never produces false negatives (two identical instructions hashing differently), which is the correct tradeoff for a merge-candidate filter.

ComputeCommonTailLength -- Detailed Binary Walkthrough

The comparison loop at 0x2F33B0F proceeds as follows:

ComputeCommonTailLength(MBB1, MBB2):
    iter1 = MBB1.rbegin()     // walk backwards from end
    iter2 = MBB2.rbegin()
    count = 0

    // skip debug instructions at tails
    skip_debug(iter1, MBB1)
    skip_debug(iter2, MBB2)

    while iter1 != MBB1.rend() && iter2 != MBB2.rend():
        MI1 = *iter1
        MI2 = *iter2

        // extract opcode from [MI + 0x18]
        opc1 = *(uint32_t*)(MI1 + 0x18)
        opc2 = *(uint32_t*)(MI2 + 0x18)

        // reject if either is inline asm (opcode check)
        if is_inline_asm(opc1) || is_inline_asm(opc2):
            break

        // reject if either is a CFI pseudo-instruction
        if is_cfi(opc1) || is_cfi(opc2):
            skip to next non-CFI; continue

        // full comparison: opcode, operand count, each operand
        if !isIdenticalTo(MI1, MI2):
            break

        count++
        iter1++; iter2++
        skip_debug(iter1, MBB1)
        skip_debug(iter2, MBB2)

    return count

The isIdenticalTo comparison at the binary level extracts fields from the MachineInstr layout:

  • [MI + 0x18]: opcode (32-bit)
  • [MI + 0x08]: operand list pointer
  • [MI + 0x10]: operand count (16-bit at +0x10, flags at +0x12)
  • Each operand at stride 40 bytes: [operand + 0x00] = type tag, [operand + 0x08] = register/immediate value

Two instructions are identical if and only if: same opcode, same number of operands, and for each operand pair: same type tag and same value. Memory operands (MachineMemOperand) are not compared -- two loads from different memory locations with the same register operands will compare as identical. This is correct for tail merging because if the instructions are in the tail of two blocks that reach the same successor, their memory operands must be equivalent by construction (they access the same state at the same program point).

Merge Candidate Ordering and the Priority System

The or ecx, 2 pattern at 0x2F33B1C implements a priority-based ordering within hash groups. When building the MergePotentials list, each entry is annotated with a priority value:

PriorityConditionMeaning
0Block ends with unconditional branch onlySafest merge -- no condition changes needed
1Block ends with fallthrough (no explicit branch)Safe -- may need a branch inserted
2+Block ends with conditional branchRiskier -- merge may require condition reversal

The sort at TryTailMergeBlocks sorts first by hash (grouping candidates), then within each hash group by priority (ascending). This ensures that the O(K^2) pairwise comparison within each hash group tries unconditional-only pairs first. If a merge succeeds for a low-priority (safe) pair, the modified block may no longer be a candidate for a higher-priority (conditional) pair, reducing the number of conditional merges attempted.

On NVPTX, this ordering is particularly important because conditional branch reversal (sub_2E09D00 AnalyzeBranch + condition inversion) can alter the fall-through layout. In a structured CFG, the fall-through direction often corresponds to the "then" path of an if-then-else, and reversing the condition flips which path falls through. While this does not change correctness, it can change the reconvergence point's distance from the branch, affecting I-cache locality. By preferring unconditional-only merges, the pass minimizes layout disruption.

OptimizeBranches

OptimizeBranches (sub_2F36310 inner loop) walks every MBB and calls OptimizeBlock to perform local branch simplifications:

  1. Empty-block elimination. If MBB contains only debug instructions, redirect all predecessors to the fallthrough successor.
  2. Unconditional-to-same-target folding. If the previous block's conditional and unconditional branches both target the same block, replace with a single unconditional branch (or fallthrough).
  3. Single-predecessor merge. If MBB has exactly one predecessor and that predecessor falls through unconditionally, splice MBB's instructions into the predecessor and remove MBB.
  4. Redundant branch removal. If the previous block branches only to MBB (the natural fallthrough), remove the branch entirely.
  5. Condition reversal. If the previous block conditionally branches to MBB on true and somewhere else on false, reverse the condition to create a fallthrough.
  6. Tail-block relocation. If MBB has no successors (return/noreturn) and the predecessor could fall through to the next block instead, move MBB to the end of the function and reverse the predecessor's condition.

Each transformation triggers goto ReoptimizeBlock to re-analyze the modified block. Dead blocks (no predecessors after optimization) are removed via sub_2E790D0 (RemoveBlock).

HoistCommonCode

For each block with exactly two successors, if both successors begin with identical instructions, hoist those instructions into the predecessor. This is the inverse of tail merging -- it reduces code size when two divergent paths start with the same setup sequence. The EnableHoistCommonCode flag (always true in cicc) controls this phase.

Reserved-Register Safety Mechanism

This section documents the NVIDIA-specific reserved-register check that gates tail merging in cicc. This mechanism has no equivalent in upstream LLVM because upstream disables tail merging entirely for structured-CFG targets.

Why Reserved Registers Cannot Cross Merge Boundaries

NVPTX "special registers" (%tid.x, %ntid.x, %ctaid.x, %nctaid.x, %laneid, %warpid, and the SM 90+ cluster registers) are not stored in the virtual register file. They are hardware-defined, read-only values whose definitions are implicit -- there is no MachineInstr that defines %tid.x. Instead, these registers appear as implicit uses on instructions that read thread/block/grid coordinates.

When tail merging creates a new shared tail block CommonTail, LLVM's infrastructure computes the live-in set for CommonTail from the union of live-outs of the merged predecessors. For a normal virtual register, this is safe: the register has a concrete definition (a MachineInstr somewhere in the function), and the live-in annotation tells the downstream passes that the value is available at block entry.

For a reserved register, there is no concrete definition. The value is implicitly available at every point in the function -- it is defined by the hardware thread context, not by any instruction. Creating a new block with a reserved register in its live-in set is semantically meaningless but causes three concrete problems:

  1. LiveIntervals confusion. The LiveIntervals analysis (already computed by this point) has no interval for reserved registers. Adding a live-in for a reserved register to CommonTail would require creating a new LiveInterval that spans from CommonTail's entry to the last use within CommonTail. But reserved registers do not participate in LiveIntervals -- they are excluded during interval construction at sub_2F5A640. The resulting inconsistency triggers assertions in debug builds and can silently corrupt the interference matrix in release builds.

  2. Register pressure miscounting. The greedy register allocator tracks pressure per register class. Reserved registers belong to the internal-only class at off_4A026E0 (the "!Special!" class documented in Register Classes). This class has no encoded ID, no PTX declaration, and is excluded from pressure accounting. If a reserved register appeared as a live-in, the pressure tracker would attempt to look up its class and fail -- or worse, miscount it against one of the nine real classes.

  3. Emission failure. During PTX emission, sub_21583D0 (the register encoding function) maps each register to its 4-bit class tag via vtable comparison. Reserved registers use the off_4A026E0 vtable, which triggers the fatal "Bad register class" error. A reserved register in a live-in set could propagate to a point where the emitter attempts to declare it, causing an unconditional abort.

The hasProperty Check -- sub_2E88A90

sub_2E88A90 is a multi-purpose property query function used across several subsystems in cicc:

Call siteFlagMeaning
BranchFolding (0x2F3427B)0x200isReservedReg -- register is a hardware-defined special register
StructurizeCFG (sub_2E88A90 in structurize)0x80000 / 0x100000Uniformity/divergence classification
InstrEmitter (sub_2E88A90 in emitter)0x1000000000 (bit 36)NVPTX-specific implicit-use flag

The function takes three arguments:

sub_2E88A90(context_ptr, register_or_operand, flag_mask) -> bool

For the BranchFolding call at 0x2F3427B, the calling convention is:

; rdi = TargetRegisterInfo*  (from MachineFunction->getSubtarget().getRegisterInfo())
; esi = register ID           (physical register number from live-in set)
; edx = 0x200                 (isReservedReg flag)
; returns: al = 1 if reserved, 0 if not

The function internally indexes into a per-register property table at [TRI + 0x58]. This table is initialized during NVPTXRegisterInfo construction (sub_2163AB0 for legacy PM, sub_30590F0 for new PM) and contains one entry per physical register. Each entry is a 64-bit bitmask of properties. The 0x200 bit (bit 9) is set for every register in the NVPTX special/environment register set.

Which Registers Are Marked Reserved (flag 0x200)

The following registers have bit 9 (0x200) set in the property table and will cause a merge rejection if live across the merge boundary:

Register GroupPTX NamesEmission FunctionCount
Thread ID%tid.x, %tid.y, %tid.zsub_21E86B0 (opcodes 0x26--0x28)3
Block dimensions%ntid.x, %ntid.y, %ntid.zsub_21E86B0 (opcodes 0x29--0x2B)3
Block ID%ctaid.x, %ctaid.y, %ctaid.zsub_21E86B0 (opcodes 0x2C--0x2E)3
Grid dimensions%nctaid.x, %nctaid.y, %nctaid.zsub_21E86B0 (opcodes 0x2F--0x31)3
Warp/lane ID%warpid, %laneidsub_21E86B0 (opcodes 0x5E--0x5F, via sub_3958DA0)2
Cluster registers (SM 90+)%cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}, %is_explicit_clustersub_21E9060 (values 0--14)15
Stack pointer%SP, %SPLinline in frame setup2
Environment regsENVREG0--ENVREG31internal (not emitted to PTX)32

Total: 63 reserved registers. These correspond to the physical register set in NVPTX -- recall that NVPTX has no general-purpose physical registers, so the only physical registers are the special hardware-defined ones plus the stack pointer pair.

The environment registers (ENVREG0--ENVREG31) are used internally by the CUDA runtime to pass kernel arguments and configuration data. They are read-only from the kernel's perspective and never appear explicitly in emitted PTX. Their presence in the reserved set is a safety measure against internal IR manipulations that might introduce them as explicit operands.

The Check in Context: Full Merge Decision Sequence

The reserved-register check is the third of four gates in the merge decision path. The complete sequence at 0x2F33B0F--0x2F34300 is:

MergeDecision(MBB1, MBB2, MinTail):
    // Gate 1: Instruction comparison
    tail_len = ComputeCommonTailLength(MBB1, MBB2)
    if tail_len < MinTail:
        return REJECT

    // Gate 2: Branch analysis feasibility
    ok = AnalyzeBranch(MBB1, ...)
    if !ok:
        return REJECT    // unanalyzable terminator (inline asm, etc.)

    // Gate 3: Reserved-register check (NVPTX-specific)
    for each reg in LiveIns(MBB1[split_point:]) ∪ LiveIns(MBB2[split_point:]):
        if sub_2E88A90(TRI, reg, 0x200):
            return REJECT    // reserved register crosses merge boundary

    // Gate 4: Profitability (code size)
    overhead = 1     // one branch instruction to CommonTail
    if MBB1 needs UpdateTerminator:
        overhead += 1
    if tail_len <= overhead:
        return REJECT    // no net code-size reduction

    return ACCEPT

Gate 3 iterates every register that would be live-in to the proposed CommonTail block. The live-in set is computed by walking the tail instructions backwards and collecting register uses that have no definition within the tail. If any register in this set has the 0x200 property, the entire merge is rejected -- there is no fallback or partial merge.

Interaction with computeLiveIns -- sub_2E16F10

After a merge is accepted and the CommonTail block is created, sub_2E16F10 (computeLiveIns) populates the new block's live-in set. This function must agree with the pre-merge reserved-register check: if the check passed (no reserved registers), then computeLiveIns will produce a live-in set containing only virtual registers and non-reserved physical registers. The function at sub_2E16F10 performs its own filtering:

computeLiveIns(CommonTail):
    for each reg in upward_exposed_uses(CommonTail):
        if isReserved(reg):
            continue    // redundant safety -- already filtered by Gate 3
        addLiveIn(CommonTail, reg)

The double-check (once in the merge decision, once in computeLiveIns) is a defense-in-depth pattern. The merge decision check prevents the merge from happening at all; the computeLiveIns filter prevents a reserved register from entering the live-in set even if the merge decision check were somehow bypassed (e.g., by a future code change that added a new merge path).

GPU-Specific Considerations

Tail Merging and Warp Divergence

Tail merging on GPU does not interact with warp divergence in the way that branch duplication does. When two blocks A and B both end with the same instruction sequence and share a common successor C, merging the tails into a shared CommonTail block that falls through to C does not change which warps execute which instructions. Every warp that previously executed the tail of A now executes the same instructions in CommonTail; similarly for B. The branch from A (or B) to CommonTail is unconditional and therefore non-divergent by definition.

However, there is one subtle interaction: if A and B are the two sides of a divergent branch, and the tail merge creates CommonTail between them and their common successor C, the reconvergence point may shift. Previously, warps reconverged at C's entry. After the merge, warps reconverge at CommonTail's entry -- which is equivalent but changes the block numbering. StructurizeCFG has already inserted any necessary reconvergence tokens before BranchFolding runs, and those tokens are block-relative. The UpdateTerminator call at sub_2FAD510 and the ReplaceUsesOfBlockWith call at sub_2E0E0B0 update all references, so the reconvergence semantics are preserved.

Code Size vs. Instruction Cache

On GPU, the primary motivation for tail merging is code size reduction, which translates directly to reduced instruction cache pressure. NVIDIA GPUs have small instruction caches per SM partition (32--128 KB depending on architecture generation). Tail merging reduces the number of unique instructions the I-cache must hold.

The tail-merge-size default of 3 reflects the GPU's branch cost: one bra instruction to redirect flow to CommonTail, plus one additional instruction if the predecessor's terminator needs rewriting. With a minimum tail length of 3, the merge always saves at least one instruction's worth of I-cache footprint. On a GPU where each instruction occupies 8--16 bytes (PTX instructions vary in encoding width, but ptxas expands them to fixed-width SASS), a 3-instruction merge saves 24--48 bytes of I-cache per merge site.

The tail-merge-threshold of 150 is generous compared to upstream LLVM's default (also 150 in upstream, but upstream disables the entire mechanism for GPU targets). In practice, GPU kernels rarely have blocks with 150+ predecessors -- the threshold exists primarily to prevent pathological compile times on machine-generated code with massive switch tables.

Structured CFG Preservation Proof

The claim that tail merging preserves structured (reducible) control flow deserves a rigorous argument, since this is the justification for NVIDIA removing the requiresStructuredCFG() gate.

Claim: If the input CFG is reducible, then the CFG after tail merging is also reducible.

Proof sketch: Tail merging performs one operation: it takes two blocks A and B that share a common tail instruction sequence, creates a new block T containing the tail, and replaces the tail portions of A and B with unconditional branches to T. The successors of A and B in the tail (which were the same for both, by construction) become successors of T instead.

Consider the back-edge structure. In a reducible CFG, every cycle has a single entry point (the loop header). Tail merging cannot create a new cycle because:

  • T is a new block with no incoming edges except from A and B.
  • T's outgoing edges are a subset of the original outgoing edges of A and B's tails.
  • No edge into T can form a back-edge of an existing cycle unless A or B was already a back-edge target, in which case the cycle's entry point was A or B, not T.
  • The only new edges are A->T and B->T (unconditional). These cannot create a new cycle because T does not dominate A or B (it was just created).

Therefore, no new irreducible cycle is introduced. The disable-nvptx-require-structured-cfg knob (at qword_5022CC8 in NVPTXTargetMachine) provides a backdoor to disable the structured-CFG requirement entirely, but it is false by default and should never be set in production.

Interaction with EH and Cleanup Pads

NVPTX does not support C++ exceptions in the traditional sense -- there is no stack unwinding on GPU. However, cicc does handle cleanup semantics for CUDA cooperative groups and destructor calls. The branch folding pass skips blocks that are EH landing pads (isEHPad() check at the start of OptimizeBlock). On NVPTX, this check is typically a no-op because no blocks are marked as EH pads, but the check remains active because the same binary serves both CUDA and non-CUDA compilation paths.

Interaction with Convergence Control Tokens

On SM 90+ (Hopper and later), cicc emits convergence control pseudo-instructions (bra.convergent, .pragma "convergent") that are consumed by ptxas to guide reconvergence behavior. These pseudo-instructions are MachineInstrs with specific opcodes that BranchFolding must not merge or reorder. The isIdenticalTo comparison in ComputeCommonTailLength considers opcode, operands, and flags, so two convergence control instructions with different target blocks will not compare as identical and will naturally terminate the common-tail scan. This prevents the tail merger from accidentally merging convergence annotations that belong to different reconvergence points.

Data Structures

The MBBInfo structure passed via rdi to sub_2F336B0:

OffsetTypeField
+0x00MachineFunction*Parent function / block list head
+0x08MachineBasicBlock*Fallthrough candidate block
+0x10BranchAnalysisResult*Cached result from AnalyzeBranch
+0x28DenseMap<uint, list>Hash-to-candidate-list merge table

The pass allocates a 792-byte stack frame holding:

Stack variablePurpose
var_2E0merge_count (number of merges performed)
var_309modified flag
var_30Ashould_try_fold flag (initialized to 1)
var_224Hash table allocated flag
var_1E4Operand table allocated flag

Configuration

KnobTypeDefaultEffect
disable-branch-foldboolfalseSkips the entire pass
enable-tail-mergetri-stateunset (uses target default)Force-enable or disable tail merging
tail-merge-thresholdunsigned150Max predecessors considered per merge round; caps MergePotentials size
tail-merge-sizeunsigned3Minimum common tail length (in instructions) to justify a merge
branch-fold-placementbooltrueEnables branch folding within MachineBlockPlacement (separate invocation)
ifcvt-branch-foldbooltrueEnables branch folding within the if-converter pass

The tail-merge-threshold of 150 exists purely as a compile-time throttle. For a block with N predecessors, the pass performs O(N^2) pairwise comparisons within each hash group. Setting the threshold to 0 effectively disables tail merging for blocks with many predecessors while keeping branch optimization active.

The tail-merge-size of 3 is the break-even point: creating a new shared block plus a branch instruction costs roughly 2 instructions of overhead, so merging fewer than 3 common instructions produces no net code-size reduction.

Function Map

FunctionAddressSizeRole
BranchFolder::OptimizeFunctionsub_2F36310--Pass entry; fixed-point loop over TailMerge + OptimizeBranches + HoistCommonCode
BranchFolder::OptimizeBlock / inner logicsub_2F336B011,347BPer-block optimization + tail merge core (792-byte stack frame)
HashEndOfMBBsub_2F26260--Tail hash computation; hashes last non-debug non-terminator instruction
isBranchFoldablesub_2F31250--Checks if operand represents a foldable branch target
Merge candidate map lookupsub_2F33020--Hash table lookup in MergePotentials DenseMap
TryTailMergeBlockssub_2E2B9F0--Attempts merge across candidate set; calls Gates 1--4
AnalyzeBranchsub_2E09D00--NVPTXInstrInfo branch analysis: type, targets, conditions
RemoveBranchsub_2E0C3B0--Removes terminator branch instructions from MBB
InsertBranchsub_2E0F080--Inserts new branch instruction to redirect flow
ReplaceTailWithBranchTosub_2E0A600--Splices tail into shared block, inserts unconditional redirect
ReplaceUsesOfBlockWithsub_2E0E0B0--Updates phi nodes and predecessor lists after merge
getBlockNumberedsub_2E192D0--MBB number to pointer lookup
UpdateTerminatorsub_2FAD510--Fixes terminators after CFG modification
RemoveBlocksub_2E790D0--Removes dead MBB from function; updates predecessor/successor lists
computeLiveInssub_2E16F10--Updates live-in register sets for merged block; filters reserved registers
getVRegDefsub_2EBEE10--Virtual register definition lookup
hasProperty(flag)sub_2E88A90--Multi-purpose register/operand property query (flag 0x200 = reserved, 0x80000 = uniform, 0x100000 = divergent)
HashMachineInstrsub_2E89C70--Instruction hash for merge candidate bucketing (* 37 multiply-XOR scheme)
SpliceBlocksub_2E31080--Unlinks MBB from doubly-linked list
NVPTXInstrInfo vtablesub_2DAC790--Vtable base checked at 0x2F337A3 to validate InstrInfo supports analyzeBranch
Dynamic special register resolversub_3958DA0--Resolves opcodes 0x5E/0x5F to %warpid/%laneid
Special register emissionsub_21E86B0--Emits %tid, %ctaid, %ntid, %nctaid (opcodes 0x26--0x31)
Cluster register emission (SM 90+)sub_21E9060--Emits 15 cluster registers (%cluster_ctarank, %clusterid, etc.)

Interaction with StructurizeCFG

StructurizeCFG runs during the IR-level pipeline (before SelectionDAG), while BranchFolding runs after register allocation at the machine level. By the time BranchFolding executes, all control flow is already structured and reducible. The key interaction:

  • StructurizeCFG may insert "Flow" blocks that serve as reconvergence points. These are often empty or contain only an unconditional branch. BranchFolding's empty-block elimination (step 1 of OptimizeBranches) can remove these if they have become redundant after code generation.
  • Tail merging never introduces irreducible control flow because it only adds unconditional branches to a new shared tail block. The new block post-dominates the merged tails, preserving reducibility.
  • The branch-fold-placement knob controls a separate invocation of branch folding logic embedded within MachineBlockPlacement. That invocation runs before the standalone BranchFolding pass and performs a limited subset of the same transformations during layout decisions.

Complexity

The hash-based matching makes the typical case efficient. For N blocks and average predecessor count M, the overall complexity is O(N * M) for hash computation, plus O(K^2 * T) for pairwise comparison within hash groups, where K is the number of blocks sharing a hash and T is the common tail length. The tail-merge-threshold caps K at 150. The recursive self-call pattern (the pass re-invokes itself when a merge creates new opportunities) means worst-case is O(N^2) iterations, but this is rare in practice -- most functions converge in 2-3 iterations.

Differences from Upstream LLVM

AspectUpstream LLVM 20cicc v13.0
Tail merge for structured-CFG targetsDisabled (requiresStructuredCFG() returns true -> tail merge off)Enabled -- structured-CFG gate removed
Reserved-register merge gateNot present (unnecessary -- tail merge disabled for GPU)Gate 3: sub_2E88A90 with flag 0x200 rejects merges when special registers are live across the boundary
Priority orderingCandidates sorted by hash onlyAdditional priority sort within hash groups: unconditional branches first (priority 0), then conditional (priority 2+)
NVPTXInstrInfo vtable checkNot presentcmp rax, offset sub_2DAC790 at 0x2F337A3 validates InstrInfo before merge attempts
computeLiveIns filteringNo reserved-register filterDouble-filters reserved registers (once at merge decision, once at live-in computation)
Convergence control awarenessNot present (no convergence tokens in upstream)isIdenticalTo naturally prevents merging convergence pseudo-instructions with different targets
MachineInstr stride32-byte operand stride40-byte operand stride (extra 8 bytes for NVPTX-specific metadata)
Upstream sourcellvm/lib/CodeGen/BranchFolding.cppBinary at 0x2F336B0--0x2F36310 range

Cross-References

  • Block Placement -- runs before BranchFolding; its branch-fold-placement knob triggers inline branch folding during layout.
  • StructurizeCFG -- guarantees structured control flow before BranchFolding runs; inserts Flow blocks that BranchFolding may later eliminate. Uses the same sub_2E88A90 for divergence queries.
  • Register Allocation -- BranchFolding requires NoPHIs property, meaning it runs post-regalloc in the NVPTX pipeline. The greedy RA at sub_2F5A640 excludes reserved registers from pressure tracking.
  • Instruction Scheduling -- scheduling runs after BranchFolding; the final CFG shape from branch folding determines scheduling regions.
  • Register Classes -- documents the internal-only off_4A026E0 class ("!Special!") that holds reserved/environment registers. The register encoding function sub_21583D0 fatally aborts on this class.
  • PTX Emission -- special register emission functions sub_21E86B0 and sub_21E9060 that handle the 63 reserved registers.
  • NVPTX Target Infrastructure -- the disable-nvptx-require-structured-cfg knob that controls the structured-CFG requirement.
  • Machine-Level Passes -- pipeline context showing BranchFolding's position after register allocation and before instruction scheduling.
  • InstrEmitter -- another consumer of sub_2E88A90 that uses flag bit 36 for NVPTX-specific implicit-use detection.

MachineBlockPlacement for GPU

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

MachineBlockPlacement decides the physical ordering of basic blocks in a MachineFunction. On CPU, it is primarily an I-cache optimization. On GPU, block ordering has deeper consequences: PTX is a structured ISA where every taken branch stalls the SM instruction fetch pipeline, warp divergence must reconverge at post-dominators, and instruction cache capacity is measured in tens of kilobytes per SM partition. cicc carries two separate instances of this pass -- a stock LLVM copy for internal use and an NVPTX-pipeline copy at sub_3521FF0 that participates in GPU-specific analysis. The NVPTX instance queries a divergence flag on the MachineFunction to decide whether tail duplication is profitable, and adds an alternative layout proposal path (sub_34BEDF0 / sub_34C7080) that is absent from upstream LLVM.

Key Facts

PropertyValue
Entry pointsub_3521FF0 (82 KB decompiled, 2435 lines)
Pass name"Branch Probability Basic Block Placement"
Pass ID"block-placement"
Registration (NVPTX)sub_350FE30 (pass), sub_350FEE0 (stats)
Registration (generic)sub_1DE8060 (pass), sub_1DE8500 (stats)
Stats pass ID"block-placement-stats", callback sub_3517680
Knob constructorctor_671_0 at 0x5A0470
Required analysesMachineBlockFrequencyInfo, MachineBranchProbabilityInfo, MachinePostDominatorTree, MachineLoopInfo, TargetPassConfig

Why Block Placement Matters on GPU

Three properties of GPU execution make block ordering non-trivial.

Instruction fetch pipeline. GPU SMs fetch instructions sequentially. A taken branch introduces a fetch bubble -- the warp scheduler cannot issue from the new target until the instruction cache services the request. Every fall-through edge is free; every taken branch costs at least one cycle of fetch latency. The misfetch-cost (default 1) and jump-inst-cost (default 1) knobs model this cost. Maximizing fall-through sequences directly reduces warp stall cycles at branch points.

Instruction cache pressure. GPU instruction caches are small (typically 32-128 KB per SM partition). Code duplication through tail-dup increases I-cache working set. The tail-dup-placement-penalty (default 2%) penalizes code copies that improve fall-through at the expense of I-cache pressure. The ext-TSP model, when enabled, explicitly optimizes for I-cache utilization by modeling forward/backward reference distances.

Warp divergence. When a branch is divergent (different lanes take different paths), all paths must execute serially, and the warp reconverges at the post-dominator. Block ordering cannot eliminate the divergence cost, but it determines which side of the branch falls through vs. takes a jump. The divergence flag at MF+8+688 bit 0 gates whether tail duplication is even attempted: duplicating a tail block that sits below a divergent branch wastes code size because divergent warps execute both paths regardless of which one falls through.

Pass Object Layout

The pass object at a1 is populated during runOnMachineFunction:

OffsetTypeContent
+488ptrLoop chain working data (cleared by sub_35142F0)
+520MachineFunction*Current function being processed
+528ptrMachineBlockFrequencyInfo* (adjusted +169 from raw analysis pointer)
+536ptrMachineBranchProbabilityInfo* (40-byte struct at +200)
+544ptrMachinePostDominatorTree* (+200)
+552u64Working state (cleared to 0)
+560ptrTargetInstrInfo* (nullptr if default vtable)
+568ptrTargetRegisterInfo* (nullptr if default vtable)
+576ptrTailDuplicator* (from unk_50209DC analysis, +200)
+584ptrMachineLoopInfo*
+592ptrTargetPassConfig*
+600inlineChain-builder state (initialized by sub_2FD5DC0)
+776u64Profile-derived hot threshold
+784i32Tail-dup threshold (2 or 4)
+788boolProfile count was explicitly provided
+792ptrBump allocator base (for chain node allocation)
+800u64Bump allocator capacity
+872u64Bump allocator total allocation counter
+888structChain-map (BB-to-chain DenseMap, queried via sub_3515040)

Chain nodes are 64 bytes each, allocated from the bump allocator:

struct ChainNode {          // 64 bytes
    MachineBasicBlock** bb_array;   // +0:  pointer to BB array (initially +16)
    uint32_t count;                 // +8:  number of BBs in chain
    uint32_t capacity;              // +12: capacity (initial: 1)
    MachineBasicBlock* inline_bb;   // +16: inline storage for single-BB chain
    uint8_t  padding[24];           // +24: space for up to 3 more inline BBs
    void*    chain_map;             // +48: pointer to parent chain-map
    uint64_t flags;                 // +56: chain flags
};

Algorithm Overview

The entry point sub_3521FF0 dispatches to one of two layout algorithms: the standard chain-based placement, or the ext-TSP layout when explicitly enabled. The overall flow:

runOnMachineFunction(MF):
    if MF.empty(): return 0

    // Fetch analyses
    MBFI  = getAnalysis<MachineBlockFrequencyInfo>()
    MBPI  = getAnalysis<MachineBranchProbabilityInfo>()
    MPDT  = getAnalysis<MachinePostDominatorTree>()
    MLI   = getAnalysis<MachineLoopInfo>()
    TPC   = getAnalysis<TargetPassConfig>()
    TII   = MF.getSubtarget().getInstrInfo()
    TRI   = MF.getSubtarget().getRegisterInfo()

    // Compute tail-dup threshold
    threshold = computeTailDupThreshold(optLevel, TII)

    // Decide layout algorithm
    if enable-ext-tsp-block-placement AND MF.size() fits:
        applyExtTsp(MF)
    else:
        buildChains(MF)           // sub_3521900
        tailDupPlacement(MF)      // sub_35185B0 (if enabled + not divergent)
        tryAlternativeLayout(MF)  // sub_34BEDF0 + sub_34C7080 (NVIDIA addition)

    // Post-placement
    optimizeBranches()            // flip branches for fall-through
    alignBlocks()                 // sub_3516980
    cleanup()
    return 1

Chain-Based Placement (Standard Path)

sub_3521900 (buildChains) is the workhorse. It operates in four steps.

Step 1 -- Initial Chain Construction

For every BB in the MachineFunction (iterated via the doubly-linked intrusive list from MF+328 to sentinel MF+320), the builder:

  1. Allocates a 64-byte chain node from the bump allocator at pass+792. The node is initialized with count=1, capacity=1, the inline BB pointer set to the current BB, and the chain-map pointer set to pass+888.
  2. Inserts the BB-to-chain mapping into the chain-map via sub_3515040 (DenseMap insert with pointer hash ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1)).
  3. Attempts to extend the chain forward: calls TII->analyzeBranch() (vtable+344) on the current BB. If analyzable and a fall-through successor exists, calls sub_2E32580 to verify the successor is valid for chaining (not already claimed by a different chain, not a landing pad, not the function entry if it would create a cycle). If valid, the successor is appended to the chain's BB array (growing from inline storage to heap allocation via sub_C7D6A0 when needed), and the walk continues from the successor.

The result is a set of maximal fall-through chains -- each chain represents a sequence of BBs where every transition is a fall-through edge according to analyzeBranch.

Step 2 -- Loop Chain Merging

Read the MachineLoopInfo structure at pass+584. Iterate loops from innermost outward. For each loop, call sub_351EBB0 (buildLoopChains), which:

  1. Identifies all chains that contain BBs belonging to the loop.
  2. Merges these chains into a single loop chain, ordering them to maximize fall-through within the loop body.
  3. Applies loop rotation via sub_351C710 (rotateLoop) to place the exiting block at the bottom, making the back-edge a fall-through and the exit a taken branch (or the reverse, whichever minimizes cost according to the profile data).

Cold blocks within the loop (where loop_freq / block_freq > loop-to-cold-block-ratio) are ejected from the loop chain and will be placed at the function's end during the commit step.

Step 3 -- Global Successor Ordering

Call sub_35157A0 (selectBestSuccessor) for each BB to find the globally best successor chain ordering. The selection considers:

  • Edge probability from sub_2E441D0 (getEdgeProbability)
  • Whether the successor is already the fall-through (free) or would require a taken branch (cost = misfetch-cost + jump-inst-cost)
  • Whether chaining the successor would break an existing profitable chain connection

Then sub_351D700 (buildChainForBlock) performs a greedy walk from the function entry, building the top-level chain by repeatedly selecting the best unchained successor and appending it.

Step 4 -- Commit

Walk the final chain's BB array and splice each BB into position using intrusive-list pointer swaps on the MachineFunction's BB list (pointer updates at BB+0 and BB+8 -- the prev/next pointers of the doubly-linked list).

Ext-TSP Layout (Optional Path)

When enable-ext-tsp-block-placement is true (default: false), the pass uses the Extended Travelling Salesman Problem formulation from LLVM's CodeLayout.h. This is a profile-guided model that explicitly optimizes I-cache utilization by penalizing backward references and rewarding fall-through edges.

The ext-TSP path builds a BB index hash-map using LLVM's DenseMap pattern (hash: (ptr >> 9) ^ (ptr >> 4), 75% load factor), computes block frequencies and edge weights, then runs three solver functions:

FunctionRole
sub_29BAF70calcExtTspScore() -- score the original layout
sub_29BAC40calcExtTspScore() -- score the alternative layout
sub_29BB2B0computeExtTspLayout() -- reorder chains by ext-TSP objective

The pass compares original vs. reordered cost and commits the better ordering via sub_3519A10 (applyBlockOrder). Additional ext-TSP tuning knobs (registered in ctor_492 at 0x5545a0):

KnobDescription
ext-tsp-forward-weight-cond / uncondWeight for conditional/unconditional forward jumps
ext-tsp-backward-weight-cond / uncondWeight for conditional/unconditional backward jumps
ext-tsp-fallthrough-weight-cond / uncondWeight for fall-through edges
ext-tsp-forward-distance / backward-distanceDistance thresholds for cache modeling
ext-tsp-max-chain-sizeMaximum chain size for ext-TSP merging
ext-tsp-chain-split-thresholdThreshold for splitting chains
ext-tsp-max-merge-density-ratioDensity ratio cap for chain merges
ext-tsp-apply-without-profileRun ext-TSP even without PGO data
cdsort-cache-entries / cache-sizeCDSort cache model parameters
cdsort-max-chain-sizeCDSort chain size limit
cdsort-distance-power / frequency-scaleCDSort cost model tuning

NVIDIA-Specific Modifications

Divergence-Gated Tail Duplication

The most significant GPU-specific behavior is the divergence check before tail duplication. At step (G) in the algorithm, the pass reads MF+8+688 bit 0 -- a flag set by earlier divergence analysis passes indicating the function contains warp-divergent branches. When this bit is set, sub_35185B0 (tailDupPlacement) is skipped entirely.

The rationale: tail duplication creates an additional copy of a basic block to convert a diamond-shaped CFG into a straight-line fall-through. On CPU, this eliminates a taken branch on the hot path. On GPU with divergent branches, both sides of the diamond execute regardless (the warp mask simply toggles), so duplicating the tail block doubles code size for zero fall-through benefit. The divergence flag is a conservative gate -- it disables tail-dup for the entire function, not per-branch.

Alternative Layout Proposal Algorithm

When the standard chain-based path is selected (not ext-TSP), and the function has more than 3 basic blocks with profile data and is not marked divergent, the pass runs a complete alternative layout evaluation through a pipeline absent from upstream LLVM. This is one of cicc's most significant code-layout additions.

Activation Gate

if (byte_503C568 is set AND MF.size() > 3):
    evaluator = sub_34BEDF0(state, profile_flag, MBFI, TII, MBPI)
    changed   = sub_34C7080(evaluator, MF, chain_data, ...)
    if changed:
        commit(evaluator_layout)

The gate variable byte_503C568 corresponds to the branch-fold-placement knob (default true). When branch-fold-placement is active and the function has enough basic blocks to justify the extra analysis cost, the alternative path fires.

State Object Initialization -- sub_34BEDF0 (321 bytes)

sub_34BEDF0 is a constructor that initializes a 0x100-byte evaluator state object. It takes six arguments: (rdi=state, rsi=profile_available, rdx=?, rcx=MBFI*, r8=TII*, r9=MBPI*). The initialization zeroes the majority of the structure and sets up internal storage pointers:

struct LayoutEvaluatorState {      // 0x100 bytes, initialized by sub_34BEDF0
    void*    bb_array_ptr;         // +0x00: BB ordering array (initially null)
    uint64_t bb_array_size;        // +0x08: count
    uint64_t bb_array_cap;         // +0x10: capacity
    uint64_t iteration_count;      // +0x18: cleared to 0
    void*    inline_storage_ptr;   // +0x20: points to +0x38 (inline array)
    uint64_t initial_capacity;     // +0x28: set to 2
    uint32_t current_count;        // +0x30: set to 0
    uint8_t  is_fresh;             // +0x34: set to 1 (first-run flag)
    uint8_t  padding[3];           // +0x35
    uint8_t  inline_array[72];     // +0x38: inline storage for small chains
    uint8_t  profile_available;    // +0x80: bit 0 = profile flag
    uint8_t  force_mode;           // +0x81: set from qword_503AD08 knob
    uint8_t  divergence_aware;     // +0x82: set from dl argument
    uint8_t  needs_reconverge;     // +0x83: cleared to 0, set during evaluation
    uint32_t bb_limit;             // +0x84: from stack argument (BB count cap)
    // +0x88..+0xA8: five qword slots, all zeroed
    void*    bb_ptr_array;         // +0xB0: points to +0xC8 (inline)
    uint64_t bb_ptr_array_pad;     // +0xB8: cleared
    uint64_t bb_ptr_array_cap;     // +0xC0: set to 8
    uint8_t  bb_ptr_inline[24];    // +0xC8: inline BB pointer storage
    uint64_t total_cost;           // +0xD8: cleared
    uint32_t cost_flags;           // +0xE0: cleared
    void*    mbfi_ptr;             // +0xE8: MachineBlockFrequencyInfo*
    void*    tii_ptr;              // +0xF0: TargetInstrInfo*
    void*    mbpi_ptr;             // +0xF8: MachineBranchProbabilityInfo*
};

The force_mode field at offset +0x81 is set based on the global qword_503AD08. When this global equals 0, the force mode takes the profile_available argument. When it equals 1, force mode is unconditionally set to 1 (always evaluate). Any other value causes a straight return (skip evaluation). This provides a three-way override: 0=auto, 1=always, other=never.

Dispatch Wrapper -- sub_34C7080 (17 bytes)

sub_34C7080 is a thin guard:

// sub_34C7080(rdi=evaluator, rsi=MF, rdx=chain_data, rcx=..., r8=..., r9=changed_flag)
if (rdx == NULL) return 0;     // no chain data -> nothing to evaluate
return sub_34C6AF0(rdi, rsi, rdx, rcx, r8, (bool)r9);

The NULL check on rdx (the chain-data pointer) provides a fast exit when the chain builder produced no intermediate state worth re-evaluating.

Core Layout Evaluator -- sub_34C6AF0 (1419 bytes)

sub_34C6AF0 is the real body of the alternative layout evaluator. It operates on the evaluator state object (from sub_34BEDF0) and the MachineFunction, performing a complete re-evaluation of the chain-based layout against a different cost model. The algorithm proceeds in six steps:

Step 1 -- Iteration counter and hash table reset. Increment the iteration count at state+0x18. If the hash table at state+0x20 is not fresh (byte at state+0x34 is 0), compute a minimum table size as max(32, 4 * (capacity - count)), and if the current table is undersized, fill it with 0xFF sentinels via memset. This hash table tracks which BBs have been visited during the current evaluation pass.

Step 2 -- State initialization from MachineFunction. Clear the running cost accumulator at state+0x2C..+0x30. Read the first BB from the MachineFunction's chain data. Store the chain data pointer, the iteration limit from state+0x84, and the analysis pointers (MBPI at state+0x98, TII at state+0xA0) into the evaluator's working slots.

Read MF->getSubtarget()->something at offset +0x220 and subtract 0x2A (decimal 42). This produces an SM-generation index (sm_70=0, sm_75=1, sm_80=2, ..., sm_90=6, sm_100=16 based on this encoding). This index determines which cost table row is used for the fetch-penalty model.

Step 3 -- Divergence-aware block scanning. For the first BB in the chain, check bit 2 of the flags at BB[0]+0x158. If set, dispatch to TII->vtable+0x210 (which is compared against sub_2FF52D0 -- the default stub). If the target overrides this vtable slot, call the override with the MachineFunction to determine if the block needs special handling. When the default is in use, set state+0x83 (needs_reconverge) to 1 unconditionally. This appears to be an NVPTX check for whether the block is in a reconvergence region where layout ordering has correctness implications, not just performance.

Step 4 -- Main evaluation loop. Call sub_34BA1B0 to snapshot the current chain state into a temporary structure on the stack. Then enter the main loop:

while (true):
    status = sub_34C4890(state, MF)   // advance to next BB in evaluation order
    changed_bit = (state->profile_available XOR 1) OR status
    if changed_bit == 0:
        // Reached the evaluation boundary without changes
        if (sm_index <= 1):           // sm_70 or sm_75
            check qword_503AA68 knob  // additional gate for older archs
            if set: call sub_34C0690(state, loop) for each loop in MF
        if state->divergence_aware:
            call sub_34C56D0(state, loop) for each loop in MF
        break if no further changes

    // A change was proposed
    call sub_34C2D70(state, MF)       // apply the proposed reordering step
    accumulate changed flags

    if (sm_index <= 1):               // sm_70/sm_75
        check qword_503AA68 knob
        if set: call sub_34C0690 for each loop
    if state->divergence_aware:
        call sub_34C56D0 for each loop

    if changed_this_iteration:
        continue loop

sub_34C4890 advances through the MachineFunction's basic blocks in frequency-priority order, proposing a reordering when a higher-frequency successor is not the current fall-through. sub_34C2D70 performs the actual chain manipulation to implement the proposed swap.

Step 5 -- Loop-level re-evaluation. The calls to sub_34C56D0 (5137 bytes, called from sub_34C6AF0 via the loop-iteration path at 0x34C6E90) perform loop-level cost re-evaluation. This function:

  • Walks the MachineFunction's loop tree (from MF+0x148, the MachineLoopInfo block list)
  • For each loop, evaluates whether the proposed layout improves or degrades the loop body's fall-through density
  • Calls sub_34C0EE0 for block-level cost queries
  • Calls sub_34BE7F0 for chain adjacency analysis
  • Queries sub_2E88AF0 (divergence analysis) and sub_2E88FE0 for convergence properties
  • Uses sub_2FDC710/sub_2FDC700 for target-specific cost overrides via the TII vtable
  • Calls sub_3509790 for reconvergence point identification

sub_34C0690 (called on the sm_70/sm_75 path gated by qword_503AA68) is a lighter variant that omits the divergence-aware sub-evaluations, appropriate for older SM architectures where divergence reconvergence is handled differently.

Step 6 -- Final cost comparison and bitvector scan. After the evaluation loop terminates, build a bitvector tracking which BBs changed position. The bitvector uses 64-bit words with word index = bb_index >> 6 and bit position = bb_index & 63. Walk the MachineFunction's loop tree blocks (MF+0x148 linked list):

  • For each block in the loop, walk the instruction list starting at BB+0x20
  • For each instruction, mask the opcode with 0xFFFFFF and compute opcode * 5 as a stride
  • If the instruction byte at offset 0 is 0x08 (a branch instruction), set the corresponding bit in the bitvector

Then scan the bitvector against the evaluator's proposed ordering to detect any BB that would need to move. If at least one BB is displaced, set the return flag.

On the final cost-comparison path (at 0x34C6FD3), the evaluator reads TII->vtable+0x5D8 and compares against sub_2FDC810. If the target overrides this slot, the override is called to provide a final accept/reject decision. Otherwise, a default threshold of 3 is used: the proposed layout is accepted only if the cost reduction exceeds the acceptance threshold. The stat-based knobs at dword_503AAC8 and qword_503AB48 provide tuning for the threshold lookup via the sub_C52410/sub_C959E0 statistics infrastructure.

Shared Infrastructure with Register Allocation

A surprising discovery: sub_34BEDF0 and sub_34C7080/sub_34C6AF0 are also called from sub_34ED530 (RegAllocGreedy, 91KB) via sub_34F1190. The register allocator uses the same layout evaluator to assess whether a spill-induced block split would degrade code layout quality. This sharing means the cost model is consistent between register allocation decisions and post-RA block placement, preventing the two passes from working at cross purposes. The evaluator state is separate per invocation (stack-allocated), so there is no state leakage between the two callers.

SM-Generation-Dependent Behavior

The SM index computation ((MF->getSubtarget()+0x220) - 0x2A) creates generation-dependent behavior:

SM GenerationIndexLoop EvaluatorDivergence Sub-Eval
sm_70 (Volta)0sub_34C0690 if qword_503AA68Only if divergence flag
sm_75 (Turing)1sub_34C0690 if qword_503AA68Only if divergence flag
sm_80+ (Ampere+)2+Skipped (only sub_34C56D0)Always if divergence flag

This split reflects the hardware difference: Volta and Turing use a stack-based reconvergence mechanism that benefits from the lighter sub_34C0690 analysis, while Ampere and later use the uniform warp scheduler where the more thorough sub_34C56D0 evaluation is worthwhile.

Dual Pass Registration

The binary contains two complete instances of MachineBlockPlacement:

InstanceRegistrationPurpose
sub_350FE30 (NVPTX)NVPTX backend pipelineGPU-specific analysis results, divergence-aware
sub_1DE8060 (generic)Default LLVM pipelineStandard pass for any non-GPU path

Having a separate NVPTX instance allows NVIDIA to control pass ordering independently. The NVPTX version is inserted at a specific point in the backend pipeline where divergence analysis results are available.

Target Tail-Dup Threshold Override

The tail-dup threshold (how many instructions a tail block can have before duplication is rejected) is determined by a multi-level decision:

default_threshold = 2                               // tail-dup-placement-threshold
aggressive_threshold = 4                            // tail-dup-placement-aggressive-threshold

if TII->getTailDupThreshold(optLevel) overrides:    // vtable+1488
    threshold = TII_override                        // NVPTX can take full control
elif optLevel > 2 (-O3):
    threshold = aggressive_threshold                // 4
else:
    threshold = default_threshold                   // 2

The default stub at sub_2FDC800 returns 2 * ((optLevel > 2) + 1), i.e., 2 at -O2 and 4 at -O3. If NVPTX's TargetInstrInfo overrides this (the pass explicitly checks whether the vtable slot points to sub_2FDC800), the override takes full control. This allows the NVPTX backend to set a different tail-dup aggressiveness based on SM generation or kernel properties.

Loop Rotation and Header Placement

Loop rotation (sub_351C710, called from buildLoopChains) determines whether the loop header is placed at the top or bottom of the loop chain. The goal is to place the exiting block at the bottom so the back-edge is a fall-through and the exit is a taken branch (or vice versa, whichever is more profitable).

Two rotation strategies exist:

Basic rotation (default): Place the exiting block last. Skip rotation if the header already has a viable fall-through from outside the loop, unless the exit edge frequency exceeds the fall-through frequency. This avoids introducing an unnecessary branch at loop entry.

Profile-guided rotation (precise-rotation-cost): Enumerate all possible rotations, compute fall-through cost for each (missed fall-through from loop entry, missed fall-throughs at exit points, missed back-edge fall-through), and select the rotation with minimum total cost. Controlled by two knobs:

  • precise-rotation-cost (default false): enable profile-guided rotation cost model
  • force-precise-rotation-cost (default false): force it even without good profile data

For GPU kernels where loops are the dominant compute pattern, correct loop rotation determines whether the loop body executes as a straight fall-through sequence or requires a taken back-edge branch every iteration. Since the misfetch-cost is low (default 1), the benefit is modest per iteration but accumulates over millions of iterations typical in GPU compute.

Hot/Cold Splitting

cicc does not perform function-level hot/cold splitting. This is expected: GPU kernels are designed for all threads in a warp to execute the same path. There is no equivalent of a CPU "cold" exception handler that should be placed far from hot code. The loop-to-cold-block-ratio knob (default 5) does enable outlining individual cold blocks from loop chains -- moving them to the end of the function -- but this is intra-function block reordering, not function splitting.

The knob force-loop-cold-block (default false) forces cold block outlining from loops regardless of the frequency ratio. When loop_freq / block_freq > loop-to-cold-block-ratio, the block is moved out of the loop chain to reduce the loop body's I-cache footprint.

Post-Placement Passes

After layout is committed, two post-processing steps run:

Branch optimization. Walk the final BB ordering. For each analyzable branch with profile info, check whether reversing the branch direction would improve fall-through. Call TII->reverseBranchCondition() (vtable+880) to flip the condition, then update the branch targets via vtable+360/368. This is controlled by sub_2EE6AD0 which checks profitability by comparing edge costs with sub_2E441D0 (getEdgeProbability).

Block alignment (sub_3516980). Walk each BB and set alignment based on block frequency, loop depth, and whether the block is a fall-through target. Controlled by:

  • align-all-blocks (default 0): force log2 alignment on every block
  • align-all-nofallthru-blocks (default 0): force alignment on blocks without fall-through predecessors
  • max-bytes-for-alignment (default 0): cap padding bytes

On GPU, block alignment is generally not useful -- PTX does not expose alignment constraints on basic blocks, and the hardware instruction fetch unit does not benefit from aligned block boundaries the way a CPU I-cache line does.

Configuration Knobs

All knobs are LLVM-standard with stock defaults. The NVIDIA delta is behavioral, not configurational.

KnobTypeDefaultEffect
disable-block-placementboolfalseDisable the pass entirely
enable-block-placement-statsboolfalseCollect placement statistics
tail-dup-placementbooltrueEnable tail duplication during placement
tail-dup-placement-thresholdint2Max instructions for tail-dup candidate
tail-dup-placement-aggressive-thresholdint4Aggressive threshold at -O3
tail-dup-placement-penaltyint2I-cache pressure penalty (percent)
tail-dup-profile-percent-thresholdint50Min hot-count percentage for profile-guided tail-dup
triangle-chain-countint2Consecutive triangles before triangle heuristic activates
branch-fold-placementbooltrueFold branches during placement
misfetch-costint1Taken-branch fetch penalty
jump-inst-costint1Cost of a jump instruction
block-placement-exit-block-biasint0Frequency percentage for loop exit replacement
loop-to-cold-block-ratioint5Ratio threshold for cold block outlining
force-loop-cold-blockboolfalseForce outlining cold blocks from loops
precise-rotation-costboolfalseProfile-guided loop rotation cost
force-precise-rotation-costboolfalseForce precise rotation cost
align-all-blocksint0Force block alignment (log2)
align-all-nofallthru-blocksint0Force alignment on non-fall-through blocks
max-bytes-for-alignmentint0Max padding for alignment
enable-ext-tsp-block-placementboolfalseEnable ext-TSP layout algorithm
ext-tsp-block-placement-max-blocksint-1Max BB count for ext-TSP (unlimited)
apply-ext-tsp-for-sizeboolfalseUse ext-TSP for code size optimization
renumber-blocks-before-viewboolfalseRenumber BBs before dot-graph output

DenseMap Implementation Pattern

The pass uses LLVM's DenseMap for BB-to-chain and BB-to-index lookups. The open-addressing hash-map pattern appears 20+ times in the decompiled code:

// Hash function for pointer keys
size_t hash = ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1);

// Probing: linear with increment counter
// Empty sentinel:   0xFFFFFFFFFFFFF000 (-4096)
// Deleted sentinel: 0xFFFFFFFFFFFFE000 (-8192)
// Rehash trigger:   4 * (count + 1) >= 3 * bucket_count  (75% load)
// Rehash function:  sub_2E3E470(map, new_capacity)

GPU-Specific Placement Considerations

Why This Pass Matters More on GPU Than on CPU

On CPU, MachineBlockPlacement is primarily an I-cache optimization -- placing hot blocks contiguously reduces cache misses. On GPU, the stakes are higher for three reasons:

  1. No branch prediction. GPU SMs do not speculate. Every taken branch is a guaranteed fetch stall. The ratio of taken branches to fall-throughs directly translates to warp scheduler utilization. Optimal block placement can eliminate 10-30% of fetch bubbles in branch-heavy kernels.

  2. Instruction cache is tiny and shared. A single SM partition has 32-128 KB of instruction cache shared across all active warps. Code duplication (tail-dup, loop unrolling) competes with warp occupancy for this shared resource. The tail-dup-placement-penalty (2%) is conservative -- on kernels with high warp counts, even small code size increases can cause I-cache thrashing.

  3. Reconvergence is layout-sensitive. On architectures before Ampere (sm_70, sm_75), the stack-based reconvergence mechanism depends on the post-dominator being reachable from both sides of a divergent branch. Block placement that separates a post-dominator from its divergent predecessors can increase the live warp state, consuming scarce convergence stack entries. The alternative layout evaluator's sub_34C0690 path specifically addresses this by evaluating reconvergence distance.

Structured Control Flow Constraint

Unlike CPU backends where block placement has complete freedom, the NVPTX backend runs StructurizeCFG before MachineBlockPlacement. This means:

  • All irreducible control flow has already been eliminated
  • Structured regions (loops, if-then-else diamonds) are contiguous in the CFG
  • Block placement cannot violate structured region boundaries without re-structurizing

This constraint actually simplifies placement in some cases (fewer valid orderings to consider) but eliminates certain profitable reorderings that would be legal on CPU (e.g., outlining a cold exception handler to a distant location that breaks region contiguity).

Interaction with PTX Emission

The final block ordering directly determines which branches in the PTX output are bra instructions (taken) vs. fall-throughs (implicit). The AsmPrinter (see AsmPrinter) emits bra only for non-fall-through edges. Since ptxas performs its own block scheduling on the PTX input, the cicc block ordering serves as a strong hint rather than a final answer -- but ptxas generally respects the input ordering for blocks within the same structured region.

Function Map

FunctionAddressSizeRole
runOnMachineFunctionsub_3521FF0--Entry point, 82 KB
buildChainssub_3521900--Initial chain construction
tailDupPlacementsub_35185B0--Tail-dup-aware chain merging
applyBlockOrdersub_3519A10--Commit final BB ordering to MF
alignBlockssub_3516980--Post-placement alignment
buildLoopChainssub_351EBB0--Loop-aware chain merging
buildChainForBlocksub_351D700--Greedy successor chain walk
selectBestSuccessorsub_35157A0--Pick best fall-through successor
chainLookupsub_3515040--DenseMap BB-to-chain lookup
rotateLoopsub_351C710--Loop rotation heuristic
mergeTailssub_351A710--Chain tail merge logic
lowerChainsub_35161F0--Final lowering of chain to BB list
(helper)sub_3515CB0--Chain cost model evaluation
(helper)sub_3515280--Chain building iteration
(helper)sub_3516000--Chain length query
(NVIDIA addition)sub_34BEDF0--Layout evaluator state constructor (321 bytes)
(NVIDIA addition)sub_34C7080--Layout evaluator dispatch wrapper (17 bytes, guards sub_34C6AF0)
(NVIDIA addition)sub_34C6AF0--Core layout evaluator body (1419 bytes, SM-aware)
(NVIDIA addition)sub_34C4890--Frequency-priority BB advancement
(NVIDIA addition)sub_34C2D70--Chain swap application
(NVIDIA addition)sub_34C56D0--Loop-level cost re-evaluation (5137 bytes, divergence-aware)
(NVIDIA addition)sub_34C0690--Lightweight loop evaluator (sm_70/sm_75 path)
(NVIDIA addition)sub_34BA1B0--Chain state snapshot
(NVIDIA addition)sub_34C0EE0--Block-level cost query
(NVIDIA addition)sub_34BE7F0--Chain adjacency analysis
(NVPTX)sub_350FE30--Pass registration
(NVPTX)sub_350FEE0--Stats pass registration
(generic)sub_1DE8060--Generic LLVM pass registration
(generic)sub_1DE8500--Generic LLVM stats registration
cleanupsub_3511770--Chain-map teardown
cleanupsub_35142F0--Loop chain data teardown
cleanupsub_3510940--Bump allocator teardown
calcExtTspScoresub_29BAF70--Ext-TSP score (original layout)
calcExtTspScoresub_29BAC40--Ext-TSP score (alternative layout)
computeExtTspLayoutsub_29BB2B0--Ext-TSP chain reordering solver
(helper)sub_2EE6520--Ext-TSP enable decision
(helper)sub_2EE6AD0--Branch redirect profitability check
getEdgeProbabilitysub_2E441D0--Edge probability query
(default stub)sub_2FDC800--Default getTailDupThreshold implementation
(default stub)sub_2FF52D0--Default reconvergence-region query
(default stub)sub_2FDC810--Default layout-accept threshold query

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Pass instancesSingle MachineBlockPlacement per pipelineTwo instances: stock LLVM copy + NVPTX-pipeline copy at sub_3521FF0
Divergence awarenessNo divergence concept; layout optimizes for I-cache localityQueries warp divergence flag on MachineFunction; divergent branches affect tail duplication profitability
Alternative layout proposalAbsent; single layout path onlyAdditional proposal path (sub_34BEDF0 / sub_34C7080) evaluates alternative orderings with SM-aware cost
Tail duplication thresholdTailDupPlacementThreshold (default 2)GPU-specific threshold via vtable query (sub_2FDC810); controlled by reconvergence-region analysis
Loop cost evaluationFrequency-weighted chain costDivergence-aware loop cost re-evaluation (sub_34C56D0, 5137 bytes) considers warp reconvergence overhead
Ext-TSP scoringStandard profile-guided layout scoringSame Ext-TSP solver but gated by NVPTX-specific enable decision (sub_2EE6520)
Structured CFG constraintNo structured CFG requirement (targets like x86 have arbitrary CFG)Must preserve structured regions from StructurizeCFG; contiguous structured blocks cannot be interleaved

Cross-References

  • StructurizeCFG -- runs before block placement; produces the structured CFG that constrains which block orderings are legal. Structured regions must remain contiguous.
  • BranchFolding -- runs after placement; performs tail merging and branch folding on the committed layout. See sub_2F336B0.
  • Instruction Scheduling -- block ordering affects scheduling windows. Post-placement scheduling operates within the committed layout.
  • Register Allocation -- register pressure is affected by block ordering through live range extent.
  • AsmPrinter -- emits PTX from the final block ordering, generating bra instructions for taken branches and fall-through for sequential blocks.

MachineOutliner for GPU

The MachineOutliner in CICC v13.0 is the stock LLVM MachineOutliner pass, compiled into the binary at two address ranges: a candidate-finder at sub_3539E80 and a core outlining engine at sub_3537010, totaling approximately 136KB of combined code. A second instance at sub_1E3D600 (62KB) appears in the MIR infrastructure region (0x1E20000--0x1E3FFFF) containing the same diagnostic strings ("NotOutliningCheaper", "OutliningBenefit", etc.) and [MEDIUM confidence] likely represents the runOnModule entry point that delegates to the two primary functions. The runOnModule identification is based on the function's address being in the MIR infrastructure region and its diagnostic string overlap with the primary outliner; it could alternatively be a separate pass-manager wrapper or a legacy code path. The pass extracts repeated MachineInstr sequences across all functions in a module, factors them into shared OUTLINED_FUNCTION_* stubs, and replaces the original sequences with calls. On GPU targets this is significant because code size directly affects the L1 instruction cache (L0/L1i) footprint per SM, and every instruction that survives into PTX also contributes to ptxas compilation time and register pressure during its own allocation pass.

CICC ships the pass as part of its standard LLVM codegen infrastructure, controlled by the enable-machine-outliner TargetPassConfig knob (tri-state: disable, enable, guaranteed beneficial). The binary does not override the upstream default -- meaning the outliner's activation depends on whether the NVPTX backend's TargetPassConfig::addMachineOutliner() enables it. The presence of full outliner infrastructure (pass registration at sub_35320A0, ~136KB of outliner code, the benefit-threshold knob, and the "nooutline" function-attribute check) confirms the pass is callable. The critical question is whether NVIDIA's default pipeline activates it. The evidence is ambiguous but leans toward conditionally enabled: the TargetPassConfig enum includes "guaranteed beneficial" mode, and the NVPTX-specific calling convention 95 (assigned to outlined functions when no special CC is required) would serve no purpose if the pass were dead code.

Pass name"Machine Function Outliner" / "machine-outliner"
Registrationsub_35320A0 -- stores pass ID at unk_503D78C
Core outlining enginesub_3537010 (77KB, 2,185 decompiled lines)
Candidate findersub_3539E80 (59KB)
Second instance (MIR region)sub_1E3D600 (62KB, 0x1E3D600)
Pass factorysub_3534A50
Benefit threshold knobqword_503DAC8 = outliner-benefit-threshold (default: 1)
Cost mode flagqword_503DC88 (loaded into pass state at offset +184)
Debug flagqword_503D828 (verbose outliner output)
Options constructorctor_675 at 0x5A2820 (10,602 bytes)
NVPTX outlined-function CCCalling convention 95 (PTX .func linkage)
Outlined function namingOUTLINED_FUNCTION_{round}_{index}
Function attributes appliednounwind (47), minsize (18), internal linkage

Suffix Tree Algorithm

The outliner's core algorithm is Ukkonen's suffix tree construction, applied to a flattened sequence of MachineInstr encodings from every eligible basic block in the module. The process proceeds in three stages.

Stage 1: Instruction Mapping

sub_3508720 (buildInstrLegalityMapping) walks each MachineBasicBlock and encodes every instruction as a uint16 alphabet symbol. The encoding incorporates both the opcode and a structurally significant operand pattern, so that two instruction sequences with different register names but identical structure map to the same suffix-tree substring. The helper sub_35082F0 initializes from the MBB's scheduling info (offset +32), and sub_35085F0 populates the actual mapping.

Register-class resolution happens in a second pass via sub_3508F10 (buildRegClassMapping): sub_3508B80 builds register-class bitmask information, and sub_3508890 computes the final mapping. This two-layer encoding is critical because NVPTX has typed register classes (i32, i64, f32, f64, pred, etc.) and an outlined sequence must be valid across all call sites regardless of which specific virtual register names appear.

Instructions that cannot participate in outlining receive a special encoding: unique negative integers starting at -3 (matching upstream's IllegalInstrNumber). Each illegal instruction gets a distinct value so it acts as a suffix-tree terminator, preventing matches from spanning across them. The sentinel value 0xFFFFFFFF (-1 as uint32) in the cost array explicitly marks these.

Stage 2: Suffix Tree Construction and Candidate Extraction

sub_35364E0 (insertIntoSuffixTree) inserts each MBB's encoded instruction sequence into the suffix tree working set. The suffix tree identifies all repeated substrings of length >= 2. For each repeated substring with at least 2 occurrences, the pass creates a candidate group.

Function filtering happens before insertion. sub_3539E80 iterates all MachineFunctions in the module's linked list and applies three gates:

  1. nooutline attribute check -- sub_B2D620 tests whether the function has the "nooutline" string attribute. If present, all MBBs in that function are skipped.

  2. shouldOutlineFrom -- vtable dispatch at offset +1440 on the TargetInstrInfo. The NVPTX backend's implementation of this hook determines whether a given function is eligible based on target constraints.

  3. isFunctionSafeToOutlineFrom -- vtable dispatch at offset +1432, receiving the outliner cost mode byte from qword_503DC88. This is where target-specific safety checks (e.g., functions with special register constraints or inline assembly) can reject outlining.

Additional per-block filters: a block must contain more than one instruction, must not already be marked as outlined (byte at MBB offset +217), and must have no special flag (qword at MBB offset +224 must be zero).

Stage 3: Sorting and Pruning

After suffix-tree extraction, the candidate list is sorted using a hybrid merge sort:

  • sub_3534120 -- parallel merge sort for large arrays (recursive, splits at midpoint)
  • sub_3533600 -- in-place merge sort for small arrays (fallback when size < 14 pointers = 112 bytes)
  • sub_3533450 -- insertion sort for very small partitions (<= 14 elements)

The sorted suffix array is then scanned by sub_3532120 (findIllegalInRange), which performs a 4-way unrolled linear scan searching for the sentinel value 0xFFFFFFFF in the integer cost array. Any candidate whose instruction range contains an illegal sentinel is pruned. The compaction loop copies valid entries forward in place and frees discarded entries' internal string buffers via _libc_free.

Benefit/Cost Model

The outliner accepts a candidate only if the net benefit exceeds the threshold. The formula:

Benefit = NumOccurrences * PerOccurrenceCost - FrameOverheadCost

Where:

  • NumOccurrences = number of identical sequences found (vtable dispatch at slot 0 on the candidate)
  • PerOccurrenceCost = bytes saved per replacement (effectively the cost of the call instruction that replaces the inlined sequence, dispatched via vtable slot 0 multiplied by the repeat_count at candidate offset +40)
  • FrameOverheadCost = cost of the outlined function itself: the function entry/exit, the return instruction, and any callee-saved register saves (vtable dispatch at slot 8)

The decision rule:

int benefit = num_occurrences * per_call_cost - frame_overhead;
if (benefit < 0) benefit = 0;
if (benefit < outliner_benefit_threshold) continue;  // skip candidate

The threshold qword_503DAC8 defaults to 1, meaning any candidate that saves at least one byte is accepted. This is identical to upstream LLVM's default and is intentionally aggressive -- the outliner relies on the cost model's accuracy rather than a conservative threshold to filter bad candidates.

NVPTX Cost Model Considerations

The cost model is dispatched through the TargetInstrInfo vtable, meaning the NVPTX backend supplies its own getOutliningCandidateInfo, buildOutlinedFrame, and insertOutlinedCall implementations. Several factors make the GPU cost model structurally different from CPU targets:

Call overhead in PTX is expensive. A PTX .func call requires .param space declaration, parameter marshaling (each argument is copied to .param memory), the call instruction itself, and result retrieval from .param space. On CPU targets, a call instruction is a single opcode plus a return address push. On NVPTX, the overhead is proportional to the number of live values that must be passed to the outlined function. This means the FrameOverheadCost for NVPTX candidates is significantly higher than on CPU, and only sequences with many occurrences or substantial length achieve positive benefit.

No hardware call stack. PTX function calls are lowered by ptxas into something closer to inlined code with register renaming. The actual "call" may or may not involve a hardware subroutine mechanism depending on the SM architecture and ptxas optimization level. This makes the cost model somewhat speculative from CICC's perspective -- the outlined function may be re-inlined by ptxas.

Calling convention 95. When no candidate entry in a group requires a special calling convention, the outlined function is assigned CC 95 -- an NVPTX-specific calling convention not present in upstream LLVM. CC 95 maps to PTX .func linkage with internal visibility, meaning the function is private to the compilation unit and ptxas has full freedom to inline or optimize it. See Calling Convention 95 below for the complete assignment algorithm and CC comparison table.

Outlined Function Creation

When a candidate group passes the benefit threshold, sub_3537010 creates the outlined function through these steps:

Name generation. The name follows the pattern OUTLINED_FUNCTION_{round}_{index}. The round number (pass counter at state offset +188) is omitted in round 0, producing OUTLINED_FUNCTION_0, OUTLINED_FUNCTION_1, etc. for the first pass and OUTLINED_FUNCTION_2_0, OUTLINED_FUNCTION_2_1, etc. for subsequent reruns. The integer-to-string conversion uses a standard two-digit lookup table ("00010203...9899") for fast decimal formatting.

LLVM Function creation. sub_BCB120 (getOrInsertFunction) creates or retrieves the Function in the LLVM Module. sub_BCF640 creates the function type (void return, no arguments by default). sub_B2C660 creates the corresponding MachineFunction.

Function flags. The flag word at function offset +32 is set to (existing & 0xBC00) | 0x4087. The bit pattern 0x4087 encodes internal linkage, norecurse, and nounwind. The mask 0xBC00 preserves target-dependent alignment and visibility bits. Two explicit attributes are added: nounwind (attribute ID 47) and minsize (attribute ID 18).

Register liveness. A calloc-allocated byte array (one byte per physical register, count from TargetRegisterInfo::getNumRegs() at TRI offset +16) tracks which registers are live-through versus defined-inside the outlined region. sub_35095B0 (populateOutlinedFunctionBody) walks the outlined MBB's instruction stream, checking the TargetRegisterInfo live-in bitmap (offset +48 in the subtarget). Registers not in the live-in set are inserted as phantom definitions. Super-register chains are walked via delta tables at TRI offset +56, following standard LLVM MCRegisterInfo encoding.

Outlined body. The TargetInstrInfo hook buildOutlinedFrame (vtable offset +1408) constructs the actual machine instructions in the outlined function by copying from the candidate entries. The isOutlined flag is set at MachineFunction offset +582.

Call-Site Rewriting

After creating the outlined function, the pass rewrites each call site:

  1. For each candidate entry, insertOutlinedCall (vtable offset +1416) is invoked with the caller's MBB, an insertion point, the outlined Function, and the candidate metadata. This returns the new call MachineInstr.

  2. If the outlined function has callee-saved register information (flag at candidate offset 344), the pass builds live-in/live-out register sets using red-black trees (sub_3536E40 for classification). Registers are classified as defs (implicit-def, flag 0x30000000), uses (implicit-use, flag 0x20000000), or implicitly-defined. These operands are attached to the call instruction via sub_2E8F270.

  3. The original instruction range in the cost array is memset to 0xFF, marking it with illegal sentinels. This prevents future outlining passes (reruns) from attempting to re-outline already-outlined code.

Candidate Entry Structure

Each candidate is a 224-byte structure (56 x uint32 stride):

OffsetSizeField
+0x004start_index -- index into module instruction array
+0x044length -- number of instructions in sequence
+0x088call_info_ptr -- pointer to MBB or instruction range
+0x108metadata_0
+0x188metadata_1
+0x204num_occurrences_field
+0x284cost_field
+0x2C48SSO string data (via sub_3532560)
+0x704benefit_or_flags
+0x7840Second SSO string field
+0xA01flag_byte_0
+0xA11flag_byte_1
+0xA84field_A8
+0xAC4field_AC
+0xB04field_B0
+0xB44field_B4

The two string fields use LLVM's small-string optimization (SSO): strings shorter than the inline buffer are stored directly in the struct; longer strings allocate on the heap. The copy function sub_3532560 handles both cases.

Calling Convention 95: The NVPTX Outlined-Function CC

CICC defines calling convention 95 (0x5F) as an NVPTX-specific calling convention that does not exist in upstream LLVM. It is assigned exclusively to outlined functions and signals to both the AsmPrinter and ptxas that the function is a module-internal device helper with PTX .func linkage.

CC Assignment Algorithm

The CC assignment happens in Phase 5 of sub_3537010 (lines 838--877 of the decompilation), after the outlined MachineFunction is created and before its body is populated. The algorithm:

fn assign_outlined_cc(candidate_group, outlined_fn):
    max_cc = 0
    for entry in candidate_group:
        cc = sub_A746B0(entry)          // extract caller's CC from candidate
        max_cc = max(max_cc, cc)

    if max_cc > 0:
        // At least one call site has a non-default CC.
        // Inherit the highest CC and create a callee-saved register mask.
        sub_B2BE50(outlined_fn, max_cc)         // setCallingConv
        sub_A77AA0(outlined_fn, max_cc)         // create callee-saved mask
    else:
        // All call sites have default CC (0) -- typical case for
        // device functions compiled from __device__ code.
        // Assign the NVPTX-specific outlined-function CC.
        outlined_fn.setCallingConv(95)

sub_A746B0 extracts the calling convention from each candidate entry's source MachineFunction. The "max" selection rule means that if candidates come from functions with different CCs, the outlined function inherits the most restrictive one. In practice, since the outliner only groups structurally identical MachineInstr sequences, all entries in a group typically come from functions with the same CC.

CC 95 vs Other NVPTX Calling Conventions

CCDecimalPTX LinkageMeaning
00.funcDefault C calling convention (non-kernel device function)
420x2A.entryPTX kernel entry (one of two kernel CCs; used in SCEV budget bypass)
430x2B.entryPTX kernel entry (variant; also bypasses SCEV budget)
710x47.entryPrimary CUDA kernel CC (isKernel returns true when linkage == 0x47)
950x5F.funcNVPTX outlined-function CC -- internal, never a kernel

CC 95 functions are emitted as .func by the AsmPrinter (sub_215A3C0). The .entry vs .func branch at line 30--33 of the PTX header emission calls sub_1C2F070 (isKernelFunction), which checks whether the CC is one of the kernel CCs (42, 43, 71) or the nvvm.kernel metadata flag. CC 95 fails all kernel tests, so the function is always emitted as .func.

What CC 95 Communicates

The CC carries three semantic signals:

  1. Internal linkage. CC 95 functions are never externally visible. The flag word 0x4087 applied at function offset +32 encodes internal linkage. Combined with the nounwind (47) and minsize (18) attributes, this tells the backend and ptxas that the function is private to the compilation unit.

  2. No .param-space calling convention overhead. Unlike CC 0 device functions, which must declare .param space for every argument and marshal values through st.param/ld.param sequences (the full sub_3040BF0 LowerCall path with DeclareParam/DeclareScalarParam nodes), CC 95 functions use a simplified call interface. The outlined function takes no explicit arguments -- live values are passed implicitly through the register state, and the TargetInstrInfo::insertOutlinedCall hook (vtable +1416) handles the call-site ABI.

  3. ptxas is free to inline. Because CC 95 functions are internal .func with no special ABI constraints, ptxas can and frequently does inline them back at the call site during its own optimization passes. This makes the outlining decision partially speculative from CICC's perspective -- the code size reduction measured by the benefit model may be undone by ptxas.

Callee-Saved Register Mask Interaction

When max_cc > 0 (the non-default path), sub_A77AA0 creates a callee-saved register mask for the outlined function. This mask determines which registers the outlined function must preserve across its body. For CC 95 (the max_cc == 0 path), no callee-saved mask is created. Instead, the call-site rewriting logic at Phase 11 of sub_3537010 (lines 1469--1968) builds explicit implicit-def (flag 0x30000000) and implicit-use (flag 0x20000000) operands on the call instruction using the RB-tree-based register classifier at sub_3536E40. This makes the register interface fully explicit rather than relying on a convention-defined preserved set.

launch_bounds Interaction and Cross-Kernel Outlining

The MachineOutliner operates at module scope -- it considers all functions in the module simultaneously. On NVPTX, this raises the question of whether sequences can be outlined across functions with different __launch_bounds__ annotations.

How launch_bounds Metadata Flows

The __launch_bounds__ attribute on a __global__ function flows through CICC as follows:

  1. EDG frontend (sub_826060): Validates __launch_bounds__ arguments. Rejects __launch_bounds__ on non-__global__ functions. Detects conflicts with __maxnreg__.

  2. Post-parse fixup (sub_5D0FF0): Converts __launch_bounds__ values into structured metadata.

  3. Kernel metadata emission (sub_B05_kernel_metadata): Stores as LLVM named metadata under nvvm.annotations:

    • nvvm.maxntid -- max threads per block (from first __launch_bounds__ argument)
    • nvvm.minctasm -- minimum CTAs per SM (from second argument, if present)
    • nvvm.maxnreg -- max registers per thread (from __maxnreg__ or third argument)
  4. PTX emission (sub_214DA90): Reads the metadata back and emits .maxntid, .minnctapersm, .maxnreg directives. These are emitted only for .entry functions -- the guard at step (g) of sub_215A3C0 ensures .func functions never receive these directives.

The Outlined Function Inherits Nothing

Because outlined functions are created with internal linkage, void return type, and CC 95 (.func), they are device functions -- never kernels. The function creation code in Phase 5 of sub_3537010 does not copy any metadata from source functions. Specifically:

  • No nvvm.kernel flag is set.
  • No nvvm.maxntid metadata is attached.
  • No nvvm.maxnreg metadata is attached.
  • No nvvm.minctasm metadata is attached.
  • No nvvm.cluster_dim or nvvm.maxclusterrank metadata is attached.
  • The isKernel check (sub_CE9220) returns false: the CC is not 0x47, there is no nvvm.kernel metadata, and there is no "kernel" entry in nvvm.annotations.

The only function-level metadata the outlined function receives is the isOutlined flag at MachineFunction offset +582 and the two attributes nounwind (47) and minsize (18).

Function Eligibility Gating

The candidate finder (sub_3539E80) applies three gates before considering a function's basic blocks for outlining:

fn is_eligible(func, cost_mode):
    // Gate 1: explicit opt-out
    if sub_B2D620(func, "nooutline"):       // has "nooutline" attribute?
        return false

    // Gate 2: target hook -- "should we outline FROM this function?"
    tii = get_target_instr_info(func)
    if !tii.vtable[1440](func):             // shouldOutlineFrom
        return false

    // Gate 3: target hook -- "is it SAFE to outline from this function?"
    if !tii.vtable[1432](func, cost_mode):  // isFunctionSafeToOutlineFrom
        return false

    return true

The NVPTX backend's implementation of shouldOutlineFrom (vtable +1440) and isFunctionSafeToOutlineFrom (vtable +1432) determines whether kernel functions and launch_bounds-constrained functions participate. The evidence does not contain the NVPTX-specific implementation of these hooks, so we cannot state definitively whether kernels with nvvm.maxnreg are rejected. However, the architectural implications are clear:

If the hooks permit outlining from constrained kernels, the outliner may extract a sequence shared between a maxnreg=32 kernel and a maxnreg=64 kernel into a single CC 95 .func. That .func has no register budget. When ptxas processes the maxnreg=32 kernel's call to this .func, it must either:

  1. Inline the call -- absorbing the outlined function's register usage into the kernel's allocation. If the outlined body fits within 32 registers, this is transparent.
  2. Keep the call -- allocating the outlined function's registers within the kernel's 32-register budget. If the outlined function needs more registers than available after the kernel's own allocation, ptxas will spill to local memory.

Both outcomes preserve correctness. The performance risk is that spilling may occur in a kernel that would not have spilled without outlining, because the CICC-side cost model has no visibility into ptxas's register allocation decisions.

If the hooks reject constrained kernels, the outliner only operates on unconstrained device functions (CC 0) and kernels without __launch_bounds__. This is the conservative and likely behavior, given that NVIDIA is aware of the register-pressure implications.

Per-Block Eligibility

Even within an eligible function, individual basic blocks are filtered:

ConditionCheckEffect
Block has <= 1 instructionMBB.size() <= 1Skipped -- too small to outline
Block already outlinedbyte at MBB offset +217Skipped -- prevents re-outlining
Block has special flagqword at MBB offset +224 != 0Skipped -- target-specific block exclusion

The "already outlined" flag at MBB offset +217 is set by the call-site rewriting phase (Phase 11) after replacing a sequence with a call to the outlined function. Combined with the cost-array sentinel memset (0xFF fill), this provides a two-layer defense against re-outlining.

Outlining vs. Inlining Tension

The MachineOutliner and the LLVM inliner operate in opposite directions: the inliner copies callee bodies into call sites (increasing code size, reducing call overhead), while the outliner extracts common sequences out of function bodies (decreasing code size, adding call overhead). In CICC, the two passes do not directly coordinate -- the inliner runs during the IR optimization pipeline (CGSCC pass manager), while the MachineOutliner runs late in the machine codegen pipeline after register allocation and scheduling.

The tension manifests in two ways:

  1. The inliner may create outlining opportunities. Aggressive inlining of small device functions can produce multiple copies of the same instruction sequence in different callers, which the outliner then detects and re-extracts. This round-trip (inline then outline) is wasteful but not incorrect. The net result depends on whether the outliner's shared function is more cache-friendly than the inlined copies.

  2. The outliner may undo inlining benefits. If the inliner carefully decided that inlining a hot function improves performance by eliminating call overhead and enabling cross-function optimization, the outliner may later extract the inlined sequence back out if it appears in multiple callers. The minsize attribute on outlined functions does not prevent this -- it only signals that the outlined function should be optimized for size rather than speed.

The enable-machine-outliner knob's "guaranteed beneficial" mode addresses this partially by only outlining sequences where the cost model is confident the savings are worthwhile, but it cannot reason about the inliner's original intent.

Configuration Knobs

All knobs are LLVM cl::opt command-line options, passable via -Xllc in CICC:

KnobTypeDefaultEffect
outliner-benefit-thresholdunsigned1Minimum net byte savings for a candidate to be accepted. Higher values make outlining more conservative.
enable-machine-outlinerenumtarget-dependentTri-state: disable, enable, guaranteed beneficial. Controls whether the pass runs at all.
enable-linkonceodr-outliningboolfalseWhether to outline from linkonce_odr functions. Off by default because the linker can deduplicate these. Should be enabled under LTO.
machine-outliner-rerunsunsigned0Number of additional outliner passes after the initial run. Each rerun can find new candidates from code modified by previous outlining.
outliner-leaf-descendantsbooltrueConsider all leaf descendants of internal suffix-tree nodes as candidates (not just direct leaf children).
disable-global-outliningboolfalseDisable global (cross-module) outlining, ignoring codegen data generation/use.

The options constructor at ctor_675 (0x5A2820, 10,602 bytes) registers the outliner-specific options including the linkonce-odr and rerun knobs. The benefit threshold is registered separately in the same constructor.

Diagnostic Strings

The outliner emits LLVM optimization remarks under the "machine-outliner" pass name:

Remark keyMeaning
"OutlinedFunction"A new outlined function was created
"NotOutliningCheaper"Candidate rejected because outlining would not save bytes
"Did not outline"Candidate rejected for other reasons (illegal instructions, safety checks)
"OutliningBenefit"Named integer: net byte savings
"OutliningCost"Named integer: cost of the outlined call sequence
"NotOutliningCost"Named integer: cost of keeping the sequence inline
"NumOccurrences"Named integer: how many times the sequence was found
"Length"Named integer: number of instructions in the sequence
"StartLoc" / "OtherStartLoc"Source locations of the outlined regions

The remark message format: "Saved {N} bytes by outlining {M} instructions from {K} locations. (Found at: {loc1}, {loc2}, ...)".

Function Map

FunctionAddressSizeRole
Pass registration (name, ID, factory)sub_35320A0----
Pass factory functionsub_3534A50----
Core outlining engine (outline + rewrite)sub_353701077KB--
Candidate finder / suffix-tree buildersub_3539E8059KB--
MachineOutliner runOnModule entry (MIR region)sub_1E3D60062KB--
insertIntoSuffixTree -- adds MBB instruction hashessub_35364E0----
SuffixArray::allocateWorkBuffersub_3535DB0----
SuffixArray::parallelMergeSortsub_3534120----
SuffixArray::inPlaceMergeSort (fallback for small arrays)sub_3533600----
Insertion sort for <= 14 elementssub_3533450----
findIllegalInRange (4-way unrolled sentinel scan)sub_3532120----
buildInstrLegalityMapping -- MBB to suffix alphabetsub_3508720----
buildRegClassMapping -- register-class constraint resolutionsub_3508F10----
populateOutlinedFunctionBody -- instruction insertionsub_35095B0----
classifyOperandRegisters -- RB-tree register trackingsub_3536E40----
RBTree::destroyAll -- recursive tree deallocationsub_3532B90----
std::string constructor (for name generation)sub_35323D0----
SmallString SSO-aware deep copysub_3532560----
RemarkBuilder::appendFieldsub_3534BB0----
RemarkBuilder::emitOutlinedFunctionRemarksub_35341F0----
Extract calling convention from candidate entry's source functionsub_A746B0----
Create callee-saved register mask for non-default CCsub_A77AA0----
hasAttribute("nooutline") -- function attribute checksub_B2D620----
isKernel(func) -- returns true for CC 0x47 or nvvm.kernel metadatasub_CE9220----
isKernelFunction -- .entry vs .func emission branchsub_1C2F070----
Kernel attribute emission (.maxntid, .maxnreg, .minnctapersm)sub_214DA90----
PTX function header orchestrator (.entry / .func branch + params)sub_215A3C0----

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
ActivationDefault off for most targets; explicit -enable-machine-outliner requiredConditionally enabled via TargetPassConfig::addMachineOutliner(); evidence of "guaranteed beneficial" mode for NVPTX
Calling conventionUses target default CC for outlined functionsAssigns CC 95 to outlined functions -- a dedicated NVPTX convention that bypasses .param-space ABI overhead
Kernel interactionNo kernel concept; all functions treated equallyisKernel(func) check (sub_CE9220) for CC 0x47 / nvvm.kernel metadata; kernel attributes (.maxntid, .maxnreg, .minnctapersm) may constrain outlining profitability
nooutline attributeStandard function attribute checkSame check (sub_B2D620 / hasAttribute("nooutline")); kernels with tight __launch_bounds__ may implicitly disable outlining
Code size motivationReduce instruction cache footprint and binary sizePrimary motivation is L0/L1i instruction cache pressure per SM partition; every surviving PTX instruction also costs ptxas compilation time
Suffix tree/arrayStandard suffix array constructionSame algorithm; parallel merge sort (sub_3534120) with fallback insertion sort for <= 14 elements

Cross-References

  • Inliner Cost Model -- the opposing force: inlining decisions that the outliner may partially reverse
  • AsmPrinter & PTX Body Emission -- how outlined .func functions are emitted as PTX
  • Register Allocation -- the outliner runs after RA; outlined functions affect register pressure
  • Register Coalescing -- coalescing happens before outlining; the outliner operates on already-coalesced code
  • Block Placement -- block layout interacts with code size; the outliner reduces the instruction footprint that placement must arrange
  • Pipeline & Ordering -- where the outliner sits in the overall pass sequence
  • NVPTX Call ABI -- the .param-space calling convention that CC 0 device functions use; CC 95 outlined functions bypass this
  • SCEV Analysis -- SCEV budget bypass for CC 42/43 kernel functions; illustrates CC-based dispatch in CICC

Tensor Core / MMA Code Generation

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

CICC v13.0 contains a complete tensor core code generation pipeline spanning five SM generations (Volta through Blackwell), three distinct MMA instruction families (HMMA/IMMA/BMMA), the SM 90 Warp Group MMA (WGMMA) system, and the SM 100 Tensor Core Generation 5 (tcgen05) engine. The pipeline transforms NVVM intrinsic calls through two parallel lowering paths -- one in the NVVM IR lowering layer (sub_955A70) and one in the SelectionDAG backend (sub_33B0210) -- before reaching a common PTX instruction emission layer that constructs MMA instructions from packed 64-bit descriptors encoding shape, type, layout, rounding, and saturation.

This page documents the code generation mechanics: how MMA operations flow from source-level __hmma_* / __wmma_* / __wgmma_* builtins through LLVM intrinsic selection, SelectionDAG lowering, and PTX string emission. For the builtin-to-intrinsic mapping and per-ID reference, see Tensor / MMA Builtins. For the SelectionDAG infrastructure that hosts this lowering, see SelectionDAG.

NVVM builtin dispatchsub_955A70 (105KB) -- main NVVM builtin lowering dispatcher
SelectionDAG intrinsic switchsub_33B0210 (343KB, 9,518 lines) -- intrinsic lowering mega-switch, CAT-17
SelectionDAG MMA handlersub_33A64B0 -- WMMA/MMA DAG node construction (95 intrinsic IDs)
WMMA load handlersub_94CAB0 / sub_94DCB0 -- fragment load codegen
WMMA MMA handlersub_94E0D0 -- matrix multiply-accumulate codegen
MMA PTX string buildersub_21E74C0 (AsmPrinter) / sub_35F3E90 (backend)
tcgen05.mma loweringsub_304E6C0 (SelectionDAG) / sub_36E9630 (instruction emission)
tcgen05 infrastructuresub_30462A0 -- fence/wait/alloc/dealloc/cp/commit
Address range0x21D0000--0x21F0000 (AsmPrinter MMA), 0x304xxxx--0x36Fxxxx (backend)
Upstreamlib/Target/NVPTX/NVPTXISelLowering.cpp (no upstream MMA; entirely NVIDIA-proprietary)

Pipeline Overview

MMA code generation follows a three-stage pipeline. The first two stages exist in parallel copies; the third is shared.

CUDA source:  __hmma_m16n16k16_mma_f32f32(d, a, b, c, 0)
                │
    ┌───────────┴───────────┐
    │ NVVM builtin lowering │ SelectionDAG intrinsic lowering
    │ (sub_955A70)          │ (sub_33B0210, CAT-17)
    │                       │
    │ 3-table lookup:       │ sub_33A64B0 -> SDNode construction
    │ dword_3F14840/7E0/7A0 │ 95 case labels (0xA4-0xA8, 0x194-0x1EC)
    │                       │
    │ sub_94E0D0 (MMA)      │
    │ sub_94CAB0 (load)     │
    │ sub_9493D0 (store)    │
    └───────────┬───────────┘
                │
    ┌───────────┴───────────┐
    │ PTX Instruction Emit  │
    │ sub_21E74C0 (printer) │
    │ sub_1D23DE0 (emitter) │
    └───────────────────────┘

The NVVM builtin lowering path handles builtins that arrive as direct function calls from the EDG frontend. The SelectionDAG path handles the same operations when they arrive as LLVM intrinsic calls (the normal path when CUDA C++ compiles through Clang-style IR generation). Both paths converge at the PTX string builder, which reads a packed 64-bit descriptor word and emits text like mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32.

Packed MMA Descriptor

All MMA operations are encoded as a single 64-bit descriptor word stored at *(QWORD*)(*(QWORD*)(a1+16) + 16*a2 + 8). The PTX string builder (sub_21E74C0) queries this descriptor through a string-keyed interface. The caller passes a query string (e.g., "shape", "ety", "mid"), and the builder extracts the relevant bits and emits the corresponding PTX text.

Bit Layout

Bits     Field       Query key   Values
───────  ──────────  ─────────   ──────
[0]      rowcol      "rowcol"    0=row, 1=col
[2:1]    mid         "mid"       0=a, 1=b, 2=c, 3=d
[7:4]    opc         "opc"       0=default, 1=.and.popc, 2=.xor.popc
[2:0]    rnd         "rnd"       0=none, 1=.rn, 2=.rm, 3=.rp, 4=.rz
[15:8]   aty         "aty"       A element type enum (see below)
[23:16]  bty         "bty"       B element type enum
[25:24]  al          "al"        A layout: 0=row, nonzero=col
[27:26]  bl          "bl"        B layout: 0=row, nonzero=col
[28]     satf        "satf"      0=off, 1=.satfinite
[39:32]  shape       "shape"     Shape enum (see below)

The "ety" query reads the result/accumulator element type from bits [27:24], sharing bit positions with al/bl in a context-dependent manner -- the builder dispatches on the query string to select the correct extraction mask.

Type Enum

ValueTypeBitsPTX string
1b11"b1"
2s44"s4"
3u44"u4"
4s88"s8"
5u88"u8"
6f1616"f16"
7bf1616"bf16"
8tf3219"tf32"
9f6464"f64"
10f3232"f32"
11s3232"s32"

Any other value triggers the fatal error "Wrong MMA element type".

Shape Enum

ValueShapePTX stringNotes
0x01m8n8k4"m8n8k4"Original Volta HMMA
0x02m8n8k16"m8n8k16"Integer MMA (s8/u8)
0x03m8n8k32"m8n8k32"Sub-byte (s4/u4)
0x04m8n8k64"m8n8k64"Extended sub-byte
0x05m8n8k128"m8n8k128"Binary MMA (b1)
0x06m8n32k16"m8n32k16"Appears unused in standard paths
0x10m16n8k4"m16n8k4"Turing HMMA, Ampere f64
0x11m16n8k8"m16n8k8"Turing/Ampere HMMA
0x12m16n8k16"m16n8k16"Ampere (bf16, tf32)
0x13m16n8k32"m16n8k32"Ampere integer
0x14m16n8k64"m16n8k64"Sub-byte integer
0x15m16n8k128"m16n8k128"Extended sub-byte
0x16m16n8k256"m16n8k256"Largest shape (binary/sub-byte)
0x17m16n16k16"m16n16k16"Square shape (Hopper+)
0x18m32n8k16"m32n8k16"Tall shape
0x19m16n16k8"m16n16k8"WMMA f16 path

Unrecognized shape values hit the default branch and trigger BUG() abort.

PTX String Emission

The string builder uses an optimized emission pattern: short constant strings are stored as integer literals for single-store writes. For example, "m16n8k16" is emitted as:

*(QWORD*)ptr = 0x36316B386E36316DLL;  // "m16n8k16" in little-endian

When the output buffer has sufficient remaining capacity, the builder writes directly via DWORD/WORD/BYTE stores. On buffer overflow, it falls back to sub_16E7EE0 (slow-path string append).

HMMA / IMMA / BMMA Lowering (SM 70--89)

The pre-Hopper MMA families share a common architecture: a three-table builtin-to-intrinsic lookup, per-family handler functions for load/store/MMA, and a consistent operand processing pattern.

Three-Table Intrinsic Lookup

TableAddressID RangeDescription
dword_3F14840Entries 0--29678--707HMMA (FP16, first-gen)
dword_3F147E0Entries 0--23708--731IMMA (INT8)
dword_3F147A0Entries 0--12732--744BMMA (binary) / INT4

Each table maps (builtin_id - base) to an LLVM intrinsic ID. The first table additionally sets a v43=1 flag indicating "first generation WMMA", which affects fragment size determination.

HMMA Handler Family (SM >= 70)

Four functions implement half-precision MMA operations. All share a common pattern:

  1. Architecture gate: *(target_info + 252) > 0x45 (SM >= 70)
  2. Fetch debug location
  3. Validate rowcol operand is constant (opcode 10 or 32 check)
  4. Resolve address space via sub_21DEF90
  5. Build operands via sub_1D38BB0 calls
  6. Emit instruction via sub_1D23DE0
FunctionAddressOperationOperand Count
sub_21E03600x21E0360hmmaldab (load A/B)6
sub_21E06300x21E0630hmmaldc (load C)5
sub_21DFBF00x21DFBF0hmmastc (store C/D)9 or 13 (shape-dependent)
sub_21E08700x21E0870hmmamma (MMA)19 or 23 + 1 metadata

For hmmastc, the operand count depends on the accumulator width: 9 operands for narrow accumulators, 13 for wide (when the a2 shape flag is set).

For hmmamma, the handler loads A fragments (v100 iterations), B fragments (v95 iterations), C fragments (v101 iterations), emits the MMA call via sub_921880, then scatters results through v103 iterations of element-wise stores.

IMMA Handler Family (SM >= 72)

Integer MMA follows the same pattern but with additional SM 72 (Xavier) restrictions:

FunctionAddressOperationSM Gate
sub_21E12800x21E1280immaldab (load A/B)SM > 0x47 (>= 72)
sub_21E15D00x21E15D0immaldc (load C)SM > 0x47
sub_21E18300x21E1830immastc (store C)SM > 0x47
sub_21E1D200x21E1D20immamma (MMA + saturation)SM > 0x47

SM 72 special case. Xavier's tensor cores support only basic IMMA shapes (variant 0 or 1). The gate check is:

if (sm_version <= 0x47 || (sm_version == 72 && shape_variant > 1))
    fatal_error("not supported on this architecture");

For immaldc at SM 72, certain intrinsic opcodes (610, 611, 179, 180) are explicitly blocked:

if (sm_version <= 0x47 || ((opcode-610 <= 1 || opcode-179 <= 1) && sm_version == 72))
    fatal_error(...);

The immamma handler includes an explicit satf (saturation-to-finite) constant extraction. The .satfinite modifier is appended to the PTX instruction when bit 28 of the descriptor is set. This clamps infinities and NaNs to the largest representable finite value.

IMMA operand counts vary by opcode:

OpcodeFragment CountShape
58412Large integer shape
6094Compact integer shape
other13Default

BMMA Handler (SM >= 73/75)

Binary MMA (sub_21E2280, 0x21E2280) handles b1 operations with XOR-POPC and AND-POPC modes. Gate: SM > 0x48 (>= 73, in practice SM 75). The handler takes 8+ operands.

Fragment Size Determination

Fragment size (the number of register-width elements per warp fragment) is computed differently per family:

WMMA (first-gen, v43=1):

ConditionFragment Count
BF16, store operation (a6==1 && !a5)4
Default first-gen8
Intrinsic 8914 or 82802

IMMA (v43=0):

Intrinsic IDsFragment Count
0x22B3--0x22B6, 0x22CF2
0x22BB--0x22BC, 0x22C5--0x22C64
0x22BD--0x22BE, 0x22C3--0x22C4, 0x22CB--0x22CE1
0x22B7, 0x22BF, 0x22C78

BMMA: Always 2 fragments, with v101=2, v95=1, v100=1.

MMA Codegen (sub_94E0D0)

The WMMA multiply-accumulate handler processes five input operands:

  1. v102 -- destination fragment pointer (output)
  2. v7 -- A matrix fragment pointer
  3. v93 -- B matrix fragment pointer
  4. v92 -- C accumulator fragment pointer
  5. v8 -- rowcol operand (validated range: 0--3 for MMA)
  6. v9 -- satf flag (validated: 0 or 1; skipped for intrinsic 8279)

Fragment counts for the MMA operation itself:

Familyv95 (A frags)v100 (B frags)v101 (C frags)v103 (D frags)
BMMA1122
IMMA 0x22C0--0x22C11488
IMMA 0x22B8--0x22B92288
IMMA 0x22C8--0x22C94188
WMMA (default)88varies4 or 8

For first-gen WMMA, v103 (D fragment count) is determined by a bit test:

if ((0x300C003 >> (intrinsic_id + 127)) & 1)
    v103 = 4;
else
    v103 = 8;

The code generation sequence is:

1. LOAD  A fragments: v100 iterations of sub_94B510 (extract from ptr v7)
2. LOAD  B fragments: v95  iterations (extract from ptr v93)
3. LOAD  C fragments: v101 iterations (extract from ptr v92)
4. EMIT  MMA call:    sub_90A810(tables, intrinsic_id, 0, 0) -> sub_921880
5. STORE D fragments: v103 iterations of sub_94B940 (scatter to ptr v102)

Address Space Resolution (sub_21DEF90)

MMA load/store operations resolve the target memory address space through sub_21DEF90, which checks the instruction opcode at offset +24:

Opcode RangeConditionAddress Space
185--237Bit test against 0x3FFFFD00000003varies
44--45Bit 1 of byte at offset +26varies
>= 659unconditionalaccepted
defaultgeneric (0)

Return values: 0=generic, 1=global, 2=shared, 3=local, 4=constant, 5=special, 404=special (from value 101).

SelectionDAG Path (sub_33B0210 / sub_33A64B0)

In the SelectionDAG intrinsic lowering mega-switch (sub_33B0210), 95 consecutive case labels (IDs 0xA4--0xA8 and 0x194--0x1EC, corresponding to LLVM intrinsic IDs 164--168 and 404--492) all dispatch to a single helper: sub_33A64B0.

This function handles every WMMA/MMA SelectionDAG intrinsic for SM 70--89:

  • wmma.load.a / wmma.load.b / wmma.load.c
  • wmma.store.d
  • wmma.mma for all shape/type combinations
  • mma.sync (SM 70+), mma.sp (SM 80+, structured sparsity), mma.f64 (SM 80+)

The SelectionDAG path constructs NVPTXISD target-specific DAG nodes that are later matched by the instruction selection tables. The intrinsic IDs from the mega-switch are distinct from the builtin IDs used in the NVVM path -- the mega-switch IDs are LLVM intrinsic table indices, not CUDA builtin numbers.

WGMMA -- Warp Group MMA (SM 90 Hopper)

WGMMA operates on a warp group (4 warps, 128 threads) instead of a single warp. Four builtin IDs (765--768) expand to over 150 LLVM intrinsic variants through compile-time dimension and type dispatch.

Builtin-to-Intrinsic Expansion

Builtin IDBuiltinVariants
765 (0x2FD)__wgmma_mma_async_f16Full 6-operand set (a, b, c, scale, negate, sparsity)
766 (0x2FE)__wgmma_mma_async_bf162-operand (no scale/negate)
767 (0x2FF)__wgmma_mma_async_tf32Reduced operand set
768 (0x300)__wgmma_mma_async_f8Minimal (2 scale operands only)

The lowering handler (in sub_955A70, cases 0x2FD--0x300, ~800 lines) extracts 7 levels of chained operands:

v263 -- M dimension (constant)
v512 -- accumulator fragments
v528 -- A descriptor
v524 -- B descriptor
v519 -- scale factors
v264 -- layout params
v540 -- element type info

Dimension-to-Intrinsic Mapping

The N dimension (extracted via sub_620FD0 as a constant integer) maps to one of 144 LLVM intrinsic IDs spanning 10654--10779. The mapping forms a dense table with stride 4 per N step:

NInteger-type IntrinsicFloat-type Intrinsic
81077410775
161069010691
321074210743
641075810759
1281066610667
2561073810739

For intermediate N values (multiples of 8 from 8 to 256), the mapping continues at stride +4 per N increment. Even intrinsic IDs encode integer-element variants; odd IDs encode float-element variants. The element type is determined by checking whether the LLVM type is an integer with width 10 (i.e., tf32 or bf16 packed as i10 -- a quirk of the NVVM type system).

If constant extraction overflows, the compiler emits:

"unexpected constant overflow in __wgmma_mma_async operand"

If N is not a power of two: (N & (N - 1)) != 0 triggers:

"N only supported for powers of two"

WGMMA 5-Dimensional Intrinsic Grid

The full WGMMA intrinsic table (sub_12B2E10) uses a 144-entry grid spanning IDs 5304--5447:

DimensionValuesCount
N16, 32, 64, 1284
B_sharedfalse, true2
is_s64false, true2
A_scale/negatecombovaries
case variant0x2FD--0x3004

Each WGMMA call packs mode bits into a single integer:

bit 0:  accumulate flag     (from operand v433)
bit 1:  transpose flag      (from operand v445)
bit 2:  negate-C flag       (from operand v433)
bit 3:  reserved
bit 4:  negate-A flag       (from operand v427)

Combined: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).

WGMMA Parameter Lookup (sub_953BA0)

On first call, sub_953BA0 lazily initializes a red-black tree at ctx+560 with 7 entries encoding per-ID shape, transpose, register count, and type information:

IDtrans_ashapea_nregsb_nregsa_typeb_typec_type
7450111i64i64--
7461099i32i32i32x2
7470088i16x2i16x2--
7480077i32x4i32x4i32x8
7490077i32x4i32x4i32x8
7500077i64i32x2i32x8

The output is packed into a 64-bit value:

bits[3:0]    = trans_a
bits[7:4]    = shape << 4
bits[15:8]   = a_nregs << 8
bits[27:16]  = b_nregs << 16
bits[31:28]  = padding << 28
bits[63:32]  = trans_b << 32
bit[25]      = ((rowcol & 2)==0) ? 0x2000000 : 0x1000000
bits[27:26]  = ((rowcol & 1)+1) << 26

WGMMA MMA Async Load (sub_9547E0)

A second red-black tree at ctx+656 holds 12 entries for MMA async load parameters:

IDShapeNRegsVariantFragment Type
753190--
754191--
755192i16x2
7562580--
7572581--
75825102i32x8
7592370i32x4
7602371i32x4
7612470i32x4
7622471i32x4
763670i32x2/i64
764671i32x2/i64

WGMMA Fence/Store Dispatch

IDsOperationIntrinsicHandler
745--750fence_aligned9062 (3 type overloads)sub_953BA0 -> sub_94B510 x3 -> sub_94B940
751--752store9145 (2 type overloads)sub_954350
753--764mma_async load9067 (2 type overloads)sub_9547E0

The fence operations pack A/B/C fragment operands via sub_94B510 and scatter results via sub_94B940 with name hint "mmafrag".

tcgen05 -- Tensor Core Generation 5 (SM 100 Blackwell)

SM 100 introduces tcgen05, a completely new tensor core instruction family with support for MX floating-point formats (MXF4, MXF8F6F4), structured sparsity, weight stationary mode, block scaling, and scaled input accumulators. The tcgen05 system includes both computation (tcgen05.mma) and lifecycle management (alloc, dealloc, fence, wait, commit, cp, relinquish) instructions.

Architecture Gate

All tcgen05 operations require SM >= 100. The gate check reads two architecture fields:

v1 = *(int*)(arch_struct + 340);  // arch_value: 1000=sm100, 1030=sm103, 1200=sm120
v2 = *(int*)(arch_struct + 336);  // ptx_version

// Family-conditional: ptx >= 86
// Arch-conditional: ptx >= 88
if (v1 <= 0x3E8 && v1 <= 0x408)  // neither sm_100 nor sm_103
    fatal_error("tcgen05.mma supported only on arch-conditional "
                "or family-conditional variants from SM100 onwards.");

tcgen05 Infrastructure Operations

All handled by sub_30462A0:

OperationIntrinsic OpcodeISD OpcodeOperands
tcgen05.alloc100804765basic allocation
tcgen05.alloc (multicast)100834770/477132-bit flag variant
tcgen05.dealloc1014048274 operands
tcgen05.commit100904772--4777multicast mask variants
tcgen05.fence1014348302 operands
tcgen05.wait1035150202 operands
tcgen05.relinquish.alloc1031149412 operands
tcgen05.cp.*1010147904 operands

Commit operations validate multicast mask size -- only 16-bit and 32-bit masks are supported:

"tcgen05.commit.* supports only 16-bit and 32-bit multicast mask size."

tcgen05.mma Data Types

The "kind" field occupies bits [8:6] of the packed operand word:

ValueKindDescription
0mxf4nvf4MX FP4 with NV FP4
1f8f6f4FP8/FP6/FP4 standard
2mxf8f6f4MX variant of f8f6f4
3f16Half precision
4i88-bit integer (arch-conditional only)
5tf32TensorFloat-32
7mxf4MX FP4

tcgen05.mma Modifiers

Scale vector size (bits [3:2]):

ValueModifierConstraints
0/1.scale_vec::1XCannot use for mxf4nvf4 type
2.scale_vec::2XCannot use for mxf8f6f4 type
3.scale_vec::4XCannot use for mxf8f6f4 or mxf4 type

Block scale alias (bits [10:9]):

ValueModifierConstraint
0.block16Not supported for f16, tf32, f8f6f4, i8
1.block32Same constraint

Weight stationary (bit 0): .ws flag. Not compatible with cta_group::2, mxf8f6f4, or fp4 types.

CTA group (bits [1:0]): .cta_group::1 (bit 1 clear) or .cta_group::2 (bit 1 set).

Sparsity (bit 5): Adds one extra operand. Restricted for MXF4 and MXF4NVF4 types to arch-conditional variants only.

Scale input accumulator (bit 4): Only usable with f16 and tf32 types. Not supported on sm_100a (v=1001) or sm_103a (v=1033), but supported on sm_100 (v=1000), sm_103 (v=1030), and sm_120+ (v>=1101).

Collector modes (emitted by sub_35F38B0):

ValuePTX modifier
1.collector::a::lastuse
2.collector::a::fill
3.collector::a::use

Cannot use collector::a::use or collector::a::fill with ashift.

tcgen05.mma ISD Opcode Selection (sub_36E9630)

The intrinsic lowering handler (sub_304E6C0) maps 10 shape cases (intrinsic opcodes 10299--10308) to ISD opcodes 4905--4940:

CaseShape ClassBase ISD+scaleD+sparsity+ws+scaleInputAccum
10299Small4906--4907----
10300Small v24908--4909----
10301Medium490549104911/49124937/4938yes
10302Medium v2491349144915/4916--yes
10303Large491749184919/4920--yes
10304Block-scale small4922--4923----
10305Block-scale small v24924--4925----
10306Block-scale medium492149264927/49284939/4940yes
10307Block-scale medium v2492949304931/4932----
10308Block-scale large493349344935/4936----

Operand count varies by variant: small shapes take 5--6 base operands plus optional sparsity operand; medium shapes take 6 base plus optional scale factor; large shapes iterate over additional operands spanning offsets 440--600 (or 440--760 on sm_103 extended variants).

tcgen05.mma Validation Errors

The full set of compile-time validation errors (emitted via sub_C64ED0):

Error MessageCondition
"INT8 type is supported only on arch-conditional variants."kind==i8 on family-conditional SM100
"MXF4 and MXF4NVF4 types with Sparsity are supported only on arch-conditional variants."(type+7)%8 > 5 AND sparsity set, on family-conditional
"Explicit scale vector size is supported only on arch-conditional variants."scale_vec_size 1--3 on family-conditional
"Scale input accumulator can only be used with f16 and tf32 types"bit 4 set but kind not f16 or tf32
"Scale input accumulator is not supported on this architecture."scaleInputAccum on sm_100a or sm_103a
"Block scale is not supported for f16, tf32, f8f6f4 and i8 types"block_scale with incompatible type
"ashift is not supported with tcgen05.mma.block_scale variants"ashift + block_scale
"cta_group::2 is not supported with weight stationary"cta_group::2 + .ws
"Cannot use weight stationary with mxf8f6f4 and fp4 types".ws + mxf8f6f4 or fp4
"Cannot use collector::a::use or colletor::a::fill with ashift"[sic] collector + ashift
"Cannot use 2X or 4X as scale vector size for mxf8f6f4 type"scale_vec >= 2X + mxf8f6f4
"Cannot use 1X as scale vector size for mxf4nvf4 type"scale_vec 1X + mxf4nvf4
"Cannot use 1X or 4X as scale vector size for mxf4 type"scale_vec 1X or 4X + mxf4

Note the typo "colletor" (missing 'c') in the binary -- this is a genuine NVIDIA binary string, not a transcription error.

tcgen05 Scaled MMA Operand Builder

Two identical copies exist for the tcgen05 scaled MMA descriptor:

CopyAddressLayer
sub_21E8CD00x21E8CD0AsmPrinter / PTX emission
sub_35F3E900x35F3E90NVPTX backend / SelectionDAG

The packed descriptor encodes Blackwell-specific modifiers:

BitQuerySet ValueClear ValueSemantics
0"scaleD""1""0"Scale output accumulator
1"negA""-1""1"Negate A matrix
2"negB""-1""1"Negate B matrix
3"transA""1""0"Transpose A
4"transB""1""0"Transpose B

scaleD and transA/transB emit boolean "0"/"1" strings. negA and negB emit sign multiplier strings "-1"/"1" because PTX applies negation as a multiplication factor.

tcgen05.cp Copy Operations

Shape variants (bits [3:1]):

ValuePTX shape
0.128x256b
1.4x256b
2.128x128b
3.64x128b
4.32x128b

Destination format variants:

ConditionPTX format
default.b8x16
bit 7 = 0.b6x16_p32
bit 7 = 1.b4x16_p64
bit 8 seterror: "Unsupported tcgen05.cp destination format"

Multicast modes:

TypePTX modifier
type 1, shape 3.warpx2::02_13
type 2, shape 3.warpx2::01_23
type 3, shape 4.warpx4

Duplicate Backend Copies

Several MMA functions exist as near-identical pairs -- one in the AsmPrinter emission layer (0x21Dxxxx--0x21Exxxx) and one in the NVPTX backend layer (0x36Exxxx). The difference is limited to error reporting and reference counting functions:

AsmPrinter CopyBackend CopyOperation
sub_21DFBF0sub_36E91F0hmmastc
sub_21E0360sub_36E72A0hmmaldab
sub_21E0630sub_36E7580hmmaldc
sub_21E0870sub_36E77C0hmmamma
sub_21E1280sub_36E7B50immaldab
sub_21E15D0sub_36E7EA0immaldc
sub_21E1830sub_36E8110immastc
sub_21E1D20sub_36E8630immamma
sub_21E2280sub_36E8BD0bmmamma
sub_21E8CD0sub_35F3E90tcgen05 scaled MMA

AsmPrinter copies use sub_16BD130 for errors; backend copies use sub_C64ED0. AsmPrinter copies use sub_1623A60/sub_161E7C0 for refcounting; backend copies use sub_B96E90/sub_B91220.

Shape x Type x Architecture Matrix

ShapeA/B TypesAccumulatorMin SMNotes
m8n8k4f16f16, f32SM 70Original Volta
m16n8k4f64f64SM 80Ampere double precision
m16n8k8f16f16, f32SM 75Turing+
m16n8k16f16, bf16, tf32f16, f32SM 80Ampere+
m16n16k8f16f16, f32SM 70WMMA path
m16n16k16f16, bf16f16, f32SM 90Hopper+
m32n8k16f16, bf16f16, f32SM 80Tall shape
m8n8k16s8, u8s32SM 72Integer MMA
m16n8k16s8, u8s32SM 75Turing+ integer
m16n8k32s8, u8s32SM 75Turing+ integer
m8n8k32s4, u4s32SM 75Sub-byte
m16n8k64s4, u4s32SM 75Sub-byte
m8n8k64s4, u4s32SM 75Extended sub-byte
m16n8k128s4, u4s32SM 75Extended sub-byte
m8n8k128b1s32SM 75Binary (.and.popc / .xor.popc)
m16n8k256b1s32SM 75Binary extended
tcgen05 (10 variants)mxf4nvf4, f8f6f4, mxf8f6f4, f16, tf32, i8, mxf4variesSM 100+block_scale, +sparsity, +ws

LLVM Intrinsic ID Reference

Key intrinsic IDs used in the MMA code generation pipeline:

Intrinsic IDSymbolUsage
8181llvm.nvvm.wmma.store (complex)WMMA complex store
8210llvm.nvvm.wmma.storeWMMA store
8279(special)IMMA MMA without satf
8280(special)Fragment count = 2 trigger
8914(special)Fragment count = 2 trigger
9062llvm.nvvm.wgmma.fence.alignedWGMMA fence (3 type overloads)
9067llvm.nvvm.wgmma.mma.asyncWGMMA MMA async (2 type overloads)
9145llvm.nvvm.wgmma.storeWGMMA store
10654--10779llvm.nvvm.wgmma.mma.async.*Per-dimension WGMMA variants (144 entries)
5304--5447(WGMMA grid)5-dimensional intrinsic grid for WGMMA

Error Handling

Two error-reporting functions serve the two layers:

FunctionAddressLayerBehavior
sub_16BD1300x16BD130AsmPrinter / PTX emissionFatal (severity=1 -> abort)
sub_C64ED00xC64ED0NVPTX backend / SelectionDAGFatal (severity=1 -> abort)

Error categories:

  1. Architecture not supported: "X is not supported on this architecture" -- SM gate failure
  2. Constant validation: "rowcol not constant", "satf not constant" -- non-constant operand
  3. Type restrictions: "Wrong MMA element type" -- invalid type enum
  4. Feature combination: "ashift is not supported with tcgen05.mma.block_scale" -- conflicting modifiers
  5. Scale restrictions: "Cannot use N as scale vector size for X type" -- type/scale mismatch

Differences from Upstream LLVM

Upstream LLVM's NVPTX backend has no MMA code generation. The entire MMA pipeline -- builtin tables, three-table lookup, fragment size computation, WGMMA dimension dispatch, tcgen05 lowering, packed descriptor encoding, and all shape/type validation -- is NVIDIA-proprietary code with no upstream equivalent.

Upstream LLVM handles MMA operations at the PTX level only: the upstream NVPTXAsmPrinter can print PTX mma.sync instructions, but the instruction selection, intrinsic lowering, and code generation logic that produces them exists only in NVIDIA's cicc binary. An open-source reimplementation would need to build the entire pipeline from the WMMA/MMA intrinsic definitions through SelectionDAG lowering and PTX emission.

Cross-References

NVVM Builtin Table Structure

770 builtins mapped to integer IDs (1--770) in a wyhash open-addressing hash table. Dual tables exist: pre-optimization (sub_90AEE0) and post-optimization (sub_126A910), both with identical content but separate address spaces.

Pre-opt table buildersub_90AEE0 (109 KB, populates all 770 entries)
Pre-opt dispatchersub_913450 (name -> ID lookup)
Post-opt table buildersub_126A910 (123 KB)
Post-opt dispatchersub_12731E0 (name -> ID lookup)
Hash functionsub_CBF760 (wyhash v4 family)
Hash table insertsub_90ADD0 -> sub_C92610 -> sub_C92740
Hash table findsub_C92860 (find-only, quadratic probing)
Rehashsub_C929D0 (75% load factor trigger)
Total builtins770 (IDs 1--770)
StorageOpen-addressing at context+480 (20-byte header)

Architecture

sub_913450 (public API: name -> builtin ID)
  |
  +-- Guard: context+492 == 0?
  |    +-- sub_90AEE0 (lazy init: populate all 770 entries, once)
  |
  +-- strlen(name)
  +-- sub_C92610(name, len)         -> compute wyhash
  +-- sub_C92860(context+480, ...)  -> quadratic probe find
  |
  +-- return *(uint32*)(entry + 8)  -> the builtin ID

Hash Table Infrastructure

The builtin name table uses a specialized 20-byte hash table header at context+480 with a parallel hash cache array and wyhash-v4 string hashing. The table employs quadratic probing with triangular-number increments and grows at 75% load factor. For 770 entries the capacity sequence is 16 -> 32 -> 64 -> 128 -> 256 -> 512 -> 1024.

Full structural details -- table layout, bucket format, string entry format, wyhash length-dispatch table with pseudocode, probing algorithm, triple-gated comparison guard, rehash procedure, and sentinel values -- are documented in Hash Table and Collection Infrastructure. The "wyhash v4 String Hasher" and "Probing Strategy" sections on that page are the canonical references.

Complete Builtin ID Inventory

Synchronization & Compiler Intrinsics (IDs 1–7)

IDName
1__syncthreads
2__nvvm_bar0
3__nvvm_membar_cta
4__nvvm_membar_gl
5__nvvm_membar_sys
6__builtin_is_constant_evaluated
7__builtin_unreachable

Cluster Operations — SM 90+ (IDs 8–14)

IDName
8__nv_clusterDimIsSpecifed_impl
9__nv_clusterRelativeBlockRank_impl
10__nv_clusterSizeInBlocks_impl
11__nv_cluster_barrier_arrive_impl
12__nv_cluster_barrier_wait_impl
13__nv_cluster_barrier_arrive_relaxed_impl
14__nv_threadfence_cluster_impl

Barrier Extensions (IDs 15–20)

IDName
15–17__nvvm_bar0_{popc,and,or}
18–20__nvvm_bar{_sync_all,rier_sync,_warp_sync}

Bit Manipulation (IDs 21–26)

__nvvm_clz_{i,ll}, __nvvm_popc_{i,ll}, __nvvm_brev{32,64}

Math — Rounding/Abs/Saturate (IDs 27–56)

__nvvm_{floor,ceil,abs,fabs,round,trunc,saturate}_{ftz_f,f,d}, __nvvm_{ex2,lg2,sin,cos}_approx_{ftz_f,f,d}

Reciprocal / Sqrt / Rsqrt (IDs 57–87)

__nvvm_rcp_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_sqrt_{f,rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_rsqrt_approx_{ftz_f,f,d}

Type Conversions (IDs 88–184)

97 entries covering all float↔int, double↔int, float↔half, bitcast combinations with all four rounding modes and FTZ variants.

Address Space & Memory Queries (IDs 185–204)

IDName
185__nv_isGlobal_impl
186–188__nv_bswap{16,32,64}_impl
189–192__nv_is{Shared,Constant,Local,GridConstant}_impl
193–200__nv_cvta_{generic_to,to_generic}_{global,shared,constant,local}_impl
201__builtin_assume
202__nv_isClusterShared_impl
203__nv_cluster_query_shared_rank_impl
204__nv_associate_access_property_impl

Atomic Operations — Legacy NVVM (IDs 207–275)

69 entries: __nvvm_atom_{,cta_,sys_}{add,xchg,min,max,inc,dec,and,or,xor}_gen_{i,ll,f,d,ui,ull,128}

FP Arithmetic (IDs 276–349)

__nvvm_{min,max}_{i,ui,ll,ull}, __nvvm_f{min,max}_{f,ftz_f,d}, __nvvm_mulhi_{i,ui,ll,ull}, __nvvm_mul_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_div_*, __nvvm_add_*

Vote Operations (IDs 351–358)

__nvvm_vote_{all,any,uni,ballot} + _sync variants

Match Operations (IDs 361–364)

__match{32,64}_{any,all}_sync

FMA (IDs 383–403)

__nvvm_fma_{rn,rz,rm,rp}_{ftz_f,f,d,ftz_f2,f2}

C++11 Atomics (IDs 417–473)

Sized variants: __nv_atomic_{load,store,fetch_add,fetch_sub,fetch_and,fetch_or,fetch_xor,fetch_max,fetch_min,exchange,compare_exchange}_{1,2,4,8,16}_{u,s,f}

Surface Stores — sust (IDs 474–638)

165 entries covering __nvvm_sust_b_{1d,1d_array,2d,2d_array,3d}_{i8,...,v4i32}_{clamp,trap,zero}.

Pattern: sust_b_<dim>_<type>_<oob_mode> across 5 dimensions × 11 types × 3 OOB modes.

CUDA Varargs (IDs 639–642)

__cu_va_{start,end,arg,copy}

Tex/Surf Handler (ID 647)

__nv_tex_surf_handler — generic dispatch for texture/surface reads (surface stores use the dedicated sust builtins above).

C++ ABI (IDs 648–677)

__cxa_vec_{ctor,cctor,dtor,new2,new,new3,delete2,delete,delete3}, __gen_nvvm_mem{cpy,set}_*, _Znw{j,m,y}, _Zna{j,m,y}, _ZdlPv{,m,y}, _ZdaPv{,m,y}

WMMA Tensor Core — SM 70+ (IDs 678–707)

30 entries: __hmma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c_f16,ld_c_f32,st_c_f16,st_c_f32,mma_f16f16,mma_f32f16,mma_f16f32,mma_f32f32}

Integer/Binary Tensor Core — SM 75+ (IDs 708–745)

38 entries: __imma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c,st_c,mma}_{s8,u8}, __imma_m8n8k32_{s4,u4}, __bmma_m8n8k128_{b1}

Extended Tensor Core — SM 80+ (IDs 746–764)

__dmma_m8n8k4_mma_f64, __mma_tf32_m16n16k8_mma_f32, __mma_bf16_m*_mma_f32 + load/store variants

WGMMA — SM 90+ (IDs 765–768)

__wgmma_mma_async_{f16,bf16,tf32,f8}

Alloca (IDs 769–770)

_alloca, __builtin_alloca

Category Summary

CategoryID RangeCount
Sync/barriers/cluster1–2020
Bit manipulation21–266
Math (floor/ceil/abs/round/etc)27–5630
Reciprocal/sqrt/rsqrt57–8731
Type conversions88–18497
Address space queries/cvta185–20420
Atomic ops (NVVM legacy)207–27569
FP min/max, mulhi, arithmetic276–34974
Vote + match operations351–36412
Compare-and-swap370–37910
FMA383–40321
Shuffle + misc404–41613
C++11 atomics (sized)417–47357
Surface stores (sust)474–638165
CUDA varargs + math shim639–6468
Tex/surf handler6471
C++ ABI + memgen + new/delete648–67730
WMMA tensor core (f16)678–70730
IMMA/BMMA tensor core708–74538
Extended tensor (dmma/tf32/bf16)746–76419
WGMMA (SM 90+ warpgroup)765–7684
Alloca769–7702
TOTAL770

SM Generation Coverage

GenerationFeatures Enabled
SM 70 (Volta)WMMA (half-precision tensor core)
SM 75 (Turing)IMMA (integer), BMMA (binary)
SM 80 (Ampere)DMMA (double), TF32, BF16
SM 90 (Hopper)WGMMA (warpgroup), cluster ops, f8

All 770 builtins are registered regardless of target SM. Architecture gating happens in the lowering layer that consumes the builtin IDs.

Key Observations

  • Lazy initialization: The entire table is built on first lookup. Guard: context+492 != 0.
  • No texture reads (suld): Only surface store builtins are registered. Texture/surface reads go through __nv_tex_surf_handler (ID 647).
  • Write-once table: Tombstone mechanics exist but deletions never occur for the builtin table.
  • Duplicate prefix optimization: IDA shows SSE xmmword constant loads for long common prefixes (__nvvm_sust_b_2d_array_*) — this is compiler optimization of string literal loads, not a different code path.

Atomic Operations Builtins

Atomic builtins constitute the largest and most complex category in the NVVM builtin system, spanning over 130 IDs across two distinct subsystems: the legacy NVVM intrinsic atomics (IDs 207--275, 370--379) and the C++11-model atomics (IDs 366, 417--473). Both families converge in the lowering layer at sub_12AE930 (EDG) / sub_9502D0 (NVVM), a 1495-line handler that generates inline PTX assembly with explicit memory ordering and scope annotations.

Two Atomic Subsystems

The compiler maintains two parallel atomic APIs that reflect CUDA's historical evolution. The legacy NVVM atomics (__nvvm_atom_*) predate the C++ memory model and encode scope directly in the builtin name (e.g., __nvvm_atom_cta_add_gen_i for block-scoped integer add). The C++11 atomics (__nv_atomic_*) accept ordering and scope as runtime parameters, matching the cuda::atomic_ref interface.

Both subsystems lower to identical PTX instructions. The distinction matters only during the EDG frontend phase, where sub_6BBC40 generates the mangled __nv_atomic_* names from C++ source, and the NVVM lowering layer sub_12B3FD0 dispatches them by ID.

Legacy NVVM Atomics (IDs 207--275)

These 69 builtins encode the operation, scope, and type directly in the name. The lowering dispatches through sub_12AA9B0 for exchange-style operations and sub_12ADE80 for load/store/fetch operations. Each operation exists in three scope variants: default (device), _cta_ (block), and _sys_ (system).

ID RangeOperationBuiltin PatternPTX Mnemonic
207--218Add__nvvm_atom_{,cta_,sys_}add_gen_{i,ll,f,d}atom.add
219--227Exchange__nvvm_atom_{,cta_,sys_}xchg_gen_{i,ll,128}atom.exch
228--251Min/Max__nvvm_atom_{,cta_,sys_}{min,max}_gen_{i,ll,ui,ull}atom.min / atom.max
252--257Inc/Dec__nvvm_atom_{,cta_,sys_}{inc,dec}_gen_uiatom.inc / atom.dec
258--275Bitwise__nvvm_atom_{,cta_,sys_}{and,or,xor}_gen_{i,ll}atom.and / atom.or / atom.xor

Legacy CAS (IDs 370--379)

Compare-and-swap builtins include 128-bit variants for SM 70+ targets. The handler sub_12AA280 builds an AtomicCmpXchg IR node with acquire ordering on both success and failure paths and weak exchange semantics.

ID RangeOperationBuiltin Pattern
370--379CAS__nvvm_atom_{,cta_,sys_}cas_gen_{i,ll,us,128}

Half-Precision Atomics (IDs 459--468)

Added for SM 90+ (Hopper), these support f16x2 and f16x4 packed atomic adds:

ID RangeOperationBuiltin PatternSM Gate
459--461f16x2 add__nvvm_atom_{,cta_,sys_}add_gen_f2SM 90+
466--468f16x4 add__nvvm_atom_{,cta_,sys_}add_gen_f4SM 100+ (Blackwell)

C++11 Atomics (IDs 366, 417--473)

These 57 builtins implement the CUDA C++ atomic model with explicit memory ordering and scope parameters. The EDG frontend generator at sub_6BBC40 constructs the mangled names using a __nv_atomic_fetch_{op}_{width}_{type} pattern, where width is the byte count (1, 2, 4, 8, or 16) and the type suffix is _u (unsigned), _s (signed), or _f (float).

Thread Fence (ID 366)

__nv_atomic_thread_fence emits either a volatile fence (SM <= 69) or an explicit fence.{ordering}.{scope}; PTX instruction (SM 70+). Ordering and scope are extracted from constant operand parameters at compile time.

Load/Store (IDs 417--428)

IDBuiltinWidthPTX
417__nv_atomic_loadgenericld.{ordering}.{scope}.{type}
418--422__nv_atomic_load_{1,2,4,8,16}1--16 bytessame
423__nv_atomic_storegenericst.{ordering}.{scope}.{type}
424--428__nv_atomic_store_{1,2,4,8,16}1--16 bytessame

Fetch-Op (IDs 429--458)

Arithmetic and bitwise fetch operations are registered with width and type suffixes. Bitwise operations (and, or, xor) omit the type suffix since signedness is irrelevant for bitwise logic.

ID RangeOperationBuiltin Pattern
429--434fetch_add__nv_atomic_fetch_add_{4,8}_{u,s,f}
435--440fetch_sub__nv_atomic_fetch_sub_{4,8}_{u,s,f}
441--446fetch_and/or/xor__nv_atomic_fetch_{and,or,xor}_{4,8}
447--452fetch_max__nv_atomic_fetch_max_{4,8}_{u,s,f}
453--458fetch_min__nv_atomic_fetch_min_{4,8}_{u,s,f}

For fetch_sub with floating-point types (IDs 437, 440), the lowering negates the operand and emits atom.add rather than a dedicated subtraction instruction.

Exchange and CAS (IDs 462--473)

ID RangeOperationBuiltin Pattern
462--465Exchange__nv_atomic_exchange{,_4,_8,_16}
469--473CAS__nv_atomic_compare_exchange{,_2,_4,_8,_16}

PTX Inline Assembly Generation

The atomic codegen handler at sub_12AE930 (address 0x12AE930, 41KB) generates PTX inline assembly strings at compile time. The generated instruction format depends on the target SM:

Pre-SM 70 (volatile mode, unk_4D045E8 <= 0x45):

ld.volatile.b32 $0, [$1];
atom.add.volatile.u32 $0, [$1], $2;

SM 70+ (explicit memory model):

ld.acquire.gpu.b32 $0, [$1];
st.release.sys.b32 [$0], $1;
atom.add.acq_rel.cta.u32 $0, [$1], $2;
atom.cas.relaxed.gpu.b64 $0, [$1], $2, $3;

The sub_12AE930 / sub_9502D0 Algorithm in Detail

Both the EDG-side handler (sub_12AE930, 0x12AE930) and its NVVM-side twin (sub_9502D0, 0x9502D0) follow identical logic. They accept five parameters: (result, codegen_state, builtin_id, call_arg_list, type_info). The algorithm proceeds in six phases.

Phase 1: SM Version Check and Path Selection

v186 = (unk_4D045E8 <= 0x45)    // SM <= 69 -> volatile mode

When v186 is true, the handler enters the pre-SM 70 "volatile" path. All atomic operations receive a .volatile qualifier instead of explicit memory ordering and scope qualifiers. The 128-bit atomics emit diagnostic 0xEB6 (3766) and are rejected entirely.

When v186 is false (SM 70+), the handler enters the memory model path, which constructs the full {mnemonic}.{ordering}.{scope}.{type} format.

Phase 2: Operand Extraction and Builtin ID Dispatch

The handler extracts between 2 and 5 operands from the call argument list (pointer, value, compare-value for CAS, plus the ordering and scope parameters encoded as compile-time constants). The builtin ID selects the PTX mnemonic via a switch:

switch (builtin_id) {
    case 417..422:  mnemonic = "ld";          // atomic load
    case 423..428:  mnemonic = "st";          // atomic store
    case 429..434:  mnemonic = "atom.add";    // fetch-add (unsigned, signed, float)
    case 435..440:  mnemonic = "atom.add";    // fetch-sub (negated; see below)
    case 441..442:  mnemonic = "atom.and";    // fetch-and
    case 443..444:  mnemonic = "atom.or";     // fetch-or
    case 445..446:  mnemonic = "atom.xor";    // fetch-xor
    case 447..452:  mnemonic = "atom.max";    // fetch-max
    case 453..458:  mnemonic = "atom.min";    // fetch-min
    case 462..465:  mnemonic = "atom.exch";   // exchange
    case 469..473:  mnemonic = "atom.cas";    // compare-and-swap
    default:        fatal("unexpected atomic builtin function");
}

For IDs 435--440 (fetch_sub), the handler does not emit atom.sub (which does not exist in PTX). Instead, for integer types it negates the operand and emits atom.add; for float types it negates via fneg and emits atom.add.f.

For thread fence (ID 366), the handler branches to sub_12AE0E0 (volatile fence, pre-SM 70) or sub_12AE4B0 (explicit fence, SM 70+) and returns immediately, bypassing the rest of the atomic pipeline.

Phase 3: Memory Ordering Resolution

The ordering parameter is extracted from the first constant operand of the C++11 atomic call via sub_620EE0. The value (0--5) maps to a PTX qualifier string:

ValueC++ OrderingPTX QualifierApplies To
0relaxed / monotonicrelaxedAll operations
1consume (treated as acquire)acquireLoads, RMW
2acquireacquireLoads, RMW
3releasereleaseStores, RMW
4acq_relacq_relRMW operations
5seq_cstacquire (loads), release (stores)All

Sequential consistency (value 5) is downgraded: loads get acquire, stores get release, and RMW operations get acq_rel. True seq_cst semantics are achieved by inserting explicit fences around the operation (see "Fence Insertion for Seq_Cst" below).

Store-specific validation. For store builtins (IDs 423--428), only ordering values 0, 3, and 5 are legal. Any other value triggers fatal("unexpected memory order."). Value 5 is treated as relaxed for the store instruction itself, with the seq_cst fence handling the ordering guarantee externally.

Load-specific validation. For load builtins (IDs 417--422), values 3 (release) and 4 (acq_rel) are illegal and trigger the same fatal error.

Phase 4: Scope Resolution

The scope parameter is extracted from the second constant operand via sub_620EE0. The value (0--4) maps to a PTX scope qualifier:

switch (scope_value) {
    case 0:  // fall through
    case 1:  scope_str = "cta";      break;   // thread block
    case 2:
        if (unk_4D045E8 > 0x59)               // SM > 89
            scope_str = "cluster";             // SM 90+ (Hopper)
        else
            scope_str = "gpu";                 // SM <= 89: fallback
        break;
    case 3:  scope_str = "gpu";      break;   // device
    case 4:  scope_str = "sys";      break;   // system
    default: fatal("unexpected atomic operation scope.");
}

The cluster scope fallback is the critical SM gate at line 255 / 424 of sub_12AE930 / sub_9502D0: when the SM version is 89 or below, scope value 2 ("cluster") silently degrades to gpu. No diagnostic is emitted; the scope is simply rewritten. On SM 90+ (Hopper and later), cluster passes through to the PTX output.

Phase 5: Type Suffix Construction

The type suffix is built from two components: a type-class letter and a byte-width number. The type-class lookup uses a 4-entry table stored in local variable v196:

v196[0] = 'b'    // bitwise   (for exch, and, or, xor, cas)
v196[1] = 'u'    // unsigned  (for add, inc, dec, max, min on unsigned)
v196[2] = 's'    // signed    (for max, min on signed)
v196[3] = 'f'    // float     (for add on float/double)

The type-class index is derived from the LLVM type of the atomic operand:

  • Integer type with unsigned semantics: index 1 (u)
  • Integer type with signed semantics: index 2 (s)
  • Floating-point type: index 3 (f)
  • All other cases (exchange, CAS, bitwise): index 0 (b)

The byte-width is the size of the atomic operand in bytes. Valid sizes are validated against the bitmask 0x10116:

valid = ((1LL << byte_size) & 0x10116) != 0

This bitmask has bits set at positions 1, 2, 4, 8, and 16, accepting exactly the byte widths {1, 2, 4, 8, 16}. Any other size triggers fatal("unexpected size1").

The resulting suffix is the letter concatenated with the bit width (byte_size * 8): .u32, .s64, .f32, .b128, etc.

Phase 6: Inline ASM String Assembly and Emission

The handler assembles the final PTX string by concatenating the components. Two string buffers are maintained throughout: v190 (ordering string) and v193 (scope string), set during phases 3 and 4.

For SM 70+ (memory model mode):

// Loads:
sprintf(buf, "ld.%s.%s.%c%d $0, [$1];", v190, v193, type_letter, bit_width);
// Stores:
sprintf(buf, "st.%s.%s.%c%d [$0], $1;", v190, v193, type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2;", mnemonic, v190, v193, type_letter, bit_width);
// CAS:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2, $3;", mnemonic, v190, v193, type_letter, bit_width);

For pre-SM 70 (volatile mode):

// Loads:
sprintf(buf, "ld.volatile.%c%d $0, [$1];", type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.volatile.%c%d $0, [$1], $2;", mnemonic, type_letter, bit_width);

Constraint string construction. The LLVM inline ASM constraint string is built dynamically to match the operand pattern:

PatternConstraint StringMeaning
Load (ld)"=r,l,~{memory}" or "=l,l,~{memory}"result in reg, address in 64-bit reg, memory clobber
Store (st)"l,r,~{memory}" or "l,l,~{memory}"address in 64-bit reg, value in reg, memory clobber
RMW (atom.*)"=r,l,r,~{memory}"result, address, operand, memory clobber
CAS (atom.cas)"=r,l,r,r,~{memory}"result, address, compare, swap, memory clobber

The register class for result and value operands is r for 32-bit types and l for 64-bit types. 128-bit types use l with pair operands.

The assembled PTX string and constraint string are passed to sub_B41A60 (NVVM side) or the equivalent EDG-side helper, which creates an LLVM InlineAsm node. The node is then emitted via sub_921880 / sub_1285290.

Fence Insertion for Seq_Cst

When the memory ordering is sequential consistency (value 5) and the SM version supports explicit fences (SM 70+), the handler does not simply emit atom.sc.{scope}. Instead, it implements seq_cst through a fence-bracketed pattern:

  1. Pre-fence: If the operation is a store or RMW and ordering >= release, the handler calls sub_94F9E0 (membar) or sub_94FDF0 (fence) to emit a leading fence:

    • sub_94F9E0 emits membar.{scope}; as inline PTX
    • sub_94FDF0 emits fence.sc.{scope}; or fence.acq_rel.{scope};
  2. The atomic operation: Emitted with downgraded ordering (acquire for loads, release for stores, acq_rel for RMW).

  3. Post-fence: If the operation is a load or RMW and ordering >= acquire, a trailing fence is emitted.

The fence scope matches the atomic operation's scope. The decision to emit membar vs fence depends on the SM version and the specific ordering level: membar is used for the pre-SM 70 path (though that path should not reach this code), and fence.sc / fence.acq_rel for SM 70+.

The pre/post-fence logic is gated by two conditions in the NVVM-side handler:

PRE-FENCE:  if (v186 && (v187 - 3) <= 2)    // v187 is ordering; range [3,5] = release, acq_rel, seq_cst
POST-FENCE: if (!v175 && v169 == 5)          // v175 = is_store; v169 = ordering = seq_cst

The Volatile Fence Handler (sub_12AE0E0)

For thread fence on SM <= 69, sub_12AE0E0 emits a volatile memory barrier. The function takes an ASM buffer and fence configuration parameters. It produces:

membar.{scope};

where the scope is derived from the fence's scope parameter (cta / gl / sys). This is the pre-memory-model equivalent of the explicit fence path.

The Explicit Fence Handler (sub_12AE4B0)

For thread fence on SM 70+, sub_12AE4B0 constructs an explicit fence.{ordering}.{scope}; instruction. The ordering for fences is a restricted set compared to atomics:

Ordering ValueFence Qualifier
3sc (sequentially consistent)
4acq_rel
5sc (same as 3)
Otherfatal("unexpected memory order.")

The scope string follows the same rules as atomics. The assembled string is emitted as LLVM inline ASM with a ~{memory} clobber.

Memory Ordering Encoding

The ordering parameter (values 0--5) maps to PTX qualifiers:

ValueOrderingUsed For
0relaxedDefault / monotonic
1, 2acquireLoads, RMW
3releaseStores
4acq_relRMW operations
5acquireSequential consistency (downgraded)

Scope Encoding

The scope parameter (values 0--4) maps to PTX scope qualifiers:

ValueScopePTXSM Requirement
0, 1Block.ctaAll
2Cluster.clusterSM 90+ (Hopper); falls back to .gpu on SM <= 89
3Device.gpuAll
4System.sysAll

Type Suffix Construction

The type suffix is built from a 4-entry table: b (bitwise), u (unsigned), s (signed), f (float). Combined with the byte size, this produces suffixes like .u32, .f64, .b128. Valid sizes are validated against the bitmask 0x10116 (bits for 1, 2, 4, 8, and 16 bytes).

The 13 Atomic Operations at PTX Emission

The PTX emission layer at sub_21E5E70 (base) and sub_21E6420 (L2-hinted) implements the final encoding from the NVPTX MachineInstr opcode to the PTX text. The instruction operand word at this stage encodes both scope and operation:

bits[7:4]    — scope:  0 = gpu (default), 1 = cta, 2 = sys
bits[23:16]  — atomic operation opcode (BYTE2)

The 13-entry dispatch table:

OpcodePTX SuffixL2-Hinted SuffixDescription
0x00.exch.b.exch.L2::cache_hint.bBitwise exchange
0x01.add.u.add.L2::cache_hint.uUnsigned add
0x02(missing)(missing)No .add.s in PTX ISA
0x03.and.b.and.L2::cache_hint.bBitwise AND
0x04(missing)(missing)Unused slot
0x05.or.b.or.L2::cache_hint.bBitwise OR
0x06.xor.b.xor.L2::cache_hint.bBitwise XOR
0x07.max.s.max.L2::cache_hint.sSigned max
0x08.min.s.min.L2::cache_hint.sSigned min
0x09.max.u.max.L2::cache_hint.uUnsigned max
0x0A.min.u.min.L2::cache_hint.uUnsigned min
0x0B.add.f.add.L2::cache_hint.fFloat add
0x0C.inc.u.inc.L2::cache_hint.uUnsigned increment
0x0D.dec.u.dec.L2::cache_hint.uUnsigned decrement
0x0E.cas.b.cas.L2::cache_hint.bCompare-and-swap

Opcodes 0x02 and 0x04 are unoccupied. There is no signed atomic add in PTX (signed add uses .add.u since two's-complement wrapping is identical). Slot 0x04 is simply skipped.

The scope prefix is emitted before the operation suffix:

bits[7:4] & 0xF:
    0  ->  (nothing; implicit .gpu scope)
    1  ->  ".cta"
    2  ->  ".sys"

Full PTX emission format:

atom[.scope].{op}.{type}{size}

Example: atom.cta.add.u32, atom.sys.cas.b64, atom.exch.b32.

L2 Cache Hint System (SM 80+ / Ampere)

sub_21E6420 (address 0x21E6420) is a parallel version of the base atomic emitter sub_21E5E70. It inserts .L2::cache_hint between the operation and type suffix for all 13 atomic operations:

atom[.scope].{op}.L2::cache_hint.{type}{size}

The L2 cache hint instructs the GPU's L2 cache to retain (or evict) the atomic target data after the operation completes. This is a PTX 7.3+ feature introduced with Ampere (SM 80+).

The L2-hinted path is selected when bit 0x400 is set in the instruction's encoding flags. The hint is applied at the MachineInstr level during instruction selection, not during the inline ASM generation phase of sub_12AE930. Both paths produce identical scope and type encoding; the L2 path adds exactly the .L2::cache_hint substring.

String emission uses SSE (xmm) register loads from precomputed constant data at addresses xmmword_435F590 through xmmword_435F620 to fast-copy the 16-byte prefix of each operation string, then patches the remaining bytes. This avoids branch-heavy string concatenation for the 13 cases.

AtomicExpandPass: IR-Level Expansion (sub_20C9140)

Before sub_12AE930 handles the C++11 atomics, and separately from the legacy builtin lowering, an LLVM FunctionPass named "Expand Atomic instructions" (pass ID "atomic-expand", registered at sub_20CA900) runs on LLVM IR to decide which atomic operations the NVPTX target can handle natively and which must be expanded into CAS loops.

Expansion Decision Tree

For each atomic instruction in the function:

  1. shouldExpandAtomicCmpXchgInIR (vtable +0x258): Default expands all cmpxchg to LL/SC or CAS-based loops. The NVPTX override may keep native i32/i64 cmpxchg on SM 70+.

  2. shouldExpandAtomicRMWInIR (vtable +0x280):

    • i32 xchg/add/min/max: kept native on all SM.
    • i64 xchg/add: kept native on SM 70+.
    • i32/i64 sub/nand: always expanded to CAS loop (no native PTX instruction).
    • i8/i16 (any operation): always expanded via partword masking.
    • Float atomicAdd: native on SM 70+ (fp32), SM 80+ (fp16/bf16).
  3. shouldExpandAtomicLoadInIR (vtable +0x270): Native for aligned i32/i64. Expanded for i8/i16 (widen to i32 load + extract) and i128+ (decompose to multiple loads).

  4. shouldExpandAtomicStoreInIR (vtable +0x278): Native for aligned i32/i64. Expanded for sub-word and >64-bit types.

Sub-Word Atomic Expansion (sub_20CB200)

No NVIDIA GPU architecture through SM 120 supports native sub-word (i8/i16) atomics. The pass generates mask-and-shift wrappers around word-sized CAS loops. The mask generation function sub_20CB200 (2896 bytes) produces a 6-field output struct:

FieldNamePurpose
+0x00AlignedAddrPointer masked to word boundary: ptr & ~(word_size - 1)
+0x08AlignedTypeAlways i32
+0x10PtrLSBLow address bits: ptr & (word_size - 1)
+0x18ShiftAmtBit position within the word: PtrLSB * 8 (little-endian)
+0x20Inv_MaskInverted mask: ~(((1 << (type_size * 8)) - 1) << ShiftAmt)
+0x28MaskMask: (1 << (type_size * 8)) - 1

The CAS loop (sub_20CBD50, 1646 bytes) then:

  1. Shifts the new value into position: ValOperand_Shifted = new_val << ShiftAmt.
  2. Loops: loads the word, applies the RMW operation on the masked sub-word, attempts CAS on the full word.
  3. On success: extracts the sub-word result via shift + mask.

CAS Loop Generation (sub_20C96A0)

For operations that cannot be handled natively, the pass builds a compare-and-swap loop with three basic blocks:

entry -> "atomicrmw.start" -> (CAS failure) -> "atomicrmw.start" (retry)
                           -> (CAS success) -> "atomicrmw.end"

Steps:

  1. Load current value from pointer.
  2. Compute new value using the RMW operation (dispatched through an 11-case switch at sub_20CC690: Xchg, Add, Sub, And, Or, Xor [implied], Nand, Max, Min, UMax, UMin, FMin, FMax).
  3. Emit cmpxchg with packed success+failure orderings.
  4. Branch back to start on failure, fall through to end on success.

Ordering-to-Fence Table (address 0x428C1E0)

The pass uses a 7-entry fence decision table indexed by LLVM AtomicOrdering enum:

OrderingIndexRelease Fence Before?Acquire Fence After?
NotAtomic0NoNo
Unordered1NoNo
Monotonic2NoNo
Acquire3NoYes
Release4YesNo
AcquireRelease5YesYes
SequentiallyConsistent6YesYes (+ barrier)

Fence emission calls sub_15F9C80 which creates an LLVM fence instruction with the specified ordering and sync scope.

Memory Barrier and Fence Emission

The PTX emission layer has two dedicated handlers for barriers and fences, separate from the atomic operation emitters.

Memory Barrier (sub_21E94F0)

Emits membar instructions based on a 4-bit operand encoding:

ValueInstructionScope
0membar.gpuDevice
1membar.ctaThread block
2membar.sysSystem
4fence.sc.clusterCluster (SM 90+)
3fatal("Bad membar op")Invalid

NVVM-Side Membar (sub_94F9E0)

At the NVVM lowering level, sub_94F9E0 handles membar emission with a different scope encoding:

Scope ValueScope StringPTX Instruction
0, 1ctamembar.cta;
2, 3glmembar.gl;
4sysmembar.sys;
Otherfatal("unexpected atomic operation scope.")

NVVM-Side Fence (sub_94FDF0)

Constructs fence.{ordering}.{scope}; from a state array. The ordering mapping is:

ValueOrdering String
3sc
4acq_rel
5sc
Otherfatal("unexpected memory order.")

Both membar and fence are emitted as inline PTX assembly (not LLVM IR fence instructions) because PTX-level memory ordering semantics have no direct LLVM IR equivalent at the precision NVIDIA requires.

Architecture Gates

SM ThresholdEffect
SM <= 59Diagnostic 0xEB6 warning for certain atomic patterns
SM 60--69Diagnostic 0xEB2 (3762) for specific atomic patterns
SM <= 69Volatile mode; 128-bit atomics not supported (diagnostic 0xEB4)
SM 70+Explicit ordering/scope in PTX output
SM <= 89Scope value 2 silently falls back from cluster to gpu
SM <= 89Half-precision (2-byte FP) atomics not supported
SM 90+ (Hopper)Cluster scope (.cluster) becomes available
SM 90+f16x2 packed atomic add (IDs 459--461)
SM 90+fence.sc.cluster becomes available
SM 100+ (Blackwell datacenter)f16x4 packed atomic add (IDs 466--468)

EDG Frontend Name Construction

The EDG atomic builtin generator sub_6BBC40 (address 0x6BBC40, 1251 lines) constructs internal function names from C++ cuda::atomic_ref calls. The algorithm uses a dispatch key v165 = *(uint16_t*)(type_node + 176), the EDG "builtin kind" tag, to select the operation:

v165 (hex)v165 (dec)Operation
0x6241, 0x624225153, 25154compare_exchange
0x6248, 0x624925160, 25161exchange
0x624F, 0x625025167, 25168fetch_add
0x6257, 0x625825175, 25176fetch_sub
0x625F, 0x626025183, 25184fetch_and
0x6263, 0x626425187, 25188fetch_xor
0x6267, 0x626825191, 25192fetch_or
0x626B, 0x626C25195, 25196fetch_max
0x6273, 0x627425203, 25204fetch_min
0x627B, 0x627C25211, 25212load
0x6280, 0x628125216, 25217store
0x628625222thread_fence

Within each pair, the odd ID is the "generic" overload that enters the renaming path; the even ID has its base name string set explicitly via strcpy.

Name Construction Algorithm (lines 877--996 of sub_6BBC40)

Step 1 -- Base name. Copy the EDG source name, then overwrite with the canonical base for the seven fetch-op builtins:

v165     Base name
------   ---------------------------
0x6250   "__nv_atomic_fetch_add"
0x6258   "__nv_atomic_fetch_sub"
0x6260   "__nv_atomic_fetch_and"
0x6264   "__nv_atomic_fetch_xor"
0x6268   "__nv_atomic_fetch_or"
0x626C   "__nv_atomic_fetch_max"
0x6274   "__nv_atomic_fetch_min"

Step 2 -- Width suffix. Append "_%u" formatted with the type size in bytes from *(uint32_t*)(type_node + 128). For fetch-op builtins, the size is validated as (type_size - 4) <= 4, accepting only 4 and 8 bytes.

Step 3 -- Type suffix (only for add/sub/max/min; lines 960--996). Reads type_kind = *(uint8_t*)(type_node + 140):

type_kindMeaningSuffixCondition
2integer_sbyte_4B6DF90[signedness_byte] != 0 (signed)
2integer_ubyte_4B6DF90[signedness_byte] == 0 (unsigned)
3float_fAlways
6unsigned explicit_uAlways

byte_4B6DF90 is a 256-entry lookup table that maps the EDG "integer kind" sub-tag (at type_node + 160) to a boolean: 1 = signed, 0 = unsigned.

Bitwise operations (and/or/xor) omit the type suffix entirely.

Naming Pattern Summary

__nv_atomic_fetch_{op}_{width}[_{type}]

{op}    = add | sub | and | xor | or | max | min
{width} = 4 | 8  (bytes)
{type}  = _s (signed), _u (unsigned), _f (float), or omitted (bitwise)

For load/store/exchange/compare_exchange, only the width suffix is appended; no type suffix.

Validation Diagnostics

DiagnosticHexCondition
8520x354Unsupported atomic operation for target
16450x66DWrong return type for builtin
16460x66EUnsupported type size (not in {1,2,4,8,16})
37450xEA1Atomic not supported for given type
37460xEA2First param scope exceeds range (>5)
37470xEA3Return param scope exceeds range (>4)
37480xEA4fetch_op type size not 4 or 8 bytes
37490xEA5Store with type_size <= 1 (too small)
37500xEA6Load with type_size > 3 (too large)
37560xEACCAS parameter type mismatch
37570xEADExchange parameter type mismatch
37590xEAFFloat return not supported below SM 90
37620xEB2SM 60--69 atomic variant diagnostic
37630xEB3Return type on store (SM <= 89)
37640xEB4128-bit store/load not supported on this SM
37650xEB516-bit store not supported on SM <= 69
37660xEB6Generic warning for SM <= 59
37670xEB7Type size not in {1,2,4,8,16} bitmask
37690xEB9Null argument list error

EDG Type Node Field Map

OffsetSizeField
+1288type_size (byte count: 1, 2, 4, 8, 16)
+1401type_kind (0=void, 2=integer, 3=float, 6=unsigned, 8=pointer, 12=typedef)
+160variesFor type_kind 12 (typedef): pointer to underlying type. For type_kind 2 (integer): uint8_t signedness sub-tag indexed into byte_4B6DF90.
+1688Pointer chain (for struct/compound types)
+1762builtin_kind (the v165 dispatch tag, uint16_t)

NVPTX MachineInstr Atomic Opcodes

At the SelectionDAG / MachineInstr level, atomic operations map to NVPTX-specific opcodes distinct from the inline ASM emission:

MachineInstr OpcodePTX Operation
149ATOMIC_LOAD
294--297atom.add (f32 / f64 / i32 / i64)
302--305atom.min (s32 / s64 / u32 / u64)
314--317atom.max (s32 / s64 / u32 / u64)
462atom.cas (generic)

These opcodes are emitted by the SelectionDAG lowering for native atomic operations that survive the AtomicExpandPass without expansion.

Function Map

FunctionAddressSizeRole
sub_6BBC400x6BBC40~1251 linesEDG atomic builtin name generator
sub_12AA2800x12AA280Legacy CAS IR node builder
sub_12AA9B00x12AA9B0Legacy atomic exchange handler
sub_12ADE800x12ADE80Scoped atomic load/store/fetch handler
sub_12AE0100x12AE010Fence acquire/release emitter (EDG only; BUG on NVVM)
sub_12AE0E00x12AE0E0Volatile fence emitter (pre-SM 70)
sub_12AE4B00x12AE4B0Explicit fence emitter (SM 70+)
sub_12AE9300x12AE93041KBPTX inline ASM atomic codegen (EDG side)
sub_12B3FD00x12B3FD0103KBMain builtin lowering mega-switch
sub_20C7CE00x20C7CE01399AtomicExpandPass: recursive type walker
sub_20C84C00x20C84C01656AtomicExpandPass: address space checker
sub_20C91400x20C91401204AtomicExpandPass: runOnFunction
sub_20C96A00x20C96A01814AtomicExpandPass: CAS loop generation
sub_20CA9000x20CA900218AtomicExpandPass: registration
sub_20CB2000x20CB2002896AtomicExpandPass: sub-word mask generation
sub_20CBD500x20CBD501646AtomicExpandPass: partword RMW expansion
sub_20CC6900x20CC69043AtomicExpandPass: 11-case operation dispatch
sub_20CD3E00x20CD3E06030AtomicExpandPass: partword CmpXchg expansion
sub_20CEB700x20CEB7010640AtomicExpandPass: full CmpXchg LL/SC expansion
sub_21E5E700x21E5E70PTX emission: base atomic opcode emitter
sub_21E64200x21E6420PTX emission: L2-hinted atomic opcode emitter
sub_21E8EA00x21E8EA0PTX emission: cluster barrier emitter
sub_21E94F00x21E94F0PTX emission: membar/fence emitter
sub_9502D00x9502D055KBPTX inline ASM atomic codegen (NVVM side)
sub_94F9E00x94F9E0NVVM membar emitter
sub_94FDF00x94FDF0NVVM fence emitter

Cross-References

Math Function Builtins

Math builtins cover floating-point rounding, transcendental approximations, reciprocal/square-root operations, type conversions, and precise arithmetic with explicit rounding modes. They span IDs 21--184 and 276--403, totaling over 230 entries. Unlike most other builtin categories, many math builtins fall through the dispatch switch entirely and resolve via the generic LLVM intrinsic path.

Bit Manipulation (IDs 21--26)

These integer utility operations map directly to hardware instructions available on all SM targets.

IDBuiltinOperation
21--22__nvvm_clz_{i,ll}Count leading zeros (32/64-bit)
23--24__nvvm_popc_{i,ll}Population count (32/64-bit)
25--26__nvvm_brev_{i,ll}Bit reverse (32/64-bit)

Rounding and Absolute Value (IDs 27--46)

Float rounding and absolute value operations exist in three type variants: flush-to-zero single (ftz_f), IEEE single (f), and double (d).

ID RangeOperationVariants
27--29__nvvm_floor_{ftz_f,f,d}Floor
30--32__nvvm_ceil_{ftz_f,f,d}Ceiling
33--35__nvvm_abs_{ftz_f,f,d}Absolute value (integer-style)
36--38__nvvm_fabs_{ftz_f,f,d}Absolute value (float)
39--41__nvvm_round_{ftz_f,f,d}Round to nearest
42--44__nvvm_trunc_{ftz_f,f,d}Truncate toward zero
45--46__nvvm_saturate_{ftz_f,f}Clamp to [0.0, 1.0]

Transcendental Approximations (IDs 47--56)

Hardware-accelerated approximations for transcendental functions. These use the GPU's special function units (SFU) and are not IEEE-compliant.

ID RangeOperationVariants
47--49__nvvm_ex2_approx_{ftz_f,f,d}Base-2 exponential
50--52__nvvm_lg2_approx_{ftz_f,f,d}Base-2 logarithm
53--55__nvvm_sin_approx_{ftz_f,f,d}Sine
56__nvvm_cos_approx_ftz_fCosine (FTZ only registered)

Reciprocal (IDs 57--69)

Full-precision reciprocal with all four IEEE rounding modes and three type variants.

ID RangeOperationRounding Modes
57--69__nvvm_rcp_{rn,rz,rm,rp}_{ftz_f,f,d}RN (nearest), RZ (zero), RM (minus), RP (plus)

The 13 entries cover 4 rounding modes x 3 types, with the FTZ single-precision variant adding one additional entry.

Square Root and Reciprocal Square Root (IDs 70--87)

ID RangeOperationDescription
70--84__nvvm_sqrt_{f,rn,rz,rm,rp}_{ftz_f,f,d}Square root (5 modes x 3 types)
85--87__nvvm_rsqrt_approx_{ftz_f,f,d}Reciprocal square root (SFU approximation)

The sqrt_f variant (without rounding qualifier) uses the default hardware rounding. The rsqrt_approx variants use the SFU fast path.

Type Conversions (IDs 88--184)

The largest math subcategory with 97 entries, covering every combination of source type, destination type, rounding mode, and FTZ flag.

Double-to-Float (IDs 88--95)

__nvvm_d2f_{rn,rz,rm,rp}_{ftz,} -- 4 rounding modes x 2 FTZ variants.

Integer/Float Cross-Conversions (IDs 96--177)

82 entries covering all permutations of:

  • Source types: d (double), f (float), i (int32), ui (uint32), ll (int64), ull (uint64)
  • Destination types: same set
  • Rounding modes: rn, rz, rm, rp

Pattern: __nvvm_{src}2{dst}_{rounding} (e.g., __nvvm_d2i_rn, __nvvm_f2ull_rz).

Half Precision (IDs 178--180)

IDBuiltinDescription
178__nvvm_f2h_rn_ftzFloat to half (FTZ, round nearest)
179__nvvm_f2h_rnFloat to half (round nearest)
180__nvvm_h2fHalf to float

Bitcast (IDs 181--184)

Reinterpret-cast between integer and float types without value conversion. Lowered via sub_12A7DA0 which emits opcode 0x31 (49, bitcast).

IDBuiltinDirection
181__nvvm_bitcast_f2ifloat -> int32
182__nvvm_bitcast_i2fint32 -> float
183__nvvm_bitcast_ll2dint64 -> double
184__nvvm_bitcast_d2lldouble -> int64

Integer Min/Max and Multiply-High (IDs 276--293)

ID RangeOperationTypes
276--279__nvvm_{min,max}_{i,ui}32-bit signed/unsigned
280--283__nvvm_{min,max}_{ll,ull}64-bit signed/unsigned
284--289__nvvm_f{min,max}_{f,ftz_f,d}Float min/max (with FTZ)
290--293__nvvm_mulhi_{i,ui,ll,ull}Upper half of multiplication

Precise Float Arithmetic (IDs 294--349)

These builtins provide IEEE-compliant arithmetic with explicit rounding mode control. Each operation exists in all four rounding modes and up to five type variants (ftz_f, f, ftz_f2, f2, d).

ID RangeOperationEntries
294--313__nvvm_mul_{rn,rz,rm,rp}_{ftz_f,f,ftz_f2,f2,d}20
314--333__nvvm_add_{rn,rz,rm,rp}_{ftz_f,f,ftz_f2,f2,d}20
334--349__nvvm_div_{rn,rz,rm,rp}_{ftz_f,f,d}16

FMA (IDs 383--402)

Fused multiply-add with all rounding/type combinations:

ID RangeOperationEntries
383--402__nvvm_fma_{rn,rz,rm,rp}_{ftz_f,f,d,ftz_f2,f2}20

Miscellaneous (IDs 350, 380--382, 403)

IDBuiltinDescription
350__nvvm_lohi_i2dCompose double from two 32-bit halves
380__nvvm_prmtByte permute (PRMT instruction)
381--382__nvvm_sad_{i,ui}Sum of absolute differences
403__nvvm_fnsFind Nth set bit

Table-Based Lowering for Precise Arithmetic

The precise arithmetic builtins (mul, add, div, fma with rounding modes) are lowered through sub_12B3540 (address 0x12B3540, 10KB), which uses two lazily-initialized red-black trees (std::map<int, triple>) to map builtin IDs to IR opcode triples.

Tree 1 serves three-operand builtins (FMA): maps ID ranges to opcode 0xF59 with variant codes encoding the rounding mode and type.

Tree 2 serves two-operand builtins (mul, add, div): maps to opcodes 0xE3A, 0xE3B, 0x105E, 0x1061 depending on the operation.

The lookup procedure:

  1. Extract up to 4 operand arguments from the call expression
  2. Find the builtin ID in the appropriate tree to obtain (opcode, variant)
  3. Look up the IR function via sub_126A190
  4. Emit the call instruction via sub_1285290
  5. Generate the inline asm fragment via sub_12A8F50

LLVM Intrinsic Fallback Path

Many standard math builtins (floor, ceil, sin, cos, sqrt, fma, exp, log) are not handled by the switch cases at all. When the builtin table lookup returns ID 0 (name not found), the dispatcher falls through to the generic LLVM intrinsic path at LABEL_4 in sub_955A70. This path:

  1. Checks if the name starts with "llvm." (prefix constant 0x6D766C6C)
  2. Looks up the intrinsic via sub_B6ACB0 (LLVM intrinsic name-to-ID)
  3. Lowers all arguments with type-cast insertion where needed
  4. Emits a standard LLVM call via sub_921880

This means functions like llvm.floor.f32, llvm.cos.f64, and llvm.fma.f32 bypass the builtin ID system entirely and map directly to LLVM's intrinsic infrastructure.

Float Compatibility Wrappers (IDs 643--646)

Four C runtime float functions are registered as builtins for compatibility:

IDBuiltinMaps To
643__ceilf__nvvm_ceil_f equivalent
644__floorf__nvvm_floor_f equivalent
645__roundf__nvvm_round_f equivalent
646__truncf__nvvm_trunc_f equivalent

Tensor Core / MMA Builtins

Tensor core builtins implement the Warp Matrix Multiply-Accumulate (WMMA) and Warp Group MMA (WGMMA) interfaces, spanning IDs 678--770 across four SM generations. Each generation added new data types and matrix shapes, resulting in 91 registered builtins that cover half-precision, integer, binary, double-precision, TF32, BF16, and FP8 matrix operations. SM 100 (Blackwell) adds a fifth generation -- tcgen05 -- documented in Tensor / MMA Codegen.

Key Facts

PropertyValue
Builtin IDs678--770 (93 entries)
WGMMA handler (IDs 753--768)~800 lines in sub_12B3FD0 / sub_955A70
LLVM intrinsic range (WGMMA)5304--5447 (144-entry 5-D grid) plus 10654--10779 (N-dimension table)
NVVM loweringsub_955A70 (105KB), sub_12B3FD0 (103KB)
Backend emissionsub_21E74C0 (PTX builder), sub_36E9630 (tcgen05 ISD selection)
SM gatesSM 70+ HMMA, SM 72+ IMMA, SM 75+ BMMA, SM 80+ DMMA/TF32/BF16, SM 90+ WGMMA

WMMA Architecture Evolution

SM GenerationFeatureID RangeCount
SM 70 (Volta)HMMA: FP16 tensor core678--70730
SM 75 (Turing)IMMA: INT8/INT4, BMMA: binary708--74538
SM 80 (Ampere)DMMA: FP64, TF32, BF16746--76419
SM 90 (Hopper)WGMMA: warp-group MMA, FP8765--7684
SM 100 (Blackwell)tcgen05: MX formats, block-scale, sparsity(intrinsic path)--

HMMA -- Half-Precision (IDs 678--707, SM 70+)

The original tensor core builtins provide 16-bit floating-point matrix multiply for three tile shapes. Each shape has 10 operations: load A, load B, load C (f16 and f32 accumulators), store C (f16 and f32), and four MMA variants for input/output precision combinations.

ID RangeShapeBuiltin Prefix
678--68716x16x16__hmma_m16n16k16_*
688--69732x8x16__hmma_m32n8k16_*
698--7078x32x16__hmma_m8n32k16_*

Per-shape operations (10 each):

SuffixOperationDescription
ld_aLoad A fragmentLoad matrix A tile from memory
ld_bLoad B fragmentLoad matrix B tile from memory
ld_c_f16Load C (f16)Load accumulator as half-precision
ld_c_f32Load C (f32)Load accumulator as single-precision
st_c_f16Store C (f16)Store result as half-precision
st_c_f32Store C (f32)Store result as single-precision
mma_f16f16MMA f16->f16FP16 input, FP16 accumulator
mma_f32f16MMA f16->f32FP16 input, FP32 accumulator
mma_f16f32MMA f32->f16FP32 accumulator, FP16 output
mma_f32f32MMA f32->f32FP32 input and accumulator

IMMA -- Integer MMA (IDs 708--739, SM 75+)

Integer tensor core operations for INT8 and INT4 data types.

INT8 (IDs 708--731)

Three shapes (16x16x16, 32x8x16, 8x32x16), each with 8 operations:

SuffixDescription
ld_a_s8 / ld_a_u8Load A fragment (signed/unsigned INT8)
ld_b_s8 / ld_b_u8Load B fragment (signed/unsigned INT8)
ld_cLoad accumulator (INT32)
st_c_i32Store result (INT32)
mma_s8 / mma_u8INT8 MMA (signed/unsigned)

INT4 (IDs 732--739)

Single shape (8x8x32) with the same operation set but _s4 / _u4 type suffixes.

BMMA -- Binary MMA (IDs 740--745, SM 75+)

Binary (1-bit) matrix multiply with XOR-POPC and AND-POPC accumulation modes. Single shape: 8x8x128.

IDBuiltinDescription
740__bmma_m8n8k128_ld_a_b1Load A fragment (binary)
741__bmma_m8n8k128_ld_b_b1Load B fragment (binary)
742__bmma_m8n8k128_ld_cLoad accumulator
743__bmma_m8n8k128_st_c_i32Store result
744__bmma_m8n8k128_mma_xor_popc_b1Binary MMA (XOR + popcount)
745__bmma_m8n8k128_mma_and_popc_b1Binary MMA (AND + popcount)

Extended Tensor Core (IDs 746--764, SM 80+)

SM 80 (Ampere) added double-precision, TF32, and BF16 tensor operations.

DMMA -- Double Precision (IDs 746, 751--754)

IDBuiltinDescription
746__dmma_m8n8k4_mma_f64FP64 MMA
751__dmma_m8n8k4_st_c_f64Store FP64 result
752--754__dmma_m8n8k4_{ld_a,ld_b,ld_c}Load fragments

TF32 (IDs 747, 755--757)

IDBuiltinDescription
747__mma_tf32_m16n16k8_mma_f32TF32 MMA producing FP32
755--757__mma_tf32_m16n16k8_{ld_a,ld_b,ld_c}Load fragments

BF16 (IDs 748--750, 758--764)

IDBuiltinDescription
748__mma_bf16_m16n16k16_mma_f32BF16 16x16x16 MMA
749__mma_bf16_m32n8k16_mma_f32BF16 32x8x16 MMA
750__mma_bf16_m8n32k16_mma_f32BF16 8x32x16 MMA
758--764__mma_bf16_m*_{ld_a,ld_b}Load fragments for each shape

WMMA Lowering Details

Three-Table Lookup

WMMA builtins use a three-table structure for mapping builtin IDs to LLVM intrinsic IDs:

TableAddress (NVVM)ID RangeDescription
dword_3F14840Entries 0--29678--707HMMA (first-generation, FP16)
dword_3F147E0Entries 0--23708--731IMMA (INT8)
dword_3F147A0Entries 0--12732--744BMMA (binary) / INT4

The EDG-side parallel tables live at dword_42810C0 (678--709), dword_4281060 (708--731), dword_4281020 (732--744), addressed from sub_12AC1A0.

Fragment Size Determination

The number of register-level fragments varies by operation and data type:

ConditionFragment CountExample
First-gen WMMA, BF16, store4BF16 store_c
First-gen WMMA, default8FP16 mma
IMMA, intrinsic 8914/82802INT8 ld_a compact
BMMA2Binary operations
IMMA intrinsic 0x22BB/0x22BC/0x22C5/0x22C64INT4 load A/B
IMMA intrinsic 0x22BD/0x22BE/0x22C3/0x22C4/0x22CB--0x22CE1Sub-byte single-element
IMMA intrinsic 0x22B7/0x22BF/0x22C78INT8 full-width

MMA Codegen Flow

The MMA handler (sub_94E0D0 / sub_12AC5F0) processes 5 input operands:

  1. dest_ptr -- Pointer to output fragment storage
  2. A_fragment -- Matrix A input (loaded v100 times)
  3. B_fragment -- Matrix B input (loaded v95 times)
  4. C_fragment -- Accumulator input (loaded v101 times)
  5. rowcol -- Layout operand (validated 0--3 for MMA)

An optional satf flag (saturation, validated 0--1) is consumed for most intrinsics except ID 8279.

The handler emits the MMA call via sub_921880 and scatters results back to the destination fragment through v103 iterations of element-wise stores.

Fragment iteration counts per family (NVVM path, sub_94E0D0):

Familyv95 (load B)v100 (load A)v101 (load C)v103 (store D)
BMMA (b1)1122
IMMA (0x22C0-0x22C1)1488
IMMA (0x22B8-0x22B9 = 8888-8889)2288
IMMA (0x22C8-0x22C9 = 8904-8905)4188
HMMA (default, first-gen)88variesvaries (4 or 8)

The output fragment count is determined by bit-test: (0x300C003 >> (intrinsic_id + 127)) & 1 selects 4 vs 8 fragments.

Architecture Gating -- Exact Thresholds

The architecture version is stored at *(target_info + 252) as a DWORD.

FunctionGate ExpressionMinimum SMNotes
sub_21DFBF0 hmmastcv8 > 0x45SM 70FP16 store
sub_21E0360 hmmaldabv8 > 0x45SM 70FP16 load A/B
sub_21E0870 hmmammav8 > 0x45SM 70FP16 MMA
sub_21E1280 immaldabv8 > 0x47SM 72INT load; v8==72 && variant>1 rejected
sub_21E1D20 immammav8 > 0x47SM 72INT MMA; variant>1 && v8==72 rejected
sub_21E2280 bmmammav8 > 0x48SM 73/75Binary MMA
sub_36E9630 tcgen05arch >= 0x3E8SM 100Blackwell only

SM 72 (Xavier) has a unique partial IMMA implementation: only variant 0/1 shapes are supported, with explicit gating that blocks higher variants. This matches hardware reality where Xavier had limited INT8 tensor cores.

WGMMA -- Warp Group MMA (SM 90+ Hopper)

WGMMA operates on an entire warp group (4 warps, 128 threads) rather than a single warp. The system is split across four builtin IDs, 20 auxiliary IDs for fence/store/load operations, and two massive handler blocks totaling ~800 lines of lowering logic.

Builtin Registration

Four builtins are registered in sub_90AEE0 (NVVM) and sub_126A910 (EDG):

IDBuiltinData TypeLowering Case
765 (0x2FD)__wgmma_mma_async_f16FP16Full operand set (6 chained: A, B, C, scale, negate, sparsity)
766 (0x2FE)__wgmma_mma_async_bf16BF162-operand (no scale/negate)
767 (0x2FF)__wgmma_mma_async_tf32TF32Reduced operand set
768 (0x300)__wgmma_mma_async_f8FP8 (SM 90a+)Minimal (2 scale operands only)

WGMMA ID Space Overview

The full WGMMA ID range spans 745--770, subdivided into four functional groups:

ID RangeFunctionHandler
745--750 (0x2E9--0x2EE)Fence / commit / waitsub_12B1C20 / sub_953BA0
751--752 (0x2EF--0x2F0)Storesub_12B27B0 / sub_954350
753--764 (0x2F1--0x2FC)MMA async load (12 variants)inline / sub_9547E0
765--768 (0x2FD--0x300)MMA async compute (4 type builtins)inline ~800 lines / sub_12B2E10
769--770 (0x301--0x302)Warp-group barrierinline IR via sub_127FC40

WGMMA Fence / Commit / Wait (IDs 745--750)

sub_953BA0 (NVVM) / sub_12B1C20 (EDG) builds a red-black tree on first call with 7 entries keyed by builtin ID. Each entry packs:

struct wgmma_fence_entry {
    uint32_t id;           // builtin ID (745--751)
    uint32_t trans_a;      // transpose A flag
    uint32_t shape;        // shape code (0 or 1)
    uint32_t trans_b;      // transpose B flag
    uint32_t a_nregs;      // register count for A fragment
    uint32_t b_nregs;      // register count for B fragment
    uint32_t padding;      // unused alignment
    llvm_type *a_type;     // LLVM type for A (i64, i32, i16x2, i32x4)
    llvm_type *b_type;     // LLVM type for B
    llvm_type *c_type;     // LLVM type for C (i32x2, i32x8)
};

Decoded entries from local variables v47--v106:

IDtrans_ashapetrans_ba_nregsb_nregsA typeB typeC type
74501511i64i64--
74610199i32i32i32x2
747002588i16x2i16x2--
748002377i32x4i32x4i32x8
749002477i32x4i32x4i32x8
75000677i64i32x2i32x8

Output packed encoding (*a4, 64-bit):

BitsFieldSource
[3:0]trans_a*(entry+40)
[7:4]shape*(entry+48) << 4
[15:8]a_nregs*(entry+64) << 8
[27:16]b_nregs*(entry+72) << 16
[31:28]padding*(entry+80) << 28
[63:32]trans_b*(entry+56) << 32
[25]rowcol bit 1(rowcol & 2) == 0 ? 0x2000000 : 0x1000000
[27:26]rowcol bit 0((rowcol & 1) + 1) << 26

The fence dispatch validates the rowcol operand (must be 0--3) and emits a 4-argument call to intrinsic 9062 (llvm.nvvm.wgmma.fence.aligned) with 3 type overloads. Fragment operands are prepared via sub_94B510.

WGMMA Store (IDs 751--752)

sub_954350 / sub_12B27B0 builds a separate parameter lookup tree. Store operations validate rowcol (0 or 1) and emit a 5-argument call using intrinsic 9145 (llvm.nvvm.wgmma.store) with 2 type overloads. Operands: {constant, B_fragment, descriptor, rowcol, zero}.

WGMMA MMA Async Load (IDs 753--764)

sub_9547E0 (NVVM) / sub_12B2E10 (EDG) builds a 12-entry red-black tree at ctx+656:

IDShapenregsVariantFragment Type
753190--
754191--
755192i16x2
7562580--
7572581--
75825102i32x8
7592370i32x4
7602371i32x4
7612470i32x4
7622471i32x4
763670i32x2/i64
764671i32x2/i64

Output packed encoding (*a4, 64-bit):

BitsField
[63:32]*(entry+40) << 32
[31:4]*(entry+48) << 4 | rowcol
[1]*(entry+56) << 1

Emits intrinsic 9067 (llvm.nvvm.wgmma.mma.async) with 2 type overloads. Arguments: {constant, B_fragment, rowcol_value, zero_constant}. Results scattered via sub_94B940.

WGMMA MMA Async Compute -- The 800-Line Handler (IDs 765--768)

This is the primary WGMMA lowering path. It lives inline in the mega-switch of sub_955A70 (NVVM, lines ~2850--3138) and sub_12B3FD0 (EDG, lines ~2270--3138). The handler implements two completely different intrinsic selection strategies depending on which builtin ID triggered entry.

Argument Extraction

The handler walks the argument chain 7 levels deep from the call expression:

v263 = M dimension              (first constant argument)
v512 = accumulator fragments    (pointer to fragment array)
v528 = A descriptor             (64-bit matrix descriptor or register fragments)
v524 = B descriptor             (64-bit matrix descriptor)
v519 = scale factors            (A and D scale constants)
v264 = layout params            (rowcol encoding)
v516, v265 = shape params       (additional dimension info)
v540 = element type info        (integer type tag from AST)

Each constant argument is validated through sub_620FD0 (EDG) / sub_620FD0 (shared), which extracts the integer value and sets an overflow flag. On overflow:

"unexpected constant overflow in __wgmma_mma_async operand"

This check is applied 5 times: once for N dimension, once for each scale factor, and once for each negate/saturation bit.

Per-Builtin Argument Layouts

IDBuiltinOperand Chain
765 (0x2FD)_f166 chained: A, B, C, scaleA, scaleD, negate/saturation
766 (0x2FE)_bf16Separate branch (LABEL_56 path), 2-operand (no scale/negate)
767 (0x2FF)_tf32Rearranged arguments, fewer config bits
768 (0x300)_f8Simplest form, 2 matrix descriptors + config

Strategy 1: N-Dimension Dispatch (IDs 765--768, inner path)

When the element type is checked and the first argument yields an N dimension, the handler enters a 33-entry switch mapping N values to LLVM intrinsic IDs in the range 10654--10779:

NInteger-type IntrinsicFloat-type Intrinsic
81077410775
161069010691
241073410735
321074210743
401074610747
481075010751
561075410755
641075810759
721076210763
801076610767
881077010771
961077810779
1041065410655
1121065810659
1201066210663
1281066610667
1361067010671
1441067410675
1521067810679
1601068210683
1681068610687
1761069410695
1841069810699
1921070210703
2001070610707
2081071010711
2161071410715
2241071810719
2321072210723
2401072610727
2481073010731
2561073810739

The even/odd intrinsic ID pairing encodes the distinction between integer-element and float-element variants. Type discrimination uses the AST element type: if the element type is integer with width 10 (i.e., a 10-bit integer signaling bf16/tf32 internal encoding), the even (integer) intrinsic is selected; otherwise the odd (float) intrinsic.

N dimension validation:

if ((N & (N - 1)) != 0)
    error("N only supported for powers of two");

This is applied when the N value does not match any case in the 33-entry switch. The N values 8, 16, 32, 64, 128, 256 are powers of two; the intermediate values (24, 40, 48, ..., 248) are non-power-of-two multiples of 8 that are still valid WGMMA dimensions.

Strategy 2: 5-Dimensional Intrinsic Grid (IDs 753--764 path, shared)

For the full WGMMA async variants (handled through sub_12B2E10), the handler selects from a 144-entry intrinsic table spanning IDs 5304--5447, organized as a 5-dimensional grid:

DimensionValuesDescription
1. N{16, 32, 64, 128}Output column dimension
2. B_shared{false, true}Is B operand from shared memory? (sub_12A71A0 != 0)
3. is_s64{false, true}Is accumulator type s64/int? (type tag 2, subtype 10)
4. scale/negatevariesA scale nonzero? D scale nonzero?
5. variant{0x2FD, 0x2FE, 0x2FF, 0x300}Which builtin triggered entry

Base addresses and stride:

NBase IDStride per N
128530424 variants
64~532824
32~535224
16~537624
overflow~5400--5447remaining

Size-based opcode selection (for f16, ID 765):

Accumulator SizeOpcode (integer)Opcode (float)
1653325333
3253805381
6454045405
12853085309
other5356/54285357/5429

The mapping formula: base + N_offset + shared_offset + type_offset + variant_offset. The accumulator size is extracted by sub_12A71A0(expr) from the expression type chain.

WGMMA Config Bit Packing

Multiple boolean arguments are packed into a single configuration word passed to the final intrinsic call:

BitFieldSourceValue Semantics
0Accumulate / saturation flagFinal constant operand (v433)1 = accumulate into D, 0 = overwrite
1ScaleD / transpose flagv445 constant1 = transpose B descriptor
2Negate-C / layout flagv81 / v433 constant1 = negate accumulator input
3Sign bit for Bv427 constant (if present)Reserved / sign extension
4Negate-A / additional modev80 / v427 constant (if present)1 = negate A operand

Combined via: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).

After intrinsic selection, the handler:

  1. Converts the accumulator pointer to a vector pointer (.asvecptr tag)
  2. Extracts bitfield from constant operands for mode flags
  3. Calls sub_1285290 / sub_921880 with name hint "mmafrag"
  4. Scatters results via sub_94B940 / sub_1280F50 (size 4 = float elements)

WGMMA Validation Summary

All constant arguments pass through sub_620FD0, which extracts the integer value and sets an overflow flag.

CheckError MessageCondition
Constant overflow"unexpected constant overflow in __wgmma_mma_async operand"Any integer operand overflows extraction (5 occurrences)
N power-of-two"N only supported for powers of two"(N & (N - 1)) != 0 and N not in the 33-entry switch
rowcol range (fence)"'rowcol' operand can be 0 or 1 only"rowcol > 1 for load/store
rowcol range (MMA)(implicit -- validated 0--3)rowcol > 3 for MMA operations

WGMMA Support Functions

FunctionAddressEDG ParallelPurpose
sub_953BA00x953BA0sub_12B1C20Fence/commit/wait parameter lookup, builds packed 64-bit encoding
sub_9547E00x9547E0sub_12B2E10MMA async load parameter lookup, 12-entry red-black tree
sub_9543500x954350sub_12B27B0Store variant parameter lookup
sub_94B5100x94B510--Prepare fragment operand for WGMMA call
sub_94B9400x94B940sub_1280F50Scatter MMA results back to fragment outputs
sub_94B2B00x94B2B0--Extract fragment element at index (WMMA shared)
sub_12A71A00x12A71A0--Extract size/dimension from expression type (EDG-only)
sub_12A6F100x12A6F10--Validate constant integer in range (EDG-only)
sub_620FD00x620FD0--Extract constant integer with overflow detection (shared)

Packed MMA Descriptor Word

The MMA PTX string builder at sub_21E74C0 (AsmPrinter) / sub_35F_range (NVPTX backend) reads a packed 64-bit descriptor for all MMA instruction emission. The descriptor is stored at:

v22 = *(QWORD *)(*(QWORD *)(a1 + 16) + 16 * a2 + 8)
BitsFieldQuery KeyValues
[0]Row/col layout"rowcol"0=row, 1=col
[2:1]Matrix ID"mid"0=a, 1=b, 2=c, 3=d
[7:4]Binary opcode"opc"0=default, 1=.and.popc, 2=.xor.popc
[2:0]Rounding mode"rnd"0=none, 1=.rn, 2=.rm, 3=.rp, 4=.rz
[15:8]A element type"aty"Type enum 1--11
[23:16]B element type"bty"Type enum 1--11
[25:24]A layout"al"0=row, nonzero=col
[27:26]B layout"bl"0=row, nonzero=col
[28]Saturation"satf"1=.satfinite
[39:32]Shape enum"shape"0x01--0x19, 18 entries

Shape Enum

EnumShapePTX StringMin SMNotes
0x01m8n8k4"m8n8k4"SM 70Original Volta HMMA
0x02m8n8k16"m8n8k16"SM 72Integer MMA (s8/u8)
0x03m8n8k32"m8n8k32"SM 75Sub-byte (s4/u4)
0x04m8n8k64"m8n8k64"SM 75Extended sub-byte
0x05m8n8k128"m8n8k128"SM 75Binary MMA (b1)
0x06m8n32k16"m8n32k16"--Appears unused in standard paths
0x10m16n8k4"m16n8k4"SM 75Turing HMMA, f64 on Ampere
0x11m16n8k8"m16n8k8"SM 75Turing/Ampere HMMA
0x12m16n8k16"m16n8k16"SM 80Ampere HMMA (bf16, tf32)
0x13m16n8k32"m16n8k32"SM 75Ampere integer
0x14m16n8k64"m16n8k64"SM 75Sub-byte integer
0x15m16n8k128"m16n8k128"SM 75Extended sub-byte
0x16m16n8k256"m16n8k256"SM 75Binary/sub-byte (largest)
0x17m16n16k16"m16n16k16"SM 90Square shape, Hopper+
0x18m32n8k16"m32n8k16"SM 80Tall shape
0x19m16n16k8"m16n16k8"SM 70WMMA f16 path

Unknown shape codes hit the default branch and abort via BUG(). String emission uses fast-path integer stores: *(QWORD *)ptr = 0x36316B386E36316DLL emits "m16n8k16" as a single 8-byte write.

Type Enum

EnumTypeBitsPTX String
1b11"b1"
2s44"s4"
3u44"u4"
4s88"s8"
5u88"u8"
6f1616"f16"
7bf1616"bf16"
8tf3219"tf32"
9f6464"f64"
10f3232"f32"
11s3232"s32"

Any other type code produces fatal error: "Wrong MMA element type".

Shape x Type x Architecture Summary

ShapeA/B TypesAcc TypesMin SMNotes
m8n8k4f16f16, f32SM 70Original Volta
m16n8k4f64f64SM 80Ampere f64
m16n8k8f16f16, f32SM 75Turing+
m16n8k16f16, bf16, tf32f16, f32SM 80Ampere+
m16n16k8f16f16, f32SM 70WMMA path
m16n16k16f16, bf16f16, f32SM 90Hopper+
m32n8k16f16, bf16f16, f32SM 80Tall shape
m8n8k16s8, u8s32SM 72Integer MMA
m16n8k16s8, u8s32SM 75Turing+
m16n8k32s8, u8s32SM 75Turing+
m8n8k32s4, u4s32SM 75Sub-byte
m16n8k64s4, u4s32SM 75Sub-byte
m8n8k64s4, u4s32SM 75Extended sub-byte
m16n8k128s4, u4s32SM 75Extended sub-byte
m8n8k128b1s32SM 75Binary (.and.popc, .xor.popc)
m16n8k256b1s32SM 75Binary extended
WGMMA (N=8..256)f16, bf16, tf32, f8f16, f32SM 90Warp-group, 33 N values
tcgen05 (10 variants)mxf8f6f4, mxf4, mxf4nvf4, f16, bf16, tf32, i8, fp4variesSM 100See mma-codegen

tcgen05 Blackwell Overview (SM 100+)

Full tcgen05 documentation lives in Tensor / MMA Codegen. Key points summarized here for cross-reference:

Data type kinds (bits [8:6] of the tcgen05 operand, emitted by sub_35F3330):

ValueKindNotes
0mxf4nvf4MX FP4 with NV FP4
1f8f6f4FP8/FP6/FP4 standard
2mxf8f6f4MX variant of f8f6f4
3f16Half precision
4i88-bit integer (arch-conditional only)
5tf32TensorFloat-32
7mxf4MX FP4

Modifier fields:

ModifierBitsDescription
Weight stationary (.ws)bit 0NOT compatible with cta_group::2, mxf8f6f4, fp4
CTA groupbit 1cta_group::1 (clear) or cta_group::2 (set)
Scale vector size[3:2].scale_vec::1X/2X/4X with per-type constraints
Scale input accumulatorbit 4f16/tf32 only; NOT on sm_100a/sm_103a
Sparsitybit 5MXF4/MXF4NVF4 restricted to arch-conditional
Block scale alias[10:9].block16 (0) or .block32 (1)

Collector modes (emitted by sub_35F38B0):

ValueModifierConstraint
1.collector::a::lastuse--
2.collector::a::fillCannot combine with .ashift
3.collector::a::useCannot combine with .ashift

tcgen05 scaled MMA operand builder (sub_21E8CD0 / sub_35F3E90):

BitQueryClearSet
0"scaleD""0""1"
1"negA""1" (no negate)"-1" (negate)
2"negB""1""-1"
3"transA""0""1"
4"transB""0""1"

Note the asymmetry: scaleD/transA/transB emit boolean "0"/"1" strings, while negA/negB emit sign multiplier "1"/"-1" strings. This reflects the PTX encoding where negation is a multiplication factor and transpose is a boolean flag.

LLVM Intrinsic Reference

Intrinsic IDNameUsage
9062llvm.nvvm.wgmma.fence.alignedWGMMA fence (3 type overloads)
9067llvm.nvvm.wgmma.mma.asyncWGMMA MMA async load (2 type overloads)
9145llvm.nvvm.wgmma.storeWGMMA store (2 type overloads)
10654--10779llvm.nvvm.wgmma.mma.async.*Per-N-dimension variants (126 entries, even=int, odd=float)
5304--5447(WGMMA 5-D grid)Per-N x shared x type x scale x variant (144 entries)
4905--4940(tcgen05 ISD opcodes)tcgen05.mma variants (36 opcodes via 10-way shape switch)

NVPTX Backend Duplicate Functions

All MMA emission functions exist in two structurally identical copies:

AsmPrinter (0x21Dxxxx)NVPTX Backend (0x36Exxxx)Function
sub_21DFBF0sub_36E91F0hmmastc (HMMA store C)
sub_21E0360sub_36E72A0hmmaldab (HMMA load A/B)
sub_21E0630sub_36E7580hmmaldc (HMMA load C)
sub_21E0870sub_36E77C0hmmamma (HMMA MMA)
sub_21E1280sub_36E7B50immaldab (IMMA load A/B)
sub_21E15D0sub_36E7EA0immaldc (IMMA load C)
sub_21E1830sub_36E8110immastc (IMMA store C)
sub_21E1D20sub_36E8630immamma (IMMA MMA)
sub_21E2280sub_36E8BD0bmmamma (Binary MMA)
sub_21E8CD0sub_35F3E90tcgen05 scaled MMA operands

The pairs differ only in error reporting (sub_16BD130 vs sub_C64ED0) and reference counting functions (sub_1623A60/sub_161E7C0 vs sub_B96E90/sub_B91220).

Cross-References

Surface and Texture Builtins

Surface and texture builtins form the largest contiguous block in the builtin table, with 165 surface store entries (IDs 474--638) plus a generic texture/surface handler (ID 647). CUDA separates texture reads (which go through a unified handler) from surface writes (which have dedicated per-format builtins). This asymmetry reflects the hardware: texture reads use a programmable texture pipeline, while surface stores map directly to typed sust (surface store) instructions.

Surface Store Builtins (IDs 474--638)

The 165 sust (surface store) builtins encode the dimensionality, data type, and out-of-bounds behavior directly in the builtin name. They follow the pattern:

__nvvm_sust_b_{dim}_{type}_{oob_mode}

Dimensions (5 variants)

DimensionDescription
1dOne-dimensional surface
2dTwo-dimensional surface
3dThree-dimensional surface
1d_arrayArray of 1D surfaces
2d_arrayArray of 2D surfaces

Data Types (11 variants)

Type SuffixElement SizeVector
i88-bit integerScalar
i1616-bit integerScalar
i3232-bit integerScalar
i6464-bit integerScalar
v2i88-bit integer2-element vector
v2i1616-bit integer2-element vector
v2i3232-bit integer2-element vector
v2i6464-bit integer2-element vector
v4i88-bit integer4-element vector
v4i1616-bit integer4-element vector
v4i3232-bit integer4-element vector

Out-of-Bounds Modes (3 variants)

ModeID RangeBehavior
clamp474--528Clamp coordinates to valid range
trap529--583Trigger hardware trap on OOB access
zero584--638Write zero for OOB coordinates

The total 5 x 11 x 3 = 165 entries are registered as a contiguous block. IDA shows SSE xmmword constant loads for the long common prefix strings (__nvvm_sust_b_2d_array_*), which is the compiler's optimization of string literal initialization during registration.

Surface Store ID Layout

Within each OOB-mode block of 55 entries, the ordering is dimension-major, type-minor:

base + 0..10:  1d       x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 11..21: 1d_array x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 22..32: 2d       x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 33..43: 2d_array x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 44..54: 3d       x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}

Given a surface store builtin ID, the decomposition is:

mode_offset = (id - 474)
oob_block   = mode_offset / 55          // 0=clamp, 1=trap, 2=zero
within_block = mode_offset % 55
dim_index    = within_block / 11         // 0=1d, 1=1d_array, 2=2d, 3=2d_array, 4=3d
type_index   = within_block % 11         // 0=i8 .. 10=v4i32

Texture/Surface Read Handler (ID 647)

All texture reads and surface reads are funneled through a single generic handler:

IDBuiltinDescription
647__nv_tex_surf_handlerDispatch for all texture/surface read operations

Unlike the surface stores which have 165 dedicated builtins, texture reads use a string-based dispatch mechanism. The handler is a single builtin that receives the texture/surface operation name as a string operand, then dynamically constructs the appropriate LLVM intrinsic name and emits the call.

Handler Dispatch Algorithm (case 0x287 in sub_955A70)

The NVVM-side lowering for __nv_tex_surf_handler (builtin ID 647, hex 0x287) is the most complex string-based builtin dispatch in cicc. It performs five steps:

Step 1 -- String extraction. Walks the AST operand tree from the call expression to locate the constant string naming the texture/surface operation. Validates that byte 173 of the operand node equals 2 (the constant-string-type marker in the EDG AST). The string is the NVVM intrinsic base name, for example __tex_fetch or __surf_read.

Step 2 -- Element type determination. Decodes the return element type from the AST type node attached to the call. The type switch maps to suffix strings:

AST TypeSuffix StringLLVM Type
void"void"void
char (as signed)"char_as_schar"i8
char (as unsigned)"char_as_uchar"i8
signed char"schar"i8
unsigned char"uchar"i8
short"short"i16
unsigned short"ushort"i16
int"int"i32
unsigned int"uint"i32
long"long"i32/i64
unsigned long"ulong"i32/i64
long long"longlong"i64
unsigned long long"ulonglong"i64
float"float"float

The long/ulong width follows the host ABI convention (32-bit on NVPTX).

Step 3 -- Intrinsic name construction. Concatenates the operation base name with the element type suffix using underscore separation:

intrinsic_name = "{operation_string}_{element_type_suffix}"

For example, __tex_fetch_v4 + float yields __tex_fetch_v4_float.

Step 4 -- Intrinsic lookup. Resolves the constructed name string via sub_BA8CA0 (NVVM intrinsic table lookup) to obtain the corresponding LLVM intrinsic function declaration. The EDG-side parallel path uses sub_1632190. If the intrinsic is not found, this is a fatal error.

Step 5 -- Call emission. Collects all arguments from the call expression, builds the LLVM function type signature from the argument types, and emits the intrinsic call via sub_921880. Returns a dummy i32 value via sub_AD6530.

This design allows the compiler to support an arbitrary number of texture/surface read variants without enumerating them in the builtin table. The single ID 647 entry is a trampoline that dispatches to hundreds of different NVVM intrinsics at runtime.

__nv_tex_surf_handle_t Built-in Type

The EDG parser recognizes __nv_tex_surf_handle_t as a built-in type (keyword index 277 in the keyword table at sub_72BA30). This opaque type is the C++-level representation of a texture or surface reference handle. When the type appears as a function parameter, the PTX emitter (sub_21502D0, 22KB) produces one of:

Parameter ABIPTX Syntax
By-value .texref.param .texref NAME
By-value .surfref.param .surfref NAME
By-value .samplerref.param .samplerref NAME
Pointer to .texref.param .u64 .ptr .texref NAME
Pointer to .surfref.param .u64 .ptr .surfref NAME
Pointer to .samplerref.param .u64 .ptr .samplerref NAME

The selection between .texref / .surfref / .samplerref is determined by the NVVM metadata attached to the GlobalVariable that the handle references. The NVPTXReplaceImageHandles pass (sub_21DBEA0) performs the final substitution of IR-level image handles into PTX-level texture/surface references during machine-level code emission.

Texture/Surface Map Initialization

The NVVM-side handler sub_954F10 maintains two lazily-initialized red-black tree maps for resolving texture and surface operations. These maps are built once (guarded by flag bytes byte_4F6D3B0 and byte_4F6D378) and cleaned up via __cxa_atexit.

Surface Operation Map (unk_4F6D3C0)

Used when the handler's v8 flag is nonzero (surface path). Contains entries mapping builtin IDs to LLVM intrinsic IDs for surface read operations. Each entry is a 12-byte packed triple:

Intrinsic IDDescription
0x21CA (8650)Surface read (primary)

The map contains 4 entries covering surface read and write variants with address space 4 (constant memory surface descriptors).

Texture Operation Map (unk_4F6D380)

Contains entries for texture fetch operations. The map has 12 entries covering the full matrix of texture modes:

Intrinsic IDMapped Builtin BaseDescription
0x1FC6 (8134)ID 338Texture fetch (sync variant)
0x23C5 (9157)ID 302Texture fetch (base variant)
0x23C8 (9160)ID 303Texture fetch (alternate)

These 12 entries span the following texture fetch modes:

ModeBehavior
Unfiltered fetchDirect texel access at integer coordinates
Filtered fetchHardware-interpolated fetch at float coordinates
LOD fetchExplicit level-of-detail selection
Gradient fetchGradient-based LOD computation

Map Lookup and Dispatch (sub_954F10)

function TexSurfSampleHandler(retval, ctx, builtin_id, arglist):
    // Determine surface vs texture path
    is_surface = (v8 flag != 0)

    if is_surface:
        map = unk_4F6D3C0     // surface map
        if not initialized:
            populate 4 entries into red-black tree
            byte_4F6D3B0 = 1
    else:
        map = unk_4F6D380     // texture map
        if not initialized:
            populate 12 entries into red-black tree
            byte_4F6D378 = 1

    // Tree lookup
    entry = rbtree_find(map, builtin_id)
    if found:
        intrinsic_id = entry.intrinsic_id   // e.g. 0x1FC6
    else:
        intrinsic_id = 0
        default_mode = 1

    // Create type constant from element type
    type_const = sub_BCB2D0(sub_ACD640(...))

    // Process 4 standard operands
    for operand in [sampler, coordinate, lod, bias]:
        if operand != null:
            lowered = type_cast(operand, expected_llvm_type)
            emit_store(lowered)   // sub_B4D190 or sub_B4D3C0

    // Build and emit intrinsic call
    fn_decl = sub_90A810(intrinsic_tables, intrinsic_id, ...)
    sub_921880(fn_decl, args)        // emit call
    sub_B4D3C0(result)               // store result

Operand Processing

For each of the 4 standard texture operands (sampler, coordinate, LOD, bias), the handler:

  1. Checks if the operand is non-null
  2. Type-casts to match the expected LLVM type
  3. Creates a store instruction via sub_B4D190 (loads) or sub_B4D3C0 (stores)
  4. Builds the LLVM call via sub_90A810 with the resolved intrinsic ID

SelectionDAG Lowering Layer

After NVVM builtin lowering produces LLVM IR intrinsic calls, the SelectionDAG layer translates these into NVPTX-specific DAG nodes. Three subsystems handle different aspects.

Intrinsic Lowering Dispatch (sub_33B0210, 343KB)

The central intrinsic lowering function dispatches on LLVM intrinsic IDs via a giant switch covering ~440 case labels. Texture and surface operations occupy three distinct ID ranges:

Intrinsic ID RangeHandlerCategory
0x5D--0x8D (93--141)sub_33A4350Texture fetch bulk handler (50 IDs)
0x8E--0x90 (142--144)sub_33A3180Surface read/write handler (3 IDs)
0x91 (145)InlineComplex texture sample with LOD/bias
0x92--0x98 (146--152)VariousSurface store variants
0x9C--0x9D (156--157)sub_33AEC60Surface atomics
0x9E--0x9F (158--159)sub_33AFBA0 / sub_340EC60Surface special ops
0xA0--0xA2 (160--162)VariousSurface/texture helpers
0x2952 (10578)Inlinenvvm_texsurf_handle binding
0x254D+ (9549+)sub_34B8FD0Unified texture sample core

Texture Fetch Bulk Handler: sub_33A4350

The 50 consecutive intrinsic IDs 0x5D through 0x8D all delegate to a single helper sub_33A4350(state, dag_node). This function maps the intrinsic ID to an NVPTXISD opcode for one of the tex.1d, tex.2d, tex.3d, or tex.a1d/tex.a2d (array) variants.

The intrinsic-to-opcode mapping encodes:

dimension:    1d / 2d / 3d / 1d_array / 2d_array / cubemap
data_type:    u32 / s32 / f32 / f32f32 (filtered)
return_width: scalar / v2 / v4
access_mode:  level / grad / unified

Each opcode corresponds to a PTX texture instruction pattern that the instruction emitter will later produce.

Complex Texture Sample (Intrinsic ID 0x91)

The most complex texture lowering path. Handles hardware-filtered texture sampling with programmable LOD computation:

  1. sub_3281100 -- Determines element count for the return type
  2. sub_3281590 -- Computes alignment for the result buffer
  3. sub_327FD70 -- Resolves the return MVT (machine value type)
  4. sub_33CC4A0 -- SM-specific path selection (some SM levels use different instruction encodings)
  5. sub_3406EB0(opcode=57) -- Creates the core sample DAG node
  6. sub_33FAF80(opcode=213) -- LOD computation DAG node
  7. sub_3406EB0(opcode=186) -- Merge result node
  8. sub_33FAF80(opcode=389) -- Final type fixup
  9. Fallback via sub_33A1E80 if the target architecture does not support this texture mode

Surface Read/Write Handler: sub_33A3180

Intrinsic IDs 0x8E (surf1Dread), 0x8F (surf2Dread), 0x90 (surf3Dread) delegate to sub_33A3180(state, dag_node, intrinsic_id). The intrinsic_id parameter selects the dimensionality. This handler produces NVPTXISD suld (surface load) DAG nodes.

Texture/Surface Handle Binding (Intrinsic 0x2952)

The nvvm_texsurf_handle intrinsic (ID 10578) is the mechanism for binding a GlobalVariable to a texture or surface reference. The DAG lowering:

  1. Validates that operand 0 is metadata wrapping a GlobalVariable -- errors with "nvvm_texsurf_handle op0 must be metadata wrapping a GlobalVariable" otherwise
  2. Creates a DAG constant node for the handle via sub_3400BD0(opcode=10579)
  3. Binds the handle via sub_3406EB0(opcode=46)

The NVPTXReplaceImageHandles pass (sub_21DBEA0) later resolves these abstract handles into concrete PTX .texref / .surfref globals during machine-level emission.

Unified Texture Sample Core (Intrinsic IDs 0x254D+)

For SM 30+ unified texture mode, a more complex sampling path handles the full matrix of texture configurations:

  1. sub_34B8FD0 -- Unpacks the parameter block encoding dimension, filtering, coordinate type
  2. Vtable dispatch at *src+88 -- Selects the sampling mode (point, linear, etc.)
  3. sub_3409320 -- Creates the sampler state DAG node
  4. sub_33EB1C0(opcode=47) -- Creates the core tex/surf sample DAG node with memory semantics
  5. sub_33FC220(opcode=2) -- Merges vector result components
  6. sub_33E5830 + sub_3411630(opcode=55) -- Packages the final result
  7. sub_B91FC0 -- Attaches debug info

Two modes exist: v2637=true (unified texture) and v2637=false (legacy separate-handle texture). The unified path is the modern default.

Texture/Surface Binding Lowering (Intrinsic IDs 0x44, 0x45, 0x47)

These intrinsics handle the compile-time binding of texture and surface references. The lowering checks the a1+120 flag to determine whether the reference is a .texref or .surfref:

  1. sub_3382030 -- Initial binding setup
  2. sub_3382930 -- Variant analysis via sub_3380DB0 and sub_B58DC0
  3. sub_3386E40 -- Final binding emission

Intrinsic 0x48 (opcode 332) handles global texture handles, while 0x162 (opcode 331) handles sampler handles. Intrinsic 0x169 dispatches to sub_3400BD0 + sub_3406EB0(opcode=333) for indirect texture access.

Instruction Selection: sub_306A930 (52KB)

The NVPTX instruction selection pass contains a 52KB handler (sub_306A930) dedicated to matching texture/surface DAG nodes to machine instructions. It calls five helper functions:

HelperAddressRole
sub_2FE5F000x2FE5F00Texture instruction type selection
sub_2FE5F300x2FE5F30Surface instruction type selection
sub_2FE5F600x2FE5F60Image type validation
sub_2FE69A00x2FE69A0Coordinate mode encoding
sub_2FE6CC00x2FE6CC0Return type dispatch

The ISel handler selects among tex, suld, sust machine instruction patterns, with address space awareness for the different texture/surface memory regions.

Image Type Validation: sub_21DD1A0 (16KB)

A dedicated 16KB validation function (sub_21DD1A0) checks that the image type encoding is legal for the instruction class. Four error messages cover the instruction categories:

Error StringInstruction Class
"Invalid image type in .tex"Texture fetch
"Invalid image type in .suld"Surface load
"Invalid image type in suq."Surface query
"Invalid image type in .sust"Surface store

This validation occurs during instruction emission, catching type mismatches that survived earlier lowering.

Surface Store Lowering Details

Surface store builtins in the 474--638 range are handled by the main dispatch switch with a block of consecutive cases. Each case:

  1. Extracts the surface handle, coordinate(s), and data value(s) from the argument list
  2. The number of coordinate arguments varies by dimensionality (1D: 1, 2D: 2, 3D: 3, arrays: +1 for layer index)
  3. The number of data arguments varies by vector width (scalar: 1, v2: 2, v4: 4)
  4. Emits a call to the corresponding llvm.nvvm.sust.b.* intrinsic

The out-of-bounds mode is encoded in the intrinsic name itself, not as a parameter, which is why each mode requires a separate builtin ID.

PTX Emission: Sampler State Initializers

The PTX emitter sub_2156420 (20KB) handles module-level emission of texture, surface, and sampler global variables. Sampler references receive structured initializers:

.global .samplerref my_sampler = {
    addr_mode_0 = wrap,          // or clamp_to_border, clamp_to_edge, mirror
    addr_mode_1 = clamp_to_edge,
    addr_mode_2 = clamp_to_edge,
    filter_mode = linear,        // or nearest
    force_unnormalized_coords = 1
};

The addressing mode and filter mode values are extracted from NVVM metadata attached to the sampler GlobalVariable. The emitter recognizes these sampler reference types via sub_1C2E890 and generates the structured PTX initializer. Texture and surface references use the simpler forms:

.global .texref my_texture;
.global .surfref my_surface;

End-to-End Pipeline

The complete texture/surface compilation pipeline spans five compiler phases:

PhaseFunction(s)What Happens
EDG Frontendsub_72BA30Parses __nv_tex_surf_handle_t as built-in type; keyword 277
NVVM Builtin Loweringsub_955A70 case 0x287 / sub_954F10String-based dispatch constructs LLVM intrinsic names; red-black tree maps resolve builtin IDs to intrinsic IDs
SelectionDAG Loweringsub_33B0210 / sub_33A4350 / sub_33A318050+ texture intrinsic IDs become NVPTXISD DAG nodes; handle binding validated against GlobalVariable metadata
Instruction Selectionsub_306A930 (52KB)DAG nodes matched to tex.* / suld.* / sust.* machine instructions
PTX Emissionsub_2156420 / sub_21DD1A0.texref/.surfref/.samplerref globals emitted; image type validated; NVPTXReplaceImageHandles substitutes abstract handles

Architecture Considerations

Surface and texture operations are available on all SM architectures. However, the texture pipeline has evolved significantly:

  • All SM: Basic texture fetch, surface read/write with clamp/trap/zero modes
  • SM 30+: Unified texture mode via __nv_tex_surf_handler generic dispatch; v2637=true path in DAG lowering
  • SM 90+ (Hopper): Tensor memory accelerator (TMA) operations provide an alternative high-throughput path for bulk data movement, partially overlapping with texture/surface functionality but handled through separate builtins (IDs 411--412)

The 165 surface store builtins are registered unconditionally regardless of target SM. Architecture gating occurs at the PTX emission layer, not during builtin registration or lowering. The complex texture sample path (intrinsic 0x91) has an explicit SM feature gate via sub_33CC4A0 that selects alternate instruction encodings for older architectures, with sub_33A1E80 as the fallback for unsupported targets.

Function Map

FunctionAddressSizeRole
NVVM builtin lowering dispatchsub_955A70--Main switch; case 0x287 handles __nv_tex_surf_handler
Texture/surface sample handlersub_954F10--Red-black tree dispatch for IDs 302--309, 338--345, 395--402
EDG keyword handlersub_72BA30--Parses __nv_tex_surf_handle_t built-in type (keyword 277)
NVPTX intrinsic loweringsub_33B0210--343KB central dispatch; tex IDs 0x5D--0x8D, surf IDs 0x8E--0x90
Texture fetch bulk handlersub_33A4350--50 consecutive intrinsic IDs for all tex1D/2D/3D/array variants
Surface read/write handlersub_33A3180--3 intrinsic IDs for surf1D/2D/3D read
Tex/surf sample DAG node buildersub_33EB1C0--Creates memory-typed NVPTXISD sample nodes (opcode 47)
Sampler state DAG node buildersub_3409320--Creates sampler state binding nodes
Surface atomics handlersub_33AEC60--Intrinsic IDs 0x9C--0x9D
Surface special handlersub_33AFBA0--Intrinsic ID 0x9E
Texture/surface ISelsub_306A930--52KB instruction selection for tex/suld/sust patterns
Image type validatorsub_21DD1A0--16KB; validates .tex/.suld/.sust/suq. image types
NVPTXReplaceImageHandlessub_21DBEA0--Replaces IR image handles with PTX .texref/.surfref
Global variable emittersub_2156420--20KB; emits .texref/.surfref/.samplerref with initializers
Parameter list emittersub_21502D0--22KB; emits .param .texref/.surfref/.samplerref in function signatures
visitNVVMTexSurfsub_2077400--20KB SelectionDAGBuilder extension for tex/surf handle lowering
NVVM intrinsic lookupsub_BA8CA0--Resolves constructed intrinsic name string to LLVM function declaration
Intrinsic table lookupsub_90A810--Resolves intrinsic ID to function declaration with type overloads

Cross-References

Barrier and Synchronization Builtins

Barrier builtins handle thread synchronization, memory fencing, and cluster-level coordination. They span IDs 1--5 (core barriers), 8--20 (cluster and barrier extensions), and several scattered IDs for memory barriers and fences. The lowering layer emits either LLVM intrinsic calls or inline PTX assembly, depending on whether the operation has a direct LLVM IR equivalent.

Core Barriers (IDs 1--5)

The most fundamental synchronization primitives in CUDA map to the lowest builtin IDs.

IDBuiltinPTX EquivalentDescription
1__syncthreadsbar.sync 0Block-wide barrier
2__nvvm_bar0bar.sync 0Alias for __syncthreads
3__nvvm_membar_ctamembar.ctaCTA-scope memory fence
4__nvvm_membar_glmembar.glDevice-scope memory fence
5__nvvm_membar_sysmembar.sysSystem-scope memory fence

The core __syncthreads (ID 1) lowers to the LLVM intrinsic llvm.nvvm.barrier0 (intrinsic ID 8259). Memory barriers at IDs 3--5 are lowered via inline IR generation: the handler builds a barrier store node through sub_128B420 / sub_92C9E0 and inserts it into the current basic block.

Barrier Extensions (IDs 15--20)

These builtins extend the basic barrier with predicate reduction and explicit warp/block synchronization.

IDBuiltinIntrinsicDescription
15__nvvm_bar0_popcllvm.nvvm.barrier0.popcBarrier + population count of predicate
16__nvvm_bar0_andllvm.nvvm.barrier0.andBarrier + AND reduction of predicate
17__nvvm_bar0_orllvm.nvvm.barrier0.orBarrier + OR reduction of predicate
18__nvvm_bar_sync_allllvm.nvvm.barrier.sync (8925)Named barrier sync (all threads)
19__nvvm_barrier_syncllvm.nvvm.barrier.sync.cnt (9296)Named barrier sync with count
20__nvvm_bar_warp_syncllvm.nvvm.bar.warp.sync (8258)Warp-level barrier

The reduction barriers (IDs 15--17) are dispatched through sub_12AB550 / sub_94C360. The handler looks up intrinsic 3767 (EDG) or the corresponding entry from dword_3F14778[] (NVVM) and emits a function call via sub_1285290 / sub_921880. ID 16 sets flag=1 (AND) and ID 17 sets flag=16|0 (OR); the population count variant uses the default flag.

Barriers with explicit count (IDs 205--206, __nvvm_bar_sync_all_cnt and __nvvm_barrier_sync_cnt) follow the same pattern with additional count arguments.

Cluster Operations (IDs 8--14, SM 90+)

Thread block cluster operations were introduced with SM 90 (Hopper). These builtins query cluster geometry and perform inter-block synchronization within a cluster.

Cluster Geometry Queries (IDs 8--10, 405--408)

IDBuiltinHandlerDescription
8__nv_clusterDimIsSpecified_implsub_12AB0E0(ctx, 0)Whether cluster dimensions are explicit
9__nv_clusterRelativeBlockRank_implsub_12AB0E0(ctx, 1)Block rank within cluster
10__nv_clusterSizeInBlocks_implsub_12AB0E0(ctx, 2)Number of blocks in cluster
405__nv_clusterDim_impl--Cluster dimension
406__nv_clusterRelativeBlockIdx_impl--Block index within cluster
407__nv_clusterGridDimInClusters_impl--Grid dimension in cluster units
408__nv_clusterIdx_impl--Cluster index

Cluster Barriers (IDs 11--14)

IDBuiltinIntrinsic IDDescription
11__nv_cluster_barrier_arrive_impl3767Signal arrival at cluster barrier
12__nv_cluster_barrier_wait_impl3767Wait at cluster barrier
13__nv_cluster_barrier_arrive_relaxed_impl3767Relaxed arrival (no ordering guarantee)
14__nv_threadfence_cluster_impl4159 / 9052Cluster-scope memory fence

The cluster fence at ID 14 emits intrinsic llvm.nvvm.cp.async.commit.group (EDG intrinsic 4159, NVVM intrinsic 9052) with a flag constant of 4, encoding the thread-fence semantic.

Cluster Shared Memory (IDs 202--203, 365)

IDBuiltinDescription
202__nv_isClusterShared_implQuery if address is in cluster shared memory
203__nv_cluster_query_shared_rank_implGet rank of block that owns shared address
365__nv_cluster_map_shared_rank_implMap address to another block's shared memory

ID 203 has an SM-dependent lowering path: on SM <= 63, the handler returns an inline constant (passthrough); on SM 64+, it emits intrinsic 3769 (EDG) / 8825 (NVVM). The same pattern applies to ID 365, which gates on intrinsic 3770 / 9005.

Memory Fence Lowering

Memory fences are emitted as inline PTX assembly because they have no direct LLVM IR equivalent. Two handlers exist:

sub_94F9E0 -- membar (CTA/Device/System)

Generates membar.{scope}; where scope is determined by the scope parameter:

Scope ValuePTX Output
0, 1membar.cta;
2, 3membar.gl;
4membar.sys;

The constraint string is ~{memory} to ensure the compiler treats the fence as a full memory clobber. The emitted node receives two memory attributes: inaccessiblemem (attribute 41) and a readonly fence marker (attribute 6).

sub_94FDF0 -- fence (with explicit ordering)

Generates fence.{ordering}.{scope}; for SM 70+ targets:

Ordering ValuePTX Qualifier
3sc (sequentially consistent)
4acq_rel
5sc (same as 3)

Both fence handlers use sub_B41A60 to create the inline assembly call and sub_921880 to emit it into the instruction stream.

Async Memory Copy Barriers (IDs 367--369)

The cp.async instructions for asynchronous shared-to-global memory copies include implicit barrier semantics:

IDBuiltinSizeDescription
367__nv_memcpy_async_shared_global_4_impl4 bytesAsync copy with barrier
368__nv_memcpy_async_shared_global_8_impl8 bytesAsync copy with barrier
369__nv_memcpy_async_shared_global_16_impl16 bytesAsync copy with barrier

These are lowered through sub_12AB730 / sub_94C5F0, which builds the cp.async PTX instruction with the specified transfer size.

Architecture Gates

SM ThresholdBarrier Feature
All SM__syncthreads, membar.{cta,gl,sys}, barrier reductions
SM 70+Explicit fence ordering (fence.{ordering}.{scope})
SM 70+cp.async asynchronous memory copy with barrier
SM 90+ (Hopper)Cluster barriers, cluster fence, cluster shared memory queries

Lowering Strategy Summary

Barrier builtins use three distinct lowering strategies:

  1. LLVM intrinsic call -- __syncthreads, barrier reductions, cluster barriers. These map to well-known LLVM/NVVM intrinsic IDs (8259, 8925, 9296, etc.) and emit via sub_1285290.

  2. Inline IR generation -- Memory barriers (__nvvm_membar_*). The handler directly constructs barrier store IR nodes without going through an intrinsic lookup.

  3. Inline PTX assembly -- Memory fences (membar.*, fence.*). These have no LLVM IR equivalent and are emitted as inline asm strings with ~{memory} clobber constraints.

Warp-Level Operation Builtins

Warp-level builtins provide lane-to-lane communication within a 32-thread warp. They cover four major categories: shuffle (data exchange between lanes), vote (predicate aggregation), match (value matching across lanes), and redux (warp-wide reductions). The shuffle operations also serve as the lowering target for the WMMA fragment load/store operations described in the tensor core page.

Shuffle Operations (IDs 413--416)

The __shfl_sync family enables direct register-to-register communication between warp lanes. Four shuffle modes exist, each registered as a _sync variant:

IDBuiltinModeDescription
413__nvvm_shfl_up_syncUpLane reads from lane - delta
414__nvvm_shfl_down_syncDownLane reads from lane + delta
415__nvvm_shfl_bfly_syncButterflyLane reads from lane XOR delta
416__nvvm_shfl_idx_syncIndexLane reads from arbitrary srcLane

Shuffle Dispatch via Table Lookup

All shuffle builtins route through sub_12B3540 (EDG) / sub_954F10 (NVVM), the table-based lowering handler. Three groups of 8 IDs each cover the complete shuffle interface:

ID RangeGroupDescription
302--309Legacy __shflNon-sync variants (4 modes x 2 types: i32/f32)
338--345__shfl_syncSync variants with mask (4 modes x 2 types)
395--402__shfl_*_syncNewer SM interface (4 modes x 2 types)

Within each group of 8, the layout is:

OffsetModei32 Variantf32 Variant
+0, +1shfl_upoffset +0offset +1
+2, +3shfl_downoffset +2offset +3
+4, +5shfl_xoroffset +4offset +5
+6, +7shfl_idxoffset +6offset +7

The handler builds the argument list (mask, value, delta/lane, width), looks up the target intrinsic by shuffle mode and data type from its red-black tree map, and emits a function call.

Vote Operations (IDs 351--358)

Warp vote builtins aggregate a boolean predicate across all participating lanes. Both legacy (non-sync) and sync variants are registered.

IDBuiltinOperationSync
351__nvvm_vote_allAll predicates true?No
352__nvvm_vote_anyAny predicate true?No
353__nvvm_vote_uniAll predicates equal?No
354__nvvm_vote_ballotBitmask of predicatesNo
355__nvvm_vote_all_syncAll predicates true?Yes
356__nvvm_vote_any_syncAny predicate true?Yes
357__nvvm_vote_uni_syncAll predicates equal?Yes
358__nvvm_vote_ballot_syncBitmask of predicatesYes

Vote Lowering

The handler sub_12ABB90 (EDG) / sub_94D570 (NVVM) takes parameters:

(result, ctx, vote_op, args, is_ballot, is_sync)

The vote_op encoding: 0 = all, 1 = any, 2 = uni, 3 = ballot.

When is_sync=1, an extra mask argument is consumed from the call arguments. For non-sync variants, the handler looks up intrinsic 5301 (llvm.nvvm.vote). For sync variants, it generates an inline predicate pattern. The ballot variant (vote_op=3) sets is_ballot=1, which changes the return type from i1 (predicate) to i32 (bitmask).

Match Operations (IDs 361--364)

Match builtins find lanes with equal values and return a bitmask of matching lanes. Available in 32-bit and 64-bit variants with two matching modes.

IDBuiltinWidthModeIntrinsic
361__match32_any_sync32-bitAny match0x1011
362__match64_any_sync64-bitAny match0x1011
363__match32_all_sync32-bitAll match0x100F
364__match64_all_sync64-bitAll match0x100F

The handler sub_12AD230 (EDG) dispatches on two opcodes: 0x1011 for any-match and 0x100F for all-match. The NVVM-side handler sub_94F430 uses intrinsic pairs 0x2017 / 0x2018 with mode variants 0, 1, 2 to encode the width and match type.

Warp Redux (IDs 413--416 range, via sub_12ADD20)

Warp-wide reduction operations perform arithmetic reductions across all active lanes in a single instruction. These are dispatched through sub_12ADD20 (EDG) / sub_94F250 (NVVM).

IDOperationNVVM IntrinsicDescription
redux.sync.add0x24F5 (9461)Sum reductionSum of values across warp
redux.sync.min0x24ED (9453)Minimum reductionMinimum value across warp
redux.sync.max0x24E9 (9449)Maximum reductionMaximum value across warp
redux.sync.or0x24F1 (9457)Bitwise OR reductionOR of values across warp

The EDG side uses intrinsic codes 0x2332 and 0x2330 for the two redux variant families.

Activemask and Lanemask

The active mask and per-lane mask builtins are handled through sub_12ADB00 (EDG) / sub_94CF30 (NVVM):

These builtins return the set of currently active lanes (__activemask()) or per-lane positional masks (__lanemask_lt(), __lanemask_le(), __lanemask_eq(), __lanemask_ge(), __lanemask_gt()). They compile to PTX special register reads (%lanemask_*).

Predicate-Register Conversion (IDs 411--412)

Two builtins convert between predicate registers and general-purpose registers:

IDBuiltinDirectionDescription
411__nv_p2rPredicate -> RegisterPack predicates into a 32-bit register
412__nv_r2pRegister -> PredicateUnpack a 32-bit register into predicates

The handler generates element-wise operations: sub_9483E0 iterates over vector elements using sub_39FAC40 to compute the element count, then builds per-element extractelement + store (for p2r) or load + insertelement (for r2p) chains.

Nanosleep and CP.Async

Warp-adjacent utility builtins handled through sub_12AD230 / sub_94ED50:

ID RangeOperationDescription
367--369__nv_memcpy_async_shared_global_{4,8,16}_implAsynchronous copy (cp.async)

These builtins combine data movement with implicit synchronization and are lowered through sub_12AB730 / sub_94C5F0, which builds the cp.async PTX instruction with the specified transfer size (4, 8, or 16 bytes).

Architecture Requirements

FeatureMinimum SMNotes
__shfl (legacy, non-sync)SM 30+Deprecated; requires full warp convergence
__shfl_syncSM 70+ (Volta)Explicit mask; independent thread scheduling
Vote (non-sync)SM 30+Deprecated
Vote (_sync)SM 70+Explicit mask required
Match (_sync)SM 70+Warp-level value matching
Redux (redux.sync.*)SM 80+ (Ampere)Hardware-accelerated warp reduction
Elect syncSM 90+ (Hopper)Single-lane election from active mask
cp.asyncSM 80+Asynchronous shared memory copy

GPU Target Architecture

45 SM variants across 6 generations. Processor table at qword_502A920 (stride-2 layout: name + PTX version). Architecture gating throughout the binary controls feature availability.

SM tableqword_502A920 (45 entries, ctor_605 at 0x584510)
Arch detectionsub_95EB40 (38KB, CLI -> 3-column mapping)
NVVM arch enumsub_CD09E0 (14.5KB, NVVM_ARCH_* strings)
EDG arch gatessub_60E7C0 (~60 feature flags based on SM version)
Backend subtargetNVPTXSubtarget (feature offsets at +2498, +2584, +2843)
Target triplesnvptx64-nvidia-cuda, nvsass-nvidia-directx, nvsass-nvidia-spirv

Per-SM Deep Dives:

Complete SM Table

SM__CUDA_ARCHPTX VerGenerationSuffixStatusDeep Dive
sm_757505Turing--Productionsm70-89
sm_808005Ampere--Productionsm70-89
sm_828205Ampere--Undocumentedsm70-89
sm_868605Ampere--Productionsm70-89
sm_878705Ampere--Productionsm70-89
sm_888805Ada--Undocumentedsm70-89
sm_898905Ada--Productionsm70-89
sm_909005Hopper--Productionsm90
sm_90a9006HopperaProductionsm90
sm_10010006Blackwell--Productionsm100
sm_100a10007BlackwellaProductionsm100
sm_100f10007BlackwellfProductionsm100
sm_10110106Jetson Thor (pre-rename)--Undocumentedsm100
sm_101a10107Jetson Thor (pre-rename)aUndocumentedsm100
sm_101f10107Jetson Thor (pre-rename)fUndocumentedsm100
sm_10210206Blackwell--Undocumentedsm100
sm_102a10207BlackwellaUndocumentedsm100
sm_102f10207BlackwellfUndocumentedsm100
sm_10310306Blackwell--Productionsm100
sm_103a10307BlackwellaProductionsm100
sm_103f10307BlackwellfProductionsm100
sm_11011006Jetson Thor--Productionsm120
sm_110a11007Jetson ThoraProductionsm120
sm_110f11007Jetson ThorfProductionsm120
sm_12012006Blackwell (sm120)--Productionsm120
sm_120a12007Blackwell (sm120)aProductionsm120
sm_120f12007Blackwell (sm120)fProductionsm120
sm_12112106Blackwell (sm120)--Productionsm120
sm_121a12107Blackwell (sm120)aProductionsm120
sm_121f12107Blackwell (sm120)fProductionsm120

Legacy architectures also present in the table but not in the CLI mapping: sm_20, sm_21, sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_73.

Suffix Meanings

SuffixMeaningPTX VersionDetail
(none)Base feature set5 (legacy) or 6 (sm_100+)All architectures; sm70-89 has no suffix-gated logic
aAccelerated / advanced features6 (sm_90a) or 7 (sm_100a+)sm_90a enables one EDG gate (sm90); sm_100a+ enables tcgen05 arch-conditional path (sm100)
fForward-compatible feature set7Implies a; never read by cicc logic (sm120); reserved for ptxas

PTX Version Mapping

PTX VersionSM RangeNotes
5sm_20 through sm_90 (legacy/base)All pre-Blackwell base variants
6sm_90a, sm_100/101/102/103/110/120/121 (base)sm_90a is the sole pre-Blackwell PTX 6 target (sm90)
7sm_100a/f through sm_121a/f (extended features)Required for tcgen05 arch-conditional intrinsics (sm100)

Architecture Gating

Four subsystems cooperate to configure feature flags from the SM version. The master configurator sub_60E7C0 runs last and has the highest non-CLI priority. For the complete flag table per tier, see SM 70-89 Complete sub_60E7C0 Flag Table.

Feature Configuration Pipeline

CLI parser (sub_617BD0)              Sets byte_4CF8* override flags
    |
sub_60DFC0 (secondary)               Sets unk_4D041B8 at sm_80+ (__VA_OPT__)
    |
sub_60D650 (optimization level)      ~109 flags from -O level
    |
sub_60E7C0 (master SM configurator)  ~60 flags via SM threshold comparisons
    |--- sub_60E530 (tertiary)        Supplementary progressive unlocks
    |
sub_982C80 (NVPTX subtarget)         224-byte bitfield for LLVM backend

Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag. See CLI Flag Inventory for the complete CLI flag-to-pipeline routing and Optimization Levels for per-level flag differences.

EDG-Level Gates -- sub_60E7C0

Sets ~60 unk_4D04* feature flags based on SM version thresholds. Each flag is gated by a byte_4CF8* user-override check.

ThresholdSM BoundaryFeatures EnabledDetail
> 30399sm_75 (Turing)Base CUDA features, dynamic parallelismsm70-89 Turing
> 40000sm_80 (Ampere)C++20 __VA_OPT__, L2 cache hints, extended atomicssm70-89 Ampere
> 89999sm_90 (Hopper)Cluster ops, TMA, setmaxnreg, WGMMA fencesm90 Feature Flags
> 109999sm_100 (Blackwell)tcgen05, match instruction, dword_4D041ACsm100 Feature Flags
> 119999sm_120unk_4D047BC disabled, unk_4D0428Csm120 Feature Flags

Backend Subtarget Feature Offsets (NVPTXSubtarget)

OffsetPurposeStrideDetail
+2498Type legality flags (per MVT)259 bytesSee Type Legalization
+2584Float legality flags (per MVT)259 bytesSee Type Legalization
+2843Integer type support flag1 byte--
+2870Branch distance flag1 byteSee Block Placement
+2871Jump table eligibility flag1 byteSee BranchFolding

For the complete NVPTXSubtarget analysis, see NVPTX Target Infrastructure.

Intrinsic Verifier Architecture Gates -- sub_2C7B6A0

The NVVMIntrinsicVerifier (143KB) gates intrinsics by SM version. For the complete three-layer verification architecture, see NVVM IR Verifier.

GateSMIntrinsicsDetail
sm_72 (Volta)Convergent branch intrinsics, some atomic opssm70-89 Volta
sm_75 (Turing)Conversion type intrinsicssm70-89 Turing
sm_89 (Ada)Specific intrinsicssm70-89 Ada
sm_90 (Hopper)Cluster dimensions, TMA, WGMMAsm90 TMA, sm90 WGMMA
sm_100+ (Blackwell).offset.bindless intrinsics, tcgen05sm100 tcgen05, sm120 .offset.bindless

Feature Gate Matrix

This matrix shows which major compiler features are available at each SM tier. Each cell links to the detailed discussion in the per-SM deep-dive page.

Tensor Core / MMA Instructions

Featuresm_70-75sm_80-89sm_90/90asm_100/103sm_110sm_120/121
HMMA m16n16k16 (f16)YesYesYesYesYesYes
IMMA int8/int4, BMMAsm_75+YesYesYesYesYes
DMMA fp64, TF32, BF16--sm_80+YesYesYesYes
WGMMA async (f16/bf16/tf32/f8)----YesYesYes--
tcgen05.mma (MX formats)------a/f onlya/f onlyNo
mma.sync.block_scale----------Future

See Tensor / MMA Builtins for the per-builtin ID reference and Tensor / MMA Codegen for the code generation pipeline.

Memory and Synchronization

Featuresm_70-75sm_80-89sm_90/90asm_100/103sm_110sm_120/121
Full atomic memory orderingsm_70+YesYesYesYesYes
128-bit atomicssm_70+YesYesYesYesYes
L2 cache hint atomics--sm_80+YesYesYesYes
Cluster scope atomics----YesYesYesYes
cp.async--sm_80+YesYesYesYes
TMA (tensor memory access)----YesYesYesYes
TMA 2CTA mode, Im2Col_W------sm_100+sm_100+sm_100+
setmaxnreg----YesYesYesYes
fence.sc.cluster----YesYesYesYes

See Atomics Builtins for atomic PTX generation detail and Barriers & Sync for barrier builtins.

Thread Block Clusters

Featuresm_70-89sm_90/90asm_100+
__cluster_dims__ attributeDiagnostic 3687YesYes
__launch_bounds__ 3rd paramDiagnostic 3704YesYes
__block_size__ 5th argDiagnostic 3790YesYes
Cluster special registers (15)--YesYes
barrier.cluster.arrive/wait--YesYes
Cluster query builtins (9)--YesYes
Distributed shared memory--YesYes
.blocksareclusters directive--YesYes

Numeric Formats

FormatFirst AvailableGate LocationDetail
f16, f32, f64All--Standard types
bf16 (bfloat16)sm_80+Ampere tensor coreTensor core and cvt
tf32 (TensorFloat-32)sm_80+Ampere tensor coreTensor core only
fp8 e4m3, e5m2sm_90+WGMMAcvt_packfloat cases 2-3
fp6 e2m3, e3m2sm_100+cvt_packfloatArch-conditional only
fp4 e2m1sm_100+cvt_packfloatArch-conditional only
ue8m0 (scale factor)sm_100+cvt_packfloatBoth arch and family-conditional
MX formats (mxf4, mxf8f6f4, mxf4nvf4)sm_100+tcgen05.mmatcgen05 a/f sub-variants only

Texture and Surface

Featuresm_70-89sm_90sm_100/103sm_120/121
Standard texture intrinsicsYesYesYesYes
.offset.bindless intrinsics (68 variants)------sm_120+
f16 texture element typesLimited (builtin 3811 only)LimitedLimitedFull support

See Surface & Texture Builtins for the tex_surf_handler dispatch algorithm.

EDG Frontend Feature Flags

FeatureThresholdFlagDetail
C++17 feature gates (EDG)sm_70+unk_4D041DC, unk_4D04858, unk_4D041ECsm70-89 Flag Table
C++20 __VA_OPT__sm_80+unk_4D041B8sm70-89 sub_60DFC0
C++23 extended float suffixessm_70+unk_4D0428Csm70-89 Tertiary Cascade
C++20 feature gatessm_90+unk_4D043D0, unk_4D041B0, unk_4D04814sm90 Feature Flags
Blackwell extended featuressm_100+unk_4D04184, dword_4D041ACsm100 Feature Flags

See EDG 6.6 Frontend for the 737-define configuration system.

tcgen05 Sub-Variant Access Table

The tcgen05 instruction family uses a two-tier gating system unique to Blackwell. Base variants (sm_100, sm_103, sm_110) are excluded; only a and f sub-variants pass the bitmask check.

SmVersionTargettcgen05Detail
1001sm_100aAllowedsm100 Arch-Conditional Gate
1002sm_100fAllowedsm100 Arch-Conditional Gate
1031sm_103aAllowedsm100 Arch-Conditional Gate
1032sm_103fAllowedsm100 Arch-Conditional Gate
1101sm_110aAllowedsm120: Jetson Thor
1102sm_110fAllowedsm120: Jetson Thor
1000, 1030, 1100base variantsBlockedBitmask 0xC0000C03 rejects; see sm100
1200-1212all sm_120/121Blockedv-1101 > 1; see sm120 No tcgen05

Generation-Specific Features

Turing (sm_75)

sm_75 is the default architecture for cicc v13.0, hardcoded as "compute_75" in sub_900130 and sub_125FB30.

Full detail: SM 70-89 (Volta through Ada)

Ampere (sm_80-sm_89)

  • L2::cache_hint on atomic operations (sub_21E6420) -- see Atomics Builtins
  • Extended tensor core shapes (tf32, bf16) -- see Tensor / MMA Builtins
  • Async copy (cp.async) -- see SM 70-89: Ampere
  • C++20 __VA_OPT__ support -- the sole differentiator between sm_75 and sm_80+ in sub_60E7C0/sub_60DFC0

Full detail: SM 70-89 (Volta through Ada)

Hopper (sm_90/90a)

  • Cluster operations: barrier.cluster.arrive/wait, fence.sc.cluster -- see Cluster Barriers
  • Cluster registers: %cluster_ctarank, %clusterid.x/y/z, %is_explicit_cluster -- see Cluster Special Registers
  • Kernel attributes: .blocksareclusters, .maxclusterrank, .reqnctapercluster, .cluster_dim -- see PTX Directives
  • setmaxnreg: Dynamic register allocation limit (sub_21EA5F0) -- see setmaxnreg
  • TMA: Tensor Memory Access with Im2Col, dimension validation, 2CTA mode -- see TMA
  • WGMMA: Warpgroup MMA async (f16, bf16, tf32, f8) -- see WGMMA
  • Distributed shared memory: .shared::cluster qualifier for cross-CTA access -- see DSMEM
  • Mbarrier extensions: DMA fence/arrive/wait for TMA coordination -- see Mbarrier

Full detail: SM 90 -- Hopper

Blackwell Datacenter (sm_100-sm_103)

  • tcgen05: Next-gen tensor core instruction set (scaleD, transA, negA, negB at sub_21E8CD0) -- see tcgen05
  • Arch-conditional vs. family-conditional gating: Two-tier feature system for tcgen05 sub-instructions -- see Gating
  • match instruction: Architecture-gated ("match instruction not supported on this architecture!") -- see sm100
  • Extended MMA shapes: m16n8k256 with MX format support
  • .offset.bindless intrinsics -- gated at sm_120+, NOT sm_100 (see sm120 .offset.bindless)
  • cvt_packfloat extended types: FP4, FP6, MX formats -- see cvt_packfloat

Full detail: SM 100 -- Blackwell Datacenter

Jetson Thor (sm_110)

sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename). It retains full tcgen05/TMEM hardware on a/f sub-variants. The sm_110 section is documented on the sm_120 page because the two are often compared.

Full detail: SM 120 -- Jetson Thor section

Blackwell Consumer (sm_120, sm_121)

  • No tcgen05: The entire tcgen05 ISA is rejected by cicc for all sm_120/121 variants -- see No tcgen05
  • .offset.bindless texture intrinsics (68 variants) -- see .offset.bindless
  • 16-bit texture element types -- see f16 Texture
  • mma.sync.block_scale: Present in upstream LLVM 22 but NOT emitted by cicc v13.0 -- see block_scale
  • Tensor core falls back to HMMA/IMMA inherited from sm_70-sm_90 path

Full detail: SM 120 -- Blackwell Consumer

NVVM Container Architecture Enum -- sub_CD09E0

The NVVM container format uses an architecture enumeration. See NVVM Container for the complete tag inventory.

Enum StringImplied SMDetail
NVVM_ARCH_BLACKWELL_10_0sm_100sm100
NVVM_ARCH_BLACKWELL_10_1sm_101Undocumented
NVVM_ARCH_BLACKWELL_10_3sm_103sm100
NVVM_ARCH_BLACKWELL_11_0sm_110sm120: Jetson Thor
NVVM_ARCH_BLACKWELL_12_0sm_120sm120
NVVM_ARCH_BLACKWELL_12_1sm_121sm120
NVVM_ARCH_HOPPER_9_0sm_90sm90
NVVM_ARCH_ADA_8_9sm_89sm70-89
NVVM_ARCH_AMPERE_8_0 through 8_8sm_80-sm_88sm70-89
NVVM_ARCH_HW_SM_5_0 through 10_4sm_50-sm_104Hardware SM enum

Notable: NVVM_ARCH_HW_SM_10_4 (sm_104) and NVVM_ARCH_BLACKWELL_11_0 are not publicly documented. NVIDIA's internal naming uses "BLACKWELL" for all sm_100-sm_121 variants, even though sm_110 is marketed as Jetson Thor and sm_120/121 are a distinct consumer microarchitecture (RTX 50xx). See SM 120: Architecture Identity for the "SM 10.4" internal designation.

Target Triples

TriplePurposeDetail
nvptx64-nvidia-cudaStandard 64-bit CUDA compilationDefault; see NVPTX Target Infrastructure
nvptx-nvidia-cuda32-bit CUDA compilationLegacy
nvptx64-nvidia-nvclOpenCL target--
nvsass-nvidia-cudaSASS backend (native assembly)--
nvsass-nvidia-directxDirectX SASS backendDiscovered in sub_2C80C90; see NVVM IR Verifier
nvsass-nvidia-spirvSPIR-V SASS backendDiscovered in sub_2C80C90

The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) reveal that NVIDIA's SASS-level backend supports DirectX and SPIR-V targets alongside traditional CUDA and OpenCL.

Data Layout Strings

ModeLayoutNotes
64-bit + sharede-p:64:64:64-p3:32:32:32-i1:8:8-...-n16:32:64p3:32:32:32 = 32-bit shared mem pointers
64-bite-p:64:64:64-i1:8:8-...-n16:32:64No shared memory specialization
32-bite-p:32:32:32-i1:8:8-...-n16:32:6432-bit mode

Address space 3 (shared memory) uses 32-bit pointers even in 64-bit mode, controlled by nvptx-short-ptr and nvptx-32-bit-smem flags. See Address Spaces for the complete address space reference.

SM Version Encoding

Two parallel version tracking systems coexist in the binary:

  • qword_4F077A8 -- Encodes SM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use the XX99 pattern (e.g., 69999 for pre-Volta, 89999 for pre-Hopper). See SM 70-89: SM Version Encoding for full detail.

  • unk_4D045E8 -- Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic. See SM 70-89: unk_4D045E8 Frontend Gates for the complete gate table.

Cross-References

Volta through Ada Lovelace (sm_70 – sm_89)

The sm_70 through sm_89 range spans four GPU generations — Volta, Turing, Ampere, and Ada Lovelace — and represents the most mature feature tier in cicc v13.0. Turing (sm_75) serves as the compiler's default architecture. Volta (sm_70/72) is no longer directly targetable: no compute_70 or compute_72 entry exists in the CLI parser, though the sm_70 feature boundary is still checked at 23 locations throughout the binary.

Supported Compute Capabilities

The architecture registration table at sub_95EB40 maps CLI strings to internal flags. Only the following are accepted for this generation range:

Compute CapabilityInternal Target__CUDA_ARCHPTX VersionGeneration
compute_75sm_757505Turing
compute_80sm_808005Ampere
compute_86sm_868605Ampere
compute_87sm_878705Ampere (Jetson Orin)
compute_88sm_888805Ada Lovelace
compute_89sm_898905Ada Lovelace

There is no compute_70, compute_72, compute_73, or compute_82. The sm_73, sm_82, and sm_88 targets exist only as internal processor table entries — they have no publicly documented differentiation and no unique feature gates in the compiler.

SM Version Encoding

Two parallel version tracking systems coexist in the binary:

  • qword_4F077A8 — Encodes SM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use the XX99 pattern (e.g., 69999 for pre-Volta, 79999 for pre-Ampere, 89999 for pre-Hopper).

  • unk_4D045E8 — Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic.

Feature Configuration Call Order

The compiler configures feature flags through a strict four-function call sequence. Each subsequent function can override or augment the previous one's settings:

  1. CLI parser — Sets byte_4CF8* override flags from user-specified options. These prevent any subsequent auto-configuration from touching the guarded flag.
  2. sub_60DFC0 — Basic initialization. Sets unk_4D041B8 for sm_80+ (C++20 __VA_OPT__ support).
  3. sub_60D650(opt_level) — Optimization-level-based flag configuration. Sets approximately 109 flags based on the -O level. Many of the same unk_4D04* flags set by SM gates are also set here under C++17/C++20 language-version conditions.
  4. sub_60E7C0 — Master SM architecture feature configurator. Reads qword_4F077A8 and sets approximately 60 backend flags through threshold comparisons. Also calls sub_60E530 (tertiary cascade) for supplementary flags.
  5. sub_982C80 — NVPTX subtarget feature table initialization (224-byte bitfield for the LLVM backend). This is a separate path from the EDG flags above.

Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag.

Feature Gates by Generation

Volta (sm_70+) — Threshold qword_4F077A8 > 69999

Volta introduced the first tensor core generation and independent thread scheduling. Although not directly targetable in this compiler version, the sm_70 boundary enables:

  • HMMA tensor core intrinsics — Builtin IDs 678–707 registered in sub_90AEE0. Three shape variants (m16n16k16, m32n8k16, m8n32k16) with load, store, and MMA operations across f16/f32 accumulator combinations.

  • Convergent branch intrinsicllvm.nvvm.branch.if.all.convergent (builtin 3755/8282) requires sm_70+. Error: "not supported on pre-Volta Architectures" (checked in sub_1C36530 and sub_2C7B6A0).

  • Proper atomic memory ordering — At sm_70+, atomics use acquire/release/relaxed semantics instead of falling back to volatile qualification. The gate is unk_4D045E8 > 69.

  • 128-bit atomic operations — Enabled at sm_70+. Below this threshold, diagnostic 3758 is emitted: "16-byte atomics only supported on sm_70+".

  • Optimizer feature flagsunk_4D041DC, unk_4D04858, unk_4D041EC are set by sub_60E7C0. The tertiary cascade sub_60E530 additionally sets unk_4D0428C (extended float suffix support for C++23 std::float*_t / std::bfloat16_t). Multiple SelectionDAG patterns in sub_706250 activate for sm_70+ codegen.

  • Variant-flag-gated features — When dword_4F077BC (SM variant flag, the a/f suffix) is set and sm_70+ is active, unk_4D043C4 is enabled. When compiling for a virtual architecture with effective SM > 69999, unk_4D04740 is set for multi-arch optimization.

  • WMMA memory space optimization — The wmma-memory-space-opt pass (registered at ctor_267, ctor_531) optimizes memory access patterns for tensor core operations.

Turing (sm_75) — Default Architecture

sm_75 is the baseline for cicc v13.0. The default is hardcoded in sub_900130 and sub_125FB30 via strcpy("compute_75"), and in sub_95EB40 as "-arch=compute_75".

No explicit sm_75-specific feature gates exist beyond the sm_70 tier. All Volta-era features are available. The key behavioral distinction is that sm_75 passes all pre-Volta gates cleanly — no diagnostic 3703 (sub_5C68F0), no volatile atomic fallback, no 128-bit atomic restrictions.

Ampere (sm_80+) — Threshold qword_4F077A8 > 79999

  • C++20 __VA_OPT__ supportunk_4D041B8 set at sub_60DFC0 line 132–133. This is the only flag set exclusively by sub_60DFC0 at the sm_80 threshold. It enables __VA_OPT__ recognition in the EDG macro expander (sub_A03 line 1010), variadic trailing argument elision (line 1584), and diagnostic 2939 for misuse.

  • Additional convergent branchllvm.nvvm.branch.if.convergent (builtin 3754/8283) requires sm_80+. Error: "not supported on pre-Ampere Architectures". Note the distinction: branch.if.all.convergent requires only sm_70+, while branch.if.convergent requires sm_80+.

  • L2 cache hint atomics — The L2::cache_hint suffix on atomic operations, emitted from sub_21E6DD0 when bit 0x400 is set in instruction encoding flags. Supported operations: exch, add, and, or, xor, max, min, cas, and floating-point add. These are PTX 7.3+ features. Emission logic lives in sub_21E6420.

  • cp.async.bulk patterns — String matching for cp.async.bulk.tensor.g2s. and cp.async.bulk. in inline assembly validation at sub_A8E250.

Important correction: The master SM feature configurator sub_60E7C0 does NOT set any new flags at the sm_80 boundary (> 79999). The Ampere-specific unk_4D041B8 is set by the secondary configurator sub_60DFC0. The next threshold in sub_60E7C0 after sm_70+ (> 69999) is sm_90+ (> 89999). This means sm_80 through sm_89 share the same sub_60E7C0 flag profile as sm_75.

Ada Lovelace and Ampere Variants (sm_86 – sm_89)

All of sm_86, sm_87, sm_88, and sm_89 share identical feature gates within cicc. They occupy unk_4D045E8 values 86–89 and qword_4F077A8 range 86000–89999, all below the 89999 Hopper boundary.

The primary gate at this tier is unk_4D045E8 <= 89, which delineates pre-Hopper from Hopper+:

LocationFeatureBehavior at sm_89 and below
sub_5D1A60__block_size__ attributeDiagnostic 3790; only 4 args parsed (5th cluster arg is sm_90+)
sub_5D1FE0__cluster_dims__ attributeDiagnostic 3687 emitted (cluster dimensions are Hopper-only)
sub_5D2430__launch_bounds__ 3rd paramDiagnostic 3704 emitted (cluster launch bounds)
sub_6BBC40Atomic scope "cluster"Falls through to "gpu" scope; diagnostic 3763/3759
sub_6BBC4016-byte extended atomicsDiagnostic 3764 for certain scope+type combinations
sub_9502D0 / sub_12AE930Atomic scope emission"gpu" used instead of "cluster"
sub_214DA90Cluster PTX directivesSkipped entirely (arch_id <= 89)

No code path differentiates sm_89 from sm_86/87/88. Hardware differences between these sub-architectures (e.g., Ada Lovelace RTX 4090 at sm_89 vs. Jetson Orin at sm_87) are resolved at the ptxas assembler level, not in cicc.

Atomic Lowering Detail

The atomic builtin lowering (sub_12AE930 / sub_9502D0) follows two paths split at the sm_70 boundary:

Pre-sm_70 path (unk_4D045E8 <= 69): Atomics are emitted with a volatile qualifier instead of memory ordering. Scope (cta/gpu/sys) is parsed but ordering is forced to volatile. 128-bit atomics emit diagnostic 3758.

sm_70+ path (unk_4D045E8 > 69): Full memory ordering support — relaxed, acquire, release, acq_rel. Scope resolution: cta (scope 0–1), gpu (scope 3), sys (scope 4). Cluster scope (scope 2) is only available at sm_90+; on sm_70–89, scope 2 falls through to "gpu".

Operations: ld, st, atom.add, atom.and, atom.or, atom.xor, atom.max, atom.min, atom.exch, atom.cas. Type suffixes via lookup table: b (bitwise), u (unsigned), s (signed), f (float).

Hopper-Gated Intrinsics Rejected on sm_70–89

Multiple intrinsics emit "this intrinsic is only supported for Hopper+" when the SM version field is non-zero and <= 899:

Builtin IDDescription
0x10B3 (4275)Hopper+ intrinsic requiring i1 or i32 return
0xFD5 (4053)Hopper+ intrinsic
0xEB7 (3767)Memory ordering/fence intrinsic with operation modes
0xEB9–0xEBA (3769–3770)Pointer-size-dependent intrinsics (>= 64-bit)

Complete sub_60E7C0 Flag Table

The master feature configurator sub_60E7C0 (address 0x60E7C0, 12,466 bytes, 56 qword_4F077A8 comparisons) is the primary SM-architecture-to-feature-flag mapper. Every flag assignment follows a guarded pattern: if the corresponding byte_4CF8* override byte is nonzero (set by a CLI flag), the auto-configuration is skipped and the user's explicit value is preserved.

Unconditional Assignments

These flags are set regardless of SM version with no user override check:

FlagValueNotes
unk_4D047C01Always enabled
unk_4D047B41Always enabled
unk_4F075840Always cleared
unk_4D0423C1Always enabled
unk_4D042080Always cleared
unk_4D042181Always enabled
unk_4D042141Always enabled
unk_4F069700Always cleared
unk_4F069640Always cleared
unk_4F069040Always cleared

SM-Dependent Unconditional Flags

These depend on SM version but have no user override check:

FlagConditionValueNotes
unk_4D047BCSM <= 1199991Disabled only for sm_120+
unk_4D04758SM <= 303001sm_32 and below only
unk_4D04764SM <= 303991sm_32 and below only
unk_4D044B8SM <= 402991Pre-Maxwell only

Guarded Flags (byte_4CF8* Override Bypass)

Each flag is set only when its guard byte is zero (user has not overridden via CLI):

GuardFlagDefault Value (guard=0)
byte_4CF807Bdword_4D048B8= 1
byte_4CF810Cdword_4D04824= 1
byte_4CF80F0unk_4D04388= 1
byte_4CF8108unk_4D04338= (dword_4F077BC && !dword_4F077B4 && SM <= 30399) ? 1 : 0
byte_4CF8123unk_4D047C8= 1 (only if SM > 30399, sm_35+)
byte_4CF8125dword_4D047B0= 1 (only if SM > 30399, sm_35+)
byte_4CF8139unk_4D04314= (SM <= 40299), pre-Maxwell
byte_4CF814Dunk_4D047C4= 0
byte_4CF8119unk_4D047D0= (SM <= 40000)
byte_4CF8119unk_4D047CC= (SM <= 40099)
byte_4CF810Funk_4D047EC= 1
byte_4CF8107unk_4D04340= 0
byte_4CF8116unk_4D047E0= 1
byte_4CF815Funk_4F0771C= 1
byte_4CF813Eunk_4D044B0= (SM > 40299), Maxwell+
byte_4CF8149unk_4D04470Complex Maxwell+ gate
byte_4CF8172dword_4D041AC= 0 (when SM <= 109999)
byte_4CF8159dword_4D048B0= (dword_4D048B8 && dword_4D048B4 && SM > 40799)
byte_4CF811Cunk_4D04790= 0 (when virtual arch flag set)
byte_4CF813Cdword_4D047ACsm_35+ feature gate
byte_4CF8156unk_4D04408CUDA C++ feature gate
byte_4CF815Dunk_4D048A0= 1 (when SM > 40699)

Total: 21 override bytes controlling approximately 25 feature flags.

Feature Escalation by SM Version

The cumulative flag-setting cascade. Each tier inherits all flags from lower tiers. Only the tiers relevant to sm_70–89 plus their immediate predecessors and successors are shown.

SM > 59999 (sm_60+, Pascal+):

FlagIdentified Meaning
unk_4D043CCEDG C++17 feature gate (also set by C++17 language block)
unk_4D04404EDG extended feature gate (also set by C++17 language block)
unk_4D043D8EDG C++17 feature gate (also set by C++17 language block)
unk_4D043D4EDG feature gate (also set via virtual arch > 30599)
dword_4F07760PTX generation mode flag
unk_4D04870EDG C++20 feature gate (also set by C++20 language block)

SM > 69999 (sm_70+, Volta+):

FlagIdentified Meaning
unk_4D041DCEDG C++17 feature gate (also set by C++17 language block)
unk_4D04858EDG C++17 feature gate (also set by C++17 language block)
unk_4D041ECEDG C++17/Pascal virtual arch feature gate

SM > 89999 (sm_90+, Hopper+) — NOT active for sm_70–89:

FlagIdentified Meaning
unk_4D043D0EDG C++20 feature gate (also set by C++20 language block)
unk_4D041B0EDG C++20 feature gate (also set by C++20 language block)
unk_4D04814EDG C++20 feature gate (also set by C++20 language block)
unk_4D0486C(with additional C++ version check)

sub_60E530 Tertiary Cascade

This supplementary function provides additional progressive unlocks. For the sm_70–89 range:

ThresholdHexFlags Set
> 405990x9E97unk_4F07764
> 406990x9EFBunk_4D043F0, unk_4D043F4
> 408990x9FC3unk_4D04220, unk_4D044D0
> 599990xEA5Funk_4D043CC (duplicates sub_60E7C0)
> 699990x1116Funk_4D0428C (extended float suffixes: C++23 std::float*_t / std::bfloat16_t)
> 899990x15F8Fdword_4F07760 (duplicates sub_60E7C0)
> 999990x1869Fdword_4D043F8, dword_4D041E8

Note: unk_4D0428C is set at > 69999 (sm_70+) by the cascade but at > 119999 (sm_120+) by sub_60E7C0. The cascade runs as part of sub_60E7C0, so the sm_70+ activation wins for all practical SM versions. This flag gates C++23 extended float suffixes (std::float16_t, std::float32_t, std::float64_t, std::bfloat16_t) in the EDG numeric parser at sub_A02 line 1612.

sub_60DFC0 SM-Gated Flags

The secondary configurator adds one flag at the sm_80 boundary:

ThresholdFlagIdentified Meaning
> 79999 (sm_80+)unk_4D041B8C++20 __VA_OPT__ support in EDG macro expander. Enables __VA_OPT__ recognition, variadic trailing argument elision, and diagnostic 2939.

Virtual Architecture Downgrade Path

When compiling for a virtual architecture (dword_4F077B4 = 1), sub_60E7C0 uses unk_4F077A0 (the effective/real SM) for a secondary tier of feature decisions:

Effective SM >Flags Set
29999unk_4D043E4
30099unk_4D044D0
30199unk_4D043F0
30299unk_4D04220
30599unk_4D043D4
59999unk_4D041EC, unk_4D043D8, unk_4D04404
69999unk_4D04740
79999unk_4D043D0
89999unk_4D043D0 (redundant — already set at > 79999)
129999unk_4D04184

Note: In the virtual arch path, unk_4D043D0 is set at > 79999 (sm_80+), while in the primary path it requires > 89999 (sm_90+). Virtual arch compilation is more conservative, enabling features the real target supports even if the virtual arch normally gates them.

unk_4D045E8 Frontend Gates

These gates use the raw SM number and control frontend semantic checks rather than backend flags:

GateLocationsEffect
<= 69sub_12AE930 ln 241, sub_9502D0 ln 294Atomic volatile fallback
<= 69sub_6BBC40 ln 763128-bit atomic error 3758
<= 69sub_5C68F0Diagnostic 3703
<= 51sub_691790 ln 126Surface builtin warning
<= 59sub_6BBC40 ln 639Atomic scope restriction
60–69sub_6BBC40 ln 814Diagnostic 3762
<= 79sub_5C6950 ln 15Diagnostic 3660
<= 89sub_5D1A60 ln 35__block_size__ 5th arg blocked
<= 89sub_5D1FE0 ln 19__cluster_dims__ diagnostic 3687
<= 89sub_5D2430 ln 33__launch_bounds__ 3rd param diagnostic 3704
<= 89sub_6BBC40 ln 684Atomic scope diagnostic 3763/3759
<= 89sub_6BBC40 ln 805, 82716-byte atomic diagnostic 3764
<= 89sub_9502D0 ln 424, sub_12AE930 ln 255Cluster scope falls through to "gpu"
<= 89sub_214DA90 ln 66Cluster PTX directives skipped

Cumulative Flag Profile per SM Version

This table shows the net flag state for each SM version in the range, combining all three configurators (sub_60E7C0 + sub_60E530 + sub_60DFC0). Only flags that differ across the sm_70–89 range are shown.

Flagsm_75sm_80sm_86–89Set ByIdentified Role
unk_4D041DC111sub_60E7C0 > 69999EDG C++17 feature gate
unk_4D04858111sub_60E7C0 > 69999EDG C++17 feature gate
unk_4D041EC111sub_60E7C0 > 69999EDG C++17 / virtual arch feature gate
unk_4D0428C111sub_60E530 > 69999Extended float suffixes (C++23)
unk_4D041B8011sub_60DFC0 > 79999C++20 __VA_OPT__ support
unk_4D043D0000sub_60E7C0 > 89999(sm_90+ only)
unk_4D041B0000sub_60E7C0 > 89999(sm_90+ only)
unk_4D04814000sub_60E7C0 > 89999(sm_90+ only)
unk_4D0486C000sub_60E7C0 > 89999(sm_90+ only)

The sole differentiator between sm_75 and sm_80+ within sub_60E7C0/sub_60DFC0 is unk_4D041B8. All flags set at > 69999 are shared by all sm_70–89 targets. All flags set at > 89999 are absent from all sm_70–89 targets. There is no per-flag difference between sm_86, sm_87, sm_88, and sm_89.

Identified Flag Semantics

Where flag consumers have been positively identified in the decompiled binary:

FlagSet AtConsumerMeaning
unk_4D041B8sm_80+ (sub_60DFC0)EDG macro expander (sub_A03 ln 1010)C++20 __VA_OPT__ support: recognition, variadic trailing argument elision, diagnostic 2939
unk_4D0428Csm_70+ (sub_60E530), sm_120+ (sub_60E7C0)EDG numeric parser (sub_A02 ln 1612)Extended float suffixes: C++23 std::float16_t, std::float32_t, std::float64_t, std::bfloat16_t
dword_4F07760sm_60+ (sub_60E7C0, sub_60E530)PTX generation pathPTX emission mode flag
unk_4D047C8sm_35+ (sub_60E7C0)BackendDynamic parallelism optimization
dword_4D047B0sm_35+ (sub_60E7C0)BackendDynamic parallelism support
unk_4D04780alwaysEDG macro expanderGNU ##__VA_ARGS__ comma-deletion extension

The remaining approximately 50 flags feed into the EDG frontend and NVVM IR generation pipeline. Based on the pattern that sub_60D650 (optimization level) and sub_60E7C0 (SM version) set the same flags with overlapping conditions, most are language feature gates (C++17/20/23 features that are also SM-gated) or optimization pass enables that depend on target capability.

Key Binary Locations

FunctionAddressSizeRole
sub_60E7C00x60E7C0Master SM feature flag initialization (12,466 bytes, 56 comparisons)Master SM feature flag initialization (12,466 bytes, 56 comparisons)
sub_60DFC00x60DFC0Secondary feature flag initialization (unk_4D041B8 at sm_80+)Secondary feature flag initialization (unk_4D041B8 at sm_80+)
sub_60E5300x60E530Tertiary feature cascade (unk_4D0428C at sm_70+)Tertiary feature cascade (unk_4D0428C at sm_70+)
sub_60D6500x60D650Optimization-level flag configurator (~109 flags)Optimization-level flag configurator (~109 flags)
sub_982C800x982C80NVPTX subtarget 224-byte feature bitfieldNVPTX subtarget 224-byte feature bitfield
sub_617BD00x617BD0CLI parser; sets unk_4D045E8 per compute_XXCLI parser; sets unk_4D045E8 per compute_XX
sub_12AE9300x12AE930Atomic builtin lowering (volatile vs. ordering)Atomic builtin lowering (volatile vs. ordering)
sub_9502D00x9502D0Duplicate atomic lowering (standalone pipeline)Duplicate atomic lowering (standalone pipeline)
sub_6BBC400x6BBC40Builtin semantic checker (atomics, scope validation)Builtin semantic checker (atomics, scope validation)
sub_90AEE00x90AEE0Builtin registration table (HMMA builtins 678–707)Builtin registration table (HMMA builtins 678–707)
sub_95EB400x95EB40Architecture registration (compute_XX to sm_XX)Architecture registration (compute_XX to sm_XX)
sub_1C365300x1C36530NVVM verifier (convergent intrinsic SM gates)NVVM verifier (convergent intrinsic SM gates)
sub_2C7B6A00x2C7B6A0NVVM lowering (convergent intrinsic SM gates)NVVM lowering (convergent intrinsic SM gates)
sub_21E6DD00x21E6DD0PTX emission (volatile / L2::cache_hint / .unified)PTX emission (volatile / L2::cache_hint / .unified)
sub_21E64200x21E6420Atomic L2 cache hint PTX emissionAtomic L2 cache hint PTX emission
sub_214DA900x214DA90Kernel attribute PTX emitter (cluster directives gated at arch_id > 89)Kernel attribute PTX emitter (cluster directives gated at arch_id > 89)
sub_5D1A600x5D1A60__block_size__ attribute (cluster dims at sm_90+)__block_size__ attribute (cluster dims at sm_90+)
sub_5D1FE00x5D1FE0__cluster_dims__ attribute (sm_90+ feature)__cluster_dims__ attribute (sm_90+ feature)
sub_5D24300x5D2430__launch_bounds__ 3rd param (sm_90+ cluster)__launch_bounds__ 3rd param (sm_90+ cluster)
sub_5C68F00x5C68F0Pre-sm_70 diagnostic 3703Pre-sm_70 diagnostic 3703
sub_5C69500x5C6950Pre-sm_80 diagnostic 3660Pre-sm_80 diagnostic 3660

Hopper (sm_90, sm_90a)

Hopper represents the largest single-generation feature expansion in cicc v13.0. The sm_90 gate at qword_4F077A8 > 89999 unlocks thread block clusters, distributed shared memory, Tensor Memory Access (TMA), Warpgroup Matrix Multiply-Accumulate (WGMMA), dynamic register count control, and a new fence instruction. The sm_90a "accelerated" sub-variant shares __CUDA_ARCH=900 with sm_90 but uses a higher PTX version and enables one additional feature gate in the EDG frontend.

Architecture Identity

The NVVM container format registers Hopper as NVVM_ARCH_HOPPER_9_0 with numeric value 900, assigned in sub_CD09E0 (line 255) and sub_1C1B150 (line 270) via the pattern v62(a1, "NVVM_ARCH_HOPPER_9_0", v64) => *a2 = 900.

VariantSubtarget Enum__CUDA_ARCHPTX Version-opt-arch-mcpu
sm_90389005sm_90sm_90
sm_90a399006sm_90asm_90a

Both variants share __CUDA_ARCH=900. The distinction lies in the -opt-arch and -mcpu flags passed through the internal pipeline (sub_95EB40 lines 461–469, sub_12C8DD0 lines 435–457). The sm_90a variant is the only pre-Blackwell SM that uses PTX version 6; all sm_20 through sm_90 base variants use PTX version 5.

The a flag is stored in unk_4D045E4 and read in exactly one location: sub_6C4D80 line 167, where the check unk_4D045E8 != 90 || !unk_4D045E4 gates a specific sm_90a-only feature (error code 0xE90 = 3728).

Thread Block Cluster Infrastructure

Clusters are the headline Hopper feature. The compiler gates all cluster functionality at arch_id >= 90 (unk_4D045E8 > 89).

Frontend Attributes

The EDG frontend recognizes three cluster-related kernel attributes:

__cluster_dims__ — Attribute code k in sub_5C79F0. Processing in sub_5D1FE0 validates three integer arguments (x, y, z) and stores them at offsets +20, +24, +28 of the kernel metadata structure. Error codes 3685/3686 on invalid values. On sm_89 and below, diagnostic 3687 is emitted as a warning.

__launch_bounds__ 3rd parameter — The cluster dimension extension to __launch_bounds__ is processed in sub_5D2430. On sm_89 and below, diagnostic 3704 is emitted.

__block_size__ attribute — Handled in sub_5D1A60. At sm_90+, five block dimension arguments are parsed (including the cluster dimension). At sm_89 and below, diagnostic 3790 is emitted and only four arguments are accepted.

NVVM Metadata

Cluster configuration propagates through NVVM IR via several metadata keys:

Metadata KeyWritersReaders
nvvm.cluster_dimsub_93AE30, sub_129A750sub_A84F90, sub_CE8EA0
cluster_dim_x/y/zsub_913C80, sub_1273830sub_CE8C00/40/80
cluster_max_blockssub_913C80, sub_1273830(kernel metadata)
nvvm.blocksareclusterssub_93AE30, sub_129A750sub_214DA90
nvvm.maxclusterrank(external)sub_A84F90, sub_CE9030

The blocksareclusters metadata requires reqntid to be set — error message: "blocksareclusters requires reqntid" (sub_214DA90 line 111).

PTX Directives

The kernel attribute emitter at sub_214DA90 gates cluster directives at arch_id >= 90. When the gate passes, four directives may be emitted:

  • .blocksareclusters — Declares that thread blocks form clusters
  • .explicitcluster — Emitted when all three cluster dimensions are present
  • .reqnctapercluster X, Y, Z — Required CTA count per cluster
  • .maxclusterrank N — Maximum cluster rank

Cluster Special Registers

The PTX emitter at sub_21E9060 handles 15 cluster special registers via a switch statement:

CaseRegisterDescription
0%is_explicit_clusterBoolean: was cluster explicitly set
1%cluster_ctarankCTA rank within the cluster
2%cluster_nctarankNumber of CTAs in cluster
3–5%cluster_nctaid.{x,y,z}Cluster grid dimensions
6–8%cluster_ctaid.{x,y,z}CTA position within cluster
9–11%nclusterid.{x,y,z}Cluster grid count
12–14%clusterid.{x,y,z}Cluster ID

Cluster Barrier Operations

The barrier.cluster instruction is emitted from sub_21E8EA0 with two operation modes and two memory ordering modes:

Opcode (bits 0–3)OperationMemory Mode (bits 4–7)Qualifier
0arrive0(default acquire/release)
1wait1.relaxed

Error strings: "bad cluster barrier op" for invalid opcode, "bad cluster barrier mem mode" for invalid memory mode.

Three corresponding builtins are registered in sub_90AEE0:

BuiltinID
__nv_cluster_barrier_arrive_impl11
__nv_cluster_barrier_wait_impl12
__nv_cluster_barrier_arrive_relaxed_impl13

Cluster Query Builtins

Nine cluster information builtins are registered in sub_90AEE0:

BuiltinIDPurpose
__nv_clusterDimIsSpecifed_impl8Check if cluster dims are set
__nv_clusterRelativeBlockRank_impl9Block rank within cluster
__nv_clusterSizeInBlocks_impl10Total blocks in cluster
__nv_cluster_query_shared_rank_impl203Query shared memory rank
__nv_cluster_map_shared_rank_impl365Map to shared memory rank
__nv_clusterDim_impl405Get cluster dimensions
__nv_clusterRelativeBlockIdx_impl406Relative block index
__nv_clusterGridDimInClusters_impl407Grid dimension in clusters
__nv_clusterIdx_impl408Cluster index

fence.sc.cluster Instruction

A new fence instruction is emitted from sub_21E94F0, the membar/fence printer. The opcode encoding uses the low 4 bits of the operand:

ValueInstructionGeneration
0membar.gpuAll
1membar.ctaAll
2membar.sysAll
4fence.sc.clusterHopper+

A duplicate implementation exists in the NVPTX backend at sub_35F18E0.

Atomic Cluster Scope

At sm_90+, the atomic lowering paths (sub_12AE930 line 255, sub_9502D0 line 424) add cluster scope support. Scope value 2 now resolves to "cluster" instead of falling through to "gpu" as it does on sm_70–89. This enables atom.*.cluster operations for intra-cluster synchronization.

setmaxnreg — Dynamic Register Count

Hopper introduces dynamic register count adjustment via setmaxnreg.{inc,dec}.sync.aligned.u32.

NVVM IR validation (sub_BFC6A0 lines 1732–1754): Builtin IDs 9431–9432 correspond to nvvm.setmaxnreg.inc and nvvm.setmaxnreg.dec. Validation rules enforce that the register count must be a multiple of 8 and within the range [24, 256].

Inline assembly recognition (sub_FCDCB0, sub_21EA5F0): The compiler scans inline asm for setmaxnreg. followed by .sync.aligned.u32, extracting the immediate operand from either a $0 placeholder or a literal integer. Backend duplicates exist at sub_307BA30 and sub_3953170.

WGMMA — Warpgroup Matrix Multiply-Accumulate

WGMMA is Hopper's primary tensor core interface, superseding HMMA for large matrix operations.

Registered Builtins

Four type variants are registered in sub_90AEE0 (lines 2941–2944) with a duplicate table in sub_126A910:

BuiltinIDAccumulator Type
__wgmma_mma_async_f16765FP16
__wgmma_mma_async_bf16766BF16
__wgmma_mma_async_tf32767TF32
__wgmma_mma_async_f8768FP8

Shape Selection

The WGMMA lowering at sub_955A70 (lines 2850–2910+) uses a switch on the M dimension (output rows) to select MachineInstr opcodes:

M DimensionOpcode
810774
1610690
2410734
3210742
40–88 (stride 8)10746–10770

Error on invalid M: "unexpected constant overflow in __wgmma_mma_async operand".

Operand Modifiers

The NVPTX printer at sub_35F3330 emits WGMMA operand modifiers encoded in bitfields:

  • kind (bits 6–8): mxf4nvf4 (0), f8f6f4 (1), mxf8f6f4 (2), f16 (3), i8 (4), tf32 (5), mxf4 (7)
  • cta_group (bit 1): cta_group::1 (clear) or cta_group::2 (set)
  • scale (bits 2–3): Additional scaling modifier

TMA — Tensor Memory Access

TMA provides hardware-accelerated bulk data movement between global and shared memory, driven by a tensor map descriptor that encodes the multi-dimensional layout. Three independent subsystems in cicc cooperate to implement TMA: the intrinsic name parser (sub_A8E250), the SelectionDAG lowering handler (sub_33AD3D0), and the NVPTX ISel pattern matcher for CpAsyncBulkTensor (sub_36EC510).

TMA Descriptor Format (NVVM Container Tag 401)

The host-side tensor map descriptor is embedded in the NVVM container under tag 401. The tag is conditional on ExtOpt.Field344 (tag 301) having value 1, which identifies the Hopper TMA path. (Blackwell uses tag 402 for TCGen05Config instead, gated by Field344==4; the two are mutually exclusive.)

ComponentSizeDescription
Fixed header44 bytesTensor map metadata (dimensions, strides, element type, interleave, swizzle, fill, OOB policy)
Per-descriptor entry16 bytes eachOne entry per cp.async.bulk.tensor call site in the kernel
Total struct at offset 40844 + 16*N bytesN = number of distinct TMA operations

The compiler serializes this into the NVVM container (sub_CDD2D0) so ptxas can validate shared memory allocation sizes and descriptor compatibility at link time.

TMA Descriptor ABI in Kernel Parameters

The EDG frontend detects TMA descriptor parameters during kernel registration stub generation. The detection function sub_8D4C10 (edg::get_tma_descriptor_flags) checks:

if (unk_4F068E0
    && arch > 0x9EFB
    && type_is_struct_or_class(type)
    && (*(type+140) & ~4) == 8
    && get_tma_descriptor_flags(type) & 4):
  insert copy_node(sub_7E7ED0, calling_convention=7)
  byte_at(node+88) |= 4   // TMA descriptor flag

This gives TMA descriptors a distinct ABI: calling convention 7 with flag bit 4, separate from normal struct-by-value passing. The copy node ensures the descriptor is materialized at the correct address space boundary before kernel launch.

TMA Intrinsic Name Parsing (sub_A8E250)

The intrinsic dispatcher sub_A8E250 (52 KB) matches TMA intrinsic names via string comparison and assigns internal opcode IDs. Two families exist:

Tensor-structured copies (require a tensor map descriptor):

Intrinsic PatternDimensionsOpcode
cp.async.bulk.tensor.g2s.tile.1d1D9222
cp.async.bulk.tensor.g2s.tile.2d2D9223
cp.async.bulk.tensor.g2s.tile.3d3D9224
cp.async.bulk.tensor.g2s.tile.4d4D9225
cp.async.bulk.tensor.g2s.tile.5d5D9226
cp.async.bulk.tensor.g2s.im2col.3d3D9213
cp.async.bulk.tensor.g2s.im2col.4d4D9214
cp.async.bulk.tensor.g2s.im2col.5d5D9215
cp.async.bulk.tensor.gmem.to.smem.1d1D8324
cp.async.bulk.tensor.gmem.to.smem.2d2D8325
cp.async.bulk.tensor.gmem.to.smem.3d3D8326
cp.async.bulk.tensor.gmem.to.smem.4d4D8327
cp.async.bulk.tensor.gmem.to.smem.5d5D8328
cp.async.bulk.tensor.gmem.to.smem.im2col.w.3d3D8329
cp.async.bulk.tensor.gmem.to.smem.im2col.w.4d4D8330
cp.async.bulk.tensor.gmem.to.smem.im2col.w.5d5D8331

Unstructured bulk copies (byte-level, no tensor map descriptor):

Intrinsic PatternOpcode
cp.async.bulk.global.to.shared.cluster8315
cp.async.bulk.gmem.to.dsmem8316

Fragment-indexed TMA (from builtin IDs 411/412 via sub_9483E0):

LLVM IntrinsicBase OpcodeIndex Range
llvm.nvvm.tma.load92339227–9232 (6 entries, indexed by fragment count)
llvm.nvvm.tma.store9257(corresponding store entries)

TMA SelectionDAG Lowering (sub_33AD3D0)

The unified TMA handler sub_33AD3D0 receives a mode argument from the main intrinsic lowering switch in sub_33B0210:

CaseModeOperationMemory Direction
0x1792TMA loadglobal -> shared
0x17A3TMA storeshared -> global
0x17B5TMA prefetchglobal (read-only)
0x17C7TMA multicast loadglobal -> N shared (across cluster)

Related cp.async handlers in the same dispatch table:

CaseHandlerOperation
0x175sub_33AC2B0cp.async (non-TMA async copy)
0x176sub_33AC130cp.async.wait
0x177sub_33AB690cp.async.bulk (non-tensor bulk copy)
0x178goto LABEL_32No-op — commit/barrier (scheduling fence only)

The 0x178 no-op is significant: it represents the cp.async.bulk commit/barrier intrinsic that exists purely for scheduling purposes. The compiler preserves it as a DAG ordering constraint even though it produces no data-flow SDNode.

CpAsyncBulkTensor G2S Lowering (sub_36EC510)

The 27 KB function sub_36EC510 (1185 lines) implements the complete cp.async.bulk.tensor global-to-shared lowering with full architecture gating and mode validation.

Architecture gates (read from offset+340 of the subtarget object):

SM ValueHexFeatures Unlocked
>= 10000x3E8SM 90: tile mode (1D–5D), Im2Col mode (3D–5D)
>= 10320x408SM 100: adds 2CTA mode, Im2Col_W, Im2Col_W128

Mode bit decoding from operand v11:

BitsMaskMeaning
2–4v11 & 0x1CIm2Col variant: Im2Col, Im2Col_W, Im2Col_W128
3–4v11 & 0x182CTA mode flag

Validation error strings (emitted as fatal diagnostics):

  • "NumDims should be at least 3 for Im2Col or Im2Col_W or Im2Col_W128 mode" — Im2Col requires >= 3D tensors
  • "Im2Col_W and Im2Col_W128 modes are not supported on this architecture." — SM 90 does not support Im2Col_W/W128; requires SM 100+
  • "2CTA Mode for CpAsyncBulkTensorG2S not supported on this architecture" — 2CTA mode requires SM 100+

TMA Builtin Codegen (EDG -> LLVM IR)

The EDG-to-LLVM builtin lowering handles TMA as builtin IDs 411 and 412 (hex 0x19B / 0x19C).

ID 411 (scatter/store path)sub_12A7070 extracts TMA descriptor info, then an iterative loop builds a vector of per-element store nodes. The intrinsic table 0x107A0x107F (4218–4223) selects among 6 entries indexed by element count. Approximately 300 lines of handler code (lines 1256–1501 of sub_12A71A0).

ID 412 (gather/load path) — Similar structure but for the load direction. Uses intrinsic table 0x10940x109A (4244–4250). Includes bitcast insertion (opcode 47) for type mismatches between the descriptor element type and the destination register type. Approximately 450 lines (lines 1503–1713).

Both paths use:

  • sub_12AA280 — TMA descriptor builder (constructs the multi-operand struct from the builtin arguments)
  • sub_12A9E60extractvalue emission (decomposes aggregate returns into individual registers)
  • sub_39FAC40 — Fragment count computation (determines how many load/store fragments the TMA operation expands into)

TMA Scheduling Constraints

TMA operations impose specific scheduling constraints visible in cicc's SelectionDAG construction:

  1. Chain dependencies by mode. Every TMA operation produces a memory chain in the SelectionDAG. The mode parameter determines the chain direction:

    ModeReadsWritesChain Effect
    2 (load)globalsharedLoad chain
    3 (store)sharedglobalStore chain
    5 (prefetch)global(none)Load chain
    7 (multicast)globalN x sharedLoad chain
  2. Commit-as-fence. Intrinsic ID 0x178 lowers to no-op (goto LABEL_32), functioning as a pure scheduling barrier. This prevents the DAG scheduler from reordering TMA operations past their commit point.

  3. Async qualifier hierarchy. The memory space qualifiers emitted by sub_35F4B50 form an ordered fence hierarchy:

    QualifierScopeStrength
    .asyncUnscopedWeakest
    .async.globalGlobal memory domain
    .async.shared::ctaCTA-local shared memory
    .async.shared::clusterCluster shared memory (DSMEM)Strongest

Distributed Shared Memory

Hopper's cluster architecture enables distributed shared memory (DSMEM) across CTAs in a cluster. The NVPTX backend emits memory space qualifiers from two functions:

sub_35F4B50 — Async memory space qualifier emission (switch on operand):

LineQualifierSemantic
20.asyncBase async qualifier (unscoped)
32.async.globalAsync from global memory
45.async.shared::ctaAsync to CTA-local shared memory
59.async.shared::clusterAsync to cluster distributed shared memory
73.aliasAliased access modifier (permits overlapping accesses)

sub_35F4E30 — Commit modifier emission (switch on operand):

LineQualifierSemantic
28.cta_group::1CTA group 1 selection
38.cta_group::2CTA group 2 selection
51.mbarrier::arrive::oneSingle-thread mbarrier arrive
67.shared::clusterCluster shared memory scope
80.multicast::clusterMulticast to all CTAs in cluster

sub_35F4080 — Secondary .shared::cluster emission (line 68), used in non-commit contexts.

These qualifiers attach to cp.async.bulk and mbarrier instructions to specify the scope and direction of asynchronous data movement within the cluster.

Mbarrier Extensions — DMA Fence/Arrive/Wait

Hopper extends the async barrier (mbarrier) mechanism to coordinate TMA data movement. The TMA DMA pipeline follows a three-phase synchronization protocol:

Phase 1: Initialization

.mbarrier_init (emitted from sub_35F4AD0) initializes the async barrier with the expected transaction byte count. The arrive_expect_tx variant sets both the expected arrival count and the transaction byte count atomically.

Phase 2: Arrive (Producer Signals Completion)

When a TMA operation completes, it signals the mbarrier:

  • .mbarrier::arrive::one (sub_35F4E30 line 51) — single-thread arrive notification. The TMA hardware auto-arrives with the transferred byte count.
  • .cta_group::1 / .cta_group::2 (sub_35F4E30 lines 28/38) — selects which CTA group the arrive targets, enabling pipelined producer-consumer patterns where two groups alternate roles.

Phase 3: Wait (Consumer Blocks)

The consumer thread issues mbarrier.try_wait with a phase bit. The phase alternates each time the barrier completes a full cycle, enabling pipelined double-buffered access patterns. No additional cicc emission function is needed; the standard mbarrier wait path handles this.

WGMMA Fence/Commit/Wait (Distinct Pipeline)

WGMMA has its own synchronization cycle, separate from TMA mbarriers:

BuiltinIDsHandlerLLVM Intrinsic
__wgmma_fence745–750sub_12B1C209062 (wgmma.fence.aligned, 3 type overloads)
__wgmma_commit_group(same range)sub_12B1C20(same dispatch)
__wgmma_wait_group(same range)sub_12B1C20(same dispatch)

WGMMA fences synchronize the tensor core accumulator pipeline; TMA mbarriers synchronize the DMA engine. A typical Hopper kernel pipelines both: TMA loads data into shared memory (mbarrier-synchronized), then WGMMA consumes the data from shared memory (fence-synchronized). The two synchronization domains must not be confused in a reimplementation.

Feature Flag Configuration

The master feature configurator sub_60E7C0 sets the following flags at the sm_90+ threshold (qword_4F077A8 > 89999):

FlagSource
unk_4D043D0sub_60E7C0
unk_4D041B0sub_60E7C0
unk_4D04814sub_60E7C0
unk_4D0486Csub_60E7C0 (with C++ version check)
dword_4F07760sub_60E530
dword_4D043F8sub_60E530 (at > 99999)
dword_4D041E8sub_60E530 (at > 99999)

Key Binary Locations

FunctionAddressSizeRole
sub_CD09E00xCD09E0NVVM arch enum (NVVM_ARCH_HOPPER_9_0)NVVM arch enum (NVVM_ARCH_HOPPER_9_0)
ctor_3560x50C890Subtarget registration (sm_90 enum 38, sm_90a enum 39)Subtarget registration (sm_90 enum 38, sm_90a enum 39)
sub_214DA900x214DA90Kernel attribute emitter (cluster PTX directives)Kernel attribute emitter (cluster PTX directives)
sub_21E90600x21E9060Cluster special register PTX emissionCluster special register PTX emission
sub_21E8EA00x21E8EA0Cluster barrier instruction emissionCluster barrier instruction emission
sub_21E94F00x21E94F0Membar/fence printer (fence.sc.cluster)Membar/fence printer (fence.sc.cluster)
sub_BFC6A00xBFC6A0setmaxnreg NVVM IR validationsetmaxnreg NVVM IR validation
sub_FCDCB00xFCDCB0setmaxnreg inline asm pattern matchingsetmaxnreg inline asm pattern matching
sub_955A700x955A70WGMMA lowering (M-dimension switch)WGMMA lowering (M-dimension switch)
sub_90AEE00x90AEE0Builtin registration (WGMMA, cluster barriers/queries)Builtin registration (WGMMA, cluster barriers/queries)
sub_A8E2500xA8E250TMA intrinsic name parsing (52 KB)TMA intrinsic name parsing (52 KB)
sub_33AD3D00x33AD3D0TMA SelectionDAG lowering handler (modes 2/3/5/7)TMA SelectionDAG lowering handler (modes 2/3/5/7)
sub_33AB6900x33AB690cp.async.bulk non-tensor handlercp.async.bulk non-tensor handler
sub_33AC2B00x33AC2B0cp.async handlercp.async handler
sub_33AC1300x33AC130cp.async.wait handlercp.async.wait handler
sub_36EC5100x36EC510CpAsyncBulkTensor G2S lowering (27 KB, 1185 lines)CpAsyncBulkTensor G2S lowering (27 KB, 1185 lines)
sub_9483E00x9483E0TMA descriptor extractionTMA descriptor extraction
sub_12AA2800x12AA280TMA descriptor builder (EDG -> LLVM IR)TMA descriptor builder (EDG -> LLVM IR)
sub_12A70700x12A7070TMA scatter/store builtin handlerTMA scatter/store builtin handler
sub_8D4C100x8D4C10edg::get_tma_descriptor_flagsedg::get_tma_descriptor_flags
sub_35F4B500x35F4B50DSMEM qualifier emissionDSMEM qualifier emission
sub_35F4E300x35F4E30Commit modifier emission (mbarrier, multicast)Commit modifier emission (mbarrier, multicast)
sub_35F4AD00x35F4AD0.mbarrier_init emission.mbarrier_init emission
sub_35F40800x35F4080Secondary .shared::cluster emissionSecondary .shared::cluster emission

Blackwell Datacenter (sm_100, sm_100a, sm_103, sm_103a)

The Blackwell datacenter family introduces the fifth-generation tensor core instruction set (tcgen05), new floating-point formats (FP4, FP6, MX formats), and a sophisticated arch-conditional versus family-conditional feature gating system. sm_100/sm_100a targets the NVIDIA B200, while sm_103/sm_103a targets Blackwell Ultra (GB300 system). Both share the tcgen05 ISA but differ in __CUDA_ARCH values and minor tensor core configuration.

Architecture Identity

Six Blackwell arch constants are defined in sub_CD09E0:

NVVM EnumNumeric ValueImplied SM
NVVM_ARCH_BLACKWELL_10_01000sm_100
NVVM_ARCH_BLACKWELL_10_11010sm_101
NVVM_ARCH_BLACKWELL_10_31030sm_103
NVVM_ARCH_BLACKWELL_11_01100sm_110 (Jetson Thor)
NVVM_ARCH_BLACKWELL_12_01200sm_120
NVVM_ARCH_BLACKWELL_12_11210sm_121

Notable: sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line. Despite the rename, both remain in the Blackwell family (NVVM_ARCH_BLACKWELL_*). The numeric encoding follows the standard major*100 + minor*10 formula: 11100 + 010 = 1100.

SM Variant Table

Each Blackwell datacenter target has base, accelerated (a), and forward-compatible (f) sub-variants:

Variant__CUDA_ARCHPTX VersionProduct
sm_10010006B200 base
sm_100a10007B200 accelerated
sm_100f10007B200 forward-compatible
sm_10310306Blackwell Ultra / GB300 base
sm_103a10307Blackwell Ultra / GB300 accelerated
sm_103f10307Blackwell Ultra / GB300 forward-compatible

The undocumented sm_101 and sm_102 targets also exist in the processor table (ctor_605) with their own a/f variants. sm_101 maps to __CUDA_ARCH=1010 and sm_102 to __CUDA_ARCH=1020. No unique feature gates differentiate them from sm_100 in cicc.

Suffix Semantics

The sub-variant flags are stored in EDG frontend globals:

  • unk_4D045E8 — Major SM number (100, 103)
  • unk_4D045E4 — Accelerated flag; set for both a and f variants
  • unk_4D045E0 — Forward-compatible flag; set only for f variants

The f suffix implies a — whenever the forward-compatible flag is set, the accelerated flag is also set. In cicc v13.0, the f flag is set during CLI parsing and reset in sub_615CB0 but is never read by any compiler logic. It exists for future-proofing and potential ptxas-level differentiation.

Arch-Conditional vs. Family-Conditional Gating

Blackwell introduces a two-tier feature gating system that distinguishes between "arch-conditional" and "family-conditional" access to instructions. This pattern repeats across every tcgen05 handler.

The gate check at sub_30462A0, sub_304E6C0, and sub_36E9630 uses a complex encoding:

v = arch_version (offset +340 of arch struct)
if (v > 0x408) {           // 0x408 = 1032 = sm_103.2
    if (v - 1101 > 1)      // allows {1101, 1102} — sm_110a/sm_110f (Jetson Thor)
        goto ERROR;
} else if (v <= 0x3E8 || ((1LL << ((v & 0xFF) + 23)) & 0xC0000C03) == 0) {
    goto ERROR;             // 0x3E8 = 1000 = sm_100 base
}

The bitmask 0xC0000C03 selects specific sub-variants when shifted by (v & 0xFF) + 23. PTX version gates further refine access: family-conditional features require PTX >= 86, while arch-conditional features require PTX >= 88.

Features gated by both arch-conditional and family-conditional (broader access): tcgen05.fence, tcgen05.wait, tcgen05.relinquish.alloc, tcgen05.cp, tcgen05.commit, tcgen05.alloc, tcgen05.mma, and the ue8m0x2 type in cvt_packfloat.

Features gated by arch-conditional only (stricter): {fp6/fp4}x2 types in cvt_packfloat, INT8 type in tcgen05.mma, MXF4/MXF4NVF4 with sparsity, and explicit scale vector size.

tcgen05 — Tensor Core Generation 5

The tcgen05 instruction family is the primary new ISA extension for Blackwell datacenter. All tcgen05 instructions are handled in sub_30462A0 and sub_304E6C0.

Lifecycle Instructions

InstructionOpcodeISDOperandsPurpose
tcgen05.alloc100804765Basic allocationAllocate tensor core accumulator memory
tcgen05.alloc (multicast)100834770/477132-bit flag variantMulticast allocation
tcgen05.dealloc1014048274 operandsDeallocate tensor core memory
tcgen05.commit10090/100914772–4777Mask variantsCommit pending operations
tcgen05.fence1014348302 operandsMemory fence for tensor ops
tcgen05.wait1035150202 operandsWait for tensor ops to complete
tcgen05.relinquish.alloc1031149412 operandsRelinquish allocated tensor memory
tcgen05.cp.*1010147904 operandsCopy operations for tensor data

The commit instruction has multiple variants based on multicast mask size. Only 16-bit and 32-bit masks are valid; other sizes produce an error.

tcgen05.mma — Matrix Multiply-Accumulate

The main MMA instruction is handled in sub_304E6C0 (opcodes 10299–10309) and validated in sub_36E9630. The operand encoding packs configuration into bitfields:

Data types (bits 8–6 of operand):

ValueKindNotes
0kind::mxf4nvf4MX FP4 with NV FP4
1kind::f8f6f4Standard FP8/FP6/FP4
2kind::mxf8f6f4MX variant of f8f6f4
3kind::f16Half precision
4kind::i88-bit integer (arch-conditional only)
5kind::tf32TensorFloat-32
7kind::mxf4MX FP4

Scale vector sizes (bits 3–2):

ValueModifierConstraints
default.scale_vec::1XNot for mxf4nvf4 or mxf4
2.scale_vec::2XNot for mxf8f6f4
3.scale_vec::4XNot for mxf8f6f4 or mxf4

Block scale (bits 10–9): .block16 (16-element block scaling) or .block32 (32-element block scaling). Not supported for f16, tf32, f8f6f4, or i8.

Weight stationary (bit 0): .ws flag. Incompatible with cta_group::2, mxf8f6f4, and FP4 types.

Sparsity (bit 5): Restricted for MXF4 and MXF4NVF4 types on arch-conditional variants only.

Scale input accumulator (bit 4): Scales the accumulator input. Only usable with f16 and tf32 types. Notably, this is NOT supported on the a sub-variants (sm_100a at v=1001, sm_103a at v=1033) but IS supported on base variants (sm_100 at v=1000, sm_103 at v=1030) and sm_120+.

CTA group (bit 1): cta_group::1 (clear) or cta_group::2 (set).

Collector modes (from sub_35F38B0): .collector::a::fill, .collector::a::use, .collector::a::lastuse, and .collector::b with ::ws sub-variants. Constraint: cannot use collector::a::use or collector::a::fill with the ashift modifier.

tcgen05.cp Copy Shapes

The copy instruction shape emission at sub_35F5090 supports:

ShapeBits 3–1 Value
.128x256b0
.4x256b1
.128x128b2
.64x128b3
.32x128b4

Destination format modifiers: .b8x16 (base), .b6x16_p32 (6-bit with 32-bit padding), .b4x16_p64 (4-bit with 64-bit padding).

Multicast modes: .warpx2::02_13 (warp pairs 0,2 and 1,3), .warpx2::01_23 (warp pairs 0,1 and 2,3), .warpx4 (all 4 warps).

cvt_packfloat — Extended Numeric Formats

The cvt_packfloat intrinsic (sub_304FBD0 for validation, sub_35ED820 for emission) has a base requirement of SM >= 90 and PTX >= 78. Blackwell adds four new types:

CaseTypeGeneration
0.f32sm_90+
1.f16x2sm_90+
2.e4m3x2 (FP8 E4M3)sm_90+
3.e5m2x2 (FP8 E5M2)sm_90+
4.bf16x2 (BFloat16)sm_90+
5.e2m1x2 (FP4 E2M1)sm_100+
6.e2m3x2 (FP6 E2M3)sm_100+
7.e3m2x2 (FP6 E3M2)sm_100+
8.ue8m0x2 (UE8M0 scale)sm_100+

The ue8m0x2 type is gated by both arch-conditional and family-conditional paths, while {fp6/fp4}x2 types (e2m1x2, e2m3x2, e3m2x2) are arch-conditional only.

tcgen05 Commit with Mbarrier

The commit modifier emission at sub_35F4E30 combines tensor core commit with mbarrier synchronization:

  • .cta_group::1 / .cta_group::2 — Group selection
  • .mbarrier::arrive::one — Mbarrier arrive modifier
  • .shared::cluster — Shared memory cluster scope
  • .multicast::cluster — Multicast cluster scope

sm_100 vs. sm_103 Differences

Both families share the full tcgen05 ISA. Observable differences in cicc:

  • __CUDA_ARCH: 1000 vs. 1030
  • Tensor core operand range: sm_103 may handle wider operand loops (offset 760 vs. 600 for simpler variants in cases 10303/10308)
  • Scale input accumulator: Not available on a sub-variants of either family

No sm_103-specific feature gates exist beyond the __CUDA_ARCH value. Hardware differences between B200 and GB300 are resolved at the ptxas level.

Feature Flag Configuration

At the sm_100+ threshold (qword_4F077A8 > 109999), the master configurator sub_60E7C0 enables:

FlagCondition
unk_4D04184Unconditional
unk_4D04800Requires CUDA mode + C++20
dword_4D041ACGuarded by byte_4CF8172

Key Binary Locations

FunctionAddressSizeRole
sub_CD09E00xCD09E0NVVM arch enum (all Blackwell constants)NVVM arch enum (all Blackwell constants)
sub_1C1B1500x1C1B150Second arch enum copy (LLVM module metadata)Second arch enum copy (LLVM module metadata)
sub_30462A00x30462A0tcgen05 intrinsic handler (alloc/dealloc/commit/fence/wait/cp)tcgen05 intrinsic handler (alloc/dealloc/commit/fence/wait/cp)
sub_304E6C00x304E6C0tcgen05.mma intrinsic handler + SelectionDAG loweringtcgen05.mma intrinsic handler + SelectionDAG lowering
sub_36E96300x36E9630tcgen05.mma validation + ISD opcode selectiontcgen05.mma validation + ISD opcode selection
sub_304FBD00x304FBD0cvt_packfloat intrinsic handlercvt_packfloat intrinsic handler
sub_35ED8200x35ED820cvt_packfloat type string emissioncvt_packfloat type string emission
sub_35F33300x35F3330tcgen05.mma modifier emission (kind, scale, cta_group)tcgen05.mma modifier emission (kind, scale, cta_group)
sub_35F38B00x35F38B0tcgen05.mma modifier emission (ashift, collector)tcgen05.mma modifier emission (ashift, collector)
sub_35F4E300x35F4E30tcgen05 commit modifier emissiontcgen05 commit modifier emission
sub_35F50900x35F5090tcgen05.cp shape/format emissiontcgen05.cp shape/format emission
sub_95EB400x95EB40CLI arch string mappingCLI arch string mapping
sub_617BD00x617BD0compute_NNN string parsingcompute_NNN string parsing
ctor_6050x584510Processor variant string tableProcessor variant string table
ctor_3560x50C890LLVM processor description tableLLVM processor description table

Blackwell (sm120) — Consumer and Enterprise (sm_120, sm_121)

The sm_120 family targets the consumer RTX 50-series and enterprise RTX Blackwell Pro GPUs. Despite sharing the "Blackwell" marketing name with sm_100, the sm_120 microarchitecture is a distinct design — a chimera of Hopper and Ada Lovelace silicon, with fundamentally different tensor core hardware. sm_121 targets DGX Spark.

Critical architectural difference: sm_120 does NOT have tcgen05 tensor core instructions. The tcgen05 arch-conditional gate in cicc (sub_30462A0, sub_304E6C0, sub_36E9630) reads SmVersion at offset +0x154 and performs:

if (SmVersion > 1032):          // above sm_103f
    if (SmVersion - 1101) > 1:  // only 1101 (sm_110a) and 1102 (sm_110f) pass
        → ERROR "tcgen05 supported only on arch-conditional..."

sm_120's SmVersion is 1200 → 1200 - 1101 = 99 > 1rejected by cicc itself, not by ptxas. The values 1101/1102 correspond to sm_110a/sm_110f (Jetson Thor), confirming that Jetson Thor retains tcgen05/TMEM hardware while consumer Blackwell does not.

The upstream LLVM 22 NVPTX backend (NVPTXSubtarget.h) independently confirms this: hasTcgen05InstSupport() lists only {100, 110}, and hasMMABlockScale() lists only {120}.

The complete tcgen05 acceptance list from cicc's binary (all three gate functions use identical logic):

SmVersionTargettcgen05
1001sm_100aAllowed (bitmask bit 0)
1002sm_100fAllowed (bitmask bit 1)
1011sm_101aAllowed (bitmask bit 10)
1012sm_101fAllowed (bitmask bit 11)
1031sm_103aAllowed (bitmask bit 30)
1032sm_103fAllowed (bitmask bit 31)
1101sm_110aAllowed ((v-1101) <= 1)
1102sm_110fAllowed ((v-1101) <= 1)
1000, 1010, 1030, 1100base variantsBlocked (no suffix)
1200–1212all sm_120/121Blocked (v-1101 > 1)

From the user-visible feature perspective in cicc v13.0, sm_120 adds exactly two compiler-visible features beyond the shared Blackwell base: .offset.bindless texture intrinsics and 16-bit texture element type support.

Architecture Identity

NVIDIA's internal naming places sm_120/sm_121 squarely in the Blackwell family:

NVVM EnumNumeric Value__CUDA_ARCHProduct
NVVM_ARCH_BLACKWELL_12_012001200RTX 50xx / RTX Blackwell Pro
NVVM_ARCH_BLACKWELL_12_112101210DGX Spark

The hardware SM enum NVVM_ARCH_HW_SM_10_4 maps to value 1200, revealing that NVIDIA internally considers sm_120 as "SM 10.4" — a continuation of the Blackwell 10.x line rather than a distinct generation.

SM Variant Table

Variant__CUDA_ARCHPTX Versiona flagf flag
sm_1201200600
sm_120a1200710
sm_120f1200711
sm_1211210600
sm_121a1210710
sm_121f1210711

The PTX version pattern is identical to sm_100: base variants use PTX 6, accelerated and forward-compatible variants use PTX 7. sm_120 does not require a higher PTX version than sm_100.

Suffix Behavior

For the sm_120 family, the a and f suffixes have no behavioral impact on compiler internals in cicc v13.0:

  • unk_4D045E4 (accelerated flag): Read in exactly one location (sub_6C4D80 line 167), but only for unk_4D045E8 == 90 — the sm_90a gate. The flag is never checked for sm_120.
  • unk_4D045E0 (forward-compatible flag): Set during CLI parsing, reset in sub_615CB0, but never read anywhere in the compiler logic.

The suffixes exist for forward-proofing, __CUDA_ARCH macro consistency (all sub-variants share the same value), and potential ptxas-level differentiation not visible in cicc.

SM 120 Exclusive Feature Gates

The entire cicc codebase contains exactly two locations gated on sm_120. Both check __CUDA_ARCH >= 1200 (i.e., the arch value field at offset +8 must exceed 1199).

Feature 1: .offset.bindless Texture Intrinsics

Frontend gate: sub_1C36530 line 2724 Backend gate: sub_2C7B6A0 line 2160

When *(int*)(a1 + 8) <= 1199, the compiler emits: ".offset.bindless intrinsics are not supported on pre-Blackwell architectures". The error message is misleading — sm_100 IS Blackwell, yet .offset.bindless requires sm_120+. The message likely reflects an earlier internal naming convention or considers sm_120 the "true" consumer Blackwell.

The .offset.bindless intrinsics provide texture and surface operations using bindless handles with an additional offset parameter. This enables runtime-flexible texture resource indexing, indirect texture access via descriptor heaps, and offset-based resource aliasing within a descriptor pool.

68 intrinsic variants are classified by two functions:

  • Frontend: sub_1C303A0 — Checks three ID ranges:

    • Range 1: IDs 4419–4469 (26 IDs, odd numbers only)
    • Range 2: IDs 4722, 4725, 4726, 4731, 4734, 4736, 4739 (7 IDs)
    • Range 3: IDs 5085–5153 (35 IDs, odd numbers only)
  • Backend: sub_CEA320 — Checks corresponding backend intrinsic IDs

These 68 intrinsics cover the full matrix of texture dimensions (1D, 2D, 3D, cube, array variants), data types (i32, f32, and others), and operation types (sample, fetch, gather). The sm_120 gate means these intrinsics physically require sm_120 hardware — the texture unit changes needed for offset-based bindless addressing are not present on sm_100 silicon.

Feature 2: 16-bit Texture Element Types

Frontend gate: sub_1C36530 line 3381 Backend gate: sub_2C7B6A0 line 2386

When *(int*)(a1 + 8) > 1199, 16-bit (f16) element types become legal for most texture intrinsics. The legalization logic at frontend line 3397:

type_legal = (elem_is_i8_or_i16_raw) || is_32bit(type) ||
             (is_16bit(type) && tex16_allowed_flag)

The tex16_allowed_flag differs by architecture:

  • sm < 120: True only for builtin ID 3811 (checked by sub_1C30390)
  • sm >= 120: True for all texture intrinsics except IDs 5116–5131 (checked by sub_1C30470 on frontend, sub_CEA3F0 for backend IDs 10462–10477)

This change reduces memory bandwidth requirements for texture operations on sm_120 by enabling native f16 texture reads without promotion to 32-bit.

sm_120 vs. sm_121

Both variants pass the same > 1199 gate. In cicc v13.0, there is no code path that differentiates sm_121 from sm_120. The only distinction is the __CUDA_ARCH macro value (1200 vs. 1210), which affects user-level #ifdef checks in CUDA source code.

sm_121 is a minor revision of sm_120, analogous to how sm_103 relates to sm_100 — both have different __CUDA_ARCH values but no compiler-internal behavioral difference beyond the macro.

Relationship to sm_100

What sm_120 Inherits from sm_100

sm_120 shares the Blackwell family identity and inherits most non-tensor-core features: Hopper cluster operations, TMA bulk copy, setmaxnreg, narrow FP conversion support (e2m3/e3m2/e2m1/ue8m0), tensormap.replace, and Blackwell ldstmatrix instructions.

What sm_120 Does NOT Have

sm_120 lacks the entire tcgen05 instruction family and its prerequisite Tensor Memory (TMEM) hardware:

  • No tcgen05.alloc / tcgen05.dealloc (no TMEM to allocate)
  • No tcgen05.mma (the async TMEM-based tensor core path)
  • No tcgen05.cp / tcgen05.commit / tcgen05.fence / tcgen05.wait
  • No tcgen05.relinquish.alloc

What sm_120 Has Instead

The sm_120 hardware extends the existing mma.sync instruction family (which has been the standard tensor core interface since Volta/sm_70) with new block_scale qualifiers and MX-format data types:

mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0

This adds per-block MX-format scaling to the synchronous register-based MMA, supporting FP8 (e4m3, e5m2), FP6 (e3m2, e2m3), and FP4 (e2m1) operand types with ue8m0 scale factors. The tile shape is m16n8k32. Upstream LLVM 22 confirms this with hasMMABlockScale() returning true only for {120} and hasMMASparseBlockScaleF4() for {120, 121}.

The block_scale variant is restricted to TN layout (.row.col is hardcoded as a string literal in LLVM's tablegen — not parameterized, no NN/NT/TT variants exist). This is consistent with the broader mma.sync family where all post-Volta shapes are effectively TN-only (only the original m8n8k4 f16 from Volta supports all four layout combinations). By contrast, tcgen05.mma on sm_100/103/110 has no layout qualifier at all — data layout is implicit in the tensor memory descriptor (idesc).

cicc v13.0 does not yet emit mma.sync.block_scale for sm_120. The binary contains the string "nvvm.mma.blockscale currently supports non-sync aligned variants only!", confirming that block-scaled MMA is only available through the tcgen05 (async) path in this release — which sm_120 doesn't have access to. The mma.sync.block_scale support for sm_120 is present in upstream LLVM 22 and presumably coming in a future CUDA release (13.1+).

In cicc v13.0, sm_120 falls back to the standard HMMA/IMMA tensor core codegen inherited from sm_70–sm_90. The new Blackwell-generation tensor features (tcgen05 async path OR block_scale sync path) are both unavailable for sm_120 in this compiler version.

Tensor Core Instruction Timeline

GenerationSMInstructionMemory Model
Volta/Turingsm_70/75mma.sync (HMMA)Register-to-register, synchronous
Amperesm_80mma.sync (extended shapes)Register-to-register, synchronous
Hoppersm_90wgmma.mma_asyncShared memory → registers, async warpgroup
Blackwell datacentersm_100/103/110tcgen05.mmaTensor Memory (TMEM), fully async
Blackwell consumersm_120/121mma.sync.block_scale (LLVM 22+)Register-to-register, synchronous + MX scaling

sm_110 — Jetson Thor

sm_110 (Jetson Thor, for automotive and robotics SoCs) sits between sm_100 and sm_120 in the architecture numbering. Despite the higher SM number, sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename) and retains tcgen05/TMEM support — the tcgen05 gate explicitly allows sm_110a (SmVersion 1101) and sm_110f (1102). It lacks sm_120's .offset.bindless and f16 texture features but has full tensor core parity with sm_100/sm_103.

Variant__CUDA_ARCHPTX Version
sm_11011006
sm_110a11007
sm_110f11007

Feature Flag Configuration

At the sm_120+ threshold (qword_4F077A8 > 119999), the master configurator sub_60E7C0 enables:

FlagPurpose
unk_4D047BCDisabled (set to 0) for sm_120+; enabled for all lower architectures
unk_4D0428CEnabled at sm_120+

The unk_4D047BC flag is unconditionally assigned based on SM <= 119999, making it the only flag that is actively disabled at sm_120+. This likely controls a legacy optimization or codegen path that is incompatible with sm_120 hardware.

Key Binary Locations

FunctionAddressSizeRole
sub_CD09E00xCD09E0NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1)NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1)
sub_95EB400x95EB40CLI arch string mappingCLI arch string mapping
sub_617BD00x617BD0compute_NNN string parsingcompute_NNN string parsing
ctor_6050x584510Processor variant table (PTX versions)Processor variant table (PTX versions)
ctor_3560x50C890LLVM processor description tableLLVM processor description table
sub_1C365300x1C36530Frontend verifier (.offset.bindless + f16 texture gates)Frontend verifier (.offset.bindless + f16 texture gates)
sub_2C7B6A00x2C7B6A0Backend verifier (.offset.bindless + f16 texture gates)Backend verifier (.offset.bindless + f16 texture gates)
sub_1C303A00x1C303A0.offset.bindless intrinsic classifier (frontend).offset.bindless intrinsic classifier (frontend)
sub_CEA3200xCEA320.offset.bindless intrinsic classifier (backend).offset.bindless intrinsic classifier (backend)
sub_1C304700x1C30470f16 texture exclusion list (frontend)f16 texture exclusion list (frontend)
sub_CEA3F00xCEA3F0f16 texture exclusion list (backend)f16 texture exclusion list (backend)
sub_6C4D800x6C4D80Accelerated flag reader (sm_90a only, not sm_120)Accelerated flag reader (sm_90a only, not sm_120)
sub_615CB00x615CB0Forward-compatible flag resetForward-compatible flag reset

NVVM IR Node Layout

The NVVM frontend in cicc v13.0 uses a custom intermediate representation distinct from LLVM's native IR. Each IR node is a variable-length structure allocated from a bump allocator, with operands stored backward from the node header pointer. The node uniquing infrastructure lives in sub_162D4F0 (49KB), which routes each opcode to a dedicated DenseMap inside the NVVM context object.

Node Header Layout

The pointer a1 returned from allocation points to the start of the fixed header. Operands are at negative offsets behind it.

OffsetSizeTypeFieldNotes
+01Buint8_topcodeSwitch key in sub_162D4F0; values 0x04..0x22+
+22Buint16_tsubopcodeIntrinsic ID; read for opcodes 0x1C, 0x1D, 0x1E
+44B--(padding)Not accessed directly
+84Buint32_tnum_operandsControls operand access range
+168Btagged_ptrcontext_ptrLow 3 bits are tag; mask with & ~7 for pointer
+248Bvariesextra_ADWORD for opcodes 0x1A/0x1B; pointer for 0x10/0x22
+284Buint32_textra_BPresent for opcode 0x1B
+328Bvariesextra_CPresent for opcode 0x10
+401Buint8_textra_flagPresent for opcode 0x10

Minimum header size is 24 bytes. Total node allocation: 24 + 8 * num_operands bytes minimum, though opcode-specific extra fields extend the header region for certain node types.

Operand Storage

Operands are stored as 8-byte QWORD pointers at negative offsets from the header. The stride is exactly 8 bytes per operand. Access follows this pattern (decompiled from sub_162D4F0):

operand[k] = *(_QWORD *)(a1 + 8 * (k - num_ops))

For a node with num_operands = 3:

  • operand[0] is at a1 - 24
  • operand[1] is at a1 - 16
  • operand[2] is at a1 - 8

A 2-operand node occupies 40 bytes total (16 operand bytes + 24 header bytes). A node with opcode 0x1B and 5 operands requires approximately 88 bytes (40 operand bytes + ~48 header bytes including extra fields).

Tagged Pointer Semantics

The context_ptr at offset +16 uses low-bit tagging to encode indirection:

  • Bits [2:0] = 0: pointer is a direct reference to the context object.
  • Bit [2] = 1: pointer is an indirect reference (pointer-to-pointer).

The decompiled dereferencing pattern:

v = *(a1 + 16) & 0xFFFFFFFFFFFFFFF8;  // mask off tag bits
if (*(a1 + 16) & 4)                    // bit 2 set = indirect
    v = *v;                             // one extra dereference

This technique saves a field by encoding the indirection flag inside the pointer itself, relying on 8-byte alignment guarantees.

Opcode Dispatch Table

The uniquing function sub_162D4F0 performs a byte-level switch on *(_BYTE *)a1. Each case extracts the tagged context pointer, dereferences it, then probes an opcode-specific DenseMap for a structurally identical node.

Uniquing Opcode Dispatch (sub_162D4F0, 49KB)

The opcodes fall into two categories: "simple" opcodes that use sub-function tables at fixed stride, and "complex" opcodes that use dedicated DenseMap instances at individually-known offsets.

Simple opcodes (0x04--0x15) -- These 18 opcodes share a uniform dispatch pattern. Each routes to a sub-function table at a fixed byte offset within the context object, spaced 32 bytes apart:

OpcodeContext Byte OffsetSemantic Category
0x04+496Type / value constant
0x05+528Binary operation
0x06+560(simple node)
0x07+592(simple node)
0x08+624(simple node)
0x09+656Undef / poison
0x0A+688(simple node)
0x0B+720(simple node)
0x0C+752(simple node)
0x0D+784Integer constant
0x0E+816FP constant
0x0F+848Constant expression
0x10--Special: uses DenseMap at qw[178]
0x11+912(simple node)
0x12+944(simple node)
0x13+976Struct / aggregate type
0x14+1008(simple node)
0x15+1072(simple node)

Each sub-function table entry at these offsets is a 32-byte structure containing the callback address and metadata for hash-table probing.

Complex opcodes (0x16--0x22) -- These opcodes each own a full DenseMap within the context object. Each DenseMap occupies 4 qwords at the indicated base, plus associated dword counters:

OpcodeQWord BaseByte OffsetDenseMap DwordsIdentified Semantic
0x16qw[130]+1040dw[264..266]Metadata node
0x17--+1104--(simple-table path at +1104)
0x18--+1136--Alloca (bitcode 0x18/0x58)
0x19------Load
0x1Aqw[146]+1168dw[296..298]Branch (br)
0x1Bqw[150]+1200dw[304..306]Switch
0x1Cqw[154]+1232dw[312..314]Invoke (reads subopcode)
0x1Dqw[158]+1264dw[320..322]Unreachable / resume (reads subopcode)
0x1Eqw[162]+1296dw[328..330]LandingPad (reads subopcode)
0x1Fqw[166]+1328dw[336..338]Call instruction
0x20------PHI node
0x21------IndirectBr
0x22qw[178]+1424dw[360..362]Special (extra_A = ptr)

Opcodes 0x1C, 0x1D, and 0x1E read the subopcode field at *(unsigned __int16 *)(a1 + 2) as part of the hash key, because these node types require the intrinsic ID to distinguish structurally identical nodes with different semantic meaning.

Hash Function

Every DenseMap in the uniquing tables uses the same hash:

hash(ptr) = (ptr >> 9) ^ (ptr >> 4)

Hash computation for multi-operand nodes (sub_15B3480) extends this by combining the hash of each operand pointer with a mixing step. The hash seed is the opcode byte, then each operand is folded in:

seed ^= hash(operand[i]) + 0x9E3779B9 + (seed << 6) + (seed >> 2);

Sentinel values: empty = -8 (0xFFFFFFFFFFFFFFF8), tombstone = -16 (0xFFFFFFFFFFFFFFF0).

Node Erasure (sub_1621740, 14KB)

The mirror of insertion. Dispatches by the same opcode byte, finds the node in the corresponding DenseMap, overwrites the bucket with the tombstone sentinel (-16), and decrements NumItems while incrementing NumTombstones. When tombstone count exceeds NumBuckets >> 3, a rehash at the same capacity is triggered to reclaim tombstone slots.

Bitcode Instruction Opcode Table

NVIDIA uses LLVM's standard instruction opcode numbering with minor adjustments. The bitcode reader sub_166A310 / sub_151B070 (parseFunctionBody, 60KB/123KB) dispatches on a contiguous range. The NVVM verifier sub_2C80C90 confirms the mapping via its per-opcode validation switch:

OpcodeHexLLVM InstructionVerifier Checks
0x0B11ret--
0x0E14br--
0x0F15switch--
0x1521invoke"invoke" unsupported via sub_2C76F10
0x1824allocaAlignment <= 2^23; AS must be Generic
0x1925load--
0x1A26br (cond)Validates "Branch condition is not 'i1' type!"
0x1B27switch (extended)--
0x1C28invoke (extended)--
0x1D29unreachable--
0x1E30resume--
0x1F31callPragma metadata validation
0x2032phi--
0x2133indirectbr"indirectbr" unsupported
0x2234call (variant)Validates callee type signature
0x2335resume (verifier)"resume" unsupported
0x23--0x3435--52Binary ops (add/sub/mul/div/rem/shift/logic)--
0x35--0x3853--56Casts (trunc/zext/sext/fpcast)--
0x3C60allocaAlignment and address-space checks
0x3D61loadAtomic loads rejected; tensor memory AS rejected
0x3E62storeAtomic stores rejected; tensor memory AS rejected
0x4064fenceOnly acq_rel/seq_cst in UnifiedNVVMIR mode
0x4165cmpxchgOnly i32/i64/i128; must be generic/global/shared AS
0x4266atomicrmwAddress space validation
0x4F79addrspacecast"Cannot cast non-generic to different non-generic"
0x5585Intrinsic callRoutes to sub_2C7B6A0 (143KB verifier)
0x5888alloca (inalloca)Same as 0x18
0x5F95landingpad"landingpad" unsupported

The binary opcodes in the 0x23--0x34 range follow LLVM's BinaryOperator numbering:

OpcodeHexOperationIRBuilder Helper
0x2335add--
0x2436fadd--
0x2537sub--
0x2638fsub--
0x2739mul--
0x2840fmul--
0x2941udiv--
0x2A42sdiv--
0x2B43fdiv--
0x2C44urem--
0x2D45srem--
0x2E46frem--
0x2F47shl--
0x3048lshr--
0x3149ashr--
0x3250and--
0x3351or--
0x3452xor--

InstCombine Internal Opcode Table

The InstCombine mega-visitor sub_10EE7A0 (405KB, the single largest function in cicc) uses a different opcode numbering -- the full LLVM Instruction::getOpcode() values rather than the bitcode record codes. These are accessed via sub_987FE0 (getOpcode equivalent). Key ranges observed:

Opcode RangeLLVM Instructions
0x0BRet
0x0EBr
0x0FSwitch
0x15Invoke
0x1AUnreachable
0x3FFNeg
0x41--0x43Add, FAdd, Sub
0x99GetElementPtr
0xAATrunc
0xAC--0xAEZExt, SExt, FPToUI
0xB4--0xB5PtrToInt, IntToPtr
0xCF--0xD2ICmp, FCmp, PHI, Call
0xE3--0xEBVAArg, ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue
0x11AFence
0x11DAtomicCmpXchg
0x125AtomicRMW
0x134--0x174FPTrunc, FPExt, UIToFP, Alloca, Load, Store, FMul, UDiv, SDiv, ...
0x17D--0x192BitCast, Freeze, LandingPad, CatchSwitch, CatchRet, CallBr, ...
0x2551, 0x255F, 0x254DNVIDIA custom intrinsic operations

The NVIDIA custom opcodes (0x2551, 0x255F, 0x254D) are in a range far above standard LLVM and handle CUDA-specific operations (texture, surface, or warp-level ops encoded as custom IR nodes) that have no upstream LLVM equivalent.

NVVM Context Object

The context object referenced by context_ptr is a large structure (~3,656 bytes, confirmed by the destructor sub_B76CB0 at 97KB which tears down a ~3656-byte object) containing uniquing tables for every NVVM opcode, plus type caches, metadata interning tables, and allocator state.

Context Layout Overview

Byte OffsetSizeFieldDescription
+0..+200200BCore stateModule pointer, allocator, flags
+2008Bvtable_0Points to unk_49ED3E0
+2248Bvtable_1Points to unk_49ED440
+2488Bvtable_2Points to unk_49ED4A0
+272..+792520BHash table array16 DenseMaps freed at stride 32 by destructor
+496..+1136640BSimple opcode tables18 sub-function tables, 32B each (opcodes 0x04..0x15)
+1040..+1424384BComplex opcode DenseMapsDedicated DenseMaps for opcodes 0x16..0x22
+1424..+2800~1376BExtended tablesAdditional hash tables, type caches, metadata maps
+2800..+3656~856BAllocator stateBump allocator slabs, counters, statistics

Simple Opcode Table Region (+496..+1136)

The 18 entries for opcodes 0x04 through 0x15 (plus a few extras) are 32-byte structures at fixed offsets:

struct SimpleOpcodeTable {
    void  *buckets;       // +0:  heap-allocated bucket array
    int32  num_items;     // +8:  live entry count
    int32  num_tombstones; // +12: tombstone count
    int32  num_buckets;   // +16: always power-of-2
    int32  reserved;      // +20: padding
    void  *callback;      // +24: hash-insert function pointer (or NULL)
};

Byte offsets increase monotonically: +496, +528, +560, +592, +624, +656, +688, +720, +752, +784, +816, +848, +880, +912, +944, +976, +1008, +1072, +1104, +1136.

Complex Opcode DenseMap Region (+1040..+1424)

Each DenseMap for a complex opcode occupies 4 qwords plus associated dword counters:

struct OpcodeUniqueMap {
    int64  num_entries;    // qw[N]:   includes tombstones
    void  *buckets;        // qw[N+1]: heap-allocated bucket array
    int32  num_items;      // dw[2*N + offset]: live entries
    int32  num_tombstones; // dw[2*N + offset + 1]: tombstone count
    int32  num_buckets;    // dw[2*N + offset + 2]: capacity (power-of-2)
};

Complete mapping:

Opcodeqw BaseByte Offset (qw)dw CountersByte Offset (dw)
0x16qw[130]+1040dw[264..266]+2112..+2120
0x1Aqw[146]+1168dw[296..298]+2368..+2376
0x1Bqw[150]+1200dw[304..306]+2432..+2440
0x1Cqw[154]+1232dw[312..314]+2496..+2504
0x1Dqw[158]+1264dw[320..322]+2560..+2568
0x1Eqw[162]+1296dw[328..330]+2624..+2632
0x1Fqw[166]+1328dw[336..338]+2688..+2696
0x10qw[178]+1424dw[360..362]+2880..+2888

Destructor (sub_1608300, 90KB)

The context destructor confirms the layout by freeing resources in order:

  1. Calls j___libc_free_0 on bucket pointers at offsets +272 through +792 (stride 32) -- frees all 16 simple opcode hash tables.
  2. Destroys sub-objects via sub_16BD9D0, sub_1605960, sub_16060D0 -- these tear down the complex DenseMap instances and any heap-allocated overflow chains.
  3. Releases vtable-referenced objects at offsets +200, +224, +248.

The separate LLVMContext destructor (sub_B76CB0, 97KB) frees 28+ hash tables from the full ~3,656-byte context structure, confirming that the uniquing tables are only part of the overall context.

Type Tag System

The context object's hash tables also serve as uniquing tables for type nodes. The byte at offset +16 in each IR node encodes the type tag (distinct from the opcode byte at +0):

Type TagMeaningNotes
5Instruction / expressionBinary ops, comparisons
8Constant aggregateConstantArray, ConstantStruct
9Undef / poisonUndefValue
13Integer constantAPInt at +24, bitwidth at +32
14FP constantAPFloat storage
15Constant expressionConstantExpr (GEP, cast, etc.)
16Struct / aggregate typeElement list at +32
17MDTuple / metadata nodeMetadata tuple
37Comparison instructionICmp / FCmp predicate

The type tag at +16 is used by InstCombine (sub_1743DA0) and many other passes to quickly classify nodes without reading the full opcode. The observed range is 5--75, considerably denser than standard LLVM's Value subclass IDs.

Instruction Creation Helpers

NVIDIA's LLVM fork provides a set of instruction creation functions that allocate nodes from the bump allocator, insert them into the appropriate uniquing table, and update use-lists. These are the core IR mutation API:

Primary Instruction Factories

AddressSizeSignatureLLVM Equivalent
sub_B504D0--(opcode, op0, op1, state, 0, 0)BinaryOperator::Create / IRBuilder::CreateBinOp
sub_B50640--(val, state, 0, 0)Result-typed instruction / CreateNeg wrapper
sub_B51BF0--(inst, src, destTy, state, 0, 0)IRBuilder::CreateZExtOrBitCast
sub_B51D30--(opcode, src, destTy, state, 0, 0)CmpInst::Create / IRBuilder::CreateCast
sub_B52190--(...)BitCastInst::Create
sub_B52260--(...)GetElementPtrInst::Create (single-index)
sub_B52500--(...)CastInst::Create with predicate
sub_B33D10--(ctx, intrinsicID, args, numArgs, ...)IRBuilder::CreateIntrinsicCall
sub_BD2DA0--(80)Instruction::Create (allocates 80-byte IR node)
sub_BD2C40--(72, N)Instruction::Create (72-byte base, N operands)

Opcode Constants for Creation

These numeric opcode values are passed as the first argument to sub_B504D0:

ValueOperationExample
13Subsub_B504D0(13, a, b, ...)
15FNeg / FSub variantsub_B504D0(15, ...)
18SDivsub_B504D0(18, ...)
21FMulsub_B504D0(21, ...)
25Orsub_B504D0(25, ...)
26Andsub_B504D0(26, a, mask, ...)
28Xorsub_B504D0(28, ...)
29Addsub_B504D0(29, ...)
30Subsub_B504D0(30, zero, operand)
32Shlsub_B504D0(32, ...)
33AShrsub_B504D0(33, ...)
38And (FP context)sub_B504D0(38, ...)
40ZExt (via sub_B51D30)sub_B51D30(40, source, resultType)
49CastInstsub_B51D30(49, src, destTy, ...)

Node Builder / Cloner (sub_16275A0, 21KB)

The IR builder at sub_16275A0 creates new nodes by cloning operand lists from a source node, using the tagged pointer Use-list encoding described above. It dispatches to three specialized constructors:

AddressRole
sub_1627350Multi-operand node create (MDTuple::get equivalent). Takes (ctx, operand_array, count, flag0, flag1). Called 463+ times from func-attrs and metadata passes.
sub_15B9E00Binary node create. Fixed 2-operand layout, minimal header.
sub_15C4420Variadic node create. Variable operand count, allocates backward operand storage.

All three ultimately route through the uniquing function sub_162D4F0 to deduplicate structurally-identical nodes.

Infrastructure Functions

AddressCall CountRole
sub_1623A60349xIRBuilder::CreateBinOp or SCEV type extension
sub_1623210337xIRBuilder::CreateUnaryOp or SCEV use registration
sub_15FB440276xCreate node with 5 args: (opcode, type, op1, op2, flags)
sub_161E7C0463xNode accessor / property query (most-called IR function)
sub_164B780336xUse-chain linked list manipulation
sub_1648A60406xMemory allocator: (size, alignment)

Allocation

NVVM IR nodes are allocated from a slab-based bump allocator:

  • Slab growth: 4096 << (slab_index >> 7) -- exponential, capped at 4TB.
  • Alignment: 8 bytes (pointer aligned via (ptr + 7) & ~7).
  • Deallocation: no individual free; entire slabs are released at once.
  • Overflow: triggers a new slab via malloc().

This is the standard LLVM BumpPtrAllocator pattern, consistent with how upstream LLVM manages IR node lifetimes. The lack of per-node deallocation means the NVVM frontend cannot reclaim memory for dead nodes until the entire context is destroyed.

Cross-References

Function Map

FunctionAddressSizeRole
Node uniquing: lookup-or-insert, opcode dispatchsub_162D4F049KB--
Node erase from uniquing tables (tombstone writer)sub_162174014KB--
IR builder / node clonersub_16275A021KB--
Multi-operand node create (MDTuple::get)sub_1627350----
Binary node createsub_15B9E00----
Variadic node createsub_15C4420----
Hash computation for multi-operand nodessub_15B3480----
Context destructor (frees 20+ hash tables)sub_160830090KB--
LLVMContext destructor (~3,656-byte object)sub_B76CB097KB--
BinaryOperator::Create / IRBuilder::CreateBinOpsub_B504D0----
Result-typed instruction create / CreateNegsub_B50640----
IRBuilder::CreateZExtOrBitCastsub_B51BF0----
CmpInst::Create / IRBuilder::CreateCastsub_B51D30----
BitCastInst::Createsub_B52190----
GetElementPtrInst::Create (single-index)sub_B52260----
CastInst::Create with predicatesub_B52500----
IRBuilder::CreateIntrinsicCallsub_B33D10----
Instruction::Create (80-byte allocation)sub_BD2DA0----
Instruction::Create (variable-size)sub_BD2C40----
create_empty_ir_node (204 callers, EDG front-end)sub_72C9A0----
IR builder / node constructor (349x calls)sub_1623A60----
IR builder / node constructor variant (337x calls)sub_1623210----
Create node with 5 args (276x calls)sub_15FB440----
Node accessor / property query (463x calls)sub_161E7C0----
BitcodeReader::parseFunctionBody (stock LLVM)sub_166A31060KB--
parseFunctionBody (two-phase compilation path)sub_151B070123KB--
parseFunctionBody (standalone libNVVM path)sub_9F2A40185KB--
InstCombinerImpl::visitInstruction (full opcode switch)sub_10EE7A0405KB--
InstCombine master visit dispatchersub_F2CFA0----
NVVMModuleVerifier (per-opcode validation)sub_2C80C9051KB--

Instruction Constraint Table (Pattern Database)

The instruction selection backend in cicc v13.0 uses a global constraint table to map target opcodes to their operand requirements. This table drives the sub_B612D0 constraint emission function (104KB), which consults a packed 16-bit word array to determine register classes and constraint patterns for each machine instruction. The constraint table is the single authoritative source of truth for every NVPTX MachineInstr's register requirements -- any reimplementation of the backend codegen must reproduce it exactly.

Global Table: word_3F3E6C0

The constraint table is a statically allocated array of 16-bit words in the .data section at address 0x3F3E6C0, indexed by (opcode - 1). Each entry packs two pieces of information into a single 16-bit word:

BitsFieldMeaning
Low byte (bits 0..7)constraint_classIndex into the constraint switch (0x00..0xB2)
High byte (bits 8..15)register_class_idTarget register class for the result

The access pattern from sub_B612D0:

// sub_B612D0(a1, a2)  where a2 = MachineInstr opcode
v4 = HIBYTE(word_3F3E6C0[a2 - 1]);    // register class for output
switch (LOBYTE(word_3F3E6C0[a2 - 1]))  // constraint class -> switch case

There are exactly 179 distinct constraint classes (0x00 through 0xB2), each encoding a specific operand pattern for a category of instructions. Multiple opcodes can share the same constraint class if they have identical operand signatures.

Constraint Descriptor Layout

Each constraint descriptor is a stack-allocated array of 16-byte entries built within sub_B612D0's frame. The frame is approximately 0x160 bytes deep. Stack slots span [rsp-0x158] through [rsp-0x20]:

OffsetSizeField
+04Bconstraint_kind (int32)
+44B(padding / alignment)
+88Bvalue (int64: register class ID or operand reference)

Entry stride: 16 bytes (8-byte aligned pairs of {int32 kind, int32 pad, int64 value}).

The constraint_kind values determine the role of each entry in the descriptor array:

KindMeaning
-1Output/result operand (always the last entry in the array)
0Input operand at position 0
1Input operand at position 1
2Input operand at position 2
3..NInput operands at higher positions

The output entry (kind = -1) carries the result register class. Input entries carry the register class constraint for each source operand. The maximum observed operand count is 17 (constraint class 0xB0, corresponding to opcode 176 in the table), requiring 18 descriptor entries = 288 bytes of stack space.

Register Class IDs

The register_class_id in the high byte maps to NVIDIA GPU register files. Values recovered from sub_A778C0 (register class constraint creator), sub_B5BA00 (register class set builder, 111 cases), and sub_2163730 (PTX emission naming):

These IDs are specific to the pattern database constraint system and differ from the 4-bit class tags used in register encoding (see Register Classes for vtable addresses, PTX types, prefixes, and encoded IDs).

IDRegister ClassWidth
14Int32Regs (%r)32 bits
22Int16Regs (%rs)16 bits
24Int16HalfRegs (%h)16 bits (f16/bf16)
27Int32HalfRegs (%hh)32 bits (v2f16/v2bf16)
29(unidentified)--
32(unidentified)--
36(unidentified)--
39(unidentified)--
40Float32Regs (%f)32 bits
41(unidentified)--
43Float16Regs (%h, alias of Int16HalfRegs)16 bits
50Int64Regs (%rd)64 bits
51Float64Regs (%fd)64 bits
52Int128Regs (%rq)128 bits
67(unidentified)--
72(unidentified)--
76(unidentified)--
78Int1Regs (%p)1 bit
86SpecialRegs (internal-only, off_4A026E0)varies

IDs 29, 32, 36, 39, 41, 67, 72, 76 appear in the sub_B612D0 table but have not been definitively mapped to named register classes. They likely correspond to sub-register classes, tied-operand classes, or WMMA accumulator classes that cicc defines beyond the 9 primary classes documented in reference/register-classes.md.

Constraint Type Classification

A secondary classification table at byte_3F252E0 categorizes constraint entries into four families (recovered from sub_A7A6D0 constraint merge/intersection logic at 0xA78000):

Classification ByteFamilyApplies To
0x00Simple/scalarSingle-register operands; the vast majority of ALU constraints
0x08OrderedOperands with fixed positional requirements (tied operands)
0x10Sized/rangedOperands with explicit bit-width requirements (sub-register extracts)
0x18CompoundMulti-register operands; types 86-97 in the classification table

The merge function sub_A7A6D0 (7KB) performs set intersection across constraint families when two constraint sets must be unified (e.g., during register coalescing or inline asm constraint resolution). The "compound" family (0x18) covers instructions that require register pairs or wider groupings -- tensor core MMA instructions fall into this category.

Key Sub-Functions

The constraint emission pipeline involves these collaborating functions:

AddressSizeFunctionPurpose
sub_A778C0--createRegClassConstraint(a1, regclass, flags)Build a register-class constraint entry; stores class ID in value field
sub_A77AD0--createAnyRegConstraint(a1, flags)Build an "any register" constraint (unconstrained operand)
sub_A79C90--composeConstraints(a1, &desc, N)Compose N descriptor entries into a single constraint record
sub_A7A6D07KBmergeConstraints(a1, a2)Merge/intersect two constraint sets using byte_3F252E0 classification
sub_B5BA0021KBcreateOutputConstraint(a1, regclass_id)Build the output register constraint; 111-case switch on class ID
sub_A78010--emitConstraint(a1, &desc_array, N)Emit the final constraint with N entries to the instruction descriptor
sub_B612D0104KBemitInstrConstraint(a1, opcode)Top-level: lookup word_3F3E6C0, dispatch on constraint class, build and emit

The sub_B5BA00 function (21KB) is itself a 111-case switch that translates register class IDs into the internal constraint representation. It produces the value field for output constraint entries. Its size suggests that it handles not just the 9 primary register classes but also sub-register classes, paired classes, and special accumulator classes for tensor operations.

Constraint Switch Structure

The 179-case switch in sub_B612D0 is the heart of the pattern database. Each case constructs a fixed sequence of constraint descriptors on the stack, then calls sub_A78010 to emit them. The cases can be organized into major families based on operand count and register class patterns.

Family 1: Unary Instructions (1 input, 1 output)

These are the simplest constraints: one input operand and one result. Two descriptor entries (32 bytes on stack). Representative constraint classes:

// Constraint class 0x01 — Unary ALU, same type in/out
// Example: MOV, NEG, NOT, ABS for Int32Regs
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E01  (class=0x01, regclass=14=Int32)
case 0x01:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: same class as output
    desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output: regclass from high byte
    sub_A78010(a1, desc, 2)

Constraint classes in this family include 0x01 through approximately 0x08, covering unary operations across all scalar register classes. The register class v4 (from the high byte) determines whether the instruction operates on Int32, Int64, Float32, Float64, Pred, or another class. The same constraint class is reused for multiple opcodes that share the same operand signature.

Family 2: Binary ALU Instructions (2 inputs, 1 output)

The most common family. Three descriptor entries (48 bytes on stack). Covers all two-operand arithmetic and logic instructions:

// Constraint class 0x09 — Binary ALU, all same type
// Example: ADD, SUB, MUL, AND, OR, XOR for Int32Regs
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E09  (class=0x09, regclass=14=Int32)
case 0x09:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: Int32
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: Int32
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  Int32
    sub_A78010(a1, desc, 3)

Variants within this family differ in whether inputs are constrained to the same class as the output or to a different class. For instance, shift instructions constrain the shift amount (input[1]) to Int32 regardless of the data type of input[0]:

// Constraint class 0x0C — Binary with mixed types (shift-like)
// Example: SHL.b64, SHR.b64  (data=Int64, shift_amount=Int32)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x320C  (class=0x0C, regclass=50=Int64)
case 0x0C:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }     // input[0]: Int64 (data)
    desc[1] = { kind=1, value=sub_A778C0(a1, 14, 0) }     // input[1]: Int32 (shift amount)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }        // output:  Int64
    sub_A78010(a1, desc, 3)

Family 3: Comparison / Predicate-Producing Instructions (2 inputs, predicate output)

Comparison instructions produce a predicate register result regardless of the input type. Three descriptor entries:

// Constraint class 0x10 — Compare, predicate output
// Example: SETP.EQ.s32, SETP.LT.f32
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x4E10  (class=0x10, regclass=78=Pred)
case 0x10:
    desc[0] = { kind=0, value=sub_A778C0(a1, <input_class>, 0) }  // input[0]: operand type
    desc[1] = { kind=1, value=sub_A778C0(a1, <input_class>, 0) }  // input[1]: operand type
    desc[2] = { kind=-1, value=sub_B5BA00(a1, 78) }               // output: Pred (%p)
    sub_A78010(a1, desc, 3)

The input register class is determined by the instruction variant (integer comparison vs. float comparison), while the output is always predicate register class 78.

Family 4: Ternary / FMA Instructions (3 inputs, 1 output)

Fused multiply-add and select instructions require four descriptor entries (64 bytes on stack):

// Constraint class 0x18 — Ternary FMA, all same float type
// Example: FMA.RN.f32 (a * b + c)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x2818  (class=0x18, regclass=40=Float32)
case 0x18:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: Float32 (a)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: Float32 (b)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: Float32 (c)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  Float32 (result)
    sub_A78010(a1, desc, 4)

Select/conditional-move instructions also fall here, with one predicate input and two data inputs:

// Constraint class 0x1A — Select (pred, trueval, falseval)
// Example: SELP.b32 (predicated select)
case 0x1A:
    desc[0] = { kind=0, value=sub_A778C0(a1, 78, 0) }   // input[0]: Pred (condition)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (true value)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: data (false value)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (selected)
    sub_A78010(a1, desc, 4)

Family 5: Memory Instructions (load/store with address operands)

Load instructions produce a data result from an address operand. Store instructions consume both data and address. These constraint classes handle the different address space qualifiers and vector widths:

// Constraint class 0x20 — Scalar load from address
// Example: LD.GLOBAL.b32 (global memory load)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E20  (class=0x20, regclass=14=Int32)
case 0x20:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address pointer)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  Int32 (loaded data)
    sub_A78010(a1, desc, 2)

Vector load variants (LoadV2, LoadV4) use additional output entries for each vector lane:

// Constraint class 0x22 — Vector load V2 (two-element)
// Example: LD.GLOBAL.V2.b32 (load 2x Int32)
case 0x22:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: (offset/predicate)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data element 0
    // Second output encoded separately via sub_A79C90 composition
    sub_A78010(a1, desc, 3)

Store instructions have no result output (kind = -1 carries a sentinel value or void class):

// Constraint class 0x28 — Scalar store
// Example: ST.GLOBAL.b32 (global memory store)
case 0x28:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: data to store
    desc[1] = { kind=1, value=sub_A778C0(a1, 50, 0) }   // input[1]: Int64 (address)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, 86) }      // output:  SpecialRegs (chain/token)
    sub_A78010(a1, desc, 3)

Family 6: Type Conversion Instructions (input and output differ)

Conversion instructions have an input class that differs from the output class. The constraint class encodes the specific pair:

// Constraint class 0x30 — CVT from Int32 to Float32
// Example: CVT.RN.f32.s32
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x2830  (class=0x30, regclass=40=Float32)
case 0x30:
    desc[0] = { kind=0, value=sub_A778C0(a1, 14, 0) }   // input[0]: Int32 (source)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 40) }      // output:  Float32 (result)
    sub_A78010(a1, desc, 2)
// Constraint class 0x32 — CVT from Float64 to Int64
// Example: CVT.RTZ.s64.f64
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x3232  (class=0x32, regclass=50=Int64)
case 0x32:
    desc[0] = { kind=0, value=sub_A778C0(a1, 51, 0) }   // input[0]: Float64 (source)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 50) }      // output:  Int64 (result)
    sub_A78010(a1, desc, 2)

Widening/narrowing conversions between integer sizes and float-to-half conversions each have their own constraint class.

Family 7: Copy / Move Instructions (register transfer)

The copy family (opcodes 440-503) maps to constraint classes that encode same-class and cross-class register transfers:

// Constraint class 0x40 — Same-class copy
// Example: MOV.b32  (Int32 -> Int32)
// Used by opcodes 440-443 (type-preserving moves)
case 0x40:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: same class
    desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  same class
    sub_A78010(a1, desc, 2)
// Constraint class 0x42 — Cross-class copy (Int32 <-> Float32)
// Example: MOV from Int32Regs to Float32Regs (bitcast-level move)
// Used by opcodes 444+ (cross-class moves)
case 0x42:
    desc[0] = { kind=0, value=sub_A778C0(a1, <source_class>, 0) }   // input: source class
    desc[1] = { kind=-1, value=sub_B5BA00(a1, <dest_class>) }        // output: dest class
    sub_A78010(a1, desc, 2)

Cross-class copies are never coalesced by the register coalescer (they remain as explicit mov instructions in PTX output). The constraint table enforces this by assigning distinct source and destination classes.

Family 8: Call ABI Instructions (parameter declaration and passing)

The NVPTX calling convention uses special opcodes for .param space management. These have unique constraint classes with no data register operands:

// Constraint class 0x50 — DeclareParam (opcode 505)
// Declares a .param space allocation for function argument passing
case 0x50:
    desc[0] = { kind=0, value=sub_A77AD0(a1, 0) }       // input[0]: "any" (chain token)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 86) }      // output:  SpecialRegs (chain)
    sub_A78010(a1, desc, 2)

Call sequence opcodes (315=CallSeqBegin, 514=CallStart, 517=CallSeqEnd, 518=CallProto) all use constraint classes that operate on chain tokens rather than data registers. Their inputs and outputs are in the SpecialRegs class (ID 86).

Family 9: Atomic Instructions (address + data + result)

Atomic operations require an address, a data operand, and produce a result of the same data type:

// Constraint class 0x60 — Atomic RMW (read-modify-write)
// Example: ATOM.ADD.s32 (atomic add on Int32)
// Opcodes 294-297 (atom.add family)
case 0x60:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (value to add)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (old value)
    sub_A78010(a1, desc, 3)

Atomic compare-and-swap (opcode 462 = atom.cas) requires four operands (address, expected, desired, result):

// Constraint class 0x62 — Atomic CAS
// Example: ATOM.CAS.b32 (compare-and-swap)
case 0x62:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (expected)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: data (desired)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (old value)
    sub_A78010(a1, desc, 4)

Family 10: Tensor Core / MMA Instructions (many inputs, many outputs)

The most complex constraint classes handle tensor core matrix operations. These instructions consume multiple register-pair or register-quad operands and produce multiple results. Constraint class 0xB0 is the extreme case with 17 input operands:

// Constraint class 0xB0 — Complex MMA (17 inputs, 1+ outputs)
// Example: tcgen05.mma variants (Blackwell, opcodes 4905-4940)
// This is the maximum-operand constraint class.
case 0xB0:
    for (i = 0; i < 17; i++) {
        desc[i] = { kind=i, value=sub_A778C0(a1, <operand_class[i]>, 0) }
    }
    desc[17] = { kind=-1, value=sub_B5BA00(a1, v4) }
    sub_A78010(a1, desc, 18)

HMMA/IMMA/BMMA instructions (the SM70+ tensor core families at sub_21E0360-sub_21E2280) use constraint classes in the 0x90-0xAF range, typically with 4-8 register inputs (accumulator fragments) and 4-8 register outputs. The operand classes include Int32HalfRegs (ID 27) for packed f16 pairs and Int128Regs (ID 52) for wide accumulator state.

Family 11: Predicated Instructions (extra predicate input)

Many NVPTX instructions support predication, where execution is conditional on a predicate register. Predicated variants append an extra Pred-class input:

// Constraint class 0x70 — Predicated binary ALU
// Example: @%p0 ADD.s32 %r1, %r2, %r3  (conditional add)
case 0x70:
    desc[0] = { kind=0, value=sub_A778C0(a1, 78, 0) }   // input[0]: Pred (guard)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (src0)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: data (src1)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (result)
    sub_A78010(a1, desc, 4)

Family 12: Special / Barrier Instructions (chain-only)

Barrier and synchronization instructions have no data operands. They operate purely on the chain token for ordering:

// Constraint class 0x80 — Barrier/Fence (chain-only)
// Example: BAR.SYNC (opcodes 287-290)
case 0x80:
    desc[0] = { kind=0, value=sub_A77AD0(a1, 0) }       // input[0]: "any" (chain in)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 86) }      // output:  SpecialRegs (chain out)
    sub_A78010(a1, desc, 2)

Pattern Matching Dispatch

The constraint table is consumed during instruction selection by the three-level dispatch hierarchy:

  1. Driver (sub_3090F90, 91KB): Builds a cost table for function arguments via hash(key*37), uses a min-heap priority queue for topological-order traversal, iterates with budget = 4 * numInstructions * maxBlockSize.

  2. Matcher (sub_308FEE0): Called per-SDNode from the driver. Dispatches to the hand-written selector or the TableGen-generated selector.

  3. Hand-written selector (sub_347A8D0, 309KB): Giant switch on ISD/NVPTXISD opcodes. Calls sub_969240 (SDNode accessor) 263 times. Recursive with 42 self-calls. Handles tex/surf, wmma, atomics, barriers.

  4. TableGen-generated selector (sub_348D3E0, 256KB): Auto-generated from NVPTX .td instruction pattern definitions. Calls sub_969240 45 times, sub_32889F0 38 times.

  5. Complex addressing mode selector (sub_33D4EF0, 114KB): Handles NVPTX load/store addressing with address space qualifiers. Calls sub_969240 399 times -- the single function with the most SDNode accesses in the entire binary.

After pattern matching selects a MachineInstr opcode, the constraint table is queried via sub_B612D0 to determine register requirements. The selected opcode is the index into word_3F3E6C0.

Operand Binding

When the constraint emission function sub_B612D0 builds the descriptor array, operand binding follows this protocol:

  1. Lookup: Read word_3F3E6C0[opcode - 1]. Extract constraint_class (low byte) and register_class_id (high byte, stored as v4).

  2. Switch dispatch: Branch to the case for constraint_class.

  3. Input construction: For each input operand position i:

    • Call sub_A778C0(a1, class_id, flags) to create a register-class constraint entry.
    • The class_id is either v4 (same class as output) or a hardcoded value (different class for mixed-type instructions).
    • The flags parameter encodes operand modifiers (tied, early-clobber, etc.).
    • Store the result in desc[i] with kind = i.
  4. Output construction: Call sub_B5BA00(a1, v4) to create the output constraint.

    • sub_B5BA00 is a 21KB function with 111 switch cases that translates the register class ID into the internal output representation.
    • Store in desc[N] with kind = -1.
  5. Emission: Call sub_A78010(a1, desc, N+1) to finalize. This function walks the descriptor array, validates constraint consistency, and writes the constraint record into the instruction's operand descriptor table.

For instructions that use sub_A77AD0 ("any register" constraint), the operand accepts any register class. This is used for chain tokens, inline asm operands with unconstrained registers, and certain special-purpose slots.

For composition of multi-output instructions, sub_A79C90 merges multiple descriptor sub-arrays into a single compound constraint. This is needed for vector loads (LoadV2, LoadV4) and MMA instructions that produce multiple result registers.

Allocation

The global table word_3F3E6C0 is in the .data section, allocated at link time. It is read-only after cicc process startup. Constraint descriptors are purely stack-allocated within sub_B612D0's frame (approximately 0x160 bytes deep). No heap allocation occurs during constraint emission. This makes the constraint emission path allocation-free and safe for use in concurrent compilation (the function is reentrant as long as each thread has its own stack frame).

Cross-References

Function Map

FunctionAddressSizeRole
createRegClassConstraintsub_A778C0--Build register-class input constraint entry
createAnyRegConstraintsub_A77AD0--Build unconstrained ("any") input constraint
composeConstraintssub_A79C90--Merge N descriptor entries into compound constraint
mergeConstraintssub_A7A6D07KBSet-intersection of constraints using byte_3F252E0
emitConstraintsub_A78010--Finalize and emit constraint record
createOutputConstraintsub_B5BA0021KB111-case switch: class ID to output representation
emitInstrConstraintsub_B612D0104KBTop-level: 179-case constraint class dispatch
decodeOperandTypesub_B6B20044KB101-case operand type decoder from bytecode stream

SelectionDAG Node Structure

The SelectionDAG (SDNode) is the central data structure in cicc's code generation backend. Nodes represent operations in the target-independent DAG before instruction selection lowers them to machine instructions. The DAG builder (sub_2081F00, 267KB) converts LLVM IR into an initial DAG by visiting each IR instruction through a dispatch chain rooted at sub_2065D30. Nodes are deduplicated via a CSE hash table (sub_F4CEE0, 41KB) and allocated from a bump allocator embedded in the builder context object. The complete SelectionDAG pipeline then runs type legalization, operation legalization, DAG combining, and instruction selection over this graph before emitting PTX machine instructions.

SDNode Layout (104 Bytes, Two Views)

Every SDNode is allocated as exactly 104 bytes, hardcoded in sub_163D530. After allocation, all fields are zeroed. Two complementary views of the layout have been recovered: the "allocator view" from the zeroing pattern in sub_163D530, and the "accessor view" from field access patterns across the combiner (sub_F20C20), legalization (sub_1FFB890), and known-bits engine (sub_33D4EF0).

Allocator View (from sub_163D530)

The raw 104 bytes are zeroed via a combination of qword and dword stores:

qw[0..5] = 0, dw[6] = 0, qw[8..10] = 0, dw[11] = 0, byte[96] = 0

The statistics counter at context offset +96 is incremented by 104 for every allocation: *(_QWORD *)(v4 + 96) += 104LL.

Accessor View (Composite from Combiner, Legalizer, KnownBits)

The following table reconciles field accesses across sub_F20C20 (DAG combiner visitor), sub_1FFB890 (LegalizeOp), sub_33D4EF0 (computeKnownBits, 114KB), and sub_1FCE100 (LegalizeOp dispatcher):

OffsetSizeTypeFieldEvidence
+08BSDNode*chain_next / first operand valueD03: *(qword*)(N+0) used as first operand in single-operand patterns
+44Buint32_tNumOperands_packedD03: *(dword*)(N+4) & 0x7FFFFFF = NumOperands (low 27 bits); bits 27--30 = flags; bit 30 (0x40 in byte +7) = hasChainOps
+71Buint8_tnode_flags_byteD03: bit 4 = hasDebugLoc; bit 6 = hasChainPtr (operand list at N-8)
+88BSDVTList*VTList / ValueType pointerD03: *(qword*)(N+8) = result value type descriptor; D05: read for MVT extraction
+168BSDUse*UseListD03: head of use-def chain (doubly-linked list)
+244Buint16_topcodeD02: *(uint16_t*)(node+24) = SDNode::getOpcode(); D05: *(a3+24) switched upon
+284Buint32_topcode_flagsD05: *(a3+28) = sub-flags (nsw/nuw/exact bits)
+328BSDUse*operand_listD02: *(node+32) = pointer to first operand SDUse; operand stride = 40 bytes
+331Buint8_textension_modeD05: *(a3+33) bits[2:3] = load extension mode (0=none, 1=zext, 2=sext, 3=zext)
+408Bptrvalue_list / operand[0] typeD02: *(node+40) = SDValue type info; D01: result type descriptor
+488BEVTresult_VTD05: *(a3+48) = result VT list, 16-byte entries {u16 MVT, pad, u64 ext}
+604Buint32_tnum_valuesD02: number of result values
+644Buint32_tflags / num_operands_altD05: *(a3+64) = operand count (alternate access path in KnownBits)
+728BSDValuechain_operand / result EVTD03: *(qword*)(N+72) = result value type; D01: chain operand for memory ops
+808Bptrmetadata / mem operandD01: *(node+80) = predicate for CAS; extra metadata
+884Buint32_taddress_space / orderingD01: *(node+88) = memory operand / address-space descriptor
+968Buint64_timmediate_valueD05: *(a3+96) = constant value for ConstantSDNode (width <= 64)
+1048Bptrextended_dataD05: *(a3+104) = second immediate, type info for wide constants
+1128Bptrmem_chain / alignmentD05: *(a3+112) = MemSDNode chain / alignment info

Note on dual access patterns. The combiner accesses opcodes at N+24 as a 4-byte field with flags, while the legalizer reads *(uint16_t*)(node+24) for a clean 16-bit opcode. The KnownBits engine (sub_33D4EF0) accesses fields at offsets up to +112, confirming that ConstantSDNode and MemSDNode subclasses extend beyond the base 104-byte allocation. These extended nodes are allocated via sub_BD2DA0 (80 bytes for lightweight variants) or sub_22077B0 (128 bytes for MemSDNode), while the base SDNode remains 104 bytes.

Operand Storage

Operands are stored in a contiguous array of SDUse structures. Two storage modes exist:

Mode A -- backward inline (common for small operand counts). Operands are stored before the node in memory, growing toward lower addresses:

operand[i] = *(qword*)(N + 32*(i - NumOps))
// or equivalently: N - 32*NumOps = first operand address

This 32-byte operand stride is confirmed across sub_F3D570, sub_F20C20, and sub_F5A610.

Mode B -- indirect pointer (when node_flags_byte bit 6 is set). An 8-byte pointer at N-8 points to a separately allocated operand array:

if (*(byte*)(N+7) & 0x40):
    operand_base = *(qword*)(N - 8)

The SDUse structure (each operand slot) has a 40-byte stride in the legalizer view (sub_1FFB890) and a 32-byte stride in the combiner view. The 40-byte stride includes use-chain forward/backward pointers:

OffsetSizeFieldDescription
+08BValPointer to the SDNode this use points to
+84BResNoResult number within the pointed-to node
+168BNextNext SDUse in the use-list of the defining node
+248BPrevPrevious SDUse (for doubly-linked list)
+328BUserBack-pointer to the node that owns this operand

Use-list traversal functions: sub_B43C20 (add to use list), sub_B43D60 (remove from use list).

SDValue

An SDValue is a lightweight {SDNode*, unsigned ResNo} pair identifying a specific result of a specific DAG node. In the decompiled code, SDValues appear as 16-byte pairs at various points:

struct SDValue {
    SDNode *Node;     // +0: pointer to the defining node
    uint32_t ResNo;   // +8: which result of that node (0-based)
};

SDValues are passed by value in registers (packed into __m128i in many decompiled signatures) and stored in operand arrays. The SDUse structure wraps an SDValue with use-chain linkage for the def-use graph.

SelectionDAG Builder Context

The builder context is the a1/v4 parameter to sub_163D530. It holds the function being compiled, target information, the bump allocator state, and several DenseMaps for node deduplication.

OffsetSizeFieldDescription
+08Bfunc_ptrThe LLVM function being compiled (a2)
+88Btarget_ptrTarget machine info (a4)
+168Balloc_cursorBump allocator current position
+248Balloc_endBump allocator end boundary
+328Bslab_arrayPointer to array of slab pointers
+404Bslab_indexCurrent slab number (dword)
+444Bslab_capacityMax slabs in array (dword)
+48varinline_slabStart of first allocation region
+808Bbb_list_headBasic block list sentinel (points to +96)
+888Bbb_list_countNumber of basic blocks (init 0)

Embedded DenseMaps

Three DenseMap/DenseSet instances are embedded inline in the context for node deduplication and worklist tracking. All use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16); see Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.

Map A (CSE node mapping) at offsets +120..+148:

OffsetSizeField
+1208BNumEntries
+1288BBuckets pointer
+1364BNumItems
+1404BNumTombstones
+1444BNumBuckets

Map B (secondary set) at offsets +152..+176, same layout.

Set C (worklist) at offsets +184..+208, same layout.

Total minimum context size: 212 bytes.

Map A uses 16-byte bucket stride (key + value pairs), confirmed by the decompiled access pattern:

v30 = (_QWORD *)(v28 + 16LL * v29);   // 16-byte stride
*v30 = v11;                             // key
v30[1] = v19;                           // value

DAG Builder Algorithm (SelectionDAGBuilder)

The SelectionDAGBuilder converts LLVM IR to an initial SelectionDAG. The main entry is sub_2081F00 (267KB, ~9,000 lines), with the visit dispatcher at sub_2065D30 (25KB). The builder processes one basic block at a time, walking the IR instruction list and emitting corresponding SDNode subgraphs.

Entry and Dispatch

sub_2081F00(SelectionDAGBuilder *this, BasicBlock *BB):
    // this+552 = SelectionDAG pointer
    // this+560 = DataLayout pointer
    // Walk BB instruction list via linked list at BB+40/+48
    for each instruction I in BB:
        sub_2065D30(this, I)     // main visit dispatch

The visit dispatcher (sub_2065D30) contains a DenseMap for node deduplication (hash function: (key >> 9) ^ (key >> 4)). It switches on the IR opcode and delegates to per-instruction visitors:

IR InstructionVisitor FunctionSizeNotes
Binary opssub_206E5B0--sub_206F0D02.3KB each8 identical template instantiations for different ISD opcodes
Callsub_208CF6056KBCalls sub_20C7CE0 (NVPTX ComputeCalleeInfo)
Loadsub_209B00015KBChains via sub_2051C20
Storesub_209078014KBAlignment, volatile, chain tokens
Switch/Brsub_20912B018KBJump tables, range checks
PHIsub_20920A013KBBlock ordering, vreg setup
GEPsub_209FCA013KBRecursive address building
Intrinsicsub_208C8A09KBDispatches to intrinsic handlers
Debugsub_208C2707KBDebug value/location handling
Inline Asmsub_2079C7083KBFull constraint parsing
NVVM Tex/Surfsub_207740020KB"nvvm_texsurf_handle" metadata, NVIDIA custom
NVVM Argssub_207259038KBCUDA argument coercion, NVIDIA custom

Chain Management

Every memory-touching SDNode carries a chain operand (token type) that enforces memory ordering. The chain is a linked sequence of token-typed SDValues threading through all memory operations in program order.

Chain creation. The builder maintains a "current chain" (PendingChain) that is updated after every memory operation. When a load or store is emitted, the current chain becomes its chain input, and the node's token result becomes the new current chain.

TokenFactor merging. When multiple independent memory operations can be reordered (e.g., independent loads), the builder creates a TokenFactor (opcode 2/55 depending on context) node that merges multiple chains into one:

// sub_F429C0: merge node creation
TokenFactor = getNode(ISD::TokenFactor, dl, MVT::Other, chains[])

Chain handling utilities in the builder:

  • sub_20993A0 (11KB) -- chain/token helper for load/store sequences
  • sub_2098400 -- chain token node creator
  • sub_20989A0 -- memory scheduling chain builder
  • sub_F6C1B0 (16KB) -- chain management in combining, uses sub_B46970 (isTokenFactor)

Glue (flag) chains. Certain node pairs must be scheduled adjacently (e.g., CopyToReg + CALL). These use a "glue" value type (MVT::Glue) as an additional operand/result. The call lowering in sub_3040BF0 threads glue through the entire call sequence: CallSeqBegin -> DeclareParam* -> Store* -> CallProto -> CallStart -> LoadRetParam* -> CallSeqEnd.

Per-Node Analysis Structure

During DAG construction, sub_163D530 creates per-node analysis objects (accessed via v381) with the following layout:

OffsetSizeField
+88Barray_ptr
+164Barray_count
+244Barray_capacity
+728Bset.Buckets
+804Bset.NumItems
+844Bset.NumTombstones
+884Bset.NumBuckets

Operations: sub_163BE40(v381, ptr) inserts into the +8 array; sub_163BBF0(context, key) looks up the analysis structure for a node in the context's DenseMap.

CSE (Common Subexpression Elimination) Hash Table

The getNode() family of functions deduplicates SDNodes via a CSE hash table. The primary implementation is sub_F4CEE0 (41KB):

sub_F4CEE0(SelectionDAG *DAG, unsigned Opcode, SDVTList VTs, SDValue *Ops, unsigned NumOps):
    // 1. Compute profile hash via sub_F4B360 (SDNode::Profile)
    //    Hash combines: opcode, VTs, all operand node pointers
    // 2. Lookup in CSE hash table:
    //    hash = ((profile >> 4) ^ (profile >> 9)) & (capacity - 1)
    //    Quadratic probing: step 1, 2, 3, ...
    //    Sentinels: -4096 (empty), -8192 (tombstone)
    // 3. If found: return existing node
    // 4. If not found:
    //    Allocate via sub_BD2C40 (bump allocator)
    //    Initialize via sub_B44260 (SDNode constructor)
    //    Insert into hash table
    //    Add to AllNodes list (global sentinel: qword_4F81430)
    //    Return new node

Node builder variants handle different operand counts:

  • sub_F49030 (38KB) -- complex node construction with operand/result type setup
  • sub_F429C0 (34KB) -- merge/TokenFactor/indexed node creation
  • sub_F44160 (22KB) -- CSE rebuild after modification
  • sub_F40FD0 (16KB) -- node construction with chain initialization

The AllNodes list (qword_4F81430) is a doubly-linked intrusive list of all SDNodes in the current DAG, used for iteration during combining and legalization passes.

NVPTX-Specific Node Types (NVPTXISD)

NVPTX target-specific ISD opcodes begin at ISD::BUILTIN_OP_END = 0x1DC9 (confirmed by sub_2095B00 delegation threshold for getTargetNodeName()). In the decompiled code, target opcodes are referenced by small integers (the NVPTXISD enum value minus BUILTIN_OP_END). The following table consolidates all NVPTXISD opcodes discovered across sub_3040BF0, sub_32E3060, sub_33B0210, and the legalization infrastructure:

Call ABI Nodes

OpcodeNameOperandsDescription
315CallSeqBeginchain, seqId, frameSizeMark start of call frame
316CallSeqEnd_Outerchain, ...Outer call-sequence-end wrapper
505DeclareParamchain, align, idx, sizeDeclare .param (byval/aggregate)
506DeclareScalarParamchain, align, idx, sizeDeclare .param (scalar, widened)
507DeclareRetParamchain, ...Declare .param for return (byval callee)
508DeclareRetScalarParamchain, ...Declare .param for return (scalar callee)
510CallDirectchain, callee, ...Direct call (callee not extern)
511CallDirectNoProtochain, callee, ...Direct call without prototype
512CallIndirectchain, ptr, ...Indirect call via function pointer
513CallIndirectNoProtochain, ptr, ...Indirect call without prototype
514CallStartchain, ...Actual call instruction emission
515LoadRetParamchain, offsetLoad return value from .param (not last)
516LoadRetParamLastchain, offsetLoad last return value from .param
517CallSeqEndchain, seqId, ...End of call sequence (inner chain)
518CallProtochain, paramCountDeclare call prototype (.callprototype)
521DeclareRetParam_Extchain, ...Declare .param for return (extended path)
527StoreCalleeRetAddrchain, ...Store callee return address in .param
528StoreRetValToParamchain, ...Store return value to .param (return path)

Memory / Vector Nodes

OpcodeNameOperandsDescription
568LoadV1chain, ptr, offsetLoad 1-element from .param (scalar return)
569LoadV2chain, ptr, offsetLoad 2-element vector from .param
570LoadV4chain, ptr, offsetLoad 4-element vector from .param
571StoreV1chain, val, ptr, offsetStore 1-element to .param (st.param)
572StoreV2chain, val, ptr, offsetStore 2-element vector to .param
573StoreV4chain, val, ptr, offsetStore 4-element vector to .param

Math / Rounding-Mode Nodes

OpcodeNameDescription
245ADD_RMAdd, round toward -inf
246SQRT_RPSqrt, round toward +inf
248SQRT_RZSqrt, round toward zero
249ADD_RZAdd, round toward zero
250DIV_RZDiv, round toward zero
251MUL_RNMul, round to nearest
252ADD_RNAdd, round to nearest
253FMA_RNFMA, round to nearest
254SQRT_RMSqrt, round toward -inf
255MUL_RZMul, round toward zero
256DIV_RMDiv, round toward -inf
267FMA_RZFMA, round toward zero
268DIV_RNDiv, round to nearest
269DIV_RPDiv, round toward +inf
270ADD_RPAdd, round toward +inf
271FMA_RMFMA, round toward -inf
272MUL_RPMul, round toward +inf
273FMA_RPFMA, round toward +inf
274MUL_RMMul, round toward -inf

Address Space / Miscellaneous Nodes

OpcodeNameDescription
22TargetAddrTarget address computation
24WrapperGlobal address wrapping
149ATOMIC_LOADAtomic load with scope
152SELECT_CCTernary select on condition code
154SQRT_RNSqrt, round to nearest
189MoveParamRead thread index / special register
193--196MIN/MAXInteger min/max variants
197CTPOPPopulation count
198--204ConstPool*Constant pool variants by size
208CMPXCHGCompare-and-exchange atomic
230DeclareLocalDeclare local .param / address of param
233--234AddrSpaceCastBidirectional address space cast pair
287--290Barrier/FenceMemory barrier/fence variants
310AnnotationAnnotation metadata node
321StackRestoreRestore stack pointer
322StackAllocDynamic stack allocation
330FunctionAddrFunction address
335BinaryArithGeneric binary arithmetic
371DynAreaOffsetDynamic alloca offset
499ConditionalBranchConditional branch with chain

Atomic Opcodes (from sub_20BED60)

Opcode RangeOperationWidths
294--297atom.addf32/f64/i32/i64
302--305atom.mins32/s64/u32/u64
314--317atom.maxs32/s64/u32/u64
462atom.casgeneric

DAG Legalization Flow

After the initial DAG is built, three legalization phases transform it into a form the NVPTX backend can select:

Phase 1: Type Legalization (sub_20019C0, 348KB)

The DAGTypeLegalizer iterates to fixpoint. For each node, it reads the result/operand types and checks the legality table at TLI + 259 * VT + opcode + 2422. If illegal, it applies one of: promote, expand, soften, scalarize, or split-vector. The worklist iterates until no node has an illegal type.

NVPTX legal vector types are extremely limited (only v2f16, v2bf16, v2i16, v4i8 -- all packing into 32-bit registers via Int32HalfRegs). This means virtually all LLVM-IR vector operations pass through the split/scalarize paths.

Type legalization workers:

  • sub_201E5F0 (81KB) -- promote/expand secondary dispatch (441 case labels, 6 switches)
  • sub_201BB90 (75KB) -- ExpandIntegerResult (632 case labels)
  • sub_2029C10 -- SplitVectorResult dispatcher (reads opcode at node+24)
  • sub_202E5A0 -- SplitVectorOperand dispatcher
  • sub_2036110 -- ScalarizeVectorResult
  • sub_2035F80 -- ScalarizeVectorOperand

Phase 2: Operation Legalization (sub_1FFB890, 169KB)

After types are legal, the operation legalizer checks whether each operation at its now-legal type is supported. The action lookup:

action = *(uint8_t*)(TLI + 259*VT + opcode + 2422)

Actions dispatch through a five-way switch:

ActionCodeBehavior
Legal0Return immediately
Custom1Call TLI->LowerOperation() via vtable slot #164 (offset 1312)
Expand2Try sub_20019C0 (LegalizeTypes), then sub_1FF6F70 (ExpandNode)
LibCall3Call sub_1FF6F70 directly
Promote4Find next legal type, rebuild at promoted type

Custom lowering invokes NVPTXTargetLowering::LowerOperation() (sub_32E3060, 111KB) through the vtable. This is where all NVPTX-specific operation lowering happens: BUILD_VECTOR splat detection, VECTOR_SHUFFLE three-level lowering, EXTRACT_VECTOR_ELT three-path dispatch, and the .param-space calling convention.

Additional action tables:

  • Second table at TLI + opcode + 2681 -- for BSWAP/CTLZ/CTTZ/BITREVERSE (opcodes 43--45, 199)
  • Third table at TLI + opcode + 3976 -- for FSINCOS (opcode 211)
  • Fourth table at TLI + 18112 -- packed nibble format for FP_TO_SINT/FP_TO_UINT/SELECT_CC, indexed by (VT_id >> 3) + 15 * condcode_type

Phase 3: DAG Combining (Three Passes)

DAG combining runs after each legalization phase. The orchestrator (sub_F681E0, 65KB) manages a worklist of SDNodes and calls the per-node visitor (sub_F20C20, 64KB) for each. The visitor implements a six-phase combine algorithm:

  1. Opcode-specific combine via sub_100E380 -- target-independent pattern matching
  2. Known-bits narrowing -- for constants, calls sub_11A3F30 (computeKnownBits/SimplifyDemandedBits) and narrows if fewer bits demanded
  3. Operand type-narrowing loop -- walks all operands, promotes/truncates to legal types, creates SIGN_EXTEND/TRUNCATE casts
  4. All-constant-operand fold -- 4x-unrolled check via sub_1028510 (ConstantFold)
  5. Division-by-constant strength reduction -- shift+mask replacement for power-of-2 divisors
  6. Vector stride / reassociation -- sub_F15770 (shift-fold), sub_F17ED0 (stride patterns)

NVPTX-specific combines run as a post-legalize pass:

  • sub_33C0CA0 (62KB) -- PerformDAGCombine, the NVPTX target hook
  • sub_32EC4F0 (92KB) -- post-legalize combine
  • sub_3425710 (142KB) -- the NVIDIA DAGCombiner with internal "COVERED"/"INCLUDED" debug tracing strings (not present in upstream LLVM)

The worklist uses the same DenseMap infrastructure as the builder context, with the hash at DAG+2072 (capacity at DAG+2088, count at DAG+2080). Node replacement goes through sub_F162A0 (CombineTo/ReplaceAllUsesWith), which walks the use-list, hashes each user into the worklist map, then calls sub_BD84D0 for the actual use-chain splice.

Bump Allocator

The builder context uses a slab-based bump allocator identical to the one used for NVVM IR nodes:

  • Slab growth: 4096 << (slab_index >> 7) -- exponential, capped at 4TB.
  • Alignment: 8 bytes.
  • No per-node free: entire slabs are released when the DAG is destroyed.
  • Overflow: allocates a new slab via malloc().

Since every base SDNode is exactly 104 bytes (13 qwords), a single 4096-byte initial slab holds approximately 39 nodes before overflow triggers slab growth. Extended node types (ConstantSDNode, MemSDNode) may be larger and are allocated via separate paths:

  • sub_BD2C40 -- standard SDNode allocation (bump allocator)
  • sub_BD2DA0 -- SDNode allocation variant (80 bytes, for lightweight nodes)
  • sub_22077B0 -- operator new[] (128 bytes, for MemSDNode with chain/alignment fields)

Basic Block Iteration

The builder iterates over the function's basic blocks via a linked list rooted at a2 + 72 (the function parameter). Each list node embeds the data pointer at offset -24 from the node:

bb_data = node_ptr - 24

Within each basic block, instructions are iterated via an inner list:

  • Inner list sentinel at bb_data + 40
  • Inner list head at bb_data + 48

This matches the LLVM ilist intrusive linked list pattern where the list hook is embedded at a fixed offset within the contained object.

Differences from Upstream LLVM

AreaNVIDIA (cicc v13.0)Upstream LLVM 20.0
Type legalizer structureSingle 348KB monolithic function (sub_20019C0)Split across 4 files (LegalizeIntegerTypes.cpp, etc.)
NVIDIA DAGCombiner142KB sub_3425710 with "COVERED"/"INCLUDED" internal tracingNo equivalent; target combines via PerformDAGCombine hook only
computeKnownBits114KB sub_33D4EF0, covers 112+ ISD opcodes including NVPTX target nodes~30 opcodes in generic computeKnownBits, target extends via hook
Inline asm162KB total (sub_2079C70 + sub_338BA40)~200 lines per target
Intrinsic lowering343KB switch covering 200+ intrinsic IDs up to 14196~300 standard intrinsic IDs
Address spacesAS 101 (param alt), AS 7 (.param), CTA/GPU/SYS scope atomicsNo AS 101; no scope atomics
Libcall metadata"nvptx-libcall-callee" metadata for custom libcall routingNot present
Legal vector typesOnly v2f16, v2bf16, v2i16, v4i8 (packed into 32-bit registers)Varies by target; typically much wider vectors

Function Map

FunctionAddressSizeRole
SelectionDAG builder context initsub_163D53073KBAllocator, DenseMaps, BB iteration
SelectionDAGBuilder::visitsub_2081F00267KBIR-to-DAG main lowering
SelectionDAGBuilder visit dispatchsub_2065D3025KBPer-instruction routing
visitCallsub_208CF6056KBCall lowering into DAG
visitLoadsub_209B00015KBLoad chain emission
visitStoresub_209078014KBStore alignment/chain
visitSwitch/Brsub_20912B018KBControl flow lowering
visitPHIsub_20920A013KBPHI node handling
visitGEPsub_209FCA013KBAddress computation
visitInlineAsmsub_2079C7083KBInline asm constraint parsing
visitNVVMTexSurfsub_207740020KBNVIDIA tex/surf handle lowering
NVPTX argument coercionsub_207259038KBCUDA kernel argument lowering
getNode / CSE hash tablesub_F4CEE041KBNode deduplication
SelectionDAG node buildersub_F4903038KBComplex node construction
Merge/TokenFactor creationsub_F429C034KBChain merging, indexed nodes
DAG combiner orchestratorsub_F681E065KBWorklist management
DAG combiner visitorsub_F20C2064KBPer-node combine algorithm
combine() opcode dispatchsub_100E380--Target-independent combines
CombineTo / RAUWsub_F162A0--Use-chain replacement + worklist push
SDNode allocationsub_BD2C40--Bump allocator
SDNode constructorsub_B44260--Initialization
SDUse add to use listsub_B43C20--Use-chain linkage
SDUse remove from use listsub_B43D60--Use-chain unlinkage
ReplaceAllUsesWithsub_BD84D0--Raw use-chain splice
transferDbgValuessub_BD6B90--Debug info transfer
setOperandsub_B91C10--Operand mutation
replaceOperandsub_B99FD0--Single operand swap
DAGTypeLegalizer::runsub_20019C0348KBType legalization master dispatch
LegalizeOpsub_1FFB890169KBOperation legalization
ExpandNodesub_1FF6F70--Full node expansion fallback
NVPTXTargetLowering::LowerOperationsub_32E3060111KBNVPTX custom operation lowering
NVPTXTargetLowering::LowerCallsub_3040BF088KB.param calling convention
Intrinsic lowering switchsub_33B0210343KB200+ CUDA intrinsic IDs
PerformDAGCombine (NVPTX)sub_33C0CA062KBPost-legalize NVPTX combines
NVIDIA DAGCombinersub_3425710142KBNVIDIA-specific combine engine
computeKnownBits (NVPTX)sub_33D4EF0114KB112-opcode known-bits transfer
ISel::Select driversub_3090F9091KBPattern matching entry
getOperationNamesub_2095B0035KBISD opcode -> string mapping

Cross-References

DenseMap, Symbol Table, and EDG Frontend Structures

The EDG 6.6 frontend layered on LLVM's DenseMap maintains its own declaration nodes, type nodes, and scope stack for C/C++/CUDA semantic analysis. This page documents the EDG-level structures that ride on top of the DenseMap. For the DenseMap implementation itself -- layout, hash function, probing, sentinel values, and growth policy -- see Hash Table and Collection Infrastructure.

The EDG symbol tables in this subsystem use the NVVM-layer sentinel pair (-8 / -16) and the pointer hash (ptr >> 9) ^ (ptr >> 4). See the sentinel reference table for other subsystems.


EDG Declaration Node Layout

The EDG 6.6 frontend represents every C/C++ declaration as a variable-length structure. The canonical declaration node layout was recovered from the top-level declarator parser sub_662DE0 and the declaration-specifier resolver sub_7C0F00.

Declaration Node (a_decl_node) -- 456+ bytes

OffsetSizeTypeFieldEvidence
+08Bptrdecl_id / entity pointer*v31 in sub_662DE0
+88Buint64_tdecl_flags bitfield (see below)v31[1]
+168Buint64_tdecl_extra_flagsv31[2]
+2416Bname / identifier infov31[3..4]
+408Bname string for "main" checkstrcmp target
+724Buint32_tsaved_specifier_word1v239 in sub_662DE0
+762Buint16_tsaved_specifier_word2v240
+801Buint8_tentity_kind (for scope dispatch)checked in sub_860B80
+1201Buint8_taccessibility (bits 0-6, bit 7 reserved)v241 = *(a1+120) & 0x7F
+1241Buint8_tcontext_flags_124bit 5=explicit_spec, bit 6=class_member
+1251Buint8_tcontext_flags_125bit 5=was_friend, bit 6=in_class_body, bit 7=template_decl_head
+1261Buint8_tstate_flags (see below)mask tests throughout sub_662DE0
+1271Buint8_textra_statebit 0=class_scope_pushed, bit 1=needs_deferred_parse
+1288Bptrentity_ptr / scope pointercompared early in sub_739430
+1301Buint8_tmodifier_flagsbit 5=deferred_parse, bit 6=virtual_specifier
+1311Buint8_tinline/constexpr flagbit 4
+1321Buint8_tneeds_semicolon_checkbit 1
+1401Buint8_ttype_kind (for type_def nodes)switch discriminant in sub_766570 case 6
+1608Bptrunderlying_type (for typedef)typedef unwrap chain
+1688Bptrflags_ptrbit 3 checked for fn-pointer
+1731Buint8_telaborate_kindprimary switch in sub_739430
+176varelaborate_sub_kind / secondarysub-switch in case 12
+1848Bptrparm_listv31[23] via sub_5CC190(1)
+2244Buint32_tinit_kindbit 0 = brace-init
+2564Buint32_tadditional_flags
+2681Buint8_tdecl_kind_enum0=variable, 4=function, 6=namespace
+2691Buint8_tstorage_class_kind0=none, 1=extern, 2=static
+2728Bptrdecl_typev31[34]
+2808Bptrresult_typev31[35]
+2888Bptrentity_type / return_typev31[36]
+3048Bptrtemplate_info
+3528Bptrbody_ptrv31[44]
+3608Bptrscope_or_contextv31[45]
+3688Bptrforward_decl_chainlinked list
+4168Bptrpending_listv31[52]
+4568Bptrextra_entityv31[57]

decl_flags (+8) Bit Definitions

BitMaskMeaning
00x1is_definition / linkage related
10x2has_initializer / needs init check
40x10is_typedef
50x20is_template_decl / friend declaration
60x40is_inline
70x80is_extern
140x4000structured_binding / decomposition decl

state_flags (+126) Bit Definitions

BitMaskMeaning
00x1has_saved_tokens
10x2abstract_declarator_mode
20x4has_leading_attributes
30x8no_declarator_needed (typedef etc.)
40x10suppress_error_recovery
50x20in_declarator_parsing (set on entry)
60x40in_multi_declarator_loop
70x80scope_pushed

entity_kind (+80) Dispatch Values

Used by sub_860B80 and sub_7C0F00 phase 3:

ValueEntity Kind
3class
4enum (variant A)
5enum (variant B)
6namespace
10function
11variable
16typedef
17template
19class template
22dependent name
23using-declaration
24injected-class-name

Declaration Node Allocation

sub_84DCB0 allocates 152-byte declaration entries from a free-list at qword_4D03C68, with fallback to the global allocator sub_823970(152). The full node size table at qword_4B6D500 provides per-tag sizes for all 87 IL node types; the declaration tag (6) indexes into this table for memcpy during template instantiation.


EDG Type Node Layout

Type nodes are the central representation for C/C++ types throughout the EDG frontend. Two distinct layouts exist: the IL-level type node used by the tree walker (sub_7506E0) and the semantic type node used by the type comparison engine (sub_7386E0). The type translation system (sub_91AED0) bridges between these and LLVM types.

IL-Level Type Node (from sub_7506E0 tree walker)

The IL tree walker addresses fields as a1[N] (8-byte indexed), with byte-level sub-kind tags at specific offsets:

OffsetSizeTypeFieldEvidence
-168Bptrparent_ptr / ownershared-node check path
-81Buint8_tflags_bytebit 0=shared, bit 2=visit-mark
+0..+N*8varptr[]child pointers (typed per kind)a1[0]..a1[N]
+241Buint8_texpression sub-kind (case 13)switch discriminant
+281Buint8_tscope sub-kind (case 23)18 sub-kinds
+401Buint8_tdeclaration sub-kind (case 21)25 sub-kinds
+481Buint8_ttemplate_arg sub-kind (case 30)9 sub-kinds
+1401Buint8_ttype_def_sub_kind (case 6)17 sub-kinds
+1611Buint8_ttype_def_flags
+168-177vartype sub-kind / sub-sub-kind
+1731Buint8_ttype_main_kind (case 2)14 sub-kinds by +173
+1761Buint8_ttype_sub_sub_kindcase 6 elaborated

Semantic Type Node (from sub_7386E0 comparison engine)

OffsetSizeTypeFieldEvidence
+08Bptrassociated_decl*v10 == *v14 comparison
+241Buint8_ttype_kind (0..37)primary switch discriminant
+251Buint8_tcv_qualifiersbits 0-1 = const/volatile, bit 6 = restrict
+261Buint8_ttype_flags_1bit 2 compared
+271Buint8_ttype_flags_2bit 1 compared (case 1)
+568Btype_payload / sub_kindcase 1: char at +56 = base_type_kind
+581Buint8_ttype_extra_flagscase 1: bits 0x3A compared
+648Bvaries per kindcase 30: word at +64
+728Bptrtype_child / pointercase 1 integer path
+808Bptrlinkage_chaincase 33: namespace list +80 = next

EDG-to-LLVM Type Translation Node (from sub_918E50)

The type translation system reads a third view of the type node with offsets optimized for LLVM type construction:

OffsetSizeTypeFieldEvidence
-728Bptrgrandparent_typenested lookups
-488Bptrparent_type_A
-248Bptrparent_type_B / first child
-88Bptrindirect_child_arrayif flag 0x40 at +23
+08Bptrllvm_type_descriptor*node -> LLVM type info
+88Bptrmember_chain_headlinked list of class members
+161Buint8_ttype_kindsee kind table below
+182Buint16_tqualifier_wordbits 0-14: qualifier ID, bit 15: negation
+204Buint32_tchild_countlow 28 bits masked & 0xFFFFFFF
+231Buint8_tflagsbit 6 (0x40) = indirect children
+248Btype-specific datavaries by kind
+328Bbitwidthenum/integer types
+331Buint8_tadditional_flagsbit 5 (0x20) = special treatment
+364Buint32_tsub_kind_discriminatornested types
+408Bptrscope_linkage_ptr
+488Bptrmember_list_headlinked list

type_kind Enumeration (semantic type comparison)

The full enumeration recovered from sub_7386E0:

ValueNameComparison Strategy
0tk_none / voidtrivially equal
1tk_fundamentalsub_kind + base type + class scope
2tk_pointerdelegate to sub_739430 on pointee
3tk_classscope identity, unique_id, template args
4tk_enumscope identity, unique_id
5tk_functionsub_73A280 pair compare
6tk_bitfieldwidth + base compare
7tk_member_pointermulti-field descriptor
8tk_referencereferent descriptor
10tk_arrayelement type recursion
11tk_qualifiedchild + qualifier bit
12tk_elaboratedsub_kind switch (typedef/class/enum)
13tk_pack_expansionsub_kind switch
14tk_typeof_exprsub_kind switch
15tk_decltypesub_kind switch
16tk_nullptrtrivially equal
17tk_autoidentity on entity
18tk_function_altsub_73A280
20tk_dependent_namescope identity + unique_id
22tk_unresolvedsub_8D97D0 decl compare
23tk_attributedattribute kind + child
24tk_decltype_autoidentity on entity
25tk_parenchild list compare
26tk_adjustedchild type recursion
27tk_typeof_declresolve decl -> type, recurse
30tk_complexelement type(s) recursion
32tk_template_template_paramidentity + template args
33tk_using_declchild list + base class hash table
34tk_atomicchild + qualifier bit
35tk_vlaelement type recursion
37tk_concept_constraintidentity on entity

EDG-to-LLVM Type Kind Encoding (byte at node+16)

ValueHexKind
0-160x00-0x10Primitive / scalar types
170x11Void (special)
50x05Qualified type (const/volatile/restrict)
130x0DEnum type
140x0EFunction type
260x1AArray type (subscript form)
270x1BCompound type (struct/union/class)
500x32Union variant A
510x33Union variant B
540x36Typedef / using declaration
550x37Using declaration variant
750x4BPointer type
760x4CReference type (lvalue or rvalue)
770x4DMember pointer type
780x4EDependent / nested type

Qualifier Word Values (node+18 & 0x7FFF)

ValueCUDA Memory Space
1Address space 1 (global memory)
9Address space 9 (generic, gated by sub_5F3280)
14Function / method qualifier
26Array subscript context A
27Array subscript context B
32Address space 32 (shared memory)
33Address space 33 (constant memory)

Type Canonicalization -- sub_72EC50

Before any type comparison, both sides are canonicalized by stripping non-template typedef aliases:

fn edg_canonicalize_type(type) -> type:
    while type.type_kind == 2:               // tk_elaborated
        scope = type.payload_at_56
        if scope.elaborate_kind != 12:       // not typedef_name
            break
        if scope.elaborate_sub_kind != 1:    // not single-member typedef
            break
        if scope.class_flags & 0x10:         // has template specialization
            break
        type = sub_72E9A0(type)              // unwrap one layer
    return type

This peels through chains like typedef int MyInt; typedef MyInt YourInt; down to the fundamental type. Template specialization aliases are never unwrapped.


EDG Scope Stack

The scope stack is a global array of 776-byte entries, indexed by a scope depth counter. It represents the C++ scope nesting at parse time (file scope -> namespace -> class -> function -> block).

Global State

AddressTypeNamePurpose
qword_4F04C68ptrScope stack baseheap-allocated array of 776B entries
dword_4F04C64int32_tCurrent scope indextop of the scope stack
dword_4F04C5Cint32_tPrevious scope indexsaved parent index
dword_4F04C44int32_tNamespace scope indexdeepest enclosing namespace
dword_4F04C34int32_tClass scope indexdeepest enclosing class
dword_4F04C40int32_tAnother scope indexauxiliary scope tracking
dword_4F04C3Cint32_tModule linkage flagC++20 module scope state
unk_4F04C48int32_tParent scope checkused by using-declaration handler

Scope Stack Entry Layout (776 bytes)

Each entry at qword_4F04C68[0] + 776 * index:

OffsetSizeTypeField
+04Buint32_tscope_id
+42Buint16_tscope_kind (see table below)
+61Buint8_tflags_a
+71Buint8_tflags_b
+81Buint8_tflags_c
+91Buint8_tflags_d
+101Buint8_tflags_e
+248Bptrname_list_head
+328Bptrname_list_tail
+2088Bptrclass_type_ptr
+2328Bptrdeferred_list
+3288Bptrtemplate_info
+5524Bint32_tparent_scope_index
+6248Bptrdeclaration_ptr
+6808Bfield used by sub_7C0F00
+6884Buint32_tentity_number_counter (for mangling)
+6964Buint32_tentity_number_counter_2

scope_kind Values

ValueScope Kind
5namespace
6class
7function
8block (compound statement)
9enum
12template parameter

Push / Pop Operations

sub_854590(0)   // push_scope -- increments dword_4F04C64, initializes new entry
sub_854430()    // pop_scope  -- decrements dword_4F04C64, restores parent
sub_854AB0(...) // pop_declarator_scope (context-specific cleanup)
sub_854B40()    // push_declarator_scope (declarator-specific init)

The scope depth counter at qword_4F061C8 + 64 is bumped independently for declarator nesting depth tracking. Class scope depth lives at qword_4F061C8 + 81.


Scope Chain Traversal Algorithm

The declaration-specifier resolver sub_7C0F00 performs scope chain traversal to resolve qualified names. The algorithm was recovered from Phase 4 (lines 1197-1600) of that function.

Unqualified Name Lookup

fn lookup_unqualified(name, scope_index) -> entity:
    // Phase 2 of sub_7C0F00
    // Try each lookup strategy in priority order:

    result = sub_7D5DD0(name)                    // unqualified lookup in current scope
    if result:
        return result

    result = sub_7D2AC0(name, flags)             // lookup with specific flags
    if result:
        return result

    result = sub_7ACA80(name)                    // ADL / ambiguity resolution
    return result

Qualified Name Lookup (A::B::C)

The scope iteration loop at LABEL_282/283/285/288 walks the scope chain:

fn lookup_qualified(base_entity, remaining_name) -> entity:
    current = base_entity
    while true:
        // Check if "::" follows the current entity
        if current_token != TK_SCOPE_RESOLUTION:  // token 37
            return current

        consume_token()  // sub_7B8B50

        // Classify the current entity
        kind = current.entity_kind  // byte at +80
        switch kind:
            case 6:   // namespace
                result = sub_7D4A40(current, remaining_name)  // namespace lookup
            case 3:   // class
            case 19:  // class template
                result = sub_7D2AC0(current, remaining_name, MEMBER_FLAG)
            case 17:  // template
                result = sub_830940(current, remaining_name)  // class template lookup
            default:
                result = sub_7D4600(current, remaining_name)  // generic qualified lookup

        if !result:
            // Error: member not found in scope
            sub_6851C0(error_code, context)
            return null

        current = result

        // Check member access, visibility, redeclaration
        sub_8841F0(current, scope_entry)  // access check for C++ members

Self-Recursive Qualified Resolution

When the declaration-specifier resolver encounters a :: after resolving a name, it recurses into itself at sub_7C0F00(20, a2) where flags=20 decodes as:

  • bit 2 (0x04) = nested declarator sub-parse context
  • bit 4 (0x10) = restrict parse to type-specifiers only

This handles arbitrarily deep qualified names like A::B::C::D. Recursion depth is bounded by the nesting depth of the qualified name.

Scope Chain Walking for Declaration Resolution

sub_868D90 (ADL / instantiation lookup) walks the scope chain upward:

fn walk_scope_chain(start_index) -> entity:
    index = start_index
    while index >= 0:
        entry = scope_table_base + 776 * index
        // Check if this scope contains the target declaration
        // ... name lookup within the scope's name list ...

        // Move to parent scope
        index = entry.parent_scope_index  // at offset +552

Type Comparison Engine

sub_7386E0 implements structural type comparison for the EDG frontend. It performs a parallel tree walk over two type nodes, comparing them field-by-field with mode-dependent strictness.

Calling Convention

sub_7386E0(packed_pair: __int128, flags: int) -> bool
    packed_pair.low  = type_A pointer
    packed_pair.high = type_B pointer
    flags bits:
        0-1: cv_compare_mode (0=strict, 1=relaxed, 2=overload)
        2:   template_matching_mode
        5:   anonymous_class_structural_compare

Comparison Algorithm

fn compare_types(type_A, type_B, flags) -> bool:
    // 1. Null handling
    if both null: return true
    if either null: return false

    // 2. Canonicalize (strip non-template typedefs)
    type_A = sub_72EC50(type_A)
    type_B = sub_72EC50(type_B)

    // 3. Quick-reject on header bytes
    if type_A.type_kind != type_B.type_kind: return false
    if (type_A.cv_quals ^ type_B.cv_quals) & 0x43: return false  // const/volatile/restrict
    if (type_A.flags_1 ^ type_B.flags_1) & 0x04: return false

    // 4. Type-specific structural comparison
    switch type_A.type_kind:
        case 3 (class):
            if type_A.scope == type_B.scope: return true     // identity shortcut
            if unique_id_enabled:
                if scope_A.unique_id == scope_B.unique_id: return true
            if template_mode:
                return sub_89BAF0(...)  // template arg list compare
            if anonymous_mode && both_anonymous:
                return sub_739430(member_list_A, member_list_B)

        case 7 (member_pointer):
            // Compare: flags, class ptr, scope ptr, return type,
            // params, exception spec -- 6 sub-comparisons

        case 33 (using_decl) in overload mode:
            // Hash table lookup at qword_4D03BF8 for base class lists
            // Element-by-element comparison of 24-byte triples

        // ... 35 other cases ...

    // 5. Post-switch: declaration pointer compare
    if type_A.decl != type_B.decl:
        if !sub_8D97D0(type_A.decl, type_B.decl): return false
    return true

Helper Functions

AddressNamePurpose
sub_7386E0edg_compare_type_nodesTop-level structural compare
sub_739370edg_compare_type_listsLinked-list comparator (next at +16)
sub_739430edg_compare_decl_typesDeclaration-level comparator (661 lines)
sub_73A280edg_compare_type_pair_trivTrivial wrapper: null=equal
sub_72EC50edg_canonicalize_typeStrip typedef / elaborated aliases
sub_8D97D0edg_compare_decl_identityName/entity identity comparison
sub_8C7520edg_class_same_templateSame primary class template check
sub_89AB40edg_compare_template_argsTemplate argument list comparison
sub_89BAF0edg_compare_template_arg_lists_fullFull template context compare

Key Global: dword_4F07588 -- unique_id optimization

When set, enables O(1) identity comparison via the unique_id field at scope+32. This avoids recursive structural comparison for named classes and enums. The field is compared as a non-null integer; matching non-null values prove the two types refer to the same entity.


IL Tree Walker and Copier

Tree Walker -- sub_7506E0 (190KB, 7283 lines)

The generic IL tree walker visits every node in the EDG intermediate representation. It dispatches on 83 node kinds (1-86 with gaps at 24-26) using a massive switch statement.

Callback table at .bss 0x4F08014..0x4F08040:

AddressTypeCallbackCall Sites
dword_4F08014boolskip_shared_nodesflag
dword_4F08018boolclear_back_pointers49 sites
qword_4F08020fn(node, kind) -> nodelist_node_rewrite_fn206 sites
qword_4F08028fn(node, kind) -> nodechild_rewrite_fn926 sites
qword_4F08030fn(node, kind) -> boolpre_visit_fn2 sites
qword_4F08038fn(str, kind, len)string_visitor_fn80 sites
qword_4F08040fn(node, kind)post_visit_fn14 sites

Visit-mark protocol: Each node has a flag byte at node[-8]. Bit 2 tracks "visited in current pass" with polarity toggled per walk pass via dword_4D03B64. This avoids clearing visited marks between walks.

Linked-list traversal pattern (60+ lists walked):

for cursor = node.field; cursor; cursor = cursor.next:
    if list_node_rewrite_fn:
        cursor = list_node_rewrite_fn(cursor, child_kind)
    if cursor:
        walk_il_node(cursor, child_kind)
        cursor = node.field  // re-read (rewrite may have changed it)

Next-pointer stride varies by node kind: +0, +16, +24, +32, +56, +112, +120 bytes.

Tree Copier -- sub_766570 (148KB, 5187 lines)

The copier is driven by template instantiation (sub_8C5CD0 -> sub_8C4EC0 -> sub_8C2C50 -> sub_766570). It uses the walker's callback infrastructure:

  • sub_8C38E0 = copy_ref callback: resolves pending copy destinations
  • sub_8C3810 = copy_scope callback: resolves scope-level copies
  • Node sizes from qword_4B6D500[tag] (87+ entries, one per IL node type)

Copy protocol using flag bits at node[-8]:

BitsMeaning
0x1needs copy, not yet started
0x2copy in progress
0x3pending copy (both bits)
0x4copy destination allocated

Copy destination stored at *(node - 24). When both bits 0 and 1 are set, sub_8C3650 forces the copy by allocating qword_4B6D500[tag] bytes and performing memcpy followed by pointer rewriting.


EDG-to-LLVM Type Translation System

Entry: sub_91AED0 -> sub_91AB30. Uses a worklist-driven fixed-point iteration.

Translation Context Object (at a1+160)

OffsetSizeField
+0x0008Bdebug_logger
+0x0088Bpass_list_ptr
+0x0388Bedg_node_map (DenseMap: EDG -> LLVM values)
+0x0588Bvisited_set (DenseSet for dedup)
+0x0604Bvisited_count
+0x0644Bvisited_capacity
+0x0684Bbucket_count
+0x0908Btype_cache (DenseMap: EDG type -> LLVM Type*)
+0x1684Bthreshold
+0x2A08Bpending_replacements
+0x2A84Bpending_count

Fixed-Point Algorithm

fn translate_all_types(ctx, module):
    // Phase 1: iterate module members
    for member in module.member_list:
        sub_AA3700(member)  // gather initial flags

    // Phase 2: fixed-point iteration
    do:
        ordering = sub_919CD0(module)        // topological sort (10-level BFS)
        for type in ordering.reverse():
            sub_913880(ctx, type)            // invalidate stale cache entries
        for type in ordering.reverse():
            changed |= sub_9197C0(ctx, type) // process single declaration
    while changed

    // Phase 3: optional late fixup (byte_3C35480-gated)
    if optimization_enabled:
        do:
            changed = sub_917E30(ctx)
        while changed

    // Phase 4: cleanup
    sub_909590(ctx)

Bitmask for Scope-Tracking Types

The expression 0x100000100003FF >> (kind - 25) selects which type kinds in the range [25..78] require scope tracking during translation. This covers compound types, pointer types, and dependent types that carry CUDA address-space qualifiers.


Usage Across the Compiler

DenseMap instances appear at these known locations:

  • NVVM context object: 8+ tables for IR node uniquing (opcodes 0x10..0x1F), plus sub-function tables for opcodes 0x04..0x15.
  • SelectionDAG builder context: Map A (+120), Map B (+152), Set C (+184) for node deduplication and worklist.
  • Per-node analysis: embedded DenseSet at +72 inside analysis structures created during DAG construction.
  • Instruction constraint table: the global word_3F3E6C0 array is a flat table rather than a DenseMap, but the constraint emission functions use DenseMaps for lookup caching.
  • EDG type translation: 5 distinct caches -- visited set, type cache, type-value map, scope table, and type index table.
  • Base class comparison: qword_4D03BF8 hash table for overload-resolution base class triple lookup.

The consistency of the hash function, sentinel values, and growth policy across all instances is documented in Hash Table and Collection Infrastructure.

Cross-References

Function Map

FunctionAddressSizeRole
edg_parse_declaratorsub_662DE0--Top-level declarator parser
edg_parse_decl_specifiers_coresub_672A20--While/switch token dispatcher
edg_resolve_decl_specifierssub_7C0F00--Scope chain + qualified name resolver
edg_compare_type_nodessub_7386E0--Structural type tree comparison
edg_compare_type_listssub_739370--Linked-list type comparator
edg_compare_decl_typessub_739430--Declaration-level type comparator
edg_canonicalize_typesub_72EC50--Typedef / elaborated alias stripper
edg_type_to_stringsub_74A390--Type-to-string for diagnostics
edg_walk_il_nodesub_7506E0--190KB IL tree walker (297 recursive calls)
edg_copy_il_nodesub_766570--148KB IL tree copier
edg_push_scopesub_854590--Push scope stack entry
edg_pop_scopesub_854430--Pop scope stack entry
edg_emit_scope_chainsub_82BDA0--Scope chain emission
edg_unqualified_lookupsub_7D5DD0--Unqualified name lookup
edg_qualified_lookupsub_7D4600--Qualified name lookup (after ::)
edg_lookup_with_flagssub_7D2AC0--Lookup with specific mode flags
edg_namespace_lookupsub_7D4A40--Lookup in namespace scope
edg_compare_decl_identitysub_8D97D0--Entity identity comparison
edg_type_translation_entrysub_91AED0--Top-level EDG-to-LLVM type translation
edg_type_translation_driversub_91AB30--Fixed-point iteration driver
edg_type_kind_dispatchsub_918E50--Type-kind dispatch for translation
edg_type_pair_comparesub_911D10--Core type-pair comparison + replacement
edg_alloc_decl_nodesub_84DCB0--152-byte declaration node allocator

NVVM Container Binary Format

The NVVM container is a proprietary binary envelope that wraps LLVM bitcode with compiler metadata for transport between pipeline stages in cicc v13.0. It carries target architecture, optimization options, fast-math flags, memory window configurations, per-kernel resource tables, and the IR payload itself -- all in a single serializable blob. Two serialization paths exist: a compact binary wire format used in production (nvcc / ptxas pipelines) and an XML-based format used for debugging and interchange. This page specifies the binary format in sufficient detail to write a conformant parser and serializer.

The format is implemented across 26 functions in the 0xCCBB10--0xCDD2D0 address range (Cluster C in the binary layout). The six top-level entry points:

FunctionAddressSizeRole
NvvmContainer_serialize0xCDD2D047,540 BBinary + XML serializer
NvvmContainer_deserialize_options0xCD1D8051,859 BBinary tag/value decoder
NvvmContainer_parse_header0xCDCA3010,206 BXML path header parser
NvvmContainer_check_versions0xCD41B016,708 BVersion compatibility gate
NvvmContainer_validate_versions0xCCD5F08,987 BStandalone version validator
NvvmContainer_init_options_struct0xCCBB10smallZero-init 248-byte container struct

Supporting parsers called from NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes):

FunctionAddressSizeRole
NvvmOptions_parse_arch_enum0xCD09E014,516 BArchVariant enum string-to-int
NvvmOptions_parse_fast_math0xCCF59012,771 BFastMathOptions sub-structure
NvvmOptions_parse_multi_view0xCD6D2012,188 BMultiViewOptions sub-structure
NvvmOptions_parse_cb_reserved_area0xCCE7809,802 BCB reserved area config
NvvmOptions_parse_reg_targets0xCD7CE09,542 BRegister target config
NvvmOptions_parse_serialize_helper0xCD58A09,579 BOption serialization helper
NvvmOptions_parse_shader_const_iface0xCCEEA08,355 BShaderConstIface (DCI)
NvvmOptions_parse_align_entries0xCD86106,739 BAlignment entry config
NvvmOptions_parse_pgo_section0xCD02C05,482 BPGO configuration
NvvmOptions_parse_section0xCD55105,166 BNested YAML section parser
NvvmOptions_parse_memory_windows0xCCE1005,042 BMemory window config
NvvmOptions_parse_cbank_config0xCCE4B04,173 BConstant bank config
NvvmOptions_parse_bool_or_int0xCCC4A0smallBoolean/int option parser
NvvmOptions_parse_tristate0xCCCFB0smallTri-state option parser
NvvmOptions_parse_string0xCD5150smallString option parser

The finalizer knobs parser (0xCD9990, 31,702 bytes) is called separately to ingest the full set of NVIDIA-specific backend knobs (see NVVMPassOptions).

Binary-level helpers:

FunctionAddressRole
NvvmContainer_write_tag_value0xCD17A0Write one tag/value pair (called 121 times from serializer)
NvvmContainer_write_blob0xCD1AB0Write blob data + tag reference
NvvmContainer_compute_crc0xCCD2B0CRC with seeds 0x8DF5D74C, 0xBAA56A96

Global state: qword_4F87148 holds the NVVM options global state pointer, checked by many downstream consumers.

Binary Header

Every binary container begins with a fixed 24-byte header. The header is self-describing: HeaderSize at offset 0x0E stores its own length (always 24), and two size fields partition the remainder into a scalar tag region and a blob data region.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Magic  (0x7F4E5C7D)                       |  0x00
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Ver.Major    |  Ver.Minor    | NvvmIR.Major  | NvvmIR.Minor  |  0x04
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NvvmDbg.Major | NvvmDbg.Minor | Llvm.Major    | Llvm.Minor    |  0x08
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         IRLevel (u16)         |       HeaderSize (u16)        |  0x0C
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     ScalarFieldsEnd (u32)                     |  0x10
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      BlobDataEnd (u32)                        |  0x14
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
struct NvvmContainerBinaryHeader {
    uint32_t magic;              /* 0x00: must be 0x7F4E5C7D              */
    uint8_t  version_major;      /* 0x04: container format major (1)      */
    uint8_t  version_minor;      /* 0x05: container format minor (<=0x41) */
    uint8_t  nvvm_ir_major;      /* 0x06: NVVM IR version major (2)       */
    uint8_t  nvvm_ir_minor;      /* 0x07: NVVM IR version minor (<=0x62)  */
    uint8_t  nvvm_debug_major;   /* 0x08: debug info version major (3)    */
    uint8_t  nvvm_debug_minor;   /* 0x09: debug info version minor (<=2)  */
    uint8_t  llvm_major;         /* 0x0A: LLVM version (see encoding)     */
    uint8_t  llvm_minor;         /* 0x0B: LLVM version (see encoding)     */
    uint16_t ir_level;           /* 0x0C: IRLevel enum                    */
    uint16_t header_size;        /* 0x0E: always 24 (0x0018)              */
    uint32_t scalar_fields_end;  /* 0x10: byte offset past scalar region  */
    uint32_t blob_data_end;      /* 0x14: byte offset past blob region    */
};

The three data regions in order:

[0 .. 24)                         -- Header (fixed)
[24 .. scalar_fields_end)         -- Scalar tag/value pairs
[scalar_fields_end .. blob_data_end) -- Blob data region

The total container size is blob_data_end bytes. After the blob data region, the IR payload (LLVM bitcode, optionally compressed) follows immediately.

LLVM Version Encoding

The llvm_major and llvm_minor bytes encode the LLVM version as a combined integer: llvm_major * 100 + llvm_minor. For cicc v13.0 (LLVM 20), this yields 20 * 100 + 0 = 2000. The version check compares the combined value, not the individual bytes.

IRLevel Enum

ValueNameMeaning
0NVVM_IR_LEVEL_UNIFIED_AFTER_DCIDefault: IR after Device-Code-Interface unification
1NVVM_IR_LEVEL_LTOLink-Time Optimization IR (partially optimized)
2NVVM_IR_LEVEL_OPTIXOptiX pipeline IR

Scalar Tag/Value Encoding

Immediately after the 24-byte header, a sequence of (tag, value) pairs encodes every container field that differs from its default value. The encoding is a variable-length scheme optimized for small values:

Case 1 -- value fits in 16 bits (0x0000..0xFFFE):
  [tag : int16] [value : int16]          -- 4 bytes total

Case 2 -- value needs 32 bits:
  [tag : int16] [0xFFFF : int16] [value : int32]  -- 8 bytes total

Terminator:
  [0x0000 : int16]                       -- tag 0 ends the sequence

All multi-byte fields are little-endian. The sentinel value 0xFFFF in the value slot signals that a full 32-bit value follows. This means the maximum encodable 16-bit value is 0xFFFE (65534); values of exactly 0xFFFF or larger require the extended form.

The serializer (sub_CD17A0, called 121 times from NvvmContainer_serialize) writes each tag/value pair using this scheme. The deserializer enters a switch loop over tags 1--402, decoding each value and writing it to the appropriate offset in the deserialized container struct.

Delta Encoding Strategy

The serializer allocates a default-initialized 440-byte Options struct and compares each field in the current Options against the corresponding default. Only fields that differ from the default are written as tag/value pairs. This makes typical containers very compact -- a standard compilation targeting SM 89 with -O2 might emit fewer than 20 tag/value pairs, covering just SmMajor, SmMinor, CompileMode, and a handful of target-specific flags.

The deserializer reverses this: it allocates a default Options struct first, then overwrites individual fields as tags are encountered. Unknown tags are silently skipped, which is the mechanism that provides forward compatibility -- a newer serializer can emit tags that an older deserializer simply ignores.

Blob Data Region

Tags in the 200+ and 400+ ranges reference variable-length data stored in the blob region. The scalar value for a blob tag is the byte offset into the blob region where the data begins. The blob region starts at scalar_fields_end bytes from the container start.

To resolve a blob reference: blob_ptr = container_base + scalar_fields_end + offset_value.

Blob entries do not carry explicit length fields in the tag/value stream. The deserializer knows each blob type's expected size from the tag ID (e.g., tag 201 is always 24 bytes, tag 203 is always 40 bytes). Variable-length blobs like strings (tags 209, 210, 213, 216, 217) are null-terminated. Length-prefixed blobs (tag 218) carry a 4-byte length prefix.

Complete Tag Table

144 distinct tag IDs organized into six ranges. The "Offset" column refers to the byte position within the deserialized 440-byte Options struct.

Range 1--39: Core Scalar Options

TagTypeNameOptions OffsetNotes
1int32SmMajor+0 (ArchVariant)SM major version (e.g., 8 for SM 89)
2int32SmMinor+0 (ArchVariant)SM minor version (e.g., 9 for SM 89)
3int32NumRegs+216Register count hint
4int32NumBarriers+220Barrier count
5int32SharedMemorySize+224Shared memory size in bytes
6int32VertexMode+72See VertexMode enum
7bitReserveLocalAddressZero+20 bit 0Reserve address 0 in local memory
8bitFastMath.IgnoreInf+200 bit 0Treat infinities as NaN
9bitFastMath.IgnoreNaN+200 bit 1Assume no NaN values present
10bitFastMath.IgnoreSignedZero+200 bit 2Ignore sign of zero
11bitFastMath.ReorderFloat+200 bit 3Allow float reordering
12bitFastMath.ReorderHalf+200 bit 4Allow half-precision reordering
13bitFastMath.Ftz+200 bit 5Flush denormals to zero
14bitFastMath.FastSqrt+200 bit 6Use fast sqrt approximation
15bitFastMath.Fmad+200 bit 7Allow fused multiply-add
16bitFastMath.AllowRcpRsqToSqrt+201 bit 0Allow rcp(rsqrt(x)) to sqrt(x)
17bitFastMath.CanReorderFloatDistribute+201 bit 1Allow distributive reordering
18int32FastMath.Reserved+204Reserved fast-math field
19int32MaxRRegsAllowed+216Maximum registers per thread (primary)
20int32SchedRegTarget+220Scheduling register pressure target
21int32UnrollControl+224Unroll factor control
22boolAcceleratedArch+232True for sm_XXa variants
23boolStdELF+233Use standard ELF output format
24int32MaxRRegsAllowed2+216Secondary max-regs (override)
25int32SchedRegTarget2+220Secondary sched target
26bitFastMath.ReassociateFloatAddOverMad+201 bit 2Float add reassociation over MAD
27bitForceImmediateConstants+20 bit 1Force immediate constant loading
28bitHideFunctions+20 bit 2Hide internal functions from output
29bitUseDX10AddressInRange+20 bit 3DX10 address range mode
30int32UnrollControl2+224Secondary unroll control
31bitFastMath.NoFloatMAD+201 bit 3Disable float MAD formation
32boolAcceleratedArch2+232Secondary accelerated-arch flag
33bitFastMath.LaxFP16ApproximateDivision+201 bit 4Lax FP16 approximate division
34boolStdELF2+233Secondary StdELF
35int32ShaderCodegenSelMask+236Shader codegen selection bitmask
36boolOmegaPtxErrorHandling+240Enable Omega-style PTX error handling
37int32FDLInsertMode+244See FDLInsertMode enum
38bitIsPIC+20 bit 4Position-independent code flag
39bitNoSpillsConstraint+20 bit 5Hard constraint: no register spills

Tag 99: Compression Metadata

TagTypeNameNotes
99int32CompressAlgoIdCompression algorithm selector for IR payload

When present, the IR payload following the blob region is compressed. The value selects a codec via sub_16886D0(algo_id). If the value is 0, the runtime substitutes the default algorithm ID 0x75D49913 (1,977,119,507 decimal). The codec is a pluggable compression/encryption layer accessed through four function pointers:

/* Compression codec API (addresses in the 0x1688xxx range) */
void *codec_acquire(uint32_t algo_id);            /* sub_16886D0 */
int   codec_compress(void *codec, void *data,
                     size_t size);                 /* sub_1688730 */
int   codec_decompress(void *codec, void *data,
                       size_t size);               /* sub_16887A0 */
void  codec_release(void *codec);                  /* sub_1688720 */

The write path in NvvmContainer_serialize (0xCDD2D0) compresses the LLVM bitcode payload via sub_C8D290, then computes a CRC hash via NvvmContainer_compute_crc (0xCCD2B0) with the two seed values -1914584148 (0x8DF5D74C) and -1162247642 (0xBAA56A96). The CRC value is stored as the CompressAlgoId tag 99 value, which doubles as an integrity check token: the deserializer uses the same CRC seeds to verify the payload before decompression.

The compression subsystem lives outside the main container cluster at addresses 0x16886D0--0x16887A0, in the utility library region of the binary.

Range 101--173: Extended Target Options

These tags configure per-kernel and target-specific hardware parameters. Most map into a sub-structure accessed through the Options struct. The "Byte.Bit" column indicates the packed bitfield location within the target options sub-structure.

TagTypeNameLocationNotes
101boolHasTextureOpsoffset 0Target supports texture operations
102boolHasSurfaceOpsoffset 0Target supports surface operations
103boolHasAtomicsoffset 0Target supports atomic operations
104boolHasVoteoffset 0Target supports warp vote intrinsics
105int32MaxThreadsPerBlockoffset 4Maximum CTA thread count
106bytePreferL1SizeFlagoffset 8L1 cache vs shared memory preference
107boolHasWarpShuffleoffset 0Target supports warp shuffle
108boolHasFunnelShiftoffset 0Target supports funnel shift
109int32CBankOfstLowoffset 12Constant bank offset lower bound
110int32CBankOfstHioffset 16Constant bank offset upper bound
111int32CBankSizeoffset 20Constant bank size in bytes
112bitBit0_68byte 68, bit 0Target capability flag
113bitBit1_68byte 68, bit 1Target capability flag
114bitBit2_68byte 68, bit 2Target capability flag
115bitBit3_68byte 68, bit 3Target capability flag
116bitBit4_68byte 68, bit 4Target capability flag
117bitBit5_68byte 68, bit 5Target capability flag
118bitBit7_68byte 68, bit 7Target capability flag (bit 6 skipped)
119bitEnableCoalescebyte 69, bit 0Enable memory coalescing optimization
120bitEnableVectorizebyte 69, bit 2Enable auto-vectorization
1212-bitCompactionModebyte 69, bits 3--4Thread compaction strategy (0--3)
122int32StackFrameSizeoffset 96Stack frame size in bytes
123int32StackAlignmentoffset 100Stack alignment requirement
124int32ParamSpaceSizeoffset 104Parameter space size
125int32ParamAlignmentoffset 108Parameter space alignment
126int32LocalMemSizeoffset 116Local memory size per thread
127int32SharedBankConfigoffset 156Shared memory bank configuration
128int32MinGridSizeoffset 248Minimum grid size for occupancy
129int32MaxGridDimXoffset 252Maximum X-dimension grid size
130int32SharedMemPerBlockoffset 264Shared memory per block
1312-bitWarpScheduleModebyte 70, bits 0--1Warp scheduling strategy
132bitEnablePrefetchbyte 70, bit 2Enable memory prefetch instructions
133bitBit4_70byte 70, bit 4Target capability flag
134bitBit5_70byte 70, bit 5Target capability flag
135bitBit6_70byte 70, bit 6Target capability flag
136bitBit7_70byte 70, bit 7Target capability flag
137int32MaxDynSharedoffset 268Maximum dynamic shared memory
138boolHasLDGoffset 5Target supports LDG instruction
139bitBit1_71byte 71, bit 1Target capability flag
140bitBit2_71byte 71, bit 2Target capability flag
141boolHasBarrierReduceoffset 40Target supports barrier-reduce
142int32CacheConfigoffset 280Cache configuration selector
143bitBit6_68byte 68, bit 6Target capability flag
144bitBit3_71byte 71, bit 3Target capability flag
145bitBit0_71byte 71, bit 0Target capability flag
146int32ConstBankSizeoffset 256Constant bank total size
147int32ShMemBankStrideoffset 152Shared memory bank stride
1482-bitScheduleMode2byte 71, bits 4--5Secondary scheduling mode
149bitBit6_71byte 71, bit 6Target capability flag
150bitBit7_71byte 71, bit 7Target capability flag
151int32LocalMemAlignmentoffset 112Local memory alignment
152bitEnableBarrierOptbyte 69, bit 5Enable barrier optimization
153bitBit0_72byte 72, bit 0Target capability flag
154bitBit6_69byte 69, bit 6Target capability flag
155bitBit7_69byte 69, bit 7Target capability flag
156bitBit1_72byte 72, bit 1Target capability flag
157boolHasDP4Aoffset 1Target supports DP4A dot-product
158bitBit3_72byte 72, bit 3Target capability flag
159int32ConstBankSize2offset 260Secondary constant bank size
160int32MaxRegsPerThreadoffset 284Hard limit on registers per thread
161int32ClusterSizeoffset 276Thread block cluster size (SM 90+)
162bitBit4_72byte 72, bit 4Target capability flag
163bitBit5_72byte 72, bit 5Target capability flag
164bitBit6_72byte 72, bit 6Target capability flag
165bitBit7_72byte 72, bit 7Target capability flag
166int32MaxCTAPerSMoffset 160Maximum CTAs per SM
167int32TexIndirectLimitoffset 272Texture indirect access limit
168bitBit0_432byte 432, bit 0Extended capability flag
169bitBit1_432byte 432, bit 1Extended capability flag
170bitBit2_432byte 432, bit 2Extended capability flag
171boolHasTMAOpsoffset 289Target supports TMA operations (SM 90+)
172bitBit3_70byte 70, bit 3Target capability flag
173boolHasTCGen05offset 290Target supports TCGen05 (SM 100+)

Range 201--218: Blob Data Tags

TagSizeNameDescription
20124 BMemoryWindowCBank3 memory window entries for constant bank (see below)
20224 BMemoryWindowLocal3 memory window entries for local memory
20340 BMemoryWindowShared10 x uint32_t for shared memory windows + flags
20448 BMultiViewOptionsMulti-view rendering header + typed arrays
205varTargetResourceTable24-byte header + 36 bytes per entry
206varPerKernelCBankOffsets4-byte count + 4 bytes per kernel
207varPerKernelStackSizes4-byte count + 4 bytes per kernel
208varPerKernelSMEMSizes8-byte count + 8 bytes per kernel
209varTargetFuncNameNull-terminated string
210varTargetEntryNameNull-terminated string
2118 BPerKernelQWORD8-byte per-kernel datum
21212 BExtraMemParams8 + 4 bytes of memory parameters
213varAuxString1Null-terminated auxiliary string
214varPerKernelRegisters4-byte count + 4 bytes per kernel
215varPerKernelBarriers4-byte count + 4 bytes per kernel
216varAuxString2Null-terminated auxiliary string
217varAuxString3Null-terminated auxiliary string
218varAuxByteArray4-byte length prefix + raw bytes

Range 301--309: Extended Int32 Fields

TagTypeNameOptions OffsetNotes
301int32ExtOpt.Field344+344Cluster/group configuration selector
302int32ExtOpt.Field348+348Extended option
303int32ExtOpt.Field352+352Extended option
304int32ExtOpt.Field356+356Extended option
305int32ExtOpt.Field360+360Extended option
306int32ExtOpt.Field400+400Extended option
307int32ExtOpt.Field364+364Extended option
308int32ExtOpt.Field368+368Extended option
309int32ExtOpt.Field372+372Extended option

Range 351--353: Extended Int64 Blob References

TagSizeNameOptions Offset
3518 BExtOpt.QWord376+376
3528 BExtOpt.QWord384+384
3538 BExtOpt.QWord392+392

Range 401--402: Structured Blob Data

These tags are conditionally parsed based on the value of tag 301 (ExtOpt.Field344):

TagConditionSizeNameNotes
401Field344 == 156+ BTMADescriptorSM 90 Hopper TMA bulk-copy descriptors. 44-byte fixed header + 16 bytes per entry.
402Field344 == 440+ BTCGen05ConfigSM 100 Blackwell TCGen05 tensor configurations. 32-byte fixed header + 12 bytes per entry.

The conditional parsing means a single container cannot carry both TMA and TCGen05 data -- the Field344 value selects which hardware generation's tensor memory interface is active.

TMADescriptor Layout (Tag 401, Field344 == 1)

TMA (Tensor Memory Access) descriptors configure cp.async.bulk operations on SM 90 Hopper. The TMA descriptor extraction is performed by sub_9483E0 during intrinsic lowering. The blob layout:

struct TMADescriptor {
    /* +0  */ uint32_t num_entries;          /* Number of TMA descriptors     */
    /* +4  */ uint32_t dimensionality;       /* 1d..5d tensor rank            */
    /* +8  */ uint32_t element_size;         /* Bytes per element             */
    /* +12 */ uint32_t interleave_layout;    /* Memory interleave pattern     */
    /* +16 */ uint32_t swizzle_mode;         /* Swizzle mode selector         */
    /* +20 */ uint32_t fill_mode;            /* Out-of-bounds fill behavior   */
    /* +24 */ uint32_t [5] global_dims;      /* Global tensor dimensions      */
    /* +44 */ /* --- 16 bytes per entry --- */
    /*        uint32_t box_dim;              Per-entry box dimension          */
    /*        uint32_t stride;               Per-entry stride                 */
    /*        uint32_t elem_stride;          Per-entry element stride         */
    /*        uint32_t reserved;             Reserved/padding                 */
};

See SM 90 Hopper for the TMA instruction format and the cp.async.bulk.tensor.g2s.tile.{1d,2d,3d,4d,5d} intrinsic family.

TCGen05Config Layout (Tag 402, Field344 == 4)

TCGen05 (Tensor Core Generation 5) configurations describe Blackwell SM 100 tensor memory operations. The TCGen05 instruction set includes tcgen05.alloc, tcgen05.dealloc, tcgen05.commit, tcgen05.fence, tcgen05.wait, and tcgen05.relinquish.alloc -- all gated by the SM 100 arch-conditional check at sub_30462A0. The blob layout:

struct TCGen05Config {
    /* +0  */ uint32_t num_entries;          /* Number of TCGen05 configs     */
    /* +4  */ uint32_t accumulator_size;     /* Accumulator memory size       */
    /* +8  */ uint32_t commit_mode;          /* Commit mode (multicast flags) */
    /* +12 */ uint32_t fence_mode;           /* Fence mode selector           */
    /* +16 */ uint32_t [4] reserved;         /* Reserved fields               */
    /* +32 */ /* --- 12 bytes per entry --- */
    /*        uint32_t config_id;            TCGen05 config identifier        */
    /*        uint32_t fragment_count;       Number of fragments              */
    /*        uint32_t flags;                Per-config flags                 */
};

See SM 100 Blackwell for the TCGen05 instruction set and the tcgen05.* intrinsic family.

Deserialized Container Struct

After parsing, the container is represented as a 248-byte in-memory structure allocated by NvvmContainer_init_options_struct (0xCCBB10). This struct holds the container metadata plus a pointer to the full 440-byte Options struct.

struct NvvmContainerHeader {          /* 248 bytes total                   */
    /* 0x00 */ uint32_t sm_major;     /* Tag 1: SM major version           */
    /* 0x04 */ uint32_t sm_minor;     /* Tag 2: SM minor version           */
    /* 0x08 */ uint32_t num_regs;     /* Tag 3                             */
    /* 0x0C */ uint32_t num_barriers; /* Tag 4                             */
    /* 0x10 */ uint32_t shared_mem_size; /* Tag 5                          */
    /* 0x14 */ uint8_t  flags_14;     /* Packed bits: tags 7,27,28,29,38,39*/
    /*         bit 0: ReserveLocalAddressZero  (tag 7)                     */
    /*         bit 1: ForceImmediateConstants  (tag 27)                    */
    /*         bit 2: HideFunctions            (tag 28)                    */
    /*         bit 3: UseDX10AddressInRange     (tag 29)                   */
    /*         bit 4: IsPIC                    (tag 38)                    */
    /*         bit 5: NoSpillsConstraint       (tag 39)                   */
    /* 0x15 */ uint8_t  _pad15[3];
    /* 0x18 */ uint8_t  multi_view_options[48]; /* Tag 204 blob            */
    /* 0x48 */ uint32_t vertex_mode;  /* Tag 6                             */
    /* 0x4C */ uint8_t  _pad4c[4];
    /* 0x50 */ uint32_t max_rregs;    /* Tag 19                            */
    /* 0x54 */ uint32_t sched_reg_target; /* Tag 20                        */
    /* 0x58 */ uint32_t unroll_control; /* Tag 21                          */
    /* 0x5C */ uint8_t  _pad5c[4];
    /* 0x60 */ uint8_t  mem_win_cbank[24];  /* Tag 201 blob                */
    /* 0x78 */ uint8_t  mem_win_local[24];  /* Tag 202 blob                */
    /* 0x90 */ uint8_t  mem_win_shared[40]; /* Tag 203 blob                */
    /* 0xB8 */ uint8_t  _padb8[12];
    /* 0xC4 */ uint8_t  accelerated_arch; /* Tag 22                        */
    /* 0xC5 */ uint8_t  std_elf;      /* Tag 23                            */
    /* 0xC6 */ uint8_t  _padc6[2];
    /* 0xC8 */ uint8_t  fast_math[8]; /* Tags 8-17,26,31,33 bitfields     */
    /* 0xD0 */ uint8_t  _padd0[8];
    /* 0xD8 */ uint32_t max_rregs_2;  /* Tag 24                            */
    /* 0xDC */ uint32_t sched_reg_2;  /* Tag 25                            */
    /* 0xE0 */ uint32_t unroll_ctl_2; /* Tag 30                            */
    /* 0xE4 */ uint32_t compress_algo_id; /* Tag 99                        */
    /* 0xE8 */ uint8_t  omega_ptx_err; /* Tag 32                           */
    /* 0xE9 */ uint8_t  std_elf_2;    /* Tag 34                            */
    /* 0xEA */ uint8_t  _padea[2];
    /* 0xEC */ uint32_t shader_cg_sel; /* Tag 35                           */
    /* 0xF0 */ uint8_t  fdl_bit;      /* Tag 36                            */
    /* 0xF1 */ uint8_t  _padf1[3];
    /* 0xF4 */ uint32_t fdl_insert_mode; /* Tag 37                         */
};
/* sizeof(NvvmContainerHeader) == 248 (0xF8) */

The Options pointer is stored at offset 208 (0xD0) of the container header during deserialization -- the container header acts as both a data holder and an index into the full Options struct.

Options Struct (440 bytes)

The full compiler options structure is allocated separately and linked from the container header. It is parsed by NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes) in the XML path, or populated field-by-field from tags in the binary path.

struct NvvmOptions {                   /* 440 bytes total                  */
    /* +0   */ uint32_t arch_variant;  /* ArchVariant enum                 */
    /* +4   */ uint32_t compile_mode;  /* CompileMode enum                 */
    /* +8   */ uint32_t opt_level;     /* OptLevel enum                    */
    /* +12  */ uint32_t debug_info;    /* DebugInfo enum                   */
    /* +16  */ uint32_t client_version;
    /* +20  */ uint8_t  flags_20;      /* Packed booleans: 6 bits          */
    /*          bit 0: ReserveLocalAddressZero                             */
    /*          bit 1: ForceImmediateConstants                             */
    /*          bit 2: HideFunctions                                       */
    /*          bit 3: UseDX10AddressInRange                               */
    /*          bit 4: IsPIC                                               */
    /*          bit 5: NoSpillsConstraint                                  */
    /* +21  */ uint8_t  _pad21[3];
    /* +24  */ uint8_t  multi_view[48]; /* MultiViewOptions sub-structure  */
    /* +72  */ uint32_t vertex_mode;    /* VertexMode enum                 */
    /* +76  */ uint8_t  _pad76[4];
    /* +80  */ uint8_t  dci_info[120];  /* DCIInfo sub-structure           */
    /* +200 */ uint8_t  fast_math_byte0; /* FastMath bits 0-7              */
    /* +201 */ uint8_t  fast_math_byte1; /* FastMath bits 8-12             */
    /* +202 */ uint8_t  _pad202[2];
    /* +204 */ uint32_t fast_math_reserved;
    /* +208 */ uint8_t  _pad208[8];
    /* +216 */ uint32_t max_rregs_allowed;
    /* +220 */ uint32_t sched_reg_target;
    /* +224 */ uint32_t unroll_control;
    /* +228 */ uint32_t okey;           /* CompressAlgoId / OKey           */
    /* +232 */ uint8_t  accelerated_arch;
    /* +233 */ uint8_t  std_elf;
    /* +234 */ uint8_t  _pad234[2];
    /* +236 */ uint32_t shader_codegen_sel_mask;
    /* +240 */ uint8_t  omega_ptx_error_handling;
    /* +241 */ uint8_t  _pad241[3];
    /* +244 */ uint32_t fdl_insert_mode;
    /* +248 */ uint8_t  target_opts[192]; /* Extended target options (tags 101-173) */
};
/* sizeof(NvvmOptions) == 440 (0x1B8) */

DCIInfo Sub-Structure (Options +80, 120 bytes)

The Device-Code-Interface sub-structure at offset +80 contains the shader constant interface and constant bank reserved area configurations. Parsed by NvvmOptions_parse_shader_const_iface (0xCCEEA0, 8,355 bytes) and NvvmOptions_parse_cb_reserved_area (0xCCE780, 9,802 bytes).

ShaderConstIface XML fields (from sub_CCEEA0):

FieldTypeDescription
OptimizerConstBankint32Constant bank index used by the optimizer
DriverConstBankint32Constant bank index used by the driver
BindlessTextureBankint32Constant bank for bindless texture handles
LocalMemoryWindowstructMemory window config for local memory
SharedMemoryWindowstructMemory window config for shared memory
VectorizeAndRemapTLDboolEnable vectorization and TLD remapping
ELFControlsDCIboolELF controls DCI interface layout
DiscardDefaultValueOutputsboolDiscard outputs that match default values

CBReservedArea XML fields (from sub_CCE780):

FieldTypeDescription
ByteOffsetToEndOfReservedAreaint32End-of-reserved-area offset in constant bank
CbAddressBitsInReservedVABaseint32Address bits for reserved virtual address base
CbBankToReservedVABaseint32Constant bank index for reserved VA base
ForceHighLatencyConstExprboolForce high-latency constant expression evaluation
ReservedCbReadBankint32Reserved constant bank read bank index

MultiViewOptions Sub-Structure (Options +24, 48 bytes)

The multi-view rendering options sub-structure at offset +24 carries graphics pipeline multi-view configuration. Parsed by NvvmOptions_parse_multi_view (0xCD6D20, 12,188 bytes). Serialized as blob tag 204.

FieldTypeDescription
NumViewsint32Number of rendering views
NominalViewIDsint32[]Array of nominal view identifiers
PerViewRTIndexConstantsint32[]Per-view render target index constants
EnableViewInstanceMaskboolEnable per-view instance masking
ComputePerPatchAttribsForViewZeroboolCompute per-patch attributes for view 0
IsImplicitboolImplicit multi-view mode

CompileMode Enum

ValueNameMeaning
0NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABIWhole-program with ABI compliance
1NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABIWhole-program without ABI (internal)
2NVVM_COMPILE_MODE_SEPARATE_ABISeparate compilation (relocatable, --device-c)
3NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABIExtensible whole-program with ABI

OptLevel Enum

ValueName
0NVVM_OPT_LEVEL_NONE
1NVVM_OPT_LEVEL_1
2NVVM_OPT_LEVEL_2 (default)
3NVVM_OPT_LEVEL_3

DebugInfo Enum

ValueName
0NVVM_DEBUG_INFO_NONE (default)
1NVVM_DEBUG_INFO_LINE_INFO
2NVVM_DEBUG_INFO_DWARF

VertexMode Enum

ValueName
0NVVM_VERTEX_MODE_SINGLE
1NVVM_VERTEX_MODE_A
2NVVM_VERTEX_MODE_B
3NVVM_VERTEX_MODE_AB

FDLInsertMode Enum

ValueName
0NVVM_FDL_MODE_NONE
1NVVM_FDL_MODE_ALL
2NVVM_FDL_MODE_APP

ArchVariant Enum

The architecture enum uses a numeric encoding where the value equals major * 10 + minor for older architectures and major * 10 + minor (with 3-digit major) for Blackwell. There are two parallel enum spaces: "virtual" architecture variants (used for compute_XX targets) and "HW" variants (used for sm_XX real silicon targets). The virtual variants are serialized by name in the XML format via NvvmOptions_parse_arch_enum (0xCD09E0, 14,516 bytes).

Virtual Architecture Variants

Enum NameNumeric ValueGenerationSM
NVVM_ARCH_KEPLER_3_030Kepler3.0
NVVM_ARCH_KEPLER_3_232Kepler3.2
NVVM_ARCH_KEPLER_3_535Kepler3.5
NVVM_ARCH_KEPLER_3_737Kepler3.7
NVVM_ARCH_MAXWELL_5_050Maxwell5.0
NVVM_ARCH_MAXWELL_5_252Maxwell5.2
NVVM_ARCH_MAXWELL_5_353Maxwell5.3
NVVM_ARCH_PASCAL_6_060Pascal6.0
NVVM_ARCH_PASCAL_6_161Pascal6.1
NVVM_ARCH_PASCAL_6_262Pascal6.2
NVVM_ARCH_VOLTA_7_070Volta7.0
NVVM_ARCH_VOLTA_7_272Volta7.2
NVVM_ARCH_TURING_7_373Turing7.3
NVVM_ARCH_TURING_7_575Turing7.5
NVVM_ARCH_AMPERE_8_080Ampere8.0
NVVM_ARCH_AMPERE_8_282Ampere8.2
NVVM_ARCH_AMPERE_8_686Ampere8.6
NVVM_ARCH_AMPERE_8_787Ampere8.7
NVVM_ARCH_AMPERE_8_888Ampere8.8
NVVM_ARCH_ADA_8_989Ada Lovelace8.9
NVVM_ARCH_HOPPER_9_090Hopper9.0
NVVM_ARCH_BLACKWELL_10_0100Blackwell10.0
NVVM_ARCH_BLACKWELL_10_1101Blackwell10.1
NVVM_ARCH_BLACKWELL_10_3103Blackwell10.3
NVVM_ARCH_BLACKWELL_11_0110Blackwell (Jetson Thor)11.0
NVVM_ARCH_BLACKWELL_12_0120Blackwell (RTX 50xx / Pro)12.0
NVVM_ARCH_BLACKWELL_12_1121Blackwell (DGX Spark)12.1

Note: NVVM_ARCH_BLACKWELL_10_1 maps to __CUDA_ARCH 1010, while NVVM_ARCH_BLACKWELL_11_0 maps to __CUDA_ARCH 1100. Despite both being in the BLACKWELL family, they are distinct architectures with separate entries in the processor table. sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line.

HW Architecture Variants

The HW variants use a major * 1000 + minor * 10 encoding for their internal numeric values. These map to real silicon rather than virtual compute capabilities:

Enum NameInternal ValueNotes
NVVM_ARCH_HW_SM_5_0500Maxwell HW baseline
......One entry per supported HW SM through 9.0
NVVM_ARCH_HW_SM_10_01000Blackwell datacenter
NVVM_ARCH_HW_SM_10_11010Blackwell Ultra (GB300)
NVVM_ARCH_HW_SM_10_31030Blackwell variant
NVVM_ARCH_HW_SM_10_41200Maps to SM 120 value -- not publicly documented

The HW_SM_10_4 = 1200 mapping is notable: SM 10.4 in the HW enum space corresponds to the SM 120 consumer architecture. This reveals that "SM 120" is internally considered a Blackwell 10.4 die variant, not a separate generation.

FastMathOptions Bitfields

The fast-math configuration occupies two bytes at Options offset +200 and +201, with an additional int32 at +204. Each bit independently controls one floating-point relaxation.

Byte +200 (tags 8--15)

  Bit 7   Bit 6   Bit 5   Bit 4   Bit 3   Bit 2   Bit 1   Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
|  Fmad | Fast  |  Ftz  |Reorder|Reorder|Ignore | Ignore|Ignore |
|       | Sqrt  |       | Half  | Float | Sign0 |  NaN  |  Inf  |
+-------+-------+-------+-------+-------+-------+-------+-------+
  tag 15  tag 14  tag 13  tag 12  tag 11  tag 10  tag 9   tag 8

Byte +201 (tags 16--17, 26, 31, 33)

  Bit 7   Bit 6   Bit 5   Bit 4   Bit 3   Bit 2   Bit 1   Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       | Lax   | No    |Reassoc|CanReor| Allow |
|       |       |       | FP16  | Float | Float |derDist| Rcp   |
|       |       |       | Div   | MAD   |AddMAD |ribute | Rsq   |
+-------+-------+-------+-------+-------+-------+-------+-------+
                          tag 33  tag 31  tag 26  tag 17  tag 16

FastMath Divide Sub-Enum

The Divide field within FastMathOptions is a nested enum serialized by name in the XML path:

ValueNameMeaning
0NVVM_FAST_MATH_DIVIDE_PRECISE_NO_FTZIEEE-compliant division, no flush-to-zero
1NVVM_FAST_MATH_DIVIDE_PRECISE_ALLOW_FTZIEEE division with FTZ permitted
2NVVM_FAST_MATH_DIVIDE_FULL_RANGE_APPROXFull-range approximation
3NVVM_FAST_MATH_DIVIDE_FAST_APPROXFast approximation (least precise)

These correspond to the nvcc flags -prec-div=1 (precise) and -prec-div=0 (fast), with FTZ interaction determined by -ftz.

Complete FastMath XML Field Inventory

The full set of XML field names parsed by NvvmOptions_parse_fast_math (0xCCF590, 12,771 bytes):

XML Field NameBinary TagTypeDescription
IgnoreInf8bitTreat infinities as NaN
IgnoreNaN9bitAssume no NaN values present
IgnoreSignedZero10bitIgnore sign of zero
ReorderFloat11bitAllow float reordering
ReorderHalf12bitAllow half-precision reordering
Ftz13bitFlush denormals to zero
FastSqrt14bitUse fast sqrt approximation
Fmad15bitAllow fused multiply-add
AllowRcpRsqToSqrt16bitAllow rcp(rsqrt(x)) to sqrt(x)
CanReorderFloatDistribute17bitAllow distributive reordering
ReassociateFloatAddOverMad26bitFloat add reassociation over MAD
NoFloatMAD31bitDisable float MAD formation
LaxFP16ApproximateDivision33bitLax FP16 approximate division
Divide--enumDivision precision sub-enum (above)

The Divide field is serialized as a nested enum element in XML; in the binary format it is encoded as part of the fast-math reserved int32 at Options +204 (tag 18).

Memory Window Configuration

Memory windows define how the compiler maps address spaces to hardware memory banks. Three window types are serialized as blobs via tags 201--203, parsed by NvvmOptions_parse_cbank_config (0xCCE4B0) and NvvmOptions_parse_memory_windows (0xCCE100).

MemoryWindow Type Enum

ValueNameMeaning
0NVVM_MEMORY_WINDOW_SPECIAL_REGISTERAccessed via special registers
1NVVM_MEMORY_WINDOW_CBANKConstant bank window
2NVVM_MEMORY_WINDOW_IMMEDIATEImmediate offset addressing

Window Entry Layout (8 bytes)

struct MemoryWindowEntry {
    uint32_t window_type;   /* MemoryWindow type enum      */
    uint32_t cbank;         /* Constant bank index         */
    /* The following are part of the containing blob: */
    /* uint32_t cbank_ofst_low;  -- lower bound of offset range */
    /* uint32_t cbank_ofst_hi;   -- upper bound of offset range */
};
  • Tag 201 (MemoryWindowCBank): 24 bytes = 3 entries of {window_type, cbank, low, hi} truncated to fit, or 3 x 8 bytes depending on sub-field packing.
  • Tag 202 (MemoryWindowLocal): 24 bytes, same structure.
  • Tag 203 (MemoryWindowShared): 40 bytes = 10 x uint32_t values encoding shared memory bank strides, offsets, and configuration flags.

Version Compatibility Logic

Version checking is the first operation performed on a container buffer, implemented in NvvmContainer_check_versions (0xCD41B0). The logic is conservative on major versions and lenient on minor versions:

1. Verify magic == 0x7F4E5C7D
   Fail: return NULL (not a container)

2. Version.Major must == 1
   Fail: "NvvmContainer major version N not compatible" → return NULL

3. Version.Minor compared to 0x41 (65)
   If container minor > tool minor:
     Warning: "Linked container's NvvmContainer minor version N newer than tool"
   Parse continues regardless.

4. NvvmIRVersion.Major must == 2
   Fail: "NvvmIR major version N not compatible" → return NULL

5. NvvmIRVersion.Minor compared to 0x62 (98)
   If container minor > tool minor: warning, parse continues.

6. NvvmDebugVersion.Major must == 3
   Fail: "NvvmDebug major version N not compatible" → return NULL

7. NvvmDebugVersion.Minor compared to 2
   If container minor > tool minor: warning, parse continues.

8. LlvmVersion (major*100 + minor) must be <= 2000
   Fail: "LLVM version N not compatible" → return NULL

A separate standalone validator (0xCCD5F0) adds a mode-dependent check: in binary dump mode (a5=0), the LLVM version must be exactly 20; in normal mode (a5=1), it must be <= 20.

The philosophy is clear: major version bumps signal breaking format changes and are hard failures. Minor version bumps add new tags but never change existing tag semantics -- the delta encoding and unknown-tag-skipping design ensures forward compatibility.

Current Version Constants (cicc v13.0)

FieldMajorMinor
Version (container format)10x41 (65)
NvvmIRVersion20x62 (98)
NvvmDebugVersion32
LlvmVersion200

XML Serialization Format

The XML path (NvvmContainer_parse_header at 0xCDCA30) uses NVIDIA's YAML-based serialization framework with virtual dispatch. The top-level XML document contains these elements:

<NvvmContainer>
  <Version major="1" minor="65"/>
  <NvvmIRVersion major="2" minor="98"/>
  <NvvmDebugVersion major="3" minor="2"/>
  <LlvmVersion major="20" minor="0"/>
  <IRLevel>NVVM_IR_LEVEL_UNIFIED_AFTER_DCI</IRLevel>
  <Options>
    <ArchVariant>NVVM_ARCH_ADA_8_9</ArchVariant>
    <CompileMode>NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI</CompileMode>
    <OptLevel>NVVM_OPT_LEVEL_2</OptLevel>
    <DebugInfo>NVVM_DEBUG_INFO_NONE</DebugInfo>
    <FastMathOptions>
      <Ftz>1</Ftz>
      <Fmad>1</Fmad>
      <Divide>NVVM_FAST_MATH_DIVIDE_FAST_APPROX</Divide>
      ...
    </FastMathOptions>
    <MaxRRegsAllowed>255</MaxRRegsAllowed>
    ...
  </Options>
  <IsBinary>1</IsBinary>
  <Module>... base64-encoded LLVM bitcode ...</Module>
</NvvmContainer>

All enum values are serialized by their full string names (e.g., NVVM_COMPILE_MODE_SEPARATE_ABI), not by numeric value. The XML format does not use delta encoding -- every field is written regardless of whether it matches the default, making XML containers significantly larger but human-readable.

Serialization Flow

The serializer (0xCDD2D0) has two modes controlled by parameter a3: binary (a3=1) and XML (a3=0).

Binary Serialization (a3=1)

 1. Compute version fields (use defaults if not set):
      Version        = {1, 0x41}
      NvvmIRVersion  = {2, 0x62}
      NvvmDebugVersion = {3, 2}
      LlvmVersion    = {20, 0}

 2. Allocate 248-byte NvvmContainerHeader (zeroed)
 3. Allocate 440-byte default Options struct
 4. Allocate two growable arrays:
      scalar_tags[]   -- int32 entries for tag/value pairs
      blob_data[]     -- byte entries for blob payloads

 5. For each field in current Options vs. default Options:
      If field differs:
        Scalar → sub_CD17A0(scalar_tags, tag_id, value)
        Blob   → sub_CD1AB0(blob_data, scalar_tags, tag_id, ptr, size)

 6. Optional IR compression:
      If a4 flag set:
        Compress LLVM bitcode via sub_C8D290
        Compute CRC via sub_CCD2B0 → store as tag 99
        Compress via sub_1688730(codec, data, size)

 7. Append terminator: tag 0 to scalar_tags
 8. Write 24-byte header (with computed ScalarFieldsEnd, BlobDataEnd)
 9. Write scalar_tags array
10. Write blob_data array
11. Write compressed or raw IR payload

Deserialization (0xCD1D80)

 1. Verify magic == 0x7F4E5C7D
 2. Allocate 248-byte NvvmContainerHeader
 3. Allocate 440-byte Options struct with defaults
 4. Store Options pointer at container header offset 208
 5. Compute tag_ptr = buffer + header_size  (from offset 0x0E)
 6. Compute blob_base = buffer + scalar_fields_end  (from offset 0x10)
 7. Enter switch loop:
      Read tag (int16), decode value (int16 or sentinel + int32)
      Switch on tag (103 unique case labels):
        Tags 1-39:    → write scalar to Options field
        Tag 99:       → store compression algo ID
        Tags 101-173: → write to extended target options
        Tags 201-218: → resolve blob offset, copy blob data
        Tags 301-309: → write to extended int32 fields
        Tags 351-353: → copy 8-byte blob to extended fields
        Tags 401-402: → conditionally parse structured blob
      Tag 0 → exit loop
 8. If tag 99 present: decompress IR payload
 9. Return container pointer

Annotated Hex Dump

A minimal container targeting SM 89 (Ada Lovelace) with default options (only SmMajor and SmMinor differ from defaults):

Offset  Hex                                        Decoded
------  -----------------------------------------  ---------------------------------
0x0000  7D 5C 4E 7F                                Magic: 0x7F4E5C7D
0x0004  01 41                                      Version: 1.65
0x0006  02 62                                      NvvmIRVersion: 2.98
0x0008  03 02                                      NvvmDebugVersion: 3.2
0x000A  14 00                                      LlvmVersion: 20.0
0x000C  00 00                                      IRLevel: 0 (UNIFIED_AFTER_DCI)
0x000E  18 00                                      HeaderSize: 24
0x0010  2C 00 00 00                                ScalarFieldsEnd: 44
0x0014  2C 00 00 00                                BlobDataEnd: 44 (no blobs)

--- Scalar tag/value region ---
0x0018  01 00 08 00                                Tag 1 (SmMajor) = 8
0x001C  02 00 09 00                                Tag 2 (SmMinor) = 9
0x0020  0D 00 01 00                                Tag 13 (Ftz) = 1
0x0024  0F 00 01 00                                Tag 15 (Fmad) = 1
0x0028  00 00                                      Terminator (tag 0)
0x002A  00 00                                      Padding to alignment

--- Blob data region ---
(empty -- ScalarFieldsEnd == BlobDataEnd)

--- IR payload follows at offset 0x002C ---
0x002C  DE C0 17 0B ...                            LLVM bitcode (0xDEC0170B magic)

This example shows the efficiency of delta encoding: only 4 tag/value pairs (16 bytes of tags) plus the 24-byte header produce a fully-specified container. All other fields (CompileMode, OptLevel, DebugInfo, all target options) inherit their defaults during deserialization.

A container with a 32-bit value would look like:

0x00XX  13 00 FF FF  00 04 00 00                   Tag 19 (MaxRRegsAllowed) = 1024
                                                   (0xFFFF sentinel, then 0x0400 LE)

Pipeline Integration

The container serves as the inter-stage transport format within the cicc compilation pipeline. Two entry paths exist:

PathEntry FunctionAddressPipeline
Path A (LibNVVM)nvvmCompileProgram dispatcher0x9047E03-phase: LNK -> OPT -> LLC
Path B (standalone)cicc_main orchestrator0x12642A04-stage: LNK -> OPT -> OPTIXIR -> LLC

Both paths deserialize the container at phase 1, then translate Options into per-stage compiler flags:

  • SmMajor / SmMinor from tags 1--2 become -mcpu=sm_XX
  • FastMath.Ftz from tag 13 becomes -nvptx-f32ftz
  • FastMath.Fmad from tag 15 becomes the IEEE mode flag
  • OptLevel becomes -nvptx-opt-level=N
  • CompileMode == 2 (SEPARATE_ABI) adds --device-c
  • IRLevel == 1 (LTO) enters the LTO pipeline with partially-optimized bitcode
  • IRLevel == 2 (OPTIX) activates the OptiX IR stage (bit 6 of pipeline bitmask) and disables LICM and IP-MSP

The container format is the single source of truth for all compilation parameters. When cicc is invoked by nvcc, the driver serializes its accumulated flags into a container, passes the container as input, and cicc deserializes it back into compiler options. This round-trip through binary serialization ensures that all pipeline stages see exactly the same configuration, eliminating the flag-parsing divergence that would otherwise arise from each stage having its own CLI parser.

YAML Serialization Framework

The XML/YAML path uses a generic serialization framework built on a bundled YAML parser/emitter library (Cluster A: 0xCB0000--0xCBFA60). The library provides:

FunctionAddressRole
yaml_parser_main0xCB9640Top-level YAML parser (25,873 bytes)
yaml_emitter_main_loop0xCBDA10Main YAML emitter loop (23,583 bytes)
yaml_scanner_scan_tokens0xCB7E40Token scanner (17,924 bytes)
yaml_parser_parse_flow0xCB8C00Flow-style parsing (15,188 bytes)
yaml_parser_load_document0xCBA570Document loader/resolver (9,695 bytes)

The serialization framework uses virtual dispatch: each serializable type registers a serialize/deserialize function pair, and the framework dispatches based on the YAML node type (scalar=1, sequence, mapping). All enum values are serialized by their full string names (NVVM_COMPILE_MODE_SEPARATE_ABI, NVVM_ARCH_ADA_8_9, etc.), not by numeric value.

Finalizer Knobs Integration

The container Options struct also feeds into the NVIDIA finalizer knobs system through NvvmOptions_parse_finalizer_knobs (0xCD9990, 31,702 bytes -- the 7th largest function in the binary). This parser ingests the complete set of NVIDIA-specific backend configuration knobs:

  • Shader pipeline controls: PromoteHalf, PromoteFixed, USePIXBAR, VSIsVREnabled, VSIsLastVTGStage
  • Codegen controls: DisablePredication, DisableXBlockSched, EnableJumpTable, ScheduleKils
  • Memory controls: DoMMACoalescing, AssumeConvertMemoryToRegProfitable
  • Barrier controls: DisableERRBARAfterMEMBAR, GenConvBranchForWarpSync
  • PGO controls: PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex
  • Per-CTA controls: CTASizeX, CTASizeY, CTASizeZ, SharedMemorySize, SMemScratchBase
  • Register controls: MaxActiveWarpsPerSM, NumReservedUReg, NumScratchURegs

These knobs are distinct from the NVVMPassOptions system (see NVVMPassOptions) -- the finalizer knobs configure the backend code generator, while NVVMPassOptions configure the optimization pipeline.

Tag Summary Statistics

RangeCountDescription
1--3938Core scalar options (SM version, fast-math, unroll, flags)
991Compression metadata
101--17373Extended target options (hardware capabilities, memory config)
201--21818Blob data (memory windows, resource tables, strings)
301--3099Extended int32 fields (cluster config, extended options)
351--3533Extended int64 blob references
401--4022Structured conditional blobs (TMA / TCGen05)
Total144Distinct tag IDs across 6 ranges

The deserializer switch statement has 103 unique case labels -- the remaining 41 tags share code paths with other tags (e.g., all single-bit tags in a byte share a case that reads the bit position from a secondary table).

Cross-References

NVPTX Target Infrastructure

The NVPTXTargetMachine, NVPTXSubtarget, and NVPTXTargetTransformInfo form the target description layer that the entire LLVM backend consults for every decision from type legality through instruction cost to vectorization factor selection. In upstream LLVM, these are three separate source files totaling roughly 1,500 lines; in cicc v13.0 they are spread across the 0xDF0000-0xE00000 address range (TTI hooks), the 0x330-0x35B range (NVPTXTargetLowering), the type legalization tables embedded in NVPTXSubtarget, and the pipeline assembler at 0x12EA000-0x12F0000 (TargetMachine construction). The NVIDIA delta relative to upstream is moderate -- the TTI hooks return GPU-specific constants rather than CPU ones, the SubtargetFeatures carry NVIDIA-proprietary math precision flags, and the TargetMachine creation path has a dual-path design that handles both the cicc standalone pipeline and the LibNVVM API pipeline.

Key Facts

PropertyValue
SM processor tableqword_502A920 (45 entries, stride-2, ctor_605 at 0x584510)
Target lookupsub_12EA530 (4KB, calls sub_16D3AC0 = TargetRegistry::lookupTarget)
TargetMachine creationsub_12F4060 (16KB, NVIDIA options) / sub_12E54A0 (50KB, pipeline path)
TTI wrapper passsub_1BFB520 (208-byte alloc, wraps sub_1BFB9A0)
Register bit width (Vector)sub_DFE640 -- returns 32 (fixed)
Scalable vectorssub_DFE610 -- returns false
Max interleave factorsub_DFB120 (at TTI+448), sub_DFB730 (vectorized variant)
SubtargetFeaturesOffsets +2498, +2584, +2843, +2870, +2871
Target triplesnvptx64-nvidia-cuda, nvptx-nvidia-cuda, nvsass-nvidia-* (6 total)

NVPTXTargetMachine

Dual-Path Target Initialization

cicc constructs the TargetMachine through two independent code paths depending on whether compilation enters through the standalone cicc CLI or through the LibNVVM API. Both converge on TargetRegistry::lookupTarget (sub_16D3AC0) but assemble the target triple, feature string, and TargetOptions differently.

Path 1 -- cicc standalone (sub_12F7D90 -> sub_12F4060):

sub_12F7D90 — CLI parser:
    parse "-arch=compute_XX" → SM version (multiplied by 10)
    parse "-opt=N"           → optimization level
    parse "-ftz=N"           → flush-to-zero mode
    parse "-fma=N"           → FMA contraction level
    parse "-prec-div=N"      → float division precision
    parse "-prec-sqrt=N"     → sqrt precision
    parse "--device-c"       → device compilation flag

sub_12F4060 — TargetMachine creation (16KB):
    triple = (pointerWidth == 64) ? "nvptx64" : "nvptx"
    features = ""
    if (sharedmem32bit):
        features += "+sharedmem32bitptr"
    features += ",+fma-level=N,+prec-divf32=N,+prec-sqrtf32=N"

    opts = TargetOptions {
        flags: 0,
        reloc: PIC (1),
        codeModel: 8,
        optLevel: from_cli,
        threadModel: 1
    }

    TM = TargetRegistry::lookupTarget(triple, cpu_string)
    if (!TM):
        error "Error: Cannot specify multiple -llcO#\n"
    return TM->createTargetMachine(triple, cpu, features, opts)

Path 2 -- pipeline assembler (sub_12E54A0):

The master pipeline assembly function (50KB, called from both Phase I and Phase II) constructs the target independently:

sub_12E54A0:
    ptrSize = Module::getDataLayout().getPointerSizeInBits(0)
    if (8 * ptrSize == 64):
        triple = "nvptx64"                          // 7 chars
    else:
        triple = "nvptx"                            // 5 chars

    target = sub_16D3AC0(&triple, &cpu_string)      // TargetRegistry::lookupTarget
    if (!target):
        error "Failed to locate nvptx target\n"     // sub_1C3EFD0

    // TargetOptions setup:
    opts[0] = 0                                     // no flags
    opts[1] = 1                                     // PIC relocation
    opts[2] = 8                                     // code model
    opts[3] = 1                                     // opt level indicator
    opts[4] = 1                                     // thread model
    opts[5] = 0                                     // reserved

    sub_167F890(subtargetInfo)                       // initialize SubtargetInfo
    TLI = sub_14A04B0(targetLibInfo, moduleName)     // TargetLibraryInfo
    sub_149CBC0(TLI)                                 // finalize TLI
    TTI = sub_1BFB9A0(DataLayout, a2, a3, v269)     // TargetTransformInfo

    optLevel = read qword_4FBB430                    // cl::opt<int> value
    PassManagerBuilder = sub_1611EE0(PM)

The pipeline assembler path also checks for an extension hook: if the target has a createExtendedTargetMachine vtable entry at offset +88, it calls that instead, enabling custom target backends. The returned TargetMachine pointer feeds into the 150+ pass registrations that follow.

TargetOptions

The TargetOptions struct passed to both paths uses LLVM's standard layout. The key NVIDIA-specific values:

FieldValueMeaning
Relocation model1 (PIC)Position-independent code, always
Code model8Large code model (matches PTX's flat addressing)
Thread model1POSIX-style threading assumed
Optimization levelFrom CLIStored in qword_4FBB430, default from qword_4FBB430[2]

NVIDIA-Specific Target Features

The feature string passed to createTargetMachine encodes math precision and shared memory configuration as subtarget features. These are not upstream LLVM features -- they are NVIDIA extensions:

FeatureCLI SourceSubtarget Effect
+sharedmem32bitptrnvptx-short-ptr / nvptx-32-bit-smemEnables 32-bit pointers for address space 3 (shared memory); adds p3:32:32:32 to data layout
+fma-level=N-fma=N0=off, 1=on, 2=aggressive FMA contraction
+prec-divf32=N-prec-div=N0=approx, 1=full, 2=IEEE+ftz, 3=IEEE compliant
+prec-sqrtf32=N-prec-sqrt=N0=approx (rsqrt.approx), 1=rn (sqrt.rn)

Registered in ctor_607 (0x584B60, 14KB):

KnobTypeDefaultDescription
nvptx-sched4regbool--Schedule for register pressure
nvptx-fma-levelint--FMA contraction level
nvptx-prec-divf32int--F32 division precision
nvptx-prec-sqrtf32int--Sqrt precision
nvptx-approx-log2f32bool--Use lg2.approx for log2
nvptx-force-min-byval-param-alignbool--Force 4-byte byval alignment
nvptx-normalize-selectbool--Override shouldNormalizeToSelectSequence
enable-bfi64bool--Enable 64-bit BFI instructions

NVPTXSubtarget Feature Flags

The NVPTXSubtarget object carries the type legalization tables and architecture-specific feature flags that the SelectionDAG, register allocator, and type legalizer consult at every step. These are populated during target construction and indexed by the SM processor table.

Feature Flag Offsets

OffsetSizePurposeStride
+120ptrRegister class array (8-byte stride entries)--
+2498259Type legality flags (indexed per MVT)259 bytes per type action
+2584259Float legality flags (indexed per MVT)259 bytes per type action
+28431Integer type support flag--
+28701Branch distance flag--
+28711Jump table eligibility flag--

The type legality arrays at +2498 and +2584 are the backbone of SelectionDAG's getTypeAction() and isTypeLegal() queries. Each entry covers one MVT (Machine Value Type) and stores the action: Legal, Promote, Expand, Scalarize, or SplitVector. For NVPTX, i32 and f32 are always Legal; i64 and f64 are Legal on all supported SM versions but with expanded arithmetic costs; vectors wider than 128 bits are always Split or Scalarized.

The function sub_201BB90 reads these offsets during type legalization to determine expansion strategy. The branch distance flags at +2870/+2871 control sub_20650A0, which decides jump table eligibility beyond the standard no-jump-tables flag.

Initialization Flow

The SubtargetFeatures initialization follows this path:

  1. ctor_605 (0x584510, 2.6KB) populates qword_502A920 with the 45-entry SM processor table at static init time.
  2. sub_167F890 initializes the SubtargetInfo during pipeline setup.
  3. sub_982C80 initializes the 224-byte NVPTX feature flag table based on SM version and OS/ABI info.
  4. sub_97DEE0 performs initial population of the feature bitfield.
  5. sub_982B20 applies SM-version-specific refinements from the global table at qword_4F7FCC8.

The 224-byte feature table (sub_982C80) initializes bytes 0-127 to all-1s (0xFF), then selectively clears bits based on the target configuration. This "default-enabled, selectively-disabled" pattern means that features are assumed present unless explicitly turned off for a given target.

NVPTXTargetTransformInfo Hook Table

The TTI is the interface through which all LLVM optimization passes query target-specific costs and capabilities. For NVPTX, every hook returns a value calibrated for a scalar-register GPU architecture rather than a SIMD-register CPU.

TTI HookAddressReturn ValueUpstream Equivalent
getRegisterBitWidth(Vector)sub_DFE640TypeSize::getFixed(32)AVX2 returns 256, AVX-512 returns 512
supportsScalableVectors()sub_DFE610falseAArch64 SVE returns true
getMaxInterleaveFactor()sub_DFB120Register-pressure-boundedCPU returns 2-4 based on uarch
getMaxInterleaveFactor(vectorized)sub_DFB730Separate limit for vectorized loops--
getRegisterBitWidth(Scalar)sub_DFB1B032Matches PTX 32-bit register file
getInstructionCost()sub_20E14F0 (32KB)Per-opcode latency from sched model--
hasAttribute(30)sub_B2D610Checks noimplicitfloatStandard LLVM
hasAttribute(47)sub_B2D610Checks alwaysvectorizeStandard LLVM
hasAttribute(18)sub_B2D610Checks optnoneStandard LLVM

Impact on Loop Vectorization

The 32-bit register width return from sub_DFE640 is the single most consequential TTI hook for GPU compilation. The standard LLVM VF formula is:

VF = registerBitWidth / elementBitWidth

With registerBitWidth = 32:

  • float (32-bit): VF = 1 -- no vectorization from the register-width formula alone
  • half (16-bit): VF = 2
  • i8 (8-bit): VF = 4

This means that profitable vectorization of 32-bit types (the dominant case in CUDA) must come entirely from the cost model determining that ld.v2.f32 or ld.v4.f32 is cheaper than multiple scalar loads, not from the register-width heuristic. The LoopVectorize pass (sub_2AF1970) has an explicit override: when the VF formula produces VF <= 1 and the byte_500D208 knob is set, it forces VF = 4 for outer loops.

Impact on SLP Vectorization

The SLP vectorizer (sub_2BD1C50) receives the target vector register width as parameter a3 and uses it to determine maximum bundle width. With 32 bits, SLP bundles are limited to:

  • 2x i16 (32 bits total)
  • 4x i8 (32 bits total)
  • 1x i32 or f32 (degenerate -- no SLP benefit)

In practice, the SLP vectorizer's profitability model can override this limit when paired loads/stores demonstrate memory coalescing benefit, but the register width serves as the initial upper bound.

Impact on Interleave Count

The getMaxInterleaveFactor hook (sub_DFB120, queried at TTI+448) caps the interleave count (IC) for loop unroll-and-jam. The interleave selection algorithm in sub_2AED330 reads this value and combines it with scheduling info at TTI+56:

maxIC    = TTI.getMaxInterleaveFactor(VF)
issueWidth = *(TTI + 56 + 32)              // scheduling model: issue width
latency    = *(TTI + 56 + 36)              // scheduling model: latency
IC         = IC / max(issueWidth, latency)  // cap by pipeline throughput

This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the warp scheduler may saturate at lower IC values, making additional interleaving waste register budget without throughput gain.

Arithmetic Cost for i64

NVPTX GPUs have 32-bit ALUs. All 64-bit integer arithmetic is emulated through pairs of 32-bit operations with carry propagation. The TTI getArithmeticInstrCost hook reflects this by returning approximately 2x the base cost for i64 operations:

Operationi32 Costi64 CostRatio
ADD/SUB122x (add.cc + addc)
MUL1~44x (mul.lo + mul.hi + add chain)
DIV/REMhighvery highLibrary call on both
Shift12-3funnel shift pair

This cost differential causes LLVM optimization passes (InstCombine, SCEV-based transformations, IV widening) to prefer i32 operations, which NVIDIA's custom IV Demotion pass (sub_18B1DE0) further exploits by narrowing 64-bit induction variables to 32-bit where the trip count permits.

SM Processor Table

The processor table at qword_502A920 is a flat array of 90 entries (45 SM variants x 2 fields per entry) with stride-2 layout: even indices hold the SM name string pointer, odd indices hold the PTX version code.

Populated by ctor_605 at 0x584510 (2.6KB), called during static initialization before main. The table is read-only after construction.

qword_502A920[2*i + 0] = const char* sm_name    // e.g., "sm_100"
qword_502A920[2*i + 1] = uint64_t   ptx_version // 5, 6, or 7

PTX Version Codes

CodeMeaningSM Range
5Legacy PTXsm_20 through sm_90 (all base variants)
6Modern PTXsm_90a, sm_100-sm_121 (base variants only)
7Extended PTXsm_100a/f through sm_121a/f (accelerated/forward-compatible)

Notable observations:

  • sm_90a is the only pre-Blackwell SM with PTX version 6.
  • The f (forward-compatible) suffix uses the same PTX version as a (accelerated).
  • No entries exist for sm_84, sm_85 (Ada Lovelace numbering gap).
  • sm_73 (Volta sub-variant) and sm_88 (Ada sub-variant) are present but not publicly documented.
  • The table contains 15 legacy architectures (sm_20 through sm_75) that are no longer accessible through the CLI mapping but remain in the backend's processor table.

Data Layout String

The NVPTX data layout string follows LLVM's standard format with three variants selected based on pointer width and shared memory pointer mode:

64-bit with shared memory specialization (most common)

e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

64-bit without shared memory specialization

e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

32-bit mode

e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

Key fields

FieldMeaningNVIDIA Note
eLittle-endianAll NVIDIA GPUs
p:64:64:64Generic pointers: 64-bit, 64-bit alignedDefault for 64-bit compilation
p3:32:32:32Address space 3 (shared memory): 32-bit pointersControlled by nvptx-short-ptr / nvptx-32-bit-smem / unk_4D0461C
n16:32:64Native integer widths: 16, 32, 64Tells LLVM that i16/i32/i64 are all hardware-supported
v16:16:16 / v32:32:32Vector alignment: natural16-bit and 32-bit vectors aligned to their width

The p3:32:32:32 entry is the NVIDIA delta: shared memory lives in a 48KB-228KB on-chip SRAM per SM, addressable with 32-bit pointers even in 64-bit mode. Using 32-bit pointers for shared memory saves register pressure and instruction count for every shared memory access.

A separate data layout string e-i64:64-v16:16-v32:32-n16:32:64 appears in the IR linker (sub_106AB30) as a compatibility check during module linking. This shortened form is used to validate that two modules being linked share the same NVPTX target data layout.

Data layout validation is performed at multiple points:

  • sub_2C74F70 in the NVVM verifier checks the layout string on every module
  • If empty: "Empty target data layout, must exist"
  • If invalid: prints "Example valid data layout:" with reference 32-bit and 64-bit strings from off_4C5D0A0 / off_4C5D0A8

Target Triple Construction

The target triple is constructed at module creation time by checking the pointer width:

if (unk_4F06A68 == 8)                    // 64-bit data model
    triple = "nvptx64-nvidia-cuda"       // 19 chars
else
    triple = "nvptx-nvidia-cuda"         // 17 chars

Eight triples are valid in UnifiedNVVMIR mode:

TripleWidthRuntime
nvptx-nvidia-cuda32-bitCUDA
nvptx64-nvidia-cuda64-bitCUDA
nvptx-nvidia-nvcl32-bitOpenCL
nvptx64-nvidia-nvcl64-bitOpenCL
nvsass-nvidia-cudaSASSCUDA native assembly
nvsass-nvidia-nvclSASSOpenCL native assembly
nvsass-nvidia-directxSASSDirectX backend
nvsass-nvidia-spirvSASSSPIR-V backend

In non-UnifiedNVVMIR mode, validation is looser: the triple must start with nvptx- or nvptx64- and contain -cuda. The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) are notable evidence that NVIDIA's SASS-level backend supports DirectX and SPIR-V shader compilation alongside traditional CUDA/OpenCL.

Configuration Knobs

Backend Options (ctor_609_0, 0x585D30, 37KB)

KnobTypeDefaultDescription
nvptx-short-ptrbool--32-bit pointers for const/local/shared
nvptx-32-bit-smembool--32-bit shared memory pointers
nvptx-enable-machine-sinkbool--Enable Machine Sinking
enable-new-nvvm-rematbooltrueEnable new rematerialization
nv-disable-rematboolfalseDisable all remat passes
nv-disable-mem2regboolfalseDisable MI Mem2Reg pass
nv-disable-scev-cgpboolfalseDisable SCEV address mode opt
disable-nvptx-load-store-vectorizerboolfalseDisable load/store vectorizer
disable-nvptx-require-structured-cfgboolfalseTurn off structured CFG requirement
nvptx-exit-on-unreachablebooltrueLower unreachable as exit
nvptx-early-byval-copybool--Copy byval args early
enable-nvvm-peepholebooltrueEnable NVVM Peephole Optimizer
lower-func-argsbooltrueLower large aggregate params
enable-sinkbooltrueEnable Sinking
disable-post-optboolfalseDisable LLVM IR opts post-opt
usedessaint2Select deSSA method
ldgbooltrueLoad Global Constant Transform
print-isel-inputboolfalsePrint LLVM IR input to isel
no-reg-target-nvptxrematboolfalseOnly old remat without reg targets
disable-set-array-alignmentboolfalseDisable alignment enhancements
nvptx-lower-global-ctor-dtorbool--Lower GPU ctor/dtors to globals

Register Pressure & FCA Options (ctor_074, 0x49AAB0)

KnobTypeDefaultDescription
fca-sizeint8Max size of first-class aggregates (bytes)
reg-target-adjustint0 (range -10..+10)Register pressure target adjustment
pred-target-adjustint0 (range -10..+10)Predicate register target adjustment
remat-load-parambool--Support remating const ld.param not in NVVM IR
cta-reconfig-aware-rpabool--CTA reconfiguration-aware register pressure analysis

Extension Options (ctor_610, 0x5888A0)

KnobTypeDefaultDescription
unroll-assumed-sizeint4Assumed size for unknown local array types
enable-loop-peelingbool--Enable loop peeling
enable-256-bit-load-storebool--Enable 256-bit vector loads/stores
ias-param-always-point-to-globalbool--Parameters always point to global memory
ias-strong-global-assumptionsbool--Strong global memory assumptions
ias-wmma-memory-space-optbool--Memory Space Optimization for WMMA

TTI Cost Model Options (ctor_061, 0x494D20)

KnobTypeDefaultDescription
costmodel-reduxcostbool--Recognize reduction patterns
cache-line-sizeint--Cache line size for cost model
min-page-sizeint--Minimum page size
predictable-branch-thresholdfloat--Threshold for predictable branch cost

Differences from Upstream LLVM

  1. Dual-path TargetMachine construction. Upstream LLVM has a single target creation path through LLVMTargetMachine::createPassConfig. NVIDIA has two independent paths (CLI and pipeline assembler) that converge at TargetRegistry::lookupTarget.

  2. NVIDIA-proprietary target features. The +sharedmem32bitptr, +fma-level=N, +prec-divf32=N, +prec-sqrtf32=N features do not exist in upstream NVPTX. Upstream NVPTX has +ptx75, +sm_90 style features. NVIDIA's math precision features are passed through the target feature string to avoid adding new cl::opt for each.

  3. 224-byte feature table. The sub_982C80 feature table with its "default all-1s then selectively clear" initialization pattern is unique to cicc. Upstream NVPTXSubtarget uses a much simpler feature set derived from +sm_XX and +ptx_YY features.

  4. Scheduling info at TTI+56. The issue-width and latency values stored in the TTI sub-structure at offset +56 are used by the interleave count selection algorithm. Upstream LLVM's NVPTX backend does not populate these scheduling parameters -- it relies on the default "no scheduling model" behavior.

  5. Extension hook at vtable+88. The pipeline assembler checks for a createExtendedTargetMachine entry, enabling loadable target backend extensions. This is not present in upstream LLVM.

Function Map

FunctionAddressSizeRole
NVPTX Target Lookup and Creationsub_12EA5304 KB--
TargetMachine Creation with NVIDIA Optionssub_12F406016 KB--
Master Pipeline Assembly (includes TM setup)sub_12E54A050 KB--
CICC CLI Argument Parsersub_12F7D9014 KB--
TargetRegistry::lookupTarget()sub_16D3AC0----
SubtargetInfo initializationsub_167F890----
TTIWrapperPass allocation (208 bytes)sub_1BFB520----
TargetTransformInfo / DataLayout creationsub_1BFB9A0----
TargetLibraryInfo creationsub_14A04B0----
TargetLibraryInfo finalizationsub_149CBC0----
TTI::getRegisterBitWidth(Vector) -- returns 32sub_DFE640----
TTI::supportsScalableVectors() -- returns falsesub_DFE610----
TTI::getMaxInterleaveFactor() (at TTI+448)sub_DFB120----
TTI::getMaxInterleaveFactor(vectorized)sub_DFB730----
TTI::getRegisterBitWidth(Scalar) or cache-line querysub_DFB1B0----
TTI::getInstructionCost() / scheduling cost modelsub_20E14F033 KB--
TTI::hasAttribute(N) -- function attribute querysub_B2D610----
TTI::getInstructionCost() (IR-level variant)sub_B91420----
NVPTX feature flag table initializer (224 bytes)sub_982C80----
Feature bitfield initial populationsub_97DEE0----
SM-version-specific feature refinementssub_982B20----
SubtargetFeature reads at +2843, +2584, +2498sub_201BB90----
Branch distance / jump table checks at +2870, +2871sub_20650A0----
EDG SM architecture feature gating (38KB, ~60 flags)sub_60E7C0----
Module initialization with triple and data layoutsub_908850----
SM processor table population (0x584510, 2.6KB)ctor_605----
NVPTX backend math options (0x584B60, 14KB)ctor_607----
NVPTX backend options (0x585D30, 37KB)ctor_609_0----

Cross-References

Alias Analysis & NVVM AA

cicc ships a custom alias analysis pass (NVVM AA, registered as nvptx-aa) that exploits GPU address space disjointness to prove pointer pairs cannot alias. On a GPU, each hardware memory partition -- global DRAM, shared scratchpad, local stack, constant cache, kernel parameter window -- occupies a physically separate address range. Pointers into different address spaces can never reference the same byte, a property that does not hold on any mainstream CPU ISA. NVVM AA encodes this hardware invariant into the LLVM AA pipeline, returning NoAlias for any cross-address-space pointer pair. This single fact unlocks aggressive dead-store elimination, load-store motion, GVN load forwarding, and MemorySSA precision that would be impossible on a flat-memory machine. The pass is stateless, trivially cheap, and runs first in the AA chain so that more expensive analyses (BasicAA, TBAA) can skip pairs that NVVM AA already resolved.

Beyond pure address-space disjointness, cicc augments the standard LLVM AA infrastructure in three further ways: (1) a process-restrict pass that propagates noalias attributes from __restrict__ kernel parameters, (2) !noalias.addrspace metadata (metadata kind 42) that tags pointers with the set of address spaces they provably do not alias with, and (3) NVIDIA-specific knobs controlling traversal depth, TBAA strictness, and fence relaxation.

Key Facts

PropertyValue
Pass name (legacy PM)nvptx-aa
Pass name (new PM)Registered via NVPTXTargetMachine::registerEarlyDefaultAliasAnalyses
Legacy wrapperNVPTXAAWrapperPass (ImmutablePass, char ID)
External wrapperNVPTXExternalAAWrapper (hooks into ExternalAAWrapperPass, RunEarly=true)
Result classNVPTXAAResult : AAResultBase
StateStateless -- invalidate() always returns false
AA chain positionFirst (before BasicAA)
Address traversal depthControlled by nvptx-traverse-address-aliasing-limit (default 6)
AA evaluator passaa-eval at sub_13549C0 (11,038 bytes)
AA query entry pointsub_134CB50 -- AAResults::alias(MemoryLocation, MemoryLocation)
ModRef query (call, loc)sub_134F0E0 -- AAResults::getModRefInfo(CallBase, MemoryLocation)
ModRef query (call, call)sub_134F530 -- AAResults::getModRefInfo(CallBase, CallBase)

GPU Address Space Table

NVPTX defines six logically disjoint address spaces plus a generic (flat) umbrella. See Address Spaces for the complete master table with hardware mapping, pointer widths, latency numbers, and data layout strings.

The critical property exploited by NVVM AA: any (AS_x, AS_y) pair where x != y and neither is 0 (generic) and neither is the shared/shared-cluster pair (AS 3 vs AS 7) returns NoAlias, unless x is global and y is param (or vice versa) since cvta.param on SM 70+ makes param addressable as global. See the Aliasing Rules section for the complete cross-space aliasing specification and the MemorySpaceOpt Internal Bitmask section for the dataflow bitmask encoding used during address space resolution.

The NVVM AA Algorithm

The core alias function follows upstream NVPTXAliasAnalysis.cpp in structure, enhanced with cicc-specific extensions. The pseudocode:

// NVPTXAAResult::alias -- the heart of NVVM AA
AliasResult alias(const MemoryLocation &Loc1,
                  const MemoryLocation &Loc2,
                  AAQueryInfo &AAQI) {
    unsigned AS1 = getAddressSpace(Loc1.Ptr, TraverseLimit);
    unsigned AS2 = getAddressSpace(Loc2.Ptr, TraverseLimit);

    // If either pointer is in generic (flat) space, we cannot disambiguate.
    // Generic pointers can point to any physical memory at runtime.
    if (AS1 == ADDRESS_SPACE_GENERIC || AS2 == ADDRESS_SPACE_GENERIC)
        return AliasResult::MayAlias;

    // Distributed shared memory (AS 7) overlaps with regular shared (AS 3).
    if ((AS1 == 3 && AS2 == 7) || (AS1 == 7 && AS2 == 3))
        return AliasResult::MayAlias;

    // Same address space: cannot determine from space alone.
    // Fall through to BasicAA / TBAA for further analysis.
    if (AS1 == AS2)
        return AliasResult::MayAlias;

    // Different non-generic, non-overlapping spaces: provably disjoint.
    return AliasResult::NoAlias;
}

// getAddressSpace -- walk through casts to find the underlying space.
// Traverses up to MaxLookup levels of getUnderlyingObject().
unsigned getAddressSpace(const Value *V, unsigned MaxLookup) {
    while (MaxLookup-- > 0) {
        unsigned AS = V->getType()->getPointerAddressSpace();
        if (AS != ADDRESS_SPACE_GENERIC)
            return AS;
        const Value *Next = getUnderlyingObject(V, /*MaxLookup=*/1);
        if (Next == V)
            break;  // Reached a root (alloca, argument, global)
        V = Next;
    }
    return V->getType()->getPointerAddressSpace();
}

The getAddressSpace helper is the key difference from a naive check. A pointer may be in generic address space (AS 0) at its use site but was produced by an addrspacecast from a specific space. The traversal walks backward through getUnderlyingObject (which strips GEPs, bitcasts, PHIs) to find the original non-generic space. The depth limit (nvptx-traverse-address-aliasing-limit, default 6) prevents exponential blowup on deeply nested pointer chains.

The getModRefInfoMask method adds a further optimization: pointers into constant memory (AS 4) or parameter memory (AS 101) are read-only, so it returns NoModRef -- the pointer's memory is never modified. This allows DSE to skip analysis of stores that might alias with const/param loads, and lets LICM hoist loads from constant memory without checking for intervening stores.

The getMemoryEffects method handles inline assembly: PTX inline asm without side-effects or {memory} clobbers is treated as having no memory effects, which prevents it from blocking optimizations.

The Generic Address Space Problem

The generic (flat, AS 0) address space is the fundamental obstacle to alias precision on GPUs. When the frontend cannot determine which physical memory a pointer targets, it emits the pointer in AS 0. The hardware resolves generic addresses at runtime using address range checks -- a pointer into the shared memory window maps to shared, otherwise it maps to global.

For NVVM AA, a generic pointer forces MayAlias against every other pointer, destroying the disjointness guarantee. This is why MemorySpaceOpt is so critical: it runs before the main optimization pipeline and converts generic pointers to specific address spaces wherever possible, feeding precise AS information into NVVM AA.

Three mechanisms address the generic pointer problem:

1. MemorySpaceOpt (pre-optimization conversion). The two-phase interprocedural pass at sub_1C70910 resolves generic pointers by tracing them back to their allocation sites. If a generic pointer is always derived from a __shared__ variable, the pass inserts an addrspacecast to AS 3 and rewrites all uses. When different call sites pass different address spaces for the same argument, the pass clones the function into space-specialized versions. This is the most impactful optimization: every generic pointer that MemorySpaceOpt resolves gives NVVM AA an additional NoAlias edge.

2. Address space traversal in AA. Even without MemorySpaceOpt, the getAddressSpace helper in NVVM AA walks through addrspacecast chains. If a generic pointer %p was produced by addrspacecast i8 addrspace(3)* %s to i8*, the traversal discovers AS 3. The traversal depth limit (default 6) controls how far back the walk goes.

3. !noalias.addrspace metadata (kind 42). cicc attaches this metadata to instructions when address space information is known but the pointer itself remains generic. The AA evaluator (sub_13549C0) detects this metadata via opcode byte 0x4E ('N') and sets bit 2 in a pointer-tagged value (OR with 4), propagating the address-space disambiguation information through to AAResults::alias. This is a cicc-specific extension not found in upstream LLVM.

AA Pipeline Ordering

cicc configures the AA chain with NVVM AA running first, as confirmed by the NVPTXExternalAAWrapper which passes RunEarly=true to ExternalAAWrapperPass. The full chain:

NVVM AA  -->  BasicAA  -->  TBAA  -->  ScopedNoAliasAA  -->  GlobalsAA
  |              |            |              |                    |
  |              |            |              |                    +-- Module-level: which globals
  |              |            |              |                        escape? (enable-unsafe-
  |              |            |              |                        globalsmodref-alias-results)
  |              |            |              |
  |              |            |              +-- !noalias / !alias.scope metadata
  |              |            |                  (enable-scoped-noalias, default true)
  |              |            |
  |              |            +-- Type-based: !tbaa metadata tree
  |              |                (enable-tbaa, default true)
  |              |
  |              +-- Stateless: GEP decomposition, alloca vs argument,
  |                  capture analysis (basic-aa-recphi, default true;
  |                  basic-aa-separate-storage, default true)
  |
  +-- Address space disjointness (stateless, O(depth) per query)

The chain is queried through AAResults::alias() (sub_134CB50), which dispatches through the registered AA providers in order. Each provider returns NoAlias, MayAlias, PartialAlias, or MustAlias. If any provider returns NoAlias, the chain short-circuits -- subsequent providers are not consulted. This is why NVVM AA runs first: cross-address-space pairs are resolved in O(1) without invoking the more expensive BasicAA GEP decomposition.

The AAResults object consumed by MemorySSA, GVN, DSE, and LICM is the same chained result. All memory-aware passes benefit transparently from NVVM AA without any code changes.

Integration with Memory Optimization Passes

NVVM AA's impact flows through every pass that queries alias information:

MemorySSA (sub_1A6A260) builds its memory SSA graph using AAResults at [this+0xB8] (retrieved via tag unk_4F9D3C0). When NVVM AA proves that a store to shared memory and a load from global memory are NoAlias, MemorySSA does not create a dependency edge between them, resulting in a sparser -- and more precise -- memory graph. This precision propagates to every consumer of MemorySSA.

GVN (sub_1900BB0) uses AA for load elimination and store forwarding. With NVVM AA, a load from %p_global can be forwarded past a store to %q_shared because they provably do not alias. Without NVVM AA, GVN would conservatively assume they might alias and abandon the forwarding. The GVN implementation queries sub_134CB50 indirectly through MemoryDependenceResults, which itself consults AAResults.

DSE (sub_19DD1D0 and related functions) eliminates dead stores by proving that no subsequent load reads the stored value. DSE requires AAResults at unk_4F9D3C0. The DSE report confirms: "The alias analysis that DSE consumes already handles address-space separation. CUDA address spaces (shared=3, global=1, local=5, constant=4) are handled by the underlying NVVM alias analysis which knows that different address spaces cannot alias." DSE does NOT implement its own address-space checks -- it relies entirely on NVVM AA.

LICM uses AA to determine whether a load inside a loop can be hoisted out. If NVVM AA proves a loop-invariant load from constant memory (AS 4, getModRefInfoMask returns NoModRef) cannot be modified by any store in the loop, LICM hoists it. This is especially impactful for __constant__ kernel arguments accessed repeatedly in hot loops.

noalias Metadata and __restrict__ Handling

cicc provides two mechanisms for marking kernel pointer parameters as non-aliasing:

1. -restrict / --kernel-params-are-restrict (frontend flag, offset +1096). When the user passes -restrict to nvcc (or --kernel-params-are-restrict to cicc), it routes to the LLVM knob -nvptx-kernel-params-restrict via the llc argument vector. This causes cicc to add the noalias attribute to all pointer-typed kernel parameters, asserting that the programmer guarantees no two kernel pointer arguments alias. The process-restrict pass (ProcessRestrictPass, registered as a function pass in the new PM at position 419 in the pipeline parser, parameter parser at sub_233A330) then propagates this attribute through the call graph. The propagate-only mode restricts the pass to propagation without inserting new restrict annotations.

2. -allow-restrict-in-struct (flag at offset +1128). Extends __restrict__ handling to pointer fields inside struct arguments. When enabled, the process-restrict pass annotates struct-member pointers with noalias scope metadata, enabling AA to disambiguate pointers extracted from different struct fields. This flag routes to both the opt and llc argument vectors as -allow-restrict-in-struct.

Supporting knobs:

  • apply-multi-level-restrict -- apply __restrict__ to all pointer indirection levels (not just the outermost pointer)
  • dump-process-restrict -- debug dump during restrict processing

The noalias attribute interacts with the AA chain through ScopedNoAliasAA, which reads !noalias and !alias.scope metadata attached to instructions. cicc's frontend emits these metadata nodes when __restrict__ qualifiers are present in the CUDA source.

The !noalias.addrspace metadata (kind 42, registered in sub_B6EEA0) is a separate mechanism specific to address-space disambiguation. It is attached by MemorySpaceOpt or IR generation when a pointer is known to not alias with pointers in specific address spaces, even if the pointer itself remains in generic AS 0. The AA evaluator detects this metadata and tags the pointer with bit 2 (OR with 4) for disambiguation during alias queries.

The ProcessRestrict Propagation Algorithm

ProcessRestrictPass is NVIDIA's interprocedural restrict propagation pass, registered as pipeline entry 419 with class name ProcessRestrictPass. It runs as a function pass but has interprocedural effects: it reads the noalias attribute from kernel entry points and propagates equivalent information to callees by attaching !noalias and !alias.scope metadata to memory instructions. The knobs controlling its behavior are grouped in ctor_534 (address range 0x560000--0x5CFFFF), alongside allow-restrict-in-struct and apply-multi-level-restrict, and independently in ctor_270 (address range 0x4F0000--0x51FFFF) alongside process-restrict.

Activation and Flag Routing

The restrict pipeline activates through a chain of flag translations:

User:   nvcc --restrict kernel.cu
          |
nvcc:   cicc -restrict             (offset +1096 in flag struct)
          |
cicc:   llc  -nvptx-kernel-params-restrict    (routes to llc args only)
        opt  -allow-restrict-in-struct        (if -allow-restrict-in-struct set)
        opt  -apply-multi-level-restrict      (if set)

The critical distinction: -restrict routes exclusively to the llc argument vector (not opt), meaning the noalias attribute injection happens during code generation, not during the optimization pipeline. The process-restrict pass in the opt pipeline then reads these attributes and propagates their implications as metadata. The -allow-restrict-in-struct flag routes to both opt and llc, enabling struct-member restrict handling on both sides.

Propagation Algorithm

The pass operates in two modes controlled by the propagate-only parameter:

Full mode (default). The pass performs both annotation and propagation:

ProcessRestrictPass::run(Function &F):

  // Phase 1: Identify restrict-qualified pointer arguments
  for each Argument &A in F:
    if A.hasNoAliasAttr() and A.getType()->isPointerTy():
      RestrictArgs.push_back(&A)

  // Phase 1b: Struct member extraction (if allow-restrict-in-struct)
  if AllowRestrictInStruct:
    for each Argument &A in F:
      if A.getType() is StructType containing pointer fields:
        for each pointer field P extracted via extractvalue/GEP:
          RestrictArgs.push_back(P)

  // Phase 1c: Multi-level restrict (if apply-multi-level-restrict)
  if ApplyMultiLevelRestrict:
    for each pointer in RestrictArgs:
      if pointer points to pointer (T**):
        add the inner pointer dereference to RestrictArgs

  if RestrictArgs.empty():
    return PreservedAnalyses::all()

  // Phase 2: Create alias scope domain and per-argument scopes
  MDNode *Domain = createAliasScopeDomain(F.getName())
  for each pointer P in RestrictArgs:
    MDNode *Scope = createAliasScope(Domain, P->getName())
    ScopeMap[P] = Scope

  // Phase 3: Attach !alias.scope and !noalias metadata to memory ops
  for each Instruction &I in F:
    if I is load, store, call, or memcpy/memmove/memset:
      Value *Ptr = getPointerOperand(I)
      Value *Underlying = getUnderlyingObject(Ptr)

      // Which restrict argument does this pointer derive from?
      if ScopeMap.count(Underlying):
        MDNode *MyScope = ScopeMap[Underlying]
        I.setMetadata(!alias.scope, MyScope)

        // Build noalias set: all OTHER restrict arguments
        SmallVector<Metadata*> NoAliasScopes
        for each (P, S) in ScopeMap:
          if P != Underlying:
            NoAliasScopes.push_back(S)
        I.setMetadata(!noalias, MDNode::get(NoAliasScopes))

  // Phase 4: Debug dump (if dump-process-restrict)
  if DumpProcessRestrict:
    print annotated IR to dbgs()

Propagate-only mode. Skips Phase 1 annotation -- does not create new noalias attributes or scopes. Instead, it only reads existing !alias.scope and !noalias metadata from callers and propagates them through inlined call chains. This mode is used in later pipeline stages where new restrict annotations would be unsound (the interprocedural calling context has changed due to inlining).

How ScopedNoAliasAA Consumes the Metadata

The ScopedNoAliasAA provider (registered as scoped-noalias-aa in sub_233BD40, enabled by default via enable-scoped-noalias at ctor_060, global at 0x4B0000) processes the metadata as follows:

ScopedNoAliasAA::alias(LocA, LocB):

  // Extract !noalias sets from the instructions that produced LocA and LocB
  MDNode *NoAliasA = InstA->getMetadata(!noalias)   // set of scopes A does NOT alias
  MDNode *ScopeB   = InstB->getMetadata(!alias.scope) // B's own scope

  MDNode *NoAliasB = InstB->getMetadata(!noalias)
  MDNode *ScopeA   = InstA->getMetadata(!alias.scope)

  // If A's noalias set contains B's scope, or vice versa: NoAlias
  if NoAliasA contains any scope in ScopeB:
    return NoAlias
  if NoAliasB contains any scope in ScopeA:
    return NoAlias

  return MayAlias  // fall through to next AA provider

This means that after ProcessRestrictPass annotates a load from __restrict__ float *a with !alias.scope !{!scope_a} and !noalias !{!scope_b, !scope_c}, any load from __restrict__ float *b (with !alias.scope !{!scope_b}) will be proven NoAlias by ScopedNoAliasAA because scope_b appears in the first instruction's !noalias set. This is the standard LLVM scoped-noalias mechanism; cicc's contribution is the ProcessRestrictPass that generates these metadata nodes from CUDA __restrict__ annotations.

Restrict and Struct Members

When -allow-restrict-in-struct is active, the pass handles a common CUDA pattern where kernel parameters are passed through a struct:

struct Args {
    float * __restrict__ a;
    float * __restrict__ b;
    int n;
};

__global__ void kernel(Args args) {
    // Without allow-restrict-in-struct: a and b are NOT marked noalias
    //   because the struct argument itself is not __restrict__
    // With allow-restrict-in-struct: process-restrict extracts the
    //   pointer fields and creates per-field alias scopes
    args.a[i] = args.b[i] * 2.0f;  // DSE/LICM can now prove no alias
}

The pass identifies pointer-typed fields within struct arguments by walking extractvalue and getelementptr chains from the struct argument. Each extracted pointer receives its own alias scope, identical to what a top-level __restrict__ parameter would receive.

Multi-Level Restrict

When -apply-multi-level-restrict is active, the pass handles pointer-to-pointer arguments:

__global__ void kernel(float ** __restrict__ ptrs) {
    // Level 0: ptrs itself is restrict (different ptrs args don't alias)
    // Level 1: *ptrs (the pointed-to pointer) is also restrict
    //   meaning ptrs[i] and ptrs[j] point to non-aliasing memory
    float *a = ptrs[0];
    float *b = ptrs[1];
    a[x] = b[x];  // Proven NoAlias with multi-level restrict
}

Without this flag, only the outermost pointer level receives noalias treatment. With it, the pass follows dereference chains and creates scopes for each indirection level.

NVVM AA Query Logic -- Internal Detail

The AA chain in cicc is queried through AAResults::alias() at sub_134CB50. This function dispatches through the registered AA providers in registration order. The chain ordering observed in cicc v13.0 is:

NVVM AA  ->  BasicAA  ->  TBAA  ->  ScopedNoAliasAA  ->  GlobalsAA

This ordering is confirmed by sub_233BD40 (the AA chain builder, 4.8KB) which constructs the pipeline from names: globals-aa, basic-aa, objc-arc-aa, scev-aa, scoped-noalias-aa, tbaa. NVVM AA is injected at the front via NVPTXExternalAAWrapper with RunEarly=true, so it executes before all others.

The Query Dispatch Path

User pass (GVN, DSE, LICM, MemorySSA)
  |
  v
AAResults::alias(MemoryLocation &A, MemoryLocation &B)   [sub_134CB50]
  |
  +-- (1) NVPTXAAResult::alias()
  |       Check address spaces: cross-space pairs -> NoAlias
  |       If NoAlias: short-circuit, return immediately
  |
  +-- (2) BasicAA
  |       GEP decomposition, alloca vs argument, capture analysis
  |       basic-aa-recphi (default true): recursive PHI analysis
  |       basic-aa-separate-storage (default true): separate underlying objects
  |
  +-- (3) TBAA (Type-Based Alias Analysis)
  |       !tbaa metadata tree comparison
  |       enable-tbaa (default true)
  |
  +-- (4) ScopedNoAliasAA
  |       !noalias / !alias.scope metadata (from ProcessRestrict or frontend)
  |       enable-scoped-noalias (default true, ctor_060 at ~0x494CC1)
  |
  +-- (5) GlobalsAA  [sub_13C7380, 35.7KB]
  |       Module-level: which globals escape?
  |       enable-unsafe-globalsmodref-alias-results (default false)
  |
  v
Final AliasResult (NoAlias / MayAlias / PartialAlias / MustAlias)

Any provider returning NoAlias short-circuits the chain -- subsequent providers are never consulted. This is why NVVM AA runs first: cross-address-space pairs are resolved with zero overhead from BasicAA's GEP decomposition.

ModRef Queries

Two additional entry points handle call-site interactions:

sub_134F0E0 -- AAResults::getModRefInfo(CallBase, MemoryLocation). Returns a ModRefInfo encoding that combines Mod/Ref bits with MustAlias information (8 values, 0--7). This is used by DSE and LICM to determine whether a call can read or write a specific memory location.

sub_134F530 -- AAResults::getModRefInfo(CallBase, CallBase). Same encoding but for two call sites. Used by MemorySSA to build dependencies between calls.

The getModRefInfoMask method in NVVM AA adds a key optimization: pointers into constant memory (AS 4) or parameter memory (AS 101) return NoModRef because these memories are read-only from the kernel's perspective. This lets DSE skip alias analysis entirely for constant/param loads and lets LICM hoist them unconditionally.

getMemoryEffects for Inline Assembly

NVVM AA's getMemoryEffects method inspects PTX inline assembly blocks. An inline asm statement without the sideeffect flag and without a {memory} clobber constraint is classified as having no memory effects (MemoryEffects::none()). This prevents innocent inline asm (register manipulation, warp votes) from blocking load motion, store elimination, and CSE across the asm block.

Address-Space-Based NoAlias Rules -- Complete Matrix

The cross-address-space NoAlias decision is the cheapest and most impactful alias analysis in cicc. The full decision matrix for all pairs:

AS 0 (generic)AS 1 (global)AS 3 (shared)AS 4 (const)AS 5 (local)AS 6 (tensor)AS 7 (shmem cluster)AS 101 (param)
AS 0MayAliasMayAliasMayAliasMayAliasMayAliasMayAliasMayAliasMayAlias
AS 1MayAliasMayAliasNoAliasNoAliasNoAliasNoAliasNoAliasMayAlias*
AS 3MayAliasNoAliasMayAliasNoAliasNoAliasNoAliasMayAliasNoAlias
AS 4MayAliasNoAliasNoAliasMayAliasNoAliasNoAliasNoAliasNoAlias
AS 5MayAliasNoAliasNoAliasNoAliasMayAliasNoAliasNoAliasNoAlias
AS 6MayAliasNoAliasNoAliasNoAliasNoAliasMayAliasNoAliasNoAlias
AS 7MayAliasNoAliasMayAliasNoAliasNoAliasNoAliasMayAliasNoAlias
AS 101MayAliasMayAlias*NoAliasNoAliasNoAliasNoAliasNoAliasMayAlias

* AS 1 (global) vs AS 101 (param) returns MayAlias because cvta.param (SM 70+) converts parameter pointers to global-space addresses. A parameter-space pointer and a global-space pointer may reference the same physical byte after conversion. This is a conservative choice; upstream LLVM has a commented TODO noting that cvta.param support is not yet implemented, and cicc matches this conservatism.

The decision algorithm implemented in NVPTXAAResult::alias:

if AS1 == 0 or AS2 == 0:          -> MayAlias   (generic escapes all reasoning)
if AS1 == AS2:                     -> MayAlias   (same space, need deeper AA)
if {AS1,AS2} == {3,7}:            -> MayAlias   (shared/cluster overlap)
if {AS1,AS2} == {1,101}:          -> MayAlias   (global/param overlap via cvta.param)
otherwise:                         -> NoAlias    (hardware disjointness)

The !noalias.addrspace Metadata Mechanism

When MemorySpaceOpt or IR generation determines that a generic-space pointer provably does not alias with a specific address space, but cannot convert the pointer itself to that space (for example, because other uses require it to remain generic), cicc attaches !noalias.addrspace metadata (kind 42) to the instruction. This is registered in sub_B6EEA0 alongside the 41 standard LLVM metadata kinds (dbg=1, tbaa=2, prof=3, ..., noalias.addrspace=42).

The AA evaluator at sub_13549C0 detects this metadata during pointer collection (Phase 2 of the evaluator). When it encounters an instruction with opcode byte 0x4E (78, ASCII 'N'), it tags the pointer value with bit 2 set (OR with 4):

// At 0x1356170, 0x1356180, 0x1356190 in the AA evaluator:
if opcode_byte == 0x4E:          // noalias.addrspace annotation
    tagged_ptr = raw_ptr | 4     // set bit 2 as disambiguation flag

This tagged pointer propagates through to AAResults::alias() (sub_134CB50), AAResults::getModRefInfo(CallBase, MemoryLocation) (sub_134F0E0), and AAResults::getModRefInfo(CallBase, CallBase) (sub_134F530). The AA providers detect bit 2 and use the associated metadata to return NoAlias for the tagged pointer against pointers in the excluded address spaces.

Similarly, opcode byte 0x1D (29) identifies addrspacecast instructions. The evaluator captures the pre-cast value via cmovz, allowing the AA to trace back to the original non-generic address space even when the instruction itself operates on generic pointers.

The three opcode values that trigger special handling in the AA evaluator:

Opcode byteDecimalMeaningAA evaluator action
0x4E78 ('N')!noalias.addrspace annotatedOR pointer with 4 (set bit 2)
0x1D29addrspacecastCapture pre-cast value for AS lookup
0x36, 0x3754, 55llvm.noalias.scope.decl intrinsic resultsInsert into separate scope pointer sets

Comparison with Upstream LLVM NVPTX

Upstream LLVM (as of LLVM 19/20) includes NVPTXAliasAnalysis.cpp in llvm/lib/Target/NVPTX/, which implements the same core address-space disjointness logic. cicc's version is functionally equivalent to upstream for the basic alias query but differs in several ways:

AspectUpstream LLVMcicc v13.0
Core alias checkSame: cross-AS = NoAlias, generic = MayAliasSame
Shared cluster handlingAS 3 vs AS 7 = MayAliasPresent (SM 90+ targets)
Param aliasing with globalCommented TODO: "cvta.param not yet supported"Same conservative treatment
getModRefInfoMaskConst/param = NoModRefSame
Inline asm analysisChecks side-effects + {memory} clobberSame
Traversal depth knobnvptx-traverse-address-aliasing-limit (default 6)Same knob present
!noalias.addrspace metadataNot used upstreamcicc-specific extension (metadata kind 42)
strict-aliasing knobNot in upstream NVPTXcicc adds "Datatype based strict alias"
nvptxaa-relax-fencesNot in upstreamcicc-specific: ordering relaxation for fences
process-restrict passNot in upstream NVPTX backendcicc-specific interprocedural restrict propagation
Integration with MemorySpaceOptNo upstream equivalentcicc's address space inference feeds NVVM AA

The most significant delta is the ecosystem: upstream NVPTX has the AA pass but lacks the interprocedural MemorySpaceOpt pipeline that resolves generic pointers, the process-restrict pass that propagates noalias, and the !noalias.addrspace metadata that bridges partial address-space knowledge into the AA chain. These three components working together give cicc far more NoAlias results than upstream LLVM achieves on the same IR.

Configuration Knobs

NVVM AA Knobs

KnobTypeDefaultDescription
nvptx-traverse-address-aliasing-limitunsigned6Maximum depth for getAddressSpace traversal through getUnderlyingObject
nvptxaa-relax-fencesbool(unknown)Enable ordering relaxation for fence instructions in AA
strict-aliasingbool(unknown)"Datatype based strict alias" -- NVIDIA extension for type-based disambiguation
traverse-address-aliasingbool(unknown)"Find address space through traversal" -- master enable for the traversal in getAddressSpace
assume-default-is-flat-addrspaceboolfalseTreat default address space (0) as flat/generic (testing knob)

Standard LLVM AA Knobs (present in cicc)

KnobTypeDefaultDescription
disable-basic-aa / disable-basicaaboolfalseDisable BasicAA entirely
basic-aa-recphibooltrueEnable recursive PHI analysis in BasicAA
basic-aa-separate-storagebooltrueEnable separate-storage analysis in BasicAA
enable-tbaabooltrueEnable Type-Based Alias Analysis
enable-scoped-noaliasbooltrueEnable ScopedNoAlias AA (processes !noalias / !alias.scope)
enable-unsafe-globalsmodref-alias-resultsboolfalseEnable GlobalsModRef (requires unsafe assumption about global escapes)
alias-set-saturation-thresholdint(default)Maximum pointers in an AliasSet before it saturates
aa-pipelinestring(default)Override the AA pipeline configuration

Restrict Processing Knobs

KnobTypeDefaultDescription
nvptx-kernel-params-restrictboolfalseMark all kernel pointer params as noalias (activated by -restrict flag)
allow-restrict-in-structboolfalsePropagate __restrict__ into struct pointer members
apply-multi-level-restrictbool(unknown)Apply __restrict__ through all pointer indirection levels
dump-process-restrictboolfalseDebug dump during restrict processing

AA Evaluator Debug Flags

The aa-eval diagnostic pass (sub_13549C0) uses 14 independent boolean flags for selective output:

AddressFlagControls
byte_4F97AA0print-all-alias-modref-infoMaster enable for all AA debug output
byte_4F979C0print-all-alias-noPrint NoAlias pointer pairs
byte_4F978E0print-all-alias-mayPrint MayAlias pointer pairs
byte_4F97800print-all-alias-partialPrint PartialAlias pointer pairs
byte_4F97720print-all-alias-mustaliasPrint MustAlias pointer pairs
byte_4F97640print-all-modref-nonePrint NoModRef results
byte_4F97560print-all-modref-refPrint JustRef results
byte_4F97480print-all-modref-modPrint JustMod results
byte_4F973A0print-all-modref-bothPrint BothModRef results
byte_4F96F40aa-eval-callsite-modrefEnable call-site ModRef evaluation (Phase 5)

Function Map

FunctionAddressSizeRole
AAResults::alias(MemoryLocation, MemoryLocation) -- main alias query entrysub_134CB50----
AAResults::getModRefInfo(CallBase, MemoryLocation)sub_134F0E0----
AAResults::getModRefInfo(CallBase, CallBase)sub_134F530----
AAEvaluator::runOnFunction -- the aa-eval diagnostic passsub_13549C011,038 B--
SmallPtrSet::insert (pointer collection in aa-eval)sub_13540B0----
Pointer-pair result printer (aa-eval)sub_1352080----
Call-site pair result printer (aa-eval)sub_1351E00----
Formatted alias result printer (aa-eval)sub_13523B0----
GlobalsAA main analysis functionsub_13C738035.7 KB--
GlobalsAA helper (per-function analysis)sub_13C553021 KB--
GlobalsAA call-site analysissub_13C44106.7 KB--
GlobalsAA alias querysub_13C34D012.6 KB--
AA iteration / chaining logicsub_FD125023.4 KB--
Dominator-tree-based AA query setup (used by MemorySSA)sub_14A4050----
Metadata kind registration (including noalias.addrspace = kind 42)sub_B6EEA09 KB--
MemorySpaceOpt pass entry (IP-MSP worklist driver)sub_1C70910~2,427 lines--
MemorySpaceOpt per-BB scanner + address-space bitmask buildersub_1CA8CD0~898 lines--

Cross-References

  • MemorySpaceOpt -- the interprocedural pass that resolves generic pointers to specific address spaces, directly feeding NVVM AA
  • IP Memory Space Propagation -- the interprocedural wrapper around MemorySpaceOpt
  • GVN -- consumes AA for load elimination and store forwarding
  • DSE -- relies on AA for dead store detection; confirmed to have no internal address-space checks
  • LICM -- uses AA to hoist/sink memory operations across loops
  • Pipeline & Ordering -- where NVVM AA fits in the overall pass schedule
  • LLVM Knobs -- complete knob inventory including AA-related knobs
  • Optimization Levels -- how NVVMAliasAnalysis appears in the tier 2+ pipeline

MemorySSA Builder for GPU

MemorySSA constructs a sparse SSA form over memory operations, giving every instruction that reads or writes memory a position in a use-def chain that tracks the flow of memory state through a function. In upstream LLVM, MemorySSA already delivers significant speedups over the older MemoryDependenceResults analysis by avoiding per-query linear scans. In cicc v13.0, the payoff is amplified because the underlying alias analysis pipeline includes NVVM AA, which returns NoAlias for any cross-address-space pointer pair. A store to shared memory (addrspace(3)) and a load from global memory (addrspace(1)) will never produce a dependency edge in the MemorySSA graph, yielding a dramatically sparser representation than would be possible on a flat-memory architecture. Every pass that consumes MemorySSA -- LICM, EarlyCSE, DSE, GVN, SimpleLoopUnswitch -- benefits from this precision without containing any GPU-specific logic itself.

Key Facts

PropertyValue
Builder entry wrappersub_1A6CAD0 (48 bytes -- skipFunction guard + tail call)
Builder core functionsub_1A6A260 (10,344 bytes)
MemoryAccess allocatorsub_1A69110 (1,245 bytes)
Pass registration string"memoryssa" (analysis #179 in pipeline parser)
Pipeline parser entry"print<memoryssa>" -> MemorySSAPrinterPass
Required analysesAliasAnalysis (tag unk_4F9D3C0), DominatorTree (tag unk_4F9E06C), LoopInfo (tag unk_4F9A488)
Stack frame size0x3F8 = 1,016 bytes
MemoryAccess node size0x40 = 64 bytes (bump-allocated)
Walker check limitmemssa-check-limit = 100 (max stores/phis to walk past)
Verification flagverify-memoryssa (off by default, on under EXPENSIVE_CHECKS)
DOT graph outputdot-cfg-mssa (filename for CFG + MemorySSA visualization)

MemorySSA Node Types

MemorySSA represents memory state with three node types, all stored in 64-byte heap-allocated objects:

MemoryDef (kind=2) -- Created for every instruction that may write memory: stores, calls with side effects, atomics, memcpy/memmove intrinsics. Each MemoryDef takes the previous memory state as its operand and produces a new version of memory state.

MemoryUse (kind=1) -- Created for every instruction that reads memory but does not modify it: loads, calls to readonly/readnone functions. A MemoryUse points to the MemoryDef (or MemoryPhi) that represents the most recent memory state it depends on.

MemoryPhi (kind=3) -- Inserted at control flow join points where predecessors have different reaching memory definitions, exactly like an SSA phi node for scalar values. A MemoryPhi merges the memory states from each predecessor into a single version.

All three types share a common layout:

OffsetSizeField
+0x008vtable / next pointer (intrusive list)
+0x088prev pointer (intrusive list)
+0x104kind (1=MemoryUse, 2=MemoryDef, 3=MemoryPhi)
+0x144operand_count (bits 0-27)
+0x171flags byte (bit 6 = 0x40 = "has inline operands")
+0x188defining instruction / accessed Value*
+0x208type/size descriptor (APInt or pointer to APInt)
+0x288operand/predecessor pointer
+0x308current reaching definition (MemoryAccess*)
+0x388associated BasicBlock* (or null)

The sentinel value 1 stored in the reaching-definition field (+0x30) represents LiveOnEntry -- the implicit MemoryDef that dominates the entire function and represents the initial state of memory at function entry.

Construction Algorithm

The builder at sub_1A6A260 follows the standard LLVM MemorySSA construction algorithm, implemented as a dominator-tree DFS rename pass. The implementation is split into eight phases.

Phase 1 -- Prerequisite Retrieval (0x1A6A260 - 0x1A6A3A0)

The builder queries the analysis manager for three required results via a vtable-tagged vector. Each registered analysis is identified by a unique tag pointer:

  1. unk_4F9D3C0 -> calls virtual method [rax+0x68] -> sub_14A4050 -- retrieves AAResults, stored at [this+0xB8]
  2. unk_4F9E06C -> retrieves DominatorTree result, stored at [this+0xA8] (offset +0xA0 within the wrapper)
  3. unk_4F9A488 -> retrieves LoopInfo, stored at [this+0xB0]

If any tag is not found in the registered analysis vector, control jumps to terminal handlers at 0x1A6CAAF-0x1A6CABE (assertion / unreachable).

Phase 2 -- Worklist Initialization (0x1A6A3A0 - 0x1A6A6B0)

The builder allocates a 1,016-byte stack frame and initializes four layers of SmallVector-based renaming stacks:

  • Level 0: DFS traversal order over the dominator tree (computed by sub_13B8390)
  • Level 1: Per-block instruction iterator
  • Level 2: Per-block incoming MemoryPhi operand buffer (SmallVector at rbp-0x330, inline capacity 8)
  • Level 3: Memory state stack (current reaching definition per DFS depth)

Each layer is initialized by sub_16CCEE0 (SmallVector move-assign). Temporary intermediate buffers are freed before the main walk begins.

Phase 3 -- Dominator Tree Walk (0x1A6A88C - 0x1A6B070)

The main loop visits each basic block in DFS order over the dominator tree. For every instruction, the builder reads the opcode byte at [instruction-8] and classifies it:

opcode_tag = *(uint8_t*)(instr - 8);

switch (opcode_tag) {
    case 0x18..0x38:    // Memory instructions (load/store range)
        type_tag = *(uint8_t*)(*(instr - 0x18) + 8);
        if (type_tag == 0x10)  // PointerType result -> this is a Load
            createMemoryUse(instr);
        else
            createMemoryDef(instr);  // Store
        break;

    case 0x0B:          // CallInst
        classifyCall(instr);   // -> sub_1A69C30
        break;

    case 0x27:          // PHINode
        if (predecessors_disagree_on_memory_state())
            createMemoryPhi(block);
        break;
}

Type-size computation. For each memory access, a three-level nested switch computes the byte-size of the accessed region. The switch handles all LLVM Type IDs:

Type IDTypeSize computation
1HalfTy16 bits
2FloatTy32 bits
3DoubleTy64 bits
4FP8080 bits
5FP128128 bits
6PPC_FP128128 bits
7PointerTygetPointerSizeInBits() via sub_15A9520
11IntegerTy[type+8] >> 8 (raw bit width)
14StructTygetStructLayout() via sub_15A9FE0
0, 8, 10, 12, 16Array/Vectorelement_count * element_size

When the computed access size differs from the store size ([rax+8] >> 8), the builder routes through sub_1A69690 to create a partial-store MemoryDef, capturing the precise overlap region.

Phase 4 -- Call and Intrinsic Classification

Call instructions (opcode 0x0B) are dispatched through sub_1A69C30 (call-instruction MemoryDef handler), which classifies intrinsics by ID:

  • ID 0x0F (lifetime.start) and ID 0x17 (lifetime.end) -- no memory effect, skipped
  • ID 0x27 -- memcpy/memmove-like intrinsics, create MemoryDef
  • ID 0x2F -- atomic intrinsics (checks [rdx-0x30] for ordering)
  • ID 0x33 -- NVIDIA-specific intrinsics (surface/texture operations, NVVM builtins)

Phase 5 -- MemoryAccess Allocation (sub_1A69110)

The core allocator creates all three node types. Parameters:

RegisterMeaning
rdiMemorySSA this
esikind: 1=MemoryUse, 2=MemoryDef, 3=MemoryPhi
rdxdefining value / access value
rcxtype descriptor (APInt holding access size)
r8instruction pointer
r9predecessor block (for MemoryPhi)

Each allocation calls sub_22077B0 (BumpPtrAllocator::Allocate) for 0x40 bytes, populates all fields, inserts the node into the intrusive list via sub_2208C80, and increments the node counter at [this+0xD0].

For kind==1, sub_16A57B0 (countLeadingZeros) determines whether the access is a full or partial def. For kind==3 (MemoryPhi), the operand list is populated by iterating predecessor blocks through sub_146F1B0 (AA-driven reaching-definition lookup).

Phase 6 -- Trivial Phi Optimization (0x1A6B280 - 0x1A6B9BD)

After the DFS walk, the builder post-processes all MemoryPhi nodes. Any MemoryPhi whose operands all resolve to the same MemoryDef is trivial -- it can be replaced with that single reaching definition. The loop at 0x1A6B9DE iterates the result vector [this+0xD8..this+0xE0]:

for (auto *Phi : result_vector) {
    unsigned count = Phi->operand_count & 0x0FFFFFFF;
    if (all_operands_identical(Phi)) {
        Phi->replaceAllUsesWith(single_reaching_def);  // sub_164B780
        Phi->eraseFromParent();                         // sub_1AEB370
        destroy(Phi);                                   // sub_164BEC0
    }
}

This cleanup is critical for GPU code. Because NVVM AA proves so many memory operations are independent, many join points that would require MemoryPhis on a flat-memory machine will have all predecessors carrying the same memory state. The trivial-phi elimination pass removes these, reducing the graph to only the essential dependencies.

GPU-Specific Precision Gains

The MemorySSA builder itself contains no explicit GPU logic. The GPU awareness comes entirely through the AA pipeline at [this+0xB8], which chains BasicAA -> TBAA -> ScopedNoAliasAA -> NVVM AA. The critical interaction points are:

Cross-address-space independence. When sub_146F1B0 queries the AA for a (store to addrspace(3), load from addrspace(1)) pair, NVVM AA returns NoAlias before BasicAA or TBAA are even consulted. The MemorySSA builder then skips creating a dependency edge. This means a MemoryUse for a global load will not depend on a MemoryDef for a shared store -- they exist in parallel chains.

Partial-alias precision. The builder at 0x1A6AFB3 creates MemoryDefs even for partial overlaps, then calls sub_1A69690 to register the precise overlap region. Standard LLVM would conservatively treat partial alias as MayAlias and create a full dependency. cicc's more aggressive approach uses the partial overlap information downstream for finer-grained DSE and LICM decisions.

Address-space check on volatile access. The call to sub_15FA300 at 0x1A6B88E performs what appears to be a volatile-access or address-space check specific to CUDA memory spaces. This gate prevents the builder from creating false dependencies between volatile shared memory operations (used for inter-warp communication) and non-volatile global operations.

NVIDIA custom intrinsic handling. Type ID 0x33 in sub_1A69990 is not a standard LLVM type ID. It appears to be cicc's custom type for CUDA-specific memory operations (surface/texture references, NVVM-specific typed pointers). These are classified as memory-clobbering conservatively unless the AA can prove otherwise.

Practical effect. Consider a kernel that loads from global memory, operates on shared memory, and stores back to global memory:

__global__ void kernel(float *out, float *in) {
    __shared__ float smem[256];
    smem[threadIdx.x] = in[threadIdx.x];        // global load + shared store
    __syncthreads();
    float val = smem[threadIdx.x] * 2.0f;       // shared load
    out[threadIdx.x] = val;                      // global store
}

On a flat-memory machine, the MemorySSA graph would have a single linear chain: every memory operation depends on the previous one. With NVVM AA feeding MemorySSA, the graph splits into two parallel chains -- one for shared memory and one for global memory -- connected only at the __syncthreads() barrier (which is modeled as a MemoryDef that clobbers all address spaces).

The MemorySSA Walker

Passes do not directly traverse the MemorySSA def-use chains. Instead, they query the CachingWalker, which answers the fundamental question: "What is the nearest MemoryDef that actually clobbers this memory location?"

The walker performs an optimized upward walk along the def chain, testing each MemoryDef against the query location using the full AA pipeline. The walk terminates when:

  1. A MemoryDef that clobbers the query location is found (instructionClobbersQuery returns true)
  2. LiveOnEntry is reached (the location was never written in this function)
  3. The walk budget (memssa-check-limit = 100 steps) is exhausted, in which case the current MemoryDef is returned conservatively as a clobber

When a MemoryPhi is encountered, the walker splits into multiple paths (one per predecessor) and tracks them using a DefPath worklist. Each path records a (MemoryLocation, First, Last, Previous) tuple, enabling the walker to reconstruct the full path from any clobber back to the query origin.

Caching. The CachingWalker memoizes results per (MemoryAccess, MemoryLocation) pair. Once a clobber query is resolved, subsequent queries for the same access return the cached result immediately. The SkipSelfWalker variant (used by DSE) additionally skips the MemoryDef that is the query origin itself, answering "what did this store overwrite?" rather than "what clobbers this store?"

On GPU, the walker's budget is rarely exhausted for shared-memory operations because NVVM AA prunes so many false dependencies that the def chain is short. For global memory operations in loops with many stores, the 100-step limit can be hit; increasing memssa-check-limit trades compilation time for precision in these cases.

Consumer Passes

Five major passes consume MemorySSA in cicc:

PassHow it uses MemorySSA
LICMQueries the walker to determine whether a load inside a loop is clobbered by any store in the loop body. If no clobber is found, the load is hoisted. NVVM AA makes shared-memory loads trivially hoistable past global stores.
EarlyCSE (early-cse-memssa variant, sub_27783D0)Uses MemorySSA to find redundant loads -- two loads from the same location with no intervening clobber are CSE'd. The MemorySSA variant avoids the O(n^2) scanning of the non-MSSA EarlyCSE.
DSEWalks the MemorySSA graph backwards from a store to find earlier stores to the same location with no intervening loads. Dead stores are eliminated. DSE has its own extensive set of MemorySSA walk limits (see knobs below).
GVNCan optionally use MemorySSA instead of MemoryDependenceResults (controlled by enable-gvn-memoryssa). When enabled, GVN uses the walker for load-value forwarding and PRE.
SimpleLoopUnswitchQueries MemorySSA to determine whether a condition inside a loop depends on memory modified in the loop. The simple-loop-unswitch-memoryssa-threshold knob controls the walk limit.

Knobs and Thresholds

MemorySSA Core

KnobDefaultEffect
memssa-check-limit100Maximum stores/phis the walker will walk past before giving up. Higher values improve precision at the cost of compilation time.
verify-memoryssafalseEnables expensive verification of MemorySSA invariants after every modification.
dot-cfg-mssa""If set, dumps the CFG annotated with MemorySSA information to the named DOT file.

DSE MemorySSA Walk Limits

KnobDefaultEffect
dse-memoryssatrueMaster switch enabling MemorySSA-based DSE.
dse-memoryssa-scanlimit150Max memory accesses DSE will scan for a redundant store.
dse-memoryssa-walklimit90Max MemorySSA walk steps per DSE query.
dse-memoryssa-partial-store-limit5Max partial stores DSE will try to merge.
dse-memoryssa-defs-per-block-limit5000Skip blocks with more defs than this limit.
dse-memoryssa-samebb-cost1Walk cost weight for same-block MemoryDefs.
dse-memoryssa-otherbb-cost5Walk cost weight for cross-block MemoryDefs.
dse-memoryssa-path-check-limit50Max paths DSE will check for nontrivial reachability.
dse-optimize-memoryssatrueEnables DSE's own MemorySSA optimization (trivial phi removal during DSE).

GVN / MemoryDependence

KnobDefaultEffect
enable-gvn-memoryssavariesSwitches GVN from MemDep to MemorySSA.
memdep-block-scan-limit100 (legacy)Legacy MemDep per-block scan limit.
memdep-block-number-limit200 (legacy) / 1000 (NewPM)Max blocks MemDep will search. Note: the NewPM variant defaults to 1,000, a 5x increase.

Function Map

FunctionAddressSizeRole
Pass entry wrapper (skipFunction guard + tail call to builder)sub_1A6CAD048--
MemorySSA builder core (DFS rename walk)sub_1A6A26010,344--
MemoryAccess node allocator (Def/Use/Phi)sub_1A691101,245--
MemoryDef creation dispatcher (routes to sub_1A69110)sub_1A695F0----
Store-instruction MemoryDef handler (partial store support)sub_1A69690754--
MemoryPhi operand insertion handler (bidirectional edge setup)sub_1A69990664--
Call-instruction handler (intrinsic classification)sub_1A69C30----
MemorySSA::getMemoryAccess or walker lookupsub_1643330----
MemoryAccess::getDefiningAccesssub_1643D30----
MemoryLocation::get or getForDestsub_1644900----
Value::replaceAllUsesWith (def substitution during trivial phi removal)sub_164B780----
MemoryAccess::~MemoryAccess (destructor)sub_164BEC0----
MemoryAccess::eraseFromParentsub_1AEB370----
BumpPtrAllocator::Allocate (64-byte node allocation)sub_22077B0----
AA query: getModRefInfo / reaching-def resolutionsub_146F1B0----
AA query: may-alias check (two-pointer comparison)sub_145CF80----
AA query: isNoAlias / clobber checksub_1487400----
DominatorTree DFS order computationsub_13B8390----
skipFunction guard (checks isDeclaration)sub_1636880----

Diagnostic Strings

Diagnostic strings recovered from p2-J04-memoryssa.txt and the pipeline parser (p2c.1-01-pipeline-parser.txt). MemorySSA itself emits no optimization remarks; its diagnostics are configuration knobs and the verification/dump infrastructure.

StringSourceCategoryTrigger
"memoryssa"Pipeline parser analysis #179RegistrationAnalysis registration name in the pass pipeline
"print<memoryssa>"Pipeline parser #406RegistrationPrinter pass registration; params: no-ensure-optimized-uses
"memssa-check-limit"Knob (default 100)KnobMaximum stores/phis the CachingWalker will walk past before returning a conservative clobber
"verify-memoryssa"Knob (default false)KnobEnables expensive verification of MemorySSA invariants after every modification; on under EXPENSIVE_CHECKS
"dot-cfg-mssa"Knob (default "")KnobIf set, dumps the CFG annotated with MemorySSA information to the named DOT file for visualization
"dse-memoryssa"Knob (default true)KnobMaster switch enabling MemorySSA-based DSE
"dse-memoryssa-scanlimit"Knob (default 150)KnobMax memory accesses DSE will scan for a redundant store
"dse-memoryssa-walklimit"Knob (default 90)KnobMax MemorySSA walk steps per DSE query
"dse-memoryssa-partial-store-limit"Knob (default 5)KnobMax partial stores DSE will try to merge
"dse-memoryssa-defs-per-block-limit"Knob (default 5000)KnobSkip blocks with more defs than this limit
"dse-memoryssa-samebb-cost"Knob (default 1)KnobWalk cost weight for same-block MemoryDefs
"dse-memoryssa-otherbb-cost"Knob (default 5)KnobWalk cost weight for cross-block MemoryDefs
"dse-memoryssa-path-check-limit"Knob (default 50)KnobMax paths DSE will check for nontrivial reachability
"dse-optimize-memoryssa"Knob (default true)KnobEnables DSE's own MemorySSA optimization (trivial phi removal during DSE)
"enable-gvn-memoryssa"Knob (varies)KnobSwitches GVN from MemDep to MemorySSA
"memdep-block-scan-limit"Knob (default 100 legacy)KnobLegacy MemDep per-block scan limit
"memdep-block-number-limit"Knob (default 200 legacy / 1000 NewPM)KnobMax blocks MemDep will search; NewPM variant defaults to 1,000 (5x increase)
"print<memoryssa-walker>"Pipeline parserRegistrationMemorySSA walker printer pass
"early-cse-memssa"Pipeline parserRegistrationEarlyCSE variant that uses MemorySSA

Cross-References

  • Alias Analysis & NVVM AA -- the AA pipeline that feeds MemorySSA with GPU-aware NoAlias results
  • LICM -- primary consumer; NVVM AA-enhanced MemorySSA enables aggressive hoisting of shared-memory loads past global stores
  • DSE -- walks MemorySSA backwards to find dead stores; extensive set of MemorySSA-specific knobs
  • GVN -- optional MemorySSA backend via enable-gvn-memoryssa
  • EarlyCSE -- EarlyCSE's memssa variant uses MemorySSA for redundant load elimination

LazyCallGraph & CGSCC Pass Manager

The LazyCallGraph (LCG) is the data structure that represents which functions call or reference which other functions, built on demand rather than up front. It drives the CGSCC (Call Graph Strongly Connected Components) pass manager, which walks the call graph in bottom-up order so that interprocedural passes -- the inliner, argument promotion, devirtualization, function attribute inference -- process callees before callers. This ordering is essential: the inliner must have finished optimizing a callee's body before it decides whether to inline that callee into a caller. cicc v13.0 uses LLVM's stock LazyCallGraph implementation without NVIDIA-specific modifications to the graph itself. The GPU-specific behavior comes entirely from how the pipeline configures the CGSCC framework: kernels serve as call graph roots, device functions are internal nodes, recursion is rare, and the inline cost model is radically different from any CPU target.

The LCG cluster occupies approximately 220KB of code at 0xD230A0--0xD2F8A0, containing the graph construction logic, Tarjan's SCC algorithm, incremental SCC mutation operations, and the DOT/text graph printers. A separate 69KB function at sub_2613930 implements the New PM CGSCC inliner that runs inside this framework.

Key Facts

PropertyValue
Binary cluster0xD230A0 -- 0xD2F8A0 (~220KB, ~25 functions)
LLVM sourcellvm/lib/Analysis/LazyCallGraph.cpp
CGSCC pass managersub_1A62BF0 (the InlinerWrapper/standard pipeline factory)
CGSCC pipeline parsersub_2377300 (103KB)
CGSCC-to-function adaptorsub_2362FB0 (6.7KB)
New PM CGSCC inlinersub_2613930 (69KB)
NVIDIA custom inlinersub_1864060 (75KB, the old CGSCC SCC-walk inliner)
Inliner core loopsub_186CA00 (61KB, Inliner::inlineCallsImpl)
DevirtSCCRepeatedPasssub_2284BC0 (16KB, "Max devirtualization iterations reached")
SCC object size136 bytes (0x88)
Edge encodingPointer with tag bits: bit 2 = call edge, bit 2 clear = ref edge
DenseMap hashhash(ptr) = (ptr >> 4) ^ (ptr >> 9), bucket size = 16 bytes
DenseMap sentinelsEmpty = 0xFFFFFFFFFFFFF000, Tombstone = 0xFFFFFFFFFFFFE000
CGSCC invocations per O1/O2/O34 passes of sub_1A62BF0(1,...), 1 iteration each
CGSCC invocations at tier 3sub_1A62BF0(5,...) -- 5 iterations
BumpPtrAllocator[LCG+0x150] cursor, [LCG+0x158] slab end

Lazy Call Graph Construction

The graph is not built all at once. When the CGSCC pass manager begins, the LCG starts with just the module's externally visible functions and kernel entry points as root nodes. Each node's edges are populated only when first visited by the SCC traversal -- the Node::populateSlow() method (sub_D23BF0 returns the edge iterator range) scans all instructions in the function, recording two kinds of edges:

Call edges (bit 2 set in pointer tag): direct CallBase instructions whose callee resolves to a defined function. These form the strong connectivity that defines SCCs.

Ref edges (bit 2 clear): any other reference to a defined function -- a function pointer stored in a global, passed as a callback argument, taken address of. These contribute to RefSCC grouping but do not create call-graph cycles.

Node layout (deduced from binary):
  +0x00: Function*          (LLVM IR function)
  +0x08: Edge array pointer  (populated lazily)
  +0x10: Edge count / DFSNumber (int32, -1 = completed)
  +0x14: LowLink             (int32, repurposed as SCC index after Tarjan)
  +0x18: Callee edge list    (second array for call edges)
  +0x20: Callee edge count

Edge encoding (single qword):
  Bits 63..3: pointer to target Node
  Bit 2:      1 = call edge, 0 = ref edge
  Bits 1..0:  reserved (alignment)

Population is the only lazy step. Once a node is populated, its edges are cached. Subsequent visits reuse the cached edge list at [node+0x08]. The scan checks [rsi] != 0 to skip unresolvable edges (declaration-only functions with no body).

For a reimplementation: scan every instruction in the function. For each CallBase, if the callee is a defined function, add a call edge. Then walk all non-call operands recursively through constants (including BlockAddress, GlobalAlias, ConstantExpr) collecting any additional function references as ref edges. This matches upstream populateSlow() exactly.

SCC and RefSCC: The Two-Level Hierarchy

The LCG maintains a two-level SCC decomposition:

  1. SCC (Call SCC): a maximal set of functions connected by call edges such that every function is reachable from every other through calls. This is the unit of work for the CGSCC pass manager.

  2. RefSCC (Reference SCC): a maximal set of SCCs connected by ref edges. A RefSCC contains one or more SCCs. SCCs within a RefSCC can reference each other (e.g., mutually store each other's function pointers) but do not necessarily call each other.

RefSCC layout (from [r15] in sub_D25FD0):
  +0x00: LazyCallGraph*     (parent graph)
  +0x08: SCC array pointer   (SmallVector data)
  +0x10: SCC array size
  +0x14: SCC array capacity
  +0x38: DenseMap #1         (SCC* -> index)
         +0x38: qword - bucket base pointer (or inline start)
         +0x40: byte  - flags (bit 0 = active map selector)
         +0x44: dword - tombstone count / generation
  +0x48: DenseMap #2         (alternate map for lazy rehashing)
         +0x48: qword - bucket base pointer
         +0x50: dword - bucket count

SCC layout (136 bytes = 0x88):
  +0x00: qword - parent pointer / metadata
  +0x08: qword - node member array pointer
  +0x10: dword - member count
  +0x14: dword - capacity
  +0x18: Edge list / callee info
  +0x38: DenseMap - node-to-index or similar

The bottom-up SCC ordering is computed using Tarjan's algorithm, implemented in sub_D2C610. The algorithm uses the standard DFS stack with 24-byte entries ({Node*, EdgeIter, EdgeEnd}) and the classic DFSNumber / LowLink fields at node offsets +0x10 and +0x14. When LowLink == DFSNumber, the node is an SCC root -- all nodes above it on the DFS result stack are popped into a new SCC, their DFSNumber set to -1 (completed), and the SCC index written into the LowLink field for reuse.

The Tarjan inner loop at 0xD2CD90--0xD2CEA4 and the SCC member popping at 0xD2CF61--0xD2CFD0 are both 4x unrolled, indicating these are hot paths in the CGSCC pipeline.

Tarjan's SCC Algorithm: Binary-Level Pseudocode

Complexity. Tarjan's SCC algorithm is O(V + E) where V = number of nodes (functions) and E = number of call edges among those nodes. The 4x-unrolled inner loop is a constant-factor optimization, not an algorithmic change. The initial buildSCCs (sub_D2BEB0) runs Tarjan once over the entire call graph: O(V_total + E_total). The incremental switchInternalEdgeToRef runs Tarjan only over the affected SCC's members, giving O(V_scc + E_scc) which is typically O(1) since most GPU SCCs contain a single function. switchInternalEdgeToCall is O(V_scc + E_scc) for the same-SCC fast path (bit flip only), or O(M * V_scc) for the slow merge path where M = number of SCCs being merged. switchOutgoingEdgeToCall/Ref (sub_D27A10, 29KB) is O(R * S) where R = number of RefSCCs involved and S = total SCCs in those RefSCCs. The DenseMap operations throughout use (ptr >> 4) ^ (ptr >> 9) hashing with O(1) amortized insert/lookup. Graph verification (sub_D29180) is O(V + E) for the entire graph. The CGSCC pass manager's outer loop processes each SCC once in post-order, re-visiting at most max_devirt_iterations times (default 1, tier 3: 5), giving O(max_iter * V) passes over the SCCs.

The Tarjan implementation lives inside sub_D2C610 (switchInternalEdgeToRef) at address range 0xD2CC66--0xD2D0BC. It recomputes SCCs within a single RefSCC after a call edge is demoted to a ref edge, which may split the original SCC into multiple smaller SCCs.

The following pseudocode is reconstructed directly from the binary. Every variable name corresponds to a register or stack slot; every offset corresponds to a binary address.

// Address: 0xD2CC66 -- 0xD2D0BC (inside sub_D2C610)
// Input:  RefSCC containing one SCC whose internal call-edge structure changed
// Output: zero or more new SCCs replacing the original

struct StackEntry {           // 24 bytes (0x18)
    Node*       node;         // +0x00
    Edge*       edge_iter;    // +0x08
    Edge*       edge_end;     // +0x10
};

fn tarjan_recompute_scc(old_scc: &SCC, allocator: &BumpPtrAllocator) -> Vec<SCC> {
    // --- Phase 0: Initialize ---
    let mut dfs_counter: i32 = 1;                       // r13d, starts at 1
    let mut worklist: SmallVector<StackEntry, 4>;        // [rbp-0xA0], 24-byte entries
    let mut result_stack: SmallVector<*Node, 8>;         // [rbp-0x120]
    let mut new_scc_count: i32 = 0;                      // r14d, incremented per SCC found

    // --- Phase 1: Push all nodes of old_scc as unvisited roots ---
    for node in old_scc.members() {
        node.DFSNumber = 0;     // [node+0x10] = 0  (unvisited marker)
        node.LowLink   = 0;     // [node+0x14] = 0
    }

    // --- Phase 2: Outer loop -- pick next unvisited root (0xD2CCF7) ---
    for root in old_scc.members() {
        if root.DFSNumber != 0 { continue; }            // already visited

        // Assign DFS number and LowLink to root
        root.DFSNumber = dfs_counter;                    // [rbx+0x10] = r12d
        root.LowLink   = dfs_counter;                    // [rbx+0x14] = r12d
        dfs_counter += 1;                                // r13d++

        // Lazy-populate edges if not yet done
        let (edge_begin, edge_end) = sub_D23BF0(&root.edge_list);  // 0xD2CD0E
        worklist.push(StackEntry { node: root, edge_iter: edge_begin, edge_end });

        // --- Phase 3: DFS inner loop (0xD2CD90 -- 0xD2CEA4, 4x unrolled) ---
        while let Some(top) = worklist.last_mut() {
            if top.edge_iter == top.edge_end {
                // All edges of current node exhausted -- backtrack
                let finished = top.node;
                worklist.pop();                          // 0xD2CE80

                // LowLink propagation to parent
                if let Some(parent) = worklist.last_mut() {
                    // 0xD2CDF5: min(parent.LowLink, finished.LowLink)
                    let child_low = finished.LowLink;    // [rbx+0x14]
                    if child_low >= 0 && child_low < parent.node.LowLink {
                        parent.node.LowLink = child_low; // [r15+0x14] = edx
                    }
                }

                // --- Phase 4: SCC root detection (0xD2CF01) ---
                if finished.DFSNumber == finished.LowLink {
                    // This node is an SCC root. Pop members from result_stack.
                    // (0xD2CF30 -- 0xD2CFD2, 4x unrolled)
                    let scc_dfs = finished.DFSNumber;    // [r15+0x10]
                    loop {
                        // Unrolled: processes 4 nodes per iteration
                        let member = result_stack.pop();
                        if member.DFSNumber < scc_dfs { break; }  // 0xD2CF61

                        member.DFSNumber = -1;           // 0xFFFFFFFF = completed
                        member.LowLink = new_scc_count;  // assign SCC index
                    }
                    // The root itself
                    finished.DFSNumber = -1;
                    finished.LowLink = new_scc_count;
                    new_scc_count += 1;                  // r14d++
                } else {
                    // Not a root -- push onto result stack for later popping
                    result_stack.push(finished);
                }
                continue;
            }

            // Advance to next edge
            let edge_raw = *top.edge_iter;               // load qword
            top.edge_iter += 1;                          // advance by 8

            let target_node = edge_raw & 0xFFFFFFFFFFFFFFF8;  // mask off tag bits
            let is_call     = (edge_raw & 0x4) != 0;          // bit 2 = call edge

            // Only follow CALL edges for SCC computation (ref edges ignored)
            if !is_call { continue; }
            if target_node == 0 { continue; }            // skip null targets

            let target_dfs = target_node.DFSNumber;      // [target+0x10]

            if target_dfs == 0 {
                // Unvisited: assign DFS number, push onto worklist
                target_node.DFSNumber = dfs_counter;     // 0xD2CD78
                target_node.LowLink   = dfs_counter;
                dfs_counter += 1;

                let (eb, ee) = sub_D23BF0(&target_node.edge_list);
                worklist.push(StackEntry { node: target_node, edge_iter: eb, edge_end: ee });

            } else if target_dfs == -1 {
                // Already in a completed SCC -- skip entirely
                continue;

            } else {
                // On the stack (tree/back edge): update LowLink
                // 0xD2CDF5: min(current.LowLink, target.DFSNumber)
                if target_dfs < top.node.LowLink {
                    top.node.LowLink = target_dfs;
                }
            }
        }
    }
}

Key binary details:

  • The DFS counter is split between r12d and r13d, alternating roles. In practice r13d holds the next available DFS number, starting at 2 (the root gets 1 via the 0x100000001 packed initialization at 0xD2CD0E).
  • The 4x-unrolled inner loop at 0xD2CD90 processes four edge entries per iteration before branching back, reducing loop overhead on this hot path.
  • The SCC member popping at 0xD2CF61--0xD2CFD0 is likewise 4x unrolled: it pops at offsets -8, -0x10, -0x18, -0x20 relative to the result stack top, then subtracts 0x20 from the stack pointer per iteration.
  • The completed marker -1 (0xFFFFFFFF) is written to [node+0x10] (DFSNumber), and the SCC identifier (the r14d counter) is written to [node+0x14] (LowLink). After Tarjan completes, the LowLink field holds the SCC index for every node -- the DFSNumber/LowLink fields are repurposed, not preserved.
  • Only call edges (bit 2 set) are followed during Tarjan. Ref edges (bit 2 clear) are skipped. This is what makes the SCC decomposition "call-SCC" rather than "reference-SCC."

Complexity: O(V + E) where V = nodes in the old SCC and E = call edges among those nodes. The 4x unrolling is a constant-factor optimization, not an algorithmic change.

Incremental SCC Mutation Operations

When a pass modifies the call graph, the SCC structure must be updated without recomputing the entire graph. The LCG provides six mutation operations, each handling a specific kind of edge change. The two most complex are switchInternalEdgeToCall and switchInternalEdgeToRef; the others handle cross-RefSCC edges and bulk operations.

switchInternalEdgeToCall -- sub_D25FD0 (5,526 bytes)

Called when a ref edge within the same RefSCC becomes a call edge (the inliner or devirtualization resolves an indirect call to a direct call). This may merge previously separate SCCs into one.

// Address: 0xD25FD0 -- 0xD27566
// Signature (deduced):
//   RefSCC::switchInternalEdgeToCall(
//       Node& SourceN,             // rsi
//       Node& TargetN,             // rdx
//       function_ref<void(ArrayRef<SCC*>)> MergeCB  // rcx (nullable), r8 (data)
//   ) -> bool

fn switchInternalEdgeToCall(source: &Node, target: &Node, merge_cb: Option<Fn>) -> bool {
    let source_scc = sub_D23C40(lcg, source);   // lookupSCC at 0xD26025
    let target_scc = sub_D23C40(lcg, target);   // lookupSCC at 0xD2604E

    // FAST PATH 1: Same SCC -- edge type flip only, no structural change
    if source_scc == target_scc {                // 0xD26B5B
        // Mark the edge as a call edge (flip bit 2) via sub_D23E00
        return false;  // no SCC change
    }

    // Look up SCC indices within the RefSCC's ordered list
    let source_idx = sub_D25BD0(refscc.map, source_scc);  // 0xD26055
    let target_idx = sub_D25BD0(refscc.map, target_scc);  // 0xD260A0

    // FAST PATH 2: Source already appears after target in post-order
    // (the new call edge doesn't create a cycle in the SCC DAG)
    if source_idx > target_idx {                 // 0xD260B4
        // Mark edge as call, no SCC restructuring needed
        return false;
    }

    // SLOW PATH: The new call edge creates a cycle between SCCs.
    // Must merge all SCCs in the range [target_idx .. source_idx].

    // Phase A: DFS reachability within the RefSCC (0xD26C92 -- 0xD26DAB)
    // Walk call edges from target, collecting all SCCs reachable
    // back to source. Uses SmallVector worklist (cap 4) and
    // DenseMap visited set at [r15+0x48].
    let mut merge_set: SmallVector<SCC*, 4>;
    let mut visited: DenseSet<SCC*>;
    // ... DFS marks all SCCs on the cycle ...

    // Phase B: Merge SCCs (0xD26335 -- 0xD263E1)
    let merge_range = &refscc.scc_array[target_idx..=source_idx];
    let merge_count = merge_range.len();

    // Allocate temp buffer for std::rotate
    let tmp = sub_2207800(merge_count * 8);      // operator new
    // sub_D23910 rotates the SCC array to consolidate merged entries
    sub_D23910(refscc.scc_array, target_idx, source_idx);

    // Move all nodes from secondary SCCs into the primary SCC
    for scc in &merge_range[1..] {
        primary_scc.members.extend(scc.members);
        scc.members.clear();
    }

    // Update the SCC-to-index DenseMap with double-buffered rehashing
    // Toggle flags byte at [RefSCC+0x40], tombstone old entries,
    // insert new entries into the alternate map via sub_D24C50

    // Phase C: Invoke merge callback (0xD26480)
    if let Some(cb) = merge_cb {
        cb(ArrayRef { ptr: merge_range.as_ptr(), len: merge_count });
    }

    // Phase D: Reindex remaining SCCs (0xD267A2)
    for scc in &refscc.scc_array[target_idx + 1..] {
        scc_index_map[scc] -= merge_count - 1;  // "sub [rax], ebx" at 0xD267B9
    }

    // Notify the graph of structural change
    sub_D23D60(lcg, 1);                          // notifyRefSCCChange

    return true;  // SCC structure changed
}

Allocation fallback: The temporary buffer allocation at 0xD27447 has a halving fallback (sar rbx, 1): if operator new fails for the full size, it retries with half the size. This handles the case where the merge set is unexpectedly large.

DenseMap double-buffering: The RefSCC maintains two DenseMaps at offsets +0x38 and +0x48. The flags byte at +0x40 (bit 0) selects which map is "current." When entries are migrated during SCC merging, old entries are tombstoned (0xFFFFFFFFFFFFE000) in the departing map and inserted fresh into the other map via sub_D24C50. This avoids a full rehash on every merge -- the tombstone count at +0x44 is incremented, and the map is only rehashed (via sub_D25CB0) when the tombstone ratio crosses a threshold.

switchInternalEdgeToRef -- sub_D2C610 (5,236 bytes)

Called when a call edge within a RefSCC is demoted to a ref edge (a direct call is deleted or replaced with an indirect reference). This may split a single SCC into multiple smaller SCCs.

// Address: 0xD2C610 -- 0xD2DA84
// Signature (deduced):
//   RefSCC::switchInternalEdgeToRef(
//       RefSCC& Result,                   // rdi (output -- new RefSCC or self)
//       ArrayRef<pair<Node*, Node*>> Pairs // rdx (edge mutations), rcx (byte count)
//   ) -> RefSCC&

fn switchInternalEdgeToRef(pairs: &[(Node, Node)]) -> Vec<SCC> {
    // Phase 0: Flip all edge types from call to ref (0xD2C6A2)
    for (source, target) in pairs {
        sub_D23E00(&source.edge_list, target);   // clear bit 2 in edge pointer
    }

    // Phase 1: Check which pairs actually cross SCC boundaries (0xD2C6A2 -- 0xD2CA2B)
    // Processes pairs 4 at a time (4x unrolled loop).
    // For each pair: DenseMap lookup of source's SCC and target's SCC.
    // If same SCC: the call-to-ref demotion might break the SCC.
    // If different SCCs: no structural impact (they were already separated).
    let mut needs_recompute = false;
    for (source, target) in pairs {     // 4x unrolled at 0xD2C6D0
        let src_scc = densemap_lookup(source);
        let tgt_scc = densemap_lookup(target);
        if src_scc == tgt_scc {
            needs_recompute = true;
        }
    }

    if !needs_recompute { return vec![old_scc]; }

    // Phase 2: Run Tarjan's algorithm on the affected SCC (0xD2CC66 -- 0xD2D0BC)
    // (See "Tarjan's SCC Algorithm" section above for full pseudocode.)
    let new_sccs = tarjan_recompute_scc(old_scc, &lcg.allocator);

    if new_sccs.len() == 1 {
        // The SCC survived intact -- no split occurred
        return vec![old_scc];
    }

    // Phase 3: Allocate new SCC objects (0xD2D0BC -- 0xD2D12E)
    for i in 1..new_sccs.len() {
        // BumpPtrAllocator at [LCG+0x150]:
        let cursor = lcg.alloc_cursor;           // [r12+0x150]
        let aligned = (cursor + 7) & !7;         // align to 8
        let new_end = aligned + 0x88;            // 0x88 = 136 bytes per SCC
        if new_end > lcg.alloc_slab_end {        // [r12+0x158]
            sub_9D1E70(allocator, 0x88, 8);      // slow path: allocate new slab
        }
        lcg.alloc_cursor = new_end;
        let scc = aligned as *mut SCC;
        sub_D23F30(scc, lcg);                    // SCC constructor
    }

    // Phase 4: Distribute nodes among new SCCs (0xD2D1F2 -- 0xD2D309)
    // Each node's LowLink field (set by Tarjan to its SCC index) determines
    // which new SCC it belongs to.
    for node in old_scc.members() {
        let scc_idx = node.LowLink;              // [node+0x14]
        new_sccs[scc_idx].members.push(node);
    }

    // Phase 5: Update ownership maps (0xD2D168 -- 0xD2D1DC)
    // Register new SCCs in the RefSCC's SCC list via sub_D248B0
    for scc in &new_sccs[1..] {
        sub_D248B0(lcg, refscc, scc);            // insertRefSCC
    }
    // Update Node -> SCC DenseMap entries
    // Update SCC -> RefSCC back-pointers via sub_D27750

    // Phase 6: Clean up old SCC (0xD2D3D6 -- 0xD2D49A)
    // Reset all DFS/LowLink fields to -1 (completed state)
    // Zero out old SCC's member list
    // Clear old SCC's internal DenseMap via sub_D24EE0

    return new_sccs;
}

Batch processing optimization: The pair-processing loop at 0xD2C6A2 is 4x unrolled: it processes four (Node*, Node*) pairs per iteration, with explicit remainder handling (1, 2, or 3 leftover pairs) at 0xD2CA2B. Each pair occupies 16 bytes (0x10), so the loop advances by 64 bytes per iteration.

SCC object allocation: New SCC objects (136 bytes each) are allocated from the LCG's BumpPtrAllocator at [LCG+0x150]. The allocator maintains a cursor/end pair for the current slab. When the slab is exhausted, sub_9D1E70 allocates a new slab (the slow path). The alignment requirement is 8 bytes, enforced by the (cursor + 7) & ~7 round-up at 0xD2D0F0.

switchOutgoingEdgeToCall / switchOutgoingEdgeToRef -- sub_D27A10 (29,179 bytes)

Handles edges that cross RefSCC boundaries. When a ref edge from one RefSCC to another becomes a call edge (or vice versa), the RefSCC structure may need updating. If the new call edge creates a cycle between previously separate RefSCCs, they merge into one. This is the RefSCC-level analog of switchInternalEdgeToCall. The function at sub_D27A10 is 29KB -- the largest single function in the LCG cluster -- because it must handle both directions (to-call and to-ref) and the full RefSCC merge/split logic.

insertInternalRefEdge -- sub_D2A080 (15,253 bytes)

Adds a new ref edge within a RefSCC. Called when optimization introduces a new reference between functions that are already in the same RefSCC (e.g., a new constant expression referencing a sibling function). This does not affect SCC structure (only call edges define SCCs), but it updates the RefSCC's internal edge tracking.

computeRefSCC -- sub_D2AD40 (12,495 bytes)

Computes the RefSCC decomposition from scratch for a set of nodes. Used during initial graph construction (sub_D2BEB0) and when incremental updates are insufficient (e.g., after bulk edge insertion). This runs a second level of Tarjan's algorithm over the ref-edge graph, grouping SCCs into RefSCCs.

mergeRefSCC -- sub_D2DA90 (17,930 bytes)

Merges two or more RefSCCs into one. Called when a new ref edge or promoted call edge connects previously separate RefSCCs that are now mutually reachable. This involves relocating all SCCs from the source RefSCC into the target, updating the graph's RefSCC list at [LCG+0x240], and fixing all back-pointers.

CGSCC Pass Manager: Bottom-Up Interprocedural Optimization

The CGSCC pass manager (sub_1A62BF0) wraps the LCG traversal and runs a pipeline of CGSCC passes over each SCC in bottom-up (post-order) order. The pass manager is invoked multiple times at different points in the optimization pipeline, controlled by a pipelineID parameter.

In the O1/O2/O3 pipeline, it is invoked four times, each with 1 devirtualization iteration:

sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #2  (inliner framework, early)
sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #17 (after DSE/GVN/MemCpyOpt)
sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #21 (after ADCE/JumpThreading)
sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #38 (late, after Sink)

At higher tier levels (tier 3 aggressive optimization), a 5-iteration variant appears: sub_1A62BF0(5,0,0,1,0,0,1). The first parameter controls the maximum number of SCC re-visitation iterations when the call graph is mutated during optimization.

The pipeline IDs observed across all optimization levels are 1, 2, 4, 5, 7, and 8, likely corresponding to LLVM's PassBuilder extension points:

Pipeline IDExtension PointNotes
1EP_EarlyAsPossible / basic cleanupMost common, 4x per O2
2EP_LoopOptimizerEnd
4EP_ScalarOptimizerLateSometimes with optFlag=1
5EP_VectorizerStartUsed at tier 3 (5 iterations)
7EP_OptimizerLast
8EP_CGSCCOptimizerLateWith optFlag=1 for inlining

The CGSCC Pass Manager Run Loop

The pass manager's run loop implements the DevirtSCCRepeatedPass pattern. For each SCC in post-order:

fn run_cgscc_pipeline(module: &Module, lcg: &mut LazyCallGraph, max_devirt_iterations: u32) {
    // Build initial SCC post-order via sub_D2BEB0 (buildSCCs)
    let post_order = lcg.build_sccs();           // sub_D2BEB0, 10KB

    for refscc in post_order.bottom_up() {       // sub_D2F8A0 / sub_D30800
        for scc in refscc.sccs() {               // sub_D2E510, 7KB
            let mut iteration = 0;
            let mut changed = true;

            while changed && iteration < max_devirt_iterations {
                changed = false;
                iteration += 1;

                // Run each registered CGSCC pass on this SCC
                for pass in &cgscc_pipeline {
                    let result = pass.run(scc, lcg);

                    if result.invalidated_call_graph {
                        // The pass mutated the call graph.
                        // Update SCC structure via switchInternal* operations.
                        // If SCCs were merged or split, re-queue affected SCCs.
                        changed = true;
                    }

                    // Run the CGSCC-to-function adaptor (sub_2362FB0)
                    // to apply function-level passes to newly modified functions
                    if result.invalidated_functions {
                        for func in scc.functions() {
                            run_function_pipeline(func);
                        }
                    }
                }
            }

            if iteration >= max_devirt_iterations && changed {
                // sub_2284BC0: "Max devirtualization iterations reached"
                // Controlled by abort-on-max-devirt-iterations-reached knob
            }
        }
    }
}

Iteration semantics: The max_devirt_iterations parameter (argument 1 to sub_1A62BF0) controls how many times the pass manager will re-run the CGSCC pipeline on an SCC after the call graph mutates. At O1/O2/O3, this is 1 (single pass, no re-visitation). At tier 3, this is 5 (up to 5 re-runs if devirtualization keeps revealing new direct calls). The devirt iteration check at sub_2284BC0 emits "Max devirtualization iterations reached" when the limit is hit and the graph is still changing.

CGSCC-to-Function Adaptor -- sub_2362FB0 (6,700 bytes)

The adaptor at sub_2362FB0 wraps a function-level pass for execution inside the CGSCC framework. When the inliner inlines a callee, the callee's body is absorbed into the caller. The caller must then be re-optimized with function-level passes (SimplifyCFG, InstCombine, etc.) before the next CGSCC pass runs. The adaptor handles this by running the function pipeline on each function in the current SCC after each CGSCC pass that reports a change.

The adaptor constructor at sub_230AC20 (5.4KB) creates the module-to-function or CGSCC-to-function wrappers. The adaptor itself stores the inner pass pipeline as a nested FunctionPassManager and forwards run() calls to each function in the SCC.

Registered CGSCC Passes

The registered CGSCC passes (from the pipeline parser at sub_2377300):

Pass nameAddress/factoryPurpose
inlinesub_2613930New PM CGSCC inliner (69KB)
argpromotionsub_2500970Promote pointer args to by-value
attributor-cgsccsub_2582AC0CGSCC attribute deduction (39KB)
attributor-light-cgscc--Lightweight variant
function-attrssub_1841180Infer readonly, nounwind, etc.
openmp-opt-cgscc--OpenMP kernel optimization
coro-annotation-elide--Coroutine elision
coro-split--Coroutine splitting
nv-early-inlinervia sub_2342850NVIDIA early inliner (wraps InlinerWrapper)

CGSCC analyses (3 registered):

Analysis namePurpose
no-op-cgsccNo-op analysis (placeholder)
fam-proxyFunctionAnalysisManagerCGSCCProxy -- bridges function-level analyses into CGSCC
pass-instrumentationPass instrumentation callbacks (via sub_2342830)

How the CGSCC Inliner Uses the Call Graph

The inliner is the most important consumer of the LazyCallGraph. The New PM inliner at sub_2613930 (69KB) and the NVIDIA custom inliner at sub_1864060 (75KB) both interact with the LCG through a specific protocol.

The core inlining loop (implemented at sub_186CA00, 61KB, Inliner::inlineCallsImpl) runs within the CGSCC framework:

fn inline_calls_in_scc(scc: &mut SCC, lcg: &mut LazyCallGraph) {
    // Collect all call sites in the SCC
    let mut worklist: Vec<CallSite> = collect_call_sites(scc);

    for callsite in &worklist {
        let callee = callsite.callee();
        let caller = callsite.caller();

        // Compute inline cost
        let cost = compute_inline_cost(callee, caller);  // sub_1864060

        // Decision: inline if cost < threshold
        // (emits optimization remarks: "Inlined", "NotInlined", "AlwaysInline",
        //  "NeverInline", "TooBig", etc.)
        if should_inline(cost) {
            // Perform inlining transformation
            inline_function(callsite);

            // CRITICAL: Update the call graph after inlining.
            // The callee's body is now in the caller. New call edges
            // may have appeared (callee's callees are now caller's callees).
            // Old edges may have disappeared (the call to callee is gone).

            // For each new direct call discovered in the inlined body:
            //   lcg.switchInternalEdgeToCall(caller_node, new_callee_node)
            //     -> may merge SCCs, triggering re-visitation

            // For the removed call edge (caller -> callee):
            //   lcg.switchInternalEdgeToRef(caller_node, callee_node)
            //     -> may split SCCs, triggering re-visitation
            //   (or removeEdge entirely if callee has no other references)

            // Run function-level cleanup on the caller
            // via CGSCC-to-function adaptor (sub_2362FB0)
        }
    }
}

Call graph update protocol: After each inline transformation, the inliner must report all edge changes to the LazyCallGraph. The CGSCC pass manager provides an UpdateResult structure that the inliner fills in:

  1. New call edges: The inlined function body may contain direct calls that the caller did not previously have. Each creates a switchInternalEdgeToCall if target is in the same RefSCC, or switchOutgoingEdgeToCall (sub_D27A10) if target is in a different RefSCC.

  2. Removed call edges: The direct call from caller to callee is replaced by the inlined body. If the caller no longer references the callee at all, the edge is removed. If it still references the callee (e.g., another call site remains), the edge type may change.

  3. SCC merging: If the inlined body creates a new call cycle (e.g., A calls B, B's body contains a call to A), the affected SCCs merge. The merge callback re-queues the merged SCC for another pass of the CGSCC pipeline.

  4. SCC splitting: If removing the call edge from caller to callee breaks the only call-path cycle, the SCC splits. New SCCs are created and inserted into the post-order traversal at the correct position.

Initial Graph Construction: buildSCCs -- sub_D2BEB0 (9,782 bytes)

The initial call graph is built by sub_D2BEB0 when the CGSCC pass manager first runs. This function:

  1. Collects all module-level root functions (kernels, externally visible functions).
  2. For each root, lazily populates edges via sub_D23BF0.
  3. Runs Tarjan's algorithm to decompose the call graph into SCCs.
  4. Runs a second pass (sub_D2AD40, computeRefSCC) to group SCCs into RefSCCs based on ref edges.
  5. Stores the resulting post-order in the LCG's RefSCC list at [LCG+0x240].

The post-order traversal helpers (sub_D2F8A0 at 10KB, sub_D30800 at 8KB) implement the iterator that the CGSCC pass manager uses to walk RefSCCs and SCCs in bottom-up order. The SCC iteration logic at sub_D2E510 (7KB) handles advancing through SCCs within each RefSCC.

Graph Verification -- sub_D29180 (6,417 bytes)

The verifier at sub_D29180 checks the consistency of the entire LazyCallGraph after mutations. It validates:

  • Every node's SCC assignment is correct (no node belongs to the wrong SCC).
  • Every SCC's RefSCC assignment is correct.
  • Call edges connect nodes that are reachable via calls (SCC invariant).
  • Ref edges connect nodes within the same RefSCC.
  • The post-order is valid: for every call edge A -> B, B's SCC appears before A's SCC in the traversal order.
  • No dangling pointers (all edge targets are live nodes in the graph).

This verifier is expensive (O(V + E) for the whole graph) and is only enabled in debug builds or when explicitly requested.

LazyCallGraph Data Structure Layout

LazyCallGraph (pointed to by [RefSCC+0]):
  +0x000: ...
  +0x130: DenseMap<Node*, SCC*>  (NodeToSCCMap)
           +0x130: qword - bucket count tracking
           +0x138: qword - bucket array pointer
           +0x140: dword - num entries
           +0x144: dword - num tombstones
           +0x148: dword - num buckets
  +0x150: BumpPtrAllocator
           +0x150: qword - current slab cursor
           +0x158: qword - current slab end
  +0x1A0: qword - total allocated bytes
  +0x1B0: SmallVector<SCC*> - SCC ownership list
           +0x1B0: qword - data pointer
           +0x1B8: dword - size
           +0x1BC: dword - capacity
  +0x240: SmallVector<RefSCC*> - RefSCC list (post-order)

GPU-Specific Call Graph Properties

The LCG implementation itself is GPU-agnostic, but the call graph shape on GPU differs fundamentally from CPU:

Kernels are roots. Functions annotated with nvvm.annotations kernel metadata are externally visible entry points. They are the roots of the call graph -- nothing calls a kernel (launches are host-side). In CGSCC ordering, kernels are processed last (they are the top of the bottom-up traversal).

Device functions are internal. Non-kernel __device__ functions are typically internal linkage. They appear in the call graph only as callees. This produces a characteristic tree-like (or DAG-like) call graph with very few cycles, meaning most SCCs contain a single function.

Recursion is rare. CUDA hardware historically did not support recursion (stack depth is bounded, and the compiler must statically allocate the call stack). Although modern architectures permit limited recursion, real-world CUDA code almost never uses it. This means SCC merging (switchInternalEdgeToCall) is rarely triggered -- most CGSCC processing is trivially single-function SCCs in a DAG.

Aggressive inlining collapses the graph. The NVIDIA inline budget (default 20,000, vs LLVM's 225) causes most device functions to be inlined into their callers. After the early inliner pass, the remaining call graph is typically flat: a handful of kernels with large bodies and very few un-inlined callees. Later CGSCC invocations mostly iterate over single-function SCCs.

ThinLTO Interaction

When ThinLTO imports functions from other modules, they appear in the call graph as available_externally definitions. The LCG treats them like any other defined function -- they get nodes, their edges are lazily populated, and they participate in SCC computation. The NVModuleSummary builder (sub_12E06D0) records call graph edges in the module summary, which the ThinLTO import pass uses to decide which cross-module functions to import. Once imported, those functions become candidates for inlining during the CGSCC traversal.

The function-inline-cost-multiplier knob (visible in sub_2613930's string table) penalizes recursive functions during ThinLTO inlining, since recursive inlining can explode code size without bound.

Knobs and Thresholds

KnobDefaultEffect
inline-budget20,000Per-caller NVIDIA inline cost budget (89x LLVM default)
inline-threshold225LLVM default cost threshold (used by New PM inliner)
nv-inline-alloffBypass cost analysis, force-inline everything
-aggressive-inline--CLI flag, routes to inline-budget=40000
intra-scc-cost-multiplier--Cost multiplier for inlining within the same SCC
function-inline-cost-multiplier--Cost multiplier for recursive functions
abort-on-max-devirt-iterations-reachedfalseAbort if devirt iteration limit is hit
cgscc-inline-replay--Replay file for inline decisions (debugging)
cgscc-inline-replay-scopeFunctionReplay scope: Function or Module
cgscc-inline-replay-fallbackOriginalFallback: Original, AlwaysInline, NeverInline
cgscc-inline-replay-formatLineReplay format: Line, LineColumn, LineDiscriminator
CGSCC iteration count (arg 1 to sub_1A62BF0)1 (O1-O3), 5 (tier 3)Max SCC re-visitation iterations after graph mutation

Sentinel Values and Constants

ValueMeaning
0xFFFFFFFFFFFFF000DenseMap empty bucket sentinel
0xFFFFFFFFFFFFE000DenseMap tombstone sentinel
0x100000000Packed {size=0, cap=1} for SmallVector initialization
0x100000001Packed {DFSNumber=1, LowLink=1} for Tarjan root init
0x400000000Packed {size=0, cap=4} for SmallVector initialization
0x800000000Packed {size=0, cap=8} for SmallVector initialization
0x88 (136)SCC object size in bytes
0x18 (24)Tarjan StackEntry size (Node*, EdgeIter, EdgeEnd)
0x10 (16)Edge mutation pair size (Node*, Node*)
0xFFFFFFFF (-1)DFSNumber value indicating "completed" / assigned to an SCC

Diagnostic Strings

The call graph printer at sub_D2B640 (12,287 bytes) emits these strings for debugging:

"Printing the call graph for module:"
"RefSCC with"
"SCC with"
"Edges in function:"
"call SCCs:"
"call"
"ref"
" -> "

The DOT dumper at sub_D29900 emits GraphViz format with "digraph", "[style=dashed" (for ref edges), and standard ";\n", "}\n" terminators.

The New PM inliner at sub_2613930 emits: "function-inline-cost-multiplier", "recursive", "recursive SCC split", "unavailable definition".

The devirtualization pass at sub_2284BC0 emits: "Max devirtualization iterations reached".

The old CGSCC inliner at sub_186CA00 emits: "inline", "NoDefinition", "NotInlined", "AlwaysInline", "Inlined", "Callee", "Caller", "cost=always", "cost=", "threshold=".

The call graph DOT writer cluster at 0x2280000--0x228A000 emits: "view-callgraph", "View call graph", "dot-callgraph", "Print call graph to 'dot' file", "Call graph: ", "external caller", "external callee", "external node", "Writing '", "error opening file for writing!".

Function Map

FunctionAddressSizeRole
LazyCallGraph cluster startsub_D230A0----
std::rotate / SCC array reordersub_D23910----
SCC array splitting helpersub_D23A60----
Node::populate() / edge iterator (lazy population point)sub_D23BF0----
LazyCallGraph::lookupSCC(Node&)sub_D23C40----
RefSCC::isAncestorOf() connectivity checksub_D23CB0----
LazyCallGraph::notifyRefSCCChange()sub_D23D60----
Edge::setKind() (flip call/ref tag bit)sub_D23E00----
SCC constructorsub_D23F30----
LazyCallGraph::insertRefSCC()sub_D248B0----
Node edge list cleanupsub_D24960----
DenseMap insert (Node-to-SCC)sub_D24C50----
RefSCC::isPartOfRefSCC() checksub_D24D10----
DenseMap clear (SCC internals)sub_D24EE0----
RefSCC::find() / updateSCCIndexsub_D25AF0----
RefSCC::SCCIndexMap::find()sub_D25BD0----
DenseMap grow/rehashsub_D25CB0----
switchInternalEdgeToCall()sub_D25FD05,526--
Node::setRefSCC()sub_D27750----
switchOutgoingEdgeToCall/Ref()sub_D27A1029,179--
Call graph verificationsub_D291806,417--
DOT graph dumpersub_D299008,235--
insertInternalRefEdge()sub_D2A08015,253--
computeRefSCC()sub_D2AD4012,495--
Call graph text printersub_D2B64012,287--
buildSCCs() / initial constructionsub_D2BEB09,782--
switchInternalEdgeToRef()sub_D2C6105,236--
mergeRefSCC()sub_D2DA9017,930--
SCC iteration logicsub_D2E5106,890--
rebuildSCC()sub_D2F2406,141--
Post-order SCC traversal helpersub_D2F8A010,451--
Post-order traversalsub_D308007,796--
Edge management helpersub_D301A05,148--
RefSCC-level operationssub_D312707,696--
CGSCC pass manager / InlinerWrapper factorysub_1A62BF0----
NVIDIA custom inliner (old CGSCC)sub_186406075,000--
Inliner::inlineCallsImpl() (CGSCC core loop)sub_186CA0061,117--
Call graph node visitorsub_228051024,000--
Call graph buildersub_228268033,000--
DevirtSCCRepeatedPass ("Max devirtualization iterations reached")sub_2284BC016,000--
InlinerWrapper factory (nv-early-inliner, inliner-wrapper)sub_2342850----
CGSCC-to-function adaptorsub_2362FB06,700--
CGSCC pipeline text parsersub_2377300103,000--
Attributor CGSCC passsub_2582AC039,000--
New PM CGSCC inlinersub_261393069,000--

Cross-References

AsmPrinter & PTX Body Emission

The NVPTXAsmPrinter is cicc's final code-generation stage: the component that converts the machine-level IR (MachineFunction, MachineBasicBlock, MachineInstr) into the textual PTX that ptxas consumes. Unlike a conventional LLVM AsmPrinter, which emits real machine assembly for a physical ISA, the NVPTX variant emits PTX -- a virtual ISA with its own declarative syntax for registers, parameters, address spaces, textures, and kernel launch metadata. The AsmPrinter is not merely "formatting instructions"; it is responsible for the entire PTX module structure: file header directives, global variable declarations with topological ordering, function signatures with .param space marshaling, register class declarations, the instruction body with debug annotations, and convergence-control pseudo-instructions required by the warp execution model. In cicc v13.0 the printer spans two address clusters -- the NVPTX-specific emission layer at 0x2140000-0x21FFFFF and the LLVM AsmPrinter override at 0x31E0000-0x3240000.

Pass registrationsub_214ABE0 -- "NVPTX Assembly Printer"
emitFunctionBodysub_31EC4F0 (12KB, 2565 asm lines)
Header emission (emitHeader)sub_214F370 (7.2KB)
Function header orchestratorsub_215A3C0 (10KB)
Kernel attribute emissionsub_214DA90 (8.7KB)
Parameter list emissionsub_21502D0 (22KB)
Stack frame + register declssub_2158E80 (17KB)
Global variable emissionsub_2156420 (20KB)
Call prototype emissionsub_21CF8D0 (29KB)
Inline asm handlersub_31F26A0 / sub_397DF10 (30KB)
AsmPrinter::doFinalizationsub_3972F10 (24KB)

PTX Output Structure

A complete PTX module emitted by cicc follows this exact structure. Every element in this layout corresponds to a specific emitter function:

//                                          ← sub_214F370 (emitHeader)
// Generated by NVIDIA NVVM Compiler
// Compiler Build ID: ...
// Based on NVVM 7.0.1
//
.version 8.5                                ← PTXVersion / 10 . PTXVersion % 10
.target sm_90, texmode_independent          ← subtarget name + driver interface
.address_size 64                            ← 64 or 32 from subtarget

// Start of file scope inline assembly      ← sub_215ACD0 (doInitialization)
...inline asm...
// End of file scope inline assembly

.extern .func (.param .b32 _) _Z3foov      ← sub_2151550 (forward declarations)
.global .texref my_tex;                     ← sub_2156420 (module-level globals)
.global .surfref my_surf;
.global .samplerref my_samp = { ... };
.global .align 4 .b8 data[1024];

.visible .entry _Z6kernelPf(               ← sub_215A3C0 (function header)
    .param .u64 _Z6kernelPf_param_0
)
.reqntid 256, 1, 1                          ← sub_214DA90 (kernel attributes)
.maxnreg 32
{
    .local .align 16 .b8 __local_depot0[64];← sub_2158E80 (frame + registers)
    .reg .b64   %SP;
    .reg .b64   %SPL;
    .reg .pred  %p<5>;
    .reg .b32   %r<47>;
    .reg .b64   %rd<8>;
    .reg .f32   %f<20>;

    // .loc 1 42 0                          ← sub_31D55F0 (per-instruction debug)
    ld.param.u64 %rd1, [_Z6kernelPf_param_0];
    mov.u32 %r1, %tid.x;
    ...
}
// -- End function

Header Directive Emission -- sub_214F370

The header is emitted once during doInitialization (sub_215ACD0). The function builds the output into a SmallString<128> buffer then flushes via OutStreamer.EmitRawText. The emission order is fixed:

  1. Comment block. "// Generated by NVIDIA NVVM Compiler", followed by "// Compiler Build ID: " with the build identifier string, then "// Based on NVVM 7.0.1" (the version string is read from llvm.ident metadata via sub_216F7F0).

  2. .version X.Y -- the PTX ISA version. Computed as PTXVersion / 10 for major, PTXVersion % 10 for minor. In cicc v13.0 targeting SM 90, this is typically .version 8.5.

  3. .target sm_XX[, texmode_independent][, debug] -- the SM target name from NVPTXSubtarget::getTargetName(). The texmode_independent modifier is appended when the driver interface is NVCL (OpenCL). If the driver interface is CUDA and the subtarget lacks double-precision support, map_f64_to_f32 is appended instead. The , debug suffix is added when MCAsmInfo::doesSupportDebugInformation() returns true.

  4. .address_size 64 (or 32) -- from NVPTXSubtarget::is64Bit(). All modern CUDA compilation uses 64-bit.

The doInitialization function (sub_215ACD0) also performs two critical rejection checks: it looks up llvm.global_ctors and llvm.global_dtors named metadata. If either is a non-empty array, it issues a fatal error: "Module has a nontrivial global ctor, which NVPTX does not support." GPU kernels have no program startup phase where global constructors could execute.

Function Declaration: .entry vs .func

The function header orchestrator (sub_215A3C0) emits the complete prologue for each function definition. The emission sequence is:

Step (a): Coroutine pragma. Checks a linked list at this+792 for metadata nodes with type byte 'N' (0x4E) matching the current function. If found, emits .pragma "coroutine";.

Step (b): Linkage directive. Calls sub_214CAD0 which emits .visible, .extern, or .common depending on the function's linkage. CUDA kernel compilation mode is gated by *(this+232)->field_952 == 1.

Step (c): Entry vs function. Calls sub_1C2F070 (isKernelFunction). If the function is a kernel: emit .entry. Otherwise: emit .func.

Step (d): Return type. For .func only. Calls sub_1C2FA50 to check whether the function returns a value. If so, calls sub_214C940 to emit the return type specification (e.g., (.param .b32 retval0)). Kernels have no return values in PTX.

Step (e): Function name. sub_214D1D0 emits the mangled C++ name.

Step (f): Parameter list. sub_21502D0 (22KB) emits the complete .param declaration list. This is the most complex part of the header -- see the next section.

Step (g): Kernel attributes. Only for .entry functions. sub_214DA90 emits launch-bound and cluster directives.

Step (h): Additional attributes. sub_214E300 emits .local_maxnreg if set.

Step (i): Noreturn. If the function has metadata attribute 29 (noreturn) and is not a kernel, emits .noreturn.

Step (j): Open body. Emits {\n.

Step (k): Frame and registers. sub_2158E80 emits the local depot, stack pointer registers, and all virtual register declarations.

.param Space Marshaling

PTX uses .param space for all function arguments. The parameter emission function sub_21502D0 handles the full taxonomy of NVPTX parameter types. The emitted parameter name follows the pattern FUNCNAME_param_N where N is a monotonic index starting at 0.

Scalar parameters are emitted as .param .TYPE _param_N where TYPE is the PTX scalar type (.b32, .b64, .f32, .f64, .pred). Scalars smaller than 32 bits are widened to 32 bits; this is the PTX rule that all .param scalars must be at least 4 bytes. The widening logic: if bit-width <= 32, widen to .b32; if 32 < bit-width < 64, widen to .b64; otherwise keep as-is.

Aggregate / byval parameters are emitted as .param .align ALIGN .b8 _param_N[SIZE] -- a byte array with explicit alignment. The alignment comes from the function's DataLayout and the parameter attribute.

Texture / surface / sampler parameters get special treatment:

  • .param .texref _param_N -- texture reference (direct binding)
  • .param .surfref _param_N -- surface reference
  • .param .samplerref _param_N -- sampler reference
  • .param .u64 .ptr .texref _param_N -- pointer to texture (indirect)
  • .param .u64 .ptr .surfref _param_N -- pointer to surface
  • .param .u64 .ptr .samplerref _param_N -- pointer to sampler

The distinction between direct references and pointer-to-references reflects whether the texture/surface handle is passed by value or by indirection through a 64-bit pointer.

Call prototypes (sub_21CF8D0, 29KB) are emitted for indirect calls. When a function pointer call occurs, the AsmPrinter generates a .callprototype declaration: prototype_N : .callprototype (.param .b32 _) _ (.param .b64 _, .param .b32 _). The prototype index N is monotonically increasing.

Register Declarations

Inside the function body, sub_2158E80 emits register declarations for every virtual register class used. The nine register classes, their vtable addresses, PTX type suffixes, prefixes, and encoded IDs are documented in Register Classes. The encoding scheme, declaration emission format, and the internal-only tenth class are covered in Register Encoding Scheme and Register Declaration Emission.

The emitted text for each class follows the pattern:

.reg .pred  %p<5>;       ← 5 predicate registers needed
.reg .b16   %rs<12>;     ← 12 short integer registers
.reg .b32   %r<47>;      ← 47 general-purpose 32-bit
.reg .b64   %rd<8>;      ← 8 double-width integer
.reg .f32   %f<20>;      ← 20 single-precision float
.reg .f64   %fd<3>;      ← 3 double-precision float

The count for each class is max_register_index + 1. The emitter iterates the function's virtual register map at this+800, deduplicates register classes using a hash table at this+808..832, and tracks the maximum index per class.

The stack frame is emitted before registers when the function has a non-zero local frame:

.local .align 16 .b8 __local_depot0[512];   ← ALIGN from frame info, N = function index
.reg .b64   %SP;                             ← stack pointer (64-bit mode)
.reg .b64   %SPL;                            ← stack pointer local

The __local_depot name is a fixed prefix (#define DEPOTNAME "__local_depot" in the source). %SP is the global stack pointer; %SPL points into the local depot. In 32-bit mode these are .reg .b32.

Global Variable & Texture Emission -- sub_2156420

Module-level global variables are emitted by sub_2156420 (20KB), called from emitGlobals during doInitialization. Globals must be emitted in topological order because ptxas does not support forward references. The ordering is computed by sub_2157D50 which performs a DFS over global variable use-def chains, detecting circular dependencies (fatal: "Circular dependency found in global variable set").

Texture references: .global .texref NAME; -- emitted when sub_1C2E830 classifies the global as a texture. Surface references: .global .surfref NAME;. Sampler references get an optional initializer block:

.global .samplerref my_sampler = {
    addr_mode_0 = clamp_to_edge,
    addr_mode_1 = wrap,
    filter_mode = linear,
    force_unnormalized_coords = 1
};

Address mode values: wrap, clamp_to_border, clamp_to_edge, mirror. Filter mode values: nearest, linear. The force_unnormalized_coords field is boolean.

Data globals receive an address-space qualifier from sub_214FA80: .global (addrspace 1), .shared (addrspace 3), .const (addrspace 4), .local (addrspace 5). Managed-memory globals get .attribute(.managed). Unified addressing gets .attribute(.unified) or .attribute(.unified(N)).

Skipped globals: Variables whose names start with "llvm.metadata", "llvm.", or "nvvm." are silently skipped.

Demoted globals (shared memory demotion, addrspace 3) emit a comment: "// NAME has been demoted".

Instruction Emission -- sub_31EC4F0

The core emission loop emitFunctionBody at sub_31EC4F0 (12KB) overrides llvm::AsmPrinter::emitFunctionBody. It allocates a 0xF28-byte stack frame (holding SmallString buffers, a DenseMap for instruction-mix statistics, and tracking structures) and proceeds through three phases:

Phase 1: Per-MBB Outer Loop

Iterates the MachineFunction's MBB linked list. The iteration strips tagged-pointer bits (AND ~7) from the ilist node pointers. For each MBB:

  1. Calls emitBasicBlockStart(MBB) via vtable dispatch.
  2. Enters the instruction inner loop.
  3. Calls emitBasicBlockEnd(MBB).
  4. Collects instruction-mix statistics when debug counters are active.

Phase 2: Per-Instruction Inner Loop

For each MachineInstr, reads the opcode at MI+0x44 (uint16) and dispatches through a 46-case jump table:

Default path (real instructions): Calls emitInstruction(MI) via [vtable+0x128], which dispatches to the tablegen-generated printInstruction(). This function uses the NVPTXGenAsmWriter.inc tables to format each instruction: printInstruction() calls NVPTXInstPrinter::printOperand for each operand, producing text like mov.u32 %r0, %r1 or add.f32 %f2, %f0, %f1. After emission, the instruction counter is incremented and, if debug info is present, sub_31D55F0 emits a .loc directive.

Inline assembly (opcodes 1, 2): Routed to sub_31F26A0 / sub_397DF10 (30KB). The inline asm handler parses ${} operand references, handles .att_syntax / .intel_syntax mode switching, and emits // begin inline asm / // end inline asm comment markers. PTX inline assembly is passed through essentially verbatim, with operand substitution.

Meta-instructions (opcodes 3-7, 10-18): These include STACKMAP, PATCHPOINT, EH_LABEL, GC_LABEL, KILL, CFI_INSTRUCTION, DBG_VALUE, DBG_VALUE_LIST, and DBG_LABEL. Most emit labels or debug comments rather than PTX instructions. The KILL pseudo emits a "kill:" comment listing each killed register with sub_2FF6320 (printReg). DBG_LABEL emits "DEBUG_LABEL: <label>".

Convergence control (opcodes 24, 33): CONVERGENCECTRL_ENTRY calls sub_31DB9B0 to mark the entry point of a convergent region. CONVERGENCECTRL_LOOP calls sub_31DB950 to mark a loop-back convergence point. These pseudo-instructions are critical for the PTX assembler to correctly track warp divergence and reconvergence. See the dedicated Convergence Control Framework section below for the full lowering pipeline.

FAKE_USE (opcode 43): Debug-only. Emits "fake_use:" followed by register operands.

MEMBARRIER (opcode 44): Emits "MEMBARRIER" as a raw comment.

Pre- and post-instruction hooks: Before each instruction, the Handlers vector at this+0x240 is iterated, calling beginInstruction(MI) on each handler. After each instruction, endInstruction() is called. The AsmPrinter maintains two handler lists (at +0x240 and +0x228) supporting both debug-info handlers and exception/unwind handlers.

Phase 3: Post-Function Processing

After all MBBs are emitted:

  1. Zero-length function avoidance. If no real instructions were emitted (tracked by var_F30 and var_ED1), inserts a NOP via sub_31DCBB0 with comment "avoids zero-length function".
  2. Function-end label. Creates a "func_end" temp symbol via sub_31DCC50 and emits it for DWARF range tracking.
  3. DWARF line table finalization. Creates CIE/FDE symbols, binds them via emitAssignment, and inserts a debug-loc entry for the function-end symbol.
  4. Handler finalization. Calls endFunction(MF) on all handlers in both lists.
  5. PGO / BBAddrMap emission. If enabled via dword_50360A8, emits BB address maps for profile-guided optimization. Missing labels trigger diagnostic: "pgo-analysis-map is enabled for function... but it does not have labels".
  6. End comment. Emits "-- End function\n" as a raw comment.

Debug Info Emission

Debug information in PTX is emitted as .loc and .file directives embedded in the instruction stream, not as separate DWARF sections (the PTX assembler ptxas constructs the actual DWARF from these directives).

The debug emission is layered:

LayerFunctionBehavior
Per-instruction .locsub_31D55F0Emits .loc FileIndex Line Col for instructions with attached DebugLoc
Source-line commentssub_31D89B0Emits source location as comments when asm-printer debug counter is active
Function-name + inlined-atemitInlinedAtInfo (NVIDIA)Appends , function_name LAB, inlined_at FILE LINE COL to .loc
Per-MBB boundarysub_31E6100Maintains file/line-to-MCSymbol mapping for MBB boundaries
.file directivesemitDwarfFileEntriesMaps source filenames to file indices during doFinalization
DWARF line sectionsub_E81A00Binds CIE/FDE symbols for line table construction

The NVIDIA extension to .loc is the function_name and inlined_at attributes. Upstream LLVM's .loc only has file line column. cicc appends inlining context so that ptxas can reconstruct the full inline call stack in DWARF. The InlinedAtLocs set tracks which inlined-at locations have already been emitted, preventing duplicates. A work list (SmallVector<DebugLoc, 8>) is built by walking the inlined-at chain, then emitted in reverse order so that outer locations appear before inner ones.

When InterleaveSrcInPtx is enabled, the AsmPrinter reads source file lines and emits them as comments interleaved with the PTX.

Module-Level Metadata Directives

Kernel launch-bound metadata directives are emitted by sub_214DA90 in this order:

DirectiveMetadata SourceNotes
.blocksareclustersnvvm.blocksareclustersFatal error if .reqntid not also set
.reqntid X, Y, Znvvm.reqntid (comma-separated strtol)Unspecified dims default to 1
.maxntid X, Y, ZStructured metadata readersUnspecified dims default to 1
.minnctapersm Nsub_1C2EF70Min CTAs per SM
.explicitclusternvvm.cluster_dimSM 90+ only (field_1212 > 0x59)
.reqnctapercluster X, Y, ZCluster dim readersSM 90+ only
.maxclusterrank Nsub_1C2EF50SM 90+ only
.maxnreg Nsub_1C2EF90Register limit per thread

The .pragma "nounroll" directive is emitted at MBB level by sub_3970E40 when llvm.loop.unroll.disable metadata is detected on a loop header. This is an NVIDIA modification to the MBB printer.

The .abi_preserve family of directives is emitted by sub_3937240: .abi_preserve, .abi_preserve_after, .abi_preserve_uniform, .abi_preserve_control. These are NVIDIA-specific PTX directives for register ABI preservation across function calls.

Convergence Control Framework

CUDA's SIMT execution model requires the compiler to track which threads in a warp must execute the same instruction simultaneously. When a conditional branch causes warp divergence (some threads take one path, others take the other), the hardware needs to know where threads reconverge. The convergence control framework propagates this information from LLVM IR intrinsics through MachineInstr pseudo-instructions to the final PTX output, where ptxas uses it to emit correct convergence/reconvergence barriers in SASS.

Three-Layer Architecture

Convergence information flows through three representation layers during compilation:

LLVM IR                    MachineInstr                AsmPrinter
─────────────────────      ──────────────────────      ──────────────────
llvm.experimental          CONVERGENCECTRL_ENTRY       sub_31DB9B0
  .convergence.entry  →    (opcode 24)            →   (emitConvergenceEntry)

llvm.experimental          CONVERGENCECTRL_LOOP        sub_31DB950
  .convergence.loop   →    (opcode 33)            →   (emitConvergenceLoop)

llvm.experimental          CONVERGENCECTRL_ANCHOR      (no AsmPrinter case --
  .convergence.anchor →    (opcode 34)                  dropped before emission)

"convergencectrl"          (operand bundle tag          (verified at IR level,
 operand bundle      →      preserved through ISel)      consumed by pseudo-instrs)

Layer 1: IR intrinsics. Three llvm.experimental.convergence.* intrinsics define convergent regions at the LLVM IR level. Each returns an abstract "convergence token" (type token) that is consumed by calls carrying the convergencectrl operand bundle. The bundle ties a call to a specific convergence scope -- the verifier at sub_29ED7A0 enforces "convergent call needs convergencectrl operand" for any call marked with the convergent attribute (attribute kind 0x34 = 52).

Layer 2: MachineInstr pseudo-opcodes. During instruction selection (SelectionDAG lowering), the convergence intrinsics are lowered to target-independent MachineInstr pseudo-opcodes. These survive register allocation and all machine-level optimization passes unchanged -- they carry no register operands and produce no real instructions. Their sole purpose is to mark positions in the MBB instruction stream for the AsmPrinter.

Layer 3: AsmPrinter emission. The emitFunctionBody loop at sub_31EC4F0 dispatches opcodes 24 and 33 to dedicated emitter functions that translate the pseudo-instructions into whatever PTX annotation ptxas requires for reconvergence tracking. The CONVERGENCECTRL_ANCHOR pseudo (opcode 34) does not appear in the AsmPrinter's 46-case jump table, indicating it is either dropped during ISel or consumed by an earlier machine pass.

Convergence Token Semantics

The convergence token model enforces a strict dominance and nesting discipline:

  1. convergence.entry produces a token that represents the function's entry convergence scope. All threads that enter the function are converged at this point. The token must dominate all its uses.

  2. convergence.loop produces a token scoped to a natural loop. The token marks the point where loop-back-edge threads reconverge before the next iteration. The loop header must dominate all blocks in the cycle.

  3. convergence.anchor produces a token at an arbitrary program point, used for structured convergence within non-loop regions (e.g., structured if/else regions where reconvergence is needed at the join point).

  4. convergencectrl operand bundle attaches a convergence token to a call site. This tells the compiler "this call must execute with the set of threads defined by this token's scope." For example:

%tok = call token @llvm.experimental.convergence.entry()
%result = call float @__shfl_sync(i32 %mask, float %val, i32 %lane)
          [ "convergencectrl"(token %tok) ]

The LLVM verifier (sub_BFC6A0, 211KB) checks that convergent calls carry the bundle; the convergence verifier (sub_E35A10, 14KB) checks the structural invariants.

ConvergenceVerifier -- sub_E35A10

The standalone convergence verification pass at sub_E35A10 (14KB) enforces five invariants on convergence token usage:

InvariantDiagnostic String
Token dominance"Convergence control token must dominate all its uses."
Region nesting"Convergence region is not well-nested."
Cycle heart dominance"Cycle heart must dominate all blocks in the cycle."
Single token per cycle"Two static convergence token uses in a cycle..."
Loop token typeChecks llvm.experimental.convergence.loop usage in cycles

The verifier calls sub_B19720 for domination checks, sub_E342D0 for cycle detection (using the generic cycle info infrastructure), sub_E45390 for diagnostic emission, and sub_E348A0 for error reporting. It runs as part of the IR verification pipeline, not as a separate pass -- the convergence invariants are checked alongside other LLVM IR well-formedness rules.

NVIDIA Convergent Branch Intrinsics

In addition to the upstream llvm.experimental.convergence.* intrinsics, cicc defines two NVIDIA-specific convergent branch intrinsics that interact with the convergence framework:

IntrinsicBuiltin IDMinimum SMError on Violation
llvm.nvvm.branch.if.all.convergent3755 / 8282sm_70+ (Volta)"not supported on pre-Volta Architectures"
llvm.nvvm.branch.if.convergent3754 / 8283sm_80+ (Ampere)"not supported on pre-Ampere Architectures"

These intrinsics produce a boolean result that must be consumed by exactly one branch instruction (enforced by sub_2C7B6A0 with diagnostic: "result of llvm.nvvm.branch.if.convergent and llvm.nvvm.branch.if.all.convergent can only be used by exactly one branch instruction"). The .all variant tests whether all threads in the warp are converged (equivalent to a "uniform predicate" test); the non-.all variant tests whether the current execution context is convergent (the thread set matches the convergence token's scope).

SM version gating is checked in both the NVVM verifier (sub_1C36530) and the lowering pass (sub_2C7B6A0). The SM version is stored as SM * 10 internally (so sm_70 = 700, sm_80 = 800), compared against thresholds at unk_4D045E8.

The convergent Function Attribute (Kind 0x34)

The convergent function/call attribute (attribute kind 52, bit 0x20 at byte offset +33 in the function attribute flags) marks operations that have warp-synchronous semantics. This attribute affects multiple compilation stages:

Constant folding gate (sub_2C7B430). The NVIDIA intrinsic fold function checks hasAttribute(callee, -1, 0x34) before attempting any constant fold. If the callee is convergent, folding is rejected unconditionally -- even if all arguments are compile-time constants. This prevents __syncthreads(), __ballot_sync(), __shfl_sync(), and warp-vote operations from being eliminated.

Inline asm convergence flag. During SelectionDAG lowering of inline assembly (sub_1560260), the convergent attribute is tested via operand bundle or function attribute. If set, bit 5 of the inline asm flags word is set (isConvergent), encoding into the DAG node as: flags = hasSideEffects | (isAlignStack << 1) | (dialect << 2) | (convergent << 5).

Loop unrolling epilog forcing. When a loop body contains convergent calls (hasCallInLoop check), the unroller forces epilog remainder style rather than prolog, because epilog preserves the property that all threads participate in each full iteration of the unrolled body.

StructurizeCFG skip. Functions carrying the convergent attribute (attribute ID 56 in the attribute check at sub_B2D610) are skipped by the StructurizeCFG pass -- they are assumed to already have correct convergence structure.

Dead barrier elimination gate. The dead sync elimination engine (sub_2C83D20) identifies barrier intrinsics by checking bit 0x20 at byte +33 (the convergent attribute flag) on the callee, combined with opcode 85 (the internal barrier opcode) and a barrier intrinsic ID confirmation via sub_CEA1A0.

Operand Bundle Registration

The convergencectrl operand bundle tag is registered during LLVMContext initialization at sub_B6EEA0 (9KB), alongside the other standard bundle tags:

Operand bundle tags registered at context creation:
  "funclet"           -- EH funclet scope
  "gc-transition"     -- GC state transition
  "ptrauth"           -- pointer authentication
  "kcfi"              -- kernel control flow integrity
  "convergencectrl"   -- convergence token attachment

These tags are interned as string IDs in the context's operand bundle tag table. When the bitcode reader parses a call instruction with operand bundles (sub_14FCE40, 107KB), the convergencectrl bundle is reconstructed from the bitcode record and attached to the CallInst/InvokeInst. The inliner at sub_29ED7A0 (96KB) checks "convergent call needs convergencectrl operand" to verify that convergent calls in the callee carry appropriate bundles after inlining.

Pseudo-Instruction Lowering in emitFunctionBody

The emitFunctionBody loop at sub_31EC4F0 handles the two convergence pseudo-instructions as part of its 46-case opcode switch:

Case 24 -- CONVERGENCECTRL_ENTRY. Calls sub_31DB9B0 (emitConvergenceEntry). This function is positioned at address 0x31DB9B0, immediately after sub_31DB950 in the binary layout (the two functions are adjacent, separated by only 0x60 bytes: 0x31DB950 to 0x31DB9B0). The entry pseudo marks the function entry convergence point. It does not emit visible PTX text -- instead it updates internal state that the OutStreamer uses for reconvergence tracking in the generated object.

Case 33 -- CONVERGENCECTRL_LOOP. Calls sub_31DB950 (emitConvergenceLoop). This marks loop-back convergence points. Like the entry pseudo, it produces no visible PTX output but influences ptxas's reconvergence analysis.

Both pseudo-instructions are "silent" -- they do not increment the instruction counter (var_F30), do not trigger .loc emission, and do not invoke the beginInstruction/endInstruction handler callbacks. They fall through the switch without reaching the default path's instruction-counting logic.

Post-Function Convergence Close-Out

After all MBBs in a function are emitted, the emitFunctionBody function performs convergence-related cleanup in Phase 3a (0x31ECFFD-0x31ED0FA):

Phase 3a: Convergence control close-out
  if (var_ED1 == true):                          // any real instructions seen?
      OutStreamer->emitAlignment(MF->getAlignment())
      for sym in MF->globalSymbolTable[0x48..0x50]:
          if (sym[-0x16] & 0x7FFF) != 0:         // visibility flags
              sub_31E1750(sym)                    // resolveBlockAddress
              if block_was_removed:
                  emit diagnostic "Address of block that was removed by Co..."
                  OutStreamer->emitLabel(fallback_sym)

The var_ED1 flag tracks whether any non-meta instructions appeared in the function body. When set, the close-out phase emits function alignment, resolves block-address symbols in the global symbol table (checking visibility flags at sym[-0x16] & 0x7FFF), and handles the edge case where a basic block was removed by CodeGen after a block-address was taken -- this would produce a dangling convergence reference, so a diagnostic is emitted and a fallback label is created.

Convergence and the StructurizeCFG Pass

The StructurizeCFG pass (documented in StructurizeCFG) is the primary consumer of convergence information during the CFG transformation phase. PTX requires reducible control flow: every back-edge must target a loop header that dominates all blocks in the cycle, and every divergent branch must reconverge at a post-dominator.

The pass performs a domtree-guided reconvergence insertion that stores head/tail pointers into function metadata at *(func_obj+672) and *(func_obj+680). These pointers are read by subsequent PTX emission passes to emit correct convergence annotations. Functions with the convergent attribute (or optnone) are skipped entirely -- they are assumed to already have correct structure.

When non-uniform divergent regions are identified, the pass creates new "reconvergence" basic blocks, copies phi entries, and reroutes edges so that all divergent paths merge at a single post-dominator. The sub_35CB4A0 uniformity check and sub_35C9ED0 NCA (nearest common ancestor) computation in the dominator tree determine where reconvergence points are inserted.

NVIDIA Extensions Beyond Upstream

cicc's AsmPrinter diverges from upstream LLVM's NVPTXAsmPrinter in several important ways:

Convergence control pseudo-instructions. Upstream LLVM (as of the LLVM 20 base) has llvm.experimental.convergence.* intrinsics, but the AsmPrinter handling of CONVERGENCECTRL_ENTRY and CONVERGENCECTRL_LOOP as dedicated opcode cases (24 and 33 in the jump table) with calls to sub_31DB9B0 / sub_31DB950 is cicc-specific. These ensure correct warp-level synchronization semantics in the emitted PTX. Additionally, cicc adds two NVIDIA-specific convergent branch intrinsics (llvm.nvvm.branch.if.convergent for sm_80+ and llvm.nvvm.branch.if.all.convergent for sm_70+) that have no upstream equivalent. See the Convergence Control Framework section for the full pipeline.

Enhanced .loc with inlined-at. The function_name and inlined_at extensions to .loc directives are NVIDIA additions. Upstream LLVM's NVPTX backend emits only standard .loc file line col. cicc's version walks the full inlining chain to produce richer debug information.

Cluster directives (SM 90+). The entire cluster attribute family (.blocksareclusters, .explicitcluster, .reqnctapercluster, .maxclusterrank) and the 15 cluster special registers are NVIDIA extensions to PTX not present in upstream LLVM's NVPTX backend.

.abi_preserve directives. The register ABI preservation annotations emitted by sub_3937240 have no upstream equivalent.

.pragma "coroutine". The coroutine pragma emission in the function header orchestrator is NVIDIA-specific, supporting CUDA coroutine execution.

PGO/BBAddrMap integration. The BBAddrMap and PGO analysis info structures (0x80 and 0x98 bytes respectively, dynamically allocated when analysis passes are absent) are LLVM 16+ features that cicc integrates into the PTX emission path.

Instruction-mix statistics. The per-MBB instruction-mix collection ("INST_<name>: <count>" format) under the "asm-printer" statistic group is significantly more elaborate than upstream's simple instruction counter.

Dual handler lists. cicc maintains two separate AsmPrinterHandler lists (at this+0x240 and this+0x228), iterated independently for beginInstruction/endInstruction/endFunction. Upstream uses a single handler list.

Function Map

FunctionAddressSizeRole
NVPTXAsmPrinter pass registrationsub_214ABE0----
Return type / .attribute(.unified) emissionsub_214C9401.9KB--
Linkage directive emission (.visible/.extern/.common)sub_214CAD02.4KB--
Kernel attribute emission (.reqntid, .maxnreg, cluster)sub_214DA908.7KB--
.local_maxnreg emissionsub_214E3001.3KB--
emitHeader (.version, .target, .address_size)sub_214F3707.2KB--
Address space qualifier emissionsub_214FA801.9KB--
emitFunctionParamList (.param declarations)sub_21502D022KB--
Parameter name generation (_param_N)sub_2150230----
Function forward declaration emissionsub_21515503.9KB--
emitFunctionEntryLabel (.entry/.func)sub_2151D307.0KB--
Function alias emission (.alias)sub_21518E05.0KB--
Static initializer expression emissionsub_21533505.3KB--
Byte-level constant data emissionsub_2153AE09.9KB--
printModuleLevelGV (texref/surfref/samplerref/data)sub_215642020KB--
Global variable topological sortsub_2157D505.9KB--
Register class -> encoded IDsub_21583D04.6KB--
Stack frame + register declaration emissionsub_2158E8017KB--
Function header orchestratorsub_215A3C010KB--
Module-level emission entry (ctor/dtor check, DWARF)sub_215ACD08.1KB--
GenericToNVVM pass registrationsub_215DC20----
Register class -> PTX type suffixsub_21637301.7KB--
Register class -> PTX register prefixsub_21638D01.6KB--
llvm.ident / "Based on NVVM 7.0.1" readersub_216F7F05.7KB--
emitCallPrototype (.callprototype for indirect calls)sub_21CF8D029KB--
Atomic opcode emission (13 operations)sub_21E5E70----
L2-hinted atomic emission (SM 80+)sub_21E6420----
Address space conversion (cvta) + MMA helperssub_21E7FE0----
Standard special register emission (%tid, %ctaid, etc.)sub_21E86B0----
Cluster barrier emission (SM 90+)sub_21E8EA0----
Cluster special register emission (SM 90+)sub_21E9060----
Memory barrier emission (membar/fence)sub_21E94F0----
printReg (register number -> %rN string)sub_2FF6320----
Per-instruction .loc DWARF directivesub_31D55F0----
Instruction-level debug comment emissionsub_31D89B0----
emitConvergenceEntry (CONVERGENCECTRL_ENTRY pseudo, opcode 24)sub_31DB9B0----
emitConvergenceLoop (CONVERGENCECTRL_LOOP pseudo, opcode 33)sub_31DB950----
ConvergenceVerifier::verify (token dominance/nesting checks)sub_E35A1014KB--
Cycle detection for convergence verificationsub_E342D0----
Convergence verification error reportingsub_E348A0----
Inliner/verifier core ("convergent call needs convergencectrl operand")sub_29ED7A096KB--
NVVM convergent branch intrinsic SM-version gatingsub_1C36530----
Convergent branch lowering + single-use enforcementsub_2C7B6A0----
Metadata kind + operand bundle tag registration (incl. convergencectrl)sub_B6EEA09KB--
emitNops (zero-length function avoidance)sub_31DCBB0----
createTempSymbol ("func_end", "Ltmp")sub_31DCC50----
emitFunctionBody (main loop)sub_31EC4F012KB--
emitInlineAsmsub_31F26A0----
.abi_preserve directive emissionsub_393724014KB--
MBB printer + .pragma "nounroll"sub_3970E4018KB--
doFinalizationsub_3972F1024KB--
emitInlineAsm (parser/streamer)sub_397DF1030KB--

Cross-References

  • PTX Emission -- hub page for the emission stage with additional detail on atomic/barrier/special-register emission
  • Code Generation -- the MachineInstr-producing stage that feeds the AsmPrinter
  • SelectionDAG -- instruction selection that creates the MachineInstrs
  • NVPTX Call ABI -- .param space calling convention detail
  • Register Allocation -- determines which virtual registers exist for the register declaration phase
  • Inliner Cost Model -- inlining decisions that create the inlined-at debug chains the AsmPrinter must emit
  • StructurizeCFG -- CFG restructuring pass that creates reconvergence basic blocks for divergent control flow
  • Dead Sync Elimination -- dead barrier elimination engine that uses the convergent attribute to identify barrier intrinsics
  • SM 70-89 Architecture -- SM version gating for convergent branch intrinsics
  • GPU Execution Model -- SIMT warp divergence/reconvergence background

Debug Info Verification

cicc includes a custom debug info verification pass (sub_29C8000) that validates DWARF-like debug metadata after each optimization pass in the pipeline. This is not the upstream LLVM IR Verifier (llvm::Verifier::verify(Module)); it is an NVIDIA-specific implementation derived from LLVM's CheckDebugInfoPass (in Debugify.cpp) with two significant extensions: a structured JSON reporting mechanism that tracks exactly which optimization passes degrade debug info quality, and a configurable verbosity system that allows the verification overhead to be tuned from silent to exhaustive. The pass lives in a self-contained module of approximately 93 functions in the 0x29C0000--0x29FFFFF address range, alongside the Debugify synthetic debug info injector and general pass infrastructure utilities. Its purpose is to ensure that when a developer compiles with -g or -generate-line-info, the debug metadata that cuda-gdb and Nsight Compute rely on survives the aggressive optimization pipeline intact.

Primary functionsub_29C8000 (12,480 bytes, 434 basic blocks)
Address range0x29C8000 -- 0x29CB0C0
Per-instruction verifiersub_29C3AB0 (5,592 bytes)
Debugify injectorsub_29C1CB0
NewPM wrapperssub_22702B0 (NewPMCheckDebugifyPass), sub_2270390 (NewPMDebugifyPass)
Pipeline parser names"check-debugify" (pass #26), "debugify" (pass #35)
Verbose output flagqword_5008FC8 (bool)
Depth thresholdqword_5008C88 (int32)
Stack frame0x4B8 bytes (eight tracking structures)
Upstream originllvm/lib/Transforms/Utils/Debugify.cpp -- CheckDebugInfoPass

Three Verification Modes

cicc supports three independent verification protocols, each activated by a different set of knobs. Understanding which protocol is active determines what diagnostic output to expect and how much overhead the verification adds.

Mode 1: Post-Pass Debug Info Verification (verify-each)

The default verification mode, activated by the verify-each (or its alias verify-after-all) LLVM knob. The pipeline runner invokes sub_29C8000 as a sandwich around each optimization pass:

// Pseudocode for the pipeline runner's verification protocol
// (entry: 0x29C8000, stack: 0x4B8 bytes)
snapshot_debug_metadata(M);
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", 11, file, fileLen, jsonOut);

The pass name argument identifies which optimization just ran, so the JSON report can attribute any debug info degradation to the specific pass responsible. The verifier checks the full metadata inventory: subprograms, scopes, variables, types, labels, imported entities, and retained nodes. It produces ERROR diagnostics for dropped subprograms and WARNING diagnostics for dropped debug variable intrinsics.

Activation: -Xcicc -verify-each or -Xcicc -verify-after-all Overhead: One full metadata snapshot + eight hash table constructions + per-function variable scan per optimization pass. Substantial for large modules.

Mode 2: Debugify Synthetic Injection + Verification (debugify-each)

The full Debugify cycle injects synthetic debug metadata before each pass, runs the pass, then verifies the synthetic metadata survived. This mode is more aggressive than Mode 1 because it tests every pass even on code compiled without -g.

// Debugify cycle pseudocode
sub_29C1CB0(M, "llvm.debugify");   // inject synthetic debug info
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", ...);  // verify
strip_debugify_metadata(M, "llvm.debugify");  // cleanup

The injector (sub_29C1CB0) creates "llvm.debugify" / "llvm.mir.debugify" named metadata nodes that serve as watermarks. The checker looks for these watermarks to distinguish synthetic from genuine debug info.

Activation: -Xcicc -debugify-each Sub-knobs: debugify-level (locations or location+variables), debugify-quiet, debugify-func-limit, debugify-export

Mode 3: Debug Info Preservation Checking (verify-debuginfo-preserve)

A lighter-weight mode that checks only whether existing debug info survives optimization, without injecting synthetic metadata. This mode is available through the New Pass Manager infrastructure and can export results via verify-di-preserve-export.

Activation: -Xcicc -verify-debuginfo-preserve Sub-knobs: verify-each-debuginfo-preserve, verify-di-preserve-export

Mode Selection Matrix

KnobScopeInjects synthetic?Checks variables?JSON output?
verify-eachAll passesNoYes (if -g)If jsonOutput != NULL
debugify-eachAll passesYesConfigurable via debugify-levelVia debugify-export
verify-debuginfo-preserveAll passesNoYesVia verify-di-preserve-export
(none, -g active)--NoNo per-pass checkNo

Pipeline Integration

The verifier operates as an interleaved "check" pass. The New Pass Manager registers it via two wrappers in the pipeline construction code at 0x2270000--0x227FFFF:

AddressRegistration stringRole
sub_22702B0"NewPMCheckDebugifyPass]"Verification after each pass
sub_2270390"NewPMDebugifyPass]"Synthetic injection before each pass
sub_2270470"VerifierPass]"Standard IR verifier (separate)

The pipeline text parser (sub_2272BE0, 14KB) recognizes these as named module passes:

SlotPipeline nameClassLevel
#26"check-debugify"NewPMCheckDebugifyPassModule
#35"debugify"NewPMDebugifyPassModule

When debugify-each is active, the pipeline builder (sub_2277440, 60KB -- buildDefaultPipeline() equivalent) wraps every optimization pass in a debugify/check-debugify pair. When verify-each is active, only the check-debugify wrapper is inserted.

Verification Function Signature

The function signature reconstructed from the binary:

bool sub_29C8000(
    Module*       module,       // rdi
    raw_ostream&  output,       // rsi -- diagnostic stream
    NamedMDNode*  dbgCU,        // rdx -- "llvm.dbg.cu" metadata
    DenseMap*     hashMap,      // rcx -- metadata identity table
    const char*   passName,     // r8
    size_t        passNameLen,  // stack+0x00
    const char*   fileName,     // stack+0x08
    size_t        fileNameLen,  // stack+0x10
    raw_ostream*  jsonOutput,   // stack+0x18 -- NULL if no JSON report
    ...
);
// Returns: true = all checks passed, false = any violation detected

Verification Algorithm

The pass proceeds through nine sequential phases within a single function call. The 0x4B8-byte stack frame holds eight separate tracking data structures.

Phase 1: Module-Level Guard (0x29C8000 -- 0x29C807A)

Looks up the "llvm.dbg.cu" named metadata node via sub_BA8DC0 (Module::getNamedMetadata). If absent or empty, prints ": Skipping module without debug info\n" and returns 0. This is the fast path for modules compiled without -g.

Phase 2: Pre-Pass Metadata Snapshot (0x29C8080 -- 0x29C8AE5)

Initializes eight SmallVector/DenseMap structures on the stack and walks the compile unit metadata tree:

Stack offsetPurposeCopy helper
var_1F0DISubprogram tracking setsub_29C6AD0
var_1D0Scope chain working setsub_29C1190
var_1A0DIVariable trackingsub_29C1060
var_170Scope-to-function mapping--
var_140DICompileUnit refs--
var_130Primary metadata node buffer--

For each DICompileUnit operand, the pass walks the subprogram list and retained types, recording every metadata node in hash tables for O(1) identity comparison. The hash function is:

uint64_t hash = ((ptr >> 4) ^ (ptr >> 9)) & (bucket_count - 1);

This is the standard DenseMap pointer hash with LLVM-layer sentinels. See Hash Table and Collection Infrastructure for the complete specification.

Phase 3: DISubprogram Iteration (0x29C82BE -- 0x29C84C8)

Walks the subprogram list attached to each compile unit via linked-list traversal ([node+8] = next pointer). For each subprogram, reads the metadata tag byte at [node-18h]:

Tag byteDWARF tagAction
0x54 ('T')DW_TAG_template_parameterSkip
0x55 ('U')Compile unit / subprogram variantSpecial handling
0x44 ('D')DW_TAG_subprogramValidate
0x45 ('E')DW_TAG_lexical_blockValidate scope chain
0x46 ('F')DW_TAG_lexical_block_fileValidate scope chain
0x47 ('G')DW_TAG_namespaceValidate scope chain

The flag byte at [rdx+21h] & 0x20 tests the "definition" bit (only defined, non-declaration subprograms are tracked). Values outside 0x44--0x47 are flagged as invalid scope types.

Phase 4: Hash Table Construction (0x29C8508 -- 0x29C8AC2)

Allocates and populates eight sorted hash tables via sub_C7D670 (aligned_alloc, alignment=8), each holding 16-byte entries [pointer, secondary_key]:

Object offsetTable contentsPurpose
+18hDISubprogramFunction-level metadata
+28hDIScopeScope hierarchy
+48hDIGlobalVariableModule-level variables
+58hDILocalVariableFunction-local variables
+78hDITypeType descriptions
+88hDIImportedEntityusing declarations
+A8hDILabelLabel metadata
+B8hRetained nodesMisc retained metadata

The MDNode operand access pattern used during population:

// MDNode internal layout decoding (0x29C8508+)
byte flags = *(ptr - 0x10);
if (flags & 0x02) {          // distinct metadata
    operands = *(ptr - 0x20);  // operand array is before the node
} else {
    int count = (flags >> 2) & 0x0F;
    operands = ptr - 0x10 - (count * 8);  // inline operands
}

Phase 5: Per-Function Debug Variable Checking (0x29C8B3B -- 0x29C9060)

Iterates every function in the module. For each, looks up its DISubprogram in the hash table and cross-references dbg.value() / dbg.declare() intrinsics against the pre-snapshot. Two diagnostic levels:

ERROR (pass dropped a subprogram entirely):

ERROR: <pass> dropped DISubprogram of <function> from <file>
ERROR: <pass> did not generate DISubprogram for <function> from <file>

WARNING (pass dropped individual variable tracking):

WARNING: <pass> drops dbg.value()/dbg.declare() for <var> from function <func> (file <file>)

The distinction between "dropped" and "did not generate" is significant: "dropped" means metadata existed before the pass and was deleted; "not-generate" means the pass created new IR (e.g., from inlining or outlining) without attaching corresponding debug metadata. This taxonomy is important for GPU compilation because kernel outlining and device function inlining frequently create new IR nodes.

The variable name is resolved by:

  1. Getting DISubprogram from the metadata ref
  2. Calling sub_AF34D0 (DIScope::getScope()) to walk the scope chain upward
  3. Getting the file via operand [10h] of the scope's file ref
  4. Calling sub_B91420 (MDString::getString()) to convert MDString to StringRef

Phase 6: Per-Instruction Location Verification (0x29C8D42 -- 0x29C8D85)

Delegated to sub_29C3AB0 (5,592 bytes), which performs detailed checks:

  • Every instruction with a DebugLoc has a valid DILocation
  • DILocation scope chains resolve to a valid DISubprogram
  • No orphaned debug locations reference deleted subprograms
  • BB-level consistency: all instructions in a basic block share compatible scopes
  • Dropped location tracking: emits "dropped DILocation" diagnostics

The JSON output from this sub-pass uses structured field names: "DILocation", "bb-name", "fn-name", "action" (with values "drop" or "not-generate").

Phase 7: JSON Structured Output (0x29C90BC -- 0x29C94E2)

When a non-null JSON output stream is provided (the jsonOutput parameter), the pass serializes a structured report via sub_2241E40 (YAML/JSON serializer):

{"file":"kernel.cu", "pass":"instcombine", "bugs": [
  {"metadata":"DISubprogram", "name":"_Z6kernelPf", "fn-name":"_Z6kernelPf", "action":"drop"},
  {"metadata":"dbg-var-intrinsic", "name":"idx", "fn-name":"_Z6kernelPf", "action":"not-generate"}
]}

This JSON reporting mechanism is an NVIDIA extension with no upstream LLVM equivalent. It feeds into NVIDIA's internal CI infrastructure to track debug info quality regressions across compiler versions. The "no-name" string serves as fallback when the pass name pointer is NULL.

The serialization calls sub_CB7060 (YAML::IO constructor) and proceeds through sub_C6D380 (object emission), sub_C6C710 (array emission), and sub_C6B0E0 (key writer). After serialization, the stream is flushed via sub_CB7080 and freed via sub_CB5B00. If the file descriptor is valid (fd != -1), it is closed via sub_C837B0 (close(fd)).

Phase 8: Result Reporting and Metadata Reconstruction (0x29C94E2 -- 0x29C9A27)

Prints the summary line ("<pass>: PASS\n" or "<pass>: FAIL\n"), then reconstructs the module's metadata tables from the verified versions -- reallocating subprogram, type, variable, label, and global variable arrays and copying verified metadata back into the compile unit structures.

The result is a 3-way outcome in bit flags (combined at 0x29C9073--0x29C9080 via AND):

  • Bit 0: any verification failure (determines PASS/FAIL)
  • Bit 1: JSON report was requested and successfully written

The final result is PASS only if all sub-checks passed AND the JSON report (if requested) was successfully written.

Cleanup frees all eight temporary hash tables (each via sub_C7D6A0 -- sized dealloc with alignment 8), linked list nodes via j_j___libc_free_0, and SmallVector inline buffers are detected by pointer comparison (if ptr == stack_addr, skip free).

Phase 9: Return (0x29C9A12 -- 0x29C9A27)

Returns var_420 (bool) in the al register. Standard epilog restores rbx, r12--r15, rbp.

Complete Diagnostic Code Table

Every diagnostic string emitted by the debug verification subsystem, with exact provenance and trigger conditions.

Verification Pass Diagnostics (sub_29C8000)

#SeverityDiagnostic stringTrigger conditionAddress range
D01INFO": Skipping module without debug info\n""llvm.dbg.cu" absent or empty0x29C8000--0x29C807A
D02ERROR"ERROR: <pass> dropped DISubprogram of <func> from <file>\n"DISubprogram existed pre-pass, absent post-pass0x29C8C08--0x29C8D2E
D03ERROR"ERROR: <pass> did not generate DISubprogram for <func> from <file>\n"New function has no DISubprogram0x29C8C08--0x29C8D2E
D04WARNING"WARNING: <pass> drops dbg.value()/dbg.declare() for <var> from function <func> (file <file>)\n"Variable intrinsics lost for a tracked variable0x29C8E4E--0x29C9060
D05SUMMARY"<pass>: PASS\n"All checks passed0x29C94E2+
D06SUMMARY"<pass>: FAIL\n"Any check failed0x29C94E2+
D07ERROR"Could not open file: <path>\n"JSON report file I/O failure0x29C90BC--0x29C94E2

Per-Instruction Verifier Diagnostics (sub_29C3AB0)

#SeverityDiagnostic stringTrigger condition
D08ERROR"<pass> dropped DILocation"Instruction had DILocation pre-pass, absent post-pass
D09ERROR"<pass> did not generate DISubprogram"DILocation references nonexistent subprogram
D10ERROR(scope chain invalid)DILocation scope chain does not resolve to a valid DISubprogram
D11WARNING(BB inconsistency)Instructions within a basic block reference incompatible scopes

JSON Report Field Schema

#Field keyTypeValuesContext
J01"file"stringSource filenameTop-level report
J02"pass"stringPass name, or "no-name" if NULLTop-level report
J03"bugs"arrayArray of bug objectsTop-level report
J04"metadata"string"DISubprogram", "dbg-var-intrinsic", "DILocation"Per-bug object
J05"name"stringEntity name (function or variable)Per-bug object
J06"fn-name"stringContaining function namePer-bug object
J07"bb-name"stringBasic block namePer-bug object (location bugs)
J08"action"string"drop" or "not-generate"Per-bug object

Action Value Taxonomy

ActionMeaningCommon cause in GPU compilation
"drop"Pass explicitly or inadvertently deleted existing debug metadataDead code elimination removing a function with debug info
"not-generate"Pass created new IR without attaching corresponding debug metadataKernel outlining, device function inlining, or loop transformation creating new BBs

String Encoding Details

Several diagnostic strings are constructed inline using immediate mov instructions rather than string table references:

StringEncodingInstruction
"ERRO"0x4F525245mov dword [rsp+X], 0x4F525245
"R:"0x3A52mov word [rsp+X+4], 0x3A52
"WARNING:"0x3A474E494E524157mov qword [rsp+X], 0x3A474E494E524157

These inline immediate constructions avoid string table lookups and are a common LLVM raw_ostream optimization for short fixed strings.

Compile Unit Descriptor Layout

The verification pass reads and reconstructs a per-CU descriptor object (referenced at [rbp+var_440]) with the following layout:

OffsetTypeContentsCopy helper
+08hvoid**Subprogram array data pointer--
+10hvoid**Subprogram array end pointer--
+18hsize_tSubprogram count--
+20hvoid*Scope chain data--
+28hsize_tScope chain count--
+38hvoid**Global variable array data--
+40hvoid**Global variable array end--
+48hsize_tGlobal variable count--
+50hvoid*Local variable list head--
+58hsize_tLocal variable count--
+68hvoid**Type array data--
+70hvoid**Type array end--
+78hsize_tType count--
+80hvoid*Imported entities listsub_29C2230 (32-byte node deep copy)
+88hsize_tImported entities count--
+98hvoid**Label array data--
+A0hvoid**Label array end--
+A8hsize_tLabel count--
+B0hvoid*Retained nodes listsub_29C0F30
+B8hsize_tRetained nodes count--

DISubprogram Node Layout

Accessed during Phase 3 scope chain validation:

OffsetTypeContents
[node-38h]void*Pointer to compile unit / parent scope
[node-18h]byteMetadata tag byte (DWARF tag discriminator)
[node-14h]uint32Flags field (lower 27 bits = operand index)
[node+08h]void*Next pointer in linked list
[node+18h]void*Linked list head for child scopes
[node+20h]void*Linked list tail for child scopes
[node+28h]void*Variable attachment (DIVariable list)
[node+38h]void*Additional metadata ref
[node+48h]void*Subprogram scope list head
[node+50h]void*Subprogram scope list tail

Debugify Injector (sub_29C1CB0)

The Debugify injector creates synthetic debug metadata to test whether optimization passes preserve debug info correctly. It is the counterpart to the verifier -- the injector sets up the watermarks, and the verifier checks them.

Named metadata markers:

  • "llvm.debugify" -- marks the module as containing synthetic debug info (standard Debugify)
  • "llvm.mir.debugify" -- marks MIR-level synthetic debug info

Behavior controlled by debugify-level:

  • locations -- inject only DILocation on every instruction (cheaper, tests location preservation)
  • location+variables -- inject DILocation plus synthetic dbg.value()/dbg.declare() for every SSA value (full coverage, higher overhead)

The injector assigns monotonically increasing line numbers to every instruction and creates one DILocalVariable per SSA value that produces a result. The variable names follow the pattern "dbg_var_N" where N is the SSA value index. After injection, the module has guaranteed 100% debug coverage, making any coverage loss attributable to the subsequent optimization pass.

Verbosity Control

Two global flags provide fine-grained control over verification output:

qword_5008FC8 -- Verbose Diagnostic Output Enable

Boolean flag (byte). Controls the output stream selection:

  • When 0: uses sub_CB72A0 (null/discard stream constructor) -- diagnostics silently discarded
  • When non-zero: uses sub_CB7330 (stderr stream accessor) -- diagnostics printed to stderr

This flag gates the ERROR and WARNING messages. The JSON structured output is controlled separately by the jsonOutput parameter. Setting qword_5008FC8 = 0 suppresses text diagnostics while still producing JSON output.

qword_5008C88 -- Metadata Depth Threshold

Signed 32-bit integer, read at 0x29C8371. Controls how deep the scope chain walk goes:

  • When <= 0: the deep scope chain walk is skipped for non-subprogram metadata. Only top-level DISubprogram validation runs.
  • When > 0: full scope chain traversal validates every DILexicalBlock, DILexicalBlockFile, and DINamespace in the hierarchy.

This allows production builds to run lightweight verification (subprogram-only) while development builds run exhaustive scope chain checking.

Debugify-Specific Knobs

KnobTypeDefaultRegistrationEffect
debugify-quietbooloffctor_493 at 0x556960Suppress all debugify text output
debugify-func-limitintunlimitedctor_493 at 0x556960Max functions to inject synthetic debug info into
debugify-levelenumlocation+variablesctor_493 at 0x556960locations or location+variables
debugify-functionstring--ctor_493 at 0x556960Restrict debugify to a single named function
check-debugify-functionstring--ctor_493 at 0x556960Restrict check-debugify to a single named function
debugify-eachbooloffctor_377 at 0x516190Wrap every pass in debugify/check-debugify
debugify-exportstring--ctor_377 at 0x516190Export debugify results to file

GPU Debug Info: What PTX Needs

DWARF for PTX differs fundamentally from DWARF for x86. PTX is a virtual ISA -- there are no physical registers, no real stack, and no fixed instruction encoding. The debug metadata cicc emits serves two consumers: cuda-gdb (which maps PTX locations back to source) and ptxas (which carries debug info forward into SASS/ELF for the hardware debugger).

The .loc Directive

The AsmPrinter (sub_31D55F0) emits DWARF .loc directives before each PTX instruction that has a valid DebugLoc:

.loc 1 42 0          // file 1, line 42, column 0
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
.loc 1 43 5
mul.wide.u32 %rd2, %r1, 4;

The .file directives (sub_31E4280) establish the file table, and sub_31E6100 maintains a file/line-to-MCSymbol mapping for line table construction.

The dwarf-extended-loc knob (enum: Default/Enable/Disable, registered at 0x490000 range) controls whether extended flags appear in .loc directives. When disabled, cicc emits bare .loc file line column without the is_stmt, prologue_end, or discriminator extensions. This is relevant because older ptxas versions do not parse extended .loc flags.

The line-info-inlined-at Extension

The -line-info-inlined-at LLVM knob (registered at ctor_043 / 0x48D7F0, exposed as -no-lineinfo-inlined-at in the cicc CLI, which sets -line-info-inlined-at=0 on the backend) controls whether inlined-at chains are preserved in PTX line info. When enabled (the default), every .loc directive for inlined code carries the full inlining chain so cuda-gdb can reconstruct the call stack at any point in the inlined code. When disabled, only the immediate source location is emitted, losing the inlining context but producing smaller PTX.

The -show-src / nvptx-emit-src Feature

The -show-src CLI flag (stored at flag struct offset +808, routed to the backend as -nvptx-emit-src) enables source line interleaving in PTX output. When active, the AsmPrinter annotates each .loc directive with the corresponding source line as a PTX comment:

// kernel.cu:42    float val = input[idx];
.loc 1 42 0
ld.global.f32 %f1, [%rd2];
// kernel.cu:43    val = val * val;
.loc 1 43 0
mul.f32 %f2, %f1, %f1;

This is purely a readability feature for developers inspecting PTX output. It has no effect on cuda-gdb or debug quality -- the source text is embedded as comments that ptxas ignores.

NvvmDebugVersion

The NVVM container format includes a debug version field (NvvmDebugVersion, packed as {Major:uint16, Minor:uint16} at container offset 0x08--0x09). The current version is Major=3, Minor<=2. The reader (sub_CD41B0) validates that Major equals 3 and warns if Minor exceeds 2. If absent, the default {3, 2} is assumed. This version tracks the debug metadata schema independently of the NVVM IR version, allowing debug format evolution without breaking IR compatibility.

The standalone pipeline (sub_12BFF60) performs a consistency check: if the container declares debug_info_present (bit 4 of flags) AND the debug mode flag is set AND the debug version has not been validated, it returns error code 3 (incompatible).

DbgRecord Format (LLVM 20)

cicc v13.0 uses LLVM 20's DbgRecord format by default (write-experimental-debuginfo = true, registered at ctor_025). This replaces traditional dbg.value()/dbg.declare() intrinsics with non-intrinsic debug records attached directly to instructions. Related knobs:

KnobDefaultRegistrationEffect
write-experimental-debuginfotruector_025Use DbgRecord format for new debug info
write-experimental-debuginfo-iterators-to-bitcodetruector_018Serialize DbgRecords to bitcode
preserve-input-debuginfo-formatfalsector_018When true, preserve whichever format the input uses

The verifier handles both formats: it checks for dbg.value()/dbg.declare() intrinsics AND for DbgRecord attachments.

Debug Info Stripping Passes

cicc includes five stripping passes registered in the pipeline parser (at sub_12C6910 and related):

Pipeline nameSlotLLVM passEffect
"strip-dead-debug-info"#110StripDeadDebugInfoPassRemove debug info for dead functions/globals
"strip-debug-declare"#112StripDebugDeclarePassRemove dbg.declare() intrinsics only
"strip-nondebug"#113StripNonDebugSymbolsPassRemove non-debug symbols (keep debug)
"strip-nonlinetable-debuginfo"#114StripNonLineTableDebugInfoPassStrip everything except line tables

The strip-nonlinetable-debuginfo pass is the key one for the -generate-line-info mode: it strips all debug metadata except .loc / .file directives, producing line-number-only debug info without variable locations, type descriptions, or scope trees. This is what nvcc's --generate-line-info flag triggers -- enough for profiler source correlation but not enough for stepping through code in cuda-gdb.

The core debug info stripping implementation lives at 0xAE0000 (Zone 3 of the type system module), which calls stripDebugInfo() to remove all llvm.dbg.* intrinsics from the module.

Debug Compilation Modes

cicc supports three debug info levels, controlled by CLI flags that route through the flag dispatch table:

CLI flagFlag offsetBackend routingDebug level
-g+296-debug-compile to both linker and optimizerFull debug info (FullDebug emission kind)
-generate-line-info+328-generate-line-info to optimizer onlyLine tables only (LineTablesOnly emission kind)
(neither)----No debug info (NoDebug)

When -g is active, cicc emits DICompileUnit with full emission kind, preserves all DISubprogram, DILocalVariable, DIType, and scope metadata through the pipeline, and the backend emits complete DWARF sections. The verifier runs at full depth.

When -generate-line-info is active, the StripNonLineTableDebugInfoPass runs early in the pipeline, leaving only line table metadata. The verifier still runs but only checks DILocation / DISubprogram consistency (variable checks are skipped because the variable metadata was intentionally stripped).

Key routing difference: -g routes to BOTH the linker (-debug-compile) and optimizer (-debug-compile), because libdevice linking needs the debug flag to preserve user debug info during merging. -generate-line-info routes to the optimizer only.

The frontend uses two independent guard mechanisms for debug emission:

  • dword_4D046B4 -- global flag checked at statement/parameter level by sub_9433F0 (per-param debug), sub_943430 (per-global debug)
  • [ctx+0x170] -- compile unit pointer checked at module finalization level by sub_915400

The NVVM container carries a dedicated DebugInfo enum (3 values: NONE, LINE_INFO, DWARF) at deserialized struct offset +12, separate from the module metadata.

Complete Knob Reference

KnobTypeDefaultRegistrationEffect
-g / -debug-compilebooloffctor_043 at 0x48D7F0Full debug compilation
-generate-line-infobooloffctor_043 at 0x48D7F0Line tables only
-no-lineinfo-inlined-atbooloffCLI flag dispatchDisable inlined-at tracking (sets -line-info-inlined-at=0)
-show-src / -nvptx-emit-srcbooloffFlag offset +808Interleave source in PTX comments
dwarf-extended-locenumDefault0x490000 rangeDefault/Enable/Disable extended .loc flags
dwarf-versionunsigned(platform)LLVM defaultDWARF version for debug sections
debugify-eachbooloffctor_377 at 0x516190Run Debugify+CheckDebugify around every pass
debugify-levelenumlocation+variablesctor_493 at 0x556960locations or location+variables
debugify-quietbooloffctor_493 at 0x556960Suppress debugify diagnostics
debugify-func-limitintunlimitedctor_493 at 0x556960Max functions to debugify
debugify-functionstring--ctor_493 at 0x556960Restrict debugify to named function
check-debugify-functionstring--ctor_493 at 0x556960Restrict check-debugify to named function
debugify-exportstring--ctor_377 at 0x516190Export debugify results to file
verify-eachbooloffctor_043 at 0x48D7F0Run IR verifier after every pass
verify-after-allalias--ctor_043 at 0x48D7F0Alias for verify-each
verify-debuginfo-preservebooloffctor_376 at 0x512DF0Enable debug info preservation checking
verify-each-debuginfo-preservebooloffctor_377 at 0x516190Per-pass debug info preservation
verify-di-preserve-exportstring--ctor_377 at 0x516190Export preservation results to file
no-inline-line-tablesbooloffsub_29E2B40Prevent inlining from merging line tables
write-experimental-debuginfobooltruector_025Use DbgRecord format
preserve-input-debuginfo-formatbool/defaultfalsector_018Preserve input debug format
qword_5008FC8booloff--Verbose diagnostic output enable
qword_5008C88int32>0--Metadata depth threshold (<=0 skips deep scope walk)
CAN_FINALIZE_DEBUGenv var--sub_60F290 et al.Debug finalization control
NVVM_IR_VER_CHKenv varenabledsub_12BFF60Override debug version checking (set "0" to disable)

DWARF Emission Backend

The actual DWARF section emission lives in a separate module at 0x3990000--0x39DF000:

AddressSizeFunction
sub_399B1E029KBDwarfDebug::beginModule() -- initializes from llvm.dbg.cu
sub_3997B5033KB.debug_aranges emission
sub_399D1D012KBRange list emission (DW_RLE_*)
sub_399EB7012KBRegister location expressions
sub_39BDF6038KB.debug_names accelerator table
sub_39B639033KBDWARF form size calculator
sub_215ACD08.1KBModule-level emission entry (NVPTX Debug Info Emission)

The module-level entry sub_215ACD0 checks *(a1+240)->field_344 to determine if DWARF is enabled, then looks up the "NVPTX DWARF Debug Writer" / "NVPTX Debug Info Emission" pass info. The NVPTX backend does not emit physical register locations (GPUs have no DWARF register numbering scheme that maps to hardware); instead, it emits virtual register references that cuda-gdb resolves through ptxas's SASS-level debug info.

Function Map

FunctionAddressSizeRole
"llvm.global_ctors" utilitysub_29C00F0----
errs() diagnostic output stream accessorsub_29C0AE0----
PassManager / PassAdaptor infrastructure ("PassManager", "PassAdaptor")sub_29C0DC0----
Copy retained-nodes list (SmallVector deep copy)sub_29C0F30----
Copy local-variable listsub_29C1060----
Copy scope-chain listsub_29C1190----
Validate scope chain connectivitysub_29C12C0----
Debugify synthetic debug info injector ("llvm.debugify", "llvm.mir.debugify")sub_29C1CB0----
Merge/update tracking sets after verificationsub_29C1F00----
Serialize verification result to streamsub_29C20D0----
Copy imported-entities list (32-byte node deep copy)sub_29C2230----
Per-instruction DILocation verifiersub_29C3AB05,592B--
DenseMap::FindAndConstruct for tracking mapsub_29C5270----
Set insert with metadata key normalizationsub_29C6AD0----
Set insert variant (different key extraction)sub_29C6DE0----
Debug info verification pass (main entry)sub_29C800012,480B--
no-inline-line-tables flag handlersub_29E2B40----
NewPMCheckDebugifyPass wrappersub_22702B0----
NewPMDebugifyPass wrappersub_2270390----
VerifierPass wrapper (standard IR verifier)sub_2270470----
Pass pipeline text parsersub_2272BE014KB--
buildDefaultPipeline() equivalentsub_227744060KB--
Flag filter (checks -debug-compile, -g, -generate-line-info)sub_12C6910----
Emit per-instruction .loc DWARF directivesub_31D55F0----
Emit .file/.loc directives (function scope)sub_31E4280----
insertDebugLocEntry (file/line to symbol mapping)sub_31E6100----
DwarfDebug::beginModule()sub_399B1E029KB--
.debug_aranges emissionsub_3997B5033KB--
Module-level emission entry / NVPTX Debug Info Emissionsub_215ACD08.1KB--
NVVM IR version + debug version validatorsub_12BFF60~9KB--
NVVM container debug version checksub_CD41B0----
Emit DILocalVariable for parameter (frontend)sub_9433F0----
Emit debug info for GlobalVariable (frontend)sub_943430----
Set DebugLoc from EDG source position (frontend)sub_941230----
Finalize: "Debug Info Version" = 3 (frontend)sub_915400----

LLVM Infrastructure Functions Used

AddressIdentityCalled from
sub_BA8DC0Module::getNamedMetadata(StringRef)Phase 1
sub_B2FC80isa<DISubprogram> or similar MDNode type checkPhase 3
sub_B2FC00MDNode type check (different metadata kind)Phase 3
sub_B92180MDNode::getContext()Phase 4
sub_B91420MDString::getString()Phase 5
sub_B91A10MDNode::getOperand(unsigned)Phase 4
sub_B14240MDNode operand range iteratorPhase 4
sub_AF34D0DIScope::getScope() -- walk scope chain upwardPhase 5
sub_AF4500DISubprogram::describes(Function)Phase 5
sub_B58DC0DenseSet::insertPhase 2
sub_B96E90DenseMap::insert_or_assignPhase 4
sub_B91220DenseMap::erasePhase 8
sub_C7D670aligned_alloc(size, alignment=8)Phase 4
sub_C7D6A0aligned_free_sized(ptr, size, alignment=8)Phase 8
sub_CB7330errs() -- get stderr raw_ostreamPhase 5
sub_CB72A0nulls() -- get null/discard raw_ostreamPhase 5 (quiet mode)
sub_CB6200raw_ostream::write(const char*, size_t)Phase 5, 7
sub_CB5D20raw_ostream::write(char)Phase 5
sub_CB5B00raw_ostream destructor / freePhase 7
sub_CB7060YAML::IO output constructorPhase 7
sub_CB7080raw_ostream::flush()Phase 7

NVIDIA Modifications vs Stock LLVM

The key differences from upstream LLVM's CheckDebugInfoPass:

  1. JSON structured output -- Upstream only prints text diagnostics. NVIDIA added a YAML/JSON serializer (sub_2241E40, sub_CB7060) that produces machine-parseable bug reports with "file", "pass", "bugs" fields and per-bug "action" classification ("drop" vs "not-generate").

  2. Verbosity control -- Two global flags (qword_5008FC8 for output enable, qword_5008C88 for depth threshold) allow fine-grained control over verification overhead. Upstream has only the debugify-quiet knob.

  3. Eight-table metadata tracking -- Upstream CheckDebugInfoPass tracks DISubprograms and debug variable intrinsics. NVIDIA's version maintains eight separate hash tables covering subprograms, scopes, global variables, local variables, types, imported entities, labels, and retained nodes -- a much more comprehensive snapshot.

  4. Metadata reconstruction -- After verification, NVIDIA's pass reconstructs the module's metadata tables from the verified versions (Phase 8), which upstream does not do. This means the verifier can also serve as a "repair" pass that normalizes metadata after an optimization pass corrupts it.

  5. No kernel-specific handling -- The verifier treats __global__ and __device__ functions identically. CUDA-specific debug info (address space annotations, shared memory debug, warp-level location info) is validated elsewhere, likely during NVPTX backend emission.

  6. DbgRecord format support -- cicc v13.0 defaults to the LLVM 20 DbgRecord format (write-experimental-debuginfo = true), so the verifier handles both intrinsic-based and record-based debug info transparently.

Cross-References

Bitcode Reader/Writer

CICC v13.0 contains the complete LLVM 20.0.0 bitcode serialization infrastructure -- reader, writer, metadata loader, module summary IO, and the full intrinsic upgrader -- spread across two address ranges. The 0x9F0000--0xA2FFFF range hosts a first copy of the bitcode reader/writer core used by the standalone libNVVM pipeline, while the 0x1500000--0x157FFFF range hosts the primary copy used by the two-phase compilation path. Both copies are structurally identical LLVM BitcodeReader.cpp and BitcodeWriter.cpp compiled at different link addresses. The reader is stock upstream LLVM 20.0.0 with no NVIDIA modifications to the deserialization logic itself. The writer, however, contains a single critical NVIDIA change: it stamps "LLVM7.0.1" as the bitcode producer identification string rather than the true "LLVM20.0.0", preserving backward compatibility with the NVVM IR ecosystem.

The bitcode subsystem sits at the boundary between all pipeline stages. The standalone pipeline validates magic bytes on entry, the module linker reads bitcode from separate compilation objects, the two-phase orchestrator serializes per-function bitcode blobs between Phase I and Phase II, and the NVVM container wraps bitcode payloads in a proprietary envelope. Every bitcode load also runs the intrinsic upgrader -- a 700+ KB AutoUpgrade subsystem that includes roughly 240 KB of effectively-dead x86 intrinsic renaming tables.

Key Facts

PropertyValue
Reader (primary copy)sub_151B070 (0x151B070, 123 KB) -- parseFunctionBody
Reader (standalone copy)sub_9F2A40 (0x9F2A40, 185 KB) -- parseFunctionBody
Writersub_1538EC0 (0x1538EC0, 58 KB) -- writeModule
Metadata readersub_A09F80 (0xA09F80, 121 KB) -- MetadataLoader::parseOneMetadata
X86 AutoUpgrade (name)sub_156E800 (0x156E800, 593 KB) -- UpgradeIntrinsicFunction
X86 AutoUpgrade (call)sub_A939D0 (0xA939D0, 457 KB) -- UpgradeIntrinsicCall
NVVM version checkersub_157E370 (0x157E370, 7 KB)
NVVM version checker (standalone)sub_12BFF60 (0x12BFF60, 9 KB)
Producer init (ctor_036)0x48CC90 (544 bytes) -- reads LLVM_OVERRIDE_PRODUCER
Producer init (ctor_154)0x4CE640 (215 bytes) -- reads LLVM_OVERRIDE_PRODUCER
Address range (primary)0x1500000--0x157FFFF
Address range (standalone copy)0x9F0000--0xA2FFFF
Address range (AutoUpgrade)0xA80000--0xABFFFF
Hardcoded producer string"LLVM7.0.1" (writer), "20.0.0" (internal fallback)
NVVM IR version gatemajor == 3, minor <= 2
Upstream sourcelib/Bitcode/Reader/BitcodeReader.cpp, lib/Bitcode/Writer/BitcodeWriter.cpp, lib/IR/AutoUpgrade.cpp

Bitcode Format Basics

LLVM bitcode uses two magic signatures. The pipeline validates both at module load time:

Magic BytesMeaningWhere Checked
0xDE 0xC0 0x17 0x0BRaw LLVM bitcode streamsub_12C06E0 (module linker)
0x42 0x43 0xC0 0xDEBitcode wrapper format (offset + size header around raw stream)Same function

If neither signature matches, the pipeline sets *error_code = 9 ("invalid bitcode") and aborts. The wrapper format is more common in practice -- nvcc generates wrapper-format .bc files that embed the raw stream at an offset specified in the wrapper header. The wrapper header is 20 bytes:

struct BitcodeWrapperHeader {
    uint32_t magic;       // 0x42, 0x43, 0xC0, 0xDE
    uint32_t version;     // wrapper version (0)
    uint32_t offset;      // byte offset to raw bitcode within file
    uint32_t size;        // size of raw bitcode in bytes
    uint32_t cpu_type;    // target CPU type (0 for NVPTX)
};

After magic validation, the bitstream enters the block-structured reader. LLVM bitcode is organized into nested blocks, each identified by a block ID. The reader uses abbreviation tables (defined in BLOCKINFO blocks) to decode records within each block efficiently using variable-bit-rate (VBR) encoding.

An epoch check runs after magic validation: "Incompatible epoch: Bitcode '<X>' vs current: '<Y>'". This ensures the bitcode was produced by a compatible LLVM generation.

Bitcode Reader

Module Parser (sub_1505110, 60 KB)

The top-level entry reads MODULE_BLOCK records from the bitcode stream. It processes:

  • Global variable declarations and definitions
  • Function declarations (bodies are deferred for lazy materialization)
  • Calling conventions and comdat groups
  • Module-level metadata, type tables, and value symbol tables
  • Data layout and target triple strings

Error strings: "Invalid calling convention ID", "Invalid function comdat ID", "Invalid global variable comdat ID", "Invalid type for value".

parseFunctionBody (sub_151B070 / sub_9F2A40)

The function body parser is the largest single reader function. The standalone copy sub_9F2A40 is 185 KB (5,706 decompiled lines) with 174 error string references. The primary copy sub_151B070 is 123 KB. Both decode the same FUNCTION_BLOCK records:

  • 57 FUNC_CODE instruction record types (switch cases 1--65), covering every LLVM IR opcode: INST_BINOP, INST_CAST, INST_GEP, INST_SELECT, INST_CMP, INST_RET, INST_BR, INST_SWITCH, INST_INVOKE, INST_CALL (opcode 85), INST_UNREACHABLE, INST_PHI, INST_ALLOCA, INST_LOAD, INST_STORE, INST_ATOMICRMW, INST_CMPXCHG, INST_FENCE, INST_EXTRACTVAL, INST_INSERTVAL, INST_LANDINGPAD, INST_RESUME, INST_CLEANUPPAD, INST_CATCHPAD, INST_CATCHSWITCH, INST_CALLBR, INST_FREEZE, and others.
  • 4 nested sub-blocks: constants (0xB), metadata (0xE), use-list order (0x10), operand bundles (0x12).
  • 53 unique error strings including: "Alignment value is too large", "Invalid record", "Invalid record: Unsupported version of DISubrange", "METADATA_NAME not followed by METADATA_NAMED_NODE".

For each INST_CALL record (opcode 85), the reader calls into the AutoUpgrade machinery to rename deprecated intrinsics. This is the hook that triggers the 700+ KB x86 upgrader on every call instruction -- even though the upgrader's x86 branches are dead code for NVPTX targets.

Pseudocode for the top-level body parse loop:

Error parseFunctionBody(Function *F) {
    SmallVector<uint64_t, 64> Record;
    while (true) {
        BitstreamEntry Entry = Stream.advance();
        switch (Entry.Kind) {
        case BitstreamEntry::Error:
            return error("Malformed block");
        case BitstreamEntry::EndBlock:
            return resolveForwardRefs();
        case BitstreamEntry::SubBlock:
            switch (Entry.ID) {
            case CONSTANTS_BLOCK_ID:  // 0xB
                parseConstants(); break;
            case METADATA_BLOCK_ID:   // 0xE
                parseMetadataAttachment(); break;
            case USELIST_BLOCK_ID:    // 0x10
                parseUseListBlock(); break;
            case OPERAND_BUNDLE_TAGS_BLOCK_ID: // 0x12
                parseOperandBundleTags(); break;
            }
            break;
        case BitstreamEntry::Record:
            unsigned Code = Stream.readRecord(Entry.ID, Record);
            switch (Code) {
            case FUNC_CODE_INST_BINOP: /* ... */ break;
            case FUNC_CODE_INST_CAST:  /* ... */ break;
            // ... 55 more cases ...
            case FUNC_CODE_INST_CALL:
                // Parse callee, args, calling convention
                // If callee is intrinsic:
                //   UpgradeIntrinsicFunction(callee, &newCallee);
                //   if (newCallee) UpgradeIntrinsicCall(CI, newCallee);
                break;
            }
        }
    }
}

Lazy Materialization (sub_1503DC0, 13 KB)

Function bodies are not parsed eagerly. The module parser records each function's byte offset in the bitcode stream, and materializeFunctions seeks to that position on demand. Error strings: "Could not find function in stream", "Expect function block", "Expect SubBlock", "Trying to materialize functions before seeing function blocks". The two-phase compilation exploits this by materializing individual functions for per-function Phase II optimization.

Bitstream Infrastructure

FunctionAddressSizeRole
readBlockInfoBlock0x150F8E042 KBReads BLOCKINFO block (abbreviation definitions)
readAbbreviatedField0x1510D7038 KBExpands abbreviated records (fixed, VBR, array, blob)
readAbbrevRecord0x151323020 KBReads one abbreviation-defined record
readRecord0x150E2B019 KBCore BitstreamCursor::readRecord
parseMetadataBlock0x151818029 KBParses METADATA_BLOCK for function-level metadata
parseFunctionMetadata0x152042032 KBMetadata/value-table builder during function parse
parseMetadataStrings0x152216013 KBReads metadata string table
parseTypeBlock / constants0x15083D026 KBTYPE_BLOCK or CONSTANTS_BLOCK parser
parseValueRecord0x15157409 KBValue record decoder
string table reader0x15140E013 KBBitcode string table entries
readBlobRecord0x1514C409 KBBlob-type record reader
skipBlock0x15127D013 KBBlock skipping and cursor navigation
parseModuleSummaryIndex0x150B5F063 KBThinLTO summary parser
materializeFunctions0x1503DC013 KBLazy function body materialization
parseModule0x150511060 KBTop-level MODULE_BLOCK parser
ThinLTO GUID lookup0x150A1607 KBGUID-based summary index lookup
parseGlobalInits0x1504A608 KBGlobal variable initializer parser

Bitcode Writer

writeModule (sub_1538EC0, 58 KB)

The top-level writer serializes an entire Module to a bitcode stream. It orchestrates sub-writers in a fixed order:

  1. Enumerate all values via ValueEnumerator (sub_15467B0, 23 KB)
  2. Write identification block (with producer string -- see next section)
  3. Write MODULE_BLOCK header
  4. Write type table (sub_1530240, 12 KB)
  5. Write attribute groups (sub_152F610, 8 KB)
  6. Write global variables
  7. Write function declarations
  8. For each defined function: writeFunction (sub_1536CD0, 40 KB)
  9. Write metadata (sub_1531F90, 27 KB) + metadata records (sub_15334D0, 8 KB)
  10. Write value symbol table (sub_1533CF0, 16 KB)
  11. Write named metadata / comdat records (sub_15311A0, 14 KB)
  12. If ThinLTO: write module summary (sub_1535340, 26 KB)

writeFunction (sub_1536CD0, 40 KB)

Writes one FUNCTION_BLOCK containing all instructions, each encoded via writeInstruction (sub_1528720, 27 KB). Instructions are encoded as (opcode, operand_ids...) records where operand IDs are relative to the value table. The writer uses abbreviations for compact encoding of common instruction patterns.

Value Enumeration

Before writing, the ValueEnumerator assigns a dense numeric ID to every value in the module. This is the reverse of what the reader does (mapping IDs back to Values).

FunctionAddressSizeRole
enumerateModule0x15467B023 KBTop-level module enumeration
enumerateValues0x1542B0026 KBAssigns numeric IDs to all values
optimizeConstants0x15484108 KBReorders constants for better compression
TypeFinder helper0x153E1D07 KBRecursive type discovery

Writer Function Map

FunctionAddressSizeRole
writeModule0x1538EC058 KBTop-level module serializer
writeFunction0x1536CD040 KBPer-function FUNCTION_BLOCK writer
writeMetadata0x1531F9027 KBMETADATA_BLOCK writer
writeInstruction0x152872027 KBSingle instruction encoder
writeModuleSummary0x153534026 KBThinLTO summary serializer
writeValueSymbolTable0x1533CF016 KBVALUE_SYMTAB_BLOCK writer
writeNamedMetadata0x15311A014 KBNamed metadata / comdat writer
writeType / globalVar0x153024012 KBType descriptors or global variable records
emitAbbreviation0x152AB4011 KBAbbreviation definition writer
emitRecord0x152A2509 KBLow-level record emission
writeConstants helper0x1527BB09 KBConstant value encoder
writeMetadataRecords0x15334D08 KBDispatcher for 37 metadata node types
writeAttributeGroup0x152F6108 KBATTRIBUTE_GROUP_BLOCK writer
emitVBR0x15271D07 KBVariable bit-rate integer encoding
emitCode0x15263C07 KBCore abbreviated/unabbreviated record emission
emitBlob0x1528330--Blob data emission

Producer String Hack

This is the single most important NVIDIA deviation in the bitcode subsystem. Two global constructors cooperate to set the producer identification string:

ctor_036 at 0x48CC90 (544 bytes): Reads LLVM_OVERRIDE_PRODUCER from the environment. If unset, falls back to the string "20.0.0" (the true LLVM version). Stores the result in the global qword_4F837E0. Also registers disable-bitcode-version-upgrade (cl::opt<bool>).

ctor_154 at 0x4CE640 (215 bytes): Also reads LLVM_OVERRIDE_PRODUCER. Falls back to "7.0.1". Stores into a separate global.

When writeModule (sub_1538EC0) writes the IDENTIFICATION_BLOCK, it emits the string "LLVM7.0.1" as the producer. This is assembled from the prefix "LLVM" plus the version string "7.0.1" loaded from the ctor_154 global.

The consequence is that any tool reading CICC's output bitcode (including older libNVVM, nvdisasm, or third-party NVVM IR consumers) sees producer "LLVM7.0.1" and interprets the bitcode as LLVM 7.x-era IR. Internally, the IR is LLVM 20.0.0 -- all modern instruction opcodes, metadata formats, and type encodings are present. The producer string is purely a compatibility marker that tells downstream tools which NVVM IR version spec to apply, not the actual LLVM version.

Why 7.0.1 specifically: NVVM IR 2.0 was defined against LLVM 7.0.1. The NVVM toolchain ecosystem (libNVVM, nvcc's device compilation pipeline) standardized on this version string as the "NVVM IR format identifier." Upgrading the producer string would require coordinated changes across the entire CUDA toolkit and all consumers.

// Pseudocode for producer string initialization
static const char *producer_version;

void ctor_036() {  // at 0x48CC90
    const char *env = getenv("LLVM_OVERRIDE_PRODUCER");
    if (!env) env = "20.0.0";  // true LLVM version
    global_4F837E0 = env;
    // Also registers: -disable-bitcode-version-upgrade (cl::opt<bool>)
}

void ctor_154() {  // at 0x4CE640
    const char *env = getenv("LLVM_OVERRIDE_PRODUCER");
    if (!env) env = "7.0.1";   // NVVM IR compat marker
    producer_version = env;
}

// In writeModule (sub_1538EC0):
void writeIdentificationBlock(BitstreamWriter &Stream) {
    Stream.EnterSubblock(IDENTIFICATION_BLOCK_ID);
    // Writes: "LLVM" + producer_version → "LLVM7.0.1"
    Stream.EmitRecord(IDENTIFICATION_CODE_STRING, "LLVM");
    Stream.EmitRecord(IDENTIFICATION_CODE_EPOCH, CurrentEpoch);
    Stream.ExitBlock();
}

Reimplementation note: A reimplementation must write "LLVM7.0.1" as the producer for compatibility with the existing NVVM ecosystem. Setting LLVM_OVERRIDE_PRODUCER to a different value will change the embedded string. The disable-bitcode-version-upgrade flag controls whether the reader's AutoUpgrade logic activates for version-mismatched bitcode.

X86 AutoUpgrade -- Why to Skip It

The intrinsic upgrader is the single largest code mass in the entire cicc binary. Two functions dominate:

FunctionAddressSizeRole
UpgradeIntrinsicFunctionsub_156E800593 KBName-based intrinsic rename lookup (271 string patterns)
UpgradeIntrinsicCallsub_A939D0457 KBCall instruction rewriter
X86 intrinsic upgrade helpersub_A8A170195 KBSSE/AVX/AVX-512 family tables
UpgradeIntrinsicCall (2nd copy)sub_15644B089 KBCompanion call upgrader
NVVM upgrade dispatchersub_A8E25052 KBnvvm.atomic, nvvm.shfl, nvvm.cp.async, nvvm.tcgen05, nvvm.cluster, nvvm.ldg
NVVM call rewritingsub_A9113028 KBNVVM-specific call rewriter
NVVM annotation metadata upgradesub_A84F9014 KBmaxclusterrank, maxntid, etc.
UpgradeModuleFlags0x156C72010 KBModule flag upgrader
UpgradeLoopMetadata0x156A1F07 KBllvm.loop.interleave.count, llvm.loop.vectorize.*

Total intrinsic upgrader code: approximately 1.4 MB across all copies and helpers.

The x86 portion (roughly 1.0 MB) handles SSE/SSE2/SSE4.1/SSE4.2/SSSE3, AVX2, AVX-512 (mask operations, conversions, FMA variants), and ARM NEON patterns (^arm\.neon\.vld, ^arm\.neon\.vst). These branches are functionally dead for NVPTX -- no CUDA program will ever contain an @llvm.x86.sse2.padds.b intrinsic. However, the code is NOT unreachable in the CFG sense: the reader calls UpgradeIntrinsicFunction on every intrinsic name, the function does a string-prefix match, and falls through the x86/ARM branches without matching. The x86 code paths simply never activate.

Reimplementation guidance: You can safely exclude the x86 and ARM AutoUpgrade tables (sub_A8A170, the x86 portions of sub_A939D0, and the ARM patterns in sub_15644B0). The NVVM-relevant upgraders must be preserved:

PreservedNVVM Intrinsic Families
sub_A8E250nvvm.atomic.*, nvvm.shfl.*, nvvm.cp.async.*, nvvm.tcgen05.*, nvvm.cluster.*, nvvm.ldg.*
sub_A91130NVVM-specific call rewrites
sub_A84F90NVVM annotation metadata (maxclusterrank, maxntid, etc.)
sub_156A1F0Loop vectorization metadata (llvm.loop.interleave.count)
sub_156C720Module flags

Stripping the x86 upgrader saves approximately 1.0 MB of binary size and significant reverse-engineering effort, with zero functional impact on GPU compilation.

Metadata Reader

MetadataLoader::parseOneMetadata (sub_A09F80, 121 KB)

The metadata reader handles 42 distinct metadata record types in a single switch statement. Each case constructs one metadata node:

  • DI metadata nodes: DISubprogram, DIFile, DICompileUnit, DIVariable, DILocation, DIType, DIExpression, DISubrange, DIEnumerator, DIGlobalVariableExpression, DIModule, DINamespace, DITemplateTypeParameter, DITemplateValueParameter, DICompositeType, DIDerivedType, DIBasicType, DILexicalBlock, DILexicalBlockFile, DILabel, DIImportedEntity, DIMacro, DIMacroFile, DICommonBlock, DIGenericSubrange, DIStringType, DIArgList
  • LLVM metadata nodes: MDTuple, MDString, named metadata
  • NVVM annotations: nvvm.annotations (parsed as named metadata carrying per-kernel attributes)

The function is called from parseMetadataBlock (sub_1518180, 29 KB), which reads the block structure, and parseFunctionMetadata (sub_1520420, 32 KB), which processes function-level metadata attachments.

Value materialization (sub_A10370, 33 KB) handles forward references in metadata. When a metadata node references a value that hasn't been parsed yet, the materializer resolves it once the value becomes available.

Module Summary Serialization

Two pairs of functions handle ThinLTO module summary IO:

Summary Writer (sub_1535340, 26 KB)

Writes the MODULE_STRTAB_BLOCK and GLOBALVAL_SUMMARY_BLOCK into the bitcode stream. For each function/alias/global:

  • Encodes the GUID hash (64-bit FNV-1a on the mangled name)
  • Writes call graph edges with hotness annotations
  • Writes reference edges (global value references)
  • For ThinLTO: writes module path strings, type test GUIDs

Error string: "Unexpected anonymous function when writing summary".

The NVIDIA-extended summary fields (import priority, complexity budget, kernel bit, CUDA attributes) are written by the NVModuleSummary builder into the standard summary records via additional flag bits and extended record fields.

Summary Reader (sub_150B5F0, 63 KB)

Reads the summary index from bitcode. Handles GUID hashes, function/alias summaries, module paths. Error strings: "Alias expects aliasee summary", "Invalid hash length", "Invalid Summary Block: version expected", "Malformed block".

Summary Writer (standalone copy) (sub_A2D2B0, 48 KB)

A second copy of the summary/metadata writer exists at 0xA2D2B0 in the standalone pipeline's address range.

NVVM IR Version Validation

CICC gates bitcode acceptance on two version checks:

Module-Level Version Gate (sub_157E370, 7 KB)

After parsing the module, this function reads the "nvvmir.version" named metadata node. The metadata contains a pair of integers (major, minor). The check enforces:

major == 3  AND  minor <= 2

If the check fails, the function calls sub_16BD130 which emits "Broken module found, compilation aborted!" and terminates compilation. If the module passes the version check, it proceeds to sub_166CBC0 (verifyModule [MEDIUM confidence] -- identification based on call position after bitcode parsing and before optimization, consistent with LLVM's standard verify-after-parse pattern, but no diagnostic string directly confirms the function name) for structural IR verification, then sub_15ACB40 for post-verification processing.

A second instance at sub_12BFF60 (9 KB) in the standalone pipeline performs the same check with additional llvm.dbg.cu debug info presence validation.

Environment Override (NVVM_IR_VER_CHK)

The NVVM_IR_VER_CHK environment variable controls whether version validation runs at all:

ValueEffect
Unset or non-"0"Version check enabled (default)
"0"Version check bypassed, no version mismatch errors

The check is: if (!env || strtol(env, NULL, 10) != 0) then enforce version. This means any non-zero numeric string also enables the check. Only the literal string "0" disables it.

Two verifier instances exist:

  • sub_12BFF60 at 0x12BFF60 (standalone pipeline)
  • sub_2259720 at 0x2259720 (second instance, possibly duplicate link unit)

Configuration

Environment Variables

VariableEffectDefault
LLVM_OVERRIDE_PRODUCEROverrides bitcode producer identification string"7.0.1" (ctor_154) / "20.0.0" (ctor_036)
NVVM_IR_VER_CHKSet to "0" to bypass NVVM IR version validationEnabled

cl::opt Flags

FlagTypeDefaultEffect
disable-bitcode-version-upgradeboolfalseDisable automatic bitcode upgrade for version mismatch
bitcode-mdindex-thresholdint25Number of metadata entries above which an index is emitted
disable-ondemand-mds-loadingboolfalseDisable lazy metadata loading
write-relbf-to-summaryboolfalseWrite relative block frequency to ThinLTO function summary
print-summary-global-idsboolfalsePrint global IDs when reading module summary
import-full-type-definitionsboolfalseImport full type definitions in ThinLTO

Differences from Upstream LLVM

AspectUpstream LLVM 20.0.0CICC v13.0
Producer string"LLVM20.0.0""LLVM7.0.1" (hardcoded via ctor_154)
Producer overrideLLVM_OVERRIDE_PRODUCER env varSame mechanism, different default
Version upgrade disabledisable-bitcode-version-upgrade existsSame, registered in ctor_036
NVVM IR version gateDoes not existnvvmir.version metadata check (major==3, minor<=2)
NVVM IR version bypassDoes not existNVVM_IR_VER_CHK=0 environment variable
X86 AutoUpgradeActive for x86 targetsPresent but dead code (NVPTX only)
NVVM intrinsic upgradeDoes not existnvvm.atomic, nvvm.shfl, nvvm.cp.async, etc. upgraders added
NVVM annotation upgradeDoes not existmaxclusterrank, maxntid metadata upgrader added
Module summaryStandard ModuleSummaryAnalysisExtended with NVModuleSummary (import priority, kernel bit, complexity budget)
Binary copiesSingle instanceTwo copies (0x9F range, 0x150 range) at different link addresses

Function Map

Reader (primary, 0x1500000--0x1522000)

AddressSizeFunction
0x1503DC013 KBmaterializeFunctions
0x1504A608 KBparseGlobalInits
0x150511060 KBparseModule
0x15083D026 KBparseTypeBlock / Constants
0x150A1607 KBThinLTO GUID lookup
0x150B5F063 KBparseModuleSummaryIndex
0x150E2B019 KBreadRecord
0x150F8E042 KBreadBlockInfoBlock
0x1510D7038 KBreadAbbreviatedField
0x151323020 KBreadAbbrevRecord
0x15127D013 KBskipBlock
0x15140E013 KBstring table reader
0x1514C409 KBreadBlobRecord
0x15157409 KBparseValueRecord
0x15177F07 KBbitcode record helper
0x151818029 KBparseMetadataBlock
0x15198207 KBbitcode record helper
0x1519BD07 KBbitcode record helper
0x151B070123 KBparseFunctionBody
0x152042032 KBparseFunctionMetadata
0x152216013 KBparseMetadataStrings

Reader (standalone copy, 0x9F0000--0xA20000)

AddressSizeFunction
0x9F2A40185 KBparseFunctionBody
0xA09F80121 KBMetadataLoader::parseOneMetadata
0xA1037033 KBvalue materialization
0x9FF22031 KBwriter helper
0xA2D2B048 KBmodule summary / metadata writer

Writer (0x1525000--0x1549000)

AddressSizeFunction
0x15263C07 KBemitCode
0x15271D07 KBemitVBR
0x1527BB09 KBwriteConstants helper
0x152872027 KBwriteInstruction
0x152A2509 KBemitRecord
0x152AB4011 KBemitAbbreviation
0x152F6108 KBwriteAttributeGroup
0x153024012 KBwriteType / GlobalVar
0x15311A014 KBwriteNamedMetadata / comdat
0x1531F9027 KBwriteMetadata
0x15334D08 KBwriteMetadataRecords (37 callees)
0x1533CF016 KBwriteValueSymbolTable
0x153534026 KBwriteModuleSummary (ThinLTO)
0x1536CD040 KBwriteFunction
0x1538EC058 KBwriteModule

Intrinsic Upgrader (0xA80000--0xABFFFF + 0x1560000--0x1580000)

AddressSizeFunction
0x156E800593 KBUpgradeIntrinsicFunction
0xA939D0457 KBUpgradeIntrinsicCall
0xA8A170195 KBX86 intrinsic upgrade helper
0x15644B089 KBUpgradeIntrinsicCall (2nd copy)
0xA8E25052 KBNVVM upgrade dispatcher
0xA9113028 KBNVVM call rewriting
0xA84F9014 KBNVVM annotation metadata upgrade
0xA7CD6010 KBUpgradeIntrinsicFunction (short, matches "nvvm.", "ftz.")
0x156C72010 KBUpgradeModuleFlags
0x156A1F07 KBUpgradeLoopMetadata

NVVM Version / Producer

AddressSizeFunction
0x157E3707 KBNVVM version checker (primary)
0x12BFF609 KBNVVM version checker (standalone)
0x2259720--NVVM version checker (duplicate instance)
0x48CC90544 Bctor_036 -- producer init + disable-bitcode-version-upgrade
0x4CE640215 Bctor_154 -- producer init ("7.0.1" default)

Value Enumeration (0x1540000--0x1549000)

AddressSizeFunction
0x1542B0026 KBenumerateValues
0x15467B023 KBenumerateModule
0x15484108 KBoptimizeConstants
0x15445A011 KBmetadata enumeration helper
0x15450E09 KBValueEnumerator helper
0x1547D809 KBValueEnumerator helper
0x1543FA07 KBValueEnumerator helper
0x15427507 KBValueEnumerator helper
0x153E1D07 KBTypeFinder helper

Cross-References

Concurrent Compilation

CICC implements a two-phase concurrent compilation model that is entirely absent from upstream LLVM. The optimizer runs twice over the same module: Phase I performs whole-module analysis and early IR optimizations on a single thread, then Phase II runs per-function backend optimization in parallel across a thread pool. The design exploits the fact that most backend passes (instruction selection prep, register pressure reduction, peephole) are function-local and do not require cross-function information once Phase I has completed interprocedural analysis.

The two-phase protocol lives in sub_12E7E70 (9,405 bytes), which calls the same master pipeline function sub_12E54A0 twice, discriminated only by a TLS phase counter. The concurrency infrastructure spans the 0x12D4000--0x12EA000 address range and includes a GNU Make jobserver integration for build-system-aware parallelism throttling -- a feature that allows make -j8 to correctly limit total system load even when each cicc invocation itself wants to spawn threads.

Phase I/II orchestratorsub_12E7E70 (9,405 bytes)
Phase counter (TLS)qword_4FBB3B0 -- values 1, 2, 3
Concurrency eligibilitysub_12D4250 (626 bytes)
Function sortingsub_12E0CA0 (23,422 bytes)
Concurrent entrysub_12E1EF0 (51,325 bytes)
Worker entrysub_12E7B90 (2,997 bytes)
Per-function callbacksub_12E8D50
Per-function optimizersub_12E86C0 (7,687 bytes)
GNU jobserver initsub_16832F0
MAKEFLAGS parsersub_1682BF0
Thread pool createsub_16D4AB0
Thread pool enqueuesub_16D5230
Thread pool joinsub_16D4EC0
Disable env varLIBNVVM_DISABLE_CONCURRENT_API -- byte_4F92D70
Pipeline functionsub_12E54A0 (49,800 bytes) -- called by both phases

Two-Phase Architecture

Both phases call the same optimization pipeline function sub_12E54A0(context, input, output, opts, errCb). The only difference is the value stored in the TLS variable qword_4FBB3B0 before each call. Individual optimization passes read this TLS variable to decide whether to run: Phase I passes fire when the counter equals 1; Phase II passes fire when it equals 2. This avoids running codegen-oriented passes during analysis and vice versa.

Phase Counter Protocol

The phase counter qword_4FBB3B0 is a TLS variable accessed via sub_16D40E0 (set) and sub_16D40F0 (get). It stores a pointer to a heap-allocated 4-byte integer. Three values are defined:

ValueMeaningSet point
1Phase I active -- analysis + early IR optimizationBefore first sub_12E54A0 call
2Phase II active -- backend optimization + codegen prepBefore second sub_12E54A0 call
3Compilation complete for this moduleAfter second sub_12E54A0 returns

Sequential Path (sub_12E7E70)

When verbose logging is disabled and the module contains only one defined function, the orchestrator takes a fast path:

// Single-function fast path: no phase counter set at all
if (!verbose && num_defined_functions <= 1) {
    sub_12E54A0(ctx, input, output, opts, errCb);  // single un-phased call
    return;
}

This means the optimizer runs both phases in a single invocation -- passes see no phase counter and run unconditionally. For multi-function modules or when verbose logging is active, the full two-phase protocol engages:

// Phase I
int *phase = malloc(4);
*phase = 1;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

if (error_reported(errCb))
    return;  // abort on Phase I error

// Concurrency decision
bool concurrent = sub_12D4250(ctx, opts);
// Diagnostic: "Concurrent=Yes" or "Concurrent=No"

// Phase II
*phase = 2;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

// Done
*phase = 3;
tls_set(qword_4FBB3B0, phase);

The diagnostic string construction between phases is notable: v46 = 3LL - (v41 == 0) computes the length of "Yes" (3) vs "No" (2, but expression yields 2 via 3 - 1), then logs "Phase II" with the "Concurrent=Yes/No" annotation appended.

Concurrent Path (sub_12E7B90)

When the thread count exceeds 1, the orchestrator dispatches to sub_12E7B90 instead of running Phase II sequentially:

sub_12E7B90(ctx, module_ptr, thread_count, opts, ...)
    |
    |-- Phase I: *phase=1, sub_12E54A0(...)        // whole-module, single thread
    |-- sub_12D4250(ctx, opts)                     // eligibility check
    |
    +-- if eligible (>1 defined function):
    |     sub_12E1EF0(...)                         // concurrent Phase II
    |     *phase = 3
    |
    +-- else (single defined function):
          *phase = 2, sub_12E54A0(...)             // sequential Phase II
          *phase = 3

Phase I always runs single-threaded on the whole module because interprocedural analyses (alias analysis, call graph construction, inlining decisions) require a consistent global view. Only after Phase I completes does the system split the module into per-function chunks for parallel Phase II processing.

Eligibility Check

sub_12D4250 (626 bytes) determines whether the module qualifies for concurrent compilation. The check is straightforward:

int sub_12D4250(Module *mod, Options *opts) {
    int defined_count = 0;
    for (Function &F : mod->functions()) {
        if (!sub_15E4F60(&F))    // !isDeclaration()
            defined_count++;
    }
    if (defined_count <= 1)
        return 0;                // not eligible: only 0 or 1 defined function

    byte force = *(byte*)(opts + 4064);   // NVVMPassOptions slot 201 (0xC9)
    if (force != 0)
        return force;            // user-forced concurrency setting

    return sub_12D3FC0(mod, opts);  // auto-determine thread count
}

The key gate is defined_count > 1. A module with a single kernel and no device functions will always compile sequentially regardless of thread count settings. The opts + 4064 byte (NVVMPassOptions slot 201, type BOOL_COMPACT, default 0) allows the user to force concurrent mode on or off. When zero (default), sub_12D3FC0 auto-determines the thread count based on module characteristics.

Function Priority Sorting

Before distributing functions to worker threads, sub_12E0CA0 (23,422 bytes) sorts them by compilation priority. This step is critical for load balancing: larger or more complex functions should start compiling first so they don't become tail stragglers.

Sorting Algorithm

The sort uses a hybrid strategy consistent with libstdc++ std::sort:

Input sizeAlgorithmFunction
Small NInsertion sortsub_12D48A0
Large NIntrosort (quicksort + heapsort fallback)sub_12D57D0

The threshold between insertion sort and introsort is 256 bytes of element data (consistent with the libstdc++ template instantiation pattern observed elsewhere in the binary).

Priority Source

Priority values come from function attributes extracted by sub_12D3D20 (585 bytes). The sorted output is a vector of (name_ptr, name_len, priority) tuples with 32-byte stride, used directly by the per-function dispatch loop to determine compilation order. Functions with higher priority (likely larger or more critical kernels) are submitted to the thread pool first.

Enumeration Phase

Before sorting, sub_12E0CA0 enumerates all functions and globals via an iterator callback table:

CallbackAddressPurpose
Next functionsub_12D3C60Advance to next function in module
Iterator advancesub_12D3C80Step iterator forward
End checksub_12D3CA0Test if iterator reached end

For each function, the enumeration:

  1. Checks the node type discriminator at *(byte*)(node + 16) -- type 0 = Function, type 1 = GlobalVariable
  2. For functions: calls sub_15E4F60 (isDeclaration check), sub_12D3D20 (priority), sub_1649960 (name), inserts into v359 hash table (name to function) and v362 hash table (name to linkage type)
  3. For global variables: walks the parent/linked GlobalValue chain via sub_164A820, inserts callee references into v365 hash table for split-module tracking

GNU Jobserver Integration

When cicc is invoked by GNU Make with -j, it can participate in the make jobserver protocol to avoid oversubscribing the system. The jobserver flag is passed from nvcc via the -jobserver CLI flag, which sets opts + 3288 (NVVMPassOptions slot 163, type BOOL_COMPACT, default 0).

Initialization (sub_16832F0)

The jobserver init function allocates a 296-byte state structure and calls sub_1682BF0 to parse the MAKEFLAGS environment variable:

int sub_16832F0(JobserverState *state, int reserved) {
    memset(state, 0, 296);
    state->flags[8] = 1;                    // initialized marker

    int err = sub_1682BF0(state);            // parse MAKEFLAGS
    if (err) return err;

    pipe(state->local_pipe);                 // local token pipe
    // state+196 = read FD, state+200 = write FD

    pthread_create(&state->thread,           // state+208
                   NULL, token_manager, state);

    reserve_vector(state, token_count);
    return 0;  // success
}

MAKEFLAGS Parsing (sub_1682BF0)

The parser searches the MAKEFLAGS environment variable for --jobserver-auth= and supports two formats:

FormatExampleMechanism
Pipe FDs--jobserver-auth=3,4Read FD = 3, Write FD = 4 (classic POSIX pipe)
FIFO path--jobserver-auth=fifo:/tmp/gmake-jobserver-12345Named FIFO (GNU Make 4.4+)

The pipe format uses comma-separated read/write file descriptors inherited from the parent make process. The FIFO format uses a named pipe in the filesystem. In both cases, the jobserver protocol works the same way: a thread reads tokens from the pipe/FIFO before starting each per-function compilation, and writes tokens back when the function completes. This ensures cicc never runs more concurrent compilations than make's -j level permits.

Error Handling

if (jobserver_init_error) {
    if (error_code == 5 || error_code == 6) {
        // Warning: jobserver pipe not accessible (probably not in make context)
        emit_warning(severity=1);
        // Fall through: continue without jobserver
    } else {
        // Fatal: "GNU Jobserver support requested, but an error occurred"
        sub_16BD130("GNU Jobserver support requested, but an error occurred", 1);
    }
}

Error codes 5 and 6 are non-fatal (the jobserver pipe may not be available if cicc is invoked outside a make context). All other errors are fatal.

Thread Pool Management

Creation (sub_16D4AB0)

The thread pool is LLVM's standard ThreadPool (the binary contains "llvm-worker-{0}" thread naming at sub_23CE0C0). Creation occurs at line 799 of sub_12E1EF0:

int actual_threads = min(requested_threads, num_functions);
sub_16D4AB0(thread_pool, actual_threads);

The thread count is clamped to the number of functions -- there is no point spawning more threads than there are work items.

Thread Count Resolution

Thread count is resolved through a fallback chain in sub_12E7E70:

int thread_count = opts[1026];    // NVVMPassOptions slot 203 (offset 4104), default -1
if (thread_count < 0)
    thread_count = opts[1036];    // NVVMPassOptions slot 205 (offset 4144), default -1
if (thread_count == 0)
    thread_count = sub_22420F0(); // get_nprocs() -- number of online CPUs
SourceSlotOffsetDefaultMeaning
Primary203 (0xCB)4104-1 (auto)Explicit thread count
Fallback205 (0xCD)4144-1 (auto)Secondary thread count
System------get_nprocs() return value

The -1 sentinel means "auto-detect." When both slots are -1, the system falls back to the number of online CPUs. In practice, this means a 16-core machine will spawn 16 threads for concurrent Phase II compilation unless the user or calling tool overrides the count.

Per-Function Context Packaging

Before submitting each function to the thread pool, sub_12E1EF0 builds a self-contained compilation context (4,632 bytes) containing everything the worker needs:

  1. Bitcode extraction: sub_1AB9F40(&buffer, *module_bc, &iterators, sub_12D4BD0, &filter_state) -- extracts the bitcode for a single function using sub_12D4BD0 as a filter callback that checks the function name against the work hash table
  2. Bitcode parsing: sub_153BF40(module_data, &context, 0, 0, 0, 0) -- parses the extracted bitcode into an LLVM Module
  3. Context copy: Copies all iterator state (v391--v395), context/options/error callbacks (v405--v409), module name string, the full 4,480-byte NVVMPassOptions struct, and module bitcode vectors

Each function gets its own independent copy of the options struct and module -- there is no shared mutable state between worker threads during Phase II.

Worker Entry and Completion

Work Submission

Each function is submitted to the thread pool with two callbacks:

v373 = sub_12E8D50;    // completion callback (runs the optimizer)
v372 = sub_12D4D90;    // destructor callback (cleanup)
sub_16D5230(work_item, thread_pool, context);  // enqueue

Per-Function Callback (sub_12E8D50)

The completion callback extracts the module from the packaged context and calls the Phase II per-function optimizer:

void sub_12E8D50(Context *ctx) {
    Module *mod = extract_module(ctx);
    sub_12E86C0(ctx, function_index, opts, module_name);
}

Per-Function Phase II Optimizer (sub_12E86C0, 7,687 bytes)

This function sets the TLS phase counter to 2 and runs the pass pipeline on the individual function's module:

void sub_12E86C0(Context *ctx, int func_idx, Options *opts, StringRef name) {
    int *phase = malloc(4);
    *phase = 2;
    tls_set(qword_4FBB3B0, phase);
    // Run Phase II pass pipeline on this function's module
    sub_12E54A0(ctx, ...);
}

Because qword_4FBB3B0 is TLS, each worker thread has its own phase counter. All worker threads see phase=2 concurrently without interference.

Post-Compilation Merge

After all worker threads complete (sub_16D4EC0 joins the thread pool):

  1. Jobserver cleanup: sub_1682740 checks for jobserver errors and releases tokens
  2. Error check: If any per-function callback reported an error, the compilation fails
  3. Normal mode (opt_level >= 0): Appends a null byte to the output buffer (bitcode stream terminator)
  4. Split-compile mode (opt_level < 0): Re-reads each function's bitcode via sub_153BF40, links all per-function modules via sub_12F5610 (the LLVM module linker), and restores linkage attributes from the v362 hash table. Specifically:
    • Linkage values 7--8: set only low 6 bits (external linkage types)
    • Other values: set low 4 bits, then check (value & 0x30) != 0 for visibility bits
    • Sets byte+33 |= 0x40 (dso_local flag)

Configuration

Environment Variables

VariableCheckEffect
LIBNVVM_DISABLE_CONCURRENT_APIgetenv() != NULLSets byte_4F92D70 = 1. Disables concurrent/thread-safe LibNVVM API usage entirely. Any non-NULL value triggers it. Checked in global constructor ctor_104 at 0x4A5810.
MAKEFLAGSParsed by sub_1682BF0Searched for --jobserver-auth= to enable GNU Make jobserver integration

NVVMPassOptions Slots

SlotOffsetTypeDefaultPurpose
163 (0xA3)3288BOOL_COMPACT0Jobserver integration requested (set by -jobserver flag)
201 (0xC9)4064BOOL_COMPACT0Force concurrency on/off (0 = auto)
203 (0xCB)4104INTEGER-1Primary thread count (-1 = auto)
205 (0xCD)4144INTEGER-1Fallback thread count (-1 = auto)

CLI Flags

FlagRouteEffect
-jobserveropt "-jobserver"Enables GNU jobserver integration (sets slot 163)
-split-compile=<N>opt "-split-compile=<N>"Enables split-module compilation (opt_level set to -1)
-split-compile-extended=<N>opt "-split-compile-extended=<N>"Extended split-compile (also sets +1644 = 1)
--sw2837879InternalConcurrent ptxStaticLib workaround flag

Phase State Machine

  START
    |
    v
  [phase=1] --> sub_12E54A0 (Phase I: whole-module analysis)
    |
    v
  error? --yes--> RETURN (abort)
    |no
    v
  count_defined_functions()
    |
    +--(1 func)--> [phase=2] --> sub_12E54A0 (Phase II sequential)
    |                                |
    |                                v
    |                            [phase=3] --> DONE
    |
    +--(N funcs, threads>1)--> sub_12E1EF0 (concurrent)
    |                             |
    |                             +-- sort functions by priority
    |                             +-- create thread pool
    |                             +-- init jobserver (if requested)
    |                             +-- for each function:
    |                             |     extract per-function bitcode
    |                             |     parse into independent Module
    |                             |     [phase=2] per-function (TLS)
    |                             |     submit to thread pool
    |                             +-- join all threads
    |                             +-- link split modules (if split-compile)
    |                             +-- [phase=3] --> DONE
    |
    +--(N funcs, threads<=1)--> [phase=2] --> sub_12E54A0 (sequential)
                                    |
                                    v
                                [phase=3] --> DONE

Differences from Upstream LLVM

Upstream LLVM has no two-phase compilation model. The standard LLVM pipeline runs all passes in a single invocation with no phase discrimination. CICC's approach is entirely custom:

  1. Phase counter TLS variable: Upstream LLVM passes have no concept of reading a global phase counter to decide whether to run. Every pass in CICC must check qword_4FBB3B0 and early-return if it belongs to the wrong phase.

  2. Per-function module splitting: Upstream LLVM's splitModule() (in llvm/Transforms/Utils/SplitModule.h) exists for ThinLTO and GPU offloading, but CICC's splitting at sub_1AB9F40 with the sub_12D4BD0 filter callback is a custom implementation integrated with the NVVMPassOptions system.

  3. GNU jobserver integration: No upstream LLVM tool participates in the GNU Make jobserver protocol. This is entirely NVIDIA-specific, implemented to play nicely with make -j in CUDA build systems.

  4. Function priority sorting: Upstream LLVM processes functions in module iteration order. CICC's priority-based sorting via sub_12E0CA0 ensures that expensive functions start compiling first, reducing tail latency in the thread pool.

Function Map

FunctionAddressSizeRole
Function iterator: nextsub_12D3C60~200--
Function iterator: advancesub_12D3C80~230--
Function iterator: end checksub_12D3CA0~260--
Function attribute/priority querysub_12D3D20585--
Auto thread count determinationsub_12D3FC03,600--
Concurrency eligibility checksub_12D4250626--
Insertion sort (small N)sub_12D48A0----
Per-function bitcode filter callbacksub_12D4BD02,384--
Work item destructor callbacksub_12D4D902,742--
Introsort (large N)sub_12D57D0----
Function sorting and enumerationsub_12E0CA023,422--
Concurrent compilation top-level entrysub_12E1EF051,325--
Master pipeline assembly (both phases)sub_12E54A049,800--
Concurrent worker entrysub_12E7B902,997--
Phase I/II orchestratorsub_12E7E709,405--
Per-function Phase II optimizersub_12E86C07,687--
Per-function completion callbacksub_12E8D50----
LLVM module linker (post-merge)sub_12F56107,339--
Bitcode reader/verifiersub_153BF40----
isDeclaration() checksub_15E4F60----
Get function namesub_1649960----
Walk to parent GlobalValuesub_164A820----
Jobserver error check/cleanupsub_1682740----
MAKEFLAGS --jobserver-auth= parsersub_1682BF0----
GNU jobserver init (296-byte state)sub_16832F0----
TLS set (qword_4FBB3B0)sub_16D40E0----
TLS get (qword_4FBB3B0)sub_16D40F0----
Thread pool createsub_16D4AB0----
Thread pool joinsub_16D4EC0----
Thread pool enqueue work itemsub_16D5230----
Per-function bitcode extractionsub_1AB9F40----
get_nprocs() wrappersub_22420F0----

Cross-References

  • Entry Point & CLI -- pipeline dispatch that leads to the optimizer, including -jobserver flag routing
  • Optimizer Pipeline -- sub_12E54A0, the pipeline function called by both phases
  • NVVMPassOptions -- the 222-slot options table including thread count and jobserver slots
  • Environment Variables -- LIBNVVM_DISABLE_CONCURRENT_API and MAKEFLAGS
  • CLI Flags -- -jobserver, -split-compile, -split-compile-extended
  • Bitcode I/O -- sub_153BF40 bitcode reader used for per-function module extraction

Diagnostics & Optimization Remarks

CICC v13.0 contains three independent diagnostic systems that operate at different phases of compilation and serve different audiences. The EDG frontend diagnostic engine handles C++/CUDA language-level errors and warnings with rich terminal formatting or SARIF JSON output. The LLVM optimization remark infrastructure reports pass-level decisions (what was optimized, what was missed, and why) through the standard DiagnosticInfo hierarchy. NVIDIA's custom "profuse" framework provides verbose per-pass diagnostic output that is entirely separate from both EDG diagnostics and LLVM remarks, controlled by dedicated knobs like profuseinline and profusegvn.

Understanding these three layers is essential for reimplementation because they share no code. EDG diagnostics live in the 0x670000-0x6FFFFF address range and operate on EDG's internal diagnostic record format. LLVM remarks use the stock OptimizationRemarkEmitter analysis pass and the DiagnosticInfoOptimizationBase class hierarchy. The profuse framework is a pure NVIDIA invention that writes directly to stderr through cl::opt<bool> guards with no connection to either of the other two systems.

EDG terminal emittersub_681D20 (37KB, 1,342 lines) at 0x681D20
EDG dispatch/SARIF emittersub_6837D0 (20KB) at 0x6837D0
Diagnostic format selectorunk_4D04198: 0 = text, 1 = SARIF
Format CLI flag--diagnostics_format=text|sarif (case 0x125 in sub_617BD0)
EDG output mode CLI--output_mode text|sarif (case 293 in lgenfe_main)
LLVM remark registrationctor_152 at 0x4CE3F0 (3 regex cl::opts)
LLVM remark YAML serializersub_15CAD70 (13KB) at 0x15CAD70
LLVM remark bitstream serializersub_F01350 (23KB) at 0xF01350
Profuse inlining knobprofuseinline at 0x4DBEC0 (ctor_186_0), default off
Profuse GVN knobprofusegvn at 0x4FAE7E0 (ctor_201), default true
Diagnostic output streamqword_4F07510 (FILE*, typically stderr)
Terminal widthdword_4D039D0 (columns, for word-wrapping)
ANSI color enabledword_4F073CC[0] (nonzero = enabled)
Upstream LLVM equivalentllvm/include/llvm/IR/DiagnosticInfo.h, llvm/lib/Analysis/OptimizationRemarkEmitter.cpp

EDG Frontend Diagnostics

Dispatch Architecture

Every EDG frontend diagnostic passes through sub_6837D0, which acts as the single dispatch point. This function performs filtering (severity threshold, duplicate suppression, pragma-based suppression), increments error/warning counters, and then routes to one of two renderers based on the global unk_4D04198:

sub_6837D0(diag_record)
  |
  +-- severity < byte_4F07481[0]?  --> suppress (return)
  +-- duplicate? (byte_4CFFE80[4*errnum+2] bit flags) --> count only
  +-- pragma disabled? (sub_67D520) --> suppress
  +-- error limit reached? (unk_4F074B0 + unk_4F074B8 >= unk_4F07478) --> error 1508, abort
  |
  +-- unk_4D04198 == 0  -->  sub_681D20(diag)   [terminal text renderer]
  +-- unk_4D04198 == 1  -->  inline SARIF JSON   [JSON renderer within sub_6837D0]

The format is selected by the --diagnostics_format flag (case 0x125 in sub_617BD0), which is surfaced as --output_mode text|sarif in the lgenfe CLI.

Diagnostic Record Layout

EDG diagnostic records are approximately 192-byte structures organized as a tree. Each record can have child diagnostics, notes, context diagnostics (include-stack annotations), and an extra child list, all stored as linked lists.

OffsetSizeFieldDescription
+04type0 = top-level, 1 = unknown, 2 = child-with-parent, 3 = continuation
+88next_siblingLinked list next pointer
+168parent_diagPointer to parent diagnostic node
+248child_listLinked list of child diagnostics
+408extra_child_listSecondary child list (always emitted)
+568note_listLinked list of attached notes
+728context_listContext diagnostics (include-stack annotations)
+964has_source_locationNonzero if source info is present
+1002column_numberColumn in source line (unsigned short)
+1208source_file_infoPassed to sub_723260 to get filename string
+1284line_numberSource line number (unsigned int)
+1364file_idFile table index (0 = no file)
+1402column_endEnd column for underlining range
+1444is_command_lineNonzero means "command line" prefix
+1528source_entityIf nonzero, use sub_723640 for decorated location
+1608display_name_ptrFilename string pointer
+1684display_lineLine number for display
+1724tab_stop_widthTab stop setting for source display
+1764diagnostic_numberNumeric ID for -W flags, becomes SARIF ruleId
+1801severitySeverity code (see severity enum below)

Terminal Text Renderer (sub_681D20)

The 37KB terminal renderer is the larger and more complex of the two backends. It handles ANSI color output, word-wrapping to terminal width, source context display with caret underlining, and recursive child diagnostic emission.

Location prefix. The source location is formatted before the severity label. For file-based diagnostics, sub_722FC0 or sub_723640 produces the filename, followed by (line_number) in parentheses, wrapped in ANSI color code 5 (file path color). Command-line diagnostics use string ID 1490 ("command line"). Diagnostics with no file have no location prefix.

Severity label. The label string is looked up via sub_67C860(string_id) from a localized string table. The string table base v57 is offset by 0 for normal diagnostics, 1 for command-line diagnostics. When diagnostic numbering is enabled (unk_4D04728 set) and severity is 5 or below with a nonzero diagnostic number at +176, the renderer appends #<number> after the severity label, converted by sub_67D2D0.

ANSI color system. CICC does not emit standard ANSI escape sequences directly. Instead, it uses an internal 2-byte marker system where byte 0 is 0x1B (ESC) and byte 1 is a color code from 1 to 5. These internal markers are translated to real terminal escapes by the output layer.

Internal CodeSemanticTypical Terminal Mapping
1Reset/default\033[0m
2ErrorRed
3Caution/severe-warningYellow/magenta
4Location highlightBold/cyan
5File path / remarkDim/blue

Color output is gated by dword_4F073CC[0] (nonzero = enabled) and dword_4F073C8 (nonzero = "rich" escape mode; zero = "simple" mode that skips escape bytes entirely).

Word-wrapping. Two code paths exist depending on whether ANSI colors are active.

Without colors (Path A), the algorithm is straightforward: compute available width as dword_4D039D0 - left_margin, scan for the last space within that width, break there, and emit newline plus indent. The left margin and continuation indent depend on the diagnostic type:

Type (+0)Left MarginContinuation Indent
0 (top-level)010
11222
2 (child)10 or 1220 or 22
3 (continuation)111

For type 2, the margin is +2 if the current diagnostic is not the first child of its parent.

With colors (Path B), the algorithm tracks character-by-character with color state (v40 = current color, v41 = at-start-of-line flag, v152 = remaining columns). On encountering an ESC marker, it consumes the 2-byte pair and updates color state via sub_67BBF0. When the column limit is hit, the algorithm attempts to break at the last recorded space position (with buffer rewind to v147), falling back to a forced break at the current position.

The global qword_4F07468 controls wrap behavior: the low 32 bits disable wrapping entirely when nonzero, and the high 32 bits suppress source context display when nonzero.

Source context display. After the message text, the renderer displays the source line with caret underlining. sub_729B10(file_id, ...) retrieves source line data. Each source position entry is a linked list node with a 24+ byte layout: +0 next pointer, +8 source text pointer, +16 entry type (0 = normal char, 1 = same-position, 2 = 2-byte char, 3 = tab), +24 replacement character. The display renders two lines: the source text and a caret/tilde underline line, where ^ marks the error column and ~ extends the range to column_end. Multi-byte character handling uses sub_721AB0 to determine byte counts.

Recursive emission. After the main diagnostic and source context, child diagnostics are emitted recursively in this order: child_list (+24), note_list (+56, skipped for severity 2 remarks), context_list (+72, with parent pointer set before recursion), extra_child_list (+40). After all children, a blank line separator is emitted (unless compact mode is active), the output buffer is null-terminated, and the result is written via fputs to qword_4F07510 followed by fflush.

Machine-readable log. When qword_4D04908 (log FILE*) is set and the diagnostic type is not 3 (continuation), the renderer writes a single-line record:

<severity-char> "<filename>" <line> <col> <message>\n

The severity character is indexed from the string "rwweeccccCli" by (severity - 4). For child diagnostics, the character is lowercased.

IndexCharacterMeaning
0 (sev 4)rremark
1 (sev 5)wwarning
2 (sev 6)wcaution (displayed as warning)
3 (sev 7)eerror
4 (sev 8)eerror (promoted)
5 (sev 9)ccatastrophe
6 (sev 10)ccatastrophe
7 (sev 11)Ccatastrophe (alternate)
8lunknown
9iinternal error

SARIF JSON Renderer

The SARIF backend is implemented inline within sub_6837D0. Rather than emitting a complete SARIF document (no $schema, no runs[] envelope), it writes one JSON object per diagnostic as a comma-separated stream to qword_4F07510. The caller or a post-processing tool is expected to wrap the stream.

Each diagnostic object has this structure:

{
  "ruleId": "EC<number>",
  "level": "error"|"warning"|"remark"|"catastrophe"|"internal_error",
  "message": {"text": "<JSON-escaped message>"},
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {"uri": "file://<path>"},
        "region": {"startLine": N, "startColumn": N}
      }
    }
  ],
  "relatedLocations": [
    {
      "message": {"text": "..."},
      "physicalLocation": { ... }
    }
  ]
}

The ruleId is constructed by sprintf("%lu", *(uint32*)(diag+176)) -- the decimal diagnostic number prefixed with "EC". The level string is mapped from the severity byte at +180 via a switch statement. The message.text is produced by sub_683690, which renders the diagnostic text into qword_4D039E8 via sub_681B50 and then copies it character-by-character into qword_4D039D8 with JSON escaping of " and \ characters. The locations array is present only when *(diag+136) != 0 (valid file ID). The physicalLocation is built by sub_67C120, which calls sub_729E00 to decompose the packed source position and sub_722DF0 to resolve the file ID to a path string. The relatedLocations array carries note sub-diagnostics from the linked list at diag+72.

Multiple diagnostics are comma-separated: a comma is prepended before { when unk_4F074B0 + unk_4F074B8 > 1 (more than one diagnostic emitted so far).

Include-stack annotations. When include depth (dword_4F04C64) is greater than zero, sub_6837D0 walks the include stack (776-byte records at qword_4F04C68) calling sub_67B7E0 to build #include context annotations. These are linked as children at diag+40/+48. Error 453 gives "in file included from ..." context, error 1150 gives ellipsis "..." when too many include levels exist, and errors 1063/1064 give file-reference footers.

Warning-as-error promotion. When a warning (severity 5) has been emitted and unk_4D04728 is set, the function creates a synthetic "warnings treated as errors" diagnostic via sub_67D610(0xE7D, ..., 4) with severity 4 (remark), then recursively calls sub_6837D0 on it.

Diagnostic Filtering and Suppression

Filtering happens in sub_6837D0 before either renderer is invoked:

  1. Severity threshold: byte_4F07481[0] stores the minimum severity. Diagnostics below this level are silently suppressed.
  2. Duplicate detection: byte_4CFFE80[4*errnum + 2] bit flags track "already seen" diagnostics. Bit 0 marks first occurrence, bit 1 marks already emitted. On second hit, the diagnostic is counted but not emitted.
  3. Pragma suppression: sub_67D520 checks whether the diagnostic is disabled via #pragma diag_suppress or similar EDG pragmas. sub_67D470 records the suppression.
  4. Error limit: When unk_4F074B0 + unk_4F074B8 >= unk_4F07478, error 1508 ("error limit reached") is emitted and sub_7235F0(9) aborts compilation.

Diagnostic Severity Enum

The severity byte at diag+180 encodes the following levels, used by both the terminal and SARIF renderers:

ValueNameTerminal ColorSARIF LevelLog CharLabel
2remarkESC 5 (blue)"remark"RR
4warningESC 5 (blue)"warning"rW
5cautionESC 3 (yellow)"warning"wW (lowercase)
6severe-warningESC 3 (yellow)(falls through to error)wE (lowercase)
7errorESC 2 (red)"error"eE
8error (promoted)ESC 2 (red)"error"eE
9catastropheESC 2 (red)"catastrophe"cC
10catastropheESC 2 (red)"catastrophe"cC
11internal-errorESC 2 (red)"internal_error"ispecial

Severity values 9, 10, and 11 are fatal: after emission, sub_7AFBD0 (longjmp / error propagation [LOW confidence] -- the function is called on fatal error paths and does not return to its caller, consistent with longjmp or exit, but could also be a custom abort-style handler; no setjmp/longjmp string evidence found) and sub_7235F0(severity) terminate compilation. Internal errors (11) additionally prepend "(internal error) " to the log output and use the prefix for error 3709.

Note: severity 2 (remark) is distinct from LLVM optimization remarks -- it is an EDG frontend remark (e.g., template instantiation notes). Remarks at severity 2 suppress their note_list children during recursive emission.

LLVM Optimization Remarks

Registration and CLI Surface

Three cl::opt<std::string> knobs are registered at ctor_152 (0x4CE3F0), each taking a regex pattern:

KnobDescriptionFilters
pass-remarksEnable optimization remarks from passes whose name matches the patternPassed (successful) optimizations
pass-remarks-missedEnable missed optimization remarksOptimizations that were considered but not applied
pass-remarks-analysisEnable analysis remarksIntermediate analysis results and explanations

These are stock LLVM cl::opt registrations. CICC exposes them through the flag catalog (sub_9624D0) via the -inline-info convenience flag, which routes to the opt phase as:

-Xopt -pass-remarks=inline
-Xopt -pass-remarks-missed=inline
-Xopt -pass-remarks-analysis=inline

Additional remark-related knobs registered at ctor_376_0 (0x512DF0):

KnobPurpose
pass-remarks-with-hotnessInclude PGO hotness information in remarks
pass-remarks-hotness-thresholdMinimum hotness for remark emission
pass-remarks-outputFile path for remark output (YAML or bitstream)
pass-remarks-filterAdditional filter for remark pass names
pass-remarks-formatFormat: yaml or bitstream

The -w flag (suppress warnings) routes to both opt and llc as -w. The -Werror flag routes to both as -Werror, promoting warnings to errors.

Remark Emission Protocol

LLVM passes emit remarks through a three-step protocol observed consistently across all analyzed passes:

Step 1: Construct the remark. The pass creates a DiagnosticInfoOptimizationBase subclass object via one of these constructors:

ConstructorAddressCreates
sub_B175600xB17560OptimizationRemark (pass succeeded)
sub_15CA3300x15CA330OptimizationRemark (alternative constructor)
sub_15CA5400x15CA540OptimizationRemarkMissed (pass failed/skipped)
sub_B178C00xB178C0Warning-level DiagnosticInfo (non-remark warning)

The constructor takes a pass name string (e.g., "coro-split", "wholeprogramdevirt", "loop-distribute") and a remark ID string (e.g., "Devirtualized", "Distribute", "CoroSplit").

Step 2: Build the message. The message is assembled through a builder pattern:

Builder FunctionAddressPurpose
sub_B182900xB18290Append raw string to remark message
sub_B164300xB16430Create named string attribute (e.g., "FunctionName")
sub_B16B100xB16B10Create named integer attribute (e.g., "frame_size")
sub_B165300xB16530Append named value (used in analysis remarks)
sub_B180C00xB180C0Finalize and prepare remark for emission

A typical emission sequence (from CoroSplit at 0x24F05D1):

call sub_B17560("coro-split", "CoroSplit")      // create remark
call sub_B18290("Split '")                       // append prefix
call sub_B16430("function", fn_name)             // named attribute
call sub_B18290("' (frame_size=")                // literal text
call sub_B16B10("frame_size", N)                 // integer attribute
call sub_B18290(", align=")                      // literal text
call sub_B16B10("align", M)                      // integer attribute
call sub_B18290(")")                             // closing paren

Resulting remark text: Split '<function_name>' (frame_size=N, align=M)

Step 3: Publish. sub_1049740 publishes the remark to the diagnostic handler registered on the LLVMContext. The handler consults the pass-remarks / pass-remarks-missed / pass-remarks-analysis regex filters to decide whether to emit or suppress the remark.

After emission, remark objects are cleaned up: vtable-based destructors free the remark structure, and SSO string cleanup checks whether each temporary string pointer differs from its inline buffer address (indicating heap allocation that needs free).

Remark Categories

Standard LLVM categories:

CategoryYAML TagMeaning
Passed!PassedOptimization was successfully applied
Missed!MissedOptimization was considered but not applied
Analysis!AnalysisIntermediate analysis information
Failure!FailureInternal failure during optimization

NVIDIA-specific categories added to the remark framework:

CategoryYAML TagPurpose
AnalysisFPCommute!AnalysisFPCommuteGPU floating-point commutativity analysis feedback
AnalysisAliasing!AnalysisAliasingGPU memory aliasing analysis feedback

These NVIDIA-specific categories are registered in the YAML serializer at sub_15CAD70 and the YAML parser at sub_C30A00.

Serialization Backends

YAML serializer (sub_15CAD70, 13KB at 0x15CAD70): Emits structured YAML with fields Pass, Name, DebugLoc, and the remark type tag. Uses a vtable-based streaming API at offsets +96 (writeKey), +120 (beginMapping), +128 (endMapping).

Bitstream serializer (sub_F01350, 23KB at 0xF01350): Emits remarks in LLVM's binary bitstream format (used for -fsave-optimization-record). Record types include "Remark", "Remark header", "Remark debug location", "Remark hotness", "Argument with debug location", and "Argument". Uses sub_EFD2C0 for VBR-encoded record emission and sub_EFCCF0 for abbreviation definitions.

Remark serializer factory (sub_C2E790, 6KB at 0xC2E790): llvm::remarks::createRemarkSerializer dispatches to YAML or bitstream format based on configuration. Returns an error for unknown formats: "Unknown remark serializer format.".

OptimizationRemarkEmitter Analysis

Two analysis passes provide remark emission capability to function-level and machine-function-level passes:

PassPipeline NameLevel
OptimizationRemarkEmitterAnalysis"opt-remark-emit" (pipeline ID 181)Function analysis
MachineOptimizationRemarkEmitterAnalysis"machine-opt-remark-emitter" (pipeline ID 467)MachineFunction analysis

Passes that emit remarks must request the appropriate analysis and store the resulting OptimizationRemarkEmitter*. For example, the TwoAddressInstruction pass stores it at this+272, obtained via analysis lookup unk_4FC4534.

Passes Known to Emit Remarks

This is a non-exhaustive list of passes observed emitting optimization remarks in the binary:

PassRemark NameRemark Examples
CoroSplit"coro-split"Split '<fn>' (frame_size=N, align=M)
WholeProgramDevirt"wholeprogramdevirt"Devirtualized '<fn>'
LoopDistribute"loop-distribute"Distribute, NoUnsafeDeps, TooManySCEVRuntimeChecks
LoopVectorize"loop-vectorize"Vectorization success/failure details
LoopUnroll"loop-unroll"Unroll factor and failure reasons
LoopInterchange"loop-interchange"Cannot interchange loops...
LICM"licm"Hoist success/failure reasons
SLPVectorizer"slp-vectorizer"SLP vectorization decisions
MachinePipeliner"pipeliner"Pipelined succesfully! [sic]
MachineOutliner"machine-outliner"Outlining decisions
OpenMP SPMD Transform"openmp-opt"OMP120 (remark), OMP121 (warning)
InstCombine"instcombine"Visit decisions (via instcombine-visit filter)
FastISel"fastisel"FastISel failure reports
IRCE"irce"Range check elimination decisions
TwoAddressInstruction"twoaddressinstruction"Two-address conversion decisions

NVIDIA Profuse Framework

Design and Purpose

The "profuse" diagnostic framework is an NVIDIA-specific verbose output system that has no connection to the LLVM OptimizationRemark infrastructure. It predates LLVM's remark system and serves a different purpose: providing NVIDIA compiler engineers with extremely detailed, unstructured diagnostic output from specific optimization passes.

The name "profuse" is unfortunately overloaded in the cicc binary. Two completely unrelated systems use the word:

  • PGO profuse: The profuse knob registered at ctor_375 (0x512720) is a boolean that enables profile-guided optimization data consumption. It is set via -profile-instr-use <file> which routes to -Xopt -profuse=true -Xopt -proffile=<file>. This is a PGO control flag, not a diagnostic system.
  • Diagnostic profuse: The profuseinline and profusegvn knobs are NVIDIA diagnostic toggles that control verbose output from specific optimization passes. These are the "profuse framework" discussed here.

profuseinline

Registered at ctor_186_0 (0x4DBEC0) as a cl::opt<bool> with default value off (false).

When enabled, the NVIDIA custom inliner (sub_1864060, the shouldInline / inline cost computation) emits verbose diagnostic output for every inlining decision. This includes the computed cost, threshold comparison, argument type-size coercion details, and the final accept/reject decision.

The profuse inlining output goes directly to stderr through fprintf-style calls within the inliner code. It is not routed through OptimizationRemarkEmitter and does not appear in remark YAML/bitstream output. This is distinct from the LLVM inline-remark-attribute knob which annotates the IR with remark metadata.

The -inline-info CLI flag does not enable profuseinline. Instead, -inline-info routes to the three standard pass-remarks knobs filtered for "inline". To enable profuse output, one must pass -Xopt -profuseinline=true (or -Xcicc -opt -profuseinline=true through nvcc).

Comparison of the two diagnostic channels for inlining:

Featureprofuseinline-inline-info (pass-remarks)
Output formatUnstructured stderr textStructured LLVM remark
Controlled bycl::opt<bool>Regex filter on pass name
DefaultOffOff
YAML/bitstream outputNoYes (if -pass-remarks-output set)
Cost model detailsYes (full cost breakdown)No (accept/reject only)
NVIDIA-specific metricsYes (GPU opcode bonus, struct analysis)No

profusegvn

Registered at ctor_201 (0x4E0990) as a cl::opt<bool> with default value true (enabled). Global address: 0x4FAE7E0. Description: "profuse for GVN".

When the knob is active (which it is by default), the GVN pass (sub_1900BB0, 83KB) emits verbose diagnostic output at the following decision points:

  • Value replacement decisions (when a leader is found in the value numbering table)
  • Store/load expression hash table matches
  • PRE (Partial Redundancy Elimination) insertion decisions

The output is written directly to stderr, bypassing the LLVM remark system entirely. The profuse GVN output is not captured by -pass-remarks-output and does not appear in remark YAML or bitstream files.

To disable the verbose output, pass -Xopt -profusegvn=false. The fact that this defaults to true (unlike profuseinline which defaults to false) suggests it may be gated by an additional runtime check (possibly wizard mode or an optimization level gate) to prevent user-visible noise in release builds.

Profuse vs. LLVM Remarks Summary

AspectProfuse FrameworkLLVM Optimization Remarks
OriginNVIDIA customUpstream LLVM
PassesInliner, GVN only (observed)Most optimization passes
OutputRaw stderr fprintfStructured DiagnosticInfo
FormatUnstructured textYAML, bitstream, or terminal
FilteringPer-knob booleanRegex on pass name
SerializationNoneYAML and bitstream serializers
IDE integrationNoneSARIF (with post-processing)
DefaultOff (inline) / On (GVN)Off (requires -pass-remarks)

Filtering and Configuration

CLI Flags for Diagnostic Control

EDG frontend diagnostics (Phase I):

FlagRouteEffect
--diagnostics_format=sarifEDG directSwitch output to SARIF JSON
--output_mode text|sarifEDG direct (case 293)Same as above, alternative spelling
-wopt -w, llc -wSuppress all warnings
-Werroropt -Werror, llc -WerrorPromote warnings to errors
--error_limit NEDG directMaximum errors before abort (unk_4F07478)
#pragma diag_suppress NEDG sourceSuppress specific diagnostic by number

LLVM optimization remarks (Phase II / opt):

FlagRouteEffect
-inline-infoopt: -pass-remarks=inline, -pass-remarks-missed=inline, -pass-remarks-analysis=inlineEnable inline-specific remarks
-Xopt -pass-remarks=<regex>opt directEnable passed remarks matching pattern
-Xopt -pass-remarks-missed=<regex>opt directEnable missed remarks matching pattern
-Xopt -pass-remarks-analysis=<regex>opt directEnable analysis remarks matching pattern
-Xopt -pass-remarks-output=<file>opt directWrite remarks to file (YAML or bitstream)
-Xopt -pass-remarks-format=yaml|bitstreamopt directSelect output format
-Xopt -pass-remarks-with-hotnessopt directInclude PGO hotness in remarks
-Xopt -pass-remarks-hotness-threshold=Nopt directMinimum hotness for emission
-Xopt -pass-remarks-filter=<regex>opt directAdditional pass name filter

NVIDIA profuse diagnostics:

FlagRouteEffect
-Xopt -profuseinline=trueopt directEnable verbose inlining diagnostics
-Xopt -profusegvn=falseopt directDisable verbose GVN diagnostics (on by default)

Debug and verbose output:

FlagRouteEffect
-enable-verbose-asmllc -asm-verboseVerbose assembly comments
-show-srcllc -nvptx-emit-srcEmbed source in PTX output
-time-passesspecial (must be only flag)Time each LLVM pass

Global Variables Controlling Diagnostic Behavior

AddressTypeNamePurpose
unk_4D04198intdiagnostic_format0 = text, 1 = SARIF
byte_4F07481[0]bytemin_severity_thresholdMinimum severity for emission
unk_4F074B0uinterror_countRunning error counter
unk_4F074B8uintwarning_countRunning warning/non-error counter
unk_4F07478uinterror_limitMaximum errors before abort
unk_4F07490flagprint_countersWhether to print summary counters
unk_4D04728bytediag_numberingDiagnostic numbering enabled
unk_4D042B0bytecommand_line_modeCommand-line diagnostic prefix
unk_4D042B8flagwerror_flagPromote severity to 7 for warnings
dword_4D039D0intterminal_widthColumns for word-wrapping
dword_4F073CC[0]intansi_color_enabledANSI color output flag
dword_4F073C8intrich_escape_modeRich (2-byte ESC) vs simple mode
qword_4F07468int64wrap_controlLow32: disable wrap. High32: suppress context
qword_4F07510FILE*diag_output_streamOutput stream (stderr)
qword_4D04908FILE*diag_log_fileMachine-readable log file
byte_4CFFE80arraydiag_seen_flagsPer-diagnostic duplicate tracking

Growable String Buffer Infrastructure

All three diagnostic systems share the same growable string buffer used for message formatting. The buffer structure appears at qword_4D039D8 (output buffer), qword_4D039E0 (prefix buffer), and qword_4D039E8 (header/message buffer):

OffsetSizeFieldDescription
+08(tag/type)Unused or type discriminator
+88capacityMaximum bytes before realloc
+168lengthCurrent write position
+248(unused)Padding
+328datachar* pointer to the actual buffer
HelperAddressOperation
sub_8238000x823800Reset/clear buffer (set length to 0)
sub_8238100x823810Grow buffer capacity (realloc)
sub_8238B00x8238B0Append data: memcpy(buf->data + buf->length, str, len)
sub_8237A00x8237A0Allocate new buffer (initial capacity = 1024)

Function Map

FunctionAddressSizeRole
sub_67B7800x67B780--EDG: Increment error/warning counters
sub_67B7E00x67B7E0--EDG: Build include-stack annotation
sub_67B9F00x67B9F0--EDG: Diagnostic record pool allocator
sub_67BB200x67BB20--EDG: Argument node allocator
sub_67BBF00x67BBF0--EDG: Set ANSI color state for output
sub_67BD400x67BD40--EDG: Emit newline/flush for source context
sub_67BDC00x67BDC0--EDG: Load file metadata and tab stop width
sub_67C1200x67C120--EDG/SARIF: Emit physicalLocation JSON
sub_67C8600x67C860--EDG: Localized string lookup by ID
sub_67D2D00x67D2D0--EDG: Convert internal diag ID to user-visible number
sub_67D4700x67D470--EDG: Record pragma-based suppression
sub_67D5200x67D520--EDG: Check pragma-based suppression
sub_67D6100x67D610--EDG: Create synthetic diagnostic (warnings-as-errors)
sub_681B500x681B50--EDG: Populate message text into header buffer
sub_681D200x681D2037KBEDG: Terminal text diagnostic renderer
sub_6836900x683690--EDG/SARIF: Emit JSON-escaped message object
sub_6837D00x6837D020KBEDG: Diagnostic dispatch and SARIF renderer
sub_721AB00x721AB0--EDG: Multi-byte character byte count
sub_722DF00x722DF0--EDG/SARIF: Resolve file-id to path string
sub_722FC00x722FC0--EDG: Format filename into buffer
sub_7232600x723260--EDG: Get filename string from file info
sub_7236400x723640--EDG: Get decorated source location string
sub_729B100x729B10--EDG: Retrieve file/line data for source context
sub_729E000x729E00--EDG/SARIF: Decompose packed source position
sub_729F800x729F80--EDG: Promote severity (hard error)
sub_7235F00x7235F0--EDG: Fatal exit with severity code
sub_7AF1D00x7AF1D0--EDG: Newline character mapping lookup
sub_8238000x823800--Shared: Reset/clear growable string buffer
sub_8238100x823810--Shared: Grow/realloc string buffer
sub_8237A00x8237A0--Shared: Allocate new growable buffer
sub_8238B00x8238B0--Shared: Append to string buffer
sub_B164300xB16430--LLVM Remark: Create named string attribute
sub_B165300xB16530--LLVM Remark: Append named value
sub_B16B100xB16B10--LLVM Remark: Create named integer attribute
sub_B157E00xB157E0--LLVM Remark: Get DebugLoc for remark source location
sub_B175600xB17560--LLVM Remark: Construct OptimizationRemark (passed)
sub_B178C00xB178C0--LLVM Remark: Construct warning-level DiagnosticInfo
sub_B180C00xB180C0--LLVM Remark: Finalize and prepare remark for emission
sub_B182900xB18290--LLVM Remark: Append raw string to remark message
sub_B2BE500xB2BE50--LLVM Remark: getRemarkStreamer
sub_B6EA500xB6EA50--LLVM Remark: isEnabled check
sub_B6F9700xB6F970--LLVM Remark: getRemarkFilter
sub_B912200xB91220--LLVM Remark: Free remark string
sub_C2E7900xC2E7906KBLLVM Remark: createRemarkSerializer factory
sub_C302C00xC302C04KBLLVM Remark: YAML remark serializer emit
sub_C30A000xC30A006KBLLVM Remark: YAML remark parser (6 type tags)
sub_C310100xC310108KBLLVM Remark: YAML remark field parser
sub_EFCCF00xEFCCF09KBLLVM Remark: Bitstream abbreviation emitter
sub_EFD2C00xEFD2C018KBLLVM Remark: Bitstream record writer
sub_EFE9000xEFE90030KBLLVM Remark: Bitstream remark parser
sub_F013500xF0135023KBLLVM Remark: Bitstream remark serializer
sub_10497400x1049740--LLVM Remark: Publish remark to diagnostic handler
sub_15CA3300x15CA330--LLVM Remark: OptimizationRemark constructor
sub_15CA5400x15CA540--LLVM Remark: OptimizationRemarkMissed constructor
sub_15CAB200x15CAB20--LLVM Remark: OptimizationRemark::operator<<(StringRef)
sub_15CAD700x15CAD7013KBLLVM Remark: YAML remark serializer (NVIDIA-extended)
sub_1DCCCA00x1DCCCA0--LLVM Remark: OptimizationRemarkEmitter::emit

Cross-References

  • Entry Point & CLI -- flag routing for -w, -Werror, -inline-info, -Xopt pass-through
  • GVN -- profusegvn knob and GVN diagnostic output
  • Inliner Cost Model -- profuseinline knob and inline cost diagnostics
  • LLVM Pass Pipeline -- opt-remark-emit and machine-opt-remark-emitter analysis pass registration
  • EDG Frontend -- EDG option registration including --diagnostics_format
  • CLI Flags -- complete flag-to-pipeline routing table
  • Knobs -- profuseinline, profusegvn, and remark-related knobs
  • AsmPrinter -- remark emission during code generation

Hash Table and Collection Infrastructure

Every associative container in cicc v13.0 is built from the same handful of primitives: a pointer-hash DenseMap/DenseSet with quadratic probing, a wyhash-v4-family string hasher, and a SmallVector with inline buffer optimization. Before this page existed, the same hash table description was duplicated across 30+ wiki pages. This is the single source of truth. If you are reimplementing cicc's data structures, start here.

There are no NVIDIA-specific modifications to the DenseMap hashing or probing logic -- cicc links the LLVM 20.0.0 implementation unmodified. The only NVIDIA-original hash infrastructure is the wyhash-v4 string hasher used for the builtin name table.

DenseMap Layout

Two variants exist, distinguished by bucket stride. Both share the same 28-byte inline header, the same hash function, the same probing sequence, the same sentinel values, and the same growth policy. The header is always embedded directly inside a larger structure (context object, analysis result, pass state) -- never heap-allocated on its own.

Variant A -- DenseSet (8 bytes/bucket)

OffsetSizeTypeField
+08uint64_tNumEntries
+88ptrBuckets (heap-allocated array)
+164uint32_tNumItems (live entries)
+204uint32_tNumTombstones
+244uint32_tNumBuckets (always power of 2)

Bucket array size: NumBuckets * 8 bytes. Each bucket holds either a valid pointer, an empty sentinel, or a tombstone sentinel.

Variant B -- DenseMap (16 bytes/bucket)

Same 28-byte header. Each bucket holds a key-value pair at a 16-byte stride:

v30 = (_QWORD *)(buckets + 16LL * slot);   // sub_163D530 line 561
*v30 = key;                                  // +0: key
v30[1] = value;                              // +8: value

Variant B is used by the SelectionDAG builder (context offsets +120 and +152), the NVVM IR node uniquing tables, and any subsystem that maps pointers to pointers.

Where the Variants Appear

SubsystemVariantContext offsetPurpose
NVVM IR uniquing (sub_162D4F0)B (16B)context qw[130..178]Node deduplication per opcode
SelectionDAG builder (sub_163D530)B (16B)+120, +152Node mapping
SelectionDAG builder (sub_163D530)A (8B)+184Worklist set
Per-node analysis structuresA (8B)+72 inside v381Visited set
CSSA PHI map (sub_3720740)B (16B)r15+0x60PHI-to-ID mapping
Coroutine spill trackingB (16B)+0x18 inlineSpill/reload tracking
Builtin name tablecustom (12B stride)context+480Name-to-ID with hash cache

Pointer Hash Function

Every DenseMap/DenseSet instance in cicc that uses pointer keys employs the same hash:

hash(ptr) = (ptr >> 9) ^ (ptr >> 4)

This is LLVM's DenseMapInfo<void*>::getHashValue, unchanged. The right-shift by 4 discards the low bits that are always zero due to 8- or 16-byte alignment. The right-shift by 9 mixes in higher-order address bits to break up the stride patterns that arise from slab allocation (where consecutive objects are separated by a fixed power-of-two). The XOR combines these two views of the pointer into a single hash value that distributes well for both heap-allocated and slab-allocated objects.

Representative decompiled evidence (appears identically in dozens of functions):

v9 = (v12 - 1) & (((unsigned int)v11 >> 9) ^ ((unsigned int)v11 >> 4));

Integer-Key Hash Variant

A separate hash function is used for DenseMap<unsigned, T> instances (integer keys rather than pointers):

hash(key) = key * 37

This is LLVM's DenseMapInfo<unsigned>::getHashValue. It appears in the instruction emitter (sub_2E29BA0), the two-address pass (sub_1F4E3A0), the vector legalization tables, and the SelectionDAG instruction selection cost table (sub_3090F90). Integer-key maps use a different sentinel pair: 0xFFFFFFFF (empty) and 0xFFFFFFFE (tombstone).

wyhash v4 String Hasher -- sub_CBF760

The NVVM builtin name table uses a separate, NVIDIA-original hash function for string keys. sub_C92610 is a thin wrapper that tail-calls sub_CBF760. The function dispatches on input length into six code paths, each using different constant sets and mixing strategies:

Length Dispatch Table

LengthStrategyConstants
0Return constant0x2D06800538D394C2
1--33-byte read + XOR + multiplyseed 0x87275A9B, mul 0xC2B2AE3D27D4EB4F, avalanche 0x165667B19E3779F9
4--82x uint32 + combine + rotateXOR 0xC73AB174C5ECD5A2, mul 0x9FB21C651E98DF25
9--162x uint64 + 128-bit multiplyXOR 0x6782737BEA4239B9 / 0xAF56BC3B0996523A, avalanche 0x165667919E3779F9
17--128Paired 16B reads from both endsPer-pair constants, 128-bit multiplies, length mixed with 0x61C8864E7A143579
129--240Extended mixingDelegates to sub_CBF370
240+Bulk processingDelegates to sub_CBF100

Pseudocode (length 1--3, the most common case for short builtins)

fn wyhash_short(data: &[u8], len: usize) -> u32 {
    let a = data[0] as u64;
    let b = data[len / 2] as u64;
    let c = data[len - 1] as u64;
    let combined = a | (b << 8) | (c << 16) | (len as u64) << 24;
    let mixed = combined ^ 0x87275A9B;
    let wide = mixed.wrapping_mul(0xC2B2AE3D27D4EB4F);
    let folded = wide ^ (wide >> 32);
    let result = folded.wrapping_mul(0x165667B19E3779F9);
    (result ^ (result >> 32)) as u32
}

Pseudocode (length 17--128, covering most __nvvm_* names)

fn wyhash_medium(data: &[u8], len: usize) -> u32 {
    let pairs = [
        (0x1CAD21F72C81017C, 0xBE4BA423396CFEB8),  // pair 0
        (0x1F67B3B7A4A44072, 0xDB979083E96DD4DE),  // pair 1
        (0x2172FFCC7DD05A82, 0x78E5C0CC4EE679CB),  // pair 2
        // ... additional pairs for 64/96/128 thresholds
    ];
    let (mut v8, mut v10) = (0u64, 0u64);
    // read 16 bytes from front, 16 from back, mix with pair constants
    for i in 0..((len + 15) / 32) {
        let front = read_u128(&data[i * 16..]);
        let back  = read_u128(&data[len - (i + 1) * 16..]);
        (v8, v10) = mix_128(v8, v10, front, back, pairs[i]);
    }
    let combined = v8 ^ v10 ^ (len as u64 ^ 0x61C8864E7A143579);
    let result = 0x165667919E3779F9u64.wrapping_mul(combined ^ (combined >> 37));
    (result ^ (result >> 32)) as u32
}

The final return value is always a uint32 -- the high dword of the 64-bit result XORed with the low dword. Most NVVM builtin names are 8--35 bytes, hitting the optimal 4--8 and 9--16 and 17--128 paths.

Probing Strategy

All DenseMap instances use quadratic probing with triangular-number increments:

slot = hash & (capacity - 1)      // initial probe
step = 1
loop:
    if bucket[slot] == key   -> found
    if bucket[slot] == EMPTY -> not found (insert here)
    if bucket[slot] == TOMBSTONE -> record for reuse
    slot = (slot + step) & (capacity - 1)
    step++

The probe sequence for initial position h visits:

h, h+1, h+3, h+6, h+10, h+15, h+21, ...
h + T(k) where T(k) = k*(k+1)/2   (triangular numbers)

This guarantees that for a power-of-2 table size n, all n slots are visited before any index repeats. The proof relies on the fact that the differences T(k+1) - T(k) = k+1 produce all residues modulo n when n is a power of 2.

Comparison Guard (Builtin Table)

The builtin name hash table (sub_C92740, sub_C92860) adds a triple comparison guard before performing the expensive memcmp:

  1. Cached hash equality: hash_cache[slot] == search_hash
  2. Length equality: entry->length == search_length
  3. Content equality: memcmp(search_data, entry->string_data, length) == 0

The hash cache is stored in a separate array immediately after the bucket array and the end-of-table sentinel. This layout avoids polluting bucket cache lines with hash values that are only needed on collision.

Probing Label: "Linear" vs "Quadratic"

Some analysis reports describe the probing as "linear" because the step variable increments by 1 each iteration. The actual probe position advances quadratically (by accumulating triangular numbers). Both descriptions refer to the same code. This page uses the technically precise term: quadratic probing with triangular numbers.

Growth Policy

Load Factor Threshold -- 75%

After every successful insertion, the map checks whether to grow:

if (4 * (NumItems + 1) >= 3 * NumBuckets)
    // load factor > 75% -> double capacity
    new_capacity = 2 * NumBuckets

Tombstone Compaction -- 12.5%

If the load factor is acceptable but tombstones have accumulated:

elif (NumBuckets - NumTombstones - NumItems <= NumBuckets >> 3)
    // fewer than 12.5% of slots are truly empty
    // rehash at same capacity to clear tombstones
    new_capacity = NumBuckets

Rehash Procedure -- sub_C929D0

  1. calloc(new_capacity + 1, bucket_stride) for the new array.
  2. Write the end-of-table sentinel at position new_capacity.
  3. For each live (non-empty, non-tombstone) entry in the old table, reinsert into the new table using quadratic probing.
  4. Copy the cached hash (if the table has a hash cache).
  5. Track the new position of a "current slot" pointer so the caller can continue using the entry it just inserted.
  6. Free the old array.
  7. Reset NumTombstones to 0.
  8. Update NumBuckets to new_capacity.
  9. Return the new position of the tracked slot.

Capacity Constraints

  • Power of 2: always. Enforced by the bit-smearing pattern: x |= x>>1; x |= x>>2; x |= x>>4; ...; x += 1.
  • Minimum: 64 buckets for standard DenseMap instances. The builtin name table starts at 16 and grows through 16 -> 32 -> 64 -> 128 -> 256 -> 512 -> 1024 as its 770 entries are inserted.
  • Allocation: sub_22077B0 (operator new[]), freed via j___libc_free_0.

Sentinel Values

Two sentinel families exist, distinguished by magnitude. Both are chosen to be impossible values for aligned pointers.

NVVM-Layer Sentinels (small magnitude)

Used by the NVVM IR uniquing tables, the SelectionDAG builder maps, and the builtin name table:

RoleValueHexWhy safe
Empty-80xFFFFFFFFFFFFFFF8Low 3 bits = 0b000 after masking, but no 8-byte-aligned pointer is this close to (uint64_t)-1
Tombstone-160xFFFFFFFFFFFFFFF0Same reasoning, distinct from -8

The builtin name table also uses a value of 2 as an end-of-table sentinel placed at bucket_array[capacity].

LLVM-Layer Sentinels (large magnitude)

Used by the majority of LLVM pass infrastructure -- SCEV, register coalescing, block placement, SLP vectorizer, StructurizeCFG, machine pipeliner, prolog-epilog, and others:

RoleValueHexDecimal
Empty0xFFFFFFFFFFFFF000-4096-4096
Tombstone0xFFFFFFFFFFFFE000-8192-8192

Integer-Key Sentinels

Used by DenseMap<unsigned, T> instances (instruction emitter, two-address pass):

RoleValueHex
Empty0xFFFFFFFF32-bit all-ones
Tombstone0xFFFFFFFE32-bit all-ones minus 1

Which Sentinel Set to Expect

SubsystemSentinel pair
NVVM IR uniquing, SelectionDAG builder-8 / -16
Builtin name table-8 (tombstone), 0 (empty), 2 (end marker)
SCEV, block placement, SLP vectorizer-4096 / -8192
Register coalescing, machine pipeliner-4096 / -8192
StructurizeCFG, prolog-epilog-4096 / -8192
Instruction emitter, two-address0xFFFFFFFF / 0xFFFFFFFE
Coroutine spill tracking0xFFFFFFFFF000 / 0xFFFFFFFFE000
CSSA PHI map0xFFFFFFFFF000 / 0xFFFFFFFFE000
Debug verify0xFFFFFFFFF000 / 0xFFFFFFFFE000
LazyCallGraph0xFFFFFFFFF000 / 0xFFFFFFFFE000

The -8/-16 pair appears exclusively in NVVM-layer (NVIDIA-original) code. The -4096/-8192 pair is the standard LLVM DenseMapInfo<void*> sentinel set. The difference is cosmetic -- both pairs are safe for the same reasons -- but it reveals code provenance: if you see -8/-16, the code was written or heavily modified by NVIDIA; if you see -4096/-8192, it is stock LLVM.

SmallVector Pattern

SmallVector is the universal dynamic array throughout cicc, with two growth implementations:

Layout

[BeginPtr, Size:Count:Capacity, InlineData...]
OffsetSizeField
+08data_ptr (points to inline buffer initially, heap after growth)
+84size (live element count)
+124capacity (allocated slots)
+16NInline buffer (N = InlineCapacity * element_size)

When size == capacity on insertion, the vector grows.

Growth Functions

FunctionAddressDescription
SmallVector::growsub_C8D5F0Generic growth -- copies elements, used for non-POD types
SmallVectorBase::grow_podsub_C8D7D0POD-optimized growth -- uses realloc when buffer is heap-allocated
SmallVector::grow (MIR)sub_16CD150Second copy in the MachineIR address range, identical logic
SmallVector::grow (extended)sub_C8E1E0Larger variant (11KB), handles edge cases

Growth Policy

The standard LLVM SmallVector growth: double the current capacity, with a minimum of 1. If the current buffer is the inline buffer, malloc a new heap buffer and memcpy the contents. If the buffer is already on the heap, realloc it (for POD types) or malloc + copy + free (for non-POD types).

new_capacity = max(2 * old_capacity, required_capacity)
if (data_ptr == &inline_buffer)
    heap_buf = malloc(new_capacity * elem_size)
    memcpy(heap_buf, inline_buffer, size * elem_size)
else
    // POD: heap_buf = realloc(data_ptr, new_capacity * elem_size)
    // non-POD: heap_buf = malloc(...); copy; free(old)
data_ptr = heap_buf
capacity = new_capacity

Common Inline Capacities

Observed across the codebase:

Inline capacityElement sizeTotal inline bytesTypical use
2816SCEV delinearization terms
4832LazyCallGraph SCC lists, basic block worklists
8864NVVMReflect call collection, PHI operand lists
168128AA evaluation pointer sets
228176Printf argument arrays (stack-allocated)
856448SROA slice descriptors

Builtin Name Table -- Specialized Hash Table

The builtin name table at context+480 is a specialized variant that does not use the standard DenseMap layout. It stores string entries rather than pointers, includes a parallel hash cache, and uses the wyhash function instead of the pointer hash.

Table Structure (20 bytes)

OffsetSizeField
+08bucket_array_ptr
+84capacity (power of 2)
+124count (live entries)
+164tombstone_count

Memory Layout

[0 .. 8*cap-1]                    bucket_array: cap QWORD pointers
[8*cap .. 8*cap+7]                sentinel: value 2 (end-of-table)
[8*cap+8 .. 8*cap+8+4*cap-1]     hash_cache: uint32 per slot

String Entry (heap-allocated via sub_C7D670)

OffsetSizeField
+08string_length
+84builtin_id (set after insertion)
+16N+1Null-terminated string data

Total allocation: length + 17 bytes, 8-byte aligned. The string data offset (16) is stored at hashtable+20 for use during comparison.

See Builtins for the complete 770-entry builtin ID inventory.

Usage Across the Compiler

Subsystems Using DenseMap (pointer hash, -8/-16 sentinels)

  • NVVM IR uniquing (sub_162D4F0): 8+ DenseMap instances in the NVVM context object, one per opcode range (0x04--0x1F). Tables at fixed qword-indexed offsets, spaced 32 bytes apart.
  • SelectionDAG builder (sub_163D530): Three maps at context offsets +120, +152, +184. Map A and B are 16-byte-stride (key-value), Set C is 8-byte-stride (keys only).
  • Per-node analysis structures: Embedded DenseSet at +72 within analysis objects created during DAG construction.
  • Memory space optimization (sub_1C6A6C0): DenseMap-style tables for address space tracking.

Subsystems Using DenseMap (pointer hash, -4096/-8192 sentinels)

  • SCEV (sub_F03CD0 and family): Expression caching, range computation, back-edge taken count.
  • Register coalescing (sub_1F2F8F0): Already-coalesced set, equivalence class map.
  • Block placement (sub_2E3B720): Chain membership, tail-merge candidates.
  • SLP vectorizer (sub_1ACCE50): AllOps and Scalars hash tables (32-byte entries).
  • StructurizeCFG (sub_1B66CF0): Flow-block mapping, region membership.
  • Machine pipeliner (sub_20C40D0): Schedule stage tracking.
  • CSSA (sub_3720740): PHI-to-ID mapping.
  • Debug/verify (sub_265D050): Instruction validation tables.
  • LazyCallGraph (sub_D1A040): Edge membership, SCC identity.

Subsystems Using DenseMap (integer hash key * 37)

  • Instruction emitter (sub_2E1F350): Opcode-to-constraint mapping. Sentinels: 0xFFFFFFFF / 0xFFFFFFFE.
  • Two-address pass (sub_1F4BFE0): TiedOperandMap (56-byte entries, 4 inline). EqClassMap.
  • Vector legalization (sub_3302A00): Type-split record mapping.
  • SelectionDAG isel (sub_3090F90): Argument cost table.

Subsystems Using wyhash (string keys)

  • Builtin name table (sub_90AEE0): 770 NVVM/CUDA builtin names. Uses the specialized 20-byte table header with hash cache.
  • This is the only known use of sub_CBF760 in cicc.

Key Functions

FunctionAddressSizeRole
DenseMap pointer hashinline--(ptr >> 9) ^ (ptr >> 4) -- always inlined
DenseMap integer hashinline--key * 37 -- always inlined
wyhash v4sub_CBF760~4 KBString hash, length-dispatched
wyhash wrappersub_C92610tinyTail-calls sub_CBF760
Builtin insert-or-findsub_C92740~2 KBQuadratic probe with hash cache
Builtin find-onlysub_C92860~1 KBRead-only variant of sub_C92740
Builtin rehashsub_C929D0~1 KB75% load factor, tombstone compaction
Builtin table initsub_C92620tinyCreates 16-bucket initial table
SmallVector::growsub_C8D5F0~2 KBGeneric element growth
SmallVectorBase::grow_podsub_C8D7D0~5 KBPOD-optimized realloc growth
SmallVector::grow (MIR)sub_16CD150~2 KBDuplicate in MachineIR range
SmallPtrSet::insertOrFindsub_C9A3C0~16 KBSmall pointer set with growth
DenseMap grow (LLVM passes)varies per pass--Each pass has its own inlined or outlined rehash

Cross-References

CoroSplit & CoroFrame: Coroutine Lowering on GPU

cicc v13.0 carries the complete LLVM coroutine lowering pipeline -- CoroEarly, CoroSplit, CoroElide, CoroAnnotationElide, and CoroCleanup -- largely unchanged from upstream LLVM 19. The pass infrastructure processes C++20 co_await/co_yield/co_return coroutines emitted by the EDG 6.6 frontend, splitting a single coroutine function into separate resume, destroy, and cleanup functions while computing a coroutine frame struct to carry live state across suspend points. NVIDIA adds one proprietary intrinsic (llvm.nvvm.coro.create.suspend) and emits a .pragma "coroutine" annotation in PTX, but the core splitting and frame layout algorithms are stock LLVM. The practical constraint is that coroutine frame allocation on GPU defaults to malloc in device heap -- extremely expensive on current architectures -- making CoroElide (which replaces heap allocation with a caller-stack alloca) the pass that determines whether GPU coroutines are viable or pathological.

Key Facts

PropertyValue
CoroSplit pass entrysub_24EF980 (71 KB, address range 0x24EF980--0x24F2300)
CoroFrame layout computationsub_24F6730 (11,249 bytes, stack frame 5,624 bytes)
Core frame layout workhorsesub_24F5860 (called from CoroFrame)
createResumeFunctionsub_2284030
createDestroyFunctionsub_2284040
CoroEarly passsub_24DCD10 (41 KB)
CoroElide passsub_24DF350 (80 KB)
CoroAnnotationElide passsub_24E2340 (33 KB)
CoroSplit Cloner/Driversub_25CA370 (55 KB)
CoroFrame Materializersub_25C5C80 (49 KB, heap-to-stack frame layout)
CoroFrame Spill Analysissub_25C1030 (37 KB)
Pass name / debug type"CoroSplit" / "coro-split" (at 0x4388A37 / 0x4387AC3)
Coroutine metadata tableunk_4F8FAE8
Pipeline parser ID#156 (CGSCC pass, param: reuse-storage)
CoroElide pipeline ID#220 (Function pass)
CoroAnnotationElide pipeline ID#155 (CGSCC pass)
CoroEarly pipeline ID#29 (Module pass)
CoroCleanup pipeline ID#28 (Module pass)
NVIDIA intrinsicllvm.nvvm.coro.create.suspend (single constant integer argument)
PTX annotation.pragma "coroutine";

The Coroutine Lowering Pipeline

Five passes run in a fixed sequence across the optimizer pipeline. The first and last are module-level bookends; the middle three do the real work inside the CGSCC (Call Graph SCC) pipeline where inlining decisions interact with coroutine splitting.

CoroEarly (module)         Lowers coroutine setup intrinsics.
                           Materializes the NoopCoro.Frame global.
                           Replaces llvm.coro.resume, llvm.coro.destroy,
                           llvm.coro.promise, llvm.coro.free with
                           concrete operations on the frame pointer.
        |
        v
CoroSplit (CGSCC)          Identifies coroutine functions by scanning for
                           llvm.coro.suspend / llvm.coro.end intrinsics.
                           Invokes CoroFrame to compute the frame layout.
                           Clones the function into resume + destroy variants.
                           Builds the state machine dispatch switch.
        |
        v
CoroAnnotationElide (CGSCC) Annotation-driven elision: when the callee is
                           marked "elide_safe_attr" and the call site has
                           ".noalloc", converts heap alloc to alloca in the
                           caller's frame. New in LLVM 19 / cicc v13.0.
        |
        v
CoroElide (function)       Classic elision: proves the coroutine frame
                           lifetime is bounded by the caller, replaces
                           coro.alloc with alloca. Emits optimization
                           remarks "'<name>' elided in '<caller>'" or
                           "'<name>' not elided in '<caller>'".
        |
        v
CoroCleanup (module)       Removes remaining coroutine intrinsic stubs
                           that survived lowering (e.g., coro.subfn.addr).
                           Final cleanup pass -- no coroutine intrinsics
                           survive past this point.

The coro-cond module analysis (registered in the pipeline parser at sub_2337E30) gates whether the coroutine passes activate at all. If no function in the module contains llvm.coro.id, the entire pipeline is skipped. This zero-cost guard is important because the vast majority of CUDA kernels contain no coroutines.

CoroSplit as a CGSCC Pass

CoroSplit is registered as CGSCC pass #156 with an optional reuse-storage parameter. When reuse-storage is active, the pass attempts to reuse the storage of coroutine frames that are provably dead -- relevant for generators where the frame is allocated once and resumed many times. In the CGSCC context, CoroSplit runs alongside the inliner (inline) and function-attrs, allowing newly split resume/destroy functions to be immediately considered for inlining into callers within the same SCC.

CoroSplit: Suspend Point Detection and Function Splitting

Detection Phase

sub_24EF980 iterates over every function in the module. For each function, it scans all instructions using a bitmask-based opcode test to identify coroutine suspension intrinsics:

// Suspend point detection (at 0x24F00E6)
// Stack frame: 0x860+ bytes, callee-saved: r15, r14, r13, r12, rbx
// Key locals:
//   [rbp-0x7F8] = outer iteration end pointer
//   [rbp-0x7E8] = current coroutine info
//   [rbp-0x7E0] = suspend point instruction
//   [rbp-0x740] = original coroutine function
//   [rbp-0x750] = resume function pointer
//   [rbp-0x748] = destroy function pointer

uint8_t opcode = inst->getOpcode();
unsigned normalized = opcode - 0x22;
if (normalized > 51) continue;  // not in range [0x22, 0x55]

uint64_t mask = 0x8000000000041ULL;
if (!((mask >> normalized) & 1)) continue;  // bit not set

The bitmask 0x8000000000041 encodes three intrinsic opcodes:

Bit positionOpcodeIntrinsic
00x22llvm.coro.suspend -- normal suspend point
60x28llvm.coro.suspend.retcon -- returned-continuation suspend
510x55llvm.coro.end -- coroutine termination

This single 64-bit bt (bit-test) instruction replaces what would otherwise be a three-way comparison or switch, a pattern upstream LLVM uses in its Intrinsic::ID checking.

Validation

After finding a suspend point, CoroSplit validates the coroutine structure (at 0x24F010E):

// Coroutine validation pseudocode (0x24F010E-0x24F0179)
Value *coro_id_inst = ...;
if (coro_id_inst->getOpcode() != 0x55)    // must be 'U' = coro.id
    goto skip;

Function *parent = coro_id_inst->getParent();  // [rax-20h]
if (!parent || parent->getOpcode() != 0)        // entry block check
    goto skip;

Value *promise = coro_id_inst->getOperand(4);   // [rcx+50h]
if (parent->getContext() != promise)             // [rax+18h] == promise
    goto skip;

if (!(parent->getFlags() & 0x20))               // "has personality" bit 5 of +0x21
    goto skip;

if (parent->getIntrinsicID() != 59)             // 0x3B = coro.id
    goto skip;

This is a thorough validation ensuring:

  1. The instruction is indeed llvm.coro.id (opcode 0x55 = 'U', intrinsic ID 59 = 0x3B)
  2. It belongs to a valid function (parent pointer non-null, starts with opcode 0)
  3. The promise alloca matches between coro.id and function context
  4. The function has the correct personality (bit 5 of byte at offset +0x21)
  5. The intrinsic ID equals 59 (cmp dword [rax+24h], 0x3B)

Nested coroutines receive additional validation (at 0x24F017F): the pass checks that coro.begin (opcode range 0x1E--0x28, ID 57 = 0x39) references the correct parent function, preventing cross-coroutine confusion when one coroutine is nested inside another.

// Nested coroutine check (0x24F017F-0x24F01D6)
unsigned operand_count = inst->getNumOperands() & 0x7FFFFFF;  // mask out type bits
Value *parent_ref = inst->getOperand(-operand_count);         // computed offset
if (parent_ref != current_function)
    goto skip;  // different coroutine -- do not cross wires

uint8_t begin_opcode = begin_inst->getOpcode();
if (begin_opcode - 0x1E > 0x0A)   // must be in [0x1E, 0x28]
    goto skip;  // not a coro.begin-related instruction

Value *frame_ptr = begin_inst->getOperand(2);  // [rdx+28h]

Suspend Point Collection

Validated suspend points are collected into a deduplicated array. The dedup check at 0x24F02F9 scans existing entries, following def-use chains ([rbx+10h]) to avoid processing the same suspend point twice when multiple CFG paths reach it. For each suspend point, the pass extracts the value operand at instruction offset +0x28.

// Suspend point collection with dedup (0x24F02F9-0x24F040A)
unsigned count = suspend_array_size;
for (unsigned i = 0; i < count; i++) {
    if (suspend_array[i] == new_suspend)
        goto already_collected;  // follow chain: [rbx+10h]
}
// Extract value operand:
Value *value_operand = suspend_inst->getOperand(2);  // [rdx+28h]
suspend_array[count++] = new_suspend;

The Split Algorithm

After collecting all suspend points, the split proceeds in three phases:

Phase 1: Frame layout computation. CoroSplit invokes sub_24F6730 (CoroFrame) to determine which SSA values are live across suspend points and must be stored in the frame struct (see the CoroFrame section below).

Phase 2: Function cloning and specialization. The split mode field at [rbp-0x3F8] controls which function variants are created:

// Function splitting dispatch (at 0x24F0540)
int split_mode = frame_state->split_mode;  // [rbp-0x3F8]

if (split_mode == 0) {
    // Returned-continuation style: destroy function only
    Function *destroy = createDestroyFunction(state, orig_fn, suspends, ...);
} else if (split_mode >= 1 && split_mode <= 3) {
    // Standard C++20 coroutine: both resume and destroy
    Function *resume  = sub_2284030(state, orig_fn, suspends, coro_info,
                                     destroy_data, resume_data);
    Function *destroy = sub_2284040(state, orig_fn, suspends, coro_info,
                                     destroy_data, resume_data);
}

sub_2284030 (createResumeFunction) and sub_2284040 (createDestroyFunction) each:

  1. Clone the original coroutine function via sub_D2E510 (function cloner)
  2. Replace the coroutine frame parameter with a typed pointer to the frame struct
  3. Insert a switch statement at the entry block dispatching on the suspend index stored in the frame (__coro_index)
  4. Replace each llvm.coro.suspend with a return instruction
  5. Wire function pointers (__resume_fn, __destroy_fn) into the frame header at offsets +0x00 and +0x08

Phase 3: Metadata and remark emission. After splitting, the pass registers the new functions in the coroutine metadata table at unk_4F8FAE8 via sub_BC1CD0, then emits an optimization remark:

// Remark emission (0x24F05D1-0x24F06E8)
sub_B17560(remark, "CoroSplit", "coro-split");  // create remark
sub_B18290(remark, "Split '");                  // prefix
sub_BD5D20(orig_fn, name_buf);                  // get function name
sub_B16430(remark, "function", name_buf);       // named attribute
sub_B18290(remark, "' (frame_size=");
sub_B16B10(remark, "frame_size", frame_size);   // integer attribute
sub_B18290(remark, ", align=");
unsigned align = 1u << alignment_log2;
sub_B16B10(remark, "align", align);
sub_B18290(remark, ")");
sub_1049740(remark);                            // publish to diagnostic handler

The format is: Split '<function_name>' (frame_size=N, align=M) where N is the computed frame size in bytes and M is 1 << alignment_log2.

The .corodispatch Trampoline

The CoroSplit dispatcher at sub_3160A60 (48 KB, second code cluster) generates a .corodispatch function -- a lightweight trampoline that:

  1. Loads __coro_index from the coroutine frame at offset +0x10
  2. Switches on the index value to select the correct resume point
  3. Uses musttail call semantics to jump to the target without growing the stack

The string "MustTailCall.Before.CoroEnd" confirms it enforces musttail on the final resume-to-end transition. Additional strings in this function include ".from." (used to construct the dispatch label name), "CoroEnd", "CoroSave", and "CoroSuspend" (marking the IR structures being dispatched through).

For GPU targets, the musttail semantics are critical: stack space is per-thread local memory, and growing it across coroutine bounces would rapidly exhaust the limited local memory budget.

CoroFrame: Frame Layout Computation

sub_24F6730 is the largest and most complex function in the coroutine pipeline, with a 5,624-byte stack frame (0x15F8) -- one of the largest in the entire cicc binary. Its job: determine which SSA values are live across suspend points and must be "spilled" into the coroutine frame struct.

Algorithm Overview

The algorithm is a BFS-based cross-suspend-point liveness analysis:

  1. Initialize tracking structures. Two hash tables with 16-byte entries, sentinel 0xFFFFFFFFF000, hash function (val >> 4) ^ (val >> 9). Initial capacity 8 entries each.

  2. Iterate all instructions. Walk every basic block and instruction. A visitor callback ([visitor+18h], virtual call) classifies each instruction as relevant or not to the frame computation.

  3. BFS traversal. A deque with 512-byte blocks (64 pointer-sized entries per block) drives BFS over the CFG. The core computation at sub_24F5860 determines which values cross which suspend points.

  4. Spill set computation. Values that are defined before a suspend point and used after it must be stored in the frame. The result is a set of (value, suspend_point) pairs.

  5. Frame layout. The frame type builder (at sub_3169200 in the second code cluster) arranges spill slots into a struct.

Frame Struct Layout

The coroutine frame is a flat C struct with a fixed header followed by computed spill slots:

struct __coro_frame {                              // type name: ".coro_frame_ty"
    void (*__resume_fn)(struct __coro_frame *);    // +0x00  resume function pointer
    void (*__destroy_fn)(struct __coro_frame *);   // +0x08  destroy function pointer
    uint32_t __coro_index;                         // +0x10  suspend point state variable
    // --- header ends, spill slots begin ---
    // padding for alignment (computed per-coroutine)
    // spill slots ordered by descending alignment requirement
    // promise storage (if promise_type is non-trivial)
    // alloca copies (stack variables that survive suspend)
};

The frame variable is named "__coro_frame" and the type is ".coro_frame_ty". The suspend point index field "__coro_index" is the state variable for the resume switch dispatch: value 0 means "initial entry", value N means "resumed at suspend point N", and a poison/unreachable value means "coroutine has returned".

The frame type builder at sub_3169200 (46 KB) constructs the StructType using these rules:

  1. The two function pointers (__resume_fn, __destroy_fn) always occupy the first 16 bytes
  2. __coro_index occupies bytes 16--19 (i32)
  3. Remaining spill slots are sorted by alignment (largest first) to minimize padding
  4. The promise alloca (if present) is placed at a known offset so llvm.coro.promise can compute it
  5. Total frame size and alignment are recorded for the split remark

Spill/Reload Code Generation

The spill/reload generator at sub_31650D0 (47 KB) creates the actual load/store instructions that move values between SSA registers and the coroutine frame:

  • A basic block named "AllocaSpillBB" is inserted at the function entry. All alloca instructions that need to survive across suspend points are moved here and replaced with GEP+store into the frame.
  • A basic block named "PostSpill" follows, branching to the original entry logic.
  • At each suspend point, ".spill.addr" store instructions write live SSA values into their frame slots.
  • After each resume point, ".reload" load instructions fetch values back from frame slots into fresh SSA values.

The naming convention (.spill.addr, .reload) is important for debugging: these instructions appear in -print-after-all dumps and identify coroutine frame traffic distinctly from normal loads/stores.

Detailed BFS Liveness Algorithm

// Pseudocode for sub_24F5860 core frame computation

void computeFrameLayout(Function *F, SmallVector<SuspendPoint> &suspends) {
    // Step 1: Build definition map
    DenseMap<Value*, uint32_t> def_map;     // sentinel 0xFFFFFFFFF000
    DenseMap<Value*, uint32_t> cross_map;   // sentinel 0xFFFFFFFFF000

    // Step 2: Walk all basic blocks, identify definitions
    for (BasicBlock &BB : *F) {
        for (Instruction &I : BB) {
            if (visitor->isRelevant(&I))    // virtual call [visitor+18h]
                def_map.insert(&I, generation++);
        }
    }

    // Step 3: For each suspend point, BFS forward to find uses
    Deque<BasicBlock*> worklist;  // 512-byte blocks, 64 entries each
    for (SuspendPoint &SP : suspends) {
        worklist.clear();
        worklist.push_back(SP.getParent());

        while (!worklist.empty()) {
            BasicBlock *BB = worklist.pop_front();
            for (Instruction &I : *BB) {
                for (Value *Op : I.operands()) {
                    if (def_map.count(Op) && def_before_suspend(Op, SP)) {
                        // This value is defined before SP and used after it
                        cross_map.insert({Op, SP.getIndex()});
                        spill_set.add(Op);
                    }
                }
            }
            for (BasicBlock *Succ : successors(BB))
                worklist.push_back(Succ);
        }
    }

    // Step 4: Build frame struct from spill set
    // Sort spill slots by alignment (descending) then by size
    // Compute offsets, padding, total frame size
}

The complexity is O(instructions * suspend_points) per coroutine for the liveness phase, O(V+E) for each BFS where V = basic blocks and E = CFG edges.

Data Structures

Frame info (0x138 = 312 bytes, allocated via sub_22077B0):

OffsetSizeDescription
+0x008Spill array pointer
+0x088Reserved (initially 0)
+0x108Reference count (initially 1)
+0x18--+0x98128Embedded hash table for spill tracking (16-byte stride, sentinel-filled)
+0x988Pointer to inner table (self-referential)
+0xA08Capacity encoding (0x800000000)
+0x1288Back-reference to visitor context
+0x1308Back-reference to suspend point array

Spill entry (0x48 = 72 bytes):

OffsetSizeDescription
+0x008Coroutine function pointer
+0x088Buffer pointer (inline or heap)
+0x108Capacity encoding (6 entries inline)
+0x18--+0x4848Inline buffer for small spill sets

The inline buffer holds up to 6 spill entries without heap allocation. When exceeded, the buffer externalizes to the heap; cleanup at 0x24F6CB0 checks [entry+8] against [entry+18h] to determine if free() is needed.

BFS deque:

ParameterValue
Block map allocation0x40 bytes (8 pointers)
Data block allocation0x200 bytes (512 bytes = 64 pointer entries)
Block pointers[rbp-0x340]=front, [rbp-0x338]=count(8), [rbp-0x330]=begin

Hash Table Policy

Both hash tables in CoroFrame share identical parameters (see hash-infrastructure.md for the universal pattern):

  • Hash function: (val >> 4) ^ (val >> 9) -- same hash used throughout cicc
  • Entry size: 16 bytes (8-byte key + 8-byte metadata)
  • Empty sentinel: 0xFFFFFFFFF000
  • Load factor threshold: 75% (triggers growth when count * 4 >= capacity * 3)
  • Tombstone cleanup: 12.5% (rehash when tombstones > capacity >> 3)
  • Growth factor: 2x (capacity doubles on each growth)
  • Collision resolution: linear probing

GPU-Specific Constraints: The Heap Allocation Problem

Why Device Malloc Is Pathological

Standard LLVM coroutines allocate the frame on the heap via operator new (or a custom allocator returned by get_return_object_on_allocation_failure). On GPU, this calls into the device-side malloc, which has severe limitations:

Fixed-size heap. The device heap is controlled by cudaLimitMallocHeapSize (default 8 MB across the entire GPU). A kernel launching 65,536 threads, each with a 256-byte coroutine frame, requires 16 MB of heap -- already exceeding the default. Increasing the limit helps, but the heap must be pre-allocated before kernel launch, wasting memory for non-coroutine workloads.

Serialized allocation. Device malloc implementation uses a global free list protected by atomics. Within a warp, threads attempting simultaneous allocation serialize on this atomic. Across warps on the same SM, L2 cache line bouncing on the free-list head pointer creates further contention. Under heavy allocation pressure (hundreds of concurrent warps), the effective throughput of device malloc can drop to single-digit allocations per microsecond -- three orders of magnitude slower than a register read.

Fragmentation under concurrency. Thousands of threads allocating and freeing small frames (64--512 bytes) rapidly fragment the device heap. The device allocator does not perform compaction. Once fragmented, even a heap with sufficient total free space may fail individual allocations, causing malloc to return nullptr and triggering coroutine allocation failure paths (if the user provided get_return_object_on_allocation_failure) or program termination.

Memory latency hierarchy. The cost difference between frame locations is dramatic:

LocationLatencyBandwidth per SMNotes
Registers0 cyclesN/A (direct)Best case -- values that don't cross suspends
Local memory (L1 hit)~28 cycles~12 TB/sStack alloca destination after CoroElide
Local memory (L1 miss, L2 hit)~200 cycles~3 TB/sLarge frames that spill L1
Global memory (device heap)~400-800 cycles~1 TB/sDefault without CoroElide
Device malloc overhead~2000+ cyclesN/AFree-list atomic contention

The combined overhead of malloc latency + global memory access latency makes un-elided coroutines 50--100x slower than elided ones on GPU. This is the fundamental reason CoroElide is the most performance-critical coroutine optimization for GPU targets.

CoroElide: The GPU Escape Analysis

sub_24DF350 (80 KB -- the largest coroutine pass) implements the classic heap allocation elision. It runs as a function-level pass (#220 in the pipeline parser), meaning it analyzes each caller individually after CoroSplit has already split the coroutine.

Elision Preconditions

For each llvm.coro.id call site in the caller, CoroElide attempts to prove that:

  1. No handle escape. The coroutine handle (pointer to __coro_frame) does not escape the caller's scope. Specifically, the handle is not stored to memory visible to other threads, not passed to functions that might store it, and not returned from the caller. On GPU, the "visible to other threads" criterion is complicated by shared memory (addrspace(3)) and generic address space (addrspace(0)) casts -- a handle stored through a generic pointer could be visible to any thread.

  2. No external aliases. No alias of the handle is created that could outlive the caller. This includes GEPs into the frame, bitcasts, and pointer arithmetic. The alias analysis at this stage uses the results from the function-level AA pipeline.

  3. Full consumption. All suspend/resume/destroy calls on this coroutine handle are within the caller function. If the handle is passed to a helper function that calls coroutine_handle::resume(), the coroutine is not fully consumed from CoroElide's perspective (unless that helper was inlined first by the CGSCC inliner running in the same SCC iteration).

  4. Callee identity known. The coroutine callee must be identifiable (not an indirect call through a function pointer). CoroElide needs to read the callee's frame size and alignment from the split remark metadata to size the alloca correctly.

The Elision Transformation

When all preconditions are satisfied, CoroElide performs this rewrite:

// BEFORE elision (caller code):
%id    = call token @llvm.coro.id(i32 0, ptr null, ptr null, ptr null)
%need  = call i1 @llvm.coro.alloc(token %id)
br i1 %need, label %alloc, label %begin

alloc:
  %mem = call ptr @operator_new(i64 FRAME_SIZE)   ; <-- heap allocation
  br label %begin

begin:
  %phi = phi ptr [ %mem, %alloc ], [ null, %entry ]
  %hdl = call ptr @llvm.coro.begin(token %id, ptr %phi)
  ; ... use coroutine ...
  call void @llvm.coro.resume(ptr %hdl)
  call void @llvm.coro.destroy(ptr %hdl)

// AFTER elision:
%frame = alloca [FRAME_SIZE x i8], align FRAME_ALIGN  ; <-- stack allocation
%hdl   = call ptr @llvm.coro.begin(token %id, ptr %frame)
; ... use coroutine ...
call void @llvm.coro.resume(ptr %hdl)
; destroy is elided (frame on stack, automatically freed)

The key changes:

  • llvm.coro.alloc is replaced with false (allocation not needed)
  • The operator new call is deleted
  • An alloca of the correct size and alignment is inserted in the caller's entry block
  • The coro.begin now points at the stack alloca
  • llvm.coro.free is replaced with a no-op (stack memory does not need explicit deallocation)
  • The destroy function call may be simplified since stack deallocation is automatic

On NVPTX, the alloca maps to per-thread local memory (address space 5). Local memory accesses go through the L1 cache and are dramatically faster than device malloc followed by global memory access.

Elision Failure Modes on GPU

Several GPU-specific patterns defeat CoroElide:

  1. Generic address space cast. If the coroutine handle is cast to addrspace(0) (generic), the compiler cannot prove it stays in local memory. Generic pointers are indistinguishable from shared or global pointers at the IR level, so the escape analysis conservatively assumes the handle escapes.

  2. Coroutine handle in shared memory. Storing the handle to addrspace(3) (shared memory) makes it visible to all threads in the CTA. Even if the programmer knows only one thread uses it, CoroElide cannot prove this.

  3. Cross-function resume. A common pattern where the coroutine is created in one device function and resumed in another (e.g., a scheduler loop calling resume on handles from a queue). The handle passed as a function argument escapes the creator.

  4. Opaque allocator. If the coroutine uses a custom allocator (via promise_type::operator new), CoroElide may not recognize the allocation/deallocation pattern.

Diagnostic Output

CoroElide emits remarks through the standard optimization remark infrastructure:

  • Success: '<coroutine_name>' elided in '<caller_name>' (via -Rpass=coro-elide)
  • Failure: '<coroutine_name>' not elided in '<caller_name>' (via -Rpass-missed=coro-elide)

For GPU developers, the failure remark is the most important diagnostic. An un-elided coroutine on GPU is a performance disaster. The recommended debugging workflow:

nvcc -Xptxas -v --compiler-options="-Rpass-missed=coro-elide" foo.cu

CoroAnnotationElide: Developer-Asserted Elision

sub_24E2340 (33 KB) is the newer annotation-driven elision from LLVM 19. It looks for the "elide_safe_attr" function attribute and ".noalloc" suffix on coroutine function names. When both are present, elision proceeds without the full escape analysis -- the developer has asserted safety.

This is particularly useful for GPU code where the developer knows the coroutine is single-thread-scoped but the compiler cannot prove it due to pointer-to-generic-address-space casts. The "caller_presplit" attribute marks the caller as needing coroutine lowering, enabling the annotation elide pass to fire during the CGSCC iteration before the caller itself is split.

CoroAnnotationElide runs as CGSCC pass #155, meaning it fires before CoroSplit (#156) in the same CGSCC iteration. This ordering allows the annotation-based elision to rewrite allocation sites before CoroSplit performs the split, avoiding the need for a second pass.

The llvm.nvvm.coro.create.suspend Intrinsic

This is the sole NVIDIA-proprietary coroutine intrinsic. The NVVM verifier enforces:

llvm.nvvm.coro.create.suspend must have exactly one argument,
which must be a constant integer

The constant integer argument likely encodes a suspend-point identifier or mode. This intrinsic appears in the NVVM intrinsic table alongside llvm.nvvm.stacksave and llvm.nvvm.stackrestore, suggesting it interacts with the local memory stack for frame placement. Its exact lowering is handled by the NVVM-specific intrinsic lowering pass rather than the standard CoroSplit pipeline.

PTX .pragma "coroutine"

The AsmPrinter (documented in asmprinter.md) optionally emits .pragma "coroutine"; in the function header. This is triggered by metadata nodes with type byte 'N' (0x4E) linked to the current function via the list at this+792. The pragma is the first thing emitted in the function prologue (step (a) in the PTX header emission sequence at sub_215A3C0), before even the .entry/.func keyword.

The pragma signals to ptxas that the function uses coroutine semantics, potentially affecting register allocation and scheduling decisions in the assembler. The exact ptxas behavior triggered by this pragma is not documented publicly, but it likely increases the local memory budget and adjusts the register allocation heuristics for the state-machine dispatch pattern.

Warp Divergence at Suspend Points

A fundamental tension exists between SIMT execution and coroutine suspend. When one thread in a warp suspends while others do not, the warp diverges. The resume dispatch switch (the __coro_index-based state machine) creates a divergence point: threads may be at different suspend indices, requiring the hardware to serialize execution paths. This is identical to how any data-dependent branch causes divergence, but the impact is amplified because coroutine state machines typically have many switch cases (one per suspend point).

The StructurizeCFG pass (see structurizecfg.md) runs after coroutine lowering and will structurize the resume switch, potentially introducing additional control flow to manage reconvergence. On SM 70+ architectures with Independent Thread Scheduling, diverged threads can reconverge at any point, but the switch still introduces warp-level serialization proportional to the number of distinct __coro_index values active within the warp.

The Second Code Cluster (0x3150000 Region)

The binary contains a second, independent cluster of coroutine functions, likely from a different compilation unit or LTO merge:

FunctionAddressSize
CoroFrame layout computation0x3171DA055 KB
CoroSplit splitting logic0x316D16049 KB
CoroSplit dispatcher (.corodispatch, MustTailCall.Before.CoroEnd)0x3160A6048 KB
Spill/reload generation (AllocaSpillBB, PostSpill, .reload, .spill.addr)0x31650D047 KB
Frame type builder (__coro_frame, .coro_frame_ty, __coro_index)0x316920046 KB
CoroElide heap allocation elision0x315A7B041 KB
Attributor analysis helper0x3150D7043 KB
Attributor analysis helper0x314DBB040 KB

These functions reference the same string literals and implement the same algorithms as the primary cluster. The primary cluster at 0x24D--0x25C and this cluster at 0x314--0x317 are structurally identical -- they differ only in binary address due to compilation unit or LTO merge ordering.

Additionally, three helper functions in the primary cluster's vicinity handle specialized aspects:

FunctionAddressSize
CoroSplit Cloner/Driver (calls CoroFrame helpers)sub_25CA37055 KB
CoroFrame Materializer (heap-to-stack frame layout)sub_25C5C8049 KB
CoroFrame Spill Analysis helpersub_25C103037 KB

sub_25C5C80 (CoroFrame Materializer) is particularly relevant: this is the function that actually rewrites the IR to replace heap allocation with stack-based frame placement after CoroElide has proven safety. It materializes the frame struct type, inserts the alloca, and rewires all frame access GEPs.

Error Conditions in the Second Cluster

The CoroSplit implementation at 0x316D160 emits two diagnostic errors:

  • "Coroutines cannot handle non static allocas yet" -- triggered when a coroutine body contains a VLA (variable-length array) or alloca() with a dynamic size. The frame layout computation requires compile-time-known sizes for all frame slots. Dynamic allocas would require a separate heap allocation per suspend-resume cycle.

  • "alignment requirement of frame variables" -- triggered when a spill slot requires alignment exceeding the frame's maximum supported alignment. This can occur with over-aligned types (e.g., alignas(256) variables that must survive across suspends).

The CoroFrame at 0x3171DA0 emits:

  • "token definition separated from use by suspend point" -- a fatal error when an LLVM token value (which cannot be stored to memory) crosses a suspend boundary. Tokens are used for exception handling state and musttail call tracking; they are inherently non-materializable.

  • "Unable to handle alias with unknown offset before CoroBegin" -- triggered when a GEP with a non-constant offset operates on a value computed before coro.begin. The frame layout computation needs constant offsets to compute spill slot positions.

EDG Frontend Support

The EDG 6.6 frontend fully implements C++20 coroutine semantics in two key functions:

  • sub_87AFA0 (14 KB) -- Coroutine body processor. Resolves promise_type methods: initial_suspend, final_suspend, unhandled_exception, get_return_object, get_return_object_on_allocation_failure. Generates the coroutine body scaffolding including the implicit try-catch around user code.

  • sub_87BD00 (6 KB) -- Coroutine trait resolver. Looks up std::coroutine_traits<R, Args...>::promise_type, std::coroutine_handle, return_value, return_void. The EDG IL walker maps these as IL node type 64 (il_coroutine), with expression sub-type 0x21 (coroutine_expr). The IL copier handles coroutine handles as entity type 72 (coroutine_handle).

The frontend does not restrict coroutines to host-side code. The EDG configuration sets COROUTINE_ENABLING_POSSIBLE = 1 globally, meaning __device__ functions can be coroutines. The full coroutine IR (with llvm.coro.id, llvm.coro.begin, llvm.coro.suspend, etc.) flows into the NVVM optimizer pipeline regardless of the function's execution space.

Diagnostic Strings

StringLocationMeaning
"Split '<name>' (frame_size=N, align=M)"CoroSplit remarkSuccessful coroutine split
"' elided in '"CoroElideFrame allocation replaced with alloca
"' not elided in '"CoroElideElision failed, heap allocation remains
"Coroutines cannot handle non static allocas yet"0x316D160VLA or dynamic alloca inside coroutine body
"alignment requirement of frame variables"0x316D160Frame alignment constraint exceeded
"token definition separated from use by suspend point"0x3171DA0Token value crosses suspend boundary (error)
"Unable to handle alias with unknown offset before CoroBegin"0x3171DA0GEP with non-constant offset on pre-begin alias
"llvm.nvvm.coro.create.suspend must have exactly one argument, which must be a constant integer"NVVM verifierMalformed NVIDIA coroutine intrinsic
"AllocaSpillBB"0x31650D0Entry block for spill alloca instructions
"PostSpill"0x31650D0Block following spill setup
".spill.addr"0x31650D0Store to coroutine frame slot
".reload"0x31650D0Load from coroutine frame slot after resume
".corodispatch"0x3160A60Dispatch trampoline function name
"MustTailCall.Before.CoroEnd"0x3160A60Musttail semantics on final transition
".from."0x3160A60Dispatch label name construction
"NoopCoro.Frame"0x24DCD10Global no-op coroutine frame (CoroEarly)
"caller_presplit"0x24E2340Attribute marking pre-split caller
"elide_safe_attr"0x24E2340Attribute asserting elision safety
".noalloc"0x24E2340Function name suffix for annotation elide

Function Map

FunctionAddressSizeRole
CoroEarly pass entrysub_24DCD1041 KB--
CoroElide pass entrysub_24DF35080 KB--
CoroAnnotationElide pass entrysub_24E234033 KB--
CoroSplit pass entrysub_24EF98071 KB--
Core frame layout computationsub_24F5860----
CoroFrame layout entrysub_24F673011 KB--
CoroFrame Spill Analysis helpersub_25C103037 KB--
CoroFrame Materializer (heap-to-stack)sub_25C5C8049 KB--
CoroSplit Cloner/Driversub_25CA37055 KB--
createResumeFunctionsub_2284030----
createDestroyFunctionsub_2284040----
Function cloner (used for resume/destroy)sub_D2E510----
Frame-already-computed checksub_B2D610----
Get function name stringsub_BD5D20----
Register in coroutine metadata tablesub_BC1CD0----
Create optimization remarksub_B17560----
Publish remark to diagnostic handlersub_1049740----
Allocator (frame info, spill entries, BFS deque)sub_22077B0----
coro-cond module analysis checkersub_2337E3015 KB--
Attributor helper (coroutine attributes)sub_314DBB040 KB--
Attributor helper (coroutine attributes)sub_3150D7043 KB--
CoroElide (second cluster)sub_315A7B041 KB--
CoroSplit dispatcher (.corodispatch)sub_3160A6048 KB--
Spill/reload generationsub_31650D047 KB--
Frame type buildersub_316920046 KB--
CoroSplit splitting logic (second cluster)sub_316D16049 KB--
CoroFrame layout (second cluster)sub_3171DA055 KB--
EDG coroutine body processorsub_87AFA014 KB--
EDG coroutine trait resolversub_87BD006 KB--

Cross-References

OpenMP Runtime Declaration Table

cicc embeds a 194-entry table of OpenMP runtime function declarations at sub_312CF50 (0x312CF50, 117 KB decompiled). This single function is the authoritative source for every __kmpc_*, omp_*, and __tgt_* device-runtime call the compiler can emit into NVPTX IR. It defines the complete ABI contract between compiler-generated GPU code and the OpenMP device runtime library (libomptarget / libomp). The function takes an integer case index (0--193), constructs the corresponding FunctionType, checks whether the symbol already exists in the module via Module::getNamedValue, and if absent, creates a Function::Create with ExternalLinkage. The result is registered into a context-local array so that any later codegen pass can reference a runtime function by its numeric index without reconstructing the type.

Upstream LLVM defines the same runtime function set declaratively in llvm/include/llvm/Frontend/OpenMP/OMPKinds.def using the __OMP_RTL macro, which the OMPIRBuilder expands at construction time. cicc's table is a procedural equivalent: a giant switch(a3) with 194 cases that does exactly what OMPKinds.def + OMPIRBuilder::initialize() do, but compiled into the binary rather than generated from a .def file. The ordering of cases 0--193 matches the upstream OMPRTL_ enum one-to-one, confirming that cicc v13.0 tracks LLVM 18.x's OpenMP runtime interface.

Key Facts

PropertyValue
Entry pointsub_312CF50 @ 0x312CF50
Decompiled size117 KB
Total entries194 (indices 0--193)
Sentinelindex 193 = __last (void function, marks table end)
Varargs entries2: index 7 (__kmpc_fork_call), index 118 (__kmpc_fork_teams)
Linkage for all entriesExternalLinkage (encoded as 0x103 = 259)
Special attributeAttribute #26 applied to indices 7 and 118 post-creation
Registration helpersub_3122A50(context, index, funcDecl)
Type constructionsub_BCF480 = FunctionType::get
Symbol lookupsub_BA8CB0 = Module::getNamedValue
Function creationsub_B2C660 = Function::Create
Upstream equivalentOMPKinds.def __OMP_RTL entries + OMPIRBuilder::initialize()

Context Object Type Cache

The first parameter a1 points to the OpenMP runtime context object. Starting at offset +2600, it contains a pre-allocated cache of LLVM types used to construct function signatures, avoiding redundant Type::get* calls:

OffsetTypeLLVM equivalent
+2600voidType::getVoidTy
+2608i1Type::getInt1Ty
+2616i8Type::getInt8Ty
+2624i16Type::getInt16Ty
+2632i32Type::getInt32Ty
+2640i64Type::getInt64Ty
+2648i8*PointerType::get(i8, 0)
+2664i32*PointerType::get(i32, 0)
+2672i64*PointerType::get(i64, 0)
+2680doubleType::getDoubleTy
+2688i64 / size_tDataLayout::getIntPtrType
+2704i8* (generic ptr)PointerType::get(i8, 0)
+2712i8**PointerType::get(i8*, 0)
+2720i8***PointerType::get(i8**, 0)
+2752kmp_critical_name*[8 x i32]*
+2784ident_t*{i32, i32, i32, i32, i8*}*
+2800__tgt_kernel_arguments*13-field struct pointer
+2816__tgt_async_info*{i8*}*
+2896KernelEnvironmentTy*{ConfigEnv, ident_t*, DynEnv*}*
+2912KernelLaunchEnvironmentTy*{i32, i32}*
+2928kmpc_microvoid(i32*, i32*, ...)* (varargs microtask)
+2944kmp_reduce_funcvoid(i8*, i8*)*
+2960kmp_copy_funcvoid(i8*, i8*)*
+3008kmpc_ctori8*(i8*)*
+3024kmp_routine_entry_ti32(i32, i8*)*
+3040kmp_ShuffleReductFctPtrvoid(i8*, i16, i16, i16)*
+3056kmp_InterWarpCopyFctPtrvoid(i8*, i32)*
+3072kmp_ListGlobalFctPtrvoid(i8*, i32, i8*)*

This layout mirrors the OMP_TYPE, OMP_STRUCT_TYPE, and OMP_FUNCTION_TYPE sections of upstream OMPKinds.def. The struct type definitions for ident_t, KernelEnvironmentTy, and __tgt_kernel_arguments match the upstream __OMP_STRUCT_TYPE declarations exactly.

Execution Modes: SPMD vs Generic

GPU OpenMP kernels operate in one of two execution modes, and the choice fundamentally determines which runtime functions the compiler emits:

ModeValueDescriptionWorker threads
Generic1Master-worker state machine. Only thread 0 runs serial code; workers spin in a polling loop (__kmpc_barrier_simple_generic). Parallel regions are dispatched via __kmpc_kernel_prepare_parallel / __kmpc_kernel_parallel.Idle until parallel region
SPMD2All threads execute the same code from kernel entry. Serial sections between parallel regions are guarded by tid == 0 checks with shared-memory output promotion and __kmpc_barrier_simple_spmd barriers.Active from first instruction
Generic-SPMD3Transient state during the Generic-to-SPMD transformation. Never observed at runtime.N/A

The execution mode is encoded in a bit-vector attached to the kernel function's metadata. The runtime function __kmpc_target_init (index 155) reads the KernelEnvironmentTy struct which embeds the ConfigurationEnvironmentTy -- the first byte of that inner struct encodes the execution mode. __kmpc_is_spmd_exec_mode (index 186) queries it at runtime.

The SPMD-vs-Generic distinction affects which runtime calls appear in the generated IR:

  • Generic mode kernels call __kmpc_kernel_prepare_parallel, __kmpc_kernel_parallel, __kmpc_kernel_end_parallel, __kmpc_barrier_simple_generic, and the full __kmpc_fork_call microtask dispatch.
  • SPMD mode kernels call __kmpc_parallel_51 (index 158) for nested parallelism, __kmpc_barrier_simple_spmd for synchronization, and __kmpc_alloc_shared / __kmpc_free_shared for shared-memory output promotion between guarded and parallel sections.
  • Both modes call __kmpc_target_init / __kmpc_target_deinit for kernel lifecycle management.

Call Generation Infrastructure

When any codegen pass needs a runtime function, it calls sub_312CF50(omp_context + 400, existing_value, case_index). The omp_context object (typically at a2+208 in the pass state) contains both the type cache (+2600..+3072) and the runtime function array. If Module::getNamedValue finds the symbol already declared, it is returned immediately; otherwise a new declaration is created and registered.

Once a declaration is obtained, sub_921880 (create runtime library call instruction) builds the CallInst node with the argument list from current SSA values, attaches debug/source location metadata, and inserts it at the specified basic block position.

Primary Consumers

PassAddressSizeRuntime Entries Used
Generic-to-SPMD transformsub_26968A061 KB6 (thread ID), 180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd)
State machine generationsub_267842041 KB155 (target_init), 156 (target_deinit), 171 (kernel_parallel), 172 (kernel_end_parallel), 188 (barrier_simple_generic)
Parallel region outlinersub_313D1B047 KB7 (fork_call), 158 (parallel_51)
Parallel region mergingsub_268094052 KB180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd)
Attributor OpenMP driversub_269F53063 KBAll -- identifies/folds known runtime calls by index

Complete Runtime Function Table

All 194 entries, organized by functional category. The "Index" column is the switch case in sub_312CF50 and the slot in the context's runtime function array. Signatures use LLVM IR type syntax. The "Call Generation" column describes how and when cicc emits each call.

Standard OpenMP Runtime (0--13)

IndexFunctionSignaturePurposeCall Generation
0__kmpc_barriervoid(ident_t*, i32)Explicit barrierEmitted for #pragma omp barrier. On GPU compiles to __syncthreads(). OpenMPOpt may replace with index 187 (SPMD barrier)
1__kmpc_canceli32(ident_t*, i32, i32)Cancel constructThird param: cancel kind (1=parallel, 2=sections, 3=for, 4=taskgroup). Returns nonzero if cancellation pending
2__kmpc_cancel_barriervoid(ident_t*, i32)Implicit barrier + cancel checkGenerated at end of worksharing constructs when cancel is possible
3__kmpc_errorvoid(ident_t*, i32, i8*)Runtime errorSecond param: severity (1=warning, 2=fatal). Third: message string pointer
4__kmpc_flushvoid(ident_t*)Memory fence#pragma omp flush. On GPU: __threadfence() or scope-specific fence
5__kmpc_global_thread_numi32(ident_t*)Get global thread IDOn GPU: blockIdx*blockDim+threadIdx. Emitted at start of every region needing a thread identifier
6__kmpc_get_hardware_thread_id_in_blocki32()threadIdx.x equivalentDirect PTX %tid.x wrapper. Used by SPMD transform (sub_26968A0) to build tid==0 guards. Lookup: sub_312CF50(..., 6)
7__kmpc_fork_callvoid(ident_t*, i32, kmpc_micro, ...)Fork parallel region (varargs)Second param: shared variable count. Third: outlined microtask pointer. Remaining: shared variables. On GPU Generic mode triggers worker state machine dispatch. Attribute #26 applied post-create
8__kmpc_fork_call_ifvoid(ident_t*, i32, i32, i8*, i32)Conditional forkThird param: if-clause condition. If false, region executes serially
9__kmpc_omp_taskwaitvoid(ident_t*, i32)Taskwait#pragma omp taskwait
10__kmpc_omp_taskyieldi32(ident_t*, i32, i32)Task yield pointThird param: end-of-task flag
11__kmpc_push_num_threadsvoid(ident_t*, i32, i32)Set thread countnum_threads(N) clause. Pushes count for next parallel region
12__kmpc_push_proc_bindvoid(ident_t*, i32, i32)Set affinityproc_bind(spread/close/master). Third param encodes binding policy
13__kmpc_omp_reg_task_with_affinityi32(ident_t*, i32, i8*, i32, i8*)Register task with affinity infoOMP 5.0 affinity clause

Index 7 (__kmpc_fork_call) and index 118 (__kmpc_fork_teams) are the only two varargs entries. Both receive special post-processing: sub_B994D0 sets function attribute #26 (likely the convergent attribute or a varargs-related marker), checked via sub_B91C10. This prevents the optimizer from incorrectly splitting, duplicating, or removing these calls.

Hardware Query (14--16)

IndexFunctionSignaturePurpose
14__kmpc_get_hardware_num_blocksi32()gridDim.x equivalent
15__kmpc_get_hardware_num_threads_in_blocki32()blockDim.x equivalent
16__kmpc_get_warp_sizei32()Warp size (32 on NVIDIA)

These three functions have no parameters -- they are direct wrappers around PTX special registers (%nctaid.x, %ntid.x, and a compile-time constant 32).

OMP Standard Library API (17--45)

IndexFunctionSignaturePurpose
17omp_get_thread_numi32()Thread ID within team
18omp_get_num_threadsi32()Threads in current team
19omp_get_max_threadsi32()Max threads available
20omp_in_paralleli32()Inside parallel region?
21omp_get_dynamici32()Dynamic adjustment enabled?
22omp_get_cancellationi32()Cancellation enabled?
23omp_get_nestedi32()Nested parallelism enabled?
24omp_get_schedulevoid(i32*, i32*)Query loop schedule
25omp_get_thread_limiti32()Max total threads
26omp_get_supported_active_levelsi32()Max supported nesting
27omp_get_max_active_levelsi32()Current max nesting
28omp_get_leveli32()Current nesting depth
29omp_get_ancestor_thread_numi32(i32)Ancestor thread ID
30omp_get_team_sizei32(i32)Team size at nesting level
31omp_get_active_leveli32()Active parallel nesting
32omp_in_finali32()Inside final task?
33omp_get_proc_bindi32()Current binding policy
34omp_get_num_placesi32()Number of places
35omp_get_num_procsi32()Available processors
36omp_get_place_proc_idsvoid(i32, i32*)Processor IDs in place
37omp_get_place_numi32()Current place number
38omp_get_partition_num_placesi32()Places in partition
39omp_get_partition_place_numsvoid(i32*)Place numbers in partition
40omp_get_wtimedouble()Wall clock time
41omp_set_num_threadsvoid(i32)Set thread count
42omp_set_dynamicvoid(i32)Enable/disable dynamic
43omp_set_nestedvoid(i32)Enable/disable nesting
44omp_set_schedulevoid(i32, i32)Set loop schedule
45omp_set_max_active_levelsvoid(i32)Set max nesting

These are the user-facing OpenMP API functions. On GPU, most return compile-time constants or trivial register reads. The Attributor-based OpenMP driver (sub_269F530) can fold many of these to constants when the execution mode and team configuration are statically known -- for example, omp_get_num_threads folds to the blockDim.x launch parameter.

Begin/End (53--54)

IndexFunctionSignaturePurpose
53__kmpc_beginvoid(ident_t*, i32)Library initialization (rarely used on GPU)
54__kmpc_endvoid(ident_t*)Library shutdown

Master/Masked Constructs (46--49)

IndexFunctionSignaturePurposeCall Generation
46__kmpc_masteri32(ident_t*, i32)Enter master regionReturns 1 for master thread (thread 0), 0 for all others. IRGen wraps user code in if(__kmpc_master(..)) {...}
47__kmpc_end_mastervoid(ident_t*, i32)Exit master regionCalled at end of master block
48__kmpc_maskedi32(ident_t*, i32, i32)Enter masked region (OMP 5.1)Third param is the filter ID (which specific thread executes). Replaces master in OMP 5.1
49__kmpc_end_maskedvoid(ident_t*, i32)Exit masked regionCalled at end of masked block

Critical Sections (50--52)

IndexFunctionSignaturePurposeCall Generation
50__kmpc_criticalvoid(ident_t*, i32, kmp_critical*)Enter critical sectionOn GPU: atomic spin-lock acquire on the 32-byte lock variable
51__kmpc_critical_with_hintvoid(ident_t*, i32, i32, kmp_critical*)Enter with lock hintHint encodes contention strategy (uncontended, contended, speculative, non-speculative)
52__kmpc_end_criticalvoid(ident_t*, i32, kmp_critical*)Exit critical sectionAtomic release on lock variable

On GPU, critical sections use atomic operations on global memory. The kmp_critical_name type is [8 x i32] (32 bytes), used as an atomic lock variable. The _with_hint variant accepts a contention hint that the GPU runtime maps to different atomic strategies.

Reduction (55--58)

IndexFunctionSignaturePurpose
55__kmpc_reducei32(ident_t*, i32, i32, i64, i8*, kmp_reduce_func, kmp_critical*)Begin reduction (blocking)
56__kmpc_reduce_nowaiti32(ident_t*, i32, i32, i64, i8*, kmp_reduce_func, kmp_critical*)Begin reduction (non-blocking)
57__kmpc_end_reducevoid(ident_t*, i32, kmp_critical*)End reduction (blocking)
58__kmpc_end_reduce_nowaitvoid(ident_t*, i32, kmp_critical*)End reduction (non-blocking)

These are the standard reduction protocol entries. On GPU, the compiler typically prefers the NVIDIA-specific shuffle-based reductions (indices 176--178) which are significantly faster.

Static Loop Scheduling (61--70)

IndexFunctionSignature
61--64__kmpc_for_static_init_{4,4u,8,8u}void(ident_t*, i32, i32, i32*, {i32,i64}*, {i32,i64}*, {i32,i64}*, {i32,i64}*, {i32,i64}, {i32,i64})
65__kmpc_for_static_finivoid(ident_t*, i32)
66--69__kmpc_distribute_static_init_{4,4u,8,8u}Same 9-param shape as 61--64
70__kmpc_distribute_static_finivoid(ident_t*, i32)

The _4 / _4u / _8 / _8u suffixes indicate signed-32, unsigned-32, signed-64, unsigned-64 loop variable types respectively. All static_init functions take 9 parameters: location, thread ID, schedule type, pointer to is-last flag, pointers to lower/upper/stride/incr bounds, and chunk size.

Dynamic Dispatch (71--87)

Indices 71--74 handle distribute + dynamic dispatch initialization. Indices 75--82 handle standard dispatch_init and dispatch_next for the four integer widths. Indices 83--87 are dispatch finalization. Total: 17 entries covering the full dynamic loop scheduling interface.

Team Static & Combined Distribute-For (88--95)

Indices 88--91 (__kmpc_team_static_init_{4,4u,8,8u}) handle team-level static work distribution. Indices 92--95 (__kmpc_dist_for_static_init_{4,4u,8,8u}) are the combined distribute parallel for static init, taking 10 parameters (the extra parameter is the distribute upper bound pointer).

Tasking (98--116)

19 entries covering the full OpenMP tasking interface:

IndexFunctionSignaturePurpose
98__kmpc_omp_task_alloci8*(ident_t*, i32, i32, i64, i64, kmp_routine_entry_t)Allocate task descriptor (6 params). Returns kmp_task_t*. Params: flags, sizeof_task, sizeof_shareds, task_entry
99__kmpc_omp_taski32(ident_t*, i32, i8*)Submit allocated task for execution. Third param is the kmp_task_t* from task_alloc
100__kmpc_end_taskgroupvoid(ident_t*, i32)End #pragma omp taskgroup
101__kmpc_taskgroupvoid(ident_t*, i32)Begin taskgroup
102__kmpc_omp_task_begin_if0void(ident_t*, i32, i8*)Begin immediate task (when if clause evaluates to false)
103__kmpc_omp_task_complete_if0void(ident_t*, i32, i8*)Complete immediate task
104__kmpc_omp_task_with_depsi32(ident_t*, i32, i8*, i32, i8*, i32, i8*)Task with dependency list (7 params). Params: task, ndeps, dep_list, ndeps_noalias, noalias_list
105__kmpc_taskloopvoid(ident_t*, i32, i8*, i32, i64*, i64*, i64, i32, i32, i64, i8*)#pragma omp taskloop (11 params). Params: task, if_val, lb_p, ub_p, st, nogroup, sched, grainsize, task_dup
106__kmpc_taskloop_5void(ident_t*, i32, i8*, i32, i64*, i64*, i64, i32, i32, i64, i8*, i32)OMP 5.1 taskloop (12 params). Extra param: modifier
107__kmpc_omp_target_task_alloci8*(ident_t*, i32, i32, i64, i64, kmp_routine_entry_t, i64)Target-offload task allocation (7 params). Extra i64: device_id
108__kmpc_taskred_modifier_initi8*(ident_t*, i32, i32, i32, i8*)Init task reduction with modifier (5 params). Params: is_ws, num, data
109__kmpc_taskred_initi8*(i32, i32, i8*)Init task reduction (basic)
110__kmpc_task_reduction_modifier_finivoid(ident_t*, i32, i32)Finalize task reduction
111__kmpc_task_reduction_get_th_datai8*(i32, i8*, i8*)Get thread-local reduction data
112__kmpc_task_reduction_initi8*(i32, i32, i8*)Init task reduction (alternate path)
113__kmpc_task_reduction_modifier_initi8*(i8*, i32, i32, i32, i8*)Init with full modifier (5 params)
114__kmpc_proxy_task_completed_ooovoid(i8*)Out-of-order proxy task completion. Used for detached tasks
115__kmpc_omp_wait_depsvoid(ident_t*, i32, i32, i8*, i32, i8*)Wait on task dependencies (6 params)
116__kmpc_omp_taskwait_deps_51void(ident_t*, i32, i32, i8*, i32, i8*, i32)OMP 5.1 dependency wait (7 params). Extra param: nowait modifier

Index 106 (__kmpc_taskloop_5) and index 116 (__kmpc_omp_taskwait_deps_51) are OMP 5.1 additions with an extra modifier parameter compared to their predecessors.

Teams and Cancellation (117--121)

IndexFunctionSignaturePurpose
117__kmpc_cancellationpointi32(ident_t*, i32, i32)Cancellation point check
118__kmpc_fork_teamsvoid(ident_t*, i32, kmpc_micro, ...)Fork teams region (varargs)
119__kmpc_push_num_teamsvoid(ident_t*, i32, i32, i32)Set team count
120__kmpc_push_num_teams_51void(ident_t*, i32, i32, i32, i32)Set team count (OMP 5.1, 5 params)
121__kmpc_set_thread_limitvoid(ident_t*, i32, i32)Set per-team thread limit

Copyprivate and Threadprivate (122--124)

IndexFunctionSignaturePurpose
122__kmpc_copyprivatevoid(ident_t*, i32, i64, i8*, kmp_copy_func, i32)#pragma omp copyprivate. Broadcasts private data from single thread to all others. 6 params
123__kmpc_threadprivate_cachedi8*(ident_t*, i32, i8*, i64, i8***)Get/allocate threadprivate variable data. 5 params
124__kmpc_threadprivate_registervoid(ident_t*, i8*, kmpc_ctor, void*, void*)Register threadprivate with ctor, copy-ctor, dtor callbacks

Doacross Synchronization (125--128)

Cross-iteration dependencies for #pragma omp ordered depend(source/sink).

IndexFunctionSignaturePurpose
125__kmpc_doacross_initvoid(ident_t*, i32, i32, i8*)Init doacross tracking. Params: num_dims, dims_info
126__kmpc_doacross_postvoid(ident_t*, i32, i64*)Post (source): signal iteration completion
127__kmpc_doacross_waitvoid(ident_t*, i32, i64*)Wait (sink): wait for iteration to complete
128__kmpc_doacross_finivoid(ident_t*, i32)Finalize doacross tracking

Memory Allocators (129--136)

IndexFunctionSignaturePurpose
129__kmpc_alloci8*(i32, i64, i8*)OpenMP allocator alloc. Params: gtid, size, allocator
130__kmpc_aligned_alloci8*(i32, i64, i64, i8*)Aligned allocation. Params: gtid, align, size, allocator
131__kmpc_freevoid(i32, i8*, i8*)Free allocated memory. Params: gtid, ptr, allocator
132__tgt_interop_initvoid(ident_t*, i32, i8**, i32, i32, i32, i8*, i32)OMP 5.1 foreign runtime interop init (8 params)
133__tgt_interop_destroyvoid(ident_t*, i32, i8**, i32, i32, i32, i8*)Destroy interop object (7 params)
134__tgt_interop_usevoid(ident_t*, i32, i8**, i32, i32, i32, i8*)Use interop object (7 params)
135__kmpc_init_allocatori8*(i32, i32, i8*, i8*)Init OpenMP allocator. Params: gtid, memspace, num_traits, traits
136__kmpc_destroy_allocatorvoid(i32, i8*)Destroy allocator

Target Offloading (137--153)

18 entries implementing the host-side target offloading protocol. These are primarily used when cicc compiles host code that launches GPU kernels, not within device code itself:

IndexFunctionSignatureParamsPurpose
137__kmpc_push_target_tripcount_mappervoid(ident_t*, i64, i64)3Set iteration count for target region. Params: device_id, trip_count
138__tgt_target_mapperi32(ident_t*, i64, i8*, i32, i8**, i8**, i64*, i64*, i8**, i8**)10Launch target region with data mapping
139__tgt_target_nowait_mapper(14 params)14Async target launch. Adds depobj count/list, noalias count/list
140__tgt_target_teams_mapper(12 params)12Target teams launch. Adds num_teams, thread_limit, mappers
141__tgt_target_teams_nowait_mapper(16 params)16Async target teams. Most complex host-side offload call
142__tgt_target_kerneli32(ident_t*, i64, i32, i32, i8*, __tgt_kernel_args*)6New-style kernel launch (takes __tgt_kernel_arguments*)
143__tgt_target_kernel_nowait(10 params)10Async new-style launch. Adds depobj info
144__tgt_target_data_begin_mapper(9 params)9Map data to device
145__tgt_target_data_begin_nowait_mapper(13 params)13Async map-to
146__tgt_target_data_begin_mapper_issue(10 params)10Split-phase issue for async map-to
147__tgt_target_data_begin_mapper_waitvoid(i64, __tgt_async_info*)2Split-phase wait for async map-to
148__tgt_target_data_end_mapper(9 params)9Map data from device
149__tgt_target_data_end_nowait_mapper(13 params)13Async map-from
150__tgt_target_data_update_mapper(9 params)9Data update (host-to-device or device-to-host)
151__tgt_target_data_update_nowait_mapper(13 params)13Async data update
152__tgt_mapper_num_componentsi64(i8*)1Query user-defined mapper component count
153__tgt_push_mapper_componentvoid(i8*, i8*, i8*, i64, i64, i8*)6Register mapper component. Params: handle, base, begin, size, type, name

Task Completion Event (154)

IndexFunctionSignaturePurpose
154__kmpc_task_allow_completion_eventi8*(ident_t*, i32, i8*)Allow completion event for detached tasks (OMP 5.0)

GPU Kernel Lifecycle (155--158)

These are the most important entries for device-side GPU OpenMP code.

IndexFunctionSignaturePurposeCall Generation
155__kmpc_target_initi32(KernelEnvironmentTy*, KernelLaunchEnvironmentTy*)Kernel entryFirst call in every GPU OpenMP kernel. State machine generator (sub_2678420) emits this at entry. KernelEnvironmentTy carries ConfigurationEnvironmentTy (first byte = execution mode)
156__kmpc_target_deinitvoid()Kernel exitLast call in every GPU OpenMP kernel. Emitted by state machine generator
157__kmpc_kernel_prepare_parallelvoid(i8*)Generic: signal workersMaster thread writes outlined function pointer to shared memory, then signals workers to execute it. Replaced by __kmpc_parallel_51 after SPMD conversion
158__kmpc_parallel_51void(ident_t*, i32, i32, i32, i32, i8*, i8*, i8**, i64)OMP 5.1 GPU parallel dispatch9 params: if_expr, num_threads, proc_bind, fn, wrapper_fn, shared_args, num_shared_args. Used by parallel region outliner (sub_313D1B0) on SPMD kernels. Replaces fork_call for GPU

__kmpc_target_init is the first runtime call in every GPU OpenMP kernel. In Generic mode, it returns -1 for worker threads (which should enter the polling loop) and 0 for the master thread. In SPMD mode, it returns 0 for all threads. The KernelEnvironmentTy struct carries the ConfigurationEnvironmentTy which encodes the execution mode, team sizes, and runtime configuration.

New-Style Static Loops, OMP 5.1+ (159--170)

12 entries implementing the callback-based loop interface introduced in OpenMP 5.1:

IndexFunctionSignature
159--162__kmpc_for_static_loop_{4,4u,8,8u}void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64}, {i32,i64})
163--166__kmpc_distribute_static_loop_{4,4u,8,8u}void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64})
167--170__kmpc_distribute_for_static_loop_{4,4u,8,8u}void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64}, {i32,i64}, {i32,i64})

Unlike the old-style _init/_fini pairs, these new-style loops take function pointer callbacks (i8* for the loop body and data pointer) and handle initialization + execution + finalization in a single call.

Legacy Kernel-Mode Parallel (171--174)

IndexFunctionSignaturePurpose
171__kmpc_kernel_paralleli1(i8**)Generic mode: worker checks if parallel work available
172__kmpc_kernel_end_parallelvoid()Generic mode: worker signals completion
173__kmpc_serialized_parallelvoid(ident_t*, i32)Execute parallel region serially (if(0) parallel)
174__kmpc_end_serialized_parallelvoid(ident_t*, i32)End serialized parallel

These are the Generic-mode worker-side functions. __kmpc_kernel_parallel returns true when the master thread has dispatched work via __kmpc_kernel_prepare_parallel, writing the outlined function pointer into the output parameter.

Warp-Level Primitives (175, 179, 189--190)

IndexFunctionSignaturePurpose
175__kmpc_shuffle_int32i32(i32, i16, i16)Warp shuffle for 32-bit value
179__kmpc_shuffle_int64i64(i64, i16, i16)Warp shuffle for 64-bit value
189__kmpc_warp_active_thread_maski64()Active lane mask (PTX activemask)
190__kmpc_syncwarpvoid(i64)Warp-level barrier with mask

The shuffle functions take (value, lane_offset, warp_size) and implement butterfly-pattern data exchange for intra-warp reductions. These compile down to PTX shfl.sync instructions.

NVIDIA Device Reduction (176--178)

IndexFunctionSignaturePurpose
176__kmpc_nvptx_parallel_reduce_nowait_v2i32(ident_t*, i64, i8*, ShuffleReductFctPtr, InterWarpCopyFctPtr)Intra-CTA parallel reduction
177__kmpc_nvptx_teams_reduce_nowait_v2i32(ident_t*, i32, i8*, i64, i8*, ShuffleReductFctPtr, InterWarpCopyFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr)Cross-CTA team reduction (11 params)
178__kmpc_reduction_get_fixed_bufferi8*()Get global reduction scratch buffer

These are the GPU-specific reduction entries -- the single most important performance-critical runtime calls for OpenMP on NVIDIA GPUs. The parallel reduction (index 176) uses a two-phase approach: (1) intra-warp reduction via shuffle, then (2) inter-warp reduction via shared memory copy. The compiler generates the ShuffleReductFctPtr and InterWarpCopyFctPtr callback functions as outlined helpers that the runtime calls during the reduction tree.

The teams reduction (index 177) adds four ListGlobalFctPtr callbacks for managing global memory buffers across CTAs, plus an extra size parameter. This is the most complex runtime call in the entire table, with 11 parameters.

Shared Memory Management (180--184)

IndexFunctionSignaturePurpose
180__kmpc_alloc_sharedi8*(i64)Dynamic shared memory allocation
181__kmpc_free_sharedvoid(i8*, i64)Free shared memory
182__kmpc_begin_sharing_variablesvoid(i8***, i64)Begin variable sharing protocol
183__kmpc_end_sharing_variablesvoid()End sharing protocol
184__kmpc_get_shared_variablesi8**()Get shared variable array

__kmpc_alloc_shared / __kmpc_free_shared are heavily used in the SPMD transformation's guarded output mechanism: values computed by the master thread that are needed by all threads are stored into dynamically-allocated shared memory, synchronized via barrier, then loaded by all threads.

SPMD Mode Detection (185--188)

IndexFunctionSignaturePurpose
185__kmpc_parallel_leveli16(ident_t*, i32)Current parallel nesting depth
186__kmpc_is_spmd_exec_modei8()Returns 1 if SPMD, 0 if Generic
187__kmpc_barrier_simple_spmdvoid(ident_t*, i32)Lightweight barrier for SPMD mode (bar.sync)
188__kmpc_barrier_simple_genericvoid(ident_t*, i32)State-machine barrier for Generic mode

The two barrier variants reflect the fundamental mode difference. __kmpc_barrier_simple_spmd compiles to a single bar.sync instruction. __kmpc_barrier_simple_generic involves polling a shared-memory flag because workers are in a state-machine loop that must check for new work after each barrier.

Profiling (191--192) and Sentinel (193)

IndexFunctionSignaturePurpose
191__llvm_profile_register_functionvoid(i8*)PGO: register function for profiling
192__llvm_profile_register_names_functionvoid(i8*, i64)PGO: register name table
193__lastvoid()Sentinel marking table end

The two __llvm_profile_* entries support profile-guided optimization instrumentation on GPU. The __last sentinel at index 193 is a void-to-void function that marks the end of the table; it is never called at runtime.

Declaration Construction Protocol

For each runtime function, sub_312CF50 follows an identical protocol:

// Pseudocode for a typical case (e.g., case 0: __kmpc_barrier)
case 0: {
    // 1. Build parameter type array from cached types
    Type *params[] = { ctx->ident_t_ptr, ctx->i32_ty };  // a1+2784, a1+2632

    // 2. Construct FunctionType
    FunctionType *fty = FunctionType::get(
        ctx->void_ty,   // return type (a1+2600)
        params, 2,       // param array + count
        /*isVarArg=*/false
    );

    // 3. Check if symbol already exists in module
    Value *existing = Module::getNamedValue("__kmpc_barrier");
    if (existing == a2)  // a2 is the existing-check value
        return existing;

    // 4. Create new function declaration
    Function *decl = Function::Create(
        fty,
        259,             // linkage = ExternalLinkage (0x103)
        "__kmpc_barrier",
        module
    );

    // 5. Register in context table
    registerRuntimeFunction(a1, /*index=*/0, decl);  // sub_3122A50

    return decl;
}

The linkage value 259 (0x103) decodes as ExternalLinkage with the DLLImport storage class flag set. This is consistent across all 194 entries.

For the two varargs entries (indices 7 and 118), the FunctionType::get call passes isVarArg=true, and after Function::Create, the code calls sub_B994D0 to add attribute #26 and sub_B91C10 to verify it was applied. Attribute #26 likely corresponds to a convergent-or-varargs marker that prevents the optimizer from incorrectly transforming these calls.

Comparison with Upstream LLVM OMPKinds.def

cicc's table maps one-to-one with the __OMP_RTL entries in LLVM 18.x's OMPKinds.def. The ordering is identical: the enum OMPRTL___kmpc_barrier = 0 corresponds to cicc's case 0, and so on through OMPRTL___last = 193 at case 193.

Key differences from upstream:

  1. Procedural vs declarative. Upstream uses X-macros (__OMP_RTL) expanded by OMPIRBuilder::initialize() to lazily create declarations on first use. cicc's sub_312CF50 is a compiled switch statement that eagerly creates declarations when requested by case index.

  2. Type representation. Upstream uses opaque pointer types (PointerType::get(Ctx, 0)) throughout. cicc preserves typed pointers (i8*, i32*, i64*, struct pointers) in its type cache, consistent with LLVM's pre-opaque-pointer era. This is because cicc's internal IR (NVVM IR) still uses typed pointers even though upstream LLVM has migrated to opaque pointers.

  3. Missing entries. cicc lacks __kmpc_push_num_threads_strict (present in latest upstream) and uses __kmpc_parallel_51 where upstream LLVM 18.x defines __kmpc_parallel_60 with a slightly different signature. The _51 name indicates cicc v13.0 targets the OMP 5.1 runtime ABI, not the OMP 6.0 draft.

  4. Attribute handling. Upstream OMPKinds.def includes extensive attribute sets (GetterAttrs, SetterAttrs, etc.) that annotate runtime functions with nounwind, nosync, nofree, willreturn, and memory effect attributes for optimization. cicc applies only attribute #26 to the two varargs functions and otherwise relies on the OpenMPOpt pass to infer attributes.

  5. The __tgt_interop_* entries (indices 132--134) in cicc take a slightly different parameter list than upstream: cicc includes an extra i32 parameter at the end that upstream encodes differently, reflecting a minor ABI divergence in the interop interface.

Configuration Knobs

All LLVM cl::opt knobs related to OpenMP optimization, as found in the cicc binary:

KnobTypeDefaultEffect
openmp-opt-disableboolfalseDisable all OpenMP optimizations
openmp-opt-enable-mergingboolfalseEnable parallel region merging
openmp-opt-disable-internalizationboolfalseSkip function internalization
openmp-opt-disable-deglobalizationboolfalseSkip global-to-local promotion
openmp-opt-disable-spmdizationboolfalseSkip Generic-to-SPMD transformation
openmp-opt-disable-foldingboolfalseSkip ICV folding
openmp-opt-disable-state-machine-rewriteboolfalseSkip state machine optimization
openmp-opt-disable-barrier-eliminationboolfalseSkip redundant barrier removal
openmp-opt-inline-deviceboolvariesInline device runtime calls
openmp-opt-verbose-remarksboolfalseEmit detailed optimization remarks
openmp-opt-max-iterationsintvariesFixed-point iteration limit for analysis
openmp-opt-shared-limitintvariesMax shared memory for SPMD output promotion
openmp-opt-print-module-afterboolfalseDump module IR after OpenMP optimization
openmp-opt-print-module-beforeboolfalseDump module IR before OpenMP optimization
openmp-deduce-icv-valuesboolvariesDeduce Internal Control Variable values
openmp-print-icv-valuesboolfalsePrint deduced ICV values
openmp-print-gpu-kernelsboolfalsePrint identified GPU kernels
openmp-hide-memory-transfer-latencyboolfalseOverlap data transfers with computation

The openmp-opt-shared-limit knob is particularly relevant for the SPMD transformation: it caps the total amount of shared memory allocated for guarded output promotion. If the serial sections between parallel regions produce too many live-out values, the SPMD transformation may be abandoned when the shared memory budget is exceeded.

Diagnostic Strings

The OpenMP subsystem emits two diagnostics during SPMD transformation:

CodeSeverityMessage
OMP120Remark"Transformed generic-mode kernel to SPMD-mode."
OMP121Warning"Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override"

OMP120 is emitted by sub_26968A0 on successful Generic-to-SPMD conversion. OMP121 is emitted for each call instruction that references a function not in the SPMD-amenable set, explaining why the transformation failed and providing the user with the override attribute.

Pipeline Integration

The OpenMP passes are registered in the pipeline under three names:

Pipeline IDPass NameLevelDescription
75openmp-optModulePre-link OpenMP optimization
76openmp-opt-postlinkModulePost-link OpenMP optimization
154openmp-opt-cgsccCGSCCCall-graph-level OpenMP optimization

The runtime declaration table (sub_312CF50) is invoked lazily from any of these passes when they need to emit a runtime call. The SPMD transformation is part of the module-level openmp-opt pass.

Execution Mode Call Patterns

The execution mode fundamentally determines which runtime functions appear in generated IR. These pseudocode patterns show the exact call sequences emitted by the state machine generator (sub_2678420) and the SPMD transformation (sub_26968A0).

Generic Mode Kernel (mode byte = 1)

entry:
    ret = __kmpc_target_init(KernelEnv, LaunchEnv)   // [155]
    if (ret == -1) goto worker_loop                   // worker threads
    // master thread: user code
    __kmpc_kernel_prepare_parallel(outlined_fn_ptr)   // [157]
    __kmpc_barrier_simple_generic(loc, gtid)          // [188]
    // ... more serial + parallel sections ...
    __kmpc_target_deinit()                            // [156]
worker_loop:
    while (true) {
        __kmpc_barrier_simple_generic(loc, gtid)      // [188]
        if (__kmpc_kernel_parallel(&fn))               // [171]
            fn(args);
            __kmpc_kernel_end_parallel()               // [172]
        __kmpc_barrier_simple_generic(loc, gtid)      // [188]
    }

SPMD Mode Kernel -- Simple (mode byte = 2, single parallel region)

After successful Generic-to-SPMD transformation:

entry:
    __kmpc_target_init(KernelEnv, LaunchEnv)          // [155], returns 0 for all
    tid = __kmpc_get_hardware_thread_id_in_block()    // [6]
    is_main = (tid == 0)
    br is_main, user_code, exit.threads
user_code:
    // all threads: user code
    __kmpc_parallel_51(loc, gtid, ...)                // [158], for nested
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187]
exit.threads:
    __kmpc_target_deinit()                            // [156]

SPMD Mode Kernel -- Complex (guarded regions, multiple parallel regions)

entry:
    __kmpc_target_init(...)                           // [155]
region.check.tid:
    tid = __kmpc_get_hardware_thread_id_in_block()    // [6]
    cmp = icmp eq tid, 0
    br cmp, region.guarded, region.barrier
region.guarded:
    ... master-only serial code ...
    shared_ptr = __kmpc_alloc_shared(sizeof(result))  // [180]
    store result -> shared_ptr
region.guarded.end:
    br region.barrier
region.barrier:
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187]
    result = load from shared_ptr
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187], post-load
    __kmpc_free_shared(shared_ptr, size)              // [181]
    ... all threads continue with result ...
exit:
    __kmpc_target_deinit()                            // [156]

The SPMD transformation eliminates the worker state machine entirely. Workers no longer idle-spin in a polling loop; they participate in computation from the kernel's first instruction. Serial sections between parallel regions are wrapped in tid==0 guards with shared-memory output promotion and barriers.

SPMD-Amenable Function Table

The SPMD transformation maintains a hash set of functions that are safe to call from all threads simultaneously, located at *(omp_context + 208) + 34952 (base pointer), +34968 (capacity).

PropertyValue
Hash functionOpen-addressing with linear probing
Slot computation((addr >> 9) ^ (addr >> 4)) & (capacity - 1)
Sentinel-4096 (empty slot marker)
ContentsFunctions pre-analyzed or annotated with [[omp::assume("ompx_spmd_amenable")]]

When a call instruction references a function not in this set, the SPMD transformation fails for that kernel and emits OMP121: "Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override".

Functional Category Summary

CategoryCountIndices
Thread hierarchy and hardware query200--6, 14--16, 17--45
Work sharing / loop scheduling4861--95, 159--170
Tasking1998--116, 154
Synchronization120, 2, 4, 50--52, 59--60, 96--97, 187--188, 190
Target offloading / data mapping18137--153
GPU execution mode10155--158, 171--174, 185--186
Warp primitives4175, 179, 189--190
NVIDIA device reduction3176--178
Shared memory management5180--184
Memory allocators8129--136
Copyprivate / threadprivate3122--124
Doacross synchronization4125--128
Teams / cancellation5117--121
Master / masked446--49
Reduction (standard)455--58
Begin / end253--54
Profiling2191--192
Sentinel1193
Total194

Function Map

FunctionAddressSizeRole
sub_312CF50 -- OpenMP runtime declaration factory (194-case switch)0x312CF50----
sub_3122A50 -- registerRuntimeFunction(context, index, funcDecl)0x3122A50----
sub_2686D90 -- OpenMP runtime declaration table (215 KB, outer wrapper)0x2686D90----
sub_26968A0 -- Generic-to-SPMD transformation (61 KB)0x26968A0----
sub_2680940 -- Parallel region merging (52 KB)0x2680940----
sub_2678420 -- State machine generation for Generic mode (41 KB)0x2678420----
sub_269F530 -- Attributor-based OpenMP optimization driver (63 KB)0x269F530----
sub_313D1B0 -- Parallel region outliner (47 KB)0x313D1B0----
sub_BCF480 -- FunctionType::get(retTy, paramTys, count, isVarArg)0xBCF480----
sub_BA8CB0 -- Module::getNamedValue(name)0xBA8CB0----
sub_B2C660 -- Function::Create(funcTy, linkage, name, module)0xB2C660----
sub_B994D0 -- addAttribute(26, value) -- set function attribute0xB994D0----
sub_B91C10 -- hasAttribute(26) -- check function attribute0xB91C10----
sub_B9C770 -- Attribute construction (varargs attribute)0xB9C770----
sub_B8C960 -- Attribute kind construction0xB8C960----
sub_B2BE50 -- Function::getContext()0xB2BE50----
sub_921880 -- Create runtime library call instruction0x921880----
sub_5FB5C0 -- OpenMP variant processing (%s$$OMP_VARIANT%06d)0x5FB5C0----

OpenMP Variant Processing

cicc also supports OpenMP variant dispatch during EDG front-end processing. The function sub_5FB5C0 at 0x5FB5C0 handles mangled names with the format %s$$OMP_VARIANT%06d, which the front-end generates for #pragma omp declare variant constructs. This is separate from the runtime declaration table and operates at the source-level AST rather than at the LLVM IR level.

Cross-References

  • Generic-to-SPMD Transformation -- the primary consumer of the runtime table, performing mode conversion using entries 6, 155, 156, 180, 181, 187, 188
  • Pipeline & Ordering -- where openmp-opt (ID 75), openmp-opt-postlink (ID 76), and openmp-opt-cgscc (ID 154) sit in the pass pipeline
  • CLI Flags -- compiler flags that control OpenMP code generation
  • LLVM Knobs -- the openmp-opt-* knobs listed above
  • Kernel Metadata -- how KernelEnvironmentTy and execution mode are set during IR generation
  • Hash Infrastructure -- the open-addressing hash table pattern used by the SPMD-amenable function set
  • GPU Execution Model -- broader context on SPMD vs Generic execution

Generic-to-SPMD Transformation

The Generic-to-SPMD transformation (sub_26968A0, 61 KB, ~1807 lines) is cicc's most impactful OpenMP target optimization. It converts GPU kernels from Generic execution mode -- where thread 0 acts as a master running serial code through a state machine while all other threads idle at a barrier -- into SPMD mode, where every thread in the block executes the same code from the first instruction. The transformation eliminates the worker state machine loop entirely, removes warp divergence at kernel entry, replaces heavyweight generic barriers with lightweight SPMD barriers (__syncthreads), and enables the hardware scheduler to fill warps from the very first cycle. On real workloads this routinely yields 2-4x speedups for simple target parallel for regions. The pass emits diagnostic OMP120 on success and OMP121 when a callee's side effects prevent conversion.

Key Facts

PropertyValue
Function addresssub_26968A0
Decompiled size61 KB (~1807 lines)
Pass registrationopenmp-opt (pipeline slot 75, Module pass)
Post-link variantopenmp-opt-postlink (slot 76)
CGSCC variantopenmp-opt-cgscc (slot 154)
Parametersa1 = PassState, a2 = ModuleContext, a3 = OutputFlag
Eligibility flag*(a1+241) -- boolean, set by prior analysis
Parallel region array*(a1+280) base, *(a1+288) count
Diagnostic handler*(a2+4392)
Success diagnosticOMP120: "Transformed generic-mode kernel to SPMD-mode."
Failure diagnosticOMP121: "Value has potential side effects preventing SPMD-mode execution"

Generic vs SPMD Execution Model

Understanding the two execution modes is essential before examining the transformation.

AspectGeneric ModeSPMD Mode
Thread rolesThread 0 = master; threads 1..N-1 = workersAll threads execute same code
Kernel entry__kmpc_target_init returns tid for master, -1 for workers__kmpc_target_init returns tid for all
Serial codeMaster executes directlyWrapped in if (tid == 0) guard
Parallel regionMaster signals workers via parallel_level; workers wake, execute outlined fn, re-barrierAll threads already executing; outlined fn body inlined
Barrier type__kmpc_barrier_simple_generic (poll-based state machine)__kmpc_barrier_simple_spmd (maps to bar.sync / __syncthreads)
Worker idle loopwhile(true) { barrier(); if(parallel_level) { exec(); barrier(); } }No idle loop -- eliminated entirely
Warp divergenceWarps containing thread 0 diverge at entry gateNo divergence at entry
OccupancyLower -- workers consume registers/shared mem while idleHigher -- all resources used productively
Execution mode constant1 (OMP_TGT_EXEC_MODE_GENERIC)2 (OMP_TGT_EXEC_MODE_SPMD)
Transition marker--3 (OMP_TGT_EXEC_MODE_GENERIC_SPMD, intermediate during transform)

In Generic mode the runtime creates a CTA (Cooperative Thread Array) where only thread 0 enters user code. The remaining N-1 threads enter a polling loop: they call __kmpc_barrier_simple_generic, check the parallel_level variable, and if a parallel region has been entered by the master, they wake up, execute the outlined parallel function, then return to polling. This "state machine" pattern is the primary performance bottleneck -- it wastes cycles on barrier polling, causes massive warp divergence on the first warp (which contains both the master and worker lanes), and prevents the scheduler from issuing useful work for idle threads.

SPMD mode eliminates all of this. Every thread begins executing user code at kernel entry. Serial code sections that cannot be parallelized are protected by lightweight tid == 0 guards, with results broadcast to all threads through shared memory and bar.sync barriers.

Legality Analysis

The transformation is gated by a boolean eligibility flag at *(a1+241), which is computed by a prior analysis pass (not sub_26968A0 itself). The analysis determines eligibility based on three conditions:

Condition 1: Kernel is Currently in Generic Mode

The execution mode bit-vector's low byte must equal 1 (Generic). This is checked at line 429 of the decompiled output:

// sub_2674090/sub_2674040 read the execution mode attribute
mode_bv = get_exec_mode(a1 + 304);
if (mode_bv.size <= 64)
    mode_val = mode_bv.inline_data;
else
    mode_val = *mode_bv.data_ptr;

if ((uint8_t)mode_val != 1)  // Not Generic mode
    return;

Condition 2: All Callees are SPMD-Amenable

Every call instruction reachable from the kernel's parallel regions must reference a function in the SPMD-amenable function set. This set lives at *(a2+208) + 34952 (base pointer) with capacity at offset +34968.

// SPMD-amenable lookup (open-addressing hash set)
bool is_spmd_amenable(void *func_ptr, void *table_base, uint64_t capacity) {
    uint64_t hash = ((uintptr_t)func_ptr >> 9) ^ ((uintptr_t)func_ptr >> 4);
    uint64_t slot = hash & (capacity - 1);
    while (true) {
        void *entry = table_base[slot];
        if (entry == func_ptr) return true;
        if (entry == (void*)-4096) return false;  // empty sentinel
        slot = (slot + 1) & (capacity - 1);       // linear probe
    }
}

Functions are pre-populated in this set if they have been analyzed as side-effect free (from the caller's perspective in SPMD context), or if the programmer annotated them with [[omp::assume("ompx_spmd_amenable")]]. When a callee fails this check, the pass takes Path A (non-SPMD candidate path, lines 1692-1806) and emits OMP121 for each offending call:

warning: Value has potential side effects preventing SPMD-mode execution.
         Add `[[omp::assume("ompx_spmd_amenable")]]` to the called function
         to override [OMP121]

The diagnostic is constructed via sub_B178C0 (warning constructor), message appended via sub_B18290, and emitted through sub_1049740 to the handler at *(a2+4392).

Condition 3: No Unresolvable Side Effects

The kernel must not contain operations that are inherently unsafe when executed by multiple threads simultaneously -- for example, I/O operations with ordering requirements, or accesses to thread-local storage that assumes single-thread access.

Legality Pseudocode

function is_spmd_eligible(kernel, module_ctx):
    // Check current execution mode
    mode = read_exec_mode(kernel.attributes)
    if mode != GENERIC:
        return false

    // Scan all parallel regions
    for region in kernel.parallel_regions:
        for inst in region.instructions:
            if is_call_like(inst):  // opcode 34, 52, or 86
                callee = get_callee(inst)
                if callee.is_declaration:
                    if callee not in module_ctx.spmd_amenable_set:
                        emit_diagnostic(OMP121, inst.location,
                            "Value has potential side effects...")
                        return false

    return true

The call-like instruction detection uses a bitmask test: (opcode - 34) <= 0x33 followed by bittest(0x8000000000041, opcode - 34), which matches opcodes 34 (call), 52 (invoke), and 86 (callbr) -- the three LLVM call-family instructions.

Transformation Algorithm

Once eligibility is confirmed, sub_26968A0 takes Path B (lines 407-1691). The path splits based on kernel complexity:

Simple Case: Single Parallel Region

When *(a1+160) == 0 and *(a1+224) == 0, the kernel has a single parallel region with no intervening serial code. This is the fast path (lines 432-672).

function transform_simple_spmd(kernel, module_ctx):
    entry_bb = get_entry_block(kernel)
    func_scope = get_function_scope(kernel)
    thread_config = get_thread_configuration(kernel, module_ctx)

    // 1. Create new basic blocks
    user_code_bb = create_region("main.thread.user_code")
    exit_bb = create_exit_block("exit.threads")
    register_in_worklist(user_code_bb)
    register_in_worklist(exit_bb)

    // 2. Insert thread-id check at entry
    tid = call __kmpc_get_hardware_thread_id_in_block()  // runtime call ID 6
    is_main = icmp eq tid, 0
    br is_main, user_code_bb, exit_bb

    // 3. Move original parallel body into user_code_bb
    //    (all threads execute this -- the parallel outlined fn
    //     is effectively inlined into the kernel)

    // 4. Update execution mode: Generic(1) -> SPMD(2)
    //    Intermediate: set mode 3 (GENERIC_SPMD) then overwrite to 2
    bv_entry = create_bitvector_entry(*(kernel+304+8), 3, 0)
    current = read_attribute(*(kernel+304))
    *(kernel+304) = insert_attribute(current, bv_entry, key=0, value=1)

    // 5. Emit success diagnostic
    if diagnostic_handler_registered(module_ctx+4392):
        emit_remark(OMP120, "Transformed generic-mode kernel to SPMD-mode.")

The resulting CFG is straightforward:

entry:
    %tid = call i32 @__kmpc_get_hardware_thread_id_in_block()
    %is_main = icmp eq i32 %tid, 0
    br i1 %is_main, label %user_code, label %exit.threads

user_code:                         ; all threads execute
    ... original parallel body ...
    br label %exit.threads

exit.threads:
    ret void

Complex Case: Multiple Parallel Regions

When the kernel contains multiple parallel regions with serial code between them, the pass executes a four-phase transformation (lines 720-1676).

Phase 1: Deduplicate Parallel Regions (lines 720-760)

Multiple parallel regions may call the same outlined function. The pass deduplicates by function pointer using an inline hash set:

function dedup_regions(parallel_regions):
    seen = HashSet()  // inline small-buffer optimization
    unique = []
    for region in parallel_regions:
        fn_ptr = region.outlined_function  // offset+40
        if fn_ptr not in seen:
            seen.insert(fn_ptr)
            unique.append(region)
    return unique

Phase 2: Identify Non-SPMD-Safe Instructions (lines 768-873)

For each parallel region, the pass walks the CFG successor chain and identifies instructions with side effects that are not SPMD-compatible:

function find_guarded_ranges(region, module_ctx):
    ranges = []
    first_unsafe = null
    last_unsafe = null

    for inst in walk_cfg_successors(region):
        if is_side_effecting_call(inst):
            // Skip known-safe calls (global dtors at module_ctx+208+32432)
            if inst.callee == module_ctx.global_dtor_fn:
                continue
            // For invoke instructions: check if exception handler count is 0
            if inst.opcode == 85:  // invoke
                if get_eh_handler_count(inst) == 0:
                    continue  // can be simplified
            if first_unsafe == null:
                first_unsafe = inst
            last_unsafe = inst
        else:
            if first_unsafe != null:
                ranges.append((first_unsafe, last_unsafe))
                first_unsafe = null
                last_unsafe = null

    if first_unsafe != null:
        ranges.append((first_unsafe, last_unsafe))

    return ranges

The pass then calls sub_B444E0 to insert guard instructions at each range boundary.

Phase 3: Build Guarded Region Descriptors (lines 876-1059)

Each parallel region is looked up in the function-to-region-tracker hash map at *(a2+144). This map uses a splitmix64-variant hash:

uint64_t hash_function_key(uint64_t name_hash, uint64_t addr_hash) {
    uint64_t raw = name_hash ^ (16 * addr_hash);
    uint64_t h = raw * 0xBF58476D1CE4E5B9ULL;
    h = (h >> 31) ^ (h * 0x1CE4E5B9ULL);
    return h;
}

The map stores 24-byte keys (module pointer, name pointer, auxiliary pointer) with a sentinel key of (-4096, qword_4FEE4D0, qword_4FEE4D8). Each entry's value (at +24) points to a guarded region tracker structure:

OffsetTypeDescription
+472i32Work counter
+480ptrBlock pointer array base
+488i64Capacity
+492i32Current size
+500i8Initialized flag

Phase 4: Split and Rewire CFG (lines 1060-1670)

For each (first_instr, last_instr) pair identified in Phase 2, the pass creates five new basic blocks and rewires the CFG:

function create_guarded_region(first_instr, last_instr, module_ctx):
    parent_bb = first_instr.parent

    // 1. Split into 5 blocks
    guarded_end_bb = split_block(parent_bb, after=last_instr, name="region.guarded.end")
    barrier_bb    = split_block(guarded_end_bb, at_start, name="region.barrier")
    exit_bb       = split_block(barrier_bb, at_start, name="region.exit")
    guarded_bb    = split_block(parent_bb, at=first_instr, name="region.guarded")
    check_tid_bb  = split_block(parent_bb, at=terminator, name="region.check.tid")

    // 2. Register all blocks in worklist
    for bb in [guarded_end_bb, barrier_bb, exit_bb, guarded_bb, check_tid_bb]:
        register_in_worklist(bb)

    // 3. Handle escaping values (shared memory promotion)
    has_broadcast = false
    for inst in guarded_bb:
        outside_uses = [u for u in inst.uses if u.parent != guarded_bb]
        if outside_uses:
            has_broadcast = true

            // Allocate shared memory for output
            alloc = create_alloca(
                type = inst.type,
                address_space = 7,  // shared memory
                name = sanitize(inst.name) + ".guarded.output.alloc"
            )

            // Store result from master thread (inside guarded block)
            create_store(inst, alloc, insert_in=guarded_bb)

            // Load from all threads (after barrier)
            load = create_load(
                type = inst.type,
                ptr = alloc,
                name = sanitize(inst.name) + ".guarded.output.load",
                insert_in = barrier_successor
            )

            // Rewrite all outside uses
            replace_all_uses_outside(inst, load, guarded_bb)

    // 4. Insert thread-id check
    tid = call __kmpc_get_hardware_thread_id_in_block()  // call ID 6
    cmp = icmp eq tid, 0
    br cmp, guarded_bb, barrier_bb

    // 5. Insert SPMD barrier
    call __kmpc_barrier_simple_spmd(ident, tid)  // call ID 187

    // 6. If broadcast values exist, insert second barrier after loads
    if has_broadcast:
        call __kmpc_barrier_simple_spmd(ident, tid)  // ensures loads complete

The resulting CFG for a complex kernel with serial code between two parallel regions:

entry:
    ...

region.check.tid:
    %tid = call i32 @__kmpc_get_hardware_thread_id_in_block()
    %cmp = icmp eq i32 %tid, 0
    br i1 %cmp, label %region.guarded, label %region.barrier

region.guarded:                    ; master thread only
    ... serial code ...
    store %result, %shared_mem     ; broadcast output
    br label %region.guarded.end

region.guarded.end:
    br label %region.barrier

region.barrier:
    call void @__kmpc_barrier_simple_spmd(%ident, %tid)
    %result = load %shared_mem     ; all threads read
    call void @__kmpc_barrier_simple_spmd(%ident, %tid)  ; if broadcast
    br label %region.exit

region.exit:
    ... next parallel region (all threads) ...

Name Sanitization

Output variable names are sanitized for use as global symbol names. Non-alphanumeric, non-underscore characters are replaced with .:

// Identical logic in both cicc and upstream LLVM
char sanitize_char(char c) {
    if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') ||
        (c >= '0' && c <= '9') || c == '_')
        return c;
    return '.';
}

Shared Memory Output Promotion

When a value computed inside a guarded region (master-only code) is needed by all threads after the barrier, the pass promotes it through shared memory. This is the cicc implementation of what upstream LLVM calls "broadcast values." The sequence is:

  1. Allocate: sub_B30000 creates an address-space-7 (shared/local) allocation with suffix .guarded.output.alloc. The allocation node is 80 bytes, subtype 7.

  2. Store: sub_B4D460 emits a store from the master thread's computed value into shared memory. Placed inside the guarded block, before the branch to region.guarded.end.

  3. First barrier: __kmpc_barrier_simple_spmd (runtime call ID 187) ensures the store is globally visible to all threads in the CTA.

  4. Load: sub_B4D230 emits a load from shared memory with suffix .guarded.output.load. Placed in the barrier successor block so all threads read the broadcast value.

  5. Second barrier: If broadcast values exist, a second __kmpc_barrier_simple_spmd call ensures all threads have completed their loads before the shared memory is potentially reused.

  6. Use rewriting: sub_256E5A0 replaces every use of the original value outside the guarded block with the loaded value.

State Machine Elimination

The state machine elimination is the core performance win of the SPMD transformation. Understanding the state machine that gets eliminated -- and its fallback generator -- is essential for reimplementation.

Generic-Mode Worker State Machine (What Gets Eliminated)

In Generic mode, __kmpc_target_init (runtime call ID 155) returns -1 for all threads except thread 0 (the master). The kernel entry code branches on this return value: thread 0 falls through to user code, while threads 1..N-1 jump to the worker state machine loop. This loop is the performance bottleneck that the SPMD transformation eliminates.

The complete Generic-mode kernel structure, as generated by the runtime and optionally customized by sub_2678420:

// Generic mode kernel entry (before SPMD transformation)
void __omp_offloading_kernel(KernelEnvironmentTy *env, KernelLaunchEnvironmentTy *launch_env) {
    int ret = __kmpc_target_init(env, launch_env);  // [155]
    if (ret == -1)
        goto worker_state_machine;

    // === MASTER THREAD (thread 0) ===
    // User code: serial sections + parallel dispatch
    ...
    __kmpc_kernel_prepare_parallel(outlined_fn_ptr);  // [157] signal workers
    __kmpc_barrier_simple_generic(loc, gtid);         // [188] wake workers
    // ... workers execute outlined_fn ...
    __kmpc_barrier_simple_generic(loc, gtid);         // [188] wait for workers
    // ... more serial code ...
    __kmpc_target_deinit();                           // [156]
    return;

worker_state_machine:
    // === WORKER THREADS (threads 1..N-1) ===
    // sub_2678420 generates this structure with these exact labels:
    worker_state_machine.begin:
        __kmpc_barrier_simple_generic(loc, gtid);     // [188] poll barrier
    .is_active.check:
        bool active = __kmpc_kernel_parallel(&fn);    // [171] check for work
        if (!active)
            goto .done.barrier;
    .parallel_region.check:
        if (fn == known_outlined_fn_1)
            goto .parallel_region.execute;
        // ... more checks for known outlined functions ...
        goto .fallback.execute;
    .parallel_region.execute:
        known_outlined_fn_1(args);                    // direct call (devirtualized)
        goto .done.barrier;
    .fallback.execute:
        fn(args);                                     // indirect call (generic)
    .done.barrier:
        __kmpc_kernel_end_parallel();                 // [172] signal completion
        __kmpc_barrier_simple_generic(loc, gtid);     // [188] sync barrier
        goto worker_state_machine.begin;
    .finished:
        return;
}

The state machine consumes five runtime calls per parallel-region invocation per worker thread: two __kmpc_barrier_simple_generic (ID 188) for poll/sync barriers, one __kmpc_kernel_parallel (ID 171) to check for dispatched work, one indirect or direct call to the outlined function, and one __kmpc_kernel_end_parallel (ID 172) to signal completion. Each __kmpc_barrier_simple_generic call compiles to a poll loop on a shared-memory flag -- not a hardware bar.sync -- because the generic barrier must handle the asymmetric wakeup protocol where the master thread signals workers through __kmpc_kernel_prepare_parallel.

Worker State Machine Generator: sub_2678420 (41 KB)

When the SPMD transformation fails (eligibility flag *(a1+241) == 0), cicc falls back to sub_2678420 which builds a customized state machine that is more efficient than the default runtime state machine. The customization replaces the indirect fn(args) call in .fallback.execute with a direct-call dispatch table when the set of outlined parallel functions is statically known.

PropertyValue
Function addresssub_2678420
Decompiled size41 KB
Basic block labelsworker_state_machine.begin, .is_active.check, .parallel_region.check, .parallel_region.execute, .fallback.execute, .done.barrier, .finished
DiagnosticsOMP130, OMP131, OMP132, OMP133

The generator has two modes:

Mode 1: Remove unused state machine (OMP130). When the kernel has zero parallel regions (e.g., a #pragma omp target with no nested parallel), the state machine is dead code. sub_2678420 removes the entire worker loop and emits: "Removing unused state machine from generic-mode kernel." (OMP130).

Mode 2: Rewrite with customized dispatch (OMP131). When the kernel has N known parallel regions, the generator builds a switch/cascade of direct-call comparisons in .parallel_region.check and .parallel_region.execute, avoiding the overhead of indirect calls through __kmpc_kernel_parallel's function pointer. It emits: "Rewriting generic-mode kernel with a customized state machine." (OMP131).

// Customized state machine pseudocode (sub_2678420 output)
function build_custom_state_machine(kernel, parallel_regions):
    // Create the 6 basic blocks with labels above
    begin_bb   = create_block("worker_state_machine.begin")
    active_bb  = create_block(".is_active.check")
    check_bb   = create_block(".parallel_region.check")
    exec_bb    = create_block(".parallel_region.execute")
    fallback_bb = create_block(".fallback.execute")
    barrier_bb = create_block(".done.barrier")
    finished_bb = create_block(".finished")

    // Entry: poll barrier
    in begin_bb:
        call __kmpc_barrier_simple_generic(loc, gtid)  // [188]
        br .is_active.check

    // Check if master dispatched work
    in active_bb:
        %active = call i1 @__kmpc_kernel_parallel(&fn)  // [171]
        br %active, .parallel_region.check, .done.barrier

    // Devirtualized dispatch: compare fn pointer against known functions
    in check_bb:
        for i, region in enumerate(parallel_regions):
            %cmp = icmp eq fn, @outlined_fn_i
            br %cmp, .parallel_region.execute.i, next_check
        br .fallback.execute  // no match -- use indirect call

    // Direct call to known function (avoids indirect branch penalty)
    in exec_bb:
        for each matched region:
            call @outlined_fn_i(args)
            br .done.barrier

    // Fallback: indirect call (should be unreachable if analysis is complete)
    in fallback_bb:
        call fn(args)  // indirect
        br .done.barrier

    // End parallel + sync barrier
    in barrier_bb:
        call __kmpc_kernel_end_parallel()  // [172]
        call __kmpc_barrier_simple_generic(loc, gtid)  // [188]
        br .worker_state_machine.begin

    // Optional: exit (reached via __kmpc_target_deinit signaling)
    in finished_bb:
        ret void

The runtime calls consumed by sub_2678420:

Call IDFunctionRole in State Machine
155__kmpc_target_initKernel entry; returns -1 for workers
156__kmpc_target_deinitKernel exit cleanup
157__kmpc_kernel_prepare_parallelMaster signals workers with outlined fn pointer
171__kmpc_kernel_parallelWorker checks if work is dispatched; returns fn ptr
172__kmpc_kernel_end_parallelWorker signals completion of parallel region
188__kmpc_barrier_simple_genericPoll-based barrier (shared-memory flag loop)

SPMD Amenability Analysis Pipeline

The eligibility flag at *(a1+241) -- which gates whether sub_26968A0 attempts the SPMD transformation -- is computed by the Attributor-based OpenMP optimization driver at sub_269F530 (63 KB). This driver orchestrates interprocedural fixed-point analysis using the standard LLVM Attributor framework.

The analysis pipeline:

sub_269F530 (OpenMP Attributor Driver, 63 KB)
  |
  +-- sub_251BBC0 (AbstractAttribute infrastructure)
  |     Creates abstract attributes for each kernel, including
  |     the SPMD-compatibility tracker that will become a1+241.
  |
  +-- sub_251CD10 (Attributor::runTillFixpoint, 53 KB)
  |     Iterates up to openmp-opt-max-iterations (default: 256)
  |     times, updating abstract attribute states until convergence.
  |
  +-- sub_26747F0 (OpenMP kernel info collector)
        Populates the PassState structure (a1) with:
          a1+72:   function handle
          a1+160:  serial-code-present flag
          a1+224:  multiple-region flag
          a1+241:  SPMD-eligible boolean  <-- the gate
          a1+280:  parallel region array base
          a1+288:  parallel region count
          a1+304:  execution mode attribute map

The fixed-point analysis in sub_251CD10 converges by iterating over all abstract attributes until none change state. For SPMD eligibility, the key attribute tracks three conditions that must all hold:

  1. Execution mode is Generic (mode byte == 1). Read via sub_2674090/sub_2674040 from the kernel's attribute map at *(a1+304). If the kernel is already SPMD or Bare, no transformation is needed.

  2. All reachable callees are SPMD-amenable. The analysis walks every call/invoke/callbr instruction in every parallel region of the kernel. Each callee is looked up in the SPMD-amenable function set at *(a2+208)+34952. This set is populated by two sources:

    • Automatic population: When sub_312CF50 (the 194-case runtime declaration factory) creates a runtime function declaration, that function is automatically added to the set if it is known to be thread-safe (most __kmpc_* functions, all omp_* query functions).
    • User annotation: Functions declared with [[omp::assume("ompx_spmd_amenable")]] are inserted into the set by the attribute parser.

    The set uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure. If any callee fails the lookup, the analysis sets *(a1+241) = 0 and the transformation will emit OMP121 diagnostics instead.

  3. No unresolvable side effects. Operations that are inherently unsafe when executed by all threads simultaneously -- such as I/O with ordering requirements, thread-local storage accesses assuming single-thread semantics, or calls to external functions with unknown side-effect profiles -- prevent SPMDization.

The Attributor driver at sub_269F530 also feeds into sub_2678420 (state machine generator) for kernels that fail SPMD eligibility, and into sub_2680940 (parallel region merging) for kernels that pass. The decision tree:

sub_269F530 analysis complete
  |
  +-- a1+241 == 1 (SPMD-eligible)
  |     |
  |     +-- a1+160 == 0 && a1+224 == 0 --> sub_26968A0 simple path
  |     +-- otherwise                   --> sub_26968A0 complex path
  |
  +-- a1+241 == 0 (not SPMD-eligible)
        |
        +-- has parallel regions --> sub_2678420 (custom state machine)
        +-- no parallel regions  --> sub_2678420 (remove dead state machine)

How the SPMD Transform Eliminates the State Machine

The actual elimination happens in sub_26968A0 and proceeds differently for simple vs. complex kernels, but the core mechanism is the same: replace the asymmetric master/worker execution model with symmetric all-thread execution.

Step 1: Remove the __kmpc_target_init return-value gate. In Generic mode, __kmpc_target_init returns -1 for workers and the kernel branches workers to the state machine loop. In SPMD mode, the return value is not used as a gate -- all threads fall through to user code. The transformation does not literally delete the __kmpc_target_init call (it is still needed for runtime initialization), but changes the execution mode attribute so the runtime initializes all threads as active.

Step 2: Eliminate the worker loop entirely. The basic blocks worker_state_machine.begin, .is_active.check, .parallel_region.check, .parallel_region.execute, .fallback.execute, .done.barrier, and .finished become dead code once the execution mode flips to SPMD. They are not explicitly deleted by sub_26968A0; instead, setting mode=2 in the KernelEnvironmentTy means the runtime never creates the worker branch, so the dead blocks are eliminated by subsequent DCE passes.

Step 3: Replace barrier primitives. Every __kmpc_barrier_simple_generic (ID 188) in the kernel is replaced with __kmpc_barrier_simple_spmd (ID 187). The difference:

  • Generic barrier (ID 188): poll-based. Workers spin-check a shared-memory flag. The master writes the flag, then workers read it. This involves memory fences, cache-line bouncing, and potential bank conflicts. Compiles to a ld.volatile.shared + branch loop.
  • SPMD barrier (ID 187): hardware-based. Maps directly to PTX bar.sync / CUDA __syncthreads(). Single instruction, handled by the warp scheduler with zero polling overhead.

Step 4: Guard serial code. For the simple case (single parallel region), this is just:

%tid = call i32 @__kmpc_get_hardware_thread_id_in_block()  ; [6]
%is_main = icmp eq i32 %tid, 0
br i1 %is_main, label %user_code, label %exit.threads

For the complex case (multiple parallel regions with serial gaps), the 5-block guarded region structure is created for each serial section, with shared-memory output promotion and double-barrier synchronization as described in Phase 4 above.

Step 5: Update execution mode. The kernel attribute is rewritten from Generic (1) to SPMD (2) via the intermediate GENERIC_SPMD (3) marker. This is the final, irreversible step. Once the mode is set, __kmpc_target_init at runtime will launch all threads into user code instead of routing N-1 threads to a state machine.

Performance Impact of Elimination

The state machine elimination saves:

Source of overheadGeneric modeSPMD modeSavings
Worker idle pollingN-1 threads spin in __kmpc_barrier_simple_genericNo idle threads100% of idle cycles
Barrier latencyPoll-based shared-memory loop (10s-100s of cycles)Hardware bar.sync (single cycle dispatch)~10-100x per barrier
Warp divergence at entryWarp 0 diverges (thread 0 = master, threads 1-31 = workers)No divergence1 warp fully utilized
Indirect calls__kmpc_kernel_parallel returns fn ptr for indirect dispatchNo indirect calls -- outlined fn body inlined/directBranch predictor pressure eliminated
Register pressureWorkers hold state machine registers while idleNo state machine registersImproved occupancy
Shared memoryGeneric barriers use shared-memory flagsOnly guarded-output allocations use shared memoryReduced shared memory pressure

On a typical #pragma omp target parallel for kernel, the SPMD transformation eliminates 5 runtime calls per parallel-region per worker-thread per iteration of the state machine loop. For a 256-thread CTA with one parallel region, that is 255 threads x 5 calls = 1,275 eliminated runtime calls per kernel invocation.

Execution Mode Update

When the transformation succeeds, the kernel's execution mode attribute is updated from Generic (1) to SPMD (2). The update goes through an intermediate GENERIC_SPMD (3) state:

// At LABEL_227 (shared success path)
bv_entry = sub_ACD640(*(a1+304+8), /*mode=*/3, /*aux=*/0);  // create mode-3 entry
current  = sub_2673FD0(*(a1+304));                           // read current attrs
*(a1+304) = sub_AAAE30(current, bv_entry, {key=0}, 1);      // write SPMD mode

The execution mode encoding matches upstream LLVM's OMPTgtExecModeFlags:

ValueNameMeaning
0OMP_TGT_EXEC_MODE_BAREBare mode (no runtime)
1OMP_TGT_EXEC_MODE_GENERICGeneric (state machine)
2OMP_TGT_EXEC_MODE_SPMDSPMD (all threads active)
3OMP_TGT_EXEC_MODE_GENERIC_SPMDGeneric

The mode is stored in the KernelEnvironmentTy global variable that __kmpc_target_init reads at kernel launch. Setting it to SPMD tells the runtime to skip the state machine setup and launch all threads directly into user code.

Limitations: What Prevents SPMDization

The following constructs cause the pass to emit OMP121 and fall back to Generic mode:

  • Calls to non-SPMD-amenable functions: Any callee not in the SPMD-amenable set blocks transformation. The user override is [[omp::assume("ompx_spmd_amenable")]].
  • Nested parallelism: Kernels with nested #pragma omp parallel regions inside a target region cannot be SPMDized because the worker threads are already participating.
  • Tasking constructs: #pragma omp task, taskloop, and taskgroup create runtime-managed work units incompatible with the SPMD execution model.
  • Critical sections and ordered regions: These constructs require specific thread-identity semantics that conflict with SPMD guards.
  • Unresolvable side effects: Calls to external functions whose side-effect profile is unknown (no declaration with convergent or spmd_amenable annotations).
  • Exception handling with unresolvable handlers: Invoke instructions with non-zero exception handler counts that cannot be simplified block the transformation (checked via sub_BD2BC0).

Comparison with Upstream LLVM OpenMPOpt

The cicc SPMD transformation in sub_26968A0 is a proprietary reimplementation that predates upstream LLVM's SPMDization and differs in several significant ways:

AspectUpstream LLVM OpenMPOptcicc sub_26968A0
FrameworkAttributor-based (AAKernelInfo)Standalone pass, direct IR mutation
Analysis approachFixed-point iteration via SPMDCompatibilityTrackerPre-computed boolean flag at a1+241
Guarded regionsinsertInstructionGuardsHelper using SplitBlockCustom 5-block split with explicit worklist registration
Broadcast mechanismGlobalVariable in shared memory (internal linkage, UndefValue init)alloca in address space 7 (shared) via sub_B30000
Barrier__kmpc_barrier_simple_spmdSame: __kmpc_barrier_simple_spmd (call ID 187)
Hash tablesLLVM DenseSet / SmallPtrSetCustom open-addressing with -4096 sentinel (details)
Region mergingSeparate openmp-opt-enable-merging flag (disabled by default)Integrated into the complex path; always runs when needed
State machine fallbackbuildCustomStateMachine in same AAKernelInfo::manifestSeparate function sub_2678420 (41 KB)
Diagnostic IDsOMP120, OMP121 (identical)OMP120, OMP121 (identical)
ompx_spmd_amenable overrideSame attribute nameSame attribute name

The key architectural difference is that upstream LLVM uses the Attributor framework's fixed-point iteration to converge on SPMD compatibility, while cicc separates the analysis (which sets a1+241) from the transformation (which is sub_26968A0). This separation allows cicc to make a single pass over the IR for the transformation rather than iterating to a fixpoint, at the cost of less flexibility in handling interdependent kernels.

Upstream's region merging is behind openmp-opt-enable-merging and disabled by default. cicc's complex path (Phase 3a-3d) performs region merging unconditionally when a kernel has multiple parallel regions with serial gaps, suggesting NVIDIA found merging beneficial enough for GPU targets to enable it by default.

Configuration Knobs

All knobs are standard LLVM cl::opt registrations present in the cicc binary. These match upstream LLVM options:

KnobTypeDefaultEffect
openmp-opt-disableboolfalseDisables all OpenMP optimizations
openmp-opt-disable-spmdizationboolfalseDisables SPMD transformation specifically
openmp-opt-disable-deglobalizationboolfalseDisables device memory deglobalization
openmp-opt-disable-foldingboolfalseDisables OpenMP folding optimizations
openmp-opt-disable-state-machine-rewriteboolfalseDisables custom state machine generation
openmp-opt-disable-barrier-eliminationboolfalseDisables barrier elimination optimizations
openmp-opt-disable-internalizationboolfalseDisables function internalization
openmp-opt-enable-mergingboolfalseEnables parallel region merging (upstream default; cicc complex path always merges)
openmp-opt-inline-deviceboolfalseInlines all applicable device functions
openmp-opt-verbose-remarksboolfalseEnables more verbose optimization remarks
openmp-opt-max-iterationsunsigned256Maximum attributor fixpoint iterations
openmp-opt-shared-limitunsignedUINT_MAXMaximum shared memory usage for broadcast values
openmp-opt-print-module-beforeboolfalseDumps IR before OpenMP optimizations
openmp-opt-print-module-afterboolfalseDumps IR after OpenMP optimizations

Note: The openmp-opt-shared-limit knob controls how much shared memory can be consumed by broadcast value allocations in guarded regions. If the limit is exceeded, the transformation will not proceed for additional guarded outputs. The default of UINT_MAX effectively means no limit.

Diagnostic Strings

CodeSeverityMessageTrigger
OMP120Remark"Transformed generic-mode kernel to SPMD-mode."Successful transformation (both simple and complex paths)
OMP121Warning"Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override"Callee not in SPMD-amenable set
OMP130-OMP133VariousState machine diagnosticssub_2678420 (fallback, not this pass)
OMP150RemarkParallel region mergingsub_2697xxx (separate merging diagnostics)

Diagnostics are emitted only when a handler is registered at *(a2+4392) and the handler's isEnabled virtual method (vtable offset +48) returns true. The construction follows the pattern: sub_B174A0 (remark) or sub_B178C0 (warning) builds a DiagnosticInfo, sub_B18290 appends the message text, and sub_1049740 emits to the handler.

Runtime Call Dependencies

The transformation uses these runtime functions from the OpenMP runtime declaration table:

Call IDFunctionSignatureUsage
6__kmpc_get_hardware_thread_id_in_blocki32()Thread identification for tid == 0 guards
180__kmpc_alloc_sharedi8*(i64)Allocate shared memory for guarded output promotion (complex path)
181__kmpc_free_sharedvoid(i8*, i64)Free shared memory allocations at kernel exit (complex path)
187__kmpc_barrier_simple_spmdvoid(ident_t*, i32)Lightweight SPMD barrier (maps to PTX bar.sync)

The state machine fallback (sub_2678420) uses a different set of runtime calls, all of which become dead code after successful SPMD transformation:

Call IDFunctionSignatureEliminated by SPMD
155__kmpc_target_initi32(KernelEnvironmentTy*, KernelLaunchEnvironmentTy*)Return value no longer gates workers
156__kmpc_target_deinitvoid()Retained (still needed for cleanup)
157__kmpc_kernel_prepare_parallelvoid(i8*)Eliminated -- no worker dispatch needed
171__kmpc_kernel_paralleli1(i8**)Eliminated -- no worker polling loop
172__kmpc_kernel_end_parallelvoid()Eliminated -- no worker completion signal
188__kmpc_barrier_simple_genericvoid(ident_t*, i32)Replaced with ID 187 (SPMD barrier)

Additionally, the SPMD-amenable function set at *(a2+208)+34952 is populated by the runtime table builder (sub_312CF50) during module initialization. Functions declared via sub_312CF50 cases 0-193 are automatically considered, along with user-annotated functions.

Function Map

FunctionAddressSizeRole
Generic-to-SPMD transformation pass (this function, 61 KB)sub_26968A0----
Worker state machine generation (Generic fallback, 41 KB)sub_2678420----
Attributor-based OpenMP optimization driver (63 KB, sets a1+241)sub_269F530----
Parallel region merging (52 KB)sub_2680940----
AbstractAttribute infrastructure (Attributor framework)sub_251BBC0----
Attributor::runTillFixpoint (53 KB, fixed-point iteration engine)sub_251CD10----
OpenMP kernel info collector (populates PassState)sub_26747F0----
Attributor Module Pass entry point (51 KB)sub_2591C20----
Read execution mode from attribute mapsub_2674090----
Read execution mode (alternate entry)sub_2674040----
Get parallel region thread configurationsub_250CBE0----
Read attribute from kernel attribute mapsub_2673FD0----
Create secondary barrier callsub_2673A60----
OpenMP runtime call table lookup by ID (194-case switch, 117 KB)sub_312CF50----
registerRuntimeFunction (registers declaration in table)sub_3122A50----
Parallel region outliner (47 KB, creates .omp_par functions)sub_313D1B0----
Get function entry basic blocksub_25096F0----
Get function scope / debug infosub_BD5C60----
Build CFG region (start/end blocks)sub_AA8550----
Build exit/cleanup blocksub_AA4D50----
Split basic blocksub_F36960----
Allocate IR instruction nodesub_BD2C40----
Fill instruction as runtime-call value loadsub_B4A410----
Create integer constant (zero for tid check)sub_AD64C0----
Create integer constant (alternate entry, used in complex path)sub_AD6530----
Create icmp instructionsub_B52500----
Create branch instruction (opcode 3)sub_B4C9A0----
Create shared-memory alloca (addr space 7)sub_B30000----
Create store instructionsub_B4D460----
Create load instructionsub_B4D230----
Replace all uses of a valuesub_256E5A0----
Create runtime library call instructionsub_921880----
Create bit-vector entrysub_ACD640----
Insert into attribute mapsub_AAAE30----
Register block in pass manager worklistsub_D695C0----
Construct remark DiagnosticInfosub_B174A0----
Construct warning DiagnosticInfosub_B178C0----
Append string to diagnostic messagesub_B18290----
Emit diagnostic to handlersub_1049740----
Check if instruction is a callsub_B46970----
Check if instruction is an invokesub_B46420----
Get invoke exception handler countsub_BD2BC0----
Insert guard instructions at range boundarysub_B444E0----
Fast-path comparison instruction creationsub_AAB310----
Full comparison instruction creationsub_B523C0----
Build name from debug info + suffixsub_CA0F50----
Ref-count increment on metadata/debug-infosub_B96E90----
Ref-count decrement on metadata/debug-infosub_B91220----
Transfer metadata ownership between blockssub_B976B0----
Get terminator's successor block pointersub_986580----
Add operand bundle to instructionsub_B99FD0----
Duplicate metadata referencesub_266EF50----
Process entry block terminator successorsub_B491C0----
Get instruction value typesub_ACA8A0----
Get IR node namesub_BD5D20----
Vector push_back (dynamic arrays)sub_C8CC70----
Vector reserve/growsub_C8D5F0----

Cross-References

  • OpenMP Runtime Declaration Table -- complete runtime function table (sub_312CF50), including __kmpc_barrier_simple_spmd (ID 187) and __kmpc_get_hardware_thread_id_in_block (ID 6)
  • Entry Point & CLI -- how OpenMP target offloading flags reach the optimizer
  • LLVM Optimizer -- pipeline slots 75/76/154 where openmp-opt runs
  • CLI Flags -- openmp-opt-* knob documentation

LTO & Module Optimization

CICC v13.0 implements Link-Time Optimization as a five-pass pipeline that exploits the GPU's closed-world compilation model for optimization opportunities unavailable to CPU compilers. In CPU LTO, the linker merges partially-optimized object files and runs a second round of optimization on the combined module. The fundamental constraint is that shared libraries, dynamic loading, and symbol interposition limit what the optimizer can assume about the complete program. On GPU, none of these constraints exist. Every __device__ function that can execute on the hardware must be statically visible at compile time -- there is no device-side dlopen, no .so files, no PLT/GOT, no symbol preemption. This closed-world guarantee means the LTO pipeline can inline aggressively across translation units, devirtualize every virtual call site against a complete class hierarchy, and promote or split global variables with full knowledge that no external observer will access the original symbols.

The LTO pipeline runs after the main LLVM optimizer (tier 0-3 passes) has performed per-module optimization. It is triggered when cicc processes bitcode from separate compilation (nvcc --device-c / -dc mode), where each .cu file compiles to a relocatable device object containing LLVM bitcode in the NVVM container. The device linker (nvlink) merges these objects and reinvokes cicc in LTO mode, passing the combined bitcode through the LTO pipeline before final PTX emission. In whole-program compilation (the default), the pipeline is still partially active -- GlobalOpt and the inliner run regardless, but the summary-based import machinery is skipped because there is only one module.

LTO pipeline entrysub_12F5F30 (0x12F5F30, 37.8 KB)
NVModuleSummary driversub_D81040 (0xD81040, 56 KB)
Summary buildersub_D7D4E0 (0xD7D4E0, 74 KB)
Address range (summary cluster)0xD60000--0xD82000
Address range (import/inline cluster)0x1850000--0x186CA00
NVVM container IRLevel for LTONVVM_IR_LEVEL_LTO (value 1)
Compile mode for separate compilationNVVM_COMPILE_MODE_SEPARATE_ABI (value 2)
Module flags readEnableSplitLTOUnit, UnifiedLTO, ThinLTO

Why LTO Matters for GPU

Three properties of GPU execution make LTO dramatically more valuable than on CPU:

Function calls are expensive. Every GPU function call marshals arguments through the .param calling convention via st.param / ld.param instruction sequences. A function with 8 struct arguments can generate hundreds of cycles of marshaling overhead that inlining eliminates entirely. Cross-module inlining -- which requires LTO -- is the primary mechanism for removing this cost for functions defined in separate translation units. See the inliner cost model for the full cost analysis.

Register pressure determines performance. Occupancy is bounded by per-thread register usage, with discrete cliff boundaries. Call boundaries force the backend to save and restore registers across the call site, often spilling to local memory (device DRAM, 200-800 cycle latency). LTO enables cross-module inlining, which in turn enables cross-function register allocation -- the single most impactful optimization for GPU code.

Indirect calls are catastrophic. An indirect call in PTX (call.uni through a register) prevents backend inlining, forces full register spills, destroys instruction scheduling freedom, and creates warp-divergence hazards. Whole-program devirtualization, which requires LTO-level visibility of the complete type hierarchy, converts indirect calls to direct calls and enables all downstream optimizations.

Regular LTO vs ThinLTO

CICC supports both regular (monolithic) LTO and ThinLTO. The LTO driver at sub_D81040 reads three module flags via sub_BA91D0 to determine which mode is active:

Module FlagEffect
EnableSplitLTOUnitEnables the split LTO unit mechanism for type metadata
UnifiedLTOEnables LLVM's unified LTO pipeline (combined thin+regular)
ThinLTOActivates summary-based import and the two-phase declaration merge in sub_D7D4E0

Regular LTO merges all translation units into a single LLVM module, then runs the full optimization pipeline on the merged result. This gives the optimizer complete visibility but has O(n) memory cost in the total program size and serializes compilation. For GPU programs this is often acceptable because device code is typically smaller than host code.

ThinLTO builds per-module summaries (via NVModuleSummary), uses the summaries to make import decisions without loading full bitcode, then imports selected functions and optimizes each module independently. The builder's a8 parameter (thinlto_mode flag) activates Phase 2 of the summary builder, which performs a second walk over declarations to merge forward-declared and defined symbol tables. This mode enables parallel per-module optimization at the cost of less global visibility.

In practice, NVIDIA's toolchain (nvcc + nvlink) uses regular LTO as the default for device code, because the closed-world model and relatively small code size (compared to CPU programs) make the memory and compile-time cost acceptable. ThinLTO is available for large CUDA programs where compile time is a concern, activated by passing -dlto to nvcc (device LTO) or -flto=thin through the driver.

LTO Pipeline

The LTO pipeline executes five major passes in a fixed order. Each pass consumes the output of its predecessor:

 ┌────────────────────────────────────────────────────────────────────────┐
 │                    NVVM Container (IRLevel=1)                         │
 │                    LLVM Bitcode + Module Flags                        │
 └────────────────────┬───────────────────────────────────────────────────┘
                      │
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  1. NVModuleSummary Builder  (sub_D7D4E0, 74 KB)              │
 │     Build per-function summaries with 4-level import priority, │
 │     complexity budget, CUDA attribute flags, call graph edges  │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  ModuleSummaryIndex
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  2. ThinLTO Function Import  (sub_1854A20, 4.3 KB)            │
 │     Summary-guided cross-module import with floating-point     │
 │     threshold computation, priority-class multipliers,         │
 │     global import budget cap                                   │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Materialized functions + thinlto_src_module metadata
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  3. Inliner  (sub_1864060 + sub_2613930 + sub_38576C0)        │
 │     Four parallel cost models: NVIDIA custom (20K budget),     │
 │     LLVM standard (225), New PM CGSCC + ML, NVPTX target      │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Inlined module
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  4. GlobalOpt  (sub_18612A0, 65 KB)                            │
 │     Small-constant promotion (≤2047 bits), SRA for structs     │
 │     (≤16 fields), malloc/free elimination, address-space-aware │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Optimized globals
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  5. WholeProgramDevirtualization  (sub_2703170, 13 KB)         │
 │     Type-test metadata → vtable resolution → direct calls      │
 │     Red-black tree for type info lookup, 0x90-byte records     │
 └────────────────────┬──────────────────────────────────────────-┘
                      │
                      ▼
              Dead Kernel Elimination + GlobalDCE
              → Standard optimizer pipeline (tier 0-3)
              → Code generation + PTX emission

The LTO pipeline entry at sub_12F5F30 (37.8 KB) orchestrates this sequence and also runs dead kernel elimination -- removing __global__ functions that are never referenced by host-side kernel launches. This is a GPU-specific optimization: on CPU, the linker preserves all externally-visible entry points, but in GPU LTO the compiler knows the complete set of kernel launch sites from the host code.

LTO Pipeline Entry -- sub_12F5F30 Algorithm

sub_12F5F30 (0x12F5F30, 37,797 bytes) is the top-level LTO orchestrator. It is called after the CLI parser (sub_12F7D90) has resolved the compilation mode bitmask and the LTO argument vector has been populated from the -Xlto forwarding meta-flag. The function operates in three distinct modes determined by the mode bitmask in a13:

ModeBitmaskCLI FlagBehavior
gen-lto0x21-gen-ltoEmit partially-optimized bitcode for later linking. No dead-kernel pass.
full LTO0x23-ltoFull merge + optimize + dead-kernel elimination + emit PTX.
link-lto0x26-link-ltoLink pre-existing LTO bitcode modules, run full pipeline.

The function's argument list is reconstructed from the LTO output vector v330 (the fourth CLI routing vector, populated by -Xlto and the six -host-ref-* flags). It receives the merged LLVM module, the host reference tables, and the compilation options struct.

Pseudocode: sub_12F5F30 Top-Level

function sub_12F5F30(module, lto_args, options, error_cb):
    # ---- Phase A: Parse LTO-specific arguments ----
    mode = NONE
    trace_enabled = false
    optimize_unused_vars = false
    host_refs = HostRefTable{}      # 6-field table: ek, ik, ec, ic, eg, ig
    force_device_c = false

    for arg in lto_args:
        switch arg:
            case "-gen-lto":       mode = GEN_LTO
            case "-link-lto":      mode = LINK_LTO
            case "-olto":          lto_opt_level = next_arg()
            case "--device-c":     device_c = true
            case "--force-device-c": force_device_c = true
            case "--trace":        trace_enabled = true
            case "-optimize-unused-variables": optimize_unused_vars = true
            case "-has-global-host-info":      has_host_info = true
            case "-host-ref-ek=*": host_refs.ek = parse_symbol_list(value)
            case "-host-ref-ik=*": host_refs.ik = parse_symbol_list(value)
            case "-host-ref-ec=*": host_refs.ec = parse_symbol_list(value)
            case "-host-ref-ic=*": host_refs.ic = parse_symbol_list(value)
            case "-host-ref-eg=*": host_refs.eg = parse_symbol_list(value)
            case "-host-ref-ig=*": host_refs.ig = parse_symbol_list(value)

    # ---- Phase B: Build preserved-symbol sets ----
    # Collect symbols from llvm.used and llvm.metadata named metadata
    used_set = collect_named_metadata(module, "llvm.used")
    metadata_set = collect_named_metadata(module, "llvm.metadata")

    # Merge host reference tables into a unified "referenced from host" set.
    # The 6 host-ref flags encode three entity types x two reference modes:
    #   e = explicit reference (symbol name appears in host launch site)
    #   i = implicit reference (symbol address taken on host side)
    #   k = kernel (__global__),  c = constant (__constant__),  g = global (__device__)
    host_referenced_kernels  = host_refs.ek  UNION  host_refs.ik
    host_referenced_constants = host_refs.ec  UNION  host_refs.ic
    host_referenced_globals   = host_refs.eg  UNION  host_refs.ig

    # ---- Phase C: Decide what to preserve ----
    preserved = used_set  UNION  metadata_set  UNION  host_referenced_kernels

    if NOT optimize_unused_vars:
        preserved = preserved  UNION  host_referenced_constants
                               UNION  host_referenced_globals

    # ---- Phase D: Dead kernel/variable elimination ----
    if mode == GEN_LTO:
        # gen-lto: emit bitcode only, skip elimination
        return emit_lto_bitcode(module)

    if has_host_info:
        dead_kernel_elimination(module, preserved, trace_enabled)

        if optimize_unused_vars:
            dead_variable_elimination(module, preserved,
                                     host_referenced_constants,
                                     host_referenced_globals,
                                     trace_enabled)

    # ---- Phase E: Run the 5-pass LTO pipeline ----
    if mode == LINK_LTO or mode == FULL_LTO:
        run_module_summary_builder(module)      # sub_D7D4E0 via sub_D81040
        run_thinlto_import(module)              # sub_1854A20  (if ThinLTO)
        run_inliner(module)                     # sub_1864060 + sub_2613930
        run_globalopt(module)                   # sub_18612A0
        run_whole_program_devirt(module)        # sub_2703170
        run_global_dce(module)                  # final GlobalDCE sweep

    # ---- Phase F: Hand off to optimizer pipeline ----
    return module    # returned to sub_12E7E70 for tier 0-3 passes

Host Reference Flag Encoding

The six -host-ref-* flags are the mechanism by which nvlink communicates host-side symbol usage to cicc's LTO pass. nvlink inspects the host-side relocatable objects and emits a semicolon-separated list of device symbol names for each flag. The two-letter suffix encodes:

SuffixEntity TypeReference Kind
-host-ref-ek__global__ kernelExplicit (launch site in host code)
-host-ref-ik__global__ kernelImplicit (address taken, e.g. &myKernel)
-host-ref-ec__constant__ variableExplicit (cudaMemcpyToSymbol target)
-host-ref-ic__constant__ variableImplicit (address taken)
-host-ref-eg__device__ global variableExplicit (cudaMemcpyToSymbol target)
-host-ref-ig__device__ global variableImplicit (address taken)

The -has-global-host-info flag signals that nvlink has provided complete host reference information. When this flag is absent, sub_12F5F30 conservatively preserves all externally-visible symbols -- the dead kernel/variable elimination pass is skipped entirely.

Function Map

FunctionAddressSizeRole
sub_12F5F300x12F5F3037.8 KBLTO pipeline entry and dead-symbol orchestrator
sub_12F56100x12F56107.3 KBLLVM module linker wrapper (Linker::linkModules)
sub_12F7D900x12F7D9014.3 KBCLI argument parser (architecture, opt level, flags)
sub_12F40600x12F406015.7 KBTargetMachine creation with NVIDIA options
sub_1C138400x1C13840--Global/function iterator used for dead-code sweep
sub_12F16500x12F16505.2 KBBitcode reader variant A
sub_12F11C00x12F11C05.2 KBBitcode reader variant B

Dead Kernel Elimination Algorithm

Dead kernel elimination is the most impactful GPU-specific optimization in the LTO pipeline. It exploits the closed-world model: every __global__ function that will ever execute must have a corresponding <<<>>> launch site (or cudaLaunchKernel call) in the host code that nvlink has already seen. Any kernel not in the host reference set is dead.

This pass cannot exist on CPU. A CPU linker must preserve all non-hidden external functions because shared libraries loaded at runtime via dlopen could call them. On GPU there is no dlopen, no dynamic symbol resolution, no PLT. The set of reachable kernels is completely determined at link time.

Pseudocode: dead_kernel_elimination

function dead_kernel_elimination(module, preserved_set, trace):
    # Walk all functions in the module via sub_1C13840 iterator
    worklist = []

    for func in module.functions():
        if func.isDeclaration():
            continue

        cc = func.getCallingConv()

        # PTX calling convention 71 = __global__ (kernel entry point)
        # PTX calling convention 72 = __device__ (device function)
        # PTX calling convention 95 = CUDA internal (managed init)
        if cc != 71:
            continue    # only eliminate kernels, not device functions

        name = func.getName()

        if name in preserved_set:
            continue    # referenced from host, or in llvm.used -- keep it

        # This kernel has no host launch site.
        if trace:
            emit_diagnostic("no reference to kernel " + name)

        worklist.append(func)

    # ---- Remove dead kernels ----
    for func in worklist:
        # Before erasing, check if any device-side indirect references exist.
        # On GPU, device-side function pointers (callback patterns) can reference
        # kernels via address-of. Check use_empty():
        if NOT func.use_empty():
            # Has device-side users -- cannot safely remove.
            # (This is rare: kernels are almost never called from device code.)
            continue

        func.replaceAllUsesWith(UndefValue)
        func.eraseFromParent()

    return len(worklist)

Pseudocode: dead_variable_elimination

When -optimize-unused-variables is enabled, the same logic extends to __device__ and __constant__ global variables:

function dead_variable_elimination(module, preserved_set,
                                   host_constants, host_globals, trace):
    worklist = []

    for gv in module.globals():
        if gv.isDeclaration():
            continue

        name = gv.getName()

        if name in preserved_set:
            continue

        as = gv.getAddressSpace()

        # Address space 1 = global, address space 4 = constant
        if as == 4 and name NOT in host_constants:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)
        elif as == 1 and name NOT in host_globals:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)

    for gv in worklist:
        if NOT gv.use_empty():
            continue    # still referenced from device code
        gv.eraseFromParent()

    return len(worklist)

The --trace-lto CLI flag (which maps to --trace in the LTO argument vector via the flag catalog at line 2394) enables the diagnostic messages. When active, cicc prints one line per eliminated symbol to stderr, enabling build-system integration and debugging of unexpected kernel removal.

Module Merge Process

Before sub_12F5F30 can perform dead-kernel elimination or any LTO optimization, the separate-compilation bitcode modules must be merged into a single LLVM module. This merge happens in two layers: the NVIDIA module linker wrapper sub_12F5610 (7.3 KB) and the underlying LLVM IRLinker at sub_16786A0 (61 KB).

Two-Level Linking Architecture

nvlink extracts .nv_fatbin bitcode sections
         |
         v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Loader  (sub_12C06E0, 63 KB)                │
│  - Validates LLVM bitcode magic (0xDEC0170B or 0x4243C0DE) │
│  - Checks IR version via sub_12BFF60                        │
│  - Validates target triple (must be "nvptx64-*")            │
│  - Single-module fast path: return directly if N=1          │
│  - Multi-module: normalize triples, set matching DataLayout │
└─────────────────────┬───────────────────────────────────────┘
                      │  N validated modules
                      v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Linker Wrapper  (sub_12F5610, 7.3 KB)       │
│  - Selects primary module (typically the largest)           │
│  - For each secondary module:                               │
│      Copy triple from primary → secondary                   │
│      Call IRLinker to merge secondary into primary           │
│  - Post-link: restore linkage attributes from hash table    │
│      Values 7-8: external linkage (low 6 bits)              │
│      Other: set low 4 bits + visibility from bits 4-5       │
│      Set dso_local flag (byte+33 |= 0x40)                  │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      v
┌─────────────────────────────────────────────────────────────┐
│  LLVM IRLinker::run  (sub_16786A0, 61 KB)                   │
│  - Allocates 0x2000-byte DenseMap for symbol resolution     │
│  - Hash function: (addr >> 9) ^ (addr >> 4)                 │
│  - Resolves COMDAT groups (sub_167DAB0, 39 KB)              │
│  - Links global value prototypes (sub_1675980, 37 KB)       │
│  - Links function bodies (sub_143B970, 14 KB)               │
│  - Merges named metadata (llvm.dbg.cu, llvm.used, etc.)     │
│  - Resolves llvm.global_ctors / llvm.global_dtors ordering  │
│  - Maps values across modules via DenseMap<Value*, Value*>  │
│  - Tombstone sentinels: empty=-8, deleted=-16               │
└─────────────────────┬───────────────────────────────────────┘
                      │  single merged module
                      v
              sub_12F5F30 (LTO pipeline entry)

Pseudocode: Module Merge (sub_12F5610 + sub_12C06E0)

function module_merge(module_list, llvm_ctx, options):
    # ---- Step 1: Load and validate all modules (sub_12C06E0) ----
    modules = []
    for entry in module_list:
        buf = open_buffer(entry.data, entry.length, entry.name)

        # Validate bitcode magic
        magic = read_u32(buf, 0)
        if magic != 0x0B17C0DE and magic != 0xDEC04342:
            error("invalid bitcode: " + entry.name)
            return NULL

        module = parse_bitcode(buf, llvm_ctx)  # sub_15099C0

        # Check IR version compatibility (sub_12BFF60)
        if ir_version_check(module_list, module, flags) != 0:
            error(entry.name + ": error: incompatible IR detected. "
                  "Possible mix of compiler/IR from different releases.")
            return NULL

        # Validate target triple
        triple = module.getTargetTriple()
        if NOT triple.startswith("nvptx64-"):
            error("Module does not contain a triple, "
                  "should be 'nvptx64-'")
            return NULL

        modules.append(module)

    # ---- Step 2: Single-module fast path ----
    if len(modules) == 1:
        return modules[0]

    # ---- Step 3: Multi-module linking (sub_12F5610) ----
    # Save linkage attributes before linking (they get modified)
    linkage_map = DenseMap<StringRef, u8>{}
    for module in modules:
        for func in module.functions():
            linkage_map[func.getName()] = func.getLinkage()
        for gv in module.globals():
            linkage_map[gv.getName()] = gv.getLinkage()

    # Select primary module and link secondaries into it
    primary = modules[0]
    for i in range(1, len(modules)):
        secondary = modules[i]

        # Normalize: copy DataLayout from primary to secondary
        secondary.setDataLayout(primary.getDataLayout())
        secondary.setTargetTriple(primary.getTargetTriple())

        # IRLinker::run (sub_16786A0)
        # Resolves COMDATs, links globals, maps values, merges metadata
        err = Linker::linkModules(primary, secondary)
        if err:
            error("<module_name>: link error: <details>")
            return NULL

    # ---- Step 4: Restore linkage attributes ----
    # During linking, LLVM may promote linkage (e.g., internal -> external)
    # to resolve cross-module references. Restore the original linkage
    # where possible, preserving the correct visibility for PTX emission.
    for func in primary.functions():
        name = func.getName()
        if name in linkage_map:
            original = linkage_map[name]
            if original in [7, 8]:       # external linkage variants
                func.setLinkage(original & 0x3F)
            else:
                func.setLinkage(original & 0x0F)
                if (original & 0x30) != 0:
                    func.setVisibility(original >> 4)
            func.setDSOLocal(true)       # byte+33 |= 0x40

    for gv in primary.globals():
        # same linkage restoration logic
        ...

    return primary

Key Data Structures in the Merge

StructureLocationDetails
Value map DenseMapAllocated in sub_16786A00x2000 bytes (8192), hash: (addr >> 9) ^ (addr >> 4), quadratic probing
Linkage hash tableStack-allocated in sub_12E1EF0 (v362)Maps StringRef name to original linkage byte
Function-to-module mapStack-allocated in sub_12E1EF0 (v359)Maps StringRef name to function pointer for split-module dispatch
COMDAT group mapInternal to sub_167DAB0Tracks COMDAT selection kinds: any / exact-match / largest / no-dup / same-size
Named metadata merge listInternal to sub_1671B40Special handling for llvm.dbg.cu, llvm.used, llvm.compiler.used, llvm.global_ctors, llvm.global_dtors, llvm.global.annotations
Module config flagdword_4F99BC0Controls linker behavior variant

Split-Module Compilation and Re-Linking

When concurrent compilation is active (thread count > 1 and multiple defined functions), the optimization pipeline uses a split-module strategy: each function is extracted into its own bitcode module, optimized independently in a thread pool, and then re-linked. The split/re-link cycle uses the same sub_12F5610 linker wrapper:

  1. Split (sub_1AB9F40): extracts per-function bitcode using a filter callback (sub_12D4BD0) that selects a single function by name from the function-to-module hash table.
  2. Optimize (thread pool via sub_16D5230): each worker runs sub_12E86C0 (Phase II optimizer) with qword_4FBB3B0 = 2.
  3. Re-link (sub_12F5610): merges all per-function bitcode modules back into a single module.
  4. Restore linkage (v362 hash table): the saved linkage attributes from step 0 are written back to prevent linkage promotion artifacts.

This cycle is orchestrated by sub_12E1EF0 (51 KB, the top-level concurrent compilation entry). The GNU Jobserver integration (sub_16832F0) throttles thread pool size to match the build system's -j level when cicc is invoked from make.

Separate Compilation and the NVVM Container

When nvcc --device-c compiles a .cu file, cicc produces an NVVM container with CompileMode = NVVM_COMPILE_MODE_SEPARATE_ABI (value 2) and IRLevel = NVVM_IR_LEVEL_LTO (value 1). This container wraps partially-optimized LLVM bitcode -- the per-module optimizer has run, but cross-module optimization has not. The bitcode is embedded in the ELF .nv_fatbin section of the relocatable object file.

At link time, nvlink extracts the bitcode sections from all input objects, concatenates them, and passes the result back to cicc in LTO mode. cicc deserializes each container, links the bitcode modules via LLVM's Linker::linkModules, and then runs the LTO pipeline described above on the merged module. The pipeline sees the complete device program for the first time at this point.

The IRLevel enum controls which optimizations have already been applied:

IRLevelValueMeaning
NVVM_IR_LEVEL_UNIFIED_AFTER_DCI0Default: fully optimized, no LTO needed
NVVM_IR_LEVEL_LTO1Partially optimized, awaiting LTO pipeline
NVVM_IR_LEVEL_OPTIX2OptiX pipeline IR (separate optimization model)

Pass Inventory

PassEntry PointSizePipeline SlotTypeSub-page
NVModuleSummary Buildersub_D7D4E074 KBN/A (called from driver)Analysismodule-summary.md
NVModuleSummary Driversub_D8104056 KBN/A (LTO entry)Modulemodule-summary.md
ThinLTO Function Importsub_1854A204.3 KBSlot 43 ("function-import")Modulethinlto-import.md
ThinLTO Threshold Enginesub_18531805.1 KBN/A (called from import driver)Utilitythinlto-import.md
NVIDIA Custom Inlinersub_186406075 KBCGSCC passCGSCCinliner-cost.md
LLVM Standard InlineCostsub_30DC7E051 KBN/A (library)Analysisinliner-cost.md
New PM CGSCC Inlinersub_261393069 KBCGSCC passCGSCCinliner-cost.md
NVPTX Target Cost Modifiersub_38576C058 KBN/A (target hook)Targetinliner-cost.md
GlobalOptsub_18612A065 KBSlot 45 ("globalopt")Moduleglobalopt.md
WholeProgramDevirtsub_270317013 KBSlot 121 ("wholeprogramdevirt")Moduledevirtualization.md

Key Differences from CPU LTO

AspectCPU LTOCICC GPU LTO
Import threshold100 instructions (default)Priority-class multipliers, global budget at dword_4FAB120
Cold import0x multiplier (never import cold)Imports cold functions if priority >= 2
Inline budget225 (LLVM default)20,000 (NVIDIA custom), 89x larger
Devirt conservatismMust handle DSOs, hidden visibilityFull type hierarchy always visible
Code size concernBloats .text, impacts cache/pagesNo shared libs; size is secondary to register pressure
Address spacesTrivial (flat memory model)5+ address spaces; GlobalOpt must preserve AS through splits
Dead symbol eliminationLinker GC sectionsDead kernel elimination in sub_12F5F30
Threshold comparisonInteger instruction countFloating-point threshold with hotness/linkage/priority multipliers
ML-guided inliningAvailable upstreamIntegrated via InlineAdvisor at sub_2609820 with model at sub_29B2CD0

LTO Knob Summary

NVModuleSummary Knobs

KnobDefaultEffect
dword_4F87C60 (global override)0When nonzero, forces all symbols to importable; value 2 = conservative comdat handling

ThinLTO Import Knobs

Registered in ctor_184_0 (0x4DA920) and ctor_029 (0x489C80):

KnobTypeDefaultEffect
import-instr-limitint100Base instruction count threshold for import
import-hot-multiplierfloat10.0Multiplier applied to threshold for hot callsites
import-cold-multiplierfloat0.0Multiplier for cold callsites (0 = never import cold on CPU)
dword_4FAB120int-1Global import budget; negative = unlimited
dword_4FAA770int0Current import count (runtime accumulator)
summary-filestring--Path to external summary file for ThinLTO
function-import----Pipeline registration string (slot 43)
disable-thinlto-funcattrsboolfalseDisable ThinLTO function attribute propagation
thinlto-workload-defstring--Workload definition file for priority-guided import

Inliner Knobs

Registered in ctor_186_0 (0x4DBEC0):

KnobTypeDefaultEffect
inline-budgetint20,000Per-caller inlining cost budget (NVIDIA custom model)
inline-total-budgetint--Global total budget across all callers
inline-adj-budget1int--Adjusted per-caller budget (secondary)
nv-inline-allbooloffForce inline every function call
profuseinlinebooloffVerbose inlining diagnostic output
inline-switchctrlint--Heuristic tuning for switch statements
inline-thresholdint225LLVM standard model threshold (separate from NVIDIA's 20K)
function-inline-cost-multiplierfloat--New PM: penalty multiplier for recursive functions

GlobalOpt Knobs

No dedicated cl::opt flags. All thresholds are hardcoded:

ParameterValueDescription
Max bits for promotion2,047 (0x7FF)Globals exceeding this fall through to SRA
Max struct fields for SRA16Structs with >16 fields are not split
Hash table load factor75%Triggers rehash of processed-globals table
Pipeline positionStep 30 (tier 2/3)After GlobalDCE, before LoopVectorize

Devirtualization Knobs

KnobTypeDefaultEffect
wholeprogramdevirt----Pipeline registration string (slot 121)

The pass has no NVIDIA-specific tuning knobs. It relies entirely on the completeness of type_test metadata produced by the NVModuleSummary builder.

Cross-References

NVModuleSummary Builder

CICC replaces LLVM's ModuleSummaryAnalysis with a custom NVModuleSummary subsystem that extends the ModuleSummaryIndex with GPU-specific information. The builder at sub_D7D4E0 (74 KB, 2571 decompiled lines) walks every global value in a module, constructs per-function summaries with CUDA-aware call graph edges, assigns four-level import priorities using a custom priority table, tracks function complexity on a profile-guided budget, and records CUDA-specific attributes such as address-space linkage, kernel-vs-device classification, and device memory reference patterns. The summary is the data source for all downstream ThinLTO decisions -- the ThinLTO importer reads these summaries to decide which functions to pull across module boundaries, and the inliner cost model consumes the complexity budget to calibrate cross-module inline thresholds.

Upstream LLVM's computeFunctionSummary (in ModuleSummaryAnalysis.cpp) counts instructions, builds call graph edges from CallBase operands, collects reference edges by walking instruction operands, and records type test / devirtualization metadata. It produces a FunctionSummary with a flat instruction count and a call edge list annotated with CalleeInfo::HotnessType (Unknown/Cold/None/Hot). NVIDIA's replacement does all of this, then adds: a 4-level import priority classification per function, a 28-bit profile-scaled complexity budget, CUDA address-space tracking (filtering out device-memory-only declarations from import candidacy), kernel identification via first-instruction opcode probing, six separate CUDA-specific accumulator structures for device call context, and a two-phase declaration re-walk that merges forward-declared and defined symbol tables for ThinLTO.

Builder entrysub_D7D4E0 (0xD7D4E0, 74 KB)
LTO driversub_D81040 (0xD81040, 56 KB)
Per-function analyzersub_D741C0 (0xD741C0, 19 KB)
Call graph analyzersub_D6EA70 (0xD6EA70, 19 KB)
Summary packersub_D77220 (0xD77220)
Summary serializersub_1535340 (0x1535340, 26 KB)
Summary parsersub_150B5F0 (0x150B5F0, 63 KB)
Address range0xD60000--0xD82000 (full NVModuleSummary cluster)
Stack frame1,552 bytes (0x610)

Summary Fields Beyond Upstream

Upstream LLVM's FunctionSummary stores instruction count, call edges with hotness, reference edges, type test GUIDs, and a few flags (norecurse, returndoesnotalias, etc). NVIDIA extends this with the following per-function fields:

FieldEncodingWidthDescription
Import priority*entry & 0x73 bits4-level priority: 0 = not importable, 1 = low, 2 = standard, 3 = force-import
Address-taken flag*entry & 0x81 bitSet if sub_B49220(GV) returns true (function has its address taken)
Complexity budget*entry >> 428 bitsProfile-scaled importance, max 0xFFFFFFF (268,435,455)
Kernel bitflags & (1 << 9)1 bitSet if first instruction opcode is 36 (kernel entry point)
Has-unwind-infoflags & (1 << 0)1 bitsub_B2DCC0(func) -- has personality function
Not-inlineflags & (1 << 1)1 bitFunction marked noinline
Read-noneflags & (1 << 2)1 bitAttribute #34 readnone
No-unwindflags & (1 << 3)1 bitAttribute #22 nounwind
Will-returnflags & (1 << 4)1 bitAttribute #31 willreturn
No-returnflags & (1 << 5)1 bitAttribute #3 noreturn
Must-progressflags & (1 << 6)1 bitAttribute #41 mustprogress
Has-visible-aliasflags & (1 << 7)1 bitAccumulated alias visibility flag
Has-non-importable-refsflags & (1 << 8)1 bitReferences symbols that cannot be imported
Has-any-importmodule flag bit 61 bitOR of device-ref, has-typed-symbol, has-non-importable

The per-entry summary record in the primary hash table is 16 bytes. The lower 32 bits pack the priority/address-taken/budget fields. The upper 64 bits hold a pointer to the full FunctionSummary record built by sub_D77220.

Builder Algorithm

The builder executes in three phases within sub_D7D4E0. The LTO driver sub_D81040 calls the builder after reading module flags (EnableSplitLTOUnit, UnifiedLTO, ThinLTO) and iterating all functions via a callback iterator.

Phase 1: Global Value Walk (lines 559--1671)

The module's global value list is a linked list rooted at Module+72 (the GlobalList field). The sentinel node is at Module+72 itself; the first real element is at Module+80.

// Phase 1: iterate all GlobalValues in the module
GlobalValue *sentinel = (GlobalValue *)(module + 72);
GlobalValue *cur = *(GlobalValue **)(module + 80);

while (cur != sentinel) {
    uint8_t opcode = cur->ir_node[0];  // IR opcode byte
    switch (opcode) {
    case 61: /* '=' -- Function definition */
        process_function(cur);
        break;
    case 62: /* '>' -- GlobalVariable */
        process_global_variable(cur);
        break;
    case 34: /* '"' -- Alias (kind 1) */
    case 40: /* '(' -- Alias (kind 2) */
        process_alias(cur);
        break;
    case 85: /* 'U' -- Declaration/extern */
        process_declaration(cur);
        break;
    }
    cur = cur->next;
}

For each function (opcode 61), the builder performs:

1. Import priority assignment. Queries the ImportPriorityTable via sub_D84370(table, func, PSI, 0). If found and the table is non-null, the priority is determined:

entry = getImportKind(priority_table, func, PSI, 0);
if (entry.found) {
    if (isImported(priority_table, entry))          // sub_D84440
        priority = 3;  // force-import
    else if (isImportCandidate(priority_table, entry, 3) == 0)  // sub_D84450
        priority = 2;  // standard importable
    else
        priority = 1;  // low priority
} else {
    priority = 0;  // not importable
}

2. Complexity budget computation. When ProfileSummaryInfo is available and the function was found in the priority table, the builder computes a profile-scaled importance value:

uint64_t profile_count = getProfileCount(PSI, func);  // sub_FDD860
uint64_t threshold = getHotThreshold(PSI);              // sub_FDC4B0

if (profile_count exists) {
    APInt importance = computeScaledImportance(profile_count, threshold);  // sub_F04200
    normalizeImportPriority(&importance, 8);  // sub_D78C90: right-shift by 8
    budget += importance.getZExtValue();
    budget = min(budget, 0xFFFFFFF);  // clamp to 28-bit max
}

// Pack into entry: lower 4 bits = priority | address_taken, upper 28 bits = budget
*entry_word = (budget << 4) | (*entry_word & 0xF);

The 28-bit budget is consumed downstream by ThinLTO to decide how much inlining budget to allocate for functions imported from other modules. A budget of 0 means the function has no profile data and gets the baseline threshold; a budget near the 268M ceiling means the function is extremely hot and will receive aggressive cross-module inlining.

3. Call graph edge construction. For functions with call graph info (bit 5 of byte 7: func->ir_node[7] & 0x20), the builder extracts two kinds of edges:

  • Direct call edges from attribute group #35: the callee list. Each callee gets a GUID via sub_9E27D0, and edges are collected into a temporary vector (4-byte stride per GUID).
  • Reference edges with type info from attribute group #34: operand bundles encoding reference edges with type metadata. Each reference carries a CalleeType byte and parameter type pairs extracted from MDNode operands. The MDNode decoding walks: operand -> parent (opcode 1 = MDString) -> offset 136 (opcode 17 = MDTuple) -> string data at offset 24.

Call graph edge records are 136 bytes each (stride 136 in the edge vector) and contain source name, target name, and edge attributes. Type-metadata edges are 72 bytes each.

4. CUDA address-space filtering. When the CUDA-mode flag (a6) is set and a declaration has address space 25 in its type chain, the function sets the device-reference flag (v327). Functions whose type resolves to address space 25 are excluded from import candidacy -- device-memory-only declarations cannot be cross-module imported in ThinLTO. The check:

if (cuda_mode && is_declaration(func)) {
    Type *ty = func->type_at_offset_minus_2;
    if (getAddressSpace(ty) == 25) {
        has_device_ref = true;
        goto skip_import;  // do not mark as importable
    }
}

Address space 25 appears to be an internal NVVM encoding for device-side linkage. This differs from the standard NVPTX address spaces (0 = generic, 1 = global, 3 = shared, 4 = constant, 5 = local). The summary records this flag so the importer can avoid attempting to import device-side-only symbols, which would fail at link time.

5. CUDA call context collection. For functions with the device attribute bit (func[33] & 0x20), the builder calls sub_D7CF70 to populate six parallel accumulator structures:

AccumulatorOffsetLikely content
v408+0Direct device call targets
v415+1Shared memory references
v422+2Texture/surface references
v429+3Constant memory references
v436+4Kernel launch edges
a5+5Additional context (passed from caller)

These six vectors capture the GPU-specific dependency information that upstream LLVM's summary has no concept of. The ThinLTO importer uses this to make GPU-aware import decisions -- for example, a function that references shared memory in another module must also import the shared memory declaration.

Phase 2: ThinLTO Declaration Re-Walk (lines 1673--1911)

When thinlto_mode (parameter a8) is true, the builder performs a second pass over forward-declared symbols:

Step 1. Re-walk function declarations collected during Phase 1. For each, remove from the "seen" set and re-analyze via sub_D7B190 into a secondary hash table.

Step 2. Re-walk global variable declarations through a separate dedup mechanism using sub_C8CA60 for hash-based deduplication.

Step 3. Merge the secondary (forward-declared) and primary (defined) hash tables. On collision -- the same symbol appears as both declared and defined -- sub_D76140 removes the entry from the defined table and sub_D7AF10 re-inserts into the merged table with updated visibility. This merge ensures that the summary captures cross-module edges even for symbols that are only forward-declared in the current module.

The two-phase design is necessary because CUDA compilation units frequently contain forward declarations of device functions defined in other translation units. Without this re-walk, the summary would miss the cross-module edges for these declarations, and ThinLTO would fail to import them.

Phase 3: Finalize and Emit (lines 1912--2569)

Module-level flag assembly. After processing all globals, the builder computes two flag words:

// v134: module-level attribute summary (bits 0-10)
v134 = (linkage & 0xF)               // bits 0-3
     | ((visibility & 0x3) << 4)     // bits 4-5
     | (has_any_import << 6)          // bit 6: OR of v327|v316|v358
     | (has_comdat << 7)              // bit 7
     | (has_comdat_attr << 8)         // bit 8
     | (dll_storage_class << 9);      // bits 9-10

// v143: per-function flags (bits 0-9)
v143 = has_unwind_info               // bit 0
     | (not_inline << 1)             // bit 1
     | (readnone << 2)               // bit 2
     | (nounwind << 3)               // bit 3
     | (willreturn << 4)             // bit 4
     | (noreturn << 5)               // bit 5
     | (mustprogress << 6)           // bit 6
     | (has_visible_alias << 7)      // bit 7
     | (has_non_importable_refs << 8) // bit 8
     | (is_kernel << 9);             // bit 9

The kernel detection walks to the function's first instruction via offset 24, verifies the opcode is in range 30--40 (basic block terminators), and checks specifically for opcode 36, which encodes a kernel entry point. This is how the summary distinguishes __global__ kernel functions from __device__ helper functions without relying on metadata -- it inspects the compiled IR structure directly.

Summary record packing. All collected data is packed into the final FunctionSummary via sub_D77220, which takes 14 arguments:

sub_D77220(
    &result,             // output FunctionSummary*
    module_flags,        // v134
    instruction_count,   // v324
    function_flags,      // v143 (includes kernel bit)
    &priority_slice,     // import priority table slice
    guid_ref_list,       // GUID reference list
    &typed_refs,         // type-checked reference list (72-byte entries)
    &typed_edges,        // typed call graph edges (136-byte entries)
    &simple_edges,       // simple call graph edges (GUID array)
    device_context,      // CUDA device context edges
    additional_edges,    // extra edge data
    &bundle_refs,        // operand bundle references
    &cross_module_calls, // cross-module call records
    &param_types         // per-parameter type metadata
);

The result is stored via sub_D7A690(index, func, &result) which merges the summary into the module-level index.

Callback invocation. The a9 parameter is a callback object with vtable layout: a9+16 points to a shouldSkip() predicate; a9+24 points to a processFunction(a9, GlobalValue*) handler. When shouldSkip() returns null, the callback is invoked for each function. The callback result is processed by sub_D8D9B0 which extracts additional summary information (likely profile or LTO-specific metadata).

Serialization and the NVVM Container

The summary is serialized into bitcode by sub_1535340 (writeModuleSummary, 26 KB). This function writes a MODULE_STRTAB_BLOCK and GLOBALVAL_SUMMARY_BLOCK into the LLVM bitcode stream using standard bitcode encoding (VBR integers, abbreviation-driven records). The strings "ThinLTO" and "Unexpected anonymous function when writing summary" appear in this function.

On the reading side, sub_150B5F0 (parseModuleSummaryIndex, 63 KB) and sub_9EBD80 (parseGlobalSummaryBlock, 82 KB) deserialize the summary from bitcode back into the in-memory ModuleSummaryIndex. These parsers handle GUID hashes, function/alias/global summaries, and module paths.

The bitcode writer at sub_1538EC0 writes the producer string as "LLVM7.0.1" despite CICC being built on LLVM 20.0.0 internally -- this is the NVVM IR compatibility layer. The summary blocks are embedded in this bitcode stream alongside the IR, so the NVVM container format (see NVVM Container) carries both the IR and its summary in a single bitcode file.

Import Priority System

The 4-level priority system is the primary extension over upstream LLVM's binary importable/not-importable model. Upstream uses GlobalValueSummary::ImportKind which is essentially a boolean; NVIDIA introduces graduated priority levels that feed a floating-point threshold multiplier in the importer.

LevelValueMeaningImporter behavior
00b000Not importableNever imported
10b001Low priorityThreshold multiplied by cold multiplier (dword_4FAACC0)
20b010StandardThreshold multiplied by default multiplier (dword_4FAB040)
30b011Force-importThreshold multiplied by hot multiplier (dword_4FAAE80)

The importer at sub_1853180 converts the integer base threshold to float, multiplies by the per-priority-level constant, converts back to integer, and compares against the function's cost from the summary (stored at offset 0x40 in the summary entry). A fourth multiplier (dword_4FAADA0) handles "critical" priority (priority class 4 in the importer's switch), though the summary builder only produces levels 0--3.

For comdat/linkonce symbols discovered during Phase 3, a special minimum priority applies:

min_priority = 3 * (dword_4F87C60 != 2) + 1;
// dword_4F87C60 == 2: min_priority = 1 (conservative)
// dword_4F87C60 != 2: min_priority = 4 (aggressive import)

Hash Table Infrastructure

The builder manages multiple open-addressing hash tables with different entry sizes. All use the standard DenseMap pointer hash and growth policy; see Hash Table and Collection Infrastructure for the common implementation.

TableEntry sizeProbe strategyPurpose
Primary (v384--v387)16 bytesLinear probingMain summary entries (ptr + metadata)
Secondary (v388--v393)8 bytesLinear probingForward-declared symbol GUIDs
GUID dedup (v406--v407)8 bytesLinear scan + memmoveDeduplication during merge
Seen set (v451--v455)VariableFlat array or hashTracks processed GlobalValues

The "seen set" has two modes selected by v455: when v455 = 1, it uses a flat inline buffer at v456 with HIDWORD(v453) as the count; when v455 = 0, it switches to a hash table via sub_C8CA60. This dual-mode design optimizes for the common case of small modules (flat scan is faster when count is low) while scaling to large modules.

Rehash strategy: new_capacity = max(64, next_power_of_2(4 * current_count)). The power-of-2 is computed via _BitScanReverse. If the new capacity equals the old, the table is cleared in-place via memset to the empty sentinel (0xFF for 8-byte entries, 0xF8 for 16-byte entries). Otherwise the old buffer is freed and a new one allocated via sub_C7D670 (aligned_alloc(8, size)).

Knobs and Global Variables

SymbolTypeDefaultEffect
dword_4F87C60int0Import priority override: 0 = normal, 1 = force all importable, 2 = conservative mode
qword_4F878A8boolfalseWhen set in ThinLTO mode, forces re-analysis of all referenced-but-undefined symbols
byte_3F871B3byte(varies)Cross-module GUID namespace prefix, distinguishes same-named symbols across modules
dword_4FAB120int-1Global import budget (-1 = unlimited)
dword_4FAA770int0Running count of imports performed
dword_4FAAE80float(varies)Hot function threshold multiplier
dword_4FAACC0float(varies)Cold function threshold multiplier
dword_4FAADA0float(varies)Critical section threshold multiplier
dword_4FAB040float(varies)Default threshold multiplier
byte_4FAAA20boolfalseEnable thinlto_src_module metadata annotation on imported functions

The dword_4F87C60 override is the most impactful knob. Setting it to 1 makes every function importable regardless of its linkage or visibility, which is useful for whole-program optimization but can cause link-time explosions. Setting it to 2 enables conservative mode where comdat symbols get minimal priority (level 1 instead of 4), preventing aggressive cross-module import of weakly-linked symbols.

Comparison with Upstream ModuleSummaryAnalysis

AspectUpstream LLVMCICC NVModuleSummary
Entry pointcomputeFunctionSummary()sub_D7D4E0 (2571 lines vs ~400)
Priority levelsBinary (importable or not)4 levels (0--3) with float multipliers
Complexity metricFlat instruction count28-bit profile-scaled budget
Call edge annotationCalleeInfo::HotnessType (4 values)136-byte records with full type metadata
Address space awarenessNoneFilters device-only (AS 25) from import
Kernel detectionNoneOpcode-36 probe for __global__ functions
Declaration re-walkNoneTwo-phase merge of declared + defined
CUDA contextNone6 accumulators for device call patterns
Hash table sizingLLVM DenseMapCustom open-addressing with dual-mode seen set
Profile integrationBFI-based hotnessProfileSummaryInfo scaled budget
SerializationStandard ModuleSummaryIndex bitcodeSame format, extended fields

The most architecturally significant difference is the priority system. Upstream LLVM makes a binary import/no-import decision based on a single threshold comparison. NVIDIA's 4-level system allows the importer to process functions in priority order (primary/secondary/tertiary passes in sub_1854A20) with different threshold multipliers per level, enabling much finer control over cross-module optimization aggressiveness.

Function Map

FunctionAddressSizeRole
NVModuleSummary::buildModuleSummary() -- main builder0xD7D4E074 KB--
NVModuleSummary::runOnModule() -- LTO driver0xD8104056 KB--
NVModuleSummary::analyzeFunction()0xD741C019 KB--
NVModuleSummary::processGlobalRef()0xD6FF5047 KB--
NVModuleSummary::collectGlobalInfo()0xD6A18021 KB--
NVModuleSummary::analyzeCallGraph()0xD6EA7019 KB--
NVModuleSummary::visitInstruction()0xD7B1909 KB--
Alias processing helper0xD738B011 KB--
NVModuleSummary::computeImportCost()0xD72D409 KB--
NVModuleSummary::resolveReferences()0xD64DE016 KB--
NVModuleSummary::getTypeMetadata()0xD669C011 KB--
NVModuleSummary::processTypeId()0xD640E012 KB--
NVModuleSummary::computeVisibility()0xD6308011 KB--
Summary serialization helper (recursive)0xD60CE015 KB--
Summary serialization helper0xD61E9010 KB--
NVModuleSummary::packFunctionSummary() -- 14-arg final packer0xD77220----
NVModuleSummary::addInlineSummary() -- CUDA context collector0xD7CF70----
NVModuleSummary::addEdge()0xD76530----
NVModuleSummary::addRef()0xD768F0----
NVModuleSummary::addSpecialGlobal() (llvm.used etc.)0xD76CA0----
NVModuleSummary::addTypeRef()0xD76D40----
NVModuleSummary::computeNextPrime() -- hash table sizing0xD76FC0----
NVModuleSummary::getModuleHash()0xD771D0----
NVModuleSummary::destroyEdgeList()0xD77880----
NVModuleSummary::destroyRefList()0xD786F0----
NVModuleSummary::compareImportPriority()0xD788E0----
NVModuleSummary::computeSymbolHash()0xD789D0----
NVModuleSummary::resizeTable()0xD78B00----
NVModuleSummary::normalizeImportPriority()0xD78C90----
NVModuleSummary::addCallEdge()0xD793D0----
Rehash/resize (next power-of-2, min 64)0xD79200----
NVModuleSummary::copyTable()0xD7A410----
NVModuleSummary::mergeSymbols()0xD7A690----
NVModuleSummary::computeFinalOrder()0xD7AC80----
NVModuleSummary::getOrInsertSummary()0xD7BAA0----
NVModuleSummary::visitGlobalValue()0xD7BD50----
NVModuleSummary::getImportKind()0xD84370----
NVModuleSummary::isImported()0xD84440----
NVModuleSummary::isImportCandidate()0xD84450----
NVModuleSummary::processInliningDecisions()0xD8B02021 KB--
NVModuleSummary::computeInlineBenefit()0xD8C2B08 KB--
NVModuleSummary::buildCalleeList()0xD8D9B09 KB--
NVModuleSummary::cloneModuleSummary()0xD8E7E032 KB--
GUID lookup/creation (namespace-aware)0x9CA390----
Get attribute group by kind from GlobalValue0xB91C10----
ProfileSummaryInfo::getProfileCount()0xFDD860----
ProfileSummaryInfo::getHotThreshold()0xFDC4B0----
writeModuleSummary() -- bitcode serializer0x153534026 KB--
parseModuleSummaryIndex() -- bitcode deserializer0x150B5F063 KB--

Cross-References

Inliner Cost Model

CICC v13.0 contains four parallel inliner cost models -- an architecturally unusual design that reflects both the historical evolution of NVIDIA's compiler and the fundamental differences between GPU and CPU inlining economics. The NVIDIA custom inliner at 0x1864060 (75 KB, 2135 decompiled lines) uses a 20,000-unit budget that is 89x the upstream LLVM default of 225. Roughly 60% of the custom inliner's code computes type-size comparisons for argument coercion cost, because on GPU the dominant cost of a function call is not instruction count but .param address-space marshaling. Alongside the custom model, CICC also links the standard LLVM InlineCostAnalysis at 0x30DC7E0 (51 KB), a New Pass Manager CGSCC inliner at 0x2613930 (69 KB) with ML-based advisory support, and an NVPTX target-specific cost modifier at 0x38576C0 (58 KB) that injects a +2000 bonus for GPU intrinsics.

Model A: NVIDIA customsub_1864060 (0x1864060, 75 KB, CGSCC)
Model B: LLVM standardsub_30DC7E0 (0x30DC7E0, 51 KB, InlineCostAnalysis)
Model C: New PM CGSCCsub_2613930 (0x2613930, 69 KB, recursive SCC)
Model D: NVPTX targetsub_38576C0 (0x38576C0, 58 KB, opcode-based)
Knob constructorctor_186_0 (0x4DBEC0, 14 KB)
LLVM knob constructorctor_625_0 / ctor_715_0 (0x58FAD0, 27 KB)

Why Four Inliner Models

The four models are not truly interchangeable alternatives -- they serve overlapping but distinct roles in the compilation pipeline:

Model A is the original NVIDIA inliner, predating the LLVM 14+ New Pass Manager. It operates on NVIDIA's internal NVVM IR node format (not LLVM IR), walks the callee body with bespoke type-size arithmetic, and is the only model that understands .param-space argument coercion costs. It runs inside the legacy CGSCC inliner framework via sub_186CA00 (Inliner::inlineCallsImpl). When CICC runs in its default optimization pipeline, this is the model that makes the bulk of inlining decisions.

Model B is upstream LLVM's InlineCostAnalysis::analyzeCall, compiled into CICC essentially unmodified. It uses LLVM's instruction-counting cost model with a 225-unit default threshold, the inline-threshold, inlinedefault-threshold, and PGO deferral knobs. It exists because CICC links the full LLVM codebase and certain LLVM passes (e.g., the always-inliner, sample-profile inliner) call into getInlineCost / analyzeCall directly.

Model C is the New Pass Manager's CGSCC inliner at 0x2613930. It handles recursive SCC splitting, carries the function-inline-cost-multiplier knob for penalizing recursive functions, and can delegate decisions to an InlineAdvisor (sub_2609820, 57 KB). The advisor supports three modes registered in the pipeline parser: default, development (training), and release (inference). The ML model inference path lives at sub_29B2CD0 / sub_29B4290. CICC registers the pipeline string "inliner-ml-advisor-release" for the release mode (parser slot 49).

Model D is an NVPTX target-specific cost modifier at 0x38576C0 that adjusts inline costs based on opcode analysis. Its primary contribution is a +2000 cost bonus for functions containing opcode tag 9 instructions (see Opcode Tag 9 Bonus below). This runs as a layer on top of whichever primary cost model is active, modifying the accumulated cost at offset+72 and comparing against the threshold at offset+76.

The historical layering is: NVIDIA built Model A first for their custom NVVM IR, then LLVM matured its own inliner (Model B), then the New PM arrived with ML advisory (Model C), and NVPTX target hooks added GPU-specific adjustments (Model D). Rather than consolidating, NVIDIA kept all four because each handles a different phase or code path in the pipeline.

The .param Address Space Problem

Understanding the NVIDIA inliner requires understanding why GPU function calls are so expensive compared to CPU calls. On x86, a function call requires pushing arguments to registers/stack, a CALL instruction, and a RET. The overhead is typically 5-20 cycles.

On NVIDIA GPUs, there is no hardware call stack for registers. The PTX calling convention works through the .param address space:

  1. Caller declares .param variables via DeclareParam (opcode 505) or DeclareScalarParam (opcode 506) for each argument.
  2. Caller stores argument values into .param space via st.param instructions (opcodes 571-573 for StoreV1/V2/V4).
  3. Caller emits the call instruction referencing the .param declarations.
  4. Callee loads arguments from .param space via ld.param instructions.
  5. Return values come back through .param space via ld.param (opcodes 515-516, 568-570 for LoadRetParam / LoadV1/V2/V4).
  6. Byval arguments (structs passed by value) copy the entire struct to .param space field by field.

Each function call therefore generates O(n) st.param + O(n) ld.param instructions where n is the number of arguments, plus register save/restore if the callee needs more registers than are available (spills go to local memory, which is device DRAM -- hundreds of cycles). Additionally, call boundaries destroy instruction scheduling freedom, prevent cross-boundary register allocation, and create branch divergence hazards at the call/return sites.

This is why NVIDIA's default inline budget of 20,000 is not as aggressive as it sounds: inlining a function with 50 instructions but 8 struct arguments might save hundreds of cycles of .param marshaling overhead.

Model A: NVIDIA Custom Inliner

Knob Inventory

All knobs are registered in ctor_186_0 at 0x4DBEC0:

KnobTypeDefaultPurpose
inline-budgetint20,000Per-caller inlining cost budget
inline-total-budgetint(none)Global total budget across all callers in the module
inline-adj-budget1int(none)Secondary per-caller budget, dynamically adjusted
nv-inline-allbooloffForce inline every function call unconditionally
profuseinlinebooloffVerbose inlining diagnostics (NVIDIA profuse framework)
inline-switchctrlint(none)Switch-statement inlining heuristic tuning
inline-numswitchfuncint(none)Penalty based on number of switch stmts in callee
inline-maxswitchcasesint(none)Maximum switch cases before cost penalty applies
disable-inlined-alloca-mergingbooloffDisable post-inline alloca merging

CLI surface mapping:

User FlagRouted To
-aggressive-inline-inline-budget=40000 (2x default)
-disable-inlining-disable-inlining
-inline-budget=NSets per-caller budget directly
-inline-infoDiagnostic flag for inline decisions

Entry and Early Bail-Outs

The entry point sub_1864060 takes four arguments: a1 = function/callsite node, a2 = context, a3 = callback, a4 = data pointer. The function performs a series of eligibility checks before any cost computation:

Intrinsic name check. Calls sub_1649960(a1) to retrieve the function name. If the name starts with the 4-byte magic 0x6D6C6C6C (an LLVM intrinsic prefix) followed by '.', returns 0 immediately. LLVM intrinsics are never inlined through this path.

Pre-analysis walk. Initializes a 32-byte inline-analysis state struct via sub_1ACF5D0, then calls sub_1ACF600 which delegates to sub_1ACF0B0. This walks the callee body to collect basic metrics (instruction count, call count, basic block count). If the pre-analysis returns nonzero, the function is not analyzable.

Linkage check. Reads the byte at a1+32. The low nibble encodes linkage class: values 7 (linkonce_odr) and 8 (weak_odr) are eligible for inlining. Bits [7:6] encode visibility: 0x2 = hidden (OK), 0x1 = protected (bail). The function also requires byte at a1+16 == 3 (function definition, not declaration), bit 0 of byte at a1+80 == 0 (no noinline attribute), and sub_15E4F60(a1) returning false (no optnone).

function shouldInline(callsite):
    name = getName(callsite.callee)
    if name starts with LLVM_INTRINSIC_PREFIX:
        return NEVER_INLINE

    state = initAnalysisState()
    if preAnalyze(callsite.callee, state) != 0:
        return NEVER_INLINE

    linkage = callsite.callee.linkage
    if linkage not in {linkonce_odr, weak_odr}:
        return NEVER_INLINE
    if callsite.callee.isDeclaration:
        return NEVER_INLINE
    if callsite.callee.hasNoinline:
        return NEVER_INLINE
    if callsite.callee.hasOptnone:
        return NEVER_INLINE

    // ... proceed to cost computation

Callee Body Scan

After eligibility checks pass, the inliner walks the callee's operand/argument list (linked list at a1+8). Each argument node is classified by its type tag at byte offset +16 via sub_1648700:

Tag RangeMeaningAction
<= 0x17Basic types or call-likeIf tag == 5 (phi): recurse into operands, check all > 0x17; otherwise bail
0x36 (54)Load-like instructionCollect into loads vector
0x37 (55)Store-like instructionCollect into stores vector
0x47 (71, 'G')Aggregate/GEPEnter sub-operand scan

The loads and stores are accumulated into two SmallVectors (v357, v360) with initial inline capacity of 4 elements each. These vectors are the input to the argument coercion cost check.

Load-Store Combinatorial Bail-Out

Before proceeding to the expensive type-size computation, the function checks:

if (num_loads * num_stores > 100):
    return BAIL_OUT  // Too expensive argument copy pattern

This prevents inlining functions where argument materialization would create a quadratic load-store explosion. Consider a function taking 4 struct-by-value arguments, each with 30 fields: that is 120 loads times 120 stores = 14,400 combinations, far above the 100 threshold. Without this guard, the type-size computation engine below would take unreasonable time.

Type-Size Computation Engine

The bulk of sub_1864060 -- lines 1140 through 2100, approximately 60% of the function -- is a type-size computation engine. This is the single most distinctive feature of the NVIDIA inliner: where LLVM counts instructions, NVIDIA computes byte-level argument coercion costs.

The engine walks NVVM IR type nodes and computes byte sizes for each argument at both the callsite (actual argument) and the callee (formal parameter). The type tag dispatch is repeated 8+ times across different contexts:

Type TagTypeSize Computation
0x01half16 bits
0x02float32 bits
0x03double64 bits
0x04fp8080 bits
0x05fp128128 bits
0x06ppc_fp128128 bits
0x07pointersub_15A9520(module, 0) for target pointer size
0x08arrayelement_type_size * count (recursive)
0x09x86_mmx64 bits
0x0Avectorelement_type_size * count (recursive)
0x0Binteger(dword >> 8) bits
0x0CfunctionRecurse (unusual, but handled)
0x0Dstructsub_15A9930 for layout size
0x0Epacked structManual: 8 * count * align * ceil
0x0Fnamed typesub_15A9520(module, type_id)
0x10opaque/tokenelement_type_size * count

The byte-size formula applied uniformly is:

byte_size = (multiplier * bit_width + 7) >> 3

The core comparison at the heart of the cost model:

if callee_arg_size > callee_formal_size:
    // Argument is being widened at the call boundary
    // This costs extra st.param + ld.param instructions
    // Proceed to next comparison level (accumulate cost)
else:
    // Sizes match or shrink -- this argument pair is OK

Arguments are processed in groups of 4 (loop unrolled at line 2098: v142 += 4, --v306 where v306 = num_stores * 8 >> 5, i.e., groups of 4 store arguments). Remainder arguments (1-3 after the groups-of-4 loop) are handled by the type compatibility check function sub_185CCC0 which calls sub_15CCEE0 for type matching.

Struct Layout Walk

The helper sub_185B2A0 (3 KB) performs a stack-based DFS walk of struct type trees to count fields. It handles pointer types (tag 15), struct types (tag 13/14), and array types (tag 16). The walk has a hard depth limit of 20 levels, preventing runaway recursion on deeply nested struct definitions.

Argument Coercion Check

The helper sub_185D7C0 (9 KB) classifies each callee operand and determines whether argument coercion is needed at the inline callsite. For each operand in the callee's argument linked list at a1+8, it:

  1. Reads the instruction tag via sub_1648700.
  2. Computes the formal parameter type size.
  3. Computes the actual argument type size at the callsite.
  4. If sizes differ, flags this argument as requiring coercion (extra cost).
  5. If the argument is a struct, invokes the struct layout walk to count individual field copies.

Callsite Transformation

When the callee qualifies for "alias inline" (replacing a call with direct body substitution), the function:

  1. Allocates a new 88-byte IR node via sub_1648A60(88, 1).
  2. Builds a function reference node via sub_15F8BC0.
  3. Builds a call replacement node via sub_15F9660.
  4. Walks callee operands to collect phi nodes into a worklist.
  5. For each phi: copies via sub_1596970, updates operands via sub_15F2120, replaces references via sub_1648780.
  6. Deletes original phis via sub_159D850.
  7. Performs final callsite replacement via sub_164D160 + sub_15E55B0.

Switch Statement Heuristics

Three dedicated knobs control inlining of switch-heavy functions. On GPU, large switch statements are particularly costly because:

  • Branch divergence: Each thread in a warp may take a different case, serializing execution.
  • No branch prediction hardware: Every divergent branch pays full penalty.
  • Control flow reconvergence: The hardware must synchronize threads after the switch, wasting cycles.

The inline-switchctrl knob tunes the general heuristic sensitivity. inline-numswitchfunc penalizes functions containing many switch statements. inline-maxswitchcases sets a case-count ceiling beyond which a switch-heavy callee is considered too expensive to inline regardless of other factors.

nv-inline-all: Force-All Mode

The nv-inline-all knob bypasses cost analysis entirely and forces inlining of every call. This is used for specific compilation modes where the call graph must be completely flattened:

  • OptiX ray tracing: The hardware intersection pipeline requires a single monolithic function. All user-defined intersection, closest-hit, any-hit, and miss programs must be inlined into a single continuation function.
  • Aggressive LTO: When doing whole-program optimization with small modules, flattening removes all call overhead.

Two-Budget System

NVIDIA uses a two-level budget to control inlining granularity:

  • inline-budget (default 20,000): Per-caller limit. Caps how much code can be inlined into a single function, preventing any one function from becoming unreasonably large.
  • inline-total-budget: Module-wide limit. Caps the total amount of inlining across all callers in the compilation unit.
  • inline-adj-budget1: A secondary per-caller limit that may be dynamically adjusted based on context -- for example, kernel entry points (__global__ functions) may receive a higher adjusted budget because they are the outermost scope and benefit most from aggressive inlining.

The threshold adjustment helper at sub_1868880 (12 KB) modifies thresholds based on calling context through pure arithmetic on cost/threshold values (no string evidence, entirely numeric).

Alloca Merging

The disable-inlined-alloca-merging knob controls post-inline stack allocation merging. On GPU, "stack" means local memory, which is device DRAM (hundreds of cycles latency). Merging allocas from inlined callees with the caller's allocations reduces total local memory consumption. Lower local memory usage directly improves occupancy (more concurrent thread blocks per SM). The default is to enable merging.

Model B: LLVM Standard InlineCostAnalysis

The standard LLVM InlineCostAnalysis::analyzeCall at 0x30DC7E0 (51 KB) is compiled into CICC from upstream LLVM sources. Its knobs are registered in ctor_625_0 / ctor_715_0 at 0x58FAD0 (27 KB of option registration, an unusually large constructor due to the 40+ individual cost parameter registrations).

Key upstream LLVM knobs present in CICC:

KnobDefaultPurpose
inline-threshold225Base inlining threshold
inlinedefault-threshold225Default when no hint/profile
inlinehint-threshold325Threshold for __attribute__((always_inline)) hint
inline-cold-callsite-threshold45Threshold for cold callsites
inlinecold-threshold45Threshold for functions with cold attribute
hot-callsite-threshold3000Threshold for hot callsites (PGO)
locally-hot-callsite-threshold525Threshold for locally hot callsites
inline-instr-cost5Cost per instruction
inline-call-penalty25Penalty per callsite in callee
inline-memaccess-cost0Cost per load/store
inline-savings-multiplier8Multiplier for cycle savings
inline-savings-profitable-multiplier4Multiplier for profitability check
inline-size-allowance100Max callee size inlined without savings proof
inline-cost-fullfalseCompute full cost even when over threshold
inline-enable-cost-benefit-analysisfalseEnable cost-benefit analysis
inline-deferral(PGO)Defer inlining in cold paths
inline-remark-attribute(off)Emit inline remarks

The LLVM model fundamentally counts instructions (at inline-instr-cost = 5 units each) and subtracts savings from constant propagation, dead code elimination after argument specialization, and simplified control flow. This instruction-counting approach is appropriate for CPUs where call overhead is small and code size is the primary concern. It is inadequate for GPUs where argument marshaling dominates.

Model C: New PM CGSCC Inliner

The New Pass Manager inliner at 0x2613930 (69 KB) handles recursive SCC processing and integrates with LLVM's InlineAdvisor framework. Its key differentiation is the function-inline-cost-multiplier knob that penalizes recursive function inlining -- a scenario the NVIDIA custom inliner (Model A) does not handle.

The InlineAdvisor at sub_2609820 (57 KB) supports three modes:

ModePipeline StringBehavior
default"inline-advisor"Heuristic-based (uses Model B cost analysis)
development(training path)Feature extraction for ML model training
release"inliner-ml-advisor-release"ML model inference via sub_29B2CD0 / sub_29B4290

The ML inference path extracts features from the callsite and callee (instruction count, call depth, loop nesting, etc.) and feeds them through a model to produce an inline/no-inline decision. This is standard upstream LLVM ML inlining infrastructure compiled into CICC; there is no evidence of NVIDIA-custom ML model weights, though NVIDIA could supply custom weights via the enable-ml-inliner knob (registered as an enum: {default, development, release}).

NVPTX Opcode Tag 9 Bonus (+2000)

Model D at sub_38576C0 modifies inline costs based on NVPTX-specific opcode analysis. The key logic:

for each instruction in callee:
    tag = getOpcodeTag(instruction)
    if ((tag >> 4) & 0x3FF) == 9:
        inline_cost += 2000
    // ... accumulate other per-instruction costs

The state layout of the cost analyzer object:

OffsetFieldPurpose
+72Accumulated costRunning sum of per-instruction costs
+76ThresholdBudget for this callsite
+120Per-instruction cost (lo)Cost array element (low)
+128Per-instruction cost (hi)Cost array element (high)

The +2000 bonus for tag 9 opcodes encourages inlining of functions containing specific GPU operations -- likely tensor core instructions, warp-level intrinsics, or other operations that benefit significantly from being visible to the register allocator and instruction scheduler within the caller's scope. The bonus is large enough (equivalent to inlining ~400 regular LLVM instructions at cost 5 each) to override most size-based objections.

NVIDIA vs. LLVM: Complete Comparison

FeatureNVIDIA (Model A)LLVM (Model B)
Default threshold20,000225
Aggressive threshold40,000Varies by -O level
Primary cost metricArgument type-size coercionInstruction count
Cost per instructionN/A (not instruction-based)5 units
Struct handlingDeep field-by-field walk (depth limit 20)Aggregate flat cost
GPU opcode bonus+2000 for tag 9N/A
Load x store bail-out> 100 combinationsN/A
Switch heuristics3 dedicated knobs1 (case-cluster-penalty)
Budget systemPer-caller + module total + adjustedPer-callsite only
Diagnostic knobprofuseinlineinline-remark-attribute
Force-all modenv-inline-allinline-all-viable-calls (hidden)
ML-based advisorNo (separate path via Model C)Yes (InlineAdvisor)
Recursive cost multiplierNofunction-inline-cost-multiplier
Alloca merging controldisable-inlined-alloca-mergingN/A
Call penaltyImplicit (.param marshaling cost)25 units per callsite
PGO integrationNo evidenceinline-deferral, hot-callsite-threshold

Decision Flowchart

The complete inlining decision flow through Model A:

                     CallSite arrives at sub_186CA00
                              |
                   sub_186B510: check remarks
                              |
                   sub_1864060: shouldInline
                              |
                     +--------+--------+
                     |                 |
              Name is LLVM       Name is user
              intrinsic?         function
                     |                 |
                NEVER INLINE     Init analysis state
                                 sub_1ACF5D0
                                      |
                                 Pre-analyze callee
                                 sub_1ACF600
                                      |
                              +-------+-------+
                              |               |
                         Returns 0       Returns != 0
                         (analyzable)    (cannot analyze)
                              |               |
                     Check linkage      NEVER INLINE
                     (7=linkonce_odr
                      8=weak_odr)
                              |
                  +-----------+-----------+
                  |                       |
            Eligible                Not eligible
                  |                  (wrong linkage,
             Check noinline,         declaration,
             optnone attrs           protected vis)
                  |                       |
            +-----+-----+          NEVER INLINE
            |           |
         Has attr    No attr
            |           |
       NEVER INLINE  Walk callee body
                     collect loads/stores
                              |
                     loads * stores > 100?
                        +-----+-----+
                        |           |
                       Yes         No
                        |           |
                   BAIL OUT    Type-size computation
                               (60% of function)
                                    |
                              Compute per-argument
                              coercion cost
                                    |
                              Total cost < inline-budget?
                                 +-----+-----+
                                 |           |
                                Yes         No
                                 |           |
                              INLINE     DO NOT INLINE
                              Transform callsite
                              sub_1648A60 / sub_15F8BC0

Call Graph

sub_186CA00  Inliner::inlineCallsImpl (CGSCC SCC walk)
  +-> sub_186B510  Inline decision with remarks
      +-> sub_1864060  shouldInline / cost computation (THIS)
          +-> sub_1ACF5D0  Inline analysis state init
          +-> sub_1ACF600  Pre-analysis callee walk
          |   +-> sub_1ACF0B0  Metric collection
          +-> sub_185FD30  Argument materialization cost (5 KB)
          +-> sub_185E850  Post-inline cleanup assessment (9 KB)
          +-> sub_185B2A0  Struct layout walk, depth limit 20 (3 KB)
          +-> sub_185D7C0  Argument matching / coercion (9 KB)
          +-> sub_185B9F0  Recursive operand simplification (5 KB)
          +-> sub_185CCC0  Type compatibility check (4 KB)
          +-> sub_18612A0  GlobalOpt integration (65 KB, conditional)
  +-> sub_1868880  Inline threshold adjustment (12 KB)
  +-> sub_1866840  Post-inline callsite update (42 KB)

Why 89x the LLVM Budget

The 20,000 vs. 225 ratio sounds extreme, but the economics are different:

CPU call overhead is approximately 5-20 cycles (push/pop registers, branch prediction handles the rest). A function with 50 instructions that is not inlined costs perhaps 60-70 cycles total. Inlining saves ~15 cycles. The savings must justify the I-cache pressure increase.

GPU call overhead includes: (1) declaring .param variables for every argument, (2) st.param for each argument value, (3) ld.param in the callee for each argument, (4) register save/restore to local memory (device DRAM, 200-800 cycle latency) if the callee's register demand exceeds what is available, (5) loss of instruction scheduling across the call boundary, (6) branch divergence at call/return. For a function with 8 arguments, the .param overhead alone is 16+ memory operations. With register spilling, a single function call can cost 1000+ cycles.

Furthermore, GPU functions tend to be small (typically 10-100 instructions for device helper functions). The NVIDIA cost model does not count instructions at all -- it counts the argument marshaling cost. A function with 200 instructions but 2 scalar arguments is cheap to call; a function with 10 instructions but 8 struct arguments is expensive. The 20,000 budget reflects this: it is not 89x more aggressive in inlining large functions; it is calibrated for a cost model where the per-argument coercion cost dominates rather than instruction count.

With -aggressive-inline (budget 40,000, i.e., 178x the LLVM default), NVIDIA targets workloads like OptiX where complete flattening is desired but nv-inline-all is too blunt (it ignores all cost analysis).

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's inliner cost model was built for x86/AArch64 where function call overhead is small and code size is the primary inlining constraint. On GPU, every assumption is wrong:

  • Upstream assumes a 225-instruction budget is sufficient. The default inline-threshold of 225 reflects CPU economics where a function call costs 5-20 cycles (register push/pop + branch). On GPU, a single function call with 8 struct arguments generates 16+ .param-space memory operations, potential register spills to device DRAM (200-800 cycle latency), loss of cross-boundary scheduling, and branch divergence hazards. NVIDIA's 20,000-unit budget (89x upstream) is calibrated for this reality, not because GPU code is more aggressive about inlining large functions.
  • Upstream counts instructions as the primary cost metric. LLVM prices each instruction at 5 units and subtracts savings from constant propagation and dead code elimination. NVIDIA's custom inliner (Model A) does not count instructions at all -- 60% of its 75KB body computes byte-level argument type-size coercion costs, because on GPU the dominant cost of a function call is .param address-space marshaling, not instruction count.
  • Upstream has no concept of .param-space argument passing cost. CPU calling conventions pass arguments in registers (nearly free) or via L1-cached stack (3-5 cycles). On GPU, every argument requires explicit DeclareParam + st.param (caller) + ld.param (callee) sequences. A function with 10 instructions but 8 struct arguments is more expensive to call than one with 200 instructions and 2 scalar arguments. Upstream's model gets this exactly backwards.
  • Upstream uses a single per-callsite budget. NVIDIA uses a three-level system: per-caller budget (inline-budget), module-wide total budget (inline-total-budget), and a dynamically adjusted secondary budget (inline-adj-budget1) that can give kernel entry points higher limits. This multi-level approach prevents any single caller from bloating while still allowing aggressive inlining where it matters most.
  • Upstream has no GPU intrinsic awareness. NVIDIA's Model D applies a +2000 cost bonus for functions containing opcode tag 9 instructions (likely tensor core or warp-level intrinsics), because these operations benefit enormously from being visible to the register allocator and scheduler within the caller's scope. Upstream LLVM has no mechanism to express "this function contains operations that are disproportionately valuable to inline."

Key Addresses

AddressSizeFunction
0x186406075 KBshouldInline / inline cost computation
0x186CA0061 KBInliner::inlineCallsImpl (CGSCC core)
0x186B51020 KBInline decision with remarks
0x186684042 KBPost-inline callsite update
0x186888012 KBInline threshold adjustment
0x185FD305 KBArgument materialization
0x185E8509 KBPost-inline cleanup
0x185B2A03 KBStruct layout walk (depth 20)
0x185D7C09 KBArgument coercion check
0x185B9F05 KBRecursive operand simplification
0x185CCC04 KBType compatibility check
0x18612A065 KBGlobalOpt integration
0x1ACF5D0--Inline analysis state init
0x1ACF600--Pre-analysis callee walk
0x30DC7E051 KBInlineCostAnalysis::analyzeCall (LLVM)
0x261393069 KBNew PM CGSCC inliner
0x260982057 KBInline advisor / ML inliner
0x38576C058 KBNVPTX target-specific cost modifier
0x4DBEC014 KBNVIDIA inliner knob registration
0x58FAD027 KBLLVM InlineCost option registration

Reimplementation Checklist

  1. Type-size-based cost model (60% of the inliner). Implement the argument coercion cost engine that walks NVVM IR type nodes (16 type tags: half through opaque/token) to compute byte-level sizes for both callsite actuals and callee formals, using the formula byte_size = (multiplier * bit_width + 7) >> 3. Flag arguments where callee_arg_size > callee_formal_size as requiring .param-space widening.
  2. 20,000-unit budget system. Implement the three-level budget: per-caller inline-budget (default 20,000), module-wide inline-total-budget, and dynamically adjusted inline-adj-budget1 (kernel entry points may receive higher limits). Include the -aggressive-inline mapping to budget 40,000 and nv-inline-all force-all mode.
  3. Early bail-out chain. Implement the eligibility checks in order: LLVM intrinsic name prefix rejection, pre-analysis callee walk (instruction/call/block counts), linkage check (linkonce_odr/weak_odr only), visibility check, noinline/optnone attribute rejection, and the loads * stores > 100 combinatorial bail-out.
  4. Struct layout walk (depth limit 20). Implement the stack-based DFS walk of struct type trees to count fields for coercion cost, handling pointer types (tag 15), struct types (tag 13/14), and array types (tag 16), with a hard depth limit of 20 levels.
  5. Switch statement heuristics. Implement the three GPU-specific switch knobs (inline-switchctrl, inline-numswitchfunc, inline-maxswitchcases) that penalize switch-heavy callees where branch divergence, absent branch prediction, and reconvergence overhead make inlining particularly costly.
  6. NVPTX opcode tag 9 bonus (+2000). Implement the target-specific cost modifier that scans callee instructions for opcode tag 9 (likely tensor core/warp intrinsics) and adds a +2000 bonus to encourage inlining functions containing GPU operations that benefit from cross-boundary register allocation and scheduling.

ThinLTO Function Import

CICC v13.0 implements LLVM's ThinLTO function import pipeline with GPU-specific modifications to the threshold computation, candidate filtering, and provenance tracking. The core of the system lives in two functions -- sub_1854A20 (the import driver, 4,326 bytes) and sub_1853180 (the threshold computation engine, 5,059 bytes) -- with an entry point at sub_1855B10 that parses the -summary-file / -function-import command line and orchestrates the whole-module import flow. The fundamental difference from CPU ThinLTO is that GPU compilation operates in a closed-world model: there are no shared libraries, no dynamic linking, and no PLT/GOT indirection. Every device function will be statically linked into the final PTX. This means CICC can afford far more aggressive import thresholds than CPU compilers, because the code size cost of importing is paid once per GPU binary rather than once per shared-object load.

The import subsystem reads NVModuleSummary data (built by sub_D7D4E0, see Module Summary) to make summary-guided decisions about which functions to pull from other translation units. Each candidate is evaluated against a floating-point threshold that incorporates callsite hotness, linkage type, and a per-priority-class multiplier. A global import budget caps the total number of imports to prevent compile-time explosion. After import, each materialized function receives thinlto_src_module metadata so downstream passes (particularly the inliner) know its origin module.

Import driversub_1854A20 (0x1854A20, 4,326 B)
Threshold computationsub_1853180 (0x1853180, 5,059 B)
Threshold comparison gatesub_18518A0 (0x18518A0)
Import executionsub_15E4B20 (0x15E4B20)
Import candidate evaluatorsub_1852CC0 (0x1852CC0)
Entry pointsub_1855B10 (0x1855B10, 10,503 B)
Whole-module processingsub_1858B90 (0x1858B90, 31,344 B)
Type metadata propagationsub_185E850 (0x185E850, 24,263 B)
Pipeline registration"function-import" (slot 43, Module pass)
Knob constructor (primary)ctor_184_0 (0x4DA920, 13,693 B)
Knob constructor (supplementary)ctor_029 (0x489C80, 1,120 B)
Knob constructor (pass-level)ctor_420_0 (0x532010, 11,787 B)

Why GPU ThinLTO Differs from CPU ThinLTO

Upstream LLVM's ThinLTO was designed for CPU executables and shared libraries where import decisions must balance code size (impacts disk, cache, page faults) against optimization opportunity (cross-module inlining, constant propagation). The default import-instr-limit is 100 instructions, the cold multiplier is 0, and the hot multiplier is 10x. These conservative defaults reflect a world where over-importing bloats .text sections shared across address spaces.

GPU compilation inverts these tradeoffs:

  1. No shared libraries. Device code is statically linked into a fatbinary. There is no dynamic linker, no GOT, no PLT. Importing a function costs compile time but has zero runtime overhead beyond instruction cache pressure.

  2. Function calls are expensive. As documented in the inliner cost model, every GPU function call marshals arguments through .param address space via st.param / ld.param sequences. Inlining (which requires importing first) eliminates this overhead entirely.

  3. Closed-world optimization. The compiler sees all device code. There are no opaque DSOs. This means aggressive import cannot break ABI contracts that don't exist.

  4. Register pressure is the real constraint. On GPU, the limiting factor is not code size but register count, which determines occupancy. Import + inline can actually reduce register pressure by enabling cross-function register allocation and eliminating .param-space spills.

These factors push CICC toward much more aggressive import thresholds. The priority-class multiplier system (section below) allows CICC to tune import aggressiveness per-callsite rather than using a single global threshold.

What Gets Imported and What Does Not

The NVModuleSummary builder (sub_D7D4E0) assigns a 4-level import priority to every global value when building the module summary index:

PriorityMeaningImport behavior
0Not importableLocal/hidden linkage, never imported
1Importable, not preferredWill import only if threshold is generous
2Standard importableNormal import candidate
3Force-importHighest priority, always imported if budget allows

The priority is determined by querying the ImportPriorityTable (parameter a4 of sub_D7D4E0) via sub_D84370, sub_D84440 (force-import check), and sub_D84450 (importable check). A global override at dword_4F87C60 can force all symbols to priority 1 or higher.

Functions that are imported:

  • __device__ functions with internal or linkonce_odr linkage (template instantiations, inline functions)
  • Math library implementations (libdevice functions) called from device code
  • Helper functions from header-only libraries (Thrust, CUB, cutlass templates)
  • Constant global variables with initializers (import-constants-with-refs = true by default)

Functions that are NEVER imported:

  • Kernels (__global__ functions). These are entry points. They are never candidates for cross-module import because they represent the root of execution; they are called from host code, not from other device functions. The summary builder marks them as non-importable.
  • Host functions. Host code is handled by the host compiler (gcc/clang), not cicc. They never appear in the device module summary.
  • Functions in address space 25. The summary builder at lines 1388-1395 explicitly skips functions whose type resolves to address space 25, with a goto LABEL_495 that bypasses the import-eligible path. The raw report notes: "device functions can't be cross-module imported in ThinLTO" -- this refers specifically to functions that are declarations only with device-memory address space linkage, meaning they reference device-side symbols without a definition in the current TU.
  • Functions with the "not importable" flag. Bit 4 (0x10) of the linkage byte at offset +0x0C in the function summary entry. The import driver checks test byte [entry+0Ch], 0x10 and skips on set.

Import Algorithm: Complete Pseudocode

Complexity. Let C = number of import candidates across all modules, G = number of unique GUIDs, and L = total number of name entries across all candidates. Stage 1 (threshold computation, sub_1853180) iterates every candidate once: O(C). For each candidate, the GUID dedup hash table (slot = GUID * 37 & (size - 1)) provides O(1) amortized lookup with linear probing. The name array scan is up to 4-level unrolled, giving O(L) total across all candidates. The 11-case linkage dispatch via jump table is O(1) per entry. The priority-class threshold adjustment is O(1) per candidate (a single float multiply). The global budget check is O(1). Overall Stage 1: O(C + L). Stage 2 (triple-pass driver, sub_1854A20) processes three priority-ordered linked lists, each in a single pass: O(C) total. Per-candidate import execution (sub_15E4B20) is O(I_f) where I_f = instructions in the imported function (bitcode materialization). The whole-module processing (sub_1858B90, 31KB) is O(F * I_avg) where F = total functions and I_avg = average instruction count. The dedup hash table growth follows standard load-factor 75% doubling, maintaining O(1) amortized operations. Total: O(C + L + sum(I_imported)).

The import process runs in two major stages. Stage 1 (sub_1853180) builds a prioritized list of qualifying candidates by evaluating each against a computed threshold. Stage 2 (sub_1854A20) materializes candidates via a triple-pass sweep over three priority-ordered linked lists, executing the actual cross-module function import.

Stage 1: Threshold Computation Engine (sub_1853180)

Address range: 0x1853180--0x1854543 (5,059 bytes). Six parameters, 0xB8-byte stack frame. Uses a jump table at dword_42BA140 for the 11-case linkage-type dispatch.

// sub_1853180 -- Threshold computation with GUID dedup and priority-class multipliers
//
// Evaluates every candidate in summary_ctx against base_threshold adjusted by
// priority class.  Emits qualifying candidates to result_array as 24-byte
// entries {GUID, threshold, import_record_ptr}.  Tracks already-evaluated
// GUIDs via guid_hash_table to prevent duplicate work.
//
// Binary: 0x1853180, 5059 bytes.  Stack: 0xB8.
// Jump table: dword_42BA140 (11 entries, linkage dispatch).
//
// Globals read:
//   dword_4FAAE80  hot_multiplier      (float, default 10.0)
//   dword_4FAACC0  cold_multiplier     (float, default 0.0)
//   dword_4FAADA0  critical_multiplier (float, default 100.0)
//   dword_4FAB040  default_multiplier  (float, default 1.0)
//   dword_4FAB120  global_import_budget (int, default -1 = unlimited)
//   dword_4FAA770  running_import_count (int, reset per module)

fn threshold_compute(
    summary_ctx,       // rdi -> [rbp-0x88]: candidate arrays and metadata
    module_info,       // rsi -> [rbp-0x58]: source module summary
    base_threshold,    // edx -> [rbp-0x7C]: integer base threshold (import-instr-limit)
    guid_hash_table,   // rcx -> [rbp-0x50]: DenseMap<uint64_t, metadata> for dedup
    result_array,      // r8  -> [rbp-0x60]: growable output array
    visited_set,       // r9  -> [rbp-0xA0]: tracks already-evaluated GUIDs
):
    candidate_begin = summary_ctx[+0x28]   // r12: start of candidate pointer array
    candidate_end   = summary_ctx[+0x30]   // r14: one-past-end

    // ---- Outer loop: iterate every candidate ----
    while candidate_begin != candidate_end:                 // 0x18531C4
        candidate_ptr = *candidate_begin
        guid = candidate_ptr & ~0x7                         // mask low 3 tag bits

        // ---- GUID dedup via multiplicative-hash table ----
        table_size = guid_hash_table[+0x18]
        if table_size > 0:                                  // 0x18531D0
            table_data = guid_hash_table[+0x00]
            raw_guid   = candidate_ptr[+0x00]               // 8-byte GUID

            // Hash: slot = (GUID * 37) & (table_size - 1)
            // Implemented as: lea edx,[rsi+rsi*8] -> edx=GUID*9
            //                  lea edx,[rsi+rdx*4] -> edx=GUID+GUID*36=GUID*37
            slot = (raw_guid * 37) & (table_size - 1)       // 0x18531E8

            // 16-byte slots: {GUID (8B), metadata (8B)}
            probe_ptr = table_data + slot * 16
            stored_guid = probe_ptr[+0x00]

            if stored_guid == raw_guid:
                goto next_candidate                         // already evaluated

            // Linear probing on collision
            probe_step = 1
            while stored_guid != 0xFFFFFFFFFFFFFFFF:        // -1 = empty sentinel
                slot = (slot + probe_step) & (table_size - 1)
                probe_step += 1
                probe_ptr = table_data + slot * 16
                stored_guid = probe_ptr[+0x00]
                if stored_guid == raw_guid:
                    goto next_candidate                     // found: already seen

            // GUID not in table -- fall through to evaluation

        // ---- Name array scan ----
        // When dedup table is absent, scan name components directly
        name_begin = candidate_ptr[+0x18]                   // 0x1853250
        name_end   = candidate_ptr[+0x20]

        // Up-to-4-level unrolled name comparison (0x1853670-0x18538BA):
        //   Level 1: entry = [name_ptr - 8]
        //   Level 2: entry = [name_ptr + 0]
        //   Level 3: entry = [name_ptr + 8]
        //   Level 4: entry = [name_ptr + 0x10]
        // Each level checks:
        //   visibility flag at [r14+0xB0] -> if set: test byte [entry+0Ch], 0x20
        //   entry type:  entry[+0x08] must == 2 (function summary)
        //   not-importable: test byte [entry+0Ch], 0x10 -> skip if set
        //   linkage:     entry[+0x0C] & 0x0F -> 11-case switch

        for each name_entry in name_begin..name_end:
            entry = *name_entry
            if entry[+0x08] != 2:                           // not a function summary
                continue
            linkage_byte = entry[+0x0C]
            if linkage_byte & 0x10:                         // "not importable" flag
                continue

            linkage = linkage_byte & 0x0F                   // 0x185324E

            // ---- Linkage-type dispatch (11 cases via jump table) ----
            switch linkage:                                 // dword_42BA140
                case 0:  // ExternalLinkage
                case 1:  // AvailableExternallyLinkage
                case 3:  // InternalLinkage
                case 5:  // ExternalWeakLinkage
                case 6:  // CommonLinkage
                    goto standard_threshold_path            // loc_18536E8

                case 7:  // WeakAnyLinkage
                case 8:  // WeakODRLinkage
                    // Weak linkage requires name verification via memcmp
                    // to confirm the candidate matches the expected symbol
                    // before allowing import.
                    expected_name = resolve_name(candidate_ptr)
                    actual_name   = resolve_name(entry)
                    if memcmp(expected_name, actual_name, name_len) != 0:
                        continue                            // 0x1853A71: name mismatch
                    goto standard_threshold_path

                case 2:  // AppendingLinkage
                case 4:  // PrivateLinkage
                case 9:  // LinkOnceAnyLinkage
                case 10: // LinkOnceODRLinkage
                    goto special_handling_path               // loc_1853928

            // ---- Standard threshold path ----
            standard_threshold_path:
                // Dereference alias chain for external linkage
                if entry.function_type == 0:                // external
                    entry = entry[+0x40]                    // follow alias pointer
                    linkage = entry[+0x0C] & 0x0F           // re-extract

                // ---- Priority-class threshold adjustment ----
                // 0x1853441: convert base_threshold to float
                threshold_f = (float)base_threshold         // cvtsi2ss xmm2, eax

                priority_class = entry[+0x08] & 0x7         // 3-bit field, al=[r15+8]&7

                switch priority_class:
                    case 3:  // HOT callsite
                        threshold_f *= dword_4FAAE80        // hot_multiplier (10.0)
                                                            // mulss xmm0, cs:dword_4FAAE80
                    case 1:  // COLD callsite
                        threshold_f *= dword_4FAACC0        // cold_multiplier (0.0)
                                                            // mulss xmm0, cs:dword_4FAACC0
                    case 4:  // CRITICAL callsite
                        threshold_f *= dword_4FAADA0        // critical_multiplier (100.0)
                                                            // mulss xmm0, cs:dword_4FAADA0
                    default: // no priority match
                        threshold_f *= dword_4FAB040        // default_multiplier (1.0)
                                                            // mulss xmm0, cs:dword_4FAB040

                adjusted_threshold = (int)threshold_f       // cvttss2si rax, xmm0
                // Stored to [rbp-0x78] and r11d for comparison

                // ---- Cost comparison (0x1853AA8) ----
                function_cost = entry[+0x40]                // IR instruction count
                if adjusted_threshold < function_cost:      // cmp r11d, [rcx+40h]
                    continue                                // jb not_eligible

                // ---- "Not importable" double-check ----
                if entry[+0x0C] & 0x10:                     // test byte [rcx+0Ch], 0x10
                    continue

                // ---- Max-threshold-wins for duplicates (0x18534C2) ----
                if guid already in result_array:
                    existing_record = result_slot[+0x10]
                    if existing_record != NULL:
                        existing_threshold = result_slot[+0x08]
                        if (float)existing_threshold >= threshold_f:
                            continue                        // existing is better; skip
                        result_slot[+0x08] = adjusted_threshold  // update to higher
                        goto next_candidate

                // ---- Global budget check (0x185340A) ----
                budget = dword_4FAB120                      // global_import_budget
                if budget >= 0:                             // test eax,eax; js proceed
                    if dword_4FAA770 >= budget:             // cmp counter vs budget
                        continue                            // jge skip: budget exhausted

                // ---- Allocate dedup hash table node (0x1853953) ----
                node = malloc(16)                           // 0x22077B0: edi=0x10
                if node != NULL:
                    node[+0x00] = 0                         // clear forward pointer
                    node[+0x08] = guid
                    sub_1851560(                            // hash table insert
                        guid_hash_table[+0x08],             // insert point
                        bucket_index,                       // slot
                        guid,                               // key
                        1                                   // insert_mode
                    )

                // ---- Emit to result array (0x1853517) ----
                count    = result_array[+0x08]              // current count
                capacity = result_array[+0x0C]
                if count >= capacity:
                    grow_result_array(result_array)         // realloc path

                // 24-byte entry: offset = count * 24
                entry_ptr = result_array.base + count * 24  // lea rax,[rax+rax*2]; shl rax,3
                entry_ptr[+0x00] = guid                     // 8 bytes: function GUID
                entry_ptr[+0x08] = adjusted_threshold       // 4 bytes: threshold value
                entry_ptr[+0x10] = import_record_ptr        // 8 bytes: import record

                result_array[+0x08] = count + 1             // increment count

                // ---- Increment global counter (0x1853510) ----
                dword_4FAA770 += 1                          // add cs:dword_4FAA770, 1

    next_candidate:
        candidate_begin += 8                                // advance to next candidate

Threshold computation arithmetic in detail. The four multiplier constants live in .data as IEEE 754 single-precision floats. The SSE scalar path is:

; At 0x1853441 -- convert integer base threshold to float
pxor   xmm2, xmm2
cvtsi2ss xmm2, rax          ; xmm2 = (float)base_threshold

; Priority dispatch -- one of four paths selected:
; HOT (priority 3):
movss  xmm0, cs:dword_4FAAE80   ; xmm0 = 10.0f
mulss  xmm0, xmm2               ; xmm0 = 10.0 * base

; COLD (priority 1):
mulss  xmm0, cs:dword_4FAACC0   ; xmm0 = 0.0 * base = 0.0

; CRITICAL (priority 4):
mulss  xmm0, cs:dword_4FAADA0   ; xmm0 = 100.0 * base

; DEFAULT (all others):
mulss  xmm0, cs:dword_4FAB040   ; xmm0 = 1.0 * base

; Convert back to integer for comparison
cvttss2si rax, xmm0             ; rax = (int)threshold_f (truncation)

The cvttss2si truncation means threshold values are floored, not rounded. For base_threshold=100 and hot_multiplier=10.0, the adjusted threshold is exactly 1000. The cold path with multiplier 0.0 always produces threshold 0, meaning cold functions are never imported unless the multiplier is overridden.

Stage 2: Triple-Pass Import Driver (sub_1854A20)

Address range: 0x1854A20--0x1855B06 (4,326 bytes). Four parameters, 0x278-byte stack frame. Callee-saved: r15, r14, r13, r12, rbx.

The driver processes candidates across three priority-ordered linked lists embedded in the guid_import_map structure. Each list covers a different import priority class. The three passes guarantee that high-priority candidates are imported (and consume budget) before lower-priority ones get a chance.

// sub_1854A20 -- Triple-pass import driver
//
// Materializes cross-module function bodies for candidates that pass
// threshold evaluation.  Processes three linked lists in priority order:
//   Pass 1: primary   list at [import_map + 0x00]  (highest priority)
//   Pass 2: secondary list at [import_map + 0x10]  (medium priority)
//   Pass 3: tertiary  list at [import_map + 0x30]  (lowest priority)
//
// For each candidate: check importable flag, evaluate threshold via
// sub_18518A0, execute import via sub_15E4B20, optionally attach
// thinlto_src_module metadata.
//
// Binary: 0x1854A20, 4326 bytes.  Stack: 0x278.
//
// Globals read:
//   byte_4FAAA20   enable_import_metadata (bool)

fn import_driver(
    import_ctx,          // rdi -> [rbp-0x258]: import state object
    module_summary_idx,  // rsi -> [rbp-0x260]: combined summary index
    source_module_info,  // rdx -> [rbp-0x278]: source module descriptor
    guid_import_map,     // rcx -> [rbp-0x268]: hash map of GUID -> import lists
                         //        also saved to rbx
):
    // ---- Initialize resolved-summary storage (0x1854A45) ----
    sub_1674380(
        &local_resolved_storage,   // rdi = [rbp-0x290]
        source_module_info         // rsi = rdx
    )

    // ---- Check if import map is empty (0x1854A6C) ----
    entry_count = guid_import_map[+0x08]
    if entry_count == 0:
        goto empty_import_path                              // 0x1854AB3

    // ======================================================================
    // PASS 1: PRIMARY CANDIDATE LIST  (0x1854B99 -- 0x1854F3B)
    // List head: [guid_import_map + 0x00]
    // Importable flag: byte [node - 0x21] & 0x20
    // Summary ptr:     [node - 0x38]
    // ======================================================================

    primary_list = guid_import_map[+0x00]                   // rsi = [rbx]

    // Scan to first valid entry (skip sentinels -8 and NULL)
    cursor = primary_list[+0x00]
    if cursor == 0xFFFFFFFFFFFFFFF8 || cursor == NULL:
        scan forward through primary_list[+0x08], [+0x10], ...
        // Inner scan: load qword, test for NULL, cmp against -8
        // Stop at first non-null, non-sentinel entry

    end_of_candidates = primary_list + entry_count * 8      // r12

    while cursor != end_of_candidates:                      // 0x1854BF0
        // ---- Load candidate descriptor ----
        desc = *cursor                                      // rax = [r14]
        summary_data = desc[+0x00]                          // rdx = [rax]
        cost_info    = desc + 0x40                          // threshold/cost at +0x40

        // ---- Evaluate candidate (0x1854C02) ----
        sub_1852CC0(&local_buf, guid_import_map)            // import candidate evaluator

        // ---- Advance to next valid entry ----
        next = cursor[+0x08]
        // Scan forward: skip NULL and sentinel -8 entries
        while next == NULL || next == 0xFFFFFFFFFFFFFFF8:
            next += 8

        // ---- Per-node import decision loop (0x1854E39) ----
        for each node in candidate.linked_nodes:
            if node == NULL:
                continue                                    // test r15, r15

            // Importable flag check
            importable = node[-0x21] & 0x20                 // test byte [r15-0x21], 0x20
            if !importable:
                continue                                    // jz skip

            // Extract function summary (stored 0x38 bytes before node)
            func_summary = node[-0x38]                      // r13 = [r15-0x38]

            // Resolve function name/info
            sub_15E4EB0(cursor, func_summary)               // 0x1854E61

            // ---- Format import remark (diagnostic output) ----
            resolved_threshold = [rbp-0x1D8]
            resolved_info      = [rbp-0x1E0]
            sub_16C1840(guid_import_map, resolved_info, resolved_threshold)
                                                            // cost component remark
            sub_16C1A90(guid_import_map, resolved_info, resolved_threshold)
                                                            // threshold component remark
            sub_16C1AA0(guid_import_map, [rbp-0x210])       // finalize remark string
            free([rbp-0x1E0])                               // cleanup temp string

            // ---- Threshold comparison gate (0x1854EE3) ----
            cost      = cursor[+0x10]                       // estimated function cost
            hot_count = cursor[+0x08]                       // call frequency / hotness
            qualifies = sub_18518A0(hot_count, cost)        // THRESHOLD GATE
            if !qualifies:                                  // test rax,rax; jz skip
                continue

            // ---- Execute import (0x1854EF7) ----
            sub_15E4B20(import_ctx, func_summary)           // MATERIALIZE FUNCTION

            // Check abort signal
            status = [rbp-0xD0]
            if status & 0xFFFFFFFFFFFFFFFE:                 // caller requested abort
                goto early_return

            // ---- Attach provenance metadata (0x1854F0D) ----
            if byte_4FAAA20 != 0:                           // enable-import-metadata
                source_name = sub_161FF10(func_summary)     // resolve source module name
                // Create optimization remark
                sub_1627350(remark_ctx, 1)                  // edx=1: enabled

                // Attach metadata string (0x1855261):
                //   lea rsi, "thinlto_src_module"  ; 0x42BA2F8, length 0x12
                sub_1627100(
                    func_summary,                           // target function
                    "thinlto_src_module",                   // metadata key (18 chars)
                    source_name                             // metadata value
                )

    // ======================================================================
    // PASS 2: SECONDARY CANDIDATE LIST  (0x1854F41 -- 0x1855074)
    // List head: [guid_import_map + 0x10]
    // Same importable-flag check: byte [node - 0x21] & 0x20
    // Same summary extraction:    [node - 0x38]
    // ======================================================================

    secondary_list = guid_import_map[+0x10]                 // r15 = [rcx+10h]
    secondary_sentinel = guid_import_map[+0x08]

    // Identical processing pattern:
    //   - Iterate linked-list nodes
    //   - Check importable flag: byte [r15-0x21] & 0x20
    //   - Extract summary: [r15-0x38]
    //   - sub_18518A0 threshold gate
    //   - sub_15E4B20 import execution
    //   - Conditional thinlto_src_module metadata attachment

    for each node in secondary_list:
        if node[-0x21] & 0x20 == 0:
            continue
        summary = node[-0x38]
        if !sub_18518A0(node.hot_count, node.cost):
            continue
        sub_15E4B20(import_ctx, summary)
        if byte_4FAAA20:
            attach_provenance_metadata(summary)

    // ======================================================================
    // PASS 3: TERTIARY CANDIDATE LIST  (0x1855074 -- 0x1855190)
    // List head: [guid_import_map + 0x30]
    // Different offsets:
    //   Summary extraction: [node - 0x30]  (not -0x38)
    //   Importable flag:    byte [node - 0x19] & 0x20  (not -0x21)
    // ======================================================================

    tertiary_list = guid_import_map[+0x30]

    // Same processing pattern but with adjusted offsets:
    for each node in tertiary_list:
        if node[-0x19] & 0x20 == 0:                        // note: -0x19, not -0x21
            continue
        summary = node[-0x30]                               // note: -0x30, not -0x38
        if !sub_18518A0(node.hot_count, node.cost):
            continue
        sub_15E4B20(import_ctx, summary)
        if byte_4FAAA20:
            attach_provenance_metadata(summary)

    // ======================================================================
    // POST-IMPORT: Result materialization (0x1854B3C -- 0x1854B97)
    // ======================================================================

    result_count = [rbp-0x100]
    if result_count > 0:
        import_source = sub_16704E0()                       // r13: source module handle
        import_dest   = sub_16704F0()                       // r14: destination module handle

        result_base = [rbp-0x110]
        result_end  = result_base + result_count * 8

        for each result_entry in result_base..result_end:   // 0x1854B7D
            func = *result_entry

            // Skip if function already exists in source module
            if sub_1670560(func, import_source):            // test al,al; jnz next
                continue

            // Materialize into destination module
            sub_1670560(func, import_dest)

    // ======================================================================
    // CLEANUP (0x1854AE7 -- 0x1854B22)
    // ======================================================================

    // Release import list entries (16-byte stride)
    cleanup_base = [rbp-0xF0]
    cleanup_count = eax
    cleanup_end = cleanup_base + cleanup_count * 16

    for each entry in cleanup_base..cleanup_end (stride=16):
        value = entry[+0x00]
        if value == 0xFFFFFFFFFFFFFFF8:                     // sentinel -8: empty
            continue
        if value == 0xFFFFFFFFFFFFFFFC:                     // sentinel -4: deleted
            continue
        sub_161E7C0(entry[+0x08])                           // release associated data

    free(cleanup_base)                                      // j___libc_free_0

    // ---- Empty-import finalization ----
    empty_import_path:                                      // 0x1854AB3
        import_ctx.status = 0                               // clear status byte
        flags = import_ctx[+0x08]
        flags = (flags & 0xFC) | 0x02                       // set "import complete, no imports"
        import_ctx[+0x08] = flags
        sub_1851C60(&local_import_list)                     // finalize empty path cleanup

Why three passes with different offsets. The three linked lists represent three structural layers in the guid_import_map:

PassList head offsetSummary offsetImportable-flag offsetInterpretation
1 (primary)[map+0x00]node[-0x38]node[-0x21] & 0x20Direct call targets from the current module -- highest priority because they are on the critical path
2 (secondary)[map+0x10]node[-0x38]node[-0x21] & 0x20Transitively-reachable functions (callees of callees) -- import enables deeper inlining chains
3 (tertiary)[map+0x30]node[-0x30]node[-0x19] & 0x20Speculative candidates (address-taken functions, indirect call targets inferred from devirtualization) -- lowest confidence

The different offsets in pass 3 (-0x30 instead of -0x38, -0x19 instead of -0x21) indicate a different node layout for speculative candidates. These nodes carry less metadata (8 fewer bytes between the summary pointer and the node base, and the importable flag is 8 bytes closer to the node).

Threshold Comparison Gate (sub_18518A0)

The gate function takes two arguments -- hot_count (rdi) and cost (rsi) -- and returns nonzero if the candidate qualifies for import. The driver calls it at three points (once per pass). This function encapsulates the final accept/reject decision after the per-priority-class threshold adjustment has already been applied by sub_1853180.

// sub_18518A0 -- Threshold comparison gate
// Returns: nonzero if candidate should be imported, zero otherwise
//
// rdi = hot_count (call frequency from profile or summary)
// rsi = cost      (adjusted threshold value from Stage 1)

fn threshold_gate(hot_count, cost) -> bool:
    // The exact comparison logic depends on whether profile data
    // is available.  With profile data, hot_count is a raw call
    // count; the gate compares the cost against a profile-weighted
    // threshold.  Without profile data, this degenerates to a
    // direct comparison: cost <= threshold.
    return hot_count > 0 || cost <= current_threshold

Threshold Multiplier Constants

The four floating-point multiplier constants are stored in the .data section and are set by the corresponding cl::opt registrations in ctor_184_0:

AddressKnobDefaultPurpose
dword_4FAAE80import-hot-multiplier10.0Multiplier for hot callsites
dword_4FAACC0import-cold-multiplier0.0Multiplier for cold callsites
dword_4FAADA0import-critical-multiplier100.0Multiplier for critical callsites
dword_4FAB040(default path)1.0Multiplier when no priority class matches

With the upstream default import-instr-limit of 100, a hot callsite gets threshold 1,000 instructions and a critical callsite gets threshold 10,000. The cold multiplier of 0.0 means cold functions are never imported by default -- the threshold evaluates to zero.

Effective threshold table (for import-instr-limit=100):

Priority classMultiplierEffective thresholdTypical candidates
Critical (4)100.0x10,000 instructionsManually annotated hot paths, PGO-identified critical edges
Hot (3)10.0x1,000 instructionsProfile-guided hot callsites, frequently-called templates
Default (0,2)1.0x100 instructionsStandard callsites without profile data
Cold (1)0.0x0 instructionsProvably cold paths -- never imported at default settings

The evolution factors control how thresholds decay as imports cascade through the call graph:

KnobDefaultEffect
import-instr-evolution-factor0.7Each transitive import level reduces the threshold to 70% of the previous
import-hot-evolution-factor1.0Hot callsite chains do not decay (threshold stays constant through transitive imports)

The evolution factor is applied by the caller of sub_1853180 before passing base_threshold. For a chain A -> B -> C where A is the root module:

  • Import B into A: threshold = import-instr-limit (100)
  • Import C into A (transitively via B): threshold = 100 * 0.7 = 70
  • Import D into A (transitively via C via B): threshold = 100 * 0.7 * 0.7 = 49

For hot chains with import-hot-evolution-factor=1.0, the threshold remains 1,000 at every transitive level, enabling arbitrarily deep import chains for hot call paths.

Global Import Budget

Two globals control the total import count:

AddressRoleDefault
dword_4FAB120Maximum allowed imports-1 (unlimited)
dword_4FAA770Running import counter0 (reset per module)

The budget check at 0x185340A:

mov  eax, cs:dword_4FAB120   ; load budget
test eax, eax
js   proceed                   ; negative = unlimited
cmp  cs:dword_4FAA770, eax   ; counter vs budget
jge  skip                     ; at or over budget -> skip

When the budget is -1 (the import-cutoff default), the js (jump-if-sign) branch is taken unconditionally, bypassing the budget check. Setting -import-cutoff=N limits the total number of imported functions to N, useful for debugging import-related miscompilations via bisection.

The counter increment at 0x1853510:

add  cs:dword_4FAA770, 1     ; increment after successful import

This is a non-atomic add -- safe because ThinLTO import runs single-threaded per module in CICC (unlike CPU LLVM where the thin link runs in parallel). The counter resets to 0 at the start of each module's import phase.

Integration with the 20,000-Budget Inliner

The import + inline pipeline in CICC works as a two-phase system:

  1. Import phase (this page): ThinLTO brings cross-module function bodies into the current module based on summary-guided threshold decisions. The imported functions are marked with thinlto_src_module metadata.

  2. Inline phase (inliner cost model): The NVIDIA custom inliner at sub_1864060 runs with a 20,000-unit per-caller budget. Imported functions are prime inlining candidates because they were specifically imported because they are called from this module.

The inliner-function-import-stats knob (registered in ctor_186_0 at 0x4DBEC0, values: basic or verbose) tracks how many imported functions were actually inlined. This provides feedback on whether the import thresholds are well-calibrated: if functions are imported but then not inlined (because they exceed the inline budget), the import was wasted compile time.

The typical flow for a template-heavy CUDA library like CUB or cutlass:

  1. Each .cu file compiles to a ThinLTO bitcode module with a summary index
  2. The thin link step reads all summaries and builds a combined index
  3. For each module, sub_1853180 evaluates import candidates using the combined index
  4. Hot template instantiations (e.g., cub::DeviceReduce::Sum<float>) get threshold base * 10.0 (hot) or base * 100.0 (critical)
  5. The imported function bodies arrive in the module and are immediately available to the 20,000-budget inliner
  6. The inliner folds the imported template bodies into their callers, eliminating .param marshaling

Entry Point: sub_1855B10

Address: 0x1855B10, 10,503 bytes. This is the runOnModule entry for the "function-import" pass (pipeline slot 43). It orchestrates the entire import flow:

fn function_import_pass_entry(module):
    // Parse required options
    if summary_file_path is empty:
        error("error: -function-import requires -summary-file")
        return

    summary_index = load_summary_file(summary_file_path)
    if summary_index is error:
        error("Error loading file")
        return

    // Build GUID-to-import map from summary index
    guid_import_map = build_import_map(module, summary_index)

    // Stage 1: threshold computation
    sub_1853180(summary_ctx, module_info, import_instr_limit,
                guid_hash_table, result_array, visited_set)

    // Stage 2: triple-pass import
    sub_1854A20(import_ctx, summary_index, source_module, guid_import_map)

    // Post-import: attribute propagation (if enabled)
    if propagate_attrs:
        propagate_summary_attributes(module, summary_index)

Knob Inventory

All knobs are registered across three constructors:

ctor_184_0 at 0x4DA920 (13,693 B -- ThinLTO Function Import options):

KnobTypeDefaultEffect
import-instr-limitunsigned100Base instruction count threshold
import-cutoffint-1Max total imports (-1 = unlimited)
import-instr-evolution-factorfloat0.7Threshold decay per transitive level
import-hot-evolution-factorfloat1.0Hot chain decay (1.0 = no decay)
import-hot-multiplierfloat10.0Threshold multiplier for hot callsites
import-critical-multiplierfloat100.0Threshold multiplier for critical callsites
import-cold-multiplierfloat0.0Threshold multiplier for cold callsites
print-importsboolfalsePrint names of imported functions
print-import-failuresboolfalsePrint rejected candidates with reasons
compute-deadbooltrueStrip dead symbols from index
enable-import-metadataboolfalseAttach thinlto_src_module / thinlto_src_file metadata
summary-filestring(none)Summary file path for -function-import
import-all-indexboolfalseImport every external function in the index

ctor_420_0 at 0x532010 (11,787 B -- pass-level ThinLTO options):

KnobTypeDefaultEffect
force-import-allboolfalseImport even noinline functions
import-declarationboolfalseImport function declarations as fallback
thinlto-workload-defstring(none)JSON file mapping root functions to import lists

ctor_029 at 0x489C80 (1,120 B -- supplementary ThinLTO options):

KnobTypeDefaultEffect
propagate-attrsbooltruePropagate attributes through the summary index
import-constants-with-refsbooltrueImport constant globals that have references

ctor_419 at 0x531850 (6,358 B -- FunctionAttrs inference):

KnobTypeDefaultEffect
disable-thinlto-funcattrsboolfalseDisable function attribute inference from ThinLTO summaries

Data Structures

Import Candidate Linked List

Each of the three priority lists in the guid_import_map is a singly-linked list with 8-byte node entries:

OffsetContent
[node+0x00]Entry value (pointer to candidate descriptor, or GUID)
[node+0x08]Next slot / next node pointer

Sentinels: 0xFFFFFFFFFFFFFFF8 (-8) = empty slot, 0xFFFFFFFFFFFFFFFC (-4) = deleted slot. These sentinel values are standard open-addressing hash map markers repurposed for the linked-list traversal.

GUID Import Map Layout

The guid_import_map structure (parameter rcx of sub_1854A20) contains the three priority lists:

OffsetSizeContent
+0x008Primary list head (direct call targets)
+0x088Entry count / secondary sentinel
+0x108Secondary list head (transitive callees)
+0x188(reserved / alignment)
+0x208(reserved / alignment)
+0x288(reserved / alignment)
+0x308Tertiary list head (speculative candidates)

GUID Dedup Hash Table

FieldSizeDescription
Slot size16 bytes{GUID (8B), metadata (8B)}
Hash functionmultiplicativeslot = (GUID * 37) & (table_size - 1)
Collision resolutionlinear probingIncrement slot by 1, wrap at table_size
Empty sentinel-10xFFFFFFFFFFFFFFFF
Size fieldoffset +0x18Number of slots in table (always power of 2)

The multiplication constant 37 produces reasonable distribution for GUIDs that are typically MD5 hashes of mangled names. The linear probing is adequate because the table is sized to maintain a low load factor.

Result Array

Growable array with 24-byte entries:

OffsetSizeContent
+0x008Function GUID
+0x084Adjusted threshold value
+0x108Import record pointer

Header: [+0x08] = current count, [+0x0C] = capacity. Growth is handled by a realloc path when count >= capacity.

Per-Function Summary Entry (import-relevant fields)

OffsetSizeContent
+0x084Entry type (2 = function summary)
+0x0C1Linkage byte: low 4 bits = linkage type, bit 4 = not-importable flag, bit 5 = importable flag
+0x404Function cost (IR instruction count, used for threshold comparison)

Function Map

FunctionAddressSizeRole
ThinLTO import driver (triple-pass candidate processing)sub_1854A204,326 B--
Threshold computation with GUID dedup and priority-class multiplierssub_18531805,059 B--
Threshold comparison gate (returns nonzero if candidate qualifies)sub_18518A0----
Import candidate evaluator (prepares candidate for threshold check)sub_1852CC0----
Import list builder (called by sub_1853180)sub_1852FB0----
Import list node allocator (called by sub_1853180)sub_1852A30----
Import list initialization (called by sub_1853180)sub_1851200----
Execute import decision (materialize function into destination)sub_15E4B20----
Resolve function name/info from summarysub_15E4EB0----
Entry point (parses -function-import / -summary-file)sub_1855B1010,503 B--
Whole-module ThinLTO processingsub_1858B9031,344 B--
Type metadata propagation during importsub_185E85024,263 B--
Attach named metadata (used for thinlto_src_module)sub_1627100----
Create optimization remark (import diagnostic)sub_1627350----
Resolve source module name stringsub_161FF10----
Check if function exists in a given modulesub_1670560----
Get "import source" module handlesub_16704E0----
Get "import destination" module handlesub_16704F0----
Format import remark (cost component)sub_16C1840----
Format import remark (threshold component)sub_16C1A90----
Finalize import remark stringsub_16C1AA0----
Hash table insert (GUID dedup table)sub_1851560----
Initialize resolved function summary storagesub_1674380----
Finalize empty-import path cleanupsub_1851C60----
Release import list entry datasub_161E7C0----
malloc wrapper (used for 16-byte dedup node allocation)sub_22077B0----

Cross-References

  • Inliner Cost Model -- the downstream consumer of imported functions. Import brings bodies into the module; the 20,000-budget inliner decides whether to fold them into callers.
  • Module Summary -- sub_D7D4E0 builds the NVModuleSummary that drives import decisions. The 4-level priority system, complexity budget, and CUDA-specific filtering all originate here.
  • Pipeline & Ordering -- function-import is registered as pipeline slot 43, a Module-level pass.
  • IP Memory Space Propagation -- after import, cross-module functions may carry address-space annotations that IPMSP must reconcile.
  • Hash Infrastructure -- the GUID dedup table uses the same DenseMap pattern documented there.

GlobalOpt for GPU

CICC implements a custom GlobalOpt pass (sub_18612A0, 65 KB, 2179 decompiled lines) that replaces LLVM's stock GlobalOptPass with GPU-aware global variable transformations. The pass operates on NVIDIA's internal IR representation rather than LLVM IR directly, and adds address-space-aware logic that stock LLVM lacks entirely: it extracts the CUDA address space from the global's flags byte ((flags >> 2) & 7), preserves address space through all generated replacement globals, and applies promotion thresholds calibrated for GPU memory hierarchy. The pass runs at pipeline position 30 in the tier-2 and tier-3 optimization sequences (via wrapper sub_196A2B0), immediately after GlobalDCE / ConstantProp (sub_1968390) and before LoopVectorize. It runs at -O2 and above; tier-1 does not include it. The inliner cost model at sub_18612A0 also calls into GlobalOpt as a subroutine when evaluating whether a callee's globals can be folded after inlining, creating a tight coupling between inlining decisions and global optimization.

The pass implements four transformation strategies with decreasing priority: small-constant promotion for globals under 2047 bits, scalar replacement of aggregates (SRA) for struct globals with up to 16 fields, malloc/free elimination for heap-allocated globals with single-unit access, and a hash-table-driven deduplication cleanup pass. Each strategy preserves the original global's NVPTX address space, which is critical -- a __device__ global in address space 1 must remain in AS 1 after splitting, not silently migrate to AS 0 (generic). The generated IR uses distinctive suffixes (.body, .init, .val, .notinit, .f0...f15, .isneg, .isnull) that survive through to PTX emission and are visible in cuobjdump output.

Core transformsub_18612A0 (0x18612A0, 65 KB, 2179 lines)
Pipeline wrappersub_196A2B0 (0x196A2B0)
Recursive re-applicationsub_185B1D0 (0x185B1D0)
Pre-SRA setupsub_185B7E0 (0x185B7E0)
Hash table rehashsub_1860410 (0x1860410)
Per-user SRA rewritesub_1860BE0 (0x1860BE0)
Pipeline positionStep 30 (tier 2/3), after GlobalDCE, before LoopVectorize
Minimum opt level-O2 (tier 2)
Pass registration"globalopt" in pipeline parser at slot 45
IR node allocation88 bytes per global, 64 bytes per basic block, 56 bytes per instruction

Address Space Handling

Every transformation in this pass must respect CUDA address spaces. The global's address space is extracted at line 577 of the decompilation:

uint8_t addr_space = (*(uint8_t*)(global + 33) >> 2) & 7;

The NVPTX address spaces relevant here are 0 (generic), 1 (global/__device__), 3 (shared/__shared__), 4 (constant/__constant__), and 5 (local). See Address Spaces for the complete table with hardware mapping, pointer widths, and latency numbers.

When sub_18612A0 creates replacement globals via sub_15E51E0, it passes the extracted address space to the constructor. The created global inherits the same address space, linkage (always internal, linkage code 7), and metadata (copied via sub_15E6480). This is the key delta from stock LLVM: upstream GlobalOpt does not consider address space when splitting globals because host-side address spaces are trivial. On GPU, promoting a __shared__ struct global to per-field __shared__ globals preserves the 10x latency advantage over DRAM, while accidentally demoting to generic would force the hardware to resolve address space at runtime via the generic-to-specific address resolution unit.

Entry Guard: Type Filtering

Before attempting any transformation, the pass filters on the global's type tag (byte at type + 8). The acceptance bitmask is 0x8A7E:

// Bits set: 1,2,3,4,5,9,11,13,15
uint16_t bitmask = 0x8A7E;
if ((1 << type_tag) & bitmask) {
    // accepted: i16, i32, i64, i80, float, double, arbitrary-int, struct, opaque-ptr
}

Additionally, struct (tag 13), vector (tag 14), and array (tag 16) types are accepted if sub_16435F0(type, 0) returns true -- this is the isAnalyzableType predicate that recursively checks whether the type's leaf elements are all scalars or pointers.

After type filtering, the pass walks the global's use-list. Every user must be either a store (opcode tag 54) or a load (opcode tag 55). If any user is an arithmetic instruction (tag <= 23), a GEP used in a non-trivial way, or any other instruction kind, the global is rejected -- it cannot be optimized because its address escapes or is used in a way the pass cannot model.

Path A: Small-Constant Promotion

When the global's initializer is a struct constant and its total bit-size (including alignment padding) fits within 2047 bits (0x7FF), the pass promotes it into a function-local value with a separate initializer function. This threshold is NVIDIA-specific -- upstream LLVM uses different heuristics based on TargetData layout considerations. The 2047-bit ceiling corresponds roughly to 64 32-bit registers, aligning with the per-thread register budget on most SM architectures where promoting beyond that limit would spill to local memory and negate the benefit.

Size Computation

The pass walks the type tree recursively to compute total bit-size. The implementation at lines 499-570 of the decompilation uses a switch on the type tag byte at type + 8:

Type tagTypeBits
0x1i16 / half16
0x2i32 / float32
0x3i6464
0x4x86_fp8080
0x5i128128
0x6fp128 / ppc_fp128128
0x7pointersub_15A9520(target, 0) * 8
0x9double64
0xBiN (custom width)from type word >> 8
0xDstruct8 * field_count (via sub_15A9930)
0xEvector8 * alignment * num_elements * padded_size
0xFopaque ptrsub_15A9520(target, addr_space) * 8
0x0, 0x8, 0xA, 0xC, 0x10array variantselement_size * array_length (recursive)

Note that opaque pointers (tag 0xF) use getPointerSizeInBits(target, addr_space) -- the pointer size varies by address space on NVPTX (64-bit for AS 0/1, potentially 32-bit for AS 3/5 on some targets). Tags 0x0, 0x8 (label/token), 0xA (metadata), and 0xC (bfloat) all fall into the array-multiplier path -- they extract an element count and recurse, which handles the case where these type wrappers contain inner array types.

The pseudocode for the size computation:

// sub_18612A0, lines 499-570
uint64_t compute_total_bits(Type *type, TargetInfo *target, uint8_t addr_space) {
    uint8_t tag = *(uint8_t *)(type + 8);
    switch (tag) {
    case 0x1:  return 16;                                     // i16 / half
    case 0x2:  return 32;                                     // i32 / float
    case 0x3:  return 64;                                     // i64
    case 0x4:  return 80;                                     // x86_fp80
    case 0x5:  return 128;                                    // i128
    case 0x6:  return 128;                                    // fp128 / ppc_fp128
    case 0x7:  return sub_15A9520(target, 0) * 8;             // generic pointer
    case 0x9:  return 64;                                     // double
    case 0xB:  return *(uint32_t *)(type + 8) >> 8;           // iN custom-width
    case 0xD: {                                               // struct
        uint64_t layout = sub_15A9930(target, type);          // getStructLayout
        return 8 * *(uint32_t *)(layout + 12);                // 8 * element_count
    }
    case 0xE: {                                               // vector
        uint64_t align = sub_15A9FE0(target, type);           // getAlignment
        uint64_t n_elts = *(uint32_t *)(type + 12);
        uint64_t elem_bits = compute_total_bits(
            sub_16463B0(type, 0), target, addr_space);        // getArrayElementType
        return 8 * align * n_elts * ((elem_bits + align - 1) / align);
    }
    case 0xF:  return sub_15A9520(target, addr_space) * 8;    // opaque ptr (AS-aware)
    default: {                                                 // 0x0,0x8,0xA,0xC,0x10: array
        uint64_t n_elts = *(uint32_t *)(type + 12);
        Type *elem = sub_16463B0(type, 0);                    // getArrayElementType
        return n_elts * compute_total_bits(elem, target, addr_space);
    }
    }
}

The acceptance check at line 570:

if (total_elements * alignment * ceil_div(total_bits, alignment) > 0x7FF)
    goto path_b;  // too large, try SRA instead

Generated IR Pattern

For a qualifying global, the pass generates three components:

; Original: @my_global = addrspace(1) global { i32, i32 } { i32 42, i32 7 }

; After promotion:
@my_global.body = internal addrspace(1) global { i32, i32 } { i32 42, i32 7 }

define internal void @my_global.init() {
  store { i32, i32 } { i32 42, i32 7 }, ptr addrspace(1) @my_global.body
  ret void
}

; All loads of @my_global replaced with: load ptr addrspace(1) @my_global.body
; ExtractValue users get ".val" accessors
; Uninitialized code paths get "notinit" sentinel via sub_15FB630

The .body global is created via sub_15E51E0 with the same address space and internal linkage (code 7). The .init function is created via sub_15E5070. The pass then walks all users of the original global: loads (tag 55) get redirected to the .body global, GEPs (tag 71) get RAUW'd via sub_164D160, and extractvalue instructions (tag 75) get specialized .val accessors. Sub-opcodes on the extractvalue determine further handling: codes 0x20/0x25/0x29 produce notinit sentinels, 0x24/0x28 extract terminal types via sub_159C540, and 0x21-0x23/0x26-0x27 pass through unchanged.

The full promotion pseudocode covering body creation, init creation, and use rewriting:

// sub_18612A0, lines 577-805 — Path A: small-constant promotion
void promote_small_constant(Global *global, Module *module, Value *init_val,
                            Type *type, TargetInfo *target) {
    // --- Extract address space from global flags ---
    uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;

    // --- Create ".body" global in same address space ---
    void *node = sub_1648A60(88, 1);                           // IRBuilder::create
    Global *body_gv = sub_15E51E0(
        get_scope(module), type, /*init=*/0, /*linkage=*/7,
        concat_name(global, ".body"), addr_space);             // createGlobalVar
    sub_15E6480(global, body_gv);                              // copyMetadata

    // --- Rewrite all users of original global ---
    Use *use = *(Use **)(global + 8);                          // use-list head
    while (use != NULL) {
        Instruction *inst = sub_1648700(use);                  // getInstruction
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 71) {                                    // GEP
            // If GEP references old global, RAUW to body
            sub_164D160(inst, body_gv);                        // RAUW
            sub_15F20C0(inst);                                 // eraseFromParent
        } else {
            // Create local variable referencing body
            Value *local = sub_15FD590(inst, get_scope(module),
                                       "newgv", module);       // createLocalVar
            sub_1648780(use, local);                           // replaceUseWith
        }
        use = *(Use **)(use + 8);                              // next use
    }

    // --- Create ".init" function ---
    Function *init_fn = sub_15E5070(
        get_scope(module), type, /*linkage=*/7,
        init_val, concat_name(global, ".init"));               // createFunction
    int init_user_count = 0;

    // Walk users again for extractvalue and load rewriting
    use = *(Use **)(body_gv + 8);
    while (use != NULL) {
        Instruction *inst = sub_1648700(use);
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 55) {                                    // load
            sub_15F9480(init_val, init_fn);                    // createStoreInit
            init_user_count++;
        } else if (opcode == 75) {                             // extractvalue
            Value *val_acc = sub_15F8F80(inst, type, init_fn,
                concat_name(global, ".val"));                  // createExtractValue
            uint8_t sub_opcode = *(uint8_t *)(inst + 24);
            switch (sub_opcode) {
            case 0x20: case 0x25: case 0x29:
                // Uninitialized path: create "notinit" sentinel
                sub_15FB630(val_acc, "notinit", inst);         // createNotInit
                break;
            case 0x24: case 0x28:
                // Terminal type extraction
                sub_159C540(val_acc);                          // getTerminalType
                break;
            default:                                           // 0x21-0x23, 0x26-0x27
                break;                                         // pass-through
            }
            sub_164D160(inst, val_acc);                        // RAUW
            sub_15F20C0(inst);                                 // eraseFromParent
            init_user_count++;
        }
        use = *(Use **)(use + 8);
    }

    // --- Finalize ---
    if (init_user_count > 0) {
        sub_1631BE0(module_fn_list, init_fn);                  // insertIntoFnList
        // Patch metadata chain at global+56
        *(void **)(global + 56) = init_fn;
    } else {
        // Dead init function: destroy
        sub_15E5530(init_fn);                                  // destroyFunctionBody
        sub_159D9E0(init_fn);                                  // destroyFunction
        sub_164BE60(init_fn);                                  // dropAllReferences
        sub_1648B90(init_fn);                                  // markDead (flags |= 1)
    }

    sub_15E55B0(global);                                       // erase original global
    sub_15F20C0(module_entry);                                 // erase module-level ref

    // --- Recursive re-application to newly created .body ---
    sub_185B1D0(body_gv, target);                              // recursiveGlobalOpt
}

After rewriting all uses, if the .init function has users, it is linked into the module's function list via sub_1631BE0. If it has zero users (the initializer was never needed), the function body is destroyed and marked dead. The original global is erased via sub_15E55B0. Finally, sub_185B1D0 recursively re-applies GlobalOpt to the newly created .body global, enabling cascaded optimizations.

Path B: Scalar Replacement of Aggregates (SRA)

When a global is too large for constant promotion, the pass attempts SRA -- exploding a struct global into per-field scalar globals. This path has stricter preconditions:

  1. The caller's flag parameter (a4) must be zero -- when set, SRA is disabled.
  2. The initializer must be the unique initializer for this global (verified via sub_15A0680).
  3. The type must be a struct (tag 13) with 1 to 16 fields: field_count - 1 <= 0xF.
  4. Every user must reference only this global -- no cross-global pointer arithmetic.

The 16-field limit is a hardcoded constant at line 822 of the decompilation. It prevents combinatorial explosion in the null-check and free chains that follow: each field generates one icmp eq (null check), one or, one conditional branch, one free_it block, and one next block. Beyond 16 fields the cost of the generated guard code would exceed the benefit of splitting.

Use Analysis: Store Value Collection

Before field explosion, the pass collects all stored values into a hash set to determine which initializers are live. For each store (tag 54) user of the global, sub_185CAF0 inserts the stored value into a hash/set structure at v432. The scratch buffer starts with capacity 32 and grows via sub_16CC920 when full. This collection serves two purposes: it validates that all stores write analyzable values (no opaque function pointers or computed addresses), and it builds the value set used later to initialize the per-field globals.

// sub_18612A0, lines 823-868 — Store value collection for SRA
void collect_store_values(Global *global, Module *module,
                          HashSet *store_set, Buffer *scratch) {
    Use *use = *(Use **)(global + 8);
    int store_count = 0;

    while (use != NULL) {
        Instruction *inst = sub_1648700(use);                  // getInstruction
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 54) {                                    // store
            sub_185CAF0(use, store_set, scratch);              // collectStoredValue
            store_count++;

            // Grow scratch if full
            if (scratch->size >= scratch->capacity) {
                if (scratch->capacity < 64)
                    memset(scratch->data, 0xFF, scratch->capacity * 8);
                else
                    sub_16CC920(scratch);                      // growScratchBuffer
            }
        }
        use = *(Use **)(use + 8);
    }
}

Global-Only-Use Validation

After collection, lines 878-1017 validate that every user of every collected global references only the target global -- no cross-global pointer arithmetic is allowed. The validation walks the use chain of each collected global. For each operand slot (24-byte stride, count from *(uint32_t *)(global + 20) & 0xFFFFFFF):

  • If the operand is the module itself: accepted.
  • If the opcode tag is <= 0x17 (arithmetic/comparison): rejected -- the global's address is used in computation.
  • If the opcode is 77 (GEP): the pass calls sub_16CC9F0 (find in sorted set) to verify the GEP's base pointer is the same global being split.
  • If the opcode is 54 (store): the pass checks that the store's parent basic block (at offset -24 from the operand) belongs to the global being analyzed.

If any operand fails validation, a flag v17 is set to zero and the entire SRA path is abandoned for this global.

Field Explosion

For each field index 0 through field_count - 1, the pass creates a replacement global variable in the same address space with internal linkage. The full pseudocode at lines 1084-1476:

// sub_18612A0, lines 1084-1476 — SRA field explosion
typedef struct {
    Global **data;
    uint64_t size;
    uint64_t capacity;
} FieldVec;

void sra_explode_fields(Global *global, Module *module, Type *struct_type,
                        Value *init_val, TargetInfo *target, FieldVec *fields) {
    uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
    const char *global_name = sub_1649960(global);             // getName
    uint32_t field_count = *(uint32_t *)(struct_type + 12);
    uint64_t ptr_bits = sub_15A9520(target, addr_space);       // getPointerSizeInBits

    for (uint32_t i = 0; i < field_count; i++) {
        // --- Extract field type and offset ---
        Type *field_type = sub_1646BA0(struct_type, ptr_bits); // getStructFieldType
        uint64_t field_offset = sub_15A06D0(struct_type, i);   // computeFieldOffset

        // --- Generate name: "my_global.f0", "my_global.f1", ... ---
        char name[256];
        snprintf(name, sizeof(name), "%s.f%d", global_name, i);

        // --- Extract field initializer from parent init ---
        Value *field_init = sub_15FEBE0(module, init_val, field_type); // createBitcast/GEP

        // --- Create field global in same address space, internal linkage ---
        Global *field_gv = sub_15E51E0(
            get_scope(module), field_type, field_init,
            /*linkage=*/7, name, addr_space);                  // createGlobalVar

        // --- Copy metadata from parent to field global ---
        sub_15E6480(global, field_gv);                         // copyMetadata

        // --- Store into dynamically-grown field vector ---
        if (fields->size >= fields->capacity) {
            // Realloc growth: double capacity (lines 1161-1220)
            uint64_t new_cap = fields->capacity * 2;
            if (new_cap < 8) new_cap = 8;
            fields->data = realloc(fields->data, new_cap * sizeof(Global *));
            fields->capacity = new_cap;
        }
        fields->data[fields->size++] = field_gv;

        // --- Compute field bit-size (same type switch as Path A) ---
        uint64_t field_bits = compute_total_bits(field_type, target, addr_space);
        uint64_t alignment;
        if (*(uint8_t *)(field_type + 8) == 0xD) {            // struct
            uint64_t layout = sub_15A9930(target, field_type);
            alignment = *(uint64_t *)(layout + 8);
        } else {
            alignment = sub_15A9FE0(target, field_type);       // getAlignment
        }
        uint64_t padded = alignment * ((field_bits + alignment - 1) / alignment);

        // --- Create GEP replacement and store initializer ---
        Value *gep = sub_15FEBE0(module, field_gv, field_type); // createBitcast/GEP
        sub_15F9660(field_offset, field_gv, global);            // createFieldStore
    }
}

The field globals are stored in a dynamically-grown std::vector with realloc growth strategy (lines 1161-1220 of the decompilation). The growth factor is 2x with a minimum initial capacity of 8 entries.

Null/Negative Guards

After field explosion, the pass generates safety checks for the original global's pointer value. This pattern handles the case where the global was heap-allocated via malloc -- the original pointer might be null or negative (indicating allocation failure on some platforms). The guard chain is constructed at lines 1478-1535:

// sub_18612A0, lines 1478-1535 — Null/negative guard chain generation
Value *build_guard_chain(Global *global, FieldVec *fields,
                         Module *module, TargetInfo *target) {
    // --- Create %isneg = icmp slt <ptr>, 0 ---
    // Opcode 51 = ICmp, predicate 40 = SLT (signed less than zero)
    Value *isneg = sub_15FEC10(
        /*dest=*/NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/40,
        get_module_sym(module), /*offset=*/0,
        concat_name(global, ".isneg"), get_current_bb(module)); // createICmp

    Value *chain = isneg;

    // --- For each field: %isnullI = icmp eq <field_ptr>, null ---
    for (uint64_t i = 0; i < fields->size; i++) {
        Global *field_gv = fields->data[i];
        uint64_t field_offset = sub_15A06D0(
            get_type(global), i);                              // computeFieldOffset

        // Predicate 32 = EQ (equal to null)
        Value *isnull = sub_15FEC10(
            /*dest=*/NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/32,
            field_gv, field_offset,
            concat_name(global, ".isnull"), get_current_bb(module));

        // Chain with OR: %tmpI = or i1 %chain, %isnullI
        // Opcode 27 = OR
        char tmp_name[16];
        snprintf(tmp_name, sizeof(tmp_name), "tmp%lu", i);
        chain = sub_15FB440(/*opcode=*/27, chain, isnull,
                            tmp_name, module);                 // createBinOp(OR)
    }

    return chain;  // final chained predicate
}

The generated IR for a 3-field struct:

%isneg  = icmp slt ptr @original_global, null    ; predicate 40 = SLT
%isnull0 = icmp eq ptr @my_global.f0, null        ; predicate 32 = EQ
%tmp0   = or i1 %isneg, %isnull0
%isnull1 = icmp eq ptr @my_global.f1, null
%tmp1   = or i1 %tmp0, %isnull1
%isnull2 = icmp eq ptr @my_global.f2, null
%tmp2   = or i1 %tmp1, %isnull2
br i1 %tmp2, label %malloc_ret_null, label %malloc_cont

The .isneg guard is created by sub_15FEC10 with opcode 51 (ICmp), predicate 40 (SLT with zero). Per-field .isnull guards use predicate 32 (EQ with null). The guards are chained with OR instructions (opcode 27) via sub_15FB440. The chain evaluation is linear in the number of fields -- for the maximum 16 fields, this produces 17 icmp instructions and 16 or instructions, plus one terminal conditional branch.

Malloc/Free Decomposition Algorithm

This is the core of NVIDIA's per-field malloc/free elimination, covering lines 1537-1640 of the decompilation. When the chained null check indicates a valid allocation, the pass generates a multi-block control flow that replaces the original single malloc/free pair with per-field conditional frees. This is the key divergence from upstream LLVM: stock tryToOptimizeStoreOfMallocToGlobal treats the malloc/free as an atomic pair, replacing it with a single static allocation. NVIDIA decomposes to per-field granularity, generating N+2 basic blocks (one malloc_ret_null, one malloc_cont, and for each field one free_it plus one next block).

The complete pseudocode:

// sub_18612A0, lines 1537-1640 — Malloc/free decomposition
void decompose_malloc_free(Global *global, Module *module, Function *fn,
                           FieldVec *fields, Value *guard_chain,
                           TargetInfo *target) {
    uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;

    // === Step 1: Create control flow skeleton ===

    // "malloc_cont" — continuation after successful allocation check
    BasicBlock *malloc_cont_bb = sub_157FBF0(
        fn, get_global_chain(module), "malloc_cont");          // createBB

    // "malloc_ret_null" — failure path returning null
    BasicBlock *ret_null_body = sub_157E9C0(fn);               // createReturnBB
    BasicBlock *malloc_ret_null_bb = sub_157FB60(
        NULL, ret_null_body, "malloc_ret_null", NULL);         // createBBWithPred

    // === Step 2: Emit conditional branch on guard chain ===
    // br i1 %guard_chain, label %malloc_ret_null, label %malloc_cont
    sub_15F8650(
        get_terminator(fn),                                    // insertion point
        malloc_ret_null_bb,                                    // true target (fail)
        malloc_cont_bb,                                        // false target (success)
        guard_chain,                                           // condition (isneg|isnull)
        fn);                                                   // createCondBr

    // === Step 3: Per-field conditional free and reinitialization ===
    BasicBlock *current_bb = malloc_cont_bb;

    for (uint64_t i = 0; i < fields->size; i++) {
        Global *field_gv = fields->data[i];
        uint64_t field_offset = sub_15A06D0(get_type(global), i);
        Type *field_type = sub_1646BA0(get_type(global),
                                       sub_15A9520(target, addr_space));

        // 3a. Create "tmp" alloca in current block
        Value *tmp_alloca = sub_15F9330(
            NULL, field_type, "tmp", current_bb);              // createAlloca

        // 3b. Create non-null check: %condI = icmp ne <field_ptr>, null
        // Opcode 51 = ICmp, predicate 33 = NE (not equal to null)
        char cond_name[64];
        snprintf(cond_name, sizeof(cond_name), "%s.f%lu.nonnull",
                 sub_1649960(global), i);
        Value *cond = sub_15FED60(
            NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/33,
            field_gv, field_offset, cond_name, current_bb);    // createICmpNE

        // 3c. Create "free_it" block — frees this field if non-null
        char free_name[64];
        snprintf(free_name, sizeof(free_name), "free_it%lu", i);
        BasicBlock *free_it_bb = sub_157FB60(
            NULL, NULL, free_name, NULL);                      // createBBWithPred

        // 3d. Create "next" block — fallthrough after conditional free
        char next_name[64];
        snprintf(next_name, sizeof(next_name), "next%lu", i);
        BasicBlock *next_bb = sub_157FB60(
            NULL, NULL, next_name, NULL);                      // createBBWithPred

        // 3e. Conditional branch: non-null → free, null → skip
        // br i1 %condI, label %free_itI, label %nextI
        sub_15F8650(
            get_terminator_of(current_bb),
            free_it_bb,                                        // true: free
            next_bb,                                           // false: skip
            cond, fn);                                         // createCondBr

        // 3f. In free_it block: wire field into use-def chain, then branch to next
        sub_15FDB00(field_gv, get_use_chain(free_it_bb),
                    i, free_it_bb);                            // wireDef

        // Unconditional branch: free_it → next
        sub_15F8590(NULL, next_bb, free_it_bb);                // createBr

        // 3g. In next block: store field initializer into the new field global
        sub_15F9850(field_offset, tmp_alloca, next_bb);        // createStoreToField

        current_bb = next_bb;
    }

    // === Step 4: Wire entry into malloc_cont, erase original ===
    // Unconditional branch from entry into malloc_cont
    sub_15F8590(NULL, malloc_cont_bb, get_entry_bb(fn));       // createBr

    // Erase the original global
    sub_15F20C0(get_module_entry(module));                      // eraseFromParent
}

The generated CFG for a 2-field struct { i32, float }:

entry:
  br i1 %tmp1, label %malloc_ret_null, label %malloc_cont

malloc_ret_null:
  ret null

malloc_cont:
  %cond0 = icmp ne ptr @g.f0, null
  br i1 %cond0, label %free_it0, label %next0

free_it0:
  ; free(@g.f0)   — conditional per-field deallocation
  br label %next0

next0:
  store i32 <init0>, ptr addrspace(1) @g.f0
  %cond1 = icmp ne ptr @g.f1, null
  br i1 %cond1, label %free_it1, label %next1

free_it1:
  ; free(@g.f1)
  br label %next1

next1:
  store float <init1>, ptr addrspace(1) @g.f1
  ; ... continuation

Each free_it block is conditionally entered only when the field pointer is non-null, preventing double-free on fields that were never successfully allocated. The next blocks store the field initializer after the conditional free, ensuring the field global is properly initialized regardless of whether freeing occurred. This per-field decomposition enables a critical optimization that upstream LLVM cannot perform: if a later pass (dead store elimination, constant propagation) determines that only some fields of the struct are actually used, the unused field globals and their associated free_it/next blocks become dead code and are trivially eliminated by GlobalDCE.

Address-Space-Aware Splitting

The address space preservation logic is woven throughout both the field explosion and the malloc/free decomposition. Every call to sub_15E51E0 (createGlobalVar) passes the extracted address space from the parent global. The extraction point is always the same: (*(uint8_t *)(global + 33) >> 2) & 7. This is critical for three reasons:

  1. Shared memory splitting: A __shared__ struct global (AS 3) split into per-field globals must keep each field in AS 3. If any field migrated to AS 0 (generic), the hardware would resolve the address at runtime via the generic-to-specific resolution unit, adding 10-20 cycles of latency per access and defeating the purpose of placing data in shared memory.

  2. Constant memory splitting: A __constant__ struct (AS 4) split into fields must remain in AS 4 to benefit from the constant cache's broadcast capability. A single warp reading the same constant field hits the cache once and broadcasts to all 32 threads. In AS 0 (generic), this broadcast would not occur.

  3. Pointer size consistency: On some NVPTX targets, pointers in AS 3 (shared) and AS 5 (local) are 32-bit, while AS 0 and AS 1 pointers are 64-bit. The size computation for opaque pointers (tag 0xF) calls sub_15A9520(target, addr_space) -- if the address space were lost during splitting, the pointer size calculation would be wrong, producing incorrect field offsets and corrupted stores.

The per-field null checks in the guard chain also respect address space: the icmp eq with null uses a null pointer of the correct address space width. A 32-bit null in AS 3 is not the same bit pattern as a 64-bit null in AS 1.

Hash Table for Processed Globals

After field explosion and malloc rewrite, the pass uses a custom hash table (open addressing, 32-byte entries) to track which globals and their transitive users have been processed. This is an instance of the NVIDIA-original hash table variant (sentinel pair -8/-16) as documented in the hash infrastructure page.

OffsetFieldDescription
+0keyPointer to global (sentinel: -8 = empty, -16 = tombstone)
+8dataPointer to field-global vector
+16sizeCurrent vector size
+24capVector capacity

Hash function, quadratic probing with triangular numbers, and 75% load factor / 12.5% tombstone compaction thresholds all follow the standard DenseMap infrastructure; see Hash Table and Collection Infrastructure for details.

The processing loop (lines 1710-1812) iterates remaining users of the original global and rewrites them to reference the new field globals:

// sub_18612A0, lines 1710-1812 — Post-SRA user rewriting via hash table
void rewrite_remaining_users(Global *global, FieldVec *fields,
                             HashTable *table, Module *module,
                             TargetInfo *target) {
    Use *use = *(Use **)(global + 8);

    while (use != NULL) {
        Use *next_use = *(Use **)(use + 8);
        Instruction *inst = sub_1648700(use);
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 54) {                                    // store
            // Walk the store's own use-chain
            Use *store_use = *(Use **)(inst + 8);
            while (store_use != NULL) {
                Use *next_su = *(Use **)(store_use + 8);

                // Per-user SRA rewrite: replaces GEP+store/load sequences
                // with direct accesses to the appropriate field global
                sub_1860BE0(store_use, table, fields, target); // rewriteUserForSRA

                store_use = next_su;
            }

            // If store has no remaining uses, erase it
            if (*(Use **)(inst + 8) == NULL) {
                sub_15F20C0(inst);                             // eraseFromParent
                // Remove from hash table (mark as tombstone)
                HashEntry *entry = sub_1860630(
                    inst, 0, table, NULL);                     // lookupInTable
                if (entry != NULL)
                    entry->key = (void *)(-16);                // tombstone
            }
        } else {
            // For non-store users (loads, etc.): create direct stores
            // to the appropriate field global
            for (uint64_t i = 0; i < fields->size; i++) {
                uint64_t offset = sub_15A06D0(
                    get_type(global), i);                      // computeFieldOffset
                sub_15F9660(offset, fields->data[i], inst);    // createFieldStore
            }
        }

        use = next_use;
    }
}

After all users are rewritten, cleanup proceeds in two phases: first, operand lists of dead GEP (tag 77) and store (tag 54) instructions are unlinked from the use chain (nulling out 24-byte-stride operand slots at lines 2004-2079); second, the dead instructions are erased via sub_15F20C0 at lines 2081-2117. Finally, the original global declaration is erased via sub_15E55B0, and all temporary data structures (hash table backing array, field vectors, scratch buffers) are freed at lines 2119-2161.

Top-Level Driver: sub_18612A0

The complete control flow of the core transform function, integrating all four strategies. This pseudocode corresponds to the entire 2179-line decompilation:

// sub_18612A0 — Core GlobalOpt transform for a single global variable
// Returns: 1 if transformed, 0 if no transformation applied
int globalopt_transform(Global *global, Module *module, Type *type,
                        int flag, TargetInfo *target, TargetInfo *target2) {
    // === Phase 1: Type filter (lines 444-451) ===
    uint8_t type_tag = *(uint8_t *)(type + 8);
    uint16_t bitmask = 0x8A7E;  // bits: 1,2,3,4,5,9,11,13,15
    if (!((1 << type_tag) & bitmask)) {
        // Additional acceptance for struct(13), vector(14), array(16)
        if (type_tag == 13 || type_tag == 14 || type_tag == 16) {
            if (!sub_16435F0(type, 0))                         // isAnalyzableType
                return 0;
        } else {
            return 0;
        }
    }

    // === Phase 2: Use validation — all users must be store/load (lines 452-481) ===
    Buffer scratch = { .data = alloca(8 * sizeof(void *)), .size = 0, .capacity = 8 };
    Use *use = *(Use **)(global + 8);
    while (use != NULL) {
        Instruction *inst = sub_1648700(use);                  // getInstruction
        uint8_t opcode = *(uint8_t *)(inst + 16);
        if (opcode <= 0x17) return 0;                          // arithmetic: reject
        if (opcode == 54) {                                    // store
            if (!sub_185C920(inst, &scratch))                  // analyzeStore
                return 0;
        } else if (opcode != 55) {                             // not load either
            return 0;
        }
        use = *(Use **)(use + 8);
    }

    // === Phase 3: Collect store values and evaluate initializer (lines 482-493) ===
    Buffer store_buf = { .data = calloc(32, sizeof(void *)), .size = 0, .capacity = 32 };
    sub_185C560(module, global, &store_buf);                   // collectStoreValues
    Value *init_val = sub_140B2F0(module, target, global, 1);  // evaluateInitializer

    // === Phase 4: Try Path A — small-constant promotion (lines 494-805) ===
    uint8_t init_tag = *(uint8_t *)(init_val + 16);
    if (init_tag == 13) {                                      // struct constant
        uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
        uint64_t total_bits = compute_total_bits(type, target, addr_space);
        uint64_t alignment = sub_15A9FE0(target, type);
        uint64_t padded = alignment * ((total_bits + alignment - 1) / alignment);

        if (padded <= 0x7FF) {                                 // <= 2047 bits
            promote_small_constant(global, module, init_val, type, target);
            free(store_buf.data);
            return 1;
        }
    }

    // === Phase 5: Try Path B — SRA of struct globals (lines 807-2177) ===
    if (flag != 0) { free(store_buf.data); return 0; }        // SRA disabled by caller

    // Verify unique initializer
    if (init_val != sub_15A0680(get_module_sym(module), 1, 0)) {
        free(store_buf.data); return 0;
    }

    // Check struct with 1-16 fields
    if (type_tag == 14) type = unwrap_vector(type);            // vector peeling
    if (*(uint8_t *)(type + 8) != 13) { free(store_buf.data); return 0; }
    uint32_t field_count = *(uint32_t *)(type + 12);
    if (field_count - 1 > 0xF) { free(store_buf.data); return 0; }  // > 16 fields

    // Collect stored values into hash set (lines 823-868)
    HashSet store_set;
    init_hashset(&store_set);
    collect_store_values(global, module, &store_set, &scratch);

    // Validate all users reference only this global (lines 878-1017)
    if (!validate_global_only_uses(global, &store_set)) {
        free(store_buf.data); return 0;
    }

    // Optional vector type peeling (lines 1026-1083)
    if (*(uint8_t *)(type + 8) == 14) {
        peel_vector_type(global, module, type, target);
    }

    // Field explosion (lines 1084-1476)
    FieldVec fields = { .data = NULL, .size = 0, .capacity = 0 };
    sra_explode_fields(global, module, type, init_val, target, &fields);

    // Null/negative guard chain (lines 1478-1535)
    Value *guard = build_guard_chain(global, &fields, module, target);

    // Malloc/free decomposition (lines 1537-1640)
    Function *fn = get_parent_function(global);
    decompose_malloc_free(global, module, fn, &fields, guard, target);

    // Hash-table-driven user rewriting (lines 1642-2161)
    HashTable processed;
    init_hashtable(&processed);
    rewrite_remaining_users(global, &fields, &processed, module, target);

    // Cleanup: unlink dead operands, erase dead instructions
    cleanup_dead_instructions(&processed);                     // lines 2004-2117

    // Erase original global and free temporaries
    sub_15E55B0(global);                                       // lines 2119-2161
    free(fields.data);
    free(store_buf.data);
    destroy_hashtable(&processed);
    destroy_hashset(&store_set);

    return 1;
}

LTO Interaction

GlobalOpt benefits significantly from LTO's whole-program visibility. In single-compilation mode, a __device__ global with external linkage cannot be optimized because the compiler cannot prove it is unused by other translation units. With ThinLTO, the NVModuleSummary builder records per-global reference edges, and the ThinLTO importer pulls definitions across module boundaries. After import, GlobalOpt can see all users of a global across the entire program and make decisions that are impossible in per-module compilation:

  • Internalization: A global referenced only within one module (after import) can be marked internal (linkage 7), enabling all four transformation paths.
  • Dead global elimination: A global with zero users after import is trivially dead and erased. The NVModuleSummary builder's address-space tracking ensures that __device__ globals referenced by kernels are not prematurely killed -- a kernel's reference counts as a use even when no host-side code touches the global.
  • Cross-module constant propagation: After import, if a __device__ global is stored exactly once (from a host-side cudaMemcpyToSymbol) and loaded many times across multiple device functions, the single-store can be propagated as a constant, unlocking Path A's small-constant promotion.

The pass wrapper sub_196A2B0 is also called from the inliner cost model (sub_18612A0 address shared by both -- the inliner calls the GlobalOpt transform function to evaluate whether post-inline global folding would pay for the inline cost). This creates a feedback loop: inlining a caller that references a global may expose the global for optimization, which reduces code size, which makes further inlining cheaper.

Recursion

After completing either Path A or Path B, the pass recursively calls sub_185B1D0 on the newly created replacement globals. This handles cascading opportunities: splitting a struct global into fields may expose one of the field globals for further small-constant promotion (if a field is a small struct itself), or for dead elimination (if one field is never used). The recursion terminates when no further transformations apply -- each recursive call runs the same type filter and use validation, so it will return 0 for leaf scalars or globals with non-store/load users.

Knobs and Thresholds

ThresholdValueSourceEffect
Max bits for Path A2047 (0x7FF)HardcodedGlobals exceeding this fall through to SRA
Max struct fields for SRA16HardcodedStructs with >16 fields are not split
Hash table load factor75% (3/4)HardcodedTriggers rehash of processed-globals table
Tombstone threshold12.5% (1/8)HardcodedTriggers compacting rehash
Initial scratch buffer8 entriesHardcodedFor use analysis; grows via sub_16CC920
Store collection buffer32 entriesHardcodedFor store value collection; grows dynamically
SRA disable flag (a4)Caller-setRuntimeWhen set, Path B is bypassed entirely
Pipeline gateopts[1440]Config arrayWhen set, the sub_196A2B0 wrapper is skipped
Optimization tier>= 2Pipeline configGlobalOpt not run at tier 1

The pipeline parser registers "globalopt" at slot 45 in the pass name table, mapping to llvm::GlobalOptPass. The NVIDIA wrapper sub_196A2B0 is gated by the config array at offset 1440 -- when opts[1440] is set, the wrapper skips the pass entirely. At tier 2, GlobalOpt runs unconditionally at pipeline position 30. At tier 3, it runs with the same parameters but benefits from more aggressive SCCP and GlobalDCE having run upstream.

There are no user-facing CLI flags that directly control the 2047-bit threshold or the 16-field SRA limit. These are compile-time constants in the binary. The only external control is the tier-level gate and the opts[1440] kill switch.

Function Map

FunctionAddressSizeRole
sub_18612A00x18612A0--Core transform: type filter, Path A, Path B
sub_196A2B00x196A2B0--Pipeline wrapper (calls core after GlobalDCE)
sub_185B1D00x185B1D0--Recursive re-application to split globals
sub_185B7E00x185B7E0--Pre-SRA setup
sub_18604100x1860410--Hash table rehash
sub_18606300x1860630--Hash table lookup
sub_1860BE00x1860BE0--Per-user SRA rewrite
sub_185C5600x185C560--Collect all store values for a global
sub_185C9200x185C920--Analyze single store for optimizability
sub_185CAF00x185CAF0--Collect stored value into hash set
sub_15E51E00x15E51E0--Create global variable (88 bytes, with AS)
sub_15E50700x15E5070--Create init function
sub_164D1600x164D160--RAUW (Replace All Uses With)
sub_15F20C00x15F20C0--Erase instruction from parent
sub_15E55B00x15E55B0--Erase global declaration
sub_15A95200x15A9520--getPointerSizeInBits(target, addr_space)
sub_15A99300x15A9930--getStructLayout (field offsets)
sub_15A06D00x15A06D0--computeFieldOffset
sub_1646BA00x1646BA0--getStructFieldType
sub_16435F00x16435F0--isAnalyzableType(type, depth)
sub_140B2F00x140B2F0--evaluateInitializer(module, target, ..., 1)
sub_15FB6300x15FB630--Create notinit sentinel
sub_15FB4400x15FB440--Create binary OR (opcode 27)
sub_15FEC100x15FEC10--Create ICmp instruction
sub_15F86500x15F8650--Create conditional branch
sub_15F85900x15F8590--Create unconditional branch
sub_157FBF00x157FBF0--Create basic block
sub_15FED600x15FED60--Create ICmp NE (opcode 51, predicate 33)
sub_15F93300x15F9330--Create alloca ("tmp" variable in block)
sub_15FDB000x15FDB00--Wire def into use-def chain
sub_15F98500x15F9850--Create store-to-field-global
sub_157E9C00x157E9C0--Create return basic block (null-return)
sub_157FB600x157FB60--Create basic block with predecessor
sub_15F55D00x15F55D0--Grow operand list
sub_16487000x1648700--getInstruction(use) from use-chain
sub_16499600x1649960--getName(global/fn) returns C string
sub_1648A600x1648A60--IRBuilder::create(size, kind) allocates IR node
sub_15E55300x15E5530--Destroy function body
sub_159D9E00x159D9E0--Destroy function
sub_164BE600x164BE60--Drop all references
sub_1648B900x1648B90--Mark dead (flags or-equals 1)
sub_1631BE00x1631BE0--Insert into function list
sub_15A9FE00x15A9FE0--getAlignment(target, type) ABI alignment
sub_15A06800x15A0680--lookupSymbol(module_sym, idx, flags)
sub_16463B00x16463B0--getArrayElementType(ptr, idx)
sub_159C5400x159C540--getTerminalType(type)
sub_17521000x1752100--Collect use-def chain
sub_15E64800x15E6480--Copy metadata from global to global
sub_15F8F800x15F8F80--Create extractvalue instruction
sub_15F94800x15F9480--Create store-init (initializer store)
sub_15F96600x15F9660--Create field store (offset + field global)
sub_15FD5900x15FD590--Create local variable ("newgv")
sub_15FEBE00x15FEBE0--Create bitcast/GEP for field extraction
sub_16487800x1648780--Replace use with value
sub_16CC9200x16CC920--Grow scratch buffer
sub_16CC9F00x16CC9F0--Find in sorted set
sub_19683900x1968390--GlobalDCE / ConstantProp (runs before GlobalOpt)

Differences from Upstream LLVM GlobalOpt

Stock LLVM's GlobalOptPass (in lib/Transforms/IPO/GlobalOpt.cpp) performs similar high-level transformations: SRA of globals, shrink-to-bool, constant marking, dead global elimination, malloc/free removal, static constructor evaluation, calling convention optimization (fastcc), and alias resolution. The NVIDIA implementation diverges in these concrete ways:

  1. Internal IR, not LLVM IR. The pass operates on NVIDIA's custom IR node format with 88-byte global nodes, 24-byte operand stride, and type tags at offset +8/+16 of type/instruction nodes. A reimplementation targeting upstream LLVM would use GlobalVariable, StoreInst, LoadInst, and GetElementPtrInst directly.

  2. 2047-bit constant promotion threshold. LLVM does not have a single bit-count gate for constant promotion. NVIDIA's threshold likely targets the GPU register file: 2047 bits is approximately 64 32-bit registers, close to the per-thread register budget on many SM architectures.

  3. Per-field malloc decomposition. Stock LLVM's tryToOptimizeStoreOfMallocToGlobal handles malloc/free as a single pair. NVIDIA generates per-field null checks, conditional frees, and continuation blocks -- a more aggressive decomposition.

  4. Custom hash table. LLVM uses DenseMap/SmallPtrSet. NVIDIA uses a hand-rolled open-addressing hash table with 32-byte entries (see Hash Table and Collection Infrastructure for the hash function and sentinel values).

  5. Address-space preservation. Every created global explicitly receives the source global's address space. Stock LLVM does not special-case address spaces in GlobalOpt.

  6. Recursive re-application. After splitting, NVIDIA calls sub_185B1D0 to re-run GlobalOpt on the results. Upstream LLVM relies on the pass manager to schedule re-runs via its invalidation mechanism.

  7. Inliner integration. The inliner cost model at the same address range calls into GlobalOpt to evaluate post-inline global folding benefit. This tight coupling does not exist in upstream LLVM where inlining and GlobalOpt are independent passes.

Cross-References

  • NVModuleSummary Builder -- builds the global reference edges that determine which globals are live across modules
  • Inliner Cost Model -- calls GlobalOpt's transform function to evaluate post-inline global optimization benefit
  • ThinLTO Function Import -- imports functions across module boundaries, exposing globals for cross-module optimization
  • Alias Analysis & NVVM AA -- address-space-aware alias analysis that informs which memory operations can alias globals in different address spaces
  • MemorySpaceOpt -- resolves generic pointers to specific address spaces; runs before GlobalOpt and may expose globals that were previously behind generic pointers
  • Pipeline & Ordering -- full pass ordering showing GlobalOpt's position at step 30
  • Type Translation, Globals & Special Vars -- how EDG frontend assigns address spaces to global variables during IR generation
  • Hash Infrastructure -- hash function, sentinel values, and probing strategy used by the processed-globals table
  • Struct Splitting -- the NewPM lower-aggr-copies pass that handles similar aggregate decomposition at a different pipeline stage
  • Address Spaces -- complete NVPTX address space reference including pointer sizes and latency characteristics

Whole-Program Devirtualization

CICC v13.0 includes LLVM's WholeProgramDevirtPass at sub_2703170 (13,077 bytes), which replaces indirect virtual calls with direct calls using whole-program type information. On GPU this optimization is far more consequential than on CPU: an indirect call in PTX compiles to a call.uni through a register, which prevents the backend from inlining the callee, forces all live registers across the call boundary into local memory spills, destroys instruction scheduling freedom, and creates a warp-divergence hazard if threads in the same warp resolve the function pointer to different targets. A single devirtualized call site in a hot kernel loop can therefore improve performance by an order of magnitude -- the direct call enables inlining by the inliner cost model, which in turn eliminates .param-space marshaling, enables cross-boundary register allocation, and restores the instruction scheduler's ability to interleave memory and arithmetic operations.

CICC's devirtualization operates in a privileged position: GPU compilation is inherently a closed-world model. Every function that can be called on the device must be visible at link time -- there is no dynamic loading, no shared libraries, and no dlopen on GPU. This means the set of possible implementations for any virtual function is fully known, making single-implementation devirtualization almost always profitable and branch funnels rare. The pass runs as a module-level pass (pipeline parser slot 121, registered as "wholeprogramdevirt") during the LTO phase, after the NVModuleSummary builder has computed type test metadata and before GlobalDCE eliminates dead virtual methods.

Entry pointsub_2703170 (0x2703170, 13,077 bytes)
Address range0x2703170--0x2706485
Stack frame856 bytes (0x358)
Pass name"wholeprogramdevirt" (pipeline slot 121)
Pass typeModule pass
Callee-savedr15, r14, r13, r12, rbx
Return value1 = module modified, 0 = no changes
Remark category"wholeprogramdevirt" / "Devirtualized"
Helper rangesub_2700B00--sub_2708220 (branch funnel helpers, summary I/O)

The Closed-World GPU Advantage

Upstream LLVM's WholeProgramDevirt is designed primarily for LTO pipelines where some modules may not be visible (ThinLTO import/export split, shared libraries with hidden visibility). The pass must therefore be conservative: it can only devirtualize when !type metadata proves that the vtable set is complete. On GPU, this conservatism is unnecessary. All device code is statically linked into a single fatbinary -- there are no device-side shared libraries, no runtime code loading (the driver JIT compiles PTX, but does not add new device functions), and __device__ virtual functions cannot escape to host code. The entire class hierarchy is visible.

CICC exploits this by running WPD in regular LTO mode (not ThinLTO export/import split), where the pass directly resolves virtual calls against the merged module. The NVModuleSummary builder records type_test metadata for all device vtables, and the pass consumes this metadata to build a complete picture of every virtual call site and every possible target. In practice, GPU programs rarely have deep polymorphic hierarchies in device code (the hardware penalties discourage it), so most virtual call sites resolve to a single implementation.

The Formal Closed-World Argument

The closed-world guarantee on GPU rests on five architectural invariants, each of which eliminates a source of conservatism that forces upstream LLVM to leave calls indirect:

#InvariantWhat upstream LLVM must worry aboutWhy GPU is immune
1No device-side shared librariesA .so loaded at runtime could add a new vtable entry for a class. LTO must mark !vcall_visibility metadata linkage-unit to prove the vtable set is closed within the link unit.The CUDA driver loads PTX/SASS as a monolithic blob. cuModuleLoad does not support incremental symbol addition. There is no dl_iterate_phdr on device.
2No dlopen on deviceHost-side dlopen can inject new implementations of virtual functions. Upstream must check !vcall_visibility for translation-unit scope.Device code has no equivalent of dlopen. The only way to add device code is to recompile and reload the entire module.
3No device-side RTTIdynamic_cast and typeid on host can defeat devirtualization by requiring the vtable to contain RTTI pointers that reference external type_info objects.CUDA explicitly prohibits dynamic_cast and typeid in __device__ functions. Device vtables contain no RTTI pointers. The NVVM IR verifier (sub_12DD660) rejects code that attempts dynamic_cast in device context.
4No exceptions on deviceVirtual destructors in exception-handling code create additional vtable entries and __cxa_throw unwinding paths that must be considered.CUDA does not support exceptions in device code. Virtual destructors are simple (no EH cleanup), and the compiler can see every destructor call site.
5Complete link-time visibilityThinLTO's import/export split means some modules may not be available during WPD. The pass must use summary-based resolution with wholeprogramdevirt-summary-action=import/export.CICC uses wholeprogramdevirt-summary-action=none (direct resolution on the merged module). All device functions, including those from separate compilation units, are linked by nvlink into a single merged module before the LTO pipeline runs.

The practical consequence: CICC sets whole-program-visibility effectively to true for all device code. The !vcall_visibility metadata that upstream uses to distinguish "linkage-unit" from "translation-unit" scope becomes irrelevant -- every device vtable is within a single, complete, closed translation unit.

How NVModuleSummary Feeds WPD

The NVModuleSummary builder at sub_D7D4E0 (2,571 decompiled lines, 74KB) produces the type metadata that WPD consumes. The interaction is:

  1. NVModuleSummary walks every GlobalValue in the module (linked list at Module+72). For each function (opcode 0x3D), it extracts attribute groups #34 (reference edges with type metadata) and #35 (direct call targets) via sub_B91C10.

  2. For reference edges with type info (attribute #34), the builder decodes MDNode operands (lines 1193-1228 of the decompilation): each parameter position >= 2 yields a type node (opcode range 5-36), walked to a parent MDTuple (opcode 17) containing the type name string at offset 24 (indirect through pointer if length > 64).

  3. These type-metadata edges are packed into the FunctionSummary record by sub_D77220 as the v378 (type-checked references) argument. The resulting metadata lands in the module as llvm.type.test / type_test_assume named metadata nodes.

  4. WPD reads these nodes back via sub_B6AC80(module, 0x166) at its entry point, completing the producer-consumer chain.

DevirtSCCRepeatedPass: The Outer Loop

WPD at the module level is one of two devirtualization mechanisms. The other operates at CGSCC granularity: DevirtSCCRepeatedPass at sub_2284BC0 (16KB) wraps the CGSCC pipeline in a fixed-point iteration loop that re-runs until no new devirtualization opportunities are discovered or a maximum iteration count is reached. On reaching the limit, the pass emits "Max devirtualization iterations reached". The abort-on-max-devirt-iterations-reached knob (registered at constructor 378) controls whether this is a fatal error or a warning. The iteration count at O1-O3 is 1; at tier 3 (maximum optimization) it is 5, giving the inliner and devirtualizer multiple rounds to discover indirect-to-direct call conversions that expose further inlining opportunities.

The two mechanisms are complementary: module-level WPD resolves virtual calls using global type hierarchy information (vtable metadata), while CGSCC-level devirtualization catches cases where inlining reveals new constant function pointers that can be resolved without type metadata.

Algorithm

The pass executes in seven phases:

Phase 1: Metadata Extraction (0x2703170--0x27031CA)

The entry point fetches four named metadata nodes from the module using sub_B6AC80 (getNamedMetadata):

Enum IDMetadata NodePurpose
0x166 (358)llvm.type.test / type_test_assumeRecords of @llvm.assume(@llvm.type.test(%ptr, %typeID)) intrinsic results
0x164 (356)llvm.type.checked.loadCall sites using type-checked vtable loads
0x165 (357)llvm.type.checked.load.relativeRelative vtable pointer variant (compact vtables)
0x0B (11)Module-level type metadataType summaries describing vtable layouts

If neither type_test_assume nor module-level type metadata are present, the pass checks for type_checked_load and type_checked_load_relative as fallbacks. If none exist, the pass returns 0 immediately.

The assembly sequence at the entry point:

; 0x2703170: entry
mov  esi, 0x166              ; enum ID = 358 (type_test_assume)
call sub_B6AC80              ; rbx = getNamedMetadata(module, 0x166)

mov  esi, 0x164              ; enum ID = 356 (type_checked_load)
call sub_B6AC80              ; r13 = getNamedMetadata(module, 0x164)

mov  esi, 0x165              ; enum ID = 357 (type_checked_load_relative)
call sub_B6AC80              ; [rbp-0x338] = result

mov  esi, 0x0B               ; enum ID = 11 (module-level type metadata)
call sub_B6AC80              ; r12 = result

Phase 2: Type Test Record Iteration (0x2703296--0x2703383)

Type test records are stored in an array at offset +0xA0 of the metadata state, with count at +0xA8. Each record is 144 bytes (0x90):

struct TypeTestRecord {       // 0x90 = 144 bytes per record
    uint8_t *type_value;      // +0x00: pointer to type test value
    // ... call site references, metadata links ...
};

// Iteration pattern at 0x2703296:
TypeTestRecord *base = state->records;          // [state + 0xA0]
uint32_t count = state->record_count;           // [state + 0xA8]
TypeTestRecord *end = base + count;             // stride = 0x90
// Address computation in binary:
//   lea rax, [rax+rax*8]      ; count * 9
//   shl rax, 4                ; count * 144 = count * 0x90
//   add rax, rdx              ; end pointer

for (TypeTestRecord *rec = base; rec != end; rec++) {
    if (rec->type_value[0] != 0) continue;      // skip already-processed
    // ... look up type in hierarchy ...
}

For each record whose type byte is 0 (unprocessed), the pass computes a string hash of the type name via sub_B91420 (get type name) and sub_B2F650 (string hash), then looks up the type in a red-black tree rooted at offset +0xE0 of the module state.

Phase 3: Hash Table Construction (0x2703589--0x2703AE2)

Unique type test values are tracked in an open-addressed hash table with 56-byte entries. The hash function combines bit-shifted fields to reduce clustering:

uint32_t hash(uint32_t val, uint32_t mask) {
    return ((val >> 4) ^ (val >> 9)) & mask;
}

The table uses power-of-2 sizing with LLVM-layer sentinels (empty = 0xFFFFFFFFE000, deleted = 0xFFFFFFFFF000). See Hash Table and Collection Infrastructure for the probing and growth policy.

Each 56-byte hash table entry stores:

OffsetSizeField
+0x008Type test value (key)
+0x088Flags / padding
+0x108Type info pointer
+0x188Associated data (resolution result)
+0x208Red-black tree node (self-referential on init)
+0x288Link pointer
+0x308Count / size

Slot addressing uses the identity slot_index * 7 * 8 = slot_index * 56:

; At 0x27035A0:
lea  rdx, ds:0[rsi*8]    ; rsi = slot index, rdx = slot*8
sub  rdx, rsi             ; rdx = slot*8 - slot = slot*7
mov  rsi, [rdi+rdx*8]    ; load from table base + slot*56

Table growth is handled by sub_2702540, which reallocates and rehashes all entries using the same (val >> 4) ^ (val >> 9) function against the new mask. Entry initialization at 0x2703A33:

; Insert new entry:
add  [rbp-0x2D0], 1      ; increment unique type count
call sub_2702540          ; grow table if needed (returns new entry ptr in rax)
mov  dword [rax+10h], 0  ; clear type info
mov  qword [rax+18h], 0  ; clear data
mov  [rax], rdx           ; store type test value
lea  rdx, [rax+10h]
mov  [rax+20h], rdx       ; self-referential link (RB tree node init)
mov  [rax+28h], rdx       ; self-referential link
mov  qword [rax+30h], 0  ; zero count

Phase 4: Type Hierarchy Lookup via Red-Black Tree (0x27032F7--0x2703362, 0x2704183--0x2704267)

For each unique type, the pass searches a red-black tree keyed by hashed type name. The tree is rooted at offset +0xE0 of the module state, with the sentinel node at +0xD8. The search is a two-phase process with a three-field comparison:

Phase 4a: Compute Type Name Hash

// At 0x27032F7:
char *name = sub_B91420(type_value);     // returns (name_ptr, name_len)
uint64_t hash = sub_B2F650(name, len);   // string hash

// Tree root and sentinel:
RBTreeNode *root = module_state[+0xE0];  // root pointer
RBTreeNode *sentinel = module_state + 0xD8; // sentinel node address

sub_B2F650 (stringHash) is LLVM's standard xxHash-style string hasher. It produces a 64-bit hash that is stored at node[+0x20] for each type in the tree.

Phase 4b: Descend Tree by Hash

// At 0x270330C:
RBTreeNode *current = root;
RBTreeNode *best = sentinel;    // rcx = sentinel initially

while (current != NULL) {
    uint64_t node_hash = current[+0x20];    // hash stored in node
    if (target_hash < node_hash) {
        best = current;                      // track nearest greater
        current = current[+0x10];            // left child
    } else if (target_hash > node_hash) {
        current = current[+0x18];            // right child
    } else {
        // hash matches -- proceed to Phase 4c
        break;
    }
}

if (current == NULL) goto not_found;

The binary encodes this as:

compare_node:
    cmp  rsi, [r15+20h]    ; compare target hash vs node hash
    ja   go_right           ; target > node -> right child
    jnb  hash_match         ; target == node -> verify
    mov  rcx, r15           ; track best (left-leaning)
    mov  r15, [r15+10h]    ; r15 = left child
    test r15, r15
    jnz  compare_node
    jmp  not_found

go_right:
    mov  r15, [r15+18h]    ; r15 = right child
    test r15, r15
    jnz  compare_node
    jmp  not_found

Phase 4c: Verify Full Match (Hash Collision Resolution)

On hash match, the pass performs a two-step verification to handle collisions:

// At 0x2704200:
// Step 1: compare string lengths
if (current[+0x30] != target_length) {
    // Length mismatch -- this is a hash collision, not a real match.
    // Continue tree traversal to the next candidate.
    goto next_candidate;
}

// Step 2: compare actual type name strings
char *node_name = (char *)current[+0x28];    // node's type name data
char *target_name = target_string;            // from sub_B91420
int cmp = memcmp(node_name, target_name, target_length);
if (cmp != 0) goto next_candidate;

// Verified match -- read vtable data

The binary at 0x2704200--0x2704240:

    cmp  r12, [r15+30h]    ; compare string length
    jnz  next_candidate    ; length mismatch

    mov  rdi, [r15+28h]    ; s1 = node's string data
    mov  rsi, [rbp-0x348]  ; s2 = target string data
    mov  rdx, r12           ; n = length
    call _memcmp
    test eax, eax
    jz   found_match

Phase 4d: Extract Vtable Data

After verifying the type match, the pass reads the vtable descriptor from the type node:

// At 0x2704248:
void *vtable_start = current[+0x68];    // vtable start address
void *vtable_data  = current[+0x70];    // vtable data pointer (function pointers)

if (vtable_data == NULL) goto skip_type; // no vtable -> nothing to devirtualize

The vtable_data pointer leads to an array of function pointers representing the virtual method implementations for this type. The pass iterates this array comparing each entry against call site signatures to identify devirtualization candidates.

Phase 5: Virtual Call Resolution (0x2703974--0x27039BA)

For each call site on a matched type, the pass calls sub_26FEE10 (resolveVirtualCall):

bool resolveVirtualCall(
    void *module_state,         // rdi: r15 (module/pass state)
    void *target_candidates,    // rsi: candidates vector from [rbp-0x230]
    void *hash_entry,           // rdx: r12 (pointer to hash table entry + 8)
    uint32_t candidate_count,   // rcx: from [rbp-0x228]
    void *call_site_info        // r8:  r13 (call site from [r15+0x28])
);
// Returns: al = 1 if unique resolution found, 0 otherwise

The resolution algorithm within sub_26FEE10 works by comparing the vtable offset encoded in each call site's llvm.type.test / llvm.type.checked.load intrinsic against the vtable slot offsets of all candidate implementations. When exactly one candidate matches, the resolution succeeds with strategy 1 (direct call). When multiple candidates exist but all return the same constant or can be distinguished by a single offset, strategy 2 (unique member) is chosen. When multiple distinct targets exist, strategy 3 (branch funnel) is produced.

The resolution result is written to hash_entry[+0x28] as a strategy selector:

ValueStrategyUpstream LLVM counter
1Direct call (single implementation)NumSingleImpl
2Unique member dispatchNumUniformRetVal / NumUniqueRetVal
3Branch funnelNumBranchFunnel

Before calling sub_26FEE10, the pass checks two preconditions:

// At 0x2703974:
void *call_site_list = module_state[+0x28];   // r13 = [r15+0x28]
if (call_site_list == NULL) goto skip;

if (type_value[0] != 0) goto skip;            // byte check: direct type info only

void *existing = hash_entry[+0x28];
if (existing != 0) goto already_resolved;     // skip if previously resolved

Phase 6: Strategy Application (0x2703BA3--0x27046F0)

Strategy 1 -- Direct Call Replacement (0x27044DA)

When only one class implements the virtual function (the common case on GPU), the indirect call is replaced with a direct call to the resolved function. This is handled by sub_26F9AB0 (rewriteCallToDirectCall):

// At 0x27044DA:
void rewriteCallToDirectCall(
    void *type_entry,           // rdi: r12
    void *call_site,            // rsi: [r15+0x38]
    uint64_t vtable_data,       // rdx: byte_3F871B3 (vtable offset data)
    uint32_t flags,             // ecx: 0
    void *resolved_function     // r8:  [rbx+0x40]
);

This is the simplest and most common optimization: the call.reg becomes call.direct, enabling downstream inlining. On GPU this is by far the dominant strategy. Consider a CUDA kernel with a virtual method call inside a loop:

; Before devirtualization (PTX):
ld.global.u64  %rd1, [%rd0];     // load vtable ptr
ld.global.u64  %rd2, [%rd1+16];  // load function ptr at vtable slot 2
call.uni       %rd2, (%args);    // indirect call -- full scheduling barrier

; After devirtualization (PTX):
call.uni       _ZN7DerivedN4workEv, (%args);  // direct call -- inlinable

The direct call then becomes an inlining candidate with CICC's 20,000-unit budget (89x the upstream LLVM default of 225), and the inliner typically eliminates it entirely, producing fully-inlined code with no call overhead.

Strategy 2 -- Unique Member Dispatch (0x27045C9)

When multiple classes exist but the call can be dispatched through a unique member offset, the pass rewrites via sub_26F9080 (rewriteToUniqueMember), passing the diagnostic string "unique_member" (13 chars). The member offset is read from hash_entry[+0x60] and the base type from hash_entry[+0x00].

; At 0x27045D9:
mov  r11, [r12]              ; type info (base type)
mov  rsi, [r12+60h]          ; member offset
lea  rax, "unique_member"    ; diagnostic string (13 chars)
call sub_26F9080             ; rewriteToUniqueMember
;   rdx = r14 (type test record)
;   rcx = r13 (call site)
;   r9  = rdi (vtable byte offset / 8)
;   "unique_member" + length 0x0D pushed on stack

After the initial rewrite, sub_26FAF90 performs call-site-specific fixup, checking [rbx+0x40] to determine if additional adjustment is needed (e.g., adjusting this pointer offset for multiple inheritance).

Upstream LLVM's equivalent covers two sub-strategies: uniform return value optimization (all implementations return the same constant -- replace the call with that constant) and unique return value optimization (for i1 returns, compare the vptr against the one vtable that returns a different value). Both are folded under the "unique_member" label in CICC's implementation.

Strategy 3 -- Branch Funnel (0x27043B5)

When multiple possible targets exist and cannot be reduced to a single dispatch, the pass creates a branch funnel -- a compact conditional dispatch sequence that checks the vtable pointer and branches to the correct target. This is handled by three functions:

  1. sub_26F78E0 -- create branch funnel metadata (with diagnostic string "branch_funnel", 13 chars)
  2. sub_BCF480 -- build the conditional dispatch structure
  3. sub_BA8C10 -- emit the indirect branch sequence
; At 0x27043B5:
mov  r12, [rbx]               ; vtable pointer
mov  rdi, [r12]               ; function pointer from vtable
call sub_BCB120                ; get function declaration

; At 0x27043D3:
lea  rax, "branch_funnel"     ; 13 chars at 0x42BCB92
call sub_26F78E0               ; create branch funnel metadata
call sub_BCF480                ; build dispatch structure
call sub_BA8C10                ; emit indirect branch sequence

The branch funnel supports two dispatch granularities:

GranularityStringFunctionDescription
Byte"byte" (4 chars, at 0x3F8C256)sub_26F9120Check byte offset into vtable to select target
Bit"bit" (3 chars, at 0x43ADFE0+0xE)sub_26F9120Check bit offset for single-bit discrimination

The emission sequence at 0x270450C--0x27045BF:

; Byte-granularity dispatch:
lea  rcx, "byte"              ; at 0x3F8C256
mov  [rbp-0x318], 4           ; string length
call sub_26F9120               ; emit byte-offset check

; Bit-granularity dispatch:
lea  rbx, "bit"               ; at 0x43ADFE0 + 0xE
mov  [rbp-0x328], 3           ; string length
call sub_26F9120               ; emit bit-offset check

; Finalize:
call sub_26FB610               ; r8=byte_result, r9=bit_result
                               ; rdi=r12, rdx=byte_3F871B3

The finalization call sub_26FB610 receives both byte and bit results and produces the final dispatch sequence. On GPU, branch funnels are rare because device code hierarchies are typically shallow, but the infrastructure exists for cases like thrust/CUB polymorphic iterators.

Upstream LLVM gates branch funnels behind the wholeprogramdevirt-branch-funnel-threshold knob (default: 10 targets per call site). CICC inherits this threshold.

Phase 7: Cleanup (0x2704144--0x270342E)

After processing all types, the pass performs four cleanup operations:

  1. Function attribute cleanup (0x2704144): iterates the module's function list (red-black tree at [rax+10h]), calling sub_B98000 with parameter 0x1C (attribute cleanup enum) on each function.
  2. Import list cleanup (0x270416C): processes entries at module[+0x110..+0x118], calling sub_B43D60 to release function metadata for imported declarations.
  3. Type hierarchy destruction: sub_26F92C0 releases all type hierarchy data structures.
  4. Hash table deallocation (0x27033C3): iterates all non-sentinel entries, calls sub_26F75B0 to release per-entry resolution data, then sub_C7D6A0 to free the table buffer. Type test result vectors (0x70-byte elements with sub-vectors at offsets +0x10, +0x28, +0x40, +0x58) are freed element by element.

Hash table cleanup detail:

// At 0x27033C3:
uint32_t count = hash_table_entry_count;     // [rbp-0x2B8]
if (count == 0) goto skip_cleanup;

void *base = hash_table_base;                // [rbp-0x2C8]
void *end = base + count * 56;               // count * 7 * 8

for (void *entry = base; entry < end; entry += 56) {
    uint64_t key = *(uint64_t *)entry;
    if (key == 0xFFFFFFFFE000) continue;     // empty sentinel
    if (key == 0xFFFFFFFFF000) continue;     // deleted sentinel
    sub_26F75B0(entry[+0x18]);               // release resolution data
}
sub_C7D6A0(base);                            // free table buffer

GPU-Specific Constraints

Virtual Functions in Device Code

CUDA allows __device__ virtual functions, but with restrictions that simplify devirtualization:

  • No RTTI on device. There is no typeid or dynamic_cast on GPU. This means vtable layouts do not contain RTTI pointers, simplifying vtable reconstruction. The NVVM IR verifier rejects code that attempts dynamic_cast in device context.
  • No exceptions on device. Virtual destructors do not need to handle __cxa_throw unwinding paths.
  • Closed world. No device-side shared libraries, no dlopen, no runtime code generation. All virtual targets are known at compile time.
  • No separate compilation for virtual dispatch. Device linking (nvlink) resolves all symbols before PTX emission, so the merged module always has complete type information.
  • Simplified vtable layout. Without RTTI pointers and exception tables, device vtables are a flat array of function pointers at known offsets. This makes vtable slot arithmetic straightforward for the WPD pass.

Cost of Unresolved Indirect Calls

If devirtualization fails, the PTX backend must emit a call.uni or call through a register. This has several penalties:

  1. No inlining. The callee is unknown, so the inliner cannot evaluate it.
  2. Full .param marshaling. Every argument must be written to .param space; no copy elision is possible. The call ABI (opcodes 510-513: CallDirect, CallDirectNoProto, CallIndirect, CallIndirectNoProto) forces .param-space round-tripping.
  3. Register pressure spike. All live registers across the call must be spilled to local memory (device DRAM, ~400 cycle latency on SM 70-90).
  4. Scheduling barrier. The call is a full fence for instruction scheduling -- no operations can be reordered across it.
  5. Divergence hazard. If different threads in a warp resolve the pointer to different functions, execution serializes both paths. In the worst case (32 different targets), this is a 32x slowdown.
  6. Occupancy reduction. The register spills increase per-thread local memory usage, reducing occupancy and thus hiding less memory latency.

This is why CICC's default inlining budget of 20,000 (89x the upstream LLVM default) makes sense in combination with aggressive devirtualization: the pass converts expensive indirect calls into direct calls, and the inliner then eliminates them entirely.

Relationship to LowerTypeTests

The LowerTypeTests pass (sub_188C730, 96,984 bytes at 0x188C730; also sub_2638ED0 at 70KB) is the other half of the type-test infrastructure. While WPD consumes type test metadata to resolve virtual calls, LowerTypeTests produces the runtime type-checking implementation. The interaction:

PassRoleWhen
NVModuleSummary (sub_D7D4E0)Produces type metadata in function summariesDuring summary construction
WholeProgramDevirt (sub_2703170)Consumes type metadata, resolves virtual callsLTO phase, after summary, before GlobalDCE
LowerTypeTests (sub_188C730)Lowers remaining @llvm.type.test intrinsics to runtime bit testsAfter WPD, if CFI is active

On GPU, LowerTypeTests is largely dead code -- CUDA does not use Control-Flow Integrity (CFI), and WPD resolves most type tests statically. The sweep at 0x1880000 confirms: "WPD/CFI/LowerTypeTests cluster is also upstream-only; CUDA does not use CFI or type-based devirtualization" in the sense of runtime CFI checks. The type metadata is consumed entirely by WPD's compile-time resolution.

LowerTypeTests validates its input with: "Second argument of llvm.type.test must be metadata" and "Second argument of llvm.type.test must be a metadata string". These error paths are unreachable in normal CUDA compilation but exist because CICC links the full upstream LLVM IPO library.

Optimization Remarks

When a call site is successfully devirtualized, the pass emits an optimization remark through the diagnostic handler. The remark is constructed at 0x2703EDA using three components:

ComponentStringAddress
Remark name"Devirtualized" (13 chars)0x42BCBEe
Pass name"wholeprogramdevirt" (18 chars)0x42BC950
Body prefix"devirtualized " (14 chars)0x42BCBE2
Attribute key"FunctionName" (12 chars)0x42BC980

The remark construction sequence:

// At 0x2703EDA:
sub_B17560(&remark, "Devirtualized", 13, "wholeprogramdevirt", 18);
sub_B18290(&remark, "devirtualized ", 14);        // append body
sub_B16430(&remark, "FunctionName", 12);          // create named attribute
sub_26F69E0(&remark, resolved_function);          // attach target name
sub_B180C0(&remark);                              // finalize
sub_1049740(diag_handler, &remark);               // publish to handler

The remark is visible via -Rpass=wholeprogramdevirt and includes the name of the resolved target function (obtained from the function's name metadata or via sub_26F69E0 for unnamed functions).

After remark emission, extensive cleanup of small-string-optimized (SSO) std::string objects is performed -- each remark component checks if the string buffer was heap-allocated (compare pointer vs stack buffer address) and frees if necessary.

Knobs

KnobTypeDefaultEffect
wholeprogramdevirt-branch-funnel-thresholdunsigned10Maximum number of call targets per call site for branch funnel emission. Beyond this threshold, the call site is left indirect.
whole-program-visibilityboolfalseForce enable whole-program visibility even without !vcall_visibility metadata. On GPU this is effectively always true.
disable-whole-program-visibilityboolfalseForce disable whole-program visibility for debugging.
wholeprogramdevirt-summary-actionenumnoneControls summary interaction: none, import, export. CICC uses none (direct resolution on merged module).
wholeprogramdevirt-read-summarystringemptyRead type resolutions from a bitcode/YAML file.
wholeprogramdevirt-write-summarystringemptyWrite type resolutions to a bitcode/YAML file.
wholeprogramdevirt-skipstring listemptyComma-separated list of function names to exclude from devirtualization.
wholeprogramdevirt-checkenumnoneRuntime checking mode: none, trap (abort on incorrect devirt), fallback (fall back to indirect call).
wholeprogramdevirt-keep-unreachable-functionbooltrueKeep unreachable functions as possible devirt targets (conservative default).
wholeprogramdevirt-print-index-basedboolfalsePrint index-based devirtualization messages for debugging.
wholeprogramdevirt-cutoffsigned-1Maximum number of devirtualization actions to perform. -1 = unlimited. Useful for bisecting devirtualization-induced miscompiles.
abort-on-max-devirt-iterations-reachedboolfalseWhen DevirtSCCRepeatedPass at sub_2284BC0 hits its iteration limit, abort instead of warning. Registered at constructor 378.

Complexity

OperationComplexityNotes
Hash table insert/lookupO(1) amortized, O(n) worst caseLinear probing with sentinel-based open addressing
Type hierarchy lookupO(log n)Red-black tree keyed by type name hash, with memcmp verification
Per-type call resolutionO(call_sites * candidates)For each type, check every call site against every candidate target
Branch funnel emissionO(vtable_entries) per siteLinear in number of possible targets
String hash (sub_B2F650)O(name_length)One-pass hash of the type name string
Total passO(T * S * C * log T)T = types, S = call sites per type, C = candidates. Typically sparse on GPU.

Function Map

FunctionAddressSizeRole
WholeProgramDevirtPass::runsub_270317013,077Pass entry point
buildTypeTestInfosub_2702830~2,600Build type test records from metadata
growHashTablesub_2702540~740Grow and rehash the type test hash table
resolveVirtualCallsub_26FEE10~3,200Attempt single-target resolution for a call site
rewriteCallToDirectCallsub_26F9AB0~1,600Strategy 1: replace indirect call with direct call
rewriteToUniqueMembersub_26F9080~640Strategy 2: unique member dispatch rewrite
finalizeUniqueMembersub_26FAF90~1,700Strategy 2: call-site-specific fixup
createBranchFunnelMetasub_26F78E0~1,100Strategy 3: create branch funnel metadata
buildBranchFunnelsub_BCF480~6,400Strategy 3: build conditional dispatch structure
emitIndirectBranchsub_BA8C10~8,200Strategy 3: emit indirect branch sequence
emitDispatchChecksub_26F9120~500Branch funnel byte/bit offset check
finalizeBranchFunnelsub_26FB610~1,800Branch funnel finalization
destroyTypeHierarchysub_26F92C0~400Release type hierarchy data structures
releaseResolutionDatasub_26F75B0~300Free per-entry resolution data
attachFunctionNamesub_26F69E0~240Attach function name to optimization remark
branchFunnelHelpersub_2700B00~9,800Branch funnel main helper (called from sub_2703170)
summaryIOsub_2706490~7,600WPD summary read/write (-wholeprogramdevirt-read-summary)
DevirtSCCRepeatedPass::runsub_2284BC016,000CGSCC devirtualization iteration loop
getNamedMetadatasub_B6AC80~200Fetch named metadata node from module
getTypeInfoNamesub_B91420~300Compute type info name string
stringHashsub_B2F650~180Hash a type name string (xxHash-style)
createRemarkHeadersub_B17560~250Create optimization remark header
appendRemarkBodysub_B18290~200Append body text to remark
createNamedAttributesub_B16430~200Create named attribute for remark
publishRemarksub_1049740~100Publish remark to diagnostic handler

Cross-References

  • NVModuleSummary Builder -- produces the type_test metadata consumed by this pass; records devirtualization-relevant type GUIDs in per-function summaries via sub_D7D4E0.
  • Inliner Cost Model -- devirtualized direct calls become inlining candidates with a 20,000-unit budget; the entire value of devirtualization on GPU depends on the inliner subsequently eliminating the call.
  • ThinLTO Function Import -- in ThinLTO mode the pass would operate in export/import phases, but CICC primarily uses regular LTO for device code.
  • Pipeline & Ordering -- WPD is registered at pipeline parser slot 121 as a module pass; it runs during the LTO phase after summary construction and before GlobalDCE.
  • NVPTX Call ABI -- describes the .param-space calling convention that makes indirect calls so expensive (opcodes 510-513: CallDirect, CallDirectNoProto, CallIndirect, CallIndirectNoProto).
  • LazyCallGraph & CGSCC -- devirtualization converts ref edges to call edges in the call graph, triggering SCC re-computation via switchInternalEdgeToCall. The DevirtSCCRepeatedPass at sub_2284BC0 wraps the CGSCC pipeline in a fixed-point loop.
  • GPU Execution Model -- explains why indirect calls are so expensive on GPU (warp divergence, scheduling barriers, register spilling to local memory).
  • Hash Infrastructure -- the type test hash table uses the same sentinel-based open-addressing pattern as CICC's universal DenseMap infrastructure.

GPU Execution Model

This page is the single authoritative reference for the GPU hardware properties that drive cicc's optimization decisions. Every other wiki page that mentions register pressure, occupancy cliffs, memory coalescing, warp divergence, or the .param calling convention should cross-reference this page rather than re-explaining the concepts inline. The page exists because these properties shape literally every pass in the compiler, from SROA (which exists to avoid .local memory) through register allocation (which trades register count for occupancy) to LTO inlining (which eliminates .param marshaling). Understanding the execution model is a prerequisite for understanding any cicc optimization decision that differs from upstream LLVM.

The material below describes the hardware model as cicc sees it -- the properties that are visible in the binary through TTI hooks, threshold constants, cost model comparisons, and diagnostic strings. Where specific numbers vary by SM generation, the sm_70+ (Volta through Blackwell) values are given unless otherwise noted.

SIMT Warp Execution

NVIDIA GPUs execute threads in groups of 32 called warps. All 32 threads in a warp share a single program counter under the SIMT (Single Instruction, Multiple Threads) model. The hardware issues one instruction per clock to all 32 threads simultaneously -- there is no per-thread instruction decode, fetch, or issue overhead. Each thread has its own register state and can execute a different data path, but they all advance through the program in lockstep.

This is not SIMD in the CPU sense. On a CPU with AVX-512, the programmer (or compiler) explicitly packs 16 floats into a vector register and issues a single vector instruction. On a GPU, the programmer writes scalar code for one thread, and the hardware transparently replicates it across 32 threads. The distinction matters for cicc because vectorization on GPU does not fill SIMD lanes -- it produces wide loads (ld.v2, ld.v4) within a single thread's scalar stream to improve memory transaction width and reduce instruction count. The VF returned by TTI::getRegisterBitWidth(Vector) is 32 bits (one scalar register), not 512 or 1024.

Divergence

When a branch condition evaluates differently across threads in a warp, the hardware serializes both paths. First the "taken" subset executes while the others are masked off; then the "not-taken" subset executes. The warp reconverges at a point determined by the hardware's reconvergence stack (pre-Volta) or independent thread scheduling (Volta+). Both paths execute regardless of how many threads take each side, so a divergent branch in a hot loop can halve throughput even if only one thread disagrees.

Divergence is the primary reason cicc includes the StructurizeCFG pass (which converts irreducible control flow to reducible form), the CSSA pass (which repairs SSA across divergent join points), the Loop Index Split pass (which eliminates index-dependent branches that cause per-iteration divergence), and the Branch Distribution pass (which separates uniform from divergent computation).

The constant warpSize = 32 is hardcoded in cicc's SCEV range analysis (intrinsic ID ~370, range [32, 33)) and is the architectural constant behind every power-of-two factor enforcement in the loop unroller and loop vectorizer.

Register Pressure and Occupancy

The register file is the single most constrained resource on an NVIDIA GPU and the single most important factor in cicc's optimization heuristics. Understanding the relationship between register count, occupancy, and performance is essential to understanding why cicc makes the decisions it does.

The Register Budget

Each Streaming Multiprocessor (SM) has a fixed 32-bit register file:

SM GenerationRegisters per SMMax Registers per Thread
SM 70 (Volta)65,536255
SM 75 (Turing)65,536255
SM 80 (Ampere)65,536255
SM 86 (Ampere GA10x)65,536255
SM 89 (Ada)65,536255
SM 90 (Hopper)65,536255
SM 100 (Blackwell)65,536255

These 65,536 registers are shared among all resident threads. The hardware partitions them at kernel launch time based on the per-thread register count reported by ptxas. The partition is coarse-grained -- registers are allocated in units of warp groups, not individual threads.

Occupancy Cliffs

The relationship between per-thread register count and achievable occupancy is a step function with sharp discontinuities:

Registers/thread    Max warps/SM    Max threads/SM    Occupancy
      32                64              2048            100%
      33-40             48              1536             75%
      41-48             32              1024             50%   <-- cliff
      49-64             32              1024             50%
      65-80             24               768            37.5%  <-- cliff
      81-96             20               640            31.3%
      97-128            16               512             25%   <-- cliff
     129-168            12               384            18.8%
     169-255             8               256            12.5%  <-- cliff

(Exact thresholds vary by SM generation and block size; these are representative for sm_70+ with standard block configurations.)

Adding a single register -- from 32 to 33 registers per thread -- drops maximum occupancy from 64 warps to 48 warps, a 25% reduction. These are the occupancy cliffs that cicc's heuristics are designed to avoid. The cost is asymmetric: the 33rd register provides trivial benefit (one fewer spill), but the occupancy loss costs 25% of the SM's latency-hiding capacity.

This is why:

  • The loop unroller uses conservative thresholds that balance ILP against register growth
  • The loop vectorizer limits VF to 2 or 4 even though wider vectors are legal
  • LSR has an lsr-rp-limit knob that hard-rejects formulae exceeding a register pressure ceiling
  • LICM runs twice -- once to hoist, once to sink back values whose extended live ranges hurt occupancy
  • The rematerialization pass recomputes values rather than keeping them live across long ranges
  • The register allocator uses -maxreg (default 70) as a pressure cap rather than a physical assignment constraint

The cicc binary contains no explicit occupancy table -- it delegates final register assignment and occupancy computation to ptxas. But the thresholds in the optimization passes (LSR's lsr-rp-limit, the unroller's PartialThreshold, the vectorizer's register-pressure-bounded interleave count) are all calibrated to stay below known cliff boundaries.

PTX Virtual Registers

PTX has no fixed physical register file from the compiler's perspective. cicc emits virtual registers in nine typed classes (%p, %rs, %r, %rd, %f, %fd, %h, %hh, %rq -- see Register Classes). The ptxas assembler performs the actual register allocation from virtual to physical registers, using the SM's register file as the constraint. cicc's job is to minimize the number of simultaneously live virtual registers so that ptxas can produce a low register-count assignment.

The typed register model means that a 32-bit integer (%r) and a 32-bit float (%f) occupy separate register namespaces -- they never alias. A 64-bit value (%rd, %fd) occupies two 32-bit register slots. An Int128Regs value (%rq) occupies four. This is why the type legalization pass aggressively scalarizes vector types and the IV demotion pass narrows 64-bit induction variables to 32-bit: every bit of width reduction directly saves register pressure.

Memory Hierarchy

GPU memory is organized into physically disjoint address spaces with radically different performance characteristics. On a CPU, the entire address space is a flat virtual memory with uniform-latency cache hierarchy. On a GPU, choosing the wrong address space for an access can cost 100x in latency. This section summarizes the performance-relevant properties; for complete address space encoding, aliasing rules, and data layout strings, see Address Spaces.

Latency Table

MemoryLLVM ASPTX QualifierLatency (cycles)ScopeCapacity
Registers--%r, %f, etc.0Per-thread255 per thread (SM 70+)
Shared3.shared20-30Per-CTA (block)48-228 KB per SM
Constant cache4.const4-8 (hit)Read-only, device-wide64 KB per SM
Parameter101.param4-8Per-kernel launchMapped to constant bank
Local (L1 hit)5.local~30Per-thread stackL1 partition
Local (L2 hit)5.local~200Per-thread stackL2 partition
Global (L2 hit)1.global32-128Device-wideL2 cache
Global (DRAM)1.global200-800Device-wideDevice DRAM
Generic0.generic+4-8 over resolvedVirtualRuntime-resolved
Shared cluster7.shared::cluster30-50Cross-CTA (SM 90+)Cluster shared pool

The 200-800 cycle range for global DRAM access is the defining constraint of GPU performance. It means that a single cache-missing load stalls the executing warp for hundreds of cycles. The hardware hides this latency through warp-level multithreading (see next section), but only if enough warps are resident -- which brings us back to register pressure and occupancy.

Why Each Memory Matters for cicc

Registers vs. .local: Every alloca that SROA fails to promote becomes a .local allocation backed by DRAM. A .local access that misses L1 costs 200-400 cycles versus zero for a register. This is why SROA runs twice in the pipeline and why cicc's inline budget (20,000 vs upstream 225) is so aggressive -- inlining eliminates allocas from byval parameter copies.

Shared memory (AS 3): On-chip SRAM with 20-30 cycle latency, shared across all threads in a CTA (thread block). Uses 32-bit pointers (when +sharedmem32bitptr is active), saving one register per pointer compared to 64-bit global pointers. This is why LSR has disable-lsr-for-sharedmem32-ptr -- strength-reducing a 32-bit shared pointer can produce 64-bit intermediates that defeat the optimization.

Constant memory (AS 4): Hardware-cached read-only memory with 4-8 cycle latency on cache hit. The NVVM AA marks AS 4 as NoModRef, enabling LICM to hoist constant loads without checking for intervening stores.

.param space (AS 101): Used for function argument passing (see the calling convention section below). Read-only from device code. Mapped to the constant cache path, so reads are 4-8 cycles.

Generic (AS 0): The performance killer. A generic pointer forces a runtime address-space lookup (+4-8 cycles per access) and destroys alias analysis precision (every generic pointer MayAlias with everything). This is why MemorySpaceOpt exists -- resolving generic pointers to specific address spaces is one of the highest-impact optimizations in cicc.

Memory Coalescing

The GPU memory subsystem services warp-wide requests in 128-byte transactions (or 32-byte sectors on some architectures). When 32 threads in a warp access 32 consecutive 4-byte values (128 bytes total), the hardware coalesces the 32 individual requests into a single transaction. This is the stride-1 access pattern -- the ideal case.

Thread 0  loads addr+0    ┐
Thread 1  loads addr+4    │
Thread 2  loads addr+8    │  One 128-byte transaction
...                       │
Thread 31 loads addr+124  ┘

When threads access non-consecutive addresses (stride > 1, scattered, or misaligned), the hardware must issue multiple transactions to satisfy the warp's requests. In the worst case (32 threads accessing 32 different cache lines), a single warp load generates 32 separate transactions, reducing effective bandwidth by 32x.

Coalescing is why the loop vectorizer targets VF=2 or VF=4 on GPU: vectorizing a per-thread loop with ld.v4.f32 loads four consecutive elements per thread in a single wide transaction, improving bytes-per-transaction. It is also why the loop unroller enforces power-of-two factors -- non-power-of-two unroll factors create asymmetric access patterns that interact poorly with the 128-byte transaction boundary.

The memory coalescing model also explains why cicc's SLP vectorizer pairs adjacent scalar loads into ld.v2 / ld.v4 instructions -- not for SIMD parallelism (there is none) but for transaction width optimization.

No Out-of-Order Execution

GPU warps execute instructions strictly in program order. There is no out-of-order execution, no speculative execution, no branch prediction, and no reorder buffer. A warp that encounters a long-latency operation (global memory load, texture fetch) simply stalls until the result is available.

The sole latency-hiding mechanism is warp-level multithreading. Each SM maintains multiple warps in flight simultaneously. When one warp stalls on a memory access, the hardware switches to another ready warp in the same clock cycle (zero-cost context switch, because each warp has its own register state). This is why occupancy matters -- more resident warps means more opportunities to hide latency through interleaving.

The absence of OOO execution has profound implications for cicc:

ILP must be compiler-created. On a CPU, the hardware reorder buffer discovers and exploits instruction-level parallelism dynamically. On a GPU, the compiler (cicc + ptxas) must explicitly schedule independent instructions adjacent to each other so the hardware can overlap them. This is why loop unrolling is so valuable on GPU -- it creates independent instructions from different iterations that the scheduler can interleave -- and why the interleave count in the loop vectorizer exists (it replicates the vectorized body to expose more ILP).

Every stall is a stall. There is no store buffer to absorb write latency, no load queue to speculatively issue reads. The scheduling passes (instruction scheduling, block placement) must model this accurately.

Instruction issue width bounds throughput. Each SM has a fixed number of instruction schedulers (typically 4 per SM on sm_70+), each issuing one instruction per clock to one warp. The total instruction throughput of an SM is schedulers * clock_rate. The TTI scheduling info at TTI+56 (issue width at +32, latency at +36 within the sub-structure) encodes this model and feeds the vectorizer's interleave count cap.

The .param Calling Convention

Function calls on NVIDIA GPUs are expensive in a way that has no CPU equivalent. On x86, a function call pushes arguments to registers or the stack (a cached memory region), executes CALL, and the callee reads them back. Total overhead: 5-20 cycles. On GPU, there is no hardware call stack for registers. The PTX calling convention works through the .param address space:

Call Sequence

// Caller side:
.param .align 8 .b8 param0[16];           // DeclareParam
st.param.b64 [param0+0], %rd1;            // Store arg 0, field 0
st.param.b64 [param0+8], %rd2;            // Store arg 0, field 1
.param .b32 param1;                        // DeclareScalarParam
st.param.b32 [param1+0], %r5;             // Store arg 1
call.uni (retval0), callee, (param0, param1);  // The actual call

// Callee side:
ld.param.b64 %rd10, [param0+0];           // Load arg 0, field 0
ld.param.b64 %rd11, [param0+8];           // Load arg 0, field 1
ld.param.b32 %r20,  [param1+0];           // Load arg 1
// ... function body ...
st.param.b32 [retval0+0], %r30;           // Store return value
ret;

// Back in caller:
ld.param.b32 %r6, [retval0+0];            // Load return value

Each function call generates O(n) st.param + O(n) ld.param instructions where n is the total number of argument fields (not just argument count -- structs are marshaled field-by-field). A function with 8 struct arguments containing 4 fields each generates 32 stores + 32 loads + the call instruction itself. At shared/constant-cache latency (4-8 cycles per access), this is 256-512 cycles of pure marshaling overhead.

Additionally:

  • Call boundaries destroy scheduling freedom. The hardware cannot overlap instructions across a call/return boundary.
  • Call boundaries force register save/restore. If the callee needs more registers than are available in the caller's allocation, the hardware spills to .local memory (DRAM, 200-800 cycles).
  • Indirect calls are catastrophic. An indirect call (call.uni through a register) prevents all of the above from being optimized statically. No inlining, no cross-function register allocation, no dead argument elimination.

This is why:

  • cicc's custom inliner uses a 20,000-unit budget (89x upstream LLVM's 225) -- the .param marshaling cost for a typical function easily exceeds the 225-unit threshold
  • LTO is dramatically more valuable on GPU than on CPU -- cross-module inlining eliminates .param overhead for functions in separate translation units
  • Whole-program devirtualization is critical -- converting indirect calls to direct calls enables inlining and eliminates the worst-case register spill scenario
  • 60% of the NVIDIA custom inliner's code computes type-size comparisons for argument coercion cost, because the .param marshaling cost dominates the inlining decision

The SelectionDAG Encoding

The SelectionDAG backend uses opcodes DeclareParam (505), DeclareScalarParam (506), StoreV1/V2/V4 (571-573), and LoadRetParam / LoadV1/V2/V4 (515-516, 568-570) for the param passing convention. The .param space is encoded as SelectionDAG code 5 in sub_33B0210. For complete opcode details, see NVPTX Machine Opcodes.

Address Space Semantics

GPU memory is partitioned into physically disjoint hardware regions. Pointers in different non-generic address spaces can never reference the same byte -- a property that NVVM AA exploits for O(1) NoAlias determination. The generic address space (AS 0) is a virtual overlay resolved at runtime by the hardware's address translation unit, which tests whether the address falls in the shared, local, or global window.

The following properties have direct optimization impact:

PropertyGlobal (AS 1)Shared (AS 3)Local (AS 5)Constant (AS 4)
Pointer width64-bit32-bit*32-bit (effective)64-bit
Read-onlyNoNoNoYes
Cross-CTA visibleYesNoNoYes
Hardware addressing modesBase + offsetBase + offset, bankedFrame pointer + offsetIndexed constant cache
Coalescing128-byte transactions32 banks, 4-byte stridePer-thread (no coalescing)Broadcast to warp

* 32-bit when +sharedmem32bitptr target feature is active (the default for sm_70+).

The 32-bit pointer optimization for shared memory saves one register per shared-memory pointer and reduces all address arithmetic from 64-bit to 32-bit operations. This is encoded in the NVPTX data layout string as p3:32:32:32 and is the reason the IV Demotion pass exists -- it narrows 64-bit induction variables to 32-bit when the loop operates entirely in shared memory.

For the complete address space reference -- including aliasing rules, the MemorySpaceOpt bitmask encoding, cvta intrinsic mapping, isspacep folding, and per-SM shared memory sizes -- see Address Spaces.

Compiler Implications Summary

Every major cicc optimization decision traces back to one or more of the properties above. The following table maps each hardware property to the compiler passes it shapes:

Hardware PropertyCompiler ImpactKey Passes
Warp divergence serializes both pathsMinimize control flow in hot loopsStructurizeCFG, CSSA, Loop Index Split, Branch Distribution
Register count determines occupancyAll transforms must minimize live valuesRegister Allocation, LSR, LICM, Rematerialization, IV Demotion
Occupancy cliffs are discreteThreshold-driven heuristics with cliff awarenessLoop Unroll, Loop Vectorize, LSR lsr-rp-limit
No OOO executionCompiler must create ILPLoop Unroll (ILP via body replication), Scheduling, vectorizer interleave count
.local spill costs 200-800 cyclesAggressively promote allocasSROA (runs twice), Inliner (20K budget eliminates byval copies)
.param marshaling is O(n) per callAggressively inlineInliner, LTO, Devirtualization
128-byte coalescing transactionsOptimize memory access strideLoop Vectorize (VF=2/4 for ld.v2/ld.v4), SLP Vectorizer
Address spaces are disjointNoAlias for cross-space pairsNVVM AA, MemorySpaceOpt
Generic pointers destroy alias precisionResolve to specific spaceMemorySpaceOpt, IPMSP
Shared memory uses 32-bit pointersNarrow IV and address widthIV Demotion, LSR disable-lsr-for-sharedmem32-ptr
Closed-world compilation modelFull-program visibilityLTO, Dead Kernel Elimination, Devirtualization
Constant cache is 4-8 cyclesHoist constant loads freelyLICM, NVVM AA NoModRef for AS 4

What Upstream LLVM Gets Wrong

Upstream LLVM's NVPTX backend correctly implements the PTX virtual register model and the basic address space numbering. But the optimization passes assume CPU-like economics:

  1. Inline threshold of 225 assumes function calls cost 5-20 cycles. GPU calls cost hundreds of cycles due to .param marshaling. NVIDIA overrides to 20,000.

  2. LSR cost model compares formulae by counting registers and instructions with equal weight. On GPU, one extra register can cost 25% occupancy; one extra instruction costs nearly nothing. NVIDIA replaces the formula solver entirely.

  3. LICM assumes hoisting is always profitable. On CPU, moving an operation from loop body to preheader is strictly beneficial. On GPU, it extends the live range of the hoisted value across the entire loop, consuming a register for all iterations. NVIDIA runs LICM twice (hoist then sink) and relies on rematerialization to undo unprofitable hoists.

  4. Vectorization targets SIMD lane width. TTI::getRegisterBitWidth(Vector) returns 256 (AVX2) or 512 (AVX-512) on CPU. NVPTX returns 32 -- there are no SIMD lanes. Vectorization targets memory transaction width, not ALU parallelism.

  5. No occupancy model exists in upstream. CPU register allocation minimizes spill cost. GPU register allocation must minimize total register count to maximize occupancy. These are different objective functions.

  6. Address spaces are an afterthought. Upstream LLVM treats address spaces as metadata annotations. On GPU, they are physically disjoint hardware memory partitions with different pointer widths, latencies, and aliasing properties. Every pass that touches pointers must be address-space-aware.

Cross-References

Address Spaces

This page is the single source of truth for NVPTX address space numbering, hardware mapping, pointer widths, aliasing rules, and the internal bitmask encoding used by MemorySpaceOpt. It supersedes all inline address space tables elsewhere in the wiki -- those pages should cross-reference this one rather than maintaining their own copies.

NVPTX defines eight address spaces in cicc v13.0, six of which correspond to physically disjoint hardware memory partitions. The generic (flat) address space is a virtual overlay resolved at runtime by the GPU's address translation unit. The eighth, tensor memory (AS 6), is a Blackwell-era addition accessible only through TMA intrinsics. A ninth, AS 25, is used internally within NVVM IR for device-linkage annotations and never reaches PTX emission. A tenth, AS 53, appears in MemorySpaceOpt initialization as an internal annotation space for global variable tracking.

Master Address Space Table

LLVM ASNamePTX QualifierHardwarePointer WidthTypical LatencyCUDA Qualifier
0Generic (flat).genericVirtual -- address translation unit maps to physical space at runtime64-bit+4-8 cycles over resolved (translation overhead)Default for unresolved pointers
1Global.globalDevice DRAM, L2 cached, optionally L1 cached64-bit200-800 cycles (DRAM); 32-128 cycles (L2 hit)__device__, cudaMalloc
3Shared.sharedPer-CTA on-chip scratchpad SRAM (48-228 KB per SM)32-bit (when p3:32:32:32 active) or 64-bit20-30 cycles (bank-conflict-free)__shared__
4Constant.constRead-only constant cache (64 KB per SM)64-bit4-8 cycles (cache hit); DRAM latency on miss__constant__
5Local.localPer-thread private stack in DRAM, L1 cached32-bit (effective) or 64-bitSame as global (backed by DRAM)Stack allocations (alloca)
6Tensor MemoryN/A (TMA intrinsics only)Blackwell tensor memory (SM 100+)64-bitVaries (TMA pipeline)N/A -- accessed via cp.async.bulk intrinsics
7Shared Cluster.shared::clusterDistributed shared memory across CTAs in a cluster (SM 90+)32-bit or 64-bit~30-50 cycles (cross-CTA penalty over AS 3)__shared__ with cluster scope
25Internal device linkageN/ANot a physical memory -- NVVM IR annotation for __device__ linkageN/AN/AUsed internally by module summary for extern device resolution
53Internal annotationN/ANot a physical memory -- used by MemorySpaceOpt for global trackingN/AN/AInternal to cicc pipeline
101Param.paramKernel parameter window (mapped into constant bank or global memory)64-bit4-8 cycles (constant cache path)Kernel parameters (__global__ function args)

Address space 2 is not used by NVPTX. The numbering gap between shared (3) and constant (4) is inherited from upstream LLVM NVPTX conventions. The NVVM verifier's valid-AS check uses the formula (AS + ~2) & 0xFFFFFF) > 2, which accepts AS values 0, 1, and 3 unconditionally; AS 2 is sometimes valid depending on context.

Aliasing Rules

The core property exploited by NVVM AA is hardware address space disjointness: pointers in different non-generic address spaces can never reference the same byte. NVVM AA (nvptx-aa) encodes this as a NoAlias rule for every cross-space pointer pair, with the following exceptions.

Pointer APointer BAlias ResultReason
AS 0 (generic)AnyMayAliasGeneric can map to any physical space at runtime
AS X (same)AS X (same)MayAliasSame space -- further analysis needed (BasicAA, TBAA)
AS 1 (global)AS 101 (param)MayAliascvta.param on SM 70+ makes param addressable as global
AS 3 (shared)AS 7 (shared cluster)MayAliasCluster shared memory overlaps with regular shared
Any other cross-space pairNoAliasPhysically disjoint hardware memory partitions

The NVVM AA algorithm (pseudocode from NVPTXAAResult::alias in cicc):

AliasResult alias(Loc1, Loc2):
    AS1 = getAddressSpace(Loc1.Ptr, TraverseLimit)  // walk through casts
    AS2 = getAddressSpace(Loc2.Ptr, TraverseLimit)

    if AS1 == 0 or AS2 == 0:         return MayAlias  // generic kills precision
    if (AS1==3 and AS2==7) or (AS1==7 and AS2==3): return MayAlias
    if AS1 == AS2:                    return MayAlias  // same space, need deeper AA
    return NoAlias                                     // different non-generic spaces

The getAddressSpace helper walks backward through getUnderlyingObject (stripping GEPs, bitcasts, PHIs) up to nvptx-traverse-address-aliasing-limit (default 6) levels deep, resolving generic pointers that were produced by addrspacecast from a specific space.

ModRef Rules

Address SpaceModRef MaskMeaning
AS 4 (constant)NoModRefRead-only -- never modified
AS 101 (param)NoModRefKernel params are read-only from device code
All othersModRefMay be both read and written

These masks enable DSE to skip constant/param stores entirely, and LICM to hoist loads from constant memory without checking for intervening stores.

MemorySpaceOpt Internal Bitmask

MemorySpaceOpt (sub_1C70910) encodes address spaces as single-bit positions in a byte-wide bitmask for efficient dataflow computation. The mapping is performed in sub_1CA8CD0 via a switch on the LLVM address space ID:

BitValueLLVM ASName
00x011Global
10x023Shared
20x044Constant
30x085Local
40x10101Param
0-30x0FN/AUnknown (union of global + shared + constant + local)
// sub_1CA8CD0 — address space to bitmask
switch (addrspace) {
    case 1:   return 0x01;   // global
    case 3:   return 0x02;   // shared
    case 4:   return 0x04;   // constant
    case 5:   return 0x08;   // local
    case 101: return 0x10;   // param
    default:  return 0x0F;   // unknown = union of all non-param
}

When multiple pointer sources contribute different address spaces (e.g., through PHI nodes or function arguments receiving pointers from different call sites), the bitmask is OR'd. A singleton bit (popcount == 1) means the space is fully resolved; multiple bits set means the pointer is ambiguous and requires either runtime isspacep or a conservative default to global.

Resolution Decision

Once the bitmask is computed for a pointer:

  • Single bit set: Resolved. The pass inserts an addrspacecast from generic to the target space and replaces all uses.
  • Multiple bits set, param bit included: If param-always-point-to-global is true (default), resolve to global. The rationale: kernel parameters always point into global device memory.
  • Multiple bits set, no param: Ambiguous. Emit warning "Cannot tell what pointer points to, assuming global memory space" and default to global.
  • Zero bits: Unreachable code or analysis error.

Relationship to EDG Frontend Encoding

The EDG frontend uses a separate encoding in the symbol table entry at offset +156/+157:

EDG BitValueMemory Space
+156 bit 00x01__device__ (any device placement)
+156 bit 10x02__shared__
+156 bit 20x04__constant__
+156 bit 40x10Read-only linkage flag
+157 bit 00x01__managed__

The EDG memory_space_code at offset +136 maps to LLVM address spaces during IR generation: code 1 (__device__) maps to AS 1, code 2 (__shared__) maps to AS 3, code 3 (__constant__) maps to AS 4.

The Generic Address Space Problem

The generic (flat, AS 0) address space is the fundamental obstacle to alias precision on GPUs. When the EDG frontend or NVVM IR generator cannot determine which physical memory a pointer targets, it emits the pointer in AS 0. The hardware resolves generic addresses at runtime by checking whether the address falls within the shared memory window, the local memory window, or defaults to global -- a process that adds 4-8 cycles of latency per access.

For NVVM AA, a generic pointer forces MayAlias against every other pointer, destroying the disjointness guarantee and blocking optimizations in DSE, LICM, GVN, and MemorySSA. Three mechanisms address this:

1. MemorySpaceOpt (compile-time conversion). The two-phase inter-procedural pass resolves generic pointers by tracing them back to their allocation sites through use-def chains. When a generic pointer always derives from a __shared__ variable, the pass inserts addrspacecast to AS 3 and rewrites all uses. When different call sites disagree on the address space for the same argument, the pass clones the function into space-specialized versions. Every generic pointer resolved gives NVVM AA an additional NoAlias edge. Disabling this pass (-disable-MemorySpaceOptPass) causes 2-20x performance regressions.

2. AA address-space traversal. Even without MemorySpaceOpt, NVVM AA's getAddressSpace helper walks through addrspacecast chains. If %p was produced by addrspacecast i8 addrspace(3)* %s to i8*, the traversal discovers AS 3 despite %p being in AS 0 at the use site.

3. !noalias.addrspace metadata (kind 42). cicc attaches this metadata to instructions when address space information is known but the pointer itself remains generic. The AA evaluator detects this via opcode byte 0x4E ('N') and sets bit 2 in a pointer-tagged value (OR with 4), propagating disambiguation information through to AAResults::alias. This is a cicc-specific extension not found in upstream LLVM.

Data Layout Strings

The NVPTX data layout string encodes pointer widths and alignment for each address space. cicc produces three variants based on pointer width and shared memory pointer mode.

64-bit with shared memory specialization (most common production mode)

e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

64-bit without shared memory specialization

e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

32-bit mode

e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

Field-by-Field Breakdown

FieldMeaningNVIDIA Note
eLittle-endianAll NVIDIA GPUs
p:64:64:64Default pointer: 64-bit size, 64-bit ABI align, 64-bit preferred alignApplies to AS 0 (generic), AS 1 (global), AS 4 (constant), AS 101 (param)
p3:32:32:32AS 3 pointer: 32-bit size, 32-bit ABI align, 32-bit preferred alignShared memory is on-chip, addressable with 32 bits even in 64-bit mode
i1:8:8Booleans stored as 8-bitStandard
i128:128:128128-bit integers: 128-bit alignedUsed by cmpxchg on global/shared
n16:32:64Native integer widthsPTX has 16-bit, 32-bit, and 64-bit register files
v16:16:16 / v32:32:32Vector alignment: natural16-bit vectors at 16-bit, 32-bit vectors at 32-bit

Shared Memory 32-bit Pointer Optimization

The p3:32:32:32 entry is the most impactful NVIDIA delta in the data layout. Shared memory lives in 48-228 KB of on-chip SRAM per SM, addressable with 32-bit pointers even when the rest of the address space is 64-bit. Using 32-bit pointers for shared memory saves register pressure (one register instead of two for every shared pointer) and instruction count (32-bit arithmetic instead of 64-bit for every address calculation).

The optimization is controlled by three knobs that alias the same underlying global (unk_4D0461C):

KnobSource
nvptx-short-ptrBackend option (ctor_609_0 at 0x585D30)
nvptx-32-bit-smemBackend option (same constructor)
+sharedmem32bitptrTarget feature string (passed via -arch processing)

When any of these is active, the data layout gains the p3:32:32:32 entry, and LLVM's type system treats all addrspace(3)* pointers as 32-bit. This is transparent to the rest of the compiler -- DataLayout queries like getPointerSizeInBits(3) return 32 automatically, and all pointer arithmetic in shared memory is lowered to 32-bit operations.

The same 32-bit treatment applies to local memory (AS 5) in practice: local stack addresses are within the per-thread frame and always fit in 32 bits. However, the data layout does not carry an explicit p5:32:32:32 entry -- the 32-bit treatment is enforced by the SelectionDAG lowering which uses AS 7 for stack operations.

Known-Bits Implications

The 32-bit address spaces have direct implications for the known-bits analysis (sub_BD5420):

Address SpacePointer WidthKnown Bits Effect
AS 0 (generic)64-bitPointer alignment only
AS 1 (global)64-bitLow 4 bits often known-zero (16-byte alignment typical)
AS 3 (shared)32-bitLow 2 bits known-zero (4-byte minimum), bits [32,63] irrelevant
AS 4 (constant)64-bitLow 2 bits known-zero (4-byte alignment)
AS 5 (local)32-bit effectiveLow 2 bits known-zero (stack alignment), bits [32,63] irrelevant

DemandedBits exploits the 32-bit address spaces to eliminate zero-extensions and truncations around shared/local address calculations, keeping all pointer arithmetic in 32-bit ALU operations. This interacts with IV Demotion (sub_18B1DE0), which narrows 64-bit induction variables to 32-bit where shared memory address calculations permit.

Data Layout Validation

The NVVM verifier (sub_2C80C90) validates the data layout string at multiple pipeline points:

  • If empty: "Empty target data layout, must exist"
  • If invalid: prints "Example valid data layout:" with reference strings from off_4C5D0A0 (32-bit) and off_4C5D0A8 (64-bit)
  • A shortened compatibility form e-i64:64-v16:16-v32:32-n16:32:64 is used in the IR linker (sub_106AB30) to verify that two modules being linked share the same NVPTX target data layout.

Address Space Casts

NVPTX has strict rules for addrspacecast instructions, enforced by the NVVM verifier:

  1. At least one side must be generic (AS 0). Casting between two non-generic address spaces is prohibited: "Cannot cast non-generic pointer to different non-generic pointer". You must go through generic: addrspace(3) -> addrspace(0) -> addrspace(1).

  2. Source and target must be valid. The verifier rejects invalid address space IDs with "Invalid target address space" / "Invalid source address space".

  3. Alloca must be in generic. "Allocas are not supported on address spaces except Generic" -- alloca produces AS 0 pointers; MemorySpaceOpt later promotes them to AS 5.

  4. Tensor memory (AS 6) rejects load/store. "Tensor Memory loads/stores are not supported" -- AS 6 memory must be accessed through TMA intrinsics (cp.async.bulk.*), not regular load/store instructions.

  5. cmpxchg is restricted. "cmpxchg pointer operand must point to generic, global, or shared address space" -- atomic compare-exchange only supports AS 0, AS 1, and AS 3, with i32/i64/i128 operand types.

cvta Intrinsic Mapping

The PTX cvta (Convert Virtual Address) instructions are lowered through intrinsic IDs in the EDG frontend (sub_94A030):

Intrinsic ID RangeDirectionAddress Space
0xC1 (193)Generic -> SpecificShared (AS 3)
0xC2 (194)Generic -> SpecificConstant (AS 4)
0xC3 (195)Generic -> SpecificLocal (AS 5)
0xC4 (196)Generic -> SpecificGlobal (AS 1)
0xC5 (197)Specific -> GenericShared (AS 3)
0xC6 (198)Specific -> GenericConstant (AS 4)
0xC7 (199)Specific -> GenericLocal (AS 5)
0xC8 (200)Specific -> GenericGlobal (AS 1)

The specific-to-generic direction emits addrspacecast (opcode 0x30). The generic-to-specific direction uses a store-to-temp followed by a load with the target address space annotation.

SelectionDAG Address Space Encoding

The SelectionDAG backend uses a secondary address space encoding for the .param passing convention. In sub_33B0210 (intrinsic lowering within the SelectionDAG), pointer arguments use this mapping:

SelectionDAG CodeLLVM ASPTX Space
11 (global).global
23 (shared).shared
34 (constant).const
45 (local).local
5--.param (not a real AS, lowered to param window)
77 (shared cluster).shared::cluster

Stack operations (SelectionDAG opcode 16, StackAlloc) explicitly use AS 7 for the .param-like space when lowering stack frames via sub_33FF780(dag, ..., 7, 0, 1, 0).

Internal Address Spaces (Non-Physical)

AS 25 -- Device Linkage Annotation

Address space 25 is used by the module summary pass (sub_1C28690 in p2-H01-nvmodule-summary.txt) to tag functions and variables with __device__ linkage during inter-module resolution. When a function's type resolves to AS 25, it indicates the symbol has device-side linkage and requires device-side extern resolution. This address space never appears in emitted PTX -- it is consumed during linking and stripped before codegen.

AS 53 -- MemorySpaceOpt Global Annotation

During pass initialization (sub_1CAB590), MemorySpaceOpt filters module globals that carry address space 53 and registers them into internal tracking structures. This appears to be an annotation mechanism for marking globals that require special address space analysis. Like AS 25, this address space is internal and does not survive to PTX emission.

Shared Memory Specializations by SM Generation

SMShared Memory SizeCluster SupportAS 7 AvailableShared Memory Pointer
SM 70 (Volta)96 KB configurable with L1NoNo32-bit (when +sharedmem32bitptr)
SM 80 (Ampere)164 KB configurableNoNo32-bit
SM 86 (Ampere GA10x)100 KB configurableNoNo32-bit
SM 89 (Ada)100 KB configurableNoNo32-bit
SM 90 (Hopper)228 KB configurableYesYes32-bit
SM 100 (Blackwell)228 KB configurableYesYes32-bit

With SM 90+, __shared__ variables accessed with cluster scope use .shared::cluster (AS 7), which provides cross-CTA access within a cooperative thread array cluster. Regular intra-CTA shared access remains on AS 3 (.shared). The EarlyCSE pass (sub_2781BB6) detects AS 7 stores and applies conservative aliasing to prevent CSE across shared cluster barriers.

isspacep Intrinsics

The PTX isspacep instruction tests at runtime whether a generic pointer points to a specific address space. cicc represents these as intrinsics with builtin IDs 0xFD0-0xFD5:

Builtin IDPTXTests for
0xFD0isspacep.globalGlobal (AS 1)
0xFD1isspacep.sharedShared (AS 3)
0xFD2isspacep.localLocal (AS 5)
0xFD3isspacep.constConstant (AS 4)
0xFD4isspacep.shared::ctaShared CTA-local (AS 3, SM 90+)
0xFD5isspacep.shared::clusterShared cluster (AS 7, SM 90+)

MemorySpaceOpt's second-time resolver (sub_1CA9E90) folds these to compile-time constants when the pointer's address space is already known: isspacep.shared(%p) where %p is proven to be AS 3 folds to true. This eliminates runtime address space checks from conditional code patterns like:

if (__isShared(p))
    atomicAdd_shared(p, val);
else
    atomicAdd(p, val);

Configuration Knobs Affecting Address Spaces

KnobDefaultEffect
nvptx-short-ptr--Enable 32-bit pointers for shared/const/local
nvptx-32-bit-smem--Same effect as above (alias)
param-always-point-to-globaltrueResolve ambiguous param pointers to global
mem-space-alg2Algorithm selection for MemorySpaceOpt (2 = default, others select alternate impl at sub_2CBBE90)
track-indir-loadtrueTrack pointers loaded from memory during address space analysis
track-int2ptrtrueTrack inttoptr casts during analysis
nvptx-traverse-address-aliasing-limit6Max depth for NVVM AA getAddressSpace traversal
do-clone-for-ip-msp-1 (unlimited)Max function clones for inter-procedural specialization
process-alloca-alwaystrueTreat alloca as definite local (AS 5)

Function Map

FunctionAddressSizeRole
MemorySpaceOpt pass entrysub_1C70910--Mode dispatch, IP-MSP worklist driver
Per-BB instruction scannersub_1CA8CD0--AS-to-bitmask mapping switch
Use-def chain walkersub_1CA5350--Backward pointer origin tracking
First-time resolversub_1CA2920--Conservative address space resolution
Second-time resolversub_1CA9E90--Hash-table-based resolution, isspacep folding
MemorySpaceCloning enginesub_2CBBE90--Inter-procedural function cloning (71KB)
IPMSP module pass variantsub_1C6A6C0--LIBNVVM path (54KB)
EDG cvta loweringsub_94A030--Address space cast intrinsic generation
EDG decl-side memspace processingsub_6582F0--CUDA attribute to memory space code resolution
EDG def-side memspace processingsub_65F400--Definition validation and initializer handling
NVVMModuleVerifiersub_2C80C90--Data layout and address space validation
NVVMIntrinsicVerifiersub_2C7B6A0--Per-intrinsic address space constraint checking
SelectionDAG intrinsic loweringsub_33B0210--Backend AS mapping for param passing
getPointerAlignmentBitssub_BD5420--Known-bits for address space pointer widths
NVIDIA intrinsic known-bits oraclesub_F0C4B0--Special register ranges

Cross-References

  • Memory Space Optimization -- Two-phase address space resolver, bitmask dataflow, function cloning
  • IPMSP -- Inter-procedural memory space propagation, worklist algorithm
  • Alias Analysis & NVVM AA -- Address space disjointness, AA chain, !noalias.addrspace
  • NVPTX Target Infrastructure -- Data layout strings, +sharedmem32bitptr feature, TTI hooks
  • KnownBits & DemandedBits -- Address space pointer width in known-bits, DemandedBits narrowing
  • NVVM Verifier -- addrspacecast rules, tensor memory restriction, cmpxchg constraints
  • EDG Frontend -- CUDA memory space attributes (__shared__, __constant__, __device__)
  • SelectionDAG -- Backend address space encoding for param passing
  • IV Demotion -- Exploits 32-bit shared memory pointers for induction variable narrowing
  • EarlyCSE -- Shared cluster (AS 7) store handling

NVPTX Register Classes

This page is the single authoritative reference for the nine NVPTX register classes used throughout cicc v13.0. Register class tables previously duplicated in Register Allocation, Register Coalescing, PTX Emission, and AsmPrinter are consolidated here. When those pages reference register classes, they should cross-reference this page rather than maintaining inline copies.

Register encodingsub_21583D0 (4.6KB)
PTX type suffix mapsub_2163730 (1.7KB)
PTX prefix mapsub_21638D0 (1.6KB)
Copy opcode dispatchsub_2162350 (3.0KB)
Register info init (legacy)sub_2163AB0 / sub_2149CD0
Register info init (new PM)sub_30590F0 / sub_301F0C0
Register decl emissionsub_2158E80 (17KB)
Internal-only class vtableoff_4A026E0

The Nine Register Classes

NVPTX defines nine register classes that participate in PTX code generation. Each class is identified at runtime by its vtable pointer, which sub_2163730 and sub_21638D0 use as a switch key to produce the PTX type suffix and register prefix respectively. The encoding function sub_21583D0 maps each class to a 4-bit tag that occupies bits [31:28] of the 32-bit encoded register ID.

TagVtableClass NamePTX TypePrefixEncoded IDWidthDescription
1off_4A027A0Int1Regs.pred%p0x100000001Predicate (boolean)
2off_4A02720Int16Regs.b16%rs0x2000000016Short integer
3off_4A025A0Int32Regs.b32%r0x3000000032General-purpose integer
4off_4A024A0Int64Regs.b64%rd0x4000000064Double-width integer
5off_4A02620Float32Regs.f32%f0x5000000032Single-precision float
6off_4A02520Float64Regs.f64%fd0x6000000064Double-precision float
7off_4A02760Int16HalfRegs.b16%h0x7000000016Half-precision float (f16, bf16)
8off_4A026A0Int32HalfRegs.b32%hh0x8000000032Packed pair (v2f16, v2bf16, v2i16, v4i8)
9off_4A02460Int128Regs.b128%rq0x90000000128128-bit wide (tensor core)

Naming Discrepancy

Two naming conventions exist in the codebase, depending on whether the name was recovered from the emission functions or from the register allocator context:

VtableEmission name (sub_2163730/sub_21638D0)RA-context name (sub_2162350)Resolution
off_4A02760Int16HalfRegsFloat16RegsSame class. The emission functions use the TableGen-derived name Int16HalfRegs; the RA raw report uses the semantic alias Float16Regs. Both refer to off_4A02760.
off_4A026A0Int32HalfRegsFloat16x2RegsSame class. Int32HalfRegs is the TableGen name; Float16x2Regs is the semantic alias. Both refer to off_4A026A0.
off_4A02460Int128RegsSpecialRegsDifferent raw reports assigned different names to off_4A02460. The emission report identifies it as Int128Regs (based on .b128 type and %rq prefix). The earlier RA sweep report labeled it SpecialRegs. The emission-derived name Int128Regs is more accurate: .b128 / %rq is used for 128-bit tensor-core values (i128 on SM 70+), not for special/environment registers.

The tenth vtable off_4A026E0 is present in the binary but returns "!Special!" from both sub_2163730 and sub_21638D0. It is never assigned an encoded ID and never participates in register declaration emission. It is an internal-only sentinel class used within NVPTXRegisterInfo initialization (string "ENVREG10" at register info offset +72).

Throughout this wiki, the emission-derived names (Int16HalfRegs, Int32HalfRegs, Int128Regs) are canonical. Pages written before this consolidation may use the RA-context aliases.

Register Encoding Scheme -- sub_21583D0

Every virtual register in the NVPTX backend is encoded as a 32-bit value that packs the register class and a per-class index into a single integer. The encoding function at sub_21583D0 (4.6KB) implements this:

encoded_register = class_tag | (register_index & 0x0FFFFFFF)

The bit layout:

 31  28 27                             0
+------+-------------------------------+
| class|       register index          |
| tag  |       (28 bits)               |
+------+-------------------------------+
  • Bits [31:28] -- 4-bit class tag, values 0x1 through 0x9 as listed in the table above.
  • Bits [27:0] -- 28-bit register index within that class, supporting up to 268 million registers per class.

The function operates in two modes:

  1. Physical register (register_id >= 0): Returns the raw index directly (low 28 bits). Physical registers on NVPTX are a vestigial concept -- the target has no fixed register file -- but LLVM's infrastructure requires them for reserved registers like %SP and %SPL.

  2. Virtual register (register_id < 0, i.e., bit 31 set in LLVM's internal convention): Looks up the register class from the MachineRegisterInfo register map, matches the class vtable against the nine known vtable addresses, and returns class_encoded_id | (register_index & 0x0FFFFFFF).

If the vtable does not match any of the nine known classes, the function triggers a fatal error:

"Bad register class"

This is a hard abort, not a recoverable diagnostic. It indicates that either a new register class was added without updating the encoding function, or memory corruption produced an invalid vtable pointer.

Why Bits [31:28] and Not Bits [31:29]

LLVM's standard convention uses bit 31 (0x80000000) to distinguish physical from virtual registers internally. The NVPTX encoding reclaims this bit as part of the class tag because after encoding, the distinction between physical and virtual is no longer meaningful -- all registers in emitted PTX are virtual. Tag value 0x8 (Int32HalfRegs) has bit 31 set, which would collide with LLVM's virtual-register marker. This works because the encoding is applied only during emission, after register allocation is complete and the physical/virtual distinction is irrelevant.

Complete Class Separation

The nine register classes are completely disjoint. There is no cross-class interference: an Int32Regs register (%r) never conflicts with a Float32Regs register (%f) even though both are 32 bits wide. This is a fundamental consequence of PTX's typed register model. In PTX, .reg .b32 %r0 and .reg .f32 %f0 are distinct storage locations from ptxas's perspective. Two implications follow:

  1. No cross-class coalescing. The register coalescer at sub_34AF4A0 enforces a same-class check on every coalescing candidate. Cross-class copies (e.g., a bitcast from i32 to f32) must survive as explicit mov instructions in the emitted PTX.

  2. Per-class pressure accounting. The greedy register allocator at sub_2F5A640 tracks register pressure per class independently. The -maxreg limit bounds total live registers across all classes combined, but interference within any single class never spills over to another.

This is unlike CPU targets (x86, AArch64) where integer and floating-point registers can alias through sub-register relationships, or where a single physical register appears in multiple register classes.

Copy Opcodes -- sub_2162350

The function sub_2162350 (3.0KB, "Copy one register into another with a different width") dispatches copy instruction emission based on the source and destination register classes. Each class has two opcodes: one for same-class copies (e.g., mov.b32 %r1, %r0) and one for cross-class copies (e.g., bitcasting between Int32Regs and Float32Regs):

ClassSame-Class OpcodeCross-Class OpcodeNotes
Int1Regs3942439424No distinct cross-class path
Int16Regs3929639296No distinct cross-class path
Int32Regs3955210816Cross = mov.b32 bitcast to float
Int64Regs3968011008Cross = mov.b64 bitcast to double
Float32Regs3065610880Cross = mov.b32 bitcast to integer
Float64Regs3078411072Cross = mov.b64 bitcast to integer
Int16HalfRegs3052810688Cross = mov.b16 half-to-short
Int32HalfRegs3955239552Uses same opcode as Int32Regs same-class
Int128Regs3916839168No distinct cross-class path

Classes where both opcodes are identical (Int1Regs, Int16Regs, Int32HalfRegs, Int128Regs) have no meaningful cross-class copy path. For predicates (Int1Regs), this is because there is no other 1-bit type. For 128-bit registers, tensor-core values have no peer class to bitcast into. The Int32HalfRegs class shares its same-class opcode (39552) with Int32Regs because both emit .b32 copies -- the packed v2f16 value is simply treated as a 32-bit bitpattern for copying.

The five classes with distinct cross-class opcodes (Int32Regs, Int64Regs, Float32Regs, Float64Regs, Int16HalfRegs) are exactly those that participate in bitcast operations between integer and floating-point interpretations of the same bit width.

Register Declaration Emission -- sub_2158E80

During function body emission, sub_2158E80 (17KB) emits .reg declarations for every register class used by the function. The process:

  1. Iterate the register map at this+800 in the AsmPrinter state.
  2. Deduplicate classes using a hash table at this+808..832.
  3. Track the maximum index per class across all virtual registers.
  4. Emit one declaration per class in the format:
.reg .pred  %p<5>;       // 5 predicate registers (indices 0..4)
.reg .b16   %rs<12>;     // 12 short integer registers
.reg .b32   %r<47>;      // 47 general-purpose 32-bit
.reg .b64   %rd<8>;      // 8 double-width integer
.reg .f32   %f<20>;      // 20 single-precision float
.reg .f64   %fd<3>;      // 3 double-precision float
.reg .b16   %h<4>;       // 4 half-precision float
.reg .b32   %hh<2>;      // 2 packed-pair registers
.reg .b128  %rq<1>;      // 1 tensor-core 128-bit register

The count for each class is max_register_index + 1. The PTX declaration syntax %prefix<N> declares registers %prefix0 through %prefix(N-1).

Note that Int16HalfRegs and Int16Regs share the same PTX type suffix (.b16) but have different prefixes (%h vs %rs). Similarly, Int32HalfRegs and Int32Regs share .b32 but use %hh vs %r. The PTX assembler ptxas treats these as completely separate register namespaces -- the prefix, not the type, determines the namespace.

Stack pointer registers (%SP, %SPL) are emitted before the class declarations when the function has a non-zero local frame. These use .b64 in 64-bit mode or .b32 in 32-bit mode.

Per-Class Detail

Int1Regs -- Predicates

PropertyValue
Vtableoff_4A027A0
PTX type.pred
Prefix%p
Tag0x1
Width1 bit
Legal MVTsi1
Same-class copy39424

Predicate registers hold boolean values used for conditional branches (@%p1 bra target), select instructions (selp), and set-predicate results (setp). They are the only 1-bit registers in PTX. There is no cross-class copy path because no other class holds 1-bit values. The coalescer excludes predicates from cross-class analysis entirely.

Int16Regs -- Short Integers

PropertyValue
Vtableoff_4A02720
PTX type.b16
Prefix%rs
Tag0x2
Width16 bits
Legal MVTsi16
Same-class copy39296

Short integer registers hold 16-bit integer values. PTX .param space widens all scalars below 32 bits to .b32, so %rs registers appear primarily in computation, not in function signatures. The prefix %rs (register-short) distinguishes these from %h (Int16HalfRegs) even though both declare as .b16.

Int32Regs -- General-Purpose 32-bit

PropertyValue
Vtableoff_4A025A0
PTX type.b32
Prefix%r
Tag0x3
Width32 bits
Legal MVTsi32
Same-class copy39552
Cross-class copy10816

The workhorse register class. Holds 32-bit integers, addresses in 32-bit mode, loop indices, and general computation results. Cross-class copy opcode 10816 handles bitcast to Float32Regs (%f).

Int64Regs -- Double-Width Integer

PropertyValue
Vtableoff_4A024A0
PTX type.b64
Prefix%rd
Tag0x4
Width64 bits
Legal MVTsi64
Same-class copy39680
Cross-class copy11008

Holds 64-bit integers and device pointers in 64-bit mode (the common case). Cross-class copy opcode 11008 handles bitcast to Float64Regs (%fd).

Float32Regs -- Single-Precision Float

PropertyValue
Vtableoff_4A02620
PTX type.f32
Prefix%f
Tag0x5
Width32 bits
Legal MVTsf32
Same-class copy30656
Cross-class copy10880

Holds IEEE 754 single-precision floats. Note the .f32 type suffix rather than .b32 -- PTX distinguishes float from bitwise register types even at the same width. Cross-class copy opcode 10880 handles bitcast to Int32Regs (%r).

Float64Regs -- Double-Precision Float

PropertyValue
Vtableoff_4A02520
PTX type.f64
Prefix%fd
Tag0x6
Width64 bits
Legal MVTsf64
Same-class copy30784
Cross-class copy11072

Holds IEEE 754 double-precision floats. Cross-class copy opcode 11072 handles bitcast to Int64Regs (%rd).

Int16HalfRegs -- Half-Precision Float

PropertyValue
Vtableoff_4A02760
PTX type.b16
Prefix%h
Tag0x7
Width16 bits
Legal MVTsf16, bf16
Same-class copy30528
Cross-class copy10688

Despite the Int16 in the TableGen-derived name, this class holds half-precision floating-point values (f16 and bf16). The .b16 PTX type (bitwise 16-bit) is used rather than a hypothetical .f16 because PTX's type system uses .b16 for all 16-bit values that are not short integers. The %h prefix distinguishes these registers from %rs (Int16Regs). Cross-class copy opcode 10688 handles conversion to Int16Regs.

The semantic alias Float16Regs appears in some wiki pages and is equally valid.

Int32HalfRegs -- Packed Half-Precision Pairs

PropertyValue
Vtableoff_4A026A0
PTX type.b32
Prefix%hh
Tag0x8
Width32 bits
Legal MVTsv2f16, v2bf16, v2i16, v4i8
Same-class copy39552
Cross-class copy39552

This is the only register class for vector types on NVPTX. It holds exactly 32 bits of packed data: two f16 values, two bf16 values, two i16 values, or four i8 values. The %hh prefix distinguishes it from %r (Int32Regs). Both same-class and cross-class copy opcodes are 39552 (identical to Int32Regs same-class), because copies of packed values are simple 32-bit bitwise moves.

All vector types wider than 32 bits (v4f32, v2f64, v8i32, etc.) are illegal on NVPTX and must be split or scalarized during type legalization. See the vector legalization documentation for the split/scalarize dispatch.

The semantic alias Float16x2Regs appears in some wiki pages.

Int128Regs -- 128-bit Tensor Core Values

PropertyValue
Vtableoff_4A02460
PTX type.b128
Prefix%rq
Tag0x9
Width128 bits
Legal MVTsi128 (SM 70+)
Same-class copy39168
Cross-class copy39168

The widest register class, introduced for tensor core operations on Volta (SM 70) and later architectures. Holds 128-bit values used as operands and accumulators in mma and wmma instructions. The %rq prefix stands for "register quad" (4x32 bits). There is no cross-class copy path because no other class holds 128-bit values.

During register coalescing, 128-bit values are tracked as wide register pairs (two 64-bit halves). The coalescer at sub_3497B40 handles paired-register decomposition: when coalescing the low half, the high half inherits corresponding constraints.

An earlier raw report (p2c.5-01-register-alloc.txt) labeled off_4A02460 as SpecialRegs. This was an error in that report's identification. The vtable off_4A02460 emits .b128 / %rq, which is the 128-bit class for tensor core values, not a class for special/environment registers.

The Internal-Only Class -- off_4A026E0

PropertyValue
Vtableoff_4A026E0
PTX type"!Special!"
Prefix"!Special!"
Encoded IDNone

A tenth vtable address appears in the register info initialization path (sub_2163AB0). Both sub_2163730 and sub_21638D0 return the sentinel string "!Special!" for this vtable. It has no encoded ID, no PTX declaration, and never produces emitted registers. The string "ENVREG10" at register info offset +72 (alongside "Int1Regs" at offset +80) suggests this class is associated with environment registers -- hardware-defined read-only registers like %tid, %ctaid, %ntid, etc. These are emitted by dedicated special-register emission functions (sub_21E86B0, sub_21E9060) rather than through the register class encoding path.

Register Info Initialization

NVPTXRegisterInfo objects are created by two factory functions corresponding to the two pass manager generations:

Legacy PMNew PM
Factorysub_2149CD0sub_301F0C0
Initsub_2163AB0sub_30590F0
Object size224 bytes248 bytes

Both call sub_1F4A910 (TargetRegisterInfo::InitMCRegisterInfo) with the register descriptor table at off_49D26D0 and register unit data at unk_4327AF0. Key fields in the initialized structure:

OffsetContent
+44NumRegs (total register count)
+72"ENVREG10" (environment register class name)
+80"Int1Regs" (first register class name)
+96numRegClasses (initially 1, expanded during init)

Coalescing Constraints

The register coalescer imposes these constraints based on register class:

ClassCoalesceableConstraint Flag (offset +3, mask 0x10)
Int1RegsSame class onlySet
Int16RegsSame class onlySet
Int32RegsSame class onlySet (type code 12)
Int64RegsSame class onlySet (type code 13)
Float32RegsSame class onlySet (type code 15)
Float64RegsSame class onlySet
Int16HalfRegsSame class onlySet
Int32HalfRegsSame class onlySet
Int128RegsNever coalescedCleared

Int128Regs (the class at off_4A02460, previously mislabeled SpecialRegs in the coalescing page) has its constraint flag cleared, excluding it from the coalescing worklist entirely. This makes sense: tensor-core 128-bit values have specific register-pair relationships that the coalescer must not disturb.

Cross-class copies between Int32Regs/Float32Regs and between Int64Regs/Float64Regs are bitcasts that the coalescer never eliminates -- they must survive as explicit PTX mov instructions because the source and destination live in different register namespaces.

Differences from Upstream LLVM NVPTX

The upstream LLVM NVPTX backend (as of LLVM 20.0.0) defines these register classes in NVPTXRegisterInfo.td:

  • Int1Regs, Int16Regs, Int32Regs, Int64Regs -- identical.
  • Float16Regs, Float16x2Regs -- upstream names for cicc's Int16HalfRegs / Int32HalfRegs. The rename reflects NVIDIA's preference for the TableGen-derived integer-typed names.
  • Float32Regs, Float64Regs -- identical.
  • Int128Regs -- present in upstream, matches cicc.
  • No SpecialRegs class in upstream. Special registers are handled through dedicated physical registers, not a register class.
  • No off_4A026E0 internal-only class in upstream.

The encoding scheme (4-bit tag in [31:28], 28-bit index in [27:0]) and the fatal "Bad register class" error path are NVIDIA additions not present in upstream LLVM's NVPTX backend, which relies on standard MCRegisterInfo encoding.

Function Map

FunctionAddressSizeRole
Register class encoding (class tag OR index)sub_21583D04.6KB--
Register class -> PTX type suffix (.pred, .b32, .f32, ...)sub_21637301.7KB--
Register class -> PTX prefix (%p, %r, %f, ...)sub_21638D01.6KB--
Copy opcode dispatch by register classsub_21623503.0KB--
Stack frame + register declaration emissionsub_2158E8017KB--
NVPTXRegisterInfo init (legacy PM)sub_2163AB01.1KB--
NVPTXRegisterInfo factory (legacy PM)sub_2149CD0----
NVPTXRegisterInfo init (new PM)sub_30590F0----
NVPTXRegisterInfo factory (new PM)sub_301F0C0----
TargetRegisterInfo::InitMCRegisterInfosub_1F4A910----
Special register emission (%tid, %ctaid, %ntid, %nctaid)sub_21E86B0----
Cluster register emission (SM 90+)sub_21E9060----

Cross-References

  • Register Allocation -- greedy RA that operates on these classes; pressure tracking and -maxreg constraint
  • Register Coalescing -- same-class-only coalescing policy, copy opcode classification
  • PTX Emission -- function header orchestrator that calls the register declaration emitter
  • AsmPrinter -- per-instruction emission that calls the encoding function
  • Type Legalization -- vector type legalization driven by the Int32HalfRegs-only vector model
  • NVPTX Target Infrastructure -- NVPTXTargetMachine that owns the register info objects

NVPTX Machine Opcode Reference

This page is the master reference for NVPTX MachineInstr opcodes as they exist in cicc v13.0. These are the target-specific opcode numbers assigned during instruction selection and consumed by register allocation, instruction scheduling, the AsmPrinter, and every other machine-level pass. They are distinct from both LLVM IR opcodes (which live in the Instruction hierarchy) and from ISD/NVPTXISD SelectionDAG node opcodes (which exist only during lowering and are erased by ISel). A MachineInstr's opcode field is the 16-bit value at MachineInstr offset +68, and it indexes into the MCInstrDesc table to obtain operand counts, constraint classes, implicit defs/uses, and scheduling information.

Constraint tableword_3F3E6C0 (static .data array of 16-bit entries)
Constraint emittersub_B612D0 (104KB, 179-case switch)
Copy-type mappersub_3494EA0 (12.7KB, maps opcodes 1--0x12 to families 440--503)
Register class buildersub_B5BA00 (21KB, 111 cases)
Operand type classifiersub_34961A0 (26.6KB, reads byte_444C4A0)
ISel entrysub_3090F90 (91KB, NVPTXDAGToDAGISel::Select)
Intrinsic lowering switchsub_33B0210 (343KB, hundreds of NVVM intrinsics)

Opcode Numbering Scheme

Opcodes 0--approximately 430 correspond to generic LLVM TargetOpcode values and standard LLVM machine pseudo-instructions (COPY, PHI, IMPLICIT_DEF, INLINEASM, etc.). These are identical to upstream LLVM 20.0.0. NVPTX target-specific opcodes begin around opcode 440 and extend into the thousands. The highest confirmed opcode numbers are in the 4900+ range (tcgen05 tensor core instructions for Blackwell).

The opcode numbering is generated by TableGen from the NVPTX .td instruction definitions and compiled into the MCInstrDesc table. Since cicc is a stripped binary, the symbolic names are lost. The identifications below come from behavioral analysis: matching the constraint table patterns, AsmPrinter string emission, and SelectionDAG lowering code against known PTX instruction semantics.

The Constraint Table: word_3F3E6C0

Every NVPTX machine opcode has an entry in the global constraint table at word_3F3E6C0. This is a flat array of 16-bit words, indexed by (opcode - 1). Each word packs two fields:

BitsFieldPurpose
[7:0] (low byte)constraint_classIndex into the 179-case switch in sub_B612D0
[15:8] (high byte)register_class_idTarget register class for the instruction's primary result

The access pattern, decompiled from sub_B612D0:

uint16_t entry = word_3F3E6C0[opcode - 1];
uint8_t constraint_class = entry & 0xFF;         // low byte
uint8_t register_class   = (entry >> 8) & 0xFF;  // high byte

switch (constraint_class) {
    case 0x00: ...  // simple 2-input ALU
    case 0x01: ...  // 3-input FMA
    ...
    case 0xB2: ...  // maximum observed class
}

The constraint class determines how many operands the instruction has, what register class each operand belongs to, and which operands are tied. Each case in the switch constructs a stack-allocated array of 16-byte constraint descriptors (see Pattern Database for the full descriptor layout) and calls sub_A78010 to emit them.

179 Constraint Classes

The constraint classes range from 0x00 through 0xB2 (179 values). Each class represents a distinct operand signature. Representative patterns:

Class RangePatternDescriptor CountTypical Instructions
0x00--0x0FSimple ALU (2 inputs, 1 output)3add, sub, mul, and, or, xor
0x10--0x1FTernary (3 inputs, 1 output)4fma, madc, selp
0x20--0x3FLoad/store variants2--5ld, st with address space and vector width
0x40--0x5FConversion and move2--3cvt, mov, bitcast
0x60--0x7FAtomic and barrier3--6atom.*, membar, fence
0x80--0x9FTexture/surface4--12tex., sust., suld.*
0xA0--0xAFTensor core (MMA)6--16hmma, imma, wmma, mma
0xB0Maximum operand (17 inputs)18Complex intrinsic (opcode 176)
0xB1--0xB2Miscellaneous high-operand-countvariableSpecialized instructions

The maximum observed operand count is 17 (constraint class 0xB0, associated with opcode 176), requiring 18 descriptor entries (17 inputs + 1 output) and 288 bytes of stack space in the constraint emitter's frame.

Register Class IDs in the High Byte

The high byte of each word_3F3E6C0 entry identifies the register class for the instruction's result. These IDs map to NVPTX's typed virtual register files:

IDRegister ClassPTX TypePTX PrefixVtable Address
14Int32Regs.b32%roff_4A025A0
22Int16Regs.b16%rsoff_4A02720
40Float32Regs.f32%foff_4A02620
43Float16Regs.b16%hoff_4A02760
50Int64Regs.b64%rdoff_4A024A0
51Float64Regs.f64%fdoff_4A02520
52Int128Regs.b128%rqoff_4A02460
78PredRegs.pred%poff_4A027A0
86SpecialRegs(varies)(varies)off_4A026E0

Additional register class IDs observed in the constraint table (24, 27, 29, 32, 36, 39, 41, 67, 72, 76) likely correspond to sub-classes or aliased classes (e.g., Int32HalfRegs with ID related to 32 and prefix %hh), but their exact mappings have not been recovered. Instructions that produce no register result (stores, barriers, calls) have a zero or don't-care value in the high byte.

Identified Opcode Families

The following sections catalog every opcode range where the binary-to-PTX mapping has been confirmed. Opcodes are grouped by functional family. Where an opcode's identity is uncertain, it is marked with a question mark.

Copy and Move Family (440--503)

These are the NVPTX-specific copy instructions that the NVPTX register coalescer at sub_34AF4A0 processes. The standard LLVM RegisterCoalescer handles only the generic COPY pseudo (a generic TargetOpcode, not in this range); the NVPTX coalescer handles these target-specific copy families in a second pass.

The mapping function sub_3494EA0 contains a switch statement that classifies internal opcode IDs (1--0x12) into copy families:

Opcode RangeFamilyDescription
440--443Type-preserving movesSame-class copies: i32-to-i32, i64-to-i64, f32-to-f32, f64-to-f64. These map from operand type codes 12, 13, 15 in the byte_444C4A0 classification table.
444--470 (approx.)Cross-class movesBitcasting copies between register classes (e.g., i32 to f32). These survive coalescing as explicit mov instructions in PTX because the source and destination register types differ.
471--490 (approx.)Paired/wide moves128-bit register pair copies for tensor core paths. The low and high halves are tracked jointly by sub_3497B40.
491--503 (approx.)ABI parameter copies.param-related copies at call boundaries. These arise from the calling convention and are prime targets for coalescing.

The byte_444C4A0 operand-type classification table (16-byte entries, indexed by MVT enum) feeds the coalescer's type check:

struct OperandTypeEntry {    // 16 bytes at byte_444C4A0[16 * mvt - 16]
    uint8_t type_code;       // +0: 12=i32, 13=i64, 15=f32, etc.
    uint8_t size_class;      // +1: size in register-width units
    uint8_t register_bank;   // +2: bank identifier
    uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
    uint8_t reserved[12];    // +4: padding
};

The constraint flag at offset +3 (mask 0x10) gates whether an operand participates in coalescing. Operands without this bit set (e.g., SpecialRegs) are excluded from the coalescer's worklist entirely.

Call ABI Family (505--573)

These opcodes implement the PTX .param-space calling convention. They are emitted by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and form the backbone of every device-function call sequence.

OpcodeNamePTX EquivalentOperands
315CallSeqBegin(pseudo) call frame setupchain, seq_id, zero
316CallSeqEnd_Outer(pseudo) outer call frame teardownchain, glue, callee_ref, callee_ref_hi
505DeclareParam.param .align A .b8 param[N]chain, alignment, param_index, byte_size
506DeclareScalarParam.param .bW paramNchain, alignment, param_index, widened_size
507DeclareRetParam.param .align A .b8 retval[N]chain, alignment, byte_size, zero
508DeclareRetScalarParam.param .bW retvalchain, 1, widened_size, zero
510CallDirectcall (retval), func, (params)chain, callee, params...
511CallDirectNoProtocall func, (params) (old-style)chain, callee, params...
512CallIndirectcall (retval), %rd, (params)chain, func_ptr, params...
513CallIndirectNoProtocall %rd, (params)chain, func_ptr, params...
514CallStart(pseudo) actual call emission pointCallProto result
515LoadRetParamld.param.bW retvalNcall_result, 1, element_index
516LoadRetParamLastld.param.bW retvalN (last)call_result, 1, element_index
517CallSeqEnd(pseudo) inner call frame teardownlast_load, chain, flag
518CallProto.callprototypechain, callee, proto_string
521DeclareRetParam_Ext.param for return (ext path)CallSeqEnd result, seq_id
527StoreCalleeRetAddr(pseudo) callee return addrchain, proto_symbol
528StoreRetValToParamst.param.bW retvalN (return)chain, value, offset

The call sequence follows a strict emission order:

CallSeqBegin(315)
  for each argument:
    DeclareParam(505) or DeclareScalarParam(506)
    StoreV1/V2/V4(571/572/573) — store argument values
  DeclareRetParam(507) or DeclareRetScalarParam(508)  [if callee returns]
  CallProto(518)
  CallStart(514)                                       [actual call point]
  for each return value:
    LoadRetParam(515) or LoadRetParamLast(516)
  CallSeqEnd(517)
  DeclareRetParam_Ext(521)                             [if prototype present]
CallSeqEnd_Outer(316)

Vector Load/Store Family (568--573)

These opcodes handle vectorized .param-space data movement, emitted during argument passing and return value extraction:

OpcodeNamePTX EquivalentVector Width
568LoadV1ld.param.b32 / ld.param.b641 element
569LoadV2ld.param.v2.b32 / ld.param.v2.b642 elements
570LoadV4ld.param.v4.b32 / ld.param.v4.b644 elements
571StoreV1st.param.b32 / st.param.b641 element
572StoreV2st.param.v2.b32 / st.param.v2.b642 elements
573StoreV4st.param.v4.b32 / st.param.v4.b644 elements

The vector width selection logic in LowerCall (sub_3040BF0, lines 1429--1440):

accumulated_operand_count == 3  ->  StoreV1 (571), width=1
accumulated_operand_count == 4  ->  StoreV2 (572), width=2
accumulated_operand_count == 6  ->  StoreV4 (573), width=4
other                           ->  fatal error (unreachable)

The same pattern applies to LoadV1/V2/V4 on the return path. These opcodes are also used for by-value struct argument decomposition, where the struct is stored element-by-element into .param space using 8-byte chunks via StoreV1(571).

Atomic Family (294--317, 462)

Atomic opcodes are emitted by sub_20BED60 during DAG legalization and emitted as PTX by sub_21E5E70 (base) and sub_21E6420 (L2-hinted variant for SM 80+):

Opcode RangePTX InstructionTypes
294--297atom.addf32, f64, i32, i64
302--305atom.mins32, s64, u32, u64
314--317atom.maxs32, s64, u32, u64
462atom.casgeneric (compare-and-swap)

Within the PTX emission layer, the atomic operation is encoded in a packed operand word:

BitsFieldValues
[7:4]scope0=gpu (default), 1=cta, 2=sys
[23:16] (BYTE2)operation0x00=exch, 0x01=add.u, 0x03=and, 0x05=or, 0x06=xor, 0x07=max.s, 0x08=min.s, 0x09=max.u, 0x0A=min.u, 0x0B=add.f, 0x0C=inc, 0x0D=dec, 0x0E=cas

Note that operation codes 0x02 and 0x04 are absent -- there is no signed atomic add or a second OR variant, matching the PTX ISA specification.

On Ampere (SM 80+), each atomic operation has an L2 cache-hinted variant emitted by sub_21E6420. The PTX format becomes atom[.scope].op.L2::cache_hint.type, instructing the GPU to retain or evict data in L2 after the atomic completes.

Barrier and Fence Family (287--290)

OpcodePTX InstructionScope
287membar.gpuGPU
288membar.ctaCTA (thread block)
289membar.sysSystem
290fence.sc.clusterCluster (SM 90+)

The emission function sub_21E94F0 dispatches on the low 4 bits of the operand word. The fence.sc.cluster instruction requires SM 90 (Hopper) and provides sequentially-consistent fence semantics at cluster scope.

Cluster barrier instructions (SM 90+, emitted by sub_21E8EA0):

Operand EncodingPTX Instruction
bits[3:0]=0, bits[7:4]=0barrier.cluster.arrive
bits[3:0]=0, bits[7:4]=1barrier.cluster.arrive.relaxed
bits[3:0]=1, bits[7:4]=0barrier.cluster.wait
bits[3:0]=1, bits[7:4]=1barrier.cluster.wait.relaxed

NVPTXISD Custom DAG Opcodes (22--499)

These are SelectionDAG-level opcodes used during lowering. After instruction selection, they are replaced by concrete MachineInstr opcodes. They are documented here because the DAG opcode numbers appear in the binary's lowering functions and serve as the conceptual identity of each instruction family:

DAG OpcodeIdentityNotes
22NVPTXISD::TargetAddrData pointer computation
24NVPTXISD::WrapperGlobal address wrapping
149NVPTXISD::ATOMIC_LOADAtomic load (lowered from IR atomic)
152NVPTXISD::SELECT_CCConditional select (ternary)
189NVPTXISD::MoveParamThread index and parameter moves
193--196NVPTXISD::MIN/MAXMin/max variants (2- and 3-source)
197NVPTXISD::CTPOPPopulation count
198--204NVPTXISD::ConstantPoolConstant pool entry variants
208NVPTXISD::CMPXCHGCompare-and-exchange
213--214NVPTXISD::STORE_SIGNEDStore with sign-extension flag
215NVPTXISD::AddrSpaceCastAddress space conversion (within lowering)
230NVPTXISD::DeclareLocalDeclare local variable / address of param
233--234NVPTXISD::AddrSpaceCast pairTwo-step address space cast
245--274NVPTXISD::MathOp_RN/RZ/RM/RPRounded math (add, mul, sqrt, div, fma)
310NVPTXISD::AnnotationPTX .pragma annotation
321NVPTXISD::StackRestoreStack pointer restore
322NVPTXISD::StackAllocDynamic stack allocation
330NVPTXISD::FunctionAddrFunction address (for indirect calls)
335NVPTXISD::BinaryArithTwo-operand arithmetic
371NVPTXISD::DynAreaOffsetDynamic alloca offset
499NVPTXISD::ConditionalBranchConditional branch with .param alloc

The rounded math opcodes (245--274) follow a systematic pattern. The intrinsic lowering switch at sub_33B0210 maps NVVM intrinsic IDs to NVPTXISD opcodes:

Intrinsic IDNVPTXISD OpcodePTX Operation
63249add.rz
64255mul.rz
89267fma.rz
170245add.rm
172274mul.rm
250271fma.rm
308270add.rp
309272mul.rp
310273fma.rp
325248sqrt.rz
328254sqrt.rm
335246sqrt.rp
348250div.rz
349256div.rm
355269div.rp

MMA / Tensor Core Opcodes

Tensor core MachineInstr opcodes occupy a large range and are organized by generation. The central MMA instruction builder at sub_21E74C0 reads a packed 64-bit descriptor to determine the specific instruction variant.

Pre-Blackwell (SM 70--90) families:

FunctionFamilyPTX BaseMin SM
sub_21E0360HMMA load A/Bwmma.load.a / wmma.load.b70
sub_21E0630HMMA load Cwmma.load.c70
sub_21DFBF0HMMA store Cwmma.store.c70
sub_21E0870HMMA MMAwmma.mma / mma70
sub_21E1280IMMA load A/Bwmma.load.a (int)72
sub_21E15D0IMMA load Cwmma.load.c (int)72
sub_21E1830IMMA store Cwmma.store.c (int)72
sub_21E1D20IMMA MMAmma (integer, with saturation)72
sub_21E2280BMMA MMAmma (binary, b1.and.popc / b1.xor.popc)75

Each family exists in two copies: the AsmPrinter-side at 0x21Dxxxx--0x21Exxxx and the NVPTX backend-side at 0x36Exxxx.

Blackwell tcgen05 (SM 100+):

Opcodes 4905--4940 cover 10 shape variants of tcgen05.mma. The packed descriptor encodes:

BitFieldValues
0scaleD0 or 1
1negA0=positive, 1=negative
2negB0=positive, 1=negative
3transA0=normal, 1=transposed
4transB0=normal, 1=transposed
5sparsitystructured sparsity enable
[8:6]type encodingmxf4nvf4, i8, mxf8f6f4, f16, tf32, fp4, mxf4, bf16

Modifiers include block_scale, weight_stationary, and scaleInputAccumulator. The architecture gate is subtarget+340 >= 0x3E8 (SM 100 decimal).

MMA Shape and Type Encoding

The MMA instruction builder uses enumerated shape and type codes embedded in the packed descriptor:

Shape codes (bits [39:32]):

CodeShapePTX StringMin SM
0x01m8n8k4"m8n8k4"70
0x02m8n8k16"m8n8k16"72
0x03m8n8k32"m8n8k32"75
0x04m8n8k64"m8n8k64"75
0x05m8n8k128"m8n8k128"75
0x10m16n8k4"m16n8k4"80
0x11m16n8k8"m16n8k8"75
0x12m16n8k16"m16n8k16"80
0x13m16n8k32"m16n8k32"75
0x14m16n8k64"m16n8k64"75
0x15m16n8k128"m16n8k128"75
0x16m16n8k256"m16n8k256"75
0x17m16n16k16"m16n16k16"90
0x18m32n8k16"m32n8k16"90?
0x19m16n16k8"m16n16k8"70

Data type codes (in aty/bty fields):

CodeTypeBitsPTX
1b11"b1"
2s44"s4"
3u44"u4"
4s88"s8"
5u88"u8"
6f1616"f16"
7bf1616"bf16"
8tf3219"tf32"
9f6464"f64"
10f3232"f32"
11s3232"s32"

Special Register Access

Special register read instructions map to PTX special registers. The AsmPrinter function sub_21E86B0 dispatches on a single-byte operand:

OperandRegisterDescription
0x26%tid.xThread ID, X
0x27%tid.yThread ID, Y
0x28%tid.zThread ID, Z
0x29%ntid.xBlock dimension, X
0x2A%ntid.yBlock dimension, Y
0x2B%ntid.zBlock dimension, Z
0x2C%ctaid.xBlock ID, X
0x2D%ctaid.yBlock ID, Y
0x2E%ctaid.zBlock ID, Z
0x2F%nctaid.xGrid dimension, X
0x30%nctaid.yGrid dimension, Y
0x31%nctaid.zGrid dimension, Z
0x5E(dynamic)%warpid / %laneid (via sub_3958DA0)
0x5F(dynamic)%nwarpid or similar (via sub_3958DA0)

Cluster special registers (SM 90+, sub_21E9060) add 15 registers: %is_explicit_cluster, %cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}.

Address Space Conversion

The cvta instruction family is emitted by sub_21E7FE0:

Operand ValueSuffixFull Instruction
0(none)cvta (generic)
1.globalcvta.to.global / cvta.global
3.sharedcvta.to.shared / cvta.shared
4+.localcvta.to.local / cvta.local

Direction is determined by a separate operand: value 0 emits "a" (to-generic), value 1 emits "b" (to-specific).

Constraint Emission Pipeline

The full path from opcode to emitted constraint:

sub_B612D0(emitter_state, opcode):
    // Step 1: Table lookup
    entry = word_3F3E6C0[opcode - 1]
    reg_class = entry >> 8
    constraint_class = entry & 0xFF

    // Step 2: Build descriptor array on stack
    switch (constraint_class):
        case 0x00:
            // Simple 2-input ALU: {op0=RC, op1=RC, result=RC}
            desc[0] = {kind=0, value=sub_A778C0(state, reg_class, flags)}
            desc[1] = {kind=1, value=sub_A778C0(state, reg_class, flags)}
            desc[2] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 3)
        case 0x01:
            // Ternary FMA: {op0, op1, op2, result}
            desc[0..2] = three input constraints
            desc[3] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 4)
        ...
        case 0xB0:
            // 17-input complex: 17 input constraints + 1 output
            for i in 0..16:
                desc[i] = {kind=i, value=...}
            desc[17] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 18)

Key helper functions:

AddressFunctionPurpose
sub_A778C0createRegClassConstraint(state, regclass, flags)Build input operand constraint for a specific register class
sub_A77AD0createAnyRegConstraint(state, flags)Build an unconstrained ("any register") input constraint
sub_A79C90composeConstraints(state, desc, N)Merge N descriptors into a single composite constraint
sub_B5BA00createOutputConstraint(state, regclass_id)Build the output/result constraint
sub_A78010emitConstraint(state, desc_array, N)Finalize and emit the constraint with N entries
sub_B612D0emitInstrConstraint(state, opcode)Top-level entry: table lookup + switch + emit

The constraint descriptors are purely stack-allocated within sub_B612D0's approximately 0x160-byte frame. No heap allocation occurs during constraint emission.

Complete Identified Opcode Summary

The following table consolidates every opcode where the binary-to-PTX mapping has been confirmed or strongly inferred. This represents a partial inventory -- the total opcode space extends to at least 4940, and many opcodes in the gaps (particularly in the load/store, texture, surface, and extended intrinsic ranges) remain unidentified.

OpcodeIdentityFamilyEvidence Source
0--~430Generic LLVM TargetOpcodeLLVM standardupstream LLVM 20.0.0
440--443Type-preserving movesCopyregister coalescer (sub_3494EA0)
444--503Cross-class / wide / ABI copiesCopyregister coalescer (sub_3494EA0)
294--297atom.add (f32/f64/i32/i64)AtomicDAG legalization (sub_20BED60)
302--305atom.min (s32/s64/u32/u64)AtomicDAG legalization (sub_20BED60)
314--317atom.max (s32/s64/u32/u64)AtomicDAG legalization (sub_20BED60)
315CallSeqBeginCall ABILowerCall (sub_3040BF0)
316CallSeqEnd_OuterCall ABILowerCall
462atom.casAtomicDAG legalization
499ConditionalBranchControlintrinsic lowering
505DeclareParamCall ABILowerCall
506DeclareScalarParamCall ABILowerCall
507DeclareRetParamCall ABILowerCall
508DeclareRetScalarParamCall ABILowerCall
510CallDirectCall ABILowerCall
511CallDirectNoProtoCall ABILowerCall
512CallIndirectCall ABILowerCall
513CallIndirectNoProtoCall ABILowerCall
514CallStartCall ABILowerCall
515LoadRetParamCall ABILowerCall
516LoadRetParamLastCall ABILowerCall
517CallSeqEndCall ABILowerCall
518CallProtoCall ABILowerCall
521DeclareRetParam_ExtCall ABILowerCall
527StoreCalleeRetAddrCall ABILowerCall
528StoreRetValToParamCall ABILowerCall
568LoadV1Vector ParamLowerCall
569LoadV2Vector ParamLowerCall
570LoadV4Vector ParamLowerCall
571StoreV1Vector ParamLowerCall
572StoreV2Vector ParamLowerCall
573StoreV4Vector ParamLowerCall
4905--4940tcgen05.mma (10 shape variants)Tensor CoreBlackwell emission (sub_21E8CD0)

Gaps and Unknown Ranges

The following opcode ranges are known to contain NVPTX instructions but have not been fully mapped:

RangeLikely ContentsEvidence
430--439Transition zone (generic-to-target boundary)Adjacent to copy family
574--~800Global/shared/local loads and storesLarge gap between param-store and first identified general opcode
800--~1500Texture and surface instructionssub_33B0210 intrinsic switch references hundreds of tex/surf intrinsics
1500--~3000Shuffle, vote, match, reduxWarp-level intrinsic families
3000--~4000WGMMA, TMA, bulk operationsHopper-era instruction families
4000--4904Additional tensor/cluster instructionsBridging pre-Blackwell and tcgen05

Recovering these ranges requires systematic analysis of the sub_33B0210 intrinsic lowering switch (343KB, the single largest function in the binary) and correlation with the AsmPrinter's printInstruction dispatch table.

Function Map

FunctionAddressSizeRole
Constraint emission (179-case switch on word_3F3E6C0)sub_B612D0104KB--
Register class set builder (111 cases)sub_B5BA0021KB--
Operand type decoder (101 cases)sub_B6B20044KB--
createRegClassConstraint(state, regclass, flags)sub_A778C0----
createAnyRegConstraint(state, flags)sub_A77AD0----
composeConstraints(state, desc, N)sub_A79C90----
emitConstraint(state, desc_array, N)sub_A78010----
Opcode-to-copy-type mapping (switch, families 440--503)sub_3494EA012.7KB--
Operand-type classification (reads byte_444C4A0)sub_34961A026.6KB--
Register-pair decomposition (wide/paired registers)sub_3497B4016.5KB--
NVPTXTargetLowering::LowerCall (call ABI opcodes)sub_3040BF088KB--
Intrinsic lowering switch (NVVM intrinsic to opcode)sub_33B0210343KB--
NVPTXDAGToDAGISel::Select (ISel entry)sub_3090F9091KB--
MMA instruction builder (packed descriptor)sub_21E74C017KB--
Atomic operation PTX emission (base)sub_21E5E70----
L2 cache-hinted atomic PTX emission (SM 80+)sub_21E6420----
Memory barrier PTX emissionsub_21E94F0----
Cluster barrier PTX emission (SM 90+)sub_21E8EA0----
Special register PTX emissionsub_21E86B0----
Cluster special register PTX emission (SM 90+)sub_21E9060----
Address space conversion (cvta) PTX emissionsub_21E7FE0----
tcgen05 Blackwell MMA emission (SM 100+)sub_21E8CD0----
Register class to encoded ID mappingsub_21583D0----
Register class to PTX type suffixsub_2163730----
Register class to PTX register prefixsub_21638D0----

Global Data References

SymbolAddressPurpose
word_3F3E6C00x3F3E6C0Constraint table (16-bit entries, indexed by opcode-1)
byte_444C4A00x444C4A0MVT/operand type table (16-byte entries, indexed by MVT enum)
word_44563400x4456340MVT to vector element count (16-bit entries)
word_44565800x4456580MVT to scalarized MVT (16-bit entries)
byte_3F252E00x3F252E0Constraint type classification table
qword_502A9200x502A920SM processor table (45 entries, stride-2)

Cross-References

  • Pattern Database -- detailed constraint descriptor layout and emission sub-functions
  • Register Coalescing -- the NVPTX-specific coalescer that processes copy family opcodes 440--503
  • Code Generation -- pipeline overview including ISel, RA, and machine-level passes
  • InstrEmitter -- how SDNodes become MachineInstrs with these opcodes
  • Register Allocation -- greedy RA that consumes constraint table data
  • AsmPrinter -- the PTX emission layer that converts these opcodes to text

CLI Flag Inventory

cicc v13.0 accepts approximately 111 unique flag keys across five parsing sites, expanding to ~142 flag+value combinations when counting value variants, and ~169 when including all architecture triplets. Flags are parsed in sub_8F9C90 (real main), sub_900130 (LibNVVM path A), sub_12CC750/sub_9624D0 (LibNVVM option processors), and sub_12C8DD0 (flag catalog builder with 65 registered configurations).

The flag system is architecturally split into two layers: a hardcoded dispatch layer in the top-level parsers (sub_8F9C90, sub_900130, sub_12CC750/sub_9624D0) that handles mode selection, pass-through, LTO, and structural flags via strcmp/prefix-match chains; and a BST-backed catalog layer (sub_12C8DD0 + sub_95EB40/sub_12C8B40) that handles all flags whose effect is purely "store a value and forward strings to output vectors."

The Four Output Vectors

Every flag ultimately routes its effects into one or more of four output std::vector<std::string> buffers. These vectors are the sole interface between the CLI parser and the downstream pipeline stages:

VectorSeedOutput argsDownstream stage
v324 (lnk)"lnk"a5/a6Phase 1: Linker / IR-link (sub_906xxx)
v327 (opt)"opt"a7/a8Phase 2: Optimizer (LLVM opt / sub_12E54A0)
v330 (lto)(none)a9/a10Phase 3: LTO passes
v333 (llc)"llc"a11/a12Phase 4: LLC codegen

Each vector element is a 32-byte std::string with SSO. At function exit (lines ~1462-1553 of sub_9624D0), each vector is serialized: count = (end - begin) >> 5, then malloc(8 * count) for the char** array, with each string individually malloc(len+1) + memcpy + null-terminated.

The lto vector receives no seed string and is only populated by explicit LTO flags (-Xlto, -olto, -gen-lto, -link-lto, --device-c, --force-device-c, host-ref flags) and the architecture string.

Mode Selection

The top-level entry point sub_8F9C90 sets a mode variable v263 that selects the compilation pipeline:

FlagModeDescription
-lgenfe1EDG C++ frontend (legacy genfe path)
-libnvvm2LibNVVM API path
-lnk3Linker path (forces keep=true)
-opt4Optimizer-only path (forces keep=true)
-llc6LLC backend-only path

Within the LibNVVM option processors (sub_12CC750/sub_9624D0), the first argument is checked as a 4-byte or 8-byte integer for phase routing. Phase routing is stored at a1+240:

argv[0] hexStringPhase IDa1+240
0x6B6E6C2D-lnk11
0x74706F2D-opt22
0x636C6C2D-llc33
0x63766E2D-nvc33 (alias)
0x6D76766E62696C2D-libnvvm44

When phase routing is active (a1+240 != 0), sub_95C880(phase_id, argc, argv, &count, &mode_flags) returns the allocated argv array for that single phase, stored directly into the corresponding output pair. When a1+240 == 0, mode flags default to 7 (all phases), and the full multi-phase option parsing loop runs.

The BST-Backed Flag Catalog

Catalog construction: sub_95EB40 / sub_12C8DD0

The function sub_95EB40(a1, cl_mode_flag) (standalone path) or sub_12C8DD0 (LibNVVM path) builds a std::map<std::string, OptionEntry> at a1+248. The underlying data structure is a C++ red-black tree (the standard library std::map implementation), with the tree root at a1+248, the sentinel/end node at a1+256, and the node count at a1+288.

Registration is performed by 65 calls to sub_95E8B0 + sub_95BF90 (standalone) or sub_12C8B40 (LibNVVM). Each call inserts one BST node.

BST node layout (168 bytes)

Each node in the red-black tree has the following layout:

OffsetSizeContent
+024RB-tree metadata (color, parent, left, right pointers)
+3232Key: flag name string (std::string with SSO)
+6432lnk forwards: space-separated flags for lnk vector
+9632opt forwards: space-separated flags for opt vector
+12832llc forwards: space-separated flags for llc vector
+1608Value pointer: points to the offset in the options structure where the flag's current value is stored

BST lookup: sub_95D600 / sub_12C8530

When the main parsing loop encounters a flag string, it calls sub_95D600 (standalone) or sub_12C8530 (LibNVVM) to perform a standard std::map::lower_bound-style traversal of the red-black tree. The lookup compares the input flag string against registered key strings at node offset +32 using strcmp semantics. On match, the node's three forwarding strings (lnk/opt/llc) are split on spaces and appended to their respective output vectors.

Duplicate detection

Each BST node's value pointer points into the options structure. If the value storage already has a non-zero sentinel (the QWORD immediately following the 32-byte STR32 slot), the flag was already set. On duplicate:

"libnvvm : error: <flag> defined more than once"

Flags NOT in the catalog

The following flag categories are handled by hardcoded strcmp/prefix-match chains in the main parsing loop BEFORE the catalog lookup, and therefore bypass the BST entirely:

  • Mode selection flags (-lnk, -opt, -llc, -nvc, -libnvvm)
  • -Ofast-compile=<level> (parsed at lines ~690-833)
  • Pass-through flags (-Xopt, -Xllc, -Xlnk, -Xlto)
  • LTO flags (-lto, -gen-lto, -gen-lto-and-llc, -link-lto, -olto, -gen-opt-lto, --trace-lto)
  • Device compilation flags (--device-c, --force-device-c, --partial-link)
  • Host reference flags (-host-ref-{ec,eg,ek,ic,ig,ik})
  • -maxreg=<N> (has its own duplicate-check logic at a1+1200)
  • -split-compile=<N>, -split-compile-extended=<N> (at a1+1480/a1+1488)
  • -opt-passes=<pipeline> (at a1+1512/a1+1520)
  • -discard-value-names=<0|1> (complex multi-phase interaction)
  • -time-passes (must be sole flag; unsupported in LibNVVM API path)
  • -cl-mode (sets v278=1, affects routing for -prec-div, -fast-math, -prec-sqrt)
  • -jump-table-density=<N> (forwarded directly to llc)
  • -jobserver (forwarded to opt)
  • --emit-optix-ir (disables ip-msp + licm, sets a13=0x43)
  • --nvvm-64, --nvvm-32 (handled in sub_95C230)

If none of the hardcoded checks match and the BST lookup also fails, the flag falls through to the catchall entry at options structure offset +1256, which triggers:

"libnvvm : error: <flag> is an unsupported option"

Complete Flag-to-Pipeline Vector Routing Table

The table below documents every flag's routing from user input to the four output vectors. "Store" indicates the options structure offset where the value is recorded. Flags marked with [BST] are registered in the catalog; flags marked with [HC] are hardcoded in the parsing loop.

Architecture Flags [BST]

All 24 architecture entries share options structure offset +552 and follow the same 3-column pattern:

User flaglnk vectoropt vectorllc vector
-arch=compute_75-R __CUDA_ARCH=750-opt-arch=sm_75-mcpu=sm_75
-arch=compute_80-R __CUDA_ARCH=800-opt-arch=sm_80-mcpu=sm_80
-arch=compute_86-R __CUDA_ARCH=860-opt-arch=sm_86-mcpu=sm_86
-arch=compute_87-R __CUDA_ARCH=870-opt-arch=sm_87-mcpu=sm_87
-arch=compute_88-R __CUDA_ARCH=880-opt-arch=sm_88-mcpu=sm_88
-arch=compute_89-R __CUDA_ARCH=890-opt-arch=sm_89-mcpu=sm_89
-arch=compute_90-R __CUDA_ARCH=900-opt-arch=sm_90-mcpu=sm_90
-arch=compute_90a-R __CUDA_ARCH=900-opt-arch=sm_90a-mcpu=sm_90a
-arch=compute_100-R __CUDA_ARCH=1000-opt-arch=sm_100-mcpu=sm_100
-arch=compute_100a-R __CUDA_ARCH=1000-opt-arch=sm_100a-mcpu=sm_100a
-arch=compute_100f-R __CUDA_ARCH=1000-opt-arch=sm_100f-mcpu=sm_100f
-arch=compute_103-R __CUDA_ARCH=1030-opt-arch=sm_103-mcpu=sm_103
-arch=compute_103a-R __CUDA_ARCH=1030-opt-arch=sm_103a-mcpu=sm_103a
-arch=compute_103f-R __CUDA_ARCH=1030-opt-arch=sm_103f-mcpu=sm_103f
-arch=compute_110-R __CUDA_ARCH=1100-opt-arch=sm_110-mcpu=sm_110
-arch=compute_110a-R __CUDA_ARCH=1100-opt-arch=sm_110a-mcpu=sm_110a
-arch=compute_110f-R __CUDA_ARCH=1100-opt-arch=sm_110f-mcpu=sm_110f
-arch=compute_120-R __CUDA_ARCH=1200-opt-arch=sm_120-mcpu=sm_120
-arch=compute_120a-R __CUDA_ARCH=1200-opt-arch=sm_120a-mcpu=sm_120a
-arch=compute_120f-R __CUDA_ARCH=1200-opt-arch=sm_120f-mcpu=sm_120f
-arch=compute_121-R __CUDA_ARCH=1210-opt-arch=sm_121-mcpu=sm_121
-arch=compute_121a-R __CUDA_ARCH=1210-opt-arch=sm_121a-mcpu=sm_121a
-arch=compute_121f-R __CUDA_ARCH=1210-opt-arch=sm_121f-mcpu=sm_121f

Note: the a and f sub-variants share the base SM number for __CUDA_ARCH (e.g., sm_100a and sm_100f both emit __CUDA_ARCH=1000) but get distinct -opt-arch= and -mcpu= strings. The architecture string is also stored into the lto vector via sub_95D700, preserving the full -arch=compute_XX string.

Architecture validation bitmask

Architecture is validated at a1+8 using bitmask 0x60081200F821:

offset = SM_number - 75
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
    -> ERROR: "is an unsupported option"

Valid bit positions:

BitSMGeneration
075Turing
580Ampere
1186Ampere
1287Jetson Orin
1388Ada
1489Ada Lovelace
1590Hopper
25100Blackwell
28103Blackwell+
35110Post-Blackwell
45120Next-gen
46121Next-gen

Maximum offset: 0x2E = 46 (SM 121). All pre-Turing architectures (SM 70 and below) are rejected.

Architecture specification forms

Architecture can be specified in many forms, all converging to a numeric SM value. Trailing a or f suffixes are stripped before numeric parsing. On parse failure: "Unparseable architecture: <val>".

FormExampleSource
-arch <val>-arch sm_90sub_8F9C90
-arch<val>-archsm_90sub_8F9C90 (compact)
--nv_arch <val>--nv_arch sm_100asub_8F9C90
-mcpu=sm_<N>-mcpu=sm_90LLVM-style
-opt-arch=sm_<N>-opt-arch=sm_90Optimizer
-arch=compute_<N>-arch=compute_100Compute capability
__CUDA_ARCH=<N>__CUDA_ARCH=900Raw define

Hex-encoded flag checks in sub_8F9C90:

  • 0x6D733D7570636D2D = -mcpu=sm
  • 0x6372612D74706F2D = -opt-arc
  • 0x6F633D686372612D = -arch=co
  • 0x6372615F766E2D2D = --nv_arc

Optimization Level Flags

User flagTypeStorelnkoptllcDefault
-opt=0[BST]+392------
-opt=1[BST]+392------
-opt=2[BST]+392------
-opt=3[BST]+392------default
-Osize[BST]+488---Osize-Osizeoff
-Om[BST]+520---Om-Omoff
-disable-allopts[BST]+424-lnk-disable-allopts-opt-disable-allopts-llc-disable-alloptsoff
-disable-llc-opts[BST]+840------off

The -opt=<N> flags do not directly emit to any vector at registration time. Instead, at the routing stage (lines 1444-1563 of sub_9624D0), the optimization level drives one of three code paths:

  1. Custom pipeline set (a1+1520 != 0): emits -passes=<pipeline_string> to opt vector
  2. Normal mode (a1+1520 == 0, a1+1640 == 0): emits -O<level> to opt vector
  3. Fast-compile mode (a1+1640 != 0): emits -optO<level> + -llcO2 to llc vector

Floating Point Control Flags

User flagTypeStorelnkoptllcDefault
-ftz=0[BST]+584------default
-ftz=1[BST]+584-R __CUDA_FTZ=1-nvptx-f32ftz-nvptx-f32ftz
-prec-sqrt=0[BST]+616-----nvptx-prec-sqrtf32=0CL default
-prec-sqrt=1[BST]+616-R __CUDA_PREC_SQRT=1---nvptx-prec-sqrtf32=1CUDA default
-prec-div=0 (CL)[BST]+648---opt-use-prec-div=false-nvptx-prec-divf32=0
-prec-div=0 (CUDA)[BST]+648---opt-use-prec-div=false-nvptx-prec-divf32=1
-prec-div=1 (CL)[BST]+648---opt-use-prec-div=true-nvptx-prec-divf32=1
-prec-div=1 (CUDA)[BST]+648-R __CUDA_PREC_DIV=1-opt-use-prec-div=true-nvptx-prec-divf32=2default
-prec-div=2[BST]+648-----nvptx-prec-divf32=3
-fma=0[BST]+680-----nvptx-fma-level=0
-fma=1[BST]+680-----nvptx-fma-level=1default
-enable-mad[BST]+712-----nvptx-fma-level=1off
-opt-fdiv=0[BST]+456---opt-fdiv=0--default
-opt-fdiv=1[BST]+456---opt-fdiv=1--
-no-signed-zeros[BST]+1160---opt-no-signed-zeros--off

Note on -prec-div: the CUDA vs CL distinction is controlled by the magic cookie a4 (0xABBA = CUDA, 0xDEED = OpenCL). CUDA -prec-div=1 maps to -nvptx-prec-divf32=2 (IEEE-correct division), while CL maps to level 1 (software approximation). When -prec-div=0 is set under CUDA, it still maps to -nvptx-prec-divf32=1 (not 0), because CUDA never drops below software approximation.

Fast Math Aggregate Flags

User flagTypeStorelnkoptllc
-unsafe-math[BST]+744-R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1-opt-use-fast-math -nvptx-f32ftz-nvptx-fma-level=1 -nvptx-f32ftz
-fast-math (CL)[BST]+776-R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1-opt-use-fast-math -nvptx-f32ftz-nvptx-f32ftz
-fast-math (CUDA)[BST]+776-R __CUDA_USE_FAST_MATH=1-opt-use-fast-math--

-unsafe-math always sets FTZ in the backend (-nvptx-f32ftz), while CUDA -fast-math does not touch the backend FTZ flag -- it only sets the preprocessor define and the optimizer flag.

Debug and Diagnostic Flags

User flagTypeStorelnkoptllcDefault
-g[BST]+296-debug-compile-debug-compile--off
-generate-line-info[BST]+328---generate-line-info--off
-no-lineinfo-inlined-at[BST]+360-----line-info-inlined-at=0off
-show-src[BST]+808-----nvptx-emit-srcoff
-enable-verbose-asm[BST]+1224-----asm-verboseoff
-w[BST]+872---w-woff
-Werror[BST]+904---Werror-Werroroff
-debug-compile[BST]+296---debug-compile--off
-line-info-inlined-at=0alias-------line-info-inlined-at=0off
-inline-info[HC]-----pass-remarks=inline -pass-remarks-missed=inline -pass-remarks-analysis=inline--off

Inlining and Function Flags

User flagTypeStorelnkoptllcDefault
-disable-inlining[BST]+1064---disable-inlining--off
-aggressive-inline[BST]+1608---inline-budget=40000--off
-restrict[BST]+1096-----nvptx-kernel-params-restrictoff
-allow-restrict-in-struct[BST]+1128---allow-restrict-in-struct-allow-restrict-in-structoff
-enable-opt-byval[BST]+1032---enable-opt-byval--off

Optimization Control Flags

User flagTypeStorelnkoptllcDefault
-opt-disable-alloptsderived-----opt-disable-allopts--off
-lnk-disable-alloptsderived---lnk-disable-allopts----off
-llc-disable-alloptsderived-------llc-disable-alloptsoff

These three are emitted by -disable-allopts (see above); they do not exist as independent user flags.

Rematerialization Flags

User flagTypeStorelnkoptllc
-vasp-fix[BST]+1352-----vasp-fix1=true -vasp-fix2=true
-new-nvvm-remat[BST]+1384-----enable-new-nvvm-remat=true -nv-disable-remat=true -rp-aware-mcse=true
-disable-new-nvvm-remat[BST]+1416-----enable-new-nvvm-remat=false -nv-disable-remat=false -rp-aware-mcse=false
-disable-nvvm-remat[BST]+1448-----enable-new-nvvm-remat=false -nv-disable-remat=true -rp-aware-mcse=false

These are multi-flag compound emissions. Note the subtle difference: -disable-nvvm-remat sets -nv-disable-remat=true (disables classic remat) but -enable-new-nvvm-remat=false (also disables new remat), while -disable-new-nvvm-remat disables both new remat AND classic remat AND register-pressure-aware MCSE.

Analysis and Transform Control Flags

User flagTypeStorelnkoptllc
-no-aggressive-positive-stride-analysis[BST]+1544---aggressive-positive-stride-analysis=false--
disable-load-select-transform[BST]+1576---disable-load-select-transform=true--

Note: disable-load-select-transform is registered WITHOUT a leading - in the catalog.

Pass-Through (Forwarding) Flags [HC]

FlagTarget vectorSpecial handling
-Xopt <arg>optIf <arg> starts with -opt-discard-value-names=, extracts value; if "1", sets v276=false
-Xllc <arg>llcNone
-Xlnk <arg>lnkIf <arg> starts with -lnk-discard-value-names=, extracts value; if "1", sets v275=false
-Xlto <arg>ltoIf <arg> starts with -lto-discard-value-names=, extracts value; if "1", sets v282=false

Each consumes the next argument from argv.

LTO Flags [HC]

User flaga13 bitmask effectlto vectorNotes
-lto(a13 & 0x300) | 0x23--Full LTO mode
-gen-lto(a13 & 0x300) | 0x21-gen-ltoEmit LTO bitcode
-gen-lto-and-llca13 |= 0x20-gen-ltoEmit LTO + run LLC
-link-lto(a13 & 0x300) | 0x26-link-ltoLink LTO modules
-olto---olto + argv[i+1]Takes next arg as LTO opt level
-gen-opt-ltosets v280=1--Affects lowering at end of parsing
--trace-lto----traceLTO tracing

Device Compilation Flags [HC]

User flaglto vector
--device-c--device-c
--force-device-c--force-device-c
--partial-link(no-op, consumed but not forwarded)

Host Reference Flags [HC]

User flaglto vector
-host-ref-ek=<val>-host-ref-ek=<val>
-host-ref-ik=<val>-host-ref-ik=<val>
-host-ref-ec=<val>-host-ref-ec=<val>
-host-ref-ic=<val>-host-ref-ic=<val>
-host-ref-eg=<val>-host-ref-eg=<val>
-host-ref-ig=<val>-host-ref-ig=<val>
-has-global-host-info-has-global-host-info

Pipeline Control Flags [HC]

User flagStoreRoutingDefault
-opt-passes=<pipeline>+1512opt: -passes=<pipeline> (overrides -O<N>)unset
-passes=<pipeline>--opt: -passes=<pipeline> (sub_9624D0 only)unset
-lsa-opt=0--opt: -lsa-opt=0generated by -Ofast-compile=max or CL-mode
-memory-space-opt=0--opt: -memory-space-opt=0generated by -Ofast-compile=max
-memory-space-opt=1--opt: -memory-space-opt=1generated when opt level allows
-rox-opt=0--opt: -rox-opt=0generated when -prec-div=0 or -prec-sqrt=0 (non-CL)
-do-ip-msp=<0|1>--opt: -do-ip-msp=<val>
-do-licm=<0|1>--opt: -do-licm=<val>
-optimize-unused-variables--lto: -optimize-unused-variablesoff

Ofast-compile Levels [HC]

Stored at a1+1640. Only ONE -Ofast-compile= is allowed; a second triggers "libnvvm : error: -Ofast-compile specified more than once".

Level stringa1+1640DescriptionSide effects
"0"1 (then reset to 0)Disabledopt: fast-compile=off string
"min"4Minimal speedupopt: -fast-compile=min
"mid"3Medium speedupopt: -fast-compile=mid + second flag
"max"2Maximum speedupopt: -fast-compile=max; forces -lsa-opt=0, -memory-space-opt=0

When -Ofast-compile is active (level >= 1), the -passes=/-O routing is bypassed. Instead: -optO<level> and -llcO2 are emitted to the llc vector (lines 1453-1460).

Miscellaneous Flags [HC]

User flagStoreRoutingNotes
-maxreg=<N>+1192opt: -maxreg=<N>, llc: -maxreg=<N>Error on duplicate
-split-compile=<N>+1480opt: -split-compile=<N>Error on duplicate
-split-compile-extended=<N>+1480opt: -split-compile-extended=<N>, sets a1+1644=1Same storage as -split-compile
-jump-table-density=<N>--llc: -jump-table-density=<N>
-jobserver--opt: -jobserver
-cl-mode--No forwarding; sets v278=1Affects -prec-div, -prec-sqrt, -fast-math routing
-time-passes--Unsupported in LibNVVM API (error if a14 != NULL)Must be sole flag
--emit-optix-ir--opt: -do-ip-msp=0, opt: -do-licm=0; a13 = (a13 & 0x300) | 0x43
--nvvm-64--a13 |= 0x10064-bit NVVM mode
--nvvm-32--a13 |= 0x20032-bit NVVM mode

Discard-Value-Names [HC]

This flag has the most complex interaction logic in the parser. Seven boolean tracking variables control its behavior:

VariableMeaning
v275lnk-discard-value-names override (from -Xlnk)
v276opt-discard-value-names override (from -Xopt)
v277global discard-value-names flag was used
v278CL-mode detected
v279-Xlnk was used for discard-value-names
v281-Xlto was used for discard-value-names
v282lto-discard-value-names override (from -Xlto)
v283-Xopt was used for discard-value-names

When a4 == 0xABBA (CUDA) and no explicit -discard-value-names:

  • Default: discard (a1+232 = 1)
  • Emits: -lnk-discard-value-names=1 to lnk, -opt-discard-value-names=1 to opt, -lto-discard-value-names=1 to lto
  • UNLESS overridden by per-phase -X flags

When a4 == 0xDEED (OpenCL): only applies if (a13 & 0x20) is set.

Error on conflicting definitions: "libnvvm : error: -discard-value-names defined more than once, or defined for both libnvvm and sub-phase".

I/O and General Flags

FlagEffect
-o <file>Output file (fatal if missing)
-vVerbose mode
-dryrunDo not execute compilation
-keepKeep intermediate files
-irversionPrint IR version and exit
-nvvmir-library <f>NVVM IR library file (also = form)
-m6464-bit mode flag (sets *a8 = 1)

Recognized input extensions: .bc, .ci, .i, .cup, .optixir, .ii. The .cup extension triggers --orig_src_path_name / --orig_src_file_name handling.

Options Structure Layout

The options structure passed as a1 to sub_9624D0/sub_12CC750 is ~1,644 bytes. Key offsets:

OffsetSizeContentDefault
+8DWORDSM architecture number75
+232BYTEdiscard-value-names master (0=keep, 1=discard)0
+240DWORDPhase routing mode (0=full, 1-4=single)0
+248PTRBST root (std::map red-black tree)
+256PTRBST sentinel/end node
+288QWORDBST node count
+296STR32-g / -debug-compile value
+328STR32-generate-line-info value
+360STR32-no-lineinfo-inlined-at value
+392STR32Optimization level (0/1/2/3)"3"
+400QWORDopt-level already-set sentinel
+424STR32-disable-allopts value
+456STR32-opt-fdiv value"0"
+464QWORDopt-fdiv already-set sentinel
+488STR32-Osize value
+520STR32-Om value
+552STR32Architecture definescompute_75
+560QWORDarch already-set sentinel
+584STR32-ftz value"0"
+592QWORDftz already-set sentinel
+616STR32-prec-sqrt value"1" (CUDA) / "0" (CL)
+624QWORDprec-sqrt already-set sentinel
+648STR32-prec-div value"1"
+656QWORDprec-div already-set sentinel
+680STR32-fma value"1"
+688QWORDfma already-set sentinel
+712STR32-enable-mad value
+744STR32-unsafe-math value
+776STR32-fast-math value
+808STR32-show-src value
+840STR32-disable-llc-opts value
+872STR32-w value
+904STR32-Werror value
+1032STR32-enable-opt-byval value
+1064STR32-disable-inlining value
+1096STR32-restrict value
+1128STR32-allow-restrict-in-struct value
+1160STR32-no-signed-zeros value
+1192STR32-maxreg value string
+1200QWORDmaxreg already-set sentinel
+1224STR32-enable-verbose-asm value
+1256STR32Catchall (unrecognized flag)
+1352STR32-vasp-fix value
+1384STR32-new-nvvm-remat value
+1416STR32-disable-new-nvvm-remat value
+1448STR32-disable-nvvm-remat value
+1480STR32-split-compile value
+1488QWORDsplit-compile already-set sentinel
+1512STR32-opt-passes pipeline string
+1520QWORDopt-passes already-set sentinel
+1544STR32-no-aggressive-positive-stride-analysis
+1576STR32disable-load-select-transform
+1608STR32-aggressive-inline value
+1640DWORDOfast-compile level (0-4)0
+1644BYTEsplit-compile-extended flag0

Each STR32 is a 32-byte std::string with SSO (small string optimization). The QWORD "already-set sentinel" fields serve as duplicate-detection guards.

Compilation Mode Bitmask (a13)

The a13 parameter is an in/out bitmask that controls which pipeline phases execute and what LTO mode is active:

Bit/MaskMeaning
0x07Phase control (default = 7 = all phases)
0x10Debug compile or line-info enabled
0x20LTO generation enabled
0x21gen-lto mode
0x23Full LTO mode
0x26link-lto mode
0x43emit-optix-ir mode
0x80gen-opt-lto lowering flag
0x100--nvvm-64 (64-bit mode)
0x200--nvvm-32 (32-bit mode)
0x300Mask for 64/32-bit mode bits
ValueMeaningEffects
0xABBA (43962)CUDA compilation-prec-div routing uses CUDA levels; -fast-math uses CUDA defines; discard-value-names defaults to on
0xDEED (57069)OpenCL compilation-prec-sqrt defaults to 0; -fast-math/-prec-div use CL routing; -cl-mode scanning active

Default Values When Flags Are Absent

When a registered flag is not found in the user's arguments, sub_9624D0 checks whether the stored-value sentinel is zero and applies defaults:

FlagSentinelDefault applied
-opt=a1+400 == 0-opt=3 (optimization level 3)
-arch=compute_a1+560 == 0-arch=compute_75 (SM 75 Turing)
-ftz=a1+592 == 0-ftz=0 (no flush-to-zero)
-prec-sqrt=a1+624 == 0-prec-sqrt=1 (CUDA) or -prec-sqrt=0 (CL)
-prec-div=a1+656 == 0-prec-div=1 (precise division)
-fma=a1+688 == 0-fma=1 (FMA enabled)
-opt-fdiv=a1+464 == 0-opt-fdiv=0

Differences Between sub_12CC750 and sub_9624D0

The two option processors are near-identical. Key differences:

Aspectsub_12CC750sub_9624D0
Binary size87KB decompiled75KB decompiled
-memory-space-opt default01
-passes= flagabsentpresent
-disable-struct-loweringpresentabsent
-prec-sqrt CL default01
PipelineLibNVVM entry pathStandalone/generic path
Companion buildersub_12C8DD0sub_95EB40
BST lookupsub_12C8530sub_95D600

Error Handling

All error strings follow the pattern "libnvvm : error: <message>":

ErrorTrigger
<flag> is an unsupported optionFlag not matched by hardcoded checks or BST lookup
<flag> defined more than onceDuplicate -maxreg, or duplicate BST-registered flag
-arch=compute_<N> is an unsupported optionArchitecture fails bitmask validation
-Ofast-compile specified more than onceSecond -Ofast-compile= encountered
-Ofast-compile called with unsupported level, only supports 0, min, mid, or maxInvalid level string
split compilation defined more than onceDuplicate -split-compile or -split-compile-extended
-discard-value-names defined more than once, or defined for both libnvvm and sub-phaseConflicting discard-value-names
<value> is an unsupported value for option: <flag>From sub_95C230 extended parser

Function Address Map

AddressFunctionRole
0x8F9C90sub_8F9C90Real main entry point (argc/argv from OS)
0x900130sub_900130LibNVVM Path A CLI parser
0x9624D0sub_9624D0LibNVVM option processor (standalone variant)
0x9685E0sub_9685E0Pipeline orchestrator (wraps sub_9624D0)
0x967070sub_967070Post-option-parse pipeline setup
0x95EB40sub_95EB40BST option map builder (standalone)
0x95E8B0sub_95E8B0Flag template registration (standalone)
0x95D600sub_95D600BST option map lookup (standalone)
0x95CB50sub_95CB50Prefix-match string comparison
0x95CA80sub_95CA80Value extraction after =
0x95C880sub_95C880Single-phase delegator
0x95C230sub_95C230Extended flag parser (--nvvm-64/--nvvm-32)
0x95BF90sub_95BF90BST node insertion helper
0x95BC80sub_95BC80String storage into options struct
0x12CC750sub_12CC750LibNVVM option processor (LibNVVM variant)
0x12C8DD0sub_12C8DD0BST option map builder (LibNVVM, 65 entries)
0x12C8B40sub_12C8B40Individual flag registration (LibNVVM)
0x12C8530sub_12C8530BST option map lookup (LibNVVM)
0x12C7B30sub_12C7B30Pass name registration into pipeline ordering
0x12C6E90sub_12C6E90Sub-argument splitter for mode flags
0x12C6910sub_12C6910Flag filter (-debug-compile, -g, -generate-line-info)
0x8FD0D0sub_8FD0D0Key-value parser (used by sub_900130)
0x8FD6D0sub_8FD6D0String concatenation builder

Cross-References

Optimization Levels

cicc v13.0 supports four standard optimization levels (O0 through O3) and three fast-compile tiers (Ofcmin, Ofcmid, Ofcmax). These are mutually exclusive with the custom --passes= interface. The pipeline name is selected in the new-PM driver sub_226C400 and assembled by sub_12E54A0. The full optimization pipeline builder is sub_12DE330, with tier-specific insertion handled by sub_12DE8F0.

Pipeline Name Selection

The new-PM driver at sub_226C400 selects a pipeline name string based on boolean flags in the config struct:

Config OffsetFlagPipeline Name
byte[888]O0nvopt<O0>
byte[928]O1nvopt<O1>
byte[968]O2nvopt<O2>
byte[1008]O3nvopt<O3>
qw[131..132]fc="max"nvopt<Ofcmax>
qw[131..132]fc="mid"nvopt<Ofcmid>
qw[131..132]fc="min"nvopt<Ofcmin>

Selection logic in sub_226C400 (lines 828--874):

if (O1_flag)       -> "nvopt<O1>"
else if (O2_flag)  -> "nvopt<O2>"
else if (O3_flag)  -> "nvopt<O3>"
else if (fc_len == 3) {
  if (fc == "max") -> "nvopt<Ofcmax>"
  if (fc == "mid") -> "nvopt<Ofcmid>"
  if (fc == "min") -> "nvopt<Ofcmin>"
}
else               -> "nvopt<O0>"

Combining -O# with --passes= is an error:

"Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass, use -passes='default<O#>,other-pass'"

The pipeline name is passed to sub_2277440 (new-PM text parser), which constructs the actual PassManager. The nvopt prefix is registered as a pipeline element in sub_225D540 (new PM) and sub_12C35D0 (legacy PM), with vtables at 0x4A08350 / 0x49E6A58.

Fast-Compile Level Encoding

The fast-compile level is stored as an integer at offset 1640 (or 1648 in the clone) of the compilation context:

ValueCLI SourceBehavior
0(no flag, or -Ofast-compile=0)Normal O-level pipeline
1-Ofast-compile=0Forwarded then reset to 0
2-Ofast-compile=max / -Ofc=maxMinimal pipeline, fastest compile
3-Ofast-compile=mid / -Ofc=midMedium pipeline
4-Ofast-compile=min / -Ofc=minClose to full optimization

Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level".

When level=1, the flag is forwarded to the optimizer phase as a pass argument and then the level is reset to 0 at offset 1640 (so it becomes normal O-level optimization). When level=2 (max), the optimizer arg string -Ofast-compile=max is appended. When level=3 (mid), -Ofast-compile=mid is appended. When level=4 (min), -Ofast-compile=min is appended.

Tier Summary

PipelineApprox PassesLSA-OptMemSpaceOptCompile Speed
nvopt<O0>5--8offoffFastest (no opt)
nvopt<Ofcmax>12--15forced 0forced 0Fast
nvopt<Ofcmid>25--30normalenabledMedium
nvopt<Ofcmin>30--35normalenabledSlower
nvopt<O1>~40 + tier-1normalenabledNormal
nvopt<O2>~40 + tier-1/2normalenabledNormal
nvopt<O3>~40 + tier-1/2/3normalenabledSlowest

Pipeline Architecture: Tier 0 + Tiers 1/2/3

O1/O2/O3 share a common pipeline construction path. The key insight is that optimization happens in layers:

  1. Tier 0 (sub_12DE330): The full base pipeline of ~40 passes. Fires for ALL of O1, O2, and O3 when opts[4224] (optimization-enabled) is set.
  2. Tier 1 (sub_12DE8F0(PM, 1, opts)): Additional passes gated by opts[3528]. Fires for O1, O2, and O3.
  3. Tier 2 (sub_12DE8F0(PM, 2, opts)): Additional passes gated by opts[3568]. Fires for O2 and O3 only.
  4. Tier 3 (sub_12DE8F0(PM, 3, opts)): Additional passes gated by opts[3608]. Fires for O3 only.

The tier control fields in the NVVMPassOptions struct at 4512 bytes:

OffsetTypeMeaning
3528boolTier 1 enable (O1+)
3532intTier 1 phase threshold
3568boolTier 2 enable (O2+)
3572intTier 2 phase threshold
3608boolTier 3 enable (O3+)
3612intTier 3 phase threshold
4224boolTier 0 enable (any O-level)
4228intTier 0 phase threshold

The assembler loop in sub_12E54A0 (lines 481--553) iterates over the plugin/external pass list at opts[4488]. Each entry has a phase_id; when the phase_id exceeds a tier's threshold, that tier fires:

for each entry in opts[4488..4496]:
  phase_id = entry[8..12]
  if (opts[4224] && phase_id > opts[4228]):
    sub_12DE330(PM, opts)   // Tier 0
    opts[4224] = 0          // one-shot
  if (opts[3528] && phase_id > opts[3532]):
    sub_12DE8F0(PM, 1, opts) // Tier 1
    opts[3528] = 0
  if (opts[3568] && phase_id > opts[3572]):
    sub_12DE8F0(PM, 2, opts) // Tier 2
    opts[3568] = 0
  if (opts[3608] && phase_id > opts[3612]):
    sub_12DE8F0(PM, 3, opts) // Tier 3
    opts[3608] = 0
  AddPass(PM, entry->createPass())

After the loop, any remaining unfired tiers fire unconditionally.

Tier 0: Full Base Pipeline (sub_12DE330)

sub_12DE330 at 0x12DE330 is called for all O1/O2/O3 compilations. It constructs the ~40-pass base pipeline:

#FactoryPassGuardNotes
1sub_1654860(1)VerifierPassalways
2sub_1A62BF0(1,0,0,1,0,0,1)CGSCC/InlineralwaysPipeline EP 1, 1 iteration
3sub_1B26330()NVVMReflectalways
4sub_185D600()SROAalways
5sub_1C6E800()NVVMLowerArgsalways
6sub_1C6E560()NVVMLowerAllocaalways
7sub_1857160()SimplifyCFGalways
8sub_1842BC0()InstCombinealways
9sub_17060B0(1,0)GVNopts[3160]Debug-dump enabled
10sub_12D4560()NVVMVerifyalways
11sub_18A3090()LoopRotatealways
12sub_184CD60()LICMalways
13sub_1869C50(1,0,1)IndVarSimplify!opts[1040]
14sub_1833EB0(3)LoopUnrollalwaysFactor = 3
15sub_17060B0(1,0)GVNalways
16sub_1952F90(-1)LoopIndexSplit/SCCPalwaysThreshold = -1 (unlimited)
17sub_1A62BF0(1,0,0,1,0,0,1)CGSCC/Inlineralways
18sub_1A223D0()DSEalways
19sub_17060B0(1,0)GVNalways
20sub_1A7A9F0()MemCpyOptalways
21sub_1A62BF0(1,0,0,1,0,0,1)CGSCC/Inlineralways
22sub_1A02540()ADCEalways
23sub_198DF00(-1)JumpThreading/CVPalwaysThreshold = -1
24sub_1C76260()NVVMDivergenceLowering!opts[1320]
25sub_195E880(0)Reassociateopts[2880]Default on (slot 143)
26sub_19C1680(0,1)SpeculativeExecution!opts[1360]
27sub_17060B0(1,0)GVNopts[3160]Debug-dump enabled
28sub_19401A0()SCCPalways
29sub_1968390()GlobalDCE/ConstantPropalways
30sub_196A2B0()GlobalOptalways
31sub_19B73C0(2,-1,-1,-1,-1,-1,-1)LoopVectorize/SLPalwaysWidth=2, thresholds=-1
32sub_17060B0(1,0)GVNalways
33sub_190BB10(0,0)EarlyCSEalways
34sub_1A13320()TailCallElimalways
35sub_17060B0(1,1)GVN (verified)opts[3160]Verify mode
36sub_18F5480()NewGVNalways
37sub_18DEFF0()Sinkalways
38sub_1A62BF0(1,0,0,1,0,0,1)CGSCC/Inlineralways
39sub_18B1DE0()Sinking2alwaysNVIDIA custom
40sub_1841180()LoopSimplify/LCSSAalways

After sub_12DE330 returns, opts[4224] is cleared (one-shot).

Tiers 1/2/3: Phase-Specific Sub-Pipeline (sub_12DE8F0)

sub_12DE8F0 at 0x12DE8F0 is a single function called with tier in {1, 2, 3}. The tier value is stored into qword_4FBB410 (phase tracker). When tier==3 and qword_4FBB370 byte4 is 0, the feature flags are set to 6 (enabling advanced barrier opt + memory space opt gates).

The following table lists every pass in sub_12DE8F0 with its tier-dependent guard condition. A pass runs only when ALL conditions in its Guard column are satisfied.

#FactoryPassGuardO1O2O3
1sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]YYY
2sub_1A223D0()NVVMIRVerification!opts[2600]YYY
3sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]YYY
4sub_18E4A00()NVVMBarrierAnalysisopts[3488]YYY
5sub_1C98160(0)NVVMLowerBarriersopts[3488]YYY
6sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]YYY
7sub_12D4560()NVVMVerifier!opts[600]YYY
8sub_185D600()IPConstPropagationopts[3200] && !opts[920]YYY
9sub_1857160()NVVMReflectopts[3200] && !opts[880]YYY
10sub_18A3430()NVVMPredicateOptopts[3200] && !opts[1120]YYY
11sub_1842BC0()SCCPopts[3200] && !opts[720]YYY
12sub_17060B0(1,0)PrintModulePass!opts[1080]YYY
13sub_12D4560()NVVMVerifier!opts[600]YYY
14sub_18A3090()NVVMPredicateOpt variantopts[3200] && !opts[2160]YYY
15sub_184CD60()ConstantMergeopts[3200] && !opts[1960]YYY
16sub_190BB10(1,0)SimplifyCFGtier!=1 && !opts[1040] && !opts[1200]-YY
17sub_1952F90(-1)LoopIndexSplit(same as #16) && !opts[1160]-YY
18sub_12D4560()NVVMVerifier(same as #16) && !opts[600]-YY
19sub_17060B0(1,0)PrintModulePass(same as #16) && !opts[1080]-YY
20sub_195E880(0)LICMopts[3704] && opts[2880] && !opts[1240]YYY
21sub_1C8A4D0(v12)EarlyCSEalways; v12=1 if opts[3704]YYY
22sub_1869C50(1,0,1)Sinktier!=1 && !opts[1040]-YY
23sub_1833EB0(3)TailCallElimtier==3 && !opts[320]--Y
24sub_1CC3990()NVVMUnreachableBlockElim!opts[2360]YYY
25sub_18EEA90()CorrelatedValuePropagationopts[3040]YYY
26sub_12D4560()NVVMVerifier!opts[600]YYY
27sub_1A223D0()NVVMIRVerification!opts[2600]YYY
28sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]YYY
29sub_1C4B6F0()Inliner!opts[440] && !opts[480]YYY
30sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]YYY
31sub_1A7A9F0()InstructionSimplify!opts[2720]YYY
32sub_12D4560()NVVMVerifier!opts[600]YYY
33sub_1A02540()GenericToNVVM!opts[2200]YYY
34sub_198DF00(-1)LoopSimplify!opts[1520]YYY
35sub_1C76260()ADCE!opts[1320] && !opts[1480]YYY
36sub_17060B0(1,0)PrintModulePass(same as #35) && !opts[1080]YYY
37sub_12D4560()NVVMVerifier(same as #35) && !opts[600]YYY
38sub_195E880(0)LICMopts[2880] && !opts[1240]YYY
39sub_1C98160(0/1)NVVMLowerBarriersopts[3488]YYY
40sub_19C1680(0,1)LoopUnroll!opts[1360]YYY
41sub_17060B0(1,0)PrintModulePass!opts[1080]YYY
42sub_19401A0()InstCombine!opts[1000]YYY
43sub_196A2B0()EarlyCSE!opts[1440]YYY
44sub_1968390()SROA!opts[1400]YYY
45sub_19B73C0(tier,...)LoopVectorize/SLP (1st)tier!=1; params vary by SM-YY
46sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]YYY
47sub_19B73C0(tier,...)LoopVectorize/SLP (2nd)!opts[2760]YYY
48sub_1A62BF0(1,...)LLVM standard pipeline!opts[600]YYY
49sub_1A223D0()NVVMIRVerification!opts[2600]YYY
50sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]YYY
51sub_17060B0(1,0)PrintModulePass!opts[1080]YYY
52sub_190BB10(0,0)SimplifyCFG!opts[960]YYY
53sub_1922F90()NVIDIA loop passopts[3080]YYY
54sub_195E880(0)LICMopts[2880] && !opts[1240]YYY
55sub_1A13320()NVVMRematerialization!opts[2320]YYY
56sub_1968390()SROA!opts[1400]YYY
57sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]YYY
58sub_18EEA90()CorrelatedValuePropagationopts[3040]YYY
59sub_18F5480()DSE!opts[760]YYY
60sub_18DEFF0()DCE!opts[280]YYY
61sub_1A62BF0(1,...)LLVM standard pipeline!opts[600]YYY
62sub_1AAC510()NVIDIA-specific pass!opts[520] && !opts[560]YYY
63sub_1A223D0()NVVMIRVerification!opts[2600]YYY
64sub_1CB4E40(1)NVVMIntrinsicLowering!opts[2000]YYY
65sub_1C8E680()MemorySpaceOpt!opts[2680]; param from opts[3120]YYY
66sub_1A223D0()NVVMIRVerificationopts[3120] && !opts[2600]YYY
67sub_17060B0(1,0)PrintModulePass!opts[1080]YYY
68sub_1CC71E0()NVVMGenericAddrOpt!opts[2560]YYY
69sub_1C98270(1,opts[2920])NVVMLowerBarriers variantopts[3488]YYY
70sub_17060B0(1,0)PrintModulePassopts[3160] && !opts[1080]YYY
71sub_1C6FCA0()ADCEopts[2840] && !opts[1840]YYY
72sub_18B1DE0()LoopOpt/BarrierOptopts[3200] && !opts[2640]YYY
73sub_1857160()NVVMReflect (late)opts[3200] && tier==3 && !opts[880]--Y
74sub_1841180()FunctionAttrsopts[3200] && !opts[680]YYY
75sub_1C46000()NVVMLateOpttier==3 && !opts[360]--Y
76sub_1841180()FunctionAttrs (2nd)opts[3200] && !opts[680]YYY
77sub_1CBC480()NVVMLowerAlloca!opts[2240] && !opts[2280]YYY
78sub_1CB73C0()NVVMBranchDist!opts[2080] && !opts[2120]YYY
79sub_1C7F370(1)NVVMWarpShuffleopts[3328] && !opts[1640]YYY
80sub_1CC5E00()NVVMReductionopts[3328] && !opts[2400]YYY
81sub_1CC60B0()NVVMSinking2opts[3328] && !opts[2440]YYY
82sub_1CB73C0()NVVMBranchDist (2nd)opts[3328] && !opts[2080] && !opts[2120]YYY
83sub_17060B0(1,0)PrintModulePassopts[3328] && !opts[1080]YYY
84sub_1B7FDF0(3)Reassociateopts[3328] && !opts[1280]YYY
85sub_17060B0(1,0)PrintModulePass (final)opts[3160] && !opts[1080]YYY

O1 vs O2 vs O3: Complete Diff

The three O-levels differ through exactly five mechanisms. Every pass that is NOT listed here runs identically at all three levels.

1. Tier guard: tier!=1 (O2/O3 only)

These passes are present in sub_12DE8F0 but skip when tier==1 (O1):

PassFactoryEffect of skipping at O1
SimplifyCFGsub_190BB10(1,0)No inter-tier CFG cleanup
LoopIndexSplitsub_1952F90(-1)No inter-tier loop splitting
NVVMVerifier (post-split)sub_12D4560()No verification after split
Sinksub_1869C50(1,0,1)No inter-tier instruction sinking
LoopVectorize/SLP (1st call)sub_19B73C0(tier,...)No aggressive vectorization

At O1, the base pipeline (Tier 0) already includes one instance of LoopVectorize with sub_19B73C0(2,-1,-1,-1,-1,-1,-1) -- width 2, all thresholds at -1 (unlimited). The tier!=1 guard blocks a SECOND, more aggressive vectorization pass with SM-dependent parameters.

2. Tier guard: tier==3 (O3 only)

These passes run exclusively at O3:

PassFactoryPurpose
TailCallElimsub_1833EB0(3)Additional tail call optimization pass
NVVMReflect (late)sub_1857160()Second-round __nvvm_reflect resolution
NVVMLateOptsub_1C46000()O3-exclusive NVIDIA custom late optimization

sub_1C46000 (NVVMLateOpt) is the most significant O3-exclusive pass. It runs only when !opts[360] (not disabled) and only at tier==3. This is a dedicated NVIDIA optimization pass that performs additional transformations after the main pipeline is complete.

3. Feature flag qword_4FBB370 escalation

When tier==3 and qword_4FBB370 byte4 is 0, the function sets qword_4FBB370 = 6 (binary 110). This enables two feature gates:

  • Advanced barrier optimization (bit 1)
  • Memory space optimization extensions (bit 2)

These gates affect behavior in downstream passes that read qword_4FBB370, such as sub_12EC4F0 (the machine pass pipeline executor).

4. LoopVectorize/SLP parameter differences

sub_19B73C0 is called with different parameters depending on context:

Call siteParametersTier
Tier 0 (sub_12DE330 #31)(2, -1, -1, -1, -1, -1, -1)All O1/O2/O3
Tier 1/2/3, 1st call (#45)(tier, ...) SM-dependentO2/O3 only
Tier 1/2/3, 2nd call (#47)(tier, ...)All tiers
Ofcmid language path(3, -1, -1, 0, 0, -1, 0)Fast-compile

The 7 parameters to sub_19B73C0 control:

  • arg1: Vector width factor (2 at Tier 0, tier at higher tiers)
  • arg2..arg7: Thresholds for cost model, trip count, and SLP width. Value -1 means unlimited/auto; value 0 means conservative/disabled.

At O2, sub_19B73C0(2, ...) provides moderate vectorization. At O3, sub_19B73C0(3, ...) increases the vector width factor, enabling wider SIMD exploration. The SM-architecture-dependent parameters are resolved at runtime based on the target GPU.

5. CGSCC iteration count

sub_1A62BF0 is the CGSCC (Call Graph SCC) pass manager factory. The first argument is the pipeline extension point / iteration count:

ContextCallIterations
Tier 0 (all O-levels)sub_1A62BF0(1,0,0,1,0,0,1)1
Ofcmid pathsub_1A62BF0(5,0,0,1,0,0,1)5
Language "mid" pathsub_1A62BF0(8,0,0,1,1,0,1)8, with extra opt flag

O1/O2/O3 all use 1-iteration CGSCC in their shared Tier 0 pipeline. The iteration count differences appear in the fast-compile and language-specific paths, not between O-levels.

Complete O-Level Comparison Matrix

FeatureO0O1O2O3
Tier 0 base pipeline (~40 passes)-YYY
Tier 1 sub-pipeline-YYY
Tier 2 sub-pipeline--YY
Tier 3 sub-pipeline---Y
LoopVectorize (base, width=2)-YYY
LoopVectorize (tier, SM-dependent)--YY
SimplifyCFG (inter-tier)--YY
LoopIndexSplit (inter-tier)--YY
Sink (inter-tier)--YY
TailCallElim (extra)---Y
NVVMReflect (late round)---Y
NVVMLateOpt (sub_1C46000)---Y
Feature flags escalation (6)---Y
NVVMDivergenceLowering-YYY
SpeculativeExecution-YYY
MemorySpaceOpt-YYY
NVVMWarpShuffle-YYY
NVVMReduction-YYY
NVVMRematerialization-YYY
NVVMBranchDist-YYY
LSA optimizationoffononon

O0 Pipeline (Minimal)

When no O-level flag is set and no fast-compile level is active, the assembler falls through to LABEL_159 which calls:

sub_1C8A4D0(0)   -- NVVMFinalCleanup or similar minimal pass

Then the common tail at LABEL_84 adds:

  1. MemorySpaceOpt (conditional, skipped at O0 since opts[3488] is typically unset)
  2. sub_1CEBD10() -- NVVMFinal / cleanup
  3. sub_1654860(1) -- VerifierPass
  4. sub_12DFE00() -- Codegen pass setup

The O0 pipeline does NOT call sub_12DE330 or sub_12DE8F0. It runs only the infrastructure passes (TargetLibraryInfo, TargetTransformInfo, BasicAA, AssumptionCacheTracker, ProfileSummaryInfo) plus minimal canonicalization.

Ofcmax Pipeline (Fastest Compile)

Ofcmax bypasses the full pipeline entirely. It forces two optimizer flags:

  • -lsa-opt=0 (disables LSA optimization)
  • -memory-space-opt=0 (disables MemorySpaceOpt pass)

This forcing happens in BOTH sub_9624D0 (line 1358--1361) and sub_12CC750 (line 2025--2079). The condition is:

if (!compare(lsa_opt_flag, "0") || fc_level == 2):
  append("-lsa-opt=0")
  append("-memory-space-opt=0")

Additionally, when fc_level == 2 AND lsa_opt is NOT already "0", the libnvvm path also injects -lsa-opt=0, mem2reg, -memory-space-opt=0.

The minimal pass sequence:

#FactoryPass
1sub_18B3080(1)Sinking2Pass (fast mode, flag=1)
2sub_1857160()SimplifyCFG
3sub_19CE990()LoopStrengthReduce (if applicable)
4sub_1B26330()NVVMReflect
5sub_12D4560()NVVMVerify
6sub_184CD60()LICM
7sub_1C4B6F0()LowerSwitch
8sub_12D4560()NVVMVerify

Ofcmid Pipeline (Medium)

Ofcmid runs ~25--30 passes without forcing LSA or MemorySpaceOpt off. The pass sequence from sub_12E54A0 (lines 814--861):

#FactoryPassGuard
1sub_184CD60()LICM!opts[1960]
2sub_1CB4E40(0)AnnotationCleanupalways
3sub_1B26330()NVVMReflectalways
4sub_198E2A0()CorrelatedValuePropagationalways
5sub_1CEF8F0()NVVMPeepholealways
6sub_215D9D0()NVVMPeephole2/TcgenAnnotationalways
7sub_17060B0(1,0)GVN!opts[1080]
8sub_198DF00(-1)JumpThreading/CVPalways
9sub_17060B0(1,0)GVN!opts[1080]
10sub_1C6E800()NVVMLowerArgsalways
11sub_1832270(1)LoopSimplifyalways
12sub_1A62BF0(5,0,0,1,0,0,1)CGSCC (5 iterations)always
13sub_1CB4E40(0)AnnotationCleanupalways
14sub_18FD350(0)DCEalways
15sub_1841180()LCSSAalways
16sub_18DEFF0()Sinkalways
17sub_17060B0(1,0)GVNalways
18sub_184CD60()LICMalways
19sub_195E880(0)Reassociatealways
20sub_190BB10(0,0)EarlyCSEalways
21sub_19B73C0(3,-1,-1,0,0,-1,0)LoopVectorize (conservative)always
22sub_1A223D0()DSEalways
23sub_1C98160(0)MemorySpaceOptalways
24sub_1C8E680(0)MemorySpaceOpt2always
25sub_1B7FDF0(3)BranchFolding/CFGSimplifyalways
26sub_18B1DE0()Sinking2always

Key differences from the O1+ pipeline: Ofcmid uses 5-iteration CGSCC (vs 1 at O1+), includes NVVMPeephole/Peephole2 early, uses conservative LoopVectorize parameters (3,-1,-1,0,0,-1,0) with some thresholds zeroed, and skips NVVMDivergenceLowering, SpeculativeExecution, NVVMBranchDist, NVVMRematerialization, and the entire tier sub-pipeline.

Ofcmin Pipeline (Closest to Full Optimization)

Ofcmin takes the same path as Ofcmid through LABEL_297 in sub_12E54A0 but with the v238 flag set differently, enabling more aggressive settings. The pipeline is essentially the Ofcmid sequence with:

  • More aggressive loop optimizer thresholds
  • Additional CGSCC framework passes
  • Closer parameter alignment to the O2 full pipeline

Ofcmin does NOT force -lsa-opt=0 or -memory-space-opt=0. Like Ofcmid, it still skips the tier 1/2/3 sub-pipeline entirely, keeping compile time lower than O1.

Post-Optimization Common Tail

Regardless of pipeline tier, sub_12E54A0 always appends at LABEL_84 (lines 640--653):

#FactoryPassGuard
1sub_1C98160(opts[2920]!=0)MemorySpaceOpt!v244 && opts[3488]
2sub_1CEBD10()NVVMFinal / cleanupalways
3sub_1654860(1)VerifierPass!opts[2800] && !opts[4464]
4sub_12DFE00(PM, v253, opts)Codegen pass dispatchalways

sub_12DFE00 (codegen dispatch) reads the optimization level from opts[200] to determine codegen aggressiveness. When opts[200] > 1, full dependency tracking is enabled across all codegen passes.

Always-Added Analysis Passes

Before any optimization, the pipeline assembler inserts (lines 396--420):

#FactoryPass
1sub_149CCE0 (368 bytes alloc)TargetLibraryInfoWrapperPass
2sub_1BFB520 (208 bytes alloc)TargetTransformInfoWrapperPass
3sub_14A7550()VerifierPass / BasicAliasAnalysis
4sub_1361950()AssumptionCacheTracker
5sub_1CB0F50()ProfileSummaryInfoWrapperPass

These five passes run at ALL optimization levels including O0.

NVVMPassOptions Offset-to-Guard Map

The passes gated by NVVMPassOptions boolean flags (opts struct at 4512 bytes). Slot defaults from sub_12D6300:

OffsetSlotDefaultControlsUsed By
28015offDCE disableTier 0 #37, Tier 1/2/3 #60
32017offTailCallElim disableTier 1/2/3 #23 (O3 only)
36019onNVVMLateOpt disableTier 1/2/3 #75 (O3 only)
44023offInliner flag A disableTier 1/2/3 #29
48025onInliner flag B disableTier 1/2/3 #29
60031offNVVMVerifier disableTier 1/2/3 #7,13,18,26,32,37
68035offFunctionAttrs disableTier 1/2/3 #74,76
72037offSCCP disableTier 1/2/3 #11
76039offDSE disableTier 1/2/3 #59
88045offNVVMReflect disableTier 1/2/3 #9,73
92047offIPConstPropagation disableTier 1/2/3 #8
96049offSimplifyCFG disableTier 1/2/3 #52
100051offInstCombine disableTier 1/2/3 #42
104053offSink/SimplifyCFG disableTier 0 #13, Tier 1/2/3 #16,22
108055offPrintModulePass disablemany
112057offNVVMPredicateOpt disableTier 1/2/3 #10
116059offLoopIndexSplit disableTier 1/2/3 #17
120061offSimplifyCFG tier guardTier 1/2/3 #16
124063offLICM disableTier 1/2/3 #20,38,54
128065offReassociate disableTier 1/2/3 #84
132065offNVVMDivergenceLow disableTier 0 #24, Tier 1/2/3 #35
136067offLoopUnroll disableTier 0 #26, Tier 1/2/3 #40
140069offSROA disableTier 1/2/3 #44,56
144071offEarlyCSE disableTier 1/2/3 #43
148073offADCE extra guardTier 1/2/3 #35
152075offLoopSimplify disableTier 1/2/3 #34
164081offNVVMWarpShuffle disableTier 1/2/3 #79
176087offMemorySpaceOpt disableCommon tail, language paths
184091offADCE variant disableTier 1/2/3 #71
196097offConstantMerge disableTier 1/2/3 #15
2000101offNVVMIntrinsicLowering disableTier 1/2/3 #1,3,28,50,64
2080103offNVVMBranchDist disable ATier 1/2/3 #78,82
2120105offNVVMBranchDist disable BTier 1/2/3 #78,82
2200109offGenericToNVVM disableTier 1/2/3 #33
2240111offNVVMLowerAlloca A disableTier 1/2/3 #77
2280113offNVVMLowerAlloca B disableTier 1/2/3 #77
2320115offNVVMRematerialization disableTier 1/2/3 #55
2360117onNVVMUnreachableBlockElim disableTier 1/2/3 #24
2400119offNVVMReduction disableTier 1/2/3 #80
2440121offNVVMSinking2 disableTier 1/2/3 #81
2560127offNVVMGenericAddrOpt disableTier 1/2/3 #68
2600129offNVVMIRVerification disableTier 1/2/3 #2,27,49,63,66
2640131offLoopOpt/BarrierOpt disableTier 1/2/3 #72
2680133offMemorySpaceOpt (2nd) disableTier 1/2/3 #65
2720135offInstructionSimplify disableTier 1/2/3 #31
2760137offLoopVectorize 2nd disableTier 1/2/3 #47
2840141onADCE enable (reversed)Tier 1/2/3 #71
2880143onLICM enable (reversed)Tier 0 #25, Tier 1/2/3 #20,38,54
2920145offLowerBarriers parameterCommon tail
3000151onEarly pass guardPre-opt phase
3040153offCorrelatedValueProp enableTier 1/2/3 #25,58
3080155onNVIDIA loop pass enableTier 1/2/3 #53
3120155onMemorySpaceOpt(2nd) enableTier 1/2/3 #65,66
3160157onPrintModulePass enableTier 0 #9,27,35; Tier 1/2/3 many
3200159onAdvanced NVIDIA passes groupTier 1/2/3 #8-11,14-15,72-76
3328165onSM-specific late passes blockTier 1/2/3 #79-84
3488173offNVVMBarrierAnalysis enableTier 1/2/3 #4,5,39,69
3528175offTier 1 enablePipeline assembler
3568177offTier 2 enablePipeline assembler
3608179offTier 3 enablePipeline assembler
3648181""Language/fc-level string ptrPipeline name selection
3704183offLate optimization flagTier 1/2/3 #20,21; Pipeline B
3904192offDebug/naming mode flagBB naming loop
4064201offConcurrent compilation flagThread count decision
4104203-1Thread count (integer)sub_12E7E70
4224209offTier 0 enable (opt active)Pipeline assembler loop
4304213offDevice-code / additional optPipeline B; fc dispatch
4384217offFast-compile bypass flagPipeline A vs B branch
4464221offLate CFG cleanup guardCommon tail #3

Codegen Optimization Level Propagation

The -optO and -llcO flags propagate the optimization level to the backend code generator. In sub_12E54A0 (lines 1451--1460):

if (lsa_opt == "0" && some_flag == "1"):
  append("-optO<level>")
  append("-llcO2")

The codegen dispatch sub_12DFE00 reads opts[200] (the integer optimization level):

  • opts[200] == 0: Minimal codegen (no dependency tracking)
  • opts[200] >= 1: Standard codegen
  • opts[200] >= 2: Full dependency tracking enabled (v121 = true)

Cross-References

NVVMPassOptions

NVVMPassOptions is NVIDIA's proprietary per-pass configuration system -- a 4,512-byte flat struct containing 221 option slots that controls every aspect of the NVVM optimization pipeline. It has no upstream LLVM equivalent. Where LLVM uses scattered cl::opt<T> globals that each pass reads independently, NVIDIA consolidates all pass configuration into a single contiguous struct that is allocated once and threaded through the entire pipeline assembler as a parameter. This design allows the pipeline to make pass-enable decisions through simple byte reads at known offsets rather than hash-table lookups, and it ensures that the complete configuration state can be copied between Phase I and Phase II of the two-phase compilation model.

The struct is populated by a single 125KB function (sub_12D6300) that reads from a PassOptionRegistry hash table and flattens the results into 221 typed slots. The pipeline assembler (sub_12E54A0) and its sub-pipeline builders (sub_12DE330, sub_12DE8F0) then read individual slots by offset to decide which passes to insert and how to configure them.

Initializersub_12D6300 (125KB, 4,786 lines)
Struct size4,512 bytes (sub_22077B0(4512))
Slot count221 (1-based index: 1--221)
Slot types5: STRING (24B), BOOL_COMPACT (16B), BOOL_INLINE (16B), INTEGER (16B), STRING_PTR (28B)
Type breakdown114 string + 83 bool compact + 17 bool inline + 6 integer + 1 string pointer
Registry lookupsub_12D6170 (hash table at registry+120)
PassDef resolversub_1691920 (64-byte stride table)
Bool parsersub_12D6240 (triple: lookup + lowercase + char test)
Callerssub_12E7E70 (Phase orchestrator), sub_12F4060 (TargetMachine creation)
Consumerssub_12E54A0, sub_12DE330, sub_12DE8F0, sub_12DFE00

Struct Layout

The struct is heap-allocated as a single 4,512-byte block. The first 16 bytes contain header fields, followed by 221 option slots packed contiguously, and a 32-byte zero trailer:

Offset  Size   Field
──────  ────   ─────
0       4      int opt_level (copied from registry+112)
4       4      (padding)
8       8      qword ptr to PassOptionRegistry
16      ~4464  221 option slots (variable-size, packed)
4480    32     zero trailer (4 qwords, sentinel)

Slot offsets are deterministic -- they depend on the type sequence hard-coded into sub_12D6300. String slots consume 24 bytes, boolean and integer slots consume 16 bytes, and the unique string-pointer slot at index 181 consumes 28 bytes. The initializer writes each slot at a compile-time-constant offset; there is no dynamic layout calculation.

Slot Types

Type A: String Option (24 bytes) -- sub_12D6090

114 slots. Stores a string value (pass name or parametric value) along with flags, optimization level, and pass ID.

struct StringOption {       // 24 bytes, written by sub_12D6090
    char*    value;         // +0:  pointer to string data
    int32_t  option_index;  // +8:  1-based slot index
    int32_t  flags;         // +12: from PassDef byte 40
    int32_t  opt_level;     // +16: from header opt_level
    int32_t  pass_id;       // +20: resolved via sub_1691920
};

Type B: Boolean Compact (16 bytes) -- sub_12D6100

83 slots. The most common boolean representation. The helper encapsulates the lookup-parse-resolve sequence.

struct BoolCompactOption {  // 16 bytes, written by sub_12D6100
    uint8_t  value;         // +0:  0 or 1
    uint8_t  pad[3];        // +1:  padding
    int32_t  option_index;  // +4:  1-based slot index
    int32_t  flags;         // +8:  from PassDef byte 40
    int32_t  pass_id;       // +12: resolved via sub_1691920
};

Type C: Boolean Inline (16 bytes) -- direct write

17 slots. Identical layout to Type B, but written directly by sub_12D6300 rather than through the sub_12D6100 helper. These correspond to option pairs where the boolean resolution requires checking PassDef+36 (has_overrides byte) and resolving via sub_1691920 inline. The 17 inline boolean slots are: 7, 11, 13, 49, 53, 55, 59, 61, 95, 103, 119, 127, 151, 159, 169, 177, 211.

struct BoolInlineOption {   // 16 bytes, same layout as Type B
    uint8_t  value;         // +0:  0 or 1
    uint8_t  pad[3];        // +1
    int32_t  option_index;  // +4:  high 32 bits of sub_12D6240 return
    int32_t  opt_level;     // +8:  from header
    int32_t  pass_id;       // +12: resolved inline
};

Type D: Integer (16 bytes) -- direct write via sub_16D2BB0

6 slots. The integer value is parsed from the registry string by sub_16D2BB0 (string-to-int64). Layout is identical to boolean compact but the first 4 bytes store a full int32_t rather than a single byte.

struct IntegerOption {      // 16 bytes
    int32_t  value;         // +0:  parsed integer
    int32_t  option_index;  // +4:  1-based slot index
    int32_t  opt_level;     // +8
    int32_t  pass_id;       // +12
};

Type E: String Pointer (28 bytes) -- slot 181 only

Unique. Stores a raw char* plus length rather than a managed string. Likely a file path or regex pattern that requires direct C-string access.

struct StringPtrOption {    // 28 bytes, slot 181 only
    char*    data;          // +0:  raw char pointer
    uint64_t length;        // +8:  string length
    int32_t  option_index;  // +16: 1-based slot index
    int32_t  opt_level;     // +20
    int32_t  pass_id;       // +24
};

Pair Organization Pattern

The 221 slots follow a predominantly paired layout. Slots 1--6 are six standalone STRING options (likely the global compilation parameters: ftz, prec-div, prec-sqrt, fmad, opt-level, sm-arch). Starting at slot 7, slots are organized in (EVEN, ODD) pairs:

  • Even slot N: STRING option -- the pass's parameter value or name
  • Odd slot N+1: BOOLEAN or INTEGER option -- the enable/disable toggle

Each "pass knob" thus gets a string parameter slot and a boolean gate. The pipeline assembler reads the boolean to decide whether to insert the pass, and passes the string value as the pass's configuration parameter.

Exceptions to the pair pattern:

RegionAnomaly
Slots 160--162Three consecutive STRING slots with a single boolean at 163
Slots 191--193Slot 191 STRING, then two consecutive booleans at 192--193
Slot 181STRING_PTR type instead of normal STRING
Slots 196--207Alternating STRING + INTEGER instead of STRING + BOOL

Helper Functions

sub_12D6170 -- PassOptionRegistry::lookupOption

Looks up an option by its 1-based slot index in the hash table at registry+120. Returns a pointer to an OptionNode or 0 if the option was not set from the command line:

// Signature: int64 sub_12D6170(void* registry, int option_index)
// Returns: OptionNode* or 0
//
// OptionNode layout:
//   +40   int16   flags
//   +48   char**  value_array_ptr (array of string values)
//   +56   int     value_count

The hash table uses open addressing. The lookup computes hash(option_index) and probes linearly. When an option is not present in the registry (meaning the user did not supply a CLI override), the caller falls back to the hard-coded default in sub_12D6300.

sub_12D6240 -- PassOptionRegistry::getBoolOption

Resolves a boolean option with a default value. This is the critical function for all 100 boolean slots -- it performs a three-step resolution:

sub_12D6240(registry, option_index, default_string):
    1. Call sub_12D6170(registry, option_index)
    2. If found AND has value:
         lowercase the string via sub_16D2060
         result = (first_char == '1' || first_char == 't')  // "1" or "true"
    3. If not found OR no value:
         result = (default_string[0] == '1')  // "0" -> false, "1" -> true
    Return: packed(bool_value:8, flags:32) in low 40 bits

The packing convention is significant: the boolean value occupies the low 8 bits and the flags occupy bits 8--39. Callers unpack with (result & 0xFF) for the boolean and (result >> 8) for the flags.

sub_1691920 -- PassDefTable::getPassDef

Resolves a 1-based pass index to its PassDef entry in a table with 64-byte stride:

// sub_1691920(table_ptr, pass_index):
//   return table_ptr[0] + (pass_index - 1) * 64
//
// PassDef layout (64 bytes):
//   +32   int     pass_id
//   +36   byte    has_overrides
//   +40   int16   override_index

The pass_id field is written into every option slot and later used by the pipeline assembler to map configuration back to the pass factory that should receive it.

sub_16D2BB0 -- parseInt

Parses a string to a 64-bit integer. Used for the 6 integer-typed option slots (9, 197, 203, 205, 207, 215).

Default Values

Most boolean slots default to 0 (disabled). 14 slots default to 1 (enabled) -- these represent passes that run by default and must be explicitly disabled:

Confidence note: Pass associations marked [MEDIUM] are inferred from pipeline guard cross-references (a4[offset]). Associations marked [LOW] are based solely on offset proximity or default-value patterns.

SlotOffsetLikely PassConfidence
19400Inliner (AlwaysInliner gate)MEDIUM
25520NVIDIA-specific pass ALOW
931880ConstantMergeHIGH
951920NVVMIntrinsicLoweringHIGH
1172360NVVMUnreachableBlockElimHIGH
1412840ADCEHIGH
1432880LICMHIGH
1513040CorrelatedValuePropagationMEDIUM
1553120MemorySpaceOpt (second pass)MEDIUM
1573160PrintModulePass (dump mode)HIGH
1593200Optimization-level gatingMEDIUM
1653328Late-pipeline enable blockLOW
2114264(inline bool, late pass)LOW
2194424(compact bool, late pass)LOW

Integer slot defaults:

SlotOffsetDefaultLikely Meaning
92001Optimization threshold / iteration count
197398420Limit/threshold (e.g., unroll count)
2034104-1Thread count (sentinel for auto-detect via get_nprocs())
2054144-1Thread count fallback
2074184-1Sentinel for unlimited/auto
21543440Disabled counter

CLI Flag Routing

The path from a user-visible flag to an NVVMPassOptions slot traverses four stages:

nvcc -Xcicc -opt "-do-licm=0"          ← user invocation
    │
    ▼
sub_9624D0 (flag catalog, 75KB)        ← parses -opt flags into opt_argv vector
    │   pushes "-do-licm=0" into v327 (opt vector)
    ▼
PassOptionRegistry (hash table)         ← opt-phase parser populates registry
    │   key = slot_index, value = "0"
    ▼
sub_12D6300 (125KB initializer)         ← flattens registry into 4512-byte struct
    │   sub_12D6240(registry, LICM_SLOT, "1") → returns 0 (overridden)
    │   writes opts[2880] = 0
    ▼
sub_12E54A0 / sub_12DE8F0              ← pipeline assembler reads opts[2880]
    if (opts[2880]) AddPass(LICM);     ← skipped because opts[2880] == 0

The -opt flag prefix is critical: it routes the argument to the optimizer phase vector rather than to the linker, LTO, or codegen phases. The flag catalog (sub_9624D0) recognizes several shorthand patterns:

User FlagRoutes ToEffect
--emit-optix-iropt "-do-ip-msp=0", opt "-do-licm=0"Disables IPMSP and LICM for OptiX
-Ofast-compile=maxopt "-fast-compile=max", opt "-memory-space-opt=0"Disables MemorySpaceOpt
-memory-space-opt=0opt "-memory-space-opt=0"Direct pass disable
-Xopt "-do-remat=0"opt "-do-remat=0"Direct pass-through to opt phase

Pipeline Consumer: How Passes Read NVVMPassOptions

The pipeline assembler and its sub-pipeline builders receive the NVVMPassOptions struct as parameter a4 (in sub_12E54A0) or opts (in sub_12DE330/sub_12DE8F0). They read individual boolean slots by dereferencing a byte at a known offset and branching:

// Pattern 1: simple disable guard
if (!*(uint8_t*)(opts + 1760))           // opts[1760] = MemorySpaceOpt disable
    AddPass(PM, sub_1C8E680(0), 1, 0);  // insert MemorySpaceOpt

// Pattern 2: enable guard (inverted logic)
if (*(uint8_t*)(opts + 2880))            // opts[2880] = LICM enabled (default=1)
    AddPass(PM, sub_195E880(0), 1, 0);  // insert LICM

// Pattern 3: combined guard with opt-level gating
if (*(uint8_t*)(opts + 3200) &&          // opts[3200] = opt-level sufficient
    !*(uint8_t*)(opts + 880))            // opts[880] = NVVMReflect not disabled
    AddPass(PM, sub_1857160(), 1, 0);   // insert NVVMReflect

// Pattern 4: integer parameter read
v12 = *(int32_t*)(opts + 200);           // opts[200] = opt threshold (default=1)
// used to configure codegen dispatch in sub_12DFE00

The key insight is that the pipeline assembler never performs string comparison or hash-table lookup at pass-insertion time -- it reads pre-resolved values from the flat struct. This makes the ~150 pass-insertion decisions in sub_12E54A0 essentially free in terms of runtime cost.

Offset-to-Pass Mapping

The following table maps struct offsets (as seen in pipeline assembler guards opts[OFFSET]) to the passes they control. Offsets are byte offsets from the struct base. "Guard sense" indicates whether the pass runs when the byte is 0 (!opts[X] -- most common, where the option is a disable flag) or when it is nonzero (opts[X] -- the option is an enable flag).

OffsetSlotGuard SenseControlled PassFactory
2009valueOptimization threshold (integer, read by sub_12DFE00)--
28014-15!optsDCE (DeadCodeElimination)sub_18DEFF0
32016-17!optsTailCallElim / JumpThreadingsub_1833EB0
36018-19!optsNVVMLateOptsub_1C46000
40020-21!optsAlwaysInliner gate Asub_1C4B6F0
44022-23!optsAlwaysInliner gate Bsub_1C4B6F0
48024-25!optsInliner gate Csub_1C4B6F0
52026-27!optsNVIDIA-specific pass Asub_1AAC510
56028-29!optsNVIDIA-specific pass Bsub_1AAC510
60030-31!optsNVVMVerifiersub_12D4560
68034-35!optsFunctionAttrssub_1841180
72036-37!optsSCCPsub_1842BC0
76038-39!optsDSE (DeadStoreElimination)sub_18F5480
88044-45!optsNVVMReflectsub_1857160
92046-47!optsIPConstantPropagationsub_185D600
96048-49!optsSimplifyCFGsub_190BB10
100050-51!optsInstCombinesub_19401A0
104052-53!optsSink / SimplifyCFG (early)sub_1869C50
108054-55!optsPrintModulePass (dump IR)sub_17060B0
112056-57!optsNVVMPredicateOptsub_18A3430
116058-59!optsLoopIndexSplitsub_1952F90
120060-61!optsSimplifyCFG (tier guard)sub_190BB10
124062-63!optsLICMsub_195E880
128064-65!optsReassociate / Sinkingsub_1B7FDF0
132066-67!optsADCE (AggressiveDeadCodeElimination)sub_1C76260
136068-69!optsLoopUnrollsub_19C1680
140070-71!optsSROAsub_1968390
144072-73!optsEarlyCSEsub_196A2B0
148074-75!optsADCE extra guardsub_1C76260
152076-77!optsLoopSimplifysub_198DF00
164082-83!optsNVVMWarpShufflesub_1C7F370
168084-85!optsNVIDIA pass (early)sub_19CE990
176088-89!optsMemorySpaceOpt (primary)sub_1C8E680
184092-93!optsADCE variantsub_1C6FCA0
196098-99!optsConstantMerge / GlobalDCEsub_184CD60
2000100-101!optsNVVMIntrinsicLoweringsub_1CB4E40
2040102-103!optsMemCpyOptsub_1B26330
2080104-105!optsBranchDist gate Asub_1CB73C0
2120106-107!optsBranchDist gate Bsub_1CB73C0
2160108-109!optsNVVMPredicateOpt variantsub_18A3090
2200110-111!optsGenericToNVVMsub_1A02540
2240112-113!optsNVVMLowerAlloca gate Asub_1CBC480
2280114-115!optsNVVMLowerAlloca gate Bsub_1CBC480
2320116-117!optsNVVMRematerializationsub_1A13320
2360118-119!optsNVVMUnreachableBlockElimsub_1CC3990
2400120-121!optsNVVMReductionsub_1CC5E00
2440122-123!optsNVVMSinking2sub_1CC60B0
2560128-129!optsNVVMGenericAddrOptsub_1CC71E0
2600130-131!optsNVVMIRVerificationsub_1A223D0
2640132-133!optsLoopOpt / BarrierOptsub_18B1DE0
2680134-135!optsMemorySpaceOpt (second invocation)sub_1C8E680
2720136-137!optsInstructionSimplifysub_1A7A9F0
2760138-139!optsLoopUnswitch variantsub_19B73C0
2840141optsADCE (enabled by default, slot 141, default=1)sub_1C6FCA0
2880143optsLICM (enabled by default, slot 143, default=1)sub_195E880
2920145valueLowerBarriers parametersub_1C98270
3000150-151optsEarly pass guardsub_18FD350
3040151optsCorrelatedValuePropagation (default=1)sub_18EEA90
3080153optsNVIDIA-specific loop passsub_1922F90
3120155optsMemorySpaceOpt second-pass enable (default=1)sub_1C8E680
3160157optsPrintModulePass enable (default=1)sub_17060B0
3200159optsOptimization-level gate (default=1)--
3328165optsLate-pipeline enable block (default=1)multiple
3488174-175optsNVVMBarrierAnalysis + LowerBarriers enablesub_18E4A00
3648181stringLanguage string ("ptx"/"mid")path dispatch
3704185optsLate optimization flagsub_1C8A4D0
3904193optsDebug / verification modesub_12D3E60
3944195optsBasic block naming ("F%d_B%d")sprintf
3984197valueInteger limit (default=20)--
4064201valueConcurrent compilation overridesub_12D4250
4104203valueThread count (default=-1, auto-detect)sub_12E7E70
4144205valueThread count fallback (default=-1)sub_12E7E70
4184207valueInteger parameter (default=-1)--
4224209optsOptimization enabled flagtier dispatch
4304213optsDevice-code flagPipeline B
4344215valueInteger counter (default=0)--
4384217optsFast-compile bypass flagPipeline B dispatch
4464221!optsLate CFG cleanup guardsub_1654860

Known Option Names

Option names are stored in the PassOptionRegistry hash table, not in sub_12D6300 itself. The following names are extracted from binary string references in global constructors and pass factories:

Boolean Toggles (do-X / no-X)

NameLikely Slot RegionDefault
do-ip-mspMemorySpaceOpt areaenabled
do-clone-for-ip-mspMemorySpaceOpt variant--
do-licmoffset 2880 (slot 143)1 (enabled)
do-rematoffset 2320 (slot 117)enabled
do-cssaCSSA pass area--
do-scev-cgpSCEV-CGP area--
do-function-scev-cgpfunction-level SCEV-CGP--
do-scev-cgp-aggresively [sic]aggressive SCEV-CGP mode--
do-base-address-strength-reduceBaseAddrSR area--
do-base-address-strength-reduce-chainBaseAddrSR chain variant--
do-comdat-renamingCOMDAT pass--
do-counter-promotionPGO counter promotion--
do-lsr-64-bit64-bit loop strength reduction--
do-sign-ext-expandsign extension expansion--
do-sign-ext-simplifysign extension simplification--

Dump/Debug Toggles

NamePurpose
dump-ip-mspDump IR around MemorySpaceOpt
dump-ir-before-memory-space-optIR dump pre-MSP
dump-ir-after-memory-space-optIR dump post-MSP
dump-memory-space-warningsMSP diagnostic warnings
dump-remat / dump-remat-add / dump-remat-iv / dump-remat-loadRematerialization diagnostics
dump-branch-distBranch distribution diagnostics
dump-scev-cgpSCEV-CGP diagnostics
dump-base-address-strength-reduceBaseAddrSR diagnostics
dump-sink2Sinking2 diagnostics
dump-before-cssaCSSA input dump
dump-phi-removePHI removal diagnostics
dump-normalize-gepGEP normalization dump
dump-simplify-live-outLive-out simplification dump
dump-process-restrictProcess-restrict dump
dump-process-builtin-assumeBuiltin assume processing dump
dump-conv-dot / dump-conv-func / dump-conv-textConvergence analysis dumps
dump-nvvmirNVVM IR dump
dump-vaValue analysis dump

Parametric Knobs

NameDefaultPurpose
remat-for-occ120Occupancy target for rematerialization
remat-gep-cost6000GEP rematerialization cost threshold
remat-lli-factor10Long-latency instruction factor
remat-max-live-limit10Maximum live range limit for remat
remat-single-cost-limit--Single-instruction remat cost limit
remat-loop-trip--Loop trip count for remat decisions
remat-use-limit--Use count limit for remat candidates
remat-maxreg-ceiling--Register ceiling for remat
remat-move--Remat move control
remat-load-param--Parameter load remat control
remat-ignore-single-cost--Ignore single-cost heuristic
branch-dist-block-limit-1Max blocks for branch distribution (-1 = unlimited)
branch-dist-func-limit-1Max functions for branch distribution
branch-dist-norm0Branch distribution normalization mode
scev-cgp-control--SCEV-CGP mode selector
scev-cgp-norm--SCEV-CGP normalization
scev-cgp-check-latency--Latency check threshold
scev-cgp-cross-block-limit--Cross-block limit
scev-cgp-idom-level-limit--Immediate dominator level limit
scev-cgp-inst-limit--Instruction count limit
scev-cgp-old-base--Old base address mode
scev-cgp-tid-max-value--Thread ID max value
base-address-strength-reduce-iv-limit--IV limit for base addr SR
base-address-strength-reduce-max-iv--Max IV count
cssa-coalesce--CSSA coalescing mode
cssa-verbosity--CSSA diagnostic verbosity
memory-space-opt-pass--MSP pass variant selector
peephole-opt--Peephole optimizer control
loop-index-split--Loop index split control
va-use-scdg--Value analysis SCDG mode
nvvm-peephole-optimizer--NVVM peephole enable
nvvm-intr-range--Intrinsic range analysis control

Differences from Upstream LLVM

Upstream LLVM has nothing resembling this system. The closest analogue is the cl::opt<T> flag mechanism, but that scatters configuration across hundreds of global variables that each pass reads independently. The differences are architectural:

AspectUpstream LLVMcicc NVVMPassOptions
Storage~1,689 scattered cl::opt globals in BSSSingle 4,512-byte contiguous struct
InitializationGlobal constructors register each flagOne 125KB function flattens all 221 slots
Access patternEach pass reads its own globalsPipeline assembler reads all slots centrally
CopyabilityNot designed for copyingStruct is trivially memcpy-able for Phase I/II
Thread safetyGlobal cl::opt requires careful coordinationEach thread gets its own struct copy
Override mechanismcl::opt command-line parserPassOptionRegistry hash table with fallback defaults
Pass gatingPass decides internally whether to runPipeline assembler decides before constructing pass

The thread-safety property is crucial for the two-phase concurrent compilation model. When Phase II runs per-function compilation in parallel threads, each thread receives a copy of the NVVMPassOptions struct. If NVIDIA used upstream cl::opt globals for pass configuration, they would need global locks or TLS for every option read during pass execution -- an unacceptable overhead for a GPU compiler that may process hundreds of kernels in a single translation unit.

Interaction with Two-Phase Compilation

The NVVMPassOptions struct is allocated and populated before Phase I begins, in the orchestrator sub_12E7E70:

// sub_12E7E70, line ~128
void* opts = malloc(4512);              // allocate NVVMPassOptions
sub_12D6300(opts, registry);            // populate from CLI-parsed registry
// ... pass opts to sub_12E54A0 for Phase I ...
// ... pass same opts to sub_12E54A0 for Phase II ...

Both phases receive the same opts pointer. Individual passes within the pipeline assembler check qword_4FBB3B0 (the TLS phase counter) to skip themselves in the wrong phase -- but the NVVMPassOptions struct itself does not change between phases. This means a pass cannot be enabled in Phase I but disabled in Phase II through NVVMPassOptions alone; phase selection is handled by the separate TLS mechanism.

The second caller, sub_12F4060 (TargetMachine creation in the standalone path), performs an identical allocation and initialization sequence, confirming that every compilation path goes through the same NVVMPassOptions infrastructure.

Function Map

FunctionAddressSizeRole
NVVMPassOptions::initsub_12D6300125KBPopulate 221 slots from registry
PassOptionRegistry::lookupOptionsub_12D6170~200BHash-table lookup by slot index
PassOptionRegistry::getBoolOptionsub_12D6240~300BBoolean resolution with default
writeStringOptionsub_12D6090~150BWrite 24-byte string slot
writeBoolOptionsub_12D6100~120BWrite 16-byte boolean slot
PassDefTable::getPassDefsub_1691920~80B64-byte stride table lookup
parseIntsub_16D2BB0~100BString-to-int64 parser
toLowercasesub_16D2060~80BString lowercasing for bool parse

Cross-References

Configuration Knobs

Three independent knob systems control compiler behavior: LLVM cl::opt flags (~1,496 unique), NVVMPassOptions (222 slots), and NVIDIA codegen knobs (~70).

LLVM cl::opt1,496 unique flags across 353 constructor files
NVVMPassOptions222 slots, initialized by sub_12D6300 (125KB)
Codegen knobs~70, parsed by sub_1C20170 / sub_CD9990 from NVVM container
BSS storage0x4F7FEA00x4FA5xxx (cl::opt), a1+0a1+4464 (PassOptions)
Dual PMSame options registered for both Legacy PM (sub_C53080) and New PM (sub_16B8280)
NVIDIA-specific172 of 1,496 cl::opt flags (11.5%) are NVIDIA-added

Knob System 1: LLVM cl::opt

Registration Pattern

Every cl::opt follows this initialization sequence in a global constructor:

// Legacy PM path
InterlockedExchangeAdd64(sub_C523C0(), 1);   // atomic option counter
sub_C53080(&option, "option-name", strlen);   // set name
sub_C53130(&option);                          // finalize registration
__cxa_atexit(destructor, &option, &dso_handle);

// New PM path (parallel registration)
InterlockedExchangeAdd64(&unk_4FA0230, 1);
sub_16B8280(&option, "option-name", strlen);
sub_16B88A0(&option);
__cxa_atexit(destructor, &option, &dso_handle);

Each cl::opt<T> occupies ~224 bytes (0xE0) in BSS. Top constructors by option count: ctor_600 (30), ctor_433 (25), ctor_472 (24), ctor_609 (22), ctor_392 (22).

Category 1: Scalar Optimization (InstCombine + FP)

Constructor: ctor_165_0 at 0x4D0500 (11,731 bytes). Registers 12 NVIDIA-specific flags plus 4 standard LLVM flags.

FlagTypeDefaultBSS AddrPurpose
split-gep-chainboolfalse0x4F901A8Split GEP chains to independent GEPs for better address mode selection
Disable-Add-to-OrbooltrueDisable add-to-or transformation (NVIDIA blocks this LLVM combine)
opt-use-fast-mathboolfalseEnable aggressive FP simplification (set by -unsafe-math / -fast-math)
opt-use-prec-divbooltrueUse precise division (set by -prec-div=1; cleared by -prec-div=0)
opt-no-signed-zerosboolfalseIgnore signed zero distinction (set by -no-signed-zeros)
disable-fp-cast-optboolfalseDisable FP-to-int and int-to-FP cast optimizations
reorder-sext-before-cnst-addboolfalsesext(add(a,CI)) to add(sext(a),CI) rewrite; hidden flag
disable-sinkboolfalseDisable instruction sinking in InstCombine
partial-sinkboolfalseEnable partial sinking of instructions
nvptx-rsqrt-approx-optboolfalseEnable reciprocal sqrt approximation optimization
disable-rsqrt-optboolfalseDisable reciprocal sqrt optimization entirely
check-vnboolfalseVerify value numbers after transformations (debug)

Standard LLVM flags in same constructor: expensive-combines (bool), instcombine-maxarray-size (int, default 1024), instcombine-visit (int), instcombine-lower-dbg-declare (bool).

Category 2: Inliner Heuristics

Constructor: ctor_186_0 at 0x4DBEC0 (14,109 bytes). Nine NVIDIA-specific flags governing the custom CGSCC inliner at sub_1864060.

FlagTypeDefaultPurpose
profuseinlineboolfalseVerbose inlining diagnostics (NVIDIA profuse framework, not PGO profuse)
inline-total-budgetintnoneGlobal total budget across all callers; unset = unlimited
nv-inline-allboolfalseForce inline ALL function calls (used by OptiX ray tracing)
inline-budgetint20000Per-caller inlining cost budget; -aggressive-inline sets to 40000
inline-adj-budget1intnoneSecondary adjusted per-caller budget
inline-switchctrlintnoneTune heuristic for switch-containing callees
inline-numswitchfuncintnoneThreshold for switch-heavy function penalty
inline-maxswitchcasesintnoneMax switch cases before inlining penalty kicks in
disable-inlined-alloca-mergingboolfalseDisable post-inline alloca merging into single frame slot

"none" means the knob is unset by default and the heuristic falls back to internal logic.

Category 3: GVN (Global Value Numbering)

Constructor: ctor_201 at 0x4E0990. Eleven knobs (8 NVIDIA-specific + 3 upstream).

FlagTypeDefaultBSS AddrPurpose
profusegvnbooltrue0x4FAE7E0Verbose GVN diagnostics (unusually, defaults on)
gvn-dom-cachebooltrue0x4FAE700Cache dominator tree nodes; cache size = 32
max-recurse-depthint10000x4FAE620Max recursion during value numbering (safety valve for template-heavy code)
enable-phi-removeint20x4FAEC40PHI removal aggressiveness: 0=off, 1=trivial only, 2=post-leader substitution
dump-phi-removeint00x4FAEB60Dump PHI removal decisions (debug)
no-split-stores-belowint-10x4FAEA80Min store width for splitting (bits); -1 = no limit
no-split-stores-aboveint-10x4FAE9A0Max store width for splitting (bits); -1 = no limit
split-storesbooltrue0x4FAE8C0Master enable for NVIDIA store-splitting in GVN
enable-prebooltrue0x4FAEEE0Enable Partial Redundancy Elimination (upstream LLVM)
enable-load-prebooltrue0x4FAEE00Enable load PRE across edges (upstream LLVM)
enable-split-backedge-in-load-preboolfalse0x4FAED20Allow backedge splitting during load PRE (upstream LLVM)

Store splitting uses a custom NVIDIA registrar (sub_190BE40) that takes a default-value pointer. Both limit knobs default to -1 = all sizes eligible.

Category 4: Loop Strength Reduction

Constructor: ctor_214_0 at 0x4E4B00. Eleven NVIDIA-specific LSR flags (69% NVIDIA customization rate).

FlagTypeDefaultPurpose
disable-unknown-trip-lsrboolfalseDisable LSR for loops with unknown trip count
lsr-check-rpbooltrue [MEDIUM]Check register pressure before applying LSR
lsr-rp-limitint~32-64 [LOW]Skip LSR entirely when RP exceeds this limit (occupancy cliff)
filter-bad-formulabooltrue [MEDIUM]Filter out poor-quality LSR formulae early
do-lsr-64-bitboolarch-dependentEnable 64-bit loop strength reduction (false on sm_3x-5x, true on sm_70+)
count-sxt-opt-for-reg-pressurebooltrue [MEDIUM]Factor sign-extension elimination savings into RP analysis
lsr-sxtoptbooltrue [MEDIUM]Perform sign-extension elimination within LSR
lsr-loop-levelint0Apply LSR only at specific loop nesting level (0 = all levels)
lsr-skip-outer-loopboolfalseIgnore outer-loop induction variables in LSR
disable-lsr-for-sharedmem32-ptrboolfalseDisable LSR for 32-bit shared memory pointers (GPU-specific)
disable-lsr-complexity-discountboolfalseDisable complexity estimation discount heuristic

Standard LLVM LSR flags in same constructor: enable-lsr-phielim, lsr-insns-cost, lsr-exp-narrow, lsr-filter-same-scaled-reg, lsr-fix-iv-inc.

Category 5: IndVarSimplify

Constructor: ctor_203_0 at 0x4E1CD0 (7,007 bytes).

FlagTypeDefaultPurpose
Disable-unknown-trip-ivboolfalseDisable IV substitution for unknown-trip-count loops
iv-loop-levelintnoneControl which loop nesting levels get IV substitution

Category 6: SimplifyCFG

Constructor: ctor_243_0 at 0x4ED0C0.

FlagTypeDefaultPurpose
disable-jump-threadingboolfalseDisable jump threading (for OCG experiments)
fold-with-var-condboolfalseFold branches with variance-based conditions

Category 7: NVPTX Backend Math/Scheduling

Constructor: ctor_607 at 0x584B60 (13,700 bytes). Core numeric precision and FMA controls. Defaults are set by the CLI flag routing in sub_9624D0, not by the cl::opt constructors.

FlagTypeCLI DefaultPurpose
nvptx-sched4regboolfalseSchedule for register pressure (key NVPTX strategy)
nvptx-fma-levelint1FMA contraction: 0=off, 1=on, 2=aggressive. CLI -fma=1 is default
nvptx-prec-divf32int1F32 div precision: 0=approx, 1=full, 2=IEEE rnd+ftz, 3=IEEE no-ftz
nvptx-prec-sqrtf32int1Sqrt precision: 0=approx, 1=rn. CLI -prec-sqrt=1 is default
nvptx-approx-log2f32boolfalseUse lg2.approx for log2 (only set by -unsafe-math)
nvptx-force-min-byval-param-alignboolfalseForce 4-byte minimum alignment for byval parameters
nvptx-normalize-selectboolfalseOverride shouldNormalizeToSelectSequence in TLI
enable-bfi64boolfalseEnable 64-bit BFI (bit-field insert) instructions

Note: These cl::opt knobs have no explicit default in their constructor (they init to 0/false). The effective defaults come from the CLI flag catalog: -fma=1 routes -nvptx-fma-level=1, -prec-div=1 routes -nvptx-prec-divf32=1, -prec-sqrt=1 routes -nvptx-prec-sqrtf32=1.

Category 8: NVPTX Backend Passes/Features

Constructor: ctor_609_0 at 0x585D30 (22 options total, largest NVPTX constructor).

FlagTypeDefaultPurpose
disable-nvptx-load-store-vectorizerboolfalseDisable load/store vectorizer
disable-nvptx-require-structured-cfgboolfalseTurn off structured CFG requirement (transitional)
nvptx-short-ptrboolfalse32-bit pointers for const/local/shared address spaces
nvptx-enable-machine-sinkboolfalseEnable machine-level instruction sinking
enable-new-nvvm-rematbooltrueEnable new NVVM rematerialization engine
nv-disable-rematboolfalseDisable all rematerialization passes
nv-disable-mem2regboolfalseDisable machine-IR mem2reg promotion
nv-disable-scev-cgpbooltrueDisable SCEV-based address mode optimization (on = disabled)
nvptx-32-bit-smemboolfalseUse 32-bit pointers for shared address space
nvptx-exit-on-unreachablebooltrueLower unreachable as PTX exit instruction
nvptx-early-byval-copyboolfalseCreate copy of byval function args in entry block
enable-nvvm-peepholebooltrueEnable NVVM peephole optimizer
no-reg-target-nvptxrematboolfalseOnly run old remat on kernels without register targets
lower-func-argsbooltrueLower large aggregate function parameters to copies
enable-sinkbooltrueEnable LLVM sinking pass
disable-post-optboolfalseDisable IR optimizations in post-opt phase
usedessaint2deSSA method: 0=off, 1=basic, 2=full
ldgbooltrueLoad-via-texture (ld.global.nc) constant transform

Category 9: NVPTX Backend Extended

Constructor: ctor_610 at 0x5888A0 (7,400 bytes).

FlagTypeDefaultPurpose
unroll-assumed-sizeint4Assumed element count for unknown-size local arrays during unroll analysis
enable-loop-peelingboolfalseEnable loop peeling transformation
enable-256-bit-load-storeboolfalseEnable 256-bit (32-byte) vector load/store generation
ias-param-always-point-to-globalboolfalseAssume function parameter pointers always point to global memory
ias-strong-global-assumptionsboolfalseStronger assumption: constant-buffer pointers resolve to globals
ias-wmma-memory-space-optboolfalseEnable MemorySpaceOpt specialization for WMMA/tensor operations

Category 10: Memory Space Optimization

Scattered across ctor_264, ctor_267_0, ctor_528, ctor_531_0. See MemorySpaceOpt and IPMSP for the full algorithm.

FlagTypeDefaultPurpose
mem-space-algint2Switch between MSO algorithm variants
dump-ir-before-memory-space-optboolfalseDump IR before MSO
dump-ir-after-memory-space-optboolfalseDump IR after MSO
track-indir-loadbooltrueTrack indirect loads during MSO dataflow
track-int2ptrbooltrueTrack IntToPtr casts in MSO
param-always-point-to-globalbooltrueKernel parameter pointers always point to global memory
devicefn-param-always-localboolfalseTreat parameter space as local in device functions
ignore-address-space-checkboolfalseIgnore address-space checks during branch distribution
sink-into-textureint3Sink loads into texture blocks: 0=off, 1=cross-block, 2=+intra, 3=+outside-only. See also Category 14
ldgbooltrueLoad Global Constant Transform (ld.global.nc)
do-clone-for-ip-mspint-1Function cloning limit for IP-MSP (-1 = unlimited, 0 = disable)
dump-ip-mspboolfalseDump interprocedural MSP info
lower-read-only-devicefn-byvalboolfalseHandle byval attribute of args to read-only device functions
reuse-lmem-very-long-live-rangeintThreshold for very-long live range in local memory reuse
hoist-load-paramboolfalseGenerate all ld.param in entry basic block
sink-ld-paramboolfalseSink one-use ld.param to use point
process-alloca-alwaysbooltrueTreat alloca as definite local (AS 5) regardless of context
wmma-memory-space-optbooltrueEnable memory space optimization for WMMA operations
strong-global-assumptionsbooltrueAssume const buffer pointers always point to globals
process-builtin-assumeboolProcess __builtin_assume(__is*(p)) assertions for space deduction

Category 11: Rematerialization

Scattered across ctor_609_0, ctor_362, ctor_277_0, ctor_361_0, and others. See Rematerialization for full algorithm detail.

IR-Level Knobs (ctor_277_0 at 0x4F7BE0)

FlagTypeDefaultGlobalPurpose
do-rematint3dword_4FC05C0Master control. 0=off, 1=conservative, 2=normal, 3=full
no-rematstring(empty)qword_4FC0440Comma-separated function exclusion list
remat-ivint4dword_4FBFB40IV demotion level. 0=off, 4=full
remat-loadint1dword_4FBFA60Load rematerialization. 0=off, 1=on
remat-addint0dword_4FBF980Add/GEP factoring. 0=off
remat-single-cost-limitint6000dword_4FC0080Max cost per single live-in reduction
remat-loop-tripint20dword_4FBFFA0Default assumed loop trip count
remat-gep-costint6000dword_4FBFEC0Max cost for GEP rematerialization
remat-use-limitint10dword_4FBFDE0Max number of uses for a candidate
remat-max-live-limitint10dword_4FBFD00Max live-in limit for rematerialization
remat-maxreg-ceilingint0dword_4FBF600Register ceiling (0 = uncapped)
remat-for-occint120dword_4FBF8A0Occupancy-driven rematerialization target
remat-lli-factorint10dword_4FC0320Long-latency instruction cost factor
remat-ignore-single-costboolfalsebyte_4FBFC20Bypass per-value cost filter
remat-moveboolfalsebyte_4FC0400Remat move instructions
simplify-live-outint2dword_4FBF520NLO level. 0=off, 2=full
dump-rematint0dword_4FC0240Debug dump level (0-4+)
dump-remat-ivint0dword_4FC0160IV remat debug dump
dump-remat-loadint0dword_4FBF720Load remat debug dump
dump-remat-addint0dword_4FBF640Add remat debug dump
dump-simplify-live-outboolfalsebyte_4FBF400NLO debug dump

Machine-Level Knobs (ctor_361_0 at 0x5108E0)

FlagTypeDefaultGlobalPurpose
nv-remat-blockint14dword_4FD3820Bitmask controlling remat modes (bits 0-3)
nv-remat-max-timesint10dword_4FD3740Max outer loop iterations
nv-remat-block-single-costint10dword_4FD3660Max cost per single live value pull-in
nv-remat-block-map-size-limitint6dword_4FD3580Map size limit for single pull-in
nv-remat-block-max-costint100dword_4FD3040Max total clone cost per live value reduction
nv-remat-block-liveout-min-percentageint70dword_4FD3120Min liveout % for special consideration
nv-remat-block-loop-cost-factorint20unk_4FD3400Loop cost multiplier
nv-remat-default-max-regint70unk_4FD3320Default max register pressure target
nv-remat-block-load-costint10unk_4FD2EC0Cost assigned to load instructions
nv-remat-threshold-for-spec-regint20unk_4FD3860Threshold for special register remat
nv-dump-remat-blockboolfalsebyte_4FD2E80Debug dump toggle
nv-remat-check-internal-liveboolfalsebyte_4FD2DA0Check internal liveness during MaxLive
max-reg-kindint0qword_4FD2C20Kind of max register pressure info
no-mi-rematstring(empty)qword_4FD2BE0Skip machine-level remat for named functions
load-rematbooltrueword_4FD32F0Enable load rematerialization
vasp-fix1boolfalseword_4FD3210VASP fix for volatile/addsp

General Remat Knobs (ctor_609_0, ctor_362, and others)

FlagTypeDefaultPurpose
nv-disable-rematboolfalseDisable all remat passes
enable-new-nvvm-rematbooltrueEnable new NVVM remat engine (disables old)
no-reg-target-nvptxrematboolfalseOnly old remat for kernels without register targets
fp-rematboolfalseAllow rematerializing floating-point instructions
high-cost-rematboolfalseAllow rematerializing high-cost instructions
cost-threshold-rematintCost threshold per remat action
block-freq-cap-rematintMaximum raw block frequency value
block-freq-norm-range-rematintNormalization range for block frequency in remat cost
collect-candidate-scale-rematintScaling ratio for high-RP candidate collection
incremental-update-rematboolfalseIncrementally update RP analysis after each remat
verify-update-rematboolfalseDebug: verify incremental update vs full analysis
print-verify-rematboolfalseDebug: print problematic RP on verification failure
rp-rematintDebug: set a target register pressure number
late-remat-update-thresholdintThreshold for copy with many other copy uses
remat-load-paramboolfalseSupport rematerializing constant ld.param not in NVVM IR

Category 12: SCEV-CGP (Address Mode Optimization)

Eleven NVIDIA-specific knobs for SCEV-based CodeGenPrepare. See CodeGenPrepare.

FlagTypeDefaultPurpose
do-scev-cgpboolfalseEnable SCEV-based CodeGenPrepare
do-scev-cgp-aggresivelyboolfalseAggressive SCEV-CGP mode [sic]
do-function-scev-cgpboolfalseFunction-level SCEV-CGP
nv-disable-scev-cgpbooltrueDisable SCEV address mode optimization (master kill switch, on by default)
scev-cgp-controlintControl max transformations applied
scev-cgp-cross-block-limitintMax common-base expressions from a single block
scev-cgp-idom-level-limitintMax dominator tree levels to walk
scev-cgp-inst-limitintMax instructions for a single parameter
scev-cgp-old-baseboolfalseForce SCEV-CGP to create new base (vs reusing old)
scev-cgp-tid-max-valueintMax value of thread ID in SCEV expressions
print-after-scev-cgpboolfalsePrint function after SCEV-CGP phase

Category 13: Branch Distribution

Seven NVIDIA-specific flags.

FlagTypeDefaultPurpose
branch-dist-block-limitintMax blocks to apply branch distribution
branch-dist-func-limitintMax functions to apply branch distribution
branch-dist-normintNormalization control
no-branch-diststringComma-separated list of functions to skip
disable-complex-branch-distboolfalseDisable complex branch distribution
dump-branch-distboolfalseDump branch distribution info

Category 14: Sinking / Code Motion

Thirteen knobs across multiple constructors. See Sinking2 for the NVIDIA-custom texture-aware sinking pass.

FlagTypeDefaultPurpose
sink-into-textureint3Texture sinking aggressiveness: 0=off, 1=cross-block, 2=+intra, 3=+outside-only
sink-limitint20Max instructions to sink per Sinking2 invocation (complexity limiter)
dump-sink2boolfalseDebug dump for Sinking2 pass
sink-check-schedbooltrueCheck scheduling effects of sinking (stock Sink)
sink-single-onlybooltrueOnly sink single-use instructions (stock Sink)
enable-andcmp-sinkingboolfalseSink and/cmp sequences into branches
aggressive-no-sinkboolfalseSink all generated instructions
max-uses-for-sinkingintDon't sink instructions with too many uses
rp-aware-sinkboolfalseConsider register pressure impact when sinking
instcombine-code-sinkingboolfalseEnable code sinking within InstCombine
hoist-const-storesboolfalseHoist loop-invariant stores

Category 15: Register Pressure / Allocation

NVIDIA-specific knobs plus LLVM greedy allocator knobs. See Register Allocation for the full algorithm.

NVIDIA RP Knobs

FlagTypeDefaultPurpose
maxregintnoneMaximum register count (--maxrregcount equivalent)
register-usage-levelintRegister usage level control
cta-reconfig-aware-mrpaboolfalseCTA reconfiguration-aware machine RP analysis
cta-reconfig-aware-rpaboolfalseCTA reconfiguration-aware RP analysis
pred-aware-mcseboolfalsePredicate-aware MachineCSE
rp-aware-mcseboolfalseRegister-pressure-aware MachineCSE
verify-update-mcseboolfalseDebug: verify incremental RP update in MachineCSE
incremental-update-mcsebooltrueIncrementally update register pressure analysis in MachineCSE
print-verifyboolfalsePrint problematic RP info if MCSE verification fails
pred-target-adjustint0Predicate register target adjustment (-10 to +10)
donot-insert-dup-copiesboolfalseSkip duplicate copies to predecessor basic block
nv-disable-mem2regboolfalseDisable machine-level mem2reg

LLVM Greedy Allocator Knobs

FlagTypeDefaultPurpose
split-spill-modeint1Spill mode: 0=default, 1=size, 2=speed
lcr-max-depthint5Last chance recoloring max recursion depth
lcr-max-interfint8Last chance recoloring max interferences
exhaustive-register-searchboolfalseBypass LCR depth/interference cutoffs
enable-deferred-spillingboolfalseDefer spill code to end of allocation
grow-region-complexity-budgetint10000growRegion() edge budget for live range splitting
split-threshold-for-reg-with-hintint75Split threshold percentage for hinted registers

Category 16: Restrict / Aliasing

Five NVIDIA-specific flags. See Alias Analysis.

FlagTypeDefaultPurpose
process-restrictboolfalseProcess __restrict__ keyword for alias analysis
allow-restrict-in-structboolfalseAllow __restrict__ inside struct members
apply-multi-level-restrictboolfalseApply restrict to all pointer levels
dump-process-restrictboolfalseDebug dump during restrict processing
strict-aliasingboolfalseDatatype-based strict aliasing

Category 17: CSSA / deSSA

Four knobs. See CSSA.

FlagTypeDefaultPurpose
cssa-coalesceintControl PHI operand coalescing strategy
cssa-verbosityint0Verbosity level
dump-before-cssaboolfalseDump specific PHI operands being coalesced
usedessaint2deSSA method: 0=off, 1=basic, 2=full

Category 18: Loop / Unrolling

Eight NVIDIA-specific knobs (beyond the 20+ standard LLVM loop-unrolling flags).

FlagTypeDefaultPurpose
nv-disable-loop-unrollingboolfalseDisable loop unrolling in all passes
aggressive-runtime-unrollingboolfalseOCG-style unrolling heuristics
aggressive-runtime-unrolling-fixed-factorintForce fixed unroll factor
aggressive-runtime-unrolling-max-factorintMaximum unroll factor
aggressive-runtime-unrolling-max-filler-instructions-per-batchintMax filler instructions
unroll-runtime-nv-expensiveboolfalseNVIDIA heuristics for expensive loops
unroll-runtime-convergentboolfalseAllow unrolling with convergent instructions
track-trip-count-moreboolfalseTrack loop trip count more aggressively

Category 19: GEP / Address Strength Reduction

Eight NVIDIA-specific knobs. See Base Address Strength Reduction.

FlagTypeDefaultPurpose
normalize-gepboolfalseNormalize 64-bit GEP subscripts
dump-normalize-gepboolfalseDebug dump for GEP normalization
do-base-address-strength-reduceint0Two levels: 1=unconditional, 2=with conditions
dump-base-address-strength-reduceboolfalseDebug dump
do-lsr-64-bitboolfalseLoop strength reduction for 64-bit (shared with LSR)
do-sign-ext-expandboolfalseExpand sign-extension during SCEV build
balance-dot-chainboolfalseBalance chain of dot operations
special-reassociate-for-threadidboolfalseDon't move back expressions containing thread ID

Category 20: Aggregate / Byval Lowering

Ten knobs.

FlagTypeDefaultPurpose
aggressive-max-aggr-lower-sizeintThreshold size for lowering aggregates
aggressive-lsvboolfalseMerge smaller dtypes in aggregate before vectorization
vect-split-aggrboolfalseSplit aggregates before vectorization
lower-aggr-unrolled-stores-limitintLimit stores in unrolled aggregate lowering
large-aggr-store-limitintCreate loops for aggregate store exceeding limit
lower-func-argsbooltrueLower large aggregate function parameters
lsa-optboolfalseOptimize copying of struct args to local memory
skiploweraggcopysafechkboolfalseSkip safety check in loweraggcopy
memdep-cache-byval-loadsbooltruePreprocess byval loads to reduce compile time
ldstmemcpy-glue-maxintLimit for gluing ld/st of memcpy

Category 21: Normalization / Canonicalization

Four knobs.

FlagTypeDefaultPurpose
norm-fold-allboolfalseFold all regular instructions
norm-preserve-orderboolfalsePreserve original instruction order
norm-rename-allboolfalseRename all instructions
norm-reorder-operandsboolfalseSort/reorder operands in commutative operations

Category 22: NVVM Infrastructure

Five knobs.

FlagTypeDefaultPurpose
nvvm-lower-printfboolfalseEnable printf lowering
nvvm-reflect-enablebooltrueNVVM reflection (reads __CUDA_FTZ, __CUDA_PREC_DIV, etc.)
nvvm-verify-show-infoboolfalseInfo messages during NVVM verification
enable-nvvm-peepholebooltrueNVVM peephole optimizer
nv-oclboolfalseDeprecated OpenCL compatibility flag

Category 23: Compilation Control

Constructor: ctor_043_0 at 0x48D7F0 + ctor_028_0 at 0x489160.

FlagTypeDefaultPurpose
debug-compileboolfalseCompile for debugging (set by -g)
generate-line-infoboolfalseEmit line info even without -G
nvptx-f32ftzboolfalseFlush f32 subnormals to zero; hidden
wboolfalseDisable warnings; hidden
WerrorboolfalseTreat warnings as errors; hidden
OsizeboolfalseOptimize for code size; hidden
OmboolfalseMaximum optimization mode; hidden
maxregintnoneMaximum register count (no limit if unset)
nvptx-nanboolfalseNaN handling control; hidden
jump-table-densityint10Minimum density (%) for jump table lowering
pass-controlint-1Disable all optional passes after pass N; -1 = no limit
disable-passnolistemptyDisable pass(es) by number (comma-separated)
sep-compboolfalseSeparate compilation mode
proffilestringFilename for PGO profile information
RstringResource constraint: name=<int> format
lnk-disable-alloptsboolfalseDisable all linker optimization passes
disable-peepholeboolfalseDisable peephole optimizer
disable-early-taildupboolfalseDisable pre-regalloc tail duplication

Category 24: Divergence / GPU Execution

Three flags.

FlagTypeDefaultPurpose
spec-exec-only-if-divergent-targetboolfalseSpeculative execution only when target is divergent
prefer-predicated-reduction-selectboolfalsePrefer predicated reduction over after-loop select
openmp-opt-disable-barrier-eliminationboolfalseDisable OpenMP barrier elimination

Category 25: MachinePipeliner (Swing Modulo Scheduling)

Eighteen LLVM-origin knobs for software pipelining. See Scheduling for the full algorithm.

FlagTypeDefaultGlobalPurpose
enable-pipelinerbooltrueunk_503EE20Master switch for SMS
enable-pipeliner-opt-sizeboolfalseqword_503ED40Enable SWP at -Os
pipeliner-max-miiint27qword_503ECE8Maximum allowed MII
pipeliner-force-iiint0qword_503EB80Force specific II (0 = auto)
pipeliner-max-stagesint3qword_503EB28Maximum pipeline stages
pipeliner-prune-depsbooltrueqword_503E9C0Prune deps between unrelated Phi nodes
pipeliner-prune-loop-carriedbooltrueqword_503E8E0Prune loop-carried order deps
pipeliner-ignore-recmiiboolfalseqword_503E888Ignore RecMII; hidden
pipeliner-show-maskboolfalseqword_503E720Debug: show scheduling mask
pipeliner-dbg-resboolfalseqword_503E640Debug: resource usage
pipeliner-annotate-for-testingboolfalseqword_503E5E8Annotate instead of codegen
pipeliner-experimental-cgboolfalseqword_503E508Use peeling code generator
pipeliner-ii-search-rangeint10qword_503E3A0Range to search for II
pipeliner-register-pressureboolfalseqword_503E2C0Consider register pressure
pipeliner-register-pressure-marginint5qword_503E1E0Margin % for reg pressure limit
pipeliner-mve-cgbooltrueunk_503E100Use MVE code generator
pipeliner-enable-copytophibooltrueqword_503E020Enable CopyToPhi DAG Mutation
pipeliner-force-issue-widthint0qword_503DF40Force issue width (0 = auto)

Category 26: LLVM Standard Inliner (Model B)

Seventeen LLVM-origin knobs from ctor_625_0 / ctor_715_0 at 0x58FAD0. These control the upstream InlineCostAnalysis::analyzeCall path; see Inliner Cost Model for why the NVIDIA custom model (Category 2) dominates in practice.

FlagTypeDefaultPurpose
inline-thresholdint225Base inlining threshold
inlinedefault-thresholdint225Default when no hint/profile
inlinehint-thresholdint325Threshold for __attribute__((always_inline)) hint
inline-cold-callsite-thresholdint45Threshold for cold callsites
inlinecold-thresholdint45Threshold for functions with cold attribute
hot-callsite-thresholdint3000Threshold for hot callsites (PGO)
locally-hot-callsite-thresholdint525Threshold for locally hot callsites
inline-instr-costint5Cost per instruction
inline-call-penaltyint25Penalty per callsite in callee
inline-memaccess-costint0Cost per load/store
inline-savings-multiplierint8Multiplier for cycle savings
inline-savings-profitable-multiplierint4Multiplier for profitability check
inline-size-allowanceint100Max callee size inlined without savings proof
inline-cost-fullboolfalseCompute full cost even when over threshold
inline-enable-cost-benefit-analysisboolfalseEnable cost-benefit analysis
inline-deferralboolDefer inlining in cold paths (PGO)
inline-remark-attributeboolfalseEmit inline remarks

Category 27: New PM CGSCC Inliner (Model C)

Two knobs for the New Pass Manager CGSCC inliner at 0x2613930. See Inliner Cost Model.

FlagTypeDefaultPurpose
function-inline-cost-multiplierintPenalize recursive function inlining
enable-ml-inlinerenumdefaultML advisory mode: default, development, release

Knob System 2: NVVMPassOptions

222 pass option slots initialized by sub_12D6300 (125KB). Each slot is accessed by integer index (1--221) and stored in a ~4,480-byte struct.

Access Functions

FunctionPurpose
sub_12D6170(base+120, index)Fetch pass option descriptor by index
sub_1691920(base+8, index)Fetch pass option value from table
sub_12D6090(a1+offset, ...)Store string-typed option
sub_12D6100(a1+offset, ...)Store integer-typed option
sub_12D6240(a1, index, "0")Get option with default value

See NVVMPassOptions for the complete 222-slot inventory.

Knob System 3: NVIDIA Codegen Knobs

Parsed from the NVVM container format by sub_1C20170 and sub_CD9990. See NVIDIA Custom Passes for the complete inventory.

Hidden / Obfuscated Flags

Obfuscated Flag (ctor_043_0 at ~0x48EE80)

A 4-byte CLI flag name computed via XOR-based obfuscation from unk_3F6F7C7:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));

Stored at qword_4F857C0 with flag bits 0x87 | 0x38 = hidden + really-hidden. NVIDIA deliberately hides this option from static analysis using FNV-1a-like constants.

Environment Variable Backdoors

VariablePurposeLocation
NVVMCCWIZWizard mode (value 553282) -- unlocks -v, -keep, -dryrun, -lgenfe, -opt, -llc, -lnk, -libnvvmsub_8F9C90
barExtended debug pass registrationctor_107_0 at 0x4A64D0
NVVM_IR_VER_CHKOverride IR version check (set to "0" to disable)sub_12BFF60
LLVM_OVERRIDE_PRODUCEROverride bitcode producer string (default "7.0.1")ctor_154 at 0x4CE640
MALLOC_CONFjemalloc allocator tuningsub_12FCDB0
LIBNVVM_DISABLE_CONCURRENT_APIForce single-threaded NVVM compilationctor_104 at 0x4A5810

CLI Defaults Set by Flag Routing

These are effective defaults applied by the flag catalog parser (sub_9624D0), not by cl::opt constructors. When no user flag is specified, the parser injects these:

CLI flagDefault valueRouted cl::opt
-arch=compute_<N>compute_75 (SM 75, Turing)target architecture
-opt=<N>3 (O3)optimization level
-ftz=<N>0 (no flush-to-zero)nvptx-f32ftz
-prec-sqrt=<N>1 (precise)nvptx-prec-sqrtf32=1
-prec-div=<N>1 (precise)nvptx-prec-divf32=1 (CUDA) / =0 (CL)
-fma=<N>1 (enabled)nvptx-fma-level=1
-opt-fdiv=<N>0 (off)optimizer fast-div control
-Ofast-compile=<level>0 (off)fast-compile pipeline

NVIDIA Modification Density

SubsystemNVIDIA KnobsLLVM KnobsCustomization Rate
LSR11569%
InstCombine12475%
Inliner (NVIDIA custom)90100%
Inliner (LLVM standard)0170%
GVN8373%
NVPTX Backend30+0100%
SimplifyCFG28+20%
Memory Space Opt200100%
Rematerialization (IR)210100%
Rematerialization (MI)160100%
Rematerialization (General)150100%
SCEV-CGP110100%
Register Pressure12763%
Sinking / Code Motion5645%
MachinePipeliner0180%
Vectorizer018+0%
SCEV010+0%

Cross-References

Environment Variables

cicc v13.0 checks 22 distinct environment variables across 36 files containing getenv() calls. Six are NVIDIA-specific (two obfuscated), six come from the LLVM infrastructure, six from the EDG frontend, and the remainder from the build system, memory allocator, and shared ptxas/nvptxcompiler infrastructure. Two of the NVIDIA variables have their names encrypted in the .rodata section using an XOR+ROT13 cipher to prevent discovery through string scanning.

String Deobfuscation Engine

The deobfuscation function sub_8F98A0 at 0x8F98A0 decrypts variable names and option strings from .rodata ciphertext. The same engine is also used for hidden CLI option names (see CLI Flags).

Algorithm: sub_8F98A0

// Reconstructed pseudocode — sub_8F98A0 (0x8F98A0)
// Inputs:
//   ciphertext  — pointer to encrypted bytes in .rodata
//   base        — base address used as key seed (a2)
//   length      — number of bytes to decrypt
//
// Output:
//   plaintext string on stack, null-terminated

char* deobfuscate(const uint8_t* ciphertext, uintptr_t base, size_t length) {
    char buf[64];
    for (size_t i = 0; i < length; i++) {
        uint8_t raw = ciphertext[i];

        // Phase 1: XOR with key derived from position
        uint32_t key = -109 * ((i - base + 97) ^ 0xC5);
        char ch = raw ^ (key & 0xFF);

        // Phase 2: ROT13 on alphabetic characters
        if (ch >= 'A' && ch <= 'Z')
            ch = ((ch - 'A' + 13) % 26) + 'A';
        else if (ch >= 'a' && ch <= 'z')
            ch = ((ch - 'a' + 13) % 26) + 'a';

        buf[i] = ch;
    }
    buf[length] = '\0';
    return buf;
}

Key constant: The multiplier -109 (signed, i.e. 0xFFFFFF93) and the XOR mask 0xC5 together form a position-dependent key stream. The ROT13 phase is applied after the XOR, meaning the plaintext must survive two transformations. This is a weak cipher by design -- it only needs to defeat strings(1) scanning, not serious cryptanalysis.

Obfuscated String Table in .rodata

All obfuscated strings live in a contiguous region near 0x3C23A7B--0x3C23AD6. Each entry is referenced by its end byte address (the deobfuscator walks backward):

End addressLengthDecrypted plaintextPurpose
byte_3C23AD614(option prefix)CLI option prefix for -nvvm-version matching
byte_3C23AC311"nvvm-latest"Option suffix; sets v253 = 1 (Path A)
byte_3C23AB46"nvvm70"Option suffix; sets v253 = 0 (Path B)
byte_3C23AAD13(option name)Option name for error message display
byte_3C23A9F15"NV_NVVM_VERSION"Environment variable name for getenv()
byte_3C23A826"nvvm70"Env var value comparison string
byte_3C23A7B11"nvvm-latest"Env var value comparison string

Additional encrypted copies exist at 0x42812C0 and 0x42812F0 for the two env var names (NV_NVVM_VERSION and LIBNVVM_NVVM_VERSION) used by sub_12B9F70.

Obfuscated CLI Flag (ctor_043)

A separate obfuscation instance at ~0x48EE80 in ctor_043 (0x48D7F0) decrypts a 4-byte hidden cl::opt name from data at unk_3F6F7C7. The algorithm variant uses FNV-1a-like constants:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));

The 0x811C9DC5 constant is the FNV-1a 32-bit offset basis. The resulting 4-character option is registered with flag bits 0x87 | 0x38 = hidden + really-hidden, making it invisible even to --help-hidden. This is stored at qword_4F857C0.

NVIDIA-Specific Variables

NVVMCCWIZ

PropertyValue
Checked insub_8F9C90 (real main) at 0x8F9C90, specifically 0x8F9D36
Expected value"553282" (magic number = 0x87142)
EffectSets byte_4F6D280 = 1 -- unlocks developer/wizard mode

Mechanism: The value is parsed via strtol(v, 0, 10) and compared against the integer 553282. Any other value is silently ignored.

What wizard mode does: When byte_4F6D280 = 1:

  • -v flag actually enables verbose output (v259 = byte_4F6D280 instead of 0)
  • -keep flag actually preserves intermediate files (v262 = byte_4F6D280)
  • -dryrun flag enables verbose as a side effect (v259 = byte_4F6D280)
  • -lnk and -opt modes set v262 = byte_4F6D280 (keep temps)

Without wizard mode, -v and -keep are no-ops -- the flags are parsed but have no effect because they set their variables to byte_4F6D280 which is 0.

NVVM_IR_VER_CHK

PropertyValue
Checked insub_12BFF60 at 0x12BFF60 (NVVM IR version verifier, instance 1)
sub_2259720 at 0x2259720 (NVVM IR version verifier, instance 2)
Expected value"0" to disable version checking
EffectControls NVVM IR bitcode version metadata validation

Detailed mechanism (from sub_12BFF60, 9KB):

  1. Reads getenv("NVVM_IR_VER_CHK").
  2. If NULL or strtol(env, 0, 10) != 0: version checking is enabled (default).
  3. If set to "0": version checking is disabled (bypass).

When enabled, the function:

  1. Looks up "nvvmir.version" named metadata via sub_1632310(module, &name).
  2. Also checks "llvm.dbg.cu" metadata (debug compile unit presence).
  3. Iterates metadata operands, deduplicating via an open-addressing hash table:
    • Hash function: (value >> 9) ^ (value >> 4) & mask
    • Tombstone: 0xFFFFFFFFFFFFFFF0 (-16)
    • Empty: 0xFFFFFFFFFFFFFFF8 (-8)
  4. For each unique 2-element version tuple (major, minor):
    • Calls sub_12BDA30(modules, major, minor) for IR compatibility check.
    • Special case: major==2, minor==0 always passes (sentinel for libdevice).
  5. For 4-element tuples (major, minor, debug_major, debug_minor):
    • Calls sub_12BD890(modules, debug_major, debug_minor) for debug version check.
    • Special case: debug_major==3, debug_minor<=2 always passes.

The env var is checked multiple times per invocation: before IR version validation, before debug IR version validation, and at each version tuple comparison. Return code 3 indicates incompatible version.

Current expected versions: nvvmir.version = {2, minor<=0x62}, debug version = {3, minor<=2}.

LIBNVVM_DISABLE_CONCURRENT_API

PropertyValue
Checked inctor_104 at 0x4A5810 (global constructor)
Expected valueAny non-NULL value
EffectSets byte_4F92D70 = 1 -- disables thread-safe libnvvm API usage

Safety valve for environments where concurrent libnvvm compilation causes issues. Any non-NULL value triggers single-threaded API behavior. See Concurrent Compilation.

NV_NVVM_VERSION (Obfuscated)

PropertyValue
Checked insub_12B9F70 at 0x12B9F70, sub_12BB580 at 0x12BB580, sub_8F9C90 at 0x8F9C90
Encrypted at0x3C23A90 and 0x42812C0 (two copies, same ciphertext)
DecryptionXOR with (-109 * ((byte_offset - base + 97) ^ 0xC5)) then ROT13
Expected values"nvvm70" (suppresses check), "nvvm-latest" (forces latest mode)

How it controls compilation path selection in sub_8F9C90:

The dispatch variable v253 starts at 2 (default). When v253 is still 2 at post-parse time (lines 1590--1692):

  1. sub_8F98A0 decrypts the env var name from byte_3C23A9F[-15..0].
  2. Calls getenv(decrypted_name).
  3. Compares the result against two decrypted reference strings:
    • "nvvm70" (from byte_3C23A82): sets v253 = 0 (Path B -- NVVM/bitcode pipeline via sub_1262860 or sub_1265970)
    • "nvvm-latest" (from byte_3C23A7B): sets v253 = 1 (Path A -- PTX pipeline via sub_902D10 or sub_905EE0)
  4. If neither matches: uses (arch > 99) as the tiebreaker, with further modulation by -nvc and -optixir flags.

For multi-stage modes (v263 >= 3), the resolved path also determines which pipeline flag string is appended:

  • Path A (v253 == 1): xmmword_3C23BC0 + "vm-latest" (25 bytes total, decodes to "-nvvm-version=nvvm-latest")
  • Path B (v253 == 0): xmmword_3C23BC0 + "vm70" (20 bytes total, decodes to "-nvvm-version=nvvm70")

The variable name is encrypted in the binary's .rodata section because NVIDIA intended to keep this escape hatch undiscoverable through casual strings(1) scanning. It controls a fundamental compilation mode choice.

LIBNVVM_NVVM_VERSION (Obfuscated)

PropertyValue
Checked insub_12B9F70 at 0x12B9F70
Encrypted at0x42812F0
Expected valuesSame as NV_NVVM_VERSION

Functionally identical to NV_NVVM_VERSION. Both names are checked by the same function sub_12B9F70; this provides an alternative name for the same feature. Likely exists so that libnvvm API users can set LIBNVVM_NVVM_VERSION while standalone cicc users set NV_NVVM_VERSION.

LLVM_OVERRIDE_PRODUCER

PropertyValue
Checked inctor_036 at 0x48CC90, ctor_154 at 0x4CE640
Expected valueAny string
EffectOverrides the producer identification string in output bitcode metadata

Dual constructor behavior:

  • ctor_036 (0x48CC90): Reads LLVM_OVERRIDE_PRODUCER, falls back to "20.0.0" (the true LLVM version). Stored in qword_4F837E0.
  • ctor_154 (0x4CE640): Reads LLVM_OVERRIDE_PRODUCER, falls back to "7.0.1" (the NVVM IR compatibility marker). Stored separately.

The bitcode writer (sub_1538EC0, 58KB) uses the ctor_154 value, producing "LLVM7.0.1" in the IDENTIFICATION_BLOCK. This means the output bitcode claims to be LLVM 7.0.1 format, even though cicc is built on LLVM 20.0.0 internally. Setting LLVM_OVERRIDE_PRODUCER overrides both constructors' values.

CAN_FINALIZE_DEBUG

PropertyValue
Checked insub_60F290 at 0x60F290, sub_4709E0 at 0x4709E0, sub_470DA0 at 0x470DA0
Expected valueControls debug finalization behavior
EffectGates debug information finalization passes

Shared with ptxas and nvptxcompiler (same codebase origin). Controls whether debug information finalization passes execute. When unset, the default behavior applies. Three call sites confirmed.

LLVM Infrastructure Variables

AS_SECURE_LOG_FILE

Checked in ctor_720 at 0x5C0D60. Sets the secure log file path for the integrated assembler, registered as LLVM cl::opt "as-secure-log-file-name". Expected: a file path.

TMPDIR / TMP / TEMP / TEMPDIR

Checked in sub_16C5C30, sub_C843A0, and sub_721330. These are probed in priority order: TMPDIR first, then TMP, TEMP, TEMPDIR. The EDG frontend (sub_721330) only checks TMPDIR and falls back to "/tmp".

PATH

Checked in sub_16C5290, sub_16C7620, sub_C86E60. Standard PATH for findProgramByName lookups.

HOME

Checked in sub_C83840. Used by sys::path::home_directory with getpwuid_r() as fallback.

PWD

Checked in sub_16C56A0, sub_C82800. Used for fast current-directory resolution (faster than getcwd).

TERM

Checked in sub_7216D0 (EDG) and sub_16C6A40/sub_C86300 (LLVM). If TERM=="dumb", terminal colors are disabled. Otherwise, specific terminal type strings (ansi, xterm, screen, linux, cygwin, etc.) are matched by integer comparison to determine color capability.

EDG Frontend Variables

NOCOLOR

Checked in sub_67C750. Respects the no-color.org convention: if set to any value, all diagnostic coloring is disabled.

EDG_COLORS

Checked in sub_67C750. Custom color specification string for EDG diagnostics. Example: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01".

GCC_COLORS

Checked in sub_67C750. Fallback if EDG_COLORS is not set. Default: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32". Provides GCC-compatible diagnostic coloring.

USR_INCLUDE

Checked in sub_720A60. Overrides the system include path (default: "/usr/include") for the EDG frontend.

EDG_BASE

Checked in sub_7239A0. Sets the EDG base directory for predefined configuration files. Stored in qword_4F07578.

EDG_MODULES_PATH

Checked in sub_723900. Adds an additional search path for C++ modules in the EDG frontend.

Build System / Parallelism

MAKEFLAGS

PropertyValue
Checked insub_1682BF0 at 0x1682BF0
EffectGNU Make jobserver integration for parallel compilation limiting

Parses for --jobserver-auth= with either:

  • fifo: prefix: FIFO-based jobserver (modern GNU Make)
  • N,M format: pipe file descriptor pair (classic GNU Make)

When detected, cicc integrates with the jobserver to limit concurrent compilation passes.

Memory Allocator

MALLOC_CONF

Checked in sub_12FCDB0 (jemalloc initialization, 131,600 bytes -- the largest function in its range). One of five configuration sources for the bundled jemalloc allocator. Expected: jemalloc config string such as "narenas:2,dirty_decay_ms:0".

Dynamic / Generic Access

Two mechanisms allow runtime access to arbitrary environment variables:

  1. --trace-env=VARNAME CLI flag (in sub_125FB30 and sub_900130): reads the named variable and injects its value into the compilation trace. This is a pass-through mechanism for build system integration.
  2. sub_C86120 (LLVM sys::Process::GetEnv wrapper): generic getenv helper called with dynamic name parameters by LLVM's option processing infrastructure.

Complete Inventory

#Env Var NameOriginObfuscatedCategoryKey Function
1NVVMCCWIZNVIDIAnoDeveloper modesub_8F9C90
2NVVM_IR_VER_CHKNVIDIAnoVersion check gatesub_12BFF60, sub_2259720
3LIBNVVM_DISABLE_CONCURRENT_APINVIDIAnoThread safetyctor_104
4NV_NVVM_VERSIONNVIDIAyesVersion compat / path selectsub_12B9F70, sub_12BB580
5LIBNVVM_NVVM_VERSIONNVIDIAyesVersion compat (alias)sub_12B9F70
6LLVM_OVERRIDE_PRODUCERLLVM/NVIDIAnoBitcode metadatactor_036, ctor_154
7CAN_FINALIZE_DEBUGNVIDIAnoDebug finalizationsub_60F290, sub_4709E0, sub_470DA0
8AS_SECURE_LOG_FILELLVMnoAssembler loggingctor_720
9TMPDIRLLVM/EDGnoTemp directorysub_16C5C30, sub_C843A0, sub_721330
10TMPLLVMnoTemp directory (fallback)sub_16C5C30, sub_C843A0
11TEMPLLVMnoTemp directory (fallback)sub_16C5C30, sub_C843A0
12TEMPDIRLLVMnoTemp directory (fallback)sub_16C5C30, sub_C843A0
13PATHLLVMnoExecutable lookupsub_16C5290, sub_16C7620, sub_C86E60
14HOMELLVMnoHome directorysub_C83840
15PWDLLVMnoWorking directorysub_16C56A0, sub_C82800
16TERMLLVM/EDGnoTerminal typesub_7216D0, sub_16C6A40, sub_C86300
17NOCOLOREDGnoColor disablesub_67C750
18EDG_COLORSEDGnoColor schemesub_67C750
19GCC_COLORSEDGnoColor scheme (fallback)sub_67C750
20USR_INCLUDEEDGnoInclude pathsub_720A60
21EDG_BASEEDGnoEDG base dirsub_7239A0
22EDG_MODULES_PATHEDGnoModule search pathsub_723900
23MAKEFLAGSBuildnoJobserversub_1682BF0
24MALLOC_CONFjemallocnoAllocator configsub_12FCDB0

Decompiler Artifacts

Several getenv("bar") calls appear in ctor_106, ctor_107, ctor_376, ctor_614. These are not real environment variable checks. The pattern getenv("bar") == (char*)-1 is jemalloc's initialization probe testing whether getenv is intercepted by a sanitizer. The string "bar" is a dummy.

"getenv" as a string in ctor_133 (qword_4F9B700[502]) is a function name in a libc symbol table used by the EDG frontend for tracking known standard library functions.

"fegetenv" in sub_E42970 is a math library function name in a builtin table.

Function Map

FunctionAddressSizeRole
StringDeobfuscatesub_8F98A0~400BXOR + ROT13 decryption engine
RealMainsub_8F9C9010,066BMain entry; checks NVVMCCWIZ, dispatches via NV_NVVM_VERSION
NvvmVersionHelpersub_12B9F70~3KBReads NV_NVVM_VERSION / LIBNVVM_NVVM_VERSION; compares values
NvvmVersionHelper2sub_12BB580~3KBSecond call site for NV_NVVM_VERSION
CheckIRVersionsub_12BDA30~1KBIR major/minor compatibility check
CheckDebugVersionsub_12BD890~1KBDebug IR major/minor compatibility check
NVVMIRVersionChecksub_12BFF609KBFull NVVM IR version validator; reads NVVM_IR_VER_CHK
NVVMIRVersionCheck2sub_225972014KBSecond instance of version checker
JemallocInitsub_12FCDB0131,600Bjemalloc config parser; reads MALLOC_CONF
JobserverParsersub_1682BF0~2KBMAKEFLAGS --jobserver-auth parser
GenericGetEnvsub_C86120~100BLLVM sys::Process::GetEnv wrapper
EDGColorInitsub_67C750~2KBNOCOLOR / EDG_COLORS / GCC_COLORS handler

Cross-References