CICC v13.0 — Reverse Engineering Reference
CICC is NVIDIA's CUDA C-to-PTX compiler — the binary that transforms CUDA C++ source code (or LLVM bitcode) into PTX assembly for GPU execution. At 60 MB, it is one of the largest single compiler binaries in production use. This wiki documents its internal architecture, recovered from static analysis of the stripped x86-64 ELF binary using IDA Pro 8.x and Hex-Rays decompilation.
| Binary | cicc v13.0, 60,108,328 bytes, x86-64, stripped |
| Build | cuda_13.0.r13.0/compiler.36424714_0 |
| Decompilation | 80,562 functions, 80,281 recovered (99.65%), IDA Pro 8.x + Hex-Rays |
| Strings | 188,141 extracted |
| LLVM base | LLVM 20.0.0 (internal), bitcode producer ID "LLVM7.0.1" (NVVM compat) |
| LLVM pass classes | ~402 standard + 35 NVIDIA custom |
| CLI options | ~1,689 registered via cl::opt + 222 NVVMPassOptions slots |
| NVVM builtins | 770 (IDs 1–770, wyhash open-addressing table) |
| Default target | sm_75 (Turing) |
| Supported SMs | sm_75 through sm_121f (Turing through Blackwell (sm120)) |
Three Subsystems
CICC is not a monolithic compiler. It is composed of three largely independent subsystems, each with its own lineage, coding conventions, and internal data structures:
1. EDG 6.6 C++ Frontend (3.2 MB, 0x5D0000–0x8F0000) — A licensed commercial frontend from Edison Design Group that parses CUDA C++ source code and emits transformed C code. It operates as a source-to-source translator: CUDA kernel launch syntax (<<<>>>) is lowered to CUDA runtime API calls, memory space qualifiers (__shared__, __constant__) are resolved to address space annotations, and C++ templates/constexpr are fully evaluated. The output is not LLVM IR — it is C code that feeds into a second compilation phase. See EDG 6.6 Frontend.
2. NVVM Bridge (~4 MB, 0x8F0000–0x12CFFFF) — The glue layer between EDG and LLVM. It handles CLI parsing, architecture detection (23 SM variants with 3-column flag fan-out), the dual-path compilation dispatch (Path A via LibNVVM API, Path B standalone), the NVVMPassOptions knob system (221 per-pass configuration slots), and the 770-entry builtin resolution table. This layer is entirely NVIDIA-proprietary. See Entry Point & CLI and LLVM Optimizer.
3. LLVM 20.0.0 Backend (~45 MB, 0x12D0000–0x3BFFFFF) — A heavily modified LLVM fork that performs IR optimization and PTX code generation. NVIDIA has added 35 custom passes (MemorySpaceOpt, Rematerialization, BranchDist, LoopIndexSplit, Sinking2, etc.), a proprietary two-phase compilation model with per-function thread parallelism, and extensive modifications to the NVPTX backend for tensor core code generation across 5 GPU architecture generations. See Code Generation and PTX Emission.
Additionally, jemalloc 5.3.x (~400 functions at 0x12FC000) is statically linked, replacing the system allocator for improved memory allocation performance during compilation.
Dual-Path Architecture
A distinctive feature of cicc is its dual-path design — two complete copies of the compilation backend exist within the same binary, selected at runtime:
Path A (0x90xxxx) | Path B (0x126xxxx) | |
|---|---|---|
| Purpose | LibNVVM API mode | Standalone mode |
| Simple compile | sub_902D10 | sub_1262860 |
| Multi-stage | sub_905EE0 (43KB) | sub_1265970 (48KB) |
| CLI parsing | sub_900130 | sub_125FB30 |
| Builtin table | sub_90AEE0 (109KB) | sub_126A910 (123KB) |
| Libdevice | unk_3EA0080 (455KB) | unk_420FD80 (455KB) |
| Version string | -nvvm-version=nvvm-latest | -nvvm-version=nvvm70 |
Runtime selection is controlled by v253 in sub_8F9C90 (the real main function). The default value (2) triggers an environment variable lookup through an obfuscated string comparison to determine which path to take. This design allows a single binary to serve both the nvcc driver toolchain and the LibNVVM runtime compilation API.
Compilation Pipeline
Both paths converge on the same 5-stage pipeline:
CUDA C++ Source (.cu / .ci / .i)
│
├─ EDG 6.6 Frontend (sub_5D2A80)
│ ├─ lgenfe_main (sub_617BD0): 282-case CLI, 737 #defines
│ ├─ Parser: recursive-descent + declaration specifier state machine
│ ├─ Constexpr evaluator: 317KB tree-walking interpreter
│ └─ Backend: "Generating NVVM IR" → .int.c / .device.c / .stub.c
│
└─ NVVM/LLVM Pipeline
│
├─ IRGEN: EDG IL → LLVM IR translation (cicc's equivalent of Clang CodeGen)
│ Type translation (fixed-point iteration, address space mapping)
│ Expression/statement/function codegen (recursive AST walk)
│ CUDA semantic lowering (threadIdx→intrinsics, printf→vprintf, etc.)
│ Kernel metadata emission (nvvm.annotations)
│ Two copies: Path A (0x90xxxx) and Path B (0x126xxxx)
│
├─ LNK: Module linking + libdevice (455KB embedded bitcode)
│ Triple validation (must be nvptx64-)
│ IR version check (nvvmir.version metadata)
│
├─ OPT: Two-phase compilation (Phase I: whole-module, Phase II: per-function)
│ ~150 pass insertions via sub_12E54A0
│ Three language paths: "ptx" / "mid" / default
│ 35 NVIDIA custom passes interleaved with standard LLVM
│ Optional: concurrent per-function compilation (thread pool + jobserver)
│
├─ OPTIXIR: OptiX IR generation (optional, --emit-optix-ir)
│
└─ LLC: NVPTX backend code generation
SelectionDAG lowering (2.3 MB NVPTXTargetLowering)
19 MMA shapes × 11 data types for tensor core codegen
9 PTX register classes
StructurizeCFG (mandatory for PTX structured control flow)
→ .ptx output
Subsystem Address Map
| Subsystem | Address Range | Size | Key Entry Point |
|---|---|---|---|
| jemalloc stats | 0x40D000–0x41FFFF | ~80KB | sub_40D5CA (vsnprintf) |
| Global constructors | 0x430000–0x5CFFFF | ~1.6 MB | cl::opt registration (~1,689 options) |
| EDG 6.6 Frontend | 0x5D0000–0x8EFFFF | 3.2 MB | sub_5D2A80 (orchestrator) |
| CLI / Real Main | 0x8F0000–0x96FFFF | 520 KB | sub_8F9C90 (real main) |
| Bitcode reader | 0x9F0000–0xAFFFFF | ~1 MB | sub_9F2A40 (parseFunctionBody) |
| LLVM verifier | 0xBF0000–0xC6FFFF | 500 KB | sub_BFC6A0 (visitCallInst) |
| LLVM passes | 0xC00000–0x12CFFFF | ~7 MB | InstCombine, GVN, DSE, LICM, etc. |
| PassManager / NVVM bridge | 0x12D0000–0x16FFFFF | 4.2 MB | sub_12E54A0 (pipeline assembly) |
| Backend / machine passes | 0x1700000–0x1EFFFFF | 8 MB | MRPA, Block Remat, Mem2Reg |
| SelectionDAG | 0x1F00000–0x20FFFFF | 2 MB | sub_20019C0 (LegalizeTypes, 348KB) |
| NVPTX emission | 0x2100000–0x21FFFFF | 1 MB | sub_215A3C0 (function headers) |
| New PM / pass registration | 0x2340000–0x23FFFFF | 768 KB | sub_2342890 (2,816-line registrar) |
| Loop passes | 0x2A00000–0x2DFFFFF | 4 MB | LoopVectorize, SLP, Unroll, etc. |
| NVPTX ISel + lowering | 0x3000000–0x36FFFFF | 7 MB | sub_33B0210 (intrinsic switch, 343KB) |
| Embedded libdevice | 0x3EA0080 / 0x420FD80 | 456 KB × 2 | LLVM bitcode (~400 math functions) |
Reading This Wiki
The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with LLVM backend experience.
Section Index
- Pipeline Overview — End-to-end compilation flow diagram with links to every stage.
- Entry Point & CLI — CLI parsing, dual-path dispatch, architecture detection.
- EDG 6.6 Frontend — CUDA C++ to transformed C source-to-source translation.
- NVVM IR Generation — EDG IL tree to LLVM Module: types, expressions, statements, functions.
- LLVM Optimizer — Two-phase compilation, pipeline assembly, NVVMPassOptions.
- Code Generation — SelectionDAG, ISel, register allocation, scheduling.
- PTX Emission — AsmPrinter, directive emission, PTX body output.
- NVIDIA Custom Passes — 35 proprietary passes not in upstream LLVM.
- LLVM Pass Pipeline & Ordering — Complete pass registration, execution order per O-level, tier system.
- NVVM Builtins — 770-entry builtin table: hash structure, ID inventory, category breakdown.
- GPU Targets — SM feature gates, architecture detection, sm_75 through sm_121f.
- Data Structures — IR node layout, pattern database, DAG node, symbol table, NVVM container.
- Infrastructure — Alias analysis, MemorySSA, AsmPrinter, debug verification, NVPTX target.
- LTO & Module Optimization — Cross-TU inlining, devirtualization, GlobalOpt, ThinLTO import.
- Configuration — Three knob systems: ~1,689
cl::optflags, 222 NVVMPassOptions slots, ~70 codegen knobs. - Reference — Address spaces, register classes, NVPTX opcodes, GPU execution model.
- Function Map — Address-to-identity lookup for ~350 key functions with confidence levels.
- Binary Layout — Subsystem address map at pass granularity.
- Methodology — How this analysis was performed and how to assess confidence.
Reading Path 1: End-to-End Pipeline Understanding
Goal: understand how CUDA source becomes PTX, what each stage does, and how control flows between subsystems.
Read in this order:
- Pipeline Overview — The complete flow diagram. Establishes the 10 stages and their address ranges. Read this first to build the mental model that all other pages assume.
- Entry Point & CLI — How cicc is invoked, the 1,689-flag CLI, dual-path dispatch (Path A LibNVVM vs. Path B standalone), and the
sub_8F9C90real-main function. - nvcc-to-cicc Interface — The flag translation layer between nvcc and cicc. The 40+ flag mappings and 3-column architecture fan-out. Necessary context for understanding why certain flags exist.
- EDG 6.6 Frontend — The commercial C++ frontend. How CUDA syntax is lowered to C, the 737 configuration
#defines, and the.int.c/.device.c/.stub.coutput split. - NVVM IR Generation — The EDG-to-LLVM bridge. Then follow the four sub-pages: Type Translation → Expressions → Statements → Functions.
- Libdevice Linking — The embedded 455KB bitcode library with 352
__nv_*math functions. Triple validation, version checking. - LLVM Optimizer — The two-phase compilation model, the 49.8KB pipeline assembler (
sub_12E54A0), pass ordering, and the NVVMPassOptions knob system. This is the longest and densest stage. - Pipeline & Pass Ordering — The exact pass execution order at each O-level, the tier system, and the 526 registered passes.
- Code Generation — SelectionDAG lowering, instruction selection, register allocation, instruction scheduling. Hub page with links to deep dives.
- PTX Emission — AsmPrinter, directive headers, PTX body output, metadata emission.
Optional extensions after the core path:
- OptiX IR Generation — The alternative output mode for ray tracing workloads.
- Debug Info Pipeline — How
-gdebug metadata survives the optimizer. - LTO & Module Optimization — Cross-module optimization when compiling multiple translation units.
- Concurrent Compilation — The Phase II thread pool and GNU Jobserver integration.
- GPU Execution Model — Background on warps, divergence, shared memory, and address spaces if you are new to GPU architecture.
Reading Path 2: Reimplementing a Specific Pass
Goal: reproduce the exact behavior of one NVIDIA custom pass or understand an LLVM pass modification deeply enough to write a compatible replacement.
For an NVIDIA custom pass (e.g., MemorySpaceOpt, Rematerialization, BranchDist):
- NVIDIA Custom Passes — Overview — Locate the pass in the inventory table. Note its category (module/function/loop/machine), its pipeline position, and its controlling knobs.
- The pass's dedicated page (e.g., MemorySpaceOpt, Rematerialization, Branch Distribution). Every dedicated page contains the function address, decompiled algorithm, data flow description, controlling knobs, and diagnostic strings.
- NVVMPassOptions — The 222-slot struct that controls per-pass enable/disable toggles and parametric thresholds. Find which slots your target pass reads.
- Pipeline & Pass Ordering — Determine exactly where the pass runs in the pipeline. Identify what analyses it depends on (must run before it) and what passes consume its results (run after it).
- Optimization Levels — Determine at which O-levels the pass is enabled, disabled, or parameterized differently.
- Function Map — Cross-reference the pass's internal function addresses with the master function map for confidence levels.
For a modified LLVM pass (e.g., InstCombine, GVN, DSE, LICM, LoopVectorize):
- The pass's dedicated page (e.g., InstCombine, GVN, DSE, LICM). These pages document NVIDIA's modifications relative to upstream LLVM 20.0.0.
- Alias Analysis & NVVM AA — The custom alias analysis chain. Nearly every optimization pass depends on AA, and NVIDIA's GPU-aware AA behaves differently from upstream (address-space-aware
NoAliasfor disjoint spaces,__restrict__propagation). - MemorySSA — The memory dependence representation used by DSE, LICM, and other memory-sensitive passes.
For a machine-level pass (e.g., Block Remat, MRPA, Machine Mem2Reg):
- Machine-Level Passes — The complete machine pass pipeline with per-pass algorithm descriptions.
- Register Allocation — The greedy RA algorithm with NVIDIA's occupancy-driven spill heuristics.
- Register Classes — The 9 PTX register classes and their constraints.
- NVPTX Machine Opcodes — The MachineInstr opcode reference.
Supporting references for any pass reimplementation:
- IR Node Layout — The internal IR data structures that passes operate on.
- Address Spaces — GPU address space semantics that many passes must respect.
- NVPTX Target Infrastructure — TargetMachine, TTI hooks, and target feature queries.
- Diagnostics — The three diagnostic systems (EDG, LLVM remarks, profuse framework) for reproducing pass-level reporting.
Reading Path 3: Debugging Correctness
Goal: diagnose a miscompilation, a crash, or incorrect PTX output by tracing the problem to a specific pass or pipeline stage.
Start with instrumentation and observability:
- Diagnostics & Optimization Remarks — The three independent diagnostic layers: EDG frontend errors, LLVM optimization remarks (
-opt-bisect-limit,-Rpass=,-Rpass-missed=), and NVIDIA's profuse framework (profuseinline,profusegvn). This page tells you how to make cicc talk about what it is doing. - Debug Info Verification — The three verification modes (
verify-each,debugify-each, and JSON delta reporting). Useverify-eachto detect the first pass that corrupts debug metadata. - CLI Flags — Locate the flags for dumping IR at specific pipeline points:
--print-after-all,--print-before-all,--filter-print-funcs=,--opt-bisect-limit=. Also the--passes=interface for running individual passes in isolation. - Optimization Levels — Compare the pass pipeline at different O-levels. If a bug appears at
-O2but not-O1, the diff between their pipelines identifies the suspect passes.
Then isolate the pipeline stage:
- Pipeline Overview — Determine which stage produces the incorrect output. The pipeline is linear: EDG → IR Generation → Libdevice Linking → Optimizer → Codegen → Emission. The stage boundary where output first goes wrong narrows the search.
- NVVM IR Verifier — The 230KB three-layer verifier (module + function + intrinsic). It validates triples, address spaces, atomic restrictions, pointer cast rules, and architecture-gated intrinsic availability. A verification failure after a specific pass is a strong signal.
- Bitcode I/O — If the problem is in bitcode reading/writing (corrupted input, version mismatch), this page documents the reader at
sub_9F2A40and the writer.
Then investigate the suspect pass:
- NVIDIA Custom Passes or the relevant LLVM pass page — Read the algorithm description for the suspect pass. Look for documented edge cases, known limitations, and diagnostic strings that would appear in verbose output.
- NVVMPassOptions — Check whether the suspect pass has enable/disable knobs or threshold parameters that could be adjusted to confirm or rule it out.
- Environment Variables — Some passes are gated by environment variables (including obfuscated ones). Check whether any are influencing behavior.
For correctness issues specific to GPU semantics:
- Address Spaces — Incorrect address space resolution is a common source of silent miscompilation. Global vs. shared vs. local aliasing rules differ from CPU memory models.
- MemorySpaceOpt — This pass resolves generic pointers to specific address spaces. If it infers the wrong space, downstream code will access the wrong memory.
- Alias Analysis — If the alias analysis returns
NoAliasfor pointers that do alias, DSE/LICM/GVN will misoptimize. Theprocess-restrictpropagation is a known source of aggressive alias assumptions. - StructurizeCFG — PTX requires structured control flow. If structurization produces incorrect flow blocks, the kernel will execute the wrong path.
- Dead Barrier Elimination and Dead Synchronization Elimination — Incorrect elimination of barriers or synchronization can cause race conditions that only manifest under specific warp configurations.
Reading Path 4: Tuning Performance
Goal: understand what cicc does at each optimization level, which passes are the performance-critical ones, and what knobs control their aggressiveness.
Start with the tuning infrastructure:
- Optimization Levels — The four standard levels (O0--O3) and three fast-compile tiers (Ofcmin/Ofcmid/Ofcmax). This page shows the exact pass pipeline diff between levels, including which passes are added, removed, or reparameterized at each step.
- NVVMPassOptions — The 222-slot per-pass configuration system. This is the primary tuning mechanism. The page documents every slot's type (boolean/integer/string), its default value, and which pass reads it.
- CLI Flags — The flag-to-pipeline routing tables. Locate flags that control pass thresholds (
--inline-threshold=,--unroll-count=, etc.) and pass enable/disable toggles. - LLVM Knobs — The ~1,689
cl::optflags with their defaults, types, and controlling constructors. - Environment Variables — Runtime environment overrides, including the obfuscated variables.
Then study the high-impact optimization passes:
- LLVM Optimizer — Understand the two-phase model. Phase I (whole-module) determines inlining decisions, inter-procedural memory space propagation, and global optimization. Phase II (per-function, potentially concurrent) does register-pressure-driven rematerialization and instruction scheduling. Tuning decisions in Phase I cascade into Phase II.
- Inliner Cost Model — Inlining is typically the single highest-impact optimization decision. This page documents the cost model thresholds, the caller/callee size heuristics, and NVIDIA's kernel-specific adjustments.
- LoopVectorize & VPlan — Loop vectorization for GPU SIMT. The VPlan infrastructure, cost model, and the NVIDIA TTI hooks that influence vectorization width decisions.
- Loop Unrolling — Unrolling thresholds, the NVIDIA-specific unroll heuristics, and the interaction with register pressure.
- Rematerialization — NVIDIA's IR-level rematerialization pass (67KB). Trades recomputation for register pressure reduction, which directly affects occupancy on GPU.
- Register Allocation — The greedy RA with occupancy-driven spill heuristics. Register count directly determines maximum occupancy.
- Instruction Scheduling — The scheduler subsystems and their interaction with hardware latency models.
For tensor core workloads specifically:
- Tensor / MMA Codegen — 19 MMA shapes across 11 data types. The instruction selection patterns, register allocation constraints, and WGMMA code generation for Hopper and Blackwell.
- Tensor / MMA Builtins — The builtin-to-intrinsic lowering for
wmma,mma, andwgmmaoperations. - SM 90 — Hopper — Hopper-specific features: TMA, WGMMA, asynchronous barriers, cluster launch.
- SM 100 — Blackwell — Blackwell-specific features: new MMA shapes, FP4/FP6 support, sparsity.
For understanding performance at the target level:
- GPU Targets — The SM feature gate matrix. Which features are enabled at each architecture level, and how architecture detection routes to different codegen paths.
- NVPTX Target Infrastructure — The TTI hooks that passes query for target-specific costs (memory latency, instruction throughput, register file size).
- Concurrent Compilation — If compile time itself is the bottleneck, understand the Phase II thread pool and GNU Jobserver integration to maximize parallelism.
Function Map
Address-to-identity lookup table. Confidence: VERY HIGH = string evidence, HIGH = strong structural evidence, MEDIUM = inferred from context/callgraph.
Top Functions by Size
| Function | Address | Size | Confidence |
|---|---|---|---|
| X86 AutoUpgrade (intrinsic rename, leftover from LLVM x86 target) | 0xA939D0 | 457KB | VERY HIGH |
| InstCombine::visitCallInst / visitIntrinsic | 0x10EE7A0 | 396KB | HIGH |
| SelectionDAG LegalizeTypes workhorse (ExpandOp/PromoteOp) | 0x20019C0 | 341KB | HIGH |
| New PassManager pipeline parser (function-level, 268 pass names) | 0x2368220 | 326KB | VERY HIGH |
| EDG constexpr expression evaluator core (124 operator opcodes, 9,075 lines) | 0x786210 | 317KB | VERY HIGH |
| SelectionDAG LegalizeOp main switch | 0x20ACAE0 | 295KB | HIGH |
| SelectionDAGBuilder::visit (IR → DAG) | 0x2081F00 | 261KB | HIGH |
| LLVM IR Verifier (visitCallInst), 298 verification messages | 0xBFC6A0 | 207KB | VERY HIGH |
| X86 Intrinsic Upgrade Helper (broadcastf32x4, compress, etc.) | 0xA8A170 | 195KB | HIGH |
| EDG IL tree walker #1 (297 self-recursive, 87 node types, 305 cases) | 0x7506E0 | 190KB | HIGH |
| EDG declaration specifier parser (393 LABEL_ gotos, NOT switch/case) | 0x7C0F00 | 184KB | HIGH |
| Bitcode Reader parseFunctionBody, 174 error strings | 0x9F2A40 | 182KB | VERY HIGH |
| EDG constexpr top-level dispatch (80 expression types + 62 intrinsics) | 0x77FCB0 | 150KB | HIGH |
| EDG IL tree copier/transformer (callback params a3/a4, template instantiation) | 0x766570 | 148KB | HIGH |
| SelectionDAG LegalizeTypes dispatch (967 case labels) | 0x1FFB890 | 137KB | HIGH |
| EDG declaration specifier state machine (80 token cases, 4,371 lines) | 0x672A20 | 132KB | VERY HIGH |
| je_malloc_conf_init (199 config strings) | 0x12FCDB0 | 129KB | VERY HIGH |
| computeKnownBits / SimplifyDemandedBits | 0x11A7600 | 125KB | VERY HIGH |
| EDG lgenfe_main (282-case CLI switch, 737 config macros, EDG 6.6) | 0x617BD0 | 123KB | VERY HIGH |
| NVVM Builtin Resolution table (post-opt, 770 entries) | 0x126A910 | 123KB | VERY HIGH |
| NVVMPassOptions init (4,786 lines, 221 slots in 4,512-byte struct) | 0x12D6300 | 125KB | VERY HIGH |
| PassOptionRegistry::lookupOption (hash table at registry+120) | 0x12D6170 | — | HIGH |
| PassOptionRegistry::getBoolOption (triple: '1'/true, 't'/true) | 0x12D6240 | — | HIGH |
| writeStringOption (24-byte entry to output struct) | 0x12D6090 | — | HIGH |
| writeBoolOption (16-byte entry to output struct) | 0x12D6100 | — | HIGH |
| 4-stage pipeline orchestrator (LNK/OPT/OPTIXIR/LLC), nvopt+nvllc objects | 0x12C35D0 | 41KB | VERY HIGH |
| Bitcode linker: triple validation, IR version check, symbol size matching | 0x12C06E0 | 63KB | VERY HIGH |
| NVVM IR version checker (nvvmir.version metadata, NVVM_IR_VER_CHK env) | 0x12BFF60 | 9KB | VERY HIGH |
| NVVM container format parser (arch, FTZ, IEEE, opt level extraction) | 0x12642A0 | — | HIGH |
| Concurrent worker entry (dispatches Phase I/II) | 0x12E7B90 | 3KB | HIGH |
| Concurrent compilation entry (jobserver, thread pool, split-module) | 0x12E1EF0 | 51KB | VERY HIGH |
| Function sorting by priority (insertion sort / introsort) | 0x12E0CA0 | — | HIGH |
| Per-function compilation callback (completion handler) | 0x12E8D50 | — | HIGH |
| Phase II per-function optimizer (sets qword_4FBB3B0=2) | 0x12E86C0 | — | HIGH |
| Concurrency eligibility check (counts defined functions) | 0x12D4250 | — | HIGH |
| GNU Jobserver init (parse MAKEFLAGS, create pipe, spawn pthread) | 0x16832F0 | — | HIGH |
| Bitcode Metadata Reader (parseMetadata) | 0xA09F80 | 121KB | VERY HIGH |
| EDG IL function body processor (14 params, scope stack management) | 0x627530 | 114KB | HIGH |
| EDG IL tree walker #2 (427 self-recursive, parallel traversal) | 0x760BD0 | 109KB | HIGH |
| EDG IL codegen (node type dispatch on byte+80, 2,589 lines) | 0x8BA620 | 108KB | HIGH |
| NVVM Builtin Resolution table (pre-opt, 770 entries) | 0x90AEE0 | 107KB | VERY HIGH |
| NVVM Builtin lowering engine (pre-opt, wgmma/tex/surf, 3571 lines) | 0x955A70 | 103KB | HIGH |
| New PassManager pipeline parser (CGSCC-level) | 0x2377300 | 103KB | HIGH |
Pipeline Functions
| Function | Address | Size | Confidence |
|---|---|---|---|
main() thunk → sub_8F9C90 | 0x4396A0 | tiny | KNOWN |
| Real main: CLI parsing, wizard check, dispatch | 0x8F9C90 | 10KB | VERY HIGH |
| Simple compile entry (Path A) | 0x902D10 | — | HIGH |
| Simple compile entry (Path B) | 0x1262860 | — | HIGH |
| LibNVVM pipeline driver (Path A): 14-phase flow, libdevice linking, API dispatch | 0x905EE0 | 43KB | VERY HIGH |
| LibNVVM compilation entry (Path B): 4-stage pipeline, embedded builtins | 0x1265970 | 48KB | VERY HIGH |
| CUDA C++ Front-End stage (lgenfe): timer "CUDA C++ Front-End" | 0x905880 | 6KB | HIGH |
| NVVM IR Container → Module opt setup | 0x9047E0 | 10KB | HIGH |
| Backend SM config + EDG binding, triple construction | 0x908850 | 10KB | HIGH |
| LNK stage verbose callback | 0x903BA0 | 5KB | HIGH |
| LLC stage verbose callback | 0x903730 | 5KB | HIGH |
| CLI processing (Path A): -arch, -maxreg, -split-compile, -gen-lto | 0x900130 | — | HIGH |
| CLI processing (Path B) | 0x125FB30 | — | HIGH |
| EDG master orchestrator (setjmp recovery, timer callbacks) | 0x5D2A80 | 2KB | VERY HIGH |
| Backend entry: "Generating NVVM IR", file output (.int.c/.device.c/.stub.c), TileIR dlopen | 0x5E3AD0 | 11KB | VERY HIGH |
| Multi-stage orchestrator: .lnk.bc → .opt.bc → .ptx | 0x9685E0 | — | HIGH |
| Architecture detection: -arch → triple fan-out | 0x95EB40 | 15KB | VERY HIGH |
| NVVM option parsing (all -opt-, -llc-, -gen-*, -Xopt) | 0x9624D0 | — | HIGH |
| Flag mapping table (O0-O3, nvcc flag translation) | 0x8FE280 | — | HIGH |
| LLVM cl::opt bulk registration (~1500 options) | 0xB6EEA0 | — | HIGH |
| Timer/context creation ("CUDA C++ Front-End", "LibNVVM") | 0xC996C0 | — | HIGH |
EDG 6.6 Frontend
Core Orchestration
| Function | Address | Size | Confidence |
|---|---|---|---|
| EDG master orchestrator (setjmp recovery, timer callbacks) | 0x5D2A80 | 2KB | VERY HIGH |
| EDG lgenfe_main (282-case CLI switch, 737 config macros, EDG 6.6) | 0x617BD0 | 123KB | VERY HIGH |
| CLI option registration table (~300 options via sub_6101D0) | 0x610260 | 22KB | HIGH |
| Option fetcher (called in main loop of sub_617BD0) | 0x6140E0 | 6KB | HIGH |
| Backend entry: "Generating NVVM IR", file output (.int.c/.device.c/.stub.c), TileIR dlopen | 0x5E3AD0 | 11KB | VERY HIGH |
| Translation unit init (416-byte TU object, keyword init, parser entry) | 0x8D0BC0 | — | VERY HIGH |
| Semantic analysis init (zeroes 6 globals) | 0x8D0F00 | tiny | HIGH |
| Keyword table init (~350 keywords via sub_885C00) | 0x706250 | 30KB | VERY HIGH |
| TU finalization ("Generating Needed Template Instantiations") | 0x709330 | 5KB | HIGH |
Register single keyword: (token_id, "keyword_string") | 0x885C00 | tiny | HIGH |
AST-to-Source Printer Cluster
| Function | Address | Size | Confidence |
|---|---|---|---|
| Main expression/statement emitter (61 self-references, recursive) | 0x5DBFC0 | 41KB | HIGH |
| Function declaration printer (__sti__, #pragma section, nv_linkonce_odr) | 0x5E13C0 | 44KB | HIGH |
| Statement printer (if/else/for/while/switch/case/return) | 0x5DFD00 | 26KB | HIGH |
| Declaration printer (linkage/storage, __builtin_va_alist) | 0x5D9330 | 12KB | HIGH |
| Scope/block printer (bit-fields, array dimensions) | 0x5DA0F0 | 13KB | HIGH |
| Struct/union/enum printer (#pragma pack) | 0x5DAD30 | 9KB | HIGH |
| Variable initializer printer (memcpy, aggregate init) | 0x5D80F0 | 17KB | HIGH |
| Inline asm printer (volatile, constraints, format specifiers) | 0x5DF1B0 | 11KB | HIGH |
| Identifier printer (keyword mangling: auto→__xauto) | 0x5D5A80 | 7KB | HIGH |
| Top-level declaration dispatcher | 0x5DB980 | 7KB | HIGH |
| Function parameter list printer (__text__/__surf__ annotations) | 0x5D7860 | 6KB | HIGH |
Parser & Declaration Processing
| Function | Address | Size | Confidence |
|---|---|---|---|
| Declaration specifier state machine (while/switch, 80 token cases) | 0x672A20 | 132KB | VERY HIGH |
| Declaration specifier parser (393 LABEL_ gotos, NOT switch/case) | 0x7C0F00 | 184KB | HIGH |
| Top-level declaration/declarator parser | 0x662DE0 | 61KB | HIGH |
| Overloaded function resolution (__builtin_ detection, OMP variants) | 0x6523A0 | 64KB | HIGH |
| Struct/union/class specifier processing | 0x66AC40 | 49KB | HIGH |
| Enum specifier processing | 0x66F9E0 | 39KB | HIGH |
| Block-level declaration/statement processor (largest in 0x630000 zone) | 0x63CAE0 | 67KB | HIGH |
| Declaration statement parsing (35 token refs, 14 diagnostics) | 0x661400 | 28KB | HIGH |
| Function declarator processing (parameter lists, return types) | 0x66DF40 | 24KB | HIGH |
| Declaration specifier combination validator | 0x668EE0 | 26KB | HIGH |
| Storage class specifier processor (_Thread_local validation) | 0x668230 | 9KB | HIGH |
| Primary declarator-to-IL conversion (type kind dispatch) | 0x6333F0 | 26KB | HIGH |
| Name/identifier processing | 0x64BAA0 | 46KB | HIGH |
| Builtin/intrinsic recognition (53 string refs, C++20/23 reflection) | 0x64A920 | 25KB | HIGH |
| IL function body processor (14 params, scope stack management) | 0x627530 | 114KB | HIGH |
| IL statement processing (16 params, IL walker/transformer) | 0x62C0A0 | 63KB | HIGH |
Type System
| Function | Address | Size | Confidence |
|---|---|---|---|
| Type conversion checker (recursive, vector type handling) | 0x713ED0 | 36KB | HIGH |
| Binary operation type checker (11 callers — very central) | 0x7115B0 | 17KB | HIGH |
| Usual arithmetic conversions (10 params) | 0x712770 | 12KB | HIGH |
| Type node comparator (parallel tree walk, canonicalization) | 0x7386E0 | 23KB | HIGH |
| Declaration-level type comparison | 0x739430 | 20KB | HIGH |
| Type-to-string emitter (19 callers, backbone of diagnostics) | 0x74A390 | 29KB | VERY HIGH |
| Constant expression emitter (alignof, sizeof, nullptr, zero-init) | 0x748000 | 45KB | HIGH |
| Declarator emitter (19 callers, paired with sub_74A390) | 0x74D110 | 10KB | HIGH |
| Type node deep-copy | 0x73A9D0 | 19KB | HIGH |
| Declaration node deep-copy (192 bytes = 12 x __m128i) | 0x73F780 | 6KB | HIGH |
| Operator overloadability checker | 0x73CC20 | 9KB | HIGH |
IL Tree Infrastructure
| Function | Address | Size | Confidence |
|---|---|---|---|
| IL tree walker #1 (297 self-recursive, 87 node types, 305 cases) | 0x7506E0 | 190KB | HIGH |
| IL tree walker #2 (427 self-recursive, parallel traversal) | 0x760BD0 | 109KB | HIGH |
| IL tree walker #3 (316 self-recursive) | 0x75C0C0 | 87KB | HIGH |
| IL tree copier/transformer (callback params a3/a4, template instantiation) | 0x766570 | 148KB | HIGH |
| Walker driver/setup (5 callbacks + flags) | 0x759B50 | 31KB | HIGH |
| Copier driver (parallel to sub_759B50) | 0x75B260 | 16KB | HIGH |
| Master walker driver (sets all 6 global callback pointers) | 0x75AFC0 | — | HIGH |
Constexpr Evaluator
| Function | Address | Size | Confidence |
|---|---|---|---|
| EDG constexpr expression evaluator core (124 operator opcodes, 9,075 lines) | 0x786210 | 317KB | VERY HIGH |
| Statement executor (declarations, loops, switch, compound blocks) | 0x795660 | 77KB | HIGH |
| Object member accessor (base classes, virtual bases, union tracking) | 0x79CCD0 | 67KB | HIGH |
| Aggregate initializer evaluator (arrays, structs, designated init) | 0x799B70 | 33KB | HIGH |
| Function call evaluator (argument binding, recursion limits) | 0x79B7D0 | 29KB | HIGH |
| EDG constexpr top-level dispatch (80 expression types + 62 intrinsics) | 0x77FCB0 | 150KB | HIGH |
| Type size calculator (Robin Hood hash memoization, 64MB cap) | 0x7764B0 | 18KB | HIGH |
| Loop/range-for evaluator | 0x7987E0 | 11KB | HIGH |
| Builtin call evaluator (dispatched from case 0x3D) | 0x77C870 | 18KB | HIGH |
| Aggregate initializer evaluator (struct/array/union at compile time) | 0x77D750 | 34KB | HIGH |
Preprocessor
| Function | Address | Size | Confidence |
|---|---|---|---|
| Main preprocessor token scanner (all C/C++ token kinds) | 0x7B8B50 | 59KB | HIGH |
| Macro expansion engine (99-entry predefined table, __VA_OPT__) | 0x81B8F0 | 77KB | HIGH |
| Numeric literal tokenizer (hex float, binary, digit separators) | 0x7B40D0 | 42KB | HIGH |
| Character classification / next-token dispatch (trigraphs, line splices) | 0x7BC390 | 29KB | HIGH |
| String literal scanner (escape processing, raw strings) | 0x7B6B00 | 13KB | HIGH |
| Macro body substitution (__VA_ARGS__, __VA_OPT__) | 0x8200E0 | 22KB | HIGH |
| Source character reader / tokenizer bootstrap | 0x7B2B10 | 16KB | HIGH |
| Preprocessing directive dispatcher | 0x7B8270 | 8KB | HIGH |
Template Engine
| Function | Address | Size | Confidence |
|---|---|---|---|
| Complete template instantiation engine (parameter lists, member iteration) | 0x7A9440 | 40KB | HIGH |
| Template argument type resolution/matching | 0x7410C0 | 42KB | HIGH |
| Template type instantiation handler | 0x743600 | 19KB | HIGH |
| Template instantiation engine (word_4F06418 SM-arch checks) | 0x5EBF70 | 30KB | HIGH |
| Template argument deduction engine (pattern matching, pack expansion) | 0x5FBCD0 | 38KB | HIGH |
Semantic Analysis
| Function | Address | Size | Confidence |
|---|---|---|---|
| Deep semantic analysis (29 SM-arch refs, 27 sub_8D* calls) | 0x6040F0 | 64KB | HIGH |
| Overload resolution main (43 SM-arch refs — highest) | 0x607B60 | 32KB | HIGH |
| Expression parsing/semantic ("Parsing Lambda", __nv_parent) | 0x609F00 | 58KB | HIGH |
| Declaration processing (9 SM version refs) | 0x5FE9C0 | 28KB | HIGH |
| Class hierarchy analysis (vtable layout, diamond inheritance) | 0x5F94C0 | 24KB | HIGH |
| Conversion function lookup (33 sub_8D* calls) | 0x5F4F20 | 21KB | HIGH |
| Operator overload resolution | 0x5F2920 | 23KB | HIGH |
| Declaration elaboration (type-spec strings "A;P", "O;F", "I", "B") | 0x84EC30 | 71KB | HIGH |
| Declaration semantic analysis (148 global refs, highest density) | 0x8708D0 | 63KB | HIGH |
CUDA-Specific Frontend
| Function | Address | Size | Confidence |
|---|---|---|---|
| Memory space attribute processing (__shared__, __constant__, __managed__) | 0x6582F0 | 22KB | HIGH |
| Declaration with memory space annotation (15 diagnostic calls) | 0x65F400 | 24KB | HIGH |
| Atomic builtin name generator (__nv_atomic_fetch_*) | 0x6BBC40 | 34KB | HIGH |
| CUDA device code generation master | 0x804B20 | 28KB | HIGH |
| CUDA registration stub (__cudaRegisterAll, __cudaRegisterEntry) | 0x806F60 | 8KB | VERY HIGH |
| Device stub generator ("__device_stub_%s", __cudaLaunch) | 0x808590 | 11KB | HIGH |
| CUDA kernel launch lowering (cudaGetParameterBufferV2) | 0x7F2B50 | 16KB | HIGH |
| Static init with CUDA memory space (__sti__, __constant__) | 0x801880 | 7KB | HIGH |
| Optimization flag configurator (109 flags from O-level) | 0x60D650 | 6KB | HIGH |
| SM-arch feature gate (56 qword_4F077A8 comparisons) | 0x60E7C0 | 12KB | HIGH |
Name Mangling (Itanium ABI)
| Function | Address | Size | Confidence |
|---|---|---|---|
| Primary mangling entry | 0x8E74B0 | 29KB | HIGH |
| Type mangling | 0x8E9FF0 | 26KB | HIGH |
| Type component mangling (__real__, __imag__) | 0x816460 | 24KB | HIGH |
| Builtin type mangling (DF16_, Cu6__bf16, u6__mfp8) | 0x80E340 | 23KB | HIGH |
| NVIDIA extension mangling (Unvdl, Unvdtl, Unvhdl) | 0x80FE00 | 8KB | HIGH |
| Special type mangling (basic_ostream, allocator substitution) | 0x80C5A0 | 11KB | HIGH |
| Expression mangling | 0x813790 | 13KB | HIGH |
Diagnostics & Support
| Function | Address | Size | Confidence |
|---|---|---|---|
| Diagnostic emitter (severity labels, ANSI color, word-wrap) | 0x681D20 | 37KB | VERY HIGH |
| SARIF JSON diagnostic output (ruleId, level, locations) | 0x6837D0 | 20KB | HIGH |
| Type name formatter (quoted type names for error messages) | 0x67FCF0 | 40KB | HIGH |
| EDG abort / __builtin_unreachable (478 callers!) | 0x721090 | tiny | VERY HIGH |
| Exit with status ("Compilation aborted/terminated") | 0x720FF0 | — | HIGH |
| IR node alloc with context (204 callers) | 0x724DC0 | — | HIGH |
| IR node free (196 callers) | 0x724E30 | — | HIGH |
| Get/create void type singleton at qword_4F07BA8 (145 callers) | 0x72C930 | — | HIGH |
| Arena allocator (63 callers) | 0x7247C0 | — | HIGH |
| IR node hash (polynomial: v10 += ch + 32*v10, 9 callers) | 0x72DB90 | 8KB | HIGH |
| Tracked heap allocation (linked list at qword_4F195F8) | 0x822B10 | — | HIGH |
| Hash table bucket chain finalizer | 0x823310 | — | HIGH |
| EDG heap pool allocator (152-byte, 416-byte, etc. entries) | 0x823970 | — | HIGH |
Class Layout & Vtable
| Function | Address | Size | Confidence |
|---|---|---|---|
| Class layout emitter (__vptr, __v_, __b_ prefixes) | 0x7E3EE0 | 7KB | HIGH |
| Virtual base offset calculator | 0x7E57B0 | 9KB | HIGH |
| Virtual call lowering (node_kind==103) | 0x7E88E0 | 11KB | HIGH |
| Class definition emitter (vtable, nested types, friends) | 0x7E9AF0 | 13KB | HIGH |
| Statement emission mega-function (largest in class layout zone) | 0x7EE560 | 45KB | HIGH |
| Class member emission (__cxa_atexit, __cxa_vec_cctor) | 0x7FEC50 | 48KB | HIGH |
| Function definition emission (ctor initializers, default args) | 0x7FCF80 | 17KB | HIGH |
LLVM cl::opt Registration Infrastructure
| Function | Address | Size | Confidence |
|---|---|---|---|
| Global option counter (atomic increment) | 0xC523C0 | — | HIGH |
| cl::Option::setArgStr(name, len) — Legacy PM | 0xC53080 | — | HIGH |
| cl::Option::addArgument() — Legacy PM | 0xC53130 | — | HIGH |
| cl::OptionCategory getter | 0xC57470 | — | HIGH |
| cl::opt name setter — New PM | 0x16B8280 | — | HIGH |
| cl::opt finalization — New PM | 0x16B88A0 | — | HIGH |
| SmallVector::grow() | 0xC8D5F0 | — | HIGH |
Key Constructors (cl::opt registration)
| Function | Address | Size | Confidence |
|---|---|---|---|
| ctor_010_0: TargetLibraryInfo VecFuncs table (9 vector math libs, 960 string xrefs, NOT decompiled) | 0x4397F0 | ~102KB | VERY HIGH |
| ctor_027: DOES NOT EXIST (phantom, no decompiled file) | 0x456120 | — | DISPROVED |
| ctor_036: LLVM version = "20.0.0" (via LLVM_OVERRIDE_PRODUCER fallback) | 0x48CC90 | 2KB | VERY HIGH |
| ctor_043_0: NVIDIA CICC-specific options (19 opts, XOR cipher hidden flag) | 0x48D7F0 | 30KB | VERY HIGH |
| MASTER pass/analysis registration (~172 init calls) | 0x4A5950 | 7KB | VERY HIGH |
| ctor_107_0: MC/Target options (131 opts, getenv("bar") backdoor) | 0x4A64D0 | 59KB | VERY HIGH |
| ctor_133_0: Known library function table (422 C/POSIX functions) | 0x4B0180 | 29KB | VERY HIGH |
| ctor_145: MISSING from decompilation (too large for Hex-Rays) | 0x4B4360 | ~99KB | HIGH |
| ctor_147_0: PassManager debug/print options | 0x4CC760 | 20KB | HIGH |
| ctor_156_0: CLI infrastructure (help, version, print-options) | 0x4CEB50 | 9KB | HIGH |
| ctor_186_0: Inliner heuristics (NVIDIA: profuseinline, inline-budget) | 0x4DBEC0 | 14KB | HIGH |
| ctor_201: GVN options (NVIDIA: profusegvn, gvn-dom-cache) | 0x4E0990 | 9KB | HIGH |
| ctor_214_0: LSR options (NVIDIA: disable-lsr-for-sharedmem32-ptr) | 0x4E4B00 | 8KB | HIGH |
| ctor_216_0: Loop Unrolling options (largest unroll ctor) | 0x4E5C30 | 21KB | HIGH |
| ctor_259_0: CICC core compiler options (debug-compile, maxreg) | 0x4F0FB0 | 17KB | HIGH |
| ctor_262_0: BranchDist pass options | 0x4F2830 | 10KB | HIGH |
| ctor_263_0: SCEV-CGP pass options (44 strings!) | 0x4F36F0 | 10KB | HIGH |
| ctor_264: IP-MSP knobs | 0x4F45B0 | — | HIGH |
| ctor_267_0: MemorySpaceOpt options (18 strings) | 0x4F54D0 | 10KB | HIGH |
| ctor_277_0: Rematerialization options (39 strings, remat-for-occ) | 0x4F7BE0 | 7KB | HIGH |
| ctor_335_0: MASTER codegen pass configuration (88 strings) | 0x507310 | 29KB | VERY HIGH |
| ctor_356_0: NVPTX SM enum + PTX version table (45 entries, sm_20–sm_121f) | 0x50C890 | 16KB | VERY HIGH |
| ctor_358_0: NVPTX pass enable/disable (43 strings, usedessa) | 0x50E8D0 | 21KB | HIGH |
| ctor_361_0: NV Remat Machine Block options (30 strings, nv-remat-*) | 0x5108E0 | 8KB | HIGH |
| ctor_376_0: LTO/bitcode/plugin options | 0x512DF0 | 39KB | HIGH |
| ctor_377_0: PassBuilder pipeline configuration (77 strings) | 0x516190 | 44KB | HIGH |
| ctor_388_0: Optimizer pipeline enables (enable-ml-inliner, etc.) | 0x51B710 | 15KB | HIGH |
| ctor_600_0: CodeGen/TargetMachine mega-options (118 strings) | 0x57F210 | 59KB | HIGH |
| ctor_605: SM processor table (45 entries, sm_20–sm_121f, PTX version map) | 0x584510 | 3KB | VERY HIGH |
| ctor_609_0: NVPTX backend options (25+ opts, usedessa, enable-nvvm-peephole) | 0x585D30 | 37KB | HIGH |
| ctor_637_0: disable-*Pass flag registration (48 flags) | 0x593380 | — | HIGH |
| ctor_701: MISSING data blob (likely instruction encoding tables) | 0x5A8850 | ~70KB | MEDIUM |
NVIDIA Custom Pass Implementations
| Function | Address | Size | Confidence |
|---|---|---|---|
| MemorySpaceOptPass registration | 0x2CDD6D0 | reg | HIGH |
| MemorySpaceOptPass factory | 0x2CDFF20 | factory | HIGH |
| MemorySpaceOpt core analysis | 0x2CDA660 | 10KB | HIGH |
| MemorySpaceOpt address space inference | 0x2CD7710 | 9KB | HIGH |
| IPMSPPass (interprocedural memory space) registration | 0x1C6FBC0 | reg | HIGH |
| RematerializationPass (IR-level) implementation | 0x1CE7DD0 | 13KB | HIGH |
| Machine Block Rematerialization | 0x2186D90 | 9KB | HIGH |
| BranchDistPass registration | 0x1C4B520 | reg | HIGH |
| LoopIndexSplitPass implementation | 0x1C7B2C0 | 11KB | HIGH |
| NVVMPeepholeOptimizerPass registration | 0x2CAF0F0 | reg | HIGH |
| ByValMem2RegPass | 0x2CD6510 | 350B | HIGH |
| BasicDeadBarrierEliminationPass | 0x2CD2690 | 366B | HIGH |
| CNPLaunchCheckPass (Dynamic Parallelism validation) | 0x1CEBC30 | reg | HIGH |
| PrintfLoweringPass | 0x1CB0B80 | name | HIGH |
| Pass registration master function (all 402+20 passes) | 0x2342890 | 32KB | VERY HIGH |
| Pass name listing (pipeline names for all passes) | 0x233C410 | — | HIGH |
MMA / Tensor Core Emission
| Function | Address | Size | Confidence |
|---|---|---|---|
| MMA instruction operand builder (shapes, types, rounding modes) | 0x21E74C0 | 17KB | VERY HIGH |
| tcgen05 Blackwell scaled MMA operands (scaleD, negA, negB, transA) | 0x21E8CD0 | 2KB | VERY HIGH |
| HMMA store-C (hmmastc), SM ≥ 70 | 0x21DFBF0 | 5KB | HIGH |
| HMMA load-A/B (hmmaldab), SM ≥ 70 | 0x21E0360 | 3KB | HIGH |
| HMMA load-C (hmmaldc), SM ≥ 70 | 0x21E0630 | 3KB | HIGH |
| HMMA MMA (hmmamma), SM ≥ 70 | 0x21E0870 | 4KB | HIGH |
| IMMA load-A/B (immaldab), SM ≥ 72 | 0x21E1280 | 4KB | HIGH |
| IMMA load-C (immaldc), SM ≥ 72 | 0x21E15D0 | 3KB | HIGH |
| IMMA store-C, SM ≥ 72 | 0x21E1830 | 5KB | HIGH |
| IMMA MMA w/ saturation (immamma), SM ≥ 72 | 0x21E1D20 | 6KB | HIGH |
| Binary MMA (bmmamma, b1 .and.popc/.xor.popc), SM ≥ 75 | 0x21E2280 | 6KB | HIGH |
| MMA address-space resolver (opcode → addrspace enum) | 0x21DEF90 | — | HIGH |
| tcgen05 scaled MMA operands (NVPTX backend copy) | 0x35F3E90 | — | HIGH |
| tcgen05.mma full instruction lowering (10 shape variants) | 0x36E9630 | — | HIGH |
| tcgen05.mma SelectionDAG lowering | 0x304E6C0 | — | HIGH |
| tcgen05 infrastructure ops (fence/wait/alloc/dealloc/cp/commit) | 0x30462A0 | — | HIGH |
PTX Emission
| Function | Address | Size | Confidence |
|---|---|---|---|
| Function header orchestrator (.entry/.func, params, attrs, pragmas) | 0x215A3C0 | — | VERY HIGH |
| Kernel attribute emission (.reqntid, .maxntid, cluster, .maxnreg) | 0x214DA90 | — | VERY HIGH |
| Stack frame emission (__local_depot, %SP, %SPL, register decls) | 0x2158E80 | 17KB | VERY HIGH |
| Register class → encoded ID (9 classes, 0x10000000–0x90000000) | 0x21583D0 | — | HIGH |
| Register class → PTX type suffix (.pred, .b16, .b32, .b64, .f32, .f64, .b128) | 0x2163730 | — | HIGH |
| Register class → PTX prefix (%p, %rs, %r, %rd, %f, %fd, %h, %hh, %rq) | 0x21638D0 | — | HIGH |
| GenericToNVVM pass registration ("generic-to-nvvm") | 0x215DC20 | — | VERY HIGH |
| GenericToNVVM pass body (addrspace 0→1 rewriting) | 0x215E100 | 36KB | HIGH |
| Module emission entry (global ctor rejection, DWARF init) | 0x215ACD0 | — | HIGH |
| Global variable emission (texref/surfref/samplerref/data) | 0x2156420 | — | HIGH |
| Atomic opcode emission (13 ops, scope prefix) | 0x21E5E70 | — | VERY HIGH |
| L2 cache-hinted atomic emission (Ampere+) | 0x21E6420 | — | HIGH |
| Memory barrier emission (membar.cta/gpu/sys, fence.sc.cluster) | 0x21E94F0 | — | HIGH |
| Cluster barrier emission (arrive/wait + relaxed) | 0x21E8EA0 | — | HIGH |
| Special register emission (%tid, %ctaid, %ntid, %nctaid) | 0x21E86B0 | — | VERY HIGH |
| Cluster special register emission (15 regs, SM 90+) | 0x21E9060 | — | HIGH |
| Address space conversion + MMA helpers (cvta, rowcol, abtype) | 0x21E7FE0 | — | HIGH |
Hash Infrastructure
| Function | Address | Size | Confidence |
|---|---|---|---|
| wyhash v4 hash function (multi-length dispatch) | 0xCBF760 | — | VERY HIGH |
| Thin wrapper → sub_CBF760 (hash for builtin names) | 0xC92610 | — | HIGH |
| Hash table insert-or-find (quadratic probing, triangular numbers) | 0xC92740 | — | VERY HIGH |
| Hash table find-only (same probing) | 0xC92860 | — | HIGH |
| Rehash at 75% load factor (double or tombstone cleanup) | 0xC929D0 | — | HIGH |
| String entry allocator (length+17, 8-byte aligned) | 0xC7D670 | — | HIGH |
NVVM Builtin Infrastructure
| Function | Address | Size | Confidence |
|---|---|---|---|
| Hash table insertion helper (pre-opt) | 0x90ADD0 | 56 lines | VERY HIGH |
| Builtin dispatcher (pre-opt): name → ID | 0x913450 | 27 lines | VERY HIGH |
| Builtin dispatcher (post-opt): name → ID | 0x12731E0 | 25 lines | VERY HIGH |
| Builtin lowering engine (pre-opt, wgmma/tex/surf, 3571 lines) | 0x955A70 | 103KB | HIGH |
| Builtin lowering engine (post-opt, 3408 lines) | 0x12B3FD0 | 101KB | HIGH |
Register Allocation
| Function | Address | Size | Confidence |
|---|---|---|---|
| Instruction constraint emission (180+ case opcode switch) | 0xB612D0 | 102KB | HIGH |
| SimplifyAndColor phase | 0x1081400 | 13KB | HIGH |
| SelectNodeForRemoval / Briggs criterion (K=15 at 3 locations) | 0x1090BD0 | 10KB | VERY HIGH |
| AssignColorsAndOptimize (address unverified, was erroneously listed as 0x12E1EF0) | 0x10841C0 | 11KB | MEDIUM |
| Operand constraint spec creator (type 14=GPR, 40=FP, 78=vec) | 0xA778C0 | — | HIGH |
| Final instruction emitter with allocated registers | 0xA78010 | — | HIGH |
jemalloc (Statically Linked, v5.3.x)
| Function | Address | Size | Confidence |
|---|---|---|---|
| je_stats_print_arena (per-arena stats, HPA shards) | 0x4134A7 | 83KB | HIGH |
| je_stats_print_bins (18 stat columns per bin) | 0x40F894 | 37KB | HIGH |
| je_stats_general (version, build config, runtime opts) | 0x411419 | 32KB | HIGH |
| je_stats_print (top-level: allocated, active, resident, mapped) | 0x417CBD | 14KB | HIGH |
| je_stats_print_large (large extent class stats) | 0x40EF06 | 13KB | HIGH |
| je_malloc_vsnprintf (custom format printer, avoids reentrancy) | 0x40D5CA | 21KB | HIGH |
| je_mutex_stats_read (mutex profiling counters) | 0x40E5B5 | 7KB | HIGH |
| je_malloc_conf_init (199 config strings) | 0x12FCDB0 | 129KB | VERY HIGH |
Optimizer Pipeline Assembly
Functions discovered during wiki writing (W101--W241). These assemble the LLVM optimization pipeline from NVVMPassOptions slots.
Pipeline Builders
| Function | Address | Size | Confidence |
|---|---|---|---|
| Master pipeline assembler (reads opts struct, ~150 pass-insertion decisions) | 0x12E54A0 | 50KB | VERY HIGH |
| Tier 0 full optimization sub-pipeline (~40 passes, base for O1/O2/O3) | 0x12DE330 | — | VERY HIGH |
| Tier 1/2/3 phase-specific sub-pipeline (phase-conditional pass insertion) | 0x12DE8F0 | — | VERY HIGH |
| Codegen pass dispatch (reads opts[200] optimization threshold) | 0x12DFE00 | 20.7KB | HIGH |
| OPT stage two-phase orchestrator (sets qword_4FBB3B0 to 1 or 2) | 0x12E7E70 | — | VERY HIGH |
| New-PM driver: pipeline name selector (O0/O1/O2/O3/Ofcmin/Ofcmid/Ofcmax) | 0x226C400 | — | HIGH |
| NVPTXTargetMachine creation (NVIDIA options, standalone path) | 0x12F4060 | 16KB | HIGH |
| OptiX IR generation core function | 0x12F9270 | ~6KB | HIGH |
Pass Factories (Pipeline Insertion Order)
Each factory creates a pass instance; referenced from sub_12E54A0, sub_12DE330, and sub_12DE8F0.
| Function | Address | Size | Confidence |
|---|---|---|---|
| NVVMReflect factory (~8 pipeline insertions) | 0x1857160 | — | HIGH |
| SCCP factory | 0x1842BC0 | — | HIGH |
| NVVMVerifier wrapper (creates context, invokes module verifier) | 0x12D4560 | — | HIGH |
| NVVMPredicateOpt factory (AggressiveInstCombine variant) | 0x18A3430 | — | HIGH |
| NVVMPredicateOpt variant / LoopRotate factory | 0x18A3090 | — | HIGH |
| ConstantMerge / GlobalDCE / LICM factory | 0x184CD60 | — | HIGH |
| FunctionAttrs factory (infers readonly, nounwind, etc.) | 0x1841180 | — | HIGH |
| LICM factory (parameter 0 = standard mode) | 0x195E880 | — | HIGH |
| LoopVectorize/SLP factory (7 params: width, thresholds) | 0x19B73C0 | — | HIGH |
| CGSCC standard pipeline factory (InlinerWrapper, 1--5 iterations) | 0x1A62BF0 | — | HIGH |
| PrintModulePass factory (debug dump, params: level, verbose) | 0x17060B0 | — | HIGH |
| JumpThreading / CVP factory (parameter: threshold) | 0x198DF00 | — | HIGH |
| EarlyCSE factory | 0x196A2B0 | — | HIGH |
| SROA factory | 0x1968390 | — | HIGH |
| DCE (DeadCodeElimination) factory | 0x18DEFF0 | — | HIGH |
| Sink/MemSSA factory (3 params: mode, flags) | 0x1869C50 | — | HIGH |
| NVVMLoopOpt/BarrierOpt / IV Demotion factory | 0x18B1DE0 | — | HIGH |
| NVVMIntrinsicLowering factory (level 0 = basic, level 1 = barrier) | 0x1CB4E40 | — | HIGH |
| MemCpyOpt factory | 0x1B26330 | — | HIGH |
| LoopUnroll / SpeculativeExecution factory (2 params) | 0x19C1680 | — | HIGH |
| ADCE (AggressiveDeadCodeElimination) factory | 0x1C76260 | — | HIGH |
| ADCE variant factory (separate pipeline position) | 0x1C6FCA0 | — | HIGH |
| SimplifyCFG factory (2 params: mode, flags) | 0x190BB10 | — | HIGH |
| InstructionSimplify factory | 0x1A7A9F0 | — | HIGH |
| NVVMRematerialization factory (IR-level) | 0x1A13320 | — | HIGH |
| Reassociate factory (parameter: tier) | 0x1B7FDF0 | — | HIGH |
| LoopStrengthReduce factory | 0x19CE990 | — | HIGH |
| NVVMBranchDist factory (two pipeline positions) | 0x1CB73C0 | — | HIGH |
| NVVMSinking2 factory (SM-specific late sinking) | 0x1CC60B0 | — | HIGH |
| NVVMGenericAddrOpt factory (generic address optimization) | 0x1CC71E0 | — | HIGH |
| NVVMReduction factory (SM-specific) | 0x1CC5E00 | — | HIGH |
| NVVMUnreachableBlockElim factory | 0x1CC3990 | — | HIGH |
| NVVMLateOpt factory (Tier 3 only) | 0x1C46000 | — | HIGH |
| NVVMLowerAlloca factory (dual gate: opts[2240] + opts[2280]) | 0x1CBC480 | — | HIGH |
| NVVMLowerBarriers factory (runs between LICM invocations) | 0x1C98160 | — | HIGH |
| Sinking2Pass fast-mode factory (flag=1, Ofcmin pipeline) | 0x18B3080 | — | HIGH |
| VerifierPass factory (late CFG cleanup guard at opts[4464]) | 0x1654860 | — | HIGH |
| NVIDIA loop pass factory (opts[3080] guard) | 0x1922F90 | — | MEDIUM |
| EarlyCSE MemorySSA variant / NVVMBarrierAnalysis factory | 0x18E4A00 | — | HIGH |
| EarlyCSE variant (v=1 if opts[3704]) | 0x1C8A4D0 | — | HIGH |
| NVVMAnnotationsProcessor factory | 0x215D9D0 | — | HIGH |
| NVIDIA Custom Inliner (CGSCC, 20,000-unit per-caller budget) | 0x1864060 | 75KB | VERY HIGH |
NVPTX Backend (SelectionDAG & ISel)
| Function | Address | Size | Confidence |
|---|---|---|---|
| NVPTXTargetLowering::LowerIntrinsicCall (largest function in binary) | 0x33B0210 | 343KB | VERY HIGH |
| NVPTXDAGToDAGISel::Select (ISel entry, hash-based cost table) | 0x3090F90 | 91KB | VERY HIGH |
| computeKnownBitsForTargetNode (112 opcodes, 399x sub_969240 calls) | 0x33D4EF0 | 114KB | HIGH |
NVPTXTargetLowering::LowerCall (PTX .param calling convention) | 0x3040BF0 | 88KB | HIGH |
| LLVM standard InlineCostAnalysis (library function) | 0x30DC7E0 | 51KB | HIGH |
| Vector legalization type-split record mapping | 0x3302A00 | — | HIGH |
| Operand type classifier (reads byte_444C4A0) | 0x34961A0 | 26.6KB | HIGH |
NVVM Verifier Subsystem
| Function | Address | Size | Confidence |
|---|---|---|---|
| NVVMModuleVerifier (data layout, address space, triple validation) | 0x2C80C90 | 51KB | HIGH |
| NVVMIntrinsicVerifier (SM gates, types, MMA, atomics, tex/surf) | 0x2C7B6A0 | 143KB | VERY HIGH |
| Frontend verifier (convergent intrinsic SM-version gating) | 0x1C36530 | — | HIGH |
| NVVMIntrinsicLowering core engine (2,460 lines) | 0x2C63FB0 | 140KB | HIGH |
LTO Subsystem
| Function | Address | Size | Confidence |
|---|---|---|---|
| NVModuleSummary builder (ThinLTO, two-phase declaration merge) | 0xD7D4E0 | 74KB | HIGH |
| New PM CGSCC inliner (inside LazyCallGraph framework) | 0x2613930 | 69KB | HIGH |
| IP-MSP module-pass variant (LIBNVVM path, DenseMap-based) | 0x1C6A6C0 | 54KB | HIGH |
| LinkUserModules (wrapper around LLVM Linker::linkModules) | 0x12F5610 | ~4KB | HIGH |
LLVM IR Utility Functions
Common LLVM IR manipulation functions referenced across many passes.
| Function | Address | Size | Confidence |
|---|---|---|---|
| operator new / BumpPtrAllocator (SDNode, BasicBlock, pass objects) | 0x22077B0 | — | HIGH |
| Value::replaceAllUsesWith / salvageDebugInfo | 0xBD84D0 | — | HIGH |
| Instruction::eraseFromParent / SDUse remove from use list | 0xB43D60 | — | HIGH |
| getCalledFunction / BranchInst::getCondition | 0xB43CB0 | — | HIGH |
| Function::hasAttribute(N) (noimplicitfloat, optnone, convergent) | 0xB2D610 | — | HIGH |
| Function::getName / IR node name getter | 0xBD5D20 | — | HIGH |
| PHINode::Create / SDNode alloc variant (80 bytes) | 0xBD2DA0 | — | HIGH |
| hasAttribute(26) (convergent/varargs marker check) | 0xB91C10 | — | HIGH |
| TTI::getInstructionCost (IR-level) / MDString::getString | 0xB91420 | — | HIGH |
| Ref-count decrement on metadata/debug-info | 0xB91220 | — | HIGH |
| Ref-count increment on metadata/debug-info | 0xB96E90 | — | HIGH |
| Value::setName / SetValueName (assigns %name to IR value) | 0x164B780 | — | HIGH |
| IRBuilder::CreateBinOp / SCEV type extension (349x callers) | 0x1623A60 | — | HIGH |
| ReleaseDebugLoc / debug location list removal | 0x161E7C0 | — | HIGH |
| Fatal error emitter ("Broken module found, compilation aborted!") | 0x16BD130 | — | HIGH |
| Create binary OR instruction (opcode 27) | 0x15FB440 | — | HIGH |
| DataLayout::getPointerSizeInBits(addressSpace) | 0x15A9520 | — | HIGH |
| DataLayout::getStructLayout (struct size computation) | 0x15A9930 | — | HIGH |
| SCEV fold/normalize / NVVM AA address-space NoAlias query | 0x146F1B0 | — | HIGH |
| CombineTo / ReplaceAllUsesWith (DAG use-chain + worklist push) | 0xF162A0 | — | HIGH |
| Function cloner (coroutine resume/destroy) | 0xD2E510 | — | HIGH |
| Create runtime library call instruction (OpenMP, MMA, barriers) | 0x921880 | — | HIGH |
| Builtin function call emitter (pre-opt path, EDG builtins) | 0x1285290 | — | HIGH |
| Kernel metadata emitter (cluster_dim, blocksareclusters) | 0x93AE30 | ~5.6KB | HIGH |
| ExpandIntegerResult (type legalization, 632 case labels) | 0x201BB90 | 75KB | HIGH |
Machine-Level Infrastructure
| Function | Address | Size | Confidence |
|---|---|---|---|
| InstrEmitter DenseMap grow / rehash (hash: key*37) | 0x2E29BA0 | — | HIGH |
| TwoAddressInstruction DenseMap (SrcEqClassMap) | 0x1F4E3A0 | — | HIGH |
Binary Layout
This page is a visual guide to navigating the cicc v13.0 binary in IDA Pro. It covers the ELF structure, section layout, subsystem address ranges, embedded data payloads, and the statically linked jemalloc allocator. If you are opening this binary for the first time, start here to orient yourself before diving into individual subsystems.
ELF Overview
CICC is a statically linked, stripped x86-64 ELF binary. There are no dynamic symbol tables, no .dynsym, no DWARF debug info, and no export table. Every function name was removed at build time. IDA Pro recovers 80,562 functions; Hex-Rays successfully decompiles 80,281 of them (99.65%).
| Property | Value |
|---|---|
| File size | 60,108,328 bytes (57.3 MB) |
| Architecture | x86-64, little-endian |
| Linking | Fully static (no .interp, no PLT/GOT) |
| Stripped | Yes, all symbol tables removed |
| Build ID | cuda_13.0.r13.0/compiler.36424714_0 |
| Compiler | Built with GCC (inferred from CRT stubs and .init_array layout) |
| Allocator | jemalloc 5.3.x, statically linked (~400 functions) |
Because the binary is statically linked, libc, libpthread, and libm are all embedded. This inflates the raw function count but also means every call target resolves to a concrete address within the binary itself -- there are no external dependencies at runtime beyond the kernel syscall interface.
Address Space Map
The binary's .text section spans roughly 0x400000 to 0x3C00000. Within that 56 MB range, subsystems occupy contiguous, non-overlapping regions. The map below is the primary orientation tool for IDA Pro navigation.
0x400000 ┌─────────────────────────────────────────┐
│ CRT startup + libc stubs │ ~52 KB
0x40D000 ├─────────────────────────────────────────┤
│ jemalloc stats / vsnprintf │ ~80 KB
0x420000 ├─────────────────────────────────────────┤
│ (gap: misc libc, math, string ops) │ ~64 KB
0x430000 ├─────────────────────────────────────────┤
│ Global constructors (cl::opt reg) │ ~1.6 MB
│ ~1,689 LLVM command-line option objects │
0x5D0000 ├─────────────────────────────────────────┤
│ EDG 6.6 C++ Frontend │ 3.2 MB
│ Parser, constexpr evaluator, IL walker │
0x8F0000 ├─────────────────────────────────────────┤
│ CLI / Real Main / NVVM Bridge │ 520 KB
│ sub_8F9C90 (real main), dual-path dispatch│
0x960000 ├─────────────────────────────────────────┤
│ Architecture detection, NVVM options │ 576 KB
0x9F0000 ├─────────────────────────────────────────┤
│ Bitcode reader (parseFunctionBody) │ ~1 MB
0xAF0000 ├─────────────────────────────────────────┤
│ X86 AutoUpgrade (legacy, 457KB fn) │ ~1 MB
0xBF0000 ├─────────────────────────────────────────┤
│ LLVM IR Verifier │ 500 KB
0xC00000 ├─────────────────────────────────────────┤
│ LLVM Support / ADT library │ ~3.2 MB
│ (see detailed sub-map below) │
0x12D0000├─────────────────────────────────────────┤
│ PassManager / NVVM bridge │ 4.2 MB
│ Pipeline assembly (sub_12E54A0) │
0x12FC000├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤
│ jemalloc core (~400 functions) │ ~256 KB
0x1700000├─────────────────────────────────────────┤
│ Backend / machine passes │ 8 MB
│ RegAlloc, Block Remat, Mem2Reg │
0x1F00000├─────────────────────────────────────────┤
│ SelectionDAG │ 2 MB
│ LegalizeTypes (348KB), LegalizeOp │
0x2100000├─────────────────────────────────────────┤
│ NVPTX PTX emission │ 1 MB
0x2340000├─────────────────────────────────────────┤
│ New PM / pass registration │ 768 KB
│ 2,816-line registrar at sub_2342890 │
0x2A00000├─────────────────────────────────────────┤
│ Loop passes │ 4 MB
│ LoopVectorize, SLP, Unroll │
0x3000000├─────────────────────────────────────────┤
│ NVPTX ISel + lowering │ 7 MB
│ 343KB intrinsic switch (sub_33B0210) │
0x3700000├─────────────────────────────────────────┤
│ Machine-level passes (tail) │ ~3 MB
│ BlockPlacement, Outliner, StructurizeCFG │
0x3A00000├─────────────────────────────────────────┤
│ (trailing code, CRT finalization) │
└─────────────────────────────────────────┘
DATA SECTIONS:
0x3EA0080 Embedded libdevice bitcode (Path A) 456 KB
0x420FD80 Embedded libdevice bitcode (Path B) 456 KB
0x4F00000+ Global BSS (cl::opt storage, hash tables, state)
Detailed Subsystem Map at Pass Granularity
The coarse map above partitions the binary into ~18 zones. The following map refines every zone to individual-pass resolution, giving the factory address of each identified pass or subsystem entry point. Addresses prefixed with sub_ are IDA function names. Sizes in parentheses are decompiled C output; actual machine code is typically 2-3x smaller.
Zone 1: CRT, libc, jemalloc stats (0x400000 - 0x42FFFF)
0x400000 _start / CRT entry (ELF entry point)
0x40D5CA sub_40D5CA vsnprintf (jemalloc stats formatting)
0x420000 libc math/string helpers (memcpy, memset, strlen, etc.)
No LLVM or NVIDIA code lives here. Pure runtime support.
Zone 2: Global constructors (0x430000 - 0x5CFFFF)
~1,689 cl::opt registration constructors execute before main(). Each registers a command-line option string, description, default value, and storage pointer into the global option registry. The .init_array section holds function pointers to these constructors.
Zone 3: EDG 6.6 C++ Frontend (0x5D0000 - 0x8EFFFF)
The complete Edison Design Group C++ frontend, version 6.6. Contains the lexer, parser, constexpr evaluator, template instantiator, overload resolver, IL walker/copier, diagnostic engine, SARIF output, and CUDA-specific extensions (kernel launch grammar, __shared__/__device__ memory space parsing, atomic builtin stubs).
| Function | Address | Size |
|---|---|---|
| EDG main entry (called from real main) | sub_5D2A80 | |
| Expression parser core | sub_610000-sub_62FFFF | 128 KB |
| Declaration processing | sub_750000-sub_76FFFF | 128 KB |
| Template / constexpr | sub_840000-sub_87FFFF | 256 KB |
| SARIF, diagnostics, keywords | sub_880000-sub_8EFFFF | 448 KB |
Zone 4: CLI / Real Main / Dual-Path Entry (0x8F0000 - 0x9EFFFF)
| Function | Address | Size |
|---|---|---|
| Real main (after CRT/jemalloc init) | sub_8F9C90 | |
| Path A CLI parsing (LibNVVM API mode) | sub_900130 | |
| Path A simple compile entry | sub_902D10 | |
| Path A multi-stage pipeline | sub_905EE0 | 43 KB |
| Path A builtin resolution table | sub_90AEE0 | 109 KB |
| Architecture detection, NVVM option parsing | sub_960000-sub_9EFFFF | 576 KB |
Zone 5: Bitcode Reader / X86 AutoUpgrade / Verifier (0x9F0000 - 0xBFFFFF)
| Sub-range | Contents |
|---|---|
0x9F0000-0xAEFFFF | Bitcode reader (sub_A24000 parseFunctionBody ~166KB) |
0xAF0000-0xBEFFFF | X86 AutoUpgrade (sub_A939D0 457KB -- legacy intrinsic upgrader) |
0xBF0000-0xBFFFFF | LLVM IR Verifier entry points |
Zone 6: LLVM Support Library (0xC00000 - 0xCAFFFF)
1,653 functions. Pure LLVM infrastructure -- no NVIDIA-specific modifications except a single !Flat address space annotation in the sample profile reader at sub_C29E70.
| Sub-range | Functions | Contents |
|---|---|---|
0xC00000-0xC0F000 | 65 | IR Verifier (sub_C05FA0 visitInstruction 75KB, sub_C0A940 verify 12KB) |
0xC0D4F0 | 1 | sub_C0D4F0 TargetRegistry::lookupTarget (8KB) |
0xC0F6D0 | 1 | sub_C0F6D0 IR module linker (48KB) |
0xC10000-0xC2FFFF | ~400 | InstrProf reader, Sample Profile reader/writer, hashing |
0xC30000-0xC3FFFF | 214 | ImmutableMap/Set, APInt printing |
0xC40000-0xC4FFFF | 197 | APInt core arithmetic (div, mul, shift) |
0xC50000-0xC5FFFF | 141 | CommandLine parser (cl::opt infrastructure) |
0xC60000-0xC6FFFF | 135 | JSON parser, debug counters, error handling |
0xC70000-0xC7FFFF | 114 | ConstantRange arithmetic |
0xC80000-0xC8FFFF | 194 | SHA-1 hash, regex, SmallVector, sorting |
0xC90000-0xC9FFFF | 139 | Timer/profiling, TimeTrace (Chrome trace) |
0xCA0000-0xCAFFFF | 186 | YAML lexer/parser, TypeSize, VFS |
Zone 7: NVVM Container, SCEV, DWARF, MC Layer (0xCB0000 - 0x10CFFFF)
This 4 MB zone contains LLVM mid-level infrastructure and the NVVM container format.
| Sub-range | Contents | Key functions |
|---|---|---|
0xCB0000-0xCBFA60 | YAML parser/emitter (libyaml) | sub_CB9640 main parser (26KB) |
0xCC0130-0xCCABA0 | LLVM Triple parsing | sub_CC0130 Triple_normalize (35KB) |
0xCCBB10-0xCDCA30 | NVVM container format | sub_CDD2D0 serialize, sub_CD1D80 deserialize, sub_CCD5F0 version validator (9KB) |
0xCD9990 | NVVM options parser (calls 60+ parse helpers) | |
0xD60000-0xD82000 | NV Module Summary / LTO | sub_D7D4E0 buildModuleSummary (74KB), sub_D81040 runOnModule (56KB) |
0xD83000-0xDFD000 | ScalarEvolution (SCEV) | SCEV framework, AddRecExpr, backedge analysis |
0xE00000-0xE0FFFF | DWARF debug info string/enum tables | |
0xE10000-0xE2FFFF | Itanium C++ name demangler | sub_E18BB0 parseExpr (47KB) |
0xE30000-0xEBFFFF | MC assembler layer | ELF/COFF/MachO section parsers, expression evaluator |
0xEC0000-0xED0000 | MC assembler directives | sub_ECB300 ELF section parser (40KB) |
0xED0000-0xEF8000 | InstrProf / MemProf reader | Profiling data infrastructure |
0xEF8000-0xF05000 | Bitstream remark serialization | |
0xF05000-0xF6FFFF | SelectionDAG infrastructure | DAG node creation, SDValue, EVT/MVT helpers |
0xF70000-0xF8FFFF | Loop vectorization runtime checks | sub_F77B70 vectorizeLoop (37KB), sub_F72730 canVectorizeMemory (29KB) |
0xF90000-0xFCFFFF | SimplifyCFG + code sinking | sub_FB0000 switch table gen, sub_FA0000 speculative exec |
0xFD0000-0xFEFFFF | AliasSet, register pressure tracking, CFG graphviz | |
0xFF0000-0x101FFFF | Block scheduling, RPO traversal, constant folding | |
0x1020000-0x103FFFF | Inline ASM + scheduling model | sub_1035170 CUTLASS kernel detection (41KB) |
0x1040000-0x106FFFF | Divergence analysis, DAG utilities, IR linker | |
0x1070000-0x10AFFFF | MC object emission, InstructionSimplify | sub_10ACA40 visitAdd (94KB) |
Zone 8: InstCombine Mega-Region (0x10D0000 - 0x122FFFF)
The single largest contiguous pass in the binary. NVIDIA's modified InstCombine spans 1.4 MB of code with three NVIDIA-custom opcodes (0x254D, 0x2551, 0x255F) for proprietary intrinsic folding.
| Sub-range | Contents | Key functions |
|---|---|---|
0x10D0000-0x10EFFFF | InstCombine visitors (casts, shifts, memory) | Various visitXxx functions |
0x10EE7A0 | InstCombine main visitor | sub_10EE7A0 (405KB / 9,258 lines -- largest function in binary) |
0x10F0000-0x1100000 | Sub-visitors for specific opcodes | |
0x1100000-0x1170000 | Intrinsic folding, demanded bits | sub_1169C30 intrinsic folder (87KB), sub_11A7600 computeKnownBits (127KB) |
0x1180000-0x119FFFF | InstCombine core worklist | sub_1190310 main dispatch (88KB) |
0x11A0000-0x11AFFFF | ValueTracking / KnownBits | sub_11AE870 SimplifyDemandedBits |
0x11B0000-0x11BFFFF | InstCombine tail (vector, extract/insert) | |
0x11D0000-0x11FFFFF | SimplifyLibCalls | Math function optimization |
0x11FF000-0x122FFFF | LLVM textual IR parser (LLParser) |
Zone 9: NVVM Bridge / Builtin System / IR Codegen (0x1230000 - 0x12CFFFF)
This zone is the core NVIDIA bridge between the EDG frontend AST and the LLVM IR optimizer.
| Sub-range | Contents | Key functions |
|---|---|---|
0x1230000-0x125FFFF | LLVM IR codegen from AST | Expression, statement, type codegen |
0x125FB30 | Path B CLI parsing | sub_125FB30 (standalone/nvcc mode) |
0x1262860 | Path B simple compile | sub_1262860 |
0x1265970 | Path B multi-stage pipeline | sub_1265970 (48KB) |
0x126A7B0 | Builtin lookup helper | sub_126A7B0 |
0x126A910 | Builtin registration table | sub_126A910 (126KB) -- registers 717 builtins (IDs 1-770) |
0x12B3FD0 | Builtin resolution dispatch | sub_12B3FD0 (103KB) -- giant switch on builtin ID |
0x12C06E0 | Bitcode linker | sub_12C06E0 (libdevice linking) |
Zone 10: Pipeline Builder / Pass Options (0x12D0000 - 0x12FFFFF)
The pipeline assembler constructs the complete LLVM pass pipeline, inserting passes by calling factory functions whose addresses scatter across the entire binary.
| Function | Address | Size |
|---|---|---|
| Module split-range helper | sub_12D3E60 | |
| Pass factory: creates NVIDIA custom pass | sub_12D4560 | 325 B |
| NVVMPassOptions initializer -- populates 222 pass option slots into 4,480-byte struct | sub_12D6300 | 125 KB |
| AddPass -- hash-table-based pass insertion into pipeline | sub_12DE0B0 | 3.5 KB |
| Tier 0 sub-pipeline builder (full optimization, 40 passes) | sub_12DE330 | 4.8 KB |
| Tier 1/2/3 sub-pipeline builder (85-pass superset, tier-gated) | sub_12DE8F0 | |
| Codegen dispatch -- routes to backend machine pass pipeline | sub_12DFE00 | |
| Master pipeline assembler -- 1,553 lines, two major pipelines (normal + fast) | sub_12E54A0 | 49.8 KB |
| Machine pass assembly (Pipeline B fast path) | sub_12EB010 | |
| Machine codegen execution | sub_12EC4F0 | |
| jemalloc core (~400 functions) | sub_12FC000+ | ~256 KB |
malloc_conf_init (parses 199 config strings from MALLOC_CONF) | sub_12FCDB0 | 129 KB |
Zone 11: IR Infrastructure / PassManager (0x1300000 - 0x16FFFFF)
Dense LLVM infrastructure: IR types, constants, instructions, metadata, use-lists, PassManager execution engine, IR linker, bitcode reader, regex, and DataLayout.
| Sub-range | Contents | Key functions |
|---|---|---|
0x1300000-0x135FFFF | IR constants, types, APInt, APFloat | |
0x1360000-0x13FFFFF | IR instructions, basic blocks, functions | sub_1361950 AssumptionCacheTracker |
0x1400000-0x14FFFFF | TargetLibraryInfo, pass scheduling | sub_149CCE0 TLI wrapper, sub_14A04B0 TLI creation, sub_14A3CD0 NVPTX TargetPassConfig |
0x1500000-0x15FFFFF | IR builder, GEP, PHI, branch creation | sub_15F83E0 conditional branch, sub_15F9210 load, sub_15F9650 store |
0x1600000-0x160FFFF | PassManager execution engine | sub_160FB70 PassManager::run, sub_1611EE0 PassManagerBuilder init |
0x1610000-0x162FFFF | Pass scheduling, metadata RAUW | sub_1619140 register target passes, sub_1619BD0 PassManager::finalize |
0x1630000-0x16FFFFF | IR Linker, bitcode reader, regex | sub_16786A0 IRLinker::run (61KB), sub_166A310 parseFunctionBody (60KB) |
Zone 12: InstCombine (NewPM) + Sanitizers + PGO (0x1700000 - 0x17FFFFF)
946 functions. Dominated by the New Pass Manager version of InstCombine (~600 functions, ~3.5 MB decompiled), with sanitizer instrumentation (MSan, TSan, coverage) and PGO/GCov infrastructure.
| Sub-range | Contents | Key functions |
|---|---|---|
0x1700000-0x17B0000 | InstCombine (NewPM) | sub_1743DA0 main visitor (168KB), sub_17A9010 liveness (111KB) |
0x17B0000-0x17BFFFF | GCov instrumentation | sub_17BF860 coverage notes (53KB) |
0x17C0000-0x17CFFFF | PGO indirect-call promotion | sub_17C2DB0 (39KB) |
0x17D0000-0x17DFFFF | MemorySanitizer | sub_17DDCE0 shadow propagation (58KB) |
0x17E0000-0x17EFFFF | PGO instrumentation | sub_17EEF60 InstrProfiling reader (81KB) |
0x17F0000-0x17FFFFF | ThreadSanitizer, SanitizerCoverage | sub_17FF260 TSan entry (51KB), sub_17F91F0 SanCov (44KB) |
sub_17060B0 | PrintModulePass (debug dump, inserted ~30x in pipeline) |
Zone 13: GVN + Scalar Passes + NVIDIA Custom IR Passes (0x1800000 - 0x1CFFFFF)
This 5 MB zone contains the bulk of LLVM's scalar optimization passes and all of NVIDIA's custom IR-level passes.
GVN family (0x1900000 - 0x193FFFF):
| Function | Address | Size |
|---|---|---|
| GVN::runOnFunction (core fixed-point iteration) | sub_1900BB0 | 83 KB |
| GVN PRE (Partial Redundancy Elimination) | sub_1906720 | 26 KB |
| NewGVN expression printing | sub_1930810 | 3 KB |
| NewGVN core value numbering | sub_1933B40 | 43 KB |
Standard scalar passes (0x1830000 - 0x1AFFFFF):
| Function (pipeline factory call) | Address | Size |
|---|---|---|
| InstructionCombining (Old PM wrapper) | sub_1832270 | |
| TailCallElim / JumpThreading | sub_1833EB0 | |
| FunctionAttrs | sub_1841180 | |
| SCCP (Sparse Conditional Constant Propagation) | sub_1842BC0 | |
| ConstantMerge / GlobalDCE | sub_184CD60 | |
| NVVMReflect | sub_1857160 | |
| IPConstantPropagation / ArgumentPromotion | sub_185D600 | |
| Sink / MemorySSA | sub_1869C50 | |
| NVVMPredicateOpt / SelectionOpt | sub_18A3430 | |
| LoopPass (barrier optimization) | sub_18B1DE0 | |
| DCE (Dead Code Elimination) | sub_18DEFF0 | |
| CorrelatedValuePropagation | sub_18EEA90 | |
| DSE (Dead Store Elimination) | sub_18F5480 | |
| DeadArgumentElimination | sub_18FD350 | |
| SimplifyCFG | sub_190BB10 | |
| LICM / LoopRotate | sub_195E880 | |
| LoopIndexSplit | sub_1952F90 | |
| LoopUnroll / LoopVectorize | sub_197E720 | |
| LoopSimplify / IndVarSimplify | sub_198DF00 | |
| SROA (Scalar Replacement of Aggregates) | sub_198E2A0 | |
| InstCombine variant | sub_19401A0 | |
| SROA variant / LoopUnswitch | sub_19B73C0 | |
| NVIDIA pass (unknown) | sub_19CE990 | |
| NVVMRematerialization (IR-level remat) | sub_1A13320 | |
| NVVMIRVerification | sub_1A223D0 | |
| LLVM standard pass pipeline (parameterized, called ~8x with different configs) | sub_1A62BF0 | |
| LoopIdiomRecognize / IndVarSimplify | sub_1A68E70 | |
| InstructionSimplify / ValueTracking | sub_1A7A9F0 |
Loop unrolling + switch lowering (0x1B00000 - 0x1B7FFFF):
| Function | Address | Size |
|---|---|---|
| LoopUnroll main driver | sub_1B01A40 | 68 KB |
| Unroll-and-Jam | sub_1B07290 | 55 KB |
| Loop peeling | sub_1B0BF10 | 39 KB |
| Unroll prologue/epilogue generation | sub_1B12B90 | 65 KB |
| Code sinking (".sink.split") | sub_1B51110 | 51 KB |
| SimplifyCFG condition combining | sub_1B5C580 | 30 KB |
| Switch-to-lookup-table transformation | sub_1B60700 | 83 KB |
Loop/SLP vectorizer (0x1B80000 - 0x1BFFFFF):
| Function | Address | Size |
|---|---|---|
| LoopVectorize main driver ("loop-vectorize") | sub_1BB6740 | 43 KB |
| VPlan builder | sub_1BAB460 | 32 KB |
| SLP horizontal reduction ("slp-vectorizer") | sub_1BDDB00 | 47 KB |
| SLP shuffle/reorder engine | sub_1BD0660 | 62 KB |
NVVM module validation + configuration (0x1C00000 - 0x1C3FFFF):
| Function | Address | Size |
|---|---|---|
| NVVM codegen config parser (70+ knobs: AdvancedRemat, CSSACoalescing, DoMMACoalescing, PGO, OCGKnobs) | sub_1C20170 | 33 KB |
| NVVM compile mode parser (WHOLE_PROGRAM_NOABI/ABI, SEPARATE_ABI, opt level, debug info) | sub_1C21CE0 | 28 KB |
| Kernel attribute validator (cluster launch, parameter size, Hopper constraints) | sub_1C32740 | 30 KB |
| NVVM intrinsic lowering (tex/surf/syncwarp/ISBE/MAP/ATTR validation) | sub_1C36530 | 112 KB |
| NVVM module validator (data layout, target triple, UnifiedNVVMIR) | sub_1C3BC10 | 48 KB |
NVIDIA custom IR passes (0x1C40000 - 0x1CFFFFF):
This 1 MB block contains the majority of NVIDIA's proprietary IR-level optimization passes. Every pass listed here has no upstream LLVM equivalent.
| Function | Address | Size | Role |
|---|---|---|---|
Dead Synchronization Elimination -- removes redundant __syncthreads() barriers via fixed-point R/W dataflow | sub_1C47810 | 63 KB | dead-sync-elim |
| Alloca cloning / PHI insertion (mem2reg extension) | sub_1C4D210 | 69 KB | |
| NVIDIA pass helper (dead-sync / common-base infrastructure) | sub_1C585C0 | 39 KB | |
| Common Base Elimination -- removes redundant base address computations | sub_1C5DFC0 | 39 KB | common-base-elim |
| Block-level analysis infrastructure ("Processing", "Block") | sub_1C5FDC0 | 26 KB | |
| Base address bitcast helper ("baseValue", "bitCastEnd") | sub_1C637F0 | 28 KB | |
| Base Address Strength Reduction ("BaseAddressStrengthReduce") | sub_1C67780 | 59 KB | base-addr-sr |
| MemorySpaceOpt loop index analysis ("phi maxLoopInd") | sub_1C6A6C0 | 54 KB | |
| GVN or LICM variant | sub_1C6E800 | ||
| ADCE (Aggressive DCE) | sub_1C6FCA0 | ||
| MemorySpaceOpt function cloning -- specializes generic pointers to global/shared/local | sub_1C70910 | 75 KB | memspace-opt (core) |
| LoopIndexSplit -- splits loops on index conditions (three modes: all-but-one, single-iter, range-split) | sub_1C7B2C0 | 84 KB | loop-index-split |
| Memmove Unrolling -- forward/reverse element copy loops | sub_1C82A50 | 40 KB | lower-aggr-copies |
| Struct/Aggregate Splitting -- element-wise memcpy decomposition | sub_1C86CA0 | 73 KB | lower-aggr-copies |
| EarlyCSE / GVN variant | sub_1C8A4D0 | ||
FP128/I128 Emulation -- replaces 128-bit ops with __nv_* library calls | sub_1C8C170 | 26 KB | lower-ops |
| MemorySpaceOpt entry (pipeline factory address) | sub_1C8E680 | nvvm-memspace-opt | |
| NVVMLowerBarriers / BarrierLowering | sub_1C98160 | ||
| MemorySpaceOpt address space resolution (warnings for illegal atomics on const/local) | sub_1CA2920 | 32 KB | |
| MemorySpaceOpt secondary resolver | sub_1CA9E90 | 28 KB | |
Printf Lowering -- lowers printf to vprintf + local buffer packing | sub_1CB1E60 | 31 KB | printf-lowering |
| NVVMIntrinsicLowering (most frequently inserted pass, ~10 occurrences in pipeline) | sub_1CB4E40 | nvvm-intrinsic-lower | |
| NVVMBranchDist | sub_1CB73C0 | branch-dist | |
| RLMCAST transformation (register-level multicast) | sub_1CBFA40 | 75 KB | |
| NVVMSinking2 (NVIDIA enhanced code sinking) | sub_1CC60B0 | sinking2 | |
| IV Demotion -- narrows 64-bit induction variables to 32-bit ("demoteIV", "newBaseIV") | sub_1CD74B0 | 75 KB | iv-demotion |
| NLO (NVIDIA Live Output) helper ("nloNewAdd", "nloNewBit") | sub_1CDC1F0 | 35 KB | |
| Instruction classification / cost model (NLO/remat) | sub_1CDE4D0 | 80 KB | |
| Simplify Live Output (NLO pass -- "nloNewBit") | sub_1CE10B0 | 48 KB | |
| Rematerialization pull-in cost analysis ("Total pull-in cost") | sub_1CE3AF0 | 56 KB | |
| Rematerialization block executor ("remat_", "uclone_" prefixes) | sub_1CE67D0 | 32 KB | |
| NVVMRematerialization main driver -- live-in/live-out pressure analysis per block | sub_1CE7DD0 | 67 KB | remat |
| Final NVVM lowering / intrinsic cleanup | sub_1CEBD10 | ||
| Formal parameter space overflow checker | sub_1CEE970 | 27 KB | |
| NVVMPeephole | sub_1CEF8F0 | nvvm-peephole | |
| Instruction scheduling helper (physical register constraints) | sub_1CFDD60 | 49 KB |
Zone 14: SelectionDAG ISel / CodeGenPrepare / Backend (0x1D00000 - 0x1EFFFFF)
| Sub-range | Contents | Key functions |
|---|---|---|
0x1D00000-0x1D60000 | SelectionDAG ISel core | sub_1D4BB00 bytecode interpreter (97KB, 131-case switch), sub_1D54C20 runOnMachineFunction (72KB, "sdagisel") |
0x1D1B0D0 | sub_1D1B0D0 computeKnownBits (87KB, 62-case ISD switch) | |
0x1D210A0 | sub_1D210A0 SimplifyDemandedBits (46KB, 118-case switch, calls NVPTX hooks at sub_1F58D40) | |
0x1D70000-0x1D7FFFF | CodeGenPrepare | sub_1D73760 address sinking (65KB, "sunkaddr") |
0x1D07BB0 | 57 KB | Pre-RA instruction scheduling |
0x1D80000-0x1DFFFFF | Deque worklist, block splitting | sub_1D7AA30 (74KB, ".unlikely", ".cond.split") |
0x1E00000-0x1EFFFFF | Register allocation infrastructure | Greedy RA, live intervals, spill cost |
Zone 15: Backend CodeGen Infrastructure (0x1F00000 - 0x20FFFFF)
| Sub-range | Contents | Key functions |
|---|---|---|
0x1F00000-0x1F0C000 | ScheduleDAG infrastructure | sub_1F0A020 DAG builder/emitter (41KB) |
0x1F0BF50-0x1F0EBC0 | Shrink Wrapping | sub_1F0DCB0 core analysis (27KB, "shrink-wrap") |
0x1F10000-0x1F15000 | SlotIndexes + SpillPlacement | sub_1F10320 "slotindexes", sub_1F12110 "spill-code-placement" |
0x1F15000-0x1F1F000 | LiveInterval utilities | sub_1F19E60 "Impossible to implement partial COPY" |
0x1F20000-0x1F5FFFF | Register coalescer, VirtRegRewriter | |
0x1F58D40 | NVPTX target hook for SimplifyDemandedBits | |
0x1F60000-0x1FFFFF | TwoAddressInstruction, stack protection | |
0x2000000-0x20FFFFF | LegalizeTypes | sub_20019C0 (341KB -- third largest function in binary) |
Zone 16: NVPTX Target Backend (0x2100000 - 0x21FFFFF)
| Sub-range | Contents | Key functions |
|---|---|---|
0x2100000-0x210FFFF | Register allocation support | sub_210BC20 seedLiveRegs ("regalloc"), sub_210BE60 "ran out of registers" |
0x2110000-0x212FFFF | DAG type legalization/promotion | |
0x2130000-0x213FFFF | DAG combiners, ISel patterns | |
0x2140000-0x214FFFF | NVPTXAsmPrinter | PTX header/kernel emission |
0x2150000-0x215FFFF | PTX function/param emission | sub_215D9D0 NVVMAnnotationsProcessor / GenericToNVVM |
0x2160000-0x216FFFF | NVPTXTargetMachine | Pass pipeline, SubtargetInfo |
0x2170000-0x218AFFF | Atomics lowering, rematerialization (machine-level) | |
0x21BC000-0x21BFFFF | Alloca hoisting, image opt | |
0x21C0000-0x21CFFFF | MemorySpace lowering (machine-level) | |
0x21D0000-0x21DFFFF | DAG lowering mega-function, peephole, prolog/epilog | |
0x21E0000-0x21EFFFF | MMA/tensor codegen, atomics, special regs, cluster ops | |
0x21F0000-0x21FFFFF | Ldg transform, vec split, mem2reg, register pressure |
Zone 17: New PM Pass Registration (0x2340000 - 0x23FFFFF)
| Function | Address | Size |
|---|---|---|
| Master pass registration -- registers all 526 passes (121 module + 174 function + 23 loop + 48 MF + analyses) into StringMap | sub_2342890 | ~2,816 lines |
| Print available passes (--print-pipeline-passes) | sub_233C410 | |
| Function pass pipeline text parser | sub_233F860 | |
| Module pipeline text parser | sub_2377300 | |
| Inner function/loop pipeline parser | sub_2368220 | |
| Alias analysis name resolver (globals-aa, basic-aa, scev-aa, tbaa) | sub_233BD40 | |
| Hash table insertion (pass_name -> constructor) | sub_E41FB0 |
Zone 18: IPO / Attributor / OpenMP Optimization (0x2400000 - 0x29FFFFF)
| Sub-range | Contents | Key functions |
|---|---|---|
0x2400000-0x25FFFFF | Attributor framework | sub_251CD10 runTillFixpoint (53KB) |
0x2590000-0x265FFFF | Sanitizer instrumentation (ASan, HWASan) | |
0x266E000-0x269FFFF | OpenMP target offloading | sub_2686D90 runtime table (215KB, ~160 __kmpc_* entries), sub_26968A0 Generic-to-SPMD transform (61KB, "OMP120") |
0x2678420 | 41 KB | OpenMP state machine for generic kernels |
0x2680940 | 52 KB | Parallel region merging |
0x26A0000-0x29FFFFF | Coroutine support, LTO infrastructure, PGO lowering |
Zone 19: Loop Transforms (0x2A00000 - 0x2CFFFFF)
| Function | Address | Size |
|---|---|---|
| LoopPeeling ("llvm.loop.peeled.count") | sub_2A07DE0 | 76 KB |
| LoopRotation (".lr.ph", "h.rot") | sub_2A0CFD0 | 65 KB |
| UnrollLoop main ("loop-unroll", "UnrollCount") | sub_2A15A20 | 85 KB |
| UnrollAndJamLoop ("loop-unroll-and-jam") | sub_2A1CF00 | 58 KB |
| Runtime unrolling (".epil.preheader", ".prol.preheader") | sub_2A25260 | 91 KB |
| IndVarSimplify IV widening ("iv.rem", ".sext", ".zext") | sub_2A76A40 | 67 KB |
| WidenIV / IV transformation | sub_2A79EE0 | 82 KB |
Dead Synchronization Elimination (island -- the larger copy; see also sub_1C47810) | sub_2C84BA0 | 94 KB |
Note: sub_2C84BA0 is a second copy of the dead synchronization elimination pass located outside the main NVIDIA custom pass zone. This is the 94KB variant analyzed in depth (p2b.6-01), with the four-category fixed-point R/W dataflow algorithm and red-black tree maps.
Zone 20: Codegen Target Options / SelectionDAG Lowering (0x2D00000 - 0x2FFFFFF)
5,217 functions. Contains LLVM TargetMachine option registration and the core SelectionDAG infrastructure used by the NVPTX backend.
| Sub-range | Contents | Key functions |
|---|---|---|
0x2D00000-0x2D8FFFF | SelectionDAG core | DAG combine, node creation, legalization helpers |
0x2D97F20 | 112 KB | TargetOptions registration (all cl::opt for -march/-mcpu/-mattr/relocation/code model) |
0x2E00000-0x2FFFFF | SelectionDAG continued | Type legalization, custom lowering, pattern matching |
Zone 21: NVPTX ISel + SelectionDAG Lowering (0x3000000 - 0x36FFFFF)
7 MB. The NVPTX instruction selection and target-specific DAG lowering.
| Sub-range | Contents | Key functions |
|---|---|---|
0x3000000-0x328FFFF | DAG node construction, EVT/MVT helpers | |
0x3290000-0x32FFFFF | NVPTXTargetLowering | sub_32E3060 LowerOperation dispatcher (111KB), sub_32A1EF0 type legalization (109KB), sub_32D2680 load/store lowering (81KB) |
0x3300000-0x33AFFFF | Intrinsic lowering (DAG level) | sub_33B0210 intrinsic switch (343KB) |
0x33B0000-0x36FFFFF | ISel pattern helpers, register info |
Zone 22: NVPTX Instruction Selector / Machine Tail (0x3700000 - 0x3BFFFFF)
| Sub-range | Contents | Key functions |
|---|---|---|
0x3700000-0x37AFFFF | Table-driven instruction selector | sub_376DE90 main pattern matcher (138KB -- per-SM opcode legality gating via compressed table at offset 521536) |
0x372FEE0 | 104 KB | DAG operand tree copier (recursive) |
0x374DD20 | 67 KB | NVPTX custom lowering entry |
0x3900000-0x396FFFF | NVIDIA register pressure / remat (machine-level) | sub_396A6C0 RP reporting ("Register Pressure: N"), sub_3964ED0 ".remat" naming |
0x3937240 | 14 KB | ABI Preserve directive emission |
0x395CFD0 | 11 KB | GEP Splitting pass |
sub_395DD20 | 66 KB | DAG pattern computation |
0x3970000-0x397FFFF | AsmPrinter / PTX emission | sub_3979400 emitFunctionBody (62KB), sub_397DF10 emitInlineAsm (30KB) |
sub_3970E40 | 18 KB | BB print + .pragma "nounroll" |
0x3980000-0x3BFFFFF | MC layer, DWARF, ELF emission | Object file writers, section management |
Pass Factory Address Summary
The pipeline assembler (sub_12E54A0) calls pass factory functions to construct the pipeline. Each factory address below is called directly from the pipeline builder and uniquely identifies a pass in the binary.
| Factory address | Pass identity | Type |
|---|---|---|
sub_1654860 | BreakCriticalEdges | F |
sub_17060B0 | PrintModulePass (debug dump) | M |
sub_1832270 | InstructionCombining | F |
sub_1833EB0 | TailCallElim / JumpThreading | F |
sub_1841180 | FunctionAttrs | M |
sub_1842BC0 | SCCP | F |
sub_184CD60 | ConstantMerge / GlobalDCE | M |
sub_1857160 | NVVMReflect | F |
sub_185D600 | IPConstantPropagation | M |
sub_1869C50 | Sink / MemorySSA | F |
sub_18A3430 | NVVMPredicateOpt | F |
sub_18B1DE0 | LoopPass (barrier opt) | F |
sub_18DEFF0 | DCE | F |
sub_18EEA90 | CorrelatedValuePropagation | F |
sub_18F5480 | DSE | F |
sub_18FD350 | DeadArgumentElimination | M |
sub_190BB10 | SimplifyCFG | F |
sub_195E880 | LICM / LoopRotate | F |
sub_1952F90 | LoopIndexSplit | L |
sub_197E720 | LoopUnroll / LoopVectorize | F |
sub_198DF00 | LoopSimplify / IndVarSimplify | F |
sub_198E2A0 | SROA | F |
sub_19401A0 | InstCombine variant | F |
sub_19B73C0 | SROA variant / LoopUnswitch | F |
sub_19CE990 | NVIDIA pass (unknown) | F |
sub_1A13320 | NVVMRematerialization (IR-level) | F |
sub_1A223D0 | NVVMIRVerification | M |
sub_1A62BF0 | LLVM standard pass pipeline (parameterized) | M |
sub_1A68E70 | LoopIdiomRecognize | F |
sub_1A7A9F0 | InstructionSimplify | F |
sub_1B26330 | MemCpyOpt | F |
sub_1B7FDF0 | Reassociate / Sinking | F |
sub_1C4B6F0 | AlwaysInliner | M |
sub_1C6FCA0 | ADCE | F |
sub_1C8A4D0 | EarlyCSE | F |
sub_1C8E680 | NVVMMemorySpaceOpt | M |
sub_1C98160 | NVVMLowerBarriers | F |
sub_1CB4E40 | NVVMIntrinsicLowering (~10 insertions) | F |
sub_1CB73C0 | NVVMBranchDist | F |
sub_1CC60B0 | NVVMSinking2 | F |
sub_1CE7DD0 | NVVMRematerialization (main) | F |
sub_1CEBD10 | Final NVVM lowering | F |
sub_1CEF8F0 | NVVMPeephole | F |
sub_1CB0F50 | ProfileSummaryInfoWrapper / NVVMModulePass | F |
sub_12D4560 | NVVMVerifier / ModuleVerifier | M |
sub_215D9D0 | NVVMAnnotationsProcessor | M |
sub_149CCE0 | TargetLibraryInfoWrapperPass | M |
sub_1BFB520 | TargetTransformInfoWrapperPass | F |
sub_14A7550 | createVerifierPass / BasicAliasAnalysis | M |
sub_1361950 | AssumptionCacheTracker | M |
Type: M = ModulePass, F = FunctionPass, L = LoopPass.
Embedded Data Payloads
Libdevice Bitcode
Two identical copies of NVIDIA's libdevice are embedded directly in the .rodata section as raw LLVM bitcode. Each copy is approximately 456 KB and contains around 400 math intrinsic implementations (__nv_sinf, __nv_expf, __nv_sqrtf, etc.). The duplication supports the dual-path architecture: Path A (LibNVVM API mode) references one copy at 0x3EA0080; Path B (standalone mode) references the other at 0x420FD80. The bitcode is linked into the user's module during the LNK phase via the bitcode linker at sub_12C06E0.
String Tables
IDA Pro extracts 188,141 strings from the binary. These fall into several categories:
| Category | Approximate count | Example |
|---|---|---|
LLVM cl::opt descriptions | ~1,689 | "Enable aggressive reassociation" |
| LLVM error/diagnostic messages | ~5,000 | "Invalid bitcode signature" |
| EDG error messages | ~2,500 | "expected a declaration" |
| LLVM pass names | ~440 | "instcombine", "gvn", "nvvm-memspace-opt" |
| PTX instruction templates | ~800 | "mov.b32 %0, %1;" |
| NVVM builtin names | ~770 | "__nvvm_atom_cas_gen_i" |
| jemalloc config strings | ~200 | "background_thread", "dirty_decay_ms" |
| NVVM container field names | ~144 | "SmMajor", "FastMath.Ftz" |
| Miscellaneous (format strings, assertions) | ~170,000+ | "%s:%d: assertion failed" |
String cross-referencing is the single most productive technique for identifying functions in a stripped binary. The LLVM pass registration pattern is especially reliable: a string like "nvvm-memspace-opt" appears exactly once, in the constructor of that pass, which IDA locates via xref.
NVVM Container Format
The binary includes a proprietary container format for wrapping LLVM bitcode with compilation metadata. The container uses a 24-byte binary header with magic 0x7F4E5C7D, followed by delta-encoded tag/value pairs (only fields that differ from defaults are serialized). There are 144 distinct tag IDs spanning core options (tags 1-39), compression metadata (tag 99), extended target options (tags 101-173), blob data (tags 201-218), and structured hardware descriptors (tags 401-402 for TMA/TCGen05 configurations). Serialization and deserialization are handled by sub_CDD2D0 and sub_CD1D80 respectively.
jemalloc Integration
NVIDIA statically links jemalloc 5.3.x as the process-wide memory allocator. The jemalloc functions cluster around 0x12FC000 (approximately 400 functions). The configuration initialization function sub_12FCDB0 (129 KB, one of the largest functions in the binary) parses 199 configuration strings from the MALLOC_CONF environment variable.
Key jemalloc entry points visible in the binary:
| Function | Address |
|---|---|
malloc_conf_init (199 config strings) | 0x12FCDB0 |
vsnprintf (jemalloc stats formatting) | 0x40D5CA |
| Core arena management, tcache, extent allocator | 0x12FC000 range |
The jemalloc integration is significant for reverse engineering because it means malloc/free calls throughout the binary resolve to jemalloc's arena-based allocator rather than glibc's ptmalloc2. When tracing memory allocation patterns in IDA, look for calls into the 0x12FC000 range.
Global Constructors
The region from 0x430000 to 0x5CFFFF (~1.6 MB) is dominated by global constructors that execute before main(). The primary purpose of these constructors is LLVM cl::opt registration: approximately 1,689 command-line option objects are initialized, each registering a string name, description, default value, and storage location into LLVM's global option registry.
The .init_array section contains function pointers to these constructors. They execute in linker-determined order and populate a global hash table that sub_8F9C90 (the real main) later queries during CLI parsing. In IDA Pro, navigating to any cl::opt constructor reveals the option name string and its associated global variable, which is invaluable for understanding what flag controls what behavior.
Additional global constructors handle:
- LLVM pass registration (
RegisterPass<T>andPassInfoobjects) - LLVM target initialization (NVPTX target machine factory)
- jemalloc allocator bootstrapping
- EDG frontend static initialization tables
Dual-Path Code Duplication
A distinctive structural feature of the binary is the presence of two near-complete copies of the NVVM bridge and backend entry points. Path A (LibNVVM API mode) lives around 0x90xxxx; Path B (standalone/nvcc mode) lives around 0x126xxxx. Each path has its own:
| Component | Path A | Path B |
|---|---|---|
| Simple compile entry | sub_902D10 | sub_1262860 |
| Multi-stage pipeline | sub_905EE0 (43 KB) | sub_1265970 (48 KB) |
| CLI parsing | sub_900130 | sub_125FB30 |
| Builtin resolution table | sub_90AEE0 (109 KB) | sub_126A910 (123 KB) |
| Embedded libdevice ref | unk_3EA0080 | unk_420FD80 |
| Version string | nvvm-latest | nvvm70 |
In IDA, if you have identified a function in one path, search for a structurally similar function at the corresponding offset in the other path. The code is not byte-identical -- Path B is generally slightly larger due to additional standalone-mode logic -- but the control flow graphs are nearly congruent.
IDA Pro Navigation Tips
When opening cicc in IDA Pro for the first time, the auto-analysis will take several minutes due to the 60 MB size. The following workflow accelerates orientation:
-
Start with strings. Open the Strings window (Shift+F12), filter for known LLVM pass names (
"instcombine","gvn","nvvm-"). Each xref leads directly to a pass constructor or registration site. -
Use the address map above. If you are looking at an address in the
0xC00000-0x12CFFFFrange, you are in LLVM optimization passes. The0x3000000-0x36FFFFFrange is NVPTX instruction selection. The0x5D0000-0x8EFFFFrange is EDG. Context narrows the search space immediately. -
Watch for vtable patterns. LLVM passes are C++ classes with virtual methods. IDA's vtable reconstruction reveals inheritance hierarchies. Every
FunctionPass,ModulePass, andLoopPasssubclass has a vtable withrunOnFunction/runOnModuleat a consistent slot offset. -
Anchor on mega-functions. The largest functions are the easiest to locate and serve as landmarks:
sub_A939D0(457 KB, X86 AutoUpgrade),sub_10EE7A0(396 KB, InstCombine),sub_20019C0(341 KB, LegalizeTypes). These anchors partition the address space. -
Follow the pipeline. Entry at
sub_8F9C90calls into EDG atsub_5D2A80, pipeline assembly atsub_12E54A0, and PTX emission starting at0x2100000. Tracing callgraph edges from these known entry points maps out the entire compilation flow. -
Mark jemalloc early. Identifying and labeling the jemalloc cluster at
0x12FC000prevents wasted time reverse-engineering well-known allocator internals. The 199-stringmalloc_conf_initfunction is an unmistakable fingerprint. -
Locate NVIDIA passes via factory addresses. The Pass Factory Address Summary table above maps every pipeline-inserted pass to its constructor address. In IDA, setting a breakpoint at
sub_12DE0B0(AddPass) and logging the second argument reveals the exact pass insertion order at runtime.
Master Address-Range Map
The definitive quick-reference for "what lives at address X?" Every major address range in the cicc v13.0 binary, sorted by start address, consolidated from all subsystem pages in this wiki.
.text Section (0x400000 - 0x3BFFFFF)
| Start | End | Size | Subsystem | Zone |
|---|---|---|---|---|
0x400000 | 0x40CFFF | 52 KB | CRT startup (_start, libc stubs) | 1 |
0x40D000 | 0x41FFFF | 80 KB | jemalloc stats (vsnprintf at sub_40D5CA) | 1 |
0x420000 | 0x42FFFF | 64 KB | libc helpers (memcpy, memset, strlen, math) | 1 |
0x430000 | 0x5CFFFF | 1.6 MB | Global constructors (~1,689 cl::opt registrations, pass/target init) | 2 |
0x5D0000 | 0x8EFFFF | 3.2 MB | EDG 6.6 C++ Frontend (parser, constexpr, templates, IL walkers, SARIF, preprocessor) | 3 |
0x8F0000 | 0x8FFFFF | 64 KB | Real main / CLI (sub_8F9C90 entry, flag mapping, XOR deobfuscator) | 4 |
0x900000 | 0x92FFFF | 192 KB | Path A entry (LibNVVM API: CLI parse, pipeline driver, builtin tables) | 4 |
0x930000 | 0x95FFFF | 192 KB | Path A builtins (pre-opt builtin lowering, 770-entry resolution) | 4 |
0x960000 | 0x9EFFFF | 576 KB | Architecture detection (-arch fan-out, NVVM option parsing) | 4 |
0x9F0000 | 0xAEFFFF | 1 MB | Bitcode reader (parseFunctionBody 166KB, metadata reader 121KB) | 5 |
0xAF0000 | 0xBEFFFF | 1 MB | X86 AutoUpgrade (sub_A939D0 457KB -- legacy intrinsic upgrader) | 5 |
0xBF0000 | 0xBFFFFF | 64 KB | LLVM IR Verifier (entry points, visitCallInst 207KB) | 5 |
0xC00000 | 0xCAFFFF | 704 KB | LLVM Support/ADT (APInt, CommandLine, ConstantRange, JSON, Timer, YAML, VFS) | 6 |
0xCB0000 | 0xCBFFFF | 64 KB | YAML parser/emitter (libyaml) | 7 |
0xCC0000 | 0xCCFFFF | 64 KB | LLVM Triple parsing (Triple_normalize 35KB) | 7 |
0xCCD000 | 0xCDFFFF | 76 KB | NVVM container format (serialize sub_CDD2D0, deserialize sub_CD1D80, 144 tags) | 7 |
0xCE0000 | 0xD5FFFF | 512 KB | NVVM options (container validators, option parsers) | 7 |
0xD60000 | 0xD82FFF | 140 KB | NV Module Summary / LTO (buildModuleSummary 74KB, runOnModule 56KB) | 7 |
0xD83000 | 0xDFFFFF | 500 KB | ScalarEvolution (SCEV) (AddRecExpr, backedge analysis, trip counts) | 7 |
0xE00000 | 0xE0FFFF | 64 KB | DWARF debug info (string/enum tables) | 7 |
0xE10000 | 0xE2FFFF | 128 KB | Itanium name demangler (parseExpr 47KB) | 7 |
0xE30000 | 0xEBFFFF | 576 KB | MC assembler layer (ELF/COFF/MachO section parsers, expression evaluator) | 7 |
0xEC0000 | 0xED0000 | 64 KB | MC directives (sub_ECB300 ELF section parser 40KB) | 7 |
0xED0000 | 0xEF8000 | 160 KB | InstrProf / MemProf reader (profiling data infrastructure) | 7 |
0xEF8000 | 0xF05000 | 52 KB | Bitstream remark serialization | 7 |
0xF05000 | 0xF6FFFF | 428 KB | SelectionDAG infrastructure (DAG node creation, SDValue, EVT/MVT helpers) | 7 |
0xF70000 | 0xF8FFFF | 128 KB | Loop vectorization runtime checks (vectorizeLoop 37KB, canVectorizeMemory 29KB) | 7 |
0xF90000 | 0xFCFFFF | 256 KB | SimplifyCFG + code sinking (switch table gen, speculative exec) | 7 |
0xFD0000 | 0xFEFFFF | 128 KB | AliasSet / register pressure (CFG graphviz) | 7 |
0xFF0000 | 0x101FFFF | 192 KB | Block scheduling (RPO traversal, constant folding) | 7 |
0x1020000 | 0x103FFFF | 128 KB | Inline ASM + scheduling model (CUTLASS kernel detection 41KB) | 7 |
0x1040000 | 0x106FFFF | 192 KB | Divergence analysis (DAG utilities, IR linker) | 7 |
0x1070000 | 0x10CFFFF | 384 KB | MC object emission + InstructionSimplify (visitAdd 94KB) | 7 |
0x10D0000 | 0x122FFFF | 1.4 MB | InstCombine mega-region (main visitor 396KB, KnownBits 125KB, SimplifyLibCalls, LLParser) | 8 |
0x1230000 | 0x12CFFFF | 640 KB | NVVM Bridge / IR codegen (AST-to-IR, Path B entry, builtin tables, bitcode linker) | 9 |
0x12D0000 | 0x12FBFFF | 176 KB | Pipeline builder (NVVMPassOptions 125KB, AddPass, tier builders, master assembler 50KB) | 10 |
0x12FC000 | 0x133FFFF | 256 KB | jemalloc core (~400 functions, malloc_conf_init 129KB) | 10 |
0x1340000 | 0x16FFFFF | 3.8 MB | IR infrastructure / PassManager (IR types, constants, instructions, metadata, execution engine, IR linker) | 11 |
0x1700000 | 0x17FFFFF | 1 MB | InstCombine (NewPM) + Sanitizers + PGO (MSan, TSan, coverage, GCov) | 12 |
0x1800000 | 0x18DFFFF | 896 KB | Standard scalar passes (InstructionCombining, TailCallElim, FunctionAttrs, SCCP, Sink, MemorySSA) | 13 |
0x18E0000 | 0x18FFFFF | 128 KB | DCE / CVP / DSE (Dead Code Elimination, CorrelatedValuePropagation, Dead Store Elimination) | 13 |
0x1900000 | 0x193FFFF | 256 KB | GVN family (runOnFunction 83KB, PRE 26KB, NewGVN 43KB) | 13 |
0x1940000 | 0x19FFFFF | 768 KB | Scalar passes continued (LICM, LoopRotate, LoopIndexSplit, LoopUnroll, SROA) | 13 |
0x1A00000 | 0x1AFFFFF | 1 MB | NVVMRematerialization / LLVM standard pipeline / InstructionSimplify | 13 |
0x1B00000 | 0x1B7FFFF | 512 KB | Loop unrolling + switch lowering (main driver 68KB, Unroll-and-Jam 55KB, peeling 39KB) | 13 |
0x1B80000 | 0x1BFFFFF | 512 KB | Loop/SLP vectorizer (LoopVectorize 43KB, VPlan 32KB, SLP 47KB+62KB) | 13 |
0x1C00000 | 0x1C3FFFF | 256 KB | NVVM module validation + config (codegen config 33KB, compile mode 28KB, intrinsic lowering 112KB, module validator 48KB) | 13 |
0x1C40000 | 0x1CFFFFF | 768 KB | NVIDIA custom IR passes (dead-sync-elim, common-base-elim, base-addr-sr, memspace-opt, loop-index-split, printf-lowering, iv-demotion, remat, peephole, sinking2, NLO) | 13 |
0x1D00000 | 0x1DFFFFF | 1 MB | SelectionDAG ISel / CodeGenPrepare (bytecode interpreter 97KB, address sinking 65KB) | 14 |
0x1E00000 | 0x1EFFFFF | 1 MB | Register allocation infrastructure (Greedy RA, live intervals, spill cost) | 14 |
0x1F00000 | 0x1FFFFFF | 1 MB | Backend codegen infrastructure (ScheduleDAG, ShrinkWrapping, SpillPlacement, register coalescer, TwoAddressInstruction) | 15 |
0x2000000 | 0x20FFFFF | 1 MB | LegalizeTypes (sub_20019C0 341KB -- third largest function) | 15 |
0x2100000 | 0x21FFFFF | 1 MB | NVPTX target backend (AsmPrinter, PTX emission, MMA/tensor codegen, atomics, TargetMachine) | 16 |
0x2200000 | 0x233FFFF | 1.25 MB | (gap: misc codegen, late passes) | -- |
0x2340000 | 0x23FFFFF | 768 KB | New PM pass registration (master registrar 2,816 lines, 526 passes, pipeline text parser) | 17 |
0x2400000 | 0x258FFFF | 1.6 MB | Attributor framework (runTillFixpoint 53KB) | 18 |
0x2590000 | 0x265FFFF | 832 KB | Sanitizer instrumentation (ASan, HWASan) | 18 |
0x2660000 | 0x269FFFF | 256 KB | OpenMP target offloading (194-entry __kmpc_* table, Generic-to-SPMD 61KB, state machine 41KB) | 18 |
0x26A0000 | 0x29FFFFF | 3.5 MB | Coroutines / LTO infrastructure / PGO lowering / EarlyCSE / SROA (NewPM) | 18 |
0x2A00000 | 0x2CFFFFF | 3 MB | Loop transforms (LoopPeeling, LoopRotation, UnrollLoop, IndVarSimplify, dead-sync-elim island) | 19 |
0x2D00000 | 0x2FFFFFF | 3 MB | Codegen target options / SelectionDAG lowering (TargetOptions 112KB, DAG combine, type legalization) | 20 |
0x3000000 | 0x36FFFFF | 7 MB | NVPTX ISel + DAG lowering (NVPTXTargetLowering 111KB, intrinsic switch 343KB, register info) | 21 |
0x3700000 | 0x37AFFFF | 704 KB | Table-driven instruction selector (main matcher 138KB, per-SM opcode gating) | 22 |
0x37B0000 | 0x38FFFFF | 1.3 MB | Late machine passes (inliner cost model at 0x38576C0, pipeline helpers) | 22 |
0x3900000 | 0x397FFFF | 512 KB | NVIDIA machine-level passes (register pressure, remat, ABI preserve, GEP split, AsmPrinter/PTX emission) | 22 |
0x3980000 | 0x399FFFF | 128 KB | MC layer / DWARF emission (object file writers, DWARF sections at 0x3990000-0x39DF000) | 22 |
0x39A0000 | 0x3BFFFFF | 2.4 MB | Trailing codegen (section management, CRT finalization) | 22 |
.rodata / .data Sections (0x3C00000+)
| Start | End | Size | Contents |
|---|---|---|---|
0x3C00000 | 0x3EAFFFF | ~2.7 MB | Read-only data (strings, jump tables, XOR-encrypted env vars at 0x3C23A7B) |
0x3EA0080 | 0x3F1FFFF | 456 KB | Embedded libdevice bitcode (Path A) |
0x3F252E0 | 0x3F3E6C0+ | varies | NVPTX tables (constraint type table, constraint word table, MVT tables) |
0x420FD80 | 0x428FFFF | 456 KB | Embedded libdevice bitcode (Path B) |
0x42812C0 | -- | varies | Obfuscated version strings (XOR+ROT13 ciphertext) |
0x444C4A0 | 0x4456580+ | varies | MVT tables (operand type, vector element count, scalarized MVT) |
0x4F00000+ | -- | large | BSS (cl::opt storage, hash tables, global state) |
Usage
Given an IDA address, find the row whose Start <= address < End. The Subsystem column tells you which component of cicc you are looking at. For pass-level detail within a zone, jump to the corresponding Zone section above.
Cross-References
- Pipeline Overview -- compilation flow from entry to PTX emission
- LLVM Pipeline -- 526-pass registration table and tier execution order
- Optimizer -- two-phase model, AddPass mechanism, tier system
- Pass Inventory -- complete pass catalog with dedicated deep-dive pages
- NVVMPassOptions -- 222-slot pass configuration system
- Function Map -- address-to-identity lookup table
- CLI Flags -- flag-to-pipeline routing
Methodology
This page documents how the reverse engineering of cicc v13.0 was performed. It serves as both a transparency record -- so readers can assess the confidence of any claim in this wiki -- and as a practical guide for anyone who wants to reproduce or extend the analysis.
Scope and Scale
CICC is a 60 MB stripped x86-64 ELF binary with no debug symbols, no export table, and no DWARF information. The scale of the analysis:
| Metric | Value |
|---|---|
| Total functions detected | 80,562 |
| Functions decompiled | 80,281 (99.65%) |
| Strings extracted | 188,141 |
| LLVM base version | 20.0.0 (internal fork) |
| LLVM pass classes identified | ~402 standard + 35 NVIDIA custom |
| CLI options registered | ~1,689 cl::opt + 222 NVVMPassOptions |
| NVVM builtins catalogued | 770 (IDs 1-770) |
The 281 functions that Hex-Rays could not decompile are predominantly very small thunks, computed-jump trampolines, or hand-written assembly stubs in the CRT startup and jemalloc fast paths. None are in critical compiler logic.
Toolchain
All analysis was performed with IDA Pro 8.x and the Hex-Rays x86-64 decompiler. No dynamic analysis (debugging, tracing, instrumentation) was used -- the entire effort is static analysis of the binary at rest. Supplementary tools:
| Tool | Purpose |
|---|---|
| IDA Pro 8.x | Disassembly, auto-analysis, cross-referencing, type reconstruction |
| Hex-Rays decompiler | Pseudocode generation for all 80,281 recovered functions |
| IDA Python scripting | Bulk string extraction, function size enumeration, xref graph walking |
| Custom Python scripts | Callgraph analysis, module taxonomy, evidence indexing, pipeline tracing |
No runtime instrumentation, no strace/ltrace, no gdb breakpoints. Every finding derives from static analysis of the binary's code and data sections.
Function Identification Strategies
Identifying functions in a stripped binary of this size requires multiple complementary strategies. They are listed below in order of reliability.
String Cross-References (Highest Confidence)
LLVM is a string-rich codebase. Error messages, pass names, option descriptions, and assertion text are compiled into the binary. A string like "Running pass 'NVVMMemorySpaceOpt'" appears at exactly one address in .rodata, and IDA's xref from that string leads directly to the function that prints it. This is the most reliable identification technique and produces VERY HIGH confidence identifications.
Specific high-value string patterns:
- LLVM pass registration:
"instcombine","gvn","nvvm-memspace-opt"-- each appears in exactly oneRegisterPassconstructor orPassInfoinitializer. cl::optnames:"-nvvm-enable-remat","-nvvm-branch-dist-threshold"-- each names a global variable and its registration constructor.- Error messages with context:
"parseFunctionBody: ..."(174 unique error strings in the bitcode reader),"visitCallInst: ..."(298 verification messages in the verifier). - Timer names:
"CUDA C++ Front-End","LibNVVM","Optimizer"-- appear in timer-creation calls that bracket pipeline stages. - EDG error templates:
"expected a %s","declaration not allowed here"-- 2,500+ diagnostic strings anchoring the frontend parser.
LLVM Pass Registration Patterns (Very High Confidence)
Every LLVM pass follows a predictable structural pattern. A pass class has a vtable with virtual methods at fixed offsets (runOnFunction at slot N, getAnalysisUsage at slot M). The pass registers itself via a global constructor that stores a PassInfo object containing the pass name string, the pass ID address, and a factory function pointer. By enumerating all .init_array entries that write a PassInfo-shaped structure, all ~437 passes were catalogued systematically.
The New Pass Manager (at sub_2342890, a 2,816-line registrar function) contains a massive string-to-pass-factory dispatch table with ~268 pass name entries. Decompiling this single function yields the name-to-address mapping for every New PM pass in the binary.
Vtable Analysis (High Confidence)
LLVM's class hierarchy is deep and regular. Pass -> FunctionPass -> LoopPass, Pass -> ModulePass, etc. Each level adds virtual methods at predictable vtable slots. By reconstructing vtable layouts (finding pointers to __cxa_pure_virtual for abstract methods, then tracing concrete overrides), the class hierarchy was reconstructed without debug symbols.
For the NVPTX backend specifically, vtable analysis identified NVPTXTargetLowering (2.3 MB of lowering logic), NVPTXInstrInfo, NVPTXRegisterInfo, and NVPTXFrameLowering as distinct classes with their own method tables.
Callgraph Propagation (High Confidence)
Once a function is identified with high confidence, its callees and callers gain contextual identity. If sub_12E54A0 is the pipeline assembly function (confirmed by string refs to pass names it registers), then the functions it calls to create individual passes are the pass factory functions. This propagation is transitive: identifying a factory function identifies its return type's vtable, which identifies the pass's runOnFunction method.
The pipeline orchestrator at sub_12C35D0 (41 KB) is a particularly productive anchor: it calls into the LNK, OPT, OPTIXIR, and LLC stages in sequence, and each stage's entry point was identified by following its callgraph edges.
Size and Structural Fingerprinting (Medium Confidence)
Some functions are identifiable by their size and structural characteristics alone. LLVM's InstCombine::visitCallInst is famously enormous (396 KB in this binary) because it handles every LLVM intrinsic. SelectionDAG::LegalizeTypes (348 KB) contains a switch with 967 case labels. These mega-functions have no structural equivalents and can be identified by size alone with reasonable confidence.
Similarly, the EDG frontend's constexpr evaluator (sub_786210, 317 KB) is identifiable by its 124 case labels corresponding to C++ operator opcodes -- a characteristic that matches the known EDG evaluator design.
Known Library Fingerprinting (Medium Confidence)
jemalloc was identified by its 199 configuration string names ("background_thread", "dirty_decay_ms", "narenas", etc.), which are unique to jemalloc's malloc_conf_init function. Once the allocator library was identified, its ~400 functions were bulk-labeled, removing them from the analysis scope.
The X86 AutoUpgrade function (sub_A939D0, 457 KB) is an LLVM artifact -- leftover x86 intrinsic renaming code that ships in every LLVM-based binary regardless of target. It was identified by its intrinsic name strings ("llvm.x86.sse2.*", "llvm.x86.avx.*") and excluded from NVPTX-specific analysis.
Confidence Levels
Every function identification in this wiki carries one of four confidence levels:
| Level | Meaning | Basis |
|---|---|---|
| KNOWN | Identity is certain | Direct string evidence naming the function, or the function is a trivial thunk to a known target |
| VERY HIGH | Effectively certain | Multiple corroborating string references, structural match to known LLVM code, consistent callgraph position |
| HIGH | Strong identification | Single strong indicator (vtable match, size fingerprint, callgraph position) corroborated by context |
| MEDIUM | Probable identification | Inferred from callgraph context, parameter patterns, or structural similarity without direct string evidence |
Approximately 60% of identified functions are VERY HIGH or KNOWN confidence. The remaining 40% are HIGH or MEDIUM, concentrated in areas with fewer string anchors (machine-level passes, register allocation internals, EDG IL tree walkers).
Analysis Pipeline and Scripts
The manual IDA Pro work was augmented by a systematic scripted pipeline that processed the exported IDA databases into structured evidence. The pipeline operates in two phases: L0 (foundation) builds indices and classifies all 80,562 functions automatically, and L1 (module analysis) organizes functions into per-module directories with metadata for human review.
All scripts live in cicc/scripts/. The pipeline requires four JSON databases exported from IDA: cicc_functions.json (80,562 function records), cicc_strings.json (188,141 string records), cicc_xrefs.json (cross-reference records), and cicc_callgraph.json (call edge records). These exports are stored in cicc/databases/.
L0 Foundation Pipeline
The L0 pipeline runs as a single sequential batch via scripts/run_foundation_analysis.sh. Each step depends on the output of the previous step.
Step 0: Extract Wiki Knowledge (foundation/00_extract_wiki_knowledge.py)
Scans all existing wiki markdown files for hex addresses (regex \b0x[0-9a-fA-F]{6,}\b) and builds a ground-truth mapping of address-to-module from prior manual analysis. This seed data provides the highest-confidence module assignments (100% confidence) used to bootstrap the automated classifier.
Output: foundation/taxonomy/modules/wiki_known_functions.json, wiki_module_addresses.json.
Step 1: Build Fast Lookup Indices (foundation/01_build_indices.py)
Loads the three IDA JSON databases (functions, strings, xrefs) and builds four pickle-serialized indices for O(1) lookup in subsequent steps:
addr_to_func.pkl-- address to function metadata (name, size, instruction count, library/thunk flags).string_to_xrefs.pkl-- string address to string value and xref list.func_to_callers.pkl-- function name to list of caller names.func_to_callees.pkl-- function name to list of callee names.
Output: foundation/indices/.
Step 2: Classify Strings (foundation/02_classify_strings.py)
Applies four regex-based pattern sets to all 188,141 strings, classifying each into one or more semantic categories:
- Error messages: strings matching
error,failed,invalid,unsupported,expected, etc. - Optimization passes: strings matching
pass,optimize,transform,inline,unroll,gvn,licm, etc. - Architecture features: strings matching
sm_\d+,tensor,warp,FP4,blackwell,hopper, etc. - Debug messages: strings matching
debug,trace,dump,verbose.
Each classified string retains its address and xref list, so the classifier output doubles as a "which functions reference optimization-related strings" index.
Output: foundation/taxonomy/strings/error_messages.json, optimization_passes.json, architecture_features.json, debug_messages.json, extracted_pass_names.json.
Step 3: Build Module Taxonomy (foundation/03_build_module_taxonomy.py)
The core classification engine. Assigns each of the 80,562 functions to one of eight compiler subsystem modules (or unknown) using four strategies applied in decreasing confidence order:
- Wiki ground truth (100% confidence) -- addresses found in wiki pages in Step 0.
- String content analysis (80% confidence) -- functions whose string xrefs match module-specific keyword patterns (e.g., a function referencing
"tensor","mma", or"tcgen"strings is classified astensor_core_codegen). - Call proximity propagation (30-60% confidence, 3 iterations) -- unclassified functions are assigned to the module voted by their callers (weighted 2x) and callees. A minimum of 2 votes is required. Each iteration propagates classifications outward from already-classified functions.
- Code location heuristics (40% confidence) -- address range rules for known code regions (e.g.,
0x2F00000-0x3000000maps toregister_allocation).
The eight modules are: optimization_framework, register_allocation, compilation_pipeline, ptx_emission, instruction_selection, error_handling, tensor_core_codegen, architecture_detection.
Output: foundation/taxonomy/modules/function_to_module_map.json, module_list.json.
Step 4: Analyze Call Graph (foundation/04_analyze_callgraph.py)
Computes three structural properties of the call graph:
- Entry points -- functions with zero callers and nonzero callees (top 100 by callee count). These are pipeline entry points, API functions, or global constructors.
- Leaf functions -- functions with zero callees and nonzero callers (top 1,000 by caller count). These are utility functions, allocators, and assertion handlers.
- Hot paths -- functions ranked by caller count (top 1,000). The highest-traffic functions in the binary.
Output: foundation/callgraph/entry_points.json, leaf_functions.json, hot_paths.json.
Step 5: Assign Priorities (foundation/05_assign_priorities.py)
Computes a composite priority score for each function to guide analysis effort allocation. The scoring formula:
- Size component: 1000 points for functions over 10 KB, 700 for 5-10 KB, 400 for 2-5 KB, 200 for 1-2 KB, 100 for 500 B-1 KB.
- Call frequency component: 500 points for 1000+ callers, 300 for 500+, 150 for 100+, 75 for 50+.
- Named function bonus: 200 points if the function has a recovered name (not
sub_). - Critical module bonus: 300 points if the function belongs to a critical module (compilation_pipeline, tensor_core_codegen, architecture_detection, register_allocation, instruction_selection, ptx_emission).
Functions scoring 1000+ are tier CRITICAL, 500+ are HIGH, 200+ are MEDIUM, below 200 are LOW.
Output: foundation/priorities/scoring_report.json, critical.json, high.json, medium.json, low.json.
Step 6: Generate Coverage Tracker (foundation/06_generate_coverage_tracker.py)
Aggregates all prior outputs into a master JSON tracker that records, per module and per function, the analysis status (pending/in-progress/complete), the assigned analyst, and the evidence quality score. This tracker serves as the coordination database for the L1 phase.
Output: foundation/coverage/tracker.json.
L1 Module Analysis Pipeline
The L1 pipeline runs via scripts/run_l1_programmatic.sh and requires L0 completion. It organizes CRITICAL and HIGH priority functions into per-module directories for systematic human review.
Step 1: Create Module Structure (modules/01_create_module_structure.py)
Creates the directory tree modules/{module}/functions/{critical,high}/ for each of the eight modules. MEDIUM and LOW tiers are intentionally excluded from L1 to focus effort on the most important functions.
Step 2: Extract Function Metadata (modules/02_extract_function_metadata.py)
For each CRITICAL and HIGH function, creates a directory modules/{module}/functions/{tier}/{address}/ containing a metadata.json file with: address, name, module, priority score, size, call frequency, scoring reasons, top 50 callers, top 50 callees, and paths to decompiled/disassembly/CFG files if they exist on disk.
Step 3: Generate Module READMEs (modules/03_generate_module_readmes.py)
Generates a skeleton README.md for each module with function counts, analysis progress tracking fields, and section headings for purpose, key functions, integration points, and data structures. These serve as the starting point for human-written module documentation.
Standalone Analysis Scripts
Six additional scripts perform targeted analyses independent of the L0/L1 pipeline:
analyze_nvvm_pipeline.py -- Loads the NVVM call graph (nvvm_callgraph.json, exported from the LibNVVM shared object analysis) and traces the compilation flow from nvvmCompileProgram. Identifies NVVM API entry points, finds LLVM optimization pass function symbols, traces call paths to depth 10, identifies hub functions (nodes with in-degree or out-degree above 10), and extracts the optimization pass ordering reachable from the compile entry point.
deep_pipeline_trace.py -- Performs deep BFS traversal (up to depth 15, width 100 per level) from nvvmCompileProgram through the NVVM call graph. Annotates each function with structural characteristics (LEAF, HUB, FANOUT, FANIN) and groups results by call depth to reveal the pipeline's stage boundaries. Also traces from secondary API entry points (nvvmVerifyProgram, nvvmAddModuleToProgram, nvvmCreateProgram).
extract_pipeline_structure.py -- Parses the 188,141 strings database for disable-*Pass patterns and Disable * description strings to extract the complete list of optimization passes by name. Categorizes passes into groups (Dead Code Elimination, Loop Optimizations, Inlining, Memory, NVVM-Specific, Lowering, etc.) and reconstructs the 13-stage compilation pipeline from NVVM module loading through PTX code generation. Also extracts compilation mode information (fast-compile, split-compile, partial-link).
analyze_performance_hotspots.py -- Loads the full function database (cicc_functions.json) and computes: global hotspot ranking (top 100 most-called functions), hot path chains (BFS from top 50 hotspots through callees, tracking weighted call frequency), size-efficiency analysis (bytes per call for each function), loop depth estimation (regex-based nesting analysis of decompiled C files), bottleneck identification (functions with 500+ callers), and module-level hotspot distribution.
catalog_optimization_framework.py -- Specialized script for the optimization_framework module. Reads per-function metadata from the L1 module directories, builds a critical function registry sorted by size, extracts HIGH-tier statistics (size tier distribution, top 20 most-called), scans decompiled code for optimization-related string patterns (pass references, iteration patterns, technique keywords), and identifies entry points (functions with 2 or fewer callers).
validate_callgraph.py -- Comprehensive validation system that cross-checks the call graph data against module classifications. Performs six verification analyses: cross-module call matrix verification (counting inter-module edges and sampling for spot-checks), entry point validation (confirming claimed entry points have zero callers), reachability analysis (BFS from main to find dead code), module dependency cycle detection (DFS on the module dependency graph), integration hotspot verification (functions called by all 8 modules), and bridge function identification (functions that both call into and are called from 2+ other modules).
Evidence Index Builders
Two versions of the evidence aggregation engine synthesize all data sources into per-function quality scores:
build_evidence_index.py (v1) -- Loads all five databases (functions, callgraph, strings, xrefs, names, comments, module map) into memory. For each of the 80,562 functions, counts eight evidence types (metadata, callers, callees, strings, xrefs, name pattern, size, module consistency) and computes a weighted confidence score (string evidence weighted highest at 20 points, callgraph at 15 each, xrefs at 15, metadata and name at 10 each, module at 10, size at 5). Produces nine output files including quality tier assignments (GOLD >= 80%, SILVER >= 50%, BRONZE < 50%), citation density analysis, cross-reference statistics, and prioritized recommendations for further analysis.
build_evidence_index_v2.py (v2, optimized) -- Memory-efficient reimplementation that avoids loading the full xref list into memory. Instead of building complete xref lookup tables, it streams the xref file line-by-line and counts only. The callgraph is preprocessed into a caller/callee count map rather than a full edge list. Produces the same nine analysis files as v1 with identical quality tier logic. Recommended for systems with less than 32 GB RAM.
Cross-Module Dependency Analysis
07_analyze_cross_module_dependencies.py -- The most complex standalone analysis. Streams the full call graph (using ijson for memory-efficient parsing) four times to compute:
- Inter-module call matrix -- for each pair of the 8 modules, the number of call edges crossing the boundary.
- Module dependency depth -- per-module statistics on how many other modules each function depends on, identifying isolated functions and hub functions.
- Critical bridges -- functions that call into 3 or more other modules (top 100 by bridge count).
- Integration hotspots -- functions called by 3 or more other modules (top 100 by fan-in).
- Module dependency graph -- a JSON graph structure with weighted edges suitable for visualization.
- Integration patterns -- entry point modules (highest out-degree), utility hub modules (highest in-degree), and linear dependency chains.
Data Flow and Directory Structure
The complete analysis data is organized as follows:
cicc/
databases/ # IDA exports (input data)
cicc_functions.json # 80,562 function records
cicc_strings.json # 188,141 string records
cicc_xrefs.json # cross-reference records
cicc_callgraph.json # call edge records
cicc_names.json # recovered names
cicc_comments.json # IDA comments
foundation/ # L0 pipeline output
indices/ # pickle indices for fast lookup
taxonomy/
modules/ # function-to-module map, module list
strings/ # classified string databases
callgraph/ # entry points, leaf functions, hot paths
priorities/ # priority scoring and tier assignments
coverage/ # master progress tracker
analyses/ # evidence index, quality tiers, cross-module data
modules/ # L1 pipeline output
{module}/
functions/
critical/{addr}/ # metadata.json per critical function
high/{addr}/ # metadata.json per high function
analysis/ # module-level analysis files
README.md # module documentation skeleton
decompiled/ # Hex-Rays output (per-function C files)
disasm/ # IDA disassembly output (per-function ASM files)
graphs/ # Control flow graphs (JSON and DOT)
scripts/ # All analysis scripts
foundation/ # L0 pipeline scripts (00-07)
modules/ # L1 pipeline scripts (01-03)
run_foundation_analysis.sh # L0 batch runner
run_l1_programmatic.sh # L1 batch runner
Verification Approaches
To verify any specific finding in this wiki:
-
Open IDA at the stated address. Every function identification includes an address. Navigate to it, press F5 to decompile, and check whether the decompiled code matches the described behavior.
-
Check string xrefs. For VERY HIGH and KNOWN identifications, search for the quoted string in IDA's Strings window. The xref should lead to the stated function address or a function that directly calls it.
-
Compare with upstream LLVM. CICC is based on LLVM 20.0.0. The LLVM source tree at the corresponding git tag contains the original implementations of all standard passes. Structural comparison (switch case counts, parameter counts, error message text) between the decompiled code and the LLVM source is the gold standard for verification.
-
Cross-reference the dual paths. Path A and Path B contain near-duplicate code. If a function is identified in Path A, the corresponding Path B function should exhibit the same structure. Agreement between the two paths increases confidence.
-
Trace from known entry points. Start at
sub_8F9C90(real main, KNOWN confidence) and follow the call chain. Every function reachable from main through a chain of identified functions has a verified callgraph path. -
Run the validation script. Execute
scripts/validate_callgraph.pyto cross-check the call graph against module classifications. The script produces aCALLGRAPH_VALIDATION_REPORT.jsonwith quantitative metrics: entry point accuracy, cross-module call counts, reachability percentage, bridge function inventory, and module dependency cycles. A healthy analysis should show entry point confidence above 90% and reachability above 80%. -
Re-run the evidence index. Execute
scripts/foundation/build_evidence_index_v2.pyto regenerate quality tier assignments. Compare the GOLD/SILVER/BRONZE percentages against the expected distribution (majority SILVER or GOLD for classified functions). Functions that drop to BRONZE after a wiki edit indicate a regression in evidence consistency.
Reproducing the Full Analysis
To reproduce this analysis from scratch:
-
Obtain the binary. Install CUDA Toolkit 13.0. The binary is at
<cuda>/nvvm/bin/cicc. SHA-256 and build stringcuda_13.0.r13.0/compiler.36424714_0must match. -
Run IDA auto-analysis. Open cicc in IDA Pro 8.x with default x86-64 analysis settings. Allow auto-analysis to complete (5-10 minutes for a binary of this size). Accept the detected compiler (GCC).
-
Batch decompile. Run the following IDA Python script to decompile all functions and export per-function C files:
import idautils, ida_hexrays, idc for func_ea in idautils.Functions(): try: cfunc = ida_hexrays.decompile(func_ea) name = idc.get_func_name(func_ea) addr = f"0x{func_ea:X}" with open(f"decompiled/{name}_{addr}.c", "w") as f: f.write(str(cfunc)) except: pass -
Export databases. Use IDA Python to export the five JSON databases (functions, strings, xrefs, callgraph, names) to
cicc/databases/. The function export should iterateFunctions()and record address, name, size, instruction count, is_library, is_thunk, callers, and callees for each. The string export should iterate IDA's string list and record address, value, and xrefs. -
Run L0 foundation pipeline.
cd cicc/scripts bash run_foundation_analysis.shThis executes Steps 0-6 in sequence, producing all indices, classifications, and the coverage tracker. Expected runtime: 2-5 minutes on a modern machine.
-
Run L1 module setup.
bash run_l1_programmatic.shThis creates the per-module directory structure, extracts metadata for CRITICAL and HIGH functions, and generates module README skeletons. Expected runtime: under 1 minute.
-
Run standalone analyses (optional, for deeper investigation):
python3 analyze_nvvm_pipeline.py # NVVM pipeline trace python3 deep_pipeline_trace.py # Deep BFS from nvvmCompileProgram python3 extract_pipeline_structure.py # Pass extraction from strings python3 analyze_performance_hotspots.py # Hotspot ranking python3 validate_callgraph.py # Validation report -
Run evidence indexing (optional, for quality assessment):
cd foundation python3 build_evidence_index_v2.py -
Begin manual analysis. With the foundation data in place, start from the CRITICAL priority list and the string anchors described in the Function Identification Strategies section above. The Function Map page is the primary lookup table.
Dependencies
The analysis scripts require only the Python 3.8+ standard library with one exception: 07_analyze_cross_module_dependencies.py uses ijson for streaming JSON parsing of the large callgraph file. Install with pip install ijson. All other scripts use only json, pickle, re, collections, pathlib, statistics, dataclasses, and typing.
Binary Address Sweep Reports
In addition to the automated scripts, the analysis produced 90+ raw binary sweep reports stored in cicc/raw/. Each report covers a contiguous address range (typically 128 KB to 512 KB) and contains per-function identification notes, string evidence citations, structural observations, and confidence assessments. The reports are named by address range (e.g., p1.3-01-sweep-0x8F0000-0x90FFFF.txt covers the compilation pipeline entry region) and organized into 10 sweep phases corresponding to the binary's major sections. A second sweep phase (p2-* and p2a-p2g) provides focused analyses of specific subsystems (EDG frontend, IR generation, optimization passes, SelectionDAG, register allocation, scheduling, configuration).
These raw reports are the primary source material from which the wiki pages were written. They are not cleaned or edited for presentation -- they contain working notes, false starts, and corrections made during the analysis process.
Limitations and Known Gaps
This analysis has several inherent limitations:
- No dynamic validation. All findings are from static analysis. Runtime behavior under specific inputs (unusual SM targets, edge-case CUDA constructs) has not been verified.
- EDG internals are partially opaque. The EDG frontend is a licensed third-party component. Its internal data structures are less well-documented in the LLVM literature, making identification harder. The IL tree format and scope management structures are identified at MEDIUM confidence.
- Inlined functions are invisible. If the compiler inlined a function during the build of cicc itself, that function has no standalone address and cannot be independently identified. Some small LLVM utility functions (SmallVector operations, StringRef comparisons) are likely inlined throughout.
- Proprietary NVIDIA code has no public reference. The 35 custom NVIDIA passes, the NVVM bridge layer, and the NVVMPassOptions system have no upstream source to compare against. These are identified purely from string evidence and structural analysis.
- Version-specific. All findings apply to cicc v13.0 (build
cuda_13.0.r13.0/compiler.36424714_0). Addresses, function sizes, and pass counts will differ in other CUDA toolkit versions. - Module classification accuracy degrades at the boundary. The automated taxonomy assigns ~60% of functions with high confidence (wiki ground truth or strong string evidence). The remaining functions are classified by call proximity propagation or address range heuristics at 30-60% confidence. Functions at module boundaries may be misclassified; the
validate_callgraph.pyscript quantifies this. - Callgraph completeness depends on IDA's xref analysis. Indirect calls through function pointers (vtable dispatch, callback registrations) are not fully captured by IDA's static analysis. The call graph is therefore a lower bound on the true call relationships. This primarily affects LLVM's pass manager dispatch and the EDG frontend's visitor pattern implementations.
Version Tracking
This page documents the exact version identifiers embedded in the cicc v13.0 binary and the version relationships between its components. Every version listed here was recovered from string constants, constructor initializations, or binary header fields in the stripped ELF binary. This is the single source of truth for version-related questions across the wiki.
Version Summary
| Component | Version | Evidence |
|---|---|---|
| cicc binary | v13.0 | Build string cuda_13.0.r13.0/compiler.36424714_0 |
| CUDA Toolkit | 12.8 | Toolkit release that ships cicc v13.0 |
| LLVM base (internal) | 20.0.0 | ctor_036 at 0x48CC90 falls back to "20.0.0"; string "llvm-mc (based on LLVM 20.0.0)" at sub_E7A190 |
| Bitcode producer (emitted) | "LLVM7.0.1" | ctor_154 at 0x4CE640 writes "7.0.1" to producer global |
| EDG frontend | 6.6 | String "Based on Edison Design Group C/C++ Front End, version 6.6" |
| NVVM IR version (user code) | 3.2 | Metadata gate at sub_157E370: major == 3, minor <= 2 |
| NVVM IR version (libdevice) | 2.0 | !nvvmir.version = !{i32 2, i32 0} -- always-compatible sentinel |
| NVVM container format | 1.x | Header field version_major = 1, version_minor <= 0x41 |
| NVVM debug info version | 3.2 | Container header nvvm_debug_major = 3, nvvm_debug_minor <= 2 |
| Embedded libdevice | libdevice.10.bc | 455,876 bytes, 352 functions, triple nvptx64-nvidia-gpulibs |
| GCC emulation (EDG) | 8.1 | DEFAULT_GNU_VERSION = 80100 |
| Clang emulation (EDG) | 9.1 | DEFAULT_CLANG_VERSION = 90100 |
| jemalloc | 5.3.x | ~400 statically linked functions at 0x12FC000 |
| Default PTX ISA (sm_90) | 8.5 | .version 8.5 computed from PTXVersion / 10, PTXVersion % 10 |
| Default SM target | sm_75 | Hardcoded strcpy("compute_75") in sub_900130 and sub_125FB30 |
LLVM Version: The Dual Identity
CICC has two LLVM version identities. Internally, it is an LLVM 20.0.0 fork -- all modern instruction opcodes, metadata formats, type encodings, and pass infrastructure from LLVM 20 are present. Externally, the bitcode it emits identifies itself as "LLVM7.0.1" in the producer field.
The reason is historical: NVVM IR 2.0 was defined against LLVM 7.0.1. The entire NVVM toolchain ecosystem (libNVVM, nvcc's device pipeline, nvdisasm, third-party NVVM IR consumers) standardized on "LLVM7.0.1" as the format identifier. Changing the producer string would require a coordinated update across the entire CUDA toolkit and all downstream consumers.
Binary evidence:
ctor_036at0x48CC90: readsLLVM_OVERRIDE_PRODUCERenvironment variable, falls back to"20.0.0"(the true version).ctor_154at0x4CE640: readsLLVM_OVERRIDE_PRODUCER, falls back to"7.0.1"(the compatibility marker). This is the constructor that runs for the bitcode writer path.sub_E7A190: contains the string"llvm-mc (based on LLVM 20.0.0)".sub_1538EC0(writeModule): emits"LLVM" + "7.0.1"="LLVM7.0.1"as the IDENTIFICATION_BLOCK producer.
Both constructors accept the LLVM_OVERRIDE_PRODUCER environment variable to override the default. Setting it changes the embedded producer string in output bitcode.
See Bitcode Reader/Writer for the full dual-producer mechanism.
EDG 6.6 Frontend
The EDG (Edison Design Group) frontend is a licensed commercial C/C++ frontend. Version 6.6 occupies 3.2 MB of code at 0x5D0000--0x8F0000. The version string is embedded literally as "Based on Edison Design Group C/C++ Front End, version 6.6" and is accessible via the --version flag.
EDG 6.6 in cicc is configured to emulate GCC 8.1 (DEFAULT_GNU_VERSION = 80100) and Clang 9.1 (DEFAULT_CLANG_VERSION = 90100). It supports C++23 as the newest C++ standard and C23 as the newest C standard, with C++17 as the default mode.
See EDG 6.6 Frontend for the full frontend documentation.
NVVM IR Version
The NVVM IR version is a metadata tuple (major, minor) embedded in every NVVM bitcode module via the !nvvmir.version named metadata node. CICC v13.0 has two distinct version contexts:
User code: the IR generation phase (sub_9151E0) emits !nvvmir.version with the current version tuple. The version checker at sub_157E370 enforces major == 3 and minor <= 2, making 3.2 the current maximum accepted version. Modules with major != 3 or minor > 2 are rejected with "Broken module found, compilation aborted!".
Libdevice: the embedded libdevice.10.bc carries !nvvmir.version = !{i32 2, i32 0}. The version (2, 0) is hard-coded in the version checker (sub_12BDA30) as an always-compatible sentinel -- it passes the check regardless of the current NVVM IR version. This ensures the embedded math library is compatible with any user module.
Container format: the NVVM container binary header stores version fields separately at offsets 0x06--0x07 (nvvm_ir_major, nvvm_ir_minor). These track the container-level IR spec version and may differ from the bitcode-level metadata tuple.
Bypass: setting NVVM_IR_VER_CHK=0 in the environment disables version validation entirely, allowing any version tuple to pass.
See Bitcode Reader/Writer for the version gate implementation and NVVM Container for the container-level version fields.
Embedded Libdevice
The embedded libdevice is libdevice.10.bc, a 455,876-byte LLVM bitcode library containing 352 GPU-optimized math functions. Two identical copies are statically embedded in the binary:
| Copy | Address | Pipeline |
|---|---|---|
| A | unk_3EA0080 | LibNVVM mode (Path A) |
| B | unk_420FD80 | Standalone mode (Path B) |
Key properties:
- Target triple:
nvptx64-nvidia-gpulibs - Function count: 352 (all
alwaysinline nounwind) - NVVM IR version:
(2, 0)-- always-compatible sentinel - Producer:
"clang version 3.8.0 (tags/RELEASE_380/final)"-- the Clang version that originally compiled libdevice (not indicative of cicc's own compiler version) - NVVMReflect calls: uses
__nvvm_reflect("__CUDA_FTZ"),__nvvm_reflect("__CUDA_ARCH"), and__nvvm_reflect("__CUDA_PREC_SQRT")for runtime specialization
The libdevice.10.bc naming convention carries forward from the CUDA 5.0 era. The 10 in the filename originally indicated "compute capability 1.0 and above" (i.e., universal), not a version number.
See Libdevice Linking for the linking algorithm, version validation, and NVVMReflect interaction.
NVVM Container Format Version
The NVVM container binary envelope uses its own versioning scheme, independent of the NVVM IR version:
| Field | Offset | Value | Meaning |
|---|---|---|---|
version_major | 0x04 | 1 | Container format major |
version_minor | 0x05 | <= 0x41 | Container format minor |
nvvm_ir_major | 0x06 | 2 | NVVM IR spec major (container-level) |
nvvm_ir_minor | 0x07 | <= 0x62 | NVVM IR spec minor (container-level) |
nvvm_debug_major | 0x08 | 3 | Debug info format major |
nvvm_debug_minor | 0x09 | <= 2 | Debug info format minor |
llvm_major | 0x0A | encoded | LLVM version (combined: major * 100 + minor = 2000) |
llvm_minor | 0x0B | encoded |
The container's LLVM version encoding stores the combined value 20 * 100 + 0 = 2000, confirming the internal LLVM 20.0.0 base.
See NVVM Container for the full binary format specification.
Version Cross-Reference Matrix
How versions flow through the pipeline:
EDG 6.6 Frontend
|
v
NVVM IR Generation
(emits nvvmir.version = {3, 2})
|
+----+----+
| |
libdevice user IR
(version 2,0) (version 3,2)
| |
+----+----+
|
NVVM IR Version Check
(gate: major==3, minor<=2)
(sentinel: 2,0 always passes)
|
LLVM 20.0.0 Optimizer
|
Bitcode Writer
(producer: "LLVM7.0.1")
|
NVVM Container Serializer
(container version 1.x, LLVM encoded as 2000)
|
v
.ptx / .optixir output
Future Updates
This wiki documents cicc v13.0 from CUDA 12.8. When a new CUDA toolkit release ships a newer cicc binary, the following version fields are the most likely to change:
- LLVM base version: NVIDIA periodically rebases on newer LLVM releases. A jump from 20.0.0 to a later version would change the internal string, the container LLVM encoding, and potentially add new passes, opcodes, and metadata formats.
- EDG version: EDG releases track independently of LLVM. A bump from 6.6 to a later version would affect C++ standard support, keyword handling, and the frontend error catalog.
- NVVM IR version minor: the minor field (currently 2 in the
major == 3series) may increment to accommodate new metadata kinds or intrinsic conventions without breaking the major version. - PTX ISA version: new SM targets require new PTX versions. sm_100 Blackwell already uses a higher PTX version than sm_90 Hopper.
- SM target range: new GPU architectures add new SM numbers. The sm_75--sm_121 range in v13.0 will expand in future releases.
The bitcode producer string ("LLVM7.0.1") is unlikely to change in the near term -- doing so would break backward compatibility with the entire NVVM IR ecosystem. The libdevice version sentinel (2, 0) is similarly stable because the version checker special-cases it.
To update this wiki for a new cicc version:
- Extract the build string (search for
cuda_XX.Y.rXX.Y/compiler.). - Check
ctor_036for the LLVM version fallback string. - Check the EDG version string at
sub_617BD0. - Check the NVVM IR version gate constants at the version checker function.
- Measure the embedded libdevice size and function count.
- Verify the NVVM container header version fields.
Cross-References
- Bitcode Reader/Writer -- producer string mechanism, version gate implementation
- NVVM Container -- container binary format with version header fields
- Libdevice Linking -- embedded math library, version sentinel
- EDG 6.6 Frontend -- frontend version, GCC/Clang emulation modes
- Binary Layout -- build ID string, ELF properties
- Entry Point & CLI -- dual-path dispatch, version string arguments
- Environment Variables --
LLVM_OVERRIDE_PRODUCER,NVVM_IR_VER_CHK - Debug Info Verification -- debug version field (3.2)
Compilation Pipeline Overview
This page maps the complete end-to-end flow of a CUDA compilation through cicc v13.0, from the initial CLI invocation to the final PTX text output. Each stage is a self-contained subsystem with its own address range, data structures, and failure modes. The links below lead to dedicated pages with reimplementation-grade detail for every stage.
Pipeline Diagram
nvcc
|
v
+===========================================================+
| cicc (60 MB, 80,562 functions) |
| |
| 1. CLI Parsing & Dispatch -----> [entry.md] |
| | argv/envp, flag translation, arch detection |
| | dual-path select: Path A (LibNVVM) / Path B |
| v |
| 2. nvcc-to-cicc Interface -----> [nvcc-interface.md] |
| | flag tree (40+ mappings), 3-column arch fan-out |
| | mode cookies: 0xABBA=CUDA, 0xDEED=OpenCL |
| v |
| 3. EDG 6.6 Frontend -----------> [edg.md] |
| | CUDA C++ --> transformed C (.int.c/.device.c) |
| | 737 config #defines, GCC 8.1 / Clang 9.1 emu |
| v |
| 4. NVVM IR Generation ---------> [ir-generation.md] |
| | EDG IL tree --> LLVM Module (NVVM IR) |
| | address spaces, kernel metadata, builtins |
| v |
| 5. Libdevice Linking ----------> [../infra/libdevice-linking.md]
| | embedded 455KB bitcode, 352 __nv_* math fns |
| | target triple validation, NVVM version check |
| v |
| 6. LLVM Optimizer -------------> [optimizer.md] |
| | two-phase model (analysis -> codegen-oriented) |
| | 49.8KB pipeline assembler, ~150 pass insertions |
| | concurrent per-function Phase II |
| v |
| 7. LTO Pipeline ---------------> [../lto/index.md] |
| | cross-TU inlining, devirt, GlobalOpt |
| | closed-world GPU model: no dlopen, no .so |
| v |
| 8. Code Generation ------------> [codegen.md] |
| | SelectionDAG, ISel, RegAlloc, MachineIR passes |
| | 37 MB of code, largest subsystem |
| v |
| 9. PTX Emission ---------------> [emission.md] |
| | .entry/.func headers, register decls, .loc/.file |
| | AsmPrinter, GenericToNVVM addrspace rewrite |
| v |
| OUTPUT: .ptx file (or NVVM bitcode, or OptiX IR) |
+===========================================================+
Side paths:
* OptiX IR (--emit-optix-ir) ----> [optix-ir.md]
* Debug info (all stages) -------> [debug-info-pipeline.md]
Stage Descriptions
1. Entry Point & CLI Parsing
The real main (sub_8F9C90, 10KB) parses argv, detects wizard mode via NVVMCCWIZ=553282, selects the target architecture (default sm_75), and dispatches into one of two compilation paths. Path A serves the LibNVVM API; Path B serves standalone nvcc invocations. Both paths are functionally identical but duplicated in the binary at different address ranges. See Entry Point & CLI.
2. nvcc-to-cicc Interface
The flag translation layer (sub_8FE280) rewrites nvcc-facing flags into cicc-facing flags through a std::map red-black tree, then a second stage (sub_95EB40) fans each flag out into three columns targeting EDG, OPT, and LLC separately. Mode cookies (0xABBA for CUDA, 0xDEED for OpenCL) select language-specific behavior. See nvcc-to-cicc Interface.
3. EDG 6.6 Frontend
A licensed commercial frontend (3.2 MB, 0x5D0000--0x8F0000) parses CUDA C++ source and emits transformed C code into .int.c, .device.c, and .stub.c files. CUDA syntax (<<<>>>, __shared__, __device__) is fully resolved in this stage. The output is C source, not LLVM IR. See EDG 6.6 Frontend.
4. NVVM IR Generation
Translates the EDG intermediate language (IL) tree into an LLVM Module with proper NVPTX address space annotations, nvvm.annotations kernel metadata, and lowered builtins. This is cicc's equivalent of Clang's lib/CodeGen, but operates on EDG's proprietary IL node format. See NVVM IR Generation and its sub-pages for expressions, statements, functions, and types.
5. Libdevice Linking
A 455,876-byte LLVM bitcode library containing 352 GPU-optimized math functions (__nv_sinf, __nv_expf, etc.) is embedded directly in the cicc binary. The linker validates the nvptx64- target triple, checks NVVM IR version metadata, and merges the library into the compilation module. No filesystem access is required. See Libdevice Linking.
6. LLVM Optimizer
A proprietary two-phase pipeline (sub_12E54A0, 49.8KB) runs ~150 passes: Phase I performs module-wide analysis, Phase II performs codegen-oriented transforms with optional per-function parallelism using a jobserver or thread pool. All behavior is controlled by the 222-slot NVVMPassOptions system. See LLVM Optimizer and Pipeline & Ordering.
7. LTO Pipeline
Exploits the GPU's closed-world compilation model (no dlopen, no shared libraries, no symbol interposition) for aggressive cross-TU inlining, whole-program devirtualization, and global variable promotion. Activated in separate compilation mode (nvcc -dc), but GlobalOpt and the inliner run even in single-TU mode. See LTO & Module Optimization.
8. Code Generation
The largest subsystem (37 MB, 0x1700000--0x35EFFFF) lowers optimized LLVM IR to NVPTX MachineInstr through SelectionDAG construction, type legalization, instruction selection via a three-level pattern match engine (900KB), pressure-driven greedy register allocation, and ~30 machine-level passes including tensor core codegen for HMMA/IMMA/WGMMA/tcgen05. See Code Generation.
9. PTX Emission
The AsmPrinter (sub_31EC4F0, 72KB) walks the final MachineFunction and emits PTX text: .entry/.func headers with kernel attributes, register declarations for 9 register classes, .loc/.file debug directives, and instruction mnemonics. A GenericToNVVM pass rewrites any remaining generic address space references before emission. See PTX Emission.
Side Paths
OptiX IR -- When --emit-optix-ir is passed, the pipeline replaces LLC with an OPTIXIR stage that serializes the optimized LLVM module for the OptiX ray tracing runtime's continuation-based execution model. See OptiX IR Generation.
Debug Info -- Debug metadata flows through all stages: generated in IR-gen, preserved or stripped in the optimizer (5 stripping passes), verified after each pass, and emitted as .loc/.file PTX directives. See Debug Info Pipeline.
Internal Pipeline Encoding
Internally, cicc represents the active pipeline stages as a bitmask:
| Stage | Internal Name | Bit | Description |
|---|---|---|---|
| LNK | Libdevice link | 0x01 | Merge embedded math library |
| OPT | Optimizer | 0x02 | LLVM IR optimization (Phase I + II) |
| OPTIXIR | OptiX IR | 0x40 | OptiX serialization (mutually exclusive with LLC) |
| LLC | Code generation | 0x04 | SelectionDAG through PTX emission |
The standard CUDA compilation bitmask is LNK | OPT | LLC = 0x07. OptiX mode uses 0x43.
Cross-References
- Binary Layout -- address ranges for every subsystem
- Function Map -- master index of recovered function addresses
- CLI Flags -- complete flag catalog
- Optimization Levels -- what changes at
-O0/-O1/-O2/-O3 - NVIDIA Custom Passes -- 35 proprietary passes inserted into the LLVM pipeline
- NVPTX Target Infrastructure -- TargetMachine, TTI, SubtargetFeatures
Entry Point & CLI
The cicc binary has a surprisingly complex entry point. Rather than a straightforward main → compile → exit flow, it implements a dual-path architecture where the same binary can operate as either a LibNVVM-based compiler (Path A) or a standalone compiler (Path B), selected at runtime through environment variables and obfuscated string comparisons. This design allows NVIDIA to ship a single binary that serves both the nvcc toolchain and the LibNVVM API.
The entry point region (0x8F0000–0x96FFFF, ~520 KB) handles CLI parsing, architecture detection with a 3-column flag fan-out system, and dispatch into one of several compilation pipelines. A hidden "wizard mode" gated behind an environment variable with a magic number enables developer diagnostics that are otherwise completely inaccessible.
| main() thunk | 0x4396A0 (16 bytes) — return sub_8F9C90(argc, argv, envp) |
| Real main | sub_8F9C90 (10,066 bytes, 1,990 lines) |
| Wizard mode | getenv("NVVMCCWIZ") == 553282 → byte_4F6D280 = 1 |
| Default arch | compute_75 / sm_75 (Turing) |
| Flag catalog | sub_9624D0 (75KB, 2,626 lines, 4 output vectors) |
| Architecture map | sub_95EB40 (38KB, 23 architectures, 3-column fan-out) |
| Flag translation | sub_8FE280 (red-black tree at qword_4F6D2A0, 40+ nvcc→cicc mappings) |
| Pipeline stages | LNK → OPT → [OPTIXIR] → LLC |
| Dual path | Path A (sub_905EE0) / Path B (sub_1265970) |
| Libdevice | Path A: unk_3EA0080 / Path B: unk_420FD80 (455,876 bytes each) |
| Arch bitmask | 0x60081200F821 (validates SM 75–121) |
Architecture
main (0x4396A0, 16B thunk)
│
└─ sub_8F9C90 (10KB, REAL MAIN)
│
├─ getenv("NVVMCCWIZ") == 553282 → wizard mode
├─ sub_16C5290: extract program name from argv[0]
│
├─ ARGUMENT LOOP (v15 = 1..argc)
│ ├─ -o <file> → v257 (output)
│ ├─ -nvvmir-library <path> → v256 (libdevice)
│ ├─ -lgenfe/-libnvvm/-lnk/-opt/-llc → v263 (mode)
│ ├─ -arch/-mcpu/--nv_arch → v242 (SM number)
│ ├─ --emit-optix-ir → v243=1, v258=1
│ ├─ -nvc → v258=1
│ ├─ -irversion → print IR version, exit
│ ├─ .bc/.ci/.i/.ii/.cup/.optixir → s (input file)
│ └─ obfuscated option → v253 (0 or 1)
│
├─ v253 RESOLUTION (if still == 2)
│ └─ getenv(obfuscated) → compare → set v253 = 0 or 1
│
├─ DISPATCH (v263 × v253)
│ ├─ v263==0, v253==1 → sub_902D10 (simple Path A)
│ ├─ v263==0, v253==0 → sub_1262860 (simple Path B)
│ ├─ v263==1 → sub_905E50 / sub_12658E0 (lgenfe)
│ ├─ v263≥2, v253==1 → sub_905EE0 (multi-stage Path A)
│ └─ v263≥2, v253==0 → sub_1265970 (multi-stage Path B)
│
└─ CLEANUP: free all vectors, strings, argv copy
Real Main — sub_8F9C90
The exported main() at 0x4396A0 is a 16-byte thunk that immediately tail-calls sub_8F9C90 — the actual entry point. This function is a monolithic CLI parser and dispatcher: it copies argv into a local buffer, checks for wizard mode, iterates over all arguments accumulating state in ~12 local variables, resolves the compilation path, and finally dispatches to the appropriate pipeline function. The entire function is a single 10KB basic-block-heavy control flow graph with ~80 branch targets.
| Field | Value |
|---|---|
| Address | 0x8F9C90–0x8FC3E2 |
| Size | 10,066 bytes |
| Stack frame | 0x978 bytes (2,424 bytes) |
| Local buffers | v284[2096] for argv copy (stack if argc ≤ 256, else heap) |
Argument Handling and Argv Copy
The function begins with a defensive copy of argv into a local buffer. When 8 * argc fits within 0x800 bytes (argc ≤ 256), the copy lives in v284[2096] on the stack. For larger argument lists -- which can occur during complex nvcc invocations with many pass-through flags -- it allocates heap memory via sub_16CD150. This copy is necessary because the argument loop modifies pointers (advancing i to skip flag values), and the caller's argv must not be disturbed.
if (8 * argc > 0x800)
v284 = sub_16CD150(8 * argc); // heap alloc for large argc
// else use stack buffer v284[2096]
memcpy(v284, argv, 8 * argc); // copy all pointers
After copying, sub_16C5290 extracts the base program name from argv[0] -- stripping directory prefixes -- and stores it in dest. This name appears in error messages and verbose output throughout the pipeline.
Key Local Variables
The function's behavior is controlled by two critical dispatch variables: v253 (which compilation backend to use) and v263 (which phase of the pipeline to invoke). These are accumulated during the argument loop and combined after parsing to select one of ~10 possible code paths. The interaction between them creates a matrix of behaviors that covers everything from simple single-file compilation to multi-stage LibNVVM pipeline processing.
| Variable | Init | Purpose |
|---|---|---|
v253 | 2 | Dispatch mode: 0=Path B, 1=Path A, 2=default (needs env resolution) |
v263 | 0 | Invocation mode: 0=default, 1=lgenfe, 2=libnvvm, 3=lnk, 4=opt, 6=llc |
v242 | 0 | Target architecture (SM number) |
v258 | 0 | NVC flag |
v243 | 0 | OptiX IR flag |
v259 | 0 | Verbose (only effective in wizard mode) |
v261 | 0 | Dryrun |
v262 | 0 | Keep intermediates (only effective in wizard mode) |
s | NULL | Input file path |
v257 | NULL | Output file path |
v256 | NULL | NVVM IR library path |
v266 | vector | Pass-through options vector |
Wizard Mode
v10 = getenv("NVVMCCWIZ"); // 0x8F9D36
if (v10 && strtol(v10, NULL, 10) == 553282) // 0x8F9D92
byte_4F6D280 = 1;
Global byte_4F6D280 gates the effectiveness of -v, -keep, -dryrun. Without wizard mode, these flags are silently ignored — v259 and v262 stay 0. This is a deliberate anti-reverse-engineering measure: even if someone discovers the -v flag, it does nothing without the magic environment variable. The magic number 553282 (0x87142) appears to be arbitrary.
Invocation Modes (v263)
The v263 variable determines which stage of the compilation pipeline cicc enters. When nvcc invokes cicc directly, v263 stays at 0 (default). But cicc can also be invoked in sub-pipeline mode — for example, -lnk runs only the linking phase, -opt runs only the optimizer, and -llc runs only code generation. This is how the multi-stage pipeline works: the outer driver calls cicc multiple times with different -lXXX flags, or a single invocation with -libnvvm runs all stages internally.
Each mode has its own format for the -discard-value-names flag, which tells the LLVM backend whether to strip IR value names (reducing memory usage). The different formats exist because each sub-pipeline stage has its own option namespace:
| v263 | Flag | Mode | discard-value-names format |
|---|---|---|---|
| 0 | (none) | Default (nvcc invocation) | -discard-value-names |
| 1 | -lgenfe | EDG frontend linkage | --discard_value_names=1 (underscores) |
| 2 | -libnvvm | LibNVVM API | -discard-value-names=1 (dashes) |
| 3 | -lnk | Linker | -lnk-discard-value-names=1 |
| 4 | -opt | Optimizer | -opt-discard-value-names=1 |
| 5 | (internal) | Undocumented (sets v278 high byte) | — |
| 6 | -llc | Standalone LLVM codegen | — |
Input File Extensions
Input files are identified by extension during the argument loop. The last matching file wins (s is overwritten each time). Unrecognized arguments are added to the v266 pass-through vector and forwarded to sub-pipelines. The .cup extension has a special restriction — it's only accepted when the preceding argument is --orig_src_path_name or --orig_src_file_name, which are metadata flags inserted by nvcc to track the original source file.
| Extension | Format | Condition |
|---|---|---|
.bc | LLVM bitcode | Always accepted |
.ci | CUDA intermediate (preprocessed) | Always accepted |
.i | Preprocessed C/C++ | Always accepted |
.ii | Preprocessed C++ | Always accepted |
.cup | CUDA source | Only after --orig_src_path_name or --orig_src_file_name |
.optixir | OptiX IR | Always accepted |
Obfuscated Strings
At 0x8F98A0, sub_8F98A0 decrypts strings using an XOR + ROT13-like cipher:
v40 = v37 ^ (-109 * ((offset + 97) ^ 0xC5));
// then ROT13 on alphabetic characters
This hides an environment variable name and option prefix from static analysis. The decrypted strings control the v253 (Path A vs Path B) resolution when no explicit mode is specified.
Error Messages
| Message | Condition | Address |
|---|---|---|
"Missing output file\n" | -o with no next argument | 0x8FA365 |
"Missing NVVM IR library file\n" | -nvvmir-library with no next arg | 0x8FAB34 |
"Unparseable architecture: " + value | Invalid arch string | Multiple |
"Missing input file\n" | No recognized input file | 0x8FBEAD |
"Recognized input file extensions are: .bc .ci .i .cup .optixir" | After missing input | 0x8FBE97 |
"Error: Output file was not specified (See -o option).\n" | Multi-stage without -o | 0x8FB655 |
The v253 Dispatch Variable
The v253 variable is the single most important dispatch control in the entire entry point. It determines whether the compilation uses Path A (the EDG/PTX-producing pipeline) or Path B (the standalone LLVM-based pipeline). Understanding its resolution logic is essential to reproducing cicc's behavior.
Initialization and Explicit Setting
v253 begins at 2 (unresolved default). During the argument loop, obfuscated string matching can set it directly:
| Source | Value | Meaning |
|---|---|---|
| Initial default | 2 | Needs environment variable resolution |
Obfuscated option suffix matches byte_3C23AC3 | 1 | Path A explicitly requested |
Obfuscated option suffix matches byte_3C23AB4 | 0 | Path B explicitly requested |
Environment Variable Resolution
When v253 remains at 2 after argument parsing (the common case), cicc resolves it through the obfuscated environment variable NV_NVVM_VERSION (decrypted from byte_3C23A9F). The resolution has two sub-cases depending on the target architecture:
if (v253 == 2) {
env = getenv(decrypt(byte_3C23A9F)); // NV_NVVM_VERSION
if (env matches decrypt(byte_3C23A82)) // "nvvm-latest"
v253 = 1; // Path A
else if (env matches decrypt(byte_3C23A7B)) // "nvvm70"
v253 = 0; // Path B
else if (v242 > 99 && !v258) // SM >= 100, not -nvc
v253 = 0; // Path B (new architectures default to standalone)
else
v253 = 1; // Path A (legacy default)
}
The architectural threshold at SM 100 (Blackwell) is notable: for SM < 100, the default is Path A (the EDG frontend path). For SM >= 100, unless the -nvc flag is present, the default switches to Path B. This suggests NVIDIA is migrating newer architectures toward the standalone LLVM pipeline, possibly as a precursor to eventually deprecating the EDG-based path.
Version Strings Injected per Path
After v253 is resolved and for multi-stage modes (v263 >= 3), the entry point injects a version string into the pass-through options:
| v253 | Injected string | Semantics |
|---|---|---|
| 1 (Path A) | "-nvvm-version=nvvm-latest" (25 bytes from xmmword_3C23BC0) | Targets the latest NVVM IR specification |
| 0 (Path B) | "-nvvm-version=nvvm70" (20 bytes) | Targets NVVM 7.0 IR (frozen at LLVM 7.0.1 bitcode format) |
This version string propagates through the entire pipeline, controlling bitcode compatibility, intrinsic name resolution, and metadata format expectations.
Post-Parse Dispatch Logic
After the argument loop terminates, the dispatch logic combines v253 and v263 to select the target function. The combined keep-and-verbose flag v260 = v262 & v259 is also computed -- both wizard-mode flags must be active for intermediate file retention and verbose logging to function simultaneously.
Simple Dispatch (v263 == 0)
When cicc is invoked without any -lXXX mode flag (the standard nvcc invocation path):
if (v253 == 1)
v8 = sub_902D10(dest, 0, &v266, s, v257, v256, v260, v262, v261);
// Path A: CLI → lgenfe → LibNVVM pipeline
else
v8 = sub_1262860(dest, 0, &v266, s, v257, v256, v260, v262, v261);
// Path B: CLI → standalone LLVM pipeline
Both functions receive identical parameter signatures: program name, zero (unused), pass-through options, input file, output file, libdevice path, verbose+keep, keep, and dryrun. The return value becomes the process exit code.
lgenfe Dispatch (v263 == 1)
The -lgenfe mode builds a full argv-style array with the program name as the first entry, followed by all v266 pass-through options. This argv is then passed to one of two function pairs:
| v253 | Init function | Pipeline function |
|---|---|---|
| 1 (Path A) | sub_B6EEA0 (LLVMContext + metadata kind registration) | sub_905880 (EDG lgenfe) |
| 0 (Path B) | sub_1602D10 (standalone context initialization) | sub_1265340 (standalone lgenfe) |
The init functions create the LLVM context and register the 42+ metadata kinds used throughout the pipeline (dbg, tbaa, prof, noalias, etc.). These must be registered before any IR construction begins.
Multi-Stage Dispatch (v263 >= 2)
For -libnvvm, -lnk, -opt, and -llc modes, the dispatch constructs a CompilationState structure with input/output strings, extra arguments, and the v278 mode byte, then calls:
| v253 | Function | Size | Role |
|---|---|---|---|
| 1 | sub_905EE0 | 43 KB | Path A multi-stage pipeline driver |
| 0 | sub_1265970 | 48 KB | Path B multi-stage pipeline driver |
For -libnvvm (v263 == 2), the extra args are taken directly from v266 without prepending the program name. For -lnk/-opt/-llc (v263 >= 3), the appropriate version string (nvvm-latest or nvvm70) is appended to the pass-through options before dispatch.
Cleanup
After the pipeline function returns, sub_8F9C90 performs deterministic cleanup in reverse allocation order: the v281 extra-argument char** array and each entry, the v275 output string, the s2 input string, each element of the v266 pass-through vector, the vector's backing buffer, the dest program name, and the v282 argv copy buffer (if heap-allocated). The return value v8 is 0 on success, 1 on argument errors, or the pipeline function's return code (stored in v264).
Path A — EDG → LibNVVM Pipeline
Path A is the full CUDA C++ compilation path. It starts with the EDG 6.6 C++ frontend parsing CUDA source code into an IL tree, then converts that IL into LLVM IR via the lgenfe (LLVM Generation Front End) stage, and finally runs the LibNVVM pipeline to optimize and lower the IR to PTX. This is the path taken when cicc is invoked by nvcc for .cu file compilation, and it represents the standard CUDA compilation flow that most users encounter.
Path A Orchestrator — sub_902D10
The orchestrator is a 9 KB function that sequences the three major stages of Path A compilation. It acts as the conductor between the CLI processing layer, the EDG frontend, and the LibNVVM optimizer/codegen.
| Field | Value |
|---|---|
| Address | 0x902D10 |
| Size | ~9 KB |
| Timer | Creates 8-byte timer via sub_22077B0 → sub_B6EEA0 |
Execution flow:
-
Timer creation. Allocates and initializes an 8-byte timing context. The
sub_B6EEA0init function also registers the 42+ LLVM metadata kinds (dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42) that all subsequent IR construction depends on. This is why the timer creation happens first: the metadata registration is a side effect of context initialization. -
CLI processing. Calls
sub_900130(39 KB) to parse the accumulated CLI flags into structured forms: command bufferv58, emit-llvm-bc flagv52, architecture compute/SM numbersv55/v56, and file paths. On failure:"Error processing command line: <cmd>\n". -
Include path setup. If an input file is present (
v64), callssub_C98ED0to configure system and user include paths for the EDG frontend. -
EDG frontend (lgenfe). Calls
sub_905880with timer name"CUDA C++ Front-End". This stage:- Allocates an 880-byte module object via
sub_BA8740 - Processes lgenfe CLI options from the options struct
- In dryrun mode: skips execution, frees the module, returns null
- On success: returns a module pointer and sets the output path
- Allocates an 880-byte module object via
-
LibNVVM pipeline. If lgenfe succeeds (module pointer is non-null), calls
sub_905EE0with the module for the full optimization and codegen pipeline. -
Time profiler output. After pipeline completion, checks
sub_C96F30()for active profiling. If profiling is enabled, writes timing data to the output file viasub_C9C600. Failure emits:"Error: Failed to write time profiler data.\n". -
Cleanup. Frees the timer (
sub_B6E710), option strings, and option arrays.
EDG Frontend Stage — sub_905880
The lgenfe stage bridges the EDG 6.6 C++ frontend to LLVM IR generation. This is where CUDA C++ source code becomes NVVM IR.
| Field | Value |
|---|---|
| Address | 0x905880 |
| Size | ~6 KB |
| Timer label | "CUDA C++ Front-End" |
| Module size | 880 bytes (allocated by sub_BA8740) |
The function reconstructs a verbose command line for diagnostic output (quoting paths for --orig_src_file_name, --orig_src_path_name, --compiler_bindir, --sdk_dir), builds an argument array, and calls sub_908750(numArgs, argArray, opt_level) to create the LLVM module. On success, it copies the output path into the module at offset 21*8 and, if the keep flag is set via a3->byte[66], calls sub_905860 to write intermediate files.
The actual EDG parsing and IL-to-IR conversion happens inside sub_908750, which eventually calls sub_617BD0 — the lgenfe_main function documented in the EDG Frontend page.
EDG Module Binding — sub_908850
After the EDG frontend produces its IL tree, sub_908850 (10 KB) bridges the output to the LLVM backend. This function performs the critical step of configuring the LLVM module's data layout and target triple based on the target architecture.
Data layout strings are selected based on unk_4F06A68 (address space width):
| Width | p3 flag | Data layout string |
|---|---|---|
| 8 (64-bit) | unk_4D0461C set | "e-p:64:64:64-p3:32:32:32-i1:8:8-..." (167 chars) |
| 8 (64-bit) | Not set | "e-p:64:64:64-i1:8:8-..." (155 chars) |
| 4 (32-bit) | — | "e-p:32:32:32-i1:8:8-..." (155 chars) |
The p3:32:32:32 component enables 32-bit pointers in address space 3 (shared memory), which is critical for SM architectures where shared memory accesses use 32-bit addressing even in 64-bit compilation mode.
Target triple is set to "nvptx64-nvidia-cuda" for 64-bit or "nvptx-nvidia-cuda" for 32-bit. The function also:
- Creates a 496-byte target info structure via
sub_AE3F70 - Iterates global function declarations, marking device functions for compilation via
sub_91CA00 - Iterates global variables, processing initializers for device-side storage via
sub_9172F0 - Runs LLVM module verification via
sub_B89FE0-- on failure:"there was an error in verifying the lgenfe output!" - Stores the module globally at
unk_4F6D2F8
LibNVVM Pipeline Driver — sub_905EE0
This 43 KB function is the core of Path A. It orchestrates the full compilation through 14 sequential phases, using an interesting indirection mechanism: rather than calling LibNVVM API functions directly, it resolves them at runtime through sub_12BC0F0(id) — a dispatch function that takes a numeric ID and returns a function pointer.
| Field | Value |
|---|---|
| Address | 0x905EE0 |
| Size | 43KB (1,268 lines) |
| Timer | "LibNVVM" |
| Orchestrator | sub_902D10 (simple mode) |
14-Phase Compilation Flow
The compilation proceeds through these phases sequentially. Phases 2.1–2.14 are the core compilation unit lifecycle: create, populate, configure, compile, extract results, destroy. The -keep flag (when wizard mode is active) causes intermediate .lnk.bc and .opt.bc files to be written to disk, which is invaluable for debugging the pipeline.
| Phase | Action |
|---|---|
| 0 | Verbose command-line reconstruction |
| 1 | Input file handling — NVVM container (sub_9047E0) or regular bitcode |
| 2.1 | Pipeline infrastructure: verbose stream, address space query, module prep |
| 2.2 | nvvmCreateCU (dispatch ID 2151) |
| 2.3 | nvvmCUAddModule (dispatch ID 4660) |
| 2.4 | Check for -opt/-llc direct stage invocation |
| 2.5 | Parse and add additional modules from extra args |
| 2.6 | Libdevice linking: external file or embedded unk_3EA0080 (455,876 bytes) |
| 2.7 | Intermediate file path construction (.lnk.bc, .opt.bc) |
| 2.8 | Option parsing via sub_9685E0 → 4 stage vectors (lnk/opt/lto/llc) |
| 2.9 | Verbose stage logging + callback registration |
| 2.10 | Load 37 LLVM options from off_4B90FE0 via nvvmSetOptionStrings (ID 0xFEED) |
| 2.11 | nvvmCUCompile (ID 0xBEAD) with phase code 57069 (0xDEED) |
| 2.12 | Get compilation log/warnings |
| 2.13 | Get PTX output |
| 2.14 | nvvmDestroyCU (ID 21257 = 0x5309) |
| 3 | Error/warning output to caller |
| 4 | Write output file (text or binary detection via ELF magic) |
| 5 | Timer stop |
Input File Handling — Phase 1
Phase 1 has two sub-paths based on the a3->byte[65] container flag:
Path 1A: NVVM IR Container. When the input is an NVVM container (a binary format wrapping IR plus compilation options), sub_9047E0 (10 KB) parses it. The container format encodes the target SM version, FTZ mode, precision settings, and IEEE mode. The parser extracts these and converts them to LLVM CLI flags:
// Pseudo-code for container option extraction
push("-march=nvptx");
push("-mcpu=sm_" + str(container->sm_version / 10));
if (container->flags[200] & 0x20) push("-nvptx-f32ftz");
if (container->flags[200] & 0x80) push("-nvptx-prec-sqrtf32=1");
else push("-nvptx-prec-sqrtf32=0");
push(container->flags[204] ? "-nvvm-ieee-mode=S" : "-nvvm-ieee-mode=T");
if (container->mode == 2) push("--device-c"); // relocatable compilation
If parsing fails, the error message is "Invalid NVVM IR Container" (error code 259).
Path 1B: Regular LLVM bitcode. For raw .bc files, the function creates a timer object, configures the SM architecture via sub_B6F950, opens the file via sub_C7EAD0, and parses it into an LLVM module via sub_A01950.
LibNVVM API Dispatch IDs
Internal function sub_12BC0F0(id) returns API function pointers by numeric ID. This indirection exists because the LibNVVM API is implemented within the same binary — these aren't dynamically-linked external functions but rather internal call points resolved through a dispatch table. The hex IDs double as a form of internal documentation:
| ID | Hex | Function |
|---|---|---|
| 2151 | 0x0867 | nvvmCreateCU |
| 4111 | 0x100F | nvvmGetCompiledResult |
| 4660 | 0x1234 | nvvmCUAddModule |
| 17185 | 0x4321 | nvvmCUSetExtraArgs |
| 21257 | 0x5309 | nvvmDestroyCU |
| 41856 | 0xA380 | nvvmGetCompilationLog |
| 46903 | 0xB737 | nvvmGetCompiledResultLog |
| 46967 | 0xB777 | nvvmGetErrorString |
| 48813 | 0xBEAD | nvvmCUCompile |
| 48879 | 0xBEEF | Callback registrar |
| 61451 | 0xF00B | nvvmGetCompiledResultSize |
| 62298 | 0xF37A | nvvmCUAddModuleFromBuffer |
| 65261 | 0xFEED | nvvmCUSetOptions |
The complete dispatch table in sub_12BC0F0 contains 25 entries implemented as a binary search tree on the ID value:
| ID | Hex | Target | Semantic Name |
|---|---|---|---|
| 2151 | 0x0867 | sub_12BB090 | nvvmCreateCU |
| 2167 | 0x0877 | sub_12BB090 | (alias) |
| 3911 | 0x0F47 | sub_12BBF40 | nvvmCUSetProgressCallback |
| 4111 | 0x100F | sub_12BA8F0 | nvvmGetCompiledResult |
| 4606 | 0x11FE | sub_12BA330 | nvvmCULinkModule |
| 4660 | 0x1234 | sub_12BC650 | nvvmCUAddModule |
| 8320 | 0x2080 | sub_12BB400 | nvvmCUSetOption |
| 11245 | 0x2BED | sub_12BB290 | nvvmCUGetLog |
| 17185 | 0x4321 | sub_12BBD80 | nvvmCUSetExtraArgs |
| 21257 | 0x5309 | sub_12B9C40 | nvvmDestroyCU |
| 23294 | 0x5AFE | sub_12BAF10 | nvvmVerify |
| 41856 | 0xA380 | sub_12BA220 | nvvmGetCompiledResultSize |
| 45242 | 0xB0BA | sub_12BAB40 | nvvmCUGetWarnings |
| 46903 | 0xB737 | sub_12BA7C0 | nvvmGetCompiledResultLog |
| 46967 | 0xB777 | sub_12B9980 | nvvmGetErrorString |
| 48813 | 0xBEAD | sub_12BA110 | nvvmCUCompile |
| 48879 | 0xBEEF | sub_12BACF0 | nvvmCURegisterCallback |
| 49522 | 0xC172 | sub_12BA470 | nvvmCUGetIR |
| 51966 | 0xCAFE | sub_12B9A50 | nvvmGetVersion |
| 56495 | 0xDCEF | sub_12B9A40 | (unknown) |
| 57005 | 0xDEAD | sub_12B9C00 | nvvmInit |
| 61451 | 0xF00B | sub_12BA560 | nvvmGetCompiledResultPTXSize |
| 61453 | 0xF00D | sub_12BA6A0 | nvvmCURegisterLNKCallback |
| 61806 | 0xF16E | sub_12BAA30 | nvvmCUGetOptIR |
| 62298 | 0xF37A | sub_12BC8B0 | nvvmCUAddModuleFromBuffer |
| 65261 | 0xFEED | sub_12B9AB0 | nvvmSetOptionStrings |
Public LibNVVM API vs Internal CU API
The dispatch table above reveals a critical architectural detail: cicc's internal API uses compilation unit semantics (nvvmCreateCU, nvvmCUAddModule, nvvmCUCompile), while the public LibNVVM shared library (libnvvm.so) exports a different API surface using program semantics (nvvmCreateProgram, nvvmAddModuleToProgram, nvvmCompileProgram). The public API is documented in NVIDIA's nvvm.h header; the internal API exists only within cicc and is never exported.
Evidence for this mapping comes from nvlink's -dlto code path, which dynamically loads libnvvm.so via dlsym() and resolves symbols by their public names:
// nvlink sub_4BC290 — loads libnvvm.so for device LTO
dlsym(handle, "nvvmCreateProgram"); // → internally nvvmCreateCU
dlsym(handle, "nvvmCompileProgram"); // → internally nvvmCUCompile
dlsym(handle, "nvvmGetCompiledResultSize");
dlsym(handle, "nvvmGetCompiledResult");
dlsym(handle, "nvvmDestroyProgram"); // → internally nvvmDestroyCU
The complete mapping between the public libnvvm.so API (as used by external callers like nvlink and user programs) and cicc's internal CU dispatch IDs:
Public API (libnvvm.so) | Internal Name | Dispatch ID | Hex | Target |
|---|---|---|---|---|
nvvmCreateProgram | nvvmCreateCU | 2151 | 0x0867 | sub_12BB090 |
nvvmAddModuleToProgram | nvvmCUAddModule | 4660 | 0x1234 | sub_12BC650 |
nvvmLazyAddModuleToProgram | nvvmCUAddModuleFromBuffer | 62298 | 0xF37A | sub_12BC8B0 |
nvvmCompileProgram | nvvmCUCompile | 48813 | 0xBEAD | sub_12BA110 |
nvvmVerifyProgram | nvvmVerify | 23294 | 0x5AFE | sub_12BAF10 |
nvvmGetCompiledResultSize | nvvmGetCompiledResultPTXSize | 61451 | 0xF00B | sub_12BA560 |
nvvmGetCompiledResult | nvvmGetCompiledResult | 4111 | 0x100F | sub_12BA8F0 |
nvvmGetProgramLogSize | nvvmGetCompiledResultSize | 41856 | 0xA380 | sub_12BA220 |
nvvmGetProgramLog | nvvmGetCompiledResultLog | 46903 | 0xB737 | sub_12BA7C0 |
nvvmDestroyProgram | nvvmDestroyCU | 21257 | 0x5309 | sub_12B9C40 |
Note the naming confusion in the internal API: nvvmGetCompiledResultSize (ID 0xA380) returns the log size, while nvvmGetCompiledResultPTXSize (ID 0xF00B) returns the actual PTX output size. The public API resolves this with clearer names (nvvmGetProgramLogSize vs nvvmGetCompiledResultSize).
The internal-only API entries have no public equivalents:
| Internal Name | Dispatch ID | Hex | Target | Purpose |
|---|---|---|---|---|
nvvmInit | 57005 | 0xDEAD | sub_12B9C00 | One-time initialization of LLVM infrastructure |
nvvmGetVersion | 51966 | 0xCAFE | sub_12B9A50 | Returns internal NVVM version tuple |
nvvmGetErrorString | 46967 | 0xB777 | sub_12B9980 | Maps nvvmResult code to human-readable string |
nvvmSetOptionStrings | 65261 | 0xFEED | sub_12B9AB0 | Bulk-loads LLVM CLI option table (37 entries) |
nvvmCUSetExtraArgs | 17185 | 0x4321 | sub_12BBD80 | Passes additional argc/argv to compilation |
nvvmCUSetOption | 8320 | 0x2080 | sub_12BB400 | Sets a single compilation option |
nvvmCUSetProgressCallback | 3911 | 0x0F47 | sub_12BBF40 | Registers progress/cancellation callback |
nvvmCURegisterCallback | 48879 | 0xBEEF | sub_12BACF0 | Registers stage-boundary callback (verbose output) |
nvvmCURegisterLNKCallback | 61453 | 0xF00D | sub_12BA6A0 | Registers LNK-stage-specific callback |
nvvmCUGetLog | 11245 | 0x2BED | sub_12BB290 | Alternative log retrieval interface |
nvvmCUGetWarnings | 45242 | 0xB0BA | sub_12BAB40 | Retrieves warning-only messages |
nvvmCUGetIR | 49522 | 0xC172 | sub_12BA470 | Retrieves intermediate LLVM IR after linking |
nvvmCUGetOptIR | 61806 | 0xF16E | sub_12BAA30 | Retrieves optimized IR (post-OPT stage); also used by -irversion |
nvvmCULinkModule | 4606 | 0x11FE | sub_12BA330 | Explicit module linking (separate from add-then-compile) |
| (unknown) | 56495 | 0xDCEF | sub_12B9A40 | Unknown (one byte smaller than nvvmGetVersion) |
| (alias) | 2167 | 0x0877 | sub_12BB090 | Alias for nvvmCreateCU (same target, different ID) |
The nvvmCUGetOptIR function at sub_12BAA30 serves double duty: it is both the post-optimization IR retrieval API and the target of sub_12BC0E0 (a thunk called from sub_8F9C90 for the -irversion flag). When the user passes -irversion, the real main calls sub_12BC0E0 which dispatches to sub_12BAA30, which returns the IR version tuple as major * 100 + minor. This value is printed to stdout and the process exits immediately.
The sub_12BC0F0 Dispatch Mechanism
sub_12BC0F0 is a ~3 KB function at 0x12BC0F0 that implements a binary search tree over the 25 dispatch IDs. The function takes a single unsigned int argument (the ID) and returns a function pointer (void*). The tree is hardcoded as a series of comparison-and-branch instructions, not as a data-driven lookup table.
// Pseudocode for sub_12BC0F0(unsigned int id)
void* nvvm_dispatch(unsigned int id) {
// Binary search over 25 IDs
if (id < 17185) {
if (id < 4660) {
if (id == 2151 || id == 2167) return sub_12BB090;
if (id == 3911) return sub_12BBF40;
if (id == 4111) return sub_12BA8F0;
if (id == 4606) return sub_12BA330;
} else {
if (id == 4660) return sub_12BC650;
if (id == 8320) return sub_12BB400;
if (id == 11245) return sub_12BB290;
}
} else {
// ... upper half of the tree
if (id == 48813) return sub_12BA110; // 0xBEAD
if (id == 65261) return sub_12B9AB0; // 0xFEED
// etc.
}
return NULL; // unknown ID
}
The hex IDs are deliberately memorable patterns used as a form of internal documentation: 0xDEAD = init, 0xBEAD = compile, 0xBEEF = callback, 0xCAFE = version, 0xFEED = options, 0xF00D = LNK callback, 0xF00B = result size. The secondary ID 0x0877 (2167) is an alias for 0x0867 (2151) and dispatches to the same sub_12BB090 target, suggesting an internal API version migration where both old and new IDs must remain functional.
Dual-Path Initialization
The two compilation paths (Path A and Path B) use independent initialization sequences, creating a dual-path initialization architecture where the same underlying LLVM infrastructure is bootstrapped through different entry points. This is why two copies of libdevice, two LLVM options tables, and two sets of verbose callbacks exist.
Path A initialization (EDG → LibNVVM):
sub_B6EEA0 — Creates LLVMContext + registers 42+ metadata kinds
(dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42)
sub_900130 — 39 KB CLI parser for Path A flags
sub_905880 — EDG frontend produces LLVM module (880-byte object)
sub_908850 — Binds module to target: data layout, triple, verification
→ sub_905EE0 enters LibNVVM pipeline with module
Path B initialization (Standalone):
sub_1602D10 — Creates standalone LLVMContext (no EDG metadata assumptions)
sub_125FB30 — 8 KB CLI parser for Path B flags
sub_1265340 — Pre-compilation setup (configure output path, timer)
→ sub_1265970 enters LibNVVM pipeline with bitcode input
The version resolver sub_12B9F70 at 0x12B9F70 is shared between both paths and determines which NVVM IR compatibility mode to use. It reads two obfuscated environment variables in sequence:
// Pseudocode for sub_12B9F70(unsigned int sm_version)
int nvvm_version_resolve(unsigned int sm_version) {
// Try NV_NVVM_VERSION first (decrypted from 0x3C23A90)
char *env = getenv(decrypt("NV_NVVM_VERSION"));
if (!env) {
// Fallback: try LIBNVVM_NVVM_VERSION (decrypted from 0x42812F0)
env = getenv(decrypt("LIBNVVM_NVVM_VERSION"));
}
if (env) {
if (strcmp(env, "nvvm70") == 0) return 0; // Path B mode
if (strcmp(env, "nvvm-latest") == 0) return 1; // Path A mode
}
// Default: SM >= 100 uses Path B, SM < 100 uses Path A
return (sm_version > 99) ? 0 : 1;
}
This function is called from both sub_8F9C90 (the real main, for v253 resolution) and sub_12BB580 (inside the LibNVVM compilation unit initialization). The dual call-site ensures that the version mode is consistent regardless of whether the compiler was invoked via CLI or via the LibNVVM API.
The nvvmInit function (ID 0xDEAD, sub_12B9C00) performs one-time LLVM infrastructure initialization. It is called implicitly during nvvmCreateCU (sub_12BB090) via a pthread_once guard at dword_4F92D9C. The initialization includes:
- Registering LLVM target triples (
nvptx64-nvidia-cuda,nvptx-nvidia-cuda) - Initializing the NVPTX target machine factory
- Setting up the LLVM pass registry
- Configuring thread-safety based on
LIBNVVM_DISABLE_CONCURRENT_API(byte_4F92D70)
When byte_4F92D70 == 1 (concurrent API disabled), the pipeline operates in single-threaded mode — no pthread_mutex locks are acquired around compilation unit operations, and Phase II concurrent optimization is disabled regardless of the module's function count.
Internal API Usage Sequence
The complete sequence of dispatch table calls during a standard Path A compilation (from sub_905EE0):
1. sub_12BC0F0(2151) → nvvmCreateCU(&handle)
Creates compilation unit. Calls nvvmInit via pthread_once on first use.
2. sub_12BC0F0(46967) → nvvmGetErrorString
Saved for later error message formatting.
3. sub_12BC0F0(4660) → nvvmCUAddModule(handle, IR_data, IR_size, NULL)
Adds the user's LLVM bitcode module.
4. sub_12BC0F0(21257) → nvvmDestroyCU
Saved as cleanup function pointer (not called yet).
5. sub_12BCB00 [thunk] → nvvmCUAddModuleFromBuffer(handle, buf, size, NULL)
Called N times: once per additional module from extra args,
once for libdevice (embedded or external).
6. sub_12BC0F0(48879) → nvvmCURegisterCallback
Registers verbose stage callbacks:
sub_903BA0 with ID 61453 (LNK stage)
sub_903730 with ID 47710 (LLC stage)
When -keep mode active, also registers:
sub_9085A0 with ID 64222 (OPT output → .opt.bc file)
sub_908220 with ID 56993 (LLC output → final file)
7. sub_12BC0F0(65261) → nvvmSetOptionStrings(opts_table, 37)
Loads 37 LLVM backend configuration strings from off_4B90FE0.
Calls sub_1C31130() internally to register/reset LLVM options.
8. sub_12BC0F0(48813) → nvvmCUCompile(handle, 57069)
Main compilation. Phase code 57069 (0xDEED) triggers full
LNK → OPT → [OPTIXIR] → LLC pipeline in sub_12C35D0.
9. sub_12BC0F0(17185) → nvvmCUSetExtraArgs(handle, argc, argv)
Passes additional arguments collected from the CLI.
10. sub_12BC0F0(41856) → nvvmGetCompiledResultSize(handle, &log_size)
Queries the compilation log size.
11. sub_12BC0F0(46903) → nvvmGetCompiledResultLog(handle, log_buf)
Retrieves the compilation log (warnings/errors).
12. sub_12BC0F0(61451) → nvvmGetCompiledResultPTXSize(handle, &ptx_size)
Queries the PTX output size.
13. sub_12BC0F0(4111) → nvvmGetCompiledResult(handle, ptx_buf)
Copies the generated PTX into the caller's buffer.
14. sub_12BC0F0(21257) → nvvmDestroyCU(&handle)
Destroys the compilation unit, frees all internal resources.
Path B (sub_1265970) follows the identical sequence but uses off_4C6EEE0 for the options table (step 7), unk_420FD80 for the embedded libdevice (step 5), and appends "-nvvm-version=nvvm70" instead of "-nvvm-version=nvvm-latest" to the pipeline arguments.
nvvmResult Error Codes
The nvvmGetErrorString function (ID 0xB777, sub_12B9980) maps integer result codes from all API functions to descriptive strings:
| Code | Constant | Description |
|---|---|---|
| 0 | NVVM_SUCCESS | Operation completed successfully |
| 1 | NVVM_ERROR_OUT_OF_MEMORY | Memory allocation failed |
| 2 | NVVM_ERROR_PROGRAM_CREATION_FAILURE | Failed to create compilation unit |
| 3 | NVVM_ERROR_IR_VERSION_MISMATCH | Incompatible NVVM IR version detected |
| 4 | NVVM_ERROR_INVALID_INPUT | Malformed input (bad bitcode, wrong magic) |
| 5 | NVVM_ERROR_INVALID_PROGRAM | Null or invalid compilation unit handle |
| 6 | NVVM_ERROR_INVALID_IR | IR failed verification |
| 7 | NVVM_ERROR_INVALID_OPTION | Unrecognized compilation option |
| 8 | NVVM_ERROR_NO_MODULE_IN_PROGRAM | Compilation unit has no modules added |
| 9 | NVVM_ERROR_COMPILATION | Compilation failed (linker, optimizer, or codegen error) |
| 10 | NVVM_ERROR_CANCELLED | Compilation cancelled by user callback |
The pipeline orchestrator sub_12C35D0 maps its internal return codes to these: 0 → NVVM_SUCCESS, 7 → NVVM_ERROR_INVALID_OPTION, 9 → NVVM_ERROR_COMPILATION, 10 → NVVM_ERROR_CANCELLED, 100 → NVVM_ERROR_COMPILATION (post-pipeline verification failure).
37 LLVM Options from off_4B90FE0
Phase 2.10 loads a hardcoded table of 37 LLVM option strings from off_4B90FE0 (296 bytes = 37 pointers). These are static, compiled-in LLVM backend configuration flags that are injected into every compilation unit via nvvmSetOptionStrings (ID 0xFEED). The options include target architecture flags (-march=nvptx64, -mcpu=sm_XX), math precision controls (-nvptx-f32ftz, -nvptx-prec-sqrtf32=), optimization levels, debug info flags, and NVPTX-specific feature knobs. The sub_12B9AB0 target function calls sub_1C31130() -- the LLVM option registration/reset function -- to apply them.
Embedded Libdevice
A key design decision: two identical copies of the libdevice bitcode are statically embedded in the binary. Each is 455,876 bytes (~445 KB) of LLVM bitcode containing ~400+ math functions (__nv_sin, __nv_cos, __nv_exp, __nv_log, __nv_sqrt, etc.) plus atomic operation helpers and FP16/BF16 conversion routines. The duplication exists because Path A and Path B have separate initialization sequences and the linker didn't deduplicate the .rodata sections.
When the user provides -nvvmir-library <path>, the external file is used instead. This allows overriding the built-in math library — useful for testing custom libdevice builds.
| Path | Address | Size | Purpose |
|---|---|---|---|
| Path A | unk_3EA0080 | 455,876 bytes | Default libdevice for LibNVVM mode |
| Path B | unk_420FD80 | 455,876 bytes | Default libdevice for standalone mode |
Verbose Callbacks and Intermediate Files
Phase 2.9 registers callback functions that fire at pipeline stage boundaries. When verbose mode is active, these callbacks produce reconstructed command-line output for each stage:
[ "<src>" -lnk -nvvmir-library "<path>" "<input>" -o "<file>.lnk.bc" <opts> -nvvm-version=nvvm-latest ]
[ "<src>" -llc "<llc_path>" -o "<output>" <opts> -nvvm-version=nvvm-latest ]
The callback registration uses sub_12BC0F0(48879) (ID 0xBEEF = nvvmCURegisterCallback) with stage-specific callback IDs:
| Callback | ID | Stage |
|---|---|---|
sub_903BA0 | 61453 | LNK stage output |
sub_903730 | 47710 | LLC stage output |
sub_9085A0 | 64222 | OPT output (keep mode) |
sub_908220 | 56993 | LLC output (keep mode) |
Intermediate file paths (.lnk.bc for linked-but-unoptimized, .opt.bc for optimized-but-not-yet-codegen'd) are always constructed as strings, but the actual files are only written to disk when the -keep flag is active in wizard mode.
Path A Error Messages
All errors from sub_905EE0 are written to stderr via sub_223E0D0. Error categories:
| Category | Prefix | Example |
|---|---|---|
| File I/O | "<src>: " | "error in open <file>", "input file <f> read error" |
| LibNVVM API | "libnvvm: error: " | "failed to create the libnvvm compilation unit" |
| Output | "<src>: " | "IO error: <system_error_msg>" |
| Fatal | (none) | "basic_string::append" (std::string overflow at 0x3FFFFFFFFFFFFFFF) |
The error code from LibNVVM API calls maps to nvvmResult: 0 = success, 1 = out of memory, 4 = invalid input, 5 = invalid compilation unit (null handle).
Path B — Standalone cicc Pipeline (sub_1265970)
Path B is the standalone compilation path used when cicc is invoked with LLVM bitcode input (.bc files), by the LibNVVM API directly, or as the default for SM >= 100 architectures. Despite the different entry point, it shares the same underlying LLVM infrastructure as Path A — the difference is in how modules are loaded and how the pipeline stages are orchestrated. Path B appends -nvvm-version=nvvm70 to the optimizer arguments, indicating it targets the NVVM 7.0 IR specification (corresponding to LLVM 7.0.1 bitcode format, the version NVIDIA froze their IR compatibility at).
The 4-stage pipeline (LNK → OPT → OPTIXIR → LLC) runs in-memory: each stage takes an LLVM Module, transforms it, and passes it to the next stage. The OPTIXIR stage is optional and only active when --emit-optix-ir is specified. A user-provided cancellation callback can abort compilation between stages (return code 10).
| Field | Value |
|---|---|
| Address | 0x1265970 |
| Size | ~48KB (1,371 lines) |
| Timer | "LibNVVM" (same name as Path A) |
| Version string | -nvvm-version=nvvm70 |
Path B Entry — sub_1262860
sub_1262860 (418 lines) is the command-line entry point for Path B, analogous to sub_902D10 for Path A. It parses CLI flags, initializes the compilation context, and calls sub_1265970 for the actual compilation.
| Field | Value |
|---|---|
| Address | 0x1262860 |
| Timer init | sub_1602D10 (standalone context, contrasted with Path A's sub_B6EEA0) |
| CLI parser | sub_125FB30 (Path B's equivalent of Path A's sub_900130) |
The flow is: allocate timer handle → parse CLI via sub_125FB30 → configure output path → call sub_1265340 for pre-compilation setup → call sub_1265970 for compilation → write output. Output can go to stdout if the output path is "-", handled by sub_125C500. On failure: "\n Error processing command line: <details>".
Path B Compilation Orchestrator — sub_1265970
This 48 KB function mirrors sub_905EE0's role but with Path B's initialization and context. It handles both LibNVVM API invocations (when a11 = 1) and CLI invocations (when a11 = 0), with the same 14-phase structure as Path A but using Path B's context objects and the nvvm70 version string.
Key behavioral differences from Path A:
-
Context initialization. Path B uses
sub_1602D10for context init (rather thansub_B6EEA0), which creates a standalone LLVM context without the EDG frontend's metadata registration assumptions. -
NVVM IR container handling. Container parsing is performed by
sub_12642A0(Path B's container parser) rather thansub_9047E0. -
Embedded libdevice address. Uses
unk_420FD80(the second copy) rather thanunk_3EA0080. -
LLVM options table. Loads 37 options from
off_4C6EEE0(Path B's copy) rather thanoff_4B90FE0. -
Verbose callbacks. Registers
sub_1263280(ID 61453) andsub_12636E0(ID 47710) for LNK and OPT stage output respectively, andsub_1268040/sub_1267CC0for keep-mode output. -
Version string. Always appends
"-nvvm-version=nvvm70"rather than"-nvvm-version=nvvm-latest".
4-Stage Pipeline Orchestrator — sub_12C35D0
The orchestrator creates two backend objects — nvopt (512 bytes, the optimizer) and nvllc (480 bytes, the code generator) — and wires them together with the stage dispatch structure. Each stage is controlled by a bit in a stage bitmask derived from sub_12D2AA0, which parses architecture and options into per-stage configuration.
| Field | Value |
|---|---|
| Address | 0x12C35D0 |
| Size | 41KB (1,446 lines) |
| Backend objects | nvopt (512 bytes) + nvllc (480 bytes) |
| Stage | Bit | Timer String | Core Function |
|---|---|---|---|
| LNK | 0x01 | "LNK" / "LibNVVM module linking step." | sub_12C06E0 (63KB, module linker) |
| OPT | 0x80 | "OPT" / "LibNVVM optimization step." | sub_12E7E70 (full LLVM pipeline) |
| OPTIXIR | 0x40 | "OPTIXIR" / "LibNVVM Optix IR step." | sub_12F9270 (OptiX IR gen) |
| LLC | 0x04 | "LLC" / "LibNVVM code-generation step." | sub_12F5100 (SelectionDAG codegen) |
Pipeline stage bitmask (from sub_12D2AA0): bit 0=LNK, bit 2=LLC, bit 5=verify, bit 6=OPTIXIR, bit 7=OPT.
Return codes: 0=success, 7=parse failure, 9=link/layout/verification error, 10=cancelled, 100=post-pipeline verification failure.
Backend Object Initialization
The orchestrator allocates and initializes two backend objects with distinct vtables:
// nvllc — code generator backend (480 bytes)
v8 = sub_22077B0(480);
sub_12EC960(v8, "nvllc", 5);
v8->vtable = &unk_49E7FF0;
// nvopt — optimizer backend (512 bytes)
v10 = sub_22077B0(512);
sub_12EC960(v10, "nvopt", 5);
v10->vtable = &unk_49E6A58;
v10->sub_vtable = &unk_49E6B20; // at offset +60*8
v10->plugin_slots[0..2] = 0; // offsets 61-63 cleared
A stage dispatch structure (vtable &unk_49E6B38) links the OPT output to the LLC input and stores the cancellation callback pointer.
Cancellation Callback
Between every pipeline stage, the orchestrator checks an optional user-provided cancellation callback stored at state[26]:
cancellation_fn = state[26];
if (cancellation_fn && cancellation_fn(state[27], 0))
return 10; // CANCELLED
This mechanism allows the LibNVVM API caller to abort a long-running compilation. Return code 10 propagates up through the entire call chain, causing sub_8F9C90 to return 10 as the process exit code.
Two-Phase Optimization (OPT Stage)
The OPT stage calls sub_12E7E70, which implements a two-phase optimization protocol. Both phases call the same underlying pipeline function sub_12E54A0, but a TLS variable qword_4FBB3B0 is set to 1 or 2 to indicate which phase is active:
| Phase | TLS value | Purpose |
|---|---|---|
| Phase I | 1 | Analysis + early IR optimization (module-level, CGSCC, function passes) |
| Phase II | 2 | Backend optimization + codegen preparation (lowering, legalization) |
| Complete | 3 | Compilation finished for this module |
Between phases, sub_12D4250 checks concurrency eligibility: if the module contains more than one defined function (non-declaration), and the options permit it, Phase II can run with multiple threads. Thread count is determined from opts[1026] or falls back to get_nprocs(). When concurrency is enabled, sub_12E7B90 is the concurrent worker entry point.
For single-function modules, the optimizer skips the two-phase protocol entirely and runs a single un-phased call to sub_12E54A0 -- no phase counter is set, and the optimizer executes both analysis and backend passes in one invocation.
Data Layout Validation
After the LLC stage but before returning, the orchestrator validates the module's data layout string. If the module has no data layout:
"DataLayoutError: Data Layout string is empty"
→ return 9
On layout mismatch, it produces a detailed diagnostic:
"<error details>\nExample valid data layout:\n64-bit: <reference_layout>"
The reference layout string is loaded from off_4CD4948[0].
Module Linker — sub_12C06E0
The LNK stage's core function (63KB) links multiple LLVM bitcode modules into a single module. This is where user code gets linked with the libdevice math library and any additional modules. The linker performs several validation steps to catch incompatible IR early — before the expensive optimization and codegen stages:
- Bitcode magic validation: checks for
0xDE,0xC0,0x17,0x0B(raw LLVM bitcode) or0x42,0x43,0xC0,0xDE(bitcode wrapper). Anything else → error code 9. - Triple validation: every module's target triple must start with
"nvptx64-". Modules without a triple get a clear error:"Module does not contain a triple, should be 'nvptx64-'". - IR version compatibility:
sub_12BFF60reads"nvvmir.version"metadata (2 or 4 element tuples: major.minor or major.minor.debug_major.debug_minor). TheNVVM_IR_VER_CHKenvironment variable can disable this check entirely (set to"0"), useful when mixing IR from different CUDA toolkit versions. - Symbol size matching: for multi-module linking, compares the byte sizes of identically-named globals across modules. Size computation uses type codes (1=half(16b), 2=float(32b), 3=double(64b), 7=ptr, 0xB=integer, 0xD=struct, 0xE=array). A mismatch produces:
"Size does not match for <sym> in <mod> with size X specified in <other> with size Y."
Single-module fast path: When only one module is present (after adding user code and libdevice), the linker returns it directly via sub_1C3DFC0 without invoking the full linking machinery.
Multi-module linking: For N > 1 modules, the linker copies the primary module's target triple to all secondary modules, then calls sub_12F5610 to perform the LLVM link. After user modules are linked, builtin modules (from a1[3..4]) are linked via sub_1CCEBE0, followed by target feature configuration via sub_1CB9110 and sub_1619140.
NVVM IR Version Checker — sub_12BFF60
The version checker reads "nvvmir.version" named metadata and validates it against the compiler's expected version range.
| Field | Value |
|---|---|
| Address | 0x12BFF60 |
| Size | ~9 KB (362 lines) |
| Metadata key | "nvvmir.version" |
| Debug metadata | "llvm.dbg.cu" |
Version tuples come in two forms:
- 2-element:
(major, minor)— IR version only. Special case:(2, 0)always passes. - 4-element:
(major, minor, debug_major, debug_minor)— IR version plus debug info version. Special case:debug_major == 3, debug_minor <= 2always passes.
The NVVM_IR_VER_CHK environment variable is checked multiple times throughout the validation. When set to "0", all version checks are bypassed, returning 0 (compatible). This is a critical escape hatch for mixing bitcode from different CUDA toolkit versions.
Memory Management
jemalloc — The Global Allocator
cicc statically links a jemalloc 5.x allocator in the address range 0x12FC000–0x131FFFF (~400 functions). This replaces the system malloc/free entirely. The jemalloc configuration parser (sub_12FCDB0, 131,600 bytes -- the largest single function in this range) handles the MALLOC_CONF environment variable and /etc/malloc.conf symlink, supporting dozens of tuning options: abort, cache_oblivious, metadata_thp, trust_madvise, retain, dss, tcache, narenas, percpu_arena, background_thread, san_guard_small, san_guard_large, and more.
The choice of jemalloc over glibc's allocator is significant for compiler workloads. jemalloc's thread-local caching (tcache) and arena-per-CPU design (percpu_arena) reduce contention during the concurrent Phase II optimization, where multiple threads may be simultaneously allocating and freeing IR nodes, instruction objects, and analysis results.
The jemalloc stats subsystem (functions at 0x400000–0x42FFFF) provides comprehensive per-arena statistics including allocation counts, active/dirty/muzzy page tracking, mutex contention metrics, and HPA hugify counts. These can be triggered via MALLOC_CONF="stats_print:true".
EDG Memory Regions — sub_822260
The EDG 6.6 frontend uses a custom memory region system configured with USE_MMAP_FOR_MEMORY_REGIONS = 1. During post-parse validation in sub_617BD0 (lgenfe_main), sub_822260() is called 11 times to initialize memory regions 1 through 11. These regions serve as arena-style allocators for different categories of EDG internal data:
- Token buffers (preprocessor token storage)
- IL node pools (intermediate language tree nodes)
- Symbol tables (name→declaration mappings)
- Type representations (structural type information)
The mmap-backed regions grow by mapping additional pages on demand, avoiding the fragmentation problems that would occur with individual malloc calls for the millions of small, short-lived objects the frontend creates during parsing. Region cleanup happens in bulk when the frontend completes -- all pages for a region are unmapped at once rather than individually freed.
The EDG heap allocator cluster at 0x821000–0x823FFF includes tracked allocation (sub_822B10/sub_822B90) with a 1024-entry inline tracking array (unk_4F19620, 1024 * 24 bytes) that overflows to heap when exceeded. The tracking count is maintained in dword_4F19600. The finalization function sub_823310 walks bucket chains to free all tracked allocations.
Large Argument Lists
The argv copy in sub_8F9C90 uses a threshold-based allocation strategy:
if (8 * argc <= 0x800) // argc <= 256
v284 = stack_buffer; // 2096 bytes on stack
else
v284 = sub_16CD150(8 * argc); // heap allocation
This avoids heap allocation for the common case (most cicc invocations have fewer than 256 arguments) while handling the worst case gracefully. The heap path uses sub_16CD150 (a realloc-like wrapper), and the buffer is freed during cleanup if it was heap-allocated.
Signal Handling and Crash Recovery
EDG Signal Handler
The EDG frontend registers a signal handler at 0x723610 during initialization:
// signal handler (0x723610)
void handler(int sig) {
write(STDERR_FILENO, "\n", 1);
dword_4F0790C = 1; // set "interrupted" flag
sub_7235F0(9); // initiate orderly shutdown
}
This handler is registered for SIGINT, allowing the compiler to be interrupted gracefully during long frontend operations (template instantiation, constexpr evaluation). The global dword_4F0790C flag is checked periodically by the parser loop, enabling cooperative cancellation.
LLVM Crash Recovery
The LLVM infrastructure provides its own crash handling via the print-on-crash and print-on-crash-path CLI options (registered in the 0x4F0000–0x51FFFF range). When enabled, the LLVM pass manager dumps the current IR to a specified path on any unhandled signal (SIGSEGV, SIGABRT, etc.). This is separate from the EDG handler and covers the optimization and codegen phases.
Concurrent API Protection
The global constructor at 0x4A5810 checks LIBNVVM_DISABLE_CONCURRENT_API. When set (to any value), byte_4F92D70 = 1 disables thread-safe LibNVVM API usage. The pipeline orchestrator (sub_12C35D0) uses pthread_once(&dword_4F92D9C, init_routine) for one-time setup, and TLS at __readfsqword(0)-24 stores exception handling stack frames while __readfsqword(0)-32 stores the cleanup function sub_12BCC20. These TLS slots ensure that concurrent compilations in the same process do not corrupt each other's state.
Timer Infrastructure
Compilation timing is implemented through a hierarchical timer system. Timer creation (sub_C996C0) takes a label and context string; timer stop (sub_C9AF60) records the elapsed time. The timer hierarchy is:
"CUDA C++ Front-End" ← EDG parsing + IL-to-IR conversion (Path A only)
└─ "LibNVVM" ← Full optimization + codegen pipeline
├─ "LNK" ← Module linking (sub_12C06E0)
├─ "OPT" ← LLVM optimization (sub_12E7E70)
│ ├─ "Phase I" ← Analysis + early optimization
│ └─ "Phase II" ← Backend optimization + codegen prep
├─ "OPTIXIR" ← OptiX IR generation (optional)
└─ "LLC" ← SelectionDAG codegen (sub_12F5100)
The profiler is controlled by sub_C96F30() (returns nonzero when active). Timer data is written to the output file after compilation via sub_C9C600 (Path A) or sub_16DD960 (Path B). The -time flag or environment variable controls activation. The timer names appear in the profiler output, making them essential for identifying compilation bottlenecks.
Architecture Detection — sub_95EB40
One of the most important functions in cicc: the architecture detection system translates a single user-facing flag like -arch=compute_90a into three independent flag strings, one for each pipeline stage. This 3-column fan-out is necessary because the EDG frontend, the LLVM optimizer, and the LLVM backend each use different flag formats to specify the target architecture. The mapping is stored in a std::map<string, ArchTriple> in a red-black tree at a1+248.
| Column | Target | Example |
|---|---|---|
| Column 1 | EDG frontend | -R __CUDA_ARCH=750 |
| Column 2 | Optimizer | -opt-arch=sm_75 |
| Column 3 | LLC backend | -mcpu=sm_75 |
Architecture Validation Bitmask
Before the 3-column mapping is consulted, the architecture number is validated against a hardcoded 64-bit bitmask. This is a fast rejection filter: the SM number minus 75 gives a bit index, and if that bit isn't set in the constant 0x60081200F821, the architecture is rejected. This means cicc v13.0 has a fixed, compile-time-determined set of supported architectures — you cannot add new SM targets without rebuilding the binary.
offset = arch_number - 75;
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
→ ERROR: "is an unsupported option"
Valid architectures (bit positions in 0x60081200F821). Note the gaps — SM 81–85, 91–99, 101–102, 104–109, 111–119 are all absent:
| Bit | SM | Generation |
|---|---|---|
| 0 | 75 | Turing |
| 5 | 80 | Ampere |
| 11 | 86 | Ampere |
| 12 | 87 | Ampere (Jetson Orin) |
| 13 | 88 | Ada (undocumented) |
| 14 | 89 | Ada Lovelace |
| 15 | 90 | Hopper |
| 25 | 100 | Blackwell |
| 28 | 103 | Blackwell |
| 35 | 110 | Jetson Thor |
| 45 | 120 | Blackwell (sm120) — RTX 50xx / Pro |
| 46 | 121 | Blackwell (sm120) — DGX Spark |
Suffix handling: a and f variants share the base SM number for validation but get distinct -mcpu=sm_XXa/-mcpu=sm_XXf strings.
Architecture Parsing in the EDG Frontend
The EDG frontend (sub_617BD0, option ID 0x52 = --nv_arch) performs its own independent architecture parsing that produces three global variables:
| Global | Address | Purpose |
|---|---|---|
unk_4D045E8 | 0x4D045E8 | SM compute version (integer: 75, 80, ..., 121) |
unk_4D045E4 | 0x4D045E4 | Accelerated flag (1 if suffix a) |
unk_4D045E0 | 0x4D045E0 | Fast flag (1 if suffix f; also sets accelerated=1) |
The f suffix (fast-mode) is new to SM >= 100 architectures. When present, it implies a forward-compatible feature set that may not exactly match the base SM version's capabilities.
Flag Catalog — sub_9624D0
The flag catalog is the second-largest function in the entry point range at 75KB. It takes the raw CLI arguments and sorts them into four output vectors — one per pipeline stage (lnk, opt, lto, llc). This is the translation layer between user-facing flags and the internal per-stage options that each pipeline component understands.
A clever detail: the function takes a "mode cookie" parameter (a4) that distinguishes CUDA compilation (0xABBA) from OpenCL compilation (0xDEED). Several flags behave differently depending on this cookie — for example, -prec-div=0 maps to -nvptx-prec-divf32=1 in CUDA mode but -nvptx-prec-divf32=0 in OpenCL mode, reflecting the different default precision expectations of the two languages.
| Field | Value |
|---|---|
| Address | 0x9624D0 |
| Size | 75KB (2,626 lines) |
| Mode cookie | a4: 0xABBA=CUDA, 0xDEED=OpenCL |
| Output vectors | lnk, opt, lto, llc (32-byte std::string elements with SSO) |
-Ofast-compile Levels
NVIDIA's -Ofast-compile is a compile-time vs runtime-performance tradeoff. At "max" level, it disables memory space optimization and LSA optimization entirely — these are expensive analysis passes that improve runtime performance but slow compilation significantly. The "mid" and "min" levels provide intermediate points. This feature is targeted at iterative development workflows where compile speed matters more than code quality.
| Level String | Internal Value | Effect |
|---|---|---|
"max" | 2 | Most optimizations skipped, forces -lsa-opt=0 -memory-space-opt=0 |
"mid" | 3 | Medium speedup |
"min" | 4 | Minimal speedup |
"0" | 1 → reset to 0 | Disabled |
Error: "libnvvm : error: -Ofast-compile specified more than once". Only one -Ofast-compile per compilation is allowed.
Flag-to-Pipeline Routing (Selected)
This table shows how a single user-facing flag gets split into per-stage options. The pattern reveals NVIDIA's compilation architecture: the LNK stage communicates via -R macro definitions (these become #defines visible to the linker), the OPT stage uses NVIDIA-specific optimizer flags (-opt-use-*), and the LLC stage uses LLVM backend flags (-nvptx-*). Some flags like -ftz=1 propagate to all three stages, while others like -aggressive-inline only affect the optimizer.
| User Flag | LNK Forward | OPT Forward | LLC Forward |
|---|---|---|---|
-ftz=1 | -R __CUDA_FTZ=1 | -nvptx-f32ftz | -nvptx-f32ftz |
-prec-div=1 (CUDA) | -R __CUDA_PREC_DIV=1 | -opt-use-prec-div=true | -nvptx-prec-divf32=2 |
-prec-div=0 (CUDA) | — | -opt-use-prec-div=false | -nvptx-prec-divf32=1 |
-prec-sqrt=1 | -R __CUDA_PREC_SQRT=1 | — | -nvptx-prec-sqrtf32=1 |
-fma=1 | — | — | -nvptx-fma-level=1 |
-fast-math (CUDA) | -R __CUDA_USE_FAST_MATH=1 | -opt-use-fast-math | — |
-unsafe-math | -R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1 | -opt-use-fast-math -nvptx-f32ftz | -nvptx-fma-level=1 -nvptx-f32ftz |
-aggressive-inline | — | -inline-budget=40000 | — |
-new-nvvm-remat | — | — | -enable-new-nvvm-remat=true -nv-disable-remat=true -rp-aware-mcse=true |
nvcc→cicc Flag Translation — sub_8FE280
When cicc is invoked by nvcc (the CUDA compiler driver), the flags arrive in nvcc's format and need to be translated to cicc's internal format. This translation happens through a red-black tree at qword_4F6D2A0, populated once on first use (guarded by qword_4F6D2C8). Each entry maps an nvcc flag to a pair: an EDG passthrough string and a cicc internal string. Some flags only affect one side — for example, -fmad=1 has no EDG equivalent (FMA is a backend concern) but maps to cicc's -fma=1. Others are dual-mapped: -O0 becomes both --device-O=0 for EDG and -opt=0 for cicc.
| nvcc Flag | EDG Passthrough | cicc Internal |
|---|---|---|
-O0..-O3 | --device-O=N | -opt=N |
-fmad=1 | — | -fma=1 |
-prec_sqrt=1 | — | -prec-sqrt=1 |
-Ofast-compile=max | — | -Ofast-compile=max |
-Ofc=max | — | -Ofast-compile=max (alias) |
--emit-optix-ir | --emit-lifetime-intrinsics | --emit-optix-ir |
-discard-value-names | --discard_value_names=1 | -discard-value-names=1 |
Environment Variables
cicc checks 20 distinct environment variables across its subsystems. The six NVIDIA-specific variables are the most important for understanding and reimplementing the entry point behavior:
| Variable | Function | Effect |
|---|---|---|
NVVMCCWIZ | sub_8F9C90 | Set to 553282 → enables wizard mode (byte_4F6D280 = 1) |
NVVM_IR_VER_CHK | sub_12BFF60 | Set to "0" → disables NVVM IR version checking |
LIBNVVM_DISABLE_CONCURRENT_API | ctor at 0x4A5810 | Any value → disables thread-safe API (byte_4F92D70 = 1) |
NV_NVVM_VERSION | sub_8F9C90, sub_12B9F70 | "nvvm70" or "nvvm-latest" → controls Path A/B default and IR compat mode |
LIBNVVM_NVVM_VERSION | sub_12B9F70 | Same as NV_NVVM_VERSION (checked as fallback) |
LLVM_OVERRIDE_PRODUCER | ctors at 0x48CC90, 0x4CE640 | Overrides the producer string in output bitcode metadata |
The NV_NVVM_VERSION and LIBNVVM_NVVM_VERSION variables are obfuscated in the binary using the same XOR+ROT13 cipher as the CLI option strings. They are decrypted from 0x3C23A90 and 0x42812F0 respectively.
Key Global Variables
These globals persist across the entire compilation and are accessed from multiple subsystems. The wizard mode flag and flag mapping tree are set during CLI parsing and read throughout the pipeline. The embedded libdevice addresses are compile-time constants (.rodata), while the data model width is set during architecture configuration.
| Variable | Purpose |
|---|---|
byte_4F6D280 | Wizard mode flag (gates -v, -keep) |
qword_4F6D2A0 | Flag mapping red-black tree root |
qword_4F6D2C8 | Tree initialization guard |
byte_4F6D2D0 | --partial-link active flag |
byte_4F6D2DC | --force-llp64 active flag |
unk_3EA0080 | Embedded libdevice bitcode (Path A, 455,876 bytes) |
unk_420FD80 | Embedded libdevice bitcode (Path B, 455,876 bytes) |
off_4B90FE0 | LLVM options table (Path A, 37 entries) |
off_4C6EEE0 | LLVM options table (Path B, 37 entries) |
unk_4F06A68 | Data model width (8=64-bit, 4=32-bit) |
unk_4D0461C | Enable p3:32:32:32 in data layout (shared mem 32-bit ptrs) |
byte_4F92D70 | Concurrent API disabled flag |
dword_4F92D9C | pthread_once guard for one-time pipeline setup |
qword_4FBB3B0 | TLS: optimization phase counter (1=Phase I, 2=Phase II, 3=done) |
unk_4F6D2F8 | Global module pointer (set by sub_908850 after EDG binding) |
Function Map — Entry Point Cluster
| Function | Address | Size | Role |
|---|---|---|---|
main() thunk → sub_8F9C90 | 0x4396A0 | 16 B | -- |
| String deobfuscation (XOR + ROT13) | 0x8F98A0 | ~512 B | -- |
Push string to std::vector<std::string> | 0x8F9C20 | ~128 B | -- |
| Real main — CLI parser + dispatcher | 0x8F9C90 | 10,066 B | -- |
| nvcc→cicc flag translation (red-black tree) | 0x8FE280 | ~4 KB | -- |
| Path A CLI processing | 0x900130 | 39 KB | -- |
| Path A orchestrator (simple mode) | 0x902D10 | ~9 KB | -- |
| LLC stage verbose callback | 0x903730 | ~5 KB | -- |
| LNK stage verbose callback | 0x903BA0 | ~5 KB | -- |
| NVVM IR container parser (Path A) | 0x9047E0 | 10 KB | -- |
| CUDA C++ Front-End (lgenfe stage) | 0x905880 | ~6 KB | -- |
| lgenfe single-stage wrapper (Path A) | 0x905E50 | ~256 B | -- |
| LibNVVM pipeline driver (Path A) | 0x905EE0 | 43 KB | -- |
| Backend SM config + EDG module binding | 0x908850 | 10 KB | -- |
| Architecture detection (3-column fan-out) | 0x95EB40 | 38 KB | -- |
| Flag catalog (4 output vectors) | 0x9624D0 | 75 KB | -- |
| Pipeline option parser (4 stage vectors) | 0x9685E0 | ~8 KB | -- |
| Path B CLI processing | 0x125FB30 | ~8 KB | -- |
| Path B entry (simple mode) | 0x1262860 | ~4 KB | -- |
| Path B LNK verbose callback | 0x1263280 | ~1 KB | -- |
| Path B OPT verbose callback | 0x12636E0 | ~1 KB | -- |
| NVVM container parser (Path B) | 0x12642A0 | ~3 KB | -- |
| Path B pre-compilation setup | 0x1265340 | ~4 KB | -- |
| lgenfe single-stage wrapper (Path B) | 0x12658E0 | ~256 B | -- |
| LibNVVM compilation entry (Path B) | 0x1265970 | 48 KB | -- |
| LibNVVM API dispatch table (25 entries) | 0x12BC0F0 | ~3 KB | -- |
Thunk → sub_12BC8B0 (nvvmCUAddModuleFromBuffer) | 0x12BCB00 | ~64 B | -- |
| NVVM IR version checker | 0x12BFF60 | ~9 KB | -- |
| Module linker (LNK stage core) | 0x12C06E0 | 63 KB | -- |
| 4-stage pipeline orchestrator | 0x12C35D0 | 41 KB | -- |
| Stage bitmask parser | 0x12D2AA0 | ~4 KB | -- |
| Concurrency eligibility check | 0x12D4250 | ~2 KB | -- |
| Two-phase optimizer entry | 0x12E7E70 | ~8 KB | -- |
| Concurrent worker entry point | 0x12E7B90 | ~4 KB | -- |
| LLC core (SelectionDAG codegen) | 0x12F5100 | ~12 KB | -- |
| OptiX IR generator | 0x12F9270 | ~6 KB | -- |
| Path B context initialization | 0x1602D10 | ~2 KB | -- |
Cross-References
- EDG Frontend —
sub_617BD0(lgenfe_main), the 282-case CLI dispatch inside the EDG 6.6 frontend - NVVM Container Format — Container parsing by
sub_9047E0(Path A) andsub_12642A0(Path B) - Optimizer Pipeline — The OPT stage driven by
sub_12E7E70(two-phase optimization) - IR Generation — Module creation via
sub_908850(EDG module binding) - PTX Emission — The LLC stage's PTX output via
sub_12F5100
nvcc-to-cicc Interface Contract
When nvcc compiles device code, it invokes cicc as an external process, passing the preprocessed CUDA source (or LLVM bitcode) along with a carefully translated set of flags. cicc never sees the raw -fmad=1 or -prec_sqrt=0 flags that the user typed on the nvcc command line -- those are rewritten through a flag translation table implemented as a global std::map red-black tree at sub_8FE280. This page documents the complete interface contract: how nvcc invokes cicc, how flags are translated, how the mode cookie selects CUDA vs. OpenCL behavior, what input formats are accepted, and what output modes are available.
The flag translation is split into two stages. Stage 1 (sub_8FE280) translates nvcc-facing flags into cicc-facing flags, producing a dual-slot result with an EDG front-end flag and an internal cicc flag. Stage 2 (sub_95EB40) further expands each cicc-facing flag into a three-column architecture mapping, routing each flag to the EDG frontend, the NVVM optimizer, and the LLC backend. The composition of these two stages means a single nvcc flag like -fmad=1 can silently become --emit-llvm-bc (always injected), nothing to EDG, nothing to OPT, and -nvptx-fma-level=1 to LLC.
| Flag translation tree | sub_8FE280 -- global std::map at qword_4F6D2A0, 40+ entries |
| Tree guard | qword_4F6D2C8 (set to 1 after first initialization) |
| Tree node size | 72+ bytes: key at +32, length at +40, FlagPair* at +64 |
| CLI parser (Path A) | sub_900130 (39 KB, 12 parameters) |
| Flag catalog (Path A/B) | sub_9624D0 (75 KB, 2,626 lines, 4 output vectors) |
| 3-column arch table | sub_95EB40 (38 KB, 23 architectures, 3-column fan-out) |
| Mode cookies | 0xABBA = CUDA, 0xDEED = OpenCL |
| Default architecture | compute_75 / sm_75 (Turing) |
| Input extensions | .bc, .ci, .i, .cup, .optixir, .ii |
| Default opt level | -opt=3 (O3) |
Invocation Contract
nvcc invokes cicc as a subprocess with a single input file and a set of translated flags. The general invocation form is:
cicc [mode-flags] [translated-flags] [pass-through-flags] -o <output> <input>
For the standard CUDA compilation path (no explicit -lXXX mode flag), cicc enters sub_8F9C90 (real main, 10,066 bytes at 0x8F9C90), parses all arguments into ~12 local variables, resolves the Path A / Path B dispatch variable v253, and calls one of:
- Path A (EDG pipeline):
sub_902D10-- invokessub_900130for CLI parsing, then the EDG frontend viasub_905880, then the LibNVVM pipeline viasub_905EE0. - Path B (standalone LLVM pipeline):
sub_1262860-- similar flow but through standalone LLVM infrastructure at0x1262860.
Path selection is controlled by v253, which defaults to 2 (unresolved) and is resolved through the obfuscated environment variable NV_NVVM_VERSION. For SM >= 100 (Blackwell and later), the default is Path B unless the -nvc flag is present. For SM < 100, the default is Path A. See Entry Point for the full dispatch matrix.
When cicc is invoked in multi-stage mode (-lnk, -opt, -llc, -libnvvm), the entry point dispatches to sub_905EE0 (Path A, 43 KB) or sub_1265970 (Path B, 48 KB), which orchestrate the LNK, OPT, and LLC sub-pipelines internally.
Parameter Passing to sub_900130
The Path A CLI parser sub_900130 receives 12 parameters and performs a two-pass argument scan:
unsigned int sub_900130(
const char *input_file, // a1: input filename
const char *opencl_src, // a2: OpenCL source path (NULL for CUDA)
const char *output_file, // a3: output filename
__int64 *arg_vector, // a4: pointer to std::vector<std::string>
char mode_flag, // a5: mode flag (0=normal, 1=special)
__int64 job_desc, // a6: output compilation job struct
__int64 error_out, // a7: error string output
_BYTE *m64_flag, // a8: output - set to 1 if -m64 seen
_BYTE *discard_names, // a9: output - set to 1 if -discard-value-names
__int64 trace_path, // a10: device time trace path
__int64 trace_pid, // a11: trace PID
__int64 trace_env // a12: trace env value
);
// Returns: 0 = success, 1 = error
Pass 1: Scans for -arch flag via sub_8FD0D0, extracts architecture string.
Pass 2: Iterates all arguments, looking each up in the red-black tree at qword_4F6D2A0. For tree hits, the EDG slot is pushed to the EDG argument vector (v145) and the cicc slot is pushed to the backend argument vector (v148). For tree misses, sequential string comparisons handle extended flags (-maxreg=N, -split-compile=N, --Xlgenfe, --Xlibnvvm, etc.).
Before any user flags, sub_900130 unconditionally injects:
--emit-llvm-bcinto the EDG argument vector--emit-nvvm-latestinto the backend argument vector
After all arguments are processed, architecture strings are appended:
--nv_arch+sm_XXto EDG arguments-arch=compute_XXto backend arguments
Mode Cookies
The sub_9624D0 flag catalog function takes a fourth parameter a4 that selects the language mode. This is not a user-visible flag -- it is passed internally by the pipeline orchestrator.
| Cookie | Hex | Decimal | Language |
|---|---|---|---|
0xABBA | 0xABBA | 43,962 | CUDA compilation |
0xDEED | 0xDEED | 57,069 | OpenCL compilation |
The cookie affects multiple behaviors:
Precision division routing. In CUDA mode (0xABBA), -prec-div=0 maps to -nvptx-prec-divf32=1 (not 0) at LLC, while -prec-div=1 maps to -nvptx-prec-divf32=2. In OpenCL mode (0xDEED), the mapping is straightforward: -prec-div=0 maps to -nvptx-prec-divf32=0, -prec-div=1 to -nvptx-prec-divf32=1, and OpenCL additionally supports -prec-div=2 mapping to -nvptx-prec-divf32=3.
Fast-math routing. In CUDA mode, -fast-math maps to -R __CUDA_USE_FAST_MATH=1 for EDG and -opt-use-fast-math for OPT, with no LLC flag. In OpenCL mode, -fast-math maps to -R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1 for EDG and -opt-use-fast-math -nvptx-f32ftz for OPT.
Default precision. -prec-sqrt defaults to 1 (precise) in CUDA mode, 0 (imprecise) in OpenCL mode.
Discard value names. In CUDA mode (0xABBA), without explicit override, value names are discarded by default (a1+232 = 1), generating -lnk-discard-value-names=1, -opt-discard-value-names=1, and -lto-discard-value-names=1. In OpenCL mode (0xDEED), this only applies when (a13 & 0x20) is set (LTO generation active).
OptiX IR emission. The --emit-optix-ir flag is only valid when the cookie is 0xABBA or 0xDEED.
Internal compile call. The LibNVVM compile function nvvmCUCompile (dispatch ID 0xBEAD) is called with phase code 57,069 (0xDEED) regardless of the outer cookie -- this is the internal LibNVVM compile phase code, not a language selector.
Flag Translation Table
sub_8FE280 populates a global std::map<std::string, FlagPair*> in the red-black tree at qword_4F6D2A0. Each FlagPair is a 16-byte struct with two slots: slot 0 for the EDG frontend passthrough, slot 1 for the internal cicc flag. The function is called exactly once, guarded by qword_4F6D2C8.
Red-Black Tree Structure
qword_4F6D2A0 -- tree root pointer (std::_Rb_tree)
dword_4F6D2A8 -- sentinel node (tree.end())
qword_4F6D2B0 -- root node pointer
qword_4F6D2B8 -- begin iterator (leftmost node)
qword_4F6D2C8 -- initialization guard (1 = already built)
Each node is 72+ bytes:
| Offset | Field |
|---|---|
| +0 | Color (0=red, 1=black) |
| +8 | Parent pointer |
| +16 | Left child pointer |
| +24 | Right child pointer |
| +32 | Key data pointer (std::string internals) |
| +40 | Key length |
| +48 | Key capacity |
| +64 | Value pointer (FlagPair*) |
Lookup is via sub_8FE150 (lower_bound + insert-if-not-found). Insert is via sub_8FDFD0 (allocate node + rebalance). Comparison uses standard std::string::compare.
Complete nvcc-to-cicc Mapping
The table below shows every entry in the sub_8FE280 red-black tree. Slot 0 is forwarded to the EDG frontend; slot 1 is forwarded to the cicc backend pipeline. <null> means no flag is generated for that slot.
| nvcc flag | EDG passthrough (slot 0) | cicc internal (slot 1) | Notes |
|---|---|---|---|
-m32 | --m32 | <null> | |
-m64 | --m64 | <null> | Also sets *a8 = 1 |
-fast-math | <null> | -fast-math | |
-ftz=1 | <null> | -ftz=1 | |
-ftz=0 | <null> | -ftz=0 | |
-prec_sqrt=1 | <null> | -prec-sqrt=1 | Underscore to hyphen |
-prec_sqrt=0 | <null> | -prec-sqrt=0 | Underscore to hyphen |
-prec_div=1 | <null> | -prec-div=1 | Underscore to hyphen |
-prec_div=0 | <null> | -prec-div=0 | Underscore to hyphen |
-fmad=1 | <null> | -fma=1 | fmad renamed to fma |
-fmad=0 | <null> | -fma=0 | fmad renamed to fma |
-O0 | --device-O=0 | -opt=0 | Dual-mapped |
-O1 | --device-O=1 | -opt=1 | Dual-mapped |
-O2 | --device-O=2 | -opt=2 | Dual-mapped |
-O3 | --device-O=3 | -opt=3 | Dual-mapped |
-Osize | <null> | -Osize | |
-Om | <null> | -Om | |
-Ofast-compile=max | <null> | -Ofast-compile=max | |
-Ofc=max | <null> | -Ofast-compile=max | Alias |
-Ofast-compile=mid | <null> | -Ofast-compile=mid | |
-Ofc=mid | <null> | -Ofast-compile=mid | Alias |
-Ofast-compile=min | <null> | -Ofast-compile=min | |
-Ofc=min | <null> | -Ofast-compile=min | Alias |
-Ofast-compile=0 | <null> | <null> | No-op |
-Ofc=0 | <null> | <null> | No-op alias |
-g | --device-debug | -g | Dual-mapped |
-show-src | <null> | -show-src | |
-disable-allopts | <null> | -disable-allopts | |
-disable-llc-opts | <null> | disable-llc-opts | |
-w | -w | -w | Dual-mapped |
-Wno-memory-space | <null> | -Wno-memory-space | |
-disable-inlining | <null> | -disable-inlining | |
-aggressive-inline | <null> | -aggressive-inline | |
--kernel-params-are-restrict | --kernel-params-are-restrict | -restrict | Dual-mapped, renamed |
-allow-restrict-in-struct | <null> | -allow-restrict-in-struct | |
--device-c | --device-c | --device-c | Dual-mapped |
--generate-line-info | --generate-line-info | -generate-line-info | Dual-mapped |
--enable-opt-byval | --enable-opt-byval | -enable-opt-byval | Dual-mapped |
--no-lineinfo-inlined-at | <null> | -no-lineinfo-inlined-at | |
--keep-device-functions | --keep-device-functions | <null> | EDG only |
--emit-optix-ir | --emit-lifetime-intrinsics | --emit-optix-ir | Triggers lifetime intrinsics in EDG |
-opt-fdiv=0 | <null> | -opt-fdiv=0 | |
-opt-fdiv=1 | <null> | -opt-fdiv=1 | |
-new-nvvm-remat | <null> | -new-nvvm-remat | |
-disable-new-nvvm-remat | <null> | -disable-new-nvvm-remat | |
-disable-nvvm-remat | <null> | -disable-nvvm-remat | |
-discard-value-names | --discard_value_names=1 | -discard-value-names=1 | Also sets *a9 = 1 |
-gen-opt-lto | <null> | -gen-opt-lto |
Key translation patterns:
- Underscore to hyphen: nvcc uses underscores (
-prec_sqrt), cicc uses hyphens (-prec-sqrt). - Rename:
-fmadbecomes-fmainternally. - Dual-mapping:
-O0through-O3emit both an EDG flag (--device-O=N) and a cicc flag (-opt=N). - Alias expansion:
-Ofc=Xis silently rewritten to-Ofast-compile=X. - Implicit dependency:
--emit-optix-iradds--emit-lifetime-intrinsicsto the EDG frontend, enabling lifetime intrinsic generation that the OptiX IR output path requires.
Extended Flags (Not in Tree)
The following flags are handled by sequential string comparison in sub_900130 when a tree lookup misses:
| nvcc flag | Expansion | Notes |
|---|---|---|
-maxreg=N | -maxreg=<N> to backend | |
-split-compile=N | -split-compile=<N> to OPT | Error if specified twice |
-split-compile-extended=N | -split-compile-extended=<N> to OPT | Mutually exclusive with -split-compile |
--Xlgenfe <arg> | <arg> to EDG | |
--Xlibnvvm <arg> | <arg> to backend | |
--Xlnk <arg> / -Xlnk <arg> | -Xlnk + <arg> to backend | |
--Xopt <arg> / -Xopt <arg> | -Xopt + <arg> to backend | |
--Xllc <arg> / -Xllc <arg> | -Xllc + <arg> to backend | |
-Xlto <arg> | <arg> to LTO vector | |
-covinfo <file> | -Xopt -coverage=true -Xopt -covinfofile=<file> | |
-profinfo <file> | -Xopt -profgen=true -Xopt -profinfofile=<file> | |
-profile-instr-use <file> | -Xopt -profuse=true -Xopt -proffile=<file> | |
-lto | -gen-lto to backend; enables LTO | |
-olto <file> | -gen-lto-and-llc + flag + next arg | |
--promote_warnings | -Werror to backend; flag to EDG | |
-inline-info | -Xopt -pass-remarks=inline + missed + analysis | |
-jump-table-density=N | -jump-table-density=<N> to backend | |
-opt-passes=<val> | -opt-passes=<val> to backend | |
--orig_src_file_name <val> | --orig_src_file_name + <val> to EDG | |
--force-llp64 | Pass to EDG; sets byte_4F6D2DC = 1 | |
--partial-link | Complex: may add -memdep-cache-byval-loads=false to OPT and LLC | Sets byte_4F6D2D0 = 1 |
--tile-only | Pass to EDG + --tile_bc_file_name + output path | |
--device-time-trace | Pass to EDG; next arg becomes trace path | |
-jobserver | -jobserver to backend or pass to EDG |
Input Extensions
Input files are identified by extension during the argument loop in sub_8F9C90. The last matching file wins (the input variable s is overwritten each time). Extension matching proceeds by checking trailing characters: last 3 for .bc/.ci, last 2 for .i, last 3 for .ii, last 4 for .cup, last 8 for .optixir.
| Extension | Format | Condition | Address |
|---|---|---|---|
.bc | LLVM bitcode | Always accepted | 0x8FAA0A |
.ci | CUDA intermediate (preprocessed) | Always accepted | 0x8FAA29 |
.i | Preprocessed C/C++ | Always accepted | 0x8FA9xx |
.ii | Preprocessed C++ | Always accepted | 0x8FBF7E |
.cup | CUDA source | Only after --orig_src_path_name or --orig_src_file_name | 0x8FBFC4 |
.optixir | OptiX IR | Always accepted | 0x8FC001 |
Unrecognized arguments (those failing both tree lookup and sequential matching, and lacking a recognized extension) are silently appended to the v266 pass-through vector, which is forwarded to sub-pipelines.
If no input file is found after parsing all arguments:
Missing input file
Recognized input file extensions are: .bc .ci .i .cup .optixir
Note that .ii is not mentioned in the error message despite being accepted -- this appears to be a minor oversight in the error string.
Output Modes
cicc can produce several output formats, controlled by the combination of flags in the a13 compilation mode bitmask. The bitmask is accumulated during flag parsing in sub_9624D0:
| a13 Value | Mode | Output Format |
|---|---|---|
0x07 | Default (all phases) | PTX text assembly |
0x10 | Debug/line-info | PTX with debug metadata |
0x21 | -gen-lto | LTO bitcode (.lto.bc) |
0x23 | -lto (full LTO) | LTO bitcode + link |
0x26 | -link-lto | Linked LTO output |
0x43 | --emit-optix-ir | OptiX IR (.optixir) |
0x80 | -gen-opt-lto | Optimized LTO bitcode |
0x100 | --nvvm-64 | 64-bit NVVM mode modifier |
0x200 | --nvvm-32 | 32-bit NVVM mode modifier |
The default output is PTX text, written through the LLC backend's PTX printer. The output file path is specified by -o <file> (fatal if missing in multi-stage modes). When no output path is provided in simple mode, sub_900130 constructs a .ptx filename from the input.
PTX Text Output (Default)
The standard path runs all four internal phases: LNK (IR linking), OPT (NVVM optimizer), optionally OptiX IR emission, then LLC (code generation). The LLC backend writes PTX assembly text to the output file. In sub_905EE0, the output writing (Phase 4) checks the first bytes of the result for ELF magic (0x7F, 0xED) to detect accidentally binary output; if the mode is text mode (0) and ELF headers are present, it indicates an internal error.
LTO Bitcode Output
When -lto or -gen-lto is active, cicc produces LLVM bitcode instead of PTX. The -gen-lto flag sets a13 = (a13 & 0x300) | 0x21 and adds -gen-lto to the LTO argument vector. The -gen-lto-and-llc variant additionally runs LLC after producing the LTO bitcode, generating both outputs. The -olto flag takes a next argument (the LTO optimization level) and combines LTO bitcode generation with LLC execution.
OptiX IR Output
The --emit-optix-ir flag sets a13 = (a13 & 0x300) | 0x43. In the flag translation tree, it also injects --emit-lifetime-intrinsics into the EDG frontend, enabling lifetime intrinsic emission that is required for the OptiX IR format. In the flag catalog (sub_9624D0), it additionally routes -do-ip-msp=0 and -do-licm=0 to the optimizer, disabling interprocedural memory space promotion and LICM for OptiX compatibility.
Split Compile
The -split-compile=N flag (or -split-compile-extended=N) routes to the optimizer as -split-compile=<N> (or -split-compile-extended=<N>). These are mutually exclusive and error if specified more than once ("split compilation defined more than once"). When -split-compile-extended is used, it also sets the flag at a1+1644 to 1. The split compile mechanism divides the compilation unit into N partitions for parallel processing.
Exit Codes
The process exit code is the return value of sub_8F9C90 (real main), stored in v8:
| Code | Meaning | Source |
|---|---|---|
| 0 | Success | Normal compilation; -irversion query |
| 1 | Argument error | Missing input file, missing output file, CLI parse failure |
v264 | Pipeline error | Return code from sub_905EE0 / sub_1265970 / sub_905880 |
Within the pipeline, error codes from sub_905EE0 are set via *a8:
*a8 Value | Meaning |
|---|---|
| 0 | Success (NVVM_SUCCESS) |
| -1 | File open/read error |
| 1 | NVVM_ERROR_OUT_OF_MEMORY |
| 4 | NVVM_ERROR_INVALID_INPUT |
| 5 | NVVM_ERROR_INVALID_CU (null compilation unit) |
Error messages are written to qword_4FD4BE0 (stderr stream) via sub_223E0D0. All LibNVVM-originated errors are prefixed with "libnvvm : error: ". Representative errors:
"Error processing command line: <cmd>"(fromsub_900130failure)"Missing input file"/"Missing output file""<src>: error in open <file>"(file I/O)"libnvvm: error: failed to create the libnvvm compilation unit""libnvvm: error: failed to add the module to the libnvvm compilation unit""libnvvm: error: failed to get the PTX output""Invalid NVVM IR Container"(error code 259, fromsub_C63EB0)"Error opening '<file>': file exists!"/"Use -f command line argument to force output""Error: Failed to write time profiler data.""Unparseable architecture: <val>""libnvvm : error: <flag> is an unsupported option""libnvvm : error: <flag> defined more than once"(duplicate-maxreg, etc.)
Special Behaviors
.cup Extension Gate
The .cup extension (CUDA preprocessed source) is only accepted as an input file when the preceding argument is --orig_src_path_name or --orig_src_file_name. These are metadata flags inserted by nvcc to track the original source file path for diagnostic messages. The check is:
// At 0x8FBFC4 and 0x8FBFDE:
if (strcmp(argv[i-1], "--orig_src_path_name") == 0 ||
strcmp(argv[i-1], "--orig_src_file_name") == 0) {
s = argv[i]; // accept .cup as input
}
This means cicc will silently ignore a .cup file that appears without a preceding metadata flag. When accepted, the .cup extension triggers --orig_src_path_name / --orig_src_file_name handling in sub_900130, which forwards the original source path to the EDG frontend for accurate error location reporting.
-Ofc Alias Handling
The -Ofc=X form is a shorthand alias for -Ofast-compile=X, handled entirely within the sub_8FE280 flag translation tree. The tree contains six entries for fast-compile control:
| Tree Key | cicc Internal | Effect |
|---|---|---|
-Ofast-compile=max | -Ofast-compile=max | Identity |
-Ofc=max | -Ofast-compile=max | Alias |
-Ofast-compile=mid | -Ofast-compile=mid | Identity |
-Ofc=mid | -Ofast-compile=mid | Alias |
-Ofast-compile=min | -Ofast-compile=min | Identity |
-Ofc=min | -Ofast-compile=min | Alias |
-Ofast-compile=0 | <null> | No-op |
-Ofc=0 | <null> | No-op alias |
The aliasing happens at the tree level, before sub_9624D0 ever sees the flag. By the time the flag catalog processes the argument, -Ofc=max and -Ofast-compile=max are indistinguishable. See Optimization Levels for what each fast-compile tier actually does.
In sub_9624D0, -Ofast-compile is stored at offset a1+1640 as an integer:
| Level string | Integer value | Behavior |
|---|---|---|
"0" | 1 | Disabled (then reset to 0) |
"max" | 2 | Most optimizations skipped; forces -lsa-opt=0, -memory-space-opt=0 |
"mid" | 3 | Medium pipeline |
"min" | 4 | Close to full optimization |
Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level, only supports 0, min, mid, or max".
Only one -Ofast-compile is permitted per invocation. A second occurrence triggers: "libnvvm : error: -Ofast-compile specified more than once".
Discard Value Names
The -discard-value-names flag has complex interaction semantics. In the tree, it dual-maps to --discard_value_names=1 (EDG, note underscores) and -discard-value-names=1 (cicc, note hyphens). Additionally, per-phase overrides are possible via -Xopt -opt-discard-value-names=0, -Xlnk -lnk-discard-value-names=0, or -Xlto -lto-discard-value-names=0.
In CUDA mode, without explicit flags, value names are discarded by default. In OpenCL mode, the default only applies when LTO generation is active (a13 & 0x20). This reflects the fact that value names are useful for debugging but waste memory in production builds.
Wizard Mode Interaction
The -v (verbose), -keep (keep intermediates), and -dryrun flags are parsed in sub_8F9C90 but are only effective when wizard mode is active. Wizard mode is gated by getenv("NVVMCCWIZ") == 553282, which sets byte_4F6D280 = 1. Without wizard mode, these flags are silently accepted but have no effect -- v259 (verbose) and v262 (keep) remain 0. This is a deliberate anti-reverse-engineering measure.
Default Values When Flags Are Absent
When a flag is not explicitly provided, sub_9624D0 applies these defaults (checking stored-value sentinels):
| Flag | Default Value | Sentinel Offset |
|---|---|---|
-opt= | -opt=3 (O3) | a1+400 |
-arch=compute_ | -arch=compute_75 (Turing) | a1+560 |
-ftz= | -ftz=0 (no flush-to-zero) | a1+592 |
-prec-sqrt= | -prec-sqrt=1 (CUDA) / -prec-sqrt=0 (OpenCL) | a1+624 |
-prec-div= | -prec-div=1 (precise) | a1+656 |
-fma= | -fma=1 (enabled) | a1+688 |
-opt-fdiv= | -opt-fdiv=0 | a1+464 |
Configuration
Four Output Vectors
sub_9624D0 builds four independent std::vector<std::string> that are serialized into char** arrays at function exit:
| Vector | Seed | Output | Pipeline Phase |
|---|---|---|---|
v324 (LNK) | "lnk" | a5/a6 | Phase 1: IR linker |
v327 (OPT) | "opt" | a7/a8 | Phase 2: NVVM optimizer |
v330 (LTO) | (none) | a9/a10 | Phase 3: LTO passes |
v333 (LLC) | "llc" | a11/a12 | Phase 4: Code generation |
Each vector element is a 32-byte std::string with SSO. At exit, elements are serialized via malloc(8 * count) for the pointer array and malloc(len+1) + memcpy for each string.
Architecture Bitmask Validation
Architecture validation in sub_9624D0 uses a 64-bit bitmask 0x60081200F821:
offset = arch_number - 75;
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
// error: "is an unsupported option"
Valid architectures (bit positions): SM 75, 80, 86, 87, 88, 89, 90, 100, 103, 110, 120, 121. The a/f sub-variants share the base SM number for bitmask validation but receive distinct routing in sub_95EB40.
Compilation Mode Flags Bitmask (a13)
The a13 parameter in sub_9624D0 is an IN/OUT bitmask tracking compilation mode:
| Bit/Mask | Source Flag | Meaning |
|---|---|---|
0x07 | (default) | Phase control: all phases active |
0x10 | -g, --generate-line-info | Debug/line-info enabled |
0x20 | -gen-lto, -gen-lto-and-llc | LTO generation enabled |
0x21 | -gen-lto | Gen-LTO mode |
0x23 | -lto | Full LTO mode |
0x26 | -link-lto | Link-LTO mode |
0x43 | --emit-optix-ir | OptiX IR emission mode |
0x80 | -gen-opt-lto | Optimized LTO lowering |
0x100 | --nvvm-64 | 64-bit NVVM mode |
0x200 | --nvvm-32 | 32-bit NVVM mode |
0x300 | (mask) | 64/32-bit mode bits mask |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
sub_8F9C90 | 0x8F9C90 | 10,066 B | Real main entry point |
sub_8FE280 | 0x8FE280 | ~35 KB | Flag translation tree builder (nvcc -> cicc) |
sub_8FE150 | 0x8FE150 | -- | Tree lookup (lower_bound + insert) |
sub_8FDFD0 | 0x8FDFD0 | -- | Tree insert + rebalance |
sub_8FD0D0 | 0x8FD0D0 | -- | Architecture flag scanner (first pass) |
sub_900130 | 0x900130 | 39 KB | CLI processing Path A (12 params) |
sub_902D10 | 0x902D10 | ~9 KB | Path A orchestrator |
sub_904450 | 0x904450 | -- | Push flag to argument vector |
sub_905880 | 0x905880 | ~6 KB | EDG frontend stage |
sub_905EE0 | 0x905EE0 | 43 KB | Path A multi-stage pipeline driver |
sub_908220 | 0x908220 | -- | LLC output callback (ID 56993) |
sub_908850 | 0x908850 | -- | Triple construction (nvptx64-nvidia-cuda) |
sub_9085A0 | 0x9085A0 | -- | OPT output callback (ID 64222) |
sub_95EB40 | 0x95EB40 | 38 KB | 3-column architecture mapping table builder |
sub_9624D0 | 0x9624D0 | 75 KB | Flag catalog (4 output vectors, ~111 flags) |
sub_1262860 | 0x1262860 | -- | Path B simple dispatch |
sub_1265970 | 0x1265970 | 48 KB | Path B multi-stage pipeline driver |
Global Variables
| Address | Variable | Purpose |
|---|---|---|
qword_4F6D2A0 | Flag tree root | std::map root for sub_8FE280 |
dword_4F6D2A8 | Flag tree sentinel | tree.end() |
qword_4F6D2B0 | Flag tree root node | Root node pointer |
qword_4F6D2B8 | Flag tree begin | Leftmost node (begin iterator) |
qword_4F6D2C8 | Init guard | Set to 1 after sub_8FE280 first call |
byte_4F6D2D0 | Partial-link flag | Set by --partial-link |
byte_4F6D2DC | LLP64 flag | Set by --force-llp64 |
unk_4F06A68 | Data model width | 8 = 64-bit, 4 = 32-bit |
unk_4D0461C | Address space 3 flag | Enables p3:32:32:32 in datalayout |
byte_4F6D280 | Wizard mode | Set by NVVMCCWIZ=553282 |
Cross-References
- Entry Point & CLI -- full
sub_8F9C90analysis, Path A/B dispatch, wizard mode - CLI Flag Inventory -- complete flag listing across all five parsing sites
- Optimization Levels -- O0-O3 and fast-compile tier pipeline details
- Environment Variables --
NVVMCCWIZ,NV_NVVM_VERSION - EDG Frontend -- what happens after EDG flags are forwarded
- OptiX IR -- OptiX IR emission pipeline
- Optimizer -- how
-opt=Nand fast-compile flags affect the optimization pipeline
EDG 6.6 Frontend
NVIDIA licenses the Edison Design Group (EDG) C/C++ front end — a commercial compiler frontend used by several major compilers including Intel ICC. In cicc v13.0, EDG version 6.6 occupies 3.2 MB of code (0x5D0000–0x8F0000), making it the largest single subsystem in the binary. Unlike most modern compilers that parse directly to an SSA-based IR, EDG operates as a source-to-source translator: it parses CUDA C++ source code and emits transformed C code containing CUDA runtime API calls. This output is then fed into a second compilation phase that produces NVVM IR (LLVM bitcode). This two-stage design means the CUDA language extensions (kernel launch syntax, memory space qualifiers, device/host function annotations) are resolved entirely within EDG, and the LLVM-based backend never sees raw CUDA syntax.
The EDG frontend is configured at compile time through 737 #define macros, including GCC 8.1 emulation mode and Clang 9.1 emulation mode. Exceptions are disabled by default — CUDA device code cannot use C++ exceptions — while RTTI remains enabled for dynamic_cast support in host-side code that interacts with device objects.
| EDG version | 6.6 (string: "Based on Edison Design Group C/C++ Front End, version 6.6") |
| Entry symbol | lgenfe_main (string at sub_617BD0) |
| GCC emulation | 8.1 (DEFAULT_GNU_VERSION = 80100) |
| Clang emulation | 9.1 (DEFAULT_CLANG_VERSION = 90100) |
| C++ standards | C++98, C++11, C++14, C++17, C++20, C++23 (unk_4F07778 = year code) |
| C standards | C99, C11, C18, C23 |
| Exceptions | Disabled by default (DEFAULT_EXCEPTIONS_ENABLED = 0) |
| RTTI | Enabled by default (DEFAULT_RTTI_ENABLED = 1) |
| Target model | LP64 (TARG_SIZEOF_POINTER = 8, TARG_SIZEOF_LONG = 8) |
| Backend | C-codegen (BACK_END_IS_C_GEN_BE = 1) — emits C source, not LLVM IR directly |
| Functions | ~5,000 in range, 300+ above 5KB |
Architecture
The compilation flow through EDG has four major phases: CLI parsing (282-case switch), translation unit initialization (keyword tables, parser bootstrapping), parsing and semantic analysis (the bulk of the 3.2 MB), and backend code emission (generating three output files: .int.c for internal declarations, .device.c for device code, and .stub.c for host-side launch stubs). Error recovery uses setjmp/longjmp — any of the 478 call sites that invoke the abort handler (sub_721090) will unwind back to the orchestrator rather than crashing the process.
sub_5D2A80 (orchestrator, setjmp error recovery)
│
├─ sub_617BD0 (lgenfe_main: 282-case CLI switch, 737 config #defines)
│ ├─ sub_610260 (register 300+ CLI options)
│ └─ sub_6140E0 (option fetcher loop)
│
├─ sub_8D0BC0 (translation unit init)
│ ├─ sub_706250 (keyword table: ~350 keywords via sub_885C00)
│ ├─ sub_858C60 (parser entry)
│ └─ sub_709290 (finalize)
│
├─ sub_709330 ("Generating Needed Template Instantiations", "Wrapping up translation unit")
│
└─ sub_5E3AD0 (backend entry: "Generating NVVM IR")
├─ Opens .int.c / .device.c / .stub.c output files
├─ sub_5DB980 (top-level declaration dispatcher)
│ ├─ sub_5E13C0 (function declaration printer, 44KB)
│ ├─ sub_5DBFC0 (expression printer, 41KB, 61 self-references)
│ ├─ sub_5DFD00 (statement printer, 26KB)
│ ├─ sub_5D80F0 (initializer printer)
│ ├─ sub_5DAD30 (struct/union/enum printer)
│ └─ sub_5DF1B0 (inline asm printer)
│
└─ dlopen("libTileIRCompiler_shared.so") [optional, gated by dword_4D045A0]
└─ dlsym("cudacc_back_end") — 17-entry function pointer table
Timer callbacks record "Front end time", "Back end time", and "Total compilation time" via sub_7211D0.
Orchestrator — sub_5D2A80
The master entry point for the entire frontend. Uses setjmp for non-local error recovery — when any of the ~5,000 EDG functions detects an unrecoverable error (type system inconsistency, parser corruption, internal assertion failure), it calls sub_721090, which longjmps back to this function. The 478 call sites that reference the abort handler demonstrate just how pervasive error checking is throughout the frontend — roughly 10% of all functions in the EDG range can trigger a fatal abort.
| Global | Purpose |
|---|---|
unk_4D045D8 | Phase callback (prints "Generating NVVM IR" etc.) |
unk_4D04744 | Timer enable flag |
unk_4F074B0 | Error flag (frontend errors occurred) |
unk_4F074A8 | Warning count |
qword_4F076F0 | Input source filename |
Frontend Entry — sub_617BD0 (lgenfe_main)
At 123KB and 3,113 decompiled lines, lgenfe_main is the largest function in the EDG range. The name "lgenfe" stands for "LLVM-generating front end" — a hint that this function was originally designed for a different backend before NVIDIA adopted the EDG+LLVM architecture. The function is divided into three distinct regions: a massive 282-case switch for CLI option parsing (2,000 lines), a post-parse validation phase that checks for conflicting options and enforces CUDA-specific constraints, and a file I/O setup phase that installs 11 signal handlers and returns a pointer to the configured compilation context.
Signature: (int argc, __int64 argv).
Structure
| Region | Lines | Content |
|---|---|---|
| A | 164–2157 | 282-case switch on option ID (v6) |
| B | 2157–2700 | Post-parse validation and cross-option consistency |
| C | 2700–3113 | File I/O setup, 11 signal handlers, return &qword_4D046F0 |
Architecture Parsing (case 0x52)
compute_75, compute_80, compute_86, compute_87, compute_88, compute_89
compute_90, compute_90a
compute_100, compute_100a, compute_100f
compute_103, compute_103a, compute_103f
compute_110, compute_110a, compute_110f
compute_120, compute_120a, compute_120f
compute_121, compute_121a, compute_121f
Storage: unk_4D045E8 = SM number, unk_4D045E4 = a suffix flag, unk_4D045E0 = f suffix flag.
Configuration Emission (case 0xE1)
Emits 737 #define macros to configure the EDG compiler. Key defines:
| Define | Value | Meaning |
|---|---|---|
VERSION_NUMBER | "6.6" | EDG frontend version |
EDG_MAIN | "lgenfe_main" | Entry point symbol |
DEFAULT_GNU_VERSION | 80100 | Emulate GCC 8.1 |
DEFAULT_CLANG_VERSION | 90100 | Emulate Clang 9.1 |
DEFAULT_EXCEPTIONS_ENABLED | 0 | CUDA: no exceptions |
TARG_SIZEOF_POINTER | 8 | 64-bit pointers |
TARG_SIZEOF_LONG_DOUBLE | 16 | 128-bit long double |
TARG_LITTLE_ENDIAN | 1 | x86-64 host |
USE_SOFTFLOAT | 1 | Software FP for constexpr |
ABI_COMPATIBILITY_VERSION | 9999 | Maximum ABI compat |
MODULE_MAX_LINE_NUMBER | 250000 | Max lines per module |
CLI Option Registration — sub_610260
Registers ~300 options via sub_6101D0(id, name, flag, ...). CUDA-specific options include:
| ID | Name | Purpose |
|---|---|---|
| 51 | no-device-int128 | Disable __int128 on device |
| 59 | emit-llvm-bc | Emit LLVM bitcode directly |
| 60 | device-debug | Device-side debug info |
| 68 | force-volatile | Force volatile on memory space (global/shared/constant/local/generic/all) |
| 73 | kernel-params-are-restrict | All kernel pointer params are __restrict__ |
| 82 | nv_arch | compute_XX architecture selection |
| 93 | device-c | Separate compilation mode |
| 105 | tile-only | TileIR-only compilation |
| 124 | extended-lambda | Extended lambda support (--expt-extended-lambda) |
| 132 | emit-lifetime-intrinsics | LLVM lifetime intrinsics |
Translation Unit Processing
Translation unit processing is where EDG transitions from CLI configuration to actual compilation. The init function sets up the lexer, allocates the translation unit data structure (416 bytes), populates the keyword table with ~350 entries, and enters the recursive-descent parser. EDG uses a keyword-registration model where each keyword is individually registered with its token ID — this allows NVIDIA to add CUDA-specific keywords (like __shared__ or __nv_fp8_e4m3) without modifying the core parser grammar.
Init — sub_8D0BC0
- Reset token state (
dword_4F063F8 = 0) - Call
sub_727950(lexer init) - Allocate 416-byte TU object via
sub_823970 - Call
sub_706250— keyword table init (~350 keywords) - Call parser entry (
sub_858C60or PCH pathsub_852E40) - Call
sub_709290— finalize
Keyword Registration — sub_706250
30KB. Calls sub_885C00(token_id, "keyword_string") ~350 times. Initializes 30+ subsystems before keyword registration. Categories:
- C89 keywords:
auto,break,case,const,continue,default,do,double,else, ... - C99 additions:
_Bool,_Complex,_Generic,_Atomic,restrict,inline - C11/C23:
_Static_assert,_Thread_local,_Alignas,_Alignof,constexpr,typeof - C++ keywords:
class,template,virtual,namespace,using,try,catch,throw, ... - C++20:
co_yield,co_return,co_await,requires,concept - Type traits (~80):
__is_pod,__is_abstract,__is_trivially_copyable,__has_virtual_destructor, ... - NVIDIA extensions:
__nv_is_extended_device_lambda_closure_type,__nv_is_extended_host_device_lambda_closure_type,__nv_is_extended_device_lambda_with_preserved_return_type - EDG internal:
__edg_type__,__edg_vector_type__,__edg_neon_vector_type__,__edg_scalable_vector_type__
Version-gated by dword_4F077C4 (language mode), unk_4F07778 (standard year), qword_4F077B4 (feature flags).
Finalization — sub_709330
Strings: "Generating Needed Template Instantiations", "Wrapping up translation unit". Calls sub_8B18F0 for C++ template instantiation when dword_4F077C4 == 2.
Preprocessor
EDG includes its own preprocessor rather than relying on an external cpp. This is standard for EDG-based compilers — the preprocessor is tightly integrated with the parser to handle complex interactions between macros and C++ syntax (e.g., __VA_OPT__ in C++20, which requires the preprocessor to understand syntactic context). The preprocessor occupies ~250KB across four major functions and maintains a 99-entry predefined macro table plus a 25-entry feature-test macro table.
Token Scanner — sub_7B8B50 (59KB)
The main preprocessor tokenizer. Handles all C/C++ token kinds: identifiers, numbers (delegates to sub_7B40D0), string literals, operators, punctuators, UCN sequences. Detects C++20 module/import keywords via string comparison.
Numeric Literal Parser — sub_7B40D0 (42KB)
Second-largest preprocessor function. Handles: integer suffixes (u/U/l/L/ll/LL), float suffixes (f/F/l/L), hex floats (0x...p...), binary literals (0b...), C++14 digit separators (').
Macro Expander — sub_81B8F0 (77KB)
The central macro expansion engine. Features:
__VA_ARGS__(C99) and__VA_OPT__(C++20) support- 99-entry predefined macro table at
off_4B7C440(stride 40 bytes) - 25-entry feature-test macro table at
off_4B7C360 - Recursion limit: 300 expansions (error
0xE3) - Intrinsic type-trait macros:
__type_pack_element,__is_signed,__make_integer_seq,__is_pointer
Character Scanner — sub_7BC390 (29KB)
Giant switch on character value. Handles trigraph sequences, line splices, multi-byte characters, comment detection (// and /*).
Parser & Declaration Processing
The parser subsystem is the largest part of the EDG frontend — over 1 MB of code spread across dozens of functions. EDG uses a recursive-descent parser augmented with a declaration-specifier state machine. The state machine design is necessary because C/C++ declaration specifiers can appear in any order (const unsigned long long int and int long unsigned long const are identical), requiring the parser to accumulate specifiers into bitmasks and resolve the final type only after all specifiers have been consumed.
NVIDIA's major contribution to the parser is the CUDA type extension infrastructure: 19 new FP8/FP6/FP4/MX-format type tokens (339–354) for Blackwell's tensor core operations, 9 address-space qualifier tokens (272–280) for GPU memory spaces, and 4 memory-space declaration specifiers (133–136) that piggyback on the existing width-modifier field. These extensions are grafted onto EDG's type system in a way that minimizes changes to the core parser logic — CUDA qualifiers reuse existing state variables with previously-unused value ranges.
Declaration Specifier State Machine — sub_672A20 (132KB, 4,371 lines)
The central parser function and one of the most complex functions in the binary. A while(2)/switch dispatcher on token codes from word_4F06418[0] with ~80 case labels. It accumulates type specifiers, qualifiers, storage-class specifiers, and CUDA address-space qualifiers from the token stream into a set of bitmask variables, then constructs the final type node from the accumulated state.
State Variables
| Variable | Stack | Bits | Role |
|---|---|---|---|
v325 | [rsp+B8h] | uint | Type specifier kind (see table below) |
v327 | [rsp+C0h] | uint64 | Specifier category bitmask |
v307 | [rsp+90h] | int | CV-qualifier accumulation bits |
v302 | [rsp+78h] | uint | Long count (0=none, 1=long, 2=long long) |
v305 | [rsp+84h] | uint | Signedness/width — reused for CUDA (4–7) |
v299 | [rsp+68h] | int | _Complex (1) / _Imaginary (2) tracking |
Type Specifier Kind (v325)
| Value | Meaning | Token Case |
|---|---|---|
| 0 | None yet | — |
| 2 | char | 80 |
| 3 | wchar_t | 165 |
| 4 | bool / _Bool | 128 / 120 |
| 5 | float | 126 |
| 6 | double | 127 |
| 7 | void | 180 |
| 8 | signed / __int8 | 93 / 239 |
| 9 | __float128 | 331 |
| 12 | int (explicit) | 89 |
| 14 | __float16 | 332 |
| 15 | short / half | 85 |
| 16 | _Float16 | 333 |
| 17 | __bf16 | 334 |
| 19 | bfloat16 | 335 |
| 20 | Resolved typedef/CUDA type name | scope lookup |
| 21 | struct/union/enum tag | 101/104/151 |
| 23 | decltype() | 183 |
| 24 | auto (deduced) | 186 |
| 25 | Resolved identifier type | C++ lookup |
| 26 | Error recovery type | diagnostic |
Specifier Bitmask (v327)
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x1 | Storage class (extern/static/etc.) |
| 1 | 0x2 | CV-qualifier seen |
| 2 | 0x4 | Type specifier seen |
| 3 | 0x8 | friend specifier |
| 4 | 0x10 | __declspec / attribute seen |
| 5 | 0x20 | explicit specifier |
| 6 | 0x40 | inline specifier |
| 7 | 0x80 | _Thread_local / thread_local |
| 10 | 0x400 | typeof / decltype |
| 12 | 0x1000 | __declspec() already processed |
| 13 | 0x2000 | explicit(bool) already processed |
| 14 | 0x4000 | _Noreturn / [[noreturn]] |
| 15 | 0x8000 | _Atomic |
CV-Qualifier Bits (v307)
| Bit | Mask | Qualifier |
|---|---|---|
| 0 | 0x01 | const (case 81) |
| 1 | 0x02 | volatile (case 107) |
| 2 | 0x04 | restrict / __restrict (cases 118/119) |
| 3 | 0x08 | __unaligned (case 263 with parens) |
| 4 | 0x10 | __ptr32 (case 264) |
| 5 | 0x20 | __ptr64 (case 265) |
| 6 | 0x40 | __sptr / __uptr (case 266) |
Duplicate CV qualifiers trigger diagnostic 83.
CUDA Memory Space Tokens (133–136)
These piggyback on the signedness/width field v305 with values 4–7:
| Token | Keyword | v305 | v325 | Formula |
|---|---|---|---|---|
| 133 | __shared__ | 4 | 2 | Special case |
| 134 | __device__ | 5 | 8 | token - 129 |
| 135 | __constant__ | 6 | 8 | token - 129 |
| 136 | __managed__ | 7 | 8 | token - 129 |
Clean separation: values 0–3 = standard C width modifiers, 4–7 = CUDA address-space qualifiers. The type-construction switch handles both ranges.
CUDA Extended Type Tokens (339–354)
| Token | Type | Format |
|---|---|---|
| 236 | __nv_fp8_e4m3 | FP8 |
| 339 | __nv_fp8_e5m2 | FP8 |
| 340–343 | __nv_fp8x{2,4}_e{4m3,5m2} | FP8 vector |
| 344–345 | __nv_fp6_e{2m3,3m2} | FP6 |
| 346–347 | __nv_fp6x2_e{2m3,3m2} | FP6 vector |
| 348–349 | __nv_mxfp8_e{4m3,5m2} | MX-format FP8 |
| 350–351 | __nv_mxfp6_e{2m3,3m2} | MX-format FP6 |
| 352 | __nv_mxfp4_e2m1 | MX-format FP4 |
| 353 | __nv_satfinite | Saturation type |
| 354 | __nv_e8m0 | Exponent-only E8M0 |
All resolve via sub_6911B0() → type node, then set v325=20, v327|=4.
CUDA Address Space Qualifier Tokens (272–280)
| Token | Keyword | Space ID | Handler |
|---|---|---|---|
| 272 | __attribute__((address_space(N))) | parsed int | sub_6210B0 |
| 273 | __global__ | 0 | sub_667B60(0,...) |
| 274 | __shared__ (addr space) | 2 | sub_667B60(2,...) |
| 275 | __constant__ (addr space) | 3 | sub_667B60(3,...) |
| 276 | __generic__ | — | sub_72B620(type, cv) |
| 277 | __nv_tex_surf_handle_t | — | sub_72BA30(unk_4F06A51) |
| 278 | __nv_buffer_handle_t | — | sub_72BA30(unk_4F06A60) |
| 279 | __nv_grid_constant | — | sub_72C390() |
| 280 | __nv_is_extended_device_lambda | — | sub_72C270() |
Type Construction Functions
| Function | Purpose | Trigger |
|---|---|---|
sub_72BA30(code) | Fundamental signed integer type | int, short, long, long long |
sub_72BC30(code) | CUDA extended-width integer | CUDA mode + v305 > 3 |
sub_72BCF0(code) | Unsigned fundamental type | unsigned combos |
sub_72BDB0(code) | CUDA unsigned extended type | CUDA mode + unsigned |
sub_72BF70() | float type | v325 == 5 |
sub_72C030() | double type | v325 == 6 |
sub_72C0F0() | long double type | long + double |
sub_72C1B0() | __float128 type | v325 == 9 |
sub_72C610(kind) | Float-by-kind (mapped from v325) | FP8/FP6/BF16/etc. |
sub_72C6F0(kind) | _Complex float variant | v299 == 1 |
sub_72C7D0(kind) | _Imaginary float variant | v299 == 2 |
sub_72C930(code) | Error/placeholder type | diagnostic issued |
sub_72CBA0() | Dependent type | v325 == 25 |
sub_72CBE0(...) | __int128 type | v325 == 1 |
sub_73C570(type, cv, flags) | Apply CV-qualifiers to type | post-construction |
Accumulation Flow
- Initialize: all state variables to 0
- Loop: read
word_4F06418[0], dispatch through switch — set bitmask bits, update kind/cv/width - Exit: unrecognized token →
LABEL_8(default exit) - Type construction: switch on
v325 × v302 × v305→ call appropriatesub_72B*/sub_72C* - CV application:
sub_73C570wraps the type with const/volatile/restrict - Return: type stored at
ds->field_272, CV bits atds->field_120
Declaration Specifier Parser — sub_7C0F00 (184KB, 3,953 lines)
Uses goto-driven dispatch (393 LABEL_ references) — NOT a switch/case. This is a massive state machine for declaration specifier resolution. Self-recursive at line 2407 with flags=20 for nested declarator parsing.
Top-Level Declaration Parser — sub_662DE0 (61KB)
Declarator parsing — handles pointer (*), reference (&/&&), array ([]), and function (()) declarators. Uses SSE __m128i for bulk struct copying of 64-byte EDG type nodes.
Overload Resolution — sub_6523A0 (64KB)
The master overload resolution function. Given a declaration being introduced and a set of existing candidates from name lookup, it decides whether the declaration is a new overload, a redeclaration, or an error. At 2,448 decompiled lines with 39 diagnostic call sites, it is one of the heaviest diagnostic emitters in the frontend.
Candidate collection uses a 72-byte ranking context (v320 on stack) and dispatches to one of three collectors: sub_644100 for non-member/ADL candidates, sub_648CF0 for member + using-declaration candidates (chosen when C++ mode, prior declaration exists, and the class has base classes or is a template), or sub_6418E0 for C-linkage functions. The best candidate is selected by sub_641B60.
__builtin_ prefix forwarding (lines 2060-2162): after resolution, if the resolved symbol is a bodyless non-member function, the resolver checks if a compiler builtin equivalent exists. It hardcodes three function names by length: "abs" (3), "ceil" (4), "strlen" (6). For each, it constructs "__builtin_" + name in a scratch buffer at qword_4F06C50, looks it up via sub_878540, then compares parameter types via sub_8DED30(type1, type2, 0x100004) (exact match + qualification conversion). On match, the builtin's scope entry is linked into the user function's auxiliary data at offset +256 field 8.
OpenMP variant dispatch (lines 727-752): when unk_4D03A10 is set, the resolver renames the declaration to "<name>$$OMP_VARIANT%06d" using a monotonic counter unk_4D03A0C. This creates unique internal names for each device/host specialization.
Constexpr/consteval propagation (lines 2288-2301): gated by unk_4F07778 (C++ standard year). For C++11 and later, byte +204 of the scope entry is bit-packed with three globals: bits 5-6 = unk_4F06C58 (constexpr disposition), bits 1-2 = unk_4F06C5A (consteval disposition), bits 3-4 = unk_4F06C59 (immediate-function flag). Diagnostic 2383 fires on constexpr mismatch between declaration and definition.
Device/host overload sets: CUDA allows the same function name to have both __host__ and __device__ overloads. EDG does not treat execution space as part of the function signature for overload resolution purposes -- the standard C++ overload rules apply first, and execution space filtering happens later during code generation. The $$OMP_VARIANT renaming mechanism is used for OpenMP dispatch variants that need distinct host/device specializations, but regular CUDA __host__/__device__ overloads rely on the backend's execution space filtering rather than frontend overload resolution. This means that if two functions have identical C++ signatures but differ only in __host__ vs __device__, they are treated as redeclarations (not overloads) at the EDG level, and the execution space annotation at scope entry offset +198 determines which version survives into device or host code.
CUDA Memory Space Processing — sub_6582F0 (22KB)
Validates __shared__, __constant__, __managed__ attributes on declarations. Emits diagnostic for automatic variables in inappropriate memory spaces.
Type System
Type Node Layout (192 bytes = 12 x __m128i)
| Offset | Size | Field |
|---|---|---|
| +8 | 8 | Next pointer (linked lists) |
| +40 | 8 | Name pointer |
| +48 | 1 | Declaration kind byte |
| +80 | 1 | Entity kind byte |
| +140 | 1 | TYPE KIND DISCRIMINATOR (the central dispatch key) |
| +160 | 8 | Inner/child type pointer (typedef chains, pointer bases) |
| +168 | 8 | Member list / parameter chain |
| +173 | 1 | Specifier/node kind byte |
| +176 | 2 | Entity kind (uint16, dispatch key for constexpr evaluator) |
| +185 | 1 | CV-qualifier bits (bit 0=const, 1=volatile, 2=restrict) |
| +200 | 1 | Attribute flags |
Type kind discriminator values at offset +140:
| Value | Type | Notes |
|---|---|---|
| 0 | void | |
| 1 | error type | Sentinel |
| 2–4 | fundamental (char, int, ...) | |
| 5 | pointer | Follows +160 chain |
| 6 | pointer-to-member | |
| 7 | function type | Complex: 17 sub-kinds for calling conventions |
| 8 | array | Element count at +128, element type at +160 |
| 9–11 | class / struct / union | Members at +168 |
| 12 | typedef / cv-qualified | Follow +160 for underlying type (critical: skip in type-walk loops) |
| 13 | enum | |
| 14 | void (incomplete) | |
| 15 | vector | Element count at +128 |
| 19 | decltype | |
| 21 | placeholder / auto |
Scope Table Entry (776 bytes)
Indexed by dword_4F04C64 into base qword_4F04C68:
| Offset | Field |
|---|---|
| +0 | Scope identifier |
| +4 | Scope kind (5=namespace, 6=class, 7=function, 8=block, 9=enum, 12=template) |
| +6–10 | Flag bytes |
| +24 | Name list head |
| +32 | Name list tail |
| +208 | Class type pointer |
| +232 | Deferred list |
| +328 | Template info |
| +552 | Parent scope index |
| +624 | Declaration pointer |
| +680 | Linkage specification |
Type Comparison — sub_7386E0 (23KB)
The core type equivalence engine. Takes two type node pointers packed in an __int128 and a flags word, returns boolean equality. The flags word controls comparison mode: bits 0-1 select cv-qualifier strictness (0=strict, 1=relaxed, 2=overload), bit 2 enables template matching (class-equivalence shortcuts), and bit 5 enables anonymous-class structural comparison.
Entry sequence: both types are first canonicalized through sub_72EC50, which peels through chains of non-template typedef aliases. The canonicalizer checks three fields on the elaborated type node: +173 == 12 (typedef kind), +176 == 1 (single-member), and +170 bit 4 == 0 (no template specialization). If all hold, it unwraps one level via sub_72E9A0 and loops. This means typedef int MyInt; typedef MyInt YourInt; canonicalizes YourInt directly to int.
After canonicalization, a quick-reject compares three header bytes without recursing: byte +24 (type kind) must match exactly, bytes +25 XOR must be zero for bits 0x03 (const/volatile) and 0x40 (restrict), and byte +26 XOR must be zero for bit 0x04. Any mismatch short-circuits to return 0.
The main switch dispatches on 38 type kinds. Key cases for CUDA:
- Case 1 (fundamental): compares sub-kind at +56, extra flags at +58 (bits 0x3A), and the base type chain at +72. For integer sub-kind (
sub_kind == 'i'), follows a resolution chain to find the underlying class scope. In template matching mode (flags bit 2), usessub_8C7520to check whether two class instantiations share the same primary template, thensub_89AB40to compare template argument lists. This path handles CUDA's exotic numeric types (__nv_fp8_e4m3,__nv_fp8_e5m2, etc.) which are represented as fundamental types with distinct sub-kinds. - Case 3 (class/struct/union): fast identity via scope pointer equality, then unique-ID shortcut via
dword_4F07588. For anonymous classes with template matching, callssub_740200to extract canonical member lists and performs structural comparison. This is relevant for CUDA lambda closure types, which are anonymous classes. - Case 33 (using-declaration/alias): in overload mode (flags bit 1), performs a hash table lookup via
*qword_4D03BF8to retrieve base class triples and compare element-by-element. This ensures that twousingdeclarations resolving to different base classes are treated as distinct for overload discrimination.
Overload mode specifics (flags & 2): the post-switch check additionally verifies that both types agree on the presence/absence of the +80 "extra declaration" pointer. Template parameters are forced unequal (never match for overload purposes without being identical). Scope pointer equivalence is verified via unique-ID for using-declaration discrimination.
CUDA type equivalence: the NVIDIA-specific float types (__nv_fp8_e4m3, __bf16, _Float16, etc.) each have distinct sub-kind values at type node +56 (see the type mangling table: sub-kind 0 = _Float16, 1 = __fp16, 9 = __bf16, 0xA = _Float16 alternate, 0xB = _Float32, 0xC = _Float64, 0xD = _Float128). The type comparison treats them as distinct fundamental types -- _Float16 and __fp16 are NOT equivalent despite both being 16-bit floats. The half type in CUDA maps to _Float16 (sub-kind 0 or 0xA depending on context), while __half in cuda_fp16.h is a wrapper struct (type kind 9, class/struct), so half and __half are never type-equivalent at the EDG level. User code relies on implicit conversions defined in the CUDA headers, not on type equivalence.
Type-to-String Emitter — sub_74A390 (29KB, 19 callers)
The backbone type printer. Walks type nodes recursively, emitting textual representation for diagnostics. Handles NVIDIA-specific types: __surface_type__, __texture_type__, __nv_bool.
IL Tree Infrastructure
EDG represents parsed code as an Intermediate Language (IL) tree — a rich AST that preserves full C++ semantic information including template instantiation state, scope chains, and type qualifiers. The IL is not LLVM IR; it is EDG's proprietary tree representation that predates the LLVM integration. All semantic analysis, template instantiation, and overload resolution operate on this tree.
The IL tree is traversed by four structurally identical walker functions that share the same 87 node-type dispatch table. The walkers are instantiated from a common template with different callback functions — a design pattern where the traversal logic is fixed but the action at each node is parameterized through function pointers stored in six global variables. This callback-driven walker system is central to EDG's architecture: template instantiation, type checking, code emission, and tree copying all use the same walker infrastructure with different callbacks.
| Function | Size | Self-recursive Calls | Purpose |
|---|---|---|---|
sub_7506E0 | 190KB | 297 | Primary walker |
sub_760BD0 | 109KB | 427 | Parallel walker (deeper traversal) |
sub_75C0C0 | 87KB | 316 | Third-pass walker |
sub_766570 | 148KB | 2 | Copier/transformer (takes callback params) |
Walker Callback System
Six global function pointers form the visitor dispatch table:
| Global | Role |
|---|---|
qword_4F08028 | Node pointer remapper (called before recursion) |
qword_4F08020 | Linked-list child remapper |
qword_4F08038 | String field processor |
qword_4F08030 | Pre-visit callback (return nonzero to skip) |
qword_4F08040 | Post-visit callback |
dword_4F08014 | Skip-shared-nodes flag |
dword_4F08018 | Clear/detach mode (null out fields for ownership transfer) |
IL Node Types (87 types, from walker case labels)
| ID | Type | ID | Type |
|---|---|---|---|
| 1 | source_file | 28 | integral_constant |
| 2 | scope (15 sub-kinds) | 29 | float_constant |
| 3 | type_qualifier | 30 | expression (generic) |
| 4 | simple_type | 41 | call_expression |
| 5 | pointer_type | 42 | cast_expression |
| 6 | function_type (17 sub-kinds) | 43 | conditional_expression |
| 7 | class_type | 44 | string_literal |
| 8 | enum_type | 48 | template_argument (4 sub-kinds) |
| 9 | array_type | 59 | concept_expression (10 sub-kinds) |
| 10 | bitfield_type | 65 | type_list (core linked list) |
| 13 | statement (30+ sub-kinds) | 75 | block/compound_statement |
| 23 | scope_entry (root) | 76 | access_specifier |
Deep Copy — sub_766570 with sub_8C2C50
sub_8C2C50 calls sub_766570 with copy callback sub_8C38E0 and list-copy callback sub_8C3810. Node size table at qword_4B6D500[node_type] provides memcpy sizes. Critical for template instantiation.
Constexpr Evaluator
The constexpr evaluator is arguably the most technically impressive subsystem in the EDG frontend. It is a complete tree-walking interpreter that can execute arbitrary C++ code at compile time, implementing the full C++20 constexpr specification including heap allocation (constexpr new), string literals, virtual function dispatch, and complex control flow. At 317KB for the expression evaluator alone, plus 77KB for the statement executor and ~200KB in supporting functions, it constitutes nearly 20% of the entire EDG frontend.
The evaluator operates on EDG's IL tree directly — it does not compile to bytecode or any intermediate form. Instead, it recursively walks expression and statement nodes, maintaining its own memory model (a 3-tier page arena), variable bindings (an open-addressing hash table), and lifetime tracking (scope epoch counters). This design trades execution speed for implementation simplicity and guaranteed semantic fidelity with the compiler's own type system.
Signature:
bool constexpr_eval_expr(
constexpr_ctx *ctx, // a1: evaluation context (hash table, arena, flags)
expr_node **expr, // a2: expression AST node
__m128i *result, // a3: output value slot (16 or 32 bytes)
char *frame_base // a4: stack frame base pointer for lifetime tracking
);
Expression Evaluator — sub_786210 (317KB, 9,075 lines)
The largest function in the entire EDG frontend. Two-level dispatch: outer switch on expression kind *(a2+24), inner switch on operator code *(a2+56) with 124 cases.
Outer Switch — Expression Kinds
| Kind | Hex | Meaning | Notes |
|---|---|---|---|
| 0 | 0x00 | Void/empty | Sets `ctx+132 |
| 1 | 0x01 | Operator expression | → 124-case inner switch on *(a2+56) |
| 2 | 0x02 | Variable reference | Hash table lookup, kind==1(const) or kind==3(constexpr) |
| 3 | 0x03 | Function reference / enumerator | Subkind==5: has constexpr body → recurse |
| 4 | 0x04 | Literal (int/float constant) | Immediate return — value is in the node |
| 5–6 | 0x05–06 | String / compound literal | C++20 mode required (dword_4F077C4 == 2) |
| 7 | 0x07 | Function call | Most complex case (~1200 lines) |
| 10 | 0x0A | Parenthesized expression | Recurse on a2[7] |
| 11 | 0x0B | Member access (->) | Navigate member hierarchy via type-size table |
| 17 | 0x11 | Lambda expression | Save/restore ctx+72, execute body via sub_7987E0 |
| 18 | 0x12 | Capture variable | Hash table lookup by a2[7] |
| 20 | 0x14 | Address-of | Set flags a3+8 = 0x20 (IS_SYMBOLIC) |
| 23 | 0x17 | sizeof / alignof | Delegate to sub_620D80 |
| 24 | 0x18 | Subscript (array[index]) | Bounds check, compute elem_size * index |
| 27 | 0x1B | Implicit conversion | Navigate chain, recurse on inner |
| 31 | 0x1F | Requires expression (C++20) | Execute body via sub_79B7D0 |
| 32 | 0x20 | Type trait | sub_693DC0 → xmmword_4F08280/xmmword_4F08290 |
| 33 | 0x21 | SFINAE / substitution failure | Template context check, sub_6F2300 |
Inner Switch — Operator Codes (124 cases, selected)
| Cases | Category | Operations |
|---|---|---|
| 0–1 | Assignment | = / initialization (ref types: 32-byte memcpy) |
| 3–4 | Conversion | Lvalue-to-rvalue via sub_7A0070 |
| 5 | Type cast | static_cast — massive dispatch: int→int(sub_622780), float→float(sub_709EF0), int→float(sub_710280), ptr→ptr(sub_770010) |
| 14–15 | Member access | . and -> — offset via sub_8D5CF0, virtual base via sub_771030 |
| 16–17 | Pointer arithmetic | Subtraction, ptrdiff_t via sub_7764B0 |
| 20, 29 | Comparison | ==, != via sub_7759B0 |
| 26–28 | Unary | ++, --, unary minus (sub_621DB0) |
| 30–31 | Vector ops | Element-wise comparison loop, broadcast |
| 39–45 | Arithmetic | +(sub_621270), -(sub_6215F0), *(sub_621F20), /(sub_6220A0), %(sub_6220C0), <<(sub_70BBE0), >>(sub_70BCF0) — all with overflow/divzero checks |
| 46–49 | Bitwise | &, |, ^, ~ |
| 50–57 | Logical | &&, || with short-circuit evaluation |
| 58–59 | Detailed comparison | Integer(sub_621000), float(sub_70BE30), pointer(address+symbolic) |
| 64 | Spaceship | <=> → strong_ordering values at unk_4F06BD8–unk_4F06C30 |
| 73–84 | Compound assignment | += through ^= with lifetime validation, const-check (diag 0x1318) |
| 91–93 | Conditional | Ternary ?:, array subscript (bounds-checked, error 0xA84) |
| 94–95 | Virtual dispatch | Vtable lookup → sub_79CCD0 |
| 96–97 | Allocation | Placement new / operator new |
| 103 | Exception | throw (always fails in constexpr) |
| 105–108 | Delegated | → sub_77FCB0 (builtin operators) |
Value Slot Layout (16 bytes at a3)
| Offset | Size | Field |
|---|---|---|
| 0–7 | 8 | Primary value (integer, IEEE float, or arena pointer) |
| 8 | 1 | Flags byte (see below) |
| 9–11 | 3 | Alignment info, compound assignment tracking |
| 12–15 | 4 | Scope epoch ID (lifetime validation) |
Extended slot (32 bytes for reference types) adds secondary address at +16 and frame base at +24.
Flags Byte (offset +8)
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | IS_POINTER | Value is an indirect pointer |
| 1 | 0x02 | IS_PAST_END | One-past-the-end pointer |
| 2 | 0x04 | HAS_CLEANUP | Destructor chain at +16 |
| 3 | 0x08 | HAS_SUBOBJECT | Refers to a subobject |
| 4 | 0x10 | HAS_BITFIELD | Bitfield offset in bits 8–31 |
| 5 | 0x20 | IS_SYMBOLIC | Unresolved symbolic reference |
| 6 | 0x40 | IS_CONST | From a const declaration |
| 7 | 0x80 | IS_ARRAY_MEMBER | Part of array storage |
Statement Executor — sub_795660 (77KB)
Dispatch on *(a2+40) — statement kind:
| Case | Kind | Notes |
|---|---|---|
| 0 | Declaration | Arena alloc → eval initializer → insert into scoped hash table |
| 1–4 | If / if-else / if-init / if-constexpr | Condition → bool via sub_620EE0 → branch |
| 5 | While loop | Step counter at ctx+120, limit from qword_4D042E0 (~1M). Error 0x97F on exceeded. |
| 6 | Jump (break/continue/goto) | Sets control flow bits: bit 1=continue, bit 2=break, bit 3=goto |
| 7,15,24 | Null/empty | Return success |
| 8 | Return | Walk call chain at ctx+72, store result, set "returned" flag |
| 11 | Expression statement | Evaluate for side effects via sub_7987E0 |
| 12 | For loop | Init → alloc → [condition → body → increment → cleanup] loop |
| 13 | Do-while | Delegates to sub_7A0E60 |
| 14 | Range-based for | 4 temp slots via sub_77A250, iterator advance via sub_7A0470 |
Memory Management — 3-Tier Page Arena
| Tier | Location | Page Size | Threshold | Purpose |
|---|---|---|---|---|
| Primary | ctx+16/ctx+24 | 64KB | default | Expression evaluation temporaries |
| Secondary | ctx+144/ctx+152 | 64KB | lazy init (ctx+132 & 8) | Variable declarations |
| Tertiary | ctx+80 | 64KB | nullable | String/compound literals |
Overflow: allocations >1024 bytes go to heap via sub_822B10(size+16), forming a singly-linked list from ctx+32. Freed by walking until scope epoch matches.
Value slot header: type pointer at offset -8 (8 bytes), lifetime bits at offset -9 (1 byte, bit 0 = "initialized").
Scope epoch: monotonic counter at ctx+128. Hash table at ctx+56/ctx+64 maps epoch → page state. Arena rewound on scope exit.
Hash Table (ctx+0/ctx+8)
Open-addressing with 16-byte entries [key, value]. Hash: key_pointer >> 3. Collision: linear probing. Doubles at 2 * count > capacity (via sub_7704A0). Secondary table at ctx+56/ctx+64/ctx+68 uses 4-byte integer keys (scope epoch IDs).
Diagnostic Codes
| Code | Hex | Meaning |
|---|---|---|
| 61 | 0x3D | Division by zero |
| 2431 | 0x97F | Step limit exceeded |
| 2692 | 0xA84 | Array index out of bounds |
| 2695 | 0xA87 | Unsupported jump in constexpr |
| 2698 | 0xA8A | Null pointer dereference |
| 2705 | 0xA91 | Negative shift count |
| 2707 | 0xA93 | Integer overflow/underflow |
| 2712 | 0xA98 | Use of uninitialized variable |
| 2721 | 0xAA1 | Not a constant expression (generic) |
| 2727 | 0xAA7 | Invalid type conversion |
| 2735 | 0xAAF | Pointer below array start |
| 2751 | 0xABF | Access outside lifetime |
| 2766 | 0xACE | Modification through null pointer |
| 2959 | 0xB8B | Missing return in constexpr function |
| 3007 | 0xBBF | reinterpret_cast in constexpr |
| 3022 | 0xBCE | Call to undefined constexpr function |
Silent mode: ctx+132 bit 5 (0x20) suppresses diagnostics (SFINAE contexts).
Constexpr and CUDA: Host-Side Evaluation of Device Code
A key architectural question for any CUDA compiler is whether constexpr functions annotated __device__ are evaluated at host compile time. In cicc v13.0, the answer is yes, conditionally. The constexpr evaluator operates entirely within the EDG frontend, which runs on the host. When a constexpr __device__ function is used in a context requiring a constant expression (template argument, array bound, static_assert, constexpr variable initializer), the evaluator executes it using its tree-walking interpreter regardless of the function's execution space annotation. The execution space attributes (__device__, __host__, __global__) are semantic annotations for code generation, not for the constexpr evaluator -- the evaluator sees only the IL tree and does not distinguish between host and device function bodies.
This works because EDG's constexpr evaluator uses software floating point (USE_SOFTFLOAT = 1 in the 737-define configuration block). All floating-point arithmetic in constexpr contexts goes through the softfloat library (sub_70B8D0 add, sub_70B9E0 sub, sub_70BBE0 mul, sub_70BCF0 div, sub_709EF0 convert) rather than the host CPU's FPU. This guarantees that constexpr evaluation of device code produces results consistent with IEEE 754 semantics regardless of the host platform's floating-point behavior. The softfloat library handles all precision levels including _Float16, __bf16, _Float32, _Float64, and __float128.
SM architecture gates influence constexpr relaxations. The global qword_4F077A8 (SM version) gates certain constexpr features:
- SM >= 89 (
qword_4F077A8 > 0x15F8F): relaxed constexpr rules for variables with incomplete types dword_4F077C4 == 2: C++20 features including constexprnew, constexpr string literals, and constexpr member access (expression evaluator cases 5/6)dword_4D04880: C++14 relaxed constexpr (loops, local variable mutation, multiple return statements)- C++23/26 extensions: constexpr
try-catch(statement executor case 14), constexpr placementnew(expression evaluator case 103), constexprdynamic_cast(error0xBB7)
The evaluator enforces a step limit (qword_4D042E0, default ~1M iterations) to prevent infinite loops in constexpr evaluation. This limit applies uniformly to both host and device constexpr functions. When exceeded, diagnostic 0x97F ("constexpr evaluation step limit exceeded") is emitted.
One important consequence: __global__ (kernel) functions cannot be constexpr because they have no return value in the conventional sense -- they are launched asynchronously. The parser enforces this at the declaration specifier level, not in the constexpr evaluator.
Supporting Functions
| Function | Size | Role |
|---|---|---|
sub_79CCD0 | 67KB | Object member accessor (base classes, virtual bases, union tracking) |
sub_799B70 | 33KB | Aggregate initializer (arrays, structs, designated init, brace elision) |
sub_79B7D0 | 29KB | Function call evaluator (argument binding, body execution, recursion limits) |
sub_7987E0 | 11KB | Statement list executor entry |
sub_77FCB0 | 150KB | Top-level dispatch (80 expression types + 62-entry intrinsic table) |
sub_7764B0 | 18KB | Type size calculator (Robin Hood hash memoization, 64MB cap) |
sub_7707D0 | — | Clone constexpr object |
sub_7790A0 | — | Trivial aggregate copy |
sub_7A0070 | — | Lvalue-to-rvalue load |
sub_77F5C0 | — | Bounds check (ptr, type → idx, err, size) |
sub_76FFC0 | — | Run cleanup/destructor chain |
Bigint Library (sub_621*)
| Function | Operation |
|---|---|
sub_621000 | compare(a, width_a, b, width_b) → {-1,0,1} |
sub_621270 | add(dst, src, width, overflow_out) |
sub_6215F0 | sub(dst, src, width, overflow_out) |
sub_621F20 | mul(dst, src, width, overflow_out) |
sub_6220A0 | div(dst, src, width, divzero_out) |
sub_6220C0 | mod(dst, src, width, divzero_out) |
sub_621DB0 | negate(dst) |
sub_620EE0 | to_int(value, width, result_out) |
Float Library (sub_70B*)
| Function | Operation |
|---|---|
sub_70B8D0 | add(type, lhs, rhs, dst, inexact, exception) |
sub_70B9E0 | sub |
sub_70BAF0 | negate |
sub_70BBE0 | mul |
sub_70BCF0 | div |
sub_70BE30 | compare(type, lhs, rhs, nan_result) → {-1,0,1,NaN} |
sub_709EF0 | convert(src, src_prec, dst, dst_prec, inexact) |
Key Globals
| Variable | Purpose |
|---|---|
dword_4F077C4 | C++ standard version (2 = C++20, enables constexpr new/string) |
dword_4D04880 | C++14 relaxed constexpr (enables loops, mutation) |
qword_4D042E0 | Max constexpr evaluation steps (~1M) |
xmmword_4F08280 | Canonical constexpr TRUE |
xmmword_4F08290 | Canonical constexpr FALSE |
qword_4F08380 | Global type-size hash table base |
qword_4F08060 | Global allocator function pointer (constexpr new detection) |
CUDA-Specific Extensions
NVIDIA's extensions to the EDG frontend fall into four categories: memory space qualifiers that map to GPU address spaces, kernel launch syntax that gets lowered to CUDA runtime API calls, registration stubs that tell the CUDA runtime about compiled kernels, and atomic builtin generation for the C++11 atomics model on GPU. These extensions are concentrated in the 0x650000–0x810000 range and reference SM architecture version globals extensively — many features are gated by qword_4F077A8 comparisons against architecture thresholds.
CUDA Keyword Extensions
NVIDIA extends the EDG keyword table with execution space qualifiers, memory space qualifiers, and type intrinsics. These exist in four distinct layers -- registered keywords, declaration specifier tokens, address space attribute tokens, and extended type tokens -- each integrated differently into the EDG parser infrastructure.
The critical architectural fact: __device__, __host__, and __global__ are not keywords in the EDG keyword table. They are processed through the C/C++ attribute system, where EDG maps them to internal single-character codes. The declaration specifier state machine (sub_672A20) and the address space handler together resolve these attributes into symbol-table fields that downstream passes consume.
Token ID Inventory
NVIDIA uses four non-contiguous token ID ranges:
| Range | Category | Count | Registration |
|---|---|---|---|
| 133-136 | Memory space declaration specifiers | 4 | Hardcoded in sub_672A20 switch |
| 236, 339-354 | Extended numeric types (FP8/FP6/FP4/MX) | 17 | Resolved via sub_6911B0 |
| 272-280 | Address space qualifier / special type tokens | 9 | Hardcoded handlers in sub_672A20 |
| 328-330 | NVIDIA type trait intrinsics | 3 | Registered via sub_885C00 in sub_706250 |
Only tokens 328-330 use the standard sub_885C00(token_id, "keyword") registration path. All other CUDA tokens are wired directly into parser switch cases, bypassing the keyword table entirely.
Execution Space Qualifiers -- Attribute Path
__device__, __host__, and __global__ are recognized by the attribute parser, which stores them as single-character codes at declaration context offset +269. The complete internal attribute character map (sub_5C79F0 at 0x5C79F0):
| Char | Hex | Attribute | Scope Entry Bits |
|---|---|---|---|
'V' | 0x56 | __host__ | -- (host is the default) |
'W' | 0x57 | __device__ | +198 bit 4 (0x10) |
'X' | 0x58 | __global__ | +198 bit 4 (0x10) AND bit 5 (0x20) |
'Y' | 0x59 | __tile_global__ | -- |
'Z' | 0x5A | __shared__ | -- (stored in +136 as space code 3) |
'[' | 0x5B | __constant__ | -- (stored in +136 as space code 2) |
'\' | 0x5C | __launch_bounds__ | Arguments at decl+336 struct |
']' | 0x5D | __maxnreg__ | -- |
'^' | 0x5E | __local_maxnreg__ | -- |
'_' | 0x5F | __tile_builtin__ | -- |
'f' | 0x66 | __managed__ | -- (stored in +136 as space code 5) |
'k' | 0x6B | __cluster_dims__ | Arguments at cluster config struct |
'l' | 0x6C | __block_size__ | -- |
'r' | 0x72 | __nv_pure__ | -- |
The attribute character code at +269 is consumed by sub_6582F0 (declaration-side validation) and sub_65F400 (definition-side validation). These functions never see the CUDA qualifier as a keyword token -- they only see the resolved character code.
Execution space at scope entry offset +198 is the authoritative record of a function's execution space for all downstream passes:
- Bit 4 (0x10): function is
__device__or__global__-- activates device-scope variable validation - Bit 5 (0x20): function is
__global__(kernel entry point) -- triggers kernel metadata emission viasub_12735D0, which emits("kernel", 1)to LLVM IR - Bit 2 (0x04) at offset +199:
full_custom_abiflag
When a function has bit 5 set, the attribute emitter also iterates the parameter array (40-byte entries at decl+16) and emits ("grid_constant", param_index) for each parameter where byte +33 is nonzero. The preserve-register struct at decl+336 (three int32 fields: data, control, after) is consumed and cleared (set to -1) after emission.
Memory Space Declaration Specifiers (Tokens 133-136)
These piggyback on the signedness/width field v305 in the declaration specifier state machine with values 4-7, cleanly separated from the standard C width modifiers (0-3):
| Token | Keyword | v305 Value | v325 Value | Formula |
|---|---|---|---|---|
| 133 | __shared__ | 4 | 2 | Special case |
| 134 | __device__ | 5 | 8 | token - 129 |
| 135 | __constant__ | 6 | 8 | token - 129 |
| 136 | __managed__ | 7 | 8 | token - 129 |
The type construction switch in sub_672A20 branches on v305 > 3 to invoke CUDA-specific type constructors (sub_72BC30 for signed, sub_72BDB0 for unsigned) instead of the standard C type constructors used for v305 values 0-3.
Address Space Qualifier Tokens (272-280)
Processed by dedicated handlers in the declaration specifier parser:
| Token | Keyword | Handler | Argument |
|---|---|---|---|
| 272 | __attribute__((address_space(N))) | sub_6210B0 | Parses integer N |
| 273 | __global__ (addr space annotation) | sub_667B60(0, ...) | Space ID = 0 |
| 274 | __shared__ (addr space annotation) | sub_667B60(2, ...) | Space ID = 2 |
| 275 | __constant__ (addr space annotation) | sub_667B60(3, ...) | Space ID = 3 |
| 276 | __generic__ | sub_72B620(type, cv) | -- |
| 277 | __nv_tex_surf_handle_t | sub_72BA30(unk_4F06A51) | Texture/surface handle |
| 278 | __nv_buffer_handle_t | sub_72BA30(unk_4F06A60) | Buffer handle |
| 279 | __nv_grid_constant | sub_72C390() | Grid-constant marker |
| 280 | __nv_is_extended_device_lambda | sub_72C270() | Lambda closure check |
Note the dual role of __shared__, __constant__, and __global__: each appears both as a memory space declaration specifier (tokens 133-135) and as an address space qualifier (tokens 273-275). The declaration specifier path stores the result in the symbol-table entry's memory_space_code at offset +136 and memory_space_flags at offset +156. The address space qualifier path stores the result in the EDG type node's qualifier word at offset +18 (values 1=global, 32=shared, 33=constant). Both representations flow downstream: the symbol-table code controls declaration validation, while the type qualifier controls LLVM pointer type construction in sub_911D10.
The __grid_constant__ qualifier (token 279, handler sub_72C390) marks kernel parameters as grid-constant -- the parameter is read-only across all thread blocks and may be placed in constant memory by the backend. This is a SM 70+ feature.
NVIDIA Type Trait Keywords (Tokens 328-330)
The only CUDA tokens registered through the standard sub_885C00 keyword registration path. Always registered -- not gated by any version, language mode, or feature flag:
| Token | Keyword | Registration |
|---|---|---|
| 328 | __nv_is_extended_device_lambda_closure_type | sub_885C00(328, ...) |
| 329 | __nv_is_extended_host_device_lambda_closure_type | sub_885C00(329, ...) |
| 330 | __nv_is_extended_device_lambda_with_preserved_return_type | sub_885C00(330, ...) |
These type traits are used by CUDA's extended lambda machinery to query whether a lambda closure type carries device or host-device execution space annotations. They participate in SFINAE and if constexpr contexts for compile-time dispatch between host and device lambda implementations.
The lambda mangling extensions in sub_80FE00 use the execution space information from these traits to choose between three proprietary Itanium ABI mangling prefixes: Unvdl (device lambda), Unvdtl (device template lambda), and Unvhdl (host-device lambda). The selection is based on flag byte +92 of the closure descriptor, where bit 5 (0x20) marks an extended CUDA lambda, bit 4 (0x10) marks host-device, and bit 2 (0x04) marks a template lambda.
Extended Numeric Type Tokens (236, 339-354)
Blackwell tensor core operations require exotic floating-point formats. These are resolved via sub_6911B0() to a type node, then set v325=20, v327|=4 in the declaration specifier state machine:
| Token | Type | Format | Width |
|---|---|---|---|
| 236 | __nv_fp8_e4m3 | FP8 | 8b |
| 339 | __nv_fp8_e5m2 | FP8 | 8b |
| 340-341 | __nv_fp8x2_e{4m3,5m2} | FP8 vector | 16b |
| 342-343 | __nv_fp8x4_e{4m3,5m2} | FP8 vector | 32b |
| 344-345 | __nv_fp6_e{2m3,3m2} | FP6 | 6b |
| 346-347 | __nv_fp6x2_e{2m3,3m2} | FP6 vector | 12b |
| 348-349 | __nv_mxfp8_e{4m3,5m2} | MX-format FP8 | 8b |
| 350-351 | __nv_mxfp6_e{2m3,3m2} | MX-format FP6 | 6b |
| 352 | __nv_mxfp4_e2m1 | MX-format FP4 | 4b |
| 353 | __nv_satfinite | Saturation modifier | -- |
| 354 | __nv_e8m0 | Exponent-only E8M0 | 8b |
These types are represented as fundamental types with distinct sub-kind values at type node +56 in the EDG type system. The type comparison engine (sub_7386E0, case 1) compares sub-kind, extra flags at +58 (bits 0x3A), and the base type chain at +72 to ensure each format is treated as a distinct type.
Attribute Processing Pipeline
The complete pipeline from CUDA source keyword to LLVM IR metadata:
CUDA source: __global__ void kernel() __launch_bounds__(256, 2)
|
v
Phase 1: Attribute parser → char code 'X' (0x58) at decl context +269
|
v
Phase 2: Declaration specifier state machine (sub_672A20)
→ scope entry +198 bit 5 set (kernel)
|
v
Phase 3: Post-parse fixup (sub_5D0FF0)
→ __launch_bounds__(256, 2) extracted to launch config struct
|
v
Phase 4: CUDA attribute validator (sub_826060)
→ validates __launch_bounds__ on __global__ function
→ diagnostic 0xDCE (3534) if __launch_bounds__ on non-kernel
→ diagnostic 0xE83 (3715) if values out of range
→ diagnostic 0xE87 (3719) if __launch_bounds__ + __maxnreg__ conflict
|
v
Phase 5: Attribute emission to LLVM IR (sub_12735D0)
→ emits ("kernel", 1) from bit 5 of decl+198
→ emits ("grid_constant", N) per qualifying parameter
|
v
Phase 6: Kernel metadata generation (sub_93AE30)
→ "nvvm.maxntid" = "256,1,1"
→ "nvvm.minctasm" = "2"
Memory Space Attributes
sub_6582F0 (22KB) and sub_65F400 (28KB) validate __shared__, __constant__, __managed__ on variable declarations and definitions respectively. Token cases 133-136 in the parser handle these as first-class declaration specifiers. The validation logic enforces CUDA semantics: __shared__ variables cannot have initializers (shared memory is not initialized on kernel launch), __constant__ variables must have static storage duration, and __managed__ variables require unified memory support on the target architecture.
Symbol Table Memory Space Encoding
Memory space is tracked in two locations within each symbol-table entry:
| Offset | Size | Field | Values |
|---|---|---|---|
| +136 | 1 byte | memory_space_code | 0=default, 1=__device__, 2=__constant__, 3=__shared__, 5=__managed__ |
| +156 | 1 byte | memory_space_flags | bit 0=device, bit 1=shared, bit 2=constant, bit 4=thread_local interaction |
| +157 | 1 byte | Extended flags | bit 0=managed |
The dual encoding exists because the flags are additive from parsed attributes (multiple attributes can be OR'd in) while the code is the single resolved value used by downstream passes. The code at +136 is set by sub_735FB0 (symbol entry constructor) and queried throughout the compiler.
Declaration-Side Validation -- sub_6582F0
The validation follows a ten-phase pipeline:
-
__managed__pre-resolution: whendword_4F04C5C == dword_4F04C34(host-only mode), managed variables are silently downgraded to__device__(space code 1) and the extern flag is cleared. -
Extern handling: sets bit 0 of decl context +122 and the
is_externtracking variables. -
Type normalization: checks function-type declarations against CUDA criteria via
sub_8D4C10; emits diagnostic 891 for function types with memory space. -
Specifier processing: calls
sub_6413B0against the current compilation target. -
Prior-declaration conflict detection: looks up existing symbol, compares memory space codes. Mismatch with
dword_4F077C4 == 2(separate compilation) triggers diagnostic 172 (warning 4). -
New symbol creation:
sub_735FB0(type_ptr, space_code, target_id, is_new_decl). -
__managed__namespace binding: validates namespace name viasub_703C10; checks class/struct type compatibility (diagnostic 1560 on failure). -
Storage class adjustments: processes constant-space read-only flags.
-
Device-scope enforcement: when scope +198 bit 4 is set (inside
__device__/__global__function), local variables cannot carry device memory qualifiers. Diagnostic 3484:"an automatic variable may not be declared as __device__". The memory space name is determined by the priority cascade:__constant__>__managed__>__shared__>__device__. -
Final fixup: type validation (
sub_8D9350), attribute propagation (sub_8756F0),"main"function warnings (diagnostic 2948), thread-safety analysis (sub_826000).
Memory Space Mutual Exclusivity
The code consistently enforces these combinations:
| Combination | Diagnostic | Severity |
|---|---|---|
__shared__ + __constant__ | 3481 | error |
__constant__ + __managed__ | 3568 | error |
__constant__ + __shared__ + __managed__ | 3568 | error |
thread_local + __device__ | 892 | error |
thread_local + any device space | 3578 | error |
auto variable + __device__/__constant__/__managed__ | 3484 | error |
__shared__ + initializer | 3510 | error |
__constant__ in device-function scope | 3512 | error |
register + device memory | 3485/3688 | error |
volatile + __constant__ | 1378 | error |
| redeclaration with different memory space | 3499 | warning 5 |
The diagnostic name-string priority cascade (__constant__ > __managed__ > __shared__ > __device__) appears identically in six locations: sub_6582F0 lines 734-739, sub_65F400 lines 541-549 and 927-935, sub_5C6B80 lines 22-34, sub_667550 lines 87-98, and sub_5D9330 (the symbol printer).
Address Space Flow: EDG to LLVM to PTX
| CUDA Source | EDG Token | Symbol +136 | Type Qualifier +18 | LLVM AS | PTX Directive |
|---|---|---|---|---|---|
__device__ int x; | 134 | 1 | 1 | 1 | .global |
__shared__ int x; | 133 | 3 | 32 | 3 | .shared |
__constant__ int x; | 135 | 2 | 33 | 4 | .const |
__managed__ int x; | 136 | 5 | 1 | 1 | .global + runtime registration |
| (local in kernel) | -- | 0 | -- | 0/5 | .local/.param |
The EDG type node qualifier word (offset +18, masked to 0x7FFF) carries address space through the type system. During EDG-to-LLVM type translation, sub_911D10 reads this qualifier from pointer/reference types (kind 75/76) and maps to LLVM address space numbers via sub_5FFE90. __managed__ variables are compiled as __device__ (LLVM address space 1) with additional runtime registration calls generated by sub_806F60 for unified memory management.
Kernel Launch Lowering — sub_7F2B50 (16KB)
Transforms CUDA's <<<gridDim, blockDim, sharedMem, stream>>> kernel launch syntax into CUDA runtime API calls. The lowered sequence allocates a parameter buffer via cudaGetParameterBufferV2, copies kernel arguments into it, and launches with cudaLaunchDeviceV2. For the simpler launch path, it generates __cudaPushCallConfiguration followed by individual __cudaSetupArg/__cudaSetupArgSimple calls. This lowering happens entirely within EDG — by the time the code reaches the LLVM backend, kernel launches are ordinary function calls.
Registration Stub Generator — sub_806F60
Generates __cudaRegisterAll function with calls to:
__cudaRegisterEntry, __cudaRegisterVariable, __cudaRegisterGlobalTexture, __cudaRegisterGlobalSurface, __cudaRegisterManagedVariable, __cudaRegisterBinary, ____cudaRegisterLinkedBinary.
Host-side stubs generated by sub_808590: "__device_stub_%s", "__cudaLaunch", "__cudaSetupArg", "__cudaSetupArgSimple".
Atomic Builtin Generator — sub_6BBC40 (34KB)
Constructs __nv_atomic_fetch_{add,sub,and,xor,or,max,min} names with type suffixes (_s, _f, _u) and width (_%u).
SM Architecture Gates
Two functions configure ~160 optimization/feature flags based on SM version:
| Function | Role | Thresholds |
|---|---|---|
sub_60D650 | Optimization level → 109 unk_4D04* flags | Single integer parameter (O-level) |
sub_60E7C0 | SM arch → 60 unk_4D04* feature flags | SM 75 (30399), SM 80 (40000), SM 90 (89999), SM 100 (109999), SM 120 (119999) |
Each flag is gated by a byte_4CF8* user-override check, preventing auto-configuration when the user explicitly sets a flag via CLI.
TileIR Backend
sub_5E3AD0 optionally loads libTileIRCompiler_shared.so via dlopen and looks up symbol "cudacc_back_end". A 17-entry function pointer table is passed. Gated by dword_4D045A0.
Diagnostic System
EDG's diagnostic system supports three output formats: human-readable terminal output (with ANSI color and word-wrapping), SARIF JSON for IDE integration, and a machine-readable log format for automated tooling. All three share the same diagnostic numbering scheme and severity classification. The terminal output handler alone is 37KB — it implements its own word-wrapping algorithm with configurable terminal width, recursive child diagnostic emission (for "note: see declaration of X" chains), and color coding by severity level.
Terminal Output — sub_681D20 (37KB)
Formats error/warning/remark messages with:
- Severity labels: remark (2), warning (4), caution (5), severe-warning (6), error (7–8), catastrophe (9–10), internal-error (11)
- Source location:
file:line:col - ANSI color escapes (gated by
dword_4F073CC) - Word-wrapping at
dword_4D039D0(terminal width) - Recursive child diagnostic emission
SARIF JSON Output — sub_6837D0 (20KB)
Structured diagnostics for IDE integration, enabled by --diagnostics_format=sarif (CLI case 0x125, sets unk_4D04198 = 1). The output is a comma-separated stream of SARIF result objects -- NOT a complete SARIF envelope with $schema, runs[], etc. The caller or a post-processor is expected to wrap the stream in the standard SARIF container.
Each diagnostic emits one JSON object:
{
"ruleId": "EC<number>",
"level": "error",
"message": {"text": "<json-escaped message>"},
"locations": [
{
"physicalLocation": {
"artifactLocation": {"uri": "file://<path>"},
"region": {
"startLine": 42,
"startColumn": 17
}
}
}
],
"relatedLocations": [
{
"message": {"text": "see declaration of X"},
"physicalLocation": { ... }
}
]
}
Rule ID format: "EC" + decimal error number from the diagnostic record at offset +176. For example, EDG error 1234 becomes "EC1234".
Severity mapping (byte at diagnostic node +180):
| Severity | EDG Meaning | SARIF level |
|---|---|---|
| 4 | remark | "remark" |
| 5 | warning | "warning" |
| 7, 8 | error | "error" |
| 9 | catastrophe | "catastrophe" |
| 11 | internal error | "internal_error" |
Note that SARIF spec only defines "warning", "error", and "note" as standard levels. The "remark", "catastrophe", and "internal_error" values are EDG extensions -- consuming tools should treat unknown levels as "error".
Message text escaping: sub_683690 renders the diagnostic text into qword_4D039E8, then copies character-by-character into the output buffer, escaping " as \" and \ as \\. No other JSON escaping (e.g., control characters, Unicode) is applied.
Location resolution: sub_67C120 calls sub_729E00 to decompose the packed source location into (file-id, line, column), then sub_722DF0 to resolve the file-id to a filesystem path. The startColumn field is omitted when column is zero.
Related locations: the linked list at diagnostic node +72 chains "note" sub-diagnostics. Each is emitted as a relatedLocations array entry with its own message and physical location.
Filtering before emission: diagnostics pass through severity threshold check (byte_4F07481[0]), duplicate detection (byte_4CFFE80[4*errnum + 2] bit flags), pragma-based suppression (sub_67D520), and error limit check (unk_4F074B0 + unk_4F074B8 >= unk_4F07478). All filtering happens before the SARIF/text format branch.
Machine-Readable Log
Writes to qword_4D04908 in format: <severity-char> "<filename>" <line> <col> <message>\n. Severity chars from "RwweeccccCli": R=remark, w=warning, e=error, c=catastrophe.
Name Mangling (Itanium ABI)
EDG includes a complete implementation of the Itanium C++ ABI name mangling specification. NVIDIA extends the standard mangling with three proprietary prefixes (Unvdl, Unvdtl, Unvhdl) for device lambdas, device template lambdas, and host-device lambdas respectively. These extensions are necessary because CUDA's execution model requires distinguishing between host and device versions of the same lambda — they must have different mangled names to avoid linker collisions when both host and device code are linked into the same binary.
Address range 0x810000–0x8EFFFF:
| Function | Size | Role |
|---|---|---|
sub_8E74B0 | 29KB | Primary mangling entry |
sub_8E9FF0 | 26KB | Type mangling |
sub_816460 | 24KB | Type component mangling |
sub_813790 | 13KB | Expression mangling |
sub_80E340 | 23KB | Builtin type mangling (incl. DF16_, DF16b, Cu6__bf16, u6__mfp8) |
sub_80FE00 | 8KB | NVIDIA extension mangling (Unvdl, Unvdtl, Unvhdl) |
NVIDIA lambda mangling extensions (sub_80FE00): standard Itanium ABI uses Ul<params>E<index>_ for unnamed lambda types and Ut<index>_ for unnamed non-lambda types. NVIDIA adds three proprietary prefixes chosen based on flag byte +92 of the lambda's closure descriptor:
| Prefix | Meaning | Condition |
|---|---|---|
Unvdl | __device__ lambda | flag_byte_92 & 0x20 set, not host-device, not template |
Unvdtl | __device__ template lambda | flag_byte_92 & 0x20 set, flag_byte_92 & 4 set |
Unvhdl | __host__ __device__ lambda | flag_byte_92 & 0x20 set, flag_byte_92 & 0x10 set |
The Unvhdl prefix carries three single-digit flags separated by underscores after the prefix: Unvhdl<index>_<has_explicit_return>_<is_host_device>_<has_template_params>_. Each flag is '0' or '1'. This is richer than the standard Ul which only encodes parameter types.
NVIDIA vendor type manglings (sub_80E340): the type mangler handles CUDA-specific types as Itanium vendor types (prefix u + length + name):
| Type | Mangling | Notes |
|---|---|---|
__bf16 (bfloat16) | u6__bf16 or DF16b | ABI-gated: qword_4F077B4 lo32 selects vendor vs C++23 encoding |
__mfp8 (FP8) | u6__mfp8 | NVIDIA micro-float 8-bit for transformer inference |
__metainfo | U10__metainfo | Kernel parameter metadata type attribute |
float80 | u7float80 | x87 extended precision (vendor type) |
The __bf16 mangling has a three-way gate reflecting the ongoing ABI transition: qword_4F077B4 lo32 != 0 selects "u6__bf16" (vendor type); hi32 == 0 selects "DF16b" (C++23 standardized P1467); otherwise qword_4F06A78 determines which encoding. The ABI version variable unk_4D04250 controls this and other encoding decisions, with known thresholds at 0x76BF (GCC 3.3 compat) and 0xC350 (GCC 12 compat).
Standard float types follow Itanium: _Float16 = "DF16_", __fp16 = "Dh", float = "f", double = "d", __float128 = "g", with the complex variants adding a 'C' prefix.
Key Global Variables
| Variable | Size | Role |
|---|---|---|
dword_4F077C4 | 4 | Language mode: 0=neither, 1=C, 2=C++ |
unk_4F07778 | 4 | C/C++ standard year (199711, 201103, 201402, 201703, 202002, 202310) |
qword_4F077B4 | 8 | Dialect extension flags (lo=CUDA extensions, hi=GNU extensions) |
dword_4F077BC | 4 | NVCC mode flag |
dword_4F077C0 | 4 | GCC compatibility mode |
qword_4F077A8 | 8 | SM architecture version (controls feature gates throughout) |
word_4F06418 | 2 | Current parser token |
qword_4F04C68 | 8 | Scope table base pointer (776-byte entries) |
dword_4F04C64 | 4 | Current scope index |
qword_4CF7CE0 | 8 | AST printer callback vtable |
qword_4D03FF0 | 8 | Current translation unit pointer |
qword_4D04908 | 8 | Machine-readable diagnostic log FILE* |
qword_4F08028–qword_4F08040 | 48 | IL tree walker callback table |
dword_4D045A0 | 4 | TileIR mode flag |
NVVM IR Generation
Between the EDG 6.6 frontend and the LLVM optimizer sits a layer that has no upstream LLVM equivalent: the NVVM IR generation subsystem. Its job is to translate the EDG intermediate language (IL) tree -- a C-level AST produced by EDG's source-to-source backend -- into LLVM IR suitable for the NVPTX target. This is cicc's equivalent of Clang's CodeGen library (lib/CodeGen/CGExpr.cpp, CGStmt.cpp, CGDecl.cpp, etc.), but it operates on EDG's proprietary IL node format rather than a Clang AST. Understanding this layer is essential because it determines every structural property of the LLVM IR that the optimizer and backend will see: address space annotations on pointers, alloca placement conventions, kernel metadata encoding, and the specific IR patterns used for CUDA-specific constructs like threadIdx.x or __shared__ memory.
The EDG frontend does not produce LLVM IR directly. Its backend mode (BACK_END_IS_C_GEN_BE = 1) emits transformed C code into .int.c, .device.c, and .stub.c files. A second compilation pass then parses these files back through EDG to produce an IL tree -- a typed, linked representation of every declaration, statement, and expression in the translation unit. The IR generation layer walks this IL tree recursively, creating LLVM BasicBlocks, Instructions, and GlobalVariables via a hand-rolled IR builder that directly manipulates LLVM's in-memory data structures. The result is a complete LLVM Module containing one function per device-side function definition, with kernel entry points annotated via nvvm.annotations metadata.
Dual-Path Architecture
One of the most distinctive features of cicc's IR generation is that two complete copies exist within the binary. This mirrors the dual-path design observed throughout cicc: Path A (LibNVVM API mode, 0x90xxxx) and Path B (standalone mode, 0x126xxxx).
| Component | Path A (LibNVVM) | Path B (Standalone) |
|---|---|---|
| Expression codegen | 0x91xxxx--0x94xxxx | 0x127xxxx--0x12Bxxxx |
| EmitExpr (master dispatch) | sub_91DF90 | sub_128D0F0 |
| EmitStmt (statement dispatch) | sub_9363D0 | (parallel at similar offset) |
| EmitFunction (entry block setup) | sub_946060 | (parallel) |
| GenerateFunctionProlog | sub_938240 | (parallel) |
| Builtin lowering mega-switch | sub_90AEE0 (109KB) | sub_12B3FD0 (103KB) |
| Bitfield load/store | sub_923780 / sub_925930 | sub_1282050 / sub_1284570 |
| Special variable codegen | sub_920430 / sub_922290 | sub_127F7A0 / sub_1285550 |
| Inline asm codegen | sub_932270 | sub_1292420 |
| Global variable codegen | sub_916430 | (parallel) |
| Type translation | sub_91AED0 | (parallel) |
| Kernel metadata emitter | sub_93AE30 | (parallel) |
These are not shared-library variations or template instantiations across different types. They are structurally identical copies of the same algorithms with the same string constants (e.g., "allocapt", "agg.result", "entry", "return", ".addr") and the same error messages (e.g., "unsupported expression!", "Argument mismatch in generation function prolog!"). The two copies use different calling conventions for their codegen context objects -- Path A passes codegen state through a flat struct with LLVM API vtable pointers, while Path B uses a pointer-to-pointer indirection scheme -- but the algorithmic logic and IR output are byte-for-byte identical.
The remainder of this page uses Path B addresses (the 0x12xxxxx range) as the primary reference because they correspond to the standalone compilation path that nvcc invokes, and because the B-series analysis reports provide the most detailed coverage of this path. Every function described here has a direct counterpart in Path A at the corresponding 0x9xxxxx address.
Address Map
| Address Range | Subsystem | Key Functions |
|---|---|---|
0x126A000--0x126BFFF | Volatile detection, alignment queries | sub_126A420 (IsVolatileAddress) |
0x1273000--0x1275FFF | Function attribute emission | sub_12735D0 (EmitFunctionAttrs), sub_1273F90 (AttributeReader) |
0x127A000--0x127CFFF | Type translation helpers | sub_127A030 (GetLLVMType), sub_127B390 (GetSMVersion), sub_127B420 (IsAddressOfExpr), sub_127B550 (FatalDiag) |
0x127D000--0x127FFFF | Constants, alloca creation, bool emission | sub_127D8B0 (EmitConstExpr), sub_127FC40 (CreateAlloca), sub_127FEC0 (EmitBoolExpr) |
0x1280000--0x1285FFF | Bitfield access, member loads, inline asm | sub_1282050 (EmitBitfieldStore), sub_1284570 (EmitBitfieldLoad), sub_1285290 (EmitAsmCall) |
0x1286000--0x128FFFF | L-value codegen, binary ops, expression dispatch | sub_1286D80 (EmitAddressOf), sub_128A450 (EmitCast), sub_128D0F0 (EmitExpr), sub_128F9F0 (EmitBinaryArithCmp) |
0x1290000--0x129AFFF | Control flow helpers, inline asm, printf lowering | sub_1290AF0 (SetInsertPoint), sub_1292420 (EmitInlineAsm), sub_12992B0 (LowerPrintfToVprintf) |
0x129B000--0x12AFFFF | Builtin helpers, atomic ops, surface/texture ops | sub_12A4D50 (CreateBasicBlock), sub_12A7DA0 (AtomicOps), sub_12ADE80 (SurfaceTexture) |
0x12B0000--0x12BFFFF | Builtin mega-switch | sub_12B3FD0 (BuiltinLowering, 103KB, 770 IDs) |
The IRGenState Object
Every codegen function receives a context object -- called IRGenState or CodeGenState in this wiki -- that carries all mutable state for the current function being compiled. Two distinct layouts exist depending on whether the context is accessed through the Path A flat struct or the Path B double-indirection pattern. Both layouts carry the same logical fields; the difference is structural.
Path B Layout (pointer-to-pointer pattern)
In Path B, the primary codegen context a1 is a CodeGenState** -- a pointer to a pointer. The outer pointer dereferences to a struct containing the core IR builder state, and sibling pointers at a1[1], a1[2], etc., reach related context objects:
| Access | Offset | Field | Purpose |
|---|---|---|---|
*a1 | +0 | IRBuilder state | Current function, insert point, module |
a1[1] | +8 | Insertion context | [0] = debug location, [1] = current BB, [2] = insertion sentinel |
a1[2] | +16 | LLVM context/module | Module handle, LLVMContext |
a1[4] | +32 | Module pointer | LLVM Module* |
a1[5] | +40 | Type context | Type table for GetLLVMType, getIntNTy |
a1[6] | +48 | Debug location | Current DebugLoc to attach to new instructions |
a1[7] | +56 | Current BasicBlock | BB for instruction insertion |
a1[8] | +64 | Insertion point | Iterator into BB's instruction list |
a1[9] | +72 | Address space context | For alloca type creation |
a1[19] | +152 | Cached printf alloca | Reused "tmp" alloca for vprintf buffer packing |
Path A Layout (flat struct, offsets from a1)
| Offset | Field | Purpose |
|---|---|---|
| +32 | Module pointer | LLVM Module* |
| +40 | IR builder | Current builder state |
| +48, +56 | Operand pair array | Base and count for metadata pairs |
| +96 | Current BasicBlock | Active BB |
| +104 | Insertion point | Iterator |
| +128 | Instruction creation vtable | Virtual dispatch for instruction emission |
| +136 | Emitter context | Vtable at [0], dispatch at vtable[2] |
| +192 | Current Function | LLVM Function* being populated |
| +200 | Return BB | The "return" basic block |
| +208 | Return value alloca | "retval" alloca or sret pointer |
| +240 | Has-cleanups flag | Nonzero when C++ destructors are pending |
| +344 | Module (kernel metadata) | Used by sub_93AE30 |
| +360/376 | In-kernel flag | Bit 0 set when compiling a __global__ function |
| +424 | Cleanup stack | Stack of pending destructor frames (24 bytes each) |
| +456 | Allocapt marker | The "allocapt" sentinel instruction |
The "allocapt" marker deserves special attention. When EmitFunction (sub_946060) creates the entry block, it inserts a dummy bitcast void to void instruction named "allocapt" as a sentinel. All subsequent alloca instructions created by CreateTmpAlloca (sub_921D70 / sub_127FC40) are inserted before this sentinel, ensuring that every alloca ends up clustered at the top of the entry block. This is a hard requirement for LLVM's mem2reg pass to promote stack slots to SSA registers. The allocapt marker is removed by a later cleanup pass.
EDG IL Node Layout
Every codegen function traverses EDG IL nodes -- linked structures that represent declarations, statements, and expressions from the parsed CUDA source. The node layout is consistent across all codegen paths:
Expression node (passed as a2 to EmitExpr):
| Offset | Field | Description |
|---|---|---|
| +0 | Type pointer | EDG type node (dereference for type info) |
| +18 | Qualifier word | 16-bit: bits 0--14 = qualifier ID, bit 15 = negation |
| +24 | Kind byte | Top-level expression category (1=operation, 2=literal, 3=member, 0x11=call, 0x14=decl-ref) |
| +25 | Flags byte | Bit 2 = assignment context (write-only) |
| +36 | Source location | Passed to debug info attachment |
| +56 | Sub-opcode / data | For kind=1: operator sub-opcode; for kind=2: literal data |
| +72 | Child/operand | Pointer to first child expression |
Type node (accessed via expression's type pointer):
| Offset | Field | Description |
|---|---|---|
| +8 | Type classification byte | 1--6 = float types, 11 = integer, 15 = pointer, 16 = vector |
| +128 | Byte size | Element count for arrays, byte size for scalars |
| +136 | Element size | Size in bits for non-typedef types |
| +140 | Type tag | 1=void, 8--11=aggregate (struct/union/class/array), 12=typedef alias, 16=__int128 |
| +144 | Flags | Bit 2 = is_bitfield, bit 3 = signed |
| +160 | Inner type / next | Followed when tag==12 (typedef stripping) |
| +176 | Element count | For array types |
The typedef-stripping idiom appears throughout every codegen function (15+ occurrences in EmitExpr alone):
for (t = *expr_type; *(BYTE*)(t + 140) == 12; t = *(QWORD*)(t + 160));
This walks through chains of typedef aliases (kind 12) until it reaches the canonical type.
Function Emission Pipeline
When cicc processes a device-side function, IR generation proceeds through a fixed sequence of stages. The entry point is EmitFunction (sub_946060), which sets up the function skeleton and then calls GenerateFunctionProlog (sub_938240) to emit parameter handling, followed by recursive statement emission.
Stage 1: Function skeleton (sub_946060).
Creates the LLVM Function* object, resolves the function type through the EDG typedef chain, and optionally sets a section name. Then creates two basic blocks: "entry" (the function entry point) and "return" (the single return block -- all return paths branch here). Inserts the "allocapt" sentinel into the entry block. For non-void functions, creates a "retval" alloca to hold the return value; for sret functions (returning aggregates), uses the first argument directly.
Stage 2: Function prolog (sub_938240).
Iterates the EDG parameter linked list (next pointer at offset +112, stride 40 bytes per LLVM argument slot) in lockstep with the LLVM function's argument list. For each parameter:
- If the first parameter has ABI kind 2 (sret), names it
"agg.result"and advances. - Unnamed parameters get the name
"temp_param"; the implicitthisparameter (flags bit 0 at offset +172) gets"this". - Creates an alloca named
<param_name>.addrviaCreateTmpAlloca. - Emits a
storeof the incoming SSA argument into the alloca. - Registers the EDG declaration -> LLVM Value mapping in a hash table (open addressing, quadratic probing) for later lookup during expression codegen.
- Optionally emits
"__val_param"temporaries for byval aggregate parameters.
Stage 3: Body emission (recursive emitStmt / EmitExpr).
Walks the IL tree for the function body, dispatching through the statement codegen switch and the expression codegen switch (detailed below).
Stage 4: Kernel metadata (sub_93AE30).
For __global__ functions, emits nvvm.annotations metadata: kernel flag, __launch_bounds__ parameters (nvvm.maxntid, nvvm.reqntid, nvvm.minctasm, nvvm.maxnreg), cluster dimensions (nvvm.cluster_dim, nvvm.blocksareclusters), and per-parameter metadata (alignment, grid_constant, hidden-parameter flags).
Stage 5: Function attributes (sub_12735D0).
Emits function-level metadata for CUDA-specific attributes: grid_constant (per-parameter), preserve_n_data / preserve_n_control / preserve_n_after (register preservation hints), and full_custom_abi (custom calling convention flag). These are later read back by sub_1273F90 and re-encoded as LLVM named metadata with MDString keys.
CUDA Semantic Mapping
The central task of this layer is mapping CUDA-specific semantics to LLVM IR constructs. The following table summarizes every CUDA concept and its IR representation:
| CUDA Concept | LLVM IR Representation | Codegen Function |
|---|---|---|
threadIdx.x | call i32 @llvm.nvvm.read.ptx.sreg.tid.x() | sub_1286E40 (EmitSpecialVarMemberAccess) |
blockIdx.y | call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() | same, category 2, component 1 |
blockDim.z | call i32 @llvm.nvvm.read.ptx.sreg.ntid.z() | same, category 1, component 2 |
gridDim.x | call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() | same, category 3, component 0 |
warpSize | call i32 @llvm.nvvm.read.ptx.sreg.warpsize() | sub_1285550 (EmitSpecialVarAccess) |
__shared__ variable | @var = addrspace(3) global ... | sub_916430 (address space = 3) |
__constant__ variable | @var = addrspace(4) global ... | same (address space = 4) |
__device__ variable | @var = addrspace(1) global ... | same (address space = 1) |
__global__ function | define void @kern() #0 + !{ptr @kern, !"kernel", i32 1} in nvvm.annotations | sub_93AE30 |
__launch_bounds__(N, M) | !{!"nvvm.maxntid", !"N,1,1"} + !{!"nvvm.minctasm", !"M"} | same |
__cluster_dims__(x,y,z) | !{!"nvvm.cluster_dim", !"x,y,z"} + !{!"nvvm.blocksareclusters"} | same |
__syncthreads() | Builtin ID dispatch -> llvm.nvvm.barrier0 | sub_12B3FD0 (cases 0xB5--0xCC) |
atomicAdd(ptr, val) | Builtin dispatch -> atomicrmw add or llvm.nvvm.atomic.* | same (cases 0xBA--0xCC) |
printf(fmt, ...) | Rewritten to vprintf(fmt, packed_buf) | sub_12992B0 (LowerPrintfToVprintf) |
__asm__("ptx" : ...) | call void asm sideeffect "ptx", "=r,..."(...) | sub_1292420 (EmitInlineAsm) |
| Texture/surface ops | call @llvm.nvvm.tex.* / @llvm.nvvm.suld.* | sub_12ADE80, sub_12AA9B0 |
__nv_float2int_rz | call i32 @__nv_float2int_rz(float %v) | sub_128A450 (EmitCast, NVIDIA intrinsic path) |
The special variable recognition pipeline (sub_127F7A0) checks five preconditions before treating a variable as a hardware register read: (1) the in-kernel flag at IRGenState+376 must be set, (2) the symbol must not be extern, (3) it must not be template-dependent, (4) its element count must be 1, and (5) its name must be non-null. The intrinsic IDs are stored in a static 5x3 table (unk_427F760): 5 categories (threadIdx, blockDim, blockIdx, gridDim, warpSize) times 3 components (x, y, z), with warpSize using only the first slot.
Common IR Emission Patterns
Alloca-at-entry
Every local variable and parameter copy uses the same pattern:
sub_127FC40(ctx, type, name, alignment, addrspace)
-> sub_921B80(ctx, type, name, arraySize=0)
-> insert AllocaInst BEFORE the allocapt sentinel
-> set alignment bits
-> return alloca pointer
The critical detail: when arraySize == 0 (the common case), the alloca is inserted at IRGenState+456+24 -- the position just before the allocapt marker. This ensures all allocas land at the top of the entry block regardless of where in the function body they are created.
Instruction insertion and debug location
After creating any instruction, the same 15-line pattern inserts it into the current basic block and attaches debug metadata:
bb = ctx[1][1]; // current BB
sentinel = ctx[1][2]; // insertion sentinel
sub_157E9D0(bb + 40, inst); // update BB instruction list
// doubly-linked list pointer surgery with 3-bit tag in low bits
sub_164B780(inst, &name); // set instruction name (e.g., "arraydecay")
debugLoc = *ctx_debug;
if (debugLoc) {
sub_1623A60(&loc, debugLoc, 2); // clone debug location
*(inst + 48) = loc; // attach at instruction offset +48
sub_1623210(&loc, loc, inst+48); // register in debug info list
}
The low 3 bits of list pointers carry tag/flags (alignment guarantees those bits are zero for valid pointers). Offset +24 is prev, +32 is parent block, +48 is debug location on each instruction node.
Constant vs instruction dispatch
Throughout expression codegen, a consistent threshold check determines whether to constant-fold or create an IR instruction:
if (*(BYTE*)(value + 16) > 0x10u)
// Real IR instruction -> emit IR-level operation
result = sub_15FDBD0(opcode, value, destTy, &out, 0); // CastInst
else
// Constant value -> constant-fold
result = sub_15A46C0(opcode, value, destTy, 0); // ConstantExpr
The byte at value+16 encodes the LLVM Value subclass kind. Values <= 0x10 are constants (ConstantInt, ConstantFP, ConstantPointerNull); values > 0x10 are Instruction subclasses. This avoids creating unnecessary instructions when both operands are compile-time constants.
Short-circuit boolean evaluation
Logical AND (&&) and OR (||) use the same short-circuit pattern with PHI merge:
; Logical AND (a && b):
%lhs = icmp ne i32 %a, 0
br i1 %lhs, label %land.rhs, label %land.end
land.rhs:
%rhs = icmp ne i32 %b, 0
br label %land.end
land.end:
%0 = phi i1 [ false, %entry ], [ %rhs, %land.rhs ]
%land.ext = zext i1 %0 to i32
Logical OR inverts the branch sense: TRUE goes to the end block (result is true), FALSE falls through to evaluate the RHS. Both share the same ZExt epilogue code via a merged tail at LABEL_162, selecting the name "land.ext" or "lor.ext" through a variable.
Printf lowering
Device-side printf cannot use C varargs. The compiler rewrites it to CUDA's vprintf(fmt, packed_buffer) ABI:
- Look up or create
@vprintfin the module viaModule::getOrInsertFunction. - Allocate a stack buffer (
"tmp"alloca, cached at IRGenState+152 for reuse across multiple printf calls in the same function). - For each vararg: compute its byte size, round offset to natural alignment, GEP into the buffer (
"buf.indexed"), bitcast if needed ("casted"), and store. - Promote
floatarguments todoubleper C variadic convention (fpext). - If the total packed size exceeds the current alloca size, patch the alloca's size operand in-place by manipulating the use-def chain.
- Emit
call i32 @vprintf(ptr %fmt, ptr %buf).
The alloca in-place resize (step 5) is unusual -- most LLVM passes would create a new alloca. NVIDIA's motivation is to maintain a single alloca that dominates all printf pack sites within a function.
Type Translation System
The EDG-to-LLVM type translation (sub_91AED0 and its callees) is a worklist-driven fixed-point computation that runs before per-function codegen. It translates every EDG type node into an LLVM type, handling:
- Primitive types: Direct mapping (EDG
int-> LLVMi32, EDGfloat-> LLVMfloat). - Pointer types: Carry qualifier words at node+18 that encode CUDA address spaces (qualifier 1 = global/addrspace 1, qualifier 32 = shared/addrspace 3, qualifier 33 = constant/addrspace 4).
- Struct/union/class types: Recursive member-by-member translation with reference counting to handle shared sub-types and diamond inheritance.
- Typedef chains: Stripped by the standard
for (t = type; tag == 12; t = *(t+160))idiom. - Template specializations: Two-pass approach -- syntactic substitution (
sub_908040) followed by semantic matching (sub_910920), gated by optimization flags. - Mutually recursive types: Handled by the fixed-point iteration
do { changed = process_all(); } while (changed).
All hash tables in the type system use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the common implementation.
Global Variable Codegen
Device-side globals (__device__, __constant__, __shared__, __managed__) are emitted by sub_916430 (determineAddressSpaceAndCreate) which reads EDG IL node attributes at offsets +0x88 (storage class), +0x9C, +0xAE, and +0xB0 to determine the NVPTX address space:
| EDG Attribute | NVPTX Address Space | PTX Qualifier |
|---|---|---|
__device__ | 1 (global) | .global |
__constant__ | 4 (constant) | .const |
__shared__ | 3 (shared) | .shared |
| Generic (default) | 0 (generic) | (none) |
After creating the GlobalVariable, sub_915400 (finalizeGlobals) orchestrates module-level metadata emission: nvvmir.version (IR version metadata), nvvm.annotations (kernel and parameter annotations), llvm.used (prevents dead-global elimination), Debug Info Version module flag (value 3), and optionally llvm.ident.
Naming Conventions
The IR generation layer produces named IR values that match Clang's naming conventions almost exactly, confirming that NVVM's codegen was closely modeled on Clang's IRGen:
| IR Name | Context | Source |
|---|---|---|
"entry" | Function entry basic block | sub_946060 |
"return" | Return basic block | sub_946060 |
"allocapt" | Sentinel instruction for alloca grouping | sub_946060 |
"retval" | Return value alloca | sub_946060 |
"agg.result" | Sret argument | sub_938240 |
<name>.addr | Parameter alloca | sub_938240 / sub_9446C0 |
"temp_param" | Unnamed parameter | sub_938240 |
"this" | Implicit C++ this parameter | sub_938240 |
"__val_param"<name> | Byval parameter copy | sub_938240 |
"arraydecay" | Array-to-pointer decay GEP | sub_128D0F0 (opcode 0x15) |
"lnot" / "lnot.ext" | Logical NOT + ZExt | sub_128D0F0 (opcode 0x1D) |
"land.rhs" / "land.end" / "land.ext" | Logical AND blocks + result | sub_128D0F0 (opcode 0x57) |
"lor.rhs" / "lor.end" / "lor.ext" | Logical OR blocks + result | sub_128D0F0 (opcode 0x58) |
"cond.true" / "cond.false" / "cond.end" | Ternary operator blocks | sub_128D0F0 (opcode 0x67) |
"tobool" / "conv" | Cast results | sub_128A450 |
"sub.ptr.lhs.cast" / "sub.ptr.rhs.cast" / "sub.ptr.sub" / "sub.ptr.div" | Pointer subtraction | sub_128D0F0 (opcode 0x34) |
"if.then" / "if.else" / "if.end" | If statement blocks | sub_937020 |
"while.cond" / "while.body" / "while.end" | While loop blocks | sub_937180 |
"for.cond" / "for.body" / "for.inc" / "for.end" | For loop blocks | sub_936D30 |
"do.body" / "do.cond" / "do.end" | Do-while loop blocks | sub_936B50 |
"bf.*" | Bitfield access temporaries (30+ variants) | sub_1282050 / sub_1284570 |
"predef_tmp_comp" | Special register read result | sub_1286E40 |
"buf.indexed" / "casted" | Printf buffer GEP and cast | sub_12992B0 |
"asmresult" | Inline asm extractvalue result | sub_1292420 |
Sub-Page Navigation
The IR generation subsystem is documented in detail across four sub-pages, each covering a major functional area:
-
Expression & Constant Codegen -- The
EmitExprmaster dispatch (sub_128D0F0), its 40-operator inner switch, compile-time constant emission (sub_127D8B0), and the cast/conversion codegen (sub_128A450). Covers every C/C++ expression type from array decay to pointer subtraction to logical short-circuit. -
Statement & Control Flow Codegen -- The
emitStmtdispatcher (sub_9363D0), basic block creation for if/while/do-while/for/switch, cleanup scope management for C++ destructors, label and goto handling, and#pragma unrollmetadata attachment. -
Function, Call & Inline Asm Codegen -- Function skeleton creation (
sub_946060), the parameter prolog (sub_938240), call instruction emission with ABI classification (sub_93CB50), inline asm template parsing and constraint construction (sub_1292420), printf-to-vprintf lowering (sub_12992B0), and the 770-entry builtin dispatch table (sub_12B3FD0). -
Type Translation, Globals & Special Vars -- The fixed-point type translation system (
sub_91AED0), address space mapping for CUDA memory qualifiers, global variable creation (sub_916430), kernel metadata emission (sub_93AE30), function attribute handling (sub_12735D0), and special variable codegen for threadIdx/blockIdx/blockDim/gridDim/warpSize.
Expression & Constant Codegen
The central expression emitter sub_128D0F0 (56 KB, 1751 decompiled lines) is the single function responsible for translating every C/C++ expression in the EDG AST into LLVM IR. It is a large recursive two-level switch: the outer switch classifies the expression node kind (operation, literal, member access, call, etc.), and the inner switch dispatches across 40+ C operators to emit the corresponding LLVM IR instruction sequences. Every named temporary in the output (%arraydecay, %land.ext, %sub.ptr.div, %cond, etc.) originates from explicit SetValueName calls within this function, closely mirroring Clang's IRGen naming conventions.
Two companion subsystems handle specialized expression domains: bitfield codegen (sub_1282050 store, sub_1284570 load) lowers C bitfield accesses to shift/mask/or sequences, and constant expression codegen (sub_127D8B0, 1273 lines) produces llvm::Constant* values for compile-time evaluable expressions. Cast codegen (sub_128A450, 669 lines) maps every C cast category to the appropriate LLVM cast opcode.
| Master dispatcher | sub_128D0F0 — EmitExpr (56 KB, address 0x128D0F0) |
| Bitfield store | sub_1282050 — EmitBitfieldStore (15 args, R-M-W sequence) |
| Bitfield load | sub_1284570 — EmitBitfieldLoad (12 args, extract sequence) |
| Constant expressions | sub_127D8B0 — EmitConstExpr (1273 lines, recursive) |
| Cast/conversion | sub_128A450 — EmitCast (669 lines, 11 LLVM opcodes) |
| Bool conversion | sub_127FEC0 — EmitBoolExpr (expr to i1) |
| Literal emission | sub_127F650 — EmitLiteral (numeric/string constants) |
Master Expression Dispatcher
Reconstructed signature
// sub_128D0F0
llvm::Value *EmitExpr(CodeGenState **ctx, EDGExprNode *expr,
llvm::Type *destTy, unsigned flags, unsigned flags2);
The ctx parameter is a pointer-to-pointer hierarchy:
| Offset | Field |
|---|---|
*ctx | IRBuilder state (current function, insert point) |
ctx[1] | Debug info context: [0] = debug scope, [1] = current BB, [2] = insertion sentinel |
ctx[2] | LLVM module/context handle |
EDG expression node layout
Every expression node passed as expr has a fixed layout:
| Offset | Size | Field |
|---|---|---|
| +0x00 | 8 | Type pointer (EDG type node) |
| +0x18 | 1 | Outer opcode (expression kind byte) |
| +0x19 | 1 | Flags byte |
| +0x24 | 12 | Source location info |
| +0x38 | 1 | Inner opcode (operator sub-kind, for kind=1) |
| +0x48 | 8 | Child/operand pointer |
Type nodes carry a tag at offset +140: 12 = typedef alias (follow +160 to unwrap), 1 = void. The typedef-stripping idiom appears 15+ times throughout the function:
// Type unwrapping — strips typedef aliases to canonical type
for (Type *t = expr->type; *(uint8_t*)(t + 140) == 12; t = *(Type**)(t + 160))
;
Outer switch — expression categories
The byte at expr+0x18 selects the top-level expression category:
| Kind | Category | Handler |
|---|---|---|
0x01 | Operation expression | Inner switch on expr+0x38 (40+ C operators) |
0x02 | Literal constant | EmitLiteral (sub_127F650) |
0x03 | Member/field access | EmitAddressOf + EmitLoadFromAddress |
0x11 | Call expression | EmitCall (sub_1296570) |
0x13 | Init expression | EmitInitExpr (sub_1281220) |
0x14 | Declaration reference | EmitAddressOf + EmitLoadFromAddress |
| default | Fatal: "unsupported expression!" |
Inner switch — complete opcode reference
When the outer kind is 0x01 (operation), the byte at expr+0x38 selects which C operator to emit. The complete dispatch table follows. Every opcode is listed; no gaps exist between documented entries.
| Opcode | C operator | Handler / delegate | LLVM pattern |
|---|---|---|---|
0x00 | Constant subexpr | sub_72B0F0 (evaluate) + sub_1286D80 (load) | Constant materialization |
0x03 | Compound special A | EmitCompoundAssign (sub_1287ED0) | Read-modify-write |
0x05 | Dereference (*p) | Elide if child is &: IsAddressOfExpr (sub_127B420). Otherwise: recursive EmitExpr + EmitLoad (sub_128B370) | %val = load T, ptr %p |
0x06 | Compound special B | EmitCompoundAssign (sub_1287ED0) | Read-modify-write |
0x08 | Compound special C | EmitCompoundAssign (sub_1287ED0) | Read-modify-write |
0x15 | Array decay | See Array decay | %arraydecay = getelementptr inbounds ... |
0x19 | Parenthesized (x) | Tail-call optimization: a2 = child, restart loop | (no IR emitted) |
0x1A | sizeof / alignof | EmitSizeofAlignof (sub_128FDE0) | Constant integer |
0x1C | Bitwise NOT (~x) | sub_15FB630 (xor with -1) | %not = xor i32 %x, -1 |
0x1D | Logical NOT (!x) | Two-phase: EmitBoolExpr + zext | %lnot = icmp eq ..., 0 / %lnot.ext = zext i1 ... to i32 |
0x1E | Type-level const | ConstantFromType (sub_127D2C0) | Compile-time constant |
0x1F | Type-level const | ConstantFromType (sub_127D2C0) | Compile-time constant |
0x23 | Pre-increment ++x | EmitIncDec (sub_128C390): prefix=1, inc=1 | %inc = add ... / %ptrincdec = getelementptr ... |
0x24 | Pre-decrement --x | EmitIncDec (sub_128C390): prefix=0, inc=0 | %dec = sub ... / %ptrincdec = getelementptr ... |
0x25 | Post-increment x++ | EmitIncDec (sub_128C390): prefix=1, inc=0 | Returns old value; %inc = add ... |
0x26 | Post-decrement x-- | EmitIncDec (sub_128C390): prefix=0, inc=1 | Returns old value; %dec = sub ... |
0x27-0x2B | +, -, *, /, % | EmitBinaryArithCmp (sub_128F9F0) | add/sub/mul/sdiv/srem (or u/f variants) |
0x32 | Comma (a, b) | Emit both sides; return RHS | (LHS discarded) |
0x33 | Subscript a[i] | EmitSubscriptOp (sub_128B750): GEP + load | %arrayidx = getelementptr ... + load |
0x34 | Pointer subtraction | See Pointer subtraction | %sub.ptr.div = sdiv exact ... |
0x35-0x39 | ==, !=, <, >, <=, >= | EmitBinaryArithCmp (sub_128F9F0) | icmp eq/ne/slt/sgt/sle/sge (or u/f variants) |
0x3A | << | EmitShiftOrBitwise (sub_128F580): triple (1, 32, 32) | shl |
0x3B | >> | EmitShiftOrBitwise (sub_128F580): triple (14, 33, 33) | ashr (signed) / lshr (unsigned) |
0x3C | & | EmitShiftOrBitwise (sub_128F580): triple (2, 38, 34) | and |
0x3D | ^ | EmitShiftOrBitwise (sub_128F580): triple (4, 40, 36) | xor |
0x3E | | | EmitShiftOrBitwise (sub_128F580): triple (3, 39, 35) | or |
0x3F | Rotate | EmitShiftOrBitwise (sub_128F580): triple (5, 41, 37) | llvm.fshl / llvm.fshr |
0x41-0x46 | Type-level consts | ConstantFromType (sub_127D2C0) | Compile-time constant |
0x49 | Member access ./-> | See Member access | getelementptr + load (or bitfield path) |
0x4A | += | EmitCompoundAssignWrapper (sub_12901D0) + sub_1288F60 | Load + add + store |
0x4B | -= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1288370 | Load + sub + store |
0x4C | *= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1288770 | Load + mul + store |
0x4D | /= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1289D20 | Load + div + store |
0x4E | %= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1288DC0 | Load + rem + store |
0x4F | &= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1288B70 | Load + and + store |
0x50 | |= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1289360 | Load + or + store |
0x51 | <<= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1288090 | Load + shl + store |
0x52 | >>= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1287F30 | Load + ashr/lshr + store |
0x53 | ^= | EmitCompoundAssignWrapper (sub_12901D0) + sub_1288230 | Load + xor + store |
0x54 | ,= (rare) | EmitCompoundAssignWrapper (sub_12901D0) + sub_128BE50 | Comma-compound |
0x55 | []= (subscript compound) | EmitCompoundAssignWrapper (sub_12901D0) + sub_128B750 | GEP + R-M-W |
0x56 | Bitfield assign | See Bitfield Codegen | R-M-W sequence |
0x57 | Logical AND && | See Logical AND | land.rhs/land.end + PHI |
0x58 | Logical OR || | See Logical OR | lor.rhs/lor.end + PHI |
0x59, 0x5A, 0x5D | Type-level consts | ConstantFromType (sub_127D2C0) | Compile-time constant |
0x5B | Statement expression ({...}) | EmitStmtExpr (sub_127FF60); create empty BB if (*a1)[7] == 0 | Body emission |
0x5C, 0x5E, 0x5F | Compound special | EmitCompoundAssign (sub_1287ED0) | Read-modify-write |
0x67 | Ternary ?: | See Ternary operator | cond.true/cond.false/cond.end + PHI |
0x68 | Type-level const | ConstantFromType (sub_127D2C0) | Compile-time constant |
0x69 | Special const | EmitSpecialConst (sub_1281200) | Constant materialization |
0x6F | Label address &&label | GCC extension: sub_12A4D00 (lookup) + sub_1285E30(builder, label, 1) | blockaddress(@fn, %label) |
0x70 | Label value | sub_12A4D00 + sub_12812E0(builder, label, type) | Indirect goto target |
0x71 | Computed goto goto *p | sub_12A4D00 + sub_1285E30(builder, label, 0) | indirectbr |
0x72 | va_arg | sub_12A4D00 on va_list child + sub_1286000 | va_arg lowering |
| default | FatalDiag (sub_127B550) | "unsupported operation expression!" |
Shift and bitwise triple encoding
The EmitShiftOrBitwise (sub_128F580) triple (signedOp, intOp, fpOp) encodes three things: signedOp controls signed-vs-unsigned selection for right shift (14 selects ashr for signed, lshr for unsigned), intOp is the LLVM integer opcode number, and fpOp is the floating-point variant (unused for shift/bitwise but present for uniformity).
Increment / decrement detail
EmitIncDec (sub_128C390, 16 KB) handles integer, floating-point, and pointer types. It reads the expression type to select the arithmetic operation:
- Integer path:
add/sub nsw i32 %x, 1with name"inc"or"dec". For prefix variants, the incremented value is returned; for postfix, the original value is returned and the increment is stored. - Floating-point path:
fadd/fsub float %x, 1.0with the same return-value semantics. - Pointer path:
getelementptr inbounds T, ptr %p, i64 1(ori64 -1for decrement) with name"ptrincdec". Element type comes from the pointed-to type.
All paths load the current value, compute the new value, store back, and return either old or new depending on prefix/postfix.
Compound assignment wrapper mechanics
EmitCompoundAssignWrapper (sub_12901D0) implements the common load-compute-store pattern for all compound assignment operators (+=, -=, etc.):
// sub_12901D0 pseudocode
Value *EmitCompoundAssignWrapper(ctx, expr, impl_fn, flags) {
Value *addr = EmitAddressOf(ctx, expr->lhs); // sub_1286D80
Value *old_val = EmitLoadFromAddress(ctx, addr); // sub_1287CD0
Value *rhs_val = EmitExpr(ctx, expr->rhs); // sub_128D0F0 (recursive)
Value *new_val = impl_fn(ctx, old_val, rhs_val); // per-operator function
EmitStore(ctx, new_val, addr); // store back
return new_val;
}
Each impl_fn is a small function (typically 200-400 lines) that handles integer/float type dispatch and signedness. For example, sub_1288F60 (AddAssign) selects between add, fadd, and pointer-GEP addition.
Member access multi-path handler
Opcode 0x49 handles struct field access (. and ->) through a multi-path dispatcher:
-
Simple scalar field (field count == 1): Computes field address via
EmitAddressOf(sub_1286D80), checks the volatile bit (v349 & 1), copies 12 DWORDs of field descriptor into the local frame, then loads viaEmitLoadFromAddress(sub_1287CD0). -
Bitfield field: If the field descriptor indicates a bitfield, routes to
EmitBitfieldAccess(sub_1282050) which emits the shift/mask extraction sequence. -
Nested/union access (field count > 1): Calls
ComputeCompositeMemberAddr(sub_1289860) for multi-level GEP computation, thenEmitComplexMemberLoad(sub_12843D0). -
Write-only context: If the assignment bit (
a2+25, bit 2) is set, returns null -- the caller only needs the address, not the loaded value.
Statement expression, label address, and va_arg
Statement expression (0x5B): Emits the compound statement body via EmitStmtExpr (sub_127FF60). If no return basic block exists yet ((*a1)[7] == 0), creates an anonymous empty BB via CreateBasicBlock + SetInsertPoint to serve as the fall-through target. The value of the last expression in the block is the statement expression's result.
Label address (0x6F): Implements the GCC &&label extension. Looks up the label via LookupLabel (sub_12A4D00), then creates a blockaddress(@current_fn, %label) constant via sub_1285E30(builder, label, 1). The second argument 1 distinguishes "take address" from "goto to".
Computed goto (0x71): The goto *ptr extension. Same LookupLabel call, but sub_1285E30(builder, label, 0) with flag 0 emits an indirectbr instruction targeting the resolved label.
va_arg (0x72): Extracts the va_list child node at +72, its sub-child at +16, resolves both via sub_12A4D00, then calls EmitVaArg (sub_1286000) which lowers to a va_arg LLVM instruction with the appropriate type.
Constant vs. instruction dispatch
Throughout all operator emission, a consistent pattern selects between constant folding and IR instruction creation. The byte at Value+16 encodes the LLVM Value subclass kind: values <= 0x10 are constants (ConstantInt, ConstantFP, etc.) and values > 0x10 are instructions. This check appears 20+ times throughout the function, always with the same structure:
// Constant-fold or emit IR? Decision pattern (appears 20+ times)
if (*(uint8_t*)(value + 16) > 0x10) {
// Real IR instruction -- create via IR builder
result = CreateCast(opcode, value, destTy, &out, 0); // sub_15FDBD0
result = CreateBinOp(opcode, lhs, rhs, &out, 0); // sub_15FB440
} else {
// Compile-time constant -- constant-fold at LLVM ConstantExpr level
result = ConstantExprCast(opcode, value, destTy, 0); // sub_15A46C0
result = ConstantFoldBinOp(lhs, rhs, 0, 0); // sub_15A2B60
}
The dispatch table for the constant-fold vs IR-instruction paths:
| Operation | IR path (Value > 0x10) | Constant path (Value <= 0x10) |
|---|---|---|
| Binary op | CreateBinOp (sub_15FB440) | ConstantFoldBinOp (sub_15A2B60) |
| Unary NOT | CreateUnaryOp (sub_15FB630) | ConstantFoldUnary (sub_15A2B00) |
| Cast | CreateCast (sub_15FDBD0) | ConstantExprCast (sub_15A46C0) |
| Int compare | sub_15FEC10(op=51, pred) | sub_15A37B0(pred, lhs, rhs) |
| Float compare | sub_15FEC10(op=52, pred) | sub_15A37B0(pred, lhs, rhs) |
| Sub (constant) | CreateBinOp(13=Sub) | ConstantFoldSub (sub_15A2B60) |
| SDiv exact | CreateBinOp(18=SDiv) + SetExactFlag | ConstantFoldSDiv (sub_15A2C90) |
When the constant path is taken, no LLVM instruction is created and no BB insertion occurs -- the result is a pure llvm::Constant* that can be used directly. This is critical for expressions like sizeof(int) + 4 where no runtime code should be emitted.
Key Expression Patterns
Array decay
Opcode 0x15. Converts an array lvalue to a pointer to its first element.
When IsArrayType (sub_8D23B0) confirms the source is an array type, the emitter creates an inbounds GEP with two zero indices. The GEP instruction is constructed manually: allocate 72 bytes for 3 operands via AllocateInstruction, compute the result element type, propagate address space qualifiers from the source, then fill operands (base, i64 0, i64 0) and mark inbounds:
%arraydecay = getelementptr inbounds [N x T], ptr %arr, i64 0, i64 0
If the source is already a pointer type (not an array), the function either passes through directly or inserts a ptrtoint / zext if the types differ.
Pointer subtraction
Opcode 0x34. The classic 5-step Clang pattern for (p1 - p2):
%sub.ptr.lhs.cast = ptrtoint ptr %p1 to i64
%sub.ptr.rhs.cast = ptrtoint ptr %p2 to i64
%sub.ptr.sub = sub i64 %sub.ptr.lhs.cast, %sub.ptr.rhs.cast
%sub.ptr.div = sdiv exact i64 %sub.ptr.sub, 4 ; element_size=4 for int*
Step 5 (the sdiv exact) is skipped entirely when the element size is 1 (i.e., char* arithmetic), since division by 1 is a no-op. The element size comes from the pointed-to type at offset +128. The exact flag on sdiv tells the optimizer that the division is known to produce no remainder -- a critical optimization hint.
Logical AND (short-circuit)
Opcode 0x57. Creates two basic blocks and a PHI node for C's short-circuit && evaluation:
entry:
%lhs = icmp ne i32 %a, 0
br i1 %lhs, label %land.rhs, label %land.end
land.rhs:
%rhs = icmp ne i32 %b, 0
br label %land.end
land.end:
%0 = phi i1 [ false, %entry ], [ %rhs, %land.rhs ]
%land.ext = zext i1 %0 to i32
The construction sequence:
- Create blocks
land.endandland.rhsviaCreateBasicBlock(sub_12A4D50). - Emit LHS as boolean via
EmitBoolExpr(sub_127FEC0). - Conditional branch:
br i1 %lhs, label %land.rhs, label %land.end. - Switch insertion point to
%land.rhs. - Emit RHS as boolean.
- Unconditional branch to
%land.end. - Switch to
%land.end, construct PHI with 2 incoming edges. - Zero-extend the
i1PHI result to the expression's declared type (i32typically) with nameland.ext.
The PHI node is allocated as 64 bytes via AllocatePHI (sub_1648B60), initialized with opcode 53 (PHI), and given a capacity of 2. Incoming values are stored in a compact layout: [val0, val1, ..., bb0, bb1, ...] where each value slot occupies 24 bytes (value pointer + use-list doubly-linked-list pointers), and basic block pointers form a parallel array after all value slots.
Logical OR (short-circuit)
Opcode 0x58. Identical structure to logical AND but with inverted branch sense: the TRUE outcome of the LHS branches to lor.end (short-circuits to true), and FALSE falls through to evaluate the RHS:
entry:
%lhs = icmp ne i32 %a, 0
br i1 %lhs, label %lor.end, label %lor.rhs
lor.rhs:
%rhs = icmp ne i32 %b, 0
br label %lor.end
lor.end:
%0 = phi i1 [ true, %entry ], [ %rhs, %lor.rhs ]
%lor.ext = zext i1 %0 to i32
Internally, the AND and OR paths share a common tail (merging at a single code point with a variable holding either "lor.ext" or "land.ext").
Ternary / conditional operator
Opcode 0x67. Constructs a full three-block diamond with PHI merge for a ? b : c:
entry:
%cond.bool = icmp ne i32 %test, 0
br i1 %cond.bool, label %cond.true, label %cond.false
cond.true:
%v1 = <emit true expr>
br label %cond.end
cond.false:
%v2 = <emit false expr>
br label %cond.end
cond.end:
%cond = phi i32 [ %v1, %cond.true ], [ %v2, %cond.false ]
The function creates three blocks (cond.true, cond.false, cond.end), records which basic block each arm finishes in (since the true/false expression emission might create additional blocks), and builds the PHI from those recorded blocks. When one arm is void, the PHI is omitted and whichever arm produced a value is returned directly.
Logical NOT and bitwise NOT
Logical NOT (opcode 0x1D) is a two-phase emit:
%lnot = icmp eq i32 %x, 0 ; Phase 1: convert to bool
%lnot.ext = zext i1 %lnot to i32 ; Phase 2: extend back to declared type
Phase 1 calls EmitBoolExpr which produces the icmp eq ... 0 comparison. Phase 2 zero-extends the i1 back to the expression's target type. If the value is already a compile-time constant, the constant folder handles it directly.
Bitwise NOT (opcode 0x1C) produces xor with all-ones:
%not = xor i32 %x, -1
Created via CreateUnaryOp (sub_15FB630) which synthesizes xor with -1 (all bits set). Optional zext follows if the result needs widening.
Dereference with address-of elision
Opcode 0x05. Before emitting a load for unary *, the function checks if the child is an address-of expression via IsAddressOfExpr (sub_127B420). If so, the dereference and address-of cancel out -- no IR is emitted, only a debug annotation is attached. This handles the common pattern *&x becoming just x.
Bitfield Codegen
Bitfield loads and stores are lowered to shift/mask/or sequences by two dedicated functions. A path selector CanUseFastBitfieldPath (sub_127F680) determines whether the bitfield fits within a single naturally-aligned container element (fast path) or must be processed byte-by-byte (general path).
EDG bitfield descriptor
The bitfield metadata object carries:
| Offset | Type | Field |
|---|---|---|
| +120 | qword | Container type node |
| +128 | qword | Byte offset within struct |
| +136 | byte | Bit offset within containing byte |
| +137 | byte | Bit width of the field |
| +140 | byte | Type tag (12 = array wrapper, walk chain) |
| +144 | byte | Flags (bit 3 = signed bitfield) |
| +160 | qword | Next/inner type pointer |
Fast path (single-container load)
When the bitfield plus its bit range fits within one container element, the fast path loads the entire container and extracts the field with a single shift and mask:
// Example: struct { unsigned a:3; unsigned b:5; } s;
// s.b: byte_offset=0, bit_offset=3, bit_width=5, container=i8
Load s.b (fast path):
%container = load i8, ptr %s
%shifted = lshr i8 %container, 3 ; "highclear" -- position field at bit 0
%result = and i8 %shifted, 31 ; "zeroext" -- mask to 5 bits (0x1F)
The shift amount is computed as 8 * elem_size - bit_width - bit_offset - 8 * (byte_offset % elem_size). When this evaluates to zero, the lshr is constant-folded away.
For signed bitfields, the zero-extend is replaced with an arithmetic sign extension via shift-left then arithmetic-shift-right:
%shifted = lshr i8 %container, 3 ; "highclear"
%signext = ashr i8 %shifted, 5 ; "signext" -- propagates sign bit
Store s.b = val (fast path read-modify-write):
%container = load i8, ptr %s
%bf.value = and i8 %val, 31 ; mask to 5 bits
%cleared = and i8 %container, 7 ; "bf.prev.cleared" -- clear bits [3:7]
%positioned = shl i8 %bf.value, 3 ; "bf.newval.positioned"
%merged = or i8 %cleared, %positioned ; "bf.finalcontainerval"
store i8 %merged, ptr %s
The clear mask is ~((1 << bit_width) - 1) << bit_position). For containers wider than 64 bits, both the clear mask and the value mask are computed via APInt operations (sub_16A5260 to set bit range, sub_16A8F40 to invert).
Byte-by-byte path (spanning load)
When the bitfield spans multiple container elements, it is processed one byte at a time. Each iteration loads a byte, extracts the relevant bits, zero-extends to the accumulator width, shifts into position, and ORs into the running accumulator.
For example, a 20-bit field starting at byte 0, bit 0:
; Byte 0: bits [0:7]
%bf.base.i8ptr = bitcast ptr %s to ptr ; pointer cast
%byte0.ptr = getelementptr i8, ptr %bf.base.i8ptr, i64 0
%bf.curbyte.0 = load i8, ptr %byte0.ptr
%bf.byte_zext.0 = zext i8 %bf.curbyte.0 to i32
; accumulator = %bf.byte_zext.0 (shift=0 for first byte)
; Byte 1: bits [8:15]
%byte1.ptr = getelementptr i8, ptr %bf.base.i8ptr, i64 1
%bf.curbyte.1 = load i8, ptr %byte1.ptr
%bf.byte_zext.1 = zext i8 %bf.curbyte.1 to i32
%bf.position.1 = shl i32 %bf.byte_zext.1, 8 ; "bf.position"
%bf.merge.1 = or i32 %bf.byte_zext.0, %bf.position.1 ; "bf.merge"
; Byte 2: only 4 bits remain (20 - 16 = 4)
%byte2.ptr = getelementptr i8, ptr %bf.base.i8ptr, i64 2
%bf.curbyte.2 = load i8, ptr %byte2.ptr
%bf.end.highclear = lshr i8 %bf.curbyte.2, 4 ; "bf.end.highclear" -- clear top 4 bits
%bf.byte_zext.2 = zext i8 %bf.end.highclear to i32
%bf.position.2 = shl i32 %bf.byte_zext.2, 16
%bf.merge.2 = or i32 %bf.merge.1, %bf.position.2
The byte-by-byte store path mirrors this in reverse: for boundary bytes (first and last), it loads the existing byte, masks out the target bits with AND, positions the new bits with SHL, and merges with OR. Middle bytes that are entirely overwritten skip the read-modify-write and store directly.
The bf.* naming vocabulary
All bitfield IR values use a consistent naming scheme:
| Name | Path | Meaning |
|---|---|---|
bf.base.i8ptr | Both | Pointer cast to i8* |
bf.curbyte | Load | Current byte in iteration loop |
bf.end.highclear | Load | lshr to clear unused high bits in last byte |
bf.byte_zext | Load | zext of byte to accumulator width |
bf.position | Both | shl to position byte/value within accumulator/container |
bf.merge | Load | or to merge byte into accumulator |
bf.highclear | Load | lshr before sign extension |
bf.finalval | Load | ashr for sign extension |
highclear | Load fast | Fast-path lshr to clear high bits |
zeroext | Load fast | Fast-path zero-extend result |
signext | Load fast | Fast-path ashr sign extension |
bf.value | Store | and(input, width_mask) -- isolated field bits |
bf.prev.cleared | Store fast | Container with old field bits cleared |
bf.newval.positioned | Store fast | New value shifted to field position |
bf.finalcontainerval | Store fast | or(cleared, positioned) -- final container |
bf.reload.val | Store | Truncated value for compound assignment reload |
bf.reload.sext | Store | Sign-extended reload via shift pair |
bassign.tmp | Store | Alloca for temporary during bitfield assignment |
Wide bitfield support (> 64 bits)
Both load and store functions handle bitfields wider than 64 bits through APInt operations. The threshold check width > 0x40 (64) appears throughout: values <= 64 bits use inline uint64_t masks computed as 0xFFFFFFFFFFFFFFFF >> (64 - width), while wider values allocate heap-backed APInt word arrays. Every code path carefully frees heap APInts after use. This supports __int128 bitfields in CUDA.
Volatile and alignment
Volatile detection uses a global flag at unk_4D0463C. When set, sub_126A420 queries whether the GEP target address is in volatile memory, propagating the volatile bit to load/store instructions. The alignment parameter for bitfield container loads must be 1; the function asserts on other values with "error generating code for loading from bitfield!".
Duplicate implementations
Two additional copies exist at sub_923780 (store) and sub_925930 (load) -- identical algorithms with the same string names, same opcodes, same control flow. These likely correspond to different template instantiations or address-space variants in the original NVIDIA source. The 0x92xxxx copies are in the main NVVM frontend region while the 0x128xxxx copies are in the codegen helper region.
Constant Expression Codegen
EmitConstExpr (sub_127D8B0) converts EDG constant expression AST nodes into llvm::Constant* values. It is recursive: aggregate initializers call it for each element.
// sub_127D8B0
llvm::Constant *EmitConstExpr(CodeGenState *ctx, EDGConstExprNode *expr,
llvm::Type *arrayElemTyOverride);
The constant kind byte at expr[10].byte[13] is the primary dispatch:
| Kind | Category | Output type |
|---|---|---|
1 | Integer constant | ConstantInt |
2 | String literal | ConstantDataArray |
3 | Floating-point constant | ConstantFP |
6 | Address-of constant | GlobalVariable*, Function*, or string global |
0xA | Aggregate initializer | ConstantStruct, ConstantArray, or ConstantAggregateZero |
0xE | Null/empty | Returns 0 (no constant) |
| default | Fatal: "unsupported constant variant!" |
Integer constants
For normal integers (up to 64 bits), the value is extracted via edg::GetSignedIntValue or edg::GetUnsignedIntValue depending on signedness, masked to the actual bit width, and passed to ConstantInt::get(context, APInt).
For __int128 (type size == 16 bytes), the EDG IL stores the value as a decimal string. The path is: edg::GetIntConstAsString(expr) returns the decimal text, then APInt::fromString(128, str, len, radix=10) parses it into a 128-bit APInt. This string-based transfer suggests the EDG IL uses text encoding for portability of wide integers.
APInt memory management follows the standard pattern: values > 64 bits use heap-allocated word arrays (checked via width > 0x40). Every path frees heap APInts after consumption.
When the target LLVM type is a pointer (tag 15), the integer constant is first created, then ConstantExpr::getIntToPtr converts it.
String literals
The character width is determined from a lookup table qword_4F06B40 indexed by the encoding enum at expr[10].byte[8] & 7:
| Index | Width | C type |
|---|---|---|
| 0 | 1 byte | char / UTF-8 |
| 1 | platform | wchar_t |
| 2 | 1 byte | char8_t |
| 3 | from global | platform-dependent |
| 4 | from global | platform-dependent |
The raw byte buffer is built by copying byte_count bytes from the EDG node, reading each character through edg::ReadIntFromBuffer(src, width) -- an endian-aware read function (the EDG IL may store string data in a platform-independent byte order). The buffer is then passed to ConstantDataArray::getRaw(data, byte_count) to create the LLVM constant.
For each character width, the LLVM element type is selected: i8 for 1-byte, i16 for 2-byte, i32 for 4-byte, i64 for 8-byte. Empty strings create zero-element arrays. If the array type override a3 provides a larger size than the literal, the remaining bytes are zero-filled.
Floating-point constants
Raw bit patterns are extracted via edg::ExtractFloatBits(kind, data_ptr), then reinterpreted into native float or double values:
| EDG kind | C type | Conversion path |
|---|---|---|
| 2 | float | BitsToFloat -> APFloat(float) -> IEEEsingle semantics |
| 4 | double | BitsToDouble -> APFloat(double) -> IEEEdouble semantics |
| 6 | long double | Truncated to double (with warning 0xE51) |
| 7 | __float80 | Truncated to double (with warning 0xE51) |
| 8, 13 | __float128 | Truncated to double (with warning 0xE51) |
All extended-precision types (long double, __float80, __float128) are silently lowered through the double path. NVPTX has no hardware support for 80-bit or 128-bit floats, so CICC truncates them to 64-bit IEEE 754. When the compilation context has the appropriate flag (bit 4 at offset +198), a diagnostic warning is emitted identifying the specific type being truncated.
Address-of constants
Sub-dispatched by a byte at expr[11].byte[0]:
- Byte 0 -- Variable/global reference: Calls
GetOrCreateGlobalVariable(sub_1276020), returning aGlobalVariable*as a constant pointer. Debug info is optionally attached. - Byte 1 -- Function reference: Calls
GetOrCreateFunction(sub_1277140). For static-linkage functions, resolves throughLookupFunctionStaticVar. - Byte 2 -- String literal reference (
&"..."): Validates the node kind is 2 (string), then callsCreateStringGlobalConstant(sub_126A1B0).
Post-processing applies a constant GEP offset if expr[12].qword[0] is nonzero, and performs pointer type cast if the produced type differs from the expected type. Same-address-space mismatches use ConstantExpr::getBitCast; cross-address-space mismatches use ConstantExpr::getAddrSpaceCast. Pointer-to-integer mismatches use ConstantExpr::getPtrToInt with address-space normalization to addrspace(0) first.
Aggregate initializers
The largest case (630+ lines). After stripping typedefs, dispatches on the canonical type tag at +140:
| Tag | Type | Output |
|---|---|---|
| 10 | Struct | ConstantStruct or ConstantAggregateZero |
| 11 | Union | Anonymous {member_type, [N x i8]} |
| 8 | Array | ConstantArray |
| 12 | Typedef | Strip and re-dispatch |
| other | Fatal: "unsupported aggregate constant!" |
Struct (tag 10): Walks the EDG field list and initializer list in parallel. The field chain is traversed via +112 pointers; the initializer list via +120 next pointers.
- Padding/zero-width fields are skipped (flag byte at +146, bit 3).
- For each non-bitfield field,
GetFieldIndex(sub_1277B60) returns the LLVM struct element index. If gaps exist between the previous and current index, intermediate slots are filled withConstant::getNullValue(sub_15A06D0). - Each field's initializer is processed by recursive
EmitConstExprcall. - Packed struct fields (flag at +145, bit 4) have their sub-elements extracted individually via
ConstantExpr::extractvalue(sub_15A0A60). - Missing trailing fields are padded with null values.
- If the struct has no fields and the initializer list is empty, returns
ConstantAggregateZero::get(sub_1598F00) as a shortcut. - Final assembly:
ConstantStruct::get(sub_159F090) with type compatibility check viaType::isLayoutIdentical(sub_1643C60). If packed,StructType::get(elts, n, true)(sub_15943F0).
Struct bitfield packing (post-processing)
When any bitfield field is detected during the main walk (flag bit 2, &4 at +144), the function re-enters a post-processing phase after the main field loop. This packs bitfield constant values byte-by-byte into the struct's byte array:
// Bitfield packing pseudocode — sub_127D8B0, case 0xA post-processing
StructLayout *layout = DataLayout::getStructLayout(structTy); // sub_15A9930
for (each bitfield field where flag &4 at +144 && name at +8 is non-null) {
uint32_t byte_offset = field->byte_offset;
uint32_t elem_idx = StructLayout::getElementContainingOffset(layout, byte_offset);
// sub_15A8020
// Validate the target byte is zero
assert(elements[elem_idx] == ConstantInt::get(i8, 0),
"unexpected error while initializing bitfield!");
// Evaluate bitfield initializer
Constant *val = EmitConstExpr(ctx, init_expr, 0); // recursive
assert(val != NULL, "bit-field constant must have a known value at compile time!");
APInt bits = extractAPInt(val); // at constant+24, width at constant+32
uint8_t bit_width = field->bit_width; // at +137
if (bits.width > bit_width)
bits = APInt::trunc(bits, bit_width); // sub_16A5A50
// Pack into struct bytes, one byte at a time
uint8_t bit_offset = field->bit_offset; // at +136 (within first byte)
while (remaining_bits > 0) {
uint8_t available = (first_byte ? 8 - bit_offset : 8);
uint8_t take = min(remaining_bits, available);
APInt slice = bits;
if (slice.width > take)
slice = APInt::trunc(slice, take); // sub_16A5A50
if (take < 8)
slice = APInt::zext(slice, 8); // sub_16A5C50
slice = slice << bit_offset; // shl
existing_byte |= slice; // sub_16A89F0
elements[byte_index] = ConstantInt::get(ctx, existing_byte);
bits = bits >> take; // sub_16A7DC0
remaining_bits -= take;
bit_offset = 0; // subsequent bytes start at bit 0
byte_index++;
}
}
This implements the C standard's bitfield byte-packing model: bits are inserted starting at the field's bit_offset within its containing byte, potentially spanning multiple bytes. Values wider than 64 bits use heap-backed APInt word arrays.
Union (tag 11): Finds the initialized member via two paths:
- Designated initializer (kind 13):
*(init+184)is the designated field,*(init+120)is the actual value expression. - Implicit: Walk the field chain (
type+160) looking for the first non-skip, non-bitfield field. Named bitfield members are explicitly rejected:"initialization of bit-field in union not supported!". If no field is found:"cannot find initialized union member!".
The member value is emitted recursively. Padding to the full union byte size is added as [N x i8] zeroinitializer. The result is an anonymous {member_type, [N x i8]} struct via ConstantStruct::getAnon (sub_159F090).
Array (tag 8): Resolves element type via GetArrayElementType (sub_8D4050), walks the initializer linked list via +120 next pointers, calls EmitConstExpr recursively for each element. Designated initializers (kind 11) are supported: *(node+176) gives the designated element index, *(node+184) gives the range count. Type mismatches are handled by sub_127D000 (resize constant to target type).
When the declared dimension exceeds the initializer count, remaining elements are filled with Constant::getNullValue. The result uses ConstantArray::get (sub_159DFD0) when all elements have the same LLVM type (the common case), or falls back to an anonymous struct via StructType::get + ConstantStruct::get for heterogeneous cases (which should not occur in well-formed C but is handled defensively).
Cast / Conversion Codegen
EmitCast (sub_128A450) handles every C-level cast category. The function first checks for early exits (skip flag, identity cast where source type equals destination type), then dispatches by source and destination type tags.
// sub_128A450
llvm::Value *EmitCast(CodeGenState **ctx, EDGCastNode *expr,
uint8_t is_unsigned, llvm::Type *destTy,
uint8_t is_unsigned2, char skip_flag,
DiagContext *diag);
Type classification
Type tags at *(type+8):
| Tag | Type |
|---|---|
| 1-6 | Floating-point (1=half, 2=float, 3=double, 4=fp80, 5=fp128, 6=bf16) |
| 11 | Integer (bit-width encoded in upper bits) |
| 15 | Pointer |
| 16 | Vector/aggregate |
The test (tag - 1) > 5 means "NOT a float" (tags 1-6 are float types).
Tobool patterns
When the destination type is i1 (bool), the codegen produces comparison-against-zero:
Integer/float source (tags 1-6, 11):
%tobool = icmp ne i32 %val, 0 ; integer source
%tobool = fcmp une float %val, 0.0 ; float source
Float-to-bool uses fcmp une (unordered not-equal), which returns true for any non-zero value including NaN. Integer-to-bool uses icmp ne with a zero constant of matching type.
Pointer source (tag 15):
%tobool = icmp ne ptr %val, null
A shortcut exists: if the source expression is already a comparison result (opcode 61) and the source is already the bool type, the comparison result is returned directly without creating a new instruction.
Integer-to-integer (trunc / zext / sext)
The helper sub_15FE0A0 internally selects the operation based on relative widths:
dest_width < src_width->truncdest_width > src_widthAND unsigned ->zextdest_width > src_widthAND signed ->sext
All produce a value named "conv".
Pointer casts
Pointer-to-pointer: In LLVM opaque-pointer mode (which CICC v13 uses for modern SMs), same-address-space casts hit the identity return path and produce no IR. Cross-address-space casts use addrspacecast (opcode 47).
Pointer-to-integer: ptrtoint (opcode 45). Asserts that the destination is actually an integer type.
Integer-to-pointer: A two-step process. First, the integer is widened or narrowed to the pointer bit-width (32 or 64, obtained via sub_127B390). Then inttoptr (opcode 46) converts the properly-sized integer to a pointer:
%conv1 = zext i32 %val to i64 ; step 1: widen to pointer width
%conv = inttoptr i64 %conv1 to ptr ; step 2: int -> ptr
Float-to-integer and integer-to-float
Two paths exist for these conversions:
Standard path: Uses LLVM's native cast opcodes. Triggered when the global flag unk_4D04630 is set (relaxed rounding mode), or when the destination is 128-bit, or when the source is fp128:
| Direction | Signed opcode | Unsigned opcode |
|---|---|---|
| int -> float | sitofp (39) | uitofp (40) |
| float -> int | fptosi (41) | fptoui (42) |
NVIDIA intrinsic path: For SM targets that require round-to-zero semantics on float-int conversions. Constructs an intrinsic function name dynamically and emits it as a plain function call:
// Name construction pseudocode
char buf[64];
if (src_is_double) strcpy(buf, "__nv_double");
else strcpy(buf, "__nv_float");
strcat(buf, is_unsigned ? "2u" : "2");
if (dest_bits == 64) strcat(buf, "ll_rz");
else strcat(buf, "int_rz");
Producing names like:
| Intrinsic | Conversion |
|---|---|
__nv_float2int_rz | f32 -> i32, signed, round-to-zero |
__nv_float2uint_rz | f32 -> u32, unsigned, round-to-zero |
__nv_double2ll_rz | f64 -> i64, signed, round-to-zero |
__nv_double2ull_rz | f64 -> u64, unsigned, round-to-zero |
__nv_float2ll_rz | f32 -> i64, signed, round-to-zero |
These are emitted as plain LLVM function calls (call i32 @__nv_float2int_rz(float %val)), not as LLVM intrinsics. The NVIDIA PTX backend later pattern-matches these __nv_ calls to cvt.rz.* PTX instructions. The intrinsic call is created by sub_128A3C0, which builds a function type, looks up or creates the declaration in the module, and emits a CallInst with one argument.
If the source integer is 32-bit but the target needs 64-bit conversion, the function first converts i32 to i64, then recursively calls itself to convert i64 to the target float type.
Float-to-float (fptrunc / fpext)
The source and destination type tags are compared directly. If the destination tag is larger (wider float), opcode 44 (fpext) is used. If smaller, opcode 43 (fptrunc).
%conv = fpext float %val to double ; float -> double
%conv = fptrunc double %val to float ; double -> float
Cast control flow summary
EmitCast(ctx, expr, is_unsigned, destTy, is_unsigned2, skip, diag)
|
+-- skip_flag set --> return 0
+-- destTy == BoolType?
| +-- src is float --> fcmp une %val, 0.0 "tobool"
| +-- src is ptr/int --> icmp ne %val, null/0 "tobool"
+-- srcTy == destTy --> return expr (identity)
+-- ptr -> ptr --> bitcast(47) "conv"
+-- ptr -> int --> ptrtoint(45) "conv"
+-- int -> ptr --> resize + inttoptr(46) "conv"
+-- int -> int --> trunc/zext/sext "conv"
+-- int -> float
| +-- standard --> sitofp(39)/uitofp(40) "conv"
| +-- nvidia --> __nv_*2*_rz call "call"
+-- float -> int
| +-- standard --> fptosi(41)/fptoui(42) "conv"
| +-- nvidia --> __nv_*2*_rz call "call"
+-- float -> float
+-- wider --> fpext(44) "conv"
+-- narrower --> fptrunc(43) "conv"
IR Instruction Infrastructure
BB insertion linked list
After creating any LLVM instruction, it must be inserted into the current basic block. This appears ~30 times across the expression codegen functions as a doubly-linked intrusive list manipulation. The low 3 bits of list pointers carry tag/flag bits (alignment guarantees valid pointers have zero in those positions):
// Repeated BB insertion pattern
Value *tail = ctx[1][1]; // current BB's instruction list tail
if (tail) {
Value *sentinel = ctx[1][2]; // sentinel node
InsertIntoBB(tail + 40, inst); // sub_157E9D0
// Linked list fixup (doubly-linked with 3-bit tag):
inst->prev = (*sentinel & ~7) | (inst->prev & 7); // preserve tag bits
inst->parent = sentinel;
((*sentinel & ~7) + 8) = inst + 24; // old_tail.next = inst
*sentinel = (*sentinel & 7) | (inst + 24); // sentinel.head = inst
}
Instruction offsets: +24 = prev pointer, +32 = parent block, +48 = debug location metadata slot.
Debug metadata attachment
After every BB insertion, debug location metadata is cloned and attached:
SetValueName(inst, &name); // sub_164B780: e.g. "lnot.ext"
Value *debugLoc = *ctx_debug;
if (debugLoc) {
Value *cloned = CloneDebugLoc(debugLoc, 2); // sub_1623A60
if (inst->debugLoc)
ReleaseDebugLoc(inst + 48); // sub_161E7C0: free old
inst->debugLoc = cloned;
if (cloned)
RegisterDebugLoc(cloned, inst + 48); // sub_1623210
}
Global flags
| Address | Purpose |
|---|---|
dword_4D04720 + dword_4D04658 | Debug info emission control. When both zero, source location is forwarded before dispatch |
dword_4D04810 | Bitfield optimization flag. When set, enables bassign.tmp alloca path for bitfield assignments |
unk_4D04630 | When set, forces standard LLVM casts (sitofp/fptosi) instead of __nv_*_rz intrinsics |
unk_4D04700 | When set, marks tobool results as "potentially inexact" via flag bit |
unk_4D0463C | Volatile detection flag. When set, queries address volatility |
Helper Function Reference
| Address | Recovered name | Role |
|---|---|---|
sub_128D0F0 | EmitExpr | Master expression dispatcher (this page) |
sub_128A450 | EmitCast | All C-level casts |
sub_127D8B0 | EmitConstExpr | Compile-time constant expressions |
sub_1282050 | EmitBitfieldStore | Bitfield write (R-M-W) |
sub_1284570 | EmitBitfieldLoad | Bitfield read (extract) |
sub_127FEC0 | EmitBoolExpr | Expression to i1 conversion |
sub_127F650 | EmitLiteral | Numeric/string literal emission |
sub_1286D80 | EmitAddressOf | Compute pointer to lvalue |
sub_1287CD0 | EmitLoadFromAddress | Load via computed address |
sub_1287ED0 | EmitCompoundAssign | Generic compound assignment |
sub_128C390 | EmitIncDec | Pre/post increment/decrement |
sub_128F9F0 | EmitBinaryArithCmp | Binary arithmetic and comparison |
sub_128F580 | EmitShiftOrBitwise | Shift and bitwise operators |
sub_128B750 | EmitSubscriptOp | Array subscript (GEP + load) |
sub_128FDE0 | EmitSizeofAlignof | sizeof and alignof operators |
sub_12901D0 | EmitCompoundAssignWrapper | Wrapper dispatching to per-operator impl |
sub_1296570 | EmitCall | Function call emission |
sub_12897E0 | EmitBitfieldStore (inner) | Actual bitfield store logic |
sub_127A030 | GetLLVMType | EDG type to LLVM type translation |
sub_127F680 | CanUseFastBitfieldPath | Bitfield path selector |
sub_128A3C0 | EmitIntrinsicConvCall | __nv_*_rz intrinsic call helper |
sub_12A4D50 | CreateBasicBlock | Create named BB |
sub_12A4DB0 | EmitCondBranch | Conditional branch emission |
sub_12909B0 | EmitUnconditionalBranch | Unconditional branch emission |
sub_1290AF0 | SetInsertPoint | Switch current BB |
sub_15FB440 | CreateBinOp | Binary instruction creation |
sub_15FDBD0 | CreateCast | Cast instruction creation (IR path) |
sub_15A46C0 | ConstantExprCast | Cast (constant-fold path) |
sub_15A0680 | ConstantInt::get | Integer constant creation |
sub_159C0E0 | ConstantInt::get (APInt) | Wide integer constant creation |
sub_159CCF0 | ConstantFP::get | Float constant creation |
sub_128B370 | EmitLoad | Load with volatile/type/srcloc |
sub_128BE50 | EmitCommaOp | Comma operator RHS extraction |
sub_1289860 | ComputeCompositeMemberAddr | Multi-level GEP for nested fields |
sub_12843D0 | EmitComplexMemberLoad | Nested struct/union field load |
sub_127FF60 | EmitStmtExpr | Statement expression body emission |
sub_1281200 | EmitSpecialConst | Special constant materialization |
sub_1281220 | EmitInitExpr | Init expression emission |
sub_1285E30 | EmitBlockAddress | blockaddress / indirect branch |
sub_1286000 | EmitVaArg | va_arg lowering |
sub_127FC40 | CreateAlloca | Alloca with name and alignment |
sub_127B420 | IsAddressOfExpr | Check if child is & (for elision) |
sub_127B3A0 | IsVolatile | Volatile type query |
sub_127B390 | GetSMVersion | Returns current SM target |
sub_127B460 | IsPacked | Packed struct type query |
sub_127B550 | FatalDiag | Fatal diagnostic (never returns) |
sub_127C5E0 | AttachDebugLoc | Debug location attachment |
sub_127D2C0 | ConstantFromType | Type-level constant (sizeof, etc.) |
sub_12A4D00 | LookupLabel | Label resolution for goto/address |
sub_1648A60 | AllocateInstruction | Raw instruction memory allocation |
sub_1648B60 | AllocatePHI | PHI node memory allocation |
sub_164B780 | SetValueName | Assigns %name to IR value |
sub_157E9D0 | InsertIntoBasicBlock | BB instruction list insertion |
sub_1623A60 | CloneDebugLoc | Debug location cloning |
sub_1623210 | RegisterDebugLoc | Debug location list registration |
sub_161E7C0 | ReleaseDebugLoc | Debug location list removal |
sub_15F1EA0 | InitInstruction | Instruction field initialization |
sub_15F1F50 | InitPHINode | PHI node initialization (opcode 53) |
sub_15F2350 | SetExactFlag | Mark sdiv/udiv as exact |
sub_15F55D0 | GrowOperandList | Realloc PHI operand array |
sub_15FEC10 | CreateCmpInst | ICmp/FCmp instruction creation |
sub_15FE0A0 | CreateIntResize | Trunc/zext/sext helper |
sub_15FB630 | CreateUnaryOp | Unary NOT (xor -1) |
sub_15F9CE0 | SetGEPOperands | GEP operand filling |
sub_15FA2E0 | SetInBoundsFlag | Mark GEP as inbounds |
sub_8D23B0 | IsArrayType | Array type check |
sub_72B0F0 | EvaluateConstantExpr | EDG constant evaluation |
sub_731770 | NeedsBitfieldTemp | Bitfield temp alloca check |
Constant expression helper functions
| Address | Recovered name | Role |
|---|---|---|
sub_127D8B0 | EmitConstExpr | Master constant expression emitter |
sub_127D000 | ResizeConstant | Resize constant to target type |
sub_127D120 | DestroyAPFloatElement | APFloat cleanup in aggregate loop |
sub_127D2E0 | PushElementBulk | Bulk push to element vector |
sub_127D5D0 | PushElement | Single push to element vector |
sub_1277B60 | GetFieldIndex | Struct field index query |
sub_1276020 | GetOrCreateGlobalVar | Global variable creation/lookup |
sub_1277140 | GetOrCreateFunction | Function creation/lookup |
sub_1280350 | LookupFunctionStaticVar | Static local variable resolution |
sub_126A1B0 | CreateStringGlobalConst | Global string constant creation |
sub_1598F00 | ConstantAggregateZero::get | Zero-initialized aggregate |
sub_15991C0 | ConstantDataArray::getRaw | Raw byte array constant |
sub_159DFD0 | ConstantArray::get | Typed array constant |
sub_159F090 | ConstantStruct::get | Struct constant |
sub_15943F0 | StructType::get | Anonymous struct type |
sub_15A06D0 | Constant::getNullValue | Zero constant for any type |
sub_15A0A60 | ConstantExpr::extractvalue | Sub-element extraction |
sub_15A2E80 | ConstantExpr::getGEP | Constant GEP expression |
sub_15A4510 | ConstantExpr::getBitCast | Constant bitcast |
sub_15A4A70 | ConstantExpr::getAddrSpaceCast | Constant addrspacecast |
sub_15A4180 | ConstantExpr::getPtrToInt | Constant ptrtoint |
sub_15A8020 | StructLayout::getElemContainingOffset | Bitfield byte lookup |
sub_15A9930 | DataLayout::getStructLayout | Struct layout query |
sub_620E90 | edg::IsSignedIntConst | Signedness query |
sub_620FA0 | edg::GetSignedIntValue | Signed integer extraction |
sub_620FD0 | edg::GetUnsignedIntValue | Unsigned integer extraction |
sub_622850 | edg::GetIntConstAsString | __int128 decimal string extraction |
sub_622920 | edg::ExtractFieldOffset | Field offset extraction |
sub_709B30 | edg::ExtractFloatBits | Float raw bits extraction |
sub_722AB0 | edg::ReadIntFromBuffer | Endian-aware integer read |
sub_8D4050 | edg::GetArrayElementType | Array element type query |
sub_8D4490 | edg::GetArrayElementCount | Array dimension query |
LLVM Opcode Constants
Numeric opcode constants used in CreateBinOp, CreateCast, and instruction creation calls throughout the expression codegen:
| Number | LLVM instruction | Used by |
|---|---|---|
| 13 | sub | Pointer subtraction step 4 |
| 18 | sdiv | Pointer subtraction step 5 (with exact flag) |
| 32 | shl | Left shift (<<) |
| 33 | ashr / lshr | Right shift (>>, signedness-dependent) |
| 34 | and (FP variant) | Bitwise AND |
| 35 | or (FP variant) | Bitwise OR |
| 36 | xor (FP variant) | Bitwise XOR |
| 37 | zext | Zero-extend (bool-to-int, lnot.ext, land.ext) |
| 38 | and | Bitwise AND (integer) |
| 39 | sitofp / or | Signed int-to-float / bitwise OR (integer) |
| 40 | uitofp / xor | Unsigned int-to-float / bitwise XOR (integer) |
| 41 | fptosi / funnel shift | Signed float-to-int / rotate |
| 42 | fptoui | Unsigned float-to-int |
| 43 | fptrunc | Float-to-float truncation |
| 44 | fpext | Float-to-float extension |
| 45 | ptrtoint | Pointer-to-integer cast |
| 46 | inttoptr | Integer-to-pointer cast |
| 47 | bitcast / addrspacecast | Pointer casts |
| 51 | ICmp instruction kind | Integer comparison creation |
| 52 | FCmp instruction kind | Float comparison creation |
| 53 | PHI node kind | PHI creation for &&, ||, ?: |
PHI Node Construction Detail
PHI nodes are used by three expression types: logical AND (0x57), logical OR (0x58), and ternary (0x67). The construction sequence is identical across all three:
- Allocate:
AllocatePHI(sub_1648B60) with 64 bytes. - Initialize:
InitPHINode(sub_15F1F50) with opcode 53 (PHI), type, and zero for parent/count/incoming. - Set capacity:
*(phi+56) = 2-- two incoming edges. - Set name:
SetValueName(sub_164B780) with"land.ext","lor.ext", or"cond". - Reserve slots:
sub_1648880(phi, 2, 1)-- reserve 2 incoming at initial capacity 1.
Adding each incoming value:
count = *(phi+20) & 0xFFFFFFF; // current operand count
if (count == *(phi+56)) // capacity full?
GrowOperandList(phi); // sub_15F55D0: realloc
new_idx = (count + 1) & 0xFFFFFFF;
*(phi+20) = new_idx | (*(phi+20) & 0xF0000000); // update count, preserve flags
// Large-mode flag at *(phi+23) & 0x40 selects operand array location:
base = (*(phi+23) & 0x40) ? *(phi-8) : phi_alloc_base - 24*new_idx;
// Value slot: base + 24*(new_idx-1) — 24 bytes per slot (value ptr + use-list pointers)
slot = base + 24*(new_idx - 1);
*slot = value; // incoming value
slot[1] = value.use_next; // link into value's use-list
slot[2] = &value.use_head | (slot[2] & 3);
value.use_head = slot;
// Basic block slot: stored after all value slots as parallel array
bb_offset = base + 8*(new_idx-1) + 24*num_incoming + 8;
*bb_offset = incoming_bb;
The PHI operand layout is [val0, val1, ..., bb0, bb1, ...] where each value slot occupies 24 bytes (value pointer + doubly-linked use-list pointers), and basic block pointers form a parallel 8-byte array after all value slots.
Duplicate Implementations
Two additional copies of the bitfield codegen exist at sub_923780 (store) and sub_925930 (load) -- identical algorithms with the same string names, same opcodes, same control flow. These are in the 0x92xxxx range (NVVM frontend region) while the primary copies are in the 0x128xxxx range (codegen helper region). They likely correspond to different template instantiations or address-space variants in the original NVIDIA source code.
Diagnostic String Index
| String | Origin function | Trigger |
|---|---|---|
"unsupported expression!" | EmitExpr (sub_128D0F0) | Default case in outer switch |
"unsupported operation expression!" | EmitExpr (sub_128D0F0) | Default case in inner switch |
"constant expressions are not supported!" | EmitConstExpr (sub_127D8B0) | Unsupported context kind (sub_6E9180 returns true) |
"unsupported constant variant!" | EmitConstExpr (sub_127D8B0) | Unknown constant kind in main switch; also byte != 0/1/2 in address-of |
"unsupported float variant!" | EmitConstExpr (sub_127D8B0) | Float kind 5, or kind < 2 |
"long double" / "__float80" / "__float128" | EmitConstExpr (sub_127D8B0) | Warning 0xE51: extended precision truncated to double on CUDA target |
"failed to lookup function static variable" | EmitConstExpr (sub_127D8B0) | Function static address with type tag > 0x10 |
"taking address of non-string constant is not supported!" | EmitConstExpr (sub_127D8B0) | &literal where literal kind != 2 (non-string) |
"unsupported cast from address constant!" | EmitConstExpr (sub_127D8B0) | Type mismatch that is not ptr-to-ptr or ptr-to-int |
"unsupported aggregate constant!" | EmitConstExpr (sub_127D8B0) | Type tag not in {8, 10, 11, 12} for aggregate case |
"initialization of bit-field in union not supported!" | EmitConstExpr (sub_127D8B0) | Union initializer targeting a named bitfield |
"cannot find initialized union member!" | EmitConstExpr (sub_127D8B0) | Union field chain exhausted without finding target |
"bit-field constant must have a known value at compile time!" | EmitConstExpr (sub_127D8B0) | Bitfield initializer evaluates to NULL |
"unexpected error while initializing bitfield!" | EmitConstExpr (sub_127D8B0) | Pre-existing byte in struct is not zero when packing |
"unexpected non-integer type for cast from pointer type!" | EmitCast (sub_128A450) | ptrtoint destination is not integer |
"unexpected destination type for cast from pointer type" | EmitCast (sub_128A450) | inttoptr source is not integer |
"error generating code for loading from bitfield!" | EmitBitfieldLoad (sub_1284570) | Alignment assertion failure |
"expected result type of bassign to be void!" | EmitExpr (sub_128D0F0) | Bitfield assign result type validation |
Cross-References
- IRGen Types -- type translation from EDG to LLVM
- Statement Codegen -- statement-level emission that calls into
EmitExpr - Cast Codegen detail --
EmitCastsubsystem - Diagnostics -- diagnostic emission infrastructure
- Address Spaces -- NVPTX address space model affecting pointer casts
Statement & Control Flow Codegen
The statement code generator converts EDG IL statement nodes into LLVM IR basic blocks and terminators. It is the control flow backbone of NVVM IR generation: every if, while, for, switch, goto, return, and compound block passes through a single recursive dispatcher (sub_9363D0) that reads a statement-kind byte and fans out to 17 specialized handlers. Each handler creates named basic blocks following a fixed naming convention, connects them with conditional or unconditional branches, and attaches metadata for branch prediction and loop optimization. Understanding this subsystem means understanding exactly how C/CUDA source-level control flow maps to the LLVM IR that downstream optimization passes will transform.
Binary coordinates: Handlers span 0x930000--0x948000 (~96 KB). The dispatcher itself is at 0x9363D0; the most complex handler (try/catch at sub_932270) is 57 KB alone.
Statement Dispatcher -- sub_9363D0 (emitStmt)
void emitStmt(CGModule *cg, StmtNode *stmt);
The dispatcher is the only entry point for statement lowering. All control flow handlers, compound statements, and even the top-level function body driver call emitStmt recursively.
Entry logic:
-
If
cg->currentBB(offset +96) is NULL, create an anonymous unreachable basic block viacreateBB("")and insert it. This is the "dead code after return" safety net -- it ensures the IR builder always has an insertion point, even for unreachable code that follows areturnorgoto. -
Read
stmt->stmtKind(byte at StmtNode offset +40). -
Special fast path: if kind == 8 (return), call
setDebugLoc+pushScope+emitReturnStmtand return immediately. Returns get priority handling because they terminate the current BB and may trigger cleanup scope unwinding. -
General path:
setDebugLoc+pushScope, then dispatch on kind through a switch table.
Kind Dispatch Table
| Kind | Statement type | Handler | Address |
|---|---|---|---|
| 0 | Expression statement | emitExprStmt | sub_921EA0 |
| 1 | if statement | emitIfStmt | sub_937020 |
| 2 | if constexpr (C++17) | emitConstexprIf | sub_936F80 |
| 5 | while loop | emitWhile | sub_937180 |
| 6 | goto | emitGoto | sub_931270 |
| 7 | Label statement | emitLabel | sub_930570 |
| 8 | return | emitReturn | sub_9313C0 |
| 11 | Compound { ... } | emitCompound | sub_9365F0 |
| 12 | do-while loop | emitDoWhile | sub_936B50 |
| 13 | for loop | emitFor | sub_936D30 |
| 15 | case label | emitCase | sub_935670 |
| 16 | switch statement | emitSwitch | sub_9359B0 |
| 17 | Variable declaration | emitDeclStmt | sub_9303A0 |
| 18 | try/catch | emitTryCatch | sub_932270 |
| 20 | Cleanup/destructor scope | emitCleanupScope | sub_931670 |
| 24 | Null/empty statement | (return immediately) | -- |
| 25 | Expression statement (alt) | emitExprStmt | sub_921EA0 |
Kinds 0 and 25 share the same handler. The split likely distinguishes C expression-statements from GNU statement-expressions or a similar EDG internal distinction. Any unrecognized kind triggers fatal("unsupported statement type").
Gaps in the numbering (3, 4, 9, 10, 14, 19, 21--23) either correspond to statement types handled entirely in the EDG frontend (lowered before codegen sees them) or are reserved for future use.
If Statement -- sub_937020
Reads from the StmtNode: condition expression at offset +48, then-body at +72, else-body at +80 (may be NULL).
BB Layout: if/else
┌─────────────────────┐
│ current BB │
│ %cond = ... │
│ br i1 %cond, │
│ label %if.then, │
│ label %if.else │
└──┬──────────────┬────┘
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ if.then │ │ if.else │
│ <then> │ │ <else> │
│ br %end │ │ br %end │
└────┬─────┘ └────┬─────┘
│ │
▼ ▼
┌─────────────────────┐
│ if.end │
└─────────────────────┘
BB Layout: if without else
┌─────────────────────┐
│ current BB │
│ %cond = ... │
│ br i1 %cond, │
│ label %if.then, │
│ label %if.end │
└──┬──────────────┬────┘
│ │
▼ │
┌──────────┐ │
│ if.then │ │
│ <then> │ │
│ br %end │ │
└────┬─────┘ │
│ │
▼ ▼
┌─────────────────────┐
│ if.end │
└─────────────────────┘
LLVM IR pseudocode:
%cond = icmp ne i32 %x, 0 ; evalCondition: convert to i1
br i1 %cond, label %if.then, label %if.else, !prof !0
if.then:
; ... then-body codegen ...
br label %if.end
if.else:
; ... else-body codegen ...
br label %if.end
if.end:
; continues here
!0 = !{!"branch_weights", i32 2000, i32 1} ; if __builtin_expect(x, 1)
Branch Weight Metadata
sub_92F9D0 examines __builtin_expect annotations on branch bodies by checking bit flags at StmtNode offset +41:
| Flag | Source annotation | Weight encoding | Metadata attached |
|---|---|---|---|
| bit 0x10 | __builtin_expect(x, 1) -- likely | weightHint = 1 | !{!"branch_weights", i32 2000, i32 1} |
| bit 0x20 | __builtin_expect(x, 0) -- unlikely | weightHint = 2 | !{!"branch_weights", i32 1, i32 2000} |
| neither | no annotation | weightHint = 0 | (no metadata) |
The 2000:1 ratio represents 99.95% prediction confidence. For compound statements (kind 11), the function recurses into the compound's first child statement to find the annotation.
Constexpr If -- sub_936F80
C++17 if constexpr is fully resolved during EDG frontend semantic analysis. By the time the codegen sees it, only the taken branch body survives. The handler reads a selection record from offset +72: a bit at +24 determines which of two fields contains the surviving body pointer. If non-null, it creates constexpr_if.body and constexpr_if.end BBs and emits the body with an unconditional branch to .end. If null (dead branch entirely eliminated), no codegen occurs at all.
While Loop -- sub_937180
┌─────────────────────┐
│ current BB │
│ br label %while.cond│
└─────────┬───────────┘
│
▼
┌─────────────────────┐◄──────────────┐
│ while.cond │ │
│ %c = ... │ │
│ br i1 %c, │ │
│ label %while.body, │ │
│ label %while.end │ │
└──┬──────────────┬────┘ │
│ │ │
▼ │ │
┌──────────────┐ │ │
│ while.body │ │ │
│ <body> │ │ │
│ br %cond ────┼─────┼────────────────────┘
└──────────────┘ │ backedge with
│ !llvm.loop metadata
▼
┌──────────────┐
│ while.end │
└──────────────┘
LLVM IR pseudocode:
br label %while.cond
while.cond:
%c = icmp slt i32 %i, %n
br i1 %c, label %while.body, label %while.end
while.body:
; ... body codegen ...
br label %while.cond, !llvm.loop !1
while.end:
; continues here
!1 = !{!1, !2} ; self-referential loop ID
!2 = !{!"llvm.loop.mustprogress"}
The backedge branch (br label %while.cond from while.body) always receives !llvm.loop metadata via emitLoopMustProgress (sub_930810). If the loop carries #pragma unroll, additional unroll metadata is merged into the same MDNode (see Loop Metadata below).
Do-While Loop -- sub_936B50
The key structural difference from while: the body executes before the condition. The condition BB follows the body.
┌─────────────────────┐
│ current BB │
│ br label %do.body │
└─────────┬───────────┘
│
▼
┌─────────────────────┐◄──────────────┐
│ do.body │ │
│ <body> │ │
│ br label %do.cond │ │
└─────────┬───────────┘ │
│ │
▼ │
┌─────────────────────┐ │
│ do.cond │ │
│ %c = ... │ │
│ br i1 %c, │ │
│ label %do.body, ──┼───────────────┘
│ label %do.end │ backedge
└──────────────┬──────┘
│
▼
┌──────────────┐
│ do.end │
└──────────────┘
LLVM IR pseudocode:
br label %do.body
do.body:
; ... body codegen ...
br label %do.cond
do.cond:
%c = icmp ne i32 %x, 0
br i1 %c, label %do.body, label %do.end, !llvm.loop !1
do.end:
; continues here
The backedge is the conditional branch in do.cond (true edge back to do.body). Debug location is set separately for the condition expression using the condition node's own source location (offset +36 from the condition expression node).
For Loop -- sub_936D30
The most complex loop handler. Reads four components from the StmtNode: init statement at offset +80 field [0], condition at +48, increment expression at +80 field [1], and body at +72. Any of init, condition, and increment may be NULL.
┌─────────────────────┐
│ current BB │
│ <init statement> │ ← emitted in current BB if non-null
│ br label %for.cond │
└─────────┬───────────┘
│
▼
┌─────────────────────┐◄──────────────┐
│ for.cond │ │
│ %c = ... or true │ │
│ br i1 %c, │ │
│ label %for.body, │ │
│ label %for.end │ │
└──┬──────────────┬────┘ │
│ │ │
▼ │ │
┌──────────────┐ │ │
│ for.body │ │ │
│ <body> │ │ │
│ br %for.inc │ │ │
└──────┬───────┘ │ │
│ │ │
▼ │ │
┌──────────────┐ │ │
│ for.inc │ │ │
│ <increment> │ │ │
│ br %for.cond┼─────┼────────────────────┘
└──────────────┘ │ backedge
▼
┌──────────────┐
│ for.end │
└──────────────┘
LLVM IR pseudocode:
; init: i = 0
store i32 0, ptr %i.addr, align 4
br label %for.cond
for.cond:
%i = load i32, ptr %i.addr, align 4
%cmp = icmp slt i32 %i, %n
br i1 %cmp, label %for.body, label %for.end
for.body:
; ... body codegen ...
br label %for.inc
for.inc:
%i1 = load i32, ptr %i.addr, align 4
%inc = add nsw i32 %i1, 1
store i32 %inc, ptr %i.addr, align 4
br label %for.cond, !llvm.loop !1
for.end:
; continues here
Special cases:
- Null condition: If the condition expression is NULL (e.g.,
for(;;)), the handler callsConstantInt::getTrue(sub_ACD6D0) to create an unconditionally-true condition, producing an infinite loop. - Volatile increment: If the increment expression operates on a volatile pointer (type descriptor
& 0xFB == 8andisVolatile()returns true), the store is marked volatile. - Scope tracking: Outside "fast codegen" mode (
dword_4D04658 == 0), pushes aDW_TAG_lexical_blockdebug scope at for-loop entry viasub_941230/sub_9415C0and pops it at exit viasub_93FF00. This generates correct DWARF scoping so debuggers seefor-local variables in the right scope.
The for.inc BB is only created when an increment expression exists. If omitted, the body branches directly back to for.cond.
Switch Statement -- sub_9359B0
The largest control flow handler after try/catch (~550 decompiled lines). Uses a three-phase approach with an internal open-addressing hash table.
Phase 1: Build case-to-BB mapping
Iterates the case list (linked list at stmt[10]+16, next pointer at +32). For each case label, creates a switch_case.target BB. Also creates one switch_case.default_target BB for the default case. Stores the mapping in an open-addressing hash table at CGModule offsets +496 through +520.
Hash table layout (32-byte entries):
| CGModule offset | Field |
|---|---|
| +496 | numEntries |
| +504 | bucket array pointer |
| +512 | numOccupied |
| +516 | numTombstones |
| +520 | capacity |
Uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function and growth policy.
Phase 2: Emit LLVM SwitchInst
Evaluates the switch condition via sub_92F410, then creates a SwitchInst via sub_B53A60 (SwitchInst::Create) with the case count, default target BB, and condition value. Each case constant is added via sub_B53E30 (SwitchInst::addCase).
Phase 3: Emit body
Creates a switch_child_entry BB, inserts it, and recursively emits the switch body. If the switch has no explicit default: case, emits a fallthrough to the switch_case.default_target BB.
LLVM IR pseudocode:
%val = load i32, ptr %x.addr
switch i32 %val, label %switch_case.default_target [
i32 0, label %switch_case.target
i32 1, label %switch_case.target1
i32 5, label %switch_case.target2
]
switch_case.target: ; case 0
; ...
br label %switch_child_entry ; fallthrough or break
switch_case.target1: ; case 1
; ...
switch_case.default_target: ; default
; ...
Note that cicc always emits an LLVM switch instruction. The decision to lower a switch into a jump table versus sequential comparisons is made later by the SelectionDAG backend (specifically NVPTXTargetLowering), not during IR generation. The codegen produces a clean, canonical switch and lets the backend optimize the dispatch strategy.
Case Label -- sub_935670
When the recursive statement walk encounters a case label (kind 15), it looks up the parent switch node (asserts stmtKind == 16), finds the pre-allocated target BB from the hash table, and calls insertBB to make it the current insertion point. Fatal error "basic block for case statement not found!" if the hash table lookup fails.
For the default case (identified by a null value at +8), retrieves the last entry in the mapping vector.
Goto and Label Statements
Goto -- sub_931270
Reads the target label from stmt->auxData+128. Fatal error if null: "label for goto statement not found!".
Two code paths based on cleanup state:
Simple goto (no active cleanups, CGModule offset +240 == 0): Resolves the label to its BB via sub_946C80 and emits an unconditional branch.
Goto with cleanups (offset +240 != 0): Before branching, the handler must destroy all local variables whose scope is being exited. Calls sub_9310E0 to compute the destruction set, iterates each variable calling sub_9465D0 to emit destructor calls, resets the cleanup stack, then resolves and branches to the label BB.
; goto with cleanup: jumping out of scope with a std::string local
call void @_ZNSsD1Ev(ptr %str) ; ~string()
br label %label_target
Label -- sub_930570
Resolves the label to its BB via sub_946C80 and inserts it as the current basic block via insertBB. The BB name comes from the label's symbol name in the EDG IL.
Computed Goto (GCC &&label Extension)
Computed goto is handled in the expression codegen layer, not the statement dispatcher. Expression kind 0x71 at sub_921EA0 calls EmitBlockAddress (sub_1285E30) to produce an LLVM blockaddress constant, and expression kind 0x70 produces the label-as-value. The resulting indirectbr instruction is lowered later by IndirectBrExpandPass (pipeline parser index 247, "indirectbr-expand") because NVPTX does not natively support indirect branches -- they are expanded into a switch over all possible target labels.
Return Statement -- sub_9313C0
Reads the return expression from StmtNode offset +48. Dispatches on CGModule return-type information (offsets +208 and +216):
Path A -- Aggregate (struct) return: If the return type is aggregate (sub_91B770 returns true), emits a memcpy-like sequence into the sret pointer via sub_947E80. For multi-register returns (offset +216 > 0), uses bit-width analysis (_BitScanReverse64) to determine the return bit layout.
Path B -- Scalar return: Evaluates the expression, creates a ReturnInst via sub_B4D3C0, and may bitcast the value for ABI compliance via sub_AE5020.
Path C -- Void return with expression: Evaluates the expression for side effects only (calls emitExprStmt), then falls through to emit a void return.
Cleanup before return: If cg->hasCleanups (offset +240) is set, calls sub_9310E0 to compute the set of locals requiring destruction, emits destructor calls in reverse order, resets the cleanup stack, then emits an unconditional branch to the function's unified return block (offset +200).
; return with cleanup unwind
call void @_ZN3FooD1Ev(ptr %obj) ; ~Foo()
store i32 %retval, ptr %retval.addr
br label %return
return: ; unified return BB
%0 = load i32, ptr %retval.addr
ret i32 %0
The unified return block pattern means every return in a function branches to a single shared return BB rather than emitting ret directly. This is standard in compilers because it simplifies cleanup handling and produces cleaner IR for optimization.
Try/Catch -- sub_932270
The largest single statement handler at 2,225 decompiled lines (57 KB, 0x3B0 bytes of stack locals). Lowers C++ try/catch into LLVM's landingpad-based exception handling model.
High-level structure:
-
Collect catch handlers: Traverses the linked list at
stmt->auxData+136to build a vector of catch clause pointers. -
Construct cleanup names: Builds a mangled cleanup function name from the function's symbol (reading name range from symbol +184/+176). Single
$characters are doubled to$$for LLVM compatibility. -
Build dispatch mapping: Creates an outer dispatch vector mapping each catch clause to its target BB, stored in the same open-addressing hash table scheme used by switch.
-
Emit try body: Installs the landingpad/invoke mechanism so that throwing calls within the try body become
invokeinstructions rather thancallinstructions. -
Emit catch handlers: For each catch clause, creates a BB, emits the handler body, and generates the cleanup/resume path.
Note that CUDA device code has exceptions disabled by default (EDG config DEFAULT_EXCEPTIONS_ENABLED = 0). This handler is exercised primarily for host-side code compiled through cicc, or for the rare case where exceptions are explicitly enabled via compiler flags. When exceptions are disabled, the EDG frontend strips try/catch entirely and the codegen never sees kind 18.
The NVVM IR verifier (sub_2C76F10) explicitly rejects landingpad, invoke, and resume instructions in device code, confirming that exception handling is a host-only feature.
Cleanup/Destructor Scope -- sub_931670
Handles statement kind 20. Only active when cg->hasCleanups (offset +240) is set.
Walks a linked list at StmtNode offset +72. For each entry where the byte at +8 equals 7 (indicating a variable with non-trivial destructor):
- Extracts the variable reference at
entry[2](offset +16). - Checks visibility flags (bits 0x60 at +170, byte +177 != 5) to skip external and static symbols.
- Looks up the variable in the CGModule's var-lookup hash table (offsets +8 through +24) using the same hash function as the switch table.
- If the variable is already registered for cleanup (checked via
sub_91CCF0), adds it to the pending cleanup list and emits an immediate destructor call viasub_9465D0. - If not yet registered, just adds it to the pending list for later processing.
This mechanism ensures that C++ automatic variables with non-trivial destructors are properly destroyed when their scope exits -- whether by normal control flow, goto, return, or exception propagation.
Compound Statement -- sub_9365F0
Handles { ... } blocks (kind 11). This is the workhorse that ties everything together: the function body itself is a compound statement, and every block scope creates a nested compound.
Cleanup frame management: When cg->hasCleanups (offset +240) is set, pushes a new cleanup frame onto the cleanup stack (offset +424). Each frame is a 24-byte record: {pendingDestructors ptr, end, capacity}.
Variable declarations: Iterates local declarations at scope fields [14] and [15] (linked lists). For each local variable, emits an alloca or initializer as needed. If the variable has a non-trivial destructor, registers it in the cleanup set.
Statement iteration: Walks the child statement linked list starting at StmtNode offset +72, following nextStmt pointers at +16. For each child, calls emitStmt(cg, child) recursively. Between statements, checks whether pending cleanups need flushing (temporaries with non-trivial destructors). If new cleanup entries appeared since the last check, iterates them in reverse order and emits destructor calls.
Statement-expression support (GNU extension): For ({...}) expressions, if the last statement in the block is an expression (kind 0 or 25), treats its value as the compound's result. Fatal error: "unexpected: last statement in statement expression is not an expression!" if the last statement is not an expression type.
Scope tracking: Outside fast-codegen mode, pushes a DW_TAG_lexical_block debug scope at entry and pops at exit, so debuggers correctly associate variables with their lexical scope.
Variable Declaration -- sub_9303A0
Reads the variable descriptor from StmtNode offset +72, then the variable's symbol from descriptor +8.
Initialization dispatch (based on byte at symbol +177):
| Value | Meaning |
|---|---|
| 4 | Block-scope static -- fatal("block scope static variable initialization is not supported!") |
| 0, 3 | No dynamic init needed -- skip codegen |
| 2 | Dynamic initialization -- main path |
Dynamic init sub-dispatch (descriptor +48 byte):
| Sub-kind | Handler | Purpose |
|---|---|---|
| 1 | sub_91DAD0 | Load-address style init |
| 2 | sub_91FFE0 | Emit initializer expression |
| 3 | sub_92F410 | Direct expression evaluation |
| other | -- | fatal("unsupported dynamic initialization") |
After computing the initializer value, the handler checks for volatile store qualification, computes alignment via sub_91CB50, retrieves the alloca/global address via sub_9439D0, and emits the store via sub_923130.
%x = alloca i32, align 4 ; from function prologue
%init = call i32 @compute_value() ; dynamic initialization
store i32 %init, ptr %x, align 4 ; emitDeclStmt
Block-scope static variables (static int x = expr;) are explicitly unsupported and fatal. In CUDA device code, block-scope statics have no sensible semantics (no persistent thread-local storage across kernel invocations), so this restriction is intentional.
Loop Metadata: Pragma Unroll and Mustprogress
Pragma Unroll -- sub_9305A0
Called from while, do-while, and for handlers when StmtNode offset +64 (pragma annotation) is non-NULL. Parses "unroll %d" from the pragma string via sscanf.
| Count value | Metadata produced |
|---|---|
0x7FFFFFFF (INT_MAX) | !{!"llvm.loop.unroll.full"} |
| Specific N | !{!"llvm.loop.unroll.count", i32 N} |
| <= 0 | fatal("Unroll count must be positive.") |
| Parse failure | fatal("Parsing unroll count failed!") |
The metadata is wrapped in the standard LLVM loop-ID self-referential MDNode pattern:
br label %for.cond, !llvm.loop !3
!3 = !{!3, !4} ; self-ref loop ID
!4 = !{!"llvm.loop.unroll.count", i32 8}
Global flag dword_4D046B4 ("skip pragma" mode) gates this entirely -- when set, sub_9305A0 returns immediately.
Loop Mustprogress -- sub_930810
Called on every loop backedge (while, do-while, for). Creates !{!"llvm.loop.mustprogress"} and attaches it to the backedge branch. If the backedge already has !llvm.loop metadata (from pragma unroll), the existing operands are read and the mustprogress node is appended to create a combined MDNode:
br label %while.cond, !llvm.loop !5
!5 = !{!5, !6, !7} ; merged: self-ref + unroll + mustprogress
!6 = !{!"llvm.loop.unroll.count", i32 4}
!7 = !{!"llvm.loop.mustprogress"}
This metadata tells the LLVM optimizer that loops must make forward progress -- it is allowed to remove provably-infinite side-effect-free loops. This corresponds to the C++ forward progress guarantee required by the standard.
Infrastructure Functions
createBB -- sub_945CA0
Allocates an 80-byte BasicBlock object and initializes it with the LLVM context from CGModule offset +40. The name parameter produces the characteristic BB names visible throughout this page: "if.then", "while.cond", "for.inc", "switch_case.target", "constexpr_if.body", etc.
insertBB -- sub_92FEA0
void insertBB(CGModule *cg, BasicBlock *bb, int canDelete);
Finalizes the current BB (emits an implicit unconditional branch to bb if the current BB lacks a terminator), then inserts bb into the function's BB list. If canDelete is 1 and the BB has no predecessors, the BB is immediately freed -- this garbage-collects unreachable continuation blocks (e.g., if.end when both branches terminate, while.end when the loop is infinite).
The canDelete=1 flag is used for if.end, while.end, for.end, and do.end BBs.
finalizeBB / emitBr -- sub_92FD90
If the current BB exists and its last instruction is NOT a terminator (opcode check: opcode - 30 > 10 filters out br, ret, switch, etc.), creates a BranchInst to the target BB and inserts it. Then clears cg->currentBB and the insert point.
emitCondBr -- sub_945D00
Creates a conditional BranchInst with true/false targets and optional branch weight metadata. When weightHint != 0, attaches !prof branch_weights metadata via MDBuilder::createBranchWeights.
evalCondition -- sub_921E00
Evaluates a condition expression and converts the result to i1. Checks for aggregate types (fatal error if the condition is an aggregate), determines signedness, evaluates the expression, then emits icmp ne 0 (integer) or fcmp une 0.0 (floating point) to produce a boolean.
EDG StmtNode Layout
Reconstructed from usage patterns across all statement handlers:
| Offset | Size | Field |
|---|---|---|
| +0 | 4 | Source location: line number |
| +4 | 2 | Source location: column number |
| +16 | 8 | nextStmt -- linked list pointer |
| +40 | 1 | stmtKind -- enum value (0--25 observed) |
| +41 | 1 | Flags (bit 0x10 = likely, bit 0x20 = unlikely) |
| +48 | 8 | exprPayload / condition expression pointer |
| +64 | 8 | Pragma annotation (NULL or "unroll N" string) |
| +72 | 8 | auxData -- kind-specific (then-body, label, variable descriptor, etc.) |
| +80 | 8 | auxData2 -- kind-specific (else-body for if, init/increment for for, etc.) |
CGModule Offsets Used by Statement Codegen
| Offset | Size | Field |
|---|---|---|
| +8 | 8 | varLookupTable.buckets |
| +24 | 4 | varLookupTable.capacity |
| +40 | 8 | llvmContext |
| +96 | 8 | currentBB (BasicBlock pointer) |
| +104 | 8 | insertPoint |
| +192 | 8 | currentFunction (Function pointer) |
| +200 | 8 | returnBlock (unified return BB) |
| +208 | 8 | returnValue / sret pointer |
| +216 | 4 | returnAlignment |
| +240 | 1 | hasCleanups flag |
| +248 | -- | cleanupSet (DenseSet tracking which vars need cleanup) |
| +424 | 8 | cleanupStack pointer (24-byte frames) |
| +496 | 8 | switchHashTable.count |
| +504 | 8 | switchHashTable.buckets |
| +512 | 4 | switchHashTable.numOccupied |
| +516 | 4 | switchHashTable.numTombstones |
| +520 | 4 | switchHashTable.capacity |
| +528 | 8 | currentScope pointer |
Global Mode Flags
| Global | Purpose |
|---|---|
dword_4D04658 | Fast codegen mode. Skips debug location emission, scope tracking, and some pragma processing. Corresponds to -G0 or equivalent "no debug" mode. |
dword_4D046B4 | Skip pragma mode. emitUnrollPragma returns immediately. Also gates some compound-statement declaration processing. |
dword_4F077C4 | CUDA compilation mode. Value 2 triggers alternate volatile-qualification logic in for-loop increment and variable declaration codegen. |
Complete BB Naming Reference
Every basic block created by the statement codegen uses one of these exact names:
| Statement type | BB names created |
|---|---|
if | if.then, if.else, if.end |
if constexpr | constexpr_if.body, constexpr_if.end |
while | while.cond, while.body, while.end |
do-while | do.body, do.cond, do.end |
for | for.cond, for.body, for.inc, for.end |
switch | switch_case.target (per case), switch_case.default_target, switch_child_entry |
goto / label | (named from label symbol) |
return | (branch to unified return block) |
compound { } | (no BBs unless cleanup) |
| dead code | "" (anonymous unreachable BB) |
These names survive into the final LLVM IR dump (-Xcuda-ptxas=-v) and are visible in optimization pass debug output. Recognizing them immediately tells you which source-level construct produced a given IR region.
Function, Call & Inline Asm Codegen
This page covers the four subsystems that together translate CUDA/C++ function definitions and call sites into LLVM IR: function prolog generation, call instruction emission, inline assembly compilation, and builtin lowering. The code lives in the 0x930000--0x960000 address range (Path A) with a parallel copy at 0x1270000--0x12D0000 (Path B).
| EmitFunction | sub_946060 (Path A) -- creates entry BB, allocapt sentinel, dispatches to prolog |
| GenerateFunctionProlog | sub_938240 (16 KB) -- parameter iteration, ABI dispatch, alloca emission |
| EmitCallExpr | sub_93CB50 (1,293 lines) -- type resolution, ABI classification, call emission |
| EmitInlineAsm | sub_1292420 (53 KB, 2,087 lines) -- 7-phase asm template-to-IR pipeline |
| BuiltinLowering | sub_12B3FD0 (103 KB, 3,409 lines) -- mega-switch over ~250 builtin IDs |
| EmitFunctionAttrs | sub_12735D0 / sub_1273F90 -- grid_constant, preserve_n, custom ABI metadata |
Function Prolog: Entry Block Setup
Every LLVM function produced by cicc starts with the same structural skeleton: an entry basic block containing a sentinel instruction, a cluster of alloca instructions for parameters and locals, and a return basic block for the unified exit path. The outer driver EmitFunction (sub_946060) builds this skeleton; the inner workhorse GenerateFunctionProlog (sub_938240) populates it with parameter handling code.
EmitFunction -- The Outer Driver
EmitFunction executes a fixed 10-step initialization sequence before tail-calling into the prolog generator:
EmitFunction(IRGenState *S, FunctionDecl *Decl, Function *F,
ParamList *Params, TypeInfoArray *TI, SourceLoc Loc, bool ByvalDemotion):
1. Resolve function type through typedef chain (kind==12 -> follow offset+160)
2. Call SetupFunctionMetadata(S, Decl)
3. Optionally set section name on F via Value::setSection
4. Create "entry" basic block:
entryBB = BasicBlock::Create(S, "entry", F, nullptr)
5. Create the "allocapt" sentinel instruction:
voidTy = Type::getVoidTy(ctx)
undef = UndefValue::get(voidTy)
allocapt = new BitCastInst(undef, voidTy) // void-to-void no-op
entryBB->getInstList().push_back(allocapt)
allocapt->setName("allocapt")
S->AllocaInsertPt = allocapt // stored at IRGenState+456
6. Create "return" basic block:
retBB = BasicBlock::Create(S, "return", nullptr, nullptr)
S->ReturnBlock = retBB // stored at IRGenState+200
7. Set up return value slot:
if returnType is void:
S->RetVal = nullptr
elif ABI kind == 2 (sret) AND isAggregate(returnType):
S->RetVal = F->arg_begin() // reuse the sret pointer
else:
S->RetVal = CreateTmpAlloca(S, returnType, "retval")
8. Store alignment of return type at S+216
9. Initialize insertion state: S->CurrentBB = entryBB
10. Tail-call GenerateFunctionProlog(S, Decl, F, Params, TI, Loc, ByvalDemotion)
The allocapt sentinel is the critical mechanism. It is a dead bitcast void undef to void instruction that serves as an insertion anchor. When CreateTmpAlloca (at sub_921D70) is called with no explicit array size -- the common case -- it inserts the new AllocaInst before the allocapt marker rather than at the current builder insertion point. This ensures that all alloca instructions cluster at the top of the entry block regardless of where in the function body they were requested, which is a hard requirement for LLVM's mem2reg pass to promote them to SSA registers.
The sentinel is eventually dead-code-eliminated in a later pass since it produces no usable value.
GenerateFunctionProlog -- Parameter Lowering
The prolog iterates four parallel data structures in lockstep:
| Cursor | Source | Stride | Termination |
|---|---|---|---|
| EDG parameter node | Linked list from Decl | next at offset +112 | nullptr |
| LLVM argument slot | F->arg_begin() | 40 bytes | F->arg_end() |
| Type info entry | From the ABI classifier | 40 bytes | (parallel with args) |
| Parameter index | 1-based counter | +1 | (parallel with params) |
A post-loop assertion validates that both cursors reached their end simultaneously: "Argument mismatch in generation function prolog!".
Struct Return: The agg.result Convention
Before entering the parameter loop, a helper (sub_938130) checks whether the first argument's ABI kind equals 2 (sret). When true, the prolog names the first LLVM argument "agg.result" and advances the argument cursor by one slot (+40 bytes), so that subsequent parameter processing starts at the second argument. This mirrors the standard LLVM sret convention where the caller pre-allocates space for a returned struct and passes a pointer as a hidden first parameter.
ABI Variant Dispatch
For each parameter, the ABI variant field at TypeInfo+12 selects one of four lowering paths:
Variant 0/1 -- Indirect/Aggregate Pass. The parameter arrives as a pointer to caller-allocated memory. If the type is an aggregate (struct/union/class/array -- type kinds 8--11 checked by IsAggregateType at sub_91B770), the prolog creates a local alloca named <param>.addr, stores the incoming argument into it, and registers the alloca in the declaration map via EmitParamDecl. If the type is a scalar, it goes directly to EmitParamDecl without an intermediate alloca.
Variant 2 -- Direct Pass (most common). The parameter is passed by value in a register or register pair. Two sub-paths exist:
-
Byval demotion path. When the
ByvalDemotionflag (parametera7) is set and the parameter carries abyvalattribute (TypeInfo+16 nonzero), the prolog consults a global name-set (dword_4D04688) to decide whether to create a__val_paramtemporary. If selected, it allocates a"tmp"alloca viaCreateTmpAlloca, stores the argument into it, names the alloca"__val_param" + param_name, and falls through toEmitParamDecl. The__val_paramprefix is NVIDIA-specific and marks parameters that have been demoted from byval to local copy for downstream optimization passes. -
Normal path. For non-byval scalars, calls
EmitParamDecldirectly. A guard validates that non-aggregate arguments are not marked indirect:"Non-aggregate arguments passed indirectly are not supported!".
Variant 3 -- Coercion. The parameter's LLVM type does not match the source type and requires a coercion cast. For aggregates, a "tmp" alloca is created. For scalars, the declaration is looked up and wrapped with a bitcast. The result is forwarded to EmitParamDecl.
EmitParamDecl -- Registration
EmitParamDecl (sub_9446C0) performs the final steps for each parameter:
- For scalar (non-aggregate, non-indirect) parameters: creates an alloca named
<param>.addr, stores the incoming argument into it, and names the argument with the original parameter name. - Inserts the mapping
(EDG decl pointer -> LLVM Value*)into a hash map with open-addressing/quadratic-probing collision resolution. A duplicate check guards against re-declaration:"unexpected: declaration for variable already exists!". - If debug info is enabled (
dword_4D046B4), emits debug metadata for the parameter viasub_9433F0.
Naming Convention Table
| IR Value | Name Assigned |
|---|---|
| sret argument | "agg.result" |
| Unnamed parameter | "temp_param" |
| C++ this parameter | "this" (detected by bit 0 at EDG node offset +172) |
| Parameter alloca | <param_name> + ".addr" |
| Byval temp alloca | "__val_param" + <param_name> |
| Return value alloca | "retval" |
| Entry basic block | "entry" |
| Return basic block | "return" |
| Alloca sentinel | "allocapt" |
CreateTmpAlloca Internals
CreateTmpAlloca (sub_921D70) computes alignment from the type size using _BitScanReverse64 (effectively log2(size)), looks up or creates the pointer-to-type in the module's type system, then delegates to CreateAllocaInst (sub_921B80). The key detail: when no explicit array size is provided, the alloca is inserted at the allocapt marker position (IRGenState+456+24), not at the current builder insertion point.
Call Codegen
Call emission (sub_93CB50) is a 1,293-line function that handles direct calls, indirect calls, builtins, special intrinsics, and printf interception. It receives the caller's codegen context, the EDG call expression node, and an optional pre-allocated destination for aggregate returns.
Phase 1: Type Resolution
The callee operand is extracted from the call node's first operand slot (offset +72). The function resolves the callee's declaration via sub_72B0F0, then peels through the type chain -- stripping typedef aliases (kind 12) by following offset +160 -- until it reaches a pointer-to-function type (kind 6) wrapping a function type (kind 7). Fatal assertions guard both steps: "Expected pointer to function!" and "unexpected: Callee does not have routine type!".
Phase 2: Builtin Dispatch
For direct calls (opcode 20), the resolved callee declaration is checked for the builtin flag: byte[199] & 2. When set, the entire normal call path is bypassed. Control transfers to sub_955A70 (or sub_12B3FD0 on Path B), the builtin lowering mega-switch described in a later section. If the builtin returns an aggregate, the call codegen allocates an "agg.tmp" stack slot and emits a store of the result into it.
Phase 3: Intrinsic Special Cases
If the callee is not a builtin but carries an intrinsic ID (word[176] != 0), a handful of intrinsic IDs receive special treatment:
| Intrinsic ID | Description |
|---|---|
| 10214 | Surface/texture primitive |
| 10219, 10227 | Warp-level primitives (detected via (id - 10219) & 0xFFF7 == 0) |
| 15752 | Special return convention intrinsic |
These dispatch to sub_939370, a dedicated handler that bypasses the normal ABI classification entirely.
Phase 4: Argument Processing
Arguments are codegen'd by walking the argument linked list and calling sub_921F50 on each expression. Results are collected into a dynamically-growing array (24 bytes per entry, managed by sub_C8D5F0).
When bit 1 of the call node's flags byte (offset +60) is set -- indicating variadic or reversed-evaluation convention -- arguments are first collected into a temporary linked list and then written into the array in reverse order. This preserves the C right-to-left evaluation order for variadic calls.
Phase 5: ABI Classification
The ABI classifier (sub_9378E0) receives the return type, parameter types, and byval flags, and produces a calling-convention descriptor. Each parameter gets an ABI kind:
| ABI Kind | Meaning | Codegen Action |
|---|---|---|
| 0 | Direct (register) | Push value directly if scalar; alloca + store if byval aggregate |
| 1 | Indirect (pointer) | Push pointer directly (only valid for aggregates) |
| 2 | Indirect + byval | Push value directly (callee copies) |
| 3 | Coercion/expand | Multi-register split, handled by sub_923000 |
For the return value, ABI kind 2 means sret: a hidden first parameter is prepended to the argument list, pointing to a caller-allocated "tmp" alloca.
Phase 6: Callee Bitcast Folding
If the callee operand is a bitcast (byte[0] == 5), the optimizer walks back to the original function pointer and compares return types and parameter counts. If the signature matches exactly (pointer equality on type nodes, parameter-by-parameter comparison), the bitcast is folded out. This removes unnecessary bitcast wrappers that arise from C-style casts between compatible function pointer types.
Phase 7: Pre-Call Hooks and printf Interception
Debug location metadata is emitted via sub_92FD10. Then a special case: if the call is direct (opcode 20) and the callee name is literally "printf", control transfers to sub_939F40 which performs GPU printf lowering -- converting the printf call into a vprintf-style call that writes formatted output through the GPU's printf buffer mechanism.
Phase 8: preserve_n Operand Bundles
If the call node's preserve_data field (offset +64) is non-null, up to three operand bundles are attached to the call instruction:
preserve_data[0] >= 0 => "preserve_n_data" = ConstantInt(value)
preserve_data[1] >= 0 => "preserve_n_control" = ConstantInt(value)
preserve_data[2] >= 0 => "preserve_n_after" = ConstantInt(value)
These NVPTX-specific operand bundles are register-pressure hints consumed by the instruction scheduler and register allocator. The value -1 means "not specified" and suppresses the bundle.
Phase 9: Call Emission and Attribute Attachment
The LLVM CallInst is created by sub_921880, which takes the callee, the argument array, return type, and the optional operand bundle. Calling-convention attributes (sret, byval, alignment) are collected by sub_93AE30 and attached to the call. For indirect calls, the instruction is named "call" for readability; direct calls inherit the callee's name.
Phase 10: Return Value Handling
| Return ABI Kind | Handling |
|---|---|
| 0 or 1 (direct scalar) | Return the CallInst result directly |
| 0 or 1 (direct aggregate) | Allocate "agg.tmp", store the result, return the alloca |
| 2 (sret) | Return the sret pointer (aggregate) or load from it (scalar) |
| 3 (expanded/multi-register) | Call sub_923000 to split across multiple extracts |
For indirect calls, callalign metadata is constructed by querying the alignment requirement of the return type and each argument type, wrapping them in an MDTuple, and attaching it to the call instruction. This metadata is consumed by the NVPTX backend to generate correct alignment annotations in PTX.
Call Emission Pseudocode
EmitCallExpr(Result *Out, CodegenCtx *Ctx, CallNode *Call, u64 DestFlags, u32 Align):
callee_decl = ResolveCallee(Call->operand[0])
func_type = PeelTypedefs(callee_decl->type) // kind 6 -> kind 7
// ---- Builtin fast path ----
if Call->opcode == CALL_DIRECT AND callee_decl->flags[199] & 2:
result = BuiltinLowering(Ctx, Call)
if isAggregate(func_type->returnType):
dest = DestFlags.ptr OR CreateTmpAlloca("agg.tmp")
Store(result, dest, ComputeAlign(returnType))
Out = {dest, INDIRECT, sizeof(returnType)}
else:
Out = result
return
// ---- Special intrinsics ----
if callee_decl->intrinsicID in {10214, 10219, 10227, 15752}:
return SpecialIntrinsicHandler(Out, Ctx, callee_decl->intrinsicID, Call)
// ---- Normal call path ----
callee_val = CodegenCallee(Ctx, Call->operand[0])
args[] = CodegenArguments(Ctx, Call->argList)
if Call->flags & REVERSED_EVAL:
Reverse(args)
abi_desc = ClassifyABI(func_type->returnType, paramTypes, byvalFlags)
if abi_desc.returnIsSRet:
sret_ptr = DestFlags.ptr OR CreateTmpAlloca("tmp")
PrependArg(args, sret_ptr)
for each (arg, abi_entry) in zip(args, abi_desc.params):
if abi_entry.kind == DIRECT AND abi_entry.isByval:
tmp = CreateAllocaForAggregate(arg)
Store(arg, tmp)
arg = tmp
elif abi_entry.kind == INDIRECT:
assert isAggregate(arg.type)
callee_val = FoldCalleebitcast(callee_val, func_type)
EmitDebugLoc(Ctx, Call->srcLoc)
if Call->opcode == CALL_DIRECT AND callee_name == "printf":
return PrintfExpansion(Ctx, abi_desc, args, Call->srcLoc)
bundle = BuildPreserveNBundle(Call->preserveData)
call_inst = EmitCall(func_type, callee_val, args, bundle)
AttachCCAttrs(call_inst, abi_desc)
Out = HandleReturnValue(call_inst, abi_desc, func_type->returnType)
Inline Assembly Codegen
The inline asm handler (sub_1292420, 53 KB) translates a CUDA __asm__() statement into an LLVM InlineAsm call instruction through a strict 7-phase pipeline. A nearly-identical duplicate exists at sub_932270 for the Path A codegen context -- same parsing logic, same constraint table, different diagnostic function pointers.
Phase 1: Template String Parsing
The raw PTX template string from the EDG AST is scanned character-by-character into a fragment array. Each fragment (48 bytes) is either a literal text chunk (kind=0) or an operand substitution reference (kind=1 with an operand index at offset +0x28).
The parser handles the CUDA-to-LLVM syntax translation:
| CUDA Syntax | LLVM IR Output | Parser Action |
|---|---|---|
$ (literal dollar) | $$ | Escape doubling |
%% | % | Literal percent |
%N (operand ref) | Fragment kind=1, index=N | Multi-digit decimal parse |
%= (unique ID) | ${:uid} | LLVM unique-identifier modifier |
%[name] | -- | Fatal: "symbolic operand reference not supported!" |
%cN (modifier+operand) | Fragment kind=1, modifier=c, index=N | Alpha char + decimal parse |
For operands referencing string literal constants (the C constraint), the parser resolves the constant through the EDG value chain, validates the type is array of char, extracts each byte, escapes any $ characters, strips the trailing NUL, and emits the entire string as a literal fragment.
Phase 2: Template Reconstruction
The fragment array is serialized into the final LLVM inline-asm template string:
- Literal fragments: appended verbatim.
- Operand references without modifier: converted to
$N(e.g., operand 3 becomes$3). - Operand references with modifier: converted to
${N:c}(e.g., operand 0 with modifierhbecomes${0:h}).
This is where the CUDA %N convention is translated to LLVM's $N convention. Literal % characters in PTX (like %tid.x) pass through unchanged because they were never parsed as operand references.
Phase 3: Constraint String Construction
The parser iterates the EDG operand linked list, building a comma-separated LLVM constraint string. Each EDG operand carries a constraint type-chain -- a linked list of tag bytes that map through a 256-byte global lookup table (aXg0123456789rh[]) to produce LLVM constraint letters.
Output operands (flags & 2 != 0):
- Pointer types: constraint prefix
"=*"+ letters (indirect output). - Non-pointer types: constraint prefix
"="+ letters (direct output). - Read-write operands (byte at +24 == 3): a tied input operand is generated with the output's index as the constraint, linking them as a two-address pair.
Input operands:
- Same tag-to-letter mapping.
- Tags 10--19 are prohibited:
"tied input/output operands not supported!"(GCC-style matching-digit constraints are not implemented). - Tag 23 (the
Cconstraint on inputs) creates anundefvalue -- the constant's value was already inlined into the template string during Phase 1.
Special tag handling:
| Tag | Effect |
|---|---|
| 8, 9 | Sets is_address + is_memory flags; tag 9 also emits "imr" composite constraint |
| 0x14, 0x15, 0x16, 0x18, 0x26, 0x2A | Pointer-through types: follow type chain, set is_address |
| 0x19, 0x1B, 0x1C | Memory constraints |
| 23 | Remapped to tag 20 before table lookup |
Phase 4: Clobber List
The EDG clobber linked list (at asmInfo+144) is iterated. Each clobber node has a tag byte selecting the clobber type:
- Tag 1: Memory clobber. Appends
",~{memory}"to the constraint string. - Tag 58: Named register clobber. Uses the name string from the node. Appends
",~{<name>}". - Other tags: Looks up the register name from a global table (
off_4B6DCE0[tag]). Appends",~{<name>}".
Phase 5: InlineAsm Object Creation
The LLVM function type for the asm is constructed based on the output count:
- Zero outputs:
voidreturn type. - One output: scalar return type matching the output operand.
- Multiple outputs: anonymous struct return type.
The volatile/sideeffect flag is read from asmInfo+128 (bit 2). A diagnostic (0xE9F) warns when outputs exist but the asm is not marked volatile, as this risks miscompilation.
The InlineAsm object is created via InlineAsm::get(funcType, asmString, constraintString, hasSideEffects, isAlignStack=0, dialect=0) and a CallInst is emitted to invoke it.
Phase 6: Result Extraction
For single-output asm, the CallInst result is used directly. For multiple outputs, each result is extracted with extractvalue instructions:
- Results with type size <= 16 bytes: a compact
extractvaluepath. - Results with type size > 16 bytes: a full instruction node (88 bytes) is allocated, the
extractvalueis constructed with explicit index arrays, linked into the basic block's instruction list, and named"asmresult".
Each extracted value is then stored into its output destination via sub_12843D0, which reads the output codegen-info records built during Phase 3.
Phase 7: Cleanup
All temporary vectors and strings are freed: the fragment array (with per-element string cleanup), constraint strings, operand/type/destination vectors, and tied-operand tracking arrays.
End-to-End Example
CUDA source: __asm__("mov.u32 %0, %tid.x" : "=r"(result));
Phase 1 parse: [literal("mov.u32 "), operand(idx=0), literal(", %tid.x")]
Phase 2 recon: "mov.u32 $0, %tid.x"
Phase 3 constr: "=r"
Phase 4 clobber: ""
Phase 5 create: InlineAsm::get("mov.u32 $0, %tid.x", "=r", sideeffects=true)
call i32 asm sideeffect "mov.u32 $0, %tid.x", "=r"()
Phase 6 extract: (single output -- use call result directly)
store i32 %asm_result, i32* %result.addr
Builtin Lowering
The builtin lowering mega-switch (sub_12B3FD0, 103 KB) is one of the largest single functions in the binary. It handles ~250 builtin IDs across ~130 case labels, dispatching CUDA intrinsic functions like __syncthreads(), __shfl_sync(), and __hmma_m16n16k16_mma_f16f16 into LLVM IR.
Entry Logic
The function extracts the callee from the call expression, validates the builtin bit (flags byte[199] & 2), then looks up the builtin ID by name via sub_12731E0. If the ID is 0 (name not in the builtin table), execution falls through to the LLVM intrinsic fallback path at line 3154.
Five Lowering Strategies
| Strategy | Usage (%) | Mechanism |
|---|---|---|
| Sub-handler delegation | 66% (~165 IDs) | Calls a specialized function for a family of builtins |
| Intrinsic call emission | 12% (~30 IDs) | 1:1 mapping to a single llvm.nvvm.* intrinsic via sub_1285290 |
| Inline IR generation | 10% (~25 IDs) | Builds IR nodes directly (alloca, load, store, cast, insertvalue) |
| Table-driven selection | 10% (~25 IDs) | Selects intrinsic ID from a table keyed by operand type/size |
| SM-gated conditional | 2% (~5 IDs) | Different lowering depending on target SM version |
Per-Category Dispatch
Atomics and synchronization (IDs 0xB5--0xCC, 181--204). Atomic operations delegate to sub_12A7DA0; fences and barriers to sub_12AB550. Cases 0xBA--0xBC map directly to LLVM intrinsic 6 (likely llvm.nvvm.atomic.*) with type-overloaded arguments. Case 0xCB is SM-gated: on SM <= 63 it emits an inline constant; on SM >= 70 it emits intrinsic 3769.
Warp shuffle (IDs 0x15F--0x166, 351--358). All eight variants delegate to sub_12ABB90 parameterized by shuffle mode (0=idx, 1=up, 2=down, 3=butterfly) and sync flag (0=legacy, 1=__shfl_sync_*). The clamp flag distinguishes butterfly from other modes.
Warp vote/ballot (IDs 0x12E--0x135, 0x152--0x159, 0x18B--0x192). Three groups of 8 IDs each, all delegating to sub_12B3540 with the builtin ID as a discriminator. This covers __ballot_sync, __all_sync, __any_sync across integer/float/predicate operand types.
Surface and texture operations (IDs 0xCF--0x113, 0x287--0x2A5, 207--275 + 647--677). The largest category at ~95 IDs (38%). Organized into pairs using two sub-handlers: sub_12ADE80(ctx, intrinsic_base, surface_type, variant, args) for individual load/store operations, and sub_12AA9B0(ctx, surface_type, expr) for combined operations. Surface types are encoded as integers (0=generic, 1=1D, 5=2D, 7=3D, 8=cubemap, 10=1D array, 11=2D array, 14=buffer). Intrinsic bases 3701/3702 are primary read/write; 3698/3699 are 2D-array variants.
The texture handler (case 0x287) is the most complex single case at ~230 lines. It walks the AST to extract the texture name string and return element type, constructs an intrinsic name as "<texname>_<typename>" using a type-name resolution switch (mapping integer subtypes 0--10 to strings like "uchar", "int", "ulonglong"), and emits the call. A global flag (dword_4F06B98) controls whether plain char maps to uchar or schar.
Tensor core / WMMA (IDs 0x16E--0x1D9, 0x2A6--0x2E8, 366--473 + 678--744). The second-largest category at ~85 IDs (34%). Three sub-handlers partition the work: sub_12AC1A0 handles wmma::mma_sync with bias/scale flags (has_bias, has_scale) encoding four accumulator modes; sub_12AC5F0 handles store_matrix_sync; sub_12ACA80 handles load_matrix_sync. IDs group into triplets by matrix shape: m16n16k16, m32n8k16, m8n32k16, m16n16k8 (TF32), bf16, and fp8 (SM 89+) families.
WGMMA (IDs 0x2E9--0x302, 745--770). SM 90+ warpgroup MMA operations. Cases 0x2E9--0x2EE handle fence/commit/wait. Cases 0x2F1--0x2FC implement __wgmma_mma_async through a massive ~800-line handler that selects from a 144-entry intrinsic table spanning IDs 5304--5447. The table is indexed by a 5-dimensional grid: N-size (16/32/64/128), B-operand source (shared vs register), element type (s64 vs other), scale/negate flags, and case variant. Mode bits are packed into a single integer: bit0=accumulate | bit1=transpose | bit2=negate-C | bit4=negate-A.
Memory copy (IDs 0x199, 0x291--0x299, 409 + 657--665). Memcpy variants encode alignment directly in the builtin ID: ID 658 = align 2, ID 659 = align 4, ID 660 = align 8, ID 661 = align 16. The actual emission delegates to sub_12897A0. Memset operations (IDs 410, 663, 665) delegate to sub_12A6DF0.
TMA bulk operations (IDs 0x19B--0x1A0, 411--416). Cases 0x19B and 0x19C are the largest individual handlers (~300 and ~450 lines respectively) for SM 90+ tensor memory access bulk copy/scatter operations. They build operand vectors iteratively and select from intrinsic tables indexed by element count (IDs 4218--4223 for stores, 4244--4250 for loads).
LLVM Intrinsic Fallback Path
When the builtin ID is 0, the default path (lines 3154--3407) looks up the LLVM intrinsic by name via sub_15E2770. If the intrinsic is type-overloaded, argument types are used to resolve the declaration. Each argument is lowered via sub_128F980, with type-mismatch bitcasts (opcode 47) and vector zexts (opcode 33) inserted as needed. Struct-return intrinsics are handled by iterating the return struct's fields with extractvalue.
Function Attributes
CUDA function attributes are lowered through a three-stage pipeline: EDG frontend parsing, attribute emission during IR generation, and a final metadata-attachment pass.
Stage 1: Frontend Parsing (sub_64F1A0)
The EDG parser scans the token stream for preserve_n_data, preserve_n_control, and preserve_n_after identifiers, parses each as an integer, and stores them in a 12-byte struct at offset +336 of the function declaration node:
struct preserve_reg_info {
int32_t preserve_n_data; // +0, -1 = not specified
int32_t preserve_n_control; // +4, -1 = not specified
int32_t preserve_n_after; // +8, -1 = not specified
};
Stage 2: Attribute Emission (sub_12735D0)
During IR generation, the attribute emitter checks declaration flags and writes attribute bundles:
-
Bit 0x20 at decl+198 (kernel function): emits
("kernel", 1). Then iterates the parameter array (40-byte entries); for each parameter withbyte[+33] != 0, emits("grid_constant", param_index)whereparam_indexis 1-based. This marks individual kernel parameters as grid-constant, enabling the backend to place them in constant memory. -
Bit 0x04 at decl+199 (custom ABI): emits
("full_custom_abi", 0xFFFFFFFF). -
Preserve-reg struct at decl+336: for each of the three fields, if the value is >= 0, emits the corresponding attribute and then writes -1 back (consumed pattern) to prevent double-emission.
Stage 3: Metadata Attachment (sub_1273F90)
The reader pass iterates all functions' attribute bundles and re-encodes them as LLVM named metadata:
grid_constant. Per-parameter type values are collected into a vector, then bundled under the MDString key "grid_constant" as an MDTuple. The downstream consumer sub_CE8660 queries this metadata to determine aliasing/readonly semantics for kernel parameters.
preserve_reg_abi. The three preserve_n values are collected with their MDString keys ("preserve_n_data", "preserve_n_control") into a vector, then bundled under the composite key "preserve_reg_abi" as an MDTuple. The register allocator and prologue-epilogue inserter query this via sub_314D260.
full_custom_abi. Emitted as a simple (MDString, MDNode(i32 0xFFFFFFFF)) pair. When a function has this attribute but NOT the full_custom_abi flag, the alternative "numParams" key records the explicit parameter count as a nested MDTuple.
Final Metadata Layout
For a __global__ kernel with grid_constant parameters and register preservation:
!kernel_attrs = !{
!MDString("kernel"), !MDNode(i32 1),
!MDString("grid_constant"), !MDTuple(
!MDNode(i32 <param1_type>), !MDNode(i32 <param2_type>), ...
),
!MDString("preserve_reg_abi"), !MDTuple(
!MDString("preserve_n_data"), !MDNode(i32 N),
!MDString("preserve_n_control"), !MDNode(i32 M),
!MDString("preserve_n_after"), !MDNode(i32 K)
)
}
Attribute Semantics
| Attribute | Meaning | Backend Effect |
|---|---|---|
grid_constant | Kernel parameter is immutable across the grid | Place in constant memory; optimize loads |
preserve_n_data | N data registers must be preserved across calls | Register allocator reserves R0--RN |
preserve_n_control | N predicate registers to preserve | Prologue/epilogue saves predicates |
preserve_n_after | N registers preserved after a call (callee-save count) | Adjusts spill/restore boundaries |
full_custom_abi | Function bypasses standard CUDA calling convention | Parameter passing determined by explicit annotations |
numParams | Explicit parameter count for non-full_custom_abi functions | Custom ABI parameter setup |
Cross-Reference
| Address | Function | Role |
|---|---|---|
sub_946060 | EmitFunction | Creates entry BB, allocapt, return BB, dispatches to prolog |
sub_938240 | GenerateFunctionProlog | Iterates parameters, ABI dispatch, alloca emission |
sub_9446C0 | EmitParamDecl | Creates alloca+store, registers decl->Value mapping |
sub_921D70 | CreateTmpAlloca | Alloca creation with alignment, inserted at allocapt |
sub_921B80 | CreateAllocaInst | Low-level alloca IR emission |
sub_938130 | IsSRetReturn | Checks ABI kind == 2 |
sub_91B770 | IsAggregateType | Type kinds 8--11 (struct/union/class/array) |
sub_93CB50 | EmitCallExpr | Full call instruction emission (1,293 lines) |
sub_9378E0 | ClassifyABI | Return + parameter ABI classification |
sub_939F40 | PrintfExpansion | GPU vprintf lowering for printf calls |
sub_93AE30 | CollectCCAttrs | Builds sret/byval/align attribute list |
sub_955A70 / sub_12B3FD0 | BuiltinLowering | Mega-switch over ~250 builtin IDs |
sub_1292420 / sub_932270 | EmitInlineAsm | 7-phase asm template-to-IR pipeline |
sub_12735D0 | EmitFunctionAttrs | Writes attribute bundles during IR gen |
sub_1273F90 | ReadFunctionAttrs | Attaches LLVM named metadata from bundles |
sub_64F1A0 | ParsePreserveAttrs | EDG parser for preserve_n_* tokens |
Type Translation, Globals & Special Variables
The type translation subsystem is one of the most algorithmically complex parts of NVVM IR generation. It converts the Edison Design Group (EDG) intermediate language type graph --- which can contain arbitrary mutual recursion, template-dependent types, and CUDA address-space qualifiers --- into a well-formed LLVM type system. The same IR generation phase also handles global variable materialization (with CUDA memory-space assignment), kernel metadata emission, and the translation of CUDA built-in variables (threadIdx, blockIdx, etc.) into LLVM intrinsic calls.
| Type translation entry | sub_91AED0 (640 bytes) |
| Fixed-point driver | sub_91AB30 (896 bytes) |
| Topological sort | sub_919CD0 (896 bytes, 10-level BFS) |
| Type-kind dispatch | sub_918E50 (2,400 bytes, 11+ categories) |
| Type-pair comparator | sub_911D10 (1,024 bytes) |
| Global var creation | sub_915C40 (2,018 bytes) |
| Address space logic | sub_916430 (482 bytes) |
| Annotation emitter | sub_914410 (3,524 bytes) |
| Kernel metadata | sub_93AE30 (~5,600 bytes) |
| Special var classifier | sub_920430 (old) / sub_127F7A0 (new) |
| Special var codegen | sub_922290 (old) / sub_1285550 (new) |
EDG-to-LLVM Type Translation
The Problem
EDG represents C++ types as a graph of IL nodes linked through child/parent pointers, member chains, and scope references. This graph can be arbitrarily cyclic: consider struct A { B* b; }; struct B { A* a; }; where translating A requires translating the pointee type B, which requires translating the pointee type A. Template instantiations add another dimension --- a template class body may reference types that cannot be resolved until the template arguments themselves are translated. The type translator must produce valid LLVM types from this graph without infinite recursion or stale mappings.
NVIDIA solves this with a fixed-point iteration scheme: translate every type, detect whether any translation changed a previously-emitted LLVM type, and if so, repeat the entire pass. The iteration terminates when a full pass produces no changes.
Context Object Layout
The type translation pass operates on a context structure initialized by sub_91AB30 and threaded through every function in the subsystem:
| Offset | Size | Field |
|---|---|---|
+0x000 | 8 | debug_logger --- nullable, enables trace output when non-null |
+0x008 | 8 | pass_list_ptr --- vector of (vtable_ptr, pass_instance) pairs |
+0x010 | 8 | target_info |
+0x018 | 8 | address_space_map --- qualifier-to-LLVM-AS translation table |
+0x020 | 8 | llvm_context --- the LLVMContext* |
+0x028 | 8 | module_ptr |
+0x038 | 8 | edg_node_map --- hash table: EDG nodes to LLVM values |
+0x038 | 16 | visited_set --- open-addressed hash set for dedup (at +0x38..+0x48) |
+0x050 | 4 | iteration_counter |
+0x060 | 12 | visited_set control (count, capacity, bucket_count) |
+0x078 | 8 | processed_list --- vector of completed types |
+0x090 | 16 | type_cache --- hash table: EDG type pointer to LLVM Type* |
+0x0A0 | 8 | remap_list --- vector of type-remapping entries |
+0x150 | 8 | alignment_table --- target-specific alignment data |
+0x168 | 4 | threshold --- type index below which scope lookups are attempted |
+0x2A0 | 16 | pending_replacements --- vector of (old_type, new_type) pairs |
+0x310 | 1 | flags --- bit-packed control flags |
Fixed-Point Iteration Algorithm
The entry point sub_91AED0 recovers pass infrastructure objects by iterating a vector<pair<void*, void*>> at context+8. Each element is 16 bytes: a vtable pointer identifying the pass, and a pass instance pointer. The function compares vtable pointers against 8 known globals to extract the data layout, reflect pass, target transform info, module context, dominator tree, and alias analysis results. It then calls sub_91AB30, the actual iteration driver.
// sub_91AB30: TypeTranslationPass driver
fn translate_all_types(ctx: &mut TypeTransCtx, module: &EDGModule) {
// Optional pre-processing (gated by byte_3C34E60)
if PRE_PROCESS_FLAG {
pre_process_types(ctx, module); // sub_90F800
}
// Gather initial flags from all module members
for member in module.members() { // linked list from module+80
gather_initial_flags(member); // sub_AA3700
}
// MAIN FIXED-POINT LOOP
loop {
let changed = single_iteration(ctx, module); // sub_91AA50
if !changed { break; }
}
// Optional late fixup pass (gated by byte_3C35480)
if OPTIMIZATION_FLAG {
finalize_late_types(ctx, module); // sub_90F750
loop {
let changed = late_fixup(ctx, module); // sub_917E30
if !changed { break; }
}
}
// Optional cleanup (gated by dword_3C351E0)
if CLEANUP_FLAG {
cleanup_stale_types(ctx); // sub_90EB40
}
flush_and_finalize(ctx); // sub_909590
}
Each single iteration (sub_91AA50) performs three steps:
- Topological sort (
sub_919CD0): Build a dependency ordering of all EDG type nodes reachable from the module root. - Invalidate (
sub_913880for each type in reverse order): Remove stale cache entries for types whose dependencies have changed. - Process (
sub_9197C0for each type in reverse order): Translate each type, returning whether any LLVM type was modified.
The iteration returns the logical OR of all sub_9197C0 results. If any type replacement occurred, the outer loop repeats.
10-Level Topological Sort
The function sub_919CD0 produces a dependency-ordered list of EDG types. Rather than a standard DFS-based topological sort, it uses a 10-level iterative BFS implemented with sorted sets at each level. This unusual depth accommodates deeply nested C++ class hierarchies with multiple inheritance, where types at depth N must be resolved before types at depth N+1 can be translated.
Each level maintains a sorted set (vector-backed, managed by sub_6CDA50 for initialization and sub_6CDC80 for merge/sort). Starting from the module's member list, the algorithm:
- Inserts root-level type declarations into level 0.
- For each level 0..9, discovers type dependencies and inserts them into the next level.
- After all 10 levels, concatenates the sets in reverse (leaf types first, composite types last).
The output is a vector of EDG type node pointers ordered so that leaf types precede the composite types that reference them.
EDG Type Kind Dispatch
The core dispatcher sub_918E50 (2,400 bytes) reads the type-kind byte at edg_node+16 and routes to specialized handlers:
| Kind Byte | Value | Handler | Description |
|---|---|---|---|
0x00--0x10 | 0--16 | Primitive dispatch | void, bool, char, int, float, double, etc. |
0x11 | 17 | Void special | Void type with swap handling in comparator |
0x05 | 5 | sub_5FFE90 | Qualified type (const/volatile/restrict) --- carries address-space info |
0x0D | 13 | Enum path | Enum type bridging C/C++ enum constants to LLVM integers |
0x0E | 14 | Function path | Function type with parameter chain traversal |
0x1A | 26 | sub_915850 | Array type (subscript form with enumeration base) |
0x1B | 27 | Inline handler | Compound type (struct/union/class) --- multi-child with dedup hash |
0x32--0x33 | 50--51 | Union variants | Union type (two internal representations) |
0x36 | 54 | sub_918C40 | Typedef / using declaration --- chains through EDG resolution |
0x37 | 55 | Using variant | Using declaration variant |
0x4B--0x4C | 75--76 | Pointer/ref | Pointer and reference types --- carry qualifier words for address spaces |
0x4D | 77 | Member pointer | Pointer-to-member type |
0x4E | 78 | sub_914070 | Dependent/nested type --- requires scope resolution |
For types with kind > 23 that are not special-cased, a default handler applies a bitmask test: 0x100000100003FF >> (kind - 25). If the low bit is set, the type requires scope tracking (kinds 25--34 selectively, plus kinds 57 and 73). The handler then looks up any existing LLVM type for this EDG type via the scope table, and if the mapping has changed, triggers a replacement plus metadata propagation.
Compound Type (Struct/Class) Translation
When kind 0x1B (27) is encountered, the dispatcher uses an inline handler that:
- Reads the child count from
node+20 & 0xFFFFFFFand divides by 2 (children come in pairs: type descriptor + offset/alignment info). - Builds a reference-counting hash table to detect shared sub-types. If a child type appears exactly once, it can be translated independently. If it appears multiple times, it indicates a shared base class or diamond inheritance pattern.
- For unique children, calls
sub_911D10(the type-pair comparator) with the parent scope to translate.
Diamond inheritance is detected by the reference count exceeding 1, which prevents the comparator from making conflicting replacements for the same sub-type.
Type-Pair Comparison Engine
The function sub_911D10 is the core workhorse for comparing and replacing type pairs. It takes (context, type_a, type_b, scope_pair, is_recursive_flag) and maintains a local worklist of (type_a, type_b) pairs:
fn compare_and_replace(ctx, type_a, type_b, scope, is_recursive) {
let mut worklist = vec![(type_a, type_b)];
while let Some((a, b)) = worklist.pop() {
if a == b { continue; }
// Normalize: larger type index = v15, smaller = v14
let (v14, v15) = if type_index(a) < type_index(b) { (a, b) } else { (b, a) };
// Primitive vs compound: record scope mapping
if v14.kind <= 0x17 && v15.kind > 0x17 {
record_scope_mapping(ctx, v14, v15);
}
// Check for UINT_MAX sentinel (incomplete type) -> swap
if scope_table_lookup(v15) == UINT_MAX {
swap(&mut v14, &mut v15);
}
// Perform actual replacement
replace_type(ctx, v14, v15, is_recursive);
// For pointer/reference types: propagate through children
if v15.kind == 75 || v15.kind == 76 {
let qualifier = v15.qualifier_word & 0x7FFF;
// Address space qualifiers trigger child propagation
if qualifier == 1 || qualifier == 32 || qualifier == 33 || qualifier == 14 {
worklist.push((v14.child, v15.child));
}
}
// For union types: push all variant children
if v15.kind == 50 || v15.kind == 51 {
for child in v15.children() { worklist.push((v14, child)); }
}
}
}
This worklist-based approach avoids stack overflow on deeply nested types while correctly propagating address-space information through pointer chains.
CUDA Address Space Propagation
CUDA memory-space qualifiers flow through the EDG type system via a 15-bit qualifier word stored at edg_node+18. The low 15 bits encode the qualifier ID; bit 15 is a negation flag. During type translation, when the type-pair comparator encounters pointer or reference types (kinds 75/76), it reads the qualifier word and maps it to an LLVM address space:
| EDG Qualifier | Value | LLVM Address Space | CUDA Meaning |
|---|---|---|---|
| Generic | 0 | 0 | Generic (default) |
| Global | 1 | 1 | __device__ / global memory |
| Function | 14 | --- | Method qualifier (not an address space) |
| Array context A | 26 | --- | Array subscript qualifier A |
| Array context B | 27 | --- | Array subscript qualifier B |
| Shared | 32 | 3 | __shared__ memory |
| Constant | 33 | 4 | __constant__ memory |
The conversion is performed by sub_5FFE90 (qualifier to LLVM address space number) and sub_5A3140 (creates the appropriately qualified LLVM pointer type). The function sub_911CB0 combines the conversion with a type-index computation: it takes (type_kind - 24) as a base and combines it with the qualifier to produce a unique index for the scope table.
Address-space propagation is transitive: if struct S contains a __shared__ int* field, the shared qualifier must be reflected in the LLVM type of the pointer field within S. The type-pair comparator achieves this by pushing child pairs onto its worklist whenever a pointer/reference type carries a non-zero qualifier.
Five Caching Layers
To avoid redundant work, the translator maintains five distinct caches:
| Cache | Location | Key | Value | Purpose |
|---|---|---|---|---|
| Visited set | ctx+0x38..+0x48 | EDG node ptr | (presence only) | Prevents re-processing the same declaration |
| Type cache | ctx+0x70..+0x94 | EDG decl ptr | child type ptr | Tracks which LLVM type a declaration was previously translated to |
| Type-value map | Per-call in sub_913E90 | EDG type ptr | LLVM Type* | Caches enum/struct translations; supports inline mode (up to 4 entries) |
| Scope table | ctx+0x10, hash at +8/+24 | scope ID | type info | Maps scope identifiers to type information for type-pair comparison |
| Type index table | ctx+0x98+ | compound key | monotonic index | Linear ordering of processed types; Jenkins-like hash for compound keys |
All hash tables use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.
Cache invalidation is handled by sub_913880, which walks a type's member list and removes stale entries. Invalidation cascades: if a struct type is invalidated, all member types that are non-trivial (not kind 54/55 typedef/using) are also removed from the cache.
Template Specialization
Template types are handled by sub_918790 (struct/class type translation with template instantiation support):
sub_41F0F0extracts template argument descriptions from the EDG IL into a 1,536-byte stack buffer (heap fallback for > 50 arguments).sub_908040performs syntactic template argument substitution, producing two lists: substituted types and original types.- If both lists are non-empty and the optimization flags
byte_3C35480+byte_3C353A0are both set,sub_910920performs semantic type matching using the full optimization infrastructure. - Otherwise,
sub_906590creates the LLVM type directly from the substitution result.
The two-pass approach (syntactic substitution then semantic matching) handles cases like template<typename T> struct Wrapper { T* data; } where Wrapper<__shared__ int> must produce a pointer in address space 3 --- the syntactic pass substitutes T = __shared__ int, and the semantic pass verifies the LLVM type is correct.
Template specialization support is entirely optional and gated behind configuration flags, allowing it to be disabled for faster compilation when not needed.
Primitive Type Translation Table
The dispatcher sub_918E50 handles kinds 0x00--0x10 (values 0--16) as primitive/scalar types. These map directly from EDG internal type representation to LLVM IR types. The correspondence between the three type-tag namespaces used across cicc is:
| EDG Type Kind | EDG Printer type_kind | Cast Codegen Tag (*(type+8)) | LLVM IR Type | Width |
|---|---|---|---|---|
0x00 | 0x00 error | --- | <error> | --- |
0x01 | 0x01 void | 3 | void | 0 |
0x02 | 0x02 scalar/integer | 17 | iN | N bits |
0x03 | 0x03 float | 1 (half), 2 (float), 3 (double), 4 (fp80), 5 (fp128), 6 (bf16) | see FP table | varies |
0x04 | 0x04 imaginary | --- | emulated | varies |
0x05 | 0x05 complex | --- | { fN, fN } struct | 2x float |
0x06 | 0x06 pointer/ref | 18 | ptr (opaque) or ptr addrspace(N) | 32/64 |
0x07 | 0x07 function | 15 (function), 16 (ptr-to-fn) | function type | --- |
0x08 | 0x08 array | 20 | [N x elem] | N * elem |
0x09--0x0B | 0x09--0x0B class/struct/union/enum | 21 (struct) | %struct.Name = type { ... } | layout |
0x0C | 0x0C elaborated/typedef | --- | resolved target | --- |
0x0D | 0x0D pointer-to-member | --- | { ptr, i64 } or i64 | 64/128 |
0x0E | 0x0E template param | --- | deduced | --- |
0x0F | 0x0F vector | 16 | <N x elem> | N * elem |
0x10 | 0x10 scalable vector | 16 | <vscale x N x elem> | runtime |
The integer type (EDG kind 0x02) carries its bit-width in the upper bytes of the type word. The cast codegen subsystem (sub_128A450) classifies types by the tag byte at *(type+8): tags 1--6 are floating-point (see next section), tag 11 is integer, tag 15 is pointer, and tag 16 is vector/aggregate. The key dispatch idiom (tag - 1) > 5u tests "is NOT a float"; (tag & 0xFD) != 0xB tests "is NOT integer-like".
Floating-Point Type Encoding
Floating-point types use a sub-kind byte stored in the EDG type node at v3[10].m128i_i8[0] (type printer) or equivalently the cast codegen tag at *(type+8). The complete mapping including all NVIDIA-extended formats:
| Cast Tag | EDG FP Sub-kind | Mangling | C++ Type | LLVM Type | Width | SM Minimum |
|---|---|---|---|---|---|---|
| 1 | 0 / 0xA | DF16_ | _Float16 / __half | half | 16 | SM 53 (scalar), SM 70 (packed) |
| 1 | 1 | Dh | __fp16 | half | 16 | SM 53 |
| 2 | 2 | f | float | float | 32 | all |
| --- | 3 | DF32x | _Float32x | double (promoted) | 64 | all |
| 3 | 4 | d | double | double | 64 | all |
| --- | 5 | DF64x | _Float64x | fp128 (emulated) | 128 | all |
| --- | 6 | (single) | long double | platform-dependent | arch | --- |
| --- | 7 | u7float80 | float80 | x86_fp80 | 80 | N/A on GPU |
| --- | 8 | g | __float128 | fp128 | 128 | emulated |
| 6 | 9 | u6__bf16 or DF16b | __bf16 / __nv_bfloat16 | bfloat | 16 | SM 80 |
| --- | 0xB | DF32_ | _Float32 | float | 32 | all |
| --- | 0xC | DF64_ | _Float64 | double | 64 | all |
| --- | 0xD | DF128_ | _Float128 | fp128 | 128 | emulated |
The bf16 mangling has a three-way ABI gate controlled by qword_4F077B4 (low 32 = use_new_bf16_mangling, high 32 = bf16_abi_version) and qword_4F06A78 (secondary selector). Old ABI emits u6__bf16 (Itanium vendor-extended); C++23 ABI emits DF16b (P1467 standard). The __nv_bool type (EDG printer case 0x02, bit 4 of +162) is a CUDA-specific boolean that emits "__nv_bool" when sub_5D76E0 (CUDA mode check) returns true, or "_Bool" / "bool" otherwise.
Two additional NVIDIA-specific types have dedicated mangling:
| EDG Type Code | Mangling | C++ Type | Purpose |
|---|---|---|---|
| 17 | u11__SVCount_t | __SVCount_t | ARM SVE predicate count |
| 18 | u6__mfp8 | __mfp8 | 8-bit minifloat (FP8 E4M3/E5M2 base) |
On the LLVM side, the __mfp8 type maps to i8 storage with metadata annotations indicating the floating-point interpretation.
CUDA FP8/FP6/FP4 Extended Type Keywords
CUDA 12.x+ introduces narrow floating-point types for transformer inference and tensor core operations. The EDG parser (sub_691320) recognizes these as token values 236 and 339--354, all resolved through sub_6911B0 (CUDA type-token resolver):
| Token | Keyword | Format | Width | Packed Variant | SM Requirement |
|---|---|---|---|---|---|
| 236 | __nv_fp8_e4m3 | E4M3 (4-bit exponent, 3-bit mantissa) | 8 | --- | SM 89 |
| 339 | __nv_fp8_e5m2 | E5M2 (5-bit exponent, 2-bit mantissa) | 8 | --- | SM 89 |
| 340 | __nv_fp8x2_e4m3 | E4M3 packed pair | 16 | 2 elements | SM 89 |
| 341 | __nv_fp8x2_e5m2 | E5M2 packed pair | 16 | 2 elements | SM 89 |
| 342 | __nv_fp8x4_e4m3 | E4M3 packed quad | 32 | 4 elements | SM 89 |
| 343 | __nv_fp8x4_e5m2 | E5M2 packed quad | 32 | 4 elements | SM 89 |
| 344 | __nv_fp6_e2m3 | E2M3 (2-bit exponent, 3-bit mantissa) | 6 | --- | SM 100 |
| 345 | __nv_fp6_e3m2 | E3M2 (3-bit exponent, 2-bit mantissa) | 6 | --- | SM 100 |
| 346 | __nv_fp6x2_e2m3 | E2M3 packed pair | 12 | 2 elements | SM 100 |
| 347 | __nv_fp6x2_e3m2 | E3M2 packed pair | 12 | 2 elements | SM 100 |
| 348 | __nv_mxfp8_e4m3 | MX-format E4M3 | 8 | --- | SM 100 |
| 349 | __nv_mxfp8_e5m2 | MX-format E5M2 | 8 | --- | SM 100 |
| 350 | __nv_mxfp6_e2m3 | MX-format E2M3 | 6 | --- | SM 100 |
| 351 | __nv_mxfp6_e3m2 | MX-format E3M2 | 6 | --- | SM 100 |
| 352 | __nv_mxfp4_e2m1 | MX-format E2M1 (FP4) | 4 | --- | SM 100 |
| 353 | __nv_satfinite | Saturation-to-finite modifier | --- | --- | SM 89 |
| 354 | __nv_e8m0 | E8M0 exponent-only scale format | 8 | --- | SM 100 |
The resolver sub_6911B0 follows the field_140 == 12 (qualified/elaborated type) chain to find the base type node, then sets v325 = 20 (typename). At the LLVM level, these narrow types are lowered to integer storage types (i8, i16, i32) with type metadata or intrinsic-based interpretation. The cvt_packfloat intrinsic family handles conversion to and from these formats with explicit format specifiers:
| cvt_packfloat Case | PTX Suffix | Format |
|---|---|---|
| 2 | .e4m3x2 | FP8 E4M3 pair |
| 3 | .e5m2x2 | FP8 E5M2 pair |
| 4 | .bf16x2 | BFloat16 pair |
| 5 | .e2m1x2 | FP4 E2M1 pair (SM 100+) |
| 6 | .e2m3x2 | FP6 E2M3 pair (SM 100+) |
| 7 | .e3m2x2 | FP6 E3M2 pair (SM 100+) |
| 8 | .ue8m0x2 | UE8M0 scale pair (SM 100+) |
Address Space Annotations on Types
CUDA memory-space qualifiers propagate through the EDG type system via a 15-bit qualifier word at edg_node+18. The low 15 bits encode a qualifier ID; bit 15 is a negation flag. The qualifier word is the single mechanism through which __device__, __shared__, __constant__, and __managed__ semantics reach the LLVM type system.
EDG qualifier word to LLVM address space mapping (performed by sub_5FFE90):
Qualifier Word (node+18 & 0x7FFF) | LLVM Address Space | CUDA Source | Notes |
|---|---|---|---|
| 0 | 0 | (default/generic) | Unqualified pointers |
| 1 | 1 | __device__ / global | Explicit global annotation |
| 9 | 0 (with flag check via sub_5F3280) | (generic variant) | Conditional on context |
| 14 | --- | __host__ / method qualifier | Not an address space --- function qualifier |
| 26 | --- | (array subscript context A) | Internal, not an address space |
| 27 | --- | (array subscript context B) | Internal, not an address space |
| 32 | 3 | __shared__ | Per-block shared memory |
| 33 | 4 | __constant__ | Read-only constant memory |
The function sub_5A3140 creates the appropriately address-space-qualified LLVM pointer type given the qualifier output from sub_5FFE90. The helper sub_911CB0 combines address space information with the type kind to produce a unique scope-table index: it computes (type_kind - 24) as a base and combines it with the qualifier to produce a monotonic key.
EDG frontend encoding (from sub_691320 parser, tokens 133--136, and sub_667B60):
| Parser Token | CUDA Keyword | v305 Value | EDG memory_space_code | Target AS |
|---|---|---|---|---|
| 133 | __shared__ | 4 | 2 | 3 |
| 134 | __device__ | 5 | 1 | 1 |
| 135 | __constant__ | 6 | 3 | 4 |
| 136 | __managed__ | 7 | (special) | 0 + "managed" annotation |
| 273 | __global__ (addr-space attr) | --- | 0 | 0 |
| 274 | __shared__ (addr-space attr) | --- | 2 | 3 |
| 275 | __constant__ (addr-space attr) | --- | 3 | 4 |
| 276 | __generic__ (addr-space attr) | --- | (parsed) | (parsed) |
Address-space propagation through types is transitive: if struct S contains a __shared__ int* field, the shared qualifier flows through the pointer type and is preserved in the LLVM ptr addrspace(3) type of that field. The type-pair comparator sub_911D10 achieves this by pushing child pairs onto its worklist whenever a pointer/reference type (kinds 75/76) carries a non-zero qualifier. The qualifier-word masks 1, 14, 32, and 33 are the four values that trigger this child propagation.
For a full cross-reference of all 10 address spaces (including AS 5 local, AS 6 tensor memory, AS 7 shared cluster, AS 25 internal device, AS 53 MemorySpaceOpt annotation, AS 101 param), see Address Spaces.
Vector Type Handling
NVPTX has a highly constrained vector type model. Only four vector types are legal --- all packed into 32-bit Int32HalfRegs (%hh prefix in PTX):
| Legal Vector Type | LLVM MVT | PTX Register Class | PTX Suffix | SM Minimum |
|---|---|---|---|---|
v2f16 | v2f16 | Int32HalfRegs | .f16x2 | SM 70 (arith), SM 53 (ld/st) |
v2bf16 | v2bf16 | Int32HalfRegs | .bf16x2 | SM 80 |
v2i16 | v2i16 | Int32HalfRegs | .s16x2 | SM 70 |
v4i8 | v4i8 | Int32HalfRegs | (packed bytes) | SM 70 |
All wider vector types are illegal and undergo recursive split/scalarize during type legalization. The split depth for common CUDA vector types:
| CUDA Type | LLVM Type | Split Chain | Final Form |
|---|---|---|---|
float4 | v4f32 | v4f32 -> 2x v2f32 -> 4x f32 | 4 scalar float ops |
float2 | v2f32 | v2f32 -> 2x f32 | 2 scalar float ops |
int4 | v4i32 | v4i32 -> 2x v2i32 -> 4x i32 | 4 scalar i32 ops |
double2 | v2f64 | v2f64 -> 2x f64 | 2 scalar double ops |
half2 | v2f16 | legal (no split) | single .f16x2 packed op |
__nv_bfloat162 | v2bf16 | legal (no split, SM 80+) | single .bf16x2 packed op |
short2 | v2i16 | legal (no split) | single .s16x2 packed op |
char4 / uchar4 | v4i8 | legal (no split) | single packed-byte op |
half (4 elements) | v4f16 | v4f16 -> 2x v2f16 | 2 packed .f16x2 ops |
half (8 elements) | v8f16 | v8f16 -> v4f16 -> 2x v2f16 | 4 packed .f16x2 ops |
The critical architectural insight: v2f32 is NOT legal on NVPTX (no 64-bit packed float register class exists), so float4 always fully scalarizes to four independent f32 operations. In contrast, half2 stays packed throughout the pipeline, delivering 2x throughput via add.f16x2, mul.f16x2, and fma.rn.f16x2 PTX instructions.
SM-version gating affects which types are legal at which pipeline stage:
- SM < 53: No legal vector types;
v2f16must be scalarized, and scalarf16is promoted tof32. - SM 53--69: Scalar
f16is legal;v2f16is legal for load/store but packed arithmetic may beCustomorExpand. - SM 70+:
v2f16fully legal with packed arithmetic.i128scalar register class added. - SM 80+:
v2bf16added as legal vector type. - SM 100+: Additional packed FP types for
cvt_packfloat---e2m1x2,e2m3x2,e3m2x2,ue8m0x2.
Tensor core matrix fragments bypass vector legalization entirely. WMMA and WGMMA intrinsics represent matrix data as individual scalar registers or {f16, f16, ...} struct aggregates, not as LLVM vector types. See MMA Codegen for the tensor-core lowering path.
Cast Codegen Type Tags
The cast emission function sub_128A450 uses a distinct type-tag namespace at *(type+8). This tag drives all cast instruction selection and must be clearly distinguished from the EDG type-kind byte at edg_node+16:
| Tag | LLVM Type | Cast Behavior |
|---|---|---|
| 1 | half (f16) | Float family; float-to-float casts use fpext/fptrunc |
| 2 | float (f32) | Float family |
| 3 | double (f64) | Float family |
| 4 | x86_fp80 | Float family (not used on GPU) |
| 5 | fp128 | Float family; triggers standard LLVM cast path (no __nv_*_rz intrinsic) |
| 6 | bfloat (bf16) | Float family |
| 11 | iN (integer) | Integer family; width at *(type+8) >> 8 |
| 15 | ptr | Pointer family |
| 16 | <N x elem> (vector) | Vector/aggregate; address-space extraction via sub_16463B0 |
Integer-to-float conversions (tags 11 -> 1..6) default to sitofp/uitofp but can route through NVIDIA-specific __nv_*_rz round-to-zero intrinsics when unk_4D04630 is clear. These intrinsics (__nv_float2int_rz, __nv_double2ll_rz, etc.) are emitted as plain function calls and later pattern-matched by the PTX backend to cvt.rz.* instructions. The fp128 path always uses standard LLVM casts because 128-bit floating point is emulated via FP128/I128 library calls.
SelectionDAG SimpleVT Encoding
After IR generation, types enter the SelectionDAG type system where they are encoded as single-byte SimpleVT values for the legality table lookup at NVPTXTargetLowering + 2422:
| SimpleVT | LLVM Type | Bitwidth |
|---|---|---|
| 0 | extended/custom | computed via sub_1F58D40 |
| 1 | i1 | 1 |
| 2 | i2 | 2 |
| 3 | i8 | 8 |
| 4 | i16 | 16 |
| 5 | i32 | 32 |
| 6 | i64 | 64 |
| 7 | i128 | 128 |
| 8 | f16 / bf16 | 16 |
| 9 | f32 | 32 |
| 10 | f64 | 64 |
| 14--55 | fixed-width vector types | vector of above |
| 56--109 | scalable vector types | scalable vector of above |
The bitwidth-to-SimpleVT conversion pattern appears 11 times in the 348KB DAGTypeLegalizer::run monolith (sub_20019C0), and the vector-to-scalar-element switch table (cases 14--109 mapping back to scalar VT 2--10) appears 6 times. This redundancy is an artifact of the monolithic inlining --- upstream LLVM factors these into per-category files (LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, etc.).
Global Variable Code Generation
Module-Level Driver
Global variable codegen is driven by sub_915990 (~2,700 bytes), which iterates all EDG IL global declarations and categorizes them into sorted sets:
- Regular device globals
__constant__globals__shared__globals__managed__globals- Texture references
- Surface references
- Grid constants
After categorization, a topological sort (using the same sub_3FEBB0/sub_3FED60 graph primitives as the type translator) determines the order in which globals must be materialized. If global A's initializer references global B, then B must be code-generated first. The transitive dependency discovery is performed by sub_914960, a BFS that walks EDG IL linkage chains, filtering nodes with kind byte in range [25..34] (variable, function, and template declarations).
Address Space Determination
The function sub_916430 (482 bytes) examines EDG IL node attribute bytes to determine the NVPTX address space for a global variable:
fn determine_address_space(edg_node: &EDGNode) -> u32 {
let storage_class = edg_node[0x88];
let flags_9c = edg_node[0x9C];
let flags_b0 = edg_node[0xB0];
let flags_ae = edg_node[0xAE];
let flags_a8 = edg_node[0xA8] as u64;
// __constant__: storage class 2
if storage_class == 2 {
return 4; // constant address space
}
// __shared__: bit 7 of flags_9c
if flags_9c & 0x80 != 0 {
if flags_ae & 1 != 0 {
return 3; // extern __shared__
}
if flags_b0 & 0x20 != 0 {
return 5; // local memory (stack-local shared variant)
}
return 3; // __shared__
}
// Bit 6 of flags_9c: device-side memory
if flags_9c & 0x40 != 0 {
if edg_node[0xF0] != 0 {
return 3; // template-instantiated shared variable
}
return 0; // generic device
}
// Extended attribute flags
if flags_a8 & 0x2000100000 != 0 {
return 3; // shared-like semantics
}
if storage_class > 2 {
emit_diagnostic("unsupported storage class!");
}
return 0; // default: generic device memory
}
NVPTX Address Space Assignment
See Address Spaces for the complete master table mapping LLVM AS numbers to PTX qualifiers, hardware, and pointer widths.
In the IR generation context: address space 0 (generic) is the default for __device__ variables. Address space 1 (global) appears in pointer types when the global qualifier is explicit in the type annotation (as opposed to being inferred from the variable declaration). __managed__ variables use address space 0 (same as regular device globals) but receive a "managed" annotation in nvvm.annotations that the runtime uses to set up Unified Virtual Memory mappings.
GlobalVariable Object Creation
The function sub_915C40 (2,018 bytes) materializes an LLVM GlobalVariable:
-
Hash table lookup: Checks whether the EDG node has already been materialized. The table at
ctx+0x178..0x190maps EDG node pointers toGlobalVariable*. If found with a different type, callsGlobalVariable::mutateTypeto reconcile. -
Allocation: Allocates 88 bytes (
0x58) viaoperator new, then calls theGlobalVariableconstructor with module, type,isConstantflag, linkage, initializer (null for declarations), name, and address space. -
Alignment: Computes alignment via
sub_91CB50(a DataLayout wrapper), then converts to log2 via BSR (bit-scan-reverse) for LLVM'sMaybeAlignrepresentation. Always explicitly set, even for naturally-aligned types. -
Initializer: If
edg_node[0xB0] & 0x20is set and the variable is not extern (edg_node[0x88] != 1), callssub_916690to generate the initializer IR. The initializer handler dispatches on a variant byte: variant 0/3 for constant expressions, variant 1/2 for aggregate initializers. -
__managed__annotation: Ifedg_node[0x9D] & 1is set, emits("managed", 1)to the annotation list viasub_913680. -
Texture/surface detection: If the mode flag at
ctx+0x168has bit 0 set, callssub_91C2A0(isTextureType) andsub_91C2D0(isSurfaceType). Matching variables get"texture"or"surface"annotations and are inserted into a red-black tree atctx+0x200for ordered tracking during annotation emission. -
Registration: The new
GlobalVariable*is stored into the hash table for future lookups.
Finalization: Metadata and @llvm.used
After all globals are materialized, sub_915400 calls four finalization functions in sequence:
sub_9151E0 --- emit nvvmir.version: Creates a named metadata node "nvvmir.version" containing version operands as ConstantInt values wrapped in ConstantAsMetadata. When debug info is present (ctx+0x170 non-null), the tuple has 4 operands including address-space-qualified indices; otherwise 2 operands.
sub_914410 --- emit nvvm.annotations: Iterates the annotation list at ctx+0x1B0..0x1B8 and creates MDTuple entries under the named metadata "nvvm.annotations". Each annotation record produces a {GlobalValue*, MDString-key, ConstantInt-value} triple. Three annotation categories receive special batching: "grid_constant", "preserve_n_data", and "preserve_reg_abi" --- these are collected into compound MDTuples rather than emitting one per parameter, reducing metadata size in kernels with many annotated parameters.
sub_90A560 --- emit @llvm.used: Builds the @llvm.used global array that prevents LLVM from dead-stripping texture references, surface references, and managed variables. The function iterates the registered global triples at ctx+0x198..0x1A0 (24-byte records, hence the 0xAAAAAAAAAAAAAAAB magic divisor for dividing by 3), bitcasts each GlobalValue* to i8*, constructs a ConstantArray of type [N x i8*], and creates a global with name "llvm.used", appending linkage, and section "llvm.metadata".
Conditional: If debug info is present, emits a "Debug Info Version" module flag with value 3 via Module::addModuleFlag. If enabled, also emits "llvm.ident" metadata identifying the compiler.
Kernel Metadata
Annotation Emitter (sub_93AE30)
After a kernel's function body has been code-generated, sub_93AE30 translates EDG-level kernel attributes (__launch_bounds__, __cluster_dims__) into LLVM named metadata under "nvvm.annotations". The function signature:
void emitKernelAnnotationMetadata(
NVVMContext *ctx, // ctx->module at offset +344
FuncDecl *funcDecl, // EDG function declaration, params at +16, count at +8
LaunchAttr *launch, // __launch_bounds__/cluster attrs, NULL if none
MDNodeVec *out // output vector of metadata nodes
);
Parameter Metadata
For each function parameter (stride 40 bytes, iterated from funcDecl+16):
-
Visibility check: If launch attributes exist and bit
0x20oflaunch+198is clear, orparam+32 != 0, emits opcode 22 (hidden/implicit parameter). Ifdword_4D04628is set and the launch bit is set, callssub_8D2E30to check for special types and emits opcode 40. -
Type dispatch:
- Type
1(pointer): Checkssub_91B6F0for read-only image/sampler (opcode 54) andsub_91B730for surface reference (opcode 79). - Type
2(value): Computes alignment metadata viasub_91A390, then log2 via BSR, emits packed(log2, hasValue)pair. Checks for alignment attribute tag 92 viasub_A74D20.
- Type
-
MDNode creation:
sub_A7B020(module, paramIndex, &attrAccum)creates the MDNode for each parameter.
Cluster Metadata
Triggered when launch is non-null and *(launch+328) points to a valid cluster config. The cluster config struct:
| Offset | Field | Used As |
|---|---|---|
+20 | [5] | reqntid.x (cluster) |
+24 | [6] | reqntid.y (cluster) |
+28 | [7] | reqntid.z (cluster) |
+40 | [10] | cluster_dim.z (also presence flag: > 0 triggers emission) |
+44 | [11] | cluster_dim.y |
+48 | [12] | cluster_dim.x |
When cluster_config[10] > 0, three metadata entries are emitted in order:
nvvm.blocksareclusters--- boolean flag, no value string. Emitted unconditionally.nvvm.reqntid--- the three cluster dimension fields[12],[11],[10]are converted to decimal strings and concatenated with commas:"{x},{y},{z}". Uses SSOstd::stringobjects with a two-digit lookup table ("00","01",...,"99") for fast integer-to-string conversion. A0x3FFFFFFFFFFFFFFFsentinel triggers a fatal"basic_string::append"error on overflow.nvvm.cluster_dim--- the three fields[7],[6],[5]are similarly concatenated.
Function-Level Metadata Node
After all per-parameter and cluster metadata is accumulated, if the accumulator is non-empty, sub_A7B020(module, 0xFFFFFFFF, &attrAccum) creates a function-level MDNode with parameter index -1 (sentinel). This node carries all function-level annotations combined.
Annotation Reader (sub_A84F90)
The inverse of the emitter. Reads "nvvm.annotations" named metadata from an LLVM Module and populates internal structures. For each {function_ref, key_string, value} operand tuple, the key is matched via raw integer comparisons (not strcmp):
| Key String | Match Method | Handler |
|---|---|---|
"kernel" | 6-byte i32+i16 compare | sub_CE8040: set/clear nvvm.kernel flag |
"maxntidx/y/z" | 7-byte prefix + suffix char | sub_A7C1C0 with "nvvm.maxntid" |
"reqntidx/y/z" | 7-byte prefix + suffix char | sub_A7C1C0 with "nvvm.reqntid" |
"cluster_dimx/y/z" | 12-byte qword+i32 + suffix | sub_A7C1C0 with "nvvm.cluster_dim" |
"maxnreg" | 7-byte qword + byte 'g' | sub_B2CD60 with "nvvm.maxnreg" |
"minctasm" | 8-byte single qword compare | sub_B2CD60 with "nvvm.minctasm" |
"maxclusterrank" | 14-byte multi-width compare | sub_B2CD60 with "nvvm.maxclusterrank" |
"cluster_max_blocks" | 18 bytes | Same handler as maxclusterrank |
"align" | 5 bytes | sub_B2CCF0: BSR-based log2 alignment |
The raw integer comparison technique avoids strcmp overhead by loading the key bytes as i32/i64 values and comparing in a single instruction. For example, "kernel" is checked as two loads: *(uint32_t*)key == 0x6E72656B and *(uint16_t*)(key+4) == 0x6C65.
Complete Metadata String Catalog
Module-level named metadata:
| Key | Purpose |
|---|---|
nvvm.annotations | Container for all kernel and global annotations |
nvvm.annotations_transplanted | Flag: annotations already migrated to function-level |
nvvm.reflection | Compile-time reflection constants |
nvvmir.version | NVVM IR version (2 or 4 operands) |
llvm.used | Array preventing dead-stripping of annotated globals |
llvm.ident | Compiler identification string |
Function-level metadata keys:
| Key | Value Format | Source |
|---|---|---|
nvvm.kernel | (boolean presence) | __global__ qualifier or calling convention 0x47 |
nvvm.maxntid | "x,y,z" | __launch_bounds__(maxThreads) |
nvvm.reqntid | "x,y,z" | __launch_bounds__ or cluster config |
nvvm.maxnreg | decimal string | __launch_bounds__(..., ..., maxRegs) |
nvvm.minctasm | decimal string | __launch_bounds__(..., minCTAs) |
nvvm.maxclusterrank | decimal string | SM >= 90 cluster rank limit |
nvvm.blocksareclusters | (boolean presence) | __cluster_dims__ present |
nvvm.cluster_dim | "x,y,z" | __cluster_dims__(x,y,z) |
Global variable annotations (emitted as {GlobalValue*, MDString, i32} triples in nvvm.annotations):
| Annotation | Value | Trigger |
|---|---|---|
"managed" | 1 | __managed__ qualifier |
"texture" | 1 | Texture reference type detected |
"surface" | 1 | Surface reference type detected |
"grid_constant" | (batched) | __grid_constant__ parameter attribute |
"preserve_n_data" | (batched) | NVIDIA-internal preservation hint |
"preserve_reg_abi" | (batched) | NVIDIA-internal register ABI hint |
Metadata Accessor Functions
The backend reads metadata through typed accessor functions in the 0xCE7xxx--0xCE9xxx range:
| Address | Reconstructed Name | Returns |
|---|---|---|
sub_CE9220 | isKernel(func) | true if linkage == 0x47 OR nvvm.kernel present |
sub_CE8D40 | getMaxNtid(out, func) | Parses "nvvm.maxntid" as (x,y,z) triple |
sub_CE8DF0 | getReqNtid(out, func) | Parses "nvvm.reqntid" as (x,y,z) triple |
sub_CE8EA0 | getClusterDim(out, func) | Parses "nvvm.cluster_dim" as (x,y,z) triple |
sub_CE9030 | getMaxClusterRank(func) | Checks "cluster_max_blocks" then "nvvm.maxclusterrank" |
sub_CE90E0 | getMinCtaSM(func) | Checks "minctasm" then "nvvm.minctasm" |
sub_CE9180 | getMaxNReg(func) | Checks "maxnreg" then "nvvm.maxnreg" |
Each accessor first checks the function-level metadata (post-transplant), then falls back to the raw nvvm.annotations tuples (pre-transplant). The isKernel check is especially important: it recognizes kernels either by calling convention 0x47 or by the nvvm.kernel metadata presence, ensuring compatibility with both the EDG frontend path and bitcode loaded through LibNVVM.
Metadata Lifecycle
The complete flow from CUDA source to PTX directives:
CUDA: __global__ void kern() __launch_bounds__(256, 2) __cluster_dims__(2, 1, 1)
EDG: LaunchAttr { cluster_config[12]=256, [11]=1, [10]=1, [7]=1, [6]=1, [5]=2 }
sub_93AE30:
-> nvvm.blocksareclusters (presence flag)
-> nvvm.reqntid = "256,1,1"
-> nvvm.cluster_dim = "2,1,1"
-> function-level MDNode (index -1)
sub_A84F90: reads back on bitcode load
Backend accessors (CE8xxx): typed access
PTX emitter (sub_3022E70):
.blocksareclusters
.reqntid 256, 1, 1
.reqnctapercluster 2, 1, 1
Special Variables: threadIdx, blockIdx, blockDim, gridDim, warpSize
Recognition Pipeline
CUDA built-in variables (threadIdx, blockIdx, blockDim, gridDim, warpSize) are not stored in memory --- they map directly to PTX special registers accessed via LLVM intrinsics. Two parallel codegen paths exist: an older one in the 0x920xxx range and a newer one in the 0x1285xxx range. Both share the same logic structure.
The classifier function isSpecialRegisterVar (sub_920430 / sub_127F7A0) checks five preconditions before recognizing a variable:
- Inside kernel:
(ctx->flags_at_360 & 1) != 0--- only valid in__global__function context. - Not extern:
(sym->byte_89 & 1) == 0. - Not template-dependent:
*(signed char*)(sym+169) >= 0. - Element count == 1:
sym->elem_count_at_136 == 1. - Name non-null:
sym->name_at_8 != NULL.
If all pass, the name is compared via strcmp against the five known strings. The output category:
| Category | Name | Type |
|---|---|---|
| 0 | threadIdx | dim3 (3-component struct) |
| 1 | blockDim | dim3 |
| 2 | blockIdx | dim3 |
| 3 | gridDim | dim3 |
| 4 | warpSize | scalar int |
Intrinsic ID Table
A static 2D array int intrinsicIDs[5][3] maps (category, component) to LLVM intrinsic IDs:
| CUDA Variable | .x | .y | .z |
|---|---|---|---|
threadIdx | @llvm.nvvm.read.ptx.sreg.tid.x | .tid.y | .tid.z |
blockDim | @llvm.nvvm.read.ptx.sreg.ntid.x | .ntid.y | .ntid.z |
blockIdx | @llvm.nvvm.read.ptx.sreg.ctaid.x | .ctaid.y | .ctaid.z |
gridDim | @llvm.nvvm.read.ptx.sreg.nctaid.x | .nctaid.y | .nctaid.z |
warpSize | @llvm.nvvm.read.ptx.sreg.warpsize | --- | --- |
Each intrinsic is a zero-argument call returning i32. The old codegen path uses intrinsic ID 9374 for warpSize; the new path uses 4348.
dim3 Member Access Codegen
Two functions handle the code generation, depending on whether the access is a full dim3 struct or a single component:
Full struct access (sub_922290 / sub_1285550): For threadIdx as a whole (all three components), loops 3 times:
for (component = 0; component < 3; component++) {
intrinsicID = intrinsicIDs[category][component];
decl = Module::getOrInsertIntrinsic(intrinsicID);
callInst = CallInst::Create(decl); // zero-arg, returns i32
// Insert into struct via InsertValue
}
The three call results are composed into the struct type via CreateInsertValue. The IR value is named "predef_tmp".
Single component access (sub_9268C0 / sub_1286E40): For threadIdx.x specifically, the member name's first character is extracted from member_symbol+56+8:
'x'(0x78) with null terminator'\0'at next byte -> component 0'y'(0x79) -> component 1'z'(0x7A) -> component 2
The null-terminator check prevents false matches on member names like "xy". A single intrinsic call is emitted, named "predef_tmp_comp":
%predef_tmp_comp = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
Both paths compute alignment from the return type's bit-width via BSR and handle sign extension: if the type tag byte at +140 satisfies (tag & 0xFB) == 8 (signed int), the result is marked as signed.
PTX Backend Mapping
The NVPTX backend (sub_21E86B0) maps internal register encodings (single-byte case labels using ASCII character codes) to PTX special register names:
| Code | ASCII | PTX Register |
|---|---|---|
0x26 | & | %tid.x |
0x27 | ' | %tid.y |
0x28 | ( | %tid.z |
0x29 | ) | %ntid.x |
0x2A | * | %ntid.y |
0x2B | + | %ntid.z |
0x2C | , | %ctaid.x |
0x2D | - | %ctaid.y |
0x2E | . | %ctaid.z |
0x2F | / | %nctaid.x |
0x30 | 0 | %nctaid.y |
0x31 | 1 | %nctaid.z |
Codes 0x5E (^) and 0x5F (_) are delegated to sub_3958DA0 for cluster and warp-level registers. Any unhandled code triggers a fatal "Unhandled special register" error. Register names are written via optimized memcpy of 6--9 bytes directly to the output stream.
ISel Lowering
The instruction selector (sub_36E4040) validates that the intrinsic declaration returns i32 (type code 7 at offset +48 of the overload descriptor). If the type does not match, it emits a fatal error: "Unsupported overloaded declaration of llvm.nvvm.read.sreg intrinsic". It then creates a MachineSDNode with NVPTX target opcode 3457.
EDG Frontend Diagnostic
The EDG frontend includes a diagnostic at sub_6A49A0 that detects writes to predefined read-only variables. When a store target matches any of the five built-in names, it emits diagnostic 0xDD0:
error: cannot assign to variable 'threadIdx' with predefined meaning in CUDA
This diagnostic fires during semantic analysis, long before IR generation. It ensures that CUDA programs cannot accidentally (or intentionally) write to hardware register proxies.
Libdevice Linking
NVIDIA embeds a complete copy of the libdevice math library -- 455,876 bytes of LLVM bitcode -- directly inside the cicc binary. This library provides GPU-optimized implementations of ~350 mathematical intrinsics (trigonometric, exponential, rounding, Bessel functions, error functions, type conversions, and integer utilities) that are linked into every CUDA compilation during the LNK pipeline stage. The linker (sub_12C06E0, 63KB) validates bitcode magic bytes, enforces the nvptx64- target triple prefix, checks NVVM IR version metadata for cross-release compatibility, and performs symbol-size matching across all modules before producing a single merged module. Two identical copies of the embedded bitcode exist in the binary -- one for each compilation path -- ensuring the library is always available without filesystem access.
Upstream LLVM has no equivalent of this embedded-library mechanism. Clang relies on external libdevice.10.bc files discovered through --cuda-path at driver level. NVIDIA's approach eliminates the file-lookup step entirely, making cicc self-contained: the entire math library ships inside the compiler binary itself.
| Embedded size | 455,876 bytes (445 KB) per copy |
| Copies in binary | 2: unk_3EA0080 (Path A), unk_420FD80 (Path B) |
| Function count | 352 defined (349 __nv_* public + 3 __internal_* helper) |
__nvvm_reflect calls | 2,016 (architecture/precision dispatch) |
| Target triple | nvptx64-nvidia-gpulibs |
| NVVM IR version | !nvvmir.version = !{i32 2, i32 0} (always-compatible sentinel) |
| Attribute group | #0 = { alwaysinline nounwind } on all public functions |
| Module linker | sub_12C06E0 (63KB, 2,154 lines) |
| Version checker | sub_12BFF60 (9KB, 362 lines) |
| Pipeline stage | LNK (first stage, before OPT) |
| Override | -nvvmir-library <path> CLI flag substitutes an external file |
| Version bypass | NVVM_IR_VER_CHK=0 disables IR version validation |
Embedded Bitcode Layout
The cicc binary contains two byte-identical copies of the libdevice bitcode at different virtual addresses. Each compilation path uses its own copy, avoiding any shared-state coordination between Path A (nvcc-invoked) and Path B (standalone/LibNVVM):
Binary offset Path Referenced by Size
─────────────────────────────────────────────────────────────
unk_3EA0080 A sub_905EE0 (43KB) 455,876 bytes
unk_420FD80 B sub_1265970 (48KB) 455,876 bytes
Both copies contain identical LLVM bitcode with:
- Data layout:
e-i64:64-v16:16-v32:32-n16:32:64 - Target triple:
nvptx64-nvidia-gpulibs(note:gpulibs, notcuda) - Producer:
clang version 3.8.0 (tags/RELEASE_380/final)-- the bitcode was originally compiled with an ancient Clang but has been maintained through bitcode format upgrades across CUDA toolkit releases - Version metadata:
!nvvmir.version = !{i32 2, i32 0}-- this specific version tuple(2, 0)is hard-coded in the version checker as an always-compatible sentinel
The duplication exists because the two compilation paths (sub_905EE0 for Path A, sub_1265970 for Path B) are entirely independent code paths with no shared module state. Deduplicating the data would require introducing a shared pointer, which NVIDIA apparently considered not worth the ~445KB savings in a 60MB binary.
Loading the Embedded Bitcode
In both paths, the embedded bitcode is passed to sub_12BCB00 (the nvvmCUAddModuleFromBuffer API wrapper) with a hardcoded size constant:
// Path A (sub_905EE0, line ~167):
v19 = sub_12BCB00(compilation_unit, &unk_3EA0080, 455876, 0);
// Path B (sub_1265970, line ~448):
v19 = sub_12BCB00(compilation_unit, &unk_420FD80, 455876, 0);
When the -nvvmir-library <path> flag is provided, the corresponding path opens the file, reads its contents into memory, and passes that buffer to sub_12BCB00 instead of the embedded pointer. This override is used primarily for testing custom libdevice builds.
Libdevice Function Inventory
The library defines 352 functions across 10 categories. All 349 public functions carry alwaysinline nounwind attributes, meaning they will be unconditionally inlined during the OPT stage after linking. Three internal helper functions (__internal_trig_reduction_slowpathd, __internal_accurate_pow, __internal_lgamma_pos) use noinline nounwind to avoid code size explosion in their callers.
| Category | Count | Examples |
|---|---|---|
| Type conversions | 75 | __nv_float2int_rn, __nv_double2ull_rz, __nv_int2float_rd, __nv_half2float |
| Rounded arithmetic | 74 | __nv_fmaf_rn, __nv_fdiv_rz, __nv_dsqrt_rd, __nv_dadd_ru, __nv_fmul_rn |
| Trigonometric | 34 | __nv_sinf, __nv_cos, __nv_tanf, __nv_asinf, __nv_atan2, __nv_sincospi |
| Special functions | 30 | __nv_erff, __nv_lgamma, __nv_j0, __nv_y1, __nv_cyl_bessel_i0, __nv_normcdf |
| Roots and norms | 28 | __nv_sqrtf, __nv_rsqrt, __nv_cbrt, __nv_hypot, __nv_norm3d, __nv_rnorm4d |
| Exponential/logarithmic | 28 | __nv_expf, __nv_log2, __nv_exp10, __nv_log1p, __nv_ldexp, __nv_frexp |
| Integer utilities | 27 | __nv_clz, __nv_popc, __nv_brev, __nv_mulhi, __nv_abs, __nv_byte_perm |
| Float utilities | 20 | __nv_fabsf, __nv_fminf, __nv_copysign, __nv_fmod, __nv_nextafter, __nv_nan |
| Rounding | 14 | __nv_floorf, __nv_ceil, __nv_truncf, __nv_roundf, __nv_nearbyintf, __nv_rint |
| Classification | 11 | __nv_isinff, __nv_isnand, __nv_isfinited, __nv_signbitf, __nv_ilogb, __nv_logb |
| Internal helpers | 3 | __internal_trig_reduction_slowpathd, __internal_accurate_pow, __internal_lgamma_pos |
Every public function body contains calls to @__nvvm_reflect with query strings (__CUDA_FTZ, __CUDA_ARCH, __CUDA_PREC_SQRT) that are resolved by the NVVMReflect pass during optimization. This is how the same bitcode adapts to different precision modes and SM architectures -- see NVVMReflect for details on the reflection mechanism. The 2,016 reflect calls across 352 functions means an average of ~5.7 architecture/precision branch points per function.
Struct Types
The bitcode defines five aggregate types used by multi-return functions:
%struct.uint2 = type { i32, i32 }
%struct.float2 = type { float, float }
%struct.trig_reduction_return = type { double, i32 }
%struct.ulonglong2 = type { i64, i64 }
%struct.double2 = type { double, double }
trig_reduction_return is used by the internal trigonometric range reduction helper. The float2/double2 types appear in sincos/sincospi which return both sine and cosine through output pointers.
Constant Tables
The bitcode contains precomputed coefficient tables in address space 1 (global memory):
| Global | Type | Purpose |
|---|---|---|
@__cudart_i2opi_f | [6 x i32] | Float-precision inverse-of-pi table for trig reduction |
@__cudart_i2opi_d | [18 x i64] | Double-precision inverse-of-pi table for trig reduction |
@__cudart_sin_cos_coeffs | [16 x double] | Chebyshev coefficients for sin/cos polynomial approximation |
Module Linker Algorithm
sub_12C06E0 (63KB) is the central module linker that operates during the LNK pipeline stage. It receives a list of user modules and a list of builtin modules (which includes libdevice), validates them, and produces a single merged LLVM module. The algorithm proceeds in six phases:
Phase A: Module Iteration and Bitcode Validation
For each module in the input list (from a1[0] to a1[1], stepping by 4 qwords per entry), the linker:
- Opens and reads the module data via
sub_16C2450 - Validates LLVM bitcode magic bytes -- accepts two formats:
- Raw bitcode: bytes
0xDE 0xC0 0x17 0x0B(little-endian0x0B17C0DE) - Bitcode wrapper: bytes
0x42 0x43 0xC0 0xDE(ASCII "BC" prefix)
- Raw bitcode: bytes
- Determines the buffer name (falls back to
"Unknown buffer"if the vtable function issub_12BCB10) - Parses bitcode into an LLVM Module via
sub_15099C0
for each entry in modules[a1[0] .. a1[1]]:
buffer = open_and_read(entry.data, entry.size, entry.name)
magic = read_4_bytes(buffer)
if magic != 0x0B17C0DE and magic != 0xDEC04342:
*error_code = 9 // invalid bitcode
return NULL
name = (entry.vtable_func == sub_12BCB10)
? "Unknown buffer"
: entry.vtable_func(entry)
module = parse_bitcode(buffer, llvm_ctx, name)
Phase B: Triple Validation
After parsing all modules, the linker enforces that every module's target triple starts with nvptx64-. The comparison uses a prefix match against the global string at off_4CD49B0:
for each parsed_module:
triple = get_triple(parsed_module) // offset +240
if triple.length == 0:
error: "Module does not contain a triple, should be 'nvptx64-'"
*error_code = 9
else if !starts_with(triple, "nvptx64-"):
error: "<module_name>: Module does not contain a triple, should be 'nvptx64-'"
*error_code = 9
The libdevice bitcode has triple nvptx64-nvidia-gpulibs, which passes this prefix check. User modules typically have nvptx64-nvidia-cuda.
Phase C: IR Version Check
For each module, the linker calls sub_12BFF60 (the version checker -- see next section). If the check fails, the linker emits a diagnostic and returns error code 3:
for each parsed_module:
result = NVVMIRVersionCheck(modules, parsed_module, flags)
if result != 0:
error: "<name>: error: incompatible IR detected. "
"Possible mix of compiler/IR from different releases."
*error_code = 3
return NULL
Phase D: Single-Module Fast Path
When only one module exists (no linking needed), the linker returns it directly via sub_1C3DFC0 without invoking any linking machinery. This fast path avoids the overhead of LLVM's Linker::linkModules for the common case of a single translation unit without libdevice.
Phase E: Multi-Module User Linking
For N > 1 user modules, the linker:
- Selects one module as the "primary" (index
v57) - Copies the primary module's triple and data layout to all secondary modules (ensuring consistency)
- Calls
sub_12F5610-- NVIDIA's wrapper around LLVM'sLinker::linkModules-- to merge all user modules into a single module
if module_count > 1:
primary = modules[v57]
for each secondary in modules where index != v57:
set_triple(secondary, get_triple(primary))
set_data_layout(secondary, get_data_layout(primary))
result = LinkModules(&modules, linking_state, &error_str, &warnings, options)
if result != 0:
error: "<module_name>: link error: <details>"
*error_code = 9
Phase F: Builtin Linking
After user modules are merged, the linker processes builtin modules from a1[3] to a1[4] (this is where libdevice lives). Each builtin module goes through the same bitcode validation and parsing as user modules, then is linked into the main module using sub_1CCEBE0 -- a different linking function than the user-module linker, likely Linker::linkModules with Linker::OverrideFromSrc flags for builtin definitions:
for each builtin in modules[a1[3] .. a1[4]]:
validate_and_parse(builtin)
set_triple(builtin, get_triple(main_module))
result = LinkBuiltinModule(main_module, builtin, &error_string)
if result != 0:
error: "builtins: link error: <details>"
// continues -- does not abort on builtin link failure
post_link_cleanup(main_module, target_features)
The post-link cleanup sequence (sub_1611EE0 through sub_160FE50) configures target features on the merged module and finalizes symbol resolution.
Phase G: Symbol Size Matching
The final validation phase walks every global symbol in the linked module and checks that declarations and definitions agree on type sizes. The linker maintains a binary search tree keyed by symbol name and computes type sizes using a recursive size calculator:
| Type code | Type | Size formula |
|---|---|---|
| 1 | half | 16 bits |
| 2 | float | 32 bits |
| 3, 9 | double, i64 | 64 bits |
| 4 | fp80 | 80 bits |
| 5, 6 | fp128 | 128 bits |
| 7 | pointer | 8 * pointer_size |
| 0xB | integer | bits >> 8 |
| 0xD | struct | sum of member sizes |
| 0xE | array | alignment * count * ceil(element_bits / (8 * alignment)) |
| 0xF | named type | resolved recursively |
| 0x10 | vector | element_size * count |
for each global_symbol in linked_module:
name = get_name(global_symbol)
if name in size_tree:
existing_size = size_tree[name].size
new_size = compute_type_size(global_symbol.type)
if existing_size != new_size:
error: "Size does not match for <name> in <module_A> "
"with size X specified in <module_B> with size Y."
size_mismatch = true
else:
size_tree.insert(name, compute_type_size(global_symbol.type))
if size_mismatch:
*error_code = 9
Triple and Version Validation
NVVM IR Version Checker (sub_12BFF60)
The version checker validates the nvvmir.version metadata node that every NVVM-produced bitcode module carries. It ensures that modules compiled by different CUDA toolkit versions are not accidentally mixed.
Metadata lookup: The checker searches for two named metadata nodes:
"nvvmir.version"-- the IR version tuple"llvm.dbg.cu"-- debug compile unit (presence indicates debug info exists)
Both are looked up via sub_1632310 (named metadata search on the module).
Version tuple format: The metadata node contains either 2 or 4 constant integer operands:
| Format | Operands | Meaning |
|---|---|---|
| 2-element | {major, minor} | IR version only |
| 4-element | {major, minor, dbg_major, dbg_minor} | IR version + debug IR version |
Compatibility check: For the IR version, sub_12BDA30 performs the actual comparison. The special case (major=2, minor=0) always passes -- this is exactly the version carried by the embedded libdevice, ensuring it is compatible with any user module regardless of toolkit version.
For the debug version, sub_12BD890 checks compatibility with a similar special case: (debug_major=3, debug_minor<=2) always passes.
Unique node deduplication: The checker builds a hash set of unique metadata nodes using the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the hash function and probing strategy. This deduplication handles the case where multiple source files within a compilation unit carry identical version metadata -- each unique version is checked exactly once.
Final gate: If debug info is present in the module, the debug mode flag is set, but no debug version was validated (because the metadata lacked elements 2-3), the checker returns 3 (incompatible). This catches the case where a debug-compiled user module is linked against a non-debug library that lacks debug version metadata.
Symbol Resolution During LNK
The LNK stage processes libdevice functions through LLVM's standard symbol resolution mechanism. Because all 349 public libdevice functions carry the alwaysinline attribute, the resolution and inlining follow a specific sequence:
-
Declaration matching: User code that calls
__nv_sinf(x)contains an external declarationdeclare float @__nv_sinf(float). The linker resolves this declaration against thedefine float @__nv_sinf(float)in libdevice. -
__nvvm_reflectremains unresolved: After linking, libdevice function bodies contain calls to@__nvvm_reflectwhich are still unresolved declarations. These are handled during the OPT stage by the NVVMReflect pass, not during linking. -
Dead function elimination: Functions from libdevice that are never called by user code are eliminated by
GlobalDCEduring the OPT stage. Since libdevice provides 352 functions but a typical kernel uses only a handful, the vast majority are stripped. -
alwaysinlineenforcement: During the OPT stage, theAlwaysInlinerpass processes all libdevice functions. After inlining, the original function bodies become dead (no remaining callers) and are removed by subsequent DCE.
The net effect: a kernel calling __nv_sinf ends up with the sinf implementation inlined directly into the kernel body, with __nvvm_reflect calls already resolved to constants by NVVMReflect, and all unused branches from precision/architecture dispatch eliminated by SimplifyCFG.
Constant Folding Interaction
The constant folding engine (sub_14D90D0, 27KB) has special knowledge of libdevice functions. When a libdevice intrinsic is called with constant arguments, the fold eligibility checker determines whether the call can be evaluated at compile time -- before the libdevice function is inlined.
This creates an important ordering constraint:
LNK stage: link libdevice → user module now has __nv_sinf definitions
OPT stage: NVVMReflect → resolve __CUDA_FTZ, __CUDA_ARCH queries
ConstantFold → fold __nv_sinf(0.0) → 0.0 (if eligible)
AlwaysInline → inline remaining __nv_sinf calls
SimplifyCFG → remove dead reflect branches
GlobalDCE → remove unused libdevice functions
The fold eligibility checker (sub_14D90D0) uses three dispatch mechanisms to identify foldable functions:
LLVM intrinsic ID switch (IDs 0-211): Covers standard LLVM intrinsics like llvm.sin, llvm.cos, llvm.sqrt, llvm.fma, llvm.floor, llvm.ceil, llvm.exp, llvm.log, llvm.pow, llvm.fabs, llvm.bswap, llvm.ctlz, llvm.ctpop, and overflow arithmetic.
NVVM intrinsic ID ranges (IDs > 211): Covers NVIDIA-specific intrinsics organized as binary-search ranges with bitmask dispatch:
| Range | IDs | Examples |
|---|---|---|
| 0xEB4-0xEE3 | 3764-3811 | nvvm.ceil.f, nvvm.ctlz.i, nvvm.cos.approx.ftz.f |
| 0xF1E-0xF72 | 3870-3954 | nvvm.exp2.approx, nvvm.fabs.f, nvvm.floor.f, nvvm.sqrt.f |
| 0xFE8-0xFEA | 4072-4074 | nvvm.sin.approx.ftz.f and similar |
| 0x1012-0x104C | 4114-4172 | nvvm.max.i, nvvm.min.ui, nvvm.min.ll |
| 0x1086-0x1087 | 4230-4231 | nvvm.mul.hi.* |
| 0x117B-0x1184 | 4475-4484 | nvvm.sqrt.rn.d, nvvm.sqrt.approx.ftz.f |
| 0x1C80-0x1CAC | 7296-7340 | nvvm.fmax.f, nvvm.fmin.ftz.nan.f |
Name-based matching (ID = 0): When the call target is not a recognized LLVM or NVVM intrinsic, the checker falls back to string matching on the function name. It dispatches on the first character, then uses DWORD integer comparisons for 4-byte names and memcmp for longer names:
Foldable C library names:
sin, sinf, cos, cosf, tan, tanf, acos, acosf, asin, asinf,
atan, atanf, atan2, atan2f, ceil, ceilf, cosh, coshf,
exp, expf, exp2, exp2f, fabs, fabsf, floor, floorf,
fmod, fmodf, log, logf, log10, log10f, pow, powf,
round, roundf, sinh, sinhf, sqrt, sqrtf, tanh, tanhf
Convergent gate: Before any folding, the checker verifies that the callee does not carry the convergent attribute (kind 0x34). Convergent functions have warp-synchronous semantics and must not be speculatively constant-folded, even if all arguments are constants.
Configuration
Environment Variables
| Variable | Effect |
|---|---|
NVVM_IR_VER_CHK | Set to "0" to disable IR version validation. Any other value or unset = enabled (default). Checked in sub_12BFF60 at 0x12BFF60 and in the duplicate verifier at 0x2259720. |
CLI Flags
| Flag | Effect |
|---|---|
-nvvmir-library <path> | Override the embedded libdevice with an external bitcode file. The file is opened, read into memory, and passed to the linker in place of the embedded unk_3EA0080/unk_420FD80 pointer. |
-opt / -llc | When passed as the first extra argument, skips builtin linking entirely (jumps past the libdevice linking code to direct pipeline stage invocation). |
-keep | Preserves .lnk.bc intermediate file showing the linked module (user + libdevice) before optimization. |
Intermediate Files
When -keep is active, the LNK stage serializes its output to a .lnk.bc file alongside the input:
input.cu → input.lnk.bc (linked: user + libdevice)
→ input.opt.bc (optimized: after OPT stage)
→ input.ptx (final: after LLC stage)
The .lnk.bc file is useful for verifying which libdevice functions survived linking and how __nvvm_reflect calls appear before the NVVMReflect pass resolves them.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
ModuleLinker | sub_12C06E0 | 63KB | Main bitcode linker: validates magic, triple, version; links user modules, then builtins |
NVVMIRVersionCheck | sub_12BFF60 | 9KB | Reads nvvmir.version metadata, checks compatibility via sub_12BDA30/sub_12BD890 |
CheckIRVersion | sub_12BDA30 | ~2KB | IR version compatibility predicate (special-cases {2,0} as always-compatible) |
CheckDebugVersion | sub_12BD890 | ~2KB | Debug IR version compatibility predicate (special-cases {3, <=2}) |
PipelineOrchestrator | sub_12C35D0 | 41KB | 4-stage pipeline driver; calls sub_12C06E0 during LNK stage |
LibNVVMPipelineA | sub_905EE0 | 43KB | Path A pipeline driver; references unk_3EA0080 for embedded libdevice |
LibNVVMPipelineB | sub_1265970 | 48KB | Path B pipeline driver; references unk_420FD80 for embedded libdevice |
nvvmCUAddModuleFromBuffer | sub_12BCB00 | ~1KB | API wrapper that adds a bitcode buffer to the compilation unit |
LibNVVM API dispatch | sub_12BC0F0 | 3KB | Resolves LibNVVM API function pointers by hash ID |
ParseBitcodeFile | sub_15099C0 | ~8KB | LLVM bitcode parser entry point |
LinkBuiltinModule | sub_1CCEBE0 | ~4KB | Links a single builtin module into the main module (Linker::linkModules with OverrideFromSrc [MEDIUM confidence] -- inferred from the override-from-source semantics of builtin linking and the 4KB size matching a thin wrapper around LLVM's linker API, but no diagnostic string confirms the exact LLVM API call) |
LinkUserModules | sub_12F5610 | ~4KB | Links multiple user modules (Linker::linkModules [MEDIUM confidence] -- same reasoning as above; wrapper size and call pattern match, but unconfirmed by string evidence) |
CanFoldIntrinsic | sub_14D90D0 | 27KB | Constant-fold eligibility checker for math intrinsics |
| embedded libdevice (Path A) | unk_3EA0080 | 455,876B | Raw LLVM bitcode blob |
| embedded libdevice (Path B) | unk_420FD80 | 455,876B | Raw LLVM bitcode blob (identical copy) |
Reimplementation Checklist
- Embedded bitcode storage and loading. Embed the libdevice bitcode blob (455,876 bytes) directly in the compiler binary, provide two independent copies for dual-path compilation (Path A / Path B), and implement the
nvvmCUAddModuleFromBufferAPI wrapper to load the embedded blob or an external override file via-nvvmir-library. - Bitcode magic validation. Accept two bitcode formats: raw bitcode (
0xDE 0xC0 0x17 0x0B, little-endian0x0B17C0DE) and bitcode wrapper (0x42 0x43 0xC0 0xDE, ASCII "BC" prefix). Reject anything else with error code 9. - Target triple and IR version validation. Enforce
nvptx64-prefix on all module triples. Implement the NVVM IR version checker that readsnvvmir.versionmetadata (2-element or 4-element tuples), special-cases version{2,0}as always-compatible (the libdevice sentinel), and checks debug IR version compatibility for{3, <=2}. - Multi-module linking pipeline. Implement the six-phase linker: (A) module iteration with bitcode validation, (B) triple validation, (C) IR version check, (D) single-module fast path, (E) multi-module user linking with primary module selection and triple/data-layout propagation, (F) builtin linking with
OverrideFromSrcsemantics. - Symbol size matching. Walk all global symbols in the linked module, compute type sizes recursively (handling half/float/double/pointer/integer/struct/array/vector types), and verify that declarations and definitions agree on type sizes using a binary search tree keyed by symbol name.
- Constant folding integration. Implement the fold eligibility checker for libdevice functions with three dispatch mechanisms (LLVM intrinsic ID switch for IDs 0--211, NVVM intrinsic ID ranges for IDs >211, name-based matching for C library names), gated by the convergent attribute check to prevent folding warp-synchronous functions.
Cross-References
- Entry Point & CLI -- dual-path architecture,
-nvvmir-libraryflag handling - NVVMReflect -- resolution of
__nvvm_reflectcalls embedded in libdevice functions - Optimizer Pipeline -- OPT stage where inlining and DCE process linked libdevice
- Environment Variables --
NVVM_IR_VER_CHKdocumentation - Bitcode I/O -- bitcode reader/writer infrastructure used by the linker
LLVM Optimizer
NVIDIA's LLVM optimizer in cicc v13.0 is not a straightforward invocation of the upstream LLVM opt pipeline. Instead, it implements a proprietary two-phase compilation model where the same 49.8KB pipeline assembly function (sub_12E54A0) is called twice with different phase counters, allowing analysis passes to run in Phase I and codegen-oriented passes in Phase II. Individual passes read a TLS variable (qword_4FBB3B0) to determine which phase is active and skip themselves accordingly.
The optimizer also supports concurrent per-function compilation: after Phase I completes on the whole module, Phase II can be parallelized across functions using a thread pool sized to get_nprocs() or a GNU Jobserver token count. This is a significant departure from upstream LLVM, which processes functions sequentially within a single pass manager invocation.
The entire optimization behavior is controlled by the NVVMPassOptions system — a 4,512-byte struct with 221 option slots (114 string + 100 boolean + 6 integer + 1 string-pointer) that provides per-pass enable/disable toggles and parametric knobs. This system is completely proprietary and has no upstream equivalent.
Address range 0x12D0000–0x16FFFFF (~4.2 MB of code).
| Pipeline assembler | sub_12E54A0 (49.8KB, 1,553 lines, ~150 pass insertions) |
| Phase orchestrator | sub_12E7E70 (9.4KB, Phase I / Phase II) |
| Concurrent entry | sub_12E1EF0 (51.3KB, jobserver + split-module + thread pool) |
| PassOptions init | sub_12D6300 (125KB, 4,786 lines, 221 option slots) |
| New PM registration | sub_2342890 (2,816 lines, 35 NVIDIA + ~350 LLVM passes) |
| Target creation | sub_12EA530 (4.1KB, "nvptx" / "nvptx64") |
| AddPass | sub_12DE0B0 (3.5KB, hash-table-based pass insertion) |
| Tier 0 sub-pipeline | sub_12DE330 (4.8KB, ~40 passes) |
| Tier 1/2/3 sub-pipeline | sub_12DE8F0 (17.9KB, phase-conditional) |
| Codegen dispatch | sub_12DFE00 (20.7KB) |
| LTO pipeline | sub_12F5F30 (37.8KB, dead kernel elimination) |
| jemalloc | 5.3.x statically linked (~400 functions at 0x12FC000) |
Architecture
sub_12E1EF0 (51KB, concurrent compilation entry)
│
├─ GNU Jobserver init (sub_16832F0, --jobserver-auth=R,W from MAKEFLAGS)
├─ Bitcode reading + verification (sub_153BF40)
├─ Function sorting by priority (sub_12E0CA0)
├─ Thread pool creation (sub_16D4AB0, min(requested, num_functions) threads)
│
└─ sub_12E7E70 (9.4KB, two-phase orchestrator)
│
├─ Phase I: qword_4FBB3B0 = 1
│ └─ sub_12E54A0 (whole-module analysis + early optimization)
│
├─ Concurrency check: sub_12D4250 (>1 defined function?)
│ ├─ Yes, threads>1 → per-function Phase II via thread pool
│ │ └─ sub_12E86C0 per function (qword_4FBB3B0 = 2)
│ └─ No → sequential Phase II
│ └─ sub_12E54A0 (qword_4FBB3B0 = 2)
│
└─ qword_4FBB3B0 = 3 (done)
sub_12E54A0 (49.8KB, MASTER PIPELINE ASSEMBLY)
│
├─ Top branch: a4[4384] → Pipeline B (fast/codegen-only)
│ else → Pipeline A (normal LLVM)
│
├─ Target machine setup
│ ├─ Triple: "nvptx64" or "nvptx" (based on pointer size)
│ ├─ sub_16D3AC0 → TargetRegistry::lookupTarget()
│ ├─ TargetOptions: PIC=1, CodeModel=8, OptLevel=1, ThreadModel=1
│ └─ DataLayout from qword_4FBB430
│
├─ Phase 0: Infrastructure (TLI, TTI, Verifier, AssumptionCache, ProfileSummary)
├─ Phase 1: Language dispatch (a4[3648]: "ptx"/"mid"/default)
├─ Phase 2: Pre-optimization passes
├─ Phase 3: Main optimization loop (tier threshold dispatch)
├─ Phase 4: Post-opt language-specific pipelines
├─ Phase 5: Finalization (NVVMLowerBarriers, BreakCriticalEdges, codegen)
├─ Phase 6: Phase 2 codegen check (qword_4FBB3B0 == 2)
├─ Phase 7: PassManager::run
└─ Phase 8: Basic block naming ("F%d_B%d" for debug)
Two-Phase Compilation — sub_12E7E70
| Field | Value |
|---|---|
| Address | 0x12E7E70 |
| Size | 9.4KB |
| Strings | "Phase I", "Phase II", "Concurrent=Yes/No" |
The two-phase model exists because certain optimization passes (e.g., inter-procedural memory space propagation, global inlining decisions) require whole-module visibility, while others (register pressure-driven rematerialization, instruction scheduling) operate per-function and benefit from parallelization. Phase I runs the whole-module analysis and early optimization passes; Phase II runs the per-function backend-oriented passes.
Both phases call the same sub_12E54A0. The difference: qword_4FBB3B0 (TLS variable) is set to 1 or 2 before each call. Individual passes read this counter and skip themselves if the current phase doesn't match their intended execution phase. When the module contains only a single defined function, the phase mechanism is bypassed entirely — a single unphased call handles everything.
Phase State Machine:
START → [phase=1] → sub_12E54A0 (Phase I)
│
error? → RETURN
│
count_functions()
├─ 1 func → [phase=2] → sub_12E54A0 → [phase=3] → DONE
├─ N funcs, threads>1 → per-function Phase II (thread pool) → [phase=3] → DONE
└─ N funcs, threads≤1 → [phase=2] → sub_12E54A0 → [phase=3] → DONE
Single-function modules skip the phase mechanism entirely — a single unphased call to sub_12E54A0.
GNU Jobserver Integration
When cicc is invoked from a parallel make -jN build, it can participate in the GNU Jobserver protocol to limit its own thread count to the available parallelism tokens. This prevents oversubscription — without it, a -j16 build could spawn 16 cicc processes each creating their own thread pool, resulting in hundreds of threads competing for CPU time. The jobserver reads the --jobserver-auth=R,W pipe file descriptors from the MAKEFLAGS environment variable.
In sub_12E1EF0 (lines 833–866), when a4+3288 is set:
v184 = sub_16832F0(&state, 0); // parse MAKEFLAGS for --jobserver-auth=R,W
if (v184 == 5 || v184 == 6) // pipe issues
warning("jobserver pipe problem");
elif (v184 != 0)
fatal("GNU Jobserver support requested, but an error occurred");
sub_16832F0 allocates a 296-byte state structure, parses MAKEFLAGS, creates a pipe for token management, and spawns a pthread to manage tokens. Throttles concurrent per-function compilations to match the build's -j level.
Split-Module Compilation
Split-module compilation is NVIDIA's mechanism for the -split-compile=N flag. It decomposes a multi-function module into individual per-function bitcode blobs, compiles each independently (potentially in parallel), then re-links the results. This trades away inter-procedural optimization opportunities for compilation speed and reduced peak memory usage — a worthwhile tradeoff for large CUDA kernels during development iteration.
When optimization level (a4+4104) is negative, enters split-module mode:
- Each function's bitcode is extracted via
sub_1AB9F40with filter callbacksub_12D4BD0 - Module name:
"<split-module>"(14 chars) - After thread pool completes, split modules are re-linked via
sub_12F5610 - Linkage attributes restored from hash table (external linkage types: bits 0–5, dso_local: bit 6 of byte+33)
Pipeline Assembly — sub_12E54A0
The pipeline assembly function is the heart of the optimizer. At 49.8KB with ~150 AddPass calls, it constructs the complete LLVM pass pipeline at runtime rather than using a static pipeline description. The function first sets up target machine infrastructure (triple, data layout, subtarget features), then dispatches into one of three language-specific paths that determine which passes run and in what order. After the language-specific path completes, a shared finalization phase runs barriers, critical edge breaking, and codegen preparation.
A distinguishing feature of NVIDIA's pipeline is the tier system: passes are organized into Tiers 0–3, each gated by a threshold counter. As compilation progresses through the main loop (which iterates over external plugin/extension pass entries), tiers fire when the accumulated pass count exceeds their threshold. This allows NVIDIA to precisely control where in the pipeline their custom passes interleave with standard LLVM passes.
Language-Specific Paths
The pipeline branches based on a4[3648] (language string). The three paths represent different optimization strategies for different IR maturity levels:
| String | Path | Pass Count | Key Difference |
|---|---|---|---|
"ptx" | Path A | ~15 | Light: NVVMPeephole → LLVM standard → DCE → MemorySpaceOpt |
"mid" | Path B | ~45 | Full: SROA → GVN → LICM → LoopIndexSplit → Remat → all NVIDIA passes |
| (default) | Path C | ~40 | General: 4 LLVM standard passes + NVIDIA interleaving |
Tier System
The main loop iterates over entries at a4[4488] (16-byte stride: vtable + phase_id):
if (opt_enabled && phase_id > opt_threshold) → sub_12DE330 // Tier 0 (full)
if (tier1_flag && phase_id > tier1_threshold) → sub_12DE8F0(1) // Tier 1
if (tier2_flag && phase_id > tier2_threshold) → sub_12DE8F0(2) // Tier 2
if (tier3_flag && phase_id > tier3_threshold) → sub_12DE8F0(3) // Tier 3
Each tier fires once (flag cleared after execution). Remaining tiers fire unconditionally after the loop.
Tier 0 — Full Optimization (sub_12DE330)
Tier 0 is the most aggressive optimization sub-pipeline. It runs ~40 passes in a carefully ordered sequence that interleaves standard LLVM passes with NVIDIA-specific ones. The ordering reveals NVIDIA's optimization strategy: start with GVN and SCCP for value simplification, then run NVIDIA's custom NVVMReflect and NVVMVerifier to clean up NVVM-specific constructs, followed by aggressive loop transformations (LoopIndexSplit, LoopUnroll, LoopUnswitch), and finally register-pressure-sensitive passes (Rematerialization, DSE, DCE) to prepare for codegen.
~40 passes in order:
Confidence note: Pass identifications are based on diagnostic strings, factory signatures, and pipeline ordering. Most are HIGH confidence. Entries with
[MEDIUM confidence]are inferred from code structure rather than direct string evidence.
| # | Factory | Likely Pass | Guarded By |
|---|---|---|---|
| 1 | sub_1654860(1) | BreakCriticalEdges | — |
| 2 | sub_1A62BF0(1,...) | LLVM standard pipeline #1 | — |
| 3 | sub_1B26330 | MemCpyOpt | — |
| 4 | sub_185D600 | IPConstantPropagation | — |
| 5 | sub_1C6E800 | GVN | — |
| 6 | sub_1C6E560 | NewGVN/GVNHoist [MEDIUM confidence] | — |
| 7 | sub_1857160 | NVVMReflect | — |
| 8 | sub_1842BC0 | SCCP | — |
| 9 | sub_12D4560 | NVVMVerifier | — |
| 10 | sub_18A3090 | NVVMPredicateOpt | — |
| 11 | sub_184CD60 | ConstantMerge | — |
| 12 | sub_1869C50(1,0,1) | Sink/MemSSA [MEDIUM confidence] | !opts[1040] |
| 13 | sub_1833EB0(3) | TailCallElim/JumpThreading [MEDIUM confidence] | — |
| 14 | sub_1952F90(-1) | LoopIndexSplit | — |
| 15 | sub_1A62BF0(1,...) | LLVM standard pipeline #1 | — |
| 16 | sub_1A223D0 | NVVMIRVerification | — |
| 17 | sub_1A7A9F0 | InstructionSimplify | — |
| 18 | sub_1A62BF0(1,...) | LLVM standard pipeline #1 | — |
| 19 | sub_1A02540 | GenericToNVVM | — |
| 20 | sub_198DF00(-1) | LoopSimplify | — |
| 21 | sub_1C76260 | ADCE | !opts[1320] |
| 22 | sub_195E880(0) | LICM | opts[2880] |
| 23 | sub_19C1680(0,1) | LoopUnroll | !opts[1360] |
| 24 | sub_19401A0 | InstCombine | — |
| 25 | sub_1968390 | SROA | — |
| 26 | sub_196A2B0 | EarlyCSE | — |
| 27 | sub_19B73C0(2,...) | LoopUnswitch | — |
| 28 | sub_190BB10(0,0) | SimplifyCFG | — |
| 29 | sub_1A13320 | NVVMRematerialization | — |
| 30 | sub_18F5480 | DSE | — |
| 31 | sub_18DEFF0 | DCE | — |
| 32 | sub_1A62BF0(1,...) | LLVM standard pipeline #1 | — |
| 33 | sub_18B1DE0 | NVVMLoopPass [MEDIUM confidence] | — |
| 34 | sub_1841180 | FunctionAttrs | — |
"mid" Path — Complete Pass Ordering
The "mid" path is the primary optimization pipeline for standard CUDA compilation. At ~45 passes, it is the most comprehensive of the three paths. The key pattern is repeated interleaving of NVIDIA custom passes with standard LLVM passes: NVVMIntrinsicLowering runs 4 times at different points, NVVMReflect runs 3 times, and NVVMIRVerification runs after each major transformation to catch correctness regressions early. The MemorySpaceOpt pass appears once in this sequence (gated by !opts[1760]) — it runs again later via the parameterized <second-time> invocation in Tier 1/2/3.
ConstantMerge → NVVMIntrinsicLowering → MemCpyOpt → SROA → NVVMPeephole → NVVMAnnotations → LoopSimplify → GVN → NVVMIRVerification → SimplifyCFG → InstCombine → LLVM standard #5 → NVVMIntrinsicLowering → DeadArgElim → FunctionAttrs → DCE → ConstantMerge → LICM → NVVMLowerBarriers → MemorySpaceOpt → Reassociate → LLVM standard #8 → NVVMReflect → ADCE → InstructionSimplify → DeadArgElim → TailCallElim → DeadArgElim → CVP → Sink → SimplifyCFG → DSE → NVVMSinking2 → NVVMIRVerification → EarlyCSE → NVVMReflect → LLVM standard #8 → NVVMIntrinsicLowering → IPConstProp → LICM → NVVMIntrinsicLowering → NVVMBranchDist → NVVMRemat
NVVMPassOptions — sub_12D6300
NVVMPassOptions is NVIDIA's proprietary mechanism for fine-grained control over every optimization pass. Unlike LLVM's cl::opt system (which uses global command-line options), NVVMPassOptions stores per-pass configuration in a flat struct that is allocated once and passed through the pipeline by pointer. This design avoids the global-state problems of cl::opt and allows different compilation units to have different pass configurations within the same process — critical for the concurrent per-function compilation model.
The 125KB initialization function is the largest in the optimizer range. Its size comes from the sheer number of option slots: each of the 221 slots requires a hash-table lookup, a default-value resolution, and a type-specific store, with most slots organized in pairs (a string parameter + a boolean enable flag).
| Field | Value |
|---|---|
| Address | 0x12D6300 |
| Size | 125KB (4,786 lines) |
| Output struct | 4,512 bytes (allocated via sub_22077B0(4512)) |
| Slot count | 221 (indices 1–221) |
| Slot types | 114 string + 100 boolean + 6 integer + 1 string-pointer |
Struct Layout
| Region | Offset | Content |
|---|---|---|
| Header | 0–7 | int opt_level (from a2+112) |
| Registry ptr | 8–15 | Pointer to PassOptionRegistry |
| Slot pairs | 16–4479 | 221 option slots (string/bool/int pairs) |
| Sentinel | 4480–4511 | 4 qwords zeroed |
Option Slot Types
| Type | Size | Writer | Count |
|---|---|---|---|
| String | 24B | sub_12D6090 | 114 |
| Bool (compact) | 16B | sub_12D6100 | 83 |
| Bool (inline) | 16B | direct byte write | 17 |
| Integer | 16B | sub_16D2BB0 (parseInt) | 6 |
| String pointer | 28B | direct qword write (slot 181 only) | 1 |
Pair Organization
Slots are organized in pairs: even = string parameter (the pass's configuration value or name), odd = boolean enable/disable toggle (the do-X flag). This consistent pairing means each "pass knob" has both a parametric value and an on/off switch, allowing passes to be individually disabled without removing their configuration — useful for A/B testing optimizations.
Exceptions to the pair pattern: slots 160–162 (3 consecutive strings — a pass with 3 string parameters), slots 192–193 (2 consecutive bools — a pair of binary flags), slot 181 (the only string-pointer type, storing a char* + length directly — likely a file path or regex pattern).
Defaults Enabled (14 of 100 booleans)
Slots: 19, 25, 93, 95, 117, 141, 143, 151, 155, 157, 159, 165, 211, 219. These are passes that run by default and must be explicitly disabled.
Integer Defaults
| Slot | Default | Likely Purpose |
|---|---|---|
| 9 | 1 | Iteration count / threshold |
| 197 | 20 | Limit (e.g., unroll count) |
| 203 | -1 | Sentinel (unlimited/auto) |
| 205 | -1 | Sentinel |
| 207 | -1 | Sentinel |
| 215 | 0 | Disabled counter |
Known Option Names
Boolean toggles (do-X / no-X):
do-ip-msp, do-licm, do-remat, do-clone-for-ip-msp, do-cssa, do-scev-cgp, do-function-scev-cgp, do-scev-cgp-aggresively, do-base-address-strength-reduce, do-base-address-strength-reduce-chain, do-comdat-renaming, do-counter-promotion, do-lsr-64-bit, do-sign-ext-expand, do-sign-ext-simplify
Parametric knobs:
remat-for-occ, remat-gep-cost, remat-max-live-limit, remat-maxreg-ceiling, remat-move, remat-single-cost-limit, remat-use-limit, branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm, scev-cgp-check-latency, scev-cgp-control, scev-cgp-cross-block-limit, scev-cgp-idom-level-limit, scev-cgp-inst-limit, scev-cgp-norm, cssa-coalesce, cssa-verbosity, base-address-strength-reduce-iv-limit
Dump flags:
dump-ip-msp, dump-remat, dump-branch-dist, dump-scev-cgp, dump-sink2, dump-before-cssa, dump-normalize-gep, dump-simplify-live-out
New PM Pass Registration — sub_2342890
NVIDIA maintains both the Legacy Pass Manager and the New Pass Manager in cicc v13.0. The New PM registration lives in a single 2,816-line function that registers every analysis, pass, and printer by calling sub_E41FB0(pm, class_name, len, pass_name, len) for each. Standard LLVM passes use the llvm:: prefix (stripped during registration), while NVIDIA custom passes use their own class names.
The registration function also handles parameterized pass parsing: when the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), it calls a registered parameter-parsing callback that returns a configured pass options struct. This is how MemorySpaceOpt can run twice with different configurations in the same pipeline.
NVIDIA Custom Passes (35 total)
Module passes (12): check-gep-index, check-kernel-functions, cnp-launch-check, ipmsp, nv-early-inliner, nv-inline-must, nvvm-pretreat, nvvm-verify, printf-lowering, select-kernels, lower-ops*, set-global-array-alignment*
Function passes (20): basic-dbe, branch-dist, byval-mem2reg, bypass-slow-division, normalize-gep, nvvm-reflect-pp, nvvm-peephole-optimizer, old-load-store-vectorizer, remat, propagate-alignment, reuse-local-memory, set-local-array-alignment, sinking2, d2ir-scalarizer, sink<rp-aware>, memory-space-opt*, lower-aggr-copies*, lower-struct-args*, process-restrict*
Loop pass (1): loop-index-split
Analyses (2): rpa (RegisterPressureAnalysis), merge-sets (MergeSetsAnalysis)
* = parameterized
Key Discoveries
- nvvm-reflect-pp is actually
SimplifyConstantConditionalsPass, not a reflection pass. It runs after NVVMReflect resolves__nvvm_reflect()calls to constants, cleaning up the resulting dead branches and unreachable code. The misleading name ("pp" = post-processing) obscures what is essentially a targeted dead-code-elimination pass. - memory-space-opt runs twice in the pipeline with different parameterizations:
<first-time>early in optimization (conservative, uses available alias information) and<second-time>late (aggressive, benefits from earlier optimizations having simplified the IR). This two-pass approach is necessary because address space resolution depends on pointer analysis quality, which improves as other passes simplify the code. - d2ir-scalarizer reuses LLVM's
ScalarizerPassclass under a different name, suggesting NVIDIA added a custom registration point to control when scalarization happens in the NVPTX pipeline without modifying the upstream pass. - Legacy PM co-existence: both Legacy PM and New PM registrations exist for the same passes, with slightly different names (e.g.,
"memory-space-opt-pass"vs"memory-space-opt"). This dual registration is necessary during the LLVM Legacy→New PM migration — cicc v13.0 appears to be in the middle of this transition.
Key Global Variables
| Variable | Purpose |
|---|---|
qword_4FBB3B0 | Phase counter TLS: 1=Phase I, 2=Phase II, 3=done |
qword_4FBB370 | Feature flag register (value 6 = barrier opt + memspace opt) |
qword_4FBB410 | Tier execution tracker |
qword_4FBB430 | Optimization level store |
qword_4FBB510 | Debug/trace verbosity level |
byte_3F871B3 | NVIDIA global flag byte (empty/null string in .rodata) |
byte_4F99740 | CUTLASS optimization enable flag |
NVVMPassOptions Deep Dive
Memory Layout
The 4,512-byte NVVMPassOptions struct is allocated on the heap via sub_22077B0(4512) at the start of each compilation. The layout divides into four regions:
Offset 0x000 [8B] : int32 opt_level (from config+112) + 4B padding
Offset 0x008 [8B] : qword ptr to PassOptionRegistry (hash table source)
Offset 0x010 [4464B]: 221 option slots (indices 1-221)
Offset 0x1180[32B] : 4 qwords zeroed (sentinel/trailer)
The slots start at offset 16 and are packed contiguously. Each slot occupies a fixed size depending on its type, but the stride varies: string options take 24 bytes, boolean options take 16 bytes, integer options take 16 bytes, and the single string-pointer option (slot 181) takes 28 bytes. The overall packing is not uniform-stride; the offset of each slot must be computed from the cumulative widths of all preceding slots.
Slot Type Formats
Five distinct slot types exist, each written by a dedicated helper:
// TYPE A: String option (114 instances)
// Written by sub_12D6090 (writeStringOption)
struct StringSlot { // 24 bytes
char* value_ptr; // +0: pointer to string value
int32_t option_index; // +8: 1-based slot index
int32_t flags; // +12: from PassDef byte+40
int32_t opt_level; // +16: optimization level context
int32_t pass_id; // +20: resolved via sub_1691920
};
// TYPE B: Boolean compact (83 instances)
// Written by sub_12D6100 (writeBoolOption)
struct BoolCompactSlot { // 16 bytes
uint8_t value; // +0: 0 or 1
uint8_t pad[3]; // +1: padding
int32_t option_index; // +4
int32_t flags; // +8
int32_t pass_id; // +12
};
// TYPE C: Boolean inline (17 instances)
// Written directly as byte + int32 fields
struct BoolInlineSlot { // 16 bytes
uint8_t value; // +0: 0 or 1
uint8_t pad[3]; // +1
int32_t option_index; // +4: from sub_12D6240 return hi32
int32_t opt_level; // +8
int32_t pass_id; // +12: resolved inline
};
// TYPE D: Integer (6 instances)
// Value parsed by sub_16D2BB0 (parseInt)
struct IntegerSlot { // 16 bytes
int32_t value; // +0: parsed integer
int32_t option_index; // +4
int32_t opt_level; // +8
int32_t pass_id; // +12
};
// TYPE E: String pointer (1 instance, slot 181 only)
struct StringPtrSlot { // 28 bytes
char* char_ptr; // +0: raw string data pointer
int64_t str_length; // +8: length of string
int32_t option_index; // +16
int32_t opt_level; // +20
int32_t pass_id; // +24
};
Helper Function Chain
The initialization function sub_12D6300 populates the struct by iterating all 221 slot indices and calling a chain of helpers for each:
-
sub_12D6170(PassOptionRegistry::lookupOption) -- looks up a slot index in the hash table atregistry+120. Returns a pointer to anOptionNodestruct:[+40] int16 flags,[+48] qword* value_array_ptr,[+56] int value_count. Returns null if the option was not set on the command line. -
sub_12D6240(getBoolOption) -- resolves a boolean option. Callssub_12D6170to find the option, then if a string value exists, lowercases it viasub_16D2060and tests if the first char is'1'(0x31) or't'(0x74). If the option was not found, defaults to true (enabled). Returns the boolean packed with the flags in the low 40 bits. -
sub_1691920(PassDefTable::getPassDef) -- looks up a PassDef entry in a table where each entry is 64 bytes. Computes:table[0] + (index - 1) * 64. The PassDef at[+32]holds the pass_id, at[+36]ahas_overridesflag, and at[+40]an override index.
Initial Slots (1-6): Global Configuration
The first six slots are all string types at a uniform 24-byte stride, starting at offset 16. They do not follow the pair pattern and represent global pipeline parameters rather than per-pass knobs:
| Slot | Offset | Likely Content |
|---|---|---|
| 1 | 16 | ftz (flush-to-zero mode string) |
| 2 | 40 | prec-div (precise division setting) |
| 3 | 64 | prec-sqrt (precise square root setting) |
| 4 | 88 | fmad (fused multiply-add policy) |
| 5 | 112 | opt-level (optimization level string) |
| 6 | 136 | sm-arch (target SM architecture string) |
CLI Interface
Users interact with NVVMPassOptions via the -opt flag, which appends key=value pairs to the PassOptionRegistry before sub_12D6300 flattens them:
cicc -opt "-do-ip-msp=0" # disable memory space propagation
cicc -opt "-do-licm=0" # disable LICM
cicc -opt "-remat-max-live-limit=50" # set rematerialization threshold
cicc -opt "-dump-remat" # enable remat dump output
The registry is a hash table populated from these CLI strings. Each -opt argument is parsed into a key (the option name) and value (the string after =). When sub_12D6300 runs, it queries the registry for each of the 221 slot indices. If a CLI override exists, it takes precedence; otherwise the compiled-in default is used.
Option Anomalies
Several regions break the standard string/boolean pair pattern:
- Slots 160-162: Three consecutive string slots with no interleaved boolean.
[LOW confidence]This represents a pass (likely MemorySpaceOpt or the CSSA pass) that takes three string configuration parameters followed by a single boolean enable flag at slot 163. The pass identity is uncertain because neither MemorySpaceOpt nor CSSA has been confirmed to consume three string parameters; the association is based on pipeline position proximity only. - Slots 192-193: Two consecutive boolean slots. One is the main enable toggle; the other appears to be a sub-feature flag (both default to disabled).
- Slot 181 (offset 3648): The only
STRING_PTRtype. Its default isbyte_3F871B3(an empty string in.rodata). The raw pointer + length storage suggests this holds a file path or regex pattern for pass filtering. - Slots 196-207: Alternating string + integer slots instead of string + boolean.
[LOW confidence]This high-numbered region contains all six integer options, likely controlling late-pipeline passes with numeric thresholds (unroll counts, live-variable limits, iteration bounds). The specific pass-to-slot associations are unconfirmed; the "unroll counts, live-variable limits, iteration bounds" interpretation is based on typical LLVM integer-valued pass options, not direct evidence.
Complete Slot-to-Offset Map with Known Consumers
The following table maps NVVMPassOptions slot indices to struct byte offsets, types, defaults, and -- where the cross-reference to the pipeline assembler's a4[offset] guards could be established -- the consuming pass(es). Offsets marked with * are confirmed by cross-referencing a4[offset] guards in sub_12E54A0 and sub_12DE8F0.
| Slot | Offset | Type | Default | Known Knob Name | Consuming Pass |
|---|---|---|---|---|---|
| 1 | 16 | STRING | ftz | Global: flush-to-zero mode | |
| 2 | 40 | STRING | prec-div | Global: precise division | |
| 3 | 64 | STRING | prec-sqrt | Global: precise sqrt | |
| 4 | 88 | STRING | fmad | Global: fused multiply-add | |
| 5 | 112 | STRING | opt-level | Global: optimization level | |
| 6 | 136 | STRING | sm-arch | Global: target SM architecture | |
| 7 | 160 | BOOL | 0 | ||
| 8 | 176 | STRING | |||
| 9 | 200* | INTEGER | 1 | Opt level for sub_12DFE00 codegen | |
| 10 | 216 | STRING | |||
| 11 | 240 | BOOL | 0 | ||
| 13 | 280* | BOOL | 0 | no-dce | sub_18DEFF0 (DCE) |
| 15 | 320* | BOOL | 0 | no-tailcallelim | sub_1833EB0 (TailCallElim) |
| 17 | 360* | BOOL | 0 | no-late-opt | sub_1C46000 (NVVMLateOpt) |
| 19 | 400* | BOOL | 1 | no-inline-a | Inlining variant A |
| 21 | 440* | BOOL | 0 | no-inline-b | sub_1C4B6F0 (AlwaysInliner) |
| 23 | 480* | BOOL | 0 | no-inline-c | sub_1C4B6F0 in sub_12DE8F0 |
| 25 | 520* | BOOL | 1 | sub_1AAC510 (NVIDIA pass A) | |
| 27 | 560* | BOOL | 0 | sub_1AAC510 (NVIDIA pass B) | |
| 29 | 600* | BOOL | 0 | no-nvvm-verify | sub_12D4560 (NVVMVerifier) |
| 33 | 680* | BOOL | 0 | no-func-attrs | sub_1841180 (FunctionAttrs) |
| 35 | 720* | BOOL | 0 | no-sccp | sub_1842BC0 (SCCP) |
| 37 | 760* | BOOL | 0 | no-dse | sub_18F5480 (DSE) |
| 43 | 880* | BOOL | 0 | no-nvvm-reflect | sub_1857160 (NVVMReflect) |
| 45 | 920* | BOOL | 0 | no-ipconst | sub_185D600 (IPConstProp) |
| 47 | 960* | BOOL | 0 | no-simplifycfg | sub_190BB10 (SimplifyCFG) |
| 49 | 1000* | BOOL | 0 | no-instcombine | sub_19401A0 (InstCombine) |
| 51 | 1040* | BOOL | 0 | no-sink | sub_1869C50 (Sink/MemSSA) |
| 53 | 1080* | BOOL | 0 | no-dump | sub_17060B0 (PrintModulePass) |
| 55 | 1120* | BOOL | 0 | no-predopt | sub_18A3430 (NVVMPredicateOpt) |
| 57 | 1160* | BOOL | 0 | no-loopindexsplit | sub_1952F90 (LoopIndexSplit) |
| 59 | 1200* | BOOL | 0 | no-simplifycfg-b | SimplifyCFG variant B |
| 61 | 1240* | BOOL | 0 | do-licm (inverted) | sub_195E880 (LICM) |
| 63 | 1280* | BOOL | 0 | no-reassoc | sub_1B7FDF0 (Reassociate) |
| 65 | 1320* | BOOL | 0 | no-adce-a | sub_1C76260 (ADCE variant) |
| 67 | 1360* | BOOL | 0 | no-loopunroll | sub_19C1680 (LoopUnroll) |
| 69 | 1400* | BOOL | 0 | no-sroa | sub_1968390 (SROA) |
| 71 | 1440* | BOOL | 0 | no-earlycse | sub_196A2B0 (EarlyCSE) |
| 73 | 1480* | BOOL | 0 | no-adce-b | ADCE variant B |
| 75 | 1520* | BOOL | 0 | no-loopsimplify | sub_198DF00 (LoopSimplify) |
| 83 | 1680* | BOOL | 0 | sub_19CE990 (NVIDIA pass) | |
| 87 | 1760* | BOOL | 0 | do-ip-msp (inverted) | sub_1C8E680 (MemorySpaceOpt) |
| 91 | 1840* | BOOL | 0 | no-adce-c | sub_1C6FCA0 (ADCE) |
| 93 | 1880 | BOOL | 1 | NVVMReduction param A | |
| 95 | 1920 | BOOL | 1 | NVVMReduction param B | |
| 97 | 1960* | BOOL | 0 | no-constmerge | sub_184CD60 (ConstantMerge) |
| 99 | 2000* | BOOL | 0 | no-intrin-lower | sub_1CB4E40 (NVVMIntrinsicLowering) |
| 101 | 2040* | BOOL | 0 | no-memcpyopt | sub_1B26330 (MemCpyOpt) |
| 105 | 2120* | BOOL | 0 | no-branchdist-b | sub_1CB73C0 (NVVMBranchDist B) |
| 109 | 2200* | BOOL | 0 | no-generic2nvvm | sub_1A02540 (GenericToNVVM) |
| 113 | 2280* | BOOL | 0 | no-loweralloca-b | NVVMLowerAlloca B |
| 115 | 2320* | BOOL | 0 | do-remat (inverted) | sub_1A13320 (NVVMRemat) |
| 117 | 2360 | BOOL | 1 | sub_1CC3990 (NVVMUnreachBlockElim) | |
| 121 | 2440* | BOOL | 0 | no-sinking2 | sub_1CC60B0 (NVVMSinking2) |
| 127 | 2560* | BOOL | 0 | no-genericaddropt | sub_1CC71E0 (NVVMGenericAddrOpt) |
| 129 | 2600* | BOOL | 0 | no-irverify | sub_1A223D0 (NVVMIRVerification) |
| 131 | 2640* | BOOL | 0 | no-loopopt | sub_18B1DE0 (NVVMLoopOpt) |
| 133 | 2680* | BOOL | 0 | no-memspaceopt-b | MemorySpaceOpt in sub_12DE8F0 |
| 135 | 2720* | BOOL | 0 | no-instsimplify | sub_1A7A9F0 (InstructionSimplify) |
| 141 | 2840* | BOOL | 1 | Enable ADCE (sub_1C6FCA0, reversed) | |
| 143 | 2880* | BOOL | 1 | do-licm | Enable LICM (reversed logic) |
| 149 | 3000* | BOOL | 0 | Extra DeadArgElim trigger | |
| 151 | 3040 | BOOL | 1 | Enable CorrelatedValuePropagation | |
| 155 | 3120* | BOOL | 1 | Address space optimization flag | |
| 157 | 3160* | BOOL | 1 | dump-* master | Debug dump mode (PrintModulePass) |
| 159 | 3200* | BOOL | 1 | Enable advanced NVIDIA passes group | |
| 165 | 3328* | BOOL | 1 | Enable SM-specific warp/reduction/sinking | |
| 173 | 3488* | BOOL | 0 | Enable barrier optimization | |
| 175 | 3528* | BOOL | 0 | Tier 1 optimization enable | |
| 177 | 3568* | BOOL | 0 | Tier 2 optimization enable | |
| 179 | 3608* | BOOL | 0 | Tier 3 optimization enable | |
| 181 | 3648* | STR_PTR | "" | Language string ("ptx"/"mid"/"idn") | |
| 183 | 3704* | BOOL | 0 | Late optimization / address-space mode | |
| 193 | 3904* | BOOL | 0 | Debug: verify after each plugin pass | |
| 195 | 3944* | BOOL | 0 | Debug: rename BBs to "F%d_B%d" | |
| 197 | 3984 | INTEGER | 20 | Limit/threshold (e.g., unroll count) | |
| 203 | 4104 | INTEGER | -1 | Sentinel: unlimited/auto | |
| 205 | 4144 | INTEGER | -1 | Sentinel: unlimited/auto | |
| 207 | 4184 | INTEGER | -1 | Sentinel: unlimited/auto | |
| 209 | 4224* | BOOL | 0 | Master optimization switch | |
| 211 | 4264 | BOOL | 1 | ||
| 213 | 4304* | BOOL | 0 | Device-code / separate-compilation | |
| 215 | 4344 | INTEGER | 0 | Disabled counter | |
| 217 | 4384* | BOOL | 0 | Fast-compile / bypass LLVM pipeline | |
| 219 | 4424 | BOOL | 1 | ||
| 221 | 4464* | BOOL | 0 | Disable late CFG cleanup variant B |
Slots not listed have no confirmed cross-reference to pipeline assembler guards. The full 221-slot table is in the NVVMPassOptions Reference.
Complete Option Name Inventory
The following option names were extracted from binary string references in .rodata. They are set via -opt "-name=value" on the cicc command line (requires NVVMCCWIZ=553282 in non-release builds).
Boolean toggles (do-X / no-X):
| Name | Effect |
|---|---|
do-ip-msp | Enable inter-procedural memory space propagation |
do-licm | Enable LICM (loop-invariant code motion) |
do-remat | Enable NVVMRematerialization |
do-clone-for-ip-msp | Enable function cloning for IPMSP |
do-cssa | Enable Conventional SSA construction |
do-scev-cgp | Enable SCEV-based CodeGenPrepare |
do-function-scev-cgp | Enable function-level SCEV-CGP |
do-scev-cgp-aggresively | Aggressive SCEV-CGP mode [sic] |
do-base-address-strength-reduce | Enable base address strength reduction |
do-base-address-strength-reduce-chain | Enable chained base address SR |
do-comdat-renaming | Enable COMDAT group renaming |
do-counter-promotion | Enable counter promotion |
do-lsr-64-bit | Enable 64-bit loop strength reduction |
do-sign-ext-expand | Enable sign extension expansion |
do-sign-ext-simplify | Enable sign extension simplification |
Parametric knobs:
| Name | Type | Default | Purpose |
|---|---|---|---|
remat-for-occ | string | Rematerialization occupancy target | |
remat-gep-cost | string | GEP rematerialization cost | |
remat-ignore-single-cost | string | Skip single-use cost analysis | |
remat-lli-factor | string | Live-interval factor | |
remat-load-param | string | Parameter load remat policy | |
remat-loop-trip | string | Loop trip count for remat decisions | |
remat-max-live-limit | string | Maximum live variable count | |
remat-maxreg-ceiling | string | Register ceiling for remat | |
remat-move | string | Rematerialization move policy | |
remat-single-cost-limit | string | Single-value cost limit | |
remat-use-limit | string | Use count limit for remat | |
branch-dist-block-limit | string | Block count limit for branch distribution | |
branch-dist-func-limit | string | Function-level branch dist limit | |
branch-dist-norm | string | Normalization factor | |
scev-cgp-check-latency | string | Latency check threshold | |
scev-cgp-control | string | CGP control mode | |
scev-cgp-cross-block-limit | string | Cross-block analysis limit | |
scev-cgp-idom-level-limit | string | Immediate dominator depth limit | |
scev-cgp-inst-limit | string | Instruction count limit | |
scev-cgp-norm | string | Normalization factor | |
scev-cgp-old-base | string | Legacy base address mode | |
scev-cgp-tid-max-value | string | Thread ID maximum value | |
base-address-strength-reduce-iv-limit | string | IV count limit for base addr SR | |
base-address-strength-reduce-max-iv | string | Maximum IV for base addr SR | |
cssa-coalesce | string | CSSA coalescing mode | |
cssa-verbosity | string | CSSA debug verbosity |
Dump/debug flags:
| Name | Purpose |
|---|---|
dump-ip-msp | Dump IPMSP analysis results |
dump-ir-before-memory-space-opt | Dump IR before MemorySpaceOpt |
dump-ir-after-memory-space-opt | Dump IR after MemorySpaceOpt |
dump-memory-space-warnings | Dump address space warnings |
dump-remat | Dump rematerialization decisions |
dump-remat-add | Dump remat additions |
dump-remat-iv | Dump remat induction variables |
dump-remat-load | Dump remat load decisions |
dump-branch-dist | Dump branch distribution analysis |
dump-scev-cgp | Dump SCEV-CGP analysis |
dump-base-address-strength-reduce | Dump base address SR |
dump-sink2 | Dump Sinking2 pass output |
dump-before-cssa | Dump IR before CSSA |
dump-phi-remove | Dump PHI node removal |
dump-normalize-gep | Dump GEP normalization |
dump-simplify-live-out | Dump live-out simplification |
dump-process-restrict | Dump restrict processing |
dump-process-builtin-assume | Dump builtin assume processing |
dump-conv-dot | Dump convergence as DOT graph |
dump-conv-func | Dump convergence per function |
dump-conv-text | Dump convergence as text |
dump-nvvmir | Dump NVVM IR |
dump-va | Dump value analysis |
Tier-Based Pass Ordering
The Threshold Dispatch Mechanism
NVIDIA's tier system is a priority-driven scheduling mechanism that interleaves optimization sub-pipelines with external plugin passes. The master pipeline function sub_12E54A0 iterates over a pass registration array at a4[4488] (16-byte stride entries: [+0] vtable_ptr, [+8] phase_id). As it processes each entry, it checks whether the entry's phase_id exceeds a threshold. When it does, the corresponding tier sub-pipeline fires once:
// Pseudocode for the main loop in sub_12E54A0
for (entry = a4[4488]; entry < a4[4496]; entry += 16) {
int phase_id = *(int*)(entry + 8);
if (opt_enabled && phase_id > opt_threshold) {
sub_12DE330(PM, opts); // Tier 0: full optimization
opt_enabled = 0; // fire once
}
if (tier1_flag && phase_id > tier1_threshold) {
sub_12DE8F0(PM, 1, opts); // Tier 1
tier1_flag = 0;
}
if (tier2_flag && phase_id > tier2_threshold) {
sub_12DE8F0(PM, 2, opts); // Tier 2
tier2_flag = 0;
}
if (tier3_flag && phase_id > tier3_threshold) {
sub_12DE8F0(PM, 3, opts); // Tier 3
tier3_flag = 0;
}
// Insert the plugin/external pass itself
pass = vtable_call(entry, +72); // entry->createPass()
AddPass(PM, pass, 1, 0);
}
// Any tier that didn't fire during the loop fires now
if (opt_enabled) sub_12DE330(PM, opts);
if (tier1_flag) sub_12DE8F0(PM, 1, opts);
if (tier2_flag) sub_12DE8F0(PM, 2, opts);
if (tier3_flag) sub_12DE8F0(PM, 3, opts);
This design means tier placement is data-driven: the thresholds stored at config offsets 4224/4228 (Tier 0), 3528/3532 (Tier 1), 3568/3572 (Tier 2), and 3608/3612 (Tier 3) determine exactly where in the plugin pass sequence each tier's sub-pipeline gets inserted. Changing the threshold shifts an entire tier of ~40 passes to a different position relative to the external passes. After each tier fires, its flag is cleared so it cannot fire again.
Tier 0 Ordering Strategy
Tier 0 (sub_12DE330) is the most comprehensive sub-pipeline at ~40 passes. Its ordering reflects NVIDIA's optimization philosophy for GPU code:
Phase A -- Value Simplification (passes 1-8): BreakCriticalEdges normalizes the CFG, then the CGSCC inliner framework runs first to create optimization opportunities. NVVMReflect resolves __nvvm_reflect() calls to compile-time constants (GPU architecture queries), and SCCP propagates those constants. GVN and NewGVN/GVNHoist eliminate redundant computations.
Phase B -- NVIDIA-Specific Cleanup (passes 9-12): NVVMVerifier catches NVVM-specific IR errors early. NVVMPredicateOpt optimizes predicate expressions. ConstantMerge reduces module size.
Phase C -- Loop Transformations (passes 13-27): This is the core loop optimization sequence. Sink/MemSSA moves code out of hot paths. LoopIndexSplit divides loops at index boundaries. LICM hoists invariants. LoopUnroll with factor 3 expands small loops. LoopUnswitch moves conditionals out of loops. ADCE removes dead code exposed by loop transformations.
Phase D -- Register Pressure Management (passes 28-40): InstCombine and SROA simplify the IR further. NVVMRematerialization recomputes values to reduce register pressure -- critical for GPU occupancy. DSE and DCE clean up dead stores and code. The final CGSCC pass and FunctionAttrs prepare for per-function Phase II processing.
Tier 1/2/3 Incremental Additions -- sub_12DE8F0
| Address | 0x12DE8F0 |
| Size | 17,904 bytes |
| Signature | int64 sub_12DE8F0(int64 passMgr, int tier, int64 opts) |
sub_12DE8F0 adds passes incrementally based on the tier value (1, 2, or 3). Its first action stores the tier into qword_4FBB410 (the tier tracker global), then checks qword_4FBB3B0 (phase counter) for phase-dependent behavior. Nearly every pass insertion is gated by a boolean in the NVVMPassOptions struct.
The full pass list for sub_12DE8F0 (all tiers combined, with tier-specific gates):
sub_1CB4E40(1) [!opts[2000]] NVVMIntrinsicLowering (level=1)
sub_1A223D0() [!opts[2600]] NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]] NVVMIntrinsicLowering (barrier=1)
sub_18E4A00() [opts[3488]] NVVMBarrierAnalysis
sub_1C98160(0) [opts[3488]] NVVMLowerBarriers
sub_12D4560() [!opts[600]] NVVMVerifier
sub_185D600() [opts[3200]&&!opts[920]] IPConstPropagation [advanced group]
sub_1857160() [opts[3200]&&!opts[880]] NVVMReflect [advanced group]
sub_18A3430() [opts[3200]&&!opts[1120]] NVVMPredicateOpt [advanced group]
sub_1842BC0() [opts[3200]&&!opts[720]] SCCP [advanced group]
sub_12D4560() [!opts[600]] NVVMVerifier
sub_18A3090() [opts[3200]&&!opts[2160]] NVVMPredicateOpt variant [advanced group]
sub_184CD60() [opts[3200]&&!opts[1960]] ConstantMerge [advanced group]
sub_190BB10(1,0)[tier!=1 && guards] SimplifyCFG [TIER 2/3 ONLY]
sub_1952F90(-1)[tier!=1 && guards] LoopIndexSplit [TIER 2/3 ONLY]
sub_12D4560() [tier!=1 && !opts[600]] NVVMVerifier [TIER 2/3 ONLY]
sub_195E880(0) [opts[3704]&&opts[2880]] LICM
sub_1C8A4D0(v) [v=1 if opts[3704]] EarlyCSE
sub_1869C50(1,0,1)[tier!=1&&!opts[1040]] Sink [TIER 2/3 ONLY]
sub_1833EB0(3) [tier==3 && !opts[320]] TailCallElim [TIER 3 ONLY]
sub_1CC3990() [!opts[2360]] NVVMUnreachableBlockElim
sub_18EEA90() [opts[3040]] CorrelatedValuePropagation
sub_12D4560() [!opts[600]] NVVMVerifier
sub_1A223D0() [!opts[2600]] NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]] NVVMIntrinsicLowering
sub_1C4B6F0() [!opts[440]&&!opts[480]] Inliner
sub_1A7A9F0() [!opts[2720]] InstructionSimplify
sub_12D4560() [!opts[600]] NVVMVerifier
sub_1A02540() [!opts[2200]] GenericToNVVM
sub_198DF00(-1)[!opts[1520]] LoopSimplify
sub_1C76260() [!opts[1320]&&!opts[1480]] ADCE
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1C98160(v) [opts[3488]] NVVMLowerBarriers
sub_19C1680(0,1)[!opts[1360]] LoopUnroll
sub_19401A0() [!opts[1000]] InstCombine
sub_196A2B0() [!opts[1440]] EarlyCSE
sub_1968390() [!opts[1400]] SROA
sub_19B73C0(t,...)[tier!=1] LoopUnswitch (SM-dependent) [TIER 2/3 ONLY]
sub_1A62BF0(1,...)[!opts[600]] LLVM standard pipeline #1
sub_1A223D0() [!opts[2600]] NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]] NVVMIntrinsicLowering
sub_190BB10(0,0)[!opts[960]] SimplifyCFG
sub_1922F90() [opts[3080]] NVIDIA-specific loop pass
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1A13320() [!opts[2320]] NVVMRematerialization
sub_1968390() [!opts[1400]] SROA
sub_18EEA90() [opts[3040]] CorrelatedValuePropagation
sub_18F5480() [!opts[760]] DSE
sub_18DEFF0() [!opts[280]] DCE
sub_1A62BF0(1,...)[!opts[600]] LLVM standard pipeline #1
sub_1AAC510() [!opts[520]&&!opts[560]] NVIDIA-specific pass
sub_1A223D0() [!opts[2600]] NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]] NVVMIntrinsicLowering
sub_1C8E680() [!opts[2680]] MemorySpaceOpt (from opts[3120])
sub_1CC71E0() [!opts[2560]] NVVMGenericAddrOpt
sub_1C98270(1,v)[opts[3488]] NVVMLowerBarriers variant
sub_1C6FCA0() [opts[2840]&&!opts[1840]] ADCE
sub_18B1DE0() [opts[3200]&&!opts[2640]] LoopOpt/BarrierOpt [advanced group]
sub_1857160() [opts[3200]&&tier==3] NVVMReflect [TIER 3 ONLY]
sub_1841180() [opts[3200]&&!opts[680]] FunctionAttrs [advanced group]
sub_1C46000() [tier==3&&!opts[360]] NVVMLateOpt [TIER 3 ONLY]
sub_1841180() [opts[3200]&&!opts[680]] FunctionAttrs (2nd call) [advanced group]
sub_1CBC480() [!opts[2240]&&!opts[2280]] NVVMLowerAlloca
sub_1CB73C0() [!opts[2080]&&!opts[2120]] NVVMBranchDist
sub_1C7F370(1) [opts[3328]&&!opts[1640]] NVVMWarpShuffle [SM-specific]
sub_1CC5E00() [opts[3328]&&!opts[2400]] NVVMReduction [SM-specific]
sub_1CC60B0() [opts[3328]&&!opts[2440]] NVVMSinking2 [SM-specific]
sub_1CB73C0() [opts[3328]&&guards] BranchDist (2nd call) [SM-specific]
sub_1B7FDF0(3) [opts[3328]&&!opts[1280]] Reassociate [SM-specific]
Tier 1 (baseline) adds the passes above EXCEPT those gated by tier!=1: SimplifyCFG, LoopIndexSplit, Sink, and LoopUnswitch are all skipped. This is a conservative set focused on NVIDIA-specific cleanup without expensive LLVM optimization.
Tier 2 adds everything Tier 1 has plus the tier!=1-gated passes. The LoopUnswitch parameters are SM-architecture-dependent: sub_19B73C0 receives different vector widths based on the target subtarget.
Tier 3 adds TailCallElim (gated tier==3), NVVMReflect at a late position (gated tier==3), and NVVMLateOpt (gated tier==3). Critically, it also triggers feature flag escalation (see below).
Feature Flag Escalation
A notable pattern occurs only in Tier 3: if BYTE4(qword_4FBB370[2]) is zero (no advanced features enabled), the tier handler allocates a new integer with value 6 and stores it via sub_16D40E0. The value 6 (binary 110) enables two feature gates used by later passes: barrier optimization and memory-space optimization. This means Tier 3 (O3) automatically enables optimization features that lower tiers leave disabled, without requiring explicit CLI flags.
O-Level Pipeline Comparison
Pipeline Selection
The new-PM driver sub_226C400 selects pipeline name strings based on config flags:
byte[888] set → "nvopt<O0>"
byte[928] set → "nvopt<O1>"
byte[968] set → "nvopt<O2>"
byte[1008] set → "nvopt<O3>"
These strings are passed to sub_2277440 (the new-PM text pipeline parser). The nvopt prefix is registered as a pipeline element in both sub_225D540 (new PM) and sub_12C35D0 (legacy PM), with vtables at 0x4A08350 and 0x49E6A58 respectively.
O0: No Optimization
O0 skips the full pipeline entirely. The code falls through to LABEL_159 which calls only sub_1C8A4D0(0) (NVVMFinalCleanup), then proceeds directly to finalization. No Tier 0/1/2/3 sub-pipelines fire. The result is ~5-8 passes total: TargetLibraryInfo, TargetTransformInfo, Verifier, AssumptionCache, ProfileSummary, NVVMFinalCleanup, and codegen setup.
O1/O2/O3: Full Pipeline with Tier Differentiation
All three levels call sub_12DE330 for the same ~40-pass Tier 0 sub-pipeline. The differences manifest through four mechanisms:
1. Tier sub-pipeline gating. sub_12DE8F0 is called with the tier number corresponding to the O-level. O1 gets tier=1 (conservative, skips several passes). O2 gets tier=2 (full set). O3 gets tier=3 (aggressive + feature flag escalation).
2. CGSCC iteration counts. The CGSCC pass manager wrapper sub_1A62BF0 takes an iteration count as its first argument. In the O1/O2/O3 base pipeline, it is called with 1 (single inliner pass). In the "mid" fast-compile path, it is called with 5 iterations. In the default path, it varies from 1 to 8 depending on pipeline position, allowing more aggressive devirtualization and inlining at higher optimization levels.
3. Loop unroll factor. sub_1833EB0 is called with factor 3 in the standard pipeline. Tier 3 adds an additional call to TailCallElim and more aggressive LoopUnswitch parameters (the sub_19B73C0 call receives SM-arch-dependent vector widths at Tier 2/3).
4. Vectorizer parameters. sub_19B73C0 receives different arguments based on tier:
- Tier 0:
(2, -1, -1, -1, -1, -1, -1)-- conservative vector width 2, all thresholds unlimited - "mid" path:
(3, -1, -1, 0, 0, -1, 0)-- vector width 3, some thresholds zeroed (disabled) - Tier 2/3: Parameters vary by SM architecture via config struct lookups
Fast-Compile Levels vs O-Levels
| Pipeline | Entry Path | Passes | LSA | MemSpaceOpt | Key Difference |
|---|---|---|---|---|---|
nvopt<O0> | LABEL_159 | ~5-8 | off | off | No optimization |
nvopt<Ofcmax> | LABEL_196 | ~12-15 | forced 0 | forced 0 | Sinking2(fast) + minimal canonicalization |
nvopt<Ofcmid> | LABEL_297 | ~25-30 | normal | enabled | CGSCC(5), LoopVectorize(conservative) |
nvopt<Ofcmin> | LABEL_297 | ~30-35 | normal | enabled | Like Ofcmid but more aggressive loop settings |
nvopt<O1> | sub_12DE330 | ~35 | normal | enabled | Tier 1: conservative set |
nvopt<O2> | sub_12DE330 | ~35+ | normal | enabled | Tier 2: full optimization set |
nvopt<O3> | sub_12DE330 | ~35+ | normal | enabled | Tier 3: aggressive + feature escalation |
Ofcmax is architecturally distinct: it forces -lsa-opt=0 and -memory-space-opt=0 in the optimizer flags (confirmed in both sub_9624D0 line 1358 and sub_12CC750 line 2025). This means two of NVIDIA's most important proprietary passes -- LSA optimization and MemorySpaceOpt -- are unconditionally disabled regardless of what the user requests.
Pipeline Text Strings and nvopt<> Dispatch
The nvopt<> Naming Convention
NVIDIA replaces LLVM's standard default<O2> pipeline naming with a proprietary nvopt<> prefix. The new-PM driver sub_226C400 (35KB, at 0x226C400) selects one of exactly seven pipeline name strings based on optimization level and fast-compile flags. These strings are passed verbatim to sub_2277440 (60KB, at 0x2277440) -- NVIDIA's equivalent of LLVM's PassBuilder::buildDefaultPipeline().
nvopt<O0> Optimization disabled. ~5-8 infrastructure passes only.
nvopt<O1> Standard optimization, Tier 1 (conservative).
nvopt<O2> Standard optimization, Tier 2 (full).
nvopt<O3> Standard optimization, Tier 3 (aggressive + feature escalation).
nvopt<Ofcmax> Fast-compile maximum speed. Forces -lsa-opt=0, -memory-space-opt=0.
nvopt<Ofcmid> Fast-compile medium. MemorySpaceOpt enabled, CGSCC(5) iterations.
nvopt<Ofcmin> Fast-compile minimum. Like Ofcmid but more aggressive loop settings.
Selection Algorithm (sub_226C400)
The config struct encodes O-level flags at fixed byte offsets. The fast-compile level string (if present) is at qwords 131/132 (offset 1048/1056), encoded as a 3-byte sequence compared via 2-byte word + 1-byte suffix:
// sub_226C400, lines 828-874 (pseudocode)
char* select_pipeline_name(Config* cfg) {
if (cfg->byte[928]) return "nvopt<O1>"; // 9 chars
if (cfg->byte[968]) return "nvopt<O2>"; // 9 chars
if (cfg->byte[1008]) return "nvopt<O3>"; // 9 chars
char* fc = cfg->qword[131];
int fc_len = cfg->qword[132];
if (fc_len == 3) {
// Word comparison: *(uint16_t*)fc, then byte fc[2]
if (*(uint16_t*)fc == 24941 && fc[2] == 120) // "max" = 'a','m' + 'x'
return "nvopt<Ofcmax>"; // 14 chars
if (*(uint16_t*)fc == 26989 && fc[2] == 100) // "mid" = 'i','m' + 'd'
return "nvopt<Ofcmid>"; // 14 chars
if (*(uint16_t*)fc == 26989 && fc[2] == 110) // "min" = 'i','m' + 'n'
return "nvopt<Ofcmin>"; // 14 chars
}
return "nvopt<O0>"; // 9 chars
}
The nvopt prefix is registered as a pipeline element in sub_225D540 (new PM, vtable 0x4A08350) and sub_12C35D0 (legacy PM, vtable 0x49E6A58). Both route into an nvopt pipeline builder class that creates a 512-byte pipeline object via sub_12EC960.
Mutual Exclusion
Combining -O# with --passes= or --foo-pass is an error:
Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass,
use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass'
Pipeline Text Parser (sub_2277440)
sub_2277440 (60KB) is the new-PM buildDefaultPipeline() equivalent. It tokenizes the pipeline name string via sub_2352D90, then dispatches to the appropriate pipeline builder based on the nvopt<> parameter. NVIDIA custom passes are injected via extension point callbacks at [PassBuilder+2208] (stride 32 bytes per entry, count at [PassBuilder+2216]). Each callback entry has a guard pointer at [+16] and a callback function at [+24].
Fast-Compile Level Encoding
In the libnvvm config struct, offset 1640 holds an integer encoding:
| Value | CLI Source | Pipeline Name | Notes |
|---|---|---|---|
| 0 | (no -Ofast-compile) | normal O-level | Default |
| 1 | -Ofast-compile=0 | reset to 0 | Treated as "off" |
| 2 | -Ofc=max | nvopt<Ofcmax> | Forces -lsa-opt=0, -memory-space-opt=0 |
| 3 | -Ofc=mid | nvopt<Ofcmid> | MemorySpaceOpt enabled |
| 4 | -Ofc=min | nvopt<Ofcmin> | Closest to full optimization |
Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level, only supports 0, min, mid, or max".
Pass Registration Architecture
Dual Pass Manager Support
cicc v13.0 maintains registrations for both the Legacy Pass Manager and the New Pass Manager simultaneously. This dual support is necessary during the LLVM Legacy-to-New PM migration. The Legacy PM path is taken when a4[4384] != 0 (the fast-compile/bypass flag), while the New PM path handles normal compilation.
Legacy PM registration occurs in pass constructor functions scattered throughout the binary. For example, MemorySpaceOpt registers as "memory-space-opt-pass" via sub_1C97F80. Each Legacy PM pass calls RegisterPass<> with a pass ID and description string.
New PM registration is centralized in sub_2342890 -- a single 2,816-line function that registers every analysis, pass, and printer. It calls sub_E41FB0(pm, class_name, len, pass_name, len) for each pass, inserting into a StringMap with open-addressing and linear probing.
New PM Registration Structure
sub_2342890 registers passes in a strict ordering by pipeline level:
| Section | Lines | Count | Content |
|---|---|---|---|
| Module analyses | 514-596 | ~18 | CallGraph, ProfileSummary, LazyCallGraph, etc. |
| Module passes | 599-1153 | ~95 | AlwaysInline, GlobalOpt, NVIDIA module passes |
| CGSCC analyses | 1155-1163 | ~5 | FunctionAnalysisManagerCGSCC, etc. |
| CGSCC passes | 1170-1206 | ~15 | Inliner, Attributor, ArgumentPromotion |
| Function analyses | 1208-1415 | ~65 | DominatorTree, LoopInfo, MemorySSA, rpa, merge-sets |
| Function passes | 1420-2319 | ~185 | SROA, GVN, LICM, all NVIDIA function passes |
| LoopNest passes | 2320-2339 | ~8 | LoopInterchange, LoopFlatten |
| Loop analyses | 2340-2362 | ~10 | LoopAccessAnalysis, IVUsers |
| Loop passes | 2367-2482 | ~40 | IndVarSimplify, LICM, LoopUnroll, loop-index-split |
| Machine analyses | 2483-2580 | ~30 | LiveIntervals, SlotIndexes |
| Machine passes | 2581-2815 | ~80 | ExpandPostRAPseudos, BranchFolding |
Parameterized Pass Parsing
When the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), a registered callback parses the parameter string. The parsing flow:
sub_2337DE0matches the pass name via astarts_withcomparisonsub_234CEE0extracts the<...>parameter string- The parameter-parsing callback (e.g.,
sub_23331A0for MemorySpaceOpt) is invoked - The parser splits on
;and matches each token against known parameter names - A configured pass options struct is returned and used to construct the pass
For MemorySpaceOpt, the parameter parser (sub_23331A0) recognizes four tokens:
| Token | Length | Effect |
|---|---|---|
first-time | 10 | Sets first_time = true (default) |
second-time | 11 | Sets first_time = false |
warnings | 8 | Enables address-space warnings |
no-warnings | 11 | Disables warnings |
Invalid parameters produce: "invalid MemorySpaceOpt pass parameter '{0}'".
Pass Serialization
Each parameterized NVIDIA pass also registers a serializer for pipeline text output (used by --print-pipeline-passes). The serializers write the pass class name followed by the current parameter state:
| Pass | Serializer | Output Format |
|---|---|---|
| MemorySpaceOpt | sub_2CE0440 | MemorySpaceOptPass]<first-time;...> |
| BranchDist | sub_2311040 | BranchDistPass] |
| Sinking2 | sub_2315E20 | llvm::Sinking2Pass] |
| Remat | sub_2311820 | RematerializationPass] |
| NVVMPeephole | sub_2314DA0 | NVVMPeepholeOptimizerPass] |
| LoopIndexSplit | sub_2312380 | LoopIndexSplitPass] |
Pipeline Construction Flow
The AddPass Mechanism -- sub_12DE0B0
| Address | 0x12DE0B0 |
| Size | 3,458 bytes |
| Signature | int64 sub_12DE0B0(int64 passMgr, int64 passObj, uint8 flags, char barrier) |
| Call count | ~137 direct calls from sub_12E54A0, ~40 from sub_12DE330, ~50+ per tier |
sub_12DE0B0 is the sole entry point for adding passes to the pipeline. Every pass factory call in the entire pipeline assembler funnels through this function. It performs three operations atomically: hash-table insertion for O(1) lookup, flag encoding for the pass scheduler, and append to the ordered pass array.
// Detailed pseudocode for sub_12DE0B0
int64 AddPass(PassManager* PM, Pass* pass, uint8_t flags, char barrier) {
// --- Step 1: Hash the pass pointer ---
// Uses a custom shift-XOR hash, NOT a standard hash function.
// The two shifts (9 and 4) spread pointer bits across the table.
uint64_t hash = ((uint64_t)pass >> 9) ^ ((uint64_t)pass >> 4);
// --- Step 2: Open-addressing insert into hash table at PM+80 ---
// The hash table is a flat array of 16-byte entries at PM+80:
// [+0] uint64 pass_pointer (0 = empty slot)
// [+8] uint8 combined_flags
// Table capacity is stored at PM+72 (initial: derived from 0x800000000 mask).
// Collision resolution: linear probing with step 1.
uint8_t combined = flags | (barrier ? 2 : 0);
// Bit 0 (0x01): 1 = FunctionPass, 0 = ModulePass/AnalysisPass
// Bit 1 (0x02): 1 = barrier (scheduling fence)
// Remaining bits: reserved
size_t capacity = PM->ht_capacity; // at PM+72
size_t idx = hash & (capacity - 1); // power-of-2 masking
Entry* table = (Entry*)(PM + 80);
while (table[idx].pass != 0) {
if (table[idx].pass == pass) {
// Pass already inserted -- update flags only
table[idx].flags = combined;
return 0; // dedup: no second insertion
}
idx = (idx + 1) & (capacity - 1); // linear probe
}
table[idx].pass = pass;
table[idx].flags = combined;
// --- Step 3: Append to ordered pass array at PM[0] ---
// PM[0] = pointer to dynamic array of 8-byte pass pointers
// PM[1] = count of passes (PM+8)
// Growth: geometric reallocation (not shown here)
uint64_t* array = (uint64_t*)PM->passes; // PM[0]
array[PM->count] = (uint64_t)pass;
PM->count++; // PM+8
return 0;
}
The flags parameter encodes the pass type: 0 for module/analysis passes, 1 for function passes. The barrier parameter (bit 1) is a scheduling fence that tells the pass manager all preceding passes must complete before this pass runs -- used for passes that require the module in a globally consistent state (e.g., after whole-module inlining).
The hash table serves two purposes: (a) deduplication -- if the same pass factory is called twice (which happens for NVVMReflect, NVVMIntrinsicLowering, etc.), the second call updates flags rather than inserting a duplicate; and (b) O(1) flag lookup during the codegen dispatch phase (sub_12DFE00), where each pass's type and barrier status must be queried efficiently.
The pass manager container is initialized at line 390 of sub_12E54A0 with inline storage: v270 = v272 (stack buffer), v271 = 0x800000000 (capacity/flags encoding with 33-bit sentinel).
Complete 8-Phase Construction Algorithm
The full pipeline construction in sub_12E54A0 proceeds through eight phases. The pseudocode below is reconstructed from the decompiled 49.8KB function at lines 300-757 of the decompilation output. All a4 offsets refer to the CompilerOptions struct (parameter 4, ~4500 bytes).
Phase 0: Infrastructure (lines 396-420, always runs)
// Phase 0: Analysis infrastructure required by all subsequent passes
#01 TLI = sub_149CCE0(malloc(368), sub_14A04B0(triple));
AddPass(PM, TLI, 0, 0); // TargetLibraryInfoWrapperPass [Module]
#02 TTI = sub_1BFB520(malloc(208), sub_1BFB9A0(dataLayout));
AddPass(PM, TTI, 1, 0); // TargetTransformInfoWrapperPass [Function]
#03 verifier = sub_14A7550();
AddPass(PM, verifier, 0, 0); // VerifierPass / BasicAliasAnalysis [Module]
#04 assumptions = sub_1361950();
AddPass(PM, assumptions, 0, 0); // AssumptionCacheTracker [Module]
#05 profile = sub_1CB0F50();
AddPass(PM, profile, 1, 0); // ProfileSummaryInfoWrapperPass [Function]
These five analysis passes have no upstream-LLVM equivalent in terms of initialization ordering. NVIDIA always adds them first regardless of optimization level, language, or fast-compile mode.
Phase 1: Language Dispatch (lines 421-488)
Phase 1 reads the language string at a4[3648] (pointer) with length at a4[3656]. Three language paths exist; each produces a fundamentally different pass sequence. See the Language Path Differences section below for the complete per-path pass lists.
// Phase 1: Language-based pipeline branching
char* lang = *(char**)(a4 + 3648);
int lang_len = *(int*)(a4 + 3656);
bool opt_enabled = *(bool*)(a4 + 4224);
bool fc_max = false, fc_mid = false;
int v238 = *(int*)(a4 + 4304); // device-code / additional-opt flag
if (lang_len == 3) {
uint16_t w = *(uint16_t*)lang;
if (w == 0x7470 && lang[2] == 0x78) { // "ptx"
goto PATH_A_PTX;
}
if (w == 0x696D && lang[2] == 0x64) { // "mid"
goto PATH_B_MID;
}
// "idn" (w == 0x696D && lang[2] == 0x6E) shares the default path
}
// Fall through to PATH_C_DEFAULT
// Fast-compile dispatch (within the language check):
// fc="max" AND !v238 → v244=1, v238=1, goto LABEL_191 (minimal + O0)
// fc="max" AND v238 → goto LABEL_196 → LABEL_188 (Sinking2 + common)
// fc="mid" → goto LABEL_297 (mid pipeline)
// fc="min" → goto LABEL_297 (min pipeline, differs via v238)
// no fc, no O-level → LABEL_159 (O0 minimal pipeline)
// O-level set → LABEL_38 → LABEL_39 (process pass list + tiers)
Phase 2: Pre-Optimization (lines 442-480)
Only when optimization is not completely skipped. Each pass is gated by a per-pass disable flag in the NVVMPassOptions struct.
// Phase 2: Early passes before the main optimization loop
if (!a4[1960] || a4[3000]) // not disabled OR extra trigger
AddPass(PM, sub_1857160(), 1, 0); // NVVMReflect
if (a4[3000]) // extra DeadArgElim trigger
AddPass(PM, sub_18FD350(0), 1, 0); // DeadArgElimination
if (!a4[1680]) // NVIDIA pass not disabled
AddPass(PM, sub_19CE990(), 1, 0); // LoopStrengthReduce (NVIDIA)
AddPass(PM, sub_1CB4E40(0), 1, 0); // NVVMIntrinsicLowering(level=0)
if (!a4[2040])
AddPass(PM, sub_1B26330(), 1, 0); // MemCpyOpt
AddPass(PM, sub_12D4560(), 1, 0); // NVVMVerifier
if (!a4[1960])
AddPass(PM, sub_184CD60(), 1, 0); // ConstantMerge
if (!a4[440] && !a4[400])
AddPass(PM, sub_1C4B6F0(), 1, 0); // AlwaysInliner
if (a4[3160]) // debug dump enabled
AddPass(PM, sub_17060B0(1, 0), 1, 0); // PrintModulePass
Phase 3: Main Optimization Loop (lines 481-553)
The tier-threshold-driven loop iterates over the plugin/external pass array at a4[4488]. Each entry is 16 bytes (vtable pointer + phase_id). When a threshold is crossed, the corresponding tier sub-pipeline fires once and never again.
// Phase 3: Tier dispatch within the main plugin pass loop
uint64_t* entry = *(uint64_t**)(a4 + 4488);
uint64_t* end = *(uint64_t**)(a4 + 4496);
while (entry < end) {
int phase_id = *(int*)((char*)entry + 8);
// Tier 0: full optimization sub-pipeline
if (*(bool*)(a4+4224) && phase_id > *(int*)(a4+4228)) {
sub_12DE330(PM, opts); // ~40 passes
*(bool*)(a4+4224) = false; // fire once
}
// Tier 1: conservative
if (*(bool*)(a4+3528) && phase_id > *(int*)(a4+3532)) {
sub_12DE8F0(PM, 1, opts);
*(bool*)(a4+3528) = false;
}
// Tier 2: full
if (*(bool*)(a4+3568) && phase_id > *(int*)(a4+3572)) {
sub_12DE8F0(PM, 2, opts);
*(bool*)(a4+3568) = false;
}
// Tier 3: aggressive
if (*(bool*)(a4+3608) && phase_id > *(int*)(a4+3612)) {
sub_12DE8F0(PM, 3, opts);
*(bool*)(a4+3608) = false;
}
// Insert the plugin/external pass itself
Pass* plugin = vtable_call(entry, +72); // entry->createPass()
AddPass(PM, plugin, 1, 0);
// Optional debug verification after each plugin pass
if (a4[3904]) {
sub_12D3E60(); // insert verification/print pass
sub_16E8CB0();
sub_15E9F00();
}
entry = (uint64_t*)((char*)entry + 16); // next entry (16-byte stride)
}
// Any tier that didn't fire during the loop fires unconditionally now
if (*(bool*)(a4+4224)) sub_12DE330(PM, opts);
if (*(bool*)(a4+3528)) sub_12DE8F0(PM, 1, opts);
if (*(bool*)(a4+3568)) sub_12DE8F0(PM, 2, opts);
if (*(bool*)(a4+3608)) sub_12DE8F0(PM, 3, opts);
Phase 4: Post-Optimization Language Paths (lines 580-1371)
After the main loop, language-specific post-optimization runs. This is where the three paths diverge most significantly. Each path ends by falling through to LABEL_84 (Phase 5). See Language Path Differences for complete pass lists per path.
Phase 5: Finalization (LABEL_84, lines 640-653)
Always runs after the language-specific optimization path completes.
// Phase 5: Finalization -- barriers, cleanup, codegen
if (!v244 && a4[3488]) // barrier optimization enabled
AddPass(PM, sub_1C98160(a4[2920]!=0), 1, 0); // NVVMLowerBarriers
AddPass(PM, sub_1CEBD10(), 1, 0); // NVVMFinalLowering (cleanup)
if (!a4[2800] && !a4[4464]) // late CFG cleanup not disabled
AddPass(PM, sub_1654860(1), 1, 0); // BreakCriticalEdges
sub_12DFE00(PM, subtargetInfo, a4); // Codegen dispatch (see below)
Phase 6: Phase II Codegen Check (lines 654-693)
Reads the compilation phase counter and conditionally enters a special codegen extension block for multi-phase compilation.
// Phase 6: Phase-II-specific codegen extensions
int phase = sub_16D40F0(qword_4FBB3B0); // read cl::opt<int> phase counter
if (phase == 2 && (*(int*)(a4 + 4480) & 4)) {
// Enter special Phase II codegen block
// Calls vtable at v245+56 (TargetMachine::addPhaseIIPasses)
// Passes SubtargetInfo (v253) and CodeGenOpt config (v262)
target_machine->addPhaseIIPasses(subtarget, codegen_config);
}
Phase 7: Pipeline Execution (lines 694-698)
// Phase 7: Run the assembled pipeline
sub_160FB70(PM, *output, output[1]); // PassManager::run(Module, outputs)
sub_1619BD0(PM, module); // PassManager::finalize(Module)
free(v274); // cleanup allocations
sub_160FE50(PM); // PassManager::destroy()
Phase 8: Basic Block Naming (lines 700-757)
Only when a4[3944] (debug/naming mode) is set. Produces deterministic block names for debugging.
// Phase 8: Debug block naming for IR dump readability
if (a4[3944]) {
int funcIdx = 0;
for (Function* F = module->functions; F; F = F->next) {
if (sub_15E4F60(F)) continue; // skip declarations
funcIdx++;
int blockIdx = 0;
for (BasicBlock* BB = F->blocks; BB; BB = BB->next) {
blockIdx++;
char name[32];
sprintf(name, "F%d_B%d", funcIdx, blockIdx);
sub_164B780(BB, &name); // BB->setName()
}
}
}
Language Path Differences
The three language paths in Phase 1/4 represent fundamentally different IR maturity levels. The a4[3648] string pointer determines which path is taken, with length at a4[3656].
Path A: "ptx" -- Light Pipeline (~15 passes)
PTX text input has already been lowered by an earlier compilation stage. This path applies only light cleanup and canonicalization:
sub_1CEF8F0() NVVMPeephole
sub_215D9D0() NVVMAnnotationsProcessor
sub_1857160() [!a4[880]] NVVMReflect
sub_1A62BF0(1,0,0,1,0,0,1) LLVM standard pipeline #1
sub_1B26330() [!a4[2040]] MemCpyOpt
sub_17060B0(0,0) PrintModulePass (debug)
sub_18DEFF0() [!a4[280]] DCE
sub_1A62BF0(1,0,0,1,0,0,1) LLVM standard pipeline #1 (repeat)
sub_18B1DE0() [!a4[2640]] LoopPass / BarrierOpt
sub_1C8E680(0) [!a4[1760]] MemorySpaceOptimization
--> LABEL_84 (finalization)
Key difference: no SROA, no GVN, no loop transformations, no CGSCC inlining. The PTX path trusts that the earlier compilation already optimized the code.
Path B: "mid" -- Full Optimization (~45 passes)
The primary path for standard CUDA compilation. The IR comes from the EDG frontend through IR generation and is at "mid-level" maturity (high-level constructs lowered, but not yet optimized).
sub_184CD60() [!a4[1960]] ConstantMerge
sub_1CB4E40(0) [!a4[2000]] NVVMIntrinsicLowering (1st of 4)
sub_1B26330() [!a4[2040]] MemCpyOpt
sub_198E2A0() SROA
sub_1CEF8F0() NVVMPeephole
sub_215D9D0() NVVMAnnotationsProcessor
sub_198DF00(-1)[!a4[1520]] LoopSimplify
sub_1C6E800() GVN
sub_1A223D0() [!a4[2600]] NVVMIRVerification (1st of 5+)
sub_190BB10(0,0) SimplifyCFG
sub_1832270(1) InstructionCombining
sub_1A62BF0(5,0,0,1,0,0,1) CGSCC pipeline (5 iterations)
sub_1CB4E40(0) [!a4[2000]] NVVMIntrinsicLowering (2nd)
sub_18FD350(0) DeadArgElim
sub_1841180() [!a4[680]] FunctionAttrs
sub_18DEFF0() [!a4[280]] DCE
sub_184CD60() [!a4[1960]] ConstantMerge
sub_195E880(0) [!a4[1240]] LICM
sub_1C98160(0) NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]] MemorySpaceOpt (1st invocation)
sub_1B7FDF0(3) [!a4[1280]] Reassociate
sub_1A62BF0(8,0,0,1,1,0,1) CGSCC pipeline (8 iterations)
sub_1857160() [!a4[880]] NVVMReflect (2nd of 3)
sub_1C6FCA0() [!a4[1840]] ADCE
sub_1A7A9F0() [!a4[2720]] InstructionSimplify
sub_18FD350(0) DeadArgElim
sub_1833EB0(3) [!a4[320]] TailCallElim
sub_18FD350(0) DeadArgElim
sub_18EEA90() CorrelatedValuePropagation
sub_1869C50(1,0,1) Sink (MemorySSA-based)
sub_190BB10(0,0)[!a4[960]] SimplifyCFG
sub_18F5480() [!a4[760]] DSE
sub_1CC60B0() [!a4[2440]] NVVMSinking2
sub_1A223D0() [!a4[2600]] NVVMIRVerification
sub_1C8A4D0(0) EarlyCSE
sub_1857160() [!a4[880]] NVVMReflect (3rd)
sub_1A62BF0(8,0,0,1,1,0,1) CGSCC pipeline (8 iterations)
sub_1CB4E40(0) [!a4[2000]] NVVMIntrinsicLowering (3rd)
sub_185D600() [!a4[920]] IPConstPropagation
sub_195E880(0) [!a4[1240]] LICM
sub_1CB4E40(0) [!a4[2000]] NVVMIntrinsicLowering (4th)
sub_1CB73C0() [!a4[2120]] NVVMBranchDist
sub_1A13320() [!a4[2320]] NVVMRematerialization
--> LABEL_84 (finalization)
Key pattern: NVVMIntrinsicLowering runs 4 times, NVVMReflect runs 3 times, NVVMIRVerification runs 5+ times. The CGSCC pipeline is called with 5 and 8 iteration counts (aggressive devirtualization).
Path C: Default -- General Pipeline (~40 passes)
Used for bitcode from external sources (not marked as "ptx" or "mid"). Balances optimization breadth with conservative assumptions about IR maturity.
sub_1A62BF0(4,0,0,1,0,0,1) LLVM standard pipeline #4
sub_1857160() [!a4[880]] NVVMReflect (1st)
sub_1CB4E40(0) [!a4[2000]] NVVMIntrinsicLowering
sub_1857160() [!a4[880]] NVVMReflect (2nd)
sub_1CEF8F0() NVVMPeephole
sub_215D9D0() NVVMAnnotationsProcessor
sub_1A7A9F0() [!a4[2720]] InstructionSimplify
sub_1A62BF0(5,0,0,1,0,0,1) LLVM standard pipeline #5
sub_185D600() [!a4[920]] IPConstPropagation
sub_1B26330() [!a4[2040]] MemCpyOpt
sub_184CD60() [!a4[1960]] ConstantMerge
sub_1A13320() [!a4[2320]] NVVMRematerialization
sub_1833EB0(3) [!a4[320]] TailCallElim
sub_1C6E800() GVN
sub_1842BC0() [!a4[720]] SCCP
sub_18DEFF0() [!a4[280]] DCE
sub_184CD60() [!a4[1960]] ConstantMerge
sub_18FD350(0) DeadArgElim
sub_18EEA90() CorrelatedValuePropagation
sub_1A62BF0(1,0,0,1,0,0,1) LLVM standard pipeline #1
sub_197E720() LoopUnroll
sub_19401A0() [!a4[1000]] InstCombine
sub_1857160() [!a4[880]] NVVMReflect (3rd)
sub_1A62BF0(7,0,0,1,0,0,1) LLVM standard pipeline #7
sub_1C8A4D0(0) EarlyCSE
sub_1A223D0() [!a4[2600]] NVVMIRVerification
sub_1832270(1) InstructionCombining
sub_1869C50(1,0,1) Sink
sub_1A68E70() LoopIdiomRecognize
sub_198DF00(-1)[!a4[1520]] LoopSimplify
sub_195E880(0) [!a4[1240]] LICM
sub_190BB10(0,0)[!a4[960]] SimplifyCFG
sub_19B73C0(3,-1,-1,0,0,-1,0) LoopUnswitch
sub_1A223D0() [!a4[2600]] NVVMIRVerification
sub_1C98160(0) NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]] MemorySpaceOpt
sub_1B7FDF0(3) [!a4[1280]] Reassociate
sub_18B1DE0() [!a4[2640]] LoopPass
sub_1952F90(-1)[!a4[1160]] LoopIndexSplit
sub_18FD350(0) DeadArgElim
sub_1CC60B0() [!a4[2440]] NVVMSinking2
sub_1A62BF0(2,0,0,1,0,0,1) LLVM standard pipeline #2
sub_1A223D0() [!a4[2600]] NVVMIRVerification
sub_18A3430() [!a4[1120]] NVVMPredicateOpt
sub_1A62BF0(4,0,0,1,1,0,1) LLVM standard pipeline #4 (inlining)
--> LABEL_84 (finalization)
Key difference from "mid": default path uses LLVM standard pipeline wrappers (IDs 1,2,4,5,7) more heavily, runs SCCP explicitly, includes LoopIdiomRecognize, and uses a conservative LoopUnswitch with zeroed thresholds (3,-1,-1,0,0,-1,0).
Codegen Dispatch -- sub_12DFE00
| Address | 0x12DFE00 |
| Size | 20,729 bytes |
| Signature | int64 sub_12DFE00(int64 passMgr, int64 subtargetInfo, int64 opts) |
| Called from | Phase 5 of sub_12E54A0 (LABEL_84, line 640) |
The codegen dispatch does not simply append passes to the pipeline. It performs a full dependency analysis over every pass already inserted, constructs an ordering graph, and then emits codegen passes in topologically-sorted order. This is necessary because machine-level passes (register allocation, instruction scheduling, frame lowering) have strict ordering dependencies that the flat AddPass model cannot express.
// Pseudocode for sub_12DFE00 (codegen dispatch with dependency analysis)
void CodegenDispatch(PassManager* PM, SubtargetInfo* STI, CompilerOpts* opts) {
// Step 1: Read optimization level to determine analysis depth
int opt_level = *(int*)(opts + 200); // opts[200] = optimization level
bool do_deps = (opt_level > 1); // dependency tracking for O2+
// Step 2: Classify existing passes
// Iterates PM->passes[0..PM->count], calling two vtable methods per pass
HashTable dep_graph; // secondary hash table for dependencies (v134..v137)
init_hashtable(&dep_graph);
for (int i = 0; i < PM->count; i++) {
Pass* p = PM->passes[i];
// 2a. Check if pass is codegen-only (vtable+112)
bool is_codegen = p->vtable->isCodeGenOnly(p); // vtable offset +112
if (is_codegen)
continue; // already classified, skip
// 2b. Check registration status
int status = sub_163A1D0(p); // pass registry check
sub_163A340(p, &status); // update status
// 2c. If pass needs codegen support, mark it in the hash table
if (pass_needs_codegen(p)) {
// Set flag |= 2 in the AddPass hash table entry
// This marks the pass as "codegen-interacting"
Entry* e = hashtable_find(PM + 80, p);
if (e) e->flags |= 2;
}
// 2d. Build dependency edges (getAnalysisUsage)
if (do_deps) {
AnalysisUsage AU;
p->vtable->getAnalysisUsage(p, &AU); // vtable offset +16
// For each required analysis, create an ordering edge
// in the dependency hash table
for (AnalysisID* req = AU.required; req; req = req->next) {
dep_graph_add_edge(&dep_graph, p, req->pass);
}
}
}
// Step 3: Emit codegen passes in dependency-respecting order
// Calls the SubtargetInfo hook to get the ordered codegen pass list
// vtable+16 at STI -> STI->emitCodeGenPasses(PM, dep_graph)
STI->vtable->emitCodeGenPasses(STI, PM, &dep_graph);
// Each emitted pass gets a flag:
// 0 = normal pass (no special ordering)
// 1 = pass with codegen requirement (flag bit 0 from AddPass)
}
The dependency graph construction is what makes this function 20KB: it must handle the full LLVM analysis dependency model, including transitive dependencies and analysis preservation. The getAnalysisUsage calls return Required, RequiredTransitive, and Preserved sets that define the ordering constraints between passes.
For O0 compilation (opt_level == 0), the dependency tracking is skipped entirely -- codegen passes are emitted in a fixed default order since no optimization passes exist that could create ordering conflicts.
Pass Iteration and Convergence
CGSCC Fixed-Point Iteration
The CGSCC (Call Graph Strongly Connected Component) pass manager sub_1A62BF0 wraps a standard LLVM InlinerWrapper with a configurable iteration count. The first parameter controls how many times the CGSCC pipeline iterates over the call graph:
| Pipeline Position | Iteration Count | Context |
|---|---|---|
| O1/O2/O3 base (sub_12DE330) | 1 | Standard inlining: one pass over the call graph |
| "mid" path (Ofcmid/Ofcmin) | 5 | Aggressive: 5 iterations to resolve indirect calls |
| Default path (general IR) | 1, 2, 4, 5, 7, or 8 | Varies by position in pipeline |
Higher iteration counts allow the CGSCC framework to resolve more indirect calls through devirtualization. After each iteration, newly-inlined code may expose new call targets, which the next iteration can inline. The diminishing returns typically plateau after 3-5 iterations, which explains NVIDIA's choice of 5 for the "mid" fast-compile path (balancing compile time against code quality).
NVVMReflect Multi-Run Pattern
NVVMReflect (sub_1857160) runs multiple times in the pipeline because NVVM IR may contain __nvvm_reflect("__CUDA_ARCH") calls at different nesting depths. The first run resolves top-level reflect calls to constants. Subsequent optimization passes (inlining, constant propagation, loop unrolling) may expose new reflect calls that were hidden inside inlined functions or unrolled loop bodies. Running NVVMReflect again after these transformations catches these newly-exposed calls.
In the "mid" path, NVVMReflect appears at three distinct positions:
- Early (before GVN) -- resolves top-level architecture queries
- Mid (after CGSCC inlining and DeadArgElim) -- catches reflect calls exposed by inlining
- Late (after LoopSimplify and second CGSCC) -- catches reflect calls exposed by loop transformations
NVVMIntrinsicLowering Repetition
Similarly, NVVMIntrinsicLowering (sub_1CB4E40) runs 4 times in the "mid" path. Each invocation lowers a different subset of NVVM intrinsics based on what the preceding optimization passes have simplified. The pass takes a level parameter (0 or 1) that controls which lowering rules are active. Level 0 handles basic intrinsic lowering; level 1 handles barrier-related lowering that only becomes safe after certain control flow transformations.
NVVMIRVerification as a Convergence Check
NVVMIRVerification (sub_1A223D0) runs after every major transformation group -- not for optimization, but as a correctness invariant check. In the "mid" path it appears at 5+ positions. In the tier 1/2/3 sub-pipeline it appears 4 times (after NVVMIntrinsicLowering, after barrier lowering, after GenericToNVVM, and after the late optimization sequence). If any transformation violates NVVM IR constraints (invalid address space usage, malformed intrinsic signatures, broken metadata), this pass reports the error immediately rather than allowing it to propagate to codegen where diagnosis would be much harder.
The Repeat-Until-Clean Philosophy
NVIDIA's pipeline does not use explicit fixed-point loops (run passes until IR stops changing). Instead, it achieves convergence through strategic repetition: the same pass appears at multiple carefully-chosen pipeline positions, with different optimization passes running between repetitions. This is more predictable than a true fixed-point approach because compilation time is bounded by the static pipeline length rather than by how many iterations are needed for convergence. The tradeoff is that the pipeline may not reach a true fixed point -- some optimization opportunities exposed by late passes may not be caught -- but in practice, the multi-position placement catches the vast majority of cases.
LLVM Standard Pass Pipeline Factory -- sub_1A62BF0
The LLVM standard pass pipeline is invoked multiple times throughout the optimizer via sub_1A62BF0. The first parameter is a pipeline ID that selects which LLVM extension point to inject passes at:
| Pipeline ID | LLVM Extension Point | Usage Context |
|---|---|---|
| 1 | EP_EarlyAsPossible / basic cleanup | Tier 0, default path |
| 2 | EP_LoopOptimizerEnd | Default path late |
| 4 | EP_ScalarOptimizerLate | Default path, Tier sub-pipeline |
| 5 | EP_VectorizerStart | "mid" path, default path |
| 7 | EP_OptimizerLast | Default path |
| 8 | EP_CGSCCOptimizerLate | "mid" path (with opt flag = 1 for inlining) |
The signature is sub_1A62BF0(pipelineID, 0, 0, 1, optFlag, 0, 1, outBuf), where optFlag at position 5 enables inlining within the CGSCC sub-pipeline (observed as 1 for pipeline IDs 4 and 8 in the "mid" path: sub_1A62BF0(8,0,0,1,1,0,1)).
Each call potentially returns a cleanup callback stored in v298, invoked as v298[0](s, s, 3) for destructor/finalization. The factory is called 9+ times across the three language paths.
CompilerOptions Struct Flag Map
The a4 parameter to sub_12E54A0 is a ~4500-byte CompilerOptions struct. The following offsets have been confirmed through cross-referencing guards in the pipeline assembler and tier sub-pipelines.
| Offset | Type | Purpose | Cross-Reference |
|---|---|---|---|
| +200 | int | Optimization level (0-3) | sub_12DFE00 codegen depth |
| +280 | bool | Disable DCE | sub_18DEFF0 guard |
| +320 | bool | Disable TailCallElim | sub_1833EB0 guard |
| +360 | bool | Disable NVVMLateOpt | sub_1C46000 guard |
| +400 | bool | Disable inlining variant A | |
| +440 | bool | Disable inlining variant B | sub_1C4B6F0 guard |
| +480 | bool | Disable inlining variant C | sub_12DE8F0 guard |
| +520 | bool | Disable NVIDIA pass A | sub_1AAC510 guard |
| +560 | bool | Disable NVIDIA pass B | sub_1AAC510 guard |
| +600 | bool | Disable NVVMVerifier | sub_12D4560 guard |
| +680 | bool | Disable FunctionAttrs | sub_1841180 guard |
| +720 | bool | Disable SCCP | sub_1842BC0 guard |
| +760 | bool | Disable DSE | sub_18F5480 guard |
| +880 | bool | Disable NVVMReflect | sub_1857160 guard |
| +920 | bool | Disable IPConstPropagation | sub_185D600 guard |
| +960 | bool | Disable SimplifyCFG | sub_190BB10 guard |
| +1000 | bool | Disable InstCombine | sub_19401A0 guard |
| +1040 | bool | Disable Sink/MemSSA | sub_1869C50 guard |
| +1080 | bool | Disable PrintModulePass | sub_17060B0 guard |
| +1120 | bool | Disable NVVMPredicateOpt | sub_18A3430 guard |
| +1160 | bool | Disable LoopIndexSplit | sub_1952F90 guard |
| +1240 | bool | Disable LICM | sub_195E880 guard |
| +1280 | bool | Disable Reassociate | sub_1B7FDF0 guard |
| +1320 | bool | Disable ADCE variant A | sub_1C76260 guard |
| +1360 | bool | Disable LoopUnroll | sub_19C1680 guard |
| +1400 | bool | Disable SROA | sub_1968390 guard |
| +1440 | bool | Disable EarlyCSE | sub_196A2B0 guard |
| +1520 | bool | Disable LoopSimplify | sub_198DF00 guard |
| +1680 | bool | Disable NVIDIA pass | sub_19CE990 guard |
| +1760 | bool | Disable MemorySpaceOpt | sub_1C8E680 guard |
| +1840 | bool | Disable ADCE C | sub_1C6FCA0 guard |
| +1960 | bool | Disable ConstantMerge | sub_184CD60 guard |
| +2000 | bool | Disable NVVMIntrinsicLowering | sub_1CB4E40 guard |
| +2040 | bool | Disable MemCpyOpt | sub_1B26330 guard |
| +2120 | bool | Disable NVVMBranchDist B | sub_1CB73C0 guard |
| +2200 | bool | Disable GenericToNVVM | sub_1A02540 guard |
| +2320 | bool | Disable NVVMRematerialization | sub_1A13320 guard |
| +2440 | bool | Disable NVVMSinking2 | sub_1CC60B0 guard |
| +2560 | bool | Disable NVVMGenericAddrOpt | sub_1CC71E0 guard |
| +2600 | bool | Disable NVVMIRVerification | sub_1A223D0 guard |
| +2640 | bool | Disable NVVMLoopOpt | sub_18B1DE0 guard |
| +2720 | bool | Disable InstructionSimplify | sub_1A7A9F0 guard |
| +2840 | bool | Enable ADCE (reversed logic) | sub_1C6FCA0 |
| +2880 | bool | Enable LICM (reversed logic) | sub_195E880 |
| +2920 | bool | NVVMLowerBarriers param | sub_1C98160 |
| +3000 | bool | Extra DeadArgElim trigger | sub_18FD350 |
| +3040 | bool | Enable CVP | sub_18EEA90 |
| +3080 | bool | Enable NVIDIA loop pass | sub_1922F90 |
| +3120 | bool | Address space optimization flag | sub_1C8E680 param |
| +3160 | bool | Debug dump mode | sub_17060B0 enable |
| +3200 | bool | Enable advanced NVIDIA group | IPConst/Reflect/SCCP/etc. |
| +3328 | bool | Enable SM-specific passes | Warp/Reduction/Sinking2 |
| +3488 | bool | Enable barrier optimization | sub_1C98160, sub_18E4A00 |
| +3528 | bool | Tier 1 enable | Phase 3 loop |
| +3532 | int | Tier 1 phase threshold | Phase 3 loop |
| +3568 | bool | Tier 2 enable | Phase 3 loop |
| +3572 | int | Tier 2 phase threshold | Phase 3 loop |
| +3608 | bool | Tier 3 enable | Phase 3 loop |
| +3612 | int | Tier 3 phase threshold | Phase 3 loop |
| +3648 | ptr | Language string ("ptx"/"mid"/"idn") | Phase 1 dispatch |
| +3656 | int | Language string length | Phase 1 dispatch |
| +3704 | bool | Late optimization mode | sub_195E880, sub_1C8A4D0 |
| +3904 | bool | Debug: verify after plugins | Phase 3 loop |
| +3944 | bool | Debug: BB naming "F%d_B%d" | Phase 8 |
| +4224 | bool | Optimization master switch | Tier 0 gate |
| +4228 | int | Optimization phase threshold | Tier 0 gate |
| +4304 | bool | Device-code flag | Phase 1 v238 |
| +4384 | bool | Fast-compile / bypass pipeline | Top branch Pipeline A vs B |
| +4464 | bool | Disable late CFG cleanup B | Phase 5 sub_1654860 |
| +4480 | ptr | SM feature capability | Phase 6: & 4 = codegen ext |
| +4488 | ptr | Plugin pass array start | Phase 3 loop |
| +4496 | ptr | Plugin pass array end | Phase 3 loop |
Pass Factory Address Inventory
All unique pass factory addresses called from the pipeline assembler and tier sub-pipelines:
| Function | Address | Size | Role |
|---|---|---|---|
| NVVMVerifier | sub_12D4560 | many (tiers) | many (tiers) |
| AssumptionCacheTracker | sub_1361950 | 1 | 1 |
| TargetLibraryInfoWrapperPass | sub_149CCE0 | 1 | 1 |
| VerifierPass / BasicAA | sub_14A7550 | 1 | 1 |
| BreakCriticalEdges | sub_1654860 | 2 | 2 |
| PrintModulePass (debug dump) | sub_17060B0 | ~30+ | ~30+ |
| InstructionCombining | sub_1832270 | 2 | 2 |
| TailCallElim / JumpThreading | sub_1833EB0 | 3 | 3 |
| FunctionAttrs | sub_1841180 | 3 | 3 |
| SCCP | sub_1842BC0 | 2 | 2 |
| NVVMReflect | sub_1857160 | ~8 | ~8 |
| IPConstantPropagation | sub_185D600 | 3 | 3 |
| Sink (MemorySSA-based) | sub_1869C50 | 3 | 3 |
| NVVMPredicateOpt variant | sub_18A3090 | 2 | 2 |
| NVVMPredicateOpt / SelectionOpt | sub_18A3430 | 2 | 2 |
| NVVMLoopOpt / BarrierOpt | sub_18B1DE0 | 3 | 3 |
| Sinking2Pass (fast=1 for fc mode) | sub_18B3080 | 1 | 1 |
| DCE | sub_18DEFF0 | 4 | 4 |
| NVVMBarrierAnalysis | sub_18E4A00 | 1 | 1 |
| CorrelatedValuePropagation | sub_18EEA90 | 3 | 3 |
| DSE | sub_18F5480 | 2 | 2 |
| DeadArgElimination | sub_18FD350 | 5 | 5 |
| SimplifyCFG | sub_190BB10 | 4 | 4 |
| NVIDIA-specific loop pass | sub_1922F90 | 1 | 1 |
| LoopIndexSplit | sub_1952F90 | 3 | 3 |
| LICM / LoopRotate | sub_195E880 | 4 | 4 |
| SROA | sub_1968390 | 2 | 2 |
| EarlyCSE | sub_196A2B0 | 2 | 2 |
| LoopUnroll | sub_197E720 | 1 | 1 |
| LoopSimplify | sub_198DF00 | 3 | 3 |
| SROA (variant) | sub_198E2A0 | 1 | 1 |
| InstCombine | sub_19401A0 | 2 | 2 |
| LoopUnswitch (7 params) | sub_19B73C0 | 3 | 3 |
| LoopUnroll variant | sub_19C1680 | 2 | 2 |
| NVIDIA custom pass | sub_19CE990 | 1 | 1 |
| GenericToNVVM | sub_1A02540 | 1 | 1 |
| NVVMRematerialization | sub_1A13320 | 3 | 3 |
| NVVMIRVerification | sub_1A223D0 | 5+ | 5+ |
| LLVM StandardPassPipeline | sub_1A62BF0 | ~9 | ~9 |
| LoopIdiomRecognize | sub_1A68E70 | 1 | 1 |
| InstructionSimplify | sub_1A7A9F0 | 3 | 3 |
| NVIDIA-specific pass | sub_1AAC510 | 1 | 1 |
| MemCpyOpt | sub_1B26330 | 4 | 4 |
| Reassociate | sub_1B7FDF0 | 3 | 3 |
| TTIWrapperPass | sub_1BFB520 | 1 | 1 |
| NVVMLateOpt | sub_1C46000 | 1 | 1 |
| Inliner / AlwaysInline | sub_1C4B6F0 | 2 | 2 |
| NewGVN / GVNHoist | sub_1C6E560 | 1 | 1 |
| GVN | sub_1C6E800 | 2 | 2 |
| ADCE | sub_1C6FCA0 | 2 | 2 |
| ADCE variant | sub_1C76260 | 2 | 2 |
| NVVMWarpShuffle | sub_1C7F370 | 1 | 1 |
| EarlyCSE / GVN variant | sub_1C8A4D0 | 3 | 3 |
| MemorySpaceOptimization | sub_1C8E680 | 4 | 4 |
| NVVMLowerBarriers | sub_1C98160 | 4 | 4 |
| NVVMLowerBarriers variant | sub_1C98270 | 1 | 1 |
| ProfileSummaryInfo | sub_1CB0F50 | 1 | 1 |
| NVVMIntrinsicLowering | sub_1CB4E40 | ~10 | ~10 |
| NVVMBranchDist | sub_1CB73C0 | 3 | 3 |
| NVVMLowerAlloca | sub_1CBC480 | 1 | 1 |
| NVVMUnreachableBlockElim | sub_1CC3990 | 1 | 1 |
| NVVMReduction | sub_1CC5E00 | 1 | 1 |
| NVVMSinking2 | sub_1CC60B0 | 3 | 3 |
| NVVMGenericAddrOpt | sub_1CC71E0 | 1 | 1 |
| NVVMFinalLowering | sub_1CEBD10 | 1 | 1 |
| NVVMPeephole | sub_1CEF8F0 | 2 | 2 |
| NVVMAnnotationsProcessor | sub_215D9D0 | 2 | 2 |
Total unique pass factory addresses: ~55.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| NVVMPassOptions::init | sub_12D6300 | 125KB | Populates 4,512-byte options struct |
| writeStringOption | sub_12D6090 | ~100B | Writes 24-byte string slot |
| writeBoolOption | sub_12D6100 | ~80B | Writes 16-byte boolean slot |
| PassOptionRegistry::lookupOption | sub_12D6170 | ~200B | Hash table lookup |
| getBoolOption | sub_12D6240 | ~300B | Boolean resolution with default |
| PassDefTable::getPassDef | sub_1691920 | ~50B | 64-byte stride table lookup |
| parseInt | sub_16D2BB0 | ~100B | String to int64 |
| Pipeline assembler (master) | sub_12E54A0 | 49.8KB | 8-phase pipeline construction |
| AddPass | sub_12DE0B0 | 3.5KB | Hash-table-based insertion |
| Tier 0 sub-pipeline | sub_12DE330 | 4.8KB | ~40 passes, full optimization |
| Tier 1/2/3 sub-pipeline | sub_12DE8F0 | 17.9KB | Phase-conditional, incremental |
| Codegen dispatch | sub_12DFE00 | 20.7KB | Dependency-ordered codegen |
| Phase I/II orchestrator | sub_12E7E70 | 9.4KB | Two-phase state machine |
| New PM registration | sub_2342890 | ~50KB | 2,816 lines, 35 NVIDIA + ~350 LLVM |
| registerPass (hash insert) | sub_E41FB0 | ~300B | StringMap insertion |
| Pass name prefix matcher | sub_2337DE0 | ~100B | starts_with comparison |
| Parameterized pass parser | sub_234CEE0 | ~200B | Extracts <params> |
| MemorySpaceOpt param parser | sub_23331A0 | ~300B | first-time/second-time/warnings |
| New PM pipeline driver | sub_226C400 | 35KB | nvopt<O0/O1/O2/O3/Ofcmax/Ofcmid/Ofcmin> selection |
New PM text parser (buildDefaultPipeline) | sub_2277440 | 60KB | Parses pipeline name strings |
| nvopt registration (new PM) | sub_225D540 | ~32KB | Pipeline element vtable at 0x4A08350 |
| nvopt registration (legacy PM) | sub_12C35D0 | ~500B | Pipeline element vtable at 0x49E6A58 |
| nvopt object initializer | sub_12EC960 | ~100B | Creates 512-byte pipeline object |
| LLVM standard pipeline factory | sub_1A62BF0 | varies | Pipeline IDs 1,2,4,5,7,8 |
| Pass registry check | sub_163A1D0 | ~100B | Pass registration status |
| Pass status update | sub_163A340 | ~100B | Used in codegen dispatch |
| Pipeline text tokenizer | sub_2352D90 | ~200B | Tokenizes nvopt<> strings |
Reimplementation Checklist
- Two-phase compilation model. Implement a TLS phase variable (values 1=Phase I, 2=Phase II, 3=done) read by individual passes to skip themselves when the current phase does not match their intended execution phase. Phase I runs whole-module analysis; Phase II runs per-function codegen-oriented passes.
- Pipeline assembly function (~150 AddPass calls). Build the master pipeline at runtime using hash-table-based pass insertion (
AddPass), with language-specific dispatch (paths for"ptx","mid", and default), tier-based interleaving (Tiers 0--3 fired by accumulated pass-count thresholds), and phase-conditional pass inclusion. - NVVMPassOptions system (4,512-byte struct, 221 slots). Implement the proprietary per-pass enable/disable and parametric knob system with 114 string + 100 boolean + 6 integer + 1 string-pointer option slots, parsed from CLI flags and routed to individual passes.
- Concurrent per-function compilation. After Phase I completes on the whole module, split Phase II across a thread pool sized to
get_nprocs()or GNU Jobserver token count, with per-function bitcode extraction, independent compilation, and re-linking of results. - GNU Jobserver integration. Parse
--jobserver-auth=R,WfromMAKEFLAGSenvironment variable, create a token management pipe, and spawn a pthread to throttle concurrent compilations to the build system's-jlevel. - Split-module compilation. Implement the
-split-compile=Nmechanism: decompose multi-function modules into per-function bitcode blobs via filter callbacks, compile each independently (potentially in parallel), re-link results, and restore linkage attributes from a hash table. - Tier 0 full optimization sub-pipeline. Assemble the ~40-pass Tier 0 sequence: BreakCriticalEdges, GVN, NVVMReflect, SCCP, NVVMVerifier, LoopIndexSplit, ADCE, LICM, LoopUnroll, InstCombine, SROA, EarlyCSE, LoopUnswitch, SimplifyCFG, NVVMRematerialization, DSE, DCE, with per-pass NVVMPassOptions gating.
Cross-References
- Optimization Levels -- detailed O0/O1/O2/O3 and fast-compile pipeline construction
- Memory Space Optimization -- the MemorySpaceOpt pass (first-time/second-time parameterization)
- Rematerialization -- NVVMRematerialization pass and its register-pressure knobs
- Loop Strength Reduction -- NVIDIA's custom LSR overlay with 11 GPU-specific knobs
- Sinking2 -- NVIDIA's enhanced sinking pass
- CGSCC & LazyCallGraph -- the inliner framework and iteration model
- Pipeline Entry -- top-level compilation entry and two-phase orchestration
- SROA, EarlyCSE, JumpThreading -- scalar pass details (hub: scalar-passes)
OptiX IR Generation
When cicc receives the --emit-optix-ir flag, it activates an alternate compilation path that produces OptiX IR instead of PTX. OptiX IR is the intermediate representation consumed by NVIDIA's OptiX ray tracing engine, which uses a continuation-based execution model fundamentally different from the standard CUDA kernel launch model. Rather than compiling all the way down to PTX machine code, the OPTIXIR pipeline stage serializes the optimized LLVM module in a form that the OptiX runtime can later JIT-compile, link with ray tracing shaders, and schedule across the RT cores' hardware intersection pipeline.
The OptiX path is the third of four stages in cicc's internal pipeline (LNK -> OPT -> OPTIXIR -> LLC), but it is mutually exclusive with LLC in practice: when OptiX mode is active, the pipeline bitmask enables OPTIXIR (0x40) and disables certain optimizations that would be incorrect for continuation-based code. The flag also forces the EDG frontend to emit lifetime intrinsics (--emit-lifetime-intrinsics, EDG option id 132), which mark the live ranges of local variables -- essential information for the OptiX runtime's continuation frame layout.
| Pipeline stage | OPTIXIR (stage 3 of 4) |
| Stage bit | Bit 6 (0x40) in pipeline bitmask |
| Mode bitmask | 0x43 = (a13 & 0x300) | 0x43 |
| Core function | sub_12F9270 (~6 KB) |
| Timer name | "OPTIXIR" / "LibNVVM Optix IR step." |
| Container IR level | NVVM_IR_LEVEL_OPTIX (value 2) |
| CLI flag | --emit-optix-ir (15 bytes, inline-matched) |
| Input extension | .optixir (recognized at 0x8FC001) |
| Callback slot | CompilationState+144 (function), +152 (user data) |
| Availability | CUDA (0xABBA) and OpenCL (0xDEED) modes only |
Flag Processing
--emit-optix-ir in Real Main (sub_8F9C90)
In the standalone entry point, --emit-optix-ir is matched at 0x8FAD00 by a 15-byte inline comparison (split across three immediate compares: "--emit-o" + "ptix" + "-ir"). When matched, it performs three actions:
-
Pushes three strings to the
v266pass-through vector:"--emit-optix-ir"(literal, 15 bytes via explicitstrcpy)- An 18-byte target string from
xmmword_3C23B30+"28"(likely target-related configuration) - A 20-byte GPU name string from
xmmword_3C23B40+"t128"(likely target capability)
-
Sets
v243 = 1(the OptiX IR mode flag) -
Sets
v258 = 1(the NVC flag, also set by-nvc)
--emit-optix-ir in Flag Catalog (sub_9624D0)
In the 3-column flag fan-out system, --emit-optix-ir is processed at line 2415 of the decompiled flag catalog. Its behavior:
// Only valid when a4 == 0xDEED (OpenCL) or a4 == 0xABBA (CUDA)
if (a4 == 0xDEED || a4 == 0xABBA) {
// Route to optimizer: disable IP-MSP and LICM
append_to_opt_vector("-do-ip-msp=0");
append_to_opt_vector("-do-licm=0");
// Set mode bitmask: preserve 64/32-bit mode bits, set OptiX mode
a13 = (a13 & 0x300) | 0x43;
}
The 0x43 value decomposes to:
- Bits
[1:0]=0x03-- all standard phases enabled (LNK + LLC) - Bit 6 =
0x40-- OPTIXIR stage enabled
3-Column Fan-Out
The flag translation table maps --emit-optix-ir across all three compilation columns:
| Column | Forwarded As |
|---|---|
| nvcc -> EDG | --emit-lifetime-intrinsics |
| nvcc -> cicc (optimizer) | --emit-optix-ir + -do-ip-msp=0 + -do-licm=0 |
| cicc internal | Mode bitmask 0x43 |
This is notable because a single user-facing flag triggers a different flag in the EDG frontend (--emit-lifetime-intrinsics, EDG option id 132) while also routing the OptiX flag itself to the cicc optimizer. The EDG side-effect ensures that lifetime markers (llvm.lifetime.start / llvm.lifetime.end) are present in the generated LLVM IR, which the OptiX runtime needs to compute continuation frame sizes.
Pipeline Stage
Bitmask and Gating
The pipeline orchestrator sub_12C35D0 (41 KB, the nvvmCompileProgram internal) reads the pipeline stage bitmask from sub_12D2AA0 during initialization. This function parses the architecture code and options into four stage descriptors:
| Stage | Descriptor Pair | Bitmask Bit |
|---|---|---|
| LNK | (&v195, &v200) | Bit 0 (0x01) |
| OPT | (&v196, &v201) | Bit 7 (0x80) |
| OPTIXIR | (&v197, &v202) | Bit 6 (0x40) |
| LLC | (&v198, &v203) | Bit 2 (0x04) |
The OPTIXIR stage executes at lines 1093--1150 of the decompiled orchestrator, after OPT and before LLC:
// STAGE 3 -- OPTIXIR
if (v87 & 0x40) {
// Start timer
sub_16D8B50(timer_ctx, "OPTIXIR", 7,
"LibNVVM Optix IR step.", 22, ...);
// Generate OptiX IR from the optimized LLVM module
err = sub_12F9270(arch_code, // a3: SM architecture code
llvm_ctx, // a4: LLVM context
module, // current LLVM Module*
state + 6, // output buffer for OptiX IR
&error_str); // error string out
if (err) {
// Append error to state[10] error log
...
}
// Close timer
sub_16D7950(timer_ctx);
}
Callback Mechanism
Like the other three stages, OPTIXIR has a callback slot in the CompilationState structure:
| Offset | Field |
|---|---|
+112 | LNK callback function pointer |
+120 | LNK callback user data |
+128 | OPT callback function pointer |
+136 | OPT callback user data |
+144 | OPTIXIR callback function pointer |
+152 | OPTIXIR callback user data |
+160 | LLC callback function pointer |
+168 | LLC callback user data |
In the standalone pipeline entry (sub_1265970), the OPTIXIR callback is registered when both verbose and keep-temps modes are active (the logical AND of -v and -keep, which requires wizard mode). The callback ID is 64222, registered via sub_1268040 through sub_12BC0F0.
sub_12F9270 -- OptiX IR Generator
| Field | Value |
|---|---|
| Address | 0x12F9270 |
| Size | ~6 KB |
| Parameters | (uint arch_code, LLVMContext *ctx, Module *module, OutputBuffer *out, char **error_str) |
| Return | unsigned int (0 = success) |
This function takes the fully optimized LLVM module and serializes it into OptiX IR format. The output goes into the state+6 output buffer in the CompilationState, not into the PTX output buffer at state+80. The architecture code and LLVM context are passed through from the pipeline orchestrator's arguments.
The function is relatively small (~6 KB) compared to the LLC stage (sub_12F5100, ~12 KB), consistent with it being primarily a serialization step rather than a full code generation pipeline. It does not run SelectionDAG, register allocation, or instruction scheduling -- those are the domain of the LLC stage, which is typically skipped when OptiX mode is active.
IR Level and Container Marking
When the NVVM container format wraps an OptiX IR payload, the IRLevel field in the binary header is set to NVVM_IR_LEVEL_OPTIX (value 2):
| IRLevel Value | Enum Name | Meaning |
|---|---|---|
| 0 | NVVM_IR_LEVEL_UNIFIED_AFTER_DCI | Default: IR after Device-Code-Interface unification |
| 1 | NVVM_IR_LEVEL_LTO | Link-Time Optimization IR (partially optimized) |
| 2 | NVVM_IR_LEVEL_OPTIX | OptiX pipeline IR |
In the binary header, this is stored as a uint16_t at offset 0x0C:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| IRLevel = 0x0002 (OPTIX) | 0x0C in NvvmContainerBinaryHeader
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
In the XML serialization path (used for debugging), this appears as the "IRLevel" element with the symbolic name "NVVM_IR_LEVEL_OPTIX".
The .optixir file extension is recognized as an input format by cicc's argument parser (matched at 0x8FC001 by comparing the last 8 characters of the filename). This allows round-tripping: cicc can both produce and consume OptiX IR files.
Optimization Pipeline Differences
When OptiX mode is active, the flag catalog forces two critical optimizer changes via the pass-through vector to the OPT stage:
LICM Disabled (-do-licm=0)
Loop Invariant Code Motion is completely disabled when compiling for OptiX. The do-licm NVVMPassOption (at a known offset in the 4,512-byte options struct) gates the LICM pass insertion in the pipeline assembler sub_12E54A0. When set to 0, the sub_195E880(0) LICM pass at position 22 of the Tier 0 pipeline is skipped entirely.
The rationale is that OptiX uses a continuation-based execution model where functions can be suspended and resumed at hardware-defined continuation points (ray-surface intersection, any-hit shader invocation, etc.). LICM hoisting moves computations out of loops and into dominating blocks, which can move them across implicit continuation boundaries. If a hoisted value is live across a continuation point, the OptiX runtime must save it to the continuation frame -- potentially increasing frame size and reducing performance. Worse, the hoisting may move side-effecting operations across points where the program could be suspended, violating the continuation semantics. Disabling LICM avoids these correctness and performance hazards entirely.
IP-MSP Disabled (-do-ip-msp=0)
Interprocedural Memory Space Propagation is also disabled. IP-MSP (sub_12E6160, the NVVMMemorySpacePropagation pass) propagates memory space annotations (generic -> shared/local/global) across function boundaries. This optimization is meaningless for OptiX IR because the OptiX runtime performs its own memory space analysis during JIT compilation, and the intermediate representation must remain generic to allow runtime binding of hit attributes, payload data, and SBT (Shader Binding Table) records to their final memory spaces.
Forced Inlining (nv-inline-all)
The nv-inline-all knob (registered at constructor ctor_186_0 at 0x4DBEC0 in the NVIDIA custom inliner) bypasses cost analysis entirely and forces inlining of every call. This mode is used for OptiX compilation where the entire call graph must be flattened for the hardware intersection pipeline. The OptiX runtime requires monolithic shader functions because the RT core hardware executes individual ray tracing programs as atomic units -- there is no call stack during hardware intersection traversal.
From the inliner cost model (sub_1864060, 75 KB):
The
nv-inline-allknob bypasses cost analysis entirely and forces inlining of every call. This is used for specific compilation modes (e.g., OptiX ray tracing where the entire call graph must be flattened for the hardware intersection pipeline).
The standard inline-budget (default 20,000) and inline-total-budget are irrelevant when nv-inline-all is active -- every call site is inlined unconditionally regardless of cost.
Continuation-Based Execution Model
OptiX IR exists because NVIDIA's ray tracing hardware uses a fundamentally different execution model than standard CUDA kernels. Understanding this model explains every design decision in the OPTIXIR pipeline stage.
Standard CUDA vs. OptiX Execution
In standard CUDA, a kernel is a single function that runs to completion on an SM. The compiler produces PTX, which ptxas assembles into SASS machine code. The entire call graph is resolved at compile time, and the GPU executes instructions sequentially (modulo warp divergence and memory latency hiding).
In OptiX, a ray tracing pipeline consists of multiple programs (ray generation, closest-hit, any-hit, miss, intersection, callable) that are compiled separately and linked at runtime by the OptiX driver. When a ray-surface intersection occurs, the hardware suspends the current program, saves its live state to a continuation frame in device memory, and launches the appropriate hit shader. When the hit shader completes, execution resumes from the continuation point.
This model has several consequences for compilation:
-
No cross-function calls during intersection. The RT core hardware does not support a general call stack. All function calls within a single program must be fully inlined before the OptiX runtime receives the IR -- hence
nv-inline-all. -
Lifetime intrinsics are critical. The OptiX runtime uses
llvm.lifetime.start/llvm.lifetime.endmarkers to determine which local variables are live at each potential continuation point. Variables that are provably dead at a continuation point do not need to be saved to the continuation frame. Without these markers, the runtime must conservatively assume all locals are live, inflating frame sizes and reducing performance. -
LICM is unsafe. Hoisting computations out of loops can move them across implicit continuation points, creating live ranges that span suspension/resumption boundaries. The OptiX runtime cannot reconstruct the hoisted value after resumption unless it is saved, but the compiler does not know where the continuation points will be (they are determined at runtime by the ray tracing pipeline topology).
-
Memory space must remain generic. OptiX IR is JIT-compiled at runtime with knowledge of the full pipeline configuration. Memory space decisions that depend on the pipeline topology (shared memory for hit attributes, global memory for payload) cannot be made at cicc compile time.
-
The output is IR, not machine code. Unlike the LLC stage which produces PTX text, the OPTIXIR stage serializes the LLVM module in a form suitable for the OptiX JIT. This is why
sub_12F9270is only ~6 KB -- it is a serializer, not a code generator.
Configuration
CLI Activation
# Standard OptiX compilation via nvcc
nvcc --emit-optix-ir -arch=sm_89 -o kernel.optixir kernel.cu
# Direct cicc invocation
cicc --emit-optix-ir -arch sm_89 -o kernel.optixir kernel.bc
# The flag also accepts .optixir input files for round-tripping
cicc -arch sm_89 -o kernel.ptx kernel.optixir
Effective Configuration When Active
When --emit-optix-ir is specified, the following configuration is implicitly applied:
| Setting | Value | Source |
|---|---|---|
v243 (OptiX flag) | 1 | Real main sub_8F9C90 |
v258 (NVC flag) | 1 | Real main sub_8F9C90 |
| Pipeline bitmask | 0x43 | Flag catalog sub_9624D0 |
do-licm | 0 | Flag catalog, routed to OPT |
do-ip-msp | 0 | Flag catalog, routed to OPT |
EDG: emit-lifetime-intrinsics (id 132) | enabled | 3-column fan-out |
Container IRLevel | 2 (NVVM_IR_LEVEL_OPTIX) | Container serializer |
nv-inline-all | true | OptiX mode forces all inlining |
Bitmask Decomposition
The 0x43 mode value preserves the 64/32-bit mode bits (mask 0x300) from any previously-set a13 value:
a13 = (a13 & 0x300) | 0x43
Bit field:
[9:8] = preserved (0x100 = 64-bit, 0x200 = 32-bit)
[7] = 0 (OPT stage -- controlled separately)
[6] = 1 (OPTIXIR stage enabled)
[5:3] = 0 (no LTO, no verification override)
[2] = 0 (LLC stage -- typically not run in OptiX mode)
[1:0] = 11 (LNK + base phase control)
Note that bit 2 (LLC) is 0 in the 0x43 bitmask, confirming that the LLC stage is not activated when OptiX mode is the primary output. The pipeline runs LNK -> OPT -> OPTIXIR and stops.
Diagnostic Strings
| String | Length | Context |
|---|---|---|
"OPTIXIR" | 7 | Timer phase name (passed to sub_16D8B50) |
"LibNVVM Optix IR step." | 22 | Timer description string |
"--emit-optix-ir" | 15 | CLI flag literal (inline-matched in real main) |
"--emit-lifetime-intrinsics" | 27 | EDG flag routed from --emit-optix-ir |
".optixir" | 8 | Input file extension (matched at 0x8FC001) |
"-do-ip-msp=0" | 13 | Optimizer option routed when OptiX active |
"-do-licm=0" | 12 | Optimizer option routed when OptiX active |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| OptiX IR generator (core OPTIXIR stage) | sub_12F9270 | ~6 KB | -- |
Pipeline orchestrator (nvvmCompileProgram internal) | sub_12C35D0 | ~41 KB | -- |
| Bitmask / stage descriptor parser | sub_12D2AA0 | — | -- |
Flag catalog (routes --emit-optix-ir) | sub_9624D0 | ~75 KB | -- |
Real main (matches --emit-optix-ir at 0x8FAD00) | sub_8F9C90 | ~10 KB | -- |
OPTIXIR callback registration (callback ID 64222) | sub_1268040 | — | -- |
| Pipeline callback dispatcher | sub_12BC0F0 | — | -- |
Inliner cost model (nv-inline-all bypass) | sub_1864060 | ~75 KB | -- |
CGSCC inliner core (inlineCallsImpl) | sub_186CA00 | ~61 KB | -- |
Timer start (receives "OPTIXIR" phase name) | sub_16D8B50 | — | -- |
| Timer close | sub_16D7950 | — | -- |
Pipeline assembler (skips LICM when do-licm=0) | sub_12E54A0 | ~49.8 KB | -- |
Cross-References
- Entry Point & CLI --
--emit-optix-irflag parsing andv243variable - LLVM Optimizer --
do-licmanddo-ip-mspNVVMPassOptions, pipeline assembler - NVVM Container Binary Format --
NVVM_IR_LEVEL_OPTIX(value 2) in IRLevel enum - EDG 6.6 Frontend --
--emit-lifetime-intrinsics(EDG option id 132) - Code Generation -- LLC stage that is skipped in OptiX mode
- LICM -- the pass disabled by OptiX mode
Code Generation
NVPTX backend: SelectionDAG lowering, instruction selection, register allocation, and machine-level passes. Address range 0x1700000–0x35EFFFF (~37 MB of code) -- the largest address range in the binary. This page is the hub for the entire code generation pipeline; each stage has a dedicated deep-dive page linked below.
| SelectionDAG pipeline | SelectionDAG & ISel — build, legalize, combine, select |
| Type legalization | Type Legalization — 348KB monolithic dispatch |
| ISel patterns | ISel Pattern Matching — three-level dispatch, 900KB |
| Register allocation | Register Allocation — pressure-driven greedy RA |
| Register classes | NVPTX Register Classes — nine classes, ID map |
| Scheduling | Instruction Scheduling — MRPA, pipeliner, post-RA |
| Machine passes | Machine-Level Passes — MRPA, remat, LDG, peephole |
| StructurizeCFG | StructurizeCFG — mandatory structured control flow |
| CodeGenPrepare | CodeGenPrepare & SCEV-CGP — IR-level backend prep |
| KnownBits | KnownBits & DemandedBits — fused analysis with GPU SR oracle |
| Tensor core codegen | MMA Code Generation — HMMA/IMMA/WGMMA/tcgen05 lowering pipeline |
| Tensor core builtins | Tensor / MMA Builtins — per-ID reference, validation rules |
| Atomics | Atomic Builtins — scope-aware atom lowering |
| Target infrastructure | NVPTX Target Infrastructure — TargetMachine, TTI, SubtargetFeatures |
| Live range calc | LiveRangeCalc — dual-bitvector liveness |
| Rematerialization | Rematerialization — IR-level + machine-level remat |
| InstrEmitter | InstrEmitter — DAG-to-MachineInstr conversion |
| DAG node layout | SelectionDAG Node Structure — 104-byte SDNode |
Architecture
The code generation pipeline runs after the LLVM optimizer and produces MachineIR that the PTX emission stage serializes to text. The pipeline follows upstream LLVM's SelectionDAG architecture with NVIDIA-specific passes inserted at key points.
LLVM IR
│
├─ CodeGenPrepare (IR-level backend prep)
│ sub_1D70000-1D7FFFF: sunkaddr, sunk_phi, block splitting
│
├─ SelectionDAG Build
│ sub_2065D30 (visit dispatcher)
│ sub_2056920 (major worker, 69KB)
│ sub_2077400 (NVVM tex/surf handle lowering) ★ NVIDIA
│ sub_2072590 (NVPTX argument passing, 38KB) ★ NVIDIA
│
├─ LegalizeTypes
│ sub_20019C0 (348KB main loop)
│ sub_201E5F0 (opcode dispatch, 81KB)
│ sub_201BB90 (expand integer, 75KB)
│
├─ LegalizeOp
│ sub_1FFB890 (169KB, type action dispatch)
│ sub_1FF6F70 (43KB, atomic target-specific lowering) ★ NVIDIA
│
├─ DAG Combining
│ sub_F681E0 (65KB, top-level orchestrator)
│ sub_F20C20 (64KB, visitNode main)
│
├─ Instruction Selection
│ sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) ★ NVIDIA
│ sub_33D4EF0 (complex addressing, calls sub_969240 399×)
│
├─ Instruction Scheduling
│ sub_355F610 (64KB, ScheduleDAGMILive post-RA)
│ sub_3563190 (58KB, MachinePipeliner)
│
├─ Register Allocation
│ sub_2F49070 (82KB, RAGreedy::selectOrSplit)
│ sub_2F2D9F0 (93KB, LiveRangeSplitter)
│
├─ Machine-Level Passes
│ MRPA, Block Remat, Mem2Reg, LDG, Peephole, etc.
│
└─ StructurizeCFG
sub_35CC920 (95KB, mandatory for PTX structured control flow)
Items marked ★ NVIDIA are NVIDIA-proprietary additions not present in upstream LLVM.
Stage Overview
CodeGenPrepare (detail) -- last IR-level pass before ISel. Sinks address computations, creates PHI nodes for sunk values, and splits critical edges. NVIDIA adds an optional SCEV-CGP extension.
SelectionDAG Build (detail) -- converts LLVM IR into a target-independent DAG. NVPTX intercepts for .param-space argument passing and texture/surface handle lowering.
Type Legalization (detail) -- rewrites every illegal type into legal equivalents via promote, expand, soften, or split-vector actions.
Operation Legalization -- processes nodes whose opcodes are illegal for the target. Atomic operations receive NVIDIA-specific scope-aware lowering (CTA/GPU/SYS) with per-SM feature gates.
DAG Combining -- folds redundant operations, canonicalizes patterns, and reduces the DAG before instruction selection. The KnownBits analysis feeds into combining decisions.
Instruction Selection (detail) -- matches DAG nodes against PTX instruction patterns via a three-level dispatch hierarchy. A compressed per-SM-variant legality table gates which opcodes exist on which GPU architecture.
Instruction Scheduling (detail) -- post-RA scheduling plus an optional software pipeliner. NVIDIA's custom MRPA provides incremental register pressure tracking.
Register Allocation (detail) -- pressure-driven greedy allocator adapted for PTX's virtual register model. Works with nine typed register classes; live range splitting and rematerialization reduce spill pressure.
Machine-Level Passes (detail) -- NVIDIA-proprietary and stock LLVM passes that optimize register pressure, promote stack objects back to registers, and prepare clean PTX for ptxas.
StructurizeCFG (detail) -- mandatory pass that converts arbitrary CFGs into the structured form PTX requires, rejecting irreducible CFGs and EH funclets.
Two-Stage Compilation: cicc + ptxas
CUDA compilation is a two-stage process. cicc (this binary) compiles CUDA/NVVM IR down to PTX assembly text -- a virtual ISA with unlimited registers and structured control flow. ptxas then compiles the PTX into SASS machine code for a specific SM target. This split means that many of cicc's code generation decisions (register allocation, instruction scheduling, peephole optimization) are revisited by ptxas with full hardware knowledge. cicc's code generation pipeline therefore optimizes for two audiences simultaneously: (1) reducing register pressure and producing clean PTX that gives ptxas maximum optimization freedom, and (2) performing target-aware lowering (type legalization, instruction selection, structured CFG) that ptxas cannot undo. The practical consequence is that cicc's backend is pressure-driven rather than latency-driven -- scheduling for low register count matters more than scheduling for pipeline throughput, because ptxas will re-schedule for the hardware but cannot reduce register demand below what cicc emitted.
Cross-References
- NVPTX Subtarget & feature flags -- SM processor table, type legality offsets
- GPU target feature gates -- per-SM architecture feature matrix
- DAG node structure -- SDNode 104-byte layout, operand stride
- Pattern database -- ISel pattern table format
- NVPTX machine opcodes -- opcode reference
- Address spaces -- global, shared, local, param encoding
- PTX emission -- downstream consumer of machine-level output
- Register coalescing -- pre-RA copy elimination
- PrologEpilogInserter --
.localframe layout
PTX Emission
PTX assembly output, function headers, stack frames, register declarations, special registers, atomic instructions, barriers, debug info, and output modes. Address range 0x2140000--0x21FFFFF for NVPTX-specific emission, 0x31E0000--0x3240000 for AsmPrinter.
| AsmPrinter::emitFunctionBody | sub_31EC4F0 (72KB) |
| Function header orchestrator | sub_215A3C0 (.entry/.func, .param, kernel attrs, .pragma) |
| Kernel attribute emission | sub_214DA90 (.reqntid, .maxntid, .minnctapersm, cluster) |
| Stack frame setup | sub_2158E80 (17KB, .local, .reg, __local_depot) |
| Register class map | sub_2163730 + sub_21638D0 (9 classes) |
| GenericToNVVM | sub_215DC20 / sub_215E100 (36KB, addrspace rewriting) |
| Special registers | sub_21E86B0 (%tid, %ctaid, %ntid, %nctaid) |
| Cluster registers | sub_21E9060 (15 registers, SM 90+) |
| Atomic emission | sub_21E5E70 (13 opcodes) + sub_21E6420 (L2 cache hints) |
| Memory barriers | sub_21E94F0 (membar.cta/gpu/sys, fence.sc.cluster) |
| Cluster barriers | sub_21E8EA0 (barrier.cluster.arrive/wait) |
| Global variable emission | sub_2156420 (texref/surfref/samplerref/data) |
| Global variable ordering | sub_2157D50 (5.9KB, topological sort with circular dependency detection) |
| Bitcode producer | "LLVM7.0.1" (NVVM IR compat marker, despite LLVM 20.0.0) |
Function Header Emission -- sub_215A3C0
Emits a complete PTX function prologue in this exact order:
| Step | Output | Condition |
|---|---|---|
| (a) | .pragma "coroutine";\n | Metadata node type 'N' linked to current function |
| (b) | CUDA-specific attributes | *(a1+232)->field_952 == 1 |
| (c) | .entry or .func | sub_1C2F070 (isKernelFunction) |
| (d) | Return type spec | .func only, via sub_214C940 |
| (e) | Mangled function name | sub_214D1D0 |
| (f) | .param declarations | sub_21502D0 (monotonic counter _param_0, _param_1, ...) |
| (g) | Kernel attributes | .entry only, via sub_214DA90 |
| (h) | Additional attributes | sub_214E300 |
| (i) | .noreturn | Non-kernel with noreturn attribute (metadata attr 29) |
| (j) | {\n | Open function body |
| (k) | Stack frame + registers | sub_2158E80 |
| (l) | DWARF debug info | If enabled |
Kernel Attributes -- sub_214DA90
Reads NVVM metadata and emits performance-tuning directives. Attribute emission order:
| Order | Attribute | Source Metadata | Condition |
|---|---|---|---|
| 1 | .blocksareclusters | nvvm.blocksareclusters | Fatal if reqntid not set |
| 2 | .reqntid X, Y, Z | nvvm.reqntid + sub_1C2EDB0 | Comma-separated strtol parse |
| 3 | .maxntid X, Y, Z | sub_1C2EC00 / structured | Unspecified dims default to 1 |
| 4 | .minnctapersm N | sub_1C2EF70 | -- |
| 5 | .explicitcluster | nvvm.cluster_dim | SM > 89 only |
| 6 | .reqnctapercluster X, Y, Z | Cluster dim readers | SM > 89 only |
| 7 | .maxclusterrank N | sub_1C2EF50 | SM > 89 only |
| 8 | .maxnreg N | sub_1C2EF90 | -- |
Cluster attributes (5--7) gated by *(a1+232)->field_1212 > 0x59 (SM > 89, i.e., SM 90+).
Stack Frame -- sub_2158E80
| Field | Value |
|---|---|
| Address | 0x2158E80 |
| Size | 17KB |
Emission Steps
-
Local depot (if
*(frame_info+48) != 0):.local .align 16 .b8 __local_depot0[256];Where alignment =
*(frame_info+60), index = function index, size = frame size. -
Stack pointer registers:
.reg .b64 %SP; // stack pointer .reg .b64 %SPL; // stack pointer localUses
.b32in 32-bit mode (checked via*(a2+8)->field_936). -
Virtual register declarations -- iterates register map at
*(a1+800), deduplicates via hash table ata1+808:.reg .pred %p<5>; .reg .b16 %rs<12>; .reg .b32 %r<47>; .reg .b64 %rd<8>; .reg .f32 %f<20>; .reg .f64 %fd<3>;
Register Class Map
The complete 9-class register table (vtable addresses, PTX type suffixes, prefixes, encoded IDs, copy opcodes, and coalescing constraints) is in Register Classes. The encoding scheme (sub_21583D0: class_encoded_id | (register_index & 0x0FFFFFFF), fatal "Bad register class" on unrecognized vtable) is documented in Register Encoding Scheme.
Special Registers -- sub_21E86B0
Switch on operand value (ASCII-encoded):
| Opcode | Char | Register | Description |
|---|---|---|---|
0x26 | & | %tid.x | Thread ID, X |
0x27 | ' | %tid.y | Thread ID, Y |
0x28 | ( | %tid.z | Thread ID, Z |
0x29 | ) | %ntid.x | Block dim, X |
0x2A | * | %ntid.y | Block dim, Y |
0x2B | + | %ntid.z | Block dim, Z |
0x2C | , | %ctaid.x | Block ID, X |
0x2D | - | %ctaid.y | Block ID, Y |
0x2E | . | %ctaid.z | Block ID, Z |
0x2F | / | %nctaid.x | Grid dim, X |
0x30 | 0 | %nctaid.y | Grid dim, Y |
0x31 | 1 | %nctaid.z | Grid dim, Z |
0x5E | ^ | (dynamic) | Via sub_3958DA0(0, ...) -- %warpid/%laneid |
0x5F | _ | (dynamic) | Via sub_3958DA0(1, ...) |
Cluster Registers -- sub_21E9060 (SM 90+)
| Value | Register | Description |
|---|---|---|
| 0 | %is_explicit_cluster | Explicit cluster flag |
| 1 | %cluster_ctarank | CTA rank within cluster |
| 2 | %cluster_nctarank | CTAs in cluster |
| 3--5 | %cluster_nctaid.{x,y,z} | Cluster grid dimensions |
| 6--8 | %cluster_ctaid.{x,y,z} | CTA ID within cluster |
| 9--11 | %nclusterid.{x,y,z} | Number of clusters |
| 12--14 | %clusterid.{x,y,z} | Cluster ID |
Fatal: "Unhandled cluster info operand" on invalid value.
Atomic Instruction Emission
Operand Encoding
The atomic instruction word packs scope and operation into a single integer read from the operand array at *(operand_array + 16*a2 + 8):
Bit layout:
[3:0] — reserved
[7:4] — scope: 0=gpu (implicit), 1=cta, 2=sys
[15:8] — reserved
[23:16] — atomic opcode (BYTE2)
The scope field emits a prefix before the atomic suffix: scope 0 produces no prefix (implicit .gpu), scope 1 emits ".cta", scope 2 emits ".sys". The complete PTX instruction format is atom[.scope].op.type.
Base Atomics -- sub_21E5E70
13-operation dispatch table. The switch on BYTE2(v4) selects both the operation suffix and its type class:
| Opcode | Suffix | Type Class | PTX Semantics |
|---|---|---|---|
0x00 | .exch.b | bitwise | Exchange -- atomically swap value |
0x01 | .add.u | unsigned | Unsigned integer addition |
0x03 | .and.b | bitwise | Bitwise AND |
0x05 | .or.b | bitwise | Bitwise OR |
0x06 | .xor.b | bitwise | Bitwise XOR |
0x07 | .max.s | signed | Signed integer maximum |
0x08 | .min.s | signed | Signed integer minimum |
0x09 | .max.u | unsigned | Unsigned integer maximum |
0x0A | .min.u | unsigned | Unsigned integer minimum |
0x0B | .add.f | float | Floating-point addition |
0x0C | .inc.u | unsigned | Unsigned increment (wrapping) |
0x0D | .dec.u | unsigned | Unsigned decrement (wrapping) |
0x0E | .cas.b | bitwise | Compare-and-swap |
Opcodes 0x02 and 0x04 are intentionally absent -- the PTX ISA has no signed atomic add at that slot, and no bitwise operation occupies slot 4. The 13 operations exactly match the PTX atom instruction repertoire.
The type width suffix (.b32, .b64, .u32, .u64, .s32, .s64, .f32, .f64) is appended separately by the instruction printer after the operation suffix, based on the register class of the destination operand.
L2 Cache-Hinted Atomics -- sub_21E6420 (Ampere+)
A parallel emission function that inserts L2::cache_hint between the operation and type suffix to produce the extended format:
atom[.scope].op.L2::cache_hint.type
All 13 atomic operations are supported with L2 hints. The hint instructs the GPU L2 cache controller to retain (or evict) the target cache line after the atomic completes -- a data-locality optimization introduced with Ampere (SM 80).
The function uses SSE xmmword loads from precomputed string constants at addresses xmmword_435F590 through xmmword_435F620 to fast-copy 16-byte prefixes of each suffix string. This avoids per-character string construction: each atomic variant's complete suffix (e.g., .exch.L2::cache_hint.b at 22 bytes) is assembled from a 16-byte SSE load of the prefix plus a patched tail. The compiler optimized this into aligned vector moves rather than memcpy calls.
Atomic Emission Pseudocode
void emitAtomicOp(raw_ostream &OS, unsigned operand) {
unsigned scope = (operand >> 4) & 0xF;
unsigned opcode = (operand >> 16) & 0xFF; // BYTE2
OS << "atom";
if (scope == 1) OS << ".cta";
else if (scope == 2) OS << ".sys";
// scope 0 = implicit .gpu, no suffix
switch (opcode) {
case 0x00: OS << ".exch.b"; break;
case 0x01: OS << ".add.u"; break;
// ... 0x02, 0x04 absent ...
case 0x03: OS << ".and.b"; break;
case 0x05: OS << ".or.b"; break;
case 0x06: OS << ".xor.b"; break;
case 0x07: OS << ".max.s"; break;
case 0x08: OS << ".min.s"; break;
case 0x09: OS << ".max.u"; break;
case 0x0A: OS << ".min.u"; break;
case 0x0B: OS << ".add.f"; break;
case 0x0C: OS << ".inc.u"; break;
case 0x0D: OS << ".dec.u"; break;
case 0x0E: OS << ".cas.b"; break;
}
// Type width appended by caller
}
The L2-hinted variant (sub_21E6420) follows identical dispatch logic but emits .op.L2::cache_hint.type instead of .op.type.
Memory Barriers -- sub_21E94F0
| Value | Instruction | Scope |
|---|---|---|
| 0 | membar.gpu | Device |
| 1 | membar.cta | Block |
| 2 | membar.sys | System |
| 4 | fence.sc.cluster | Cluster (SM 90+) |
| 3 | -- | Fatal: "Bad membar op" |
Cluster Barriers -- sub_21E8EA0 (SM 90+)
Encoding: bits[3:0] = operation (0=arrive, 1=wait), bits[7:4] = ordering (0=default, 1=relaxed).
| Instruction | Meaning |
|---|---|
barrier.cluster.arrive | Signal arrival |
barrier.cluster.arrive.relaxed | Relaxed-memory arrival |
barrier.cluster.wait | Wait for all CTAs |
barrier.cluster.wait.relaxed | Relaxed-memory wait |
GenericToNVVM -- sub_215DC20 / sub_215E100
Pass Registration
| Field | Value |
|---|---|
| Pass name | "generic-to-nvvm" |
| Description | "Ensure that the global variables are in the global address space" |
| Pass ID | unk_4FD155C |
| Factory | sub_215D530 (allocates 320-byte state) |
| Disable knob | NVVMPassOptions[2200] (bool) |
| Pipeline position | After InstructionSimplify, before LoopSimplify (position ~22 in optimizer) |
Registration uses a once-init pattern guarded by dword_4FD1558. The 80-byte pass descriptor stores the description at offset 0, pass kind 64 (ModulePass) at offset 8, the name string at offset 16, its length 15 at offset 24, the pass ID pointer at offset 32, flags 0 at offset 40, and the factory function pointer at offset 72. Registration dispatches through sub_163A800 (the LLVM pass registration infrastructure).
A new-pass-manager version also exists: GenericToNVVMPass, registered at sub_305ED20 / sub_305E2C0 with CLI name "generic-to-nvvm".
Algorithm -- sub_215E100 (36KB)
The pass body at sub_215E100 is 36KB because it must rewrite every address-space-dependent use of every affected global. The factory function sub_215D530 allocates a 320-byte state object containing two DenseMap-like hash tables:
| Table | Offset | Purpose | Initial Capacity |
|---|---|---|---|
| GVMap | +168 | Old GlobalVariable -> New GlobalVariable | 128 buckets, 48 bytes/bucket |
| ConstMap | +248 | Old Constant -> New Constant (for constant expressions) | 128 buckets, 48 bytes/bucket |
The algorithm proceeds in three phases:
Phase 1 -- Clone globals. Iterate over all GlobalVariable objects in the module. For each global in addrspace(0) (the LLVM generic address space):
- Create a new
GlobalVariableinaddrspace(1)(NVPTX global memory) with identical initializer, linkage, alignment, and section attributes. - Store the old-to-new mapping in
GVMap.
Phase 2 -- Rewrite uses. For each cloned global:
- Create an
addrspacecastinstruction from the new global (addrspace(1)*) back to the original pointer type (addrspace(0)*). This preserves type compatibility with all existing uses. - Call
RAUW(replaceAllUsesWith) on the original global, substituting theaddrspacecastvalue. All instructions, constant expressions, and metadata references that pointed to the original global now point through the cast. - The
ConstMaptable handles the tricky case of constant expressions that embed a global reference:ConstantExpr::getAddrSpaceCast,ConstantExpr::getGetElementPtr, and similar must be reconstructed with the new global. This is the bulk of the 36KB function body -- a recursive walk over the constant expression tree, rebuilding each node.
Phase 3 -- Erase originals. Iterate GVMap and erase each original global from the module. The cleanup helper sub_215D780 iterates the map, properly managing LLVM Value reference counts during deletion.
The destructor at sub_215D1A0 / sub_215CE20 frees both hash tables and all stored Value references.
// Pseudocode for GenericToNVVM::runOnModule
bool runOnModule(Module &M) {
for (GlobalVariable &GV : M.globals()) {
if (GV.getAddressSpace() != 0) continue; // skip non-generic
if (GV.isDeclaration()) continue;
// Phase 1: Clone to addrspace(1)
GlobalVariable *NewGV = new GlobalVariable(
M, GV.getValueType(), GV.isConstant(),
GV.getLinkage(), GV.getInitializer(),
GV.getName(), /*InsertBefore=*/nullptr,
GV.getThreadLocalMode(), /*AddressSpace=*/1);
NewGV->copyAttributesFrom(&GV);
GVMap[&GV] = NewGV;
}
for (auto &[OldGV, NewGV] : GVMap) {
// Phase 2: addrspacecast + RAUW
Constant *Cast = ConstantExpr::getAddrSpaceCast(NewGV,
OldGV->getType());
OldGV->replaceAllUsesWith(Cast);
}
for (auto &[OldGV, NewGV] : GVMap) {
// Phase 3: Erase originals
OldGV->eraseFromParent();
}
return !GVMap.empty();
}
Why this exists. The CUDA frontend (EDG) generates globals in addrspace(0) (LLVM's generic/default address space). The NVPTX backend requires device globals to reside in addrspace(1) (GPU global memory) for correct PTX emission. GenericToNVVM bridges this mismatch. Upstream LLVM has an equivalent NVPTXGenericToNVVM pass, but cicc's version carries the additional ConstMap machinery for handling nested constant expression trees that reference relocated globals -- a case that upstream handles differently through its GenericToNVVM + NVPTXAssignValidGlobalAddresses split.
Global Constructor Rejection -- sub_215ACD0
if (lookup("llvm.global_ctors") && type_tag == ArrayType && count != 0)
fatal("Module has a nontrivial global ctor, which NVPTX does not support.");
if (lookup("llvm.global_dtors") && type_tag == ArrayType && count != 0)
fatal("Module has a nontrivial global dtor, which NVPTX does not support.");
GPU kernels have no "program startup" phase -- no __crt_init equivalent. Static initialization with non-trivial constructors is incompatible with the GPU execution model.
Global Variable Emission -- sub_2156420
Overview
The function sub_2156420 (20KB, printModuleLevelGV) handles PTX emission for individual global variables. It processes each global in the module, categorizing it by type (texture reference, surface reference, sampler reference, or data variable) and emitting the appropriate PTX declaration.
Skipped globals: "llvm.metadata", "llvm.*", "nvvm.*".
| Global Type | PTX Output |
|---|---|
| Texture reference | .global .texref NAME; |
| Surface reference | .global .surfref NAME; |
| Sampler reference | .global .samplerref NAME = { ... } |
| Managed memory | .attribute(.managed) |
| Demoted (addrspace 3) | // NAME has been demoted (comment only) |
Sampler Reference Initializer
Sampler references receive a structured initializer block with addressing mode, filter mode, and normalization settings. The emission format:
.global .samplerref my_sampler = {
addr_mode_0 = clamp_to_edge,
addr_mode_1 = wrap,
addr_mode_2 = mirror,
filter_mode = linear,
force_unnormalized_coords = 1
};
The addressing mode values are selected from four string literals:
| Value | String |
|---|---|
| 0 | "wrap" |
| 1 | "clamp_to_border" |
| 2 | "clamp_to_edge" |
| 3 | "mirror" |
Filter mode selects between "nearest" and "linear". The force_unnormalized_coords field is emitted only when the sampler uses unnormalized texture coordinates (integer addressing).
Address Space Qualifiers
sub_214FA80 maps NVPTX address space numbers to PTX qualifier strings (0=no qualifier, 1=.global, 3=.shared, 4=.const, 5+=.local). See Address Spaces for the complete mapping including tensor memory, shared cluster, and param spaces.
Additional attributes emitted by sub_214FEE0:
.attribute(.managed)for CUDA managed memory globals.attribute(.unified)or.attribute(.unified(N))for unified addressing
Data Type Emission
For aggregate or large types, the emitter uses .b8 NAME[SIZE] (byte array). For pointer types with initializers, it selects .u32 or .u64 arrays depending on the pointer width flag at *(a1+232)->field_936. Simple scalar types use the type from sub_214FBF0 (.u32, .u64, .f32, .f64, etc.).
Invalid Address Space Detection
If a global has an initializer in an address space that does not support static initialization:
fatal("initial value of 'NAME' is not allowed in addrspace(N)");
This diagnostic is emitted via sub_1C3F040.
Global Variable Ordering -- sub_2157D50 (Topological Sort)
Problem
Global variables with initializers can reference other globals. If global A's initializer contains a reference to global B, then B must be emitted before A in the PTX output. Circular dependencies are illegal and must be detected.
Algorithm -- DFS Topological Sort
sub_2157D50 (5.9KB) implements a depth-first topological sort over the global use-def chains. The algorithm:
-
Build dependency graph. For each global variable in the emission set, walk its initializer constant expression tree. Every
GlobalVariablereference found in the initializer creates a directed edge from the referencing global to the referenced global. -
DFS with three-color marking. Each global is in one of three states:
- White (unvisited): not yet processed.
- Gray (in progress): currently on the DFS stack -- its subtree is being explored.
- Black (finished): all dependents have been emitted.
-
Visit procedure. For each white global, mark it gray and recurse into its dependencies. When all dependencies return, mark it black and push it onto the output ordering (post-order).
-
Cycle detection. If the DFS encounters a gray node, a back-edge has been found, which means a circular dependency. The pass emits the fatal diagnostic:
"Circular dependency found in global variable set"
This is a hard error -- cicc cannot emit globals with mutual references. The PTX format requires a linear declaration order, and there is no forward-declaration mechanism for global variable initializers.
Pseudocode
// sub_2157D50 — topological sort of globals for PTX emission
void orderGlobals(SmallVectorImpl<GlobalVariable *> &Ordered,
ArrayRef<GlobalVariable *> Globals) {
enum Color { White, Gray, Black };
DenseMap<GlobalVariable *, Color> color;
for (GlobalVariable *GV : Globals)
color[GV] = White;
std::function<void(GlobalVariable *)> visit =
[&](GlobalVariable *GV) {
if (color[GV] == Black) return;
if (color[GV] == Gray)
fatal("Circular dependency found in global variable set");
color[GV] = Gray;
// Walk initializer for GlobalVariable references
if (Constant *Init = GV->getInitializer())
for (GlobalVariable *Dep : globalsReferencedBy(Init))
if (color.count(Dep))
visit(Dep);
color[GV] = Black;
Ordered.push_back(GV);
};
for (GlobalVariable *GV : Globals)
if (color[GV] == White)
visit(GV);
}
Interaction with Sampler References
Sampler reference globals can have structured initializers that reference other sampler state. These initializers are walked by the same DFS traversal. The topological sort ensures that any sampler whose initializer references another sampler or texture object appears after its dependencies in the PTX output.
Call Context
sub_2157D50 is called from the module-level emission entry (sub_215ACD0 -> sub_214F370) after all globals have been collected but before any global PTX text is written. The ordered list is then iterated by sub_2156420 to emit each global in dependency order.
Output Mode Selection
Compilation output mode is controlled by a bitmask in the a13 mode flags parameter, passed through the pipeline from the CLI flag parser (sub_95C880). The low bits encode the output format, while bits 8--9 encode the address width (32/64-bit).
Mode Flag Bitmask
| Bits | Value | Mode | Description |
|---|---|---|---|
[2:0] | 0x07 | Phase control | Default = 7 (all phases: lnk + opt + llc) |
[4] | 0x10 | Debug | Debug compile or line-info enabled |
[5] | 0x20 | LTO gen | LTO generation enabled |
| combined | 0x21 | gen-lto | Generate LTO bitcode for later linking |
| combined | 0x23 | full LTO | Complete LTO compilation (lnk + opt + lto) |
| combined | 0x26 | link-lto | Link-time LTO phase (consume LTO bitcode) |
| combined | 0x43 | OptiX IR | Emit .optixir format |
[7] | 0x80 | gen-opt-lto | Lowering flag for LTO |
[8] | 0x100 | nvvm-64 | 64-bit pointer mode |
[9] | 0x200 | nvvm-32 | 32-bit pointer mode |
CLI Flag to Mode Mapping
| CLI Flag | Mode Bits Set | Pipeline Effect |
|---|---|---|
| (default) | 0x07 | All phases run, PTX text output |
--emit-llvm-bc | (EDG flag id=59) | Emit raw LLVM bitcode .bc after optimization |
--emit-optix-ir | (a13 & 0x300) | 0x43 | Disables IP-MSP and LICM, emits .optixir |
-gen-lto | (a13 & 0x300) | 0x21 | Generates LTO-compatible bitcode |
-gen-lto-and-llc | a13 | 0x20 | LTO generation plus LLC codegen |
-link-lto | (a13 & 0x300) | 0x26 | Consumes LTO bitcode for final compilation |
-lto | (a13 & 0x300) | 0x23 | Full LTO mode (all phases) |
-split-compile=N | (stored at offset+1480) | Per-function compilation, F%d_B%d output naming |
OptiX IR Mode
The --emit-optix-ir flag is valid only when the compilation mode is CUDA (a4 == 0xABBA) or OpenCL (a4 == 0xDEED). It forces two optimizer passes to be disabled by routing "-do-ip-msp=0" and "-do-licm=0" to the opt phase. The output is an .optixir file containing NVVM IR in a format consumable by the OptiX ray-tracing runtime for JIT compilation. See OptiX IR for the full format details.
Split Compilation
The -split-compile=N flag (stored at options offset +1480, with a sentinel at +1488 to detect double-definition) enables per-function or per-block compilation for large kernels. The pipeline assembler at sub_12E54A0 generates output identifiers using the "F%d_B%d" format string (function index, block index). Each split unit is compiled independently and the results are linked back together. An extended variant -split-compile-extended=N sets the additional flag at offset +1644.
When split-compile is active, the optimization level is set to negative (typically -1), triggering special handling in sub_12E1EF0: each compiled function's bitcode is re-read via sub_153BF40, validated against the "<split-module>" identifier, and linked back through sub_12F5610 with linkage attributes restored from a hash table.
LTO Modes
Three LTO modes interact with emission:
-
gen-lto (
0x21): Runs optimization but skips LLC. Output is optimized LLVM bitcode suitable for later link-time optimization. The-gen-ltostring is forwarded to the LTO phase. -
link-lto (
0x26): Consumes bitcode produced by gen-lto. Runs the LTO linker and optimizer, then proceeds to LLC for final codegen. The-link-ltostring is forwarded. -
full LTO (
0x23): Single-invocation LTO that runs all phases including linking and codegen.
Bitcode Producer ID
The bitcode writer at sub_1538EC0 (58KB, writeModule) stamps "LLVM7.0.1" as the producer identification string in the IDENTIFICATION_BLOCK of every output bitcode file. This is despite cicc being built on LLVM 20.0.0 internally.
Dual-Constructor Mechanism
Two separate global constructors manage producer version strings, both reading the same environment variable but with different defaults:
| Constructor | Address | Default | Stored At | Purpose |
|---|---|---|---|---|
ctor_036 | 0x48CC90 | "20.0.0" | qword_4F837E0 | True LLVM version (internal use) |
ctor_154 | 0x4CE640 | "7.0.1" | (separate global) | NVVM IR compatibility marker |
Both constructors execute this logic:
char *result = getenv("LLVM_OVERRIDE_PRODUCER");
if (!result) result = default_string; // "20.0.0" or "7.0.1"
producer_global = result;
The bitcode writer uses the ctor_154 value, producing "LLVM" + "7.0.1" = "LLVM7.0.1" in the output. Setting LLVM_OVERRIDE_PRODUCER in the environment overrides both constructors to the same value.
Why "LLVM7.0.1"
The "LLVM7.0.1" string is the NVVM IR compatibility marker. It signals that the bitcode format conforms to the NVVM IR specification originally based on LLVM 7.0.1's bitcode structure. Even though cicc's internal passes operate at LLVM 20.0.0 capability, the output bitcode format (record encoding, metadata layout, type table) is constrained to be readable by older NVVM toolchain components (libNVVM, nvdisasm, Nsight) that expect LLVM 7.x-era bitcode. The writer achieves this by:
- Using the
IDENTIFICATION_BLOCKproducer string to declare compatibility. - Constraining the
MODULE_BLOCKrecord types to the LLVM 7.x repertoire. - Enforcing
nvvmir.versionmetadata withmajor == 3, minor <= 2.
The disable-bitcode-version-upgrade cl::opt (registered in ctor_036) controls whether the bitcode reader accepts version mismatches during ingestion.
Related Environment Variable
NVVM_IR_VER_CHK=0 bypasses the NVVM IR version validation at sub_157E370 and sub_12BFF60, which normally enforces major == 3, minor <= 2 and fatals with "Broken module found, compilation aborted!" on mismatch.
Address Space Operations -- sub_21E7FE0
Multi-purpose helper for cvta, MMA operands, and address space qualifiers:
| Query | Values | Output |
|---|---|---|
"addsp" | 0=generic, 1=.global, 3=.shared, 4+=.local | cvta address space suffix |
"ab" | 0="a", 1="b" | cvta direction |
"rowcol" | 0="row", 1="col" | MMA layout |
"mmarowcol" | 0--3 | "row.row"/"row.col"/"col.row"/"col.col" |
"satf" | 0=(none), 1=".satfinite" | MMA saturation |
"abtype" | 0--6 | "u8"/"s8"/"u4"/"s4"/"b1"/"bf16"/"tf32" |
"trans" | 0=(none), 1=".trans" | WGMMA transpose |
Architecture-Gated Features
| Feature | Min Architecture | Evidence |
|---|---|---|
| Basic atomics (all 13 ops) | SM 20+ (all) | sub_21E5E70, no arch check |
| Atomic scopes (.cta/.sys) | SM 60+ (Pascal) | Scope bits in operand |
| L2 cache-hinted atomics | SM 80+ (Ampere) | sub_21E6420 separate function |
| membar.cta/gpu/sys | SM 20+ (all) | sub_21E94F0, no arch check |
| fence.sc.cluster | SM 90+ (Hopper) | Opcode 4 in membar handler |
| barrier.cluster.arrive/wait | SM 90+ (Hopper) | sub_21E8EA0 entire function |
| Cluster special registers (15) | SM 90+ (Hopper) | sub_21E9060 entire function |
| MMA row/col layout | SM 70+ (Volta) | mmarowcol in sub_21E7FE0 |
| MMA abtype: bf16/tf32 | SM 80+ (Ampere) | Ampere-class MMA formats |
| .trans modifier (WGMMA) | SM 90+ (Hopper) | WGMMA transpose |
Key Global Variables
| Variable | Purpose |
|---|---|
byte_4FD17C0 | Pass configuration flag |
byte_4FD16E0 | ISel dump enable |
byte_4FD2160 | Extra ISel pass enable |
dword_4FD26A0 | Scheduling mode (1=simple, else=full pipeline) |
unk_4FD155C | GenericToNVVM pass ID |
dword_4FD1558 | GenericToNVVM once-init guard |
qword_4F837E0 | True LLVM producer version ("20.0.0") |
ptxas Interaction
The PTX text emitted by cicc is not executed directly -- it is consumed by ptxas, which parses the PTX back into an internal IR, applies its own optimization and scheduling passes (195+ knobs), performs hardware register allocation, and emits SASS machine code. Every formatting decision in emission (register naming with %r<N> angle-bracket counts, .pragma annotations, kernel attribute placement) must conform to what ptxas's PTX parser expects. The "LLVM7.0.1" producer string exists specifically because ptxas gates certain parsing behaviors on the declared producer version. Emission quality directly affects ptxas optimization scope: cleaner PTX with fewer redundant moves gives ptxas more freedom to schedule and allocate efficiently.
Cross-References
- OptiX IR -- OptiX IR output format details
- Bitcode I/O -- Bitcode reader/writer and
"LLVM7.0.1"producer - Register Classes -- Consolidated register class reference
- Address Spaces -- Consolidated address space reference
- AsmPrinter -- AsmPrinter infrastructure
- nvcc Interface -- CLI flag routing from nvcc to cicc
Debug Info Pipeline
Debug information in cicc follows a four-stage lifecycle: generation in the EDG/IR-generation frontend, preservation and selective stripping in the optimizer, verification after each pass, and emission as .loc/.file directives in the PTX backend. This page traces the full journey of debug metadata from CUDA source to PTX output, covering the three compilation modes (-g, -generate-line-info, neither), the five stripping passes, the NVIDIA-custom verification infrastructure, and the backend emission format with its non-standard inlined-at extension. Understanding this flow is essential for anyone reimplementing cicc's debug info contract, because the NVPTX target's debug model is fundamentally different from x86 DWARF: PTX is a virtual ISA with no physical registers, no real stack, and no fixed instruction encoding, so the debug metadata cicc emits is consumed by ptxas rather than directly by a debugger.
| Debug info generation | sub_9433F0 (per-parameter), sub_943430 (per-global), sub_941230 (source location) |
| Debug version module flag | sub_915400 -- emits "Debug Info Version" = 3 |
| Flag filter | sub_12C6910 -- checks -debug-compile, -g, -generate-line-info |
| Verification pass | sub_29C8000 (12,480B, 434 BBs) -- runs after each optimization pass |
| Per-instruction verifier | sub_29C3AB0 (5,592B) |
| Debugify injector | sub_29C1CB0 |
| Stripping passes | #110--#114 in the pipeline parser |
.loc emission | sub_31D55F0 (per-instruction), sub_31E4280 (function-scope .file/.loc) |
| DWARF section emission | sub_399B1E0 (29KB, DwarfDebug::beginModule) |
| NVVM container field | DebugInfo at container offset +12 (enum: NONE/LINE_INFO/DWARF) |
| cl::opt registration | ctor_043 at 0x48D7F0 -- debug-compile, generate-line-info, line-info-inlined-at |
Three Compilation Modes
cicc supports three debug info levels. The mode is selected at the CLI layer and propagated through the flag dispatch table into both the optimizer and the backend. The flag filter function sub_12C6910 reads the CLI flags and routes them to the appropriate pipeline stages.
| CLI flag | Flag struct offset | Routing | NVVM container DebugInfo | DICompileUnit emission kind |
|---|---|---|---|---|
-g | +296 | -debug-compile to LNK and OPT stages | NVVM_DEBUG_INFO_DWARF (2) | FullDebug |
-generate-line-info | +328 | -generate-line-info to OPT stage only | NVVM_DEBUG_INFO_LINE_INFO (1) | LineTablesOnly |
| (neither) | -- | -- | NVVM_DEBUG_INFO_NONE (0) | NoDebug |
The distinction between -g and -generate-line-info is critical and non-obvious:
-
-groutes as-debug-compileto both the linker (LNK) and optimizer (OPT) stages. The linker stage needs the flag because libdevice linking must preserve debug info from the user module when merging with the stripped libdevice bitcode. The optimizer preserves all metadata:DICompileUnit,DISubprogram,DILocalVariable,DIType, scope chains,dbg.value()/dbg.declare()intrinsics -- everything. The backend emits complete DWARF sections. cuda-gdb can step through source, inspect variables, and reconstruct inlined call stacks. -
-generate-line-inforoutes only to the OPT stage (not the linker). Early in the optimizer,StripNonLineTableDebugInfoPassstrips all metadata exceptDILocation/DISubprogram/DICompileUnitwithLineTablesOnlyemission kind. This is enough for profiler source correlation (Nsight Compute maps.locdirectives back to source lines) but not enough for variable inspection or source-level debugging in cuda-gdb. -
Neither flag: no debug metadata is generated. The IR-generation frontend skips all debug calls (the
dword_4D046B4/[ctx+0x170]guards prevent emission), and the module has nollvm.dbg.cunamed metadata. The verification pass detects this in Phase 1 and returns immediately.
Stage 1: Frontend Debug Metadata Generation
EDG IL-to-IR Layer
The IR generation frontend creates debug metadata when the debug info flag is active. Two independent guards control this:
-
dword_4D046B4: a global flag checked at parameter and statement codegen entry points. When set, the function prolog emitter (sub_938240/ Path B equivalent) callssub_9433F0to emitDILocalVariablemetadata for each parameter, and the statement emitter (sub_9363D0) callssub_941230to set the IR builder's debug location from the EDG source position. -
[ctx+0x170]: a pointer to theDICompileUnitobject in the codegen context. When non-null, the global variable emitter (sub_916430and friends) callssub_943430to attach debug metadata to eachGlobalVariable, and the module finalizer (sub_915400) emits the"Debug Info Version"module flag with value 3.
The metadata hierarchy created during IR generation:
DICompileUnit
[ctx+0x170], emission kind: FullDebug or LineTablesOnly
├── DIFile (per source file)
├── DISubprogram (per __global__ / __device__ function)
│ ├── DILocalVariable (per parameter, via sub_9433F0)
│ │ arg: 1-based index from v10 in the parameter iteration loop
│ │ scope: parent DISubprogram
│ │ file, line, type: from EDG declaration node
│ ├── DILocalVariable (per auto variable, via statement codegen)
│ └── DILocation (per instruction, via sub_941230)
│ line, column: from EDG source position
│ scope: nearest enclosing DILexicalBlock or DISubprogram
└── DIGlobalVariable (per device-side global, via sub_943430)
[gv+0xAD] < 0 indicates debug info present on the GlobalVariable
The module finalizer sub_915400 runs after all globals and functions have been code-generated. Its debug-relevant actions:
- Calls
sub_9151E0to emitnvvmir.versionmetadata. When[ctx+0x170]is non-null, the version tuple has 4 operands instead of 2, including address-space-qualified indices. - Calls
sub_914410to emitnvvm.annotationsmetadata. - If
[ctx+0x170] != 0: callssub_BA93D0(Module::addModuleFlag) with("Debug Info Version", 3). This module flag is mandatory -- without it, LLVM's DWARF backend refuses to emit debug sections.
DIBuilder Infrastructure
The actual metadata node creation uses LLVM's DIBuilder infrastructure at 0xAD0000--0xAF0000 (Zone 2 of the type system module). This includes DIBasicType / DIDerivedType / DICompositeType uniquing, scope chain construction, and the standard LLVM !dbg attachment API. cicc uses the standard LLVM DIBuilder without modifications -- the NVIDIA-specific aspects are in the calling patterns (which EDG nodes map to which DI metadata), not in the metadata creation API itself.
Stage 2: Optimizer Preservation and Stripping
The StripNonLineTableDebugInfoPass
When -generate-line-info is active (but not -g), the optimizer runs StripNonLineTableDebugInfoPass ("strip-nonlinetable-debuginfo", pipeline parser slot #114) early in the pipeline. This pass:
- Strips all
DILocalVariableandDIGlobalVariablemetadata - Removes all
dbg.value()anddbg.declare()intrinsics - Strips
DITypenodes, imported entities, and retained nodes - Downgrades
DICompileUnitemission kind fromFullDebugtoLineTablesOnly - Preserves
DISubprogram,DILocation,DIFile, andDICompileUnit(the minimum needed for.locdirectives)
After this pass, the module has enough metadata for line-table-based profiling but not for source-level debugging.
The Five Stripping Passes
cicc registers five debug stripping passes in the pipeline parser, all standard LLVM passes:
| Pipeline name | Slot | LLVM pass class | What it strips | What survives |
|---|---|---|---|---|
"strip-dead-debug-info" | #110 | StripDeadDebugInfoPass | Debug info for dead functions/globals | Everything for live code |
"strip-debug-declare" | #112 | StripDebugDeclarePass | dbg.declare() intrinsics only | dbg.value(), all metadata |
"strip-nondebug" | #113 | StripNonDebugSymbolsPass | Non-debug symbols | All debug metadata |
"strip-nonlinetable-debuginfo" | #114 | StripNonLineTableDebugInfoPass | Everything except line tables | DILocation, DISubprogram, DIFile |
(core stripping at 0xAE0000) | -- | stripDebugInfo() | All llvm.dbg.* intrinsics | Nothing |
The core debug stripping implementation at 0xAE0000 (Zone 3 of the type system module) is the nuclear option -- it calls stripDebugInfo() to remove everything. The four named passes provide finer granularity.
Optimizer Pass Behavior with Debug Info
Every standard LLVM optimization pass is expected to preserve debug metadata it does not intentionally modify. In practice, some passes degrade debug info quality:
Passes that preserve debug info well:
- InstCombine: updates
dbg.value()when simplifying instructions, usesreplaceAllDbgUsesWith - SROA: splits
dbg.declare()into multipledbg.value()fragments when decomposing allocas - GVN: preserves debug locations on replacement instructions
- SimplifyCFG: maintains
DILocationthrough block merging
Passes that commonly degrade debug info:
- Inlining: creates new
DISubprogramfor inlined functions, must maintain inlined-at chains. Failure to do so triggers the verifier's"did not generate DISubprogram"diagnostic. - LoopUnroll: duplicates instructions without always duplicating
DILocationscope context - LICM: moves instructions out of loops, potentially detaching them from their original scope
- Dead code elimination: removes instructions along with their
dbg.value()references - Tail merging / BranchFolding: merges basic blocks from different source scopes
The verification pass (sub_29C8000) runs after each optimization pass and tracks exactly which passes degrade debug info. When the debugify-each knob is active, the full Debugify-then-CheckDebugify cycle runs around every pass, injecting synthetic debug metadata before the pass and verifying it survived afterward.
Stage 3: Debug Info Verification
The verification pass sub_29C8000 is documented in detail on the Debug Info Verification page. Here we summarize its role in the pipeline.
Pipeline Integration Protocol
The pipeline runner invokes the verifier as a sandwich around each optimization pass:
// Pseudocode for the verification protocol
snapshot_debug_metadata(M); // Phase 2 of sub_29C8000: 8 hash tables
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", 11, file, fileLen, jsonOut);
// Returns: true = PASS, false = FAIL (debug info degraded)
The pass name argument lets the JSON report attribute degradation to the specific pass responsible. The eight-table metadata snapshot captures DISubprogram, DIScope, DIGlobalVariable, DILocalVariable, DIType, DIImportedEntity, DILabel, and retained nodes -- far more comprehensive than upstream LLVM's CheckDebugInfoPass, which only tracks subprograms and debug variable intrinsics.
Verification Modes
Three modes of debug verification exist, controlled by LLVM knobs:
| Mode | Knob | What runs |
|---|---|---|
| Standard | verify-each or verify-after-all | sub_29C8000 after every pass |
| Debugify | debugify-each | sub_29C1CB0 (inject) + pass + sub_29C8000 (check) |
| Selective | verify-debuginfo-preserve | Lighter-weight preservation checking |
The Debugify mode is especially powerful: it first injects synthetic debug metadata via sub_29C1CB0 (ensuring every instruction has a DILocation and every variable has dbg.value()), then runs the optimization pass, then checks whether the synthetic metadata survived. This detects passes that drop debug info even when the original module had sparse or no debug metadata.
Behavior in -generate-line-info Mode
When the module is in LineTablesOnly mode (after StripNonLineTableDebugInfoPass has run), the verifier still executes but its scope is narrower. Phase 5 (per-function debug variable checking) skips variable intrinsic validation because dbg.value()/dbg.declare() were intentionally stripped. Only Phase 6 (per-instruction DILocation verification via sub_29C3AB0) remains fully active, checking that:
- Every instruction with a
DebugLochas a validDILocation DILocationscope chains resolve to a validDISubprogram- No orphaned debug locations reference deleted subprograms
- BB-level consistency is maintained
Stage 4: Backend Emission
The .loc Directive
The AsmPrinter emits DWARF .loc directives as inline annotations in the PTX instruction stream. The per-instruction emitter sub_31D55F0 runs after each real (non-meta) instruction when HasDebugInfo (r15+0x1E8) is set. It reads the DebugLoc attached to each MachineInstr and emits:
.loc 1 42 0
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
.loc 1 43 5
mul.wide.u32 %rd2, %r1, 4;
The function-scope emitter sub_31E4280 handles .file directives that establish the file index table, and sub_31E6100 (insertDebugLocEntry) maintains a file/line-to-MCSymbol mapping for MBB boundaries used in DWARF line table construction.
The NVIDIA Inlined-At Extension
Standard LLVM .loc emits only file line column. cicc extends .loc with function_name and inlined_at attributes that encode the full inlining chain:
.loc 1 42 0, function_name _Z6kernelPf, inlined_at 2 15 3
This allows ptxas to reconstruct the complete call stack at any point in inlined code, so cuda-gdb can show the user which function was inlined and where. The implementation in the AsmPrinter:
- Reads the
DebugLocfrom theMachineInstr - Walks the inlined-at chain via
DebugLoc::getInlinedAt() - Builds a work list (
SmallVector<DebugLoc, 8>) of the full chain - Emits in reverse order (outer locations before inner) so ptxas sees the outermost caller first
- Tracks already-emitted inlined-at locations in an
InlinedAtLocsset to prevent duplicates
The line-info-inlined-at LLVM knob (registered at 0x48D7F0, cl::opt<bool>) controls whether this extension is active. The CLI flag -no-lineinfo-inlined-at disables it by setting -line-info-inlined-at=0 on the backend command line. When disabled, only the immediate source location is emitted, losing inlining context but producing smaller PTX.
The dwarf-extended-loc Knob
The dwarf-extended-loc knob (enum: Default/Enable/Disable, registered at 0x490000 area) controls whether extended flags appear in .loc directives:
| Value | Effect |
|---|---|
Default (0) | Platform-dependent behavior |
Enable (1) | Emit is_stmt, prologue_end, discriminator extensions |
Disable (2) | Bare .loc file line column only |
The Disable mode exists for compatibility with older ptxas versions that do not parse extended .loc flags. When enabled, the extended flags allow cuda-gdb to identify statement boundaries (is_stmt), function entry points (prologue_end), and distinguish between multiple code paths at the same source line (discriminator).
Source Interleaving
The -show-src CLI flag (flag struct offset +808, routed to the backend as -nvptx-emit-src) enables the InterleaveSrcInPtx mode. When active, the AsmPrinter reads source file lines and emits them as comments interleaved with the PTX:
// kernel.cu:42 float val = input[idx];
.loc 1 42 0
ld.global.f32 %f1, [%rd2];
// kernel.cu:43 val = val * val;
.loc 1 43 0
mul.f32 %f2, %f1, %f1;
This is purely a readability feature -- the comments are ignored by ptxas and have no effect on debug quality. The nvptx-emit-src LLVM knob description string is "Emit source line in ptx file".
.file Directive Emission
The .file directives are emitted by emitDwarfFileEntries during doFinalization (sub_3972F10, 24KB). They map source filenames to numeric file indices referenced by .loc:
.file 1 "/path/to/kernel.cu"
.file 2 "/usr/local/cuda/include/cuda_runtime.h"
The file table is built incrementally as .loc directives reference new files during instruction emission. The DWARF line section symbols are created via sub_E808D0 (createTempSymbol for DwarfLineSection) and bound via sub_E81A00 (emitDwarfLineSection).
DWARF Section Emission
When full debug info (-g) is active, a separate DWARF emission module at 0x3990000--0x39DF000 generates complete DWARF debug sections. This is standard LLVM DWARF emission with no significant NVIDIA modifications to the section format:
| Address | Size | Function |
|---|---|---|
sub_399B1E0 | 29KB | DwarfDebug::beginModule() -- initializes from llvm.dbg.cu, strings: "DWARF Debug Writer", "DWARF Emission" |
sub_3997B50 | 33KB | .debug_aranges emission -- address range tables |
sub_399D1D0 | 12KB | Range list emission (DW_RLE_base_address, DW_RLE_offset_pair, DW_RLE_start_length) |
sub_399EB70 | 12KB | Register location expressions -- strings: "no DWARF register encoding", "sub-register" |
sub_39BDF60 | 38KB | .debug_names accelerator table -- bucket count, name count, augmentation string |
sub_39B6390 | 33KB | DWARF form size calculator -- switch on DW_FORM_* codes |
sub_215ACD0 | 8.1KB | Module-level emission entry (NVPTX Debug Info Emission) |
The module-level entry sub_215ACD0 checks *(a1+240)->field_344 to determine if DWARF is enabled, then looks up the "NVPTX DWARF Debug Writer" / "NVPTX Debug Info Emission" pass info. The NVPTX backend does not emit physical register locations -- GPUs have no DWARF register numbering scheme that maps to hardware. Instead, it emits virtual register references that ptxas resolves through SASS-level debug info.
The DWARF string/enum tables at 0xE00000--0xE0FFFF (tag-to-string conversion, attribute-to-string, operation encoding) are stock LLVM 20 BinaryFormat/Dwarf.cpp utilities with no visible NVIDIA modifications.
.target Debug Suffix
The header emission function sub_214F370 appends , debug to the .target directive when MCAsmInfo::doesSupportDebugInformation() returns true:
.target sm_90, texmode_independent, debug
This suffix tells ptxas that the PTX contains debug information and should be processed accordingly. Without it, ptxas ignores .loc and .file directives.
NvvmDebugVersion
The NVVM container format includes a debug version field at header bytes 0x08--0x09:
| Offset | Size | Field |
|---|---|---|
0x08 | 1 byte | NvvmDebugVersion.Major |
0x09 | 1 byte | NvvmDebugVersion.Minor |
Current version: Major=3, Minor<=2. The version check logic in sub_CD41B0:
Majormust equal 3 (hard fail on mismatch:"not compatible"error, returns NULL)Minor > 2: warning printed, parse continues- If absent: default
{3, 2}is assumed
This version tracks the debug metadata schema independently of the NVVM IR version (NvvmIRVersion at 0x06--0x07, current Major=2, Minor<=0x62). The separation allows debug format evolution without breaking IR compatibility -- NVIDIA can add new debug metadata fields (e.g., for new SM features) without requiring a full IR version bump.
The container's DebugInfo field (at deserialized struct offset +12) also encodes the debug level as an enum that must be consistent with the module metadata:
enum NvvmDebugInfo {
NVVM_DEBUG_INFO_NONE = 0, // no debug info
NVVM_DEBUG_INFO_LINE_INFO = 1, // -generate-line-info
NVVM_DEBUG_INFO_DWARF = 2 // -g
};
The standalone pipeline validates this at IR intake: if debug_info_present AND debug_mode_flag AND NOT debug_version_validated, the function returns error code 3 (incompatible).
Debug Records Format
cicc v13.0 inherits LLVM 20's support for the new debug records format (DbgRecord) as an alternative to the traditional dbg.value() / dbg.declare() intrinsics. Three knobs control this:
| Knob | Type | Default | Effect |
|---|---|---|---|
write-experimental-debuginfo | bool | true | Write debug info in new non-intrinsic format |
write-experimental-debuginfo-iterators-to-bitcode | bool | true | Serialize debug records to bitcode |
preserve-input-debuginfo-format | boolOrDefault | false | When true, preserve whatever format the input uses |
The write-experimental-debuginfo default of true means cicc v13.0 uses the new DbgRecord format internally by default. This is an LLVM 20 feature where debug info is stored as DbgVariableRecord and DbgLabelRecord objects attached directly to instructions rather than as separate dbg.value() intrinsic calls. The format change is transparent to the optimizer and backend -- the verification pass and AsmPrinter handle both formats identically.
End-to-End Flow Diagram
CUDA Source (.cu / .cup)
│
▼
EDG 6.6 Frontend (IL tree)
│ dword_4D046B4 / [ctx+0x170] guards debug emission
│ sub_9433F0: per-parameter DILocalVariable
│ sub_943430: per-global DIGlobalVariable
│ sub_941230: per-instruction DILocation
│ sub_915400: "Debug Info Version" = 3 module flag
▼
LLVM Module with debug metadata
│ llvm.dbg.cu → DICompileUnit → DISubprogram → ...
│
├─ If -generate-line-info:
│ StripNonLineTableDebugInfoPass (#114)
│ strips variables, types, scopes; keeps DILocation/DISubprogram
│
▼
LLVM Optimizer (sub_12E54A0)
│ ┌─────────────────────────────────────────────┐
│ │ For each pass: │
│ │ snapshot = sub_29C8000 Phase 2 (8 tables) │
│ │ run_pass(M); │
│ │ sub_29C8000(M, ..., passName, ...); │
│ │ if FAIL: JSON report + diagnostic │
│ └─────────────────────────────────────────────┘
▼
Optimized LLVM Module
│
▼
NVPTX Backend (SelectionDAG → MachineInstr)
│ DebugLoc attached to each MachineInstr
│
▼
AsmPrinter (sub_31EC4F0)
│ sub_31D55F0: per-instruction .loc emission
│ sub_31E4280: .file/.loc at function scope
│ inlined-at chain walking → function_name, inlined_at extensions
│ InterleaveSrcInPtx: source line comments
│
├─ If -g:
│ sub_399B1E0: DwarfDebug::beginModule()
│ sub_3997B50: .debug_aranges
│ sub_39BDF60: .debug_names
│
▼
PTX Output
.target sm_90, texmode_independent, debug
.file 1 "kernel.cu"
.loc 1 42 0, function_name _Z6kernelPf
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
Knobs Reference
| Knob | Type | Default | Scope | Effect |
|---|---|---|---|---|
-g / -debug-compile | bool | off | CLI | Full debug compilation (FullDebug emission) |
-generate-line-info | bool | off | CLI | Line tables only (LineTablesOnly emission) |
-no-lineinfo-inlined-at | bool | off | CLI | Disable inlined-at tracking (sets -line-info-inlined-at=0) |
-show-src / -nvptx-emit-src | bool | off | CLI | Interleave source lines as PTX comments |
dwarf-extended-loc | enum | Default | LLVM | Default/Enable/Disable extended .loc flags |
dwarf-version | unsigned | (platform) | LLVM | DWARF version for debug sections |
line-info-inlined-at | bool | true | LLVM | Emit inlined-at chains in .loc directives |
debugify-each | bool | off | LLVM | Debugify + CheckDebugify around every pass |
debugify-level | enum | location+variables | LLVM | locations or location+variables |
debugify-quiet | bool | off | LLVM | Suppress debugify diagnostics |
debugify-func-limit | int | unlimited | LLVM | Max functions to debugify |
debugify-export | string | -- | LLVM | Export debugify results to file |
verify-each | bool | off | LLVM | Run IR verifier after every pass |
verify-debuginfo-preserve | bool | off | LLVM | Enable debug info preservation checking |
no-inline-line-tables | bool | off | LLVM | Prevent inlining from merging line tables |
write-experimental-debuginfo | bool | true | LLVM | Use DbgRecord format instead of intrinsics |
preserve-input-debuginfo-format | boolOrDefault | false | LLVM | Preserve input debug info format as-is |
NvvmDebugVersion | {u8,u8} | {3,2} | Container | Debug metadata schema version |
qword_5008FC8 | bool | off | Global | Verbose diagnostic output enable |
qword_5008C88 | int32 | >0 | Global | Metadata depth threshold (<=0 skips deep scope walk) |
NVIDIA Modifications vs Stock LLVM
-
Inlined-at
.locextension. Upstream LLVM's NVPTX AsmPrinter emits standard.loc file line column. cicc appendsfunction_nameandinlined_atattributes that encode the full inlining chain for cuda-gdb call stack reconstruction. -
Eight-table verification. Upstream
CheckDebugInfoPasstracksDISubprogramand debug variable intrinsics. NVIDIA'ssub_29C8000maintains eight separate hash tables covering subprograms, scopes, global variables, local variables, types, imported entities, labels, and retained nodes. -
JSON structured reporting. NVIDIA added a YAML/JSON serializer to the verification pass that produces machine-parseable bug reports with per-pass attribution -- no upstream equivalent.
-
Metadata reconstruction. After verification, NVIDIA's pass reconstructs the module's metadata tables from verified versions (Phase 8), effectively serving as a "repair" pass that normalizes metadata after corruption.
-
Container debug versioning. The
NvvmDebugVersionfield in the NVVM container header tracks the debug metadata schema independently of the IR version -- a concept that does not exist in upstream LLVM. -
Three-level debug info enum. The
NVVM_DEBUG_INFO_NONE/LINE_INFO/DWARFenum in the container provides a compile-unit-level debug mode indicator that ptxas and libNVVM can check without parsing the full module metadata.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
Emit DILocalVariable for function parameter | sub_9433F0 | -- | -- |
Emit debug info for GlobalVariable (conditional on [ctx+0x170]) | sub_943430 | -- | -- |
Set IR builder DebugLoc from EDG source position | sub_941230 | -- | -- |
Module finalizer: emit "Debug Info Version" = 3 module flag | sub_915400 | 133B | -- |
Flag filter: checks -debug-compile, -g, -generate-line-info | sub_12C6910 | -- | -- |
| Debug info verification pass (main entry) | sub_29C8000 | 12,480B | -- |
Per-instruction DILocation verifier | sub_29C3AB0 | 5,592B | -- |
| Debugify synthetic debug info injector | sub_29C1CB0 | -- | -- |
NewPMCheckDebugifyPass wrapper | sub_22702B0 | -- | -- |
NewPMDebugifyPass wrapper | sub_2270390 | -- | -- |
Per-instruction .loc emission | sub_31D55F0 | -- | -- |
Function-scope .file/.loc emission | sub_31E4280 | -- | -- |
insertDebugLocEntry (file/line to MCSymbol mapping) | sub_31E6100 | -- | -- |
| Instruction-level debug comment emission | sub_31D89B0 | -- | -- |
emitHeader (.version, .target ... , debug) | sub_214F370 | 7.2KB | -- |
| Module-level emission entry / NVPTX Debug Info Emission | sub_215ACD0 | 8.1KB | -- |
DwarfDebug::beginModule() | sub_399B1E0 | 29KB | -- |
.debug_aranges emission | sub_3997B50 | 33KB | -- |
Range list emission (DW_RLE_*) | sub_399D1D0 | 12KB | -- |
| Register location expressions | sub_399EB70 | 12KB | -- |
.debug_names accelerator table | sub_39BDF60 | 38KB | -- |
| DWARF form size calculator | sub_39B6390 | 33KB | -- |
DIBuilder / debug metadata helper | sub_ADCDB0 | -- | -- |
cl::opt registration: debug-compile, generate-line-info, line-info-inlined-at | sub_48D7F0 | -- | -- |
NVVM container version check (validates NvvmDebugVersion.Major == 3) | sub_CD41B0 | -- | -- |
Cross-References
- Debug Info Verification -- detailed
sub_29C8000algorithm, 9-phase walk, JSON output format - AsmPrinter & PTX Body Emission --
.loc/.filedirective emission, per-instruction debug annotation, inlined-at chain - PTX Emission -- module-level emission,
.target ... , debugsuffix - Entry Point & CLI --
-g,-generate-line-infoflag parsing insub_8F9C90 - NVVM IR Generation -- dual-path architecture, codegen context
- CLI Flags -- flag routing through the 3-column dispatch table
- LLVM Knobs --
debugify-*,verify-each,dwarf-*knobs - Pipeline & Ordering -- where debug verification fits in the pass ordering
- NVVM Container --
NvvmDebugVersionfield in the binary header - Inliner Cost Model -- inlining decisions that create the inlined-at chains
NVIDIA Custom Passes
25+ proprietary optimization passes not found in upstream LLVM. Registered into the New PM pipeline at sub_2342890 and into the pipeline assembler at sub_12E54A0.
| Module-level custom | 16 passes |
| Function-level custom | 9 passes |
| Loop-level custom | 1 pass |
| Custom analyses | 2 analyses |
| Machine-level custom | 13 passes |
| Registration | sub_2342890 (New PM) + sub_12E54A0 (pipeline builder) |
| Dedicated deep-dive pages | 22 |
IR-Level Module Passes
| Pass Name | Class / Function | Size | Description |
|---|---|---|---|
memory-space-opt | sub_1C70910 / sub_1CA2920 | cluster | Resolves generic pointers to specific address spaces (global/shared/local/const). Warns on illegal ops: atomics on constant mem, wmma on wrong space. Parameterized: first-time, second-time, no-warnings, warnings |
printf-lowering | sub_1CB1E60 | 31KB | Lowers printf → vprintf + local buffer. Validates format string is a literal. "vprintfBuffer.local", "bufIndexed" |
nvvm-verify | sub_2C80C90 | 230KB | Three-layer NVVM IR verifier (module + function + intrinsic). Validates triples, address spaces, atomic restrictions, pointer cast rules, architecture-gated intrinsic availability |
nvvm-pretreat | PretreatPass | — | IR pre-treatment before optimization |
check-kernel-functions | NVPTXSetFunctionLinkagesPass | — | Kernel function linkage validation |
check-gep-index | — | — | GEP index validation |
cnp-launch-check | CNPLaunchCheckPass | — | Cooperative launch validation |
ipmsp | IPMSPPass | — | Inter-procedural memory space propagation |
nv-early-inliner | — | — | NVIDIA early inlining pass |
nv-inline-must | InlineMustPass | — | Force-inline functions marked __forceinline__ |
select-kernels | SelectKernelsPass | — | Kernel selection for compilation |
set-global-array-alignment | — | — | Parameterized: modify-shared-mem, skip-shared-mem, modify-global-mem, skip-global-mem |
lower-aggr-copies | — | 72KB+58KB | Lower aggregate copies: struct splitting, memmove unrolling. Param: lower-aggr-func-args |
lower-struct-args | — | — | Lower structure arguments. Param: opt-byval |
process-restrict | — | — | Process __restrict__ annotations. Param: propagate-only |
lower-ops | LowerOpsPass | — | Lower special operations. Includes FP128/I128 emulation via 48 __nv_* library calls |
IR-Level Function Passes
| Pass Name | Function | Size | Description |
|---|---|---|---|
branch-dist | sub_1C47810 cluster | — | Branch distribution optimization. Knobs: branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm |
nvvm-reflect | sub_1857160 | — | Resolves __nvvm_reflect() calls to integer constants based on target SM and FTZ mode. Runs multiple times as inlining exposes new calls |
nvvm-reflect-pp | — | — | NVVM reflect preprocessor |
nvvm-intrinsic-lowering | sub_2C63FB0 | 140KB | Lowers llvm.nvvm.* intrinsics to standard LLVM IR. Two levels: 0 = basic, 1 = barrier-aware. Runs up to 10 times in mid pipeline |
nvvm-peephole-optimizer | — | — | NVVM-specific peephole optimizations |
remat | sub_1CE7DD0 | 67KB | IR-level rematerialization. Analyzes live-in/live-out register pressure per BB. Contains IV demotion sub-pass (75KB) |
reuse-local-memory | — | — | Local memory reuse optimization |
set-local-array-alignment | — | — | Set alignment for local arrays |
sinking2 | — | — | NVIDIA-specific instruction sinking (distinct from LLVM's Sink pass) |
IR-Level Loop Pass
| Pass Name | Function | Size | Description |
|---|---|---|---|
loop-index-split | sub_2CC5900 / sub_1C7B2C0 | 69KB | Split loops on index conditions. NVIDIA-preserved pass (removed from upstream LLVM) |
Custom Analyses
| Analysis Name | Purpose |
|---|---|
rpa | Register Pressure Analysis — feeds into scheduling and rematerialization decisions |
merge-sets | Merge set computation — used by coalescing and allocation |
Machine-Level Passes
| Pass Name | Function | Pass ID | Size | Description |
|---|---|---|---|---|
| Block Remat | sub_2186D90 | nvptx-remat-block | 47KB | Two-phase candidate selection + iterative "pull-in" for register pressure reduction. "Max-Live-Function(", "Really Final Pull-in:" |
| Machine Mem2Reg | sub_21F9920 | nvptx-mem2reg | — | Promotes __local_depot stack objects back to registers post-regalloc |
| MRPA | sub_2E5A4E0 | machine-rpa | 48KB | Machine Register Pressure Analysis — incremental tracking, not in upstream LLVM |
| LDG Transform | sub_21F2780 | ldgxform | — | Transforms global loads to ldg.* (texture cache) for read-only data |
| GenericToNVVM | sub_215DC20 | generic-to-nvvm | 36KB | Moves globals from generic to global address space |
| Alloca Hoisting | sub_21BC7D0 | alloca-hoisting | — | Ensures all allocas are in entry block (PTX requirement) |
| Image Optimizer | sub_21BCF10 | — | — | Optimizes texture/surface access patterns |
| NVPTX Peephole | sub_21DB090 | nvptx-peephole | — | NVPTX-specific peephole optimization |
| Prolog/Epilog | sub_21DB5F0 | — | — | Custom frame management (PTX has no traditional prolog/epilog) |
| Replace Image Handles | sub_21DBEA0 | — | — | Replaces IR-level image handles with PTX texture/surface references |
| Extra MI Printer | sub_21E9E80 | extra-machineinstr-printer | — | Register pressure statistics reporting |
| Valid Global Names | sub_21BCD80 | nvptx-assign-valid-global-names | — | Sanitizes global names to valid PTX identifiers |
| NVVMIntrRange | sub_216F4B0 | nvvm-intr-range | — | Adds !range metadata to NVVM intrinsics (e.g., tid.x bounds) |
Major Proprietary Subsystems
Dead Synchronization Elimination — sub_2C84BA0
| Field | Value |
|---|---|
| Size | 96KB |
| Purpose | Removes redundant __syncthreads() barriers |
Bidirectional fixed-point dataflow analysis across the CFG, tracking four memory access categories per BB through eight red-black tree maps. Each deletion triggers full restart. Distinct from lightweight basic-dbe. See dedicated page for full algorithm.
MemorySpaceOpt — Multi-Function Cluster
| Function | Size | Purpose |
|---|---|---|
sub_1C70910 | — | Pass entry point |
sub_1C6A6C0 | — | Pass variant |
sub_1CA2920 | 32KB | Address space resolution — "Cannot tell what pointer points to, assuming global memory space" |
sub_1CA9E90 | 28KB | Secondary resolver |
sub_1CA5350 | 45KB | Infrastructure |
sub_2CBBE90 | 71KB | Memory-space-specialized function cloning |
NV Rematerialization Cluster
| Function | Size | Role |
|---|---|---|
sub_1CE7DD0 | 67KB | Main driver — live-in/live-out analysis, skip decisions |
sub_1CE67D0 | 32KB | Block-level executor — "remat_", "uclone_" prefixes |
sub_1CE3AF0 | 56KB | Pull-in cost analysis — "Total pull-in cost = %d" |
NLO — Simplify Live Output
| Function | Size | Strings |
|---|---|---|
sub_1CE10B0 | 48KB | "Simplify Live Output", "nloNewBit", "newBit" |
sub_1CDC1F0 | 35KB | "nloNewAdd", "nloNewBit" |
Creates new add/bit operations to simplify live-out values at block boundaries.
IV Demotion — sub_1CD74B0
| Field | Value |
|---|---|
| Size | 75KB |
| Strings | "phiNode", "demoteIV", "newInit", "newInc", "argBaseIV", "newBaseIV", "iv_base_clone_", "substIV" |
Demotes induction variables (e.g., 64-bit to 32-bit), creates new base IVs, clones IV chains for register pressure reduction. Sub-pass of rematerialization. See dedicated page for full algorithm.
RLMCAST — sub_2D13E90
| Field | Value |
|---|---|
| Size | 67KB |
| Purpose | Register-level multicast instruction lowering |
Broadcasts a value to multiple register destinations. Uses 216-byte and 160-byte node structures.
Texture Group Merge (.Tgm) — sub_2DDE8C0
Groups texture load operations to hide latency. Uses .Tgm suffix in scheduling and function pointer table (3 predicates) for grouping decisions.
NVVM Intrinsic Verifier — sub_2C7B6A0
| Field | Value |
|---|---|
| Size | 143KB |
| Purpose | Validates ALL NVVM intrinsics against SM capabilities |
Architecture-gated validation for every intrinsic call. Part of the three-layer NVVM verifier (230KB total).
NVVM Intrinsic Lowering — sub_2C63FB0
| Field | Value |
|---|---|
| Size | 140KB |
| Purpose | Lowers NVVM intrinsics to concrete operations |
Pattern-matching rewrite engine for llvm.nvvm.* intrinsics. Two levels (basic + barrier-aware), runs up to 10 times. See dedicated page for full dispatch table.
Base Address Strength Reduction — sub_2CA4A10
| Field | Value |
|---|---|
| Size | 58KB |
| Knobs | do-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions) |
Scans loop bodies for memory ops sharing a common base pointer, hoists the anchor computation, rewrites remaining addresses as (anchor + relative_offset). See dedicated page for the anchor selection algorithm.
Common Base Elimination — sub_2CA8B00
| Field | Value |
|---|---|
| Size | 39KB |
| Purpose | Hoists shared base address expressions to dominating CFG points |
Operates at inter-block level (vs BASR intra-loop). The two passes form a complementary pair for comprehensive GPU address computation reduction. See dedicated page.
CSSA Transformation — sub_3720740
| Field | Value |
|---|---|
| Size | 22KB |
| Purpose | Conventional-SSA for GPU divergent control flow |
| Knobs | do-cssa, cssa-coalesce, cssa-verbosity, dump-before-cssa |
| Debug | "IR Module before CSSA" |
Rewrites PHI nodes to be safe under warp-divergent execution by inserting explicit copy instructions at reconvergence points. See dedicated page for the divergence model.
NVIDIA Codegen Knobs — sub_1C20170
70+ knobs parsed from the NVVM container format:
Graphics Pipeline
VSIsVREnabled, VSIsLastVTGStage, EnableZeroCoverageKill, AllowComputeDerivatives, AllowDerivatives, EnableNonUniformQuadDerivatives, UsePIXBAR, ManageAPICallDepth
Compute / Memory
DisableSAMRAM, DoMMACoalescing, DisablePartialHalfVectorWrites, AssumeConvertMemoryToRegProfitable, MSTSForceOneCTAPerSMForSmemEmu, AddDepFromGlobalMembarToCB
Register Allocation / Scheduling
AdvancedRemat, CSSACoalescing, DisablePredication, DisableXBlockSched, ReorderCSE, ScheduleKils, NumNopsAtStart, DisableERRBARAfterMEMBAR
Type Promotion
PromoteHalf, PromoteFixed, FP16Mode, IgnoreRndFtzOnF32F16Conv, DisableLegalizeIntegers
PGO
PGOProfileKind, PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex
Knob Forwarding
OCGKnobs, OCGKnobsFile, NVVMKnobsString, OmegaKnobs, FinalizerKnobs
Compile Modes — sub_1C21CE0
| Mode | Constant |
|---|---|
| Whole-program no-ABI | NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI |
| Whole-program ABI | NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI |
| Separate ABI | NVVM_COMPILE_MODE_SEPARATE_ABI |
| Extensible WP ABI | NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI |
| Opt Level | Constant |
|---|---|
| None | NVVM_OPT_LEVEL_NONE |
| 1 | NVVM_OPT_LEVEL_1 |
| 2 | NVVM_OPT_LEVEL_2 |
| 3 | NVVM_OPT_LEVEL_3 |
| Debug Info | Constant |
|---|---|
| None | NVVM_DEBUG_INFO_NONE |
| Line info | NVVM_DEBUG_INFO_LINE_INFO |
| Full DWARF | NVVM_DEBUG_INFO_DWARF |
NVVMReflect
The NVVMReflect pass resolves calls to __nvvm_reflect() -- a compile-time introspection mechanism that lets CUDA device code query compilation parameters such as the target GPU architecture, flush-to-zero mode, and precision settings. Each __nvvm_reflect("__CUDA_ARCH") call is replaced with an integer constant derived from the target SM version, and each __nvvm_reflect("__CUDA_FTZ") is replaced with 0 or 1 depending on the -ftz flag. After replacement, the constant result feeds into conditional branches that standard LLVM passes (SimplifyCFG, SCCP, ADCE) can fold away, eliminating dead architecture-specific code paths at compile time. This is NVIDIA's primary mechanism for producing architecture-specialized code from a single portable source: libdevice alone contains hundreds of __nvvm_reflect calls that select between FTZ and non-FTZ instruction variants.
The pass is relatively small in code size but architecturally critical -- it runs multiple times at different pipeline positions because inlining, loop unrolling, and other transformations continuously expose new __nvvm_reflect calls that were previously hidden inside un-inlined function bodies.
Key Facts
| Property | Value |
|---|---|
| Pass factory | sub_1857160 |
| Pass level | Function pass (runs per-function) |
| Registration | Legacy PM only (not separately registered in New PM); post-processor nvvm-reflect-pp is New PM #381 at line 2237 |
| Runtime positions | Tier 0 #7; Tier 1/2/3 #9, #73 (see Pipeline) |
| Pipeline disable flag | NVVMPassOptions offset +880 |
| Knob | nvvm-reflect-enable (boolean, default: true) |
| Global knob constructor | ctor_271 |
| Vtable (likely) | unk_3C2026C |
| Post-processing pass | nvvm-reflect-pp = SimplifyConstantConditionalsPass |
| New PM registration | Not separately registered -- NVVMReflect is a legacy-PM pass invoked from the pipeline assembler; nvvm-reflect-pp is the New PM companion at registration line 2237 of sub_2342890 |
| Upstream equivalent | NVVMReflect in llvm/lib/Target/NVPTX/NVVMReflect.cpp |
| Occurrences in pipeline | ~8 invocations across all paths (see Multi-Run Pattern) |
Reflect Query Names
The __nvvm_reflect mechanism supports a fixed set of query strings. These are embedded as global string constants in NVVM IR (typically from libdevice bitcode) and matched by the pass:
| Query String | Meaning | Value Source |
|---|---|---|
__CUDA_ARCH | Target GPU compute capability | -arch=compute_XX flag, encoded as major*100 + minor*10 |
__CUDA_FTZ | Flush-to-zero mode for single-precision | -ftz=1 sets to 1; default 0 |
__CUDA_PREC_DIV | Precise division mode | -prec-div=1 sets to 1; default 0 |
__CUDA_PREC_SQRT | Precise square root mode | -prec-sqrt=1 sets to 1; default 0 |
__CUDA_ARCH Values
The __CUDA_ARCH value is an integer encoding SM_major * 100 + SM_minor * 10, propagated from the CLI through the EDG frontend as -R __CUDA_ARCH=NNN:
| Architecture | __CUDA_ARCH | SM Variants |
|---|---|---|
| Turing | 750 | sm_75 |
| Ampere | 800, 860, 870, 880 | sm_80, sm_86, sm_87, sm_88 |
| Ada Lovelace | 890 | sm_89 |
| Hopper | 900 | sm_90, sm_90a (both share 900) |
| Blackwell | 1000, 1030 | sm_100/100a/100f, sm_103/103a/103f |
| (SM 11.x) | 1100 | sm_110/110a/110f |
| (SM 12.x) | 1200, 1210 | sm_120/120a/120f, sm_121/121a/121f |
Note: Architecture variants with a (accelerated) and f (forward-compatible) suffixes share the same __CUDA_ARCH value as their base. They differ only in -opt-arch and -mcpu flags, which affect instruction selection and scheduling but not reflect queries.
Algorithm
The NVVMReflect pass implements a straightforward pattern-matching replacement. In pseudocode:
bool NVVMReflectPass::runOnFunction(Function &F) {
bool changed = false;
if (!nvvm_reflect_enable) // controlled by 'nvvm-reflect-enable' knob
return false;
SmallVector<CallInst *, 8> reflect_calls;
// Phase 1: Collect all __nvvm_reflect call sites
for (BasicBlock &BB : F) {
for (Instruction &I : BB) {
if (auto *CI = dyn_cast<CallInst>(&I)) {
Function *callee = CI->getCalledFunction();
if (callee && callee->getName() == "__nvvm_reflect")
reflect_calls.push_back(CI);
}
}
}
// Phase 2: Resolve each call to a constant
for (CallInst *CI : reflect_calls) {
// Extract the query string from the first argument.
// The argument is a pointer to a global constant string:
// @.str = private constant [12 x i8] c"__CUDA_ARCH\00"
// The pass traces through the GEP/bitcast to find the
// ConstantDataArray initializer.
StringRef query = extractStringArgument(CI->getArgOperand(0));
int result = 0;
if (query == "__CUDA_ARCH")
result = sm_version; // e.g., 900 for sm_90
else if (query == "__CUDA_FTZ")
result = ftz_enabled ? 1 : 0;
else if (query == "__CUDA_PREC_DIV")
result = prec_div ? 1 : 0;
else if (query == "__CUDA_PREC_SQRT")
result = prec_sqrt ? 1 : 0;
else
result = 0; // unknown query => 0
// Replace the call with the constant integer
CI->replaceAllUsesWith(ConstantInt::get(CI->getType(), result));
CI->eraseFromParent();
changed = true;
}
return changed;
}
The string extraction logic must handle the IR pattern produced by the CUDA frontend and libdevice linking:
@.str = private unnamed_addr constant [12 x i8] c"__CUDA_ARCH\00", align 1
%1 = call i32 @__nvvm_reflect(ptr @.str)
The pass walks through the argument operand, stripping ConstantExpr GEPs and bitcasts, to reach the ConstantDataArray containing the query string. If the argument is not a resolvable constant string, the call is left unmodified (this is a no-op safety -- in practice, all reflect calls use literal string arguments).
Interaction with Constant Propagation and Dead Code Elimination
The reflect replacement produces a constant integer that feeds directly into an icmp and conditional branch. This is the canonical pattern in libdevice:
Before NVVMReflect (from libdevice.10.ll, function __nv_floorf):
define float @__nv_floorf(float %f) {
%1 = call i32 @__nvvm_reflect(ptr @.str) ; @.str = "__CUDA_FTZ"
%2 = icmp ne i32 %1, 0
br i1 %2, label %ftz_path, label %precise_path
ftz_path:
%3 = call float @llvm.nvvm.floor.ftz.f(float %f)
br label %merge
precise_path:
%4 = call float @llvm.nvvm.floor.f(float %f)
br label %merge
merge:
%.0 = phi float [ %3, %ftz_path ], [ %4, %precise_path ]
ret float %.0
}
After NVVMReflect (with -ftz=1):
define float @__nv_floorf(float %f) {
%2 = icmp ne i32 1, 0 ; constant 1 replaces the call
br i1 %2, label %ftz_path, label %precise_path
ftz_path:
%3 = call float @llvm.nvvm.floor.ftz.f(float %f)
br label %merge
precise_path: ; now unreachable
%4 = call float @llvm.nvvm.floor.f(float %f)
br label %merge
merge:
%.0 = phi float [ %3, %ftz_path ], [ %4, %precise_path ]
ret float %.0
}
After SimplifyCFG / SCCP / ADCE (subsequent passes):
define float @__nv_floorf(float %f) {
%1 = call float @llvm.nvvm.floor.ftz.f(float %f)
ret float %1
}
The icmp ne i32 1, 0 folds to true, SimplifyCFG eliminates the dead branch, and ADCE removes the unused llvm.nvvm.floor.f call. The function collapses from 4 basic blocks to 1.
This pattern repeats for every libdevice math function: __nv_fabsf, __nv_fminf, __nv_fmaxf, __nv_rsqrtf, __nv_exp2f, and dozens more all contain the same __nvvm_reflect("__CUDA_FTZ") branch. After reflect resolution, each function specializes to either FTZ or precise mode.
__CUDA_ARCH branching pattern
For architecture-dependent code, the pattern uses inequality comparisons:
%arch = call i32 @__nvvm_reflect(ptr @.str.1) ; "__CUDA_ARCH"
%is_sm80_plus = icmp sge i32 %arch, 800
br i1 %is_sm80_plus, label %sm80_path, label %legacy_path
sm80_path:
; use SM 8.0+ specific intrinsics (e.g., async copy, cp.async)
...
legacy_path:
; fallback path for older architectures
...
After NVVMReflect replaces %arch with (e.g.) 900 for Hopper, the comparison icmp sge i32 900, 800 folds to true, and the legacy path is eliminated.
Multi-Run Pattern
NVVMReflect (sub_1857160) is invoked multiple times across the pipeline because optimization passes continuously expose new reflect calls. The key insight is that __nvvm_reflect calls originate primarily from libdevice functions, which are linked as bitcode and initially exist as un-inlined function calls. Each inlining pass expands these functions inline, exposing their internal __nvvm_reflect calls to the containing function.
Tier 0 Pipeline (Full Optimization via sub_12DE330)
In the Tier 0 (O1/O2/O3) full optimization pipeline, NVVMReflect appears once:
| Position | Factory | Context |
|---|---|---|
| #7 | sub_1857160() | After CGSCC inliner (#2), GVN (#5-6). Catches reflect calls exposed by first-round inlining |
"mid" Path Pipeline (Ofcmid/Ofcmin via sub_12E54A0 PATH B)
In the "mid" fast-compile path, NVVMReflect appears at three distinct positions:
| Position | Factory | Guard | Context |
|---|---|---|---|
| After CGSCC pipeline #8 | sub_1857160() | !opts[880] | After aggressive CGSCC inlining (8 iterations). Catches reflect calls from freshly inlined libdevice bodies |
| After Sinking2 + EarlyCSE | sub_1857160() | !opts[880] | After loop transformations and code motion. Catches reflect calls in loop bodies after unrolling |
| (appears once more in late position) | sub_1857160() | !opts[880] | Final cleanup after late CGSCC pass and NVVMIntrinsicLowering |
Default/General Path Pipeline (PATH C)
In the default path (external bitcode input), NVVMReflect appears at three positions:
| Position | Factory | Context |
|---|---|---|
| After CGSCC pipeline #4 | sub_1857160() | First resolution after initial inlining |
| After NVVMIntrinsicLowering | sub_1857160() | Intrinsic lowering may expose new reflect patterns |
| After LoopUnroll + InstCombine | sub_1857160() | Loop unrolling duplicates loop bodies containing reflect calls |
Tiered Pipeline Insertions (sub_12DE8F0)
Within the tiered sub-pipeline, NVVMReflect appears with additional gating:
| Tier | Guard | Position |
|---|---|---|
| 1, 2, 3 | opts[3200] && !opts[880] | Mid-tier, after NVVMVerifier and IPConstPropagation |
| 3 only | opts[3200] && tier==3 && !opts[880] | Late-tier, after ADCE and LoopOpt/BarrierOpt. This extra run at O3 catches reflect calls exposed by the most aggressive transformations |
Why Multiple Runs Are Necessary
Consider this scenario:
- User code calls
__nv_sinf(x)(a libdevice function). - Initially,
__nv_sinfis an external function call -- its body contains__nvvm_reflect("__CUDA_FTZ")but the reflect call is not visible to the optimizer. - First NVVMReflect run: No-op for this function (the reflect is inside
__nv_sinf's body, which has not been inlined yet). - CGSCC Inliner runs: Inlines
__nv_sinfinto the caller, expanding its body with the__nvvm_reflectcall. - Second NVVMReflect run: Now sees the freshly-inlined
__nvvm_reflectcall and resolves it to a constant. - Loop Unrolling runs: If the
__nv_sinfcall was inside a loop, unrolling duplicates the call site. If the loop body was too complex to inline before unrolling simplified it, a third inlining opportunity may arise. - Third NVVMReflect run: Resolves any remaining reflect calls exposed by unrolling + re-inlining.
Without multiple runs, libdevice functions inlined late in the pipeline would retain their reflect-based branching, defeating the specialization mechanism and leaving dead code paths in the final binary.
The nvvm-reflect-pp Post-Processing Pass
After NVVMReflect replaces calls with constants, the resulting IR contains trivially-foldable comparisons and dead branches. While standard LLVM passes (SimplifyCFG, ADCE) handle most of this, NVIDIA registers a dedicated post-processing pass under the misleading name nvvm-reflect-pp.
Despite its name, nvvm-reflect-pp is SimplifyConstantConditionalsPass (class llvm::SimplifyConstantConditionalsPass), not a reflection pass. It is a targeted dead-branch elimination pass that:
- Finds conditional branches where the condition is a constant (
icmpwith both operands constant). - Replaces the branch with an unconditional branch to the taken target.
- Marks the not-taken successor as potentially unreachable.
- Cleans up resulting dead phi nodes and empty blocks.
This pass is registered in the New PM at sub_2342890 line 2237 as a function-level pass. It runs immediately after NVVMReflect in some pipeline configurations to ensure that reflected constants are cleaned up before subsequent optimization passes see the IR.
Configuration
| Knob | Type | Default | Effect |
|---|---|---|---|
nvvm-reflect-enable | bool | true | Master enable for NVVMReflect. When false, all __nvvm_reflect calls are left unresolved (they default to 0 at link time, selecting the non-FTZ/non-precise/lowest-arch path). |
Pipeline Disable Flag
NVVMPassOptions offset +880 is the per-compilation disable flag for NVVMReflect. When set (e.g., by an internal debugging mechanism), all pipeline insertion points skip the pass via the !opts[880] guard. This flag is distinct from the nvvm-reflect-enable knob: the knob controls the pass's internal behavior, while the pipeline flag prevents the pass from being added to the pipeline at all.
Reflect Value Propagation Path
The reflect query values flow from the CLI through three layers:
- CLI:
-arch=compute_90is parsed bysub_95EB40/sub_12C8DD0 - EDG frontend: Receives
-R __CUDA_ARCH=900and defines the preprocessor macro - Optimizer: Receives
-opt-arch=sm_90. The NVVMReflect pass reads the SM version from the target machine configuration (not from-Rflags -- those are for the preprocessor)
For FTZ/precision flags, the path is:
-ftz=1maps to-R __CUDA_FTZ=1(EDG) and-nvptx-f32ftz(optimizer/backend)- The NVVMReflect pass reads the FTZ setting from the NVPTX subtarget or a global variable set during pipeline configuration
Differences from Upstream LLVM
Upstream LLVM's NVVMReflect pass (in llvm/lib/Target/NVPTX/NVVMReflect.cpp) is functionally similar but differs in several respects in CICC v13.0:
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Pipeline placement | Runs once, typically early | Runs ~8 times at strategic positions throughout the pipeline |
| Post-processing | Relies on standard SimplifyCFG | Has dedicated nvvm-reflect-pp (SimplifyConstantConditionalsPass) |
| Pipeline integration | New PM function pass | Legacy PM function pass invoked from the pipeline assembler (sub_12E54A0), with the pipeline disable flag at NVVMPassOptions[880] |
| Tier 3 extra run | Not applicable | Extra late-pipeline run gated by tier==3 for O3-only cleanup |
| Query string set | __CUDA_ARCH, __CUDA_FTZ | Same set plus __CUDA_PREC_DIV, __CUDA_PREC_SQRT |
The multi-run strategy is the most significant difference. Upstream LLVM assumes that NVVMReflect runs once before optimization, resolving all reflect calls in the linked libdevice bitcode. CICC's pipeline accounts for the reality that aggressive inlining and loop transformations in a GPU-focused compiler expose reflect calls at many different pipeline stages.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| NVVMReflect pass factory | sub_1857160 | -- | Creates and returns a new NVVMReflect pass instance |
| NVVMReflect constructor knob | ctor_271 | -- | Registers nvvm-reflect-enable cl::opt |
| SimplifyConstantConditionalsPass (nvvm-reflect-pp) | registered at line 2237 of sub_2342890 | -- | Post-reflect dead branch cleanup |
| Pipeline assembler | sub_12E54A0 | -- | Inserts NVVMReflect at multiple positions |
| Tier 0 pipeline builder | sub_12DE330 | -- | Inserts NVVMReflect as pass #7 |
| Tiered sub-pipeline | sub_12DE8F0 | -- | Inserts NVVMReflect at tier-gated positions |
| Architecture detection table | sub_95EB40 | -- | Maps -arch=compute_XX to __CUDA_ARCH values |
| Architecture detection (libnvvm) | sub_12C8DD0 | -- | Parallel mapping table for the libnvvm path |
Test This
The following kernel calls a libdevice math function whose implementation branches on __CUDA_FTZ and __CUDA_ARCH. Compile for two configurations and compare the PTX to see NVVMReflect in action.
#include <math.h>
__global__ void reflect_test(float* out, const float* in, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) {
out[tid] = sinf(in[tid]);
}
}
Compile twice:
nvcc -ptx -arch=sm_90 -ftz=true reflect_test.cu -o reflect_ftz.ptx
nvcc -ptx -arch=sm_90 -ftz=false reflect_test.cu -o reflect_noftz.ptx
What to look for in PTX:
- With
-ftz=true: the PTX should contain flush-to-zero math instructions (e.g.,sin.approx.ftz.f32). The NVVMReflect pass resolved__nvvm_reflect("__CUDA_FTZ")to1, SimplifyCFG folded the branch, and only the FTZ code path survived. - With
-ftz=false: the PTX should contain precise math instructions without the.ftzsuffix. The reflect resolved to0, selecting the non-FTZ path. - The key evidence is that the PTX contains only one code path -- no conditional branch choosing between FTZ and non-FTZ variants. If both paths survive, NVVMReflect or its downstream cleanup passes failed.
- Comparing
-arch=sm_75vs.-arch=sm_90exercises the__CUDA_ARCHreflect. Functions like__nv_dsqrt_rnuse architecture comparisons (icmp sge i32 %arch, 800) to select between SM 8.0+ instruction sequences and legacy fallbacks.
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent compile-time reflection mechanism.
1. Returning the wrong __CUDA_ARCH encoding. The __CUDA_ARCH value is major * 100 + minor * 10, not major * 10 + minor. For SM 9.0, the correct value is 900, not 90. For SM 10.0, the correct value is 1000, not 100. A reimplementation that uses the wrong encoding will select the wrong code paths in libdevice, potentially enabling instructions not supported by the target architecture (e.g., SM 7.0 paths on an SM 9.0 target) or disabling instructions that should be available. This encoding is also used by the CUDA preprocessor (__CUDA_ARCH__), so consistency between the frontend macro and the reflect value is critical.
2. Running NVVMReflect only once in the pipeline. The pass must run multiple times (approximately 8 invocations across the full pipeline) because __nvvm_reflect calls are hidden inside un-inlined libdevice function bodies. The first run resolves calls visible at the top level, but each subsequent inlining pass exposes new reflect calls from freshly inlined libdevice functions. A reimplementation with a single early invocation will leave reflected branches unresolved in all functions inlined after that point, resulting in both FTZ and non-FTZ code paths surviving to the final binary -- doubling code size and defeating the entire specialization mechanism.
3. Not running SimplifyConstantConditionalsPass (nvvm-reflect-pp) after reflect resolution. After NVVMReflect replaces __nvvm_reflect("__CUDA_FTZ") with the constant 1, the IR contains icmp ne i32 1, 0 feeding a conditional branch. If no pass simplifies this to an unconditional branch, the dead code path survives through the rest of the pipeline, consuming compile time in every subsequent pass and inflating the final binary. While standard LLVM SimplifyCFG will eventually handle it, the dedicated nvvm-reflect-pp pass provides immediate cleanup at the point where it matters most.
4. Returning 0 for unknown query strings instead of propagating a diagnostic. The pass returns 0 for any unrecognized __nvvm_reflect query string. This is the correct behavior (documented default), but a reimplementation that raises an error or leaves the call unresolved will break forward compatibility: future CUDA toolkit versions may introduce new query strings that libdevice checks. The value 0 is the safe default because libdevice code always treats 0 as "feature not available" and falls back to the conservative code path.
5. Reading the SM version from the wrong source. The reflect query values flow through three layers: CLI (-arch=compute_90), EDG frontend (-R __CUDA_ARCH=900), and optimizer (-opt-arch=sm_90). The NVVMReflect pass must read the SM version from the target machine configuration (the optimizer-level value), not from the -R preprocessor flags. A reimplementation that reads from the wrong layer may get a stale or mismatched value, especially in LTO scenarios where the preprocessor flags were consumed during an earlier compilation phase.
Cross-References
- Optimizer Pipeline -- NVVMReflect pipeline positions and the NVVMPassOptions system
- NVIDIA Custom Passes -- registry of all NVIDIA-proprietary passes
- NVVM Intrinsic Constant-Fold Eligibility (K02) --
sub_14D90D0, the companion pass that checks whether an intrinsic can be constant-folded (NVVMReflect calls are resolved before K02 runs) - Architecture Detection -- the
sub_95EB40table that maps CLI flags to__CUDA_ARCHvalues - Optimization Levels -- how NVVMReflect placement varies across O0/O1/O2/O3 and fast-compile tiers
NVVM IR Verifier (Deep Dive)
The NVVM IR Verifier (nvvm-verify) is NVIDIA's three-layer correctness gate that runs between optimization passes throughout the CICC pipeline. Unlike LLVM's generic Verifier pass, which validates structural IR invariants, this pass enforces the complete NVVM IR contract: valid target triples, legal address space usage, architecture-gated intrinsic availability, MMA dimension/type constraints, function attribute restrictions, and atomic operation rules. It is the single largest verification subsystem in CICC at approximately 230KB across three cooperating functions. The verifier is inserted at roughly a dozen points in every optimization tier, guarded only by NVVMPassOptions[600] (disable). Every NVVM intrinsic call, every address space cast, and every unsupported CPU-oriented feature triggers a check here; failure produces a diagnostic message and sets the module error flag, but compilation continues to collect as many errors as possible in a single run.
Key Facts
| Property | Value |
|---|---|
| Pass name | nvvm-verify |
| Pass class | llvm::NVVMIRVerifierPass |
| Registration | sub_2342890 (New PM), sub_12E54A0 (pipeline builder) |
| Entry point | sub_12D4560 |
| Module verifier | sub_2C80C90 (51KB, ~1671 lines) |
| Function verifier | sub_2C771D0 (36KB, ~1165 lines) |
| Intrinsic verifier | sub_2C7B6A0 (143KB, ~4139 lines) |
| Binary size | ~230KB decompiled |
| Pipeline slot | ~12 per tier (O1-O3), after GVN, after DSE, after LICM, etc. |
| Disable flag | NVVMPassOptions[600] (bool) |
| Primary knobs | nvvm-verify-show-info |
| Error model | Accumulate-and-continue (no early abort) |
| SM encoding | Internal SM * 10 (e.g., sm_90 = 900) at context offset +8 |
| Upstream equivalent | None -- fully proprietary |
Three-Layer Verification Architecture
The pass operates as three nested verification functions. The module verifier is the entry point; it calls the function verifier once per function, and the function verifier dispatches to the intrinsic verifier for every intrinsic call instruction.
sub_2C80C90 (NVVMModuleVerifier)
|
+-- Validate data layout string
+-- Validate target triple against whitelist
+-- sub_2C797D0() for each global variable
+-- sub_2C7A130() for each function declaration
+-- sub_2C7AA20() for each named metadata node
|
+-- For each function:
| |
| +-- sub_2C771D0 (NVVMFunctionVerifier)
| | +-- Cluster dimension validation (Hopper+ gate)
| | +-- Parameter width validation (>=32-bit or sext/zext)
| | +-- Function attribute rejection (17 attributes)
| | +-- Entry/exit handler constraints
| |
| +-- For each instruction in each basic block:
| |
| +-- Switch on opcode 0x1E..0x60
| +-- Opcode 0x55 (intrinsic call) --> sub_2C7B6A0
| (NVVMIntrinsicVerifier, 143KB)
| +-- Switch on intrinsic ID
| +-- SM version gate checks
| +-- Type, address space, constant arg validation
| +-- MMA shape/type cross-validation
Context Object Layout
All three verifiers share a context object passed as the first argument:
| Offset | Type | Field |
|---|---|---|
| 0 | int32 | Mode (0 = standard, 1 = UnifiedNVVMIR) |
| 4 | int32 | Flags |
| 8 | int32 | SM version (SM * 10, e.g., 900 for sm_90) |
| 16 | ptr | Error flag pointer (set on any validation failure) |
| 24 | ptr | Error output stream |
Target Triple Whitelist
The module verifier validates the module's target triple against two whitelists depending on mode.
UnifiedNVVMIR Mode (mode == 1) -- Exact Match
Eight triples are accepted:
| Triple | Arch | API |
|---|---|---|
nvptx-nvidia-cuda | 32-bit PTX | CUDA |
nvptx64-nvidia-cuda | 64-bit PTX | CUDA |
nvptx-nvidia-nvcl | 32-bit PTX | OpenCL |
nvptx64-nvidia-nvcl | 64-bit PTX | OpenCL |
nvsass-nvidia-cuda | SASS direct | CUDA |
nvsass-nvidia-nvcl | SASS direct | OpenCL |
nvsass-nvidia-directx | SASS direct | DirectX |
nvsass-nvidia-spirv | SASS direct | SPIR-V |
The nvsass triples confirm that CICC can compile directly to native GPU assembly (SASS) without the PTX intermediate step, and can do so for DirectX shader and SPIR-V/Vulkan shader pipelines. This reveals CICC's role in NVIDIA's shader compiler toolchain beyond CUDA.
Failure message: "Invalid target triple".
Standard Mode (mode != 1) -- Prefix + Suffix Match
The triple must begin with "nvptx-" or "nvptx64-" and end with "-cuda". The middle component is wildcarded.
Failure message: "Invalid target triple (<actual>), must be one of:" followed by "nvptx-*-cuda" and "nvptx64-*-cuda".
Data Layout Validation
If the module's data layout string is empty: "Empty target data layout, must exist".
Otherwise, sub_2C74F70 parses and validates the layout. On failure, the verifier prints "Example valid data layout:" with reference strings from:
| Global | Description |
|---|---|
off_4C5D0A0 | 32-bit layout example |
off_4C5D0A8 | 64-bit layout example |
off_4C5D070 | 64-bit with mixed pointer widths (p3:32:32:32) |
Per-Instruction Validation (Module Verifier)
After calling sub_2C771D0 for function-level checks, the module verifier iterates every instruction in every basic block and dispatches on the LLVM IR opcode. The opcode range is 0x1E through 0x60:
| Opcode | IR Instruction | Validation |
|---|---|---|
| 0x1F | call (non-intrinsic) | Calls sub_2C795F0. Checks for "pragma" metadata; rejects "unroll" pragma with: "pragma unroll is not supported. Please use llvm.loop.unroll.count instead". Validates branch pragma operand count. |
| 0x21 | indirectbr | Rejected via sub_2C76F10(ctx, "indirectbr", instr) |
| 0x22 | invoke | Rejected via sub_2C76F10(ctx, "invoke", instr) |
| 0x23 | resume | Rejected via sub_2C76F10(ctx, "resume", instr) |
| 0x3C | alloca | Alignment must be <= 2^23. Address space must be Generic (AS 0): "Allocas are not supported on address spaces except Generic" |
| 0x3D | load | Rejects atomic loads: "Atomic loads/stores are not supported". Rejects tensor memory (AS 6): "Tensor Memory loads/stores are not supported" |
| 0x3E | store | Same atomic and tensor memory checks as load |
| 0x40 | fence | In UnifiedNVVMIR mode: only acq_rel and seq_cst allowed. Otherwise: rejected entirely via sub_2C76F10 |
| 0x41 | cmpxchg | Only i32/i64/i128 types. Pointer must be in generic, global, or shared AS |
| 0x42 | (GEP/addrspacecast helper) | Calls sub_2C7AF00 |
| 0x4F | addrspacecast | Validates source and target AS are in range. "Cannot cast non-generic pointer to different non-generic pointer" -- at least one side must be AS 0 (generic) |
| 0x55 | call (intrinsic) | Dispatches to sub_2C7B6A0 (NVVMIntrinsicVerifier) |
| 0x5F | landingpad | Rejected: "landingpad" unsupported |
The unsupported instructions -- indirectbr, invoke, resume, landingpad -- are CPU exception-handling features with no GPU equivalent. Their rejection at the IR level prevents downstream passes from encountering them.
Address Space Casting Rules
The addrspacecast validation enforces NVIDIA's GPU address space model:
Rule: At least one operand of addrspacecast must be AS 0 (generic).
Non-generic-to-non-generic casts are illegal.
Legal: addrspacecast i32* addrspace(0) to i32* addrspace(1) ; generic -> global
Legal: addrspacecast i32* addrspace(3) to i32* addrspace(0) ; shared -> generic
Illegal: addrspacecast i32* addrspace(3) to i32* addrspace(1) ; shared -> global
The valid address space range check uses the expression (AS + ~2) & 0xFFFFFF) > 2, which means AS values 0 (generic), 1 (global), and 3 (shared) are always valid for atomic and cast operations. AS 2 (constant) and higher values have restricted usage contexts.
Function Attribute Rejection
The function verifier (sub_2C771D0) rejects 17 LLVM function attributes that have no GPU meaning. Each is identified by its LLVM attribute kind ID:
| Attr ID | Attribute Name | Error Message |
|---|---|---|
| 4 | builtin | "builtin function attribute is not supported." |
| 17 | jumptable | "jumptable function attribute is not supported." |
| 20 | naked | "naked function attribute is not supported." |
| 23 | nobuiltin | "nobuiltin function attribute is not supported." |
| 30 | noimplicitfloat | "noimplicitfloat function attribute is not supported." |
| 35 | noredzone | "noredzone function attribute is not supported." |
| 42 | nonlazybind | "nonlazybind function attribute is not supported." |
| 53 | returns_twice | "returns_twice function attribute is not supported." |
| 55 | safestack | "safestack function attribute is not supported." |
| 56 | sanitize_address | "sanitize_address function attribute is not supported." |
| 59 | sanitize_memory | "sanitize_memory function attribute is not supported." |
| 63 | sanitize_thread | "sanitize_thread function attribute is not supported." |
| 69 | ssp | "ssp function attribute is not supported." |
| 70 | sspreq | "sspreq function attribute is not supported." |
| 71 | sspstrong | "sspstrong function attribute is not supported." |
| 86 | alignstack | "alignstack function attribute is not supported." |
| 95 | uwtable | "uwtable function attribute is not supported." |
These attributes fall into four categories: (1) CPU ABI (naked, alignstack, noredzone), (2) security hardening (ssp/sspreq/sspstrong, safestack, sanitizers), (3) EH-related (uwtable, returns_twice, personality), and (4) linker features (jumptable, nonlazybind, builtin, nobuiltin). None have GPU equivalents.
Additional Function-Level Checks
| Check | Error Message | Notes |
|---|---|---|
| Cluster dimensions on pre-Hopper | "Cluster dimensions and cluster maximum blocks are not supported on pre-Hopper Architectures" | SM version <= 899 (i.e., before sm_90) |
| Cluster dims on non-kernel | "Cluster dimensions and cluster maximum blocks are only allowed for kernel functions" | Checked via sub_CE9220 |
| Partial zero cluster dims | "If any cluster dimension is specified as 0 then all other dimensions must be specified as 0" | |
| Zero max cluster blocks | "Cluster maximum blocks must be non-zero" | |
| Narrow int param without sign attr | "Integer parameter less than 32-bits without sext/zext flag" | PTX requires >=32-bit params |
| Narrow int return without sign attr | "Integer return less than 32-bits without sext/zext flag" | |
| InReg attribute | "InReg attribute on parameter will be ignored" | Warning only |
| Nest attribute | "Nest attribute on parameter will be ignored" | Warning only |
| Explicit section | "Explicit section marker <name> is not allowed." | |
| Explicit alignment | "Explicit alignment is not allowed." | |
| Prefix data | "Prefix data is not allowed." | CPU feature |
| Prologue data | "Prologue data is not allowed." | CPU feature |
| Personality function | "Personality function is not allowed." | EH feature |
| GC names | "GC names are not supported." | |
| Non-void kernel/entry | "non-void entry function." | Return type must be void |
| Entry with params | "entry function with parameters." | Non-kernel entries only |
| Non-void exit handler | "non-void exit handler function." | |
| Exit handler with params | "exit handler function with parameters." |
Architecture Gates (SM-Gated Features)
The intrinsic verifier (sub_2C7B6A0) uses the SM version stored at context offset +8 (encoded as SM*10) to gate feature availability. The threshold checks use <=, so e.g. <= 899 means "below sm_90".
| SM Gate | Threshold | Intrinsics / Features | Error Message |
|---|---|---|---|
| sm_70 (Volta) | <= 699 | llvm.nvvm.branch.if.all.convergent (ID 0x205A) | "...not supported on pre-Volta Architectures" |
| sm_72 (Volta+) | <= 719 | llvm.nvvm.cvt base conversion (ID 0x2106) | "this instrinsic is only supported for Volta (sm_72)+" |
| sm_75 (Turing) | <= 749 | cvt extended types -- BF16, TF32 conversions (within ID 0x2106) | "conversion type only supported for Turing (sm_75)+" |
| sm_80 (Ampere) | <= 799 | llvm.nvvm.branch.if.convergent (ID 0x205B) | "...not supported on pre-Ampere Architectures" |
| sm_89 (Ada) | <= 889 | Extended type conversion intrinsic (ID 0x2107) | "this instrinsic is only supported for Ada (sm_89)+" |
| sm_90 (Hopper) | <= 899 | TMA, async copy (IDs 0x2279, 0x232D), cluster dims, bulk async (IDs 0x244D-0x2459, 0x2487-0x2489) | "this intrinsic is only supported for Hopper+" |
| sm_90 (Hopper) | <= 899 | 64-bit pointer requirement for TMA | "this intrinsic is only supported when pointer size is >= 64 bits" |
| sm_100+ (Blackwell) | <= 1199 | .offset.bindless intrinsics (checked via sub_CEA320) | ".offset.bindless intrinsics are not supported on pre-Blackwell architectures" |
Note the typo "instrinsic" in the Volta and Ada messages -- this is present in the binary. The Blackwell gate threshold of 1199 means the .offset.bindless intrinsics are available on sm_120 (value 1200) and above, covering all Blackwell-generation architectures including consumer (sm_120/121) and datacenter (sm_100/103).
Intrinsic Verification Categories
The intrinsic verifier is a single monolithic switch on the NVVM internal intrinsic ID (stored at function value offset +36). The 143KB function covers 26+ validation categories:
A. Constant Argument Validation
Many NVVM intrinsics require one or more arguments to be compile-time constants (typically mode selectors, masks, or task IDs):
"arg0 of intrinsic not constant""op0 of intrinsic not constant"/"op1 of intrinsic not constant""Flag argument must be an immediate.""the task_id parameter must be constant""the mask parameter must be constant""Mode operand must be constant"
B. Rounding Mode Validation
Rounding mode encoding: bits[2:0] of the mode word
Valid range: 1..4 (round-to-nearest-even, round-down, round-up, round-to-zero)
Reject: value == 0 or value > 4
Message: "rounding mode not a valid value"
C. Subword Mode Validation
For conversion intrinsics that operate on sub-word portions:
Source subword mode: bits[9:7], valid range 0..2
Dest subword mode: bits[12:10], valid range 0..2
Messages: "src subword mode not a valid value"
"dest subword mode not a valid value"
D. Reserved Bits Checking
Multiple locations verify that high/reserved bits in mode words are zero:
"reserved flag bits used"
This prevents future-proofing conflicts if NVIDIA later assigns meaning to currently reserved fields.
E. Address Space Validation
Intrinsics that access memory enforce specific address space requirements:
| Check | Message |
|---|---|
| Global pointer required | "pointer address space not global" |
| Invalid arg1 address space | "arg1 invalid addrspace" |
| Arg0 must be pointer | "arg0 of intrinsic not pointer" |
| Constant AS required | "Operand must be in constant address space" |
| Memcpy/memmove targets constant AS | "memmove/memcpy cannot target constant address space" |
| Memset targets constant AS | "memset cannot point to constant address space" |
| Stack ops require local AS (5) | "llvm.nvvm.stackrestore is only supported with local address space pointers" |
| Stack ops require local AS (5) | "llvm.nvvm.stacksave is only supported with local address space pointers" |
F. Type Validation
| Check | Message |
|---|---|
| bswap operand | "Invalid type for bswap, need i16, i32, or i64" |
| ctpop/ctlz/cttz operand | "Invalid type for ctpop/ctlz/cttz, need i8, i16, i32, ..." (i64) |
| Arithmetic overflow | "Invalid type for arithmetic overflow intrinsic, need i16, i32, or i64" |
| Inline asm type | "Invalid type in inline assembly, must be i1, i8, i16, i32, i64, float, or double" |
| MMA element | "op1 of intrinsic not containing f32 or i32 element" |
Inline assembly type validation uses a bitmask check: valid bit widths are 1, 8, 16, 32, 64 (encoded as 0x1000000010001 for fast lookup).
G. Atomic Intrinsic Validation
| Check | Message |
|---|---|
| CAS opcode mismatch | "the opcode of atomic_cas must be CAS" |
| RMW opcode error | "the opcode of atomic_rmw must not be CAS, CAST or CAST_SPIN" |
| CAST opcode error | "the opcode of atomic_cast must be CAST or CAST_SPIN" |
| CAST type restriction | "atomic.cast only overloads on i32 and i64" |
| CAST pointer restriction | "atomic.cast is only allowed on shared pointers" |
| CAST ordering restriction | "atomic.cast works on shared memory, so cannot be ordered" |
| Global ordering scope | "Global ordering on atomics is only allowed on generic/global pointers" |
| Ordering mode | "ordering mode not a valid value" |
| Scope mode | "scope mode not a valid value" |
| Cache hint | "Cache operation hint not a valid value" |
| Operation mode | "operation mode not a valid value" |
H. Texture/Surface Validation
| Check | Message |
|---|---|
| Texture dimensionality | "dimensionality not a valid value" |
| LOD adjust | "LOD Adjust mode not a valid value" |
| Binding mode | "Binding Mode is not a valid value" |
| Border mode | "border mode not a valid value" |
| Address mode | "address mode not a valid value" |
| Scope | "scope not a valid value" |
| Semantic mode | "semantic mode not a valid value" |
| Query mode | "query mode is not a valid value" |
| Handle source | "Op0 of nvvm.texsurf.handle must be a metadata wrapper around a tex/surf GlobalVariable" |
| Deprecated desc | "Desc parameter is deprecated and should be undef." (IDs 8937, 9549) |
I. SATF (Saturate-to-Float) Validation
For math intrinsics with saturation control (IDs 0x2281-0x229C, covering fma/mul/add variants):
Message: "satf operand must be a constant zero"
The satf parameter was deprecated but the intrinsic signatures retain it for ABI compatibility. The verifier enforces it must be zero.
J. Constant Load Validation
For ID 0x2310 (constant bank load):
| Check | Message |
|---|---|
| Load kind | "Invalid constant load kind" |
| Bound bank type | "Bound bank must be i32" |
| Bindless bank type | "Bindless bank must be i64" |
K. TMA/Shared Memory Validation
For IDs 0x2319-0x231B:
| Check | Message |
|---|---|
| Column-major restriction | "ColMajor is not supported for this size" |
| Size encoding | "Invalid size" (bits[3:1] > 4) |
L. Load Bounds Check
For ID 0x231C:
Validation: (value & 7) must be <= 2
Message: "invalid load bounds check type"
Also: "pointer address space not global"
M. Convergent Branch Result Validation
For IDs 8282 (llvm.nvvm.branch.if.all.convergent) and 8283 (llvm.nvvm.branch.if.convergent):
Message: "result of llvm.nvvm.branch.if.convergent and
llvm.nvvm.branch.if.all.convergent can only be
used by exactly one branch instruction"
This enforces that the convergent branch intrinsic's boolean result flows directly to a single terminator branch, preventing misuse that would break convergence guarantees.
N. MMA (Matrix Multiply-Accumulate) Validation
The most complex validation category (ID 0x2366 = 9062). Validates WMMA/MMA intrinsics against a multidimensional constraint space:
Opcode byte encoding:
| Byte | Bits | Field |
|---|---|---|
| byte0 | [2:0] | Rounding mode |
| byte0 | [7:4] | MMA opcode |
| byte1 | all | A matrix element type (1-13, lookup via dword_43A2620) |
| byte2 | all | B matrix element type |
| byte4 | all | MNK dimension encoding (cases 1-0x19) |
| byte5 | all | Additional type info |
MNK dimension decoding (selected cases):
| Encoding | M | N | K | Notes |
|---|---|---|---|---|
| 1 | 8 | 8 | 8 | Legacy HMMA |
| 0x10 | 16 | 8 | 8 | |
| 0x17 | 16 | 8 | 16 | |
| 0x18 | 32 | 8 | 8 | |
| 0x19 | 16 | 8 | 16 |
Validation checks:
| Check | Message |
|---|---|
| MNK dimensions | "Invalid MMA MNK" |
| A element type | "Invalid MMA AType" |
| Fragment A bit width | "Invalid MMA FragASize" |
| Fragment B bit width | "Invalid MMA FragBSize" |
| Fragment C bit width | "Invalid MMA FragCSize" |
| Fragment A IR type | "Invalid fragA type" |
| Rounding mode | "Invalid MMA Rounding Mode" |
| MMA opcode | "Invalid MMA Opcode" |
| A/B type match | "Mismatched MMA A B Type" |
| Fragment element consistency | "Mismatched fragA, fragB and fragC element type" |
O. Type Conversion Validation
For IDs 0x2106 and 0x2107:
Conversion type: bits[3:1], must be 1..4
Messages: "conversion type not a valid value"
"Invalid dst type" / "Invalid src type"
"Src and dst type must be different types"
"Src and dst type must be different bit widths"
P. Other Validation Categories
| Category | IDs | Key Messages |
|---|---|---|
| Coroutine | -- | "llvm.nvvm.coro.create.suspend must have exactly one argument, which must be a constant integer" |
| Subop mode | 9383-9384 | "Invalid subop mode" (bits[3:1] > 5) |
| Geometry output | -- | "geometry out mode not a valid value", "op1 of GeometryOut intrinsic must be constant when CUT mode", "op1 of GeometryOut intrinsic must be 0 when CUT mode" |
| Syncwarp | -- | "syncwarp mode not a valid value" |
| Cache operations | -- | "invalid cache type", "invalid cache op" |
| Wait intrinsic | -- | "Invalid wait mode" |
| ISBE | 0x2BC1 (11201) | "Only writes to MAP or ATTR are supported", "Cannot write to input ISBE" |
| Unsupported fallback | -- | "Unsupported intrinsic: <name>" |
Cmpxchg Restrictions
The module verifier enforces strict constraints on cmpxchg:
Allowed types: i32, i64, i128
Allowed spaces: generic (AS 0), global (AS 1), shared (AS 3)
Messages:
"Atomic operations on non-i32/i64/i128 types are not supported"
"cmpxchg pointer operand must point to generic, global, or shared address space"
This rules out i8/i16 atomics (hardware does not support sub-word CAS) and atomics on constant/local address spaces.
Tensor Memory Restrictions
Load and store instructions targeting address space 6 (tensor memory) are rejected at the IR level:
Message: "Tensor Memory loads/stores are not supported"
Tensor memory access is handled through dedicated intrinsics (TMA/cp.async) rather than generic load/store instructions. The verifier enforces this indirection.
Pipeline Placement
The NVVMVerifier is inserted repeatedly throughout the optimization pipeline, not just once. In the pipeline assembler (sub_12E54A0), it appears after nearly every major optimization pass, gated by !NVVMPassOptions[600]:
| Position | After Pass | Notes |
|---|---|---|
| 10 (O1 tier) | GVN | Verify IR after value numbering |
| After DSE | Dead Store Elimination | Verify after store removal |
| After EarlyCSE | Early CSE | O2+ only |
| After LoopIndexSplit | Loop Index Split | O2+ only |
| After NVVMReflect | NVVM Reflect | Common tail |
| After LICM | Loop-Invariant Code Motion | Common tail |
| After LowerSwitch | Switch lowering | Final position in common tail |
This aggressive re-verification catches bugs introduced by any optimization pass. In debug/development builds, this is the primary mechanism for detecting optimizer-introduced IR invalidity.
Configuration
| Knob | Storage | Type | Default | Description |
|---|---|---|---|---|
NVVMPassOptions[600] | opts array | bool | false | When true, disables ALL NVVMVerifier insertions in the pipeline |
nvvm-verify-show-info | ctor_257 | bool | false | Enables informational messages (e.g., "IR Kind is UnifiedNVVMIR") |
Diagnostic Infrastructure
Error messages are produced through a chain of helper functions:
| Function | Role |
|---|---|
sub_2C764C0 | Create diagnostic message with severity level |
sub_2C76A00 | Create error diagnostic for a specific instruction |
sub_2C76240 | Flush diagnostic to error stream |
sub_2C76F10 | Report an unsupported instruction by name (takes a string literal like "indirectbr") |
sub_904010 | Append string to diagnostic buffer |
sub_CB6200 | Write raw bytes to output buffer |
sub_CB5AE0 | Flush buffer |
The error model is accumulate-and-continue: the verifier sets the error flag at context offset +16 and writes the diagnostic, but does not abort. This allows a single verification run to report all errors in the module.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| NVVMModuleVerifier | sub_2C80C90 | 51KB | Module entry: triples, data layout, per-instruction dispatch |
| NVVMFunctionVerifier | sub_2C771D0 | 36KB | Function-level: attributes, params, cluster dims, entry funcs |
| NVVMIntrinsicVerifier | sub_2C7B6A0 | 143KB | Intrinsic-level: SM gates, types, MMA, atomics, tex/surf |
| NVVMVerifier pass wrapper | sub_12D4560 | small | Pipeline entry point, creates context, invokes module verifier |
| Verify global variable | sub_2C797D0 | -- | Per-global validation |
| Verify function declaration | sub_2C7A130 | -- | Checks function declarations (not definitions) |
| Verify named metadata | sub_2C7AA20 | -- | Named metadata validation |
| Verify address space cast | sub_2C7AF00 | -- | addrspacecast / GEP rule checker |
| Verify generic call | sub_2C795F0 | -- | Non-intrinsic call validation, pragma check |
| Report unsupported instruction | sub_2C76F10 | -- | Produces "<name> is not supported" diagnostics |
| Is kernel function? | sub_CE9220 | -- | Checks kernel calling convention |
| Extract cluster dimensions | sub_CE8EA0 | -- | Reads cluster dims from function metadata |
| Extract cluster max blocks | sub_CE9030 | -- | Reads max cluster blocks from metadata |
| Check function attribute | sub_A73ED0 | -- | Tests presence of attribute by ID |
| Is .offset.bindless? | sub_CEA320 | -- | Blackwell gate predicate |
| Get intrinsic name string | sub_BD5D20 | -- | Returns intrinsic name for error messages |
| Get integer bit width | sub_BCAE30 | -- | Type query helper |
| Compute total bit width | sub_CA1930 | -- | Aggregate/vector width computation |
Cross-References
- GPU Target Architecture -- SM table and architecture gating
- Hopper (sm_90) -- TMA, cluster operations, WGMMA
- Blackwell (sm_100) -- tcgen05, .offset.bindless
- Memory Space Optimization -- address space enforcement and resolution
- NVIDIA Custom Passes index -- pass inventory
- IP Memory Space Propagation -- inter-procedural address space analysis
NVVM Intrinsic Lowering
The NVVMIntrinsicLowering pass is a pattern-matching rewrite engine that transforms NVVM intrinsic calls into equivalent sequences of standard LLVM IR operations. NVVM IR uses hundreds of target-specific intrinsics (llvm.nvvm.*) for GPU-specific operations -- texture/surface access, warp shuffles, type conversions, wide vector manipulations, barrier synchronization, and tensor core primitives. These intrinsics encode NVIDIA-specific semantics that have no direct LLVM IR equivalent. This pass bridges the gap: it matches each intrinsic call against a database of lowering rules and, when a match is found, replaces the call with a combination of standard LLVM instructions (shufflevector, extractelement, insertelement, bitcast, arithmetic) that express the same semantics in a form amenable to standard LLVM optimization passes.
The pass runs repeatedly throughout the pipeline -- up to 10 times in the "mid" compilation path -- because other optimization passes (NVVMReflect, InstCombine, inlining) can expose new intrinsic calls or simplify existing ones into forms that become lowerable. Two distinct invocation levels exist: level 0 for basic intrinsic lowering, and level 1 for barrier-related intrinsic lowering that must happen after barrier analysis infrastructure is in place.
| Pass factory | sub_1CB4E40 (creates pass instance with level parameter) |
| Core engine | sub_2C63FB0 (140KB, 2,460 lines) |
| Pass type | FunctionPass (Legacy PM) |
| Registration | Legacy PM only (not separately registered in New PM); invoked from pipeline assembler |
| Runtime positions | Tier 1/2/3 #1, #3, #28, #50, #64 (level 1); "mid" path has 4 level-0 invocations (see Pipeline) |
| NVVMPassOptions slot | 99 (offset 2000, BOOL_COMPACT, default = 0 = enabled) |
| Disable flag | opts[2000] = 1 disables all invocations |
| Level parameter | 0 = basic lowering, 1 = barrier-aware lowering |
| Iteration limit | 30 (global qword_5010AC8) |
| Upstream equivalent | None -- entirely NVIDIA-proprietary |
| Address range | 0x2C4D000--0x2C66000 (lowering engine cluster) |
Algorithm
Entry and Dispatch
The pass factory sub_1CB4E40 takes a single integer parameter -- the lowering level. Level 0 performs basic intrinsic lowering (type conversions, vector decomposition, shuffle lowering). Level 1 adds barrier-related intrinsic lowering that depends on barrier analysis having already run. The factory allocates a pass object and stores the level in it; the pass entry point reads this level to filter which intrinsics are candidates for lowering.
At the core engine (sub_2C63FB0), the entry check validates that the instruction is an intrinsic call: the byte at call->arg_chain->offset_8 must equal 17 (intrinsic call marker), and call->offset_16 must be non-null (the callee exists). If either check fails, the function returns 0 (no lowering performed).
Pattern-Matching Rewrite Loop
The algorithm operates as a worklist-driven rewrite system:
function lowerIntrinsic(ctx, call, level):
if not isIntrinsicCall(call): return 0
if not hasCallee(call): return 0
operands = collectOperands(call) // v285/v286 arrays
worklist_direct = [] // v288: direct operand replacements
worklist_typed = [] // v294: type-changed operands
worklist_shuf = [] // v300: shuffle/reorganized operands
iterations = 0
while iterations < qword_5010AC8: // default 30
iterations++
// Phase 1: build candidate lowerings
candidates = buildCandidates(operands) // sub_2C4D470
for each candidate in candidates:
pattern = extractPattern(candidate) // sub_2C4D5A0
// Phase 2: type compatibility check
if not checkTypeCompat(pattern, operands): // sub_AD7630
continue
// Phase 3: operand matching
if not matchOperands(pattern, operands):
continue
// Phase 4: additional pattern checks
if not additionalChecks(pattern): // sub_2C50020
continue
// Phase 5: core lowering -- create replacement
replacement = buildReplacement( // sub_2C515C0
ctx, operands,
worklist_direct, worklist_typed, worklist_shuf)
// Phase 6: substitute
replaceAllUses(call, replacement) // sub_BD84D0
transferMetadata(call, replacement) // sub_BD6B90
queueForDeletion(call) // sub_F15FC0
return 1
return 0 // no lowering found within iteration limit
The iteration limit of 30 (stored in qword_5010AC8) exists because lowering one intrinsic can produce new intrinsic calls that themselves need lowering. For example, lowering a wide vector intrinsic into narrower operations may produce calls to narrower intrinsics. Without the limit, pathological patterns could cause infinite expansion. In practice, most intrinsics lower in a single iteration; the limit is a safety net.
Three Worklist Structures
The rewrite engine maintains three parallel worklist structures that categorize how operands are transformed:
| Worklist | Variable | Purpose |
|---|---|---|
| Direct | v288 | Operands that pass through unchanged -- same value, same type |
| Type-changed | v294 | Operands that need a type conversion (e.g., NVVM-specific type to standard LLVM type) |
| Shuffle/reorganized | v300 | Operands that need positional rearrangement (vector lane reordering, element extraction) |
When sub_2C515C0 builds the replacement instruction, it reads all three worklists to assemble the final operand list: direct operands are copied verbatim, type-changed operands go through a bitcast or type conversion, and shuffle operands are processed through a shufflevector or extractelement/insertelement sequence.
Lowering Categories
Vector Operation Decomposition
Wide vector NVVM intrinsics (operating on v4f32, v2f64, v4i32, etc.) are decomposed into sequences of narrower operations. The NVVM IR frontend emits vector intrinsics to express data-parallel GPU operations, but the NVPTX backend's instruction selector handles scalar or narrow-vector operations more efficiently.
The decomposition pattern:
// Before: single wide-vector intrinsic call
%result = call <4 x float> @llvm.nvvm.wide.op(<4 x float> %a, <4 x float> %b)
// After: four scalar operations + vector reconstruction
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3
%b0 = extractelement <4 x float> %b, i32 0
...
%r0 = call float @llvm.nvvm.narrow.op(float %a0, float %b0)
%r1 = call float @llvm.nvvm.narrow.op(float %a1, float %b1)
...
%v0 = insertelement <4 x float> undef, float %r0, i32 0
%v1 = insertelement <4 x float> %v0, float %r1, i32 1
...
This decomposition enables scalar optimizations (constant folding, CSE) to work on individual lanes, and the narrower intrinsics may themselves lower in subsequent iterations -- hence the iteration limit.
Shuffle Vector Lowering
When an NVVM intrinsic performs pure data reorganization -- lane permutation, broadcast, or subvector extraction -- without any arithmetic, the pass replaces it with an LLVM shufflevector instruction. The core lowering for this path goes through sub_DFBC30, which takes:
sub_DFBC30(context, operation=6, type_info, shuffle_indices, count, flags)
The operation=6 constant identifies this as a shufflevector creation. The shuffle_indices array encodes the lane mapping: for a warp shuffle that broadcasts lane 0 to all lanes, the mask would be <0, 0, 0, 0, ...>. For a rotation, it might be <1, 2, 3, 0>.
Shuffle lowering handles several NVVM intrinsic families:
- Warp-level shuffle operations (
__shfl_sync,__shfl_up_sync, etc.) when the shuffle amount is a compile-time constant - Subvector extraction from wider types (e.g., extracting the low
v2f16from av4f16) - Lane broadcast patterns used in matrix fragment loading
Type Conversion Lowering
NVVM defines intrinsic-based type conversions for types that LLVM's standard type system does not directly support, such as:
- BF16 (bfloat16) to/from FP32 -- intrinsic ID 0x2106, gated by sm_72+
- TF32 (tensorfloat32) conversions -- intrinsic ID 0x2106 with conversion type 3+, gated by sm_75+
- FP8 (E4M3/E5M2) conversions -- intrinsic ID 0x2107, gated by sm_89+ (Ada)
- Extended type conversions with saturate, rounding mode control
The lowering replaces these intrinsic calls with sequences of:
bitcastfor reinterpretation between same-width typesfptrunc/fpextfor standard floating-point width changestrunc/zext/sextfor integer width changes- Arithmetic sequences for rounding mode emulation when the hardware rounding mode is not directly expressible
The sub_2C52B30 helper ("get canonical type") resolves NVVM-specific type encodings to their standard LLVM Type* equivalents during this process.
Multi-Run Pattern
NVVMIntrinsicLowering appears more times in the compilation pipeline than any other NVIDIA custom pass. In the "mid" path (standard CUDA compilation), it runs approximately 10 times across the main path and the Tier 1/2/3 sub-pipelines. The pattern reveals a deliberate interleaving strategy.
"mid" Path Invocations (Level 0)
All four invocations in the main "mid" path use level 0 and are guarded by !opts[2000]:
| Position | Context | Preceding Pass | Following Pass | Purpose |
|---|---|---|---|---|
| 1st | Early pipeline | ConstantMerge | MemCpyOpt | Lower intrinsics from the original IR before SROA/GVN operate |
| 2nd | After InstCombine + standard pipeline #5 | LLVM standard #5 | DeadArgElim | Re-lower intrinsics that InstCombine may have simplified or inlining may have exposed |
| 3rd | After NVVMReflect + standard pipeline #8 | LLVM standard #8 | IPConstProp | Lower intrinsics whose arguments became constant after NVVMReflect resolved __nvvm_reflect() calls |
| 4th | Late pipeline | LICM | NVVMBranchDist | Final cleanup of any remaining lowerable intrinsics before register-pressure-sensitive passes |
Tier 1/2/3 Invocations (Level 1)
Within the sub_12DE8F0 tier sub-pipeline, the pass runs with level 1 at five distinct points:
| Position | Context | Notes |
|---|---|---|
| 1st | Tier entry | Immediately at tier start -- lower barrier-related intrinsics before barrier analysis |
| 2nd | After 1st NVVMIRVerification | Re-lower after verification may have canonicalized IR |
| 3rd | After CVP + NVVMVerifier + NVVMIRVerification | Post-optimization cleanup of barrier intrinsics |
| 4th | After LoopUnswitch + standard pipeline #1 | Re-lower intrinsics exposed by loop transformations |
| 5th | After DSE + DCE + standard pipeline #1 | Final tier cleanup before MemorySpaceOpt |
Each tier (1, 2, and 3) runs this same sequence independently, so in a full compilation with all three tiers active, level-1 lowering executes up to 15 times total.
Level Parameter Semantics
The level parameter partitions the intrinsic lowering rules into two sets:
Level 0 -- Basic lowering. Handles intrinsics whose lowering depends only on the intrinsic's operands and types. This includes vector decomposition, shuffle lowering, and standard type conversions. These are safe to run at any point in the pipeline because they have no dependencies on analysis results. The "mid" path runs level 0 exclusively.
Level 1 -- Barrier-aware lowering. Handles intrinsics related to synchronization barriers (__syncthreads, __syncwarp, barrier-guarded memory operations) whose lowering must coordinate with the barrier analysis infrastructure. In the tier sub-pipeline, level 1 runs at the entry point before NVVMBarrierAnalysis and NVVMLowerBarriers, and again after those passes have run. This two-phase pattern within the tier ensures that:
- Barrier intrinsics are lowered to a canonical form that the barrier analysis can recognize
- After barrier analysis and lowering, any residual barrier-related intrinsics are cleaned up
The reason level 1 is restricted to tiers rather than the main "mid" path: the tier sub-pipeline (sub_12DE8F0) sets up the barrier analysis state (via sub_18E4A00 / sub_1C98160) that level 1 lowering depends on. Running level 1 in the main path before this state exists would produce incorrect results.
Interaction with NVVMReflect
NVVMReflect resolves compile-time queries about the target GPU architecture:
%arch = call i32 @llvm.nvvm.reflect(metadata !"__CUDA_ARCH__")
; After NVVMReflect: %arch = i32 900 (for sm_90)
This resolution has a cascading effect on intrinsic lowering. Many NVVM intrinsics are conditionally emitted by the frontend behind architecture checks:
if (__CUDA_ARCH__ >= 900) {
// Hopper-specific intrinsic
__nvvm_tma_load_async(...);
} else {
// Fallback path using standard loads
}
After NVVMReflect replaces the architecture query with a constant, and nvvm-reflect-pp (SimplifyConstantConditionalsPass) eliminates the dead branch, the surviving path may contain intrinsics that were previously unreachable. The pipeline runs NVVMIntrinsicLowering after NVVMReflect specifically to catch these newly-exposed intrinsics. This is why the 3rd invocation in the "mid" path immediately follows NVVMReflect + LLVM standard pipeline #8.
Configuration
NVVMPassOptions
| Slot | Offset | Type | Default | Semantics |
|---|---|---|---|---|
| 98 | 1976 | STRING | (empty) | Paired string parameter for the pass (unused or reserved) |
| 99 | 2000 | BOOL_COMPACT | 0 | Disable flag: 0 = enabled, 1 = disabled |
Setting slot 99 to 1 disables all invocations of NVVMIntrinsicLowering across the entire pipeline -- both level 0 and level 1. There is no mechanism to disable one level independently.
Global Variables
| Variable | Default | Purpose |
|---|---|---|
qword_5010AC8 | 30 | Maximum iterations per invocation of the rewrite loop |
This global is not exposed as a user-facing knob. It is initialized at program startup and is constant for the lifetime of the process.
Key Helper Functions
Pattern Matching
| Function | Role |
|---|---|
sub_2C4D470 | Build candidate lowering list from intrinsic operands |
sub_2C4D5A0 | Extract pattern from candidate -- returns the lowering rule |
sub_2C50020 | Additional pattern compatibility checks beyond type matching |
sub_2C52B30 | Get canonical LLVM type for an NVVM-specific type encoding |
sub_AD7630 | Type-lowering query -- checks if source type can lower to target type |
Instruction Construction
| Function | Role |
|---|---|
sub_2C515C0 | Build replacement instruction from three worklist structures |
sub_2C4FB60 | Opcode dispatch -- selects the LLVM opcode for the lowered operation |
sub_DFBC30 | Create shufflevector or similar vector IR construct (operation=6) |
IR Mutation
| Function | Role |
|---|---|
sub_BD84D0 | Replace all uses of old instruction with new value |
sub_BD6B90 | Transfer metadata from old instruction to replacement |
sub_F15FC0 | Queue old instruction for deletion |
Pass Infrastructure
| Function | Address | Role |
|---|---|---|
sub_1CB4E40 | 0x1CB4E40 | Pass factory -- creates pass with level parameter |
sub_2C63FB0 | 0x2C63FB0 | Core lowering engine (140KB, 2,460 lines) |
Diagnostic Strings
The core engine at sub_2C63FB0 contains no user-visible diagnostic strings. This is unusual for a 140KB function and reflects the fact that intrinsic lowering is a mechanical pattern-matching operation: either a lowering rule matches (silently applied) or it does not (silently skipped). Failures are not reported because an unlowered intrinsic is not necessarily an error -- it may be handled by a later pass (NVVMLowerBarriers, GenericToNVVM) or by the NVPTX instruction selector directly.
The pass factory sub_1CB4E40 similarly contains no diagnostic strings.
Pipeline Position Summary
sub_12E54A0 (Master Pipeline Assembly)
│
├─ "mid" path (level 0, 4 invocations):
│ ├─ #1: After ConstantMerge, before MemCpyOpt/SROA
│ ├─ #2: After InstCombine + LLVM standard #5
│ ├─ #3: After NVVMReflect + LLVM standard #8
│ └─ #4: After LICM, before NVVMBranchDist/Remat
│
├─ "ptx" path (level 0, 0 invocations):
│ └─ (not present -- PTX input already has intrinsics lowered)
│
├─ default path (level 0, 1 invocation):
│ └─ #1: After NVVMReflect, before NVVMPeephole
│
└─ Tier 1/2/3 sub-pipeline (level 1, 5 invocations per tier):
├─ #1: Tier entry
├─ #2: After NVVMIRVerification
├─ #3: After CVP + NVVMVerifier
├─ #4: After LoopUnswitch + LLVM standard #1
└─ #5: After DSE + DCE + LLVM standard #1
Cross-References
- LLVM Optimizer -- master pipeline assembly and tier system
- NVIDIA Custom Passes -- Inventory -- pass registry and classification
- Rematerialization -- runs after intrinsic lowering in "mid" path
- NVVM Peephole -- peephole patterns that may expose new lowerable intrinsics
- MemorySpaceOpt -- runs after level-1 lowering in tier sub-pipeline
FP128/I128 Emulation
No NVIDIA GPU in any SM generation has native 128-bit arithmetic hardware. Neither fp128 (IEEE 754 binary128) nor i128 (128-bit integer) operations can be lowered to PTX instructions directly. CICC handles this by replacing every fp128 and i128 operation in LLVM IR with a call to one of 48 distinct NVIDIA runtime library functions whose implementations live in a separate bitcode module. The pass at sub_1C8C170 walks each function in the module, inspects every instruction, dispatches on the LLVM opcode byte, and emits the appropriate __nv_* call in place of the original operation. This is a correctness-critical legalization pass -- if any fp128/i128 operation survives past it, instruction selection will abort because NVPTX has no patterns for 128-bit types.
The pass is structurally part of lower-ops (LowerOpsPass), NVIDIA's umbrella module pass for lowering operations that the NVPTX backend cannot handle natively. Within the lower-ops framework, sub_1C8C170 is the dedicated handler for 128-bit types. It runs as a module-level pass early in the pipeline, after libdevice linking and before the main optimization sequence, so that the generated calls can be inlined and optimized by subsequent passes.
| Entry point | sub_1C8C170 |
| Size | 25 KB (~960 lines decompiled) |
| Pass framework | Part of lower-ops / LowerOpsPass (module pass) |
| Registration | New PM slot 144 at sub_2342890; param enable-optimization |
| Runtime functions | 48 distinct __nv_* library calls |
| Upstream equivalent | None. Upstream LLVM lowers fp128 through SoftenFloat in type legalization. CICC replaces this with explicit call insertion at the IR level. |
Opcode Dispatch
The pass reads the LLVM instruction opcode from the byte at offset +16 of the instruction node and dispatches through a dense switch. The following table lists every handled opcode and the corresponding lowering action. All unlisted opcodes in the range 0x18--0x58 produce an early return (no 128-bit type involvement, or handled elsewhere).
| Opcode | LLVM Instruction | Lowering Target | Handler |
|---|---|---|---|
0x24 | fadd | __nv_add_fp128 | sub_1C8A5C0 |
0x26 | fsub | __nv_sub_fp128 | sub_1C8A5C0 |
0x28 | fmul | __nv_mul_fp128 | sub_1C8A5C0 |
0x29 | udiv | __nv_udiv128 | sub_1C8BD70 |
0x2A | sdiv | __nv_idiv128 | sub_1C8BD70 |
0x2B | fdiv | __nv_div_fp128 | sub_1C8A5C0 |
0x2C | urem | __nv_urem128 | sub_1C8BD70 |
0x2D | srem | __nv_irem128 | sub_1C8BD70 |
0x2E | frem | __nv_rem_fp128 | sub_1C8A5C0 |
0x36 | trunc/ext | Type-based conversion | sub_1C8ADC0 |
0x3F | fptoui | __nv_fp128_to_uint* or __nv_cvt_f*_u128_rz | sub_1C8ADC0 / sub_1C8BF90 |
0x40 | fptosi | __nv_fp128_to_int* or __nv_cvt_f*_i128_rz | sub_1C8ADC0 / sub_1C8BF90 |
0x41 | uitofp | __nv_uint*_to_fp128 or __nv_cvt_u128_f*_rn | sub_1C8ADC0 / sub_1C8BF90 |
0x42 | sitofp | __nv_int*_to_fp128 or __nv_cvt_i128_f*_rn | sub_1C8ADC0 / sub_1C8BF90 |
0x43 | fptrunc | __nv_fp128_to_float or __nv_fp128_to_double | sub_1C8ADC0 |
0x44 | fpext | __nv_float_to_fp128 or __nv_double_to_fp128 | sub_1C8ADC0 |
0x4C | fcmp | __nv_fcmp_* (predicate-selected) | dedicated |
Ignored opcode ranges: 0x18--0x23, 0x25, 0x27, 0x2F--0x35, 0x37--0x3E, 0x45--0x4B, 0x4D--0x58. Opcode 0x37 (store) receives a similar type check as 0x36 but for store target types.
Library Function Inventory
FP128 Arithmetic (5 functions)
Binary operations on IEEE 754 binary128. Each takes two fp128 operands, returns fp128.
| Function | Operation | String Length |
|---|---|---|
__nv_add_fp128 | fp128 addition | 14 |
__nv_sub_fp128 | fp128 subtraction | 14 |
__nv_mul_fp128 | fp128 multiplication | 14 |
__nv_div_fp128 | fp128 division | 14 |
__nv_rem_fp128 | fp128 remainder | 14 |
All five are lowered through sub_1C8A5C0, which constructs the call with a fixed string length of 14 characters.
I128 Division and Remainder (4 functions)
Integer division and remainder for i128. No native PTX instruction exists for 128-bit integer divide.
| Function | Operation | Signedness | String Length |
|---|---|---|---|
__nv_udiv128 | i128 division | unsigned | 12 |
__nv_idiv128 | i128 division | signed | 12 |
__nv_urem128 | i128 remainder | unsigned | 12 |
__nv_irem128 | i128 remainder | signed | 12 |
Lowered through sub_1C8BD70 with string length 12. Note: i128 add/sub/mul are NOT lowered here -- those can be decomposed into pairs of 64-bit operations by standard LLVM legalization. Only division and remainder require the runtime call path because they involve complex multi-word algorithms.
FP128-to-Integer Conversions (10 functions)
Convert fp128 to integer types of various widths. The target width is determined by examining sub_1642F90 and the type's bit-width field (type_id >> 8).
| Function | Conversion |
|---|---|
__nv_fp128_to_uint8 | fp128 -> i8 (unsigned) |
__nv_fp128_to_uint16 | fp128 -> i16 (unsigned) |
__nv_fp128_to_uint32 | fp128 -> i32 (unsigned) |
__nv_fp128_to_uint64 | fp128 -> i64 (unsigned) |
__nv_fp128_to_uint128 | fp128 -> i128 (unsigned) |
__nv_fp128_to_int8 | fp128 -> i8 (signed) |
__nv_fp128_to_int16 | fp128 -> i16 (signed) |
__nv_fp128_to_int32 | fp128 -> i32 (signed) |
__nv_fp128_to_int64 | fp128 -> i64 (signed) |
__nv_fp128_to_int128 | fp128 -> i128 (signed) |
Integer-to-FP128 Conversions (10 functions)
Convert integer types to fp128.
| Function | Conversion |
|---|---|
__nv_uint8_to_fp128 | i8 (unsigned) -> fp128 |
__nv_uint16_to_fp128 | i16 (unsigned) -> fp128 |
__nv_uint32_to_fp128 | i32 (unsigned) -> fp128 |
__nv_uint64_to_fp128 | i64 (unsigned) -> fp128 |
__nv_uint128_to_fp128 | i128 (unsigned) -> fp128 |
__nv_int8_to_fp128 | i8 (signed) -> fp128 |
__nv_int16_to_fp128 | i16 (signed) -> fp128 |
__nv_int32_to_fp128 | i32 (signed) -> fp128 |
__nv_int64_to_fp128 | i64 (signed) -> fp128 |
__nv_int128_to_fp128 | i128 (signed) -> fp128 |
String lengths for both fp128-to-integer and integer-to-fp128 conversions vary from 18 to 21 characters depending on the function name. Lowered through sub_1C8ADC0.
FP128-to-Float/Double Conversions (4 functions)
Truncation and extension between fp128 and the native floating-point types.
| Function | Conversion | Opcode |
|---|---|---|
__nv_fp128_to_float | fp128 -> float | 0x43 (fptrunc) |
__nv_fp128_to_double | fp128 -> double | 0x43 (fptrunc) |
__nv_float_to_fp128 | float -> fp128 | 0x44 (fpext) |
__nv_double_to_fp128 | double -> fp128 | 0x44 (fpext) |
I128-to-Float/Double Conversions (8 functions)
These handle the non-fp128 path: converting i128 directly to/from float/double without going through fp128 as an intermediate. The _rz suffix denotes round-toward-zero mode; _rn denotes round-to-nearest-even.
| Function | Conversion | Rounding | String Length |
|---|---|---|---|
__nv_cvt_f32_u128_rz | i128 (unsigned) -> float | toward zero | 20 |
__nv_cvt_f32_i128_rz | i128 (signed) -> float | toward zero | 20 |
__nv_cvt_f64_u128_rz | i128 (unsigned) -> double | toward zero | 20 |
__nv_cvt_f64_i128_rz | i128 (signed) -> double | toward zero | 20 |
__nv_cvt_u128_f32_rn | float -> i128 (unsigned) | to nearest | 20 |
__nv_cvt_i128_f32_rn | float -> i128 (signed) | to nearest | 20 |
__nv_cvt_u128_f64_rn | double -> i128 (unsigned) | to nearest | 20 |
__nv_cvt_i128_f64_rn | double -> i128 (signed) | to nearest | 20 |
All eight are lowered through sub_1C8BF90 with a fixed string length of 20 characters. The rounding mode choice is deliberate: _rz for integer-from-float (truncation semantics matching C/C++ cast behavior) and _rn for float-from-integer (IEEE 754 default rounding for conversions).
The dispatch logic selects between the __nv_fp128_to_* / __nv_*_to_fp128 family and the __nv_cvt_* family based on whether the source or destination type is fp128 (type_id == 5). If neither operand is fp128 but one is i128, the __nv_cvt_* path is taken.
FP128 Comparison Predicates
The fcmp instruction (opcode 0x4C) is dispatched by extracting the comparison predicate from bits 0--14 of the halfword at instruction offset +18. Each LLVM fcmp predicate maps to a dedicated runtime function.
Ordered Comparisons (7 functions)
Ordered comparisons return false if either operand is NaN.
| Function | Predicate | Semantics |
|---|---|---|
__nv_fcmp_oeq | oeq | ordered equal |
__nv_fcmp_ogt | ogt | ordered greater-than |
__nv_fcmp_oge | oge | ordered greater-or-equal |
__nv_fcmp_olt | olt | ordered less-than |
__nv_fcmp_ole | ole | ordered less-or-equal |
__nv_fcmp_one | one | ordered not-equal |
__nv_fcmp_ord | ord | ordered (neither is NaN) |
Unordered Comparisons (7 functions)
Unordered comparisons return true if either operand is NaN.
| Function | Predicate | Semantics |
|---|---|---|
__nv_fcmp_uno | uno | unordered (either is NaN) |
__nv_fcmp_ueq | ueq | unordered or equal |
__nv_fcmp_ugt | ugt | unordered or greater-than |
__nv_fcmp_uge | uge | unordered or greater-or-equal |
__nv_fcmp_ult | ult | unordered or less-than |
__nv_fcmp_ule | ule | unordered or less-or-equal |
__nv_fcmp_une | une | unordered or not-equal |
The predicate naming follows the standard LLVM fcmp convention: o prefix = ordered, u prefix = unordered. The 14 predicates cover the complete set of IEEE 754 comparison semantics excluding true and false (which are constant-folded before reaching this pass). Each function takes two fp128 operands and returns i1.
Trunc/Ext Handling (Opcode 0x36)
The trunc/zext/sext opcode path requires special logic because it must distinguish between genuine 128-bit truncation/extension and other type conversions that happen to use the same opcode.
sub_1C8C170::handle_trunc_ext(inst):
if sub_1642F90(*operand, 128): // Is the operand type 128-bit?
// Determine source and dest bit-widths from DataLayout
src_bits = type_id >> 8 // Bit-width encoded in high byte
dst_bits = target_type_id >> 8
if src_bits > dst_bits:
emit_truncation(inst, src_bits, dst_bits)
else:
emit_extension(inst, src_bits, dst_bits, is_signed)
elif type_id == 5: // fp128 type marker
emit_fp128_conversion(inst)
else:
return // Not a 128-bit operation
The type_id value 5 is the LLVM type tag for fp128 in CICC's internal representation (consistent with the type code table: 1=half, 2=float, 3=double, 4=fp80, 5=fp128, 6=bf16, 0xB=integer with bit-width at type_id >> 8).
Lowering Helpers
Four internal helper functions perform the actual call construction. Each creates a new CallInst with the library function name, replaces all uses of the original instruction with the call result, and erases the original instruction.
| Helper | Address | Purpose | Name Length |
|---|---|---|---|
sub_1C8A5C0 | 0x1C8A5C0 | Binary fp128 arithmetic (add/sub/mul/div/rem) | 14 |
sub_1C8BD70 | 0x1C8BD70 | Binary i128 division (udiv/idiv/urem/irem) | 12 |
sub_1C8ADC0 | 0x1C8ADC0 | FP128 conversions (to/from all integer widths, to/from float/double) | 18--21 (varies) |
sub_1C8BF90 | 0x1C8BF90 | I128-to/from-float/double conversions | 20 |
The "name length" column refers to the string length passed to the call construction routine. This is a fixed constant in each helper, not computed at runtime, which means the function name strings are embedded as literals in the binary (confirmed by string sweep at 0x1C8C170).
Each helper follows the same pattern:
helper(module, instruction, name_string, name_length):
// 1. Get or create function declaration in module
func = module.getOrInsertFunction(name_string, return_type, param_types...)
// 2. Build argument list from instruction operands
args = extract_operands(instruction)
// 3. Create CallInst
call = IRBuilder.CreateCall(func, args)
// 4. Replace uses and erase
instruction.replaceAllUsesWith(call)
instruction.eraseFromParent()
Libdevice Resolution
The 48 __nv_* functions emitted by this pass are not present in the standard libdevice.10.bc. The standard libdevice (455,876 bytes embedded at unk_3EA0080 / unk_420FD80) contains ~400+ math functions (__nv_sinf, __nv_expf, etc.) but does not include any fp128 or i128 emulation routines.
Instead, these functions are resolved through one of two mechanisms:
-
Separate bitcode library: A dedicated 128-bit emulation bitcode module linked after
lower-opsruns. This module contains the actual multi-word software implementations of 128-bit arithmetic using 64-bit operations. -
Late synthesis during type legalization: The SelectionDAG type legalization pass (
SoftenFloataction) can also handlefp128operations, but CICC's IR-level lowering preempts this by replacing operations before they reach the backend. The__nv_*functions, once declared in the module, must be resolvable at link time.
The call declarations emitted by the pass use external linkage, meaning the linker must supply definitions. If a definition is missing, the compilation will fail at the NVPTX link stage with an unresolved symbol error. The benefit of performing this lowering at the IR level rather than in SelectionDAG is that the resulting calls are visible to the LLVM optimizer: the inliner can inline the emulation routines, SROA can decompose the intermediate values, and the loop optimizers can hoist invariant 128-bit computations.
Configuration
The pass has no dedicated knobs. It is controlled indirectly through the lower-ops pass framework:
| Parameter | Effect |
|---|---|
enable-optimization | Parameter to LowerOpsPass registration (slot 144). When enabled, the lowered calls may be marked with optimization attributes. |
There are no knobs in knobs.txt specific to fp128 or i128 lowering. The pass runs unconditionally whenever lower-ops is in the pipeline -- there is no way to disable 128-bit emulation because leaving fp128/i128 operations in the IR would cause a fatal error in the NVPTX backend.
Diagnostic Strings
The pass itself emits no diagnostic messages or debug prints. All diagnostic information comes from the embedded function name strings:
"__nv_add_fp128" "__nv_sub_fp128" "__nv_mul_fp128"
"__nv_div_fp128" "__nv_rem_fp128"
"__nv_udiv128" "__nv_idiv128"
"__nv_urem128" "__nv_irem128"
"__nv_fp128_to_uint8" "__nv_fp128_to_int8"
"__nv_fp128_to_uint16" "__nv_fp128_to_int16"
"__nv_fp128_to_uint32" "__nv_fp128_to_int32"
"__nv_fp128_to_uint64" "__nv_fp128_to_int64"
"__nv_fp128_to_uint128" "__nv_fp128_to_int128"
"__nv_uint8_to_fp128" "__nv_int8_to_fp128"
"__nv_uint16_to_fp128" "__nv_int16_to_fp128"
"__nv_uint32_to_fp128" "__nv_int32_to_fp128"
"__nv_uint64_to_fp128" "__nv_int64_to_fp128"
"__nv_uint128_to_fp128" "__nv_int128_to_fp128"
"__nv_fp128_to_float" "__nv_fp128_to_double"
"__nv_float_to_fp128" "__nv_double_to_fp128"
"__nv_cvt_f32_u128_rz" "__nv_cvt_f32_i128_rz"
"__nv_cvt_f64_u128_rz" "__nv_cvt_f64_i128_rz"
"__nv_cvt_u128_f32_rn" "__nv_cvt_i128_f32_rn"
"__nv_cvt_u128_f64_rn" "__nv_cvt_i128_f64_rn"
"__nv_fcmp_oeq" "__nv_fcmp_ogt" "__nv_fcmp_oge"
"__nv_fcmp_olt" "__nv_fcmp_ole" "__nv_fcmp_one"
"__nv_fcmp_ord" "__nv_fcmp_uno" "__nv_fcmp_ueq"
"__nv_fcmp_ugt" "__nv_fcmp_uge" "__nv_fcmp_ult"
"__nv_fcmp_ule" "__nv_fcmp_une"
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Main entry | sub_1C8C170 | 25 KB | Opcode dispatch, instruction walk, type checks |
| FP128 binary lowering | sub_1C8A5C0 | -- | Emits __nv_{add,sub,mul,div,rem}_fp128 calls |
| FP128 conversion lowering | sub_1C8ADC0 | -- | Emits __nv_fp128_to_* / __nv_*_to_fp128 calls |
| I128 division lowering | sub_1C8BD70 | -- | Emits __nv_{u,i}div128 / __nv_{u,i}rem128 calls |
| I128-float lowering | sub_1C8BF90 | -- | Emits __nv_cvt_* calls (rz/rn variants) |
| Type width check | sub_1642F90 | -- | Tests whether a type has a given bit-width (e.g., 128) |
Cross-References
- NVIDIA Custom Passes -- pass registry including
lower-ops - Other NVIDIA Passes -- summary entry for this pass
- Type Legalization -- SelectionDAG SoftenFloat path for fp128 (preempted by this pass)
- Libdevice Linking -- how the embedded libdevice is linked (standard math, not fp128)
- Cast Codegen -- EDG frontend cast generation, type tag
5= fp128 - Struct Splitting -- sibling pass within the same address cluster
Struct/Aggregate Splitting
GPU register files are typed and scalar. An SM has no concept of loading a struct, storing a struct, or passing a struct through a register -- every value that survives past IR lowering must reduce to a set of individually-named scalar registers. LLVM's standard SROA pass handles alloca-based aggregates by promoting them to scalars, but a large class of aggregate operations never touch an alloca: return values, call arguments, PHI nodes carrying struct types, and aggregate load/store patterns from memcpy lowering. NVIDIA's struct-splitting pass operates on these non-alloca aggregate operations at the NVVM IR level, decomposing every struct-typed value into its constituent scalar fields so that downstream register allocation sees only scalar types.
The pass exists in two binary instances. The primary implementation at sub_1C86CA0 (72KB, ~1,200 lines, 500+ locals) lives in the aggregate-splitting cluster at 0x1C80000--0x1CBFFFF and operates on NVVM IR using NVIDIA-proprietary type IDs. A second, closely related implementation at sub_2CCF450 (58KB) handles the lower-aggr-copies pipeline pass and shares the same string constants ("splitStruct", "srcptr", "dstptr", "remsrc", "remdst", "split", "vld"). Both instances produce the same fundamental transformation: aggregate operations become sequences of scalar operations on individual struct elements.
Key Facts
| Property | Value |
|---|---|
| Entry point | sub_1C86CA0 |
| Size | 72KB (~1,200 lines decompiled), 500+ local variables |
| Binary cluster | 0x1C80000--0x1CBFFFF (Aggregate Splitting + Memory Ops) |
| Second instance | sub_2CCF450 (58KB, lower-aggr-copies pass) |
| Pipeline pass name | lower-aggr-copies (parameterized: lower-aggr-func-args) |
| Related pass | lower-struct-args (parameterized: opt-byval) |
| IR level | NVVM IR (NVIDIA-proprietary type IDs, not LLVM Type::TypeID) |
| Key opcode | 32 (splitStruct instruction) |
| Use replacement | sub_164D160 (RAUW -- Replace All Uses With) |
| LLVM upstream | No equivalent -- this is entirely NVIDIA-proprietary |
Algorithm
The pass walks every instruction in a function, looking for operations whose result type or operand type is an aggregate (struct or array). For each such operation, it decomposes the aggregate into its scalar elements, creates a splitStruct multi-output instruction, and rewires all uses to reference individual element extractions.
Step 1: Type Decomposition
For each struct type encountered, the pass retrieves the struct layout from the DataLayout and enumerates its elements:
function decomposeStructType(struct_type, data_layout):
layout = sub_1643350(data_layout, struct_type) // GetStructLayout
element_types = []
for each element in struct_type.elements:
scalar_ty = sub_159C470(element) // getScalarType
element_types.append(scalar_ty)
return element_types
sub_1643350 retrieves the StructLayout from the DataLayout, giving byte offsets and sizes for each field. sub_159C470 maps each element to its scalar type -- for nested structs, this recurses; for arrays, it yields the element type; for scalars, it returns the type directly.
The element types accumulate in a local array v505[] with the count tracked in v506. This flattened type list drives all subsequent instruction creation.
Step 2: splitStruct Instruction Creation (Opcode 32)
The pass creates a new multi-output instruction with NVVM opcode 32:
function createSplitStruct(original_inst, element_types, count):
composite_ty = sub_15F9F50(element_types, count) // ComputeCompositeType
aligned_ty = sub_1646BA0(composite_ty, data_layout) // SetAlignmentFromDL
// If original was a vector type (type_id == 16), wrap in vector
if getTypeId(original_inst.type) == 16:
aligned_ty = sub_16463B0(aligned_ty) // WrapInVectorType
split_inst = sub_15F1EA0(aligned_ty, 32, parent, nops, flags)
// InitInstruction(opcode=32)
// Store original type info at inst+56, composite at inst+64
split_inst[+56] = original_type_info
split_inst[+64] = sub_15F9F50(composite_ty)
return split_inst
The splitStruct instruction is the NVVM-specific multi-result node that represents the decomposition. It produces N outputs, one per struct element. The instruction stores both the original aggregate type (at offset +56) and the composite element type (at offset +64) for later phases that may need to reconstruct type information.
Step 3: Element Pointer Extraction
For each element of the decomposed struct, the pass creates an indexed load from the splitStruct result:
for i in 0..count:
ptr = sub_15FD590(split_inst, element_types[i],
operand=i, name="ptr", insertion_point)
// Creates opcode 56 (extractvalue-like) with type=1
sub_15FD590 creates an instruction with opcode 56 that extracts the i-th element from the multi-output splitStruct node. The "ptr" name prefix appears in debug output. Each extraction yields a scalar-typed value that downstream passes can assign to an individual PTX register.
Step 4: Split Load with Alignment Preservation
For the actual memory access that feeds the splitStruct, the pass creates a split load instruction:
function createSplitLoad(original_load, element_types):
alignment = computeAlignment(original_load)
split_load = sub_15F90A0(element_types, alignment, ...)
additional_align = sub_1CCB4A0(data_layout, element_types)
final_align = alignment & (-additional_align) // min power-of-2
return split_load
The resulting instruction carries the "split" name prefix. The alignment computation is described in detail in the next section.
Step 5: Use Replacement
After creating all scalar operations, sub_164D160 (RAUW -- Replace All Uses With) replaces every use of the original aggregate operation with the corresponding scalar element extraction:
sub_164D160(original_aggregate_inst, split_inst)
This is the same RAUW infrastructure used across CICC (also called from GlobalOpt, DSE, the inliner, and other passes). After replacement, the original aggregate instruction has zero uses and is eligible for dead code elimination.
Alignment Preservation
The pass must preserve memory alignment when splitting aggregate loads/stores into per-element accesses. GPU memory transactions have strict alignment requirements: a misaligned access can silently produce wrong results or trap, depending on the address space and SM architecture.
The Alignment Formula
The decompiled alignment calculation is:
aligned_value = 1 << (alignment_field >> 1) >> 1
Breaking this down:
alignment_field >> 1-- the alignment is stored in a compressed encoding where the field value is approximately2 * log2(alignment) + bias.1 << (result)-- converts back to a power-of-two alignment value.>> 1-- adjusts for the encoding's off-by-one (the encoding stores2*log2 + 1, so the final shift corrects it).
For example, if alignment_field = 9, then 9 >> 1 = 4, 1 << 4 = 16, 16 >> 1 = 8, yielding 8-byte alignment. This encoding is compact and used throughout NVVM's type system to store alignment in a single byte.
Additional Alignment Computation
sub_1CCB4A0 provides a DataLayout-aware alignment computation for the element type. The final alignment is the minimum of the original alignment and the element's natural alignment, computed via:
final_align = original_align & (-element_natural_align)
The bitwise AND with the negation of the element alignment selects the largest power-of-two that divides both values, ensuring the per-element access is always naturally aligned for its type without exceeding the original aggregate's alignment guarantee.
NVVM Type ID System
The pass operates on NVVM's proprietary type ID system, not LLVM's Type::TypeID. The size classification logic (decompiled lines 997--1030) reveals the mapping:
| NVVM Type ID | Type | Bit Width |
|---|---|---|
| 1 | BFloat16 (i8 pair with padding) | 16 |
| 2 | Float | 32 |
| 3 | Double / i32 (context-dependent) | 64 |
| 4 | i64 | 80 (with padding to 10 bytes) |
| 5, 6 | FP128 / PPC FP128 | 128 |
| 7 | Pointer | 8 * DataLayout::getPointerSizeInBits(0) |
| 9 | Float (alternate, possibly metadata) | 64 |
| 0xB (11) | Integer (arbitrary width) | element_encoding >> 8 |
| 0xD (13) | Array | 8 * DataLayout::getStructLayout(type) total size |
| 0xE (14) | Struct | Recursive sum of element sizes |
| 16 | Vector | Triggers vector-type wrapping via sub_16463B0 |
For struct types (ID 0xE), the size computation is recursive: the pass sums the sizes of all elements, each resolved through the same type-ID dispatch table. Array types (ID 0xD) use sub_15A9930 to look up the total allocation size from the DataLayout's StructLayout cache (which also handles arrays despite the name).
Nested Struct and Array Handling
When a struct element is itself a struct or an array, the pass recurses. The sub_159C470 (getScalarType) call during type decomposition flattens nested aggregates: a struct {i32, {f32, f64}, i16} decomposes not into three elements but into four scalars: i32, f32, f64, i16. The flattening continues until every element is a primitive scalar or a pointer.
Arrays within structs are handled differently depending on their size. Small arrays may be fully unrolled into individual element accesses. The size threshold is governed by the max-aggr-copy-size and large-aggr-store-limit knobs. Arrays that exceed the threshold are not decomposed into per-element loads but instead lowered to byte-copy loops (the "remsrc" / "remdst" / "i8dst" paths correspond to this remainder-byte handling when the aggregate cannot be evenly split into typed elements).
The remainder path:
- Computes the number of whole elements that can be extracted as typed loads.
- For any trailing bytes that do not fill a complete element, generates an
i8byte loop:"remsrc"is the source pointer for the remainder,"remdst"is the destination, and"i8dst"is the byte-typed destination pointer.
Relationship with SROA
LLVM's SROA (Scalar Replacement of Aggregates) and NVIDIA's struct splitting are complementary, not overlapping:
| Aspect | LLVM SROA | NVIDIA Struct Splitting |
|---|---|---|
| Target | alloca instructions in entry block | Non-alloca aggregate operations |
| Scope | Stack-allocated structs | Return values, call args, PHI nodes, memcpy results |
| IR level | LLVM IR (standard Type::TypeID) | NVVM IR (proprietary type IDs) |
| Pipeline position | Early scalar optimization passes | After LLVM optimization, NVVM lowering phase |
| Output | SSA scalars replacing alloca uses | splitStruct (opcode 32) multi-output nodes |
| Upstream | Standard LLVM pass | No upstream equivalent |
SROA runs during the standard LLVM optimization pipeline and eliminates alloca-based aggregates. By the time struct splitting runs, all remaining aggregate operations are those SROA could not handle: function return values carrying struct types, call sites passing or receiving struct-typed parameters, and aggregate-typed PHI nodes at control flow merges. Struct splitting is the final lowering step that ensures no aggregate-typed values survive into register allocation.
PTX Register Mapping
After struct splitting, every value in the IR is scalar-typed. During instruction selection and register allocation, each scalar maps to a PTX virtual register of the corresponding type:
// Before struct splitting:
%result = load {i32, f32, i64}, ptr %p, align 8
// After struct splitting:
%split = splitStruct {i32, f32, i64} // opcode 32, multi-output
%r0 = extractelement %split, 0 // i32 -> %r1 (32-bit register)
%r1 = extractelement %split, 1 // f32 -> %f1 (32-bit FP register)
%r2 = extractelement %split, 2 // i64 -> %rd1 (64-bit register)
In PTX, register types are explicit:
%rregisters: 32-bit integers%rdregisters: 64-bit integers%fregisters: 32-bit floats%fdregisters: 64-bit floats%hregisters: 16-bit values (half/bfloat)%pregisters: predicates (1-bit)
Without struct splitting, the register allocator would need to handle aggregate-typed live ranges, which is impossible on GPU hardware where the register file has no concept of a "struct register." The pass is therefore a hard prerequisite for correct register allocation.
Pipeline Position
The pass runs as part of the NVVM lowering phase, after the main LLVM optimization pipeline has completed. It is registered as lower-aggr-copies in the New PM pipeline parser at index 417 (sub_2342890), with parameter lower-aggr-func-args controlling whether function argument aggregates are also lowered.
Pipeline position:
LLVM Optimizer (SROA, GVN, DSE, etc.)
-> NVIDIA NVVM Lowering Phase
-> lower-struct-args (opt-byval) [lower struct function args]
-> lower-aggr-copies (lower-aggr-func-args) [struct splitting]
-> memory-space-opt [address space resolution]
-> register allocation preparation
The companion pass lower-struct-args (pass index 418) handles byval-attributed function parameters specifically, converting struct-typed byval parameters into explicit copy + scalar access patterns. It runs before lower-aggr-copies to ensure that byval struct arguments are already decomposed when the main splitting pass encounters them.
Configuration
Knobs (ctor_265 at 0x4F48E0)
| Knob | Default | Description |
|---|---|---|
devicefn-param-always-local | -- | Treat parameter space as local in device functions |
skiploweraggcopysafechk | false | Skip safety check in aggregate copy lowering |
large-aggr-store-limit | -- | Threshold for large aggregate store unrolling |
max-aggr-copy-size | -- | Maximum aggregate size for full decomposition |
lower-aggr-unrolled-stores-limit | -- | Limit on unrolled stores per aggregate copy |
InstCombine Aggregate Knobs (ctor_086 at 0x49E670)
| Knob | Default | Description |
|---|---|---|
max-aggr-lower-size | 128 | Size threshold (bytes) below which InstCombine lowers aggregates |
aggressive-max-aggr-lower-size | 256 | Aggressive threshold for aggregate lowering |
instcombine-merge-stores-from-aggr | true | Merge stores originating from aggregate decomposition |
Related Passes
| Knob | Scope | Description |
|---|---|---|
lsa-opt | lower-struct-args | Controls struct argument lowering |
lower-read-only-devicefn-byval | lower-struct-args | Lower read-only device function byval params |
hoist-load-param | lower-struct-args | Hoist parameter loads |
nvptx-force-min-byval-param-align | backend | Force 4-byte minimum alignment for byval params |
nvptx-early-byval-copy | backend | Copy byval arguments early in the pipeline |
Diagnostic Strings
"splitStruct" -- Name prefix for the opcode-32 multi-output node
"srcptr" -- Source pointer in aggregate copy lowering
"dstptr" -- Destination pointer in aggregate copy lowering
"remsrc" -- Remainder source pointer (byte-copy tail loop)
"remdst" -- Remainder destination pointer (byte-copy tail loop)
"i8dst" -- Byte-typed destination for remainder copies
"split" -- Name prefix for the per-element split load
"ptr" -- Name prefix for element pointer extractions
"vld" -- Vector load variant in the second instance
Function Map
Primary Instance (sub_1C86CA0, 72KB)
| Function | Address | Role |
|---|---|---|
| Main driver | sub_1C86CA0 | Top-level struct splitting pass |
| StructLayout query | sub_1643350 | DataLayout::getStructLayout |
| Scalar type query | sub_159C470 | Get scalar element type (recursive for nested structs) |
| Composite type creation | sub_15F9F50 | Build composite type from element array |
| Alignment from DL | sub_1646BA0 | Set type alignment from DataLayout |
| Vector type wrapping | sub_16463B0 | Wrap in vector type if original was vector |
| Instruction creation | sub_15F1EA0 | InitInstruction(type, opcode=32, parent, nops, flags) |
| Element extraction | sub_15FD590 | Create indexed load from multi-output node |
| Split load creation | sub_15F90A0 | Create load with alignment preservation |
| Alignment computation | sub_1CCB4A0 | DataLayout-aware alignment for element type |
| Use replacement | sub_164D160 | RAUW (Replace All Uses With) |
| Pointer size query | sub_15A9520 | DataLayout::getPointerSizeInBits(AS) |
| Struct size query | sub_15A9930 | DataLayout::getStructLayout for size lookup |
Second Instance (sub_2CCF450, 58KB)
| Function | Address | Role |
|---|---|---|
| Aggregate lowering | sub_2CCF450 | lower-aggr-copies pass implementation |
Pipeline Registration
| Function | Address | Role |
|---|---|---|
| New PM registration | sub_2342890 | Pass index 417 (lower-aggr-copies) |
| Parameter parser | sub_233A3B0 | Parses lower-aggr-func-args parameter |
lower-struct-args parser | sub_233A370 | Parses opt-byval parameter |
Test This
The following kernel returns a struct from a device function. Struct splitting should decompose the aggregate return value into individual scalar registers.
struct Result {
float value;
int index;
float confidence;
};
__device__ Result compute(const float* data, int tid) {
Result r;
r.value = data[tid] * 2.0f;
r.index = tid;
r.confidence = 0.95f;
return r;
}
__global__ void struct_split_test(const float* in, float* out_val,
int* out_idx, float* out_conf, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= n) return;
Result r = compute(in, tid);
out_val[tid] = r.value;
out_idx[tid] = r.index;
out_conf[tid] = r.confidence;
}
What to look for in PTX:
- The
computefunction should be inlined, but even if it is not, the struct return should be decomposed. Look for the absence of.localmemory for theResultstruct -- all three fields (value,index,confidence) should live in individual PTX registers (%ffor floats,%rfor int). - No
ld.local/st.localpairs for passing the struct betweencomputeand the kernel. If the struct survives unsplit, the caller allocates local memory for the return value, the callee stores into it, and the caller loads from it -- a 200+ cycle penalty per field. - In the PTX, the three stores to
out_val,out_idx,out_confshould use values directly from registers without any intermediate local memory traffic. Look forst.global.f32andst.global.u32with register operands, not loaded-from-local operands. - To see the unsplit case, make
computea__noinline__function and compile at-O0. The struct will be passed through.paramspace with explicitst.param/ld.paramsequences, showing the overhead that struct splitting eliminates.
Cross-References
- SROA -- upstream SROA handles alloca-based aggregates; complements StructSplitting's inter-procedural splitting
- Rematerialization -- struct splitting reduces aggregate live ranges before remat
- Memmove Unrolling -- companion pass at
sub_1C82A50that unrolls memmove/memcpy loops - FP128/I128 Emulation -- companion pass at
sub_1C8C170in the same binary cluster - Pipeline & Ordering -- pass ordering in the New PM pipeline
- NVIDIA Custom Passes Overview -- master inventory of all NVIDIA passes
- Code Generation -- register allocation that consumes split scalars
Memmove Unrolling
CUDA GPUs have no hardware instruction for bulk memory copy. On a CPU, memcpy and memmove compile down to optimized microcode sequences (REP MOVSB, AVX-512 scatter/gather, or libc hand-tuned SIMD loops). On an SM, every byte of a copy must pass through explicit load and store instructions executed by individual threads. LLVM's standard memcpy lowering in SelectionDAG produces reasonable load/store sequences, but it operates late in the pipeline and cannot reason about NVVM IR semantics -- address spaces, alignment guarantees from the CUDA memory model, or the interaction between copy direction and overlapping shared-memory buffers. NVIDIA's memmove unrolling pass replaces llvm.memmove and llvm.memcpy intrinsic calls at the NVVM IR level with explicit element-wise copy loops, generating both forward and reverse copy paths to handle overlapping memory correctly.
The pass lives in the aggregate-lowering cluster at 0x1C80000--0x1CBFFFF, adjacent to struct splitting (sub_1C86CA0) and FP128/I128 emulation (sub_1C8C170). It is part of the lower-aggr-copies pipeline pass (pass index 417), which coordinates memmove unrolling, struct splitting, and aggregate store lowering as a single pipeline unit. Upstream LLVM has no equivalent IR-level memmove unroller -- this is entirely NVIDIA-proprietary.
Key Facts
| Property | Value |
|---|---|
| Entry point | sub_1C82A50 |
| Size | 39KB (~1,200 lines decompiled) |
| Binary cluster | 0x1C80000--0x1CBFFFF (Aggregate Splitting + Memory Ops) |
| Pipeline pass | lower-aggr-copies (pass index 417, parameterized: lower-aggr-func-args) |
| Pass registration | sub_233A3B0 (parameter parser for LowerAggrCopiesPass) |
| IR level | NVVM IR (pre-instruction-selection) |
| Unroll threshold global | dword_4FBD560 |
| Knob constructor | ctor_265 at 0x4F48E0 |
| LLVM upstream | No equivalent -- NVIDIA-proprietary |
| Neighbor passes | Struct splitting (sub_1C86CA0), FP128 emulation (sub_1C8C170) |
Why This Pass Exists
On a CPU, memmove(dst, src, n) is a single function call that the runtime library implements with architecture-specific optimized loops, often using SIMD instructions that move 32 or 64 bytes per cycle. On a GPU:
-
No bulk copy instruction. PTX and SASS have
ldandstbut nomemcpyorrep movsbequivalent. Every byte must be an explicit load followed by an explicit store. -
Per-thread execution model. Each thread in a warp copies its own portion of data. A 128-byte struct copy in a kernel with 1024 threads means 1024 independent 128-byte copy sequences, all of which must resolve to individual load/store pairs.
-
Address space semantics. The source and destination may live in different address spaces (global, shared, local, constant). Generic-pointer memmove requires runtime address-space resolution, but if the compiler can resolve the spaces at IR time, it can emit space-qualified loads and stores that map directly to the correct PTX instructions.
-
Overlap semantics.
memmoveguarantees correct behavior when source and destination overlap. The pass must emit both a forward path (fordst < src) and a reverse path (fordst >= src) to preserve this guarantee.memcpyis also routed through this pass because the NVVM verifier enforces overlap-safety uniformly.
Algorithm
The pass scans each function for llvm.memmove and llvm.memcpy intrinsic calls. For each call, it replaces the intrinsic with a 4-block CFG that implements element-wise copying. The generated code has two paths: one for when the element count is statically known and small enough to fully unroll, and one for dynamic or large counts that use a loop with a PHI induction variable.
Step 1: Basic Block Structure Creation
The pass creates four new basic blocks, splitting the block containing the memmove call:
+-------+
| split | (direction comparison)
+---+---+
/ \
+--------+--+ +--+----------+
| forward.for| | reverse.for |
+--------+--+ +--+----------+
\ /
+----------+
| nonzerotrip | (exit / continuation)
+----------+
| Block | Name string | Purpose |
|---|---|---|
| Entry | "split" | Compares src and dst addresses to choose copy direction |
| Forward | "forward.for" | Copies elements from index 0 upward |
| Reverse | "reverse.for" | Copies elements from index count-1 downward |
| Exit | "nonzerotrip" | Continuation after the copy completes |
Step 2: Forward vs. Reverse Decision
The split block determines copy direction by comparing the source and destination base addresses:
; Pseudocode for the split block
%cmp = icmp ult ptr %dst, ptr %src ; sub_12AA0C0, opcode 0x22 (34)
br i1 %cmp, label %forward.for, label %reverse.for ; sub_15F83E0
The ICMP instruction is created via sub_12AA0C0 with opcode 0x22 (34 decimal, corresponding to an unsigned-less-than integer comparison). The conditional branch is created via sub_15F83E0. When dst < src, memory does not overlap in the forward direction, so the forward path is safe. When dst >= src, copying forward would overwrite source bytes before they are read, so the reverse path is required.
Step 3: Copy Generation -- Small/Static Path
When the copy size is statically known and satisfies size <= dword_4FBD560 (the compile-time unroll threshold), the pass generates fully unrolled element-by-element copies with no loop overhead.
Reverse copy (decompiled lines 606--690):
; Fully unrolled reverse copy, count elements
; For i = count-1 downto 0:
%src.gep.N = getelementptr i8, ptr %src, i64 N ; named "src.memmove.gep.unroll"
%val.N = load i8, ptr %src.gep.N, align A ; sub_15F9210 (InitLoadInstruction)
%dst.gep.N = getelementptr i8, ptr %dst, i64 N ; named "dst.memmove.gep,unroll" [sic]
store i8 %val.N, ptr %dst.gep.N, align A ; sub_15F9650 (InitStoreInstruction)
; ... repeated for each index from count-1 down to 0
Forward copy (decompiled lines 1036--1123):
; Fully unrolled forward copy, count elements
; For i = 0 to count-1:
%src.gep.N = getelementptr i8, ptr %src, i64 N ; "src.memmove.gep.unroll"
%val.N = load i8, ptr %src.gep.N, align A
%dst.gep.N = getelementptr i8, ptr %dst, i64 N ; "dst.memmove.gep,unroll" [sic]
store i8 %val.N, ptr %dst.gep.N, align A
; ... repeated for each index from 0 up to count-1
Each load is created via sub_15F9210 (InitLoadInstruction, opcode 64 type 1) and each store via sub_15F9650 (InitStoreInstruction, opcode 64 type 2). Alignment is set on both loads and stores via sub_15F8F50 / sub_15F9450, preserving the alignment from the original memmove intrinsic call (passed as parameter a15). Memory attributes (volatile flags, etc.) are propagated through parameters a16 and a17.
Step 4: Copy Generation -- Large/Dynamic Path
When the copy size exceeds the threshold or is not statically known, the pass generates a single-iteration loop body with a PHI induction variable:
Forward loop:
forward.for:
%iv = phi i64 [ 0, %split ], [ %iv.next, %forward.for ] ; sub_15F1EA0, opcode 53
%src.gep = getelementptr i8, ptr %src, i64 %iv
%val = load i8, ptr %src.gep, align A
%dst.gep = getelementptr i8, ptr %dst, i64 %iv
store i8 %val, ptr %dst.gep, align A
%iv.next = add i64 %iv, 1 ; sub_15A0680 (constant 1) + sub_15FB440 (ADD, opcode 13)
%done = icmp eq i64 %iv.next, %count
br i1 %done, label %nonzerotrip, label %forward.for ; sub_15F83E0
Reverse loop:
reverse.for:
%iv = phi i64 [ %count.minus1, %split ], [ %iv.next, %reverse.for ]
%src.gep = getelementptr i8, ptr %src, i64 %iv
%val = load i8, ptr %src.gep, align A
%dst.gep = getelementptr i8, ptr %dst, i64 %iv
store i8 %val, ptr %dst.gep, align A
%iv.next = sub i64 %iv, 1
%done = icmp eq i64 %iv.next, -1 ; or icmp slt i64 %iv.next, 0
br i1 %done, label %nonzerotrip, label %reverse.for
The PHI node is created via sub_15F1EA0 with opcode 53. The constant 1 for the increment is created via sub_15A0680. The addition/subtraction uses sub_15A2B60 or sub_15FB440 (the 5-argument node constructor, opcode 13 for ADD). The nonzerotrip block serves as the exit target for both loop directions.
Step 5: Alignment Propagation
The pass preserves the alignment annotation from the original memmove/memcpy intrinsic call. The alignment value is passed through the internal parameter a15 to the load/store alignment setter functions sub_15F8F50 (SetLoadAlignment) and sub_15F9450 (SetStoreAlignment). This matters because downstream PTX emission can generate wider loads (e.g., ld.global.v4.b32 for 16-byte aligned accesses) if the alignment permits it.
Step 6: Cleanup
After generating the replacement CFG, the original memmove/memcpy intrinsic call is erased. The pass uses sub_164D160 (RAUW -- Replace All Uses With) to rewire any remaining references.
Unroll Threshold
The global variable dword_4FBD560 controls the boundary between full unrolling and loop generation. This value is registered at ctor_265 (0x4F48E0) as part of the aggregate copy lowering knob group.
| Condition | Code generation |
|---|---|
count statically known AND count <= dword_4FBD560 | Fully unrolled: N load/store pairs with no loop overhead |
count statically known AND count > dword_4FBD560 | Dynamic loop with PHI induction variable |
count not statically known | Dynamic loop with PHI induction variable |
The tradeoff is straightforward: full unrolling eliminates loop overhead (branch, PHI, compare) but increases code size linearly. For GPU kernels where instruction cache pressure is rarely the bottleneck, unrolling small copies is almost always profitable. The threshold prevents pathological code size explosion for large static copies (e.g., a 4KB struct assignment would generate 4,096 load/store pairs without the limit).
The related knob lower-aggr-unrolled-stores-limit provides an additional limit on the number of stores generated in unrolled mode, and large-aggr-store-limit controls when aggregate stores transition from unrolled sequences to loops.
Naming Conventions
The pass names its generated GEP instructions with distinctive prefixes that are visible in IR dumps and useful for debugging:
| Instruction | Name string | Notes |
|---|---|---|
| Source GEP | "src.memmove.gep.unroll" | Period-separated |
| Destination GEP | "dst.memmove.gep,unroll" | Comma before unroll -- a typo in the binary [sic] |
The comma in "dst.memmove.gep,unroll" (where a period would be expected by analogy with the source GEP name) is a benign naming inconsistency baked into the binary string table. It has no semantic effect -- LLVM IR value names are arbitrary strings -- but it serves as a reliable fingerprint for identifying output from this specific pass. A reimplementation should preserve this exact string if binary-identical IR output is desired, or normalize it to "dst.memmove.gep.unroll" if not.
Configuration
Knobs registered at ctor_265 (0x4F48E0), applicable to the lower-aggr-copies pass cluster:
| Knob | Global | Description |
|---|---|---|
lower-aggr-unrolled-stores-limit | -- | Maximum number of stores in unrolled mode |
large-aggr-store-limit | -- | Element count above which aggregate stores use a loop |
max-aggr-copy-size | -- | Maximum aggregate copy size the pass will handle |
skiploweraggcopysafechk | -- | Skip safety check in aggregate copy lowering |
devicefn-param-always-local | -- | Treat device function parameter space as local |
The pass can be invoked via the pipeline text interface:
-Xcicc "-passes=lower-aggr-copies"
-Xcicc "-passes=lower-aggr-copies<lower-aggr-func-args>"
Related aggregate lowering knobs from ctor_089 (0x4A0D60):
| Knob | Default | Description |
|---|---|---|
max-aggr-lower-size | 128 | Threshold size (bytes) below which aggregates are lowered |
aggressive-max-aggr-lower-size | 256 | Aggressive threshold for aggregate lowering |
Diagnostic Strings
"split"
"forward.for"
"reverse.for"
"nonzerotrip"
"src.memmove.gep.unroll"
"dst.memmove.gep,unroll"
"memmove/memcpy cannot target constant address space" (from nvvm-verify)
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Memmove unroller | sub_1C82A50 | 39KB | Main pass: CFG construction, copy generation |
| ICMP creation | sub_12AA0C0 | -- | Creates integer comparison (opcode 0x22) |
| Conditional branch | sub_15F83E0 | -- | Creates br i1 |
| InitLoadInstruction | sub_15F9210 | -- | Creates load instruction (opcode 64, type 1) |
| InitStoreInstruction | sub_15F9650 | -- | Creates store instruction (opcode 64, type 2) |
| SetLoadAlignment | sub_15F8F50 | -- | Sets alignment on load |
| SetStoreAlignment | sub_15F9450 | -- | Sets alignment on store |
| InitInstruction (PHI) | sub_15F1EA0 | -- | Creates PHI node (opcode 53) |
| CreateConstant | sub_15A0680 | -- | Creates integer constant (e.g., 1 for increment) |
| CreateBinaryOp | sub_15FB440 | -- | Creates binary operation node (5-arg constructor) |
| CreateBinaryOp (variant) | sub_15A2B60 | -- | Alternative binary op constructor |
| RAUW | sub_164D160 | -- | Replace All Uses With |
| Pipeline param parser | sub_233A3B0 | -- | Parses lower-aggr-func-args parameter |
Cross-References
- Struct/Aggregate Splitting -- sibling pass in the same
lower-aggr-copiespipeline unit; decomposes struct-typed operations into scalar field operations - FP128/I128 Emulation -- neighbor in the
0x1C80000cluster; replaces wide arithmetic with runtime library calls - NVVM Verifier -- validates that memmove/memcpy targets are not in constant address space
- NVIDIA Custom Passes -- master index of all proprietary passes
- SROA -- upstream LLVM pass that splits alloca-based aggregates; handles memcpy/memmove during alloca rewriting
printf-lowering
The printf lowering pass rewrites device-side printf() calls into CUDA's runtime vprintf() ABI. GPU hardware does not support C variadic function calls, so the compiler must pack all arguments into a stack buffer and emit a two-argument call to vprintf(format_string, arg_buffer_ptr). CICC implements this transformation at two levels: a module-level IR pass and an AST-level lowering function.
| Pass name | printf-lowering |
| Class | llvm::PrintfLoweringPass |
| Scope | Module pass |
| Registration | New PM slot 130, sub_2342890 |
| Module-level entry | sub_1CB1E60 (31 KB) |
| AST-level lowering | sub_12992B0 (24 KB) |
| Enable knob | nvvm-lower-printf (registered at ctor_269) |
Two Lowering Stages
Printf lowering happens at two points in the compilation pipeline:
Stage 1 -- AST-level (sub_12992B0): During initial IR generation from the EDG frontend output, when the code generator encounters a direct call to printf, it intercepts the call and emits the vprintf rewrite inline. This is the earlier, more detailed pass that handles type promotion, buffer packing, and alloca management.
Stage 2 -- Module-level (sub_1CB1E60): A cleanup pass that runs during the LLVM optimization pipeline. It catches any remaining printf calls that survived the AST lowering (e.g., from linked bitcode modules or inlined functions) and applies the same transformation. This pass validates that the format string is a string literal: "The first argument for printf must be a string literal!".
AST-Level Lowering Algorithm (sub_12992B0)
The AST-level lowering is the more thoroughly analyzed implementation. It operates in six phases:
Phase 1: Resolve the vprintf Symbol
The pass looks up or creates the "vprintf" function declaration in the module:
- Build the
vprintfparameter type list:(i8*, i8*) - Create the
FunctionTypeviasub_1644EA0 - Call
sub_1632190(Module*, "vprintf", 7, funcType)-- this isModule::getOrInsertFunction
The literal string "vprintf" with length 7 is stored in a local variable.
Phase 2: Set Up Argument List
- The format string (
**a3) becomes the first argument - The remaining varargs (
a3[1..]) are collected into a dynamic argument array - A 22-QWORD (176-byte) stack small-buffer optimization avoids heap allocation for typical printf calls with fewer than ~16 arguments
Fast path: if argCount <= 1 (format string only, no varargs), the pass skips buffer creation entirely and emits vprintf(fmt, undef) using sub_15A06D0 (UndefValue::get).
Phase 3: Allocate Packed Argument Buffer
For the varargs case, a stack buffer named "tmp" is allocated:
sub_127FC40(context, type, "tmp", alignment=8, addrspace=0)creates an alloca- The alloca is cached at
a1[19]and reused across multiple printf calls within the same function - If a cached alloca exists, its size is reused (and potentially grown in Phase 5)
Phase 4: Per-Argument Processing
For each vararg, the pass:
-
Float promotion: per C variadic calling convention,
floatarguments are promoted todoublevia anfpextinstruction. Detected whentype_info[+12] == 2andtype_info[+16] != 0. -
Type size calculation: a multi-level switch on the LLVM type tag computes the byte width:
Type tag Size (bits) Notes 1 16 half / i16 2 32 float / i32 3, 9 64 double / i64 4 80 x86_fp80 5, 6 128 fp128 / ppc_fp128 7 target-dependent Pointer size from DataLayout 11 custom dword >> 8(arbitrary-width integer)13 aggregate Struct size from DataLayout 14 packed struct Complex alignment calculation, up to 3 levels of nesting -
Alignment and offset: each argument is placed at the next naturally-aligned offset in the buffer. If
offset % argSize != 0, the offset is rounded up. -
GEP creation: a
GetElementPtrnamed"buf.indexed"indexes into the packed buffer at the computed byte offset. -
Bitcast: if the GEP result type differs from the argument type, a
bitcastinstruction named"casted"(opcode 47) is emitted. -
Store: the argument value is stored into the buffer slot via a
StoreInst.
Phase 5: Alloca Resize
After processing all arguments, the pass checks whether the total packed size exceeds the current alloca size. If so, it patches the alloca's size operand in-place by manipulating the use-def chain directly -- unlinking the old size constant and linking a new one. This unusual technique avoids creating a second alloca while ensuring a single allocation dominates all printf pack sites.
Phase 6: Emit vprintf Call
sub_1285290 emits the final call: vprintf(format_string, arg_buffer_ptr).
Cleanup frees any heap-allocated argument arrays (from the small-buffer overflow path).
Module-Level Pass (sub_1CB1E60)
The module-level pass at 0x1CB1E60 (31 KB) performs a similar transformation but operates on already-lowered LLVM IR rather than AST nodes. Key recovered strings:
| String | Purpose |
|---|---|
"DataLayout must be available for lowering printf!" | Guard: DataLayout required |
"vprintf" | Target function name |
"The first argument for printf must be a string literal!" | Format string validation |
"vprintfBuffer.local" | Name of the packed argument buffer alloca |
"bufIndexed" | Name of GEP instructions into the buffer |
The module-level pass uses "vprintfBuffer.local" as the alloca name (versus "tmp" in the AST-level lowering), and "bufIndexed" for the GEP instructions (versus "buf.indexed"). These naming differences confirm the two implementations are distinct codepaths.
Implementation Details
Small-buffer optimization: the argument array uses a 22-QWORD (176-byte) stack buffer. Only if more than ~16 arguments overflow does it heap-allocate via the SmallVector grow path (sub_16CD150). This avoids malloc for typical printf calls.
Alloca caching: a1[19] in the IRGenState caches the "tmp" alloca across multiple printf calls within the same function. This reduces alloca instruction count in functions with many printf calls.
Struct nesting limit: the type-size calculation handles up to 3 levels of nested struct packing (three nested switch statements in the decompilation). Deeper nesting hits a JUMPOUT at 0x129A22F -- likely an assertion for structs nested more than 3 levels in printf arguments.
Pointer tag bits: the basic block instruction list uses an intrusive doubly-linked list where the low 3 bits of next/prev pointers carry metadata tags (masked with 0xFFFFFFFFFFFFFFF8). This is consistent with LLVM's ilist implementation using pointer-int pairs.
Diagnostic Strings
Diagnostic strings recovered from p2-B08-printf-lowering.txt and p1.7-04-sweep-0x1B00000-0x1CFFFFF.txt.
| String | Source | Category | Trigger |
|---|---|---|---|
"DataLayout must be available for lowering printf!" | sub_1CB1E60 (module-level pass) | Assertion/Error | Module lacks DataLayout; fatal guard at module pass entry |
"The first argument for printf must be a string literal!" | sub_1CB1E60 (module-level pass) | Error | Format string argument is not a constant string; validation failure |
"vprintf" | sub_1632190 / sub_12992B0 | Symbol | Target function name looked up or created in the module (literal string, length 7) |
"vprintfBuffer.local" | sub_1CB1E60 (module-level pass) | IR name | Name of the packed argument buffer alloca in the module-level pass |
"bufIndexed" | sub_1CB1E60 (module-level pass) | IR name | Name of GEP instructions into the argument buffer in the module-level pass |
"tmp" | sub_12992B0 (AST-level lowering) | IR name | Name of the packed argument buffer alloca in the AST-level lowering; cached at a1[19] |
"buf.indexed" | sub_12992B0 (AST-level lowering) | IR name | Name of GEP instructions into the argument buffer in the AST-level lowering |
"casted" | sub_12992B0 (AST-level lowering) | IR name | Name of bitcast instructions when GEP result type differs from argument type (opcode 47) |
"nvvm-lower-printf" | ctor_269 | Knob | Enable knob for the printf lowering pass |
The two lowering stages produce different IR names for the same conceptual entities ("vprintfBuffer.local" vs "tmp" for the alloca, "bufIndexed" vs "buf.indexed" for the GEPs), confirming they are distinct codepaths.
IRGenState Layout
The codegen context object used by the AST-level lowering:
| Offset | Field | Purpose |
|---|---|---|
a1[4] | Module* | The LLVM module |
a1[5] | Return type | Function return type / type context |
a1[6] | DebugLoc | Current debug location |
a1[7] | BasicBlock* | Current insertion block |
a1[8] | Iterator | Insertion point in BB's instruction list |
a1[9] | AS context | Address space context for alloca type creation |
a1[19] | AllocaInst* | Cached "tmp" alloca (reused across printf calls) |
ipmsp -- Inter-Procedural Memory Space Propagation
The IPMSP pass resolves generic (address space 0) pointer arguments to concrete NVIDIA address spaces by analyzing call sites across the entire module. When all callers of a function agree that a pointer argument points to a specific memory space (global, shared, local, constant), the pass either specializes the function in place or clones it with narrowed pointer types. This enables downstream passes to emit space-specific load/store instructions (e.g., ld.shared instead of generic ld) and eliminates addrspacecast overhead.
Disabling this pass (-disable-MemorySpaceOptPass) causes 2--20x performance regressions on real workloads. The pass is automatically disabled in OptiX IR mode (--emit-optix-ir routes -do-ip-msp=0).
| Pass name | ipmsp |
| Class | llvm::IPMSPPass |
| Scope | Module pass |
| Registration | New PM slot 125, line 1111 in sub_2342890 |
| Main function | sub_2CBBE90 (71 KB) -- MemorySpaceCloning worklist driver |
| LIBNVVM variant | sub_1C6A6C0 (54 KB) |
| Inference engine | sub_2CE96D0 -> sub_2CE8530 |
| Cloning engine | sub_F4BFF0 (CloneFunction) |
| Callee matching | sub_2CE7410 |
| Propagation | sub_2CF5840 -> sub_2CF51E0 |
| Pipeline control | do-ip-msp NVVMPassOption (default: enabled) |
NVPTX Address Spaces
The pass resolves generic (AS 0) pointers to specific address spaces: global (AS 1), shared (AS 3), constant (AS 4), local (AS 5), or param (AS 101). Generic pointers require a runtime address space check on every access; resolving them statically eliminates this overhead. See Address Spaces for the complete table with hardware mapping, pointer widths, aliasing rules, and the MemorySpaceOpt bitmask encoding.
Algorithm Overview
The pass operates as a worklist-driven inter-procedural fixed-point analysis. The top-level loop:
function IPMSP_Run(Module M):
worklist = deque<Function*>{}
argSpaceMap = map<Value*, int>{} // formal arg -> resolved AS
returnSpaceMap = map<Function*, int>{} // function -> return AS
calleeInfoMap = map<Function*, set<Function*>>{} // reverse call graph
// Phase 1: seed
for each F in M.functions():
if shouldProcess(F):
worklist.push_back(F)
for each caller of F:
calleeInfoMap[F].insert(caller)
debug("Initial work list size : %d", worklist.size())
// Phase 2: fixed-point iteration
while worklist not empty:
F = worklist.pop_back()
// Analyze and specialize F's callee arguments
changed = analyzeAndSpecialize(F, argSpaceMap, calleeInfoMap)
if changed:
// Propagate to F's callees
propagateSpacesToCallees(F, argSpaceMap)
for each callee C of F in calleeInfoMap:
if shouldProcess(C):
worklist.push_back(C)
debug("%d callees are affected")
// Check return space
if resolveReturnSpace(F, returnSpaceMap):
debug("%s : return memory space is resolved : %d")
// propagate to callers and push them onto worklist
Phase 1: Build Worklist
The pass iterates all functions in the module. A function enters the worklist if sub_2CBA650 returns true, meaning:
- The function is not a declaration or
available_externally - Its linkage is not
extern_weakorcommon - It is not an intrinsic (
sub_B2DDD0filter) - It has at least one formal argument that is a generic pointer not yet in the resolved-space map
Specifically, sub_2CBA650 checks:
function shouldProcess(this, F):
if F has no users (F[16] == 0): return false
linkage = F.linkage & 0xF
if (linkage + 14) & 0xF <= 3: return false // available_externally, appending
if (linkage + 7) & 0xF <= 1: return false // common, extern_weak
if isIntrinsic(F): return false
retType = F.getReturnType()
if retType is pointer with AS 0 and not in returnSpaceMap:
return true
return hasUnresolvedPointerArgs(this, F)
sub_2CBA520 (hasUnresolvedPointerArgs) walks the formal arg list (stride 40 bytes) and returns true if any arg has type byte 14 (pointer) and is not already in the arg-space map.
A reverse call graph is also constructed: for each callee, the pass records which callers invoke it.
Debug output (when dump-ip-msp is enabled): "Initial work list size : N"
Phase 2: Per-Function Analysis
For each function popped from the worklist:
-
Classify arguments: allocate a per-arg array initialized to 1000 ("unresolved"). Non-pointer args and already-resolved args are marked 2000 ("skip").
-
Walk call sites: for each call instruction, examine each actual argument:
- If the actual's address space is non-zero (already specific), record it.
- If the actual is generic (AS 0), first check the callee-space map for a cached result. If not found, invoke the dataflow inference engine
sub_2CE96D0to trace the pointer's provenance. - If this is the first call site for this arg, record the space. If a subsequent call site disagrees, mark 2000 ("conflicting -- give up").
-
Count resolved arguments: any arg where all call sites agree on a single address space is a candidate for specialization.
function analyzeArgSpaces(F, argSpaceMap, calleeSpaceMap):
numArgs = F.arg_size()
spaces[numArgs] = {1000, ...} // 1000 = unresolved
for i in 0..numArgs:
arg = F.getArg(i)
if arg.type != pointer:
spaces[i] = 2000 // not a pointer, skip
else if arg in argSpaceMap:
spaces[i] = 2000 // already resolved
for each CallInst CI using F:
calledFn = CI.getCalledFunction()
for i in 0..numArgs:
if spaces[i] == 2000: continue
actual = CI.getOperand(i)
if actual == F.getArg(i): continue // passthrough
as = actual.type.addrspace
if as == 0:
// Check cache first
if actual in calleeSpaceMap:
as = calleeSpaceMap[actual]
else:
ok = inferAddressSpace(calledFn, actual, &as, ...)
if !ok:
spaces[i] = 2000
continue
if spaces[i] == 1000:
spaces[i] = as // first call site
else if spaces[i] != as:
spaces[i] = 2000 // conflict
return count(s for s in spaces if s != 1000 and s != 2000)
Debug output: "funcname : changed in argument memory space (N arguments)"
Phase 3: Specialization Decision
The pass chooses between two strategies based on linkage:
| Linkage | Strategy | Mechanism |
|---|---|---|
| Internal / Private (7, 8) | In-place specialization | Modify the function's arg types directly. No clone needed since all callers are visible. |
| External / Linkonce / Weak | Clone | Create a new function with specialized arg types and internal linkage. Rewrite matching call sites to target the clone. Keep the original for external callers. |
The decision at line 1114 in sub_2CBBE90:
if (F.linkage & 0xF) - 7 <= 1:
// Internal/Private: specialize in place
for each resolved arg:
argSpaceMap[arg] = resolvedAS
else:
// External: must clone
if resultsTree is empty:
debug("avoid cloning of %s")
else:
createClone(F, resolvedArgs)
The clone is created by sub_F4BFF0 (CloneFunction):
- Builds a new
FunctionTypewith specific-space pointer arg types - Allocates a new Function object (136 bytes via
sub_BD2DA0) - Copies the body via a ValueMap-based cloner (
sub_F4BB00) - For each specialized arg, inserts an
addrspacecastfrom specific back to generic at the clone's entry (these fold away in later optimization) - Sets clone linkage to internal (
0x4007)
Debug output: "funcname is cloned"
Phase 4: Transitive Propagation
After specializing a function, the pass propagates resolved spaces to its callees via sub_2CF5840. This function:
- Creates an analysis context similar to
sub_2CE96D0 - Calls
sub_2CF51E0which walks F's body - For each call instruction in F that targets a known function, determines if the called function's args now have resolved spaces
- Updates the arg-space map accordingly
Affected callees are pushed back onto the worklist. This enables bottom-up resolution through call chains: if A -> B -> C, specializing A's args may resolve B's args, which in turn resolves C's args.
Debug output: "N callees are affected"
Phase 5: Return Space Resolution
After argument processing, the pass checks return values:
- If the function returns a generic pointer, walk all
retinstructions. - Follow the def chain through GEPs to the base pointer.
- If all returns agree on a single address space, record it in the return-space map and propagate to callers.
Debug output: "funcname : return memory space is resolved : N"
The Dataflow Inference Engine
The inference engine is the core analysis that determines what address space a generic pointer actually points to. It is invoked when a call-site argument has address space 0 (generic) and the pass needs to determine the concrete space.
Entry Point: sub_2CE96D0
function inferAddressSpace(calledFn, actualArg, &result, module, symtab, argSpaceMap):
as = actualArg.type.addrspace
if as != 0:
*result = as
return true // trivially resolved
// Generic pointer: need full analysis
context = alloca(608) // 608-byte stack context
// Initialize 6 tracking sets:
// [0] visited set (bitset for cycle detection in PHI chains)
// [1] user-list collector
// [2] callee mapping
// [3] load tracking (when track-indir-load)
// [4] inttoptr tracking (when track-int2ptr)
// [5] alloca tracking
return coreDataflowWalker(context, calledFn, actualArg,
&loadsVec, &callsVec, result)
The 608-byte context is allocated on the stack and contains all working state for the backward dataflow walk.
Core Backward Dataflow Walker: sub_2CE8530
The walker traces the pointer's provenance backward through the SSA def chain. It uses a worklist plus visited-set to handle cycles (primarily PHI nodes).
IR nodes handled:
| IR node | Action |
|---|---|
getelementptr | Transparent: follow the base pointer operand |
bitcast | Transparent: follow the source operand |
addrspacecast | Extract target address space, record it |
phi | Add all incoming values to the worklist |
select | Add both arms to the worklist (result = OR of both) |
call / invoke | Look up callee in return-space map; if found, use that |
load | If track-indir-load enabled: follow the loaded pointer; otherwise opaque |
inttoptr | If track-int2ptr enabled: follow the integer source; otherwise opaque |
alloca | If process-alloca-always: immediately resolve to AS 5 (local) |
argument | If in arg-space map: use the recorded space |
Inference rules (lattice):
The engine collects candidate address spaces from all reachable definitions. The resolution follows these rules:
// All sources agree: resolved to that space
// Sources disagree: unresolvable (return false)
// param bit set + param-always-point-to-global: resolve to global (AS 1)
// alloca found + process-alloca-always: resolve to local (AS 5)
// __builtin_assume(__isGlobal(p)) + process-builtin-assume: resolve to global
The walker collects three separate vectors during traversal:
- loads: pointers loaded from memory (indirect provenance)
- GEPs:
getelementptrinstructions encountered along the chain - calls: function calls whose return values contribute to the pointer
Per-Callee Space Propagation: sub_2CE8CB0
This function is the heavy-weight driver called from the worklist loop for each function. It processes a function's call graph entries and determines concrete address spaces for callees by examining actual arguments at all call sites.
Architecture:
-
A global limit at
qword_3CE3528caps maximum analysis depth to prevent explosion on large call graphs. -
The function iterates the BB instruction list (offset +328, linked list). For each callee encountered:
- Check visited set. The set has two representations:
- Small set: flat array at object offsets +32..+52 (checked when flag at +52 is set)
- Large set: hash-based DenseSet at offset +24 (checked via
sub_18363E0)
- If callee has no body (
*(_DWORD *)(callee + 120) == 0): collect it as a leaf and record its argument address spaces viasub_2CE80A0 - Otherwise: skip (will be processed when popped from worklist)
- Check visited set. The set has two representations:
-
For each collected callee, a DenseMap cache at offset +160 is checked:
- Hash function:
(ptr >> 9) ^ (ptr >> 4), linear probing - Empty sentinel:
-4096(0xFFFFFFFFFFFFF000) - If found in cache: skip re-analysis (use cached result)
- Hash function:
-
After collecting all callees: invoke
sub_2CE88B0for merge/commit. -
For single-entry results (exactly 1 callee entry in the vector): special fast path via
sub_2CE2F10that commits directly through a vtable dispatch.
function perCalleePropagate(this, F):
if this.firstVisit:
// Reset tracking vectors
clearUserVectors()
// Walk BB instruction list
for each BB in F.body():
if BB in visitedSet: continue
if BB.isDeclaration(): continue
collectCalleeInfo(BB) // -> sub_2CE80A0
addToVisitedSet(BB)
// Check depth limit
if userVector.size() > depthLimit:
return false
// Merge phase
if userVector.size() > 1:
return mergeAndCommit(this, F) // sub_2CE88B0
elif userVector.size() == 1:
commitSingleResult(this) // fast path
return false
Callee Matching Engine: sub_2CE7410
When multiple call instructions target the same callee, this function determines the best pair to use for space inference. This is critical for correctness -- the pass must ensure that the inferred space is valid for all uses.
Algorithm:
-
Parallel operand walk: for each pair of call instructions to the same callee, walk their operand use-chains in parallel. Compare the instructions at each position via the instruction equivalence DenseMap at offset +80.
-
Coverage scoring: count the number of matching operands (variable
v95). Higher coverage means more confidence in the match. -
Dominance check: call
sub_2403DE0(A, B)to test if BB A dominates BB B. Both directions are checked:- If A dominates B and B dominates A (same BB or trivial loop): strong match.
- If only one direction: check if the non-dominating one is the entry BB's first instruction.
-
Loop membership gate:
sub_24B89F0checks whether both call instructions are in the same loop. If both are in the same loop and the coverage score > 1, the match is accepted even without strict dominance (loops create natural fixed-point convergence). -
Attribute check: for each matched pair,
sub_245A9B0verifies metadata flags (at instruction offset +44) to ensure the transformation is legal. -
Output: the best-scoring pair is written into the results vector for subsequent instruction rewriting.
Post-Inference Merge: sub_2CE88B0
After the per-callee analysis produces a list of (instruction, resolved_space) entries:
function mergeAndCommit(this, F):
entries = this.resultVector
if entries.size() > 1:
qsort(entries, comparator=sub_2CE2BD0) // sort by callee ID
changed = false
while entries.size() > 1:
entry = entries.back()
calleeId = entry.calleeId
// Find best match for this callee
matchScore = sub_2CE7410(this, calleeId, ...)
if matchScore > 0:
// Commit via instruction specialization
sub_2CE4830(this, matchedCallee) // edge weight
sub_2CE3B60(this, bestMatchIdx) // commit space
// Propagate to other entries sharing this callee
for each other entry with same callee:
if other != bestMatch:
sub_2CE3780(this, other.users, matchedCallee)
// Compact the entries vector
changed = true
else:
// No match: fallback propagation
sub_2CE3A70(this, calleeId, ...)
return changed
Instruction Specialization: sub_2CE8120
Once a callee's address space is determined, this function creates a specialized copy of the instruction:
-
Legality check: vtable dispatch at offset +408 (
sub_25AE460default). Returns false if the instruction cannot be legally specialized (e.g., volatile operations, intrinsics with fixed types). -
Create specialized instruction:
sub_244CA00creates a new instruction with the modified pointer type (generic -> specific address space). -
Insert into BB:
sub_24056C0places the new instruction in the basic block's instruction list. -
Rewrite use chain: all uses of the old instruction are updated to reference the new specialized version.
-
Update DenseMap caches:
- Instruction-to-space map at offset +80: insert mapping from new instruction to resolved space
- Edge count at offset +72: update via
sub_24D8EE0 - If nested clone tracking (offset +131 flag): update debug info via
sub_2D2DBE0
Handling Recursion and Clone Limits
- Transitive: clones are pushed back onto the worklist, so chains
A->B->Care handled iteratively. - Mutual recursion: already-resolved args are detected via the map (marked 2000), preventing infinite re-processing.
- Self-recursion: after the first pass resolves args, re-processing finds agreement and applies specialization.
- Clone limit:
do-clone-for-ip-msp(default -1 = unlimited) caps the total number of clones. Each clone increments a counter atthis[200]. When the limit is exceeded, cloning stops but in-place specialization continues for internal functions. - Analysis depth limit:
qword_3CE3528limits the per-function callee analysis depth to prevent explosion on large modules.
The LIBNVVM Variant
A second implementation at sub_1C6A6C0 (54 KB) serves the LIBNVVM/module-pass path. Key differences:
- Uses DenseMap-style hash tables (empty sentinel = -8, tombstone = -16, 16-byte entries)
- Includes loop-induction analysis via
sub_1BF8310withmaxLoopIndtracking (debug:"phi maxLoopInd = N: Function name") - Three processing phases controlled by globals:
- Phase A (
dword_4FBD1E0, default=4): call-site collection, thresholddword_4FBC300= 500 - Phase B (
dword_4FBD2C0, default=2): address space resolution. Ifdword_4FBCAE0(special mode), picks the callee with the smallest constant value (minimum address space ID). - Phase C (
dword_4FBCD80, default=2): WMMA-specific sub-pass viasub_1C5FDC0, called withwmma_mode=1first (WMMA-specific), thenwmma_mode=0
- Phase A (
- Threshold:
v302 > 5triggerssub_1C67780for deeper analysis - Pre/post analysis toggle:
byte_4FBC840controls calls tosub_1C5A4D0
Interaction with memory-space-opt
The ipmsp and memory-space-opt passes are complementary:
ipmspis inter-procedural: it analyzes call graphs, infers address spaces across function boundaries, and specializes function signatures via cloning.memory-space-optis intra-procedural: it resolves generic pointers within a single function body using backward dataflow analysis and bitmask accumulation.
The typical pipeline flow:
ipmspruns first (module pass) to propagate address spaces across function boundariesmemory-space-optruns withfirst-timemode to resolve obvious intra-procedural cases- Further optimization passes run (may create new generic pointers via inlining, SROA, etc.)
memory-space-optruns withsecond-timemode to clean up remaining generic pointers, foldisspacepintrinsics to constants
Both passes share the same set of knobs (with ias- prefixed mirrors for the IAS variant). The inference engine sub_2CE96D0 is shared between IPMSP and the alternate algorithm selected by mem-space-alg.
Knobs
IPMSP-Specific Knobs
| Knob | Default | Storage | Description |
|---|---|---|---|
dump-ip-msp | 0 | qword_5013548 | Enable debug tracing |
do-clone-for-ip-msp | -1 (unlimited) | qword_5013468 | Max clones allowed |
do-ip-msp | 1 (enabled) | NVVMPassOption | Enable/disable the entire pass |
Shared Inference Knobs (MemorySpaceOpt variant)
| Knob | Default | Storage | Description |
|---|---|---|---|
param-always-point-to-global | true | unk_4FBE1ED | Parameter pointers always resolve to global (AS 1) |
strong-global-assumptions | true | (adjacent) | Assume constant buffer pointers always point to globals |
process-alloca-always | true | unk_4FBE4A0 | Treat alloca-derived pointers as local (AS 5) unconditionally |
wmma-memory-space-opt | true | unk_4FBE3C0 | Specialize WMMA call args to shared memory (AS 3) |
track-indir-load | true | byte_4FBDE40 | Track indirect loads during inference |
track-int2ptr | true | byte_4FBDC80 | Track inttoptr in inference |
mem-space-alg | 2 | dword_4FBDD60 | Algorithm selection for address space optimization |
process-builtin-assume | -- | (ctor_531_0) | Process __builtin_assume(__is*(p)) for space deduction |
IAS Variant Knobs (IPMSPPass path, ctor_610)
Each shared knob has an ias- prefixed mirror that controls the InferAddressSpaces-based code path (sub_2CBBE90):
| Knob | Mirrors |
|---|---|
ias-param-always-point-to-global | param-always-point-to-global |
ias-strong-global-assumptions | strong-global-assumptions |
ias-wmma-memory-space-opt | wmma-memory-space-opt |
ias-track-indir-load | track-indir-load |
ias-track-int2ptr | track-int2ptr |
The unprefixed versions control the LIBNVVM variant (sub_1C6A6C0). The ias- prefixed versions control the New PM / IAS variant (sub_2CBBE90).
LIBNVVM Variant Globals
| Global | Default | Description |
|---|---|---|
dword_4FBD1E0 | 4 | Phase A call-site collection level |
dword_4FBD2C0 | 2 | Phase B resolution level |
dword_4FBCD80 | 2 | Phase C WMMA sub-pass level |
dword_4FBC300 | 500 | Max analysis depth threshold |
dword_4FBCAE0 | -- | Special minimum-selection mode |
byte_4FBC840 | -- | Pre/post analysis toggle |
dword_4FBD020 | -- | Debug: maxLoopInd dump |
Debug Dump Knobs
| Knob | Description |
|---|---|
dump-ir-before-memory-space-opt | Dump IR before MemorySpaceOpt runs |
dump-ir-after-memory-space-opt | Dump IR after MemorySpaceOpt completes |
dump-process-builtin-assume | Dump __builtin_assume processing |
msp-for-wmma | Enable Memory Space Optimization for WMMA (tensor core) |
Data Structures
Worklist
The worklist is a std::deque<Function*> with 512-byte pages (64 pointers per page). Push-back via sub_2CBB610 (extends the deque when the current page is full). Pop-back from the last page.
Red-Black Tree Maps
The cloning engine uses red-black trees (std::map) for four separate maps:
| Map | Key | Value | Purpose |
|---|---|---|---|
| Return-space | Function* | Resolved AS | Return value address space |
| Arg-space | Value* | Resolved AS | Per-argument address space |
| Callee-space | Value* | Resolved AS | Callee pointer spaces (cached inference results) |
| Callee-info | Function* | Sub-tree | Reverse call graph (which callers invoke this callee) |
Red-black tree nodes are 0x58 bytes with the standard {left, right, parent, color, key} layout at offsets 16, 24, 8, 0, 32.
DenseMap Caches
The inference engine and per-callee propagation use DenseMap hash tables with LLVM-layer sentinels (-4096 / -8192) and 16-byte entries (key + value). Growth is handled by sub_240C8E0. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.
Three independent DenseMaps are used:
- Offset +80: instruction -> resolved space (per-function analysis cache)
- Offset +160: callee -> inference result (cross-function cache)
- Offset +232: edge weight tracking (call graph weights for profitability)
Visited Sets
Two representations depending on set size:
- Small set (flag at offset +52): flat array at offsets +32..+44, capacity at +40, count at +44. Linear scan for membership test.
- Large set (default): hash-based DenseSet at offset +24 via
sub_18363E0for insert andsub_18363E0for membership test.
Inference Context
The 608-byte stack-allocated context for sub_2CE8530 contains:
| Offset range | Content |
|---|---|
| 0--23 | Result vector (pointer, size, capacity) |
| 24--47 | Loads vector (indirect pointer sources) |
| 48--71 | GEPs vector (getelementptr chains) |
| 72--95 | Calls vector (call instructions returning pointers) |
| 96--127 | Worklist for PHI traversal |
| 128--607 | Visited bitset, callee tracking, metadata |
Sentinel Values
| Value | Meaning | Used in |
|---|---|---|
| 1000 | Unresolved pointer argument (not yet seen at any call site) | Per-arg analysis array |
| 2000 | Non-pointer, already resolved, or conflicting (skip) | Per-arg analysis array |
| -4096 | DenseMap empty slot | All DenseMap caches |
| -8192 | DenseMap tombstone (deleted entry) | All DenseMap caches |
Diagnostic Messages
| Message | Source | Condition |
|---|---|---|
"Initial work list size : %d" | sub_2CBBE90 | Always (when dump-ip-msp) |
"funcname : changed in argument memory space (N arguments)" | sub_2CBBE90 | Args resolved |
"funcname is cloned" | sub_2CBBE90 | Clone created |
"avoid cloning of funcname" | sub_2CBBE90 | External linkage, empty results |
"N callees are affected" | sub_2CBBE90 | After propagation |
"funcname : return memory space is resolved : N" | sub_2CBBE90 | Return space resolved |
"phi maxLoopInd = N: Function name" | sub_1C6A6C0 | LIBNVVM loop-ind analysis |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| MemorySpaceCloning | sub_2CBBE90 | 71 KB | Worklist driver (New PM variant) |
| IPMSPPass | sub_1C6A6C0 | 54 KB | LIBNVVM variant |
| inferAddressSpace | sub_2CE96D0 | -- | Inference entry point |
| coreDataflowWalker | sub_2CE8530 | -- | Backward dataflow analysis |
| perCalleePropagate | sub_2CE8CB0 | -- | Per-callee space propagation |
| mergeAndCommit | sub_2CE88B0 | -- | Post-inference merge (qsort) |
| rewriteCalleePair | sub_2CE85D0 | -- | Instruction rewriting for matched pairs |
| calleeMatchingEngine | sub_2CE7410 | -- | Dominance + coverage scoring |
| pushInferenceResult | sub_2CE80A0 | -- | Append to result vector |
| vectorRealloc | sub_2CE7E60 | -- | Grow inference result vector |
| computeEdgeWeight | sub_2CE4830 | -- | Call graph edge weight |
| commitSpace | sub_2CE3B60 | -- | Commit resolved space to callee |
| fallbackPropagate | sub_2CE3A70 | -- | Propagate unmatched entries |
| propagateToAlternate | sub_2CE3780 | -- | Propagate to alternate callee users |
| commitSingleCallee | sub_2CE2F10 | -- | Single-callee commit via vtable |
| singlePredecessorCheck | sub_2CE2DE0 | -- | Check single-predecessor property |
| qsortComparator | sub_2CE2BD0 | -- | Compare callee entries for sorting |
| mergeSmallVectors | sub_2CE2A70 | -- | Merge small vector pairs |
| extractAddressSpace | sub_2CE27A0 | -- | Extract AS from Value's type |
| cloneInstruction | sub_2CE8120 | -- | Clone instruction + DenseMap update |
| populateUserSet | sub_2CE97F0 | -- | Build per-arg user list |
| propagateSpacesToCallees | sub_2CF5840 | -- | Post-specialization propagation |
| bodyWalker | sub_2CF51E0 | -- | Walk function body for propagation |
| shouldProcessFunction | sub_2CBA650 | -- | Worklist eligibility predicate |
| hasUnresolvedPointerArgs | sub_2CBA520 | -- | Check for unresolved generic ptr args |
| CloneFunction | sub_F4BFF0 | -- | Full function clone with arg rewriting |
| ValueMapCloner | sub_F4BB00 | -- | ValueMap-based body cloner |
| replaceAllUsesWith | sub_BD84D0 | -- | Redirect call sites to clone |
| mapInsertOrFind | sub_2CBB230 | -- | Red-black tree insert |
| mapLookup | sub_2CBB490 | -- | Red-black tree search |
| dequeGrow | sub_2CBB610 | -- | Worklist deque push_back |
| checkAttributeBundle | sub_245A9B0 | -- | Attribute flag membership test |
| instructionEquivalence | sub_245AA10 | -- | Test instruction equivalence |
| bbDominates | sub_2403DE0 | -- | BasicBlock dominance test |
| loopMembership | sub_24B89F0 | -- | Check if two instructions share a loop |
| createSpecializedInst | sub_244CA00 | -- | Create instruction with modified types |
| insertIntoBlock | sub_24056C0 | -- | Insert instruction into BB |
| updateDebugInfo | sub_2D2DBE0 | -- | Debug info update for cloned inst |
Cross-References
- memory-space-opt -- intra-procedural complement
- reference/address-spaces -- consolidated AS reference
- config/knobs -- complete knob inventory
- pipeline/optimizer -- pipeline position and
do-ip-mspoption - pipeline/optix-ir -- OptiX disables IPMSP
- infra/alias-analysis -- cross-space NoAlias rules
Memory Space Optimization
The Memory Space Optimization pass (memory-space-opt) is NVIDIA's inter-procedural address space resolution engine. Its job is to convert generic (flat) pointers into specific address spaces -- global, shared, local, constant, or parameter -- so that the backend can emit specialized memory instructions (ld.shared, st.global, etc.) instead of generic ones (ld, st) that require address translation hardware at runtime. On NVIDIA GPUs, generic memory accesses go through an address translation unit that adds latency; resolving pointer provenance at compile time eliminates this overhead entirely and is one of the most impactful optimizations in the CUDA compilation pipeline.
The pass is implemented as a multi-function cluster totaling roughly 250KB of decompiled code, with two cooperating systems: an intra-procedural address space resolver and an inter-procedural function cloning engine.
Key Facts
| Property | Value |
|---|---|
| Pass name (pipeline) | memory-space-opt |
| Class | MemorySpaceOptPass |
| Pass type | Parameterized FunctionPass (NVIDIA-custom) |
| Registration | New PM #416, parameterized: first-time;second-time;no-warnings;warnings |
| Runtime positions | Tier 1/2/3 #65 (after DSE + DCE + LLVM standard pipeline); also runs early in "mid" path (see Pipeline) |
| Pass entry point | sub_1C70910 (2,427 lines) |
| Pass factory | sub_1C8E680 |
| NVVMPassOptions slot | Offset +2680 (disable), offset +3120 (mode parameter) |
| Binary size | ~250 KB total (multi-function cluster) |
| Upstream equivalent | None -- entirely NVIDIA-proprietary |
NVPTX Address Space Numbering
The pass operates on the standard NVPTX address spaces (0=generic, 1=global, 3=shared, 4=constant, 5=local, 101=param). See Address Spaces for the complete table with hardware mapping, pointer widths, and aliasing rules.
Internally, the pass encodes address spaces as a single-bit bitmask for efficient dataflow computation (0x01=global, 0x02=shared, 0x04=constant, 0x08=local, 0x10=param, 0x0F=unknown). When multiple pointer sources contribute different spaces, the bitmask is OR'd together. A singleton bit (popcount == 1) means the space is fully resolved; multiple bits set means ambiguous. See the MemorySpaceOpt Internal Bitmask section for the complete mapping and resolution algorithm.
IR Before/After Example
The following illustrates the core transformation: generic-pointer loads/stores are resolved to specific address spaces, enabling specialized PTX memory instructions.
Before (generic pointers, AS 0):
define void @kernel(ptr addrspace(0) %shared_buf, ptr addrspace(0) %global_out) {
%val = load float, ptr addrspace(0) %shared_buf, align 4
%add = fadd float %val, 1.0
store float %add, ptr addrspace(0) %global_out, align 4
%check = call i1 @llvm.nvvm.isspacep.shared(ptr %shared_buf)
br i1 %check, label %fast, label %slow
fast:
ret void
slow:
ret void
}
After (resolved address spaces):
define void @kernel(ptr addrspace(3) %shared_buf, ptr addrspace(1) %global_out) {
%val = load float, ptr addrspace(3) %shared_buf, align 4 ; -> ld.shared.f32
%add = fadd float %val, 1.0
store float %add, ptr addrspace(1) %global_out, align 4 ; -> st.global.f32
; isspacep.shared folded to true (phase 2), branch simplified by later DCE
br label %fast
fast:
ret void
}
The addrspacecast instructions are inserted during resolution and consumed by downstream passes. The isspacep folding (phase 2 only) eliminates runtime address space checks when the space is statically known.
Two-Phase Architecture
The pass entry point (sub_1C70910) accepts a mode parameter controlling execution:
| Mode | Name | Behavior |
|---|---|---|
| 0 | First-time | Conservative resolution via sub_1CA2920. Called early in the pipeline. |
| 1 | Second-time | Hash-table-based resolution via sub_1CA9E90. Called after IP-MSP propagation. |
| 2 | First-time, no warnings | Same as mode 0 but suppresses "Cannot tell what pointer points to" messages. |
| 3 | Second-time, no warnings | Same as mode 1 but silent. Used on re-runs where repeated warnings would be noise. |
Both phases share the same instruction dispatch structure, handling loads (opcode 0x36), stores (0x37), calls (0x4E), atomic loads (0x3A), and atomic stores (0x3B).
Phase 1 (first-time) resolves obvious cases where pointer origin is statically known. It uses sub_1C9F820 for dataflow analysis and sub_1C98370 for annotation-based resolution.
Phase 2 (second-time) runs after inter-procedural propagation has enriched the analysis context. It uses hash-table lookups (sub_1CA8350) and can fold isspacep intrinsics (builtins 0xFD0-0xFD5) to constants when the address space is already known, eliminating runtime space checks.
Inter-Procedural Memory Space Propagation (IP-MSP)
Complexity. Let F = number of functions in the module, A = total number of pointer-typed arguments across all functions, E = total call-graph edges, and I = total instructions. The intra-procedural use-def chain walk is O(I) per function (bounded by visited-set to avoid cycles through PHI nodes). The IP-MSP worklist iterates until no argument's bitmask changes; since each of the A arguments has a 5-bit bitmask that can only grow (OR of incoming values), the worklist converges in at most O(A) rounds. Each round re-analyzes at most O(F) functions, and adding callers back to the worklist costs O(E) in total across all rounds. Worst-case: O(A * (F * I_avg + E)) where I_avg is average instructions per function. Function cloning adds at most O(F) clones (bounded by do-clone-for-ip-msp), each clone being O(I_f) to create. In practice, GPU modules have small call graphs (F < 200 after inlining) and the worklist converges in 2--4 rounds, making the pass effectively O(F * I_avg + E).
The IP-MSP driver in sub_1C70910 implements a fixed-point worklist algorithm that propagates address space information across function boundaries:
- Build a worklist of all functions in the module. Debug:
"Initial work list size: %d". - Pop a function from the worklist.
- Run intra-procedural resolution (phase 1 or 2).
- If argument memory spaces changed (
"changed in argument memory space"), add all callers back to the worklist ("callees are affected"). - If the return memory space is resolved (
"return memory space is resolved"), propagate to callers. - Repeat until the worklist is empty.
A second IP-MSP implementation exists at sub_1C6A6C0 (54KB), which appears to be the LIBNVVM/module-pass variant. It uses DenseMap-style hash tables (sentinel -8 for empty, -16 for tombstone), has explicit loop-induction analysis (sub_1BF8310), and runs three sub-phases: call-site collection (level controlled by dword_4FBD1E0, default 4), address space resolution (level dword_4FBD2C0, default 2), and a WMMA-specific pass (sub_1C5FDC0).
Function Cloning for Specialization
When different call sites pass pointers from different address spaces to the same function argument, the pass clones the function so that each clone can be specialized for a single address space. This is the key mechanism that eliminates generic pointers at call boundaries.
The cloning engine (sub_2CBBE90, 71KB) uses two distinct strategies based on function linkage:
Strategy 1 -- In-place specialization (internal/private linkage): All call sites are visible within the module, so the function is modified directly. Pointer argument types are changed from generic (AS 0) to the resolved specific space. No clone is created. This is the cheaper path.
Strategy 2 -- Clone and specialize (external/linkonce/weak linkage): The function might have callers outside the module, so the original must be preserved. A clone is created with internal linkage (0x4007), its argument types are specialized, and internal call sites are rewritten to target the clone. The original remains for any remaining generic-pointer callers.
The cloning process (sub_F4BFF0):
- Iterate all formal args of the original function.
- For each arg whose address space was resolved, create a new function type with the specific address space.
- Allocate a new
Functionobject viasub_BD2DA0(136). - Copy linkage, attributes, and calling convention.
- Clone the body via
sub_F4BB00(ValueMap-based cloner). - For specialized args, insert
addrspacecastinstructions at the clone's entry. - Rewrite matching call sites via
sub_BD84D0.
After cloning, the clone is pushed back onto the worklist, enabling recursive specialization through call chains: if A calls B calls C, each level's arguments resolve bottom-up as the worklist iterates.
Intra-Procedural Resolution Algorithm
Use-Def Chain Walking (sub_1CA5350)
The core resolver walks backward through use-def chains to find the original allocation a pointer derives from:
| IR Node | Behavior |
|---|---|
GEP (H) | Transparent -- follow pointer operand |
Bitcast (G) | Transparent -- follow source operand |
PHI (O) | Follow all incoming values (adds all to worklist) |
Call (M) | Check if returns a known-space pointer |
| Load (subcode 32) | Tracked if track-indir-load is enabled |
| inttoptr (subcode 47) | Tracked if track-int2ptr is enabled |
| ptrtoint (subcode 48) | Transparent |
Alloca (8) | Resolves to local (AS 5) |
The walker uses a worklist with a visited bitset to handle cycles through phi nodes. It collects three separate vectors: loads (indirect pointers), GEPs, and calls returning pointers.
Resolution Decision
Once the bitmask is computed:
- Single bit set: resolved. Insert
addrspacecastto the target space. - Multiple bits set: ambiguous. If
param-always-point-to-globalis true and the param bit is set, resolve to global. Otherwise emit a warning and default to global. - Zero bits: unreachable or error.
Address Space Inference Engine (sub_2CE96D0)
For generic-pointer arguments at call sites, the inference engine creates a 608-byte analysis context on the stack, sets up six independent tracking sets, and calls sub_2CE8530 for deep dataflow analysis tracing pointer provenance through GEPs, bitcasts, PHI nodes, and loads from known-space pointers.
Post-Resolution Optimizations
After resolving a pointer's address space, the pass performs several follow-up transformations:
- addrspacecast insertion:
sub_1CA1B70(first-time) /sub_1CA28F0(second-time) inserts a cast from generic to the resolved space and replaces all uses of the generic pointer. - Instruction rewriting: Loads and stores on generic pointers are rewritten to use the specific space, enabling the backend to emit
ld.shared,st.global, etc. - isspacep folding (second-time only): If a pointer's space is known,
isspacep.shared(%p)folds totrueorfalse. - Dead cast elimination: Redundant
addrspacecastchains (e.g., generic-to-shared followed by shared-to-generic) are simplified. - Call site specialization: After cloning, call sites are rewritten to call the specialized version with casted arguments.
Error Handling for Illegal Operations
The pass detects and reports illegal address-space/operation combinations as soft warnings (compilation continues):
| Operation | Illegal Space | Warning Message |
|---|---|---|
| Atomic load/store | Constant | "Cannot do atomic operation on const memory" |
| Atomic load/store | Local | "Cannot do atomic on local memory" |
| WMMA | Constant | "Cannot do WMMA on constant memory" |
| WMMA | Local | "Cannot do WMMA on local memory" |
| Vector atomic | Shared | "Cannot to vector atomic on shared memory" |
| Vector atomic | Local | "Cannot to vector atomic on local memory" |
| Vector atomic | Constant | "Cannot to vector atomic on const memory" |
Note: The vector atomic messages contain a typo in NVIDIA's source -- "Cannot to" should read "Cannot do". This typo is present in all three vector atomic warning strings.
Key Functions
| Function | Address | Size | Role |
|---|---|---|---|
| Pass entry / IP-MSP driver | sub_1C70910 | 2427 lines | Main entry point, worklist iteration, mode dispatch |
| First-time resolver | sub_1CA2920 | 1119 lines | Conservative address space resolution |
| Second-time resolver | sub_1CA9E90 | 933 lines | Hash-table-based resolution with isspacep folding |
| Use-def chain walker | sub_1CA5350 | 1641 lines | Backward pointer origin tracking |
| Per-BB scanner | sub_1CA8CD0 | 898 lines | Instruction scan, bitmask builder |
| Pass initialization | sub_1CAB590 | 1040 lines | Global registration, data structure setup |
| MemorySpaceCloning engine | sub_2CBBE90 | 71KB | Inter-procedural function cloning |
| IPMSPPass variant | sub_1C6A6C0 | 54KB | LIBNVVM module-pass variant |
| Address space inference | sub_2CE96D0 | -- | Dataflow analysis for single argument |
| CloneFunction | sub_F4BFF0 | -- | Full function clone with type rewriting |
| shouldProcessFunction | sub_2CBA650 | -- | Multi-condition filter for worklist eligibility |
| hasUnresolvedPointerArgs | sub_2CBA520 | -- | Checks if any arg is an unresolved generic pointer |
| replaceAllUsesWith | sub_BD84D0 | -- | Rewrites call sites to target the clone |
| propagateSpacesToCallees | sub_2CF5840 | -- | Propagates resolved spaces through call graph |
Alternate Algorithm
A parallel implementation exists at sub_2CBBE90 / sub_2CEAC10 / sub_2CF2C20, selected when mem-space-alg != 2. The default algorithm (value 2) is the one documented above; the alternate may be a simpler or older version optimized for different patterns.
Configuration Knobs
Primary Knobs (ctor_264 / ctor_267_0)
| Knob | Global | Type | Default | Description |
|---|---|---|---|---|
dump-ip-msp | dword_4FBD480 | bool | false | Dump inter-procedural memory space propagation debug info |
do-clone-for-ip-msp | dword_4FBD3A0 | int | -1 | Max number of clones (-1 = unlimited). Set to 0 to disable cloning. |
param-always-point-to-global | unk_4FBE1ED | bool | true | Assume kernel parameters always point to global memory |
dump-ir-before-memory-space-opt | byte_4FBE000 | bool | false | Dump IR before the pass runs |
dump-ir-after-memory-space-opt | byte_4FBDF20 | bool | false | Dump IR after the pass completes |
track-indir-load | byte_4FBDE40 | bool | true | Track pointers loaded from memory during use-def walking |
mem-space-alg | dword_4FBDD60 | int | 2 | Algorithm selection for address space optimization |
track-int2ptr | byte_4FBDC80 | bool | true | Track inttoptr casts during analysis |
Additional Knobs (ctor_267_0 / ctor_531_0)
| Knob | Default | Description |
|---|---|---|
process-alloca-always | true | Treat alloca instructions as definite local (AS 5) regardless of context |
wmma-memory-space-opt | true | Enable memory space optimization for WMMA operations |
strong-global-assumptions | true | Assume const buffer pointers always point to globals |
process-builtin-assume | -- | Process __builtin_assume(__is*(p)) assertions for space deduction |
IP-MSP Pass Knobs (ctor_528)
| Knob | Global | Default | Description |
|---|---|---|---|
dump-ip-msp | qword_5013548 | 0 | Debug tracing for IPMSP variant |
do-clone-for-ip-msp | qword_5013468 | -1 | Clone limit for IPMSP variant |
Optimization Level Behavior
| Level | Phase 1 (first-time) | Phase 2 (second-time) | IP-MSP Cloning |
|---|---|---|---|
| O0 | Runs (mode 0) -- address space resolution is required for correct PTX emission | Not run | Not run |
| Ofcmax | Runs (mode 0); LSA-Opt forced to 0, limiting resolution depth | Not run | Not run |
| Ofcmid | Runs (mode 0) | Runs (mode 1) after IP-MSP propagation | Enabled (do-clone-for-ip-msp=-1) |
| O1+ | Runs (mode 0) early in pipeline | Runs (mode 1) after IP-MSP propagation | Enabled; iterates to fixed point |
This pass is unusual in that it runs even at O0 -- address space resolution is a correctness requirement, not purely an optimization. Without it, all memory accesses would use generic (flat) addressing, which is functionally correct but significantly slower due to the address translation hardware penalty. At Ofcmax, the pass runs in a reduced mode with LSA-Opt disabled. See Optimization Levels for the complete pipeline structure.
Diagnostic Strings
"Initial work list size: %d"
"changed in argument memory space"
"is cloned"
"avoid cloning of"
"callees are affected"
"return memory space is resolved"
"Cannot tell what pointer points to, assuming global memory space"
"Cannot do atomic operation on const memory"
"Cannot do atomic on local memory"
"Cannot do WMMA on constant memory"
"Cannot do WMMA on local memory"
"Cannot to vector atomic on shared memory"
"Cannot to vector atomic on local memory"
"Cannot to vector atomic on const memory"
Multi-Pass Data Flow: MemorySpaceOpt / IP-MSP / Alias Analysis
The following diagram shows how three cooperating subsystems exchange data to resolve generic pointers into specific address spaces. The left column is MemorySpaceOpt (per-function), the center is IP-MSP (module-level), and the right is NVVM Alias Analysis (query service). Arrows show data produced (-->) and consumed (<--).
MemorySpaceOpt (per-function) IP-MSP (module-level) NVVM Alias Analysis
============================== ========================== ======================
1. EARLY RUN (mode 0)
+----------------------------+
| Use-def chain walker |
| (sub_1CA5350) |
| Walk: GEP, bitcast, PHI, |
| alloca, call returns |
| |
| Produces: |
| - per-arg bitmask |
| (0x01=global,0x02=shr, |
| 0x04=const,0x08=local, |
| 0x10=param) |
| - unresolved arg list |
+---+------------------------+
| +----------------------+
| per-arg bitmasks | Address space |
| (singleton bit = resolved, | disjointness table: |
| multi-bit = ambiguous) | |
v | AS 1 vs AS 3: NoAlias|
+---+------------------------+ | AS 1 vs AS 5: NoAlias|
| addrspacecast insertion | | AS 3 vs AS 5: NoAlias|
| (sub_1CA1B70) | | AS 0 vs any: MayAlias|
| Rewrites loads/stores to | | (stateless, trivial) |
| ld.shared / st.global etc. | +----------+-----------+
+---+------------------------+ |
| |
| Resolved pointer types on |
| function args + return values |
v |
+---+-----------------------------+ +--------------------------+ |
| Unresolved args remain generic | ---> | IP-MSP worklist driver | |
| Need cross-function evidence | | (sub_1C70910 / 2CBBE90) | |
+---+-----------------------------+ | | |
^ | For each function F: | |
| | 1. Collect all callers | |
| | 2. Intersect arg AS | |
| | across call sites | |
| | 3. If unanimous: | |
| | specialize or clone | |
| propagated arg spaces | | |
| (from callers) | Produces: | |
+------------------------------------+ - cloned functions | |
| with AS-specific args | |
| - updated call sites | |
| - "changed in argument | |
| memory space" events | |
+---+----------------------+ |
| |
2. LATE RUN (mode 1) | Enriched module with |
+----------------------------+ | resolved pointer types |
| Hash-table resolver | v |
| (sub_1CA9E90) | <--- cloned functions re-enter worklist |
| | |
| Additional capabilities: | Each resolved addrspacecast |
| - isspacep folding | feeds into... |
| (builtins 0xFD0-0xFD5) | |
| - Dead cast elimination | +----------v-----------+
| | | NVVM AA (nvptx-aa) |
| Consumes: | | |
| - IP-MSP propagated | | With resolved AS on |
| address spaces | | pointers, queries |
| - hash table of known | | return NoAlias for |
| pointer->space mappings | | cross-space pairs |
+---+------------------------+ | |
| | Enables downstream: |
| Fully resolved IR | - GVN load forward |
| (minimal generic ptrs) | - DSE elimination |
v | - LICM hoisting |
+---+------------------------+ | - MemorySSA queries |
| Downstream consumers: | +----------------------+
| - Instruction selection |
| (ld.shared, st.global) |
| - Backend PTX emission |
| - Register allocation |
| (no generic-ptr spills) |
+----------------------------+
Data flow summary:
| Producer | Data | Consumer |
|---|---|---|
| MemorySpaceOpt phase 1 | Per-arg address space bitmask | IP-MSP worklist |
| IP-MSP worklist | Cloned functions with specialized arg types | MemorySpaceOpt phase 2 |
| IP-MSP worklist | Call-site rewriting (addrspacecast at boundaries) | All downstream passes |
| MemorySpaceOpt phase 2 | isspacep folded to true/false | Dead code elimination |
| Both phases | Resolved pointer address spaces on all IR values | NVVM AA (nvptx-aa) |
| NVVM AA | NoAlias for cross-space pointer pairs | GVN, DSE, LICM, MemorySSA |
The feedback loop between MemorySpaceOpt and IP-MSP is the critical insight: phase 1 resolves locally-obvious cases, IP-MSP propagates those resolutions across call boundaries (cloning when necessary), and phase 2 picks up the newly-available information to resolve cases that were previously ambiguous. The worklist iterates until no more argument spaces change, guaranteeing a fixed point. NVVM AA is the downstream beneficiary -- every resolved pointer pair that previously required a conservative MayAlias answer can now return NoAlias, enabling more aggressive optimization in GVN, DSE, LICM, and scheduling.
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent address space resolution engine.
1. Resolving ambiguous pointers to the wrong default space. When the bitmask has multiple bits set (e.g., 0x03 = global OR shared), the pass defaults to global if param-always-point-to-global is true. A reimplementation that defaults to shared instead will silently produce ld.shared instructions for what is actually global memory, causing out-of-bounds accesses on the shared memory aperture. The correct behavior is: ambiguous always resolves to global (the safe conservative choice), never to a more restrictive space.
2. Forgetting to re-run after inter-procedural propagation. The pass must run twice: once before IP-MSP to resolve locally-obvious cases, and again after IP-MSP to consume propagated information. A single-pass reimplementation will miss every case where a callee's argument space is only known from the caller's context. The second run (mode 1) is not optional -- it catches the majority of inter-procedural resolutions and performs isspacep folding that the first run cannot do.
3. Cloning functions with external linkage instead of specializing in-place. The pass uses two strategies: in-place specialization for internal/private functions (all call sites visible) and clone-and-specialize for external/weak linkage. Reversing this logic -- cloning internal functions or modifying external ones -- either wastes compile time on unnecessary clones or breaks callers outside the module who still pass generic pointers. The linkage check (0x4007 for internal) is the discriminator and must not be inverted.
4. Failing to handle the addrspacecast chain correctly. After resolving a pointer's space, the pass inserts addrspacecast from generic to the specific space and replaces all uses. A reimplementation that replaces the pointer type directly (without the cast) will break LLVM's type system invariants, causing assertion failures in downstream passes. The cast must exist in the IR even though it is semantically a no-op -- LLVM's type-based alias analysis and GEP arithmetic depend on it.
5. Not iterating the IP-MSP worklist to a fixed point. The worklist must iterate until no argument bitmask changes. A reimplementation that runs one pass over all functions and stops will miss transitive resolutions through call chains (A calls B calls C). The bitmask OR is monotone (can only grow), so convergence is guaranteed, but early termination produces incomplete resolutions that leave generic pointers in the IR and forfeit the performance benefit of specialized memory instructions.
Test This
The following minimal kernel exercises address space resolution. Compile with nvcc -ptx -arch=sm_90 and inspect the PTX output.
__global__ void memspace_test(float *global_out, int n) {
__shared__ float smem[64];
smem[threadIdx.x] = (float)threadIdx.x;
__syncthreads();
float val = smem[threadIdx.x];
global_out[threadIdx.x] = val + 1.0f;
}
What to look for in PTX:
ld.shared.f32for the read fromsmem-- confirms the pass resolved the shared pointer from generic (AS 0) to shared (AS 3). If you see a plainld.f32without the.sharedqualifier, the access goes through the generic address translation unit at runtime.st.global.f32for the write toglobal_out-- confirms global pointer resolution (AS 1).- Absence of
cvta.to.shared/cvta.to.globalinstructions. Thesecvta(convert address) instructions indicate the backend is converting generic pointers at runtime instead of using resolved address spaces at compile time. Their absence means the pass succeeded fully. - Compare with
-O0to see the unresolved version where genericld/stinstructions dominate.
Reimplementation Checklist
- Address space bitmask dataflow engine. Implement the per-value bitmask lattice (0x01=global, 0x02=shared, 0x04=constant, 0x08=local, 0x10=param) with OR-based meet, use-def chain walking through GEP/bitcast/PHI/alloca/inttoptr, and a visited-set to handle cycles through PHI nodes.
- Two-phase resolution with mode dispatch. Build a mode-parameterized entry point: mode 0 (conservative first-time), mode 1 (hash-table-based second-time with
isspacepfolding), and warning-suppression variants (modes 2/3). - Inter-procedural fixed-point worklist (IP-MSP). Implement the module-level worklist that propagates per-argument address space bitmasks across call boundaries, re-adding callers when an argument's bitmask changes, iterating until no bitmask grows.
- Function cloning for specialization. Implement two strategies: in-place specialization for internal-linkage functions (modify arg types directly) and clone-and-specialize for external-linkage functions (create internal clone, rewrite call sites, insert
addrspacecastat clone entry). isspacepintrinsic folding (phase 2). When a pointer's address space is resolved, foldisspacep.shared/.global/etc. builtins (IDs 0xFD0--0xFD5) totrueorfalseconstants.- Post-resolution cleanup. Insert
addrspacecastinstructions, rewrite loads/stores to specific address spaces, eliminate dead cast chains (generic-to-shared followed by shared-to-generic), and rewrite call sites to target specialized clones. - Illegal operation detection. Check and warn on illegal address-space/operation combinations (atomics on constant/local, WMMA on constant/local, vector atomics on shared/local/constant) without aborting compilation.
Pipeline Interaction
The pass runs at two points in the CICC pipeline: once early (first-time, mode 0) to resolve obvious cases before optimization, and again after inter-procedural propagation (second-time, mode 1) to catch cases that became resolvable after inlining and constant propagation. The no-warnings variants (modes 2/3) suppress repeated diagnostics on re-runs. The pass feeds directly into instruction selection, where resolved address spaces determine which PTX memory instructions are emitted. It also interacts with the ipmsp module pass, which drives the inter-procedural cloning engine separately from the per-function resolver.
nvvm-peephole-optimizer
The NVVM Peephole Optimizer is an NVIDIA-proprietary function-level IR pass that performs NVVM-specific pattern matching and instruction simplification. It is distinct from both LLVM's standard InstCombine pass (which handles general-purpose peephole optimization across ~600 functions in the 0x1700000--0x17B0000 range) and the machine-level nvptx-peephole pass (sub_21DB090) that operates on MachineInstrs after instruction selection.
This page documents all three peephole layers in cicc, their pipeline positions, their transformations, and the satellite machine-level peephole passes that complement them.
| Pass name | nvvm-peephole-optimizer |
| Class | llvm::NVVMPeepholeOptimizerPass |
| Scope | Function pass (IR level) |
| Registration | New PM slot 382 in sub_2342890 |
| Serializer | sub_2314DA0 |
| Factory (legacy PM) | sub_1CEF8F0 |
| Pipeline parser line | 3534 in sub_233C410 |
| Enable knob | enable-nvvm-peephole (bool, default = true) |
| Knob address | ctor_358_0 @ 0x50E8D0 |
| NVVMPassOptions slot | nvvm-peephole-optimizer in 4512-byte options struct |
| Pipeline position | Function-level, runs after NVVMReflect + NVVMIntrinsicLowering |
Purpose
CUDA programs produce IR patterns that standard LLVM optimizations do not recognize or cannot legally transform. The NVVM peephole pass fills this gap by matching NVVM-specific idioms -- address space casts, intrinsic call sequences, convergent operation patterns, and GPU-specific type conversions -- and rewriting them into simpler, cheaper forms. It operates at the LLVM IR level before code generation, complementing the machine-level nvptx-peephole pass that runs later in the pipeline.
The pass is always paired with sub_215D9D0 (NVVMAnnotationsProcessor), which runs immediately after the peephole in every pipeline path. This companion pass processes NVVM annotations (e.g., tcgen05 tensor annotation metadata) on the IR that the peephole has just simplified.
Three Peephole Layers
CICC contains three distinct peephole optimization layers, each operating at a different abstraction level and targeting different pattern classes.
| Layer | Pass | Level | Address / Slot | Targets |
|---|---|---|---|---|
| LLVM InstCombine | instcombine | IR | 0x1700000+ (~600 funcs) | General-purpose: algebraic simplification, constant folding, dead instruction removal |
| NVVM Peephole | nvvm-peephole-optimizer | IR | slot 382, factory sub_1CEF8F0 | NVVM-specific: address space casts, intrinsic sequences, GPU type conversions |
| NVPTX Peephole | nvptx-peephole | MachineInstr | sub_21DB090 | PTX-specific: redundant cvta folding, predicate optimization, move elimination |
The NVVM peephole pass handles transformations that require knowledge of NVVM's address space model, intrinsic semantics, or GPU-specific type system -- patterns that InstCombine cannot match because they depend on NVPTX target information not available to target-independent passes. The machine-level NVPTX peephole then handles patterns that only emerge after instruction selection has lowered IR to MachineInstrs.
Pipeline Positions
IR-Level: nvvm-peephole-optimizer
The IR-level peephole (sub_1CEF8F0) is invoked from the legacy pipeline assembler (sub_12E54A0) in all three language-specific code paths. Its companion sub_215D9D0 always follows immediately.
Path A -- "ptx" language (lines 580--638 in sub_12E54A0):
sub_1CEF8F0() NVVMPeephole
sub_215D9D0() NVVMAnnotationsProcessor
sub_1857160() NVVMReflect (conditional)
sub_1A62BF0(1) LLVM standard pipeline #1
sub_1B26330() MemCpyOpt
sub_18DEFF0() DCE
...
Path B -- "mid" language (Ofcmid, lines 814--1075):
sub_184CD60() ConstantMerge / GlobalDCE
sub_1CB4E40(0) NVVMIntrinsicLowering
sub_1B26330() MemCpyOpt
sub_198E2A0() SROA / CorrelatedValuePropagation
sub_1CEF8F0() NVVMPeephole <<<
sub_215D9D0() NVVMAnnotationsProcessor
sub_17060B0(1,0) PrintModulePass
sub_198DF00(-1) JumpThreading / CVP
sub_1C6E800() GVN / LICM
...
Path C -- default/general (O2/O3, lines 1077--1371):
sub_1A62BF0(4) LLVM standard pipeline #4
sub_1857160() NVVMReflect
sub_1CB4E40(0) NVVMIntrinsicLowering
sub_1857160() NVVMReflect (second pass)
sub_1CEF8F0() NVVMPeephole <<<
sub_215D9D0() NVVMAnnotationsProcessor
sub_1A7A9F0() InstructionSimplify
sub_1A62BF0(5) LLVM standard pipeline #5
...
Late position (O3 tier finalization):
sub_1B7FDF0(n) BranchFolding / CFGSimplify
sub_1CEF8F0() NVVMPeephole <<<
sub_215D9D0() NVVMAnnotationsProcessor
sub_18B3080(f) Sinking2Pass (fast mode)
sub_1CC60B0() NVVMSinking
sub_18A3430() AggressiveInstCombine
...
In every path, the peephole runs after NVVMIntrinsicLowering (sub_1CB4E40) and NVVMReflect (sub_1857160) have resolved intrinsics and reflect calls. This ensures the peephole sees simplified IR where previously-opaque intrinsic call patterns have been reduced to simpler forms amenable to pattern matching.
Machine-Level: nvptx-peephole
The machine-level peephole (sub_21DB090) runs in addPreRegAlloc() (sub_2166ED0):
EarlyTailDuplicate
codegen DCE
Machine LICM + CSE + Sinking (conditional on enable-mlicm, enable-mcse)
PeepholeOptimizerPass (stock LLVM, slot 492, disable-peephole)
NVPTXPeephole (sub_21DB090) <<<
DeadMachineInstrElim
MachineCopyPropagation
The string "After codegen peephole optimization pass" in sub_2166ED0 marks the checkpoint after both the stock LLVM peephole and the NVPTX peephole have completed.
New PM Registration
The pass is registered as a function-level pass in the New Pass Manager at registration line 2242 in sub_2342890. It sits in the mid-optimization phase alongside other NVIDIA function passes:
| Slot | Pass | Class |
|---|---|---|
| 376 | basic-dbe | BasicDeadBarrierEliminationPass |
| 377 | branch-dist | BranchDistPass |
| 378 | byval-mem2reg | ByValMem2RegPass |
| 379 | bypass-slow-division | BypassSlowDivisionPass |
| 380 | normalize-gep | NormalizeGepPass |
| 381 | nvvm-reflect-pp | SimplifyConstantConditionalsPass |
| 382 | nvvm-peephole-optimizer | NVVMPeepholeOptimizerPass |
| 383 | old-load-store-vectorizer | OldLoadStoreVectorizerPass |
| 384 | print<merge-sets> | MergeSetsAnalysisPrinterPass |
| 385 | remat | RematerializationPass |
IR-Level Transformation Categories
Based on pipeline position (after NVVMReflect + NVVMIntrinsicLowering, before sinking and rematerialization) and the patterns visible in NVVM IR, the peephole optimizer targets several categories.
Address Space Cast Simplification
After memory-space-opt and ipmsp resolve generic pointers to specific address spaces, redundant addrspacecast chains remain in the IR. The peephole rewrites these:
; Before:
%p1 = addrspacecast ptr addrspace(3) %src to ptr ; shared -> generic
%p2 = addrspacecast ptr %p1 to ptr addrspace(3) ; generic -> shared
store i32 %val, ptr addrspace(3) %p2
; After:
store i32 %val, ptr addrspace(3) %src ; chain eliminated
; Before:
%p = addrspacecast ptr addrspace(1) %src to ptr addrspace(1) ; identity cast
; After:
; (use %src directly — identity addrspacecast removed)
The validation function sub_21BEE70 ("Bad address space in addrspacecast", 4.1KB) ensures the peephole does not create illegal address space transitions. NVPTX address spaces are:
| AS | Name | Legal cast targets |
|---|---|---|
| 0 | Generic | All |
| 1 | Global | Generic |
| 3 | Shared | Generic |
| 4 | Constant | Generic |
| 5 | Local | Generic |
Intrinsic Call Folding
After NVVMIntrinsicLowering has expanded NVVM intrinsics, some expansion sequences can be further simplified:
; Before (after intrinsic lowering, launch_bounds known):
%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%cmp = icmp ult i32 %tid, 256 ; blockDim.x known = 256
; After (when nvvm-intr-range has set !range {0, 256}):
%cmp = i1 true ; always true for valid threads
; Before:
call void @llvm.nvvm.barrier0()
; (no shared memory operations between barriers)
call void @llvm.nvvm.barrier0()
; After:
call void @llvm.nvvm.barrier0() ; redundant barrier removed
Type Conversion Cleanup
GPU-specific type representations (bf16, tf32, fp8) produce conversion chains not present in standard LLVM IR:
; Before (roundtrip through wider type):
%wide = fpext half %x to float
%back = fptrunc float %wide to half
; After:
; (use %x directly — roundtrip eliminated when no precision loss)
; Before (bf16 roundtrip):
%f32 = call float @llvm.nvvm.bf16.to.f32(i16 %bf)
%bf2 = call i16 @llvm.nvvm.f32.to.bf16(float %f32)
; After:
; (use %bf directly)
Post-Reflect Dead Code Cleanup
The companion pass nvvm-reflect-pp (SimplifyConstantConditionalsPass) runs immediately before the peephole in the pipeline. It resolves __nvvm_reflect() calls and simplifies constant conditionals:
; Before (after nvvm-reflect-pp resolves __nvvm_reflect("__CUDA_FTZ") = 1):
%ftz = call i32 @__nvvm_reflect(ptr @"__CUDA_FTZ") ; resolved to 1
%cmp = icmp ne i32 %ftz, 0 ; always true
br i1 %cmp, label %ftz_path, label %no_ftz_path
; After nvvm-reflect-pp:
br label %ftz_path ; unconditional
; The peephole then cleans up dead instructions in %no_ftz_path
; and simplifies any resulting phi nodes at merge points
Convergent Operation Canonicalization
CUDA's convergent operations (__syncwarp, __ballot_sync, etc.) have specific semantic constraints that standard InstCombine cannot reason about because it must treat convergent calls as opaque. The peephole, with knowledge of NVVM semantics, can simplify convergent call sequences when the mask or participating threads can be determined at compile time.
Machine-Level NVPTXPeephole (sub_21DB090)
The machine-level peephole operates on MachineInstr objects after instruction selection has converted LLVM IR to PTX pseudo-instructions. It targets patterns specific to the PTX instruction set.
Redundant cvta Folding
The cvta (convert address) instruction converts between generic and specific address spaces. Address space lowering often inserts redundant conversions:
// Before:
cvta.to.global %rd1, %rd2 ; convert global -> generic
cvta.global %rd3, %rd1 ; convert generic -> global (redundant pair)
// After:
mov.b64 %rd3, %rd2 ; direct copy, cvta pair eliminated
The companion pass sub_21DA810 ("NVPTX optimize redundant cvta.to.local instruction") handles the remaining cvta.to.local instructions that survive to late post-RA:
// Before (late pipeline):
cvta.to.local %rd1, %rd2 ; redundant when %rd2 is already local-space
// After (sub_21DA810 removes it):
; (use %rd2 directly)
Predicate Pattern Optimization
PTX uses predicate registers for conditional execution. The peephole simplifies predicate sequences:
// Before:
setp.ne.s32 %p1, %r1, 0;
@%p1 bra target;
// After (folds setp into branch when pattern is recognized):
// Combined compare-and-branch
PTX Move Elimination
sub_2204E60 ("Remove redundant moves") eliminates identity moves:
// Before:
mov.b32 %r5, %r5; ; identity move
// After:
; (deleted)
Satellite Machine Peephole Passes
Three additional machine-level passes perform specialized peephole transformations adjacent to the main NVPTXPeephole:
param-opt (sub_2203290)
| Pass name | param-opt |
| Entry point | sub_2203290 |
| Description | "Optimize NVPTX ld.param" |
Optimizes parameter load patterns. In PTX, kernel parameters are loaded via ld.param instructions into registers. When the same parameter is loaded multiple times (e.g., after inlining or loop unrolling), param-opt consolidates them:
// Before:
ld.param.u32 %r1, [_param_0];
...
ld.param.u32 %r7, [_param_0]; ; redundant reload of same parameter
// After:
ld.param.u32 %r1, [_param_0];
...
mov.b32 %r7, %r1; ; reuse previous load
nvptx-trunc-opts (sub_22058E0)
| Pass name | nvptx-trunc-opts |
| Entry point | sub_22058E0 |
| Description | "Optimize redundant ANDb16ri instrunctions" [sic] |
Eliminates redundant AND operations on b16 (16-bit) registers. When type legalization widens a sub-16-bit value to 16 bits, it inserts an AND with a mask to preserve the original width. If the value is already correctly masked (e.g., from a load that zero-extends), the AND is redundant:
// Before:
ld.u8 %rs1, [%rd1]; ; loads 8-bit, zero-extended to 16
and.b16 %rs2, %rs1, 0xFF; ; redundant mask — already 8-bit clean
// After:
ld.u8 %rs1, [%rd1];
// (AND deleted, use %rs1 directly)
The binary contains the string with a typo: "instrunctions" instead of "instructions".
Remove Redundant Moves (sub_2204E60)
| Entry point | sub_2204E60 |
| Description | "Remove redundant moves" |
Eliminates move instructions where source and destination are the same register, or where the move is immediately dead. This complements the stock LLVM MachineCopyPropagation pass with PTX-specific move patterns.
Knobs
| Knob | Type | Default | Scope | Effect |
|---|---|---|---|---|
enable-nvvm-peephole | bool | true | IR + Machine | Master switch for both the IR-level nvvm-peephole-optimizer and the machine-level nvptx-peephole. Registered at ctor_358_0 (0x50E8D0). |
disable-peephole | bool | false | Machine only | Disables the stock LLVM PeepholeOptimizerPass (slot 492). Does not affect the NVIDIA-specific passes. Registered at ctor_314 (0x502360). |
aggressive-ext-opt | bool | (varies) | Machine only | Controls aggressive extension optimization in stock LLVM peephole. |
disable-adv-copy-opt | bool | false | Machine only | Disables advanced copy optimization in stock LLVM peephole. |
rewrite-phi-limit | int | (varies) | Machine only | Limits PHI rewriting in stock LLVM peephole. |
recurrence-chain-limit | int | (varies) | Machine only | Limits recurrence chain analysis in stock LLVM peephole. |
The enable-nvvm-peephole description string recovered from the binary is: "Enable NVVM Peephole Optimizer". Its default-on status indicates the pass is mature and does not require opt-in behavior.
Optimization Level Behavior
The IR-level peephole runs in all optimization paths except -O0:
| Level | Path | NVVMPeephole invocations |
|---|---|---|
| Ofcmin | "ptx" path | 1 (early) |
| Ofcmid | "mid" path | 1 (after SROA/CVP) |
| O2/O3 | "default" path | 1 (after NVVMReflect + IntrinsicLowering) |
| O3 (late) | Tier finalization | 1 (after BranchFolding/CFGSimplify) |
At -O0, the peephole is likely skipped along with most optimization passes. The factory function sub_1CEF8F0 appears only in code paths that are active at O1 and above.
End-to-End Peephole Pipeline
The complete peephole optimization flow through cicc, from IR to PTX:
Source CUDA
|
v
[LLVM IR after clang/EDG frontend]
|
v
InstCombine (0x1700000+) General algebraic simplification
| ~600 functions, target-independent
v
NVVMReflect (sub_1857160) Resolve __nvvm_reflect() calls
|
v
nvvm-reflect-pp Simplify constant conditionals from reflect
|
v
NVVMIntrinsicLowering (sub_1CB4E40) Expand NVVM intrinsics
|
v
nvvm-peephole-optimizer NVVM-specific IR patterns:
(sub_1CEF8F0 factory) - addrspacecast chain folding
| - intrinsic sequence simplification
v - type conversion roundtrip elimination
NVVMAnnotationsProcessor - post-reflect dead code cleanup
(sub_215D9D0 companion)
|
v
[Further IR optimization: GVN, LICM, Sinking2, etc.]
|
v
[Instruction Selection: DAGToDAG (sub_2200150, 78KB)]
| Hash-table pattern matching: hash = (37*idx) & (tableSize-1)
v
PeepholeOptimizerPass (slot 492) Stock LLVM machine peephole:
| - redundant copy folding
v - compare-and-branch simplification
NVPTXPeephole (sub_21DB090) PTX-specific machine peephole:
| - cvta pair elimination
v - predicate folding
param-opt (sub_2203290) - ld.param consolidation
|
v
nvptx-trunc-opts (sub_22058E0) - ANDb16ri elimination
|
v
Remove Redundant Moves (sub_2204E60) - identity move deletion
|
v
[Register Allocation]
|
v
ProxyRegErasure (sub_21DA810) Late cvta.to.local removal
|
v
[PTX Emission]
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| -- | sub_1CEF8F0 | small | NVVMPeephole factory (legacy PM) |
| -- | sub_215D9D0 | -- | NVVMAnnotationsProcessor (companion, always paired) |
| -- | sub_2314DA0 | small | NVVMPeepholeOptimizerPass serializer (New PM) |
| -- | sub_2342890 | -- | New PM registration function (slot 382) |
| -- | sub_233C410 | -- | Pipeline text parser (line 3534) |
| -- | sub_21DB090 | small | NVPTXPeephole machine pass registration |
| -- | sub_2166ED0 | 1.6KB | addPreRegAlloc() -- hosts NVPTXPeephole |
| -- | sub_21DA810 | -- | ProxyRegErasure (cvta.to.local removal) |
| -- | sub_2203290 | small | param-opt (ld.param optimization) |
| -- | sub_2204E60 | small | Remove Redundant Moves |
| -- | sub_22058E0 | small | nvptx-trunc-opts (ANDb16ri elimination) |
| -- | sub_21BEE70 | 4.1KB | "Bad address space in addrspacecast" validation |
| -- | sub_20DA7F0 | 30KB | DAG combine / peephole on MachineInstrs |
| -- | sub_37E1AE0 | 18KB | Late-stage machine optimization (peephole or copy prop) |
Differences from Upstream LLVM
Upstream LLVM (as of LLVM 17/18) contains NVPTXPeephole.cpp in llvm/lib/Target/NVPTX/, which implements a small machine-level pass that:
- Folds
cvtaaddress-space-conversion pseudo-instructions - Removes
NVPTX::PROXY_REGpseudo-instructions (now split into a separateNVPTXProxyRegErasurepass in cicc)
CICC v13.0 extends this significantly:
- The IR-level pass (
nvvm-peephole-optimizer) has no upstream counterpart. It is entirely NVIDIA-proprietary, filling a gap between target-independent InstCombine and target-specific machine peephole. - Three satellite machine passes (
param-opt,nvptx-trunc-opts,Remove Redundant Moves) have no upstream equivalents. - The machine-level
nvptx-peepholeis larger than upstream, likely incorporating additional pattern rules for newer PTX features (tensor core operations, cluster operations, etc.). - ProxyRegErasure is separated from NVPTXPeephole into its own pass (
sub_21DA810) and runs late post-RA rather than inline with the peephole.
Evidence Summary
The pass's existence and classification are confirmed through multiple independent sources:
| Source | Address / Location | Evidence |
|---|---|---|
| Pipeline parser | sub_233C410 line 3534 | Registers "nvvm-peephole-optimizer" as function-level NVIDIA custom pass |
| New PM registration | sub_2342890 slot 382 | Maps string to llvm::NVVMPeepholeOptimizerPass |
| Serializer | sub_2314DA0 | Produces "nvvm-peephole-optimizer" text for pipeline printing |
| Legacy PM factory | sub_1CEF8F0 | Called 2x from sub_12E54A0 (pipeline assembler) |
| Companion pairing | sub_215D9D0 | Always immediately follows sub_1CEF8F0 in all paths |
| Knob sweep | 0x50E8D0 (ctor_358_0) | enable-nvvm-peephole = "Enable NVVM Peephole Optimizer", default true |
| Knob duplicate | 0x560000 sweep line 292 | Confirmed with identical description |
| NVVMPassOptions | p2a.3-03-passoptions.txt | Listed as nvvm-peephole-optimizer in option table |
| Machine pass | sub_21DB090 | "NVPTX Peephole" / "nvptx-peephole" registration string |
| Machine pipeline | sub_2166ED0 | "After codegen peephole optimization pass" checkpoint string |
Confidence note. The pass registration, knobs, pipeline position, and factory function are confirmed at HIGH confidence from binary evidence. The specific transformation patterns described above are at MEDIUM confidence -- inferred from pipeline position (runs after NVVMReflect + NVVMIntrinsicLowering), NVVM IR semantics, and address space validation code, but the actual NVVMPeepholeOptimizerPass::run() body has not been individually decompiled. The factory sub_1CEF8F0 creates the pass object; the run method is dispatched through the object's vtable.
Cross-References
- Scalar Passes (InstCombine) -- stock LLVM InstCombine that handles general-purpose peephole
- NVVM Intrinsic Lowering -- runs before the peephole, expands intrinsics
- NVVMReflect -- resolves
__nvvm_reflect()before the peephole cleans up - Machine-Level Passes -- documents the full pre-RA / post-RA machine pass pipeline
- Minor NVIDIA Passes -- brief entries for
nvptx-peephole,proxy-reg-erasure, and other small passes - Address Spaces -- NVPTX address space numbering and cast rules
- NVVMPassOptions -- the 4512-byte options struct that gates this pass
- Optimization Levels -- which paths invoke the peephole at each -O level
- Pipeline Assembler -- the master
sub_12E54A0function that builds the pass pipeline
Sinking2 (NVIDIA Code Sinking)
sinking2 is an NVIDIA-proprietary instruction sinking pass that moves instructions closer to their uses, with specific awareness of GPU texture and surface memory operations. It is entirely distinct from LLVM's stock sink pass: while both perform code sinking, Sinking2 is tailored for NVIDIA's memory hierarchy and iterates to a fixed point rather than making a single pass. The primary motivation is reducing register pressure by deferring computation of values until just before they are consumed, which is especially impactful on GPUs where register files are shared across hundreds of concurrent threads.
The pass is particularly focused on sinking instructions into texture load blocks. Texture operations on NVIDIA GPUs have high latency but are served by a dedicated cache; by sinking the address computation and other operands into the block that performs the texture fetch, the compiler reduces the live range of those values and frees registers for other warps. This directly improves occupancy -- the number of warps that can execute simultaneously on an SM.
Pipeline Position
| Field | Value |
|---|---|
| Pass name (pipeline) | sinking2 |
| Pass ID | sink2 |
| Display name | Code sinking |
| Pass type | FunctionPass (NVIDIA-custom) |
| Class | llvm::Sinking2Pass |
| Registration | New PM #390, line 2282 in sub_2342890 |
| Runtime positions | Tier 1/2/3 #81 (NVVMSinking2 via sub_1CC60B0, gated by opts[3328] && !opts[2440]); see Pipeline |
| Legacy PM entry | sub_1CCA270 |
| New PM entry | sub_2D1C160 (19KB) |
| Legacy PM registration | sub_1CC7010 |
| New PM registration | sub_2D1B410 |
| Knob constructor | ctor_275 at 0x4F7750 |
| Vtable (Legacy) | off_49F8BC0 |
| Vtable (New PM) | off_4A260F0 |
Relationship to All Sink Passes in cicc
CICC v13.0 contains five distinct sinking mechanisms. Understanding which is which is essential when reading the pipeline or debugging register pressure issues:
| Pass ID / Factory | Class | Origin | Key Difference |
|---|---|---|---|
sink / sub_1A634D0 | LLVM SinkingPass | Upstream LLVM | Stock single-pass sinking, uses MemorySSA for alias safety |
sink2 / sub_1CCA270 | llvm::Sinking2Pass | NVIDIA | Texture-aware, iterative fixpoint, custom AA layer |
sink<rp-aware> | Parameterized variant | LLVM + NVIDIA | Register-pressure-aware sinking (stock sink with rp-aware-sink=true) |
NVVMSinking2 / sub_1CC60B0 | NVIDIA late sinking | NVIDIA | Late-pipeline SM-specific sinking, gated by opts[3328] |
MachineSink | LLVM MachineSinking | LLVM | MIR-level sinking, opt-in for NVPTX via nvptx-enable-machine-sink |
The stock LLVM sink (sub_1869C50, called with params (1,0,1)) uses MemorySSA for alias queries and makes a single pass. Sinking2 uses its own alias analysis layer routed through sub_13575E0 and iterates to convergence. NVVMSinking2 (sub_1CC60B0) is a separate NVIDIA pass that runs late in the pipeline after barrier lowering and warp-level optimizations, gated by the SM-specific pass group flag opts[3328].
IR Before/After Example
The pass sinks address computation closer to texture/surface use sites, reducing register pressure by shortening live ranges.
Before (address computation in preheader, live across loop body):
preheader:
%base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
%addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
br label %loop
loop:
%i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
; ... many instructions using registers, %base and %addr are live ...
%tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
%val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
%i.next = add i64 %i, 1
%cmp = icmp slt i64 %i.next, %n
br i1 %cmp, label %loop, label %exit
After (address computation sunk into loop, next to texture use):
preheader:
br label %loop
loop:
%i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
; ... many instructions, but %base and %addr are no longer live here ...
%base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
%addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
%tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
%val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
%i.next = add i64 %i, 1
%cmp = icmp slt i64 %i.next, %n
br i1 %cmp, label %loop, label %exit
The GEP instructions now execute inside the loop (higher execution count) but free registers in the rest of the loop body. This is a deliberate tradeoff: extra ALU work for reduced register pressure, which typically improves occupancy and net throughput.
Algorithm
Entry Point
The legacy PM entry sub_1CCA270 performs these steps:
- Fetches
DominatorTreeanalysis (viaDominatorTreeWrapperPassatunk_4F9E06C) - Fetches
LoopInfoanalysis (viaLoopInfoWrapperPassatunk_4F96DB4) - Reads
sink-into-textureknob (qword_4FBF2C0[20]) -- must be non-zero (enabled) - Reads
sink-limitknob (qword_4FBF1E0[20]) -- must be greater than zero - Calls the main worklist driver
sub_1CC9110
The New PM entry sub_2D1C160 (19KB) performs the same logic using AnalysisManager to fetch analyses, then dispatches to sub_2D1CFB0 (13KB).
The pass does not require ScalarEvolution (SCEV), MemorySSA, or PostDominatorTree, keeping it simpler and cheaper than loop-oriented or MemorySSA-dependent passes.
Main Worklist Driver (sub_1CC9110, 22KB)
The core algorithm is a fixpoint iteration over the dominator tree:
function SinkingWorklist(F, DT, LI, textureLevel, sinkLimit):
changed = false
do:
roundChanged = false
sinkCount = 0
// Walk dominator tree in DFS preorder
for BB in DT.dfs_preorder():
// Skip loop headers to avoid creating loop-carried deps
if LI.isLoopHeader(BB):
continue
// Process instructions bottom-up within each block
for I in reverse(BB.instructions()):
if sinkCount >= sinkLimit:
break // complexity limiter
if I.mayHaveSideEffects() or I.isTerminator():
continue // unsinkable
if I.use_empty():
continue // dead, leave for DCE
// Level 3: consider instructions used only outside BB
if textureLevel < 3 and allUsesInSameBlock(I, BB):
continue
targetBB = findBestSinkTarget(I, DT, LI) // sub_1CC7510
if targetBB == BB:
continue // already in best position
// Profitability: prefer texture/surface blocks
if textureLevel >= 1:
if not blockContainsTextureOps(targetBB):
if not dominatesTextureBlock(targetBB, DT):
continue // not profitable
// Safety: alias analysis check
if not isSafeToSink(I, BB, targetBB): // sub_1CC8920
continue
// Safety: memory dependency check
if not checkMemDep(I, BB, targetBB): // sub_1CC8CA0
continue
I.moveBefore(targetBB.firstNonPHI())
roundChanged = true
sinkCount++
changed |= roundChanged
while roundChanged // iterate until no more changes
return changed
Key design points:
- DFS preorder ensures parent blocks are processed before children. Instructions sunk from a parent into a child on one iteration may expose further sinking opportunities for grandchild blocks on the next iteration -- hence the fixpoint loop.
- Bottom-up within each block processes the last instruction first. This is important because sinking an instruction may make an earlier instruction's operands dead, which DCE will clean up later.
- Loop headers are skipped to prevent creating loop-carried dependencies (a value defined in the header, consumed in the latch, sunk into the latch would create a cycle).
Instruction Processing (sub_1CC7510, 16KB)
For each candidate instruction, this function:
- Walks the use chain to find all consumers (via
sub_15F4D60, multi-use check) - For each user, determines the containing basic block
- Computes the lowest common dominator (LCD) of all user blocks using the dominator tree
- If LCD == current block, no benefit from sinking -- the instruction is already as close to its uses as possible while dominating all of them
- Builds a sink mapping: instruction to target block
- Checks memory safety via alias analysis (
sub_13575E0) - Validates that sinking does not violate memory ordering constraints
- Respects PHI nodes (LLVM opcode
PHI) as sink boundaries -- an instruction cannot be sunk past a PHI insertion point
The target block selection algorithm effectively finds the nearest common dominator of all uses that is strictly dominated by the current block. If the instruction has a single use, the target is trivially the use's block (or its immediate dominator if the use is a PHI operand).
Dominance Ordering (sub_1CC8170, 13KB)
Implements a hash-based ordering of basic blocks for comparing sink profitability. Uses DFS numbering from the dominator tree to determine which block comes "earlier" in the program. This ordering ensures:
- Instructions are only sunk toward uses, never away from them
- When multiple sink targets exist (multi-use instruction), the lowest common dominator is chosen
- The ordering is consistent across iterations of the fixpoint loop
Alias Checking (sub_1CC8920, 4KB)
Validates that moving instruction I from block From to block To does not reorder I past any conflicting memory access:
function isSafeToSink(I, From, To):
if not I.mayReadOrWriteMemory():
return true // pure computation, always safe
// Walk all instructions on the domtree path From -> To
for BB in pathBlocks(From, To):
for J in BB.instructions():
if J == I: continue
if AA.getModRefInfo(I, J) != NoModRef:
return false // conflict: I aliases with J
return true
This is not MemorySSA-based (unlike stock LLVM sink). The pass invokes the traditional AliasAnalysis query interface through sub_13575E0. This is less precise than MemorySSA but avoids the cost of building and maintaining the MemorySSA graph, which matters because Sinking2 iterates to fixpoint and would need to update MemorySSA on every move.
Memory Dependency Checking (sub_1CC8CA0, 6KB)
Additional memory safety layer beyond alias checking:
- Store-load forwarding: if
Iis a load and there is a store betweenFromandTothat may alias the loaded location, sinking would change the value loaded - Store ordering: if
Iis a store, moving it past another store to a potentially-aliasing location changes program semantics - Volatile/atomic barrier: volatile loads/stores and atomic operations are never sunk (treated as having side effects)
- Synchronization intrinsics: barrier calls (
__syncthreads,bar.sync) are treated as memory fences; no instruction may be sunk past them
Texture/Surface Awareness
The pass identifies "texture blocks" -- basic blocks containing calls to texture/surface intrinsics (the tex.*, suld.*, sust.* family). Address computations that feed these intrinsic calls are the primary sink candidates, because texture address computation chains (GEP + index arithmetic) produce intermediate values that are consumed only at the texture fetch site. Without sinking, these intermediates occupy registers across potentially many instructions.
The sink-into-texture knob controls aggressiveness:
| Level | Behavior |
|---|---|
| 0 | Disabled -- no texture-aware sinking |
| 1 | Cross-block only: move instructions across block boundaries into texture blocks |
| 2 | Cross-block + intra-block: also reorder instructions within a block to position them immediately before their texture use |
| 3 (default) | All of the above + outside-only: consider instructions whose only uses are in blocks other than where the instruction is defined |
Level 3 catches the important case where a GEP in a preheader feeds a texture load inside a loop -- the GEP has no uses in its own block, only "outside" uses.
Address space checks for NVPTX (see reference/address-spaces):
- AS 1 (global): may alias with texture reads in some configurations
- AS 3 (shared): texture operations never access shared memory, so shared-space stores are not barriers to texture sinking
- AS 4 (const): texture/surface descriptors typically live in constant memory
- AS 5 (local): thread-local, no cross-thread interference
Loop Considerations
Sinking2 is loop-aware but conservative:
- Never sinks OUT of a loop: moving an instruction from a loop body to an exit block would change its execution count. The pass skips this entirely.
- May sink INTO loop bodies: when an instruction in a loop preheader feeds only uses inside the loop (particularly texture fetches), sinking it into the loop is profitable despite increasing execution count -- the register pressure reduction from shorter live ranges outweighs the extra computation.
- Skips loop headers: prevents creating loop-carried dependencies.
- Runs after LoopSimplify: the early instance (
sub_18B1DE0) runs after LoopSimplify/LCSSA have canonicalized loop structure, so preheaders, latches, and exit blocks are well-formed.
This creates a deliberate tension with LICM:
- LICM hoists loop-invariant code into the preheader (reducing execution count)
- Sinking2 sinks non-invariant address computation out of the preheader and into the loop body (reducing register pressure)
The two passes run at different pipeline positions and balance each other. LICM runs first; Sinking2 runs after GVN and CGSCC inlining, when texture patterns are fully exposed.
Barrier Awareness
Sinking2 itself does not contain explicit __syncthreads / bar.sync detection logic. Instead, it relies on the LLVM side-effect model:
- Barrier intrinsics are marked as having side effects, so they are never sunk
- Barrier intrinsics are treated as memory fences by alias analysis, so no memory instruction may be sunk past them
The late NVVMSinking2 (sub_1CC60B0) runs after barrier lowering (sub_1CB73C0) and warp-level optimization passes. By that point, barriers have been lowered to their final form. The pipeline ordering is:
NVVMBranchDist -> NVVMWarpShuffle -> NVVMReduction -> NVVMSinking2
This sequence ensures NVVMSinking2 can sink past warp-level operations that are no longer opaque barriers, while still respecting the lowered barrier representation.
Multi-Run Pipeline Pattern
Sinking2 appears at three to four pipeline positions. Each run has different context and different opportunities:
| Position | Factory | Mode | Context |
|---|---|---|---|
| Early (pass ~39) | sub_18B1DE0() | Standard | After stock Sink, GVN, and CGSCC inlining. Texture patterns are exposed. |
| Post-peephole | sub_18B3080(1) | Fast (flag=1) | After NVVMPeephole. Peephole may create new sinking opportunities. Reduced iteration budget. |
| Late SM-specific | sub_1CC60B0() | SM-gated | After barrier lowering and warp shuffle. Gated by opts[3328] && !opts[2440]. |
For fast-compile mode (Ofcmax), only sub_18B3080(1) runs -- the single Sinking2 in fast mode with reduced iteration budget. No stock Sink, no NVVMSinking2.
The rationale for multiple runs:
- Run 1 (stock Sink) handles straightforward cases using MemorySSA's precise alias information
- Run 2 (Sinking2 early) performs texture-aware sinking now that GVN/CGSCC have simplified the IR
- Run 3 (Sinking2 fast) cleans up opportunities created by peephole optimization
- Run 4 (NVVMSinking2) performs SM-specific late sinking after barrier and warp-level transforms
NVVMPassOptions Gating
| Offset | Type | Effect |
|---|---|---|
opts[1040] | bool | Disable stock Sink/MemSSA |
opts[2440] | bool | Disable NVVMSinking2 (sub_1CC60B0) |
opts[3328] | bool | Enable SM-specific warp/reduction/sinking pass group (gates NVVMSinking2) |
Cost Model (New PM)
The New PM object (176 bytes) contains floating-point thresholds at offsets +88 and +144, both initialized to 1065353216 (IEEE 754 1.0f). These thresholds suggest the New PM implementation has a more sophisticated cost model than the Legacy PM version:
- Profitability threshold (
+88): minimum benefit score for a sink to be accepted. A value of 1.0 means the benefit must at least equal the cost. - Cost threshold (
+144): maximum acceptable cost for the sinking motion itself. A value of 1.0 means the movement cost must not exceed the baseline.
The Legacy PM version uses a simpler boolean profitability model (is the target a texture block? yes/no).
Configuration Knobs
Sinking2-Specific (ctor_275 at 0x4F7750)
| Knob | Type | Default | Storage | Description |
|---|---|---|---|---|
sink-into-texture | int | 3 | qword_4FBF2C0 | Texture sinking aggressiveness (0=off, 1=cross-block, 2=+intra, 3=+outside-only) |
sink-limit | int | 20 | qword_4FBF1E0 | Max instructions to sink per invocation (complexity limiter) |
dump-sink2 | bool | false | qword_4FBF100 | Dump debug information during sinking |
Related Sinking Knobs (other passes, NOT Sinking2)
| Knob | Type | Default | Owner | Description |
|---|---|---|---|---|
sink-check-sched | bool | true | stock Sink | Check scheduling effects of sinking |
sink-single-only | bool | true | stock Sink | Only sink single-use instructions |
rp-aware-sink | bool | false | stock Sink | Consider register pressure (controls sink<rp-aware> variant) |
max-uses-for-sinking | int | (default) | stock Sink | Don't sink insts with too many uses |
sink-ld-param | bool | (default) | NVPTX backend | Sink one-use ld.param to use point |
hoist-load-param | bool | (default) | NVPTX backend | Hoist all ld.param to entry block (counterpart to sink-ld-param) |
enable-andcmp-sinking | bool | (default) | CodeGenPrepare | Sink and/cmp into branches |
aggressive-no-sink | bool | (default) | (unknown) | Sink all generated instructions |
instcombine-code-sinking | bool | (default) | InstCombine | Enable code sinking within instcombine |
nvptx-enable-machine-sink | bool | (default) | NVPTX backend | Enable MIR-level MachineSink |
SinkRematEnable | bool | (default) | ptxas | Enable sink+rematerialization in ptxas |
Analysis Dependencies
| Legacy PM | New PM | Purpose |
|---|---|---|
DominatorTreeWrapperPass (unk_4F9E06C) | DominatorTreeAnalysis (sub_CF6DB0) | Dominator tree for sink legality and ordering |
LoopInfoWrapperPass (unk_4F96DB4) | LoopAnalysis (sub_B1A2E0) | Avoid sinking out of loops; skip loop headers |
Does not require: SCEV, MemorySSA, PostDominatorTree, BranchProbabilityInfo.
This is a key difference from stock LLVM SinkingPass, which requires MemorySSAAnalysis. Sinking2 uses its own alias analysis queries through helpers sub_1CC8920 and sub_1CC8CA0, routed through the traditional AA interface at sub_13575E0. This avoids the overhead of building/maintaining MemorySSA across fixpoint iterations.
Pass Object Layout
Legacy PM (160 bytes):
| Offset | Type | Content |
|---|---|---|
| +0 | ptr | Vtable pointer (off_49F8BC0) |
| +8 | ptr | Pass link (next pass in chain) |
| +16 | ptr | Pass ID pointer (&unk_4FBF0F4) |
| +24 | int32 | Mode (default=3, from sink-into-texture) |
| +28 | int32 | Sink limit (default=20, from sink-limit) |
| +32--48 | ptr[3] | Worklist data (head, tail, size) |
| +56 | ptr | DominatorTree* (set during runOnFunction) |
| +64 | ptr | List head 1 (self-referential sentinel) |
| +72--80 | ptr[2] | List next/prev 1 |
| +96 | int64 | Counter (sink count for current iteration) |
| +104 | ptr | LoopInfo* (set during runOnFunction) |
| +112 | ptr | List head 2 (self-referential sentinel) |
| +120--128 | ptr[2] | List next/prev 2 |
| +144 | int64 | Data field |
| +152 | byte | Changed flag (for fixpoint termination) |
New PM (176 bytes): two embedded worklists and float thresholds at offsets +88 and +144 (value 1065353216 = 1.0f IEEE 754).
Differences from Upstream LLVM
| Aspect | Upstream LLVM sink | NVIDIA sinking2 |
|---|---|---|
| Alias analysis backend | MemorySSA | Custom AA layer (sub_13575E0) |
| Iteration strategy | Single pass | Fixpoint iteration |
| Texture awareness | None | 3-level configurable |
| Address space awareness | Generic | NVPTX-specific (AS 1,3,4,5) |
| Complexity limiter | None | sink-limit knob (default=20) |
| Intra-block reordering | No | Level >= 2 |
| Outside-only pattern | No | Level == 3 |
| Debug dump | Standard LLVM debug | dump-sink2 knob |
| Cost model | Boolean (profitable or not) | Float thresholds in New PM |
| Pipeline occurrences | 1 | 3--4 (multi-run strategy) |
| Fast-compile variant | Same pass | Dedicated fast=1 mode |
Diagnostic Strings
| String | Context |
|---|---|
"llvm::Sinking2Pass]" | RTTI name at sub_2315E20 |
"sink2" | Pipeline parser ID |
"Code sinking" | Display name (shared with stock LLVM sink) |
"sinking2" | New PM pipeline string match |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| -- | sub_1CC7010 | -- | Legacy PM pass registration |
| -- | sub_1CC7100 | -- | Legacy PM factory |
| -- | sub_1CC71E0 | -- | Legacy PM alternate factory |
| -- | sub_1CC7510 | 16KB | processInstruction: sink candidate evaluation, use-chain walk, LCD computation |
| -- | sub_1CC8170 | 13KB | Dominance ordering: DFS numbering for block comparison |
| -- | sub_1CC8920 | 4KB | Alias checking helper: validates no conflicting memory accesses on path |
| -- | sub_1CC8CA0 | 6KB | Memory dependency helper: store-load forwarding, store ordering, volatile |
| -- | sub_1CC9110 | 22KB | Main worklist driver: fixpoint iteration over dominator tree |
| -- | sub_1CCA270 | -- | Legacy PM runOnFunction entry |
| -- | sub_2D1B410 | -- | New PM pass registration |
| -- | sub_2D1BC50 | -- | New PM factory |
| -- | sub_2D1C160 | 19KB | New PM run() entry |
| -- | sub_2D1CFB0 | 13KB | New PM core logic |
| -- | sub_2D1D770 | 7KB | New PM helper |
| -- | sub_2D1DCF0 | 7KB | New PM helper |
| -- | sub_2315E20 | -- | RTTI name printer |
| -- | 0x4F7750 | -- | Knob constructor (ctor_275) |
Related pipeline factories:
| Address | Role |
|---|---|
sub_18B1DE0 | Sinking2 early-pipeline factory |
sub_18B3080 | Sinking2 fast-mode factory (accepts fast flag parameter) |
sub_1CC60B0 | NVVMSinking2 late-pipeline factory |
sub_1A634D0 | Stock LLVM Sink legacy PM registration |
sub_29776B0 | Stock LLVM Sink New PM registration |
sub_1B51110 | Stock Sink core (51KB, creates .sink.split / .sink blocks) |
sub_1869C50 | Stock Sink pipeline factory (called with params 1,0,1) |
Total code size: ~80KB (Legacy PM) + ~65KB (New PM) = ~145KB
GPU-Specific Motivation
Register pressure directly determines occupancy -- each additional live register per thread reduces the number of warps available for latency hiding, with discrete cliff boundaries where a single register can drop an entire warp group.
Sinking instructions closer to their uses shortens live ranges and reduces the peak number of simultaneously live registers. This is especially valuable for texture load sequences, which typically involve address computation (GEP chains, index arithmetic) that produces values consumed only at the texture fetch site. Without sinking, these intermediate values occupy registers across potentially many instructions, bloating register pressure unnecessarily.
The three-level sink-into-texture design reflects a graduated approach to this optimization: level 1 handles the common case (cross-block sinking), level 2 adds intra-block reordering for tighter packing, and level 3 (the default) handles the edge case where an instruction's only uses are in blocks other than where it is defined, enabling more aggressive motion.
The multi-run pattern (early Sinking2, post-peephole fast Sinking2, late NVVMSinking2) ensures that sinking opportunities created by other optimization passes are captured throughout the pipeline, rather than relying on a single sinking point that may miss opportunities not yet exposed.
Cross-References
- Dead Synchronization Elimination -- runs earlier, removes barriers that Sinking2 would otherwise treat as memory fences
- LICM -- counterpart: hoists loop-invariant code into preheaders; Sinking2 sinks address computation out of preheaders
- NVVMPeephole -- runs before late Sinking2, may create new sinking opportunities
- Rematerialization -- runs after all sinking; rematerialization + sinking together minimize register pressure (ptxas
SinkRematEnableknob) - MemorySpaceOpt -- changes address spaces which affects sinking profitability
- NVVMPassOptions --
opts[1040]disables stock Sink;opts[2440]disables NVVMSinking2 - Register Allocation -- ultimate consumer of the register pressure reduction that sinking provides
- Optimization Levels -- Ofcmax runs only fast-mode Sinking2; O2/O3 run full multi-run pattern
Loop Index Split
loop-index-split is a loop transformation pass that splits or peels loops when a condition inside the loop body depends on the loop induction variable. The pass was originally part of upstream LLVM 2.x (circa 2008--2009) but was removed around LLVM 3.0 due to correctness concerns and limited applicability. NVIDIA revived and heavily modified it for CUDA workloads, where loops with index-dependent conditionals are extremely common -- boundary handling in stencil computations, tile edge processing, and index-based predication are pervasive GPU kernel patterns. The NVIDIA version is substantially more sophisticated than the original, implementing three distinct transformation modes with full SCEV-based analysis.
By eliminating index-dependent branches from loop bodies, the pass reduces warp divergence on NVIDIA GPUs. When threads in a warp take different paths through a branch, the GPU must serialize both paths (predicated execution or divergent branch), wasting throughput. Splitting the loop so that each resulting loop has a uniform body eliminates this divergence entirely within the split regions, restoring full SIMT efficiency.
Pipeline Position
| Field | Value |
|---|---|
| Pass name (pipeline) | loop-index-split |
| Display name | Index Split Loops |
| Pass type | LoopPass (NVIDIA-custom, revived from LLVM 2.x) |
| Class | llvm::LoopIndexSplitPass |
| Legacy PM registration | sub_1C76080 |
| New PM registration | sub_2CBEC60 |
| Pass ID | dword_4FBD4A8 / unk_4FBD4AC |
| New PM vtable | off_4A25510 |
Transformation Modes
The pass implements three transformation strategies, attempted in priority order. When the first applicable transformation is found, it is applied and the pass moves on.
Mode A: All-But-One Iteration Peel (processAllButOneIterationLoop)
When: The loop body contains a condition that is true for all iterations except exactly one (typically i == K for a constant K).
What: The pass peels the single exceptional iteration out of the loop and removes the condition from the remaining iterations.
Before:
for (i = 0; i < N; i++) {
if (i == K) special();
else normal();
}
After:
for (i = 0; i < K; i++) normal();
special();
for (i = K+1; i < N; i++) normal();
This eliminates the branch from both resulting loops entirely. On a GPU, this means warps executing the pre-K or post-K loops never diverge on this condition.
Implementation: sub_2CC3FF0 (13KB, New PM) / part of sub_1C77080 (46KB, Legacy PM).
Mode B: Only-One-Iteration Collapse (processOnlyOneIterationLoop)
When: The condition is true for exactly one iteration, and the loop body does nothing useful on other iterations.
What: The pass replaces the entire loop with a guarded single execution of the body.
Before:
for (i = 0; i < N; i++) {
if (i == K) doWork();
}
After:
if (K >= 0 && K < N) doWork();
This transforms an O(N) loop into O(1) code -- a dramatic optimization when the original loop's only purpose was to find and execute a single iteration.
Implementation: sub_2CC4A70 (19KB, New PM) / part of sub_1C77080 (46KB, Legacy PM).
Mode C: Range Split (processSplitRangeLoop)
When: The condition splits the iteration space into two contiguous ranges (e.g., i < M vs i >= M).
What: The pass splits the loop at the boundary point so each resulting loop has a simpler, branch-free body.
Before:
for (i = 0; i < N; i++) {
if (i < M) a(); else b();
}
After:
for (i = 0; i < min(M, N); i++) a();
for (i = M; i < N; i++) b();
This is the most common transformation for GPU boundary handling code, where the first/last few iterations of a tile perform padding or clamping.
Implementation: sub_2CC5900 (68KB, New PM) / sub_1C7B2C0 (84KB, Legacy PM). The loop cloning and rewiring logic is in sub_2CC1B10 (42KB), with split point computation in sub_2CC0040 and sub_2CC0CC0 (7KB each).
Algorithm Detail
The main driver (sub_2CC5900, 68KB) proceeds as follows:
- Verify loop structure: The loop must have exactly one exit, a preheader, a latch block, and an identifiable header.
- Initialize SCEV analysis: Obtains the ScalarEvolution result for the loop to identify the induction variable and compute trip counts.
- Find the induction variable and exit condition from the loop's back-edge.
- Scan the loop body for
ICmporSelectinstructions that compare the IV against a loop-invariant value. - Validate the comparison uses constant integer bounds (checked via
APIntextraction at multiple points). - Safety checks (lines 760--830 of
sub_2CC5900):- Iterate all loop BBs, checking each instruction:
- Opcode 85 (Call): reject if callee may have side effects
- Opcodes 34--85: checked against bitmask
0x8000000000041for safe operations - Store instructions: checked for non-interference with the split
- No volatile loads permitted
- No memory operations that prevent reordering
- Iterate all loop BBs, checking each instruction:
- Determine which transformation applies:
- Try
processAllButOneIterationLoopfirst - Try
processOnlyOneIterationLoopsecond - Fall back to
processSplitRangeLoop
- Try
- For range splits: Compute the split point, clone the loop (including all basic blocks, PHI nodes, and branch conditions), adjust iteration bounds, and rewire predecessors/successors.
Comparison Classifiers
Four small functions classify how the ICmp operands relate to the induction variable:
| Function | Purpose |
|---|---|
sub_2CBED80 | Determine which operand is the IV |
sub_2CBED00 | Determine which operand is the bound |
sub_2CBEE00 | Classify comparison direction (ascending/descending) |
sub_2CBEE80 | Extended classification for range splits |
Legality Validation
| Function | Size | Purpose |
|---|---|---|
sub_2CBFC80 | — | Validate split is legal (check exit conditions) |
sub_2CBF770 | — | Validate loop structure for splitting |
sub_2CBF180 | — | Create new loop preheader for split result |
Diagnostic Strings
Diagnostic strings recovered from p2b.4-5-sinking2-loopindexsplit.txt. The pass emits optimization remarks via the standard LLVM OptimizationRemark system.
| String | Source | Category | Trigger |
|---|---|---|---|
"LoopIndexSplit: performed processAllButOneIterationLoop" | sub_2CC3FF0 (New PM) / sub_1C77080 (Legacy PM) | Remark | Mode A transformation applied: single exceptional iteration peeled |
"LoopIndexSplit: performed processOnlyOneIterationLoop" | sub_2CC4A70 (New PM) / sub_1C77080 (Legacy PM) | Remark | Mode B transformation applied: entire loop replaced with guarded single body |
"LoopIndexSplit: performed processSplitRangeLoop" | sub_2CC5900 (New PM) / sub_1C7B2C0 (Legacy PM) | Remark | Mode C transformation applied: loop split at range boundary |
"Index Split Loops" | sub_1C76080 / sub_2CBEC60 | Registration | Display name used in both Legacy PM and New PM pass registration |
"loop-index-split" | Pipeline parser (sub_2377300 line 3768, sub_2368220 line 5081) | Registration | Pipeline ID string (16 characters) |
"LoopSplitIndex" / "LoopIndexSplit" | Remark infrastructure | Remark tag | Optimization remark tag names (both variants observed in binary) |
Configuration Knobs
No dedicated cl::opt knobs were found for LoopIndexSplit. The pass is enabled or disabled at the pipeline level via the pass name loop-index-split in the pipeline string or by including/excluding it during pipeline assembly. It can also be controlled by the global pass-control and disable-passno mechanisms.
Analysis Dependencies
| Legacy PM | New PM | Purpose |
|---|---|---|
DominatorTreeWrapperPass (sub_15CD350) | DominatorTreeAnalysis (sub_D4AA90) | Dominance checks for loop cloning |
LoopInfoWrapperPass (sub_13FBE20) | LoopAnalysis (sub_B1A2E0) | Loop structure and nesting |
ScalarEvolutionWrapperPass (sub_1AE1AE0) | ScalarEvolutionAnalysis (sub_11CDF60) | IV identification, trip count, range proofs |
LoopAccessAnalysis (sub_1AF93A0) | LoopAccessAnalysis (sub_F67EE0) | Memory dependence in loops |
SCEV is the critical dependency: it provides induction variable identification, trip count computation, and the mathematical proofs needed to establish that split points are correct and that bounds do not overflow.
Pass Object Layout
Legacy PM: 80-byte pass descriptor.
New PM: 176-byte pass object with embedded worklists and float thresholds. Key fields during execution:
| Offset (QWORDs) | Content |
|---|---|
| 0 | Vtable / loop pointer |
| 1--3 | Sub-loop tracking |
| 4 | Sinkable instruction count |
| 5 | Exit condition block |
| 6 | Split condition (ICmp/FCmp instruction) |
| 7 | Loop bound (lower) |
| 8 | Loop bound (upper) |
| 9 | Split instruction |
| 10 | Instruction counter / worklist |
| 11--13 | DenseSet for tracking visited blocks |
| 14 | Iteration counter |
| 18--24 | Computed values (preheader, header, latch, exitBB, etc.) |
| 25 | SCEV analysis result pointer |
| 26 | New loop blocks array (for split range) |
Function Map
New PM Implementation
| Function | Address | Size | Role |
|---|---|---|---|
| -- | 0x2CBEC60 | — | New PM pass registration |
| -- | 0x2CBFF20 | — | New PM factory |
| -- | 0x2CC3FF0 | 13KB | processAllButOneIterationLoop (Mode A) |
| -- | 0x2CC4A70 | 19KB | processOnlyOneIterationLoop (Mode B) |
| -- | 0x2CC5900 | 68KB | Main driver + processSplitRangeLoop (Mode C) |
| -- | 0x2CC1B10 | 42KB | Loop cloning and CFG rewiring |
| -- | 0x2CC0040 | 7KB | Split boundary computation |
| -- | 0x2CC0CC0 | 7KB | Alternate split boundary computation |
| -- | 0x2CC9AA0 | 18KB | Helper |
| -- | 0x2CCB3B0 | 25KB | Helper |
| -- | 0x2CCCE20 | 13KB | Helper |
| -- | 0x2CCDD70 | 15KB | Helper |
| -- | 0x2CCED30 | 8KB | Helper |
| -- | 0x2CCF450 | 57KB | Large helper / alternate path |
| -- | 0x2CBED80 | — | Comparison classifier (IV operand) |
| -- | 0x2CBED00 | — | Comparison classifier (bound operand) |
| -- | 0x2CBEE00 | — | Comparison direction classifier |
| -- | 0x2CBEE80 | — | Extended comparison classifier |
| -- | 0x2CBFC80 | — | Split legality validation |
| -- | 0x2CBF770 | — | Loop structure validation |
| -- | 0x2CBF180 | — | Create new preheader |
Legacy PM Implementation
| Function | Address | Size | Role |
|---|---|---|---|
| -- | 0x1C76080 | — | Legacy PM pass registration |
| -- | 0x1C76180 | — | Legacy PM factory |
| -- | 0x1C76260 | — | Alternate factory |
| -- | 0x1C76340 | 7KB | Hash table management for visited set |
| -- | 0x1C768C0 | 4KB | Helper |
| -- | 0x1C76B50 | 4KB | Block cloning helper |
| -- | 0x1C76EB0 | 2.5KB | Recursive loop tree walker |
| -- | 0x1C77080 | 46KB | processAllButOneIterationLoop + processOnlyOneIterationLoop |
| -- | 0x1C797A0 | 15KB | Split legality checking |
| -- | 0x1C7A300 | 21KB | Loop body cloning |
| -- | 0x1C7B2C0 | 84KB | processSplitRangeLoop + main driver |
Total code size: ~180KB (Legacy PM) + ~260KB (New PM) = ~440KB. This is one of the largest individual passes in cicc.
GPU-Specific Motivation
Index-dependent conditionals inside loops are ubiquitous in GPU kernels:
- Boundary handling: Threads at tile edges must check whether their index falls within the valid data range, leading to
if (threadIdx.x + blockIdx.x * blockDim.x < N)patterns inside processing loops. - Stencil codes: Halo region processing requires different behavior for the first and last few iterations of a tile.
- Reduction patterns: The final iteration of a reduction loop often has special aggregation logic.
- Predicated execution: CUDA warp-level programming frequently uses index-based predicates to assign work to specific lanes.
Each of these patterns introduces a branch that causes warp divergence: threads in the same warp take different paths, forcing the GPU to serialize both sides. By splitting the loop at the index boundary, the pass ensures that within each resulting loop, all threads in a warp execute the same path. This eliminates divergence entirely within the split regions, recovering full SIMT throughput.
The pass's large code size (~440KB) reflects the complexity of correct loop cloning on GPU IR, where PHI nodes, memory dependencies, and SCEV invariants must all be preserved across the transformation.
Branch Distribution (Dead Synchronization Elimination)
Despite its name, the branch-dist pass does not distribute or restructure branches. It is a GPU-specific dead synchronization elimination pass that removes __syncthreads() barriers and fence intrinsics when no actual memory hazard exists across the barrier boundary. In CUDA kernels, programmers often insert barriers conservatively to guarantee correctness, but many of these barriers protect code regions that have no conflicting read/write patterns on shared or global memory. Removing them eliminates warp serialization points and reduces the latency cost of unnecessary thread coordination.
The pass works by classifying every instruction in the function as a shared/global memory read, a write, or neither. It then propagates this information through the control flow graph using a standard dataflow fixed-point iteration. For each synchronization instruction, it examines the memory access patterns above and below the barrier; if no read-after-write, write-after-read, or write-after-write hazard exists, the barrier is dead and is deleted. Because removing one barrier may expose others as redundant, the entire analysis restarts after each deletion until no more dead barriers remain.
Pipeline Position
| Field | Value |
|---|---|
| Pass name | branch-dist |
| Pass type | FunctionPass (NVIDIA-custom, not in upstream LLVM) |
| Registration | New PM #377, line 2217 in sub_2342890 |
| Runtime positions | Tier 1/2/3 #78, #82 (NVVMBranchDist via sub_1CB73C0, gated by !opts[2080] && !opts[2120]); see Pipeline |
| Core function | sub_1C47810 (2357 lines) |
| Pass wrapper | sub_1C49D10 (179 lines) |
| Knob constructor | ctor_525_0 at 0x563730 (493 lines) |
| Global enable flag | byte_4FBB6C0 (initialized to 0 in ctor_261) |
The pass runs during the NVIDIA IR optimization pipeline. The global enable flag at byte_4FBB6C0 is set by the pipeline setup when appropriate for the current optimization level.
IR Before/After Example
The pass removes __syncthreads() barriers that protect no actual shared/global memory hazard.
Before (conservative barrier placement):
define void @kernel(ptr addrspace(3) %smem) {
entry:
%x = add i32 %tid, 1 ; pure register computation
%y = mul i32 %x, 42 ; pure register computation
call void @llvm.nvvm.barrier0() ; __syncthreads() -- no shared/global R/W above
%z = add i32 %y, %x ; pure register computation
ret void
}
After (dead barrier removed):
define void @kernel(ptr addrspace(3) %smem) {
entry:
%x = add i32 %tid, 1
%y = mul i32 %x, 42
; barrier removed: no shared/global reads or writes above or below
%z = add i32 %y, %x
ret void
}
When the dataflow analysis determines that neither side of the barrier accesses shared or global memory, the barrier is dead and removed. The pass restarts after each removal since deleting one barrier may expose another as redundant.
Algorithm
Phase 1: Instruction Classification (sub_1C46330)
The classifier (sub_1C45690, 117 lines) examines each instruction's opcode byte at offset +16 and determines whether it reads or writes shared/global memory:
| Opcode | Hex | Meaning | Action |
|---|---|---|---|
0x36 | '6' | Load | Check address space; mark as read if shared/global |
0x37 | '7' | Store | Check address space; mark as write |
0x3A | ':' | Memory op | Check address space |
0x3B | ';' | Memory op | Check address space |
0x4E | 'N' (78) | Call | Complex analysis: filter sync intrinsics, check callee attributes |
The classifier is invoked twice per basic block:
- Forward scan (a3=1): iterates from the last instruction backward to the first sync instruction. Everything after the sync is classified as "above" the barrier.
- Backward scan (a3=0): iterates from the first instruction forward to the first sync instruction. Everything before the sync is classified as "below" the barrier.
This produces four boolean flags per block, stored in red-black tree maps: reads_above, writes_above, reads_below, writes_below.
Phase 2: CFG Propagation (sub_1C46620)
A classic dataflow fixed-point iteration propagates memory access information through successor edges. For each basic block, the read/write flags from its successors' "below" maps are OR-combined into the current block's "above" maps. The iteration repeats until no flags change (convergence). This ensures that a barrier's necessity accounts for memory accesses reachable through any control flow path, not just the local block.
The branch-dist-norm knob modifies the dataflow meet operator: the default (0) uses OR-propagation (conservative), while a non-zero value likely switches to AND-normalization (more aggressive, requiring all paths to access memory before considering a sync necessary).
Phase 3: Dead Sync Identification and Removal
After propagation, the main function (sub_1C47810) iterates over all blocks and instructions. For each synchronization intrinsic, it looks up the four per-instruction flags:
ra = inst_read_above[I] wa = inst_write_above[I]
rb = inst_read_below[I] wb = inst_write_below[I]
A sync is dead (removable) when any of these conditions holds:
| Condition | Meaning |
|---|---|
!ra && !wa | Nothing above the barrier accesses shared/global memory |
!rb && !wb | Nothing below the barrier accesses shared/global memory |
!ra && !wb | No read-after-write or write-after-write hazard |
!wa && !rb | No write-after-read or write-after-write hazard |
When a sync is removed, the pass calls sub_15F20C0 to delete it from the IR, then restarts the entire algorithm (goto LABEL_2). This restart is necessary because removing one barrier may cause another to become dead.
Special Cases
Barrier variants that carry data -- __syncthreads_count, __syncthreads_and, __syncthreads_or (intrinsic IDs 3734--3736) -- are explicitly excluded from removal. Their return values encode lane participation information, so they cannot be elided even when no memory hazard exists.
Address Space Filtering
The pass only considers memory accesses to shared and global address spaces as relevant for synchronization. The address space check in sub_1C45690:
- Address space IDs <=
0x1FF(511) or in the0x300range: considered local/private -- do not require synchronization. - Address space IDs > 511 and not in the
0x3xxrange: considered shared/global -- these are the accesses that justify keeping a barrier.
This distinction is critical: local memory is per-thread and never visible to other threads in the warp, so barriers protecting only local accesses are always dead.
Intrinsic Classification
Two predicates classify synchronization-related intrinsics:
sub_1C301F0 (is-sync-intrinsic): Returns true for intrinsic IDs representing barrier operations:
| ID | Likely Mapping |
|---|---|
| 34 | llvm.nvvm.barrier0 (basic __syncthreads) |
| 3718--3720 | barrier.sync / bar.warp.sync variants |
| 3731--3736 | __syncthreads_count/and/or, bar.arrive |
sub_1C30240 (is-fence-intrinsic): Returns true for IDs 4046 and 4242, which are memory fence/membar intrinsics. These are excluded from the sync test -- they impose memory ordering but are not full barriers that can be elided by this pass.
Configuration Knobs
All registered in ctor_525_0 at 0x563730. All are cl::opt<> with hidden visibility.
| Knob | Type | Default | Description |
|---|---|---|---|
dump-branch-dist | bool | false | Emit diagnostic output on each removed sync |
ignore-call-safety | bool | true | Treat function calls as non-memory-accessing (aggressive) |
ignore-variance-cond | int | 0 | Ignore warp divergence on branch conditions |
ignore-address-space-check | int | 0 | Treat all memory accesses as requiring sync (conservative) |
ignore-phi-overhead | int | 0 | Ignore PHI node overhead from sync removal in cost model |
disable-complex-branch-dist | int | 0 | Disable inter-block CFG propagation (Phase 2) |
no-branch-dist | string | (empty) | Comma-separated list of function names to skip |
branch-dist-func-limit | int | -1 | Max functions to process (-1 = unlimited) |
branch-dist-block-limit | int | -1 | Max blocks per function (-1 = unlimited) |
branch-dist-norm | int | 0 | Dataflow meet operator mode (0 = OR, non-zero = AND) |
The default for ignore-call-safety is notably true (aggressive): device function calls are assumed not to access shared/global memory unless proven otherwise. This is reasonable for typical CUDA kernels where helper functions operate on registers and local memory.
Diagnostic Strings
Diagnostic strings recovered from p2b.3-01-branchdist.txt. All runtime diagnostics are gated by the dump-branch-dist knob (default false).
| String | Source | Category | Trigger |
|---|---|---|---|
"[filename:line] Removed dead synch: Read above: X, Write above: Y, Read below: Z, Write below: W in function NAME" | sub_1C47810 phase 3 | Debug | dump-branch-dist enabled and a barrier is removed; prints the four read/write flags and the function name |
"Dump information from Branch Distribution" | ctor_525_0 at 0x563730 | Knob | dump-branch-dist knob description |
"Ignore calls safety in branch Distribution" | ctor_525_0 | Knob | ignore-call-safety knob description |
"Ignore variance condition in branch Distribution" | ctor_525_0 | Knob | ignore-variance-cond knob description |
"Ignore address-space checks in branch Distribution" | ctor_525_0 | Knob | ignore-address-space-check knob description |
"Ignore the overhead due to phis" | ctor_525_0 | Knob | ignore-phi-overhead knob description |
"Disable more complex branch Distribution" | ctor_525_0 | Knob | disable-complex-branch-dist knob description |
"Do not do Branch Distribution on some functions" | ctor_525_0 | Knob | no-branch-dist knob description (value format: "function1,function2,...") |
"Control number of functions to apply" | ctor_525_0 | Knob | branch-dist-func-limit knob description |
"Control number of blocks to apply" | ctor_525_0 | Knob | branch-dist-block-limit knob description |
"Control normalization for branch dist" | ctor_525_0 | Knob | branch-dist-norm knob description |
Data Structures
The pass allocates a large state object (~696 bytes, 87 QWORDs) containing 13 red-black tree maps organized in three tiers:
| Maps | Keys | Values | Purpose |
|---|---|---|---|
a1[3..14] (2 maps) | Block pointer | bool | Has-sync-above/below per block |
a1[15..38] (4 maps) | Block pointer | bool | Propagated read/write above/below (Phase 2 output) |
a1[39..62] (4 maps) | Block pointer | bool | Initial read/write above/below (Phase 1 output) |
a1[63..86] (4 maps) | Instruction pointer | bool | Per-instruction read/write above/below (Phase 3) |
All maps are std::map-like red-black trees with 48-byte nodes (left/right/parent pointers + key + 1-byte boolean value at offset 40). Tree operations are implemented in sub_1C46280 (insert-or-find for block maps), sub_1C47760 (insert-or-find for instruction maps), sub_1C45B10 (erase), and sub_1C45C70/sub_1C45940 (recursive destructors).
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| -- | 0x1C47810 | 2357L | Core algorithm: classify + propagate + remove |
| -- | 0x1C49D10 | 179L | Pass wrapper: init state, call core, cleanup |
| -- | 0x1C46330 | 197L | Phase 1: forward/backward instruction scan |
| -- | 0x1C46620 | 1157L | Phase 2: CFG successor propagation (fixed-point) |
| -- | 0x1C45690 | 117L | Instruction classifier: determines R/W flags |
| -- | 0x1C458C0 | 28L | Helper: classify all instructions in a block |
| -- | 0x1C46280 | 38L | Map insert-or-find (block-level maps) |
| -- | 0x1C47760 | 37L | Map insert-or-find (instruction-level maps) |
| -- | 0x1C475C0 | 43L | Map lower_bound lookup |
| -- | 0x1C47660 | 50L | Map find with hint |
| -- | 0x1C45B10 | 113L | Map erase operation |
| -- | 0x1C45C70 | 133L | Tree destructor (recursive free) |
| -- | 0x1C45940 | 133L | Tree destructor (recursive free, alt type) |
| -- | 0x1C301F0 | 15L | Is-sync-intrinsic predicate |
| -- | 0x1C30240 | 13L | Is-fence-intrinsic predicate |
| -- | 0x563730 | 493L | CLI knob registration (ctor_525_0) |
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent dead barrier elimination pass using CFG dataflow.
1. Using address-level tracking instead of boolean per-category flags. The pass tracks four boolean flags per block (reads_above, writes_above, reads_below, writes_below) for shared/global memory, not specific addresses. A reimplementation that attempts to track precise addresses ("smem[0] is only written above, smem[1] is only read below") will appear to find more dead barriers but is fundamentally unsound for GPU execution. Different threads access different addresses through the same pointer expression (smem[tid] vs smem[tid-1]), making address-based alias analysis across threads impossible at compile time. The boolean-per-category approach is the correct conservative abstraction.
2. Not excluding __syncthreads_count/and/or (IDs 3734--3736) from removal. These barrier variants return a value that encodes lane participation information (__syncthreads_count returns the number of threads that passed a non-zero predicate). Even when no memory hazard exists across the barrier, the return value carries data that the program depends on. A reimplementation that removes these barriers based solely on memory analysis will break programs that use the return value for algorithmic purposes (e.g., warp-level voting patterns, early-exit counting).
3. Treating the ignore-call-safety default as conservative. The default for ignore-call-safety is true (aggressive): function calls are assumed not to access shared/global memory. This is correct for typical CUDA helper functions that operate on registers and local memory, but a reimplementation that uses false as the default will retain nearly all barriers in code that calls device functions, defeating the optimization. Conversely, a reimplementation that uses true but does not also check the callee's isSharedMemoryAccess attribute when available will miss cases where a called function does access shared memory through a pointer argument.
4. Not restarting the analysis after removing a barrier. The pass restarts from Phase 1 (goto LABEL_2) after each barrier deletion because removing one barrier merges the regions it separated, potentially exposing adjacent barriers as dead. A reimplementation that collects all dead barriers in one pass and removes them simultaneously will miss cascading redundancies. Worse, it may remove barriers in the wrong order: if barrier B2 is dead only because barrier B1 separates it from a hazard, removing both simultaneously removes B1's protection while the hazard still exists.
5. Conflating address space filtering with memory visibility. The pass considers only shared and global memory accesses (address spaces > 511 and not in the 0x3xx range) as relevant for barrier justification. Local/private memory (per-thread, invisible to other threads) is correctly excluded. A reimplementation that includes local memory accesses in the analysis will never remove any barrier in code that uses local arrays, since every function with local variables would show "read+write above and below." The address space filter is essential for the optimization to have any effect.
GPU-Specific Motivation
On NVIDIA GPUs, __syncthreads() forces all threads in a thread block to reach the barrier before any can proceed. This is one of the most expensive control flow operations in CUDA -- it serializes warp execution and creates a pipeline stall. In practice, CUDA programmers insert barriers conservatively (every shared memory access pattern gets a barrier "just in case"), leading to significant over-synchronization. This pass recovers the performance lost to unnecessary barriers by proving, through static dataflow analysis, that specific barriers protect no actual memory hazard.
The ignore-variance-cond knob connects to warp divergence analysis: when a branch condition is provably uniform (all lanes take the same path), synchronization across that branch is trivially unnecessary regardless of memory access patterns. This is a common case in well-structured CUDA code where control flow depends on blockIdx or compile-time constants.
Dead Barrier Elimination
CICC contains three independent passes that eliminate redundant __syncthreads() barriers from CUDA kernels. This page documents the lightweight basic-dbe pass -- a single-pass, intra-block pattern matcher that removes trivially dead barriers without dataflow analysis. The two heavyweight engines are covered on their own pages: Dead Synchronization Elimination (sub_2C84BA0, 96KB, full bidirectional fixed-point dataflow) and Branch Distribution (sub_1C47810, 63KB, NVVM-IR-level fixed-point with restart). All three target the same goal -- eliminating barriers that provably do not order any memory hazard -- but at different cost/precision tradeoffs.
Key Facts: basic-dbe
| Property | Value |
|---|---|
| Pass name | basic-dbe |
| Class | llvm::BasicDeadBarrierEliminationPass |
| Scope | Function pass (LLVM IR level) |
| Registration | New PM #376, line 2212 in sub_2342890 (first NVIDIA function pass registered) |
| Runtime positions | Inserted via pipeline extension callbacks; not in the Tier 0/1/2/3 tables (see Pipeline) |
| Parameters | None (non-parameterized pass) |
| Knob constructor | ctor_261 (below 5KB, in 0x4F0000--0x51FFFF range) |
| Enable global | byte_4FBB6C0 (initialized to 0 in ctor_261, set to 1 by pipeline setup) |
| Binary size | Small (< 5KB compiled) |
| Upstream equivalent | None -- entirely NVIDIA-proprietary |
Why a Lightweight Pass Exists
The full dead synchronization elimination engine at sub_2C84BA0 is 96KB of code implementing bidirectional fixed-point dataflow with complete restart after each removal. That is expensive. For the common cases -- consecutive barriers with no intervening memory operations, barriers at function entry/exit with no shared memory traffic in the block, or barriers immediately followed by another barrier -- the heavyweight engine is overkill.
basic-dbe exists as a cheap pre-filter: it handles the trivially dead cases in a single linear scan per function, eliminating the low-hanging fruit before the full engine (if scheduled) performs its expensive inter-block analysis. By removing obvious dead barriers early, basic-dbe also reduces the iteration count of the heavyweight pass, since fewer barriers remain for it to analyze.
Algorithm
basic-dbe operates as a single-pass function pass with no dataflow propagation, no fixed-point iteration, and no restart-on-removal. It scans each basic block once and applies local pattern matching to identify barriers that are trivially dead.
Barrier Identification
The pass reuses the same barrier predicate logic as the full engine. An instruction is a synchronization barrier if all of the following hold:
- Opcode == 85 (internal call opcode for intrinsics)
- The callee pointer at offset -32 is non-null
- The callee's byte at offset 0 == 0 (intrinsic, not user-defined function)
- The
convergentattribute flag (bit0x20at byte+33) is set sub_CEA1A0(callee.field[36])confirms the intrinsic ID falls within the known barrier ID range
This is the same check implemented by sub_2C83D20 in the full engine.
Elimination Patterns
basic-dbe identifies four categories of trivially dead barriers, all detectable without inter-block analysis:
Pattern 1: Consecutive Barriers
Two or more __syncthreads() calls with no intervening instructions (or only non-memory instructions between them). The second and subsequent barriers are redundant because the first already forces all threads to synchronize.
; Before basic-dbe:
call void @llvm.nvvm.barrier0() ; barrier A
call void @llvm.nvvm.barrier0() ; barrier B -- DEAD (consecutive)
; After basic-dbe:
call void @llvm.nvvm.barrier0() ; barrier A retained
Pattern 2: Barrier in Empty Block
A basic block whose only non-terminator instructions are barriers and non-memory operations (debug info, metadata). If no instruction in the block reads or writes shared/global memory, every barrier in the block is dead -- there is nothing to order.
; Before basic-dbe:
bb_empty:
call void @llvm.nvvm.barrier0() ; DEAD -- no memory ops in block
br label %bb_next
; After basic-dbe:
bb_empty:
br label %bb_next
Pattern 3: Barrier at Function Entry
A barrier at the start of a kernel (or device function) with no memory operations between function entry and the barrier. Since no thread has performed any shared memory access yet, the barrier orders nothing.
Pattern 4: Barrier Before Return
A barrier immediately before a return with no memory operations between the barrier and the function exit. The barrier would order accesses that have already been performed, but since no subsequent access follows, no hazard exists in the forward direction.
Pseudocode
function BasicDeadBarrierEliminationPass::run(F):
if not byte_4FBB6C0: // global enable flag
return PreservedAnalyses::all()
changed = false
for each BB in F:
barriers = []
has_memory_op = false
for each inst in BB:
if isSyncBarrier(inst):
if not has_memory_op:
// Pattern 2/3: barrier with no preceding memory op
// Also handles Pattern 1: consecutive barriers
// (first barrier is not a memory op, so second is dead)
mark inst for deletion
changed = true
else:
barriers.append(inst)
has_memory_op = false // reset for next segment
else if classifyMemoryAccess(inst) has read or write:
has_memory_op = true
// Pattern 4: check trailing barrier before terminator
if not barriers.empty() and not has_memory_op:
mark barriers.back() for deletion
changed = true
// Delete all marked instructions
for each marked inst:
inst.eraseFromParent()
if changed:
return PreservedAnalyses::none() // IR modified
else:
return PreservedAnalyses::all()
The key design choice: basic-dbe treats each basic block as an isolated unit. It does not look at predecessor or successor blocks. This means it will miss cases where a barrier is dead because all reaching paths lack memory accesses -- those cases require the full inter-block dataflow of sub_2C84BA0 or sub_1C47810.
Memory Access Classification
Within the basic block scan, basic-dbe must determine which instructions constitute memory operations that could create cross-thread hazards. The classification mirrors the logic in sub_2C83AE0 (the full engine's classifier):
| Opcode | Value | Instruction | Classification |
|---|---|---|---|
| 61 | 0x3D | Store | Memory write |
| 62 | 0x3E | Load | Memory read |
| 65 | 0x41 | Atomic | Memory read + write |
| 66 | 0x42 | AtomicCmpXchg | Memory write |
| 85 | 0x55 | Call/Intrinsic | Read+Write if callee accesses shared/global memory |
Non-memory instructions (arithmetic, comparisons, PHI nodes, debug info, branches) do not set the has_memory_op flag.
The byte_4FBB6C0 Enable Flag
The global byte at byte_4FBB6C0 serves as a shared enable flag initialized to 0 in ctor_261. The pipeline setup code sets it to 1 when the optimization level and target configuration warrant running barrier elimination. This same flag gates branch-dist (sub_1C49D10 checks it before invoking sub_1C47810), confirming that ctor_261 initializes shared state for the barrier elimination subsystem as a whole, not just basic-dbe.
Relationship to Other Dead-Sync Passes
CICC's three barrier elimination passes form a layered strategy:
| Property | basic-dbe | branch-dist | Dead Sync Elimination |
|---|---|---|---|
| Entry point | llvm::BasicDeadBarrierEliminationPass | sub_1C47810 | sub_2C84BA0 |
| PM slot | 376 (New PM function pass) | 377 (New PM function pass) | None (module-level caller) |
| Scope | Intra-block only | Inter-block (CFG propagation) | Inter-block (full restart) |
| Dataflow | None (pattern match) | Fixed-point, 13 RB-tree maps | Fixed-point, 12 RB-tree maps |
| Restart on removal | No | Yes (goto LABEL_2) | Yes (goto LABEL_2) |
| IR level | LLVM IR (opcodes 61/62/65/66/85) | NVVM IR (opcodes 0x36/0x37/0x3A/0x3B/0x4E) | LLVM IR (opcodes 61/62/65/66/85) |
| Binary size | < 5KB | 63KB core + helpers | 96KB core + helpers |
| Knobs | byte_4FBB6C0 enable flag | 10 knobs (ctor_525) | None known (controlled by caller) |
| Complexity | O(n_instructions) | O(B * F * C) | O(B * F * C) |
| Typical runtime | Microseconds | Milliseconds | Milliseconds |
The intended execution order:
basic-dberuns first in the function pass pipeline, eliminating trivially dead barriers in O(n) time.branch-distruns next (slot 377, immediately after basic-dbe at slot 376), performing full inter-block analysis on the reduced barrier set using NVVM IR opcodes.- Dead Sync Elimination (
sub_2C84BA0) runs later from module-level callers (sub_2C88020,sub_2C883F0), performing the most aggressive analysis using LLVM IR opcodes with the element-size gate and special intrinsic ID handling.
Configuration
| Knob | Type | Default | Effect |
|---|---|---|---|
byte_4FBB6C0 | bool (global) | 0 (disabled) | Master enable for basic-dbe and branch-dist |
No dedicated per-pass knobs (threshold, dump flags, or limits) have been identified for basic-dbe itself. The pass is controlled entirely by its enable flag. This is consistent with its role as a lightweight pre-filter -- there is nothing to tune.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| -- | sub_2342890 line 2212 | -- | New PM registration: maps "basic-dbe" to llvm::BasicDeadBarrierEliminationPass |
| -- | ctor_261 (0x4F range) | -- | Global constructor: initializes byte_4FBB6C0 to 0, registers basic-dbe knob string |
| -- | byte_4FBB6C0 | -- | Global enable flag (shared with branch-dist) |
| -- | sub_2C83D20 | -- | isSyncBarrier predicate (shared with full engine) |
| -- | sub_2C83AE0 | -- | classifyMemoryAccess (shared with full engine) |
| -- | sub_CEA1A0 | -- | Barrier intrinsic ID confirmation |
| -- | sub_B49E00 | -- | isSharedMemoryAccess -- CUDA address space check |
| -- | sub_B43D60 | -- | Instruction::eraseFromParent -- barrier deletion |
Cross-References
- Dead Synchronization Elimination -- the full 96KB bidirectional dataflow engine
- Branch Distribution -- the NVVM-IR-level dead-sync pass (63KB, 13 RB-tree maps)
- NVIDIA Custom Passes: Inventory -- registry entry
- LLVM Optimizer: Pipeline -- pipeline context showing
basic-dbeat slot 376 - GPU Execution Model -- why
__syncthreads()exists and when it matters
Dead Synchronization Elimination
The dead synchronization elimination engine at sub_2C84BA0 is the largest NVIDIA-custom pass in cicc at 96KB (~3,400 decompiled lines). It removes __syncthreads() barriers that provably do not order any memory hazard, reducing warp stall cycles in CUDA kernels without affecting correctness. The algorithm performs a bidirectional fixed-point dataflow analysis across the entire function's CFG, tracking four memory access categories per basic block through eight red-black tree maps. After convergence, it evaluates every barrier against the computed access sets and deletes those that protect no actual hazard. Each deletion triggers a full restart of the analysis, handling cascading redundancies at the cost of quadratic worst-case complexity.
This pass is distinct from the lightweight basic-dbe pass (slot 376, llvm::BasicDeadBarrierEliminationPass) and from the branch-dist pass. All three target dead barriers, but only this engine performs full inter-block dataflow with complete restart -- the other two handle simpler local or single-pass cases.
Key Facts
| Property | Value |
|---|---|
| Entry point | sub_2C84BA0 |
| Binary size | 96KB (~3,400 decompiled lines) |
| Pass type | Module-level NVIDIA custom (not registered in New PM) |
| Callers | sub_2C88020, sub_2C883F0, self-recursive |
| Barrier predicate | sub_2C83D20 |
| Access classifier | sub_2C83AE0 |
| Per-BB analysis | sub_2C84640 (bidirectional, parameterized by direction) |
| State object | 12 red-black tree maps at known offsets in a1 |
| Diagnostic | " Removed dead synch: " with per-category read/write counts |
| Upstream equivalent | None -- entirely NVIDIA-proprietary |
Five-Phase Algorithm
Phase 1: Barrier Identification (sub_2C83D20)
The helper sub_2C83D20 classifies whether a given instruction is a synchronization barrier. The check is a conjunction of five conditions:
function isSyncBarrier(inst) -> bool:
if inst.opcode != 85: // internal call opcode
return false
callee = inst.field[-32] // callee pointer at offset -32
if callee == null:
return false
if callee.byte[0] != 0: // byte 0 == 0 means intrinsic (not user-defined)
return false
if callee.field[24] != inst.field[80]: // scope match
return false
if !(callee.byte[33] & 0x20): // convergent attribute flag
return false
return CEA1A0(callee.field[36]) // confirm barrier intrinsic ID
The convergent attribute flag (bit 0x20 at byte+33) is the key discriminator. LLVM marks barrier intrinsics as convergent to prevent optimizations from moving them across control flow boundaries. The final sub_CEA1A0 call validates that the intrinsic ID falls within the known barrier ID range, distinguishing barriers from other convergent intrinsics (e.g., warp vote operations).
Phase 2: Memory Access Classification (sub_2C83AE0)
For every non-barrier instruction, sub_2C83AE0 determines whether it reads from or writes to memory that could create a hazard across a barrier. It outputs two boolean flags via pointer parameters a2 (read) and a3 (write).
| Opcode | Value | Instruction | Classification |
|---|---|---|---|
| 61 | 0x3D | Store | Write, if element size > 0x1FF bits |
| 62 | 0x3E | Load | Read, with same large-type gate |
| 65 | 0x41 | Atomic | Read + Write |
| 66 | 0x42 | AtomicCmpXchg | Write |
| 85 | 0x55 | Call/Intrinsic | Context-dependent (see below) |
For call instructions (opcode 85), the classifier applies recursive analysis:
- Check if the callee has intrinsic flag
0x20set. - For barrier-like intrinsics with opcode 25 and
field+96 == 0: classify as Read only. - For general calls: invoke
sub_B49E00(isSharedMemoryAccess) to determine whether the callee accesses shared/global memory. If yes: Read + Write.
The element size gate (> 0x1FF bits, i.e., > 511 bits) filters out trivially small memory operations that target scalar types in registers rather than actual memory-backed storage. Loads and stores of types narrower than 512 bits are assumed to operate on register-promoted values and do not participate in cross-thread hazards.
Phase 3: Bidirectional Fixed-Point Dataflow
Complexity. Let B = number of basic blocks, S = number of barrier instructions, and I = total instructions across all blocks. Phase 1 (barrier identification) is O(S). Phase 2 (access classification) is O(I). The dataflow fixed-point iterates until no boolean in the 4 * B * 2 lattice positions flips from 0 to 1; since the lattice has height 1, convergence is bounded by O(B) iterations, each costing O(B + I) for the forward and backward scans, giving O(B * (B + I)) per convergence cycle. Phase 4 (elimination decision) is O(S). Phase 5 restarts the entire analysis from Phase 3 on each removal, yielding a worst-case total of O(S * B * (B + I)). In practice, CUDA kernels have B < 100, S < 20, and convergence in 2--3 iterations, so the pass behaves as near-linear in typical use. The red-black tree maps contribute O(log B) per insert/lookup, but this is dominated by the iteration cost.
This is the core of the pass and accounts for the majority of its 96KB size. The algorithm maintains eight red-black tree maps organized into forward and backward analysis sets, plus four bridge maps for the final elimination decision.
Map Layout
| Offset range | Direction | Contents |
|---|---|---|
a1[15..20] | Forward | ReadAbove per basic block |
a1[21..26] | Forward | WriteAbove per basic block |
a1[27..32] | Forward | ReadBelow per basic block |
a1[33..38] | Forward | WriteBelow per basic block |
a1[39..44] | Backward | ReadAbove per basic block |
a1[45..50] | Backward | WriteAbove per basic block |
a1[51..56] | Backward | ReadBelow per basic block |
a1[57..62] | Backward | WriteBelow per basic block |
a1[63..68] | Bridge | ReadAbove crossing barrier |
a1[69..74] | Bridge | WriteAbove crossing barrier |
a1[75..80] | Bridge | ReadBelow crossing barrier |
a1[81..86] | Bridge | WriteBelow crossing barrier |
Each map is a std::map-style red-black tree (48-byte nodes: left/right/parent pointers, key = basic block pointer, value = 1-byte boolean at offset 40). The helper sub_2C84590 performs map insertion; sub_2C84AF0 is a variant for a different node type used in the bridge maps.
Iteration Algorithm
The analysis loop is implemented as a goto-based iteration between labels LABEL_2 and LABEL_178 in the decompiled output:
function analyzeBarriers(F, state):
LABEL_2: // restart point after barrier removal
// --- Forward pass ---
for each BB in F:
sub_2C84640(state, BB, direction=1) // scan BB forward
// For each instruction from BB start toward first barrier:
// classify as read/write via sub_2C83AE0
// OR the flags into forward maps [15..38]
// Propagate successor BBs' flags backward if they
// contain already-analyzed barriers
// --- Forward convergence check ---
changed_fwd = false
for each BB in F:
if forward_maps[BB] != previous_forward_maps[BB]:
changed_fwd = true
break
// --- Backward pass ---
for each BB in F:
sub_2C84640(state, BB, direction=0) // scan BB backward
// For each instruction from BB end toward last barrier:
// classify as read/write
// OR into backward maps [39..62]
// Propagate predecessor BBs' flags forward
// --- Backward convergence check ---
changed_bwd = false
for each BB in F:
if backward_maps[BB] != previous_backward_maps[BB]:
changed_bwd = true
break
// If either direction changed, iterate
if changed_fwd or changed_bwd:
goto LABEL_2_inner // re-run dataflow (not full restart)
// Both converged -- proceed to Phase 4
goto elimination_phase
The sub_2C84640 helper is the per-BB analysis workhorse. It takes a direction parameter:
- direction=1 (forward): scans from block entry toward the first barrier, accumulating ReadAbove/WriteAbove. Propagates read/write information from successor blocks.
- direction=0 (backward): scans from block exit toward the last barrier, accumulating ReadBelow/WriteBelow. Propagates information from predecessor blocks.
The convergence check compares the entire map contents (all four categories for every BB) against their values from the previous iteration. If any single boolean flipped from 0 to 1, the changed flag is set. Since the analysis is monotone (booleans can only transition from 0 to 1, never back), convergence is guaranteed in at most O(|BB|) iterations, though in practice it converges in 2--3 iterations for typical CUDA kernels.
Phase 4: Elimination Decision
After the dataflow converges, the pass examines every barrier instruction and checks the bridge maps (a1[63..86]) which represent the combined read/write sets crossing barrier boundaries.
A barrier is redundant (dead) if any of the following holds:
| Condition | Interpretation |
|---|---|
ReadAbove == 0 AND WriteAbove == 0 | No shared-memory accesses reach this barrier from above; the barrier orders nothing |
ReadBelow == 0 AND WriteBelow == 0 | No accesses reach from below |
ReadAbove == 0 AND WriteBelow == 0 | No RAW or WAW hazard across the barrier |
WriteAbove == 0 AND ReadBelow == 0 | No WAR or WAW hazard across the barrier |
The first two conditions capture the case where one side of the barrier has no memory traffic at all. The latter two capture the case where both sides access memory, but the access patterns cannot conflict.
Special Case: Intrinsic IDs 8260--8262
For call instructions (opcode 85) where the callee's intrinsic ID satisfies (ID - 8260) <= 2 (i.e., IDs 8260, 8261, or 8262), the pass applies an additional test via sub_BD3660 (hasOneUse). If the barrier-like intrinsic has only a single use, it is considered removable even if the standard dataflow check would keep it. These IDs likely correspond to specialized barrier variants (__syncthreads_count, __syncthreads_and, __syncthreads_or) where the return value is used as data. When the return value has only one use, the compiler can reason that the data-carrying aspect is trivially handled and the barrier itself may still be dead from a memory ordering perspective.
Phase 5: Removal and Complete Restart
When a barrier is identified as dead, the pass:
-
Emits a diagnostic string (if the controlling dump flag is enabled):
Removed dead synch: [filename:line] in function <name> Read above: N, Write above: N, Read below: N, Write below: Nwhere N is 0 or 1 for each category.
-
Calls
sub_B43D60(Instruction::eraseFromParent) to delete the barrier instruction from the IR. -
Restarts from Phase 3 (goto LABEL_2) -- a complete re-analysis of the entire function.
The restart is not optional. Removing a barrier changes the memory access pattern visible between adjacent barriers: what was previously two separate "above/below" regions separated by a barrier now becomes a single merged region. This merging may cause an adjacent barrier to lose its hazard justification, making it dead as well. The cascading effect can propagate through a chain of barriers.
Dataflow Analysis Detail
The Eight Red-Black Tree Maps
The forward and backward map sets track identical categories but are populated from different CFG traversal directions. This bidirectional approach ensures that the analysis captures memory accesses reachable from any control flow path, not just straight-line code.
Forward maps (a1[15..38]):
- Populated by scanning each BB from entry toward the first barrier.
- ReadAbove[BB] = 1 if any instruction before the first barrier in BB reads shared/global memory, OR if any successor BB contributes a read.
- WriteAbove[BB] = same for writes.
- ReadBelow/WriteBelow[BB] = propagated from successor blocks' analysis.
Backward maps (a1[39..62]):
- Populated by scanning each BB from exit toward the last barrier.
- ReadBelow[BB] = 1 if any instruction after the last barrier in BB reads memory, OR if any predecessor BB contributes a read.
- WriteBelow[BB] = same for writes.
- ReadAbove/WriteAbove[BB] = propagated from predecessor blocks.
Bridge maps (a1[63..86]):
- Keyed by barrier instruction pointer (not BB pointer).
- Represent the combined access sets that cross the specific barrier boundary.
- Populated during the final pass over barrier instructions after dataflow convergence.
Monotone Dataflow Framework
The analysis is a classic monotone dataflow problem on a Boolean lattice:
- Domain: {0, 1} per (basic-block, category) pair.
- Transfer function: OR of local classification with propagated values.
- Meet operator: OR (any path contributing an access sets the flag).
- Direction: Bidirectional (forward pass propagates from successors, backward pass propagates from predecessors).
- Convergence: Guaranteed because the lattice has height 1 (a value can only change from 0 to 1, never back). The fixed point is reached when no additional propagation changes any value.
In the worst case, each iteration may set one new bit, and there are 4 * |BB| bits per direction, so convergence takes at most 4 * |BB| iterations per direction. In practice, CUDA kernels have shallow CFGs and the iteration converges in 2--3 rounds.
Cascading Restart Logic
The most expensive aspect of the algorithm is the complete restart after each barrier removal. Consider a function with N barriers:
B0 -- barrier_1 -- B1 -- barrier_2 -- B2 -- barrier_3 -- B3
If barrier_2 is removed first, blocks B1 and B2 merge into a single region. If B1 contained only writes and B2 contained only reads, barrier_1 was previously justified by the WAR hazard between B0's writes and B1's reads. But after merging, B1+B2 now contains both reads and writes, and barrier_3 might become dead if B3 has no memory accesses. This cascading effect requires full re-analysis.
Worst-case complexity: O(N_barriers * N_BBs * convergence_iterations), where convergence_iterations is bounded by 4 * |BB| but is typically 2--3. For a kernel with B barriers removed in sequence, the total work is O(B * F * C) where F is the per-iteration cost of the dataflow and C is the convergence bound.
In practice, CUDA kernels rarely have more than 10--20 barriers, and cascading removals are uncommon (typically 0--3 restarts), so the theoretical quadratic cost is not a bottleneck.
Relationship to basic-dbe and branch-dist
CICC contains three passes that eliminate dead synchronization barriers. They differ in scope, cost, and the cases they handle:
| Property | basic-dbe | branch-dist | Dead Sync Elimination |
|---|---|---|---|
| Pass name | basic-dbe | branch-dist | (unnamed, called from module pass) |
| Entry point | llvm::BasicDeadBarrierEliminationPass | sub_1C47810 | sub_2C84BA0 |
| Registration | New PM slot 376 | New PM slot (function pass) | Module-level caller |
| Scope | Single BB / local | Function-level with CFG propagation | Function-level with full restart |
| Dataflow | None (pattern match) | Fixed-point, 13 rb-tree maps | Fixed-point, 12 rb-tree maps |
| Restart on removal | No | Yes (goto LABEL_2) | Yes (goto LABEL_2) |
| Binary size | Small (ctor_261) | 63KB core + helpers | 96KB core + helpers |
| Knobs | basic-dbe | 10 knobs (ctor_525) | None known (controlled by caller) |
basic-dbe handles trivially dead barriers detectable without dataflow analysis -- cases where the barrier is immediately adjacent to another barrier, or where the enclosing block contains no memory operations at all. It runs in the standard function pass pipeline and is cheap.
branch-dist performs full CFG propagation with 13 red-black tree maps and restart-on-removal, but it uses NVVM IR opcodes (0x36/0x37/0x3A/0x3B/0x4E) rather than the generic LLVM IR opcodes (61/62/65/66/85) used by the full engine. It also has its own address space filtering logic and 10 configurable knobs.
The full dead synchronization elimination engine (sub_2C84BA0) is the most aggressive of the three. It uses the LLVM IR opcode set, applies the element-size gate for loads/stores, and handles the special intrinsic IDs 8260--8262. It runs separately from the New PM function pass pipeline, invoked from module-level callers sub_2C88020 and sub_2C883F0.
Configuration
No dedicated knobs have been identified for the full engine at sub_2C84BA0. Its behavior is controlled entirely by its callers (sub_2C88020, sub_2C883F0), which determine when and whether the engine runs. This is in contrast to branch-dist, which has 10 knobs, and basic-dbe, which has at least an enable flag.
The diagnostic output is gated by an internal condition in the caller, not by a standalone dump knob.
Diagnostic Strings
" Removed dead synch: "
"Read above: "
", Write above: "
", Read below: "
", Write below: "
" in function "
"dbg"
The complete diagnostic message, assembled from these fragments:
Removed dead synch: [filename:line] in function <name>
Read above: 0, Write above: 0, Read below: 1, Write below: 1
The numeric values are the boolean (0/1) access flags for each category. When the pass removes a barrier, the diagnostic shows exactly why it was safe: which of the four access categories was absent.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| -- | sub_2C84BA0 | 96KB (3,400 lines) | Main engine: 5-phase algorithm |
| -- | sub_2C83D20 | small | isSyncBarrier predicate |
| -- | sub_2C83AE0 | small | classifyMemoryAccess (read/write classification) |
| -- | sub_2C84640 | medium | Per-BB analysis (bidirectional, direction parameter) |
| -- | sub_2C84590 | small | Red-black tree insert (forward/backward maps) |
| -- | sub_2C84AF0 | small | Red-black tree insert (bridge maps, different node type) |
| -- | sub_2C84080 | small | Map lookup / convergence check helper |
| -- | sub_2C83F20 | small | Map initialization / clear helper |
| -- | sub_2C83D50 | small | Map destructor / cleanup |
| -- | sub_BD3660 | small | hasOneUse -- used for intrinsic IDs 8260--8262 special case |
| -- | sub_CEA1A0 | small | Barrier intrinsic ID confirmation |
| -- | sub_B49E00 | small | isSharedMemoryAccess -- CUDA address space check |
| -- | sub_B43D60 | small | Instruction::eraseFromParent -- barrier deletion |
| -- | sub_B46E30 | small | getNumSuccessors -- CFG successor count |
| -- | sub_B46EC0 | small | getSuccessor(i) -- i-th successor retrieval |
| -- | sub_CB6200 | small | raw_ostream::write -- diagnostic string output |
| -- | sub_B91420 | small | Debug location extraction (filename/line) |
| -- | sub_B91F50 | small | Debug info accessor |
| -- | sub_BD5D20 | small | Type/value accessor |
| -- | sub_22409D0 | small | IR utility (instruction manipulation) |
| -- | sub_CB59D0 | small | raw_ostream integer write |
| -- | sub_CB59F0 | small | raw_ostream integer write (variant) |
| -- | sub_2C88020 | -- | Caller: module-level pass invoking the engine |
| -- | sub_2C883F0 | -- | Caller: module-level pass invoking the engine (variant) |
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent dead synchronization elimination engine.
1. Removing a barrier that protects a cross-thread shared memory hazard invisible to single-thread analysis. The most dangerous mistake is treating the analysis as a single-thread dataflow problem. The pass classifies memory accesses as read/write per thread, but the barrier's purpose is to order accesses across threads. If thread A writes to smem[tid] above the barrier and thread B reads smem[tid-1] below it, a single-thread view sees no RAW hazard (different addresses). The correct analysis must conservatively assume that any shared memory write above and any shared memory read below constitutes a hazard -- the pass uses boolean flags (not address tracking) precisely because aliasing across threads is unknowable at compile time. A reimplementation that attempts to be "smarter" by tracking addresses will remove barriers that are needed.
2. Not restarting the full analysis after each barrier removal. When a barrier is deleted, the two regions it separated merge into one. This merged region may expose an adjacent barrier as dead (it no longer has memory accesses on one side). A reimplementation that removes all identified dead barriers in a single pass and then stops will miss these cascading redundancies. The restart is mandatory: the pass deliberately uses a goto back to Phase 3 after each removal, re-analyzing the entire function from scratch.
3. Incorrectly classifying call instructions as non-memory-accessing. The access classifier (sub_2C83AE0) must recursively analyze callees to determine if they access shared/global memory. A reimplementation that conservatively marks all calls as read+write will be correct but will retain too many barriers (poor optimization). Conversely, one that ignores calls entirely will remove barriers protecting memory accesses hidden inside called functions. The correct behavior checks the isSharedMemoryAccess predicate on the callee and falls back to read+write if the callee is opaque.
4. Treating __syncthreads_count/and/or (IDs 8260--8262) the same as plain __syncthreads. These barrier variants return a value (lane participation count/and/or). Even when the barrier is dead from a memory-ordering perspective, the return value may be used as data by the program. The pass applies a special hasOneUse check for these IDs. A reimplementation that blindly removes them when the dataflow says "no hazard" will break programs that depend on the return value for algorithmic purposes.
5. Applying the element-size gate too aggressively. The pass filters out loads/stores of types narrower than 512 bits (> 0x1FF), assuming they are register-promoted scalars. A reimplementation that raises this threshold (e.g., to 1024 bits) will miss legitimate memory operations that should keep a barrier alive. Conversely, lowering it to 0 will make the analysis overly conservative, retaining dead barriers for trivial register operations.
Test This
The following kernel contains consecutive __syncthreads() barriers with no shared memory accesses between them. The dead synchronization elimination pass should remove the redundant barriers.
__global__ void dead_sync_test(float* out, int n) {
__shared__ float smem[256];
smem[threadIdx.x] = (float)threadIdx.x;
__syncthreads(); // barrier 1: needed (write above, read below)
float val = smem[threadIdx.x ^ 1];
__syncthreads(); // barrier 2: dead -- no smem access between barrier 1 and 2's "below"
__syncthreads(); // barrier 3: consecutive with barrier 2 -- trivially dead
out[threadIdx.x] = val;
}
What to look for in PTX:
- Count the number of
bar.sync 0;instructions. The kernel has three__syncthreads()calls in source, but only one should survive: barrier 1 (which orders the write tosmemagainst the read fromsmem[tid^1]). Barriers 2 and 3 have no shared memory hazard to protect. - The diagnostic
"Removed dead synch:"(visible with internal dump flags) shows the per-category access flags that justified removal:Read above: 0, Write above: 0means no memory accesses reach the barrier from above. - To verify the pass preserves necessary barriers, move the
float val = smem[...]read to between barriers 2 and 3. Now barrier 2 orders the write against this read and must survive -- expect twobar.syncinstructions. - The cascading restart behavior is observable with 5 consecutive
__syncthreads()with no memory between them. The pass removes one, restarts the analysis, removes the next, and repeats until only one remains.
Reimplementation Checklist
- Barrier identification predicate. Implement the five-condition conjunction: opcode == 85 (internal call), non-null callee, byte[0] == 0 (intrinsic flag), scope match (callee.field[24] == inst.field[80]), convergent attribute (bit 0x20 at byte+33), and barrier intrinsic ID confirmation.
- Memory access classifier. Classify every non-barrier instruction as read/write/both/neither based on opcode (store=0x3D, load=0x3E, atomic=0x41, cmpxchg=0x42, call=0x55), with the element-size gate (>511 bits) for loads/stores and recursive analysis for call instructions including shared-memory-access checks.
- Bidirectional fixed-point dataflow. Maintain eight red-black tree maps (forward ReadAbove/WriteAbove/ReadBelow/WriteBelow per BB, backward same) populated by scanning each BB in both directions, propagating from successors (forward) and predecessors (backward), iterating until no boolean flips from 0 to 1.
- Bridge map construction. After dataflow convergence, populate four bridge maps keyed by barrier instruction pointer, representing the combined read/write access sets crossing each specific barrier boundary.
- Elimination decision logic. A barrier is dead if: (ReadAbove==0 AND WriteAbove==0), OR (ReadBelow==0 AND WriteBelow==0), OR (ReadAbove==0 AND WriteBelow==0), OR (WriteAbove==0 AND ReadBelow==0). Handle the special case for intrinsic IDs 8260--8262 (
__syncthreads_count/and/or) where single-use return values allow additional removal. - Complete restart after removal. After each barrier deletion, restart the entire dataflow analysis from scratch to handle cascading redundancies where removing one barrier makes adjacent barriers dead.
Cross-References
- Dead Barrier Elimination -- overview page covering both
basic-dbeand this engine - Branch Distribution -- the other full dead-sync pass using NVVM IR opcodes
- NVIDIA Custom Passes: Inventory -- registry entry for Dead Synchronization Elimination
- LLVM Optimizer: Pipeline -- pipeline context and Phase I/II interaction
Rematerialization
NVIDIA's rematerialization infrastructure in CICC operates at two levels: an IR-level pass (nvvmrematerialize / "Legacy IR Remat") that reduces register pressure before instruction selection, and a machine-level pass (nv-remat-block / "Do Remat Machine Block") that performs the same transformation on MachineIR after register allocation decisions have been made. Both passes share the same fundamental strategy -- recompute cheap values at their use sites rather than keeping them live across long spans -- but they differ significantly in their cost models, candidate selection criteria, and interaction with the surrounding pipeline.
On NVIDIA GPUs, register pressure directly determines occupancy -- the number of concurrent warps per SM -- with discrete cliff boundaries where a single additional register can drop an entire warp group. Rematerialization trades extra ALU work for reduced register count, a tradeoff that is almost always profitable on GPUs where compute throughput vastly exceeds register file bandwidth.
Key Facts
| Property | Value |
|---|---|
| Pass name (New PM) | remat |
| Pass name (Legacy PM) | nvvmrematerialize / "Legacy IR Remat" |
| Class | RematerializationPass |
| Registration | New PM #385, line 2257 in sub_2342890 |
| Runtime positions | Tier 0 #34 (NVVMRematerialization via sub_1A13320); Tier 1/2/3 #55 (gated by !opts[2320]); see Pipeline |
| Pass factory | sub_1A13320 |
| Machine-level companion | nv-remat-block / "Do Remat Machine Block" at sub_2186D90 |
| Upstream equivalent | None -- entirely NVIDIA-proprietary |
IR-Level Rematerialization (nvvmrematerialize)
Registration and Dependencies
The pass is registered at sub_1CD0BE0 with pass ID "nvvmrematerialize" and entry point sub_1CD0CE0. Before running, it initializes five analysis passes:
| Analysis | Function | Purpose |
|---|---|---|
| Dominator tree | sub_15CD350 | Dominance queries for instruction placement |
| Loop info | sub_1440EE0 | Loop nest structure for cost scaling |
| Unknown | sub_13FBE20 | Possibly alias analysis |
| Live variable analysis | sub_1BFC830 | Builds live-in/live-out bitvector sets |
| Unknown | sub_1BFB430 | Possibly register pressure estimation |
Main Algorithm (sub_1CE7DD0, 67KB)
Complexity. Let B = number of basic blocks, I = total instructions, and L = number of live-in values. The live-in analysis uses hardware popcnt on bitvectors of size ceil(I / 64) per block, giving O(B * I / 64) per iteration. The intersection of live-in sets (bitwise AND) is O(B * I / 64). The rematizability check for each candidate walks its def chain: O(D) where D is the def-chain depth (bounded by max-recurse-depth). The pull-in cost model (sub_1CE3AF0) scores each candidate in O(U * D) where U = uses per candidate. Candidate sorting is O(K^2) via selection sort where K = candidates selected. The block executor clones instructions in O(K * B). The outer loop runs at most 5 iterations. Overall IR-level: O(5 * (B * I / 64 + K * U * D + K * B)). For the machine-level pass (sub_2186D90): max-live computation is O(I) per block (reverse walk), giving O(I) total. Candidate classification is O(I) for the initial scan, plus O(K * 50) for recursive pullability checks (depth bounded at 50). The second-chance heuristic iterates until convergence -- bounded by the candidate count K. The outer loop runs at most nv-remat-max-times (default 10) iterations. Overall machine-level: O(10 * (I + K^2)).
The driver implements an iterative register pressure reduction loop with up to 5 iterations. The high-level flow:
-
Function exclusion check: The
no-rematknob stores a comma-separated list of function names. If the current function matches, the pass prints"Skip rematerialization on <funcname>"and bails. -
Master gate: If all three sub-passes are disabled (
do-remat,remat-iv,remat-loadall zero), return immediately. -
Live-in/live-out analysis: For each basic block, the pass looks up the block's live-in bitvector from the analysis (
sub_1BFDF20), counts live-in values via hardwarepopcnt(sub_39FAC40), and stores per-block counts in a hash map. The maximum live-in across all blocks becomes the pressure target baseline. Atdump-remat >= 2, the pass prints"Block %s: live-in = %d". -
Register target computation: The algorithm computes how many registers it wants to reduce to:
- If
remat-maxreg-ceilingis set and lower than the actual register count, cap at that value. - If
remat-for-occis non-zero (default 120): callsub_1BFBA30for register usage, thensub_1C01730for an occupancy-based target. Apply heuristic adjustments based on occupancy level. - Otherwise: target = 80% of the current register count.
- If
-
Iterative loop (up to 5 iterations):
- If max live-in is already at or below the target, skip to the IV/load phases.
- Compute the intersection of live-in bitvectors across blocks (bitwise AND). Values that are live-in everywhere are the best rematerialization candidates because pulling them in at each use site eliminates a register everywhere.
- Walk the intersection bitvector. For each candidate, check rematerizability via
sub_1CD06C0. Partition into rematizable and non-rematizable sets. - Call
sub_1CE3AF0(pull-in cost analysis) to rank candidates by cost. - Build a per-block rematerialization plan and execute via
sub_1CE67D0. - Recompute max live-in. If it decreased, continue iterating.
-
Post-remat phases: After the main loop, run IV demotion (
sub_1CD74B0) ifremat-ivis enabled, then load rematerialization (sub_1CDE4D0) ifremat-loadis enabled, then cleanup (sub_1CD2540). -
Expression factoring: When
remat-addis non-zero, the pass also performs strength reduction on chains ofadd/mul/GEPinstructions, factoring common sub-expressions into"factor"named values. This is a mini-pass embedded within rematerialization.
Block-Level Executor (sub_1CE67D0, 32KB)
This function processes one basic block at a time, creating two kinds of instruction clones distinguished by their name prefixes:
remat_ prefix: The value was live-in to the block and is being recomputed from scratch. The defining instruction is duplicated via sub_15F4880, named with the "remat_" prefix via sub_164B780, and inserted at the use site. This is full rematerialization.
uclone_ prefix: The value already has a definition in the block's dominance chain, but a local copy is needed to shorten the live range. The instruction is cloned and named "uclone_". This is a use-level clone for live range splitting, not pure rematerialization.
After cloning, both variants update use-def chains via sub_1648780 and set debug locations via sub_15F22F0.
Pull-In Cost Model (sub_1CE3AF0, 56KB)
The cost model evaluates each candidate for rematerialization by computing:
pull_in_cost = base_cost * use_factor
Where base_cost is the sum of per-instruction costs along the value's def chain (sub_1CD0460), and use_factor is accumulated from per-use costs (sub_1CD3A10), with different cost tables for uses in different loop nests.
Candidates are filtered by three thresholds:
| Filter | Condition | Default |
|---|---|---|
| Use limit | use_count > remat-use-limit AND use_factor >= remat-loop-trip | 10 uses, 20 trips |
| GEP cost | cost > remat-gep-cost AND opcode is GEP | 6000 |
| Single cost | cost > remat-single-cost-limit (unless remat-ignore-single-cost) | 6000 |
After scoring, candidates are sorted by cost (cheapest first via selection sort), and the cheapest N are selected where N is the target reduction count. At dump-remat >= 4, the pass prints "Total pull-in cost = %d".
NLO -- Simplify Live Output (sub_1CE10B0 + sub_1CDC1F0)
The NLO sub-pass normalizes live-out values at block boundaries to reduce register pressure. Controlled by simplify-live-out (default 2):
- Level 1: Basic normalization only.
- Level 2 (default): Full normalization. Walks each block's live-out set and replaces values with simpler expressions.
- Level 3+: Extended patterns.
NLO creates two kinds of synthetic instructions:
nloNewBit: A bit-level operation (AND, extract, truncation) to reduce a live-out value to its actually-used bit width.nloNewAdd: A local add instruction to recompute an address/offset that was previously live-out, replacing it with a local computation.
IV Demotion (sub_1CD74B0, 75KB)
The induction variable demotion sub-pass reduces register pressure by narrowing wide IVs (typically 64-bit to 32-bit). Controlled by remat-iv (default 4, meaning full demotion):
| Level | Behavior |
|---|---|
| 0 | Disabled |
| 1-2 | Basic IV demotion |
| 3 | Extended IV demotion |
| 4 | Full demotion including complex patterns (default) |
| 5+ | Aggressive mode |
The algorithm identifies PHI nodes at loop headers, checks whether the IV's value range fits in a smaller type (for 64-bit IVs: (val + 0x80000000) <= 0xFFFFFFFF), and creates narrower replacements:
demoteIV: A truncation of the original IV to a narrower type.newBaseIV: A new narrow PHI node to replace the wide loop IV.iv_base_clone_: A clone of the IV's base value for use in comparisons that need the original width.substIV: Replaces uses of the old IV with the demoted version.
Multi-Pass Data Flow: Rematerialization / IV Demotion / NLO
The IR-level rematerialization pass (nvvmrematerialize) contains three cooperating sub-passes that execute in a fixed sequence within a single pass invocation. The following diagram shows the data each sub-pass produces and consumes, and the feedback loop that drives iterative pressure reduction.
Live Variable Analysis (prerequisite)
+------------------------------------+
| Builds per-block live-in/live-out |
| bitvector sets via sub_1BFDF20 |
| Produces: |
| - live-in bitvector per BB |
| - live-out bitvector per BB |
| - max live-in count (pressure) |
+------------------+-----------------+
|
v
+===============================================================+
| MAIN REMATERIALIZATION LOOP (sub_1CE7DD0, up to 5 iterations)|
| |
| Inputs: |
| - live-in bitvectors (from analysis above) |
| - register target (from occupancy model or 80% heuristic) |
| - remat cost thresholds (knobs) |
| |
| +----------------------------------------------------------+ |
| | Step 1: Compute intersection of live-in sets | |
| | (bitwise AND across all blocks) | |
| | --> Values live everywhere = best candidates | |
| +---------------------------+------------------------------+ |
| | |
| | candidate value set |
| v |
| +---------------------------+------------------------------+ |
| | Step 2: Pull-In Cost Analysis (sub_1CE3AF0) | |
| | For each candidate: | |
| | cost = base_cost(def chain) * use_factor(loop nesting) | |
| | Filter by: remat-use-limit, remat-gep-cost, | |
| | remat-single-cost-limit | |
| | Sort by cost (cheapest first) | |
| | Produces: ranked list of N cheapest candidates | |
| +---------------------------+------------------------------+ |
| | |
| | remat plan per block |
| v |
| +---------------------------+------------------------------+ |
| | Step 3: Block Executor (sub_1CE67D0) | |
| | For each selected candidate in each block: | |
| | "remat_" clone: full rematerialization at use site | |
| | "uclone_" clone: live range split within dom chain | |
| | Produces: | |
| | - cloned instructions at use sites | |
| | - reduced live-in counts per block | |
| +---------------------------+------------------------------+ |
| | |
| | updated IR |
| v |
| Recompute max live-in. If decreased and < 5 iters, loop. |
+=======================+=====================================+
|
| IR with reduced register pressure
v
+=======================+=====================================+
| IV DEMOTION (sub_1CD74B0, controlled by remat-iv) |
| |
| Consumes: |
| - Loop header PHI nodes (from LoopInfo) |
| - Type widths (from DataLayout) |
| - post-remat IR (live ranges already shortened) |
| |
| Algorithm: |
| for each loop L: |
| for each 64-bit PHI in L.header: |
| if (val + 0x80000000) <= 0xFFFFFFFF: |
| create "demoteIV" (trunc to i32) |
| create "newBaseIV" (narrow PHI replacement) |
| rewrite uses with "substIV" |
| |
| Produces: |
| - narrowed IVs (64->32 bit, halving register cost) |
| - "iv_base_clone_" values for comparisons needing |
| original width |
| - updated loop exit conditions |
+=======================+=====================================+
|
| IR with narrowed IVs
v
+=======================+=====================================+
| NLO -- SIMPLIFY LIVE OUTPUT (sub_1CE10B0, simplify-live-out)|
| |
| Consumes: |
| - per-block live-out bitvector sets |
| - post-IV-demotion IR |
| |
| For each block's live-out set: |
| - If a value is live-out but only its low bits are used |
| downstream: create "nloNewBit" (AND/extract/trunc) |
| - If a value is an address live-out that can be recomputed |
| locally in successors: create "nloNewAdd" (local add) |
| |
| Produces: |
| - "nloNewBit" bit-narrowing instructions |
| - "nloNewAdd" local recomputation instructions |
| - reduced live-out register count at block boundaries |
+=======================+=====================================+
|
| Final IR: pressure-reduced,
| IVs narrowed, live-outs simplified
v
+-------------------------------------------------------+
| Downstream consumers: |
| - Instruction selection (register model now concrete) |
| - Machine-level remat (nv-remat-block, second pass) |
| - Register allocation (lower pressure = higher occ.) |
+-------------------------------------------------------+
Data flow summary:
| Producer | Data | Consumer |
|---|---|---|
| Live Variable Analysis | Per-block live-in/live-out bitvectors | Main remat loop |
Occupancy model (sub_1C01730) | Register pressure target | Main remat loop |
| Main remat loop | remat_/uclone_ cloned instructions | Updated IR for IV demotion |
| IV Demotion | demoteIV, newBaseIV, substIV narrowed values | NLO and downstream |
| NLO | nloNewBit, nloNewAdd local recomputations | Final IR for instruction selection |
| All three sub-passes | Cumulative register pressure reduction | Machine-level remat (nv-remat-block) |
The sequencing is important: the main loop reduces cross-block live-in pressure first (the broadest and cheapest wins), IV demotion then halves the cost of loop induction variables (converting two registers to one), and NLO cleans up block-boundary live-out values that survived both earlier phases. The machine-level nv-remat-block pass runs much later in the pipeline (after instruction selection and register allocation) as a final safety net, operating on concrete register assignments rather than abstract SSA values.
Machine-Level Block Rematerialization (nv-remat-block)
Registration
Registered at ctor_361_0 (address 0x5108E0) with pass name "nv-remat-block" and description "Do Remat Machine Block". Main entry point: sub_2186D90 (47KB, ~1742 lines).
Algorithm Overview
The machine-level pass implements a sophisticated iterative pull-in algorithm operating on MachineIR after instruction selection:
-
Measure: Compute max-live register pressure across all blocks via
sub_2186590. Prints"Max-Live-Function(<num_blocks>) = <max_live>". -
Identify: For each block where pressure exceeds the target, enumerate live-out registers.
-
Classify: For each live-out register, determine pullability:
- MULTIDEF check (
sub_217E810): The register must have exactly one non-dead, non-debug definition. Registers with multiple definitions print"MULTIDEF"and are rejected. - Opcode exclusion: A large switch/comparison tree excludes memory ops, atomics, barriers, texture ops, surface ops, and other side-effecting instructions. Specific exclusions exist for sm_62 (opcodes 380-396).
- Operand safety: Instructions that define additional tied registers beyond the target are rejected.
- Recursive verification (
sub_2181550): All operands of the defining instruction must themselves be pullable, checked recursively up to depth 50.
- MULTIDEF check (
-
Second-chance heuristic (
sub_2181870): Registers initially rejected because one of their operands was non-pullable are re-evaluated when those operands become pullable. This iterates until convergence, using a visit-count mechanism to prevent infinite loops. The hash function throughout ish(regID) = 37 * regID. Debug:"After pre-check, <N> good candidates, <N> given second-chance","ADD <N> candidates from second-chance". -
Cost analysis (
sub_2183E30): Each candidate receives a clone cost. Candidates with cost 0 are non-rematerializable. -
Selection: Sort candidates by cost (ascending). Greedily select the cheapest candidates until pressure is reduced to target. Double-wide register classes (size > 32) count as 2 for pressure purposes and have their cost doubled. Debug:
"Really Final Pull-in: <count> (<total_cost>)". -
Execute: For each selected register:
- Clear from live-out bitmap via
sub_217F620. - Propagate backward through predecessors via
sub_2185250. - Clone the defining instruction at use sites via
sub_217E1F0. - Replace register references via
sub_21810D0. - Remove now-dead original definitions.
- Clear from live-out bitmap via
-
Iterate: Repeat up to
nv-remat-max-times(default 10) iterations until max pressure is at or below target, or no further progress is made.
Instruction Replacement (sub_21810D0)
When replacing a rematerialized register:
- Create a new virtual register of the same class via
sub_1E6B9A0. - Call the target's
replaceRegWithmethod (vtable offset 152). - Walk all uses of the original register ID and rewrite operands via
sub_1E310D0. - Handle special cases:
DBG_VALUE(opcode 45) and NOP/PHI (opcode 0) instructions use stride-2 operand scanning.
Register Pressure Computation (sub_2186590)
Per-block pressure is computed by starting with the live-out set size, walking instructions in reverse, tracking register births (defs) and deaths (last uses), and recording the peak pressure point. The maximum across all blocks is returned.
Key Functions
IR-Level
| Function | Address | Size | Role |
|---|---|---|---|
| Pass registration | sub_1CD0BE0 | -- | Registers "nvvmrematerialize" |
| Main driver | sub_1CE7DD0 | 67KB | Iterative live-in reduction loop |
| Block executor | sub_1CE67D0 | 32KB | "remat_" / "uclone_" creation |
| Pull-in cost | sub_1CE3AF0 | 56KB | Cost model and candidate selection |
| NLO main | sub_1CE10B0 | 48KB | Live-out normalization |
| NLO helper | sub_1CDC1F0 | 35KB | Inter-block NLO propagation |
| IV demotion | sub_1CD74B0 | 75KB | Induction variable narrowing |
| Load remat | sub_1CDE4D0 | -- | Load rematerialization sub-pass |
| Per-function init | sub_1CDA600 | -- | Data structure initialization |
| Rematizability check | sub_1CD06C0 | -- | Determines if a value can be recomputed |
Machine-Level
| Function | Address | Size | Role |
|---|---|---|---|
| Main engine | sub_2186D90 | 47KB | Iterative pull-in algorithm |
| Max-live computation | sub_2186590 | -- | Per-block pressure analysis |
| MULTIDEF check | sub_217E810 | ~230 lines | Single-definition verification |
| Recursive pullability | sub_2181550 | ~110 lines | Operand chain verification (depth 50) |
| Second-chance | sub_2181870 | ~800 lines | Re-evaluation of rejected candidates |
| Cost evaluator | sub_2183E30 | -- | Clone cost computation |
| Liveness propagation | sub_2185250 | ~650 lines | Backward propagation + cloning |
| Instruction replacement | sub_21810D0 | ~290 lines | Register use rewriting |
| Remat allocation helper | sub_2184890 | ~477 lines | Pressure simulation |
Configuration Knobs
IR-Level Knobs (ctor_277_0 at 0x4F7BE0)
| Knob | Global | Default | Description |
|---|---|---|---|
do-remat | dword_4FC05C0 | 3 | Master control. 0=off, 1=conservative, 2=normal, 3=full. |
no-remat | qword_4FC0440 | (empty) | Comma-separated function exclusion list |
remat-iv | dword_4FBFB40 | 4 | IV demotion level. 0=off, 4=full. |
remat-load | dword_4FBFA60 | 1 | Load rematerialization. 0=off, 1=on. |
remat-add | dword_4FBF980 | 0 | Add/GEP factoring. 0=off. |
remat-single-cost-limit | dword_4FC0080 | 6000 | Max cost per single live-in reduction |
remat-loop-trip | dword_4FBFFA0 | 20 | Default assumed loop trip count |
remat-gep-cost | dword_4FBFEC0 | 6000 | Max cost for GEP rematerialization |
remat-use-limit | dword_4FBFDE0 | 10 | Max number of uses for a candidate |
remat-max-live-limit | dword_4FBFD00 | 10 | Max live-in limit for rematerialization |
remat-maxreg-ceiling | dword_4FBF600 | 0 | Register ceiling (0 = uncapped) |
remat-for-occ | dword_4FBF8A0 | 120 | Occupancy-driven rematerialization target |
remat-lli-factor | dword_4FC0320 | 10 | Long-latency instruction cost factor |
remat-ignore-single-cost | byte_4FBFC20 | false | Bypass per-value cost filter |
remat-move | byte_4FC0400 | false | Remat move instructions |
simplify-live-out | dword_4FBF520 | 2 | NLO level. 0=off, 2=full. |
dump-remat | dword_4FC0240 | 0 | Debug dump level (0-4+) |
dump-remat-iv | dword_4FC0160 | 0 | IV remat debug dump |
dump-remat-load | dword_4FBF720 | 0 | Load remat debug dump |
dump-remat-add | dword_4FBF640 | 0 | Add remat debug dump |
dump-simplify-live-out | byte_4FBF400 | false | NLO debug dump |
Machine-Level Knobs (ctor_361_0 at 0x5108E0)
| Knob | Global | Default | Description |
|---|---|---|---|
nv-remat-block | dword_4FD3820 | 14 | Bitmask controlling remat modes (bits 0-3) |
nv-remat-max-times | dword_4FD3740 | 10 | Max outer loop iterations |
nv-remat-block-single-cost | dword_4FD3660 | 10 | Max cost per single live value pull-in |
nv-remat-block-map-size-limit | dword_4FD3580 | 6 | Map size limit for single pull-in |
nv-remat-block-max-cost | dword_4FD3040 | 100 | Max total clone cost per live value reduction |
nv-remat-block-liveout-min-percentage | dword_4FD3120 | 70 | Min liveout % for special consideration |
nv-remat-block-loop-cost-factor | unk_4FD3400 | 20 | Loop cost multiplier |
nv-remat-default-max-reg | unk_4FD3320 | 70 | Default max register pressure target |
nv-remat-block-load-cost | unk_4FD2EC0 | 10 | Cost assigned to load instructions |
nv-remat-threshold-for-spec-reg | unk_4FD3860 | 20 | Threshold for special register remat |
nv-dump-remat-block | byte_4FD2E80 | false | Debug dump toggle |
nv-remat-check-internal-live | byte_4FD2DA0 | false | Check internal liveness during MaxLive |
max-reg-kind | qword_4FD2C20 | 0 | Kind of max register pressure info |
no-mi-remat | qword_4FD2BE0 | (empty) | Skip remat for named functions |
load-remat | word_4FD32F0 | true | Enable load rematerialization |
vasp-fix1 | word_4FD3210 | false | VASP fix for volatile/addsp |
Complementary ptxas-side Knobs
The assembler (ptxas) has its own rematerialization controls that complement the CICC passes:
RegAllocRematEnable=1RegAllocEnableOptimizedRemat=1RematEnable=1SinkRematEnable=1RematBackOffRegTargetFactor=N
Optimization Level Behavior
| Level | IR-Level Remat (nvvmrematerialize) | Machine-Level Remat (nv-remat-block) |
|---|---|---|
| O0 | Not run | Not run |
| Ofcmax | Not run | Not run |
| Ofcmid | Runs with do-remat=3 (full) | Not run |
| O1 | Runs with do-remat=3, remat-iv=4, remat-load=1 | Runs with nv-remat-block=14 (default bitmask) |
| O2 | Same as O1 | Same as O1 |
| O3 | Same as O1; may see more candidates due to additional inlining/unrolling | Same as O1; operates on more aggressively optimized MIR |
The do-remat master control (default 3) enables all rematerialization sub-phases at O1+. The machine-level pass is gated by its own NVVMPassOptions slot and runs only when the codegen pipeline includes the full register allocation sequence. At Ofcmax, neither pass runs because the fast-compile pipeline skips the full optimization and codegen stack. See Optimization Levels for the complete pipeline tier structure.
Diagnostic Strings
"Skip rematerialization on <funcname>"
"Block %s: live-in = %d"
"Total pull-in cost = %d"
"remat_"
"uclone_"
"nloNewBit"
"nloNewAdd"
"demoteIV"
"newBaseIV"
"iv_base_clone_"
"substIV"
"factor"
"Max-Live-Function(<num_blocks>) = <max_live>"
"Really Final Pull-in: <count> (<total_cost>)"
"MULTIDEF"
"Skip machine-instruction rematerialization on <name>"
"After pre-check, <N> good candidates, <N> given second-chance"
"ADD <N> candidates from second-chance"
"Pullable: <count>"
"live-out = <count>"
"Total Pullable before considering cost: <count>"
Reimplementation Checklist
- Live-in/live-out bitvector analysis. Build per-basic-block bitvector sets tracking which values are live-in and live-out, compute max live-in via hardware
popcnt, and maintain a hash map of per-block counts. - Occupancy-driven register target. Query the occupancy model to compute a target register count (default:
remat-for-occ=120), apply heuristic adjustments based on occupancy cliff boundaries, and cap atremat-maxreg-ceilingwhen set. - Candidate selection and cost model. Compute the live-in intersection across all blocks (bitwise AND), check rematerizability of each candidate via def-chain walking (bounded by
max-recurse-depth), score candidates asbase_cost * use_factorwith loop-nesting scaling, filter byremat-use-limit/remat-gep-cost/remat-single-cost-limit, and sort cheapest-first. - Block-level instruction cloning. Implement two clone types:
remat_prefix clones (full rematerialization of live-in values at use sites) anduclone_prefix clones (use-level copies for live range splitting within the dominance chain), with proper use-def chain and debug location updates. - IV demotion sub-pass. Identify 64-bit loop-header PHI nodes whose value range fits in 32 bits (
(val + 0x80000000) <= 0xFFFFFFFF), create narrowed PHI replacements (demoteIV/newBaseIV/substIV), and rewrite loop exit conditions. - NLO live-out simplification. Walk each block's live-out set, create
nloNewBitinstructions (AND/extract/trunc to actual used bit-width) andnloNewAddinstructions (local address recomputations) to reduce live-out register count at block boundaries. - Machine-level pull-in algorithm (
nv-remat-block). Implement the iterative MachineIR rematerialization engine: max-live computation via reverse instruction walk, MULTIDEF verification, recursive pullability checking (depth 50), second-chance heuristic for re-evaluating rejected candidates, cost-sorted greedy selection, and liveness propagation with instruction cloning at use sites. - Iterative convergence loop. Wrap the IR-level pass in an up-to-5-iteration loop (recompute max live-in after each round, stop when target is met) and the machine-level pass in an up-to-
nv-remat-max-timesloop.
Architecture-Specific Behavior
The machine-level MULTIDEF checker (sub_217E810) contains architecture-specific opcode exclusions: opcodes 380-396 are rejected only when the target SM is sm_62 (GP106, mid-range Pascal), suggesting these instructions have rematerialization hazards specific to that microarchitecture. All other opcode exclusions apply uniformly across SM targets.
Test This
The following kernel creates high register pressure by keeping many independent values alive simultaneously. Compile with nvcc -ptx -arch=sm_90 -maxrregcount=32 to force a low register cap and observe rematerialization in action.
__global__ void remat_test(const float* __restrict__ in, float* __restrict__ out, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= n) return;
float a = in[tid];
float b = in[tid + n];
float c = in[tid + 2*n];
float d = in[tid + 3*n];
float e = in[tid + 4*n];
float f = in[tid + 5*n];
float g = in[tid + 6*n];
float h = in[tid + 7*n];
float r0 = a * b + c;
float r1 = d * e + f;
float r2 = g * h + a;
float r3 = b * c + d;
float r4 = e * f + g;
out[tid] = r0 + r1;
out[tid + n] = r2 + r3;
out[tid + 2*n] = r4 + r0;
}
What to look for in PTX:
- Address recomputation: the expressions
tid + k*nare cheap to recompute. With-maxrregcount=32, the pass should rematerialize these address calculations at use sites rather than keeping them in registers. Look for repeatedmad.lo.s32oradd.s32instructions computing the same offset near eachld.globalinstead of a single computation early on. - Compare the
.nregdirective value between-maxrregcount=32and the default. The rematerialization pass trades extra ALU instructions for fewer registers to hit the lower target. - With
-Xcicc -dump-remat=4, cicc prints"Total pull-in cost = %d"for each candidate, showing the cost/benefit analysis. - The
remat_prefix on SSA names in LLVM IR dumps identifies rematerialized values.
Pipeline Interaction
The IR-level pass runs after live variable analysis has been computed and before instruction selection. Its register pressure reduction directly influences the occupancy achievable by the final kernel. The machine-level pass runs later, after instruction selection and register allocation, providing a second opportunity to reduce pressure on MachineIR where the register model is concrete rather than abstract. Together, the two passes form a layered rematerialization strategy: the IR pass makes broad, cost-effective reductions early, and the machine pass performs precise, targeted reductions late. Both passes interact with the register pressure analysis (rpa / machine-rpa) that feeds pressure estimates into scheduling and allocation decisions throughout the pipeline.
IV Demotion
IV demotion is NVIDIA's proprietary induction variable narrowing sub-pass, embedded within the IR-level rematerialization pass (nvvmrematerialize). It reduces register pressure by converting wide induction variables -- typically 64-bit -- to narrower types, typically 32-bit. On NVIDIA GPUs this is a high-impact optimization: the NVPTX ISA provides native 32-bit integer arithmetic in a single instruction, while 64-bit operations require multi-instruction sequences (add.cc + addc for a single 64-bit add, for example). A 64-bit loop induction variable that provably fits in 32 bits wastes two registers where one would suffice, and every arithmetic operation on it costs roughly twice the instruction count.
The sub-pass is large -- 75KB of compiled code, larger than the main rematerialization driver itself -- reflecting the complexity of proving that narrowing is safe across all uses of an IV, rewriting PHI nodes, adjusting comparisons, and handling edge cases where some uses require the original width while others can consume the narrowed version.
Key Facts
| Property | Value |
|---|---|
| Entry point | sub_1CD74B0 (75KB, ~2500 lines) |
| Parent pass | nvvmrematerialize (IR-level rematerialization) |
| Invocation site | sub_1CE7DD0 line ~2276 (post-remat phase) |
| Primary knob | remat-iv (default 4 = full demotion) |
| Debug knob | dump-remat-iv (default 0) |
| Gate condition | dword_4FBFB40 != 0 (non-zero enables the sub-pass) |
| Helper: IV analysis | sub_1CD5F30 |
| Helper: IV base lookup | sub_1CD5400 |
| Helper: cleanup | sub_1CD0600 |
| IR builder | sub_15FB440 (opcode, type, operand, name, insertpt) |
| Width query | sub_127FA20 (DataLayout::getTypeStoreSize) |
Demotion Levels
The remat-iv knob controls five demotion aggressiveness levels:
| Level | Behavior | Gate in binary |
|---|---|---|
| 0 | Disabled -- IV demotion entirely skipped | dword_4FBFB40 == 0 |
| 1--2 | Basic IV demotion. Only simple induction variables with constant step and all uses in the same loop body. | Default path |
| 3 | Extended IV demotion. Enables demotion of IVs whose uses extend to loop-exit comparisons and address computations outside the innermost loop. | line 1380: if (dword > 3) |
| 4 | Full demotion (default). Includes complex patterns: IVs used in GEP chains, IVs with multiple PHI consumers, and IVs that feed into both narrow and wide downstream computations. | line 1546: if (dword <= 4) |
| 5+ | Aggressive mode. Relaxes safety margins on range proofs, allowing demotion when the range check is tight (no headroom). |
Level 4 is the default because it captures the vast majority of profitable demotion opportunities in real CUDA kernels without the correctness risk of aggressive mode.
Algorithm
Phase 1: Loop Iteration and PHI Identification
The algorithm iterates over every loop in the function (obtained from LoopInfo, sub_1440EE0). For each loop, it examines the loop header block's PHI nodes. Each PHI node is a candidate induction variable. The pass checks the PHI's type width via sub_127FA20 (DataLayout::getTypeStoreSize).
for each loop L in function:
header = L.getHeader()
for each PHI in header:
width = getTypeStoreSize(PHI.getType()) // sub_127FA20
if width != 64:
continue // only demote 64-bit IVs to 32-bit
Phase 2: Increment Pattern Analysis
For each 64-bit PHI, the pass identifies the increment pattern -- the value feeding back from the latch block. It verifies the pattern is a simple add/sub by a constant. The helper sub_1CD5F30 (IV analysis helper) walks the def-use chain of the PHI's backedge operand to extract the step value and verify linearity.
backedge_val = PHI.getIncomingValueForBlock(latch)
if backedge_val is not (PHI + constant) and
backedge_val is not (PHI - constant):
skip this PHI // non-linear IV, cannot demote
step = extract_constant(backedge_val)
Phase 3: Value Range Fitting
The critical safety check. The pass must prove that the IV's value never exceeds the 32-bit signed range throughout the loop's execution. The check uses an unsigned comparison trick:
(val + 0x80000000) <= 0xFFFFFFFF
This is equivalent to checking -2^31 <= val <= 2^31 - 1 (the signed i32 range). Adding 0x80000000 shifts the signed range to [0, 0xFFFFFFFF], which can be checked with a single unsigned comparison. The pass evaluates this condition on:
- The initial value (from the preheader incoming edge of the PHI).
- The final value (derived from the loop trip count and step).
- Conservatively, any intermediate values if the step is not +1/-1.
The initial value and trip count information come from the loop analysis infrastructure. The pass does not directly invoke SCEV (ScalarEvolution) -- it operates on NVIDIA's own IR-level live variable analysis and loop info passes (sub_1440EE0 for loop structure, sub_1BFC830 for live variable analysis). However, upstream LLVM's IndVarSimplify (sub_1945A50) may have already widened or simplified IVs using SCEV before this pass runs. The IV demotion pass operates on whatever IV structure remains after the main optimization pipeline.
If the range check fails, the IV is skipped. There is no speculative demotion with runtime guards.
Phase 4: Use Analysis and Classification
Before rewriting, the pass classifies every use of the original 64-bit IV:
- Narrow-safe uses: Arithmetic (add, sub, mul, shift), array indexing within the loop body. These can consume the 32-bit value directly.
- Comparison uses: Loop exit conditions (
icmp). These need a narrow comparison instruction (newICmp). - Address uses: GEP instructions that use the IV as an index. At level 4+, these are handled by cloning the base address computation (
iv_base_clone_). - Escape uses: Uses outside the loop (LCSSA PHIs, return values). These require sign/zero extension back to 64-bit.
The level knob gates which use categories are eligible:
| Use category | Minimum level |
|---|---|
| Same-block arithmetic | 1 |
| Loop exit comparisons | 2 |
| Cross-block GEP indexing | 3 |
| Multi-PHI consumers | 4 |
| Tight-range speculation | 5 |
Phase 5: Instruction Generation
Once an IV is approved for demotion, the pass generates four types of synthetic instructions:
demoteIV -- Truncation
v475 = "demoteIV";
v366 = sub_15FB440(11, destg, v401, &v475, v115);
// opcode 11 = trunc
Creates a trunc i64 %iv to i32 instruction, inserted at the point where the original IV was defined. This is the primary demotion: the new 32-bit value replaces the old 64-bit value for all narrow-safe uses.
IR before:
%iv = phi i64 [ %init, %preheader ], [ %iv.next, %latch ]
%iv.next = add i64 %iv, 1
IR after (demoteIV inserted):
%iv = phi i64 [ %init, %preheader ], [ %iv.next, %latch ]
%demoteIV = trunc i64 %iv to i32
newBaseIV -- Narrow PHI Replacement
v475 = "newBaseIV";
desth = sub_15FB440(11, v289, v427, &v475, destd);
When the entire loop can use a 32-bit IV, the pass creates a completely new PHI node with i32 type in the loop header. The old 64-bit PHI is not simply truncated -- a new narrow induction cycle is constructed:
- A narrow initial value:
%newInit = trunc i64 %init to i32 - A narrow PHI:
%newBaseIV = phi i32 [ %newInit, %preheader ], [ %newInc, %latch ] - A narrow increment:
%newInc = add i32 %newBaseIV, <step32>
The old 64-bit IV becomes dead if all uses are successfully rewritten.
IR after (full base IV replacement):
%newInit = trunc i64 %init to i32
%newBaseIV = phi i32 [ %newInit, %preheader ], [ %newInc, %latch ]
%newInc = add i32 %newBaseIV, 1
iv_base_clone_ -- Comparison Clone
v475 = "iv_base_clone_";
v214 = sub_15F4880(v210); // clone instruction
sub_164B780(v214, &v475); // set name
sub_15F2120(v395, v198); // insert into block
When some uses of the IV require the original 64-bit width -- typically the loop exit comparison or an address computation that cannot be narrowed -- the pass clones the IV's base value. The clone instruction preserves the original semantics while allowing the primary loop computation to proceed with the narrow type. The clone is placed at the specific use site rather than at the loop header, avoiding the register pressure cost of keeping the wide value live across the entire loop body.
substIV -- Use Replacement
After generating the narrow IV infrastructure, the pass walks all uses of the original wide IV and replaces them with the demoted version. This is the final rewriting step:
- Arithmetic uses: replaced with uses of
%newBaseIVor%demoteIV. - Comparison uses: replaced with narrow comparisons (
newICmp) on the demoted value. - PHI uses at LCSSA boundaries: a
sext/zextis inserted to restore 64-bit width for consumers outside the loop.
The pass also creates newICmp instructions -- narrower comparison instructions that compare i32 values instead of i64 values, rewriting the loop exit condition to match the demoted IV.
After all use replacement, sub_1CD0600 performs dead code cleanup: if the original 64-bit IV has no remaining uses, the wide PHI and its increment chain are deleted.
GPU Motivation: 32-bit vs. 64-bit Performance
The performance gap between 32-bit and 64-bit integer operations on NVIDIA GPUs is substantial and architectural, not merely a throughput difference:
Instruction count. 64-bit integer addition on PTX compiles to two machine instructions (add.cc.u32 + addc.u32) because the hardware ALU is 32-bit wide. A 64-bit multiply is even worse: it decomposes into multiple 32-bit multiplies and adds. Every loop iteration with a 64-bit IV pays this tax on the increment alone.
Register pressure. A single i64 value occupies a pair of 32-bit registers. In a loop with 3 IVs, demoting all three frees 3 registers -- enough to cross an occupancy cliff and gain an entire warp group on many kernels.
Address arithmetic. CUDA uses 64-bit pointers (nvptx64 target), so loop index computations are promoted to i64 by default during LLVM IR generation. But most CUDA kernels operate on arrays smaller than 4 GB, making the upper 32 bits of the index perpetually zero. The IV demotion pass recovers this wasted precision.
Pipeline utilization. GPU SM pipelines have limited integer execution units. Halving the instruction count for IV arithmetic directly translates to higher utilization of other functional units (FP, memory) in the same warp cycle.
Configuration
Knobs (registered at ctor_277_0, address 0x4F7BE0)
| Knob | Global | Default | Description |
|---|---|---|---|
remat-iv | dword_4FBFB40 | 4 | IV demotion level. 0=off, 1-2=basic, 3=extended, 4=full, 5+=aggressive. |
dump-remat-iv | dword_4FC0160 | 0 | Debug dump verbosity for IV demotion. Non-zero enables diagnostic output. |
The remat-iv knob is read by the main rematerialization driver (sub_1CE7DD0) at the post-remat phase gate. When non-zero, sub_1CD74B0 is invoked. The level value is then read inside sub_1CD74B0 to control which demotion patterns are attempted.
Interaction with ptxas
The ptxas assembler has its own rematerialization controls (--knob RegAllocRematEnable, RematEnable, etc.) but does not have an IV demotion equivalent. IV demotion is purely an IR-level transformation -- by the time ptxas sees the code, the IVs are already narrow. The ptxas knob --advanced-remat (0/1/2) controls machine-level rematerialization but does not perform type narrowing.
Diagnostic Strings
All strings emitted by sub_1CD74B0:
"phiNode" -- PHI node identification during loop header scan
"demoteIV" -- Truncation instruction creation
"newInit" -- Narrow initial value for new base IV
"newInc" -- Narrow increment for new base IV
"argBaseIV" -- Base IV argument lookup
"newBaseIV" -- New narrow PHI node creation
"newICmp" -- Narrow comparison instruction creation
"iv_base_clone_" -- Clone of IV base for original-width uses
"substIV" -- Use replacement pass
These strings are set as instruction name prefixes via sub_164B780 (for cloned instructions) or passed directly to the IR builder sub_15FB440. They appear in IR dumps when dump-remat-iv is non-zero or when the module is printed after the rematerialization pass.
Differences from Upstream LLVM
Upstream LLVM's IndVarSimplify pass (indvars) performs IV widening and narrowing through SCEV-based analysis. NVIDIA's IV demotion sub-pass is a completely separate implementation with several key differences:
| Aspect | Upstream IndVarSimplify | NVIDIA IV Demotion |
|---|---|---|
| Analysis framework | SCEV (ScalarEvolution) | NVIDIA live variable analysis + LoopInfo |
| Direction | Primarily widens narrow IVs to canonical form | Narrows wide IVs to reduce register pressure |
| Motivation | Canonical form for other optimizations | Register pressure reduction for GPU occupancy |
| Placement | Early in optimization pipeline | Late, inside rematerialization (post-optimization) |
| Range proof | SCEV range analysis | Direct (val + 0x80000000) <= 0xFFFFFFFF check |
| IV creation | SCEV expander | Direct IR builder calls (sub_15FB440) |
| Configuration | indvars-widen-indvars (bool) | remat-iv (6-level integer knob) |
The two passes are complementary. IndVarSimplify runs early and may widen IVs for canonical form. Later, IV demotion runs inside rematerialization and narrows them back when the wide form causes excessive register pressure. This is not redundant work -- the early widening enables other optimizations (loop vectorization, strength reduction), and the late narrowing recovers the register cost after those optimizations have completed.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| IV demotion entry | sub_1CD74B0 | 75KB | Main algorithm: PHI scan, range check, rewrite |
| IV analysis helper | sub_1CD5F30 | -- | Walks def-use chain to extract step/linearity |
| IV base lookup | sub_1CD5400 | -- | Finds base value of induction variable |
| Dead IV cleanup | sub_1CD0600 | -- | Removes unreferenced wide IVs after demotion |
| IR builder | sub_15FB440 | -- | Creates instruction (opcode, type, operand, name, insertpt) |
| Clone instruction | sub_15F4880 | -- | Clones an IR instruction (for iv_base_clone_) |
| Set name prefix | sub_164B780 | -- | Sets name string on cloned instruction |
| Insert into block | sub_15F2120 | -- | Inserts instruction at specified position |
| Replace uses | sub_1648780 | -- | Rewrites all uses of a value to a new value |
| Delete dead instr | sub_15F20C0 | -- | Erases instruction from parent block |
| Type store size | sub_127FA20 | -- | DataLayout::getTypeStoreSize -- returns bit width |
Cross-References
- Rematerialization -- parent pass; IV demotion is invoked in the post-remat phase
- ScalarEvolution -- upstream SCEV framework; not used directly by IV demotion but related
- IndVarSimplify -- upstream IV canonicalization pass
- LLVM Optimizer -- pipeline context showing where rematerialization runs
- Knobs -- central knob inventory
Base Address Strength Reduction
Address computation is a disproportionately expensive category of work on NVIDIA GPUs. The integer ALU units that compute memory addresses are a scarce resource relative to the FP/tensor throughput the hardware is designed to maximize. A typical unrolled loop body touching four arrays at A[tid + i], B[tid + i], C[tid + i], D[tid + i] -- where tid is a function of threadIdx.x, blockIdx.x, and blockDim.x -- may emit four independent 64-bit multiply-add chains per iteration, each recomputing the same base expression base_ptr + tid * element_size. Reducing those four chains to one base computation plus three cheap constant-offset additions can halve the integer instruction count in the loop body and free address registers that would otherwise stay live across the entire loop.
Base Address Strength Reduction (BASR) is an NVIDIA-proprietary IR-level pass that performs exactly this transformation. It scans loop bodies for memory operations that share a common base pointer expression, finds the one with the minimum constant offset (the "anchor"), hoists the anchor computation, and rewrites all remaining addresses as (anchor + relative_offset). The pass is confirmed by the string "BaseAddressStrengthReduce" at decompiled line 457 of sub_1C67780.
Key Facts
| Property | Value |
|---|---|
| Pass name | BaseAddressStrengthReduce |
| Entry point | sub_1C67780 (Legacy PM), sub_2CA4A10 (New PM) |
| Binary size | 58 KB (~1,400 decompiled lines) |
| Pass type | NVIDIA-proprietary, IR-level, loop body transform |
| Primary knobs | do-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions) |
| Chain variant | do-base-address-strength-reduce-chain (separate boolean toggle) |
| Negative offset control | dword_4FBCAE0 (aggressiveness for negative-offset patterns) |
| IV limit | base-address-strength-reduce-iv-limit (parametric) |
| Max IV | base-address-strength-reduce-max-iv (parametric) |
| Debug dump | dump-base-address-strength-reduce |
| Required analyses | LoopInfo (sub_1632FA0), DataLayout |
| Option registration | ctor_263_0 at 0x4F36F0 (shared with SCEV-CGP, 44 strings total) |
| Companion pass | Common Base Elimination (sub_1C5DFC0) |
| Helper | Bitcast helper at sub_1C637F0 (28 KB, strings "baseValue", "bitCastEnd") |
Algorithm
The pass operates in six phases, executing once per function. It processes all loop bodies simultaneously using worklists seeded from LoopInfo.
Phase 1 -- Initialization (lines 452-497)
The entry function retrieves LoopInfo via sub_1632FA0 and extracts the module's DataLayout from the function object (path: (a1+184)->field+24->field+40). It then allocates bookkeeping state:
- Eight hash maps at stack offsets
v374-v399, keyed byValue*(the base pointer). Each map entry holds a linked list of memory instructions that share that base. - Multiple worklists for basic blocks containing loads vs. stores.
- Threshold:
v429 = 2-- the minimum number of uses of the same base before the pass considers strength reduction worthwhile. - Pass counter:
v438 = 1-- the initial pass number (the pass may iterate).
Phase 2 -- Address Pattern Collection (lines 518-600)
For each instruction in the target basic blocks (drawn from the a4 worklist):
sub_1C57390classifies the address expression, extracting its structural form.sub_1CCB2B0computes alignment information from the DataLayout.sub_1456040extracts the base pointer from the address expression.
The base pointer is then categorized into one of two buckets:
| Category | Condition | Hash map | Worklist | Description |
|---|---|---|---|---|
| Non-pointer-type base | type_id != 15 | v382 | v363 | Integer/GEP-derived base addresses |
| Pointer-type base | type_id == 15 | v378 | v360 | Bases that are raw pointers to globals |
For pointer-type bases, sub_1CCDC20 further extracts the underlying global variable, allowing grouping of addresses to the same global even when accessed through different local pointer variables.
Hash map insertion uses sub_1C50900. If the base pointer is new (not yet in the map), the instruction list is initialized and the base is appended to the corresponding worklist. Otherwise, the instruction is appended to the existing list for that base.
for each instruction I in target BBs:
addr_info = classify_address(I) // sub_1C57390
alignment = compute_alignment(addr_info) // sub_1CCB2B0
base_ptr = extract_base(addr_info) // sub_1456040
if type_of(base_ptr) != POINTER_TYPE:
map_insert(hash_map_v382, base_ptr, I) // sub_1C50900
if is_new_entry:
worklist_v363.append(base_ptr)
else:
global = extract_global(base_ptr) // sub_1CCDC20
map_insert(hash_map_v378, global, I)
if is_new_entry:
worklist_v360.append(global)
Phase 3 -- Anchor Finding (lines 430-470)
For each base pointer that has accumulated at least v429 (2) uses, the pass determines the "anchor" -- the use with the minimum constant offset. This is the instruction whose address computation will be hoisted and shared.
For each candidate base:
sub_1C53170decomposes each address expression into a(base, constant_offset)pair.- The pass iterates over all uses and finds the one with the smallest constant offset:
- For offsets that fit in 64 bits: direct integer comparison via sign-extended values.
- For offsets wider than 64 bits: reads from extended-precision word arrays and compares word-by-word.
- The minimum-offset use becomes the anchor.
function find_anchor(base_ptr, use_list):
min_offset = +INF
anchor = null
for each use U in use_list:
(base, offset) = decompose_address(U) // sub_1C53170
if bit_width(offset) <= 64:
val = sign_extend_64(offset)
else:
val = read_extended_precision(offset)
if val < min_offset:
min_offset = val
anchor = U
return (anchor, min_offset)
Phase 4 -- Address Rewriting (lines 578-600)
Once the anchor is identified:
sub_13A5B00creates a new base address instruction from the anchor's address computation. This instruction is placed at the loop preheader or the dominating point of all uses.- For every other instruction sharing the same base, the pass computes the relative offset:
relative_offset = original_offset - anchor_offset. sub_14806B0creates a new address expression(new_base + relative_offset)and replaces the original address operand.
function rewrite_addresses(anchor, anchor_offset, use_list):
new_base = create_base_instruction(anchor) // sub_13A5B00
for each use U in use_list:
if U == anchor:
replace_address(U, new_base)
else:
(_, orig_offset) = decompose_address(U)
rel_offset = orig_offset - anchor_offset
new_addr = create_offset_add(new_base, rel_offset) // sub_14806B0
replace_address(U, new_addr)
After this transformation, a loop body that previously contained:
load (base + tid*stride + 0) // original: full GEP chain
load (base + tid*stride + 16) // original: full GEP chain
store (base + tid*stride + 32) // original: full GEP chain
store (base + tid*stride + 48) // original: full GEP chain
Becomes:
anchor = base + tid*stride + 0 // hoisted once
load anchor // offset 0: use anchor directly
load (anchor + 16) // cheap add
store (anchor + 32) // cheap add
store (anchor + 48) // cheap add
The three 64-bit multiply-add chains are replaced by three 64-bit immediate additions.
Phase 5 -- Negative Offset Handling (lines 512-520)
When dword_4FBCAE0 > 1 (the aggressiveness knob is set above default), the pass also considers address groups where the maximum offset has a negative sign bit. These represent patterns like:
load (base + tid*stride - 32)
load (base + tid*stride + 0)
load (base + tid*stride + 32)
Without this phase, the anchor would be the instruction at offset -32, producing negative relative offsets for the first use. Some hardware addressing modes handle negative offsets less efficiently, so this phase is gated separately.
For negative-offset candidates, the pass:
- Checks whether the base is loop-invariant via
sub_1C51340. - If loop-invariant, creates a separate common base via
sub_1C55CE0that absorbs the negative component.
Phase 6 -- Red-Black Tree Tracking
The pass uses a red-black tree infrastructure (sub_220F040 for insertion, sub_220EF80 for lookup) shared with other NVIDIA passes. This provides O(log n) sorted-set operations for maintaining collections of instruction pointers and efficiently checking membership during the rewriting phase.
Hash Map Implementation
The address pattern hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). The resize/rehash logic lives in sub_1C54050 -- the same function used by Common Base Elimination. Hash keys are Value* pointers with linear probing. See Hash Table and Collection Infrastructure for the hash function and probing strategy.
Relationship with Common Base Elimination
BASR and Common Base Elimination (sub_1C5DFC0) attack the same problem -- redundant address computation -- but at different scopes and with different strategies:
| Dimension | Base Address Strength Reduction | Common Base Elimination |
|---|---|---|
| Scope | Intra-loop: operates within a single loop body | Inter-block: operates across the CFG using dominance |
| Grouping | Groups addresses by shared induction-variable-based base | Groups addresses by shared base pointer to the same global |
| Placement | Anchor placed at loop preheader | Anchor placed at common dominator of all uses |
| Offset model | Constant offsets relative to IV-derived base | Constant offsets relative to global-derived base |
| Entry point | sub_1C67780 | sub_1C5DFC0 |
| Size | 58 KB | 38 KB |
The two-pass approach is deliberate. Common Base Elimination runs first at the IR level, hoisting shared base expressions across control flow boundaries. BASR then runs within loop bodies, strength-reducing the IV-dependent address chains that CBE cannot handle because the IV changes each iteration.
Both passes share the same address decomposition helper (sub_1C53170), the same hash map infrastructure (sub_1C50900, sub_1C54050), and the same instruction creation routines (sub_13A5B00, sub_14806B0).
Relationship with SCEV-CGP
The BASR knobs are registered together with SCEV-CGP (Scalar-Evolution-based CodeGenPrepare) in ctor_263_0 at 0x4F36F0. This constructor registers 44 option strings total, covering both SCEV-CGP and BASR. The do-base-address-strength-reduce and do-scev-cgp knobs are stored in the same ctor_526_0 option block.
SCEV-CGP is a broader pass that performs SCEV-based address optimization using thread ID as an induction variable (scev-cgp-tid-max-value controls the maximum thread ID value for analysis). BASR is a sub-transformation within this address optimization framework -- it handles the specific case of multiple memory operations sharing a base, while SCEV-CGP handles the broader case of rewriting address expressions using scalar evolution.
Related SCEV-CGP knobs that interact with BASR:
| Knob | Purpose |
|---|---|
scev-cgp-old-base | Controls whether SCEV-CGP creates new base expressions |
ignore-bad-base | Bypasses validity checks on base pointer classification |
ignore-32-bit-overflow | Skips 32-bit overflow checks in address arithmetic |
ignore-signed-32-bit-overflow | Skips signed 32-bit overflow checks |
topo-sort-begin | Controls topological sort start point for address chains |
special-reassociate-for-threadid | Prevents reassociation from moving threadId-dependent expressions |
Configuration
Boolean Knobs
| Knob | Default | Description |
|---|---|---|
do-base-address-strength-reduce | Enabled (level 2) | Master enable. Level 1 = unconditional; level 2 = with conditions (default). 0 = disabled. |
do-base-address-strength-reduce-chain | Enabled | Enables the chain variant, which strength-reduces chains of dependent address computations |
dump-base-address-strength-reduce | false | Prints diagnostic output when set |
Parametric Knobs
| Knob | Description |
|---|---|
base-address-strength-reduce-iv-limit | Maximum number of induction variables to consider per loop |
base-address-strength-reduce-max-iv | Maximum IV value for strength reduction eligibility |
Global Variables
| Global | Purpose |
|---|---|
dword_4FBCAE0 | Negative offset aggressiveness. When > 1, enables strength reduction of address groups with negative offsets. Also used as a special minimum-selection mode in MemorySpaceOpt. |
Diagnostic Strings
"BaseAddressStrengthReduce" -- Pass identification (line 457)
"baseValue" -- Bitcast helper: base value operand name (sub_1C637F0)
"bitCastEnd" -- Bitcast helper: end-of-chain marker (sub_1C637F0)
When dump-base-address-strength-reduce is enabled, the pass emits additional diagnostic output showing which base pointers were grouped, which anchor was selected, and which addresses were rewritten.
Key Functions
| Function | Address (Legacy) | Size | Role |
|---|---|---|---|
| Main entry | sub_1C67780 | 58 KB | Pass driver: initialization, collection, anchor finding, rewriting |
| Main entry (New PM) | sub_2CA4A10 | 62 KB | New Pass Manager variant |
| Address classifier | sub_1C57390 | -- | Classifies address expression structure |
| Address decomposer | sub_1C53170 | -- | Decomposes address into (base, constant_offset) pairs |
| Hash map insert | sub_1C50900 | -- | Inserts base pointer into pattern hash map |
| Hash map resize | sub_1C54050 | -- | Load-factor-based resize/rehash |
| Loop invariance check | sub_1C51340 | -- | Tests whether a value is loop-invariant |
| Negative offset handler | sub_1C55CE0 | -- | Creates common base for negative-offset patterns |
| Base instruction creation | sub_13A5B00 | -- | Creates the hoisted anchor address instruction |
| Offset rewriting | sub_14806B0 | -- | Creates (base + relative_offset) replacement |
| Base extraction | sub_1456040 | -- | Extracts base pointer from address expression |
| Global extraction | sub_1CCDC20 | -- | Extracts underlying global variable from pointer chains |
| Alignment computation | sub_1CCB2B0 | -- | Computes alignment from DataLayout |
| Bitcast helper | sub_1C637F0 | 28 KB | Handles bitcast chains in base address expressions |
| RB-tree insert | sub_220F040 | -- | Red-black tree insertion (shared infrastructure) |
| RB-tree lookup | sub_220EF80 | -- | Red-black tree membership check |
| LoopInfo retrieval | sub_1632FA0 | -- | Gets LoopInfo analysis for the function |
Cross-References
- Common Base Elimination -- the complementary inter-block pass
- Pass Overview & Inventory -- master pass listing
- Optimizer Pipeline -- pipeline position and option registration
- Rematerialization -- another pass trading computation for register pressure
- SCEV -- the scalar evolution analysis that SCEV-CGP (and indirectly BASR) depends on
Common Base Elimination
The Common Base Elimination pass hoists shared base address expressions to dominating points in the control flow graph, eliminating redundant recomputations of the same base pointer across multiple basic blocks. Where Base Address Strength Reduction targets intra-loop patterns driven by induction variables, Common Base Elimination operates at the inter-block level: it groups memory operations that share the same base pointer regardless of loop structure, finds their common dominator, and creates a single base computation at that dominator. Every original address is then rewritten as (hoisted_base + relative_offset).
This is a strictly GPU-motivated optimization. NVIDIA GPUs have limited integer ALU throughput relative to their floating-point pipelines, so any reduction in address arithmetic directly translates to freed execution slots for other work. On a typical CUDA kernel performing strided accesses across multiple branches (e.g., different cases of a switch over tile indices), the pass can eliminate dozens of redundant GEP chains that independently recompute the same base address.
The two-pass approach -- Common Base Elimination first at the IR level for inter-block redundancies, then Base Address Strength Reduction for intra-loop induction-variable patterns -- ensures comprehensive coverage of GPU address computation overhead.
Key Facts
| Property | Value |
|---|---|
| Pass name | "Common Base Elimination" |
| Entry point | sub_1C5DFC0 |
| Binary offset | 0x1C5DFC0 |
| Binary size | 38 KB (~850 decompiled lines) |
| Scope | Function-level |
| IR level | LLVM IR (pre-codegen) |
| Upstream equivalent | None -- entirely NVIDIA-proprietary |
| Complementary pass | Base Address Strength Reduction (sub_1C67780) |
| Primary knobs | scev-cgp-cross-block-limit -- limits common bases from a single block |
| Required analysis | Dominator tree (a1[23]), DataLayout |
Algorithm
The pass has four major phases: address decomposition, base pointer grouping, dominator-based hoisting, and address rewriting.
Phase 1 -- Address Expression Decomposition
For every memory operation (load, store, GEP-based address) in the function, the pass calls sub_1C53170 to decompose the address into a structured form:
struct AddressExpr {
Value *base_ptr; // The root pointer (alloca, global, argument)
Operand operands[]; // List of (index, constant_offset) pairs
unsigned operand_count; // Number of index terms
};
The result is stored as a (base_ptr, operand_list, operand_count) tuple. The decomposition strips away GEP chains to expose the underlying base pointer and accumulates constant offset terms separately from variable index terms. This is the same decomposition helper used by BASR (sub_1C67780), ensuring both passes reason about addresses in a compatible representation.
Phase 2 -- Base Pointer Grouping
The pass maintains two hash maps for grouping addresses:
Non-pointer-type bases (hash map at v382, keyed by base pointer value):
- Each memory operation whose decomposed base is not a pointer type (type_id != 15) is inserted via
sub_1C50900. - The hash entry accumulates a list of all instructions sharing that base.
- New bases are appended to worklist
v363.
Pointer-to-global bases (hash map at v378, keyed by underlying global variable):
- For pointer-type bases,
sub_1CCDC20extracts the underlying global variable by walking through bitcast and GEP chains. - This allows grouping addresses to the same global even when accessed through different local pointer variables.
- New globals are appended to worklist
v360.
The hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). sub_1C54050 handles both resize and in-place rehash. See Hash Table and Collection Infrastructure for the complete specification.
Phase 3 -- Dominator Walk and Base Hoisting
For each base pointer group containing two or more uses, the pass:
-
Finds the anchor. Among all constant offsets in the group, the operand with the minimum constant offset becomes the anchor. For offsets up to 64 bits, the constant is extracted directly from the GEP operand. For wider offsets (> 64 bits), the pass reads from extended-precision word arrays. Sign-extended comparisons determine the minimum.
-
Computes the common dominator. The pass reads the function's dominator tree from
a1[23]and walks it to find the nearest block that dominates all use sites. This is the standardfindNearestCommonDominatoroperation -- iteratively walk both paths toward the root until they meet. -
Inserts the hoisted base.
sub_13A5B00creates a new base address computation (a GEP or add instruction) at the terminator insertion point of the common dominator block. The hoisted instruction computesbase_ptr + min_offset, which is the anchor's address. -
Rewrites all uses. For each original memory operation in the group,
sub_14806B0rewrites the address as(hoisted_base + (original_offset - min_offset)). Since the anchor's own relative offset is zero, it becomes a direct use of the hoisted base.
In pseudocode:
fn run_common_base_elimination(F: &Function):
let dom_tree = F.dominator_tree // a1[23]
let data_layout = F.module.data_layout
// Phase 1+2: decompose and group
let base_groups: HashMap<Value*, Vec<(Instruction*, ConstantOffset)>> = {}
let global_groups: HashMap<GlobalVariable*, Vec<(Instruction*, ConstantOffset)>> = {}
for bb in F.basic_blocks():
for inst in bb.instructions():
if !is_memory_op(inst): continue
let (base, offsets, count) = sub_1C53170(inst)
if base.type_id != POINTER_TYPE:
base_groups[base].push((inst, offsets))
else:
let gv = sub_1CCDC20(base) // extract global
global_groups[gv].push((inst, offsets))
// Phase 3+4: hoist and rewrite
for (base, uses) in chain(base_groups, global_groups):
if uses.len() < 2: continue
let min_offset = uses.iter().map(|u| u.offset).min()
let anchor_inst = uses.find(|u| u.offset == min_offset).inst
// Find common dominator of all use blocks
let dom_block = uses[0].inst.parent
for use in uses[1..]:
dom_block = dom_tree.find_nearest_common_dominator(
dom_block, use.inst.parent)
// Hoist: create base+min_offset at dominator
let hoisted = sub_13A5B00(dom_block, base, min_offset)
// Rewrite all uses
for (inst, offset) in uses:
let relative = offset - min_offset
sub_14806B0(inst, hoisted, relative)
Phase 4 -- Pointer-to-Global Grouping
The global-variable grouping deserves special attention. Consider two local pointers p and q that both derive from the same global array g:
%p = getelementptr [1024 x float], ptr @g, i64 0, i64 %tid
%q = getelementptr [1024 x float], ptr @g, i64 0, i64 %tid2
Without the global extraction step, these would be in different groups (keyed by %p vs %q). The sub_1CCDC20 helper walks through the pointer chain to find the underlying @g, allowing the pass to recognize that both addresses target the same global and can share a hoisted base.
Cost-Benefit Analysis
The pass trades register pressure at the dominator for reduced address computation at use sites. This trade-off is particularly favorable on GPUs for two reasons:
Benefit -- Reduced integer ALU pressure. Each eliminated GEP chain frees integer ALU slots. On SM architectures, integer instructions compete for the same warp scheduler slots as floating-point instructions. A kernel with N memory operations sharing the same base saves up to (N-1) complete base address recomputations. For a kernel doing 8 loads from the same struct through different control-flow paths, this eliminates 7 redundant address computations.
Cost -- Extended live range at the dominator. The hoisted base must remain live from the dominator block down to every use site. On GPUs, each additional live register reduces occupancy (the number of concurrent warps per SM). The pass implicitly relies on the subsequent rematerialization pass (sub_1CE7DD0) to undo any hoisting decisions that prove too costly for register pressure -- if the hoisted value's live range crosses too many basic blocks, rematerialization will re-derive it closer to the use point.
The SCEV-CGP knob scev-cgp-cross-block-limit provides an explicit limit on how many common bases can be created from a single block, acting as a safety valve against excessive register pressure growth. The related scev-cgp-idom-level-limit constrains how far up the dominator tree the pass is willing to hoist.
Relationship with Base Address Strength Reduction
The two passes operate at different granularities and are intentionally complementary:
| Aspect | Common Base Elimination | Base Address Strength Reduction |
|---|---|---|
| Scope | Inter-block (dominator-based) | Intra-loop (induction-variable-based) |
| Target pattern | Multiple BBs accessing the same base | Loop body with base + stride * iv |
| Mechanism | Hoist to common dominator | Factor out common base, use incremented pointer |
| Key helper | sub_1C53170 (address decomposition) | sub_1C53170 (same decomposition) |
| Offset handling | Minimum-offset anchor | Minimum-offset anchor (same strategy) |
| Pipeline order | Runs first | Runs after CBE |
The shared address decomposition helper (sub_1C53170) and the shared rewriting infrastructure (sub_13A5B00 for creating new base computations, sub_14806B0 for rewriting addresses) confirm that these passes were designed as a coordinated pair. Common Base Elimination runs first to eliminate inter-block redundancies, leaving BASR to focus on the remaining intra-loop stride patterns. Without CBE running first, BASR would encounter more diverse base expressions in loop bodies, reducing its grouping effectiveness.
Both passes share the same 0x1C50000-0x1CCFFFF address range in the binary, and BASR's helper functions (e.g., sub_1C637F0 -- base address bitcast helper, strings "baseValue", "bitCastEnd") are directly adjacent to CBE's entry point.
Configuration
Direct Knobs
No CBE-specific enable/disable knob has been identified in the binary. The pass appears to be unconditionally enabled when the SCEV-CGP subsystem is active.
Related SCEV-CGP Knobs
| Knob | Type | Description |
|---|---|---|
scev-cgp-cross-block-limit | int | Maximum number of common bases that can be created from a single block. Limits the register pressure increase from hoisting. |
scev-cgp-idom-level-limit | int | Maximum dominator tree depth for hoisting. Prevents hoisting too far from use sites. |
do-scev-cgp | bool | Master enable for the SCEV-CGP subsystem. Disabling this may also disable CBE. |
do-base-address-strength-reduce | int | Two levels: 1 = basic, 2 = with conditions. Controls the companion BASR pass. |
do-base-address-strength-reduce-chain | bool | Enables chained strength reduction in BASR. |
base-address-strength-reduce-iv-limit | int | IV limit for BASR. |
base-address-strength-reduce-max-iv | int | Maximum IVs considered by BASR. |
BASR Aggressiveness Knob
The global dword_4FBCAE0 controls aggressiveness for negative-offset handling in the BASR companion pass. When dword_4FBCAE0 > 1, BASR also considers base groups where the maximum offset has a negative sign bit, checking via sub_1C51340 whether the base is loop-invariant before creating a separate common base via sub_1C55CE0. This knob does not directly affect CBE but influences how much address redundancy remains for CBE to handle.
Diagnostic Strings
"Common Base Elimination"
The pass registers a single diagnostic string (its name). No additional debug/dump strings have been identified. The pass does not appear to have a dedicated dump knob analogous to dump-base-address-strength-reduce for BASR.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
CommonBaseElimination::run | sub_1C5DFC0 | 38 KB | Main entry point -- orchestrates all four phases |
decomposeAddress | sub_1C53170 | -- | Decomposes a memory address into (base, offset_list, count) tuple. Shared with BASR. |
hashMapGrowOrRehash | sub_1C54050 | -- | Hash map resize/rehash with load-factor policy |
hashMapInsertOrLookup | sub_1C50900 | -- | Insert into or look up in the base-pointer hash map |
extractGlobalFromPointerChain | sub_1CCDC20 | -- | Walks bitcast/GEP chains to find the underlying GlobalVariable |
createCommonBaseForNegativeOffsets | sub_1C55CE0 | -- | Creates a separate common base when the max offset is negative. Used by BASR, available to CBE. |
isBaseLoopInvariant | sub_1C51340 | -- | Checks whether a base address is loop-invariant |
classifyAddressExpression | sub_1C57390 | -- | Classifies an instruction's address expression type |
createNewBaseInstruction | sub_13A5B00 | -- | Creates a new base address computation at the insertion point |
rewriteAddressAsBaseOffset | sub_14806B0 | -- | Rewrites an address as (new_base + relative_offset) |
extractBasePointer (SCEV helper) | sub_1456040 | -- | Extracts the base pointer from an address expression (SCEV getStart/getOperand(0)) |
Cross-References
- Base Address Strength Reduction -- the companion intra-loop pass
- SCEV-CGP knobs -- knobs controlling cross-block limits and IDOM depth
- NVIDIA Custom Passes Overview -- pass inventory and registration
- Rematerialization -- downstream pass that can undo costly hoisting by re-deriving values closer to use sites
- Other NVIDIA Passes -- summary entries for CBE and BASR
- LLVM Optimizer -- two-phase pipeline where CBE runs
CSSA -- Conventional SSA for GPU Divergence
Standard SSA form assumes that a PHI node selects its incoming value based solely on the control flow edge along which execution arrived. On a scalar CPU, exactly one predecessor edge is taken per dynamic execution of the PHI, so this assumption holds trivially. On an NVIDIA GPU, it does not. A warp of 32 threads executes in lockstep, and when control flow diverges -- different threads take different branches -- all paths are eventually serialized and the warp reconverges. At the reconvergence point, a standard PHI node cannot correctly select a single incoming value because the warp carries live values from multiple predecessors simultaneously. The wrong thread could see the wrong value.
CSSA (Conventional SSA) is NVIDIA's transformation that rewrites the IR so that every PHI node is safe under warp-divergent execution. It does this by inserting explicit copy instructions at points where threads reconverge, ensuring that each thread's value is materialized into its own copy before the PHI merges anything. The name "Conventional SSA" comes from the SSA literature: a program is in CSSA form when every PHI node's operands can be simultaneously live without interfering -- the PHI web has no overlapping live ranges. This property is exactly what GPU divergence demands.
| Pass location | sub_3720740 (22KB, ~800 lines decompiled) |
| Address range | 0x3720740--0x3721501 (3521 bytes) |
| Gate knob | do-cssa (NVVMPassOptions boolean toggle) |
| Coalesce knob | cssa-coalesce (controls copy coalescing aggressiveness) |
| Debug knobs | cssa-verbosity, dump-before-cssa |
| Container knob | CSSACoalescing (NVVM container format, parsed at sub_CD9990) |
| Debug string | "IR Module before CSSA:\n" |
| Helper cluster | sub_371F790 (27KB), sub_371F160, sub_371EDF0, sub_371CDC0 |
| Pass-option slot | One of the 221 NVVMPassOptions slots (boolean do/don't pair) |
| Pipeline position | Late IR, after optimization, before SelectionDAG lowering |
| Upstream equivalent | None. LLVM has no concept of warp-divergent PHI semantics. |
GPU Divergence Background
The Warp Execution Model
NVIDIA GPUs execute threads in warps of 32 under the SIMT model: all threads share a program counter, and divergent branches serialize both paths before the warp reconverges. The full warp execution model and its implications for cicc are documented in the GPU Execution Model.
Why Standard SSA Breaks
Consider a diamond CFG:
entry
/ \
then else
\ /
join <-- PHI(%x = [then: %a], [else: %b])
On a CPU, the PHI at join works correctly: execution came from exactly one predecessor, so the PHI selects the corresponding value. On a GPU warp where threads 0-15 took then and threads 16-31 took else, both paths executed sequentially. When the warp reconverges at join, the PHI must produce %a for threads 0-15 and %b for threads 16-31 simultaneously in the same register. A naive lowering of the PHI to a simple register copy is incorrect -- whichever path executed last would overwrite the value from the first path.
The CSSA Solution
CSSA transforms the IR so that the PHI web has non-interfering live ranges. Concretely, it inserts copy instructions at the end of each predecessor block so that each thread's value is written into a dedicated copy before the warp reconverges:
entry
/ \
then else
%a_copy %b_copy <-- inserted copies (one per predecessor)
\ /
join
%x = PHI [then: %a_copy], [else: %b_copy]
Now the PHI's operands occupy distinct virtual registers. During later register allocation, the allocator can assign them the same physical register only when their live ranges truly do not overlap -- which is the correct condition for divergent execution. The copies give the allocator the freedom to keep the values separate when divergence requires it.
Algorithm
The sub_3720740 function implements CSSA in several phases:
Phase 1: Basic Block Ordering and Numbering
The function begins by iterating over all basic blocks in the LLVM function (accessed via [r15], the LLVM Module/Function pointer) and assigning sequential numbering. Each basic block receives an ordinal stored at offset +0x48 (preorder index) and +0x4C (reverse postorder index). These indices are used later for dominance and reconvergence queries. The block list is walked via the standard LLVM doubly-linked list at function offsets +0x48/+0x50 (begin/end sentinels), with a secondary worklist stored in a dynamic array at [rbp-0x240] that grows via the standard SmallVector growth function sub_C8D5F0.
After ordering, the function sets byte [r8+0x70] = 1 and dword [r8+0x74] = 0 on the pass state object (at [r15+8]), marking the ordering phase as complete. If the ordering was already done (byte [r8+0x70] is non-zero on entry), the function skips directly to phase 2.
Phase 2: PHI Node Scanning and Hash Map Population
The function iterates over every basic block (outer loop at 0x37208C0) and within each block walks the instruction use-list (inner loop at 0x3720930). Instructions are identified by checking byte [rbx-0x18] (the LLVM Value tag / opcode byte) against 0x54 (decimal 84), which is the LLVM PHI node opcode. Non-PHI instructions are skipped.
For each PHI node found, the function:
- Increments a monotonic counter at
[r15+0x78]to assign a unique PHI ID. - Computes a hash of the PHI's pointer value using the standard NVIDIA hash:
h = (ptr >> 4) ^ (ptr >> 9), masked by(table_size - 1). This is the same hash function used across CICC's DenseMap infrastructure. - Inserts the PHI (or looks it up) in the hash map at
[r15+0x60]with metadata fields: key at[slot+0], PHI ID at[slot+8]. The hash table uses LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure for the probing and growth policy. - Calls
sub_A41E30to resize the hash table when the load factor exceeds the 75% threshold.
Phase 3: Copy Insertion at Reconvergence Points
After populating the PHI map, the function enters the copy-insertion phase. For each basic block that contains PHI nodes, it:
- Walks the PHI's incoming values (the use-list at offset
+0x18through the instruction's operand chain at 32-byte stride). - For each incoming value, calls
sub_371F160withr8d=1(the "insert copy" flag). This helper creates a copy instruction at the end of the predecessor block, before the terminator. The copy is named with the"pcp"(PHI copy propagation) prefix string, as evidenced by thelea rax, aPcpinstruction at0x3720D34. - Calls
sub_ACA8A0to set the name on the newly created copy instruction. - Calls
sub_371CDC0with an instruction builder struct to create the actual copy/move IR instruction. The call passes opcode0x22D7(8919 decimal) as the first argument viaedi-- this is likely an NVVM-internal opcode for a divergence-safe copy. - Calls
sub_371EDF0to insert the new copy instruction into the predecessor block's instruction list. This is followed bysub_BD84D0(the standard LLVMinsertBefore/insertAfter) to splice the instruction into position. - Updates the PHI node's use chain: the operand that previously pointed to the original value now points to the copy. This rewiring is done at
0x3720C87--0x3720CDDby manipulating the 32-byte use-def chain entries (pointer at[use+0], predecessor at[use+8], backlink at[use+10]).
Phase 4: Instruction-Level Copy Propagation
After copy insertion, the function iterates over all basic blocks a second time (0x3720A2F--0x3720A62). For each instruction in each block, it calls sub_371F790 (27KB, the "NVPTX intrinsic operand builder" / copy propagation helper). This function propagates the "pcp" copies through the instruction graph, replacing uses of the original values with uses of the copies where appropriate, and eliminating redundant copies where the original value and the copy provably carry the same value for all threads.
Phase 5: Dead Copy Cleanup
The final phase walks a linked list at [r15+0x28] (a cleanup worklist). For each entry, it checks whether the instruction at [entry+8] has zero remaining uses ([rdi+0x10] == 0). If so, it calls sub_B43D60 to erase the dead instruction. This removes copies that were rendered unnecessary by the propagation phase.
Copy Coalescing
The cssa-coalesce knob controls how aggressively the pass coalesces the inserted copies back together. Without coalescing, CSSA inserts one copy per PHI operand per predecessor -- potentially a large number of copies in control flow with many branches. Coalescing identifies cases where two or more copies carry the same value and can share a single register, reducing the copy overhead.
The CSSACoalescing knob in the NVVM container format (parsed by sub_CD9990 from the finalizer knobs structure) provides a separate control path for the same behavior. The container knob is categorized alongside register allocation and scheduling controls (AdvancedRemat, DisablePredication, DisableXBlockSched, ReorderCSE), confirming that CSSA coalescing is considered part of the register allocation subsystem.
deSSA Alternative
The usedessa knob (default value 2, registered at ctor_358_0 at 0x50E8D0, stored in dword_4FD26A0) selects an alternative path for PHI elimination during the transition from SSA to machine code. Despite its name suggesting "de-Static Single Assignment", analysis of the dispatch functions shows it controls the scheduling and PHI elimination pipeline:
| Mode | Pre-RA Scheduling | Post-RA Scheduling | Behavior |
|---|---|---|---|
| 1 | Skipped | Minimal (single pass) | Simple mode -- no pre-RA scheduling |
| 2 (default) | Full (&unk_4FC8A0C) | Three passes + StackSlotColoring | Full mode -- complete scheduling pipeline |
The deSSA mode and CSSA transformation are complementary. CSSA operates at the LLVM IR level, converting PHI nodes into a form safe for GPU divergence before instruction selection. The usedessa mode controls how PHI nodes are ultimately eliminated during the MachineIR lowering, after SelectionDAG has already consumed the CSSA-transformed IR. When usedessa=2 (default), the full scheduling pipeline runs, giving the register allocator maximum flexibility to handle the extra copies that CSSA introduced. When usedessa=1, the minimal scheduling mode may be appropriate for debugging or for kernels where scheduling causes regressions.
Configuration Knobs
NVVMPassOptions Knob
| Knob | Type | Description |
|---|---|---|
do-cssa | bool | Master enable/disable for the CSSA pass |
Set via -opt "-do-cssa=0" to disable the pass entirely.
cl::opt Knobs (ctor_705 at 0x5BD430)
| Knob | Type | Default | Global | Description |
|---|---|---|---|---|
cssa-coalesce | int | (unknown) | (ctor_705 data) | Controls PHI operand coalescing aggressiveness. Higher values = more aggressive coalescing = fewer copies but higher risk of incorrect merging under divergence. |
cssa-verbosity | int | 0 | (ctor_705 data) | Verbosity level for diagnostic output during the CSSA transformation. |
dump-before-cssa | bool | false | qword_5050A28 | When non-zero, dumps the entire IR module before CSSA runs. Triggers the "IR Module before CSSA:\n" output followed by sub_A69980 (Module::print). |
Container-Format Knob
| Knob | Parsed At | Category | Description |
|---|---|---|---|
CSSACoalescing | sub_CD9990 | Register allocation / scheduling | Controls CSSA coalescing from the NVVM container format. Parsed alongside AdvancedRemat, DisablePredication, DisableXBlockSched. |
Related Knob
| Knob | Type | Default | Global | Description |
|---|---|---|---|---|
usedessa | int | 2 | dword_4FD26A0 | Selects deSSA method / scheduling pipeline mode. Mode 1 = simple (no pre-RA scheduling), mode 2 = full. |
Diagnostic Strings
"IR Module before CSSA:\n" -- Module dump header (dump-before-cssa)
"pcp" -- PHI copy propagation instruction name prefix
The "pcp" prefix is assigned to all copy instructions created by the CSSA pass. These copies can be identified in IR dumps by their %pcp naming. After register allocation, these copies may be eliminated (coalesced into the same physical register) or materialized as actual move instructions in the final PTX.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| CSSA main | sub_3720740 | 22KB | BB ordering, PHI scanning, copy insertion, cleanup |
| PCP builder | sub_371F790 | 27KB | PHI copy propagation / intrinsic operand builder |
| Copy insertion helper | sub_371F160 | -- | Creates copy instruction in predecessor block |
| Copy instruction creator | sub_371EDF0 | -- | Inserts copy into instruction list |
| Copy IR builder | sub_371CDC0 | -- | Builds the copy instruction IR node |
| Hash table grow | sub_A41E30 | -- | DenseMap resize for PHI hash table |
| Module printer | sub_A69980 | -- | Module::print (for dump-before-cssa) |
| raw_ostream::write | sub_CB6200 | -- | String output for debug dump |
| Debug stream getter | sub_C5F790 | -- | Returns current debug output stream |
| Instruction eraser | sub_B43D60 | -- | Erases dead instruction from parent block |
| Instruction insert | sub_BD84D0 | -- | BasicBlock::insert (instruction splice) |
| Name setter | sub_ACA8A0 | -- | Value::setName for "pcp" prefix |
| Use chain rewrite | sub_B96E90 | -- | replaceAllUsesWith on operand |
| Use helper | sub_B91220 | -- | Use-list manipulation |
| DenseMap grow helper | sub_C8D5F0 | -- | SmallVector/DenseMap capacity growth |
| Knob registration | ctor_705 (0x5BD430) | 5.4KB | Registers cssa-coalesce, cssa-verbosity, dump-before-cssa |
| Container knob parser | sub_CD9990 | 31KB | Parses CSSACoalescing from NVVM container |
| deSSA dispatch (post-RA) | sub_21668D0 | -- | Scheduling pipeline mode selector |
| deSSA dispatch (pre-RA) | sub_2165850 | -- | Pre-RA scheduling mode selector |
Differences from Upstream LLVM
LLVM's standard PHI elimination pass (llvm::PHIEliminationPass, registered as "phi-node-elimination" at pipeline slot 493 in CICC's pass parser) lowers PHI nodes to machine copies during the SelectionDAG-to-MachineIR transition. It operates under the assumption that PHI semantics follow scalar control flow -- exactly one predecessor contributes a value at each dynamic execution.
NVIDIA's CSSA pass runs before instruction selection, at the LLVM IR level, and transforms the IR into a form where PHI elimination can proceed safely even when the underlying execution model is SIMT. The two passes are not alternatives -- CSSA runs first to prepare the IR, then standard PHI elimination runs later to lower the CSSA-safe PHI nodes to machine copies.
This is one of the fundamental semantic gaps between LLVM's CPU-centric IR model and GPU reality. LLVM assumes sequential scalar semantics; NVIDIA's CSSA pass bridges that gap by making the implicit thread-level parallelism explicit in the copy structure of the IR.
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent CSSA transformation for GPU targets.
1. Inserting copies only at the merge block instead of at the end of each predecessor. The entire point of CSSA is that copies must be placed before the warp reconverges, not at the reconvergence point. If you insert the copy instruction at the beginning of the merge block (after the PHI), the warp has already reconverged and whichever path executed last has overwritten the register value for all threads. Copies must be at the terminator position of each predecessor block, before control leaves that block. This is the fundamental GPU-vs-CPU distinction: on a CPU, only one predecessor executes so placement does not matter; on a GPU, all predecessors may execute sequentially within the same warp.
2. Coalescing copies that have divergent live ranges. The cssa-coalesce knob controls how aggressively copies are merged back together. Over-aggressive coalescing can assign two copies to the same physical register when their live ranges overlap under divergence -- threads from different predecessor paths would see each other's values. The coalescer must verify that live ranges are truly non-interfering under the SIMT execution model, not just under the sequential CFG model. A reimplementation that reuses a standard LLVM register coalescer without divergence-aware interference checking will produce silent miscompilation on any kernel with divergent control flow.
3. Failing to insert copies for uniform PHI nodes that become divergent after later transformations. CSSA runs before instruction selection, but divergence analysis at that point may be imprecise. A PHI node classified as uniform (all threads agree on the incoming edge) may become effectively divergent after subsequent loop transformations or predication changes the control flow. The safe approach is to insert copies for all PHI nodes and let the coalescing phase remove unnecessary ones. A reimplementation that skips "uniform" PHI nodes based on divergence analysis risks correctness if that analysis is later invalidated.
4. Using a standard LLVM PHIElimination pass without the CSSA preprocessing step. LLVM's built-in PHI elimination assumes scalar control flow semantics (exactly one predecessor contributes at runtime). Running it directly on GPU IR without first converting to CSSA form will produce incorrect register assignments whenever a warp diverges at a branch leading to a PHI merge point. CSSA is not a replacement for PHI elimination -- it is a prerequisite that transforms PHI semantics into a form safe for the standard lowering.
5. Not propagating the "pcp" copy through the instruction graph after insertion. Phase 4 of the algorithm (copy propagation via sub_371F790) replaces uses of original values with uses of the inserted copies. A reimplementation that inserts copies but skips this propagation step will leave the PHI node still referencing the original value, making the copies dead. The subsequent dead-copy cleanup (Phase 5) will then erase them, and the transformation has no effect -- the original divergence-unsafe PHI remains.
Reimplementation Checklist
- Basic block ordering and numbering. Assign preorder and reverse-postorder indices to every basic block (stored at block offsets +0x48/+0x4C), used later for dominance and reconvergence queries.
- PHI node scanning and hash map population. Walk all instructions across all basic blocks, identify PHI nodes (opcode 0x54), assign monotonic IDs, and insert into a DenseMap using the hash
(ptr >> 4) ^ (ptr >> 9)with LLVM-layer sentinels (-4096/-8192) and 75% load-factor growth. - Copy insertion at reconvergence points. For each PHI node's incoming value, insert a
"pcp"-prefixed copy instruction at the end of the predecessor block (before the terminator) using opcode 0x22D7 (divergence-safe copy), then rewire the PHI's use chain so the operand points to the copy instead of the original value. - Copy propagation. Iterate all blocks a second time, invoking the PCP builder on each instruction to propagate inserted copies through the instruction graph, replacing uses of original values with uses of copies where appropriate and eliminating redundant copies where original and copy provably carry the same value for all threads.
- Dead copy cleanup. Walk the cleanup worklist, check each entry for zero remaining uses, and erase dead copy instructions via
eraseFromParent. - Copy coalescing (cssa-coalesce). Implement configurable coalescing that identifies cases where multiple
"pcp"copies carry the same value and can share a single register, reducing copy overhead while preserving correctness under warp divergence.
Cross-References
- NVIDIA Custom Passes -- CSSA listed as
sub_3720740with knobscssa-coalesce,cssa-verbosity,dump-before-cssa - Register Allocation -- greedy RA consumes CSSA-prepared IR
- Scheduling --
usedessaknob controls pre-RA/post-RA scheduling mode - Code Generation Pipeline -- CSSA's position in the overall compilation flow
- StructurizeCFG -- related pass that ensures structured control flow for PTX
- Rematerialization -- CSSA copies may interact with remat decisions
- Configuration Knobs -- full knob inventory
Minor NVIDIA Passes
This page indexes NVIDIA-proprietary passes that are too small or insufficiently decompiled for dedicated pages. For the ten passes that were previously documented here and now have full pages, see the links below.
Passes with Dedicated Pages
| Pass | Page |
|---|---|
| NVVM IR Verifier | nvvm-verify (Deep Dive) |
| NVVM Intrinsic Lowering | nvvm-intrinsic-lowering |
| Dead Synchronization Elimination | dead-sync-elimination |
| IV Demotion | iv-demotion |
| Struct/Aggregate Splitting | struct-splitting |
| Base Address Strength Reduction | base-address-sr |
| Common Base Elimination | common-base-elim |
| CSSA (Conventional SSA) | cssa |
| FP128/I128 Emulation | fp128-emulation |
| Memmove Unrolling | memmove-unroll |
alloca-hoisting -- Entry Block Alloca Consolidation
| Field | Value |
|---|---|
| Pass ID | alloca-hoisting |
| Entry point | sub_21BC7D0 |
| Scope | Machine-level pass |
PTX requires all stack allocations to reside in the entry block. This pass moves alloca instructions inserted by inlining or loop transforms into the entry block, preserving order and alignment. Without it, non-entry-block allocas produce invalid PTX.
image-optimizer -- Texture/Surface Access Optimization
| Field | Value |
|---|---|
| Pass ID | nvptx-image-optimizer |
| Entry point | sub_21BCF10 |
| Scope | Machine-level pass (pre-emission) |
Groups related texture loads for cache utilization and merges redundant surface operations. Works in coordination with Replace Image Handles (below). See also Machine-Level Passes.
nvptx-peephole -- Machine-Level Peephole
| Field | Value |
|---|---|
| Pass ID | nvptx-peephole |
| Entry point | sub_21DB090 |
| Scope | Machine-level pass (pre-RA) |
| Knob | enable-nvvm-peephole (default: on) |
PTX-specific peephole that folds redundant cvta address space conversions, optimizes predicate patterns, and simplifies PTX-specific instruction sequences. Distinct from the IR-level NVVM Peephole. See Machine-Level Passes for pipeline position.
proxy-reg-erasure -- Redundant cvta.to.local Removal
| Field | Value |
|---|---|
| Pass ID | nvptx-proxy-reg-erasure |
| Entry point | sub_21DA810 |
| Scope | Machine-level pass (late post-RA) |
Removes redundant cvta.to.local instructions left by address space lowering. Runs late in the pipeline after register allocation. See Machine-Level Passes.
valid-global-names -- PTX Identifier Sanitization
| Field | Value |
|---|---|
| Pass ID | nvptx-assign-valid-global-names |
| Entry point | sub_21BCD80 |
| Scope | Machine-level pass (pre-emission) |
Rewrites global symbol names to comply with PTX naming rules, removing characters illegal in PTX identifiers (@, $, etc.). Runs immediately before PTX emission.
replace-image-handles -- Texture/Surface Handle Substitution
| Field | Value |
|---|---|
| Pass ID | nvptx-replace-image-handles |
| Entry point | sub_21DBEA0 |
| Scope | Machine-level pass (pre-emission) |
Replaces IR-level texture/surface handle references with PTX-level .tex / .surf declarations. Paired with image-optimizer above. See Machine-Level Passes.
extra-mi-printer -- Register Pressure Diagnostics
| Field | Value |
|---|---|
| Pass ID | extra-machineinstr-printer |
| Entry point | sub_21E9E80 |
| Scope | Diagnostic (debug-only) |
Prints per-function register pressure statistics. Used for tuning pressure heuristics during development. Not active in release builds.
nvvm-intr-range -- Intrinsic Range Metadata
| Field | Value |
|---|---|
| Pass ID | nvvm-intr-range |
| Entry point | sub_216F4B0 |
| Scope | Function pass (IR level) |
| Knob | nvvm-intr-range-sm (ctor_359) |
Attaches !range metadata to NVVM intrinsics that return hardware-bounded values (threadIdx.x, blockIdx.x, etc.), enabling downstream known-bits analysis and range-based dead code elimination. Tightens ranges when __launch_bounds__ metadata is present. Documented in detail in KnownBits & DemandedBits.
GenericToNVVM -- Global Address Space Migration
| Field | Value |
|---|---|
| Pass ID | generic-to-nvvm |
| Entry point | sub_215DC20 |
| Size | 36 KB |
Moves global variables from generic address space (AS 0) to global address space (AS 1), inserting addrspacecast at use sites. Required because PTX globals must reside in .global memory. Documented in detail in PTX Emission.
Other Passes Documented Elsewhere
These passes appear in the NVPTX backend but have primary documentation on other pages:
| Pass | Entry | Primary Page |
|---|---|---|
| nvvm-pretreat | PretreatPass (New PM slot 128) | Optimizer Pipeline |
| NLO (Simplify Live Output) | sub_1CE10B0, sub_1CDC1F0 | Rematerialization |
| Prolog/Epilog | sub_21DB5F0 | Machine-Level Passes, PrologEpilogInserter |
| LDG Transform | sub_21F2780 (ldgxform) | Machine-Level Passes, Code Generation |
| Machine Mem2Reg | sub_21F9920 (nvptx-mem2reg) | Machine-Level Passes, Code Generation |
Pipeline & Pass Ordering
CICC v13.0 implements the LLVM New Pass Manager pipeline infrastructure, with NVIDIA injecting 33 custom passes into the registration table alongside approximately 493 standard LLVM passes. The master registration function at sub_2342890 populates a StringMap<PassInfo> hash table with every known pass name at startup, and a text-based pipeline parser allows the full pass ordering to be specified as a parenthesized string (e.g., module(function(instcombine,dse))). This page documents the complete pass inventory, the registration mechanism, the NVIDIA-specific additions, and — critically — the runtime pass execution order for each optimization level including the tier system and pass factory addresses.
| Master registration | sub_2342890 (0x2342890, ~2,816 lines) |
| Hash table insert | sub_E41FB0 (0xE41FB0) -- open-addressing, 48-byte entries |
| String equality | sub_9691B0 (0x9691B0) -- len==len && memcmp==0 |
| AA name resolver | sub_233BD40 (0x233BD40) -- chain of string comparisons |
| AA pipeline parser | sub_233C0C0 (0x233C0C0) -- splits on ,, special-cases "default" |
| Extension callback | sub_233C300 (0x233C300) -- iterates [PassBuilder+2208], stride 32 |
| Option parser | sub_233A120 (0x233A120) -- splits on ;, validates tokens |
| Help/listing | sub_233C410 (0x233C410) -- --print-pipeline-passes handler |
| Pipeline assembler | sub_12E54A0 (0x12E54A0, 49.8KB, 1,553 lines) |
| AddPass | sub_12DE0B0 (0x12DE0B0, hash-based pass insertion) |
| Tier 0 sub-pipeline | sub_12DE330 (0x12DE330, ~40 passes) |
| Tier 1/2/3 sub-pipeline | sub_12DE8F0 (0x12DE8F0, phase-conditional) |
| Codegen dispatch | sub_12DFE00 (0x12DFE00, 20.7KB) |
| Total passes | ~526 unique registrations |
| NVIDIA additions | 33 passes (12 module, 20 function, 1 loop) |
Registration Architecture
The pipeline infrastructure follows the standard LLVM New Pass Manager design. At startup, sub_2342890 is called once and inserts every known pass into a StringMap living at [PassBuilder+8]. The insertion function sub_E41FB0 uses open-addressing with linear probing; each entry occupies 48 bytes containing the key pointer, key length, value pointer, value length, and 16 bytes of inline storage for short class names.
Pass lookup during pipeline parsing uses the hash function at sub_C94890 (likely DJB/FNV-family). Parameterized passes are detected by the presence of <...> angle brackets after the pass name; the parameter string is extracted and forwarded to a pass-specific callback. The generic parameter validator sub_233A120 splits option strings on semicolons and compares each token to expected values, emitting "invalid {PassName} pass parameter '{token}'" on mismatch.
The alias analysis pipeline has its own parser at sub_233C0C0. It special-cases the string "default" (which calls sub_23A1380 then sub_23038C0 to build the default AA stack), and otherwise splits on commas, resolving each name through sub_233BD40:
| AA Name | Constructor |
|---|---|
globals-aa | sub_2396EC0 |
basic-aa | sub_2361CE0 |
objc-arc-aa | sub_2361F60 |
scev-aa | sub_2362040 |
scoped-noalias-aa | sub_2362120 |
tbaa | sub_2362200 |
Extension callbacks for target-specific pipeline customization are stored at [PassBuilder+2208] with a count at [PassBuilder+2216]. Each entry is 32 bytes with a guard at offset +16 (must be non-null) and the callback function pointer at offset +24. The string "all" in extension context triggers invalidate<all>.
Pipeline Text Parser
The pipeline text parser accepts a nesting grammar where each level specifies the pass manager scope:
module(
function(
instcombine<max-iterations=1>,
dse,
loop(indvars, loop-deletion)
),
globalopt
)
The parser splits on commas and parentheses, recognizing module(...), cgscc(...), function(...), and loop(...) as scope wrappers. Bare names are looked up in the StringMap built by sub_2342890. For parameterized passes, the <...> suffix is extracted and dispatched to per-pass option parsers. Several NVIDIA-specific parameter parsers are thin wrappers around sub_233A120:
| Parser | Pass | Recognized Options |
|---|---|---|
sub_233A330 | process-restrict | propagate-only |
sub_233A370 | lower-struct-args | opt-byval |
sub_233A3B0 | lower-aggr-copies | lower-aggr-func-args |
More complex passes (GVN, SimplifyCFG, InstCombine) use chained sub_9691B0 string comparisons for multi-option parsing.
The pipeline name strings recognized by the nvopt<> dispatch table are:
| Pipeline Name | CLI Source | Pass Count |
|---|---|---|
nvopt<O0> | (no -O flag, no -Ofc) | ~5--8 |
nvopt<O1> | -O1 | ~35 |
nvopt<O2> | -O2 | ~35+ |
nvopt<O3> | -O3 | ~35+ |
nvopt<Ofcmax> | -Ofast-compile=max / -Ofc=max | ~12--15 |
nvopt<Ofcmid> | -Ofast-compile=mid / -Ofc=mid | ~25--30 |
nvopt<Ofcmin> | -Ofast-compile=min / -Ofc=min | ~30--35 |
Key addresses for pipeline name dispatch: sub_226C400 selects the pipeline name string, which is passed to sub_2277440 (pipeline text parser). The nvopt prefix is registered in sub_225D540 (new PM) and sub_12C35D0 (legacy PM), both calling into a pipeline builder class at vtable unk_4A08350.
Mutual exclusion: combining -O# with --passes= is an error: "Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass, use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass'".
Complete Pass Inventory
The following tables list every pass in exact registration order within sub_2342890. NVIDIA-specific passes are marked with bold names. Registration line numbers are from the decompiled output.
Module Analyses (18)
| # | Pass Name | LLVM Class | Reg. Line |
|---|---|---|---|
| 1 | callgraph | CallGraphAnalysis | 514 |
| 2 | collector-metadata | CollectorMetadataAnalysis | — |
| 3 | ctx-prof-analysis | CtxProfAnalysis | — |
| 4 | dxil-metadata | DXILMetadataAnalysis | — |
| 5 | dxil-resource-binding | DXILResourceBindingAnalysis | — |
| 6 | dxil-resource-type | DXILResourceTypeAnalysis | — |
| 7 | inline-advisor | InlineAdvisorAnalysis | — |
| 8 | ir-similarity | IRSimilarityAnalysis | — |
| 9 | last-run-tracking | via sub_2342820 | — |
| 10 | lcg | LazyCallGraphAnalysis | — |
| 11 | module-summary | ModuleSummaryIndexAnalysis | — |
| 12 | no-op-module | NoOpModuleAnalysis | — |
| 13 | pass-instrumentation | via sub_2342830 | — |
| 14 | profile-summary | ProfileSummaryAnalysis | — |
| 15 | reg-usage | PhysicalRegisterUsageAnalysis | — |
| 16 | stack-safety | StackSafetyGlobalAnalysis | — |
| 17 | verify | via sub_2342840 | 596 |
| 18 | globals-aa | GlobalsAA | — |
Module Passes (131)
Registration lines 599--1153 in sub_2342890. The first 121 entries are standard LLVM; the final 12 are NVIDIA custom passes registered at lines 1096--1153.
Standard LLVM Module Passes (entries 19--131)
| # | Pass Name | LLVM Class |
|---|---|---|
| 19 | always-inline | AlwaysInlinerPass |
| 20 | annotation2metadata | Annotation2MetadataPass |
| 21 | assign-guid | AssignGUIDPass |
| 22 | attributor | AttributorPass |
| 23 | attributor-light | AttributorLightPass |
| 24 | called-value-propagation | CalledValuePropagationPass |
| 25 | canonicalize-aliases | CanonicalizeAliasesPass |
| 26 | check-debugify | NewPMCheckDebugifyPass |
| 27 | constmerge | ConstantMergePass |
| 28 | coro-cleanup | CoroCleanupPass |
| 29 | coro-early | CoroEarlyPass |
| 30 | cross-dso-cfi | CrossDSOCFIPass |
| 31 | ctx-instr-gen | PGOInstrumentationGen |
| 32 | ctx-prof-flatten | PGOCtxProfFlatteningPass |
| 33 | noinline-nonprevailing | NoinlineNonPrevailing |
| 34 | deadargelim | DeadArgumentEliminationPass |
| 35 | debugify | NewPMDebugifyPass |
| 36 | dfsan | DataFlowSanitizerPass |
| 37 | dot-callgraph | CallGraphDOTPrinterPass |
| 38 | dxil-upgrade | DXILUpgradePass |
| 39 | elim-avail-extern | EliminateAvailableExternallyPass |
| 40 | extract-blocks | BlockExtractorPass |
| 41 | expand-variadics | ExpandVariadicsPass |
| 42 | forceattrs | ForceFunctionAttrsPass |
| 43 | function-import | FunctionImportPass |
| 44 | global-merge-func | GlobalMergeFuncPass |
| 45 | globalopt | GlobalOptPass |
| 46 | globalsplit | GlobalSplitPass |
| 47 | hotcoldsplit | HotColdSplittingPass |
| 48 | inferattrs | InferFunctionAttrsPass |
| 49 | inliner-ml-advisor-release | via sub_2342850 (InlinerWrapper) |
| 50 | inliner-wrapper | via sub_2342850 (InlinerWrapper) |
| 51 | inliner-wrapper-no-mandatory-first | via sub_2342850 |
| 52 | insert-gcov-profiling | GCOVProfilerPass |
| 53 | instrorderfile | InstrOrderFilePass |
| 54 | instrprof | InstrProfilingLoweringPass |
| 55 | ctx-instr-lower | PGOCtxProfLoweringPass |
| 56 | print<ctx-prof-analysis> | CtxProfAnalysisPrinterPass |
| 57 | invalidate<all> | via sub_2342860 |
| 58 | iroutliner | IROutlinerPass |
| 59 | jmc-instrumenter | JMCInstrumenterPass |
| 60 | lower-emutls | LowerEmuTLSPass |
| 61 | lower-global-dtors | LowerGlobalDtorsPass |
| 62 | lower-ifunc | LowerIFuncPass |
| 63 | lowertypetests | LowerTypeTestsPass |
| 64 | fatlto-cleanup | FatLtoCleanup |
| 65 | pgo-force-function-attrs | PGOForceFunctionAttrsPass |
| 66 | memprof-context-disambiguation | MemProfContextDisambiguation |
| 67 | memprof-module | ModuleMemProfilerPass |
| 68 | mergefunc | MergeFunctionsPass |
| 69 | metarenamer | MetaRenamerPass |
| 70 | module-inline | ModuleInlinerPass |
| 71 | name-anon-globals | NameAnonGlobalPass |
| 72 | no-op-module | NoOpModulePass |
| 73 | nsan | NumericalStabilitySanitizerPass |
| 74 | objc-arc-apelim | ObjCARCAPElimPass |
| 75 | openmp-opt | OpenMPOptPass |
| 76 | openmp-opt-postlink | OpenMPOptPass |
| 77 | partial-inliner | PartialInlinerPass |
| 78 | pgo-icall-prom | PGOIndirectCallPromotion |
| 79 | pgo-instr-gen | PGOInstrumentationGen |
| 80 | pgo-instr-use | PGOInstrumentationUse |
| 81 | pre-isel-intrinsic-lowering | PreISelIntrinsicLoweringPass |
| 82 | print | PrintModulePass |
| 83 | print-callgraph | CallGraphPrinterPass |
| 84 | print-callgraph-sccs | CallGraphSCCsPrinterPass |
| 85 | print-ir-similarity | IRSimilarityAnalysisPrinterPass |
| 86 | print-lcg | LazyCallGraphPrinterPass |
| 87 | print-lcg-dot | LazyCallGraphDOTPrinterPass |
| 88 | print-must-be-executed-contexts | MustBeExecutedContextPrinterPass |
| 89 | print-profile-summary | ProfileSummaryPrinterPass |
| 90 | print-stack-safety | StackSafetyGlobalPrinterPass |
| 91 | print<dxil-metadata> | DXILMetadataAnalysisPrinterPass |
| 92 | print<dxil-resource-binding> | DXILResourceBindingPrinterPass |
| 93 | print<inline-advisor> | InlineAdvisorAnalysisPrinterPass |
| 94 | print<module-debuginfo> | ModuleDebugInfoPrinterPass |
| 95 | print<reg-usage> | PhysicalRegisterUsageInfoPrinterPass |
| 96 | pseudo-probe | SampleProfileProbePass |
| 97 | pseudo-probe-update | PseudoProbeUpdatePass |
| 98 | recompute-globalsaa | RecomputeGlobalsAAPass |
| 99 | rel-lookup-table-converter | RelLookupTableConverterPass |
| 100 | rewrite-statepoints-for-gc | RewriteStatepointsForGC |
| 101 | rewrite-symbols | RewriteSymbolPass |
| 102 | rpo-function-attrs | ReversePostOrderFunctionAttrsPass |
| 103 | rtsan | RealtimeSanitizerPass |
| 104 | sample-profile | SampleProfileLoaderPass |
| 105 | sancov-module | SanitizerCoveragePass |
| 106 | sanmd-module | SanitizerBinaryMetadataPass |
| 107 | scc-oz-module-inliner | via sub_2342850 (InlinerWrapper) |
| 108 | shadow-stack-gc-lowering | ShadowStackGCLoweringPass |
| 109 | strip | StripSymbolsPass |
| 110 | strip-dead-debug-info | StripDeadDebugInfoPass |
| 111 | strip-dead-prototypes | StripDeadPrototypesPass |
| 112 | strip-debug-declare | StripDebugDeclarePass |
| 113 | strip-nondebug | StripNonDebugSymbolsPass |
| 114 | strip-nonlinetable-debuginfo | StripNonLineTableDebugInfoPass |
| 115 | trigger-crash-module | TriggerCrashModulePass |
| 116 | trigger-verifier-error | TriggerVerifierErrorPass |
| 117 | tsan-module | ModuleThreadSanitizerPass |
| 118 | tysan | TypeSanitizerPass |
| 119 | verify | via sub_2342870 |
| 120 | view-callgraph | CallGraphViewerPass |
| 121 | wholeprogramdevirt | WholeProgramDevirtPass |
NVIDIA Module Passes (entries 122--131)
| # | Pass Name | LLVM Class | Reg. Line | Purpose |
|---|---|---|---|---|
| 122 | check-gep-index | CheckGepIndexPass | 1096 | Validates GEP index bounds |
| 123 | check-kernel-functions | NVPTXSetFunctionLinkagesPass | 1101 | Enforces kernel linkage |
| 124 | cnp-launch-check | CNPLaunchCheckPass | 1106 | Cooperative launch validation |
| 125 | ipmsp | IPMSPPass | 1111 | Inter-procedural memory space propagation |
| 126 | nv-early-inliner | via sub_2342850 | 1114 | NVIDIA early inlining heuristic |
| 127 | nv-inline-must | InlineMustPass | 1119 | Force-inlines __forceinline__ functions |
| 128 | nvvm-pretreat | PretreatPass | 1124 | IR canonicalization before optimization |
| 129 | nvvm-verify | NVVMIRVerifierPass | 1129 | NVVM IR constraint validation |
| 130 | printf-lowering | PrintfLoweringPass | 1134 | Lowers printf to vprintf ABI |
| 131 | select-kernels | SelectKernelsPass | 1139 | Selects kernels for compilation |
Parameterized Module Passes (entries 132--145)
| # | Pass Name | Class | Parameters |
|---|---|---|---|
| 132 | asan | AddressSanitizerPass | kernel |
| 133 | cg-profile | CGProfilePass | in-lto-post-link |
| 134 | global-merge | GlobalMergePass | group-by-use;ignore-single-use;max-offset=N |
| 135 | embed-bitcode | EmbedBitcodePass | thinlto;emit-summary |
| 136 | globaldce | GlobalDCEPass | in-lto-post-link |
| 137 | hwasan | HWAddressSanitizerPass | kernel;recover |
| 138 | internalize | InternalizePass | preserve-gv=GV |
| 139 | ipsccp | IPSCCPPass | no-func-spec;func-spec |
| 140 | loop-extract | LoopExtractorPass | single |
| 141 | memprof-use | MemProfUsePass | profile-filename=S |
| 142 | msan | MemorySanitizerPass | recover;kernel;eager-checks;track-origins=N |
| 143 | print<structural-hash> | StructuralHashPrinterPass | detailed;call-target-ignored |
| 144 | lower-ops | LowerOpsPass | enable-optimization |
| 145 | set-global-array-alignment | SetGlobalArrayAlignmentPass | modify-shared-mem;skip-shared-mem;modify-global-mem;skip-global-mem |
CGSCC Analyses and Passes (entries 146--158)
| # | Pass Name | LLVM Class | Level |
|---|---|---|---|
| 146 | no-op-cgscc | NoOpCGSCCAnalysis | Analysis |
| 147 | fam-proxy | FunctionAnalysisManagerCGSCCProxy | Analysis |
| 148 | pass-instrumentation | via sub_2342830 | Analysis |
| 149 | argpromotion | ArgumentPromotionPass | Pass |
| 150 | attributor-cgscc | AttributorCGSCCPass | Pass |
| 151 | attributor-light-cgscc | AttributorLightCGSCCPass | Pass |
| 152 | invalidate<all> | via sub_2342860 | Pass |
| 153 | no-op-cgscc | NoOpCGSCCPass | Pass |
| 154 | openmp-opt-cgscc | OpenMPOptCGSCCPass | Pass |
| 155 | coro-annotation-elide | CoroAnnotationElidePass | Pass |
| 156 | coro-split | CoroSplitPass | Param: reuse-storage |
| 157 | function-attrs | PostOrderFunctionAttrsPass | Param: skip-non-recursive-function-attrs |
| 158 | inline | InlinerPass | Param: only-mandatory |
Function Analyses (entries 159--201)
Registration lines 1208--1415 in sub_2342890.
| # | Pass Name | LLVM Class |
|---|---|---|
| 159 | aa | AAManager |
| 160 | access-info | LoopAccessAnalysis |
| 161 | assumptions | AssumptionAnalysis |
| 162 | bb-sections-profile-reader | BasicBlockSectionsProfileReaderAnalysis |
| 163 | block-freq | BlockFrequencyAnalysis |
| 164 | branch-prob | BranchProbabilityAnalysis |
| 165 | cycles | CycleAnalysis |
| 166 | da | DependenceAnalysis |
| 167 | debug-ata | DebugAssignmentTrackingAnalysis |
| 168 | demanded-bits | DemandedBitsAnalysis |
| 169 | domfrontier | DominanceFrontierAnalysis |
| 170 | domtree | DominatorTreeAnalysis |
| 171 | func-properties | FunctionPropertiesAnalysis |
| 172 | machine-function-info | MachineFunctionAnalysis |
| 173 | gc-function | GCFunctionAnalysis |
| 174 | inliner-size-estimator | InlineSizeEstimatorAnalysis |
| 175 | last-run-tracking | via sub_2342820 |
| 176 | lazy-value-info | LazyValueAnalysis |
| 177 | loops | LoopAnalysis |
| 178 | memdep | MemoryDependenceAnalysis |
| 179 | memoryssa | MemorySSAAnalysis |
| 180 | no-op-function | NoOpFunctionAnalysis |
| 181 | opt-remark-emit | OptimizationRemarkEmitterAnalysis |
| 182 | pass-instrumentation | via sub_2342830 |
| 183 | phi-values | PhiValuesAnalysis |
| 184 | postdomtree | PostDominatorTreeAnalysis |
| 185 | regions | RegionInfoAnalysis |
| 186 | scalar-evolution | ScalarEvolutionAnalysis |
| 187 | should-not-run-function-passes | ShouldNotRunFunctionPassesAnalysis |
| 188 | should-run-extra-vector-passes | ShouldRunExtraVectorPasses |
| 189 | ssp-layout | SSPLayoutAnalysis |
| 190 | stack-safety-local | StackSafetyAnalysis |
| 191 | target-ir | TargetIRAnalysis |
| 192 | target-lib-info | TargetLibraryAnalysis |
| 193 | uniformity | UniformityInfoAnalysis |
| 194 | verify | via sub_2342840 |
| 195 | rpa | RegisterPressureAnalysis |
| 196 | merge-sets | MergeSetsAnalysis |
Function AA Analyses (entries 197--201)
| # | Pass Name | LLVM Class |
|---|---|---|
| 197 | basic-aa | BasicAA |
| 198 | objc-arc-aa | objcarc::ObjCARCAA |
| 199 | scev-aa | SCEVAA |
| 200 | scoped-noalias-aa | ScopedNoAliasAA |
| 201 | tbaa | TypeBasedAA |
Function Passes (entries 202--419)
Registration lines 1420--2319 in sub_2342890. The first 173 entries (202--374) are standard LLVM; entries 376--392 are NVIDIA-specific; entries 393--419 are parameterized passes (both standard and NVIDIA).
Standard LLVM Function Passes (entries 202--375)
| # | Pass Name | LLVM Class |
|---|---|---|
| 202 | aa-eval | AAEvaluator |
| 203 | adce | ADCEPass |
| 204 | add-discriminators | AddDiscriminatorsPass |
| 205 | aggressive-instcombine | AggressiveInstCombinePass |
| 206 | alignment-from-assumptions | AlignmentFromAssumptionsPass |
| 207 | annotation-remarks | AnnotationRemarksPass |
| 208 | assume-builder | AssumeBuilderPass |
| 209 | assume-simplify | AssumeSimplifyPass |
| 210 | atomic-expand | AtomicExpandPass |
| 211 | bdce | BDCEPass |
| 212 | break-crit-edges | BreakCriticalEdgesPass |
| 213 | callbr-prepare | CallBrPreparePass |
| 214 | callsite-splitting | CallSiteSplittingPass |
| 215 | chr | ControlHeightReductionPass |
| 216 | codegenprepare | CodeGenPreparePass |
| 217 | complex-deinterleaving | ComplexDeinterleavingPass |
| 218 | consthoist | ConstantHoistingPass |
| 219 | constraint-elimination | ConstraintEliminationPass |
| 220 | coro-elide | CoroElidePass |
| 221 | correlated-propagation | CorrelatedValuePropagationPass |
| 222 | count-visits | CountVisitsPass |
| 223 | dce | DCEPass |
| 224 | declare-to-assign | AssignmentTrackingPass |
| 225 | dfa-jump-threading | DFAJumpThreadingPass |
| 226 | div-rem-pairs | DivRemPairsPass |
| 227 | dot-cfg | CFGPrinterPass |
| 228 | dot-cfg-only | CFGOnlyPrinterPass |
| 229 | dot-dom | DOTGraphTraitsPrinter<DominatorTree, false> |
| 230 | dot-dom-only | DOTGraphTraitsPrinter<DominatorTree, true> |
| 231 | dot-post-dom | DOTGraphTraitsPrinter<PostDominatorTree, false> |
| 232 | dot-post-dom-only | DOTGraphTraitsPrinter<PostDominatorTree, true> |
| 233 | dse | DSEPass |
| 234 | dwarf-eh-prepare | DwarfEHPreparePass |
| 235 | expand-large-div-rem | ExpandLargeDivRemPass |
| 236 | expand-large-fp-convert | ExpandLargeFpConvertPass |
| 237 | expand-memcmp | ExpandMemCmpPass |
| 238 | extra-vector-passes | ExtraFunctionPassManager<ShouldRunExtraVectorPasses> |
| 239 | fix-irreducible | FixIrreduciblePass |
| 240 | flatten-cfg | FlattenCFGPass |
| 241 | float2int | Float2IntPass |
| 242 | gc-lowering | GCLoweringPass |
| 243 | guard-widening | via sub_2342880 |
| 244 | gvn-hoist | GVNHoistPass |
| 245 | gvn-sink | GVNSinkPass |
| 246 | helloworld | HelloWorldPass |
| 247 | indirectbr-expand | IndirectBrExpandPass |
| 248 | infer-address-spaces | InferAddressSpacesPass |
| 249 | infer-alignment | InferAlignmentPass |
| 250 | inject-tli-mappings | InjectTLIMappings |
| 251 | instcount | InstCountPass |
| 252 | instnamer | InstructionNamerPass |
| 253 | instsimplify | InstSimplifyPass |
| 254 | interleaved-access | InterleavedAccessPass |
| 255 | interleaved-load-combine | InterleavedLoadCombinePass |
| 256 | invalidate<all> | via sub_2342860 |
| 257 | irce | IRCEPass |
| 258 | jump-threading | JumpThreadingPass |
| 259 | jump-table-to-switch | JumpTableToSwitchPass |
| 260 | kcfi | KCFIPass |
| 261 | kernel-info | KernelInfoPrinter |
| 262 | lcssa | LCSSAPass |
| 263 | libcalls-shrinkwrap | LibCallsShrinkWrapPass |
| 264 | lint | LintPass |
| 265 | load-store-vectorizer | LoadStoreVectorizerPass |
| 266 | loop-data-prefetch | LoopDataPrefetchPass |
| 267 | loop-distribute | LoopDistributePass |
| 268 | loop-fusion | LoopFusePass |
| 269 | loop-load-elim | LoopLoadEliminationPass |
| 270 | loop-simplify | LoopSimplifyPass |
| 271 | loop-sink | LoopSinkPass |
| 272 | loop-versioning | LoopVersioningPass |
| 273 | lower-atomic | LowerAtomicPass |
| 274 | lower-constant-intrinsics | LowerConstantIntrinsicsPass |
| 275 | lower-expect | LowerExpectIntrinsicPass |
| 276 | lower-guard-intrinsic | LowerGuardIntrinsicPass |
| 277 | lower-invoke | LowerInvokePass |
| 278 | lower-widenable-condition | LowerWidenableConditionPass |
| 279 | make-guards-explicit | MakeGuardsExplicitPass |
| 280 | mem2reg | PromotePass |
| 281 | memcpyopt | MemCpyOptPass |
| 282 | memprof | MemProfilerPass |
| 283 | mergeicmps | MergeICmpsPass |
| 284 | mergereturn | UnifyFunctionExitNodesPass |
| 285 | move-auto-init | MoveAutoInitPass |
| 286 | nary-reassociate | NaryReassociatePass |
| 287 | newgvn | NewGVNPass |
| 288 | no-op-function | NoOpFunctionPass |
| 289 | normalize | IRNormalizerPass |
| 290 | objc-arc | ObjCARCOptPass |
| 291 | objc-arc-contract | ObjCARCContractPass |
| 292 | objc-arc-expand | ObjCARCExpandPass |
| 293 | pa-eval | PAEvalPass |
| 294 | partially-inline-libcalls | PartiallyInlineLibCallsPass |
| 295 | pgo-memop-opt | PGOMemOPSizeOpt |
| 296 | place-safepoints | PlaceSafepointsPass |
| 297 | print | PrintFunctionPass |
| 298--338 | print<access-info> ... print-predicateinfo | (41 printer passes) |
| 339 | reassociate | ReassociatePass |
| 340 | redundant-dbg-inst-elim | RedundantDbgInstEliminationPass |
| 341 | reg2mem | RegToMemPass |
| 342 | safe-stack | SafeStackPass |
| 343 | sandbox-vectorizer | SandboxVectorizerPass |
| 344 | scalarize-masked-mem-intrin | ScalarizeMaskedMemIntrinPass |
| 345 | sccp | SCCPPass |
| 346 | select-optimize | SelectOptimizePass |
| 347 | separate-const-offset-from-gep | SeparateConstOffsetFromGEPPass |
| 348 | sink | SinkingPass |
| 349 | sjlj-eh-prepare | SjLjEHPreparePass |
| 350 | slp-vectorizer | SLPVectorizerPass |
| 351 | slsr | StraightLineStrengthReducePass |
| 352 | stack-protector | StackProtectorPass |
| 353 | strip-gc-relocates | StripGCRelocates |
| 354 | tailcallelim | TailCallElimPass |
| 355 | transform-warning | WarnMissedTransformationsPass |
| 356 | trigger-crash-function | TriggerCrashFunctionPass |
| 357 | trigger-verifier-error | TriggerVerifierErrorPass |
| 358 | tsan | ThreadSanitizerPass |
| 359 | unify-loop-exits | UnifyLoopExitsPass |
| 360 | vector-combine | VectorCombinePass |
| 361 | verify | via sub_2342870 |
| 362--368 | verify<cycles> ... verify<scalar-evolution> | (7 verifiers) |
| 369--374 | view-cfg ... view-post-dom-only | (6 viewers) |
| 375 | wasm-eh-prepare | WasmEHPreparePass |
NVIDIA Function Passes (entries 376--392)
Registered at lines 2212--2292 of sub_2342890.
| # | Pass Name | LLVM Class | Reg. Line | Purpose |
|---|---|---|---|---|
| 376 | basic-dbe | BasicDeadBarrierEliminationPass | 2212 | Removes dead bar.sync instructions |
| 377 | branch-dist | BranchDistPass | 2217 | Branch distribution for divergence control |
| 378 | byval-mem2reg | ByValMem2RegPass | 2222 | Promotes byval arguments to registers |
| 379 | bypass-slow-division | BypassSlowDivisionPass | 2227 | Fast-path for small-operand division |
| 380 | normalize-gep | NormalizeGepPass | 2232 | GEP canonicalization for address arithmetic |
| 381 | nvvm-reflect-pp | SimplifyConstantConditionalsPass | 2237 | Folds __nvvm_reflect results (post-processing) |
| 382 | nvvm-peephole-optimizer | NVVMPeepholeOptimizerPass | 2242 | NVVM-specific peephole rewrites |
| 383 | old-load-store-vectorizer | OldLoadStoreVectorizerPass | 2247 | Legacy load/store vectorization |
| 384 | print<merge-sets> | MergeSetsAnalysisPrinterPass | 2252 | Printer for merge-sets analysis |
| 385 | remat | RematerializationPass | 2257 | Register-pressure-aware rematerialization |
| 386 | print<rpa> | RegisterPressurePrinterPass | 2262 | Printer for register pressure analysis |
| 387 | propagate-alignment | PropagateAlignmentPass | 2267 | Propagates alignment through pointer chains |
| 388 | reuse-local-memory | ReuseLocalMemoryPass | 2272 | Shares local memory across kernels |
| 389 | set-local-array-alignment | SetLocalArrayAlignmentPass | 2277 | Aligns stack arrays for coalescing |
| 390 | sinking2 | Sinking2Pass | 2282 | Enhanced instruction sinking |
| 391 | d2ir-scalarizer | ScalarizerPass (NVIDIA alias) | 2287 | NVIDIA-branded scalarization |
| 392 | sink<rp-aware> | SinkingPass (variant) | 2292 | Register-pressure-aware sinking |
Parameterized Function Passes (entries 393--419)
| # | Pass Name | Class | Parameters |
|---|---|---|---|
| 393 | cfguard | CFGuardPass | check;dispatch |
| 394 | early-cse | EarlyCSEPass | memssa |
| 395 | ee-instrument | EntryExitInstrumenterPass | post-inline |
| 396 | function-simplification | (byte_3F871B3) | O1;O2;O3;Os;Oz |
| 397 | gvn | GVNPass | no-pre;pre;no-load-pre;load-pre;... |
| 398 | instcombine | InstCombinePass | no-aggressive-aggregate-splitting;...;max-iterations=N |
| 399 | loop-unroll | LoopUnrollPass | O0;O1;O2;O3;full-unroll-max=N;... |
| 400 | loop-vectorize | LoopVectorizePass | no-interleave-forced-only;... |
| 401 | lower-allow-check | LowerAllowCheckPass | (empty) |
| 402 | lower-matrix-intrinsics | LowerMatrixIntrinsicsPass | minimal |
| 403 | lower-switch | LowerSwitchPass | enable-jump-table |
| 404 | mldst-motion | MergedLoadStoreMotionPass | no-split-footer-bb;split-footer-bb |
| 405 | print<da> | DependenceAnalysisPrinterPass | normalized-results |
| 406 | print<memoryssa> | MemorySSAPrinterPass | no-ensure-optimized-uses |
| 407 | print<stack-lifetime> | StackLifetimePrinterPass | may;must |
| 408 | scalarizer | ScalarizerPass | load-store;no-load-store;variable-insert-extract;... |
| 409 | separate-const-offset-from-gep | SeparateConstOffsetFromGEPPass | lower-gep |
| 410 | simplifycfg | SimplifyCFGPass | simplify-unreachable;...;bonus-inst-threshold=N |
| 411 | speculative-execution | SpeculativeExecutionPass | only-if-divergent-target |
| 412 | sroa | SROAPass | preserve-cfg;modify-cfg |
| 413 | structurizecfg | StructurizeCFG | skip-uniform-regions |
| 414 | win-eh-prepare | WinEHPreparePass | demote-catchswitch-only |
| 415 | bounds-checking | BoundsCheckingPass (modified) | trap |
| 416 | memory-space-opt | MemorySpaceOptPass | first-time;second-time;no-warnings;warnings |
| 417 | lower-aggr-copies | LowerAggrCopiesPass | lower-aggr-func-args |
| 418 | lower-struct-args | LowerStructArgsPass | opt-byval |
| 419 | process-restrict | ProcessRestrictPass | propagate-only |
LoopNest Passes (entries 420--423)
| # | Pass Name | LLVM Class |
|---|---|---|
| 420 | loop-flatten | LoopFlattenPass |
| 421 | loop-interchange | LoopInterchangePass |
| 422 | loop-unroll-and-jam | LoopUnrollAndJamPass |
| 423 | no-op-loopnest | NoOpLoopNestPass |
Loop Analyses (entries 424--428)
| # | Pass Name | LLVM Class |
|---|---|---|
| 424 | ddg | DDGAnalysis |
| 425 | iv-users | IVUsersAnalysis |
| 426 | no-op-loop | NoOpLoopAnalysis |
| 427 | pass-instrumentation | via sub_2342830 |
| 428 | should-run-extra-simple-loop-unswitch | ShouldRunExtraSimpleLoopUnswitch |
Loop Passes (entries 429--455)
| # | Pass Name | LLVM Class |
|---|---|---|
| 429 | canon-freeze | CanonicalizeFreezeInLoopsPass |
| 430 | dot-ddg | DDGDotPrinterPass |
| 431 | guard-widening | via sub_2342880 |
| 432 | extra-simple-loop-unswitch-passes | ExtraLoopPassManager<...> |
| 433 | indvars | IndVarSimplifyPass |
| 434 | invalidate<all> | via sub_2342860 |
| 435 | loop-bound-split | LoopBoundSplitPass |
| 436 | loop-deletion | LoopDeletionPass |
| 437 | loop-idiom | LoopIdiomRecognizePass |
| 438 | loop-idiom-vectorize | LoopIdiomVectorizePass |
| 439 | loop-instsimplify | LoopInstSimplifyPass |
| 440 | loop-predication | LoopPredicationPass |
| 441 | loop-reduce | LoopStrengthReducePass |
| 442 | loop-term-fold | LoopTermFoldPass |
| 443 | loop-simplifycfg | LoopSimplifyCFGPass |
| 444 | loop-unroll-full | LoopFullUnrollPass |
| 445 | loop-versioning-licm | LoopVersioningLICMPass |
| 446 | no-op-loop | NoOpLoopPass |
| 447 | print | PrintLoopPass |
| 448--450 | print<ddg>, print<iv-users>, print<loop-cache-cost>, print<loopnest> | (printers) |
| 451 | loop-index-split | LoopIndexSplitPass |
Parameterized Loop Passes (entries 452--455)
| # | Pass Name | Class | Parameters |
|---|---|---|---|
| 452 | licm | LICMPass | allowspeculation;conservative-calls |
| 453 | lnicm | LNICMPass | allowspeculation |
| 454 | loop-rotate | LoopRotatePass | no-header-duplication;header-duplication;... |
| 455 | simple-loop-unswitch | SimpleLoopUnswitchPass | nontrivial;no-nontrivial;trivial;no-trivial |
Machine Function Analyses (entries 456--475)
| # | Pass Name | LLVM Class |
|---|---|---|
| 456 | edge-bundles | EdgeBundlesAnalysis |
| 457 | livedebugvars | LiveDebugVariablesAnalysis |
| 458 | live-intervals | LiveIntervalsAnalysis |
| 459 | live-reg-matrix | LiveRegMatrixAnalysis |
| 460 | live-stacks | LiveStacksAnalysis |
| 461 | live-vars | LiveVariablesAnalysis |
| 462 | machine-block-freq | MachineBlockFrequencyAnalysis |
| 463 | machine-branch-prob | MachineBranchProbabilityAnalysis |
| 464 | machine-cycles | MachineCycleAnalysis |
| 465 | machine-dom-tree | MachineDominatorTreeAnalysis |
| 466 | machine-loops | MachineLoopAnalysis |
| 467 | machine-opt-remark-emitter | MachineOptimizationRemarkEmitterAnalysis |
| 468 | machine-post-dom-tree | MachinePostDominatorTreeAnalysis |
| 469 | machine-trace-metrics | MachineTraceMetricsAnalysis |
| 470 | pass-instrumentation | via sub_2342830 |
| 471 | regalloc-evict | RegAllocEvictionAdvisorAnalysis |
| 472 | regalloc-priority | RegAllocPriorityAdvisorAnalysis |
| 473 | slot-indexes | SlotIndexesAnalysis |
| 474 | spill-code-placement | SpillPlacementAnalysis |
| 475 | virtregmap | VirtRegMapAnalysis |
Machine Function Passes (entries 476--526)
| # | Pass Name | LLVM Class |
|---|---|---|
| 476 | dead-mi-elimination | DeadMachineInstructionElimPass |
| 477 | detect-dead-lanes | DetectDeadLanesPass |
| 478 | early-ifcvt | EarlyIfConverterPass |
| 479 | early-machinelicm | EarlyMachineLICMPass |
| 480 | early-tailduplication | EarlyTailDuplicatePass |
| 481 | finalize-isel | FinalizeISelPass |
| 482 | fixup-statepoint-caller-saved | FixupStatepointCallerSavedPass |
| 483 | localstackalloc | LocalStackSlotAllocationPass |
| 484 | machine-cp | MachineCopyPropagationPass |
| 485 | machine-cse | MachineCSEPass |
| 486 | machine-latecleanup | MachineLateInstrsCleanupPass |
| 487 | machine-scheduler | MachineSchedulerPass |
| 488 | machinelicm | MachineLICMPass |
| 489 | no-op-machine-function | NoOpMachineFunctionPass |
| 490 | opt-phis | OptimizePHIsPass |
| 491 | patchable-function | PatchableFunctionPass |
| 492 | peephole-opt | PeepholeOptimizerPass |
| 493 | phi-node-elimination | PHIEliminationPass |
| 494 | post-RA-sched | PostRASchedulerPass |
| 495 | postmisched | PostMachineSchedulerPass |
| 496 | post-ra-pseudos | ExpandPostRAPseudosPass |
| 497 | print | PrintMIRPass |
| 498--510 | print<livedebugvars> ... print<virtregmap> | (13 MF printers) |
| 511 | reg-usage-collector | RegUsageInfoCollectorPass |
| 512 | reg-usage-propagation | RegUsageInfoPropagationPass |
| 513 | register-coalescer | RegisterCoalescerPass |
| 514 | rename-independent-subregs | RenameIndependentSubregsPass |
| 515 | remove-redundant-debug-values | RemoveRedundantDebugValuesPass |
| 516 | require-all-machine-function-properties | RequireAllMachineFunctionPropertiesPass |
| 517 | stack-coloring | StackColoringPass |
| 518 | stack-slot-coloring | StackSlotColoringPass |
| 519 | tailduplication | TailDuplicatePass |
| 520 | trigger-verifier-error | TriggerVerifierErrorPass |
| 521 | two-address-instruction | TwoAddressInstructionPass |
| 522 | verify | MachineVerifierPass |
| 523 | verify<machine-trace-metrics> | MachineTraceMetricsVerifierPass |
| 524 | machine-sink | MachineSinkingPass (parameterized) |
| 525 | regallocfast | RegAllocFastPass (parameterized) |
| 526 | greedy | RAGreedyPass (parameterized, LAST registered) |
No NVIDIA-specific machine function passes were identified in the registration table; NVIDIA's machine-level customizations are implemented through target hooks in the NVPTX backend rather than as separately registered passes.
Runtime Pass Execution Order
Registration order (above) describes what is known to the pipeline parser. Runtime execution order is determined by sub_12E54A0 (the pipeline assembler) and controlled by the tier system. The execution order varies dramatically depending on: (1) optimization level, (2) fast-compile mode, (3) language string, and (4) individual pass enable/disable flags in NVVMPassOptions.
The AddPass Mechanism -- sub_12DE0B0
All runtime pass insertion uses sub_12DE0B0 (0x12DE0B0), a hash-table-based function that:
- Hashes the pass pointer:
(pass >> 9) ^ (pass >> 4) - Probes an open-addressed hash table at
passMgr+80 - Stores the pass pointer and a flags byte (
flags | 2if barrier set) - Appends the pass pointer to a dynamic array at
passMgr[0] - Increments the counter at
passMgr+8
The third parameter encodes pass type: 0 = ModulePass/AnalysisPass, 1 = FunctionPass. The fourth parameter is a scheduling barrier hint.
Tier System Architecture
The tier system is NVIDIA's mechanism for interleaving custom passes with standard LLVM passes at precise points. The main optimization loop in sub_12E54A0 iterates over a plugin/extension pass array at opts[4488..4496] (16-byte stride: vtable + phase_id), and fires tier sub-pipelines when the accumulated phase counter exceeds their thresholds:
// Pseudocode from sub_12E54A0, lines 481-553
for (entry = opts[4488]; entry < opts[4496]; entry += 16) {
phase_id = entry[8];
if (opts[4224] && phase_id > opts[4228]) { // Tier 0
sub_12DE330(PM, opts); // Full optimization
opts[4224] = 0; // Fire once
}
if (opts[3528] && phase_id > opts[3532]) { // Tier 1
sub_12DE8F0(PM, 1, opts);
opts[3528] = 0;
}
if (opts[3568] && phase_id > opts[3572]) { // Tier 2
sub_12DE8F0(PM, 2, opts);
opts[3568] = 0;
}
if (opts[3608] && phase_id > opts[3612]) { // Tier 3
sub_12DE8F0(PM, 3, opts);
opts[3608] = 0;
}
pass = entry->vtable[72](); // Plugin pass factory call
sub_12DE0B0(PM, pass, 1, 0); // Insert plugin pass
if (opts[3904]) // Debug mode
insert_verifier_after_each();
}
// Remaining unfired tiers fire unconditionally after loop
The tier control fields in the NVVMPassOptions struct:
| Offset | Type | Field |
|---|---|---|
+3528 | bool | Tier 1 enable |
+3532 | int | Tier 1 phase threshold |
+3568 | bool | Tier 2 enable |
+3572 | int | Tier 2 phase threshold |
+3608 | bool | Tier 3 enable |
+3612 | int | Tier 3 phase threshold |
+4224 | bool | Tier 0 (full optimization) enable |
+4228 | int | Tier 0 phase threshold |
Infrastructure Setup (Always Runs)
These five passes are always inserted first, regardless of optimization level:
| Pos | Factory | Identity | AddPass Flags |
|---|---|---|---|
| 1 | sub_149CCE0 (alloc 368B) | TargetLibraryInfoWrapperPass | (PM, TLI, 0, 0) Module |
| 2 | sub_1BFB520 (alloc 208B) | TargetTransformInfoWrapperPass | (PM, TTI, 1, 0) Function |
| 3 | sub_14A7550 | VerifierPass / BasicAliasAnalysis | (PM, _, 0, 0) Module |
| 4 | sub_1361950 | AssumptionCacheTracker | (PM, _, 0, 0) Module |
| 5 | sub_1CB0F50 | ProfileSummaryInfoWrapperPass | (PM, _, 1, 0) Function |
Tier 0 -- Full Optimization (sub_12DE330)
Called when opts[4224] (optimization enabled) and the phase threshold is exceeded. This is the primary optimization sub-pipeline for O1/O2/O3, adding ~40 passes. Address: 0x12DE330.
Confidence note: Pass identifications are based on diagnostic strings, factory-function signatures, and pipeline ordering. Most identifications are HIGH confidence (confirmed by unique string literals). Entries marked
[MEDIUM confidence]are inferred from code structure, argument patterns, or address proximity rather than direct string evidence.
| Pos | Factory Address | Likely Pass | Guard Condition |
|---|---|---|---|
| 1 | sub_1654860(1) | BreakCriticalEdges | always |
| 2 | sub_1A62BF0(1,0,0,1,0,0,1) | LLVM standard pipeline #1 | always |
| 3 | sub_1B26330 | MemCpyOpt | always |
| 4 | sub_185D600 | IPConstantPropagation | always |
| 5 | sub_1C6E800 | GVN | always |
| 6 | sub_1C6E560 | NewGVN/GVNHoist [MEDIUM confidence] | always |
| 7 | sub_1857160 | NVVMReflect | always |
| 8 | sub_1842BC0 | SCCP | always |
| 9 | sub_17060B0(1,0) | PrintModulePass | opts[3160] |
| 10 | sub_12D4560 | NVVMVerifier | always |
| 11 | sub_18A3090 | NVVMPredicateOpt | always |
| 12 | sub_184CD60 | ConstantMerge | always |
| 13 | sub_1869C50(1,0,1) | Sink/MemSSA [MEDIUM confidence] -- three-arg factory matches Sink with MemSSA parameters, but could also be a custom sinking variant | !opts[1040] |
| 14 | sub_1833EB0(3) | TailCallElim/JumpThreading [MEDIUM confidence] -- integer arg=3 could be JumpThreading threshold or TailCallElim mode; no disambiguating string | always |
| 15 | sub_17060B0(1,0) | PrintModulePass | opts[3160] |
| 16 | sub_1952F90(-1) | LoopIndexSplit | always |
| 17 | sub_1A62BF0(1,...) | LLVM standard pipeline #1 | always |
| 18 | sub_1A223D0 | NVVMIRVerification | always |
| 19 | sub_17060B0(1,0) | PrintModulePass | opts[3160] |
| 20 | sub_1A7A9F0 | InstructionSimplify | always |
| 21 | sub_1A62BF0(1,...) | LLVM standard pipeline #1 | always |
| 22 | sub_1A02540 | GenericToNVVM | always |
| 23 | sub_198DF00(-1) | LoopSimplify | always |
| 24 | sub_1C76260 | ADCE | !opts[1320] |
| 25 | sub_195E880(0) | LICM | opts[2880] |
| 26 | sub_19C1680(0,1) | LoopUnroll | !opts[1360] |
| 27 | sub_17060B0(1,0) | PrintModulePass | opts[3160] |
| 28 | sub_19401A0 | InstCombine | always |
| 29 | sub_1968390 | SROA | always |
| 30 | sub_196A2B0 | EarlyCSE | always |
| 31 | sub_19B73C0(2,-1,-1,-1,-1,-1,-1) | LoopUnswitch | always |
| 32 | sub_17060B0(1,0) | PrintModulePass | opts[3160] |
| 33 | sub_190BB10(0,0) | SimplifyCFG | always |
| 34 | sub_1A13320 | NVVMRematerialization | always |
| 35 | sub_17060B0(1,1) | PrintModulePass (verified) | opts[3160] |
| 36 | sub_18F5480 | DSE | always |
| 37 | sub_18DEFF0 | DCE | always |
| 38 | sub_1A62BF0(1,...) | LLVM standard pipeline #1 | always |
| 39 | sub_18B1DE0 | NVVMLoopPass/BarrierOpt [MEDIUM confidence] -- address is in NVVM pass range, but dual name reflects ambiguity between loop optimization and barrier optimization roles | always |
| 40 | sub_1841180 | FunctionAttrs | always |
Tier 1/2/3 -- Phase-Specific Sub-pipeline (sub_12DE8F0)
Called with tier number (1, 2, or 3). Address: 0x12DE8F0. Stores the tier value to qword_4FBB410. When tier==3, sets qword_4FBB370 = 6 if BYTE4 was 0 (enables advanced barrier and memory space optimization features).
The pass sequence is significantly longer than Tier 0 and varies by tier. The following shows the superset of all passes that can be inserted; tier-based guards are annotated.
Confidence note: Same methodology as Tier 0 table above. Most identifications confirmed by diagnostic strings or NVVMPassOptions slot cross-references.
| Pos | Factory Address | Likely Pass | Guard |
|---|---|---|---|
| 1 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] |
| 2 | sub_1A223D0 | NVVMIRVerification | !opts[2600] |
| 3 | sub_1CB4E40(1) | NVVMIntrinsicLowering (barrier) | !opts[2000] |
| 4 | sub_18E4A00 | NVVMBarrierAnalysis | opts[3488] |
| 5 | sub_1C98160(0) | NVVMLowerBarriers | opts[3488] |
| 6 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] |
| 7 | sub_12D4560 | NVVMVerifier | !opts[600] |
| 8 | sub_185D600 | IPConstPropagation | opts[3200] && !opts[920] |
| 9 | sub_1857160 | NVVMReflect | opts[3200] && !opts[880] |
| 10 | sub_18A3430 | NVVMPredicateOpt | opts[3200] && !opts[1120] |
| 11 | sub_1842BC0 | SCCP | opts[3200] && !opts[720] |
| 12 | sub_17060B0(1,0) | PrintModulePass | !opts[1080] |
| 13 | sub_12D4560 | NVVMVerifier | !opts[600] |
| 14 | sub_18A3090 | NVVMPredicateOpt variant | opts[3200] && !opts[2160] |
| 15 | sub_184CD60 | ConstantMerge | opts[3200] && !opts[1960] |
| 16 | sub_190BB10(1,0) | SimplifyCFG | tier!=1 && !opts[1040] && !opts[1200] |
| 17 | sub_1952F90(-1) | LoopIndexSplit | (same guard) && !opts[1160] |
| 18 | sub_12D4560 | NVVMVerifier | (same guard) && !opts[600] |
| 19 | sub_17060B0(1,0) | PrintModulePass | (same guard) && !opts[1080] |
| 20 | sub_195E880(0) | LICM | opts[3704] && opts[2880] && !opts[1240] |
| 21 | sub_1C8A4D0(v) | EarlyCSE | v=1 if opts[3704] |
| 22 | sub_1869C50(1,0,1) | Sink | tier!=1 && !opts[1040] |
| 23 | sub_1833EB0(3) | TailCallElim | tier==3 && !opts[320] |
| 24 | sub_1CC3990 | NVVMUnreachableBlockElim | !opts[2360] |
| 25 | sub_18EEA90 | CorrelatedValuePropagation | opts[3040] |
| 26 | sub_12D4560 | NVVMVerifier | !opts[600] |
| 27 | sub_1A223D0 | NVVMIRVerification | !opts[2600] |
| 28 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] |
| 29 | sub_1C4B6F0 | Inliner | !opts[440] && !opts[480] |
| 30 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] |
| 31 | sub_1A7A9F0 | InstructionSimplify | !opts[2720] |
| 32 | sub_12D4560 | NVVMVerifier | !opts[600] |
| 33 | sub_1A02540 | GenericToNVVM | !opts[2200] |
| 34 | sub_198DF00(-1) | LoopSimplify | !opts[1520] |
| 35 | sub_1C76260 | ADCE | !opts[1320] && !opts[1480] |
| 36 | sub_17060B0(1,0) | PrintModulePass | (same guard) |
| 37 | sub_12D4560 | NVVMVerifier | (same guard) |
| 38 | sub_195E880(0) | LICM | opts[2880] && !opts[1240] |
| 39 | sub_1C98160(0/1) | NVVMLowerBarriers | opts[3488] |
| 40 | sub_19C1680(0,1) | LoopUnroll | !opts[1360] |
| 41 | sub_17060B0(1,0) | PrintModulePass | !opts[1080] |
| 42 | sub_19401A0 | InstCombine | !opts[1000] |
| 43 | sub_196A2B0 | EarlyCSE | !opts[1440] |
| 44 | sub_1968390 | SROA | !opts[1400] |
| 45 | sub_19B73C0(tier,...) | LoopUnswitch | tier!=1, SM-arch-dependent params |
| 46 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] |
| 47 | sub_19B73C0(tier,...) | LoopUnswitch (2nd) | !opts[2760] |
| 48 | sub_1A62BF0(1,...) | LLVM standard pipeline | !opts[600] |
| 49 | sub_1A223D0 | NVVMIRVerification | !opts[2600] |
| 50 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] |
| 51 | sub_17060B0(1,0) | PrintModulePass | !opts[1080] |
| 52 | sub_190BB10(0,0) | SimplifyCFG | !opts[960] |
| 53 | sub_1922F90 | NVIDIA loop pass | opts[3080] |
| 54 | sub_195E880(0) | LICM | opts[2880] && !opts[1240] |
| 55 | sub_1A13320 | NVVMRematerialization | !opts[2320] |
| 56 | sub_1968390 | SROA | !opts[1400] |
| 57 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] |
| 58 | sub_18EEA90 | CorrelatedValuePropagation | opts[3040] |
| 59 | sub_18F5480 | DSE | !opts[760] |
| 60 | sub_18DEFF0 | DCE | !opts[280] |
| 61 | sub_1A62BF0(1,...) | LLVM standard pipeline | !opts[600] |
| 62 | sub_1AAC510 | NVIDIA-specific pass | !opts[520] && !opts[560] |
| 63 | sub_1A223D0 | NVVMIRVerification | !opts[2600] |
| 64 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] |
| 65 | sub_1C8E680 | MemorySpaceOpt | !opts[2680], param from opts[3120] |
| 66 | sub_1A223D0 | NVVMIRVerification | opts[3120] && !opts[2600] |
| 67 | sub_17060B0(1,0) | PrintModulePass (barrier) | !opts[1080] |
| 68 | sub_1CC71E0 | NVVMGenericAddrOpt | !opts[2560] |
| 69 | sub_1C98270(1,opts[2920]) | NVVMLowerBarriers variant | opts[3488] |
| 70 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] |
| 71 | sub_1C6FCA0 | ADCE | opts[2840] && !opts[1840] |
| 72 | sub_18B1DE0 | LoopOpt/BarrierOpt | opts[3200] && !opts[2640] |
| 73 | sub_1857160 | NVVMReflect | opts[3200] && tier==3 && !opts[880] |
| 74 | sub_1841180 | FunctionAttrs | opts[3200] && !opts[680] |
| 75 | sub_1C46000 | NVVMLateOpt | tier==3 && !opts[360] |
| 76 | sub_1841180 | FunctionAttrs (2nd) | opts[3200] && !opts[680] |
| 77 | sub_1CBC480 | NVVMLowerAlloca | !opts[2240] && !opts[2280] |
| 78 | sub_1CB73C0 | NVVMBranchDist | !opts[2080] && !opts[2120] |
| 79 | sub_1C7F370(1) | NVVMWarpShuffle | opts[3328] && !opts[1640] |
| 80 | sub_1CC5E00 | NVVMReduction | opts[3328] && !opts[2400] |
| 81 | sub_1CC60B0 | NVVMSinking2 | opts[3328] && !opts[2440] |
| 82 | sub_1CB73C0 | NVVMBranchDist (2nd) | opts[3328] && !opts[2080] && !opts[2120] |
| 83 | sub_17060B0(1,0) | PrintModulePass | opts[3328] && !opts[1080] |
| 84 | sub_1B7FDF0(3) | Reassociate | opts[3328] && !opts[1280] |
| 85 | sub_17060B0(1,0) | PrintModulePass (final) | opts[3160] && !opts[1080] |
Optimization Level Summary
| Pipeline | Sub-pipeline called | lsa-opt | mem-space-opt | Approx. passes |
|---|---|---|---|---|
nvopt<O0> | (minimal, sub_1C8A4D0(0) only) | off | off | ~5--8 |
nvopt<Ofcmax> | Sinking2 + common tail only | forced 0 | forced 0 | ~12--15 |
nvopt<Ofcmid> | mid-level pipeline | normal | enabled | ~25--30 |
nvopt<Ofcmin> | close to full pipeline | normal | enabled | ~30--35 |
nvopt<O1> | sub_12DE330 (Tier 0) | normal | enabled | ~35 |
nvopt<O2> | sub_12DE330 + Tier 1/2 | normal | enabled | ~35+ |
nvopt<O3> | sub_12DE330 + Tier 1/2/3 | normal | enabled | ~35+ |
O1/O2/O3 all route through the same sub_12DE330 (Tier 0). The difference manifests through the tiered pass inserter sub_12DE8F0: O1 only fires Tier 1, O2 fires Tiers 1--2, O3 fires all three tiers. Within the tiers, passes additionally vary by: loop unroll factor (parameter to sub_1833EB0), vectorizer width (parameters to sub_19B73C0), CGSCC iteration count (first parameter to sub_1A62BF0), and the SM-architecture-dependent late passes gated by opts[3328].
Ofcmax critical behavior: when fast-compile level == 2 (max), the libnvvm pipeline builder forces -lsa-opt=0 and -memory-space-opt=0 even if the user explicitly enables them. This is confirmed in both sub_9624D0 (line 1358) and sub_12CC750 (line 2025).
Codegen Dispatch -- sub_12DFE00
After all optimization tiers complete, sub_12DFE00 (0x12DFE00) performs codegen pass scheduling. This is NOT a simple pass adder -- it performs a full dependency graph construction:
- Reads optimization level from
opts[200](0 = minimal, >1 = enable dependency tracking) - Iterates all passes already in the pass manager
- For each pass, calls
vtable+112(isCodeGenOnly()) to filter - Calls
vtable+16(getAnalysisUsage()) to extract dependencies - Builds a secondary hash table of ordering constraints
- Dispatches each pass to the codegen subsystem in topological order via the subtarget hook at
vtable+16
Pass Classification Statistics
| Category | Count |
|---|---|
| Module analyses | 18 |
| Module passes | ~131 |
| CGSCC analyses | 3 |
| CGSCC passes | ~10 |
| Function analyses | ~39 |
| Function AA analyses | 5 |
| Function passes | ~219 |
| LoopNest passes | 4 |
| Loop analyses | 5 |
| Loop passes | ~26 |
| MachineFunction analyses | 20 |
| MachineFunction passes | ~50 |
| Total | ~526 |
| NVIDIA additions | 33 |
| Standard LLVM | ~493 |
Complete Pass Factory Address Map
Every unique pass factory address observed in sub_12E54A0, sub_12DE330, and sub_12DE8F0:
| Function | Address | Size | Role |
|---|---|---|---|
| NVVMVerifier | sub_12D4560 | many (tiers) | many (tiers) |
| AssumptionCacheTracker | sub_1361950 | 1 | 1 |
| TargetLibraryInfoWrapperPass | sub_149CCE0 | 1 | 1 |
| VerifierPass/BasicAA | sub_14A7550 | 1 | 1 |
| BreakCriticalEdges | sub_1654860 | 2 | 2 |
| PrintModulePass (debug dump) | sub_17060B0 | ~30+ | ~30+ |
| InstructionCombining | sub_1832270 | 2 | 2 |
| TailCallElim/JumpThreading | sub_1833EB0 | 3 | 3 |
| FunctionAttrs | sub_1841180 | 3 | 3 |
| SCCP | sub_1842BC0 | 2 | 2 |
| NVVMReflect | sub_1857160 | ~8 | ~8 |
| IPConstantPropagation | sub_185D600 | 3 | 3 |
| Sink (MemorySSA-based) | sub_1869C50 | 3 | 3 |
| NVVMPredicateOpt | sub_18A3090 | 2 | 2 |
| AggressiveInstCombine | sub_18A3430 | 2 | 2 |
| NVVMLoopOpt/BarrierOpt | sub_18B1DE0 | 3 | 3 |
| Sinking2Pass (fast-mode) | sub_18B3080 | 1 | 1 |
| DCE | sub_18DEFF0 | 4 | 4 |
| NVVMBarrierAnalysis | sub_18E4A00 | 1 | 1 |
| CorrelatedValuePropagation | sub_18EEA90 | 3 | 3 |
| DSE | sub_18F5480 | 2 | 2 |
| DeadArgElimination | sub_18FD350 | 5 | 5 |
| SimplifyCFG | sub_190BB10 | 4 | 4 |
| NVIDIA loop pass | sub_1922F90 | 1 | 1 |
| LoopIndexSplit | sub_1952F90 | 3 | 3 |
| LICM | sub_195E880 | 4 | 4 |
| SROA | sub_1968390 | 2 | 2 |
| EarlyCSE | sub_196A2B0 | 2 | 2 |
| LoopUnroll/Vectorize | sub_197E720 | 1 | 1 |
| LoopSimplify/IndVarSimplify | sub_198DF00 | 3 | 3 |
| CorrelatedValuePropagation | sub_198E2A0 | 1 | 1 |
| InstCombine | sub_19401A0 | 2 | 2 |
| LoopUnswitch | sub_19B73C0 | 3 | 3 |
| LoopUnroll | sub_19C1680 | 2 | 2 |
| NVIDIA pass (unknown) | sub_19CE990 | 1 | 1 |
| GenericToNVVM | sub_1A02540 | 1 | 1 |
| NVVMRematerialization | sub_1A13320 | 3 | 3 |
| NVVMIRVerification | sub_1A223D0 | 5+ | 5+ |
| LLVM StandardPassPipeline | sub_1A62BF0 | ~9 | ~9 |
| LoopIdiomRecognize | sub_1A68E70 | 1 | 1 |
| InstructionSimplify | sub_1A7A9F0 | 3 | 3 |
| NVIDIA-specific pass | sub_1AAC510 | 1 | 1 |
| MemCpyOpt | sub_1B26330 | 4 | 4 |
| Reassociate/Sinking | sub_1B7FDF0 | 3 | 3 |
| TTIWrapperPass | sub_1BFB520 | 1 | 1 |
| NVVMLateOpt | sub_1C46000 | 1 | 1 |
| Inliner/AlwaysInline | sub_1C4B6F0 | 2 | 2 |
| NewGVN/GVNHoist | sub_1C6E560 | 1 | 1 |
| GVN | sub_1C6E800 | 2 | 2 |
| ADCE (AggressiveDCE) | sub_1C6FCA0 | 2 | 2 |
| ADCE variant | sub_1C76260 | 2 | 2 |
| NVVMWarpShuffle | sub_1C7F370 | 1 | 1 |
| EarlyCSE/GVN variant | sub_1C8A4D0 | 3 | 3 |
| MemorySpaceOpt | sub_1C8E680 | 4 | 4 |
| NVVMLowerBarriers | sub_1C98160 | 4 | 4 |
| NVVMLowerBarriers variant | sub_1C98270 | 1 | 1 |
| ProfileSummaryInfo | sub_1CB0F50 | 1 | 1 |
| NVVMIntrinsicLowering | sub_1CB4E40 | ~10 | ~10 |
| NVVMBranchDist | sub_1CB73C0 | 3 | 3 |
| NVVMLowerAlloca | sub_1CBC480 | 1 | 1 |
| NVVMUnreachableBlockElim | sub_1CC3990 | 1 | 1 |
| NVVMReduction | sub_1CC5E00 | 1 | 1 |
| NVVMSinking2 | sub_1CC60B0 | 3 | 3 |
| NVVMGenericAddrOpt | sub_1CC71E0 | 1 | 1 |
| NVVMFinalLowering | sub_1CEBD10 | 1 | 1 |
| NVVMPeephole | sub_1CEF8F0 | 2 | 2 |
| NVVMAnnotationsProcessor | sub_215D9D0 | 2 | 2 |
Total unique pass factories: ~65.
NVVMPassOptions Offset-to-Pass Guard Map
The NVVMPassOptions struct (4,512 bytes, 221 slots) controls which passes execute. The pipeline assembler reads boolean flags at specific offsets to gate pass insertion. See NVVMPassOptions for the full slot layout. Key offset-to-pass mappings:
| Offset | Slot | Type | Controls |
|---|---|---|---|
| +200 | 9 | int | Optimization level (0/1/2/3) |
| +280 | 15 | bool | DCE disable |
| +320 | 17 | bool | TailCallElim/JumpThreading disable |
| +360 | 19 | bool (default=1) | NVVMLateOpt disable |
| +600 | 31 | bool | NVVMVerifier disable |
| +720 | 37 | bool | SCCP disable |
| +760 | 39 | bool | DSE disable |
| +880 | 45 | bool | NVVMReflect disable |
| +920 | 47 | bool | IPConstantPropagation disable |
| +960 | 49 | bool | SimplifyCFG disable |
| +1000 | 51 | bool | InstCombine disable |
| +1040 | 53 | bool | Sink/MemSSA disable |
| +1080 | 55 | bool | PrintModulePass disable |
| +1160 | 59 | bool | LoopIndexSplit disable |
| +1240 | 63 | bool | LICM disable |
| +1280 | 65 | bool | Reassociate disable |
| +1320 | 67 | bool | ADCE disable |
| +1360 | 69 | bool | LoopUnroll disable |
| +1400 | 71 | bool | SROA disable |
| +1440 | 73 | bool | EarlyCSE disable |
| +1760 | 89 | bool | MemorySpaceOpt disable |
| +2000 | 101 | bool | NVVMIntrinsicLowering disable |
| +2320 | 117 | bool (default=1) | NVVMRematerialization disable |
| +2440 | 123 | bool | NVVMSinking2 disable |
| +2600 | 131 | bool | NVVMIRVerification disable |
| +2840 | 141 | bool (default=1) | ADCE enable (reversed logic) |
| +2880 | 143 | bool (default=1) | LICM enable (reversed logic) |
| +3120 | 155 | bool (default=1) | MemorySpaceOpt (2nd pass) enable |
| +3160 | 157 | bool (default=1) | PrintModulePass/debug dump enable |
| +3200 | 159 | bool (default=1) | Advanced NVIDIA passes group enable |
| +3328 | 165 | bool (default=1) | SM-specific late passes enable |
| +3488 | 175 | bool | Barrier optimization enable |
| +3648 | 181 | ptr | Language string ("ptx"/"mid"/"idn") |
| +3656 | — | int | Language string length |
| +3704 | 185 | bool | Late optimization / address-space flag |
| +4064 | 201 | bool | Concurrent compilation enable |
| +4104 | 203 | int (default=-1) | Thread count |
| +4224 | 211 | bool (default=1) | Master optimization enable |
| +4304 | 213 | bool | Device-code / separate-compilation flag |
| +4384 | 217 | bool | Fast-compile bypass (skip LLVM pipeline) |
| +4464 | 219 | bool (default=1) | Late CFG cleanup guard |
Infrastructure Functions
| Address | Function | Role |
|---|---|---|
0x2342890 | sub_2342890 | Master pass registration (~2,816 lines) |
0xE41FB0 | sub_E41FB0 | StringMap::insert (48-byte entries, open-addressing) |
0xE41C70 | sub_E41C70 | StringMap::grow (hash table resize) |
0xC94890 | sub_C94890 | String hash function (DJB/FNV-family) |
0x9691B0 | sub_9691B0 | String equality (len + memcmp) |
0xC931B0 | sub_C931B0 | StringRef::find_first_of (delimiter search) |
0x95CB50 | sub_95CB50 | StringRef::consume_front (strip llvm:: prefix) |
0x233C410 | sub_233C410 | Help listing (--print-pipeline-passes) |
0x233BD40 | sub_233BD40 | AA name resolver (chain of comparisons) |
0x233C0C0 | sub_233C0C0 | AA pipeline parser |
0x233C300 | sub_233C300 | Extension callback dispatch |
0x233A120 | sub_233A120 | Generic parameterized option parser |
0x12E54A0 | sub_12E54A0 | Master pipeline assembler (49.8KB) |
0x12DE0B0 | sub_12DE0B0 | AddPass (hash-table-based insertion) |
0x12DE330 | sub_12DE330 | Tier 0 full optimization sub-pipeline |
0x12DE8F0 | sub_12DE8F0 | Tier 1/2/3 phase-specific sub-pipeline |
0x12DFE00 | sub_12DFE00 | Codegen dispatch (dependency-ordered) |
0x226C400 | sub_226C400 | Pipeline name selector (nvopt<O#>) |
0x2277440 | sub_2277440 | Pipeline text parser entry |
0x225D540 | sub_225D540 | New PM nvopt registration |
0x12C35D0 | sub_12C35D0 | Legacy PM pipeline orchestrator |
0x2342820 | sub_2342820 | LastRunTrackingAnalysis factory |
0x2342830 | sub_2342830 | PassInstrumentationAnalysis factory |
0x2342840 | sub_2342840 | VerifierAnalysis factory |
0x2342850 | sub_2342850 | InlinerWrapper factory (shared by 4 inliner variants) |
0x2342860 | sub_2342860 | InvalidateAllAnalysesPass factory |
0x2342870 | sub_2342870 | VerifierPass factory |
0x2342880 | sub_2342880 | GuardWideningPass factory |
0x2339850 | sub_2339850 | PassBuilder destructor |
0x233B610 | sub_233B610 | PassBuilder::~PassBuilder cleanup |
Cross-References
- Optimizer -- runtime pipeline assembly, two-phase model, concurrent compilation
- NVVMPassOptions -- 221-slot option struct controlling pass enablement
- Optimization Levels -- O0/O1/O2/O3 and Ofcmin/Ofcmid/Ofcmax
- Concurrent Compilation -- Phase I/II, thread pool, GNU Jobserver
Scalar Passes: SROA, EarlyCSE & JumpThreading
Three LLVM scalar optimization passes play outsized roles in cicc's GPU pipeline. Each is a stock LLVM implementation with NVIDIA configuration overrides (and in EarlyCSE's case, binary-level modifications). Each appears multiple times in the pipeline at different tier levels, and each can be independently disabled via NVVMPassOptions flags.
SROA (Scalar Replacement of Aggregates)
SROA eliminates alloca instructions by decomposing aggregates into individual SSA values that the register allocator can place in registers. On a GPU this is existential: every surviving alloca becomes a spill to .local memory (DRAM-backed, 200-800 cycle latency on cache miss versus zero for a register). A single un-promoted alloca in a hot loop can degrade kernel throughput by 10-50x. SROA also eliminates the .param space copies generated for byval struct parameters, preventing round-trips through local memory.
EarlyCSE (Early Common Subexpression Elimination)
Cicc's EarlyCSE is not stock LLVM. The binary contains four CUDA-specific extensions: barrier-aware memory versioning that prevents CSE across __syncthreads() and other synchronization points, shared memory address space 7 protection against unsafe store-to-load forwarding between threads, a dedicated NVVM intrinsic call CSE handler with fast-path recognition for thread-invariant special register reads (threadIdx.x, etc.), and a PHI operand limit of 5 for compile-time control. It also adds a fourth scoped hash table (store-forwarding) that upstream LLVM lacks.
JumpThreading
JumpThreading duplicates basic blocks so that predecessors with statically-determinable branch conditions jump directly to the correct successor, eliminating warp divergence. The pass is fundamentally at odds with PTX's requirement for reducible control flow: block duplication can create irreducible cycles. Cicc addresses this through loop header protection (jump-threading-across-loop-headers defaults to false), conservative duplication thresholds (6-instruction block limit), and a late-pipeline StructurizeCFG safety net that catches any irreducibility that slips through. NVIDIA provides a separate "disable-jump-threading" kill switch (distinct from upstream's "disable-JumpThreadingPass"), with an OCG experiment annotation suggesting architecture-specific cases where the CFG disruption outweighs the benefit.
Full JumpThreading analysis >>>
Cross-References
- Pipeline & Ordering -- tier-dependent scheduling of all three passes
- Register Allocation -- surviving allocas after SROA become register pressure; failed promotion leads to
.localmemory spills - StructurizeCFG -- the safety net that catches irreducible CFG created by JumpThreading or other passes
- GVN -- GVN performs load CSE and redundancy elimination complementary to EarlyCSE, running later in the pipeline with more expensive analysis
- MemorySpaceOpt -- resolves generic pointers to specific address spaces; interacts with EarlyCSE's address-space-aware load forwarding
- DSE -- Dead Store Elimination complements EarlyCSE's within-block store-to-load forwarding with cross-block dead store detection
SROA (Scalar Replacement of Aggregates)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: Based on LLVM 20.0.0
SROA.cpp. Evidence:preserve-cfg/modify-cfgpipeline parser parameters match LLVM 16+ new PM integration; two-pass analysis mode (qword_50055E8) matches LLVM 17+ pre-analysis path. Core splitting algorithm is stock LLVM with no CUDA-specific modifications detected.
SROA is the single most important early-pipeline optimization for NVIDIA GPU compilation. Every alloca instruction that survives into code generation is lowered to .local memory (NVPTX address space 5) -- physically backed by device DRAM and accessed through the L1/L2 cache hierarchy. A .local access that misses L1 costs 200-400 cycles; a register read costs zero. A single un-promoted alloca in a hot loop can degrade kernel throughput by 10-50x. SROA's job is to decompose aggregate allocas (structs, arrays, unions) into individual scalar SSA values that the register allocator can place in registers, eliminating the memory traffic entirely.
| Property | Value |
|---|---|
| Pass name | "sroa" |
| Pipeline parser params | preserve-cfg, modify-cfg |
| Entry function | sub_2935C30 (runOnAlloca) |
| Core function | sub_2930B90 (splitAlloca) |
| Binary footprint | ~138 KB primary (80 KB + 58 KB), ~200 KB secondary (legacy PM) |
| Binary address range | 0x2910000-0x293FFFF (178 functions) |
| Pipeline positions | Position 4 (early, after NVVMReflect) and post-sinking (late) |
| Disable flag | NVVMPassOptions offset +1400 |
| Size threshold knob | qword_50056C8 (max alloca size in bits) |
| Two-pass flag | qword_50055E8 (enables pre-analysis for new PM) |
| NVIDIA modifications | None to core algorithm |
| Upstream source | llvm/lib/Transforms/Scalar/SROA.cpp |
Why SROA Is Existential on GPU
On a CPU, an alloca that cannot be promoted to a register lives on the stack -- a cached, low-latency memory region with typical access times of 1-4 cycles. On an NVIDIA GPU there is no hardware stack cache: every surviving alloca becomes a .local allocation backed by DRAM with 200-800 cycle latency on cache miss versus zero for a register. See the GPU Execution Model memory hierarchy table for per-tier latencies.
Every alloca that survives SROA becomes a .local allocation. The NVPTX backend emits these as frame objects in the NVPTXFrameLowering::emitPrologue path, and ptxas maps them to per-thread local memory. Because occupancy is bounded by register count per SM, and .local spills effectively consume both registers (for the address) and memory bandwidth, the performance impact compounds.
The pipeline runs SROA twice: once early (position 4, immediately after NVVMReflect) to eliminate allocas before any other transform sees them, and once late (after NVVMCustomSinking2 and BreakCriticalEdges) to catch allocas created or exposed by loop unrolling, inlining, and other mid-pipeline transforms. The early invocation handles the common case (byval parameter copies, local struct variables); the late invocation cleans up whatever the loop optimizer and sinking passes left behind.
The isAllocaPromotable Fast Path
Before performing any splitting, runOnAlloca checks whether the alloca is trivially promotable via sub_B4CE70 (isAllocaPromotable). An alloca is promotable if every use is a simple load or store with no address-taken escape -- the same criterion as mem2reg. When this returns true, SROA marks the alloca for mem2reg and returns without performing any slice analysis or splitting. This fast path avoids the O(n) slice-building cost for the vast majority of CUDA local variables (scalar int, float, simple pointers), which are already simple enough for mem2reg to handle directly.
Algorithm: runOnAlloca (sub_2935C30)
The top-level per-alloca entry point. Validates the alloca as a candidate, builds the partition/slice table, and delegates to splitAlloca for the actual transformation.
Phase 1: Candidate Validation
runOnAlloca(state, alloca):
if alloca has no users:
eraseFromParent(alloca)
return
if isAllocaPromotable(alloca):
defer to mem2reg
return
type = getAllocatedType(alloca)
type_byte = getTypeID(type)
// Accept: integers(3), half(4), bfloat(5), float(6),
// pointers(10), vectors(11), arrays(12), structs(15-18, 20)
// Reject structs/composites unless isVectorType returns true
if type_byte not in {3,4,5,6,10,11,12,15,16,17,18,20}:
return
if type_byte in {15,16,17,18,20} and not isVectorType(type):
return // function types, labels, etc.
size = getTypeSizeInBits(type) // sub_BDB740
if size > qword_50056C8: // SROA size threshold
return // alloca too large, leave for backend
The size threshold at qword_50056C8 is a global tuning knob, likely controlled by the sroa<preserve-cfg> / sroa<modify-cfg> pipeline parameter. Allocas larger than this threshold are left untouched; the backend will lower them to .local memory. The exact default is not exposed in the binary's constructor initializers, but upstream LLVM uses a default of 128 bytes (1024 bits) for the sroa-threshold flag.
Phase 2: Use Analysis and Slice Building
metadata = buildMetadataTable(alloca) // sub_D5F1F0
if qword_50055E8: // two-pass mode
buildSlices(state, alloca, 1) // sub_2927160 — pre-analysis
slices = buildPartitions(state) // sub_2924690
else:
slices = buildPartitions(state) // single-pass
buildSlices (sub_2927160) walks all users of the alloca, classifying each use as a "slice" -- a byte range [start, end) with associated flags. Each slice is a 24-byte entry:
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | start (byte offset into alloca) |
| +8 | 8 | end (byte offset, exclusive) |
| +16 | 8 | flags -- bit 2 = splittable, bits [63:3] = user instruction metadata pointer |
buildPartitions (sub_2924690) groups non-overlapping slices into partitions. Each partition represents a contiguous byte range that can be replaced by a single sub-alloca. Overlapping slices are merged; slices that cross partition boundaries are marked as "unsplittable."
The two-pass flag (qword_50055E8) enables a pre-analysis pass that runs buildSlices first with a "dry-run" mode to count slices and pre-allocate arrays, then runs the actual partition builder. This is the new PM (PassManager) style -- the legacy PM code path at 0x1A10000 does a single pass.
Phase 3: Contiguous Slice Merging
After building slices, runOnAlloca scans for contiguous ranges that share the same base type and can be merged:
for each group of contiguous slices:
if all loads/stores in group use the same type:
if none are volatile (isVolatile check via sub_B46500):
if all are in-bounds (byte +2, bit 0):
mergeSlices(group) // sub_11D2BF0 + sub_11D3120 + sub_11D7E80
This optimizer/merger reduces redundant slices before the splitting phase. For example, if a 16-byte struct has four contiguous 4-byte i32 loads, the merger can combine them into a single slice covering the full struct, which may then map to a single <4 x i32> register rather than four separate scalar registers.
Phase 4: Dead Instruction Processing
for each dead instruction found during analysis:
for each operand:
addToWorklist(operand) // sub_29220F0
replaceAllUsesWith(undef) // sub_BD84D0 + sub_ACADE0
eraseFromParent(instruction) // sub_BD60C0
Dead instructions identified during slice building (stores to never-loaded ranges, loads of write-only ranges) are removed immediately, before the splitting phase begins.
Phase 5: Recursive Splitting
if slices is non-empty:
splitAlloca(state, alloca, slices) // sub_2930B90 — recursive
This is the key: splitAlloca may create new sub-allocas that are themselves candidates for further splitting. The newly created sub-allocas are added to the worklist and processed in stack order (LIFO).
Phase 6-8: Post-Split Processing
After splitting, runOnAlloca processes newly created sub-allocas (56-byte records stored in a SmallVector with 2-element inline buffer), rewrites per-sub-alloca slice lists, and returns a two-byte result: byte 0 = changed flag, byte 1 = re-run needed flag.
Algorithm: splitAlloca (sub_2930B90)
The core splitting function. Given a partitioned alloca and its use-slices, it creates new sub-allocas and rewrites all users.
Phase 1: Pre-Filter Slices
Iterates the 24-byte slice array. For slices whose instruction is a load (opcode 61) or store (opcode 62) of a simple scalar type that fits entirely within the alloca boundary, clears the "splittable" bit (flag & 4). This prevents unnecessary splitting of trivial accesses -- a scalar i32 load from an i32 alloca does not need splitting. If any slices were de-flagged, calls sortSlices (sub_2912200) and compactSlices (sub_2915A90 / sub_2914CE0) to remove the now-redundant entries.
Phase 2: Partition Iteration
buildPartitionTable (sub_2913C40) produces a partition list from the sorted slices. Each partition is a local tuple [start, end, first_slice_ptr, last_slice_ptr]. The main loop advances through partitions via sub_2912870 (advancePartitionIterator).
Phase 3: Find Rewrite Target
For each partition [start, end):
- Get the
DataLayoutviasub_B43CC0(getDL). - If the partition contains only unsplittable slices, call
findExistingValue(sub_291A860) to search for an existing SSA value that already covers[start, end). If found, reuse it instead of creating a new alloca. - Otherwise, scan slices for a single dominating load or store. Dispatch on opcode:
- 61 (load): extract the loaded type.
- 62 (store): extract the stored value type from the store's value operand.
- 85 (intrinsic): memcpy/memset/memmove -- follow the pointer chain to determine the affected type.
- Compare type sizes via
getTypeSizeInBits(sub_BDB740). - If no suitable existing value, create a new alloca via
CreateAlloca(sub_BCD420) orCreateBitCast(sub_BCD140).
Phase 4: Size and Alignment Check
alloc_size = getTypeAllocSize(partition_type) // sub_9208B0
if alloc_size > 0x800000: // 8 MB sanity limit
skip partition
// Verify rewrite target matches partition size (8-byte aligned)
if match:
checkTypeCompatibility(both_directions) // sub_29191E0
validateUnsplittableSlices(partition) // sub_291A4D0
The 8 MB sanity limit prevents SROA from creating absurdly large sub-allocas from pathological input.
Phase 5: Slice Classification
For each slice in the partition, classifySlice (sub_29280E0) sorts it into one of two lists:
| List | Variable | Contents |
|---|---|---|
| splittable-inside | v446 | Slices fully contained within [start, end) |
| splittable-outside | v452 | Slices that reference bytes outside the partition (integer widening) |
The classification also tracks:
v413(sameType flag): whether all slices in the partition use the same LLVM type.v415(common type): the shared type ifsameTypeis true.v412(hasPointerType): whether any slice involves a pointer type.- Integer types (type byte == 14) are routed to the outside list for special handling (widening/narrowing may be needed).
Then rewritePartition (sub_29197E0) is called twice: first for inside slices with callback sub_2919EF0, then for outside slices if the first call produced nothing.
Phase 6: New Sub-Alloca Creation
// Compute alignment
align_log2 = _BitScanReverse64(alloca_alignment)
abi_align = getABITypeAlignment(type) // sub_AE5020
pref_align = getPrefTypeAlignment(type) // sub_AE5260
// Build name: original_name + ".sroa." + index
name = getName(alloca) + ".sroa." // sub_BD5D20
// Create the new alloca (80-byte AllocaInst object)
new_alloca = AllocaInst::Create(type, size, alignment, name)
// sub_BD2C40 + sub_B4CCA0
// Insert before the original alloca
insertBefore(new_alloca, alloca)
// Copy debug metadata
copyDebugInfo(alloca, new_alloca) // sub_B96E90 + sub_B976B0
Each sub-alloca is an 80-byte AllocaInst object with the .sroa. name prefix. The insertion point is always directly before the original alloca in the entry block, maintaining the invariant that all allocas are grouped at the function entry.
Phase 7: Instruction Rewriting
The visitUse function (sub_292A4F0) rewrites each user of the original alloca to reference the appropriate sub-alloca:
- GEP chains: retargeted to the new sub-alloca with adjusted offsets (
sub_29348F0). - Loads: rewritten with type-casts if the sub-alloca type differs from the original load type (
sub_F38250). - Stores: same treatment as loads (
sub_F38250). - Memcpy/memset: split into smaller operations covering only the sub-alloca's byte range (
sub_F38330).
Each rewritten instruction is validated via sub_291F660 (validateRewrite).
Phase 8: Worklist Management
Dead instructions are removed from the pass's open-addressing hash table (at pass state offset +432, mask at +896). New sub-allocas are added to the worklist (sub_2928360) for re-processing. Allocas that cannot be split are recorded via sub_2916C30 (recordNonSplitAlloca).
Phase 9: Result Recording
For each partition that produced a new alloca, the result is stored as a 24-byte entry [new_alloca, bit_offset, bit_size] in the output array. Hash table capacity is computed using the classic 4n/3 + 1 formula (next power of 2), and entries are stored via open-addressing with linear probing (sub_29222D0 handles resizing).
Phase 10: Post-Split Use Rewriting
The most complex phase. For every use of the original alloca:
getOperandNo(sub_B59530) determines which operand references the alloca.getAccessRange(sub_AF47B0) computes the byte range[begin, end)within the alloca that this use touches.- For each new sub-alloca in the result array,
checkSubAllocaOverlap(sub_AF4D30) tests whether the sub-alloca's range overlaps the use's range. - If overlap:
computeRewrittenValue(sub_2916270) produces the replacement value by combining reads from multiple sub-allocas if the original use spans a partition boundary. - Dead uses are identified by
isDeadUse(sub_291D8F0) and erased.
The use-list implementation uses a tagged-pointer scheme: bit 2 indicates "heap-allocated list" vs. "inline single element," bits [63:3] are the actual pointer. Lists are freed via _libc_free after extracting the data pointer.
Phase 11-12: Lifetime and Debug Info
Lifetime markers (llvm.lifetime.start / llvm.lifetime.end) are rewritten via sub_291E540 to cover only the sub-alloca's byte range. Debug declarations (dbg.declare, dbg.value) are similarly rewritten: each debug-info entry pointing to the original alloca is retargeted to the sub-alloca whose byte range covers the relevant fragment, using the debug expression's DW_OP_LLVM_fragment to indicate the piece.
Speculative Loads Through Select
When a load reaches its pointer through a select instruction, SROA hoists the load into both branches:
; Before SROA:
%p = select i1 %cond, ptr %a, ptr %b
%v = load float, ptr %p, align 4
; After SROA:
%vt = load float, ptr %a, align 4 ; .sroa.speculate.load.true
%vf = load float, ptr %b, align 4 ; .sroa.speculate.load.false
%v = select i1 %cond, float %vt, float %vf ; .sroa.speculated
This is significant on GPU for two reasons:
-
SIMT execution model. A
selecton a GPU maps to a predicated move, which executes in a single cycle without divergence. The two speculative loads execute unconditionally and in parallel (both issue to the memory pipeline regardless of the predicate). This is cheaper than a control-dependent load that would require branch divergence handling. -
Alloca elimination. The original pattern requires the
selectto produce a pointer, which means the alloca must remain in memory (the pointer must be materializable). After speculation, both pointers are consumed directly by loads, and if%aand%bare themselves sub-allocas that can be promoted to registers, the entire chain collapses to register-only operations.
The implementation (Kind 3, lines 1024-1235 of splitAlloca) creates:
- Two
BitCastInstwith names.sroa.speculate.cast.trueand.sroa.speculate.cast.false. - Two
LoadInstwith names.sroa.speculate.load.trueand.sroa.speculate.load.false, preserving alignment from the original load. - One
SelectInstwith name.sroa.speculatedviasub_B36550(SelectInst::Create). - Metadata copied from the original load via
sub_B91FC0(copyMetadata).
Interaction with .param Space
Function parameters passed by value in CUDA/PTX use the .param address space (NVPTX address space 101). The EDG frontend generates an alloca to hold a copy of each byval parameter, then loads fields from it. Consider:
struct Vec3 { float x, y, z; };
__device__ float sum(Vec3 v) {
return v.x + v.y + v.z;
}
The IR before SROA contains:
define float @sum(%struct.Vec3* byval(%struct.Vec3) align 4 %v) {
%v.addr = alloca %struct.Vec3, align 4 ; byval copy
%x = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 0
%0 = load float, ptr %x, align 4
%y = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 1
%1 = load float, ptr %y, align 4
%z = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 2
%2 = load float, ptr %z, align 4
%add = fadd float %0, %1
%add1 = fadd float %add, %2
ret float %add1
}
SROA splits %v.addr into three scalar allocas (%v.addr.sroa.0, .sroa.1, .sroa.2), each holding a single float. Because each sub-alloca has only simple loads and stores, mem2reg (which runs in the next pipeline iteration) promotes all three to SSA registers. The final IR has no allocas and no memory traffic -- the three float values live entirely in registers.
Without SROA, the byval copy would persist as a .local allocation, and every field access would be a .local load. For a kernel that calls sum() in a tight loop, this difference is the difference between register-speed and DRAM-speed execution.
The NVPTXTargetLowering::LowerCall function (sub_3040BF0) emits DeclareParam (opcode 505) and StoreV1/V2/V4 (opcodes 571-573) for the .param writes on the caller side; SROA's job is to ensure the callee's reads never touch memory.
Auxiliary SROA Functions (Secondary Instance)
The binary contains a second SROA instance at 0x1A10000-0x1A3FFFF (~200 KB), corresponding to the legacy pass manager code path. This instance contains additional rewriting functions not visible in the primary (new PM) instance:
| Function | Size | Role | Key strings |
|---|---|---|---|
sub_1A3B290 | 58 KB | rewritePartition (memcpy/memset) | "memcpy.load.fca", "memcpy.store.fca", "memset.store.fca", ".fca" |
sub_1A2D070 | 35 KB | presplitLoadsAndStores | "select.gep.sroa", "select.sroa", "phi.sroa", "phi.gep.sroa" |
sub_1A2C2F0 | 9 KB | Select speculation | ".sroa.speculate.load.true", ".sroa.speculate.load.false" |
sub_1A2FFA0 | 12 KB | Vector splat handling | "vsplat", ".splatinsert", ".splat" |
sub_1A30D10 | 16 KB | Load rewriting | "copyload", "oldload" |
sub_1A31B60 | 9 KB | Extract/load patterns | "extract", "load.ext", "endian_shift", "load.trunc" |
sub_1A23B30 | 11 KB | Type casting | "sroa_raw_cast", "sroa_raw_idx", "sroa_cast" |
sub_1A3A670 | 13 KB | Speculative load promotion | ".sroa.speculated", ".sroa.speculate.load." |
sub_1A13B30 | 36 KB | Alloca analysis / slice building | -- |
sub_1A15E70 | 34 KB | Partition computation | -- |
sub_1A18770 | 38 KB | Use analysis | -- |
sub_1A3DCD0 | 15 KB | Cleanup | -- |
The .fca suffix stands for "first-class aggregate" -- LLVM's term for structs and arrays passed by value. The presplitLoadsAndStores function handles a special case where loads and stores of aggregates can be split before the main SROA algorithm runs, decomposing load { i32, i32 } into separate load i32 instructions and store { i32, i32 } into separate store i32 instructions. The select.gep.sroa and phi.gep.sroa strings indicate that this pre-split phase also handles GEP chains through PHI nodes and selects, a pattern common in CUDA code after inlining.
Data Structures
Slice Entry (24 bytes)
struct SROASlice {
uint64_t start; // +0: byte offset into alloca (inclusive)
uint64_t end; // +8: byte offset into alloca (exclusive)
uint64_t flags; // +16: bit 2 = splittable, bits [63:3] = user metadata ptr
};
The splittable bit indicates whether the slice can be split across partition boundaries. Loads and stores of simple scalars that fit entirely within the alloca have this bit cleared in Phase 1 of splitAlloca.
Sub-Alloca Record (56 bytes)
struct SubAllocaRecord {
void* alloca_ptr; // +0: pointer to the new AllocaInst
void* slice_list; // +8: pointer to slice list for this sub-alloca
uint64_t slice_list_cap; // +16: capacity of slice list
// ... additional fields through +55
};
Stored in a SmallVector<SubAllocaRecord, 2> -- the inline buffer holds two elements (common case: a struct with two fields), spilling to heap for larger aggregates.
Pass State Hash Table
The SROA pass state object (parameter a1 to both main functions) contains an open-addressing hash table at offsets +432 through +896. It uses LLVM-layer sentinels (-4096 / -8192) with instruction pointer keys. This table tracks which instructions have already been processed or are pending in the worklist. See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.
Tagged Pointer Scheme
Use-lists and debug-info lists use a tagged-pointer encoding for memory efficiency:
- Bit 2 clear: the "pointer" field directly contains a single element (inline storage for the common case of one use).
- Bit 2 set: bits [63:3] are a heap-allocated pointer to a variable-length list. Freed via
_libc_freeafter masking off the tag bits.
This avoids heap allocation for the overwhelmingly common case where an alloca field has exactly one load or one store.
IR Before/After Example
Consider a CUDA kernel that uses a local struct:
__global__ void kernel(float* out, int n) {
struct { float a; int b; float c; } local;
local.a = 1.0f;
local.b = n;
local.c = 2.0f;
out[0] = local.a + local.c;
out[1] = (float)local.b;
}
Before SROA:
define void @kernel(ptr %out, i32 %n) {
entry:
%local = alloca { float, i32, float }, align 4
%a = getelementptr { float, i32, float }, ptr %local, i32 0, i32 0
store float 1.0, ptr %a, align 4
%b = getelementptr { float, i32, float }, ptr %local, i32 0, i32 1
store i32 %n, ptr %b, align 4
%c = getelementptr { float, i32, float }, ptr %local, i32 0, i32 2
store float 2.0, ptr %c, align 4
%v0 = load float, ptr %a, align 4
%v2 = load float, ptr %c, align 4
%sum = fadd float %v0, %v2
store float %sum, ptr %out, align 4
%v1 = load i32, ptr %b, align 4
%conv = sitofp i32 %v1 to float
%idx = getelementptr float, ptr %out, i64 1
store float %conv, ptr %idx, align 4
ret void
}
After SROA (three sub-allocas, then mem2reg promotes to registers):
define void @kernel(ptr %out, i32 %n) {
entry:
; No allocas remain -- all promoted to SSA values
%sum = fadd float 1.0, 2.0 ; constant-folded later by InstCombine
store float %sum, ptr %out, align 4
%conv = sitofp i32 %n to float
%idx = getelementptr float, ptr %out, i64 1
store float %conv, ptr %idx, align 4
ret void
}
SROA splits %local into %local.sroa.0 (float), %local.sroa.1 (i32), %local.sroa.2 (float). Each sub-alloca has trivial load/store patterns, so mem2reg promotes all three. The stores and loads collapse, GEPs disappear, and the kernel runs entirely from registers.
Name Suffixes Created During Splitting
| Suffix | Purpose |
|---|---|
.sroa. | New sub-alloca name prefix |
.sroa.speculate.cast.true | Bitcast for true branch of select |
.sroa.speculate.cast.false | Bitcast for false branch of select |
.sroa.speculate.load.true | Speculative load from true branch |
.sroa.speculate.load.false | Speculative load from false branch |
.sroa.speculated | Final select combining speculative loads |
.cont | Continuation block (after branch splitting) |
.then | Then-branch block |
.else | Else-branch block |
.val | Value extracted from split load/store |
.fca | First-class aggregate decomposition |
select.gep.sroa | GEP through select, pre-split |
select.sroa | Select pointer, pre-split |
phi.sroa | PHI pointer, pre-split |
phi.gep.sroa | GEP through PHI, pre-split |
sroa_raw_cast | Raw bitcast during type rewriting |
sroa_raw_idx | Raw index computation during rewriting |
sroa_cast | Generic SROA type cast |
vsplat | Vector splat element |
.splatinsert | Splat insert element |
.splat | Splat shuffle |
copyload | Copy of a load during rewriting |
oldload | Original load being replaced |
extract | Extracted sub-value |
load.ext | Load with extension |
endian_shift | Endianness-adjustment shift |
load.trunc | Load with truncation |
memcpy.load.fca | Memcpy load of first-class aggregate |
memcpy.store.fca | Memcpy store of first-class aggregate |
memset.store.fca | Memset store of first-class aggregate |
Differences from Upstream LLVM
The core SROA algorithm in cicc v13.0 is stock LLVM SROA. No CUDA-specific modifications to the splitting logic, slice building, or partition computation were detected. The NVIDIA-specific elements are limited to:
-
Pass state object layout. The offsets within the pass state structure (worklist at
+432, hash table at+824-+864, sub-alloca records at+1080-+1096) reflect NVIDIA's PassManager integration, not upstream's. -
IR node encoding. Opcode numbers (61 = load, 62 = store, 85 = intrinsic, 55 = phi) and operand layout (32-byte basic blocks, tagged pointers) follow NVIDIA's modified IR format.
-
Debug metadata system. The metadata kind for debug info uses
MD_dbg = 38(NVIDIA assignment), queried viasub_B91C10. -
Global threshold knob. The value at
qword_50056C8may have an NVIDIA-specific default different from upstream's 128-byte / 1024-bit default. The knob is likely settable via the pipeline textsroa<preserve-cfg>orsroa<modify-cfg>. -
Pipeline positioning. The early-pipeline placement (position 4, before
NVVMLowerArgsandNVVMLowerAlloca) is NVIDIA-specific. Upstream LLVM typically places SROA afterInstCombineandSimplifyCFG; cicc places it before those passes to eliminate byval parameter copies as early as possible.
Configuration
| Knob | Global | Description |
|---|---|---|
qword_50056C8 | SROA size threshold | Maximum alloca size (in bits) that SROA will attempt to split. Allocas exceeding this are left for the backend. |
qword_50055E8 | Two-pass analysis flag | When set, enables a pre-analysis pass before slice building (new PM integration). |
NVVMPassOptions offset +1400 | Disable flag | Setting this byte disables SROA entirely. |
Pipeline param preserve-cfg | -- | Runs SROA without modifying the CFG (no block splitting for speculative loads across PHIs). |
Pipeline param modify-cfg | -- | Allows SROA to modify the CFG (enables full speculative load hoisting including PHI/select decomposition). |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Primary instance (new PM) | -- | ||
SROAPass::runOnAlloca | sub_2935C30 | 58 KB | -- |
SROAPass::splitAlloca | sub_2930B90 | 80 KB | -- |
buildSlices (use analysis) | sub_2927160 | -- | -- |
buildPartitions (group slices) | sub_2924690 | -- | -- |
buildPartitionTable | sub_2913C40 | -- | -- |
sortSlices | sub_2912200 | -- | -- |
compactSlices (with filter) | sub_2915A90 | -- | -- |
compactSlices (simple) | sub_2914CE0 | -- | -- |
findExistingValue | sub_291A860 | -- | -- |
rewritePartition | sub_29197E0 | -- | -- |
rewriteCallback | sub_2919EF0 | -- | -- |
visitUse (rewrite one use) | sub_292A4F0 | 54 KB | -- |
validateRewrite | sub_291F660 | -- | -- |
analyzeSlice | sub_29150D0 | -- | -- |
addToNewAllocaWorklist | sub_2929FB0 | -- | -- |
addToWorklist | sub_2928360 | -- | -- |
addOperandToWorklist | sub_29220F0 | -- | -- |
clearPendingQueue | sub_2921860 | -- | -- |
classifySlice | sub_29280E0 | -- | -- |
recordNonSplitAlloca | sub_2916C30 | -- | -- |
computeRewrittenValue | sub_2916270 | -- | -- |
advancePartitionIterator | sub_2912870 | -- | -- |
rewriteGEPChain | sub_29348F0 | -- | -- |
replaceAndErase | sub_2914800 | -- | -- |
collectUsesForRewrite (variant) | sub_2914380 | -- | -- |
collectUsesForRewrite (original) | sub_2914550 | -- | -- |
| Hash table resize | sub_29222D0 | -- | -- |
| Alloca rewriting helper | sub_292D810 | 67 KB | -- |
| SROA pass metadata | sub_2912100 | -- | -- |
SROA pass registration ("Scalar Replacement Of Aggregates", "sroa") | sub_2912340 | -- | -- |
| Secondary instance (legacy PM) | -- | ||
SROAPass::runOnAlloca (legacy) | sub_1A33E80 | 61 KB | -- |
SROAPass::splitAlloca (legacy) | sub_1A37040 | 46 KB | -- |
rewritePartition (memcpy/memset) | sub_1A3B290 | 58 KB | -- |
presplitLoadsAndStores | sub_1A2D070 | 35 KB | -- |
| Select speculation | sub_1A2C2F0 | 9 KB | -- |
| Vector splat handling | sub_1A2FFA0 | 12 KB | -- |
| Load rewriting | sub_1A30D10 | 16 KB | -- |
| Extract/load patterns | sub_1A31B60 | 9 KB | -- |
| Type casting | sub_1A23B30 | 11 KB | -- |
| Speculative load promotion | sub_1A3A670 | 13 KB | -- |
| Alloca analysis / slice building | sub_1A13B30 | 36 KB | -- |
| Partition computation | sub_1A15E70 | 34 KB | -- |
| Use analysis | sub_1A18770 | 38 KB | -- |
| Cleanup | sub_1A3DCD0 | 15 KB | -- |
| Shared helpers | -- | ||
isAllocaPromotable | sub_B4CE70 | -- | -- |
getDL (DataLayout) | sub_B43CC0 | -- | -- |
getTypeSizeInBits | sub_BDB740 | -- | -- |
getTypeAllocSize | sub_9208B0 | -- | -- |
getType | sub_BD5C60 | -- | -- |
getName | sub_BD5D20 | -- | -- |
AllocaInst::Create | sub_BD2C40 | -- | -- |
PHINode::Create | sub_BD2DA0 | -- | -- |
AllocaInst constructor | sub_B4CCA0 | -- | -- |
CreateBitCast | sub_BCD140 | -- | -- |
CreateAlloca | sub_BCD420 | -- | -- |
replaceAllUsesWith | sub_BD84D0 | -- | -- |
eraseFromParent | sub_B43D60 | -- | -- |
SelectInst::Create | sub_B36550 | -- | -- |
UndefValue::get | sub_ACADE0 | -- | -- |
getABITypeAlignment | sub_AE5020 | -- | -- |
getPrefTypeAlignment | sub_AE5260 | -- | -- |
copyMetadata | sub_B91FC0 | -- | -- |
isVolatile | sub_B46500 | -- | -- |
isVectorType | sub_BCEBA0 | -- | -- |
rewriteLoadStoreOfSlice | sub_F38250 | -- | -- |
rewriteMemTransferOfSlice | sub_F38330 | -- | -- |
collectAllUses | sub_AE74C0 | -- | -- |
getAccessRange | sub_AF47B0 | -- | -- |
checkSubAllocaOverlap | sub_AF4D30 | -- | -- |
buildMetadataTable | sub_D5F1F0 | -- | -- |
addToErasedSet | sub_D6B260 | -- | -- |
| Slice optimizer init | sub_11D2BF0 | -- | -- |
| Slice optimizer run | sub_11D3120 | -- | -- |
| Slice optimizer finalize | sub_11D7E80 | -- | -- |
Test This
The following kernel allocates a local struct and accesses its fields. SROA should completely eliminate the alloca, promoting all fields to registers.
struct Particle {
float x, y, z;
float vx, vy, vz;
};
__global__ void sroa_test(float* out, int n) {
Particle p;
p.x = (float)threadIdx.x;
p.y = (float)threadIdx.y;
p.z = 0.0f;
p.vx = 1.0f;
p.vy = 2.0f;
p.vz = 3.0f;
float energy = 0.5f * (p.vx*p.vx + p.vy*p.vy + p.vz*p.vz);
out[threadIdx.x] = p.x + p.y + p.z + energy;
}
What to look for in PTX:
- Absence of
.localmemory declarations. If SROA succeeds, there should be no.local .aligndirectives in the PTX for theParticlestruct. All six fields (x, y, z, vx, vy, vz) should live in%f(float) registers. - No
st.localorld.localinstructions. These indicate that the struct survived into.localmemory -- a 200-400 cycle penalty per access versus zero cycles for a register. - The PTX should show direct register arithmetic:
mov.f32,fma.rn.f32,add.f32-- no memory traffic at all for the struct fields. - To see the failure case, add
volatileto the struct declaration (volatile Particle p;). This prevents SROA from promoting the alloca, andld.local/st.localinstructions will appear in the PTX, demonstrating the performance cliff that SROA normally prevents. - At
-O0, SROA still runs (it is correctness-relevant for address space resolution), but with a more conservative threshold. Compare the.localframe size between-O0and-O2.
Cross-References
- Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
- Pipeline & Ordering -- pipeline positions 4 and post-sinking
- Register Allocation -- surviving allocas become
.localspills, directly increasing register pressure - Rematerialization -- recomputes cheap values to reduce register pressure; operates downstream of SROA
- StructSplitting -- NVIDIA custom pass that splits struct arguments at the call boundary; complements SROA's intra-procedural splitting
- MemorySpaceOpt -- resolves generic pointers to specific address spaces; runs after SROA
- Hash Infrastructure -- the open-addressing hash table used by the SROA pass state
EarlyCSE (Early Common Subexpression Elimination)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: Based on LLVM 20.0.0
EarlyCSE.cpp. Evidence: iterative (non-recursive) dominator-tree walk matches the LLVM 16+ refactoring; MemorySSA-backed variant withearly-cse-memssapipeline parameter matches LLVM 14+. NVIDIA adds four GPU extensions (barrier-aware versioning, AS 7 handling, NVVM call CSE, PHI limit) and a fourth scoped hash table not present in any upstream version.
EarlyCSE is a fast dominator-tree-walk pass that eliminates redundant computations, loads, and calls within a function. Cicc's version is not stock LLVM 20.0.0 -- the binary contains four CUDA-specific extensions that handle GPU memory model semantics: barrier-aware memory versioning with hardcoded NVVM intrinsic ID checks, shared memory address space 7 protection against unsafe store-to-load forwarding, a dedicated NVVM intrinsic call CSE handler with a fast-path for thread-invariant special register reads, and a PHI operand limit of 5 for compile-time control. It also adds a fourth scoped hash table (store-forwarding) that upstream lacks.
Key Facts
| Property | Value |
|---|---|
| Pass name | "early-cse" (standard), "early-cse-memssa" (MemorySSA variant) |
| Pipeline parser params | memssa (selects MemorySSA-backed variant) |
| Entry point (standard) | sub_2778270 |
| Entry point (MemorySSA) | sub_27783D0 |
| Core function | sub_2780B00 (12,350 bytes) |
| NVVM call CSE handler | sub_2780450 (1,142 bytes, ~263 decompiled lines) |
| Pipeline slot | 245, 291 (tier 1); 525, 593 (tier 2+); ~370 (late) |
| Disable flag | NVVMPassOptions offset +1440 |
| Pipeline assembler | sub_18E4A00 (MemorySSA variant), sub_196A2B0 (standard) |
| Upstream LLVM file | llvm/lib/Transforms/Scalar/EarlyCSE.cpp |
| NVIDIA modifications | Barrier generation tracking, AS 7 handling, NVVM call CSE, PHI limit, store-fwd table |
Algorithm Overview
The pass performs a stack-driven iterative DFS over the dominator tree. At each basic block it scans instructions linearly, attempting three forms of elimination:
-
Expression CSE -- arithmetic, casts, comparisons, GEPs with identical operands are looked up in a scoped hash table. If a matching canonical instruction exists, the redundant one is replaced via RAUW and erased.
-
Load CSE and store-to-load forwarding -- loads from the same address and type as a prior load (or a prior store) are replaced with the already-available value. This is gated by a
CurrentGenerationcounter that invalidates stale entries whenever a memory-writing instruction or barrier intrinsic is encountered. -
Call CSE -- readonly/readnone calls with identical targets and arguments are deduplicated. The NVVM-specific handler
sub_2780450provides a fast-path for thread-invariant NVVM intrinsics (llvm.nvvm.read.ptx.sreg.*).
The dominator tree walk is not recursive. It uses an explicit growable stack (initial capacity 8 entries, 64 bytes) with DomTreeScope nodes that record per-scope hash table insertions. On scope exit all insertions are tombstoned. This matters for deeply-nested GPU kernel CFGs where stack overflow from recursion is a real risk.
function EarlyCSE(ctx):
root = ctx.Function.DomTree.root
stack.push(DomTreeScope(root))
while stack is not empty:
scope = stack.top()
ctx.CurrentGeneration = scope.generation_begin
if not scope.visited:
for inst in scope.bb.instructions:
processNode(ctx, inst) // CSE logic below
scope.visited = true
scope.generation_end = ctx.CurrentGeneration
else:
if scope has unvisited children:
child = scope.children.pop_front()
stack.push(DomTreeScope(child))
continue
else:
unwindScope(ctx, scope) // tombstone entries, free node
stack.pop()
DomTreeScope Structure
Each scope node is 160 bytes (0xA0), allocated via sub_22077B0:
| Offset | Type | Field |
|---|---|---|
+0x00 | u32 | generation_begin -- snapshot of CurrentGeneration at scope entry |
+0x04 | u32 | generation_end -- value at scope exit (after processing all instructions) |
+0x08 | BasicBlock* | The basic block for this domtree node |
+0x10 | DomTreeNode** | children_begin |
+0x18 | DomTreeNode** | children_end |
+0x20 | scope link | Expression ScopedHT chain -> ctx+0x78 |
+0x38 | scope link | Load ScopedHT chain -> ctx+0x108 |
+0x50 | scope link | Call ScopedHT chain -> ctx+0x198 |
+0x68 | scope link | Call-values ScopedHT chain -> ctx+0x228 |
+0x80 | scope link | Store-fwd ScopedHT chain -> ctx+0x250 |
+0x98 | u8 | visited flag (0 = not yet processed, 1 = instructions scanned) |
Each chain entry is a triplet [link_fwd, link_back, insertion_list_head] occupying 24 bytes. On scope exit, the pass walks each insertion list and tombstones the corresponding hash table entries, then frees the scope node.
Four Scoped Hash Tables
Upstream LLVM EarlyCSE has three scoped hash tables (expression, load, call). Cicc adds a fourth dedicated to store-to-load forwarding.
| Table | Context offset | Hash function | Equality | Key | Value |
|---|---|---|---|---|---|
| Expression | +0xE8 / +0xF8 | sub_277F590 | sub_277AC50 | Opcode + operand value-numbers | Canonical instruction pointer |
| Load | +0x178 / +0x188 | sub_277CF80 | sub_27792F0 | Load address + type | Previously loaded value |
| Call | +0x230 / +0x240 | sub_277CF80 | sub_27792F0 | Call target + arguments | Return value |
| Store-fwd | +0x2C0 / +0x2D0 | sub_277C800 | sub_27781D0 | Store address + type | Stored value |
All four use open-addressing with linear probing. Sentinel values: 0xFFFFFFFFFFFFF000 = empty, 0xFFFFFFFFFFFFE000 = tombstone. Resize triggers at 75% load factor (4 * (count + 1) >= 3 * bucket_count) or when tombstones exceed 12.5% of capacity. Bucket counts are always a power of two.
The store-forwarding table is the NVIDIA addition. Upstream EarlyCSE performs store-to-load forwarding through the load table by inserting the stored value when a store is processed. Cicc separates this into a dedicated table, which enables more aggressive dead-store detection within the early pipeline -- two stores to the same address with no intervening load or barrier can be recognized without polluting the load table's namespace.
CUDA Extension 1: Barrier-Aware Memory Versioning
The context structure holds a CurrentGeneration counter at offset +0x2E0 (type u32). This counter acts as a memory version number. Every load and call CSE lookup checks whether the cached entry's generation matches the current generation -- a mismatch means an intervening memory-modifying operation invalidated the entry.
Generation is incremented when:
- A trivially dead instruction is skipped (minor bump at
0x2781950) sub_B46490(hasMemoryWriteSideEffects) returns true for a call instruction- Any of four hardcoded NVVM barrier intrinsic IDs is encountered
The barrier intrinsic checks are explicit cmp dword ptr [rax+24h], IMM instructions at specific addresses in the binary:
| Address | Encoding | Intrinsic ID | Decimal | Identity |
|---|---|---|---|---|
0x2781B30 | cmp ..., 9Bh | 0x9B | 155 | llvm.nvvm.barrier0 (__syncthreads) |
0x27812AF | cmp ..., CDh | 0xCD | 205 | llvm.nvvm.membar.* (device/system memory barrier) |
0x2781F4D | cmp ..., 123h | 0x123 | 291 | llvm.nvvm.bar.sync (named barrier sync) |
0x2781F40 | cmp ..., 144h | 0x144 | 324 | NVVM cluster barrier (SM 90+ cluster-scope fence) |
These checks are a safety net on top of the intrinsics' declared memory-effect attributes. Upstream LLVM relies solely on the memory-effect modeling to determine whether a call clobbers memory. Cicc adds the explicit ID checks because the barrier intrinsics' memory effects, as declared in the NVVM tablegen files, may not fully capture the GPU-specific semantics: a bar.sync does not just write memory from the perspective of one thread -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees at the IR level, so the explicit ID checks are the correctness backstop.
When any of these four intrinsics appears between two memory operations, EarlyCSE refuses to forward the earlier value. This prevents optimizations like:
;; INCORRECT optimization that barriers prevent:
%v1 = load i32, ptr addrspace(3) %p ;; load from shared memory
call void @llvm.nvvm.barrier0() ;; __syncthreads()
%v2 = load i32, ptr addrspace(3) %p ;; CANNOT be replaced with %v1
;; Another thread may have written to %p between the barrier and this load
CUDA Extension 2: Shared Memory Address Space 7 Handling
Stores targeting NVPTX address space 7 (the internal representation for __shared__ memory) receive special treatment that prevents unsafe store-to-load forwarding.
At address 0x2781BB6, the pass checks byte [rdx+8] == 7 on the store's pointer operand type. When this matches, the store is routed through sub_B49E20 (isSharedMemoryStore), which calls sub_B43CB0 (getCalledFunction) and sub_B2D610 (hasIntrinsicID) to confirm the target is a shared memory variable (string ID 0x31 = "shared").
The motivation: shared memory is written by one thread and potentially read by a different thread after a barrier. Forwarding a stored value to a subsequent load in the same thread is only safe if no barrier intervenes -- but even then, a reimplementor must be careful because the CUDA memory model permits a thread to read its own store without a barrier, while other threads cannot. The shared-memory path in EarlyCSE conservatively disables forwarding for shared-memory stores to avoid the case where a load is CSE'd to the stored value, but the actual runtime value has been modified by another thread's post-barrier store to the same location.
processStore(ctx, store_inst):
ptr_type = store_inst.pointer_operand.type
if ptr_type.address_space == 7: // NVPTX shared memory
if isSharedMemoryStore(store_inst): // sub_B49E20
ctx.CurrentGeneration++ // invalidate load/call tables
return // do NOT insert into store-fwd table
// Normal path: insert stored value into store-fwd table for later forwarding
insertStoreForwarding(ctx, store_inst)
CUDA Extension 3: NVVM Intrinsic Call CSE (sub_2780450)
The dedicated function sub_2780450 (1,142 bytes, ~263 decompiled lines) handles CSE for calls to NVVM builtin intrinsics. It is entered when the main instruction loop detects a single-use-by-call pattern: the instruction's result has exactly one user, that user is a CallInst (opcode 0x1F), and the operand index is 3.
The function provides a fast-path for thread-invariant special register reads. Many NVVM intrinsics return values that are constant for the lifetime of a kernel invocation from a given thread's perspective:
llvm.nvvm.read.ptx.sreg.tid.x/y/z--threadIdx.x/y/zllvm.nvvm.read.ptx.sreg.ntid.x/y/z--blockDim.x/y/zllvm.nvvm.read.ptx.sreg.ctaid.x/y/z--blockIdx.x/y/zllvm.nvvm.read.ptx.sreg.nctaid.x/y/z--gridDim.x/y/zllvm.nvvm.read.ptx.sreg.warpsizellvm.nvvm.read.ptx.sreg.laneid
Upstream LLVM would model these as readnone and CSE them through the generic call table. The NVVM-specific handler recognizes these intrinsic IDs directly via sub_987FE0 (getIntrinsicID), avoiding the overhead of the general readonly-call analysis. For a kernel that references threadIdx.x twenty times, the fast-path eliminates nineteen redundant intrinsic calls in a single pass.
The function also handles two additional NVVM intrinsic IDs:
| ID | Decimal | Identity | CSE behavior |
|---|---|---|---|
0xE4 | 228 | NVVM load intrinsic | CSE-able if same address and no intervening clobber |
0xE6 | 230 | NVVM store intrinsic | Blocks CSE (generation bump) |
The check at 0x2783890 tests for intrinsic ID 228 and at 0x27839BC for intrinsic ID 230. The store intrinsic (230) triggers a generation bump, while the load intrinsic (228) is treated as a CSE candidate.
CUDA Extension 4: PHI Operand Limit
At address 0x2781BED, the pass checks:
if PHINode.getNumIncomingValues() > 5:
skip CSE analysis for this PHI
This is a compile-time heuristic absent from upstream LLVM. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence becomes quadratic in the operand count (each pair of values must be checked for dominance and equivalence), and the benefit for wide PHIs is marginal -- they rarely represent true common subexpressions.
The threshold of 5 is hardcoded with no cl::opt override.
Instruction Classification
The inner processing loop at 0x2780EB5--0x2781110 classifies each instruction by its opcode byte at [instr-0x18]:
| Opcode | Hex | Instruction | EarlyCSE action |
|---|---|---|---|
0x55 | Store | StoreInst | Store-to-load forwarding path; shared memory check |
0x3D | Call | CallInst | Call CSE or generation bump (if memory effects) |
0x3E | Invoke | InvokeInst | Same as CallInst |
0x3F | Select | SelectInst | Expression CSE with type-size check |
0x40 | PHI | PHINode | Expression CSE if operand count <= 5 |
<= 0x1C | -- | Constants/args | Skip (not instructions) |
0x29 | Return | ReturnInst | Skip |
0x43--0x4F | Casts | Cast instructions | Expression CSE |
The classification dispatches to these helper predicates:
| Helper | Address | Purpose |
|---|---|---|
sub_AA54C0 | 0x2780EC6 | isTriviallyDead -- if true, bump generation and skip |
sub_D222C0 | 0x2780F97 | isSimpleExpression -- arithmetic, casts, comparisons, GEPs |
sub_F50EE0 | 0x2780F7A | canCSE / doesNotAccessMemory |
sub_1020E10 | 0x2781967 | getCallCSEValue -- readonly/readnone call check |
sub_B46420 | 0x2781B95 | isLoadCSECandidate |
sub_B46490 | 0x2781CC6 | hasMemoryWriteSideEffects -- triggers generation bump |
Load-Store Forwarding Detailed Flow
The most complex code path (0x2781B48--0x2781F32) handles load CSE and store-to-load forwarding:
processLoad(ctx, load_inst):
key = computeLoadCSEKey(load_inst, ctx.DataLayout) // sub_2779A20
if key.status != 0:
// Cannot form clean key -- check if call/invoke returns equivalent value
if load_inst is CallInst (0x3D) or InvokeInst (0x3E):
tryCallValueForwarding(ctx, load_inst)
return
// Check for preceding store to same address
store_entry = lookupStoreTable(ctx, key)
if store_entry and store_entry.generation == ctx.CurrentGeneration:
// Forward stored value to this load
salvageDebugInfo(load_inst, store_entry.value) // sub_BD84D0
replaceAllUsesWith(load_inst, store_entry.value) // sub_11C4E30
eraseInstruction(load_inst) // sub_B43D60
return CHANGED
// Check for preceding load from same address
load_entry = lookupLoadTable(ctx, key)
if load_entry and load_entry.generation == ctx.CurrentGeneration:
// Replace with previously loaded value
replaceAllUsesWith(load_inst, load_entry.value)
eraseInstruction(load_inst)
return CHANGED
// Not found -- insert into load table for future lookups
insertLoadTable(ctx, key, load_inst, ctx.CurrentGeneration)
For stores, the pass also performs dead-store detection within the same scope: if two stores target the same address with no intervening load or barrier, the earlier store is dead. The barrier check uses the same four intrinsic ID comparisons described above.
Type Compatibility and Bitwidth Handling
At 0x27829C3--0x2782B87, for expression CSE of SelectInst and PHINode:
sub_AE43F0computes type size in bits via theDataLayout- If size <= 64 bits: use a
u64bitmask as the CSE key - If size > 64 bits: allocate a
BitVectorviasub_C43690and use bit-level comparison
At 0x2782F72--0x2782FD5, integer constant range analysis computes leading zeros/ones to determine effective bit-width. If the value fits in fewer bits, EarlyCSE allows CSE across different integer types (e.g., i32 zext i64 vs i64). This is an NVIDIA extension that upstream LLVM does not perform -- upstream requires exact type matches for expression CSE.
Context Structure Layout
The EarlyCSEContext structure passed to sub_2780B00 in rdi:
| Offset | Field | Size |
|---|---|---|
+0x00 | Current instruction pointer | 8 |
+0x08 | DataLayout* / TargetData* | 8 |
+0x10 | Function* (-> [+0x60] = DomTree root) | 8 |
+0x18 | TargetLibraryInfo* | 8 |
+0x20 | AssumptionCache* | 8 |
+0x68 | MemDep result tracking | 8 |
+0x70 | MemDep analysis reference | 8 |
+0xE8--+0x110 | Expression hash table (buckets, count, ScopedHT, free list, allocator) | 40 |
+0x170--+0x198 | Load hash table + ScopedHT | 40 |
+0x200--+0x258 | Call hash table + ScopedHT | 88 |
+0x2B8--+0x2D8 | Store-fwd hash table + ScopedHT | 32 |
+0x2E0 | CurrentGeneration (u32) | 4 |
Stack frame: 0x1D0 bytes (sub rsp, 0x1A8 + 5 callee-saved pushes).
Scope Page Management
The scoped hash tables use 512-byte (0x200) scope pages chained together. When a page fills:
- At
0x2781328: fetch previous page via[stack.end - 8], advance by0x200to the next chained page. - At
0x2782260: when reclaiming, free the current page and pop from the page pointer array.
The initial worklist stack is 64 bytes (8 entries of 8 bytes each). The scope-page-pointer array is 8-byte aligned via lea rbx, [rdx*4 - 4]; and rbx, ~7; add rbx, rax.
memssa Pipeline Parameter
The pipeline parser registers "early-cse" at slot 394 with the parameter keyword memssa. When memssa is specified, the pass uses the MemorySSA-backed variant (sub_27783D0, pass name "Early CSE w/ MemorySSA") instead of the standard variant (sub_2778270, pass name "Early CSE"). Both variants call the same core function sub_2780B00; the difference is that the MemorySSA variant receives a pre-built MemorySSA graph in the context structure and uses it for more precise clobber queries, avoiding the O(n^2) scanning that the non-MSSA path falls back to for load CSE.
Knobs
| Knob | Default | Description |
|---|---|---|
enable-earlycse-memoryssa | true | Master switch for MemorySSA integration |
earlycse-debug-hash | false | Debug: log hash function inputs/outputs |
earlycse-mssa-optimization-cap | 500 | Max MemorySSA queries per block before falling back to conservative |
enable-earlycse-imprecision | false | Allow approximate analysis in pathological cases (huge blocks, deep PHI nests) |
No dedicated cl::opt flags exist for any of the four NVIDIA extensions. The PHI operand limit of 5, the four barrier intrinsic IDs, and the shared-memory address space 7 check are all hardcoded in the binary.
Pipeline Positions and Tier Gating
| Tier | Position(s) | Notes |
|---|---|---|
| Tier 1 (O1) | Skipped | sub_12DE8F0 explicitly gates EarlyCSE with tier != 1 |
| Tier 2 (O2) | 525, 593 | Two invocations: early function simplification and post-loop-optimization |
| Tier 3 (O3) | 245, 291, ~370 | Three invocations; additional late-pipeline run |
| Ofcmid | After Sinking2 | Single invocation in the moderate-optimization path |
The pass is independently disableable via NVVMPassOptions at offset +1440. The same offset gates the standard and MemorySSA variants identically.
Key Constants
| Value | Hex | Meaning |
|---|---|---|
| 160 | 0xA0 | DomTreeScope node size |
| 512 | 0x200 | Scope page size |
| 64 | 0x40 | Initial stack capacity (8 entries) |
| 48 | 0x30 | Hash table entry node size |
| 40 | 0x28 | Insertion record size |
0xFFFFFFFFFFFFF000 | -- | Hash table EMPTY sentinel |
0xFFFFFFFFFFFFE000 | -- | Hash table TOMBSTONE sentinel |
| 155 | 0x9B | llvm.nvvm.barrier0 intrinsic ID |
| 205 | 0xCD | llvm.nvvm.membar.* intrinsic ID |
| 291 | 0x123 | NVVM bar.sync intrinsic ID |
| 324 | 0x144 | NVVM cluster barrier intrinsic ID |
| 228 | 0xE4 | NVVM load intrinsic ID |
| 230 | 0xE6 | NVVM store intrinsic ID |
| 5 | -- | PHI operand limit for CSE |
Differences from Upstream LLVM 20.0.0
| Feature | Upstream | Cicc |
|---|---|---|
| Scoped hash tables | 3 (expression, load, call) | 4 (+ store-forwarding) |
| Barrier intrinsic checks | Relies on memory-effect attributes only | Explicit ID checks for IDs 155, 205, 291, 324 |
| Shared memory handling | No address-space-specific logic | AS 7 stores skip store-fwd insertion, bump generation |
| NVVM intrinsic call CSE | Generic readonly-call path | Dedicated sub_2780450 with fast-path for sreg.* reads |
| PHI operand limit | None | Skip CSE for PHI nodes with >5 incoming values |
| Cross-type expression CSE | Exact type match required | Allows CSE across integer widths when value range fits |
| Dominator tree walk | Recursive in many LLVM builds | Always iterative (explicit stack) |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
EarlyCSEPass::run (standard variant entry) | sub_2778270 | -- | -- |
EarlyCSEPass::run (MemorySSA variant entry) | sub_27783D0 | -- | -- |
| Core pass body (domtree walk + instruction processing) | sub_2780B00 | 12,350 | -- |
handleNVVMCallCSE (NVVM intrinsic call CSE) | sub_2780450 | 1,142 | -- |
| Expression hash function | sub_277F590 | -- | -- |
| Expression equality check | sub_277AC50 | -- | -- |
| Load/call key hash | sub_277CF80 | -- | -- |
| Load/call key equality | sub_27792F0 | -- | -- |
| Store key hash | sub_277C800 | -- | -- |
| Store key equality | sub_27781D0 | -- | -- |
isSimpleExpression | sub_D222C0 | -- | -- |
canCSE / doesNotAccessMemory | sub_F50EE0 | -- | -- |
isSharedMemoryStore (AS 7 check) | sub_B49E20 | -- | -- |
isSharedMemoryAccess | sub_B49E00 | -- | -- |
getCallCSEValue (readonly/readnone check) | sub_1020E10 | -- | -- |
isLoadCSECandidate | sub_B46420 | -- | -- |
hasMemoryWriteSideEffects | sub_B46490 | -- | -- |
computeCSEHash / isVolatile | sub_B46500 | -- | -- |
getIntrinsicID (NVVM intrinsic ID from call) | sub_987FE0 | -- | -- |
isTriviallyDead | sub_AA54C0 | -- | -- |
replaceAllUsesWith (RAUW) | sub_11C4E30 | -- | -- |
salvageDebugInfo | sub_BD84D0 | -- | -- |
eraseInstruction | sub_B43D60 | -- | -- |
removeFromParent | sub_27793B0 | -- | -- |
computeLoadCSEKey | sub_2779A20 | -- | -- |
insertStoreForwarding | sub_27808D0 | -- | -- |
insertExprIntoScopedHT | sub_27801B0 | -- | -- |
lookupScope (find value by generation) | sub_277D510 | -- | -- |
lookupCallTable | sub_277D3C0 | -- | -- |
lookupInScopedHT | sub_2778110 | -- | -- |
shouldInsertIntoTable | sub_27785B0 | -- | -- |
growTable (double hash table size) | sub_277C980 | -- | -- |
insertIntoTable (post-grow insert) | sub_277C8A0 | -- | -- |
cleanupLoadTable (compact after scope exit) | sub_277FFC0 | -- | -- |
cleanupCallTable (compact after scope exit) | sub_277A110 | -- | -- |
compareLoadTypes (type compatibility) | sub_277A9A0 | -- | -- |
TargetData::getTypeSizeInBits | sub_AE43F0 | -- | -- |
getCalledFunction | sub_B43CB0 | -- | -- |
hasIntrinsicID | sub_B2D610 | -- | -- |
Common Pitfalls
These are mistakes a reimplementor is likely to make when extending EarlyCSE for a GPU target with barrier semantics.
1. Relying solely on LLVM memory-effect attributes to model barrier semantics. Upstream LLVM models barrier intrinsics as memory-writing calls, which triggers a generation bump through the standard hasMemoryWriteSideEffects path. This is insufficient for GPU barriers: a bar.sync does not just write memory from one thread's perspective -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees. Cicc adds explicit hardcoded checks for four intrinsic IDs (155, 205, 291, 324) as a safety net. A reimplementation that trusts the declared memory effects alone will forward values across barriers, producing load CSE that reads stale pre-barrier data written by a different thread.
2. Forwarding stores to loads across barriers in shared memory (AS 7). When thread T0 stores to smem[0], a barrier fires, and thread T1 loads from smem[0], the load must see T1's own value (if it wrote) or the value written by whichever thread last stored before the barrier. Forwarding T0's stored value to T0's subsequent load is only safe if no barrier intervenes and no other thread could have written to the same location. Cicc's AS 7 handling conservatively disables store-to-load forwarding for all shared memory stores by bumping the generation counter. A reimplementation that allows shared memory store forwarding without barrier awareness will produce reads that return the local thread's stale value instead of the globally-visible post-barrier value.
3. Missing one or more of the four barrier intrinsic IDs. Cicc checks for IDs 155 (barrier0 / __syncthreads), 205 (membar.*), 291 (bar.sync), and 324 (cluster barrier for SM 90+). A reimplementation that only handles __syncthreads (ID 155) will fail to invalidate the load/call tables when a bar.sync or cluster barrier is encountered. The result: loads before and after a named barrier or cluster-scope fence are incorrectly CSE'd, producing silent data corruption in multi-CTA cooperative kernels.
4. Applying expression CSE to PHI nodes with more than 5 incoming values. Cicc hardcodes a PHI operand limit of 5 for CSE analysis. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence is quadratic in operand count, and the benefit for wide PHIs is negligible -- they rarely represent true common subexpressions. A reimplementation without this threshold will experience severe compile-time regressions on heavily unrolled GPU kernels.
5. Not adding a dedicated store-forwarding hash table. Upstream LLVM uses three scoped hash tables (expression, load, call). Cicc adds a fourth table dedicated to store-to-load forwarding. Without this separation, inserting stored values into the load table pollutes the load namespace, making dead-store detection within the same scope unreliable. Two stores to the same address with no intervening load or barrier should trigger dead-store elimination of the earlier store; mixing stores into the load table obscures this pattern.
Cross-References
- Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
- MemorySSA Builder for GPU -- the MemorySSA infrastructure consumed by the
early-cse-memssavariant - Hash Infrastructure -- the universal DenseMap mechanics shared by all four hash tables
- Barriers & Sync -- the barrier builtins whose intrinsic IDs trigger generation bumps
- Dead Synchronization Elimination -- the 96KB pass that removes dead barriers; interacts with EarlyCSE's barrier-aware generation tracking
- GVN -- the more expensive redundancy elimination pass that complements EarlyCSE later in the pipeline
- DSE -- Dead Store Elimination, which complements EarlyCSE's within-scope store-to-load forwarding with cross-block analysis
- Pipeline & Ordering -- tier-dependent scheduling and
NVVMPassOptionsgating - Alias Analysis & NVVM AA -- address-space-aware alias analysis that feeds into MemorySSA clobber queries
InstCombine
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp,llvm/lib/Transforms/InstCombine/InstCombine*.cpp(LLVM 20.0.0). The upstream is split across ~15 files by instruction category; cicc inlines them into a single monolithic visitor.
NVIDIA's InstCombine in CICC v13.0 is approximately twice the size of upstream LLVM's, weighing in at roughly 405 KB for the main visitor alone. The monolithic visitor function at sub_10EE7A0 dispatches across 80 unique opcode cases through a three-level switch structure, handling standard LLVM instructions, NVIDIA-extended vector and FMA operations, and three high-opcode NVVM intrinsic dead-code elimination patterns. A separate 87 KB intrinsic folding function (sub_1169C30) handles NVVM-specific canonicalization, and a 127 KB computeKnownBits implementation (sub_11A7600) provides the dataflow backbone. This page covers the visitor architecture, the per-instruction-type visitors recovered from the binary, and the NVIDIA-specific extensions that distinguish this implementation from upstream.
| Registration | New PM #398, parameterized: no-aggressive-aggregate-splitting;...;max-iterations=N |
| Runtime positions | Tier 0 #28 (via sub_19401A0); Tier 1/2/3 #42 (gated by !opts[1000]); see Pipeline |
| Main visitor | sub_10EE7A0 (0x10EE7A0, ~405 KB, 9,258 lines) |
| Intrinsic folding | sub_1169C30 (0x1169C30, ~87 KB, 2,268 lines) |
| computeKnownBits | sub_11A7600 (0x11A7600, ~127 KB, 4,156 lines) |
| SimplifyDemandedBits | sub_11AE870 / sub_11AE3E0 (wrapper + hash table) |
| Opcode cases | 80 unique case labels across 3 switch blocks |
| NVIDIA extra size | ~200 KB beyond upstream (~87 KB intrinsic fold + ~113 KB expanded cases) |
Visitor Architecture
The main visitor sub_10EE7A0 receives an NVVM IR node pointer (__m128i* a2) and attempts to simplify it. A persistent local v1612 aliases the instruction being visited. The function has four structural regions:
Preamble (lines ~1760--2000) performs pre-dispatch checks: validating call-site attributes (opcode 41 for bitwise-assert), handling ternary FMA instructions (opcodes 238--245), checking for constant-foldable select patterns, canonicalizing operand ordering (constant to RHS), and running SimplifyDemandedBits via sub_11A3F30 on the result type.
Opcode dispatch reads the NVVM opcode via sub_987FE0 (getOpcode) and uses a three-level switch:
| Switch Level | Opcode Range | Description |
|---|---|---|
| Level 1 | 0x99--0x2A5 (main) | Standard LLVM instructions (GEP, select, stores, casts, compares, calls, vectors) |
| Level 2 | 0x01--0x42 (low) | Binary operations, casts, early comparisons |
| Level 3 | > 0x13CF (high) | NVIDIA proprietary intrinsic IDs (9549, 9553, 9567) |
Additional if-else chains handle intermediate ranges: opcodes 0xC7E (3198), 0x2E2 (738), 0x827 (2087), 0x2CC (716), 0xE07--0xE08, 0xE4F--0xE51, 0x13C6--0x13C7, and 0x13CD--0x13CE.
The fallback path at LABEL_95 calls sub_F0C430 for generic simplification. The no-change return path at LABEL_155 is referenced 101 times throughout the function.
Per-Instruction Visitors
Each major instruction type is handled by a dedicated visitor function called from the main dispatch. The following table summarizes the recovered visitors with their sizes and key characteristics.
visitBinaryOperator -- sub_10D8BB0
| Address | 0x10D8BB0 (102 KB, 2,078 lines) |
| Dispatch case | 0x3A in the master dispatcher |
| Sibling cases | 0x39 (NSW/NUW-focused), 0x3B (associative/commutative) |
This is the second-largest visitor. It implements approximately 25 cascading simplification phases for all binary arithmetic (Add, Sub, Mul, Div, Rem, Shl, LShr, AShr, And, Or, Xor, and their floating-point counterparts). The phases execute in a strict try-and-return order:
Phase 0 runs quick exits: pattern-matched constant fold (sub_101E960), SimplifyBinOp (sub_F29CA0), algebraic identities (sub_F0F270), NSW/NUW simplification (sub_F11DB0), and critically the NVIDIA-specific intrinsic handler sub_11AE870 which runs before any standard LLVM folds.
Phases 1--9 handle associative/commutative factoring, cross-operand Mul-of-Add matching, delegated simplification, overflow detection, and multiply-shift strength reduction. Phase 5 detects multiply-by-power-of-2 and converts to shift; sub_10BA120 builds the full strength reduction for patterns like x * (2^n + 1) into (x << n) + x.
Phases 10--25 cover Add-of-Mul factoring, shift chains, linear expression folding, subtraction of multiplied constants, demanded-bits masking, reciprocal elimination, overflow intrinsic decomposition, and division/remainder folding. The division constant folder uses sub_C46BD0 (APInt::udiv), sub_C499A0 (APInt::urem), sub_C45F70 (APInt::sdiv), and sub_C49AB0 (APInt::srem).
Four template-instantiated helpers at sub_10D2680--sub_10D2D70 (2,767 bytes each, identical structure) implement matchBinOpReduction parameterized by NVVM intrinsic ID (329, 330, 365, 366) and acceptable opcode range. These detect NVVM horizontal reduction intrinsics (e.g., horizontal add/mul across vector lanes) and simplify them to scalar binary operations.
visitICmpInst -- sub_1136650 + sub_113CA70
| Comprehensive folder | sub_1136650 (0x1136650, 149 KB, 3,697 lines) |
| Per-opcode dispatch | sub_113CA70 (0x113CA70) -- 12 case labels |
The ICmp folder is the single largest function in InstCombine. It runs before the per-opcode dispatch table and handles 15 major fold categories: all-ones/sign-bit constant folds, Mul-with-constant strength reduction (NUW-gated), nested Mul decomposition, common sub-operand cancellation, NUW/NSW flag-gated predicate conversion, known-nonnegativity folds, ConstantRange intersection, shared sub-operand elimination, Sub sign-bit analysis, min/max pattern recognition, computeKnownBits sign-bit analysis, power-of-2 optimizations, remainder pattern matching, XOR/shift decomposition, and Or/And decomposition with type width folding.
NVVM uses a custom predicate encoding stored at ICmpInst+2 as a 6-bit field (*(_WORD*)(inst+2) & 0x3F):
| Value | Predicate | Value | Predicate |
|---|---|---|---|
| 32 | EQ | 33 | NE |
| 34 | UGT | 35 | UGE |
| 36 | ULT | 37 | ULE |
| 38 | SGT | 39 | SGE |
| 40 | SLT | 41 | SLE |
The per-opcode dispatch at sub_113CA70 routes based on the non-constant operand's opcode tag:
| Tag | Instruction | Handler | Size |
|---|---|---|---|
* (42) | Mul | sub_1128290 | 1,178 lines |
, (44) | Add | sub_1119FB0 | 413 lines |
. (46) | Trunc | sub_1115510 | -- |
0 (48) | SExt | sub_11164F0 | -- |
1 (49) | ZExt | sub_1122A30 | -- |
4 (52) | Select | sub_1115C10 | 428 lines |
6 (54) | And | sub_1120680 | 911 lines |
7 (55) | Or | sub_1126B10 | 786 lines |
8 (56) | Xor | sub_1126B10 | shared with Or |
9 (57) | Shl | sub_112C930 | 664 lines |
: (58) | LShr | sub_1133500 | -- |
; (59) | Sub | sub_111CED0 + sub_113BFE0 | 519 lines |
visitCastInst -- sub_110CA10
| Address | 0x110CA10 (93 KB, 2,411 lines) |
| Cast chain helper | sub_110B960 (22 KB, 833 lines) |
Handles all cast simplification: same-type identity elimination, bool-to-float chains, integer-to-integer narrowing/widening, FP-to-int special cases, FP narrowing, cast-through-select/PHI, and the major cast-of-cast chain folding. The helper sub_110B960 implements deep cast chain folding for aggregate types using a worklist with a DenseMap for O(1) deduplication, preventing exponential blowup on diamond-shaped use-def graphs. The function is conservative about side effects: sub_B46500 (isVolatile) is called before every fold.
visitSelectInst -- sub_1012FB0
| Address | 0x1012FB0 (74 KB, 1,801 lines) |
| Local variables | 190 total |
Implements 18 prioritized select simplifications: constant fold, undef arm elimination, both-same identity, PHI-through-select, KnownBits sign analysis, ConstantRange analysis, full-range analysis, KnownBits cross-validation, ICmpInst arm synthesis, ExtractValue decomposition, implied condition, canonicalization (delegated to sub_1015760, 27 KB), min/max pattern detection (smin/smax/umin/umax/abs/nabs via four helpers), select-in-comparison chains, PHI-select worklist scan (DenseMap with hash (ptr >> 9) ^ (ptr >> 4)), ValueTracking classification, pointer-null folding, and load/trunc delegation.
visitPHINode -- sub_1175E90
| Address | 0x1175E90 (~57 KB, ~2,130 lines) |
Implements 16 PHI optimization strategies tried in sequence: SimplifyInstruction constant fold, foldPHIArgOpIntoPHI (binary/cast with one varying operand), foldPHIArgConstantOp, typed opcode dispatch (GEP via sub_1172510, InsertValue, ExtractValue, CmpInst, BinOp/Cast), GEP incoming deduplication with loop back-edge analysis, single-use PHI user check, GEP-of-PHI transform (sub_1174BB0, 1,033 lines), phi-cycle escape detection, trivial PHI elimination (all-same non-PHI value), recursive PHI cycle resolution (sub_116D410), operand reordering canonicalization, identical-PHI-in-block deduplication, pointer-type struct GEP optimization, all-undef incoming check, and dominator-tree GEP index hoisting using two DenseMaps.
visitCallInst -- sub_1162F40
| Address | 0x1162F40 (50 KB, 1,647 lines) |
Processes calls through a 15-step cascade: LibCall simplification (sub_100A740), standard intrinsic folding (sub_F0F270), return attribute analysis (sub_F11DB0), overflow/saturating arithmetic (sub_115C220), inline mul-by-constant folding, generic call combining (sub_115A080), FMA/fneg/fsub canonicalization (the largest block, requiring all of nnan+ninf+nsz+arcp+reassoc on both call and function), constant-argument intrinsic folding, unary intrinsic constant folding, exp/log pair detection (IDs 325 and 63), sqrt/rsqrt folding (IDs 284, 285), min/max folding (IDs 88, 90), nested intrinsic composition, division-to-reciprocal-multiply, and finally the NVIDIA-specific sub_115A4C0 which dispatches to the 87 KB intrinsic folding table.
visitLoadInst -- sub_1152CF0
| Address | 0x1152CF0 (~68 KB, ~1,680 lines) |
| Stack frame | 0x4F0 bytes (1,264 bytes) |
Four major paths: constant-address fold (loads from known constant pointers with types <= 64 bits are replaced via symbol table lookup using sub_BCD420), address-space-based elimination (loads from non-AS(32) pointers are replaced with constants, exploiting CUDA's read-only address spaces), the main store-to-load forwarding worklist (BFS over the def-use graph following GEPs, PHIs, and bitcasts, depth-limited by global qword_4F90528), and dominator-based forwarding for non-pointer loads. Alignment is propagated as the maximum of source and destination, with the volatile bit carefully preserved through the *(node+2) 16-bit field (bits [5:0] = log2(alignment), bit [6] = volatile flag).
NVIDIA-Specific Extensions
NVVM Intrinsic Folding -- sub_1169C30
This 87 KB function is the core of NVIDIA's additions to InstCombine. Called from the main visitor when the instruction is an NVIDIA intrinsic, it uses a two-layer dispatch:
Layer 1 (primary switch, entered when the uses-list is empty or the "fast" flag at a1+336 is set) dispatches on the node's byte-tag:
| Tag | Char | Fold Type |
|---|---|---|
| 42 | * | FNeg/negation -- pushes negation through arithmetic via the "Negator" chain |
| 55 | 7 | Vector extract from intrinsic result (full-width extract becomes identity) |
| 56 | 8 | Vector insert into intrinsic result (full-width insert becomes And mask) |
| 59 | ; | Multiply-like symmetric intrinsic (folds when one operand is known non-negative) |
| 68 | D | ZExt of i1 intrinsic result (bypasses intrinsic wrapper) |
| 69 | E | SExt of i1 intrinsic result (bypasses intrinsic wrapper) |
| 85 | U | Call-site fold for llvm.nvvm.* with specific IDs (313, 362) |
| 86 | V | Select-like intrinsic fold (dead select elimination) |
Layer 2 (depth-gated by qword_4F908A8 = instcombine-negator-max-depth) adds aggressive cases:
| Tag | Char | Fold Type |
|---|---|---|
| 46 | . | Dot product fold |
| 54 | 6 | Indexed access / extract with fold |
| 58 | : | Comparison intrinsic fold |
| 67 | C | Type conversion intrinsic fold |
| 84 | T | Tensor / multi-operand intrinsic fold |
| 90 | Z | Zero-extend intrinsic fold |
| 91 | [ | Three-operand fold (e.g., fma) |
| 92 | \ | Four-operand fold (e.g., dp4a) |
| 96 | ` | Unary special intrinsic fold |
The FNeg case (tag 42) is the most complex. It first attempts constant folding: if the operand is all-ones (-1), it creates sub(0, operand) via CSE lookup with opcode 30. When the simple fold fails, it falls through to the Negator chain at LABEL_163: sub_1168D40 collects all negatable sub-expressions, sub_1169800 attempts to fold negation into each operand, and the results are combined with sub_929C50 or sub_929DE0. This pushes negation through chains of arithmetic to find a cheaper representation, depth-gated to prevent exponential blowup. Created replacement instructions carry .neg modifier metadata for PTX emission.
Three High-Opcode NVIDIA Intrinsics
Opcodes 0x254D (9549), 0x2551 (9553), and 0x255F (9567) are NVIDIA-proprietary intrinsic IDs handled directly in the main visitor. All three share the same pattern: extract the commuted-operand index via v1612->m128i_i32[1] & 0x7FFFFFF, verify the other operand has byte-tag 12 or 13 (ConstantInt/ConstantFP), query metadata via sub_10E0080 with mask 0xFFFFFFFFFFFFFFFF, and test specific bit patterns:
| Opcode | Test | Fold Condition |
|---|---|---|
| 0x2551 (9553) | ((result >> 40) & 0x1E) == 0x10 | Fold when bit pattern mismatches |
| 0x255F (9567) | (result & 0x10) != 0 | Fold when bit 4 is clear |
| 0x254D (9549) | (result & 0x200) != 0 | Fold when bit 9 is clear |
When the filter passes, the shared epilogue calls sub_F207A0(v6, v1612->m128i_i64) (eraseInstFromFunction), deleting the instruction entirely. These implement dead-code elimination for NVIDIA intrinsics with constant arguments matching known-safe-to-remove criteria.
Separate Storage Assume Bundles
At lines 6557--6567 of the main visitor, the code iterates over operand bundles on llvm.assume calls (opcode 0x0B). For each bundle with a tag of exactly 16 bytes matching "separate_storage" (verified by memcmp), it calls sub_10EA360 on both bundle operands. This implements NVIDIA's separate_storage alias analysis hint, allowing InstCombine to exploit non-aliasing assumptions for pairs of pointers declared to reside in separate memory spaces.
Expanded GEP Handling
The GEP case (opcode 0x99 = 153) is significantly expanded compared to upstream. The global dword_4F901A8 controls a depth-limited chain walk for nested GEP simplification:
v729 = getOperand(0) of GEP
if (dword_4F901A8) {
v730 = 0;
do {
if (!isConstantGEP(v729)) break;
++v730;
v729 = getOperand(0, v729); // walk up
} while (v730 < dword_4F901A8);
}
if (*(_BYTE*)v729 != 85) // 85 = CallInst
goto LABEL_155; // bail
This walks backward through constant-index GEP chains up to dword_4F901A8 steps, looking for a CallInst base pointer. The knob controls how many GEP levels to look through when simplifying GEP(GEP(GEP(..., call_result))).
Ternary/FMA Support
The preamble handles 3-operand instructions (opcodes 238--245) representing fused multiply-add variants. This includes checking whether the third operand is a zero-constant, converting between FMA opcode variants (238 vs. 242), and handling address space mismatches on FMA operand types -- entirely NVIDIA-specific for CUDA's FMA intrinsics.
computeKnownBits -- sub_11A7600
The 127 KB computeKnownBits implementation dispatches on the first byte of the NVVM IR node (the type tag):
| Tag | Char | Node Type |
|---|---|---|
| 42 | * | Truncation (extracts low bits) |
| 44 | , | GEP (computes known bits through pointer arithmetic) |
| 46 | . | Comparison (known result bits) |
| 48 | 0 | Select (intersection of known bits from both arms) |
| 52 | 4 | Branch-related |
| 54 | 6 | Vector shuffle |
| 55 | 7 | Vector extract |
| 56 | 8 | Vector insert |
| 57 | 9 | PHI node (intersection across incoming values) |
| 58 | : | Comparison variant |
| 59 | ; | Invoke / call |
| 67 | C | Cast chain |
| 68 | D | Binary op path 1 |
| 69 | E | Binary op path 2 |
| 85 | U | CallInst (sub-dispatch: 0x0F=abs, 0x42=ctpop, 0x01=bitreverse) |
| 86 | V | LoadInst |
A debug assertion at lines 2204--2212 fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results, printing both APInt values and calling abort(). This invariant check (known_zero & known_one == 0, plus consistency with the demanded mask) is compiled in for debug/checked builds.
SimplifyDemandedBits -- sub_11AE870
The wrapper sub_11AE870 gets the bit-width via sub_BCB060 (or sub_AE43A0 for non-integer types), allocates two APInts sized to the width, delegates to sub_11AE3E0, and frees any heap-allocated storage. The core implementation at sub_11AE3E0 (235 lines) calls computeKnownBits, then if the instruction was simplified, walks the use-chain and inserts each user into a hash table (open-addressing with quadratic probing, hash = (ptr >> 9) ^ (ptr >> 4)) at offset +2064 from the InstCombiner context. This "seen instructions" set prevents infinite recursion during demanded-bits propagation.
Configuration Knobs
| Global | CLI Flag | Default | Used In |
|---|---|---|---|
dword_4F901A8 | (GEP chain look-through depth) | unknown | GEP handler (case 0x99) |
qword_4F908A8 | instcombine-negator-max-depth | -1 | sub_1169C30 (depth gate) |
qword_4F90988 | instcombine-negator-enabled | 1 | ctor_090 |
qword_4F8B4C0 | instcombine-split-gep-chain | -- | ctor_068 |
qword_4F8B340 | instcombine-canonicalize-geps-i8 | -- | ctor_068 |
qword_4F909E0 | instcombine-max-num-phis | -- | ctor_091 |
qword_4F90120 | instcombine-guard-widening-window | 3 | ctor_087 |
qword_4F90528 | (load forwarding search depth) | -- | sub_1152CF0 |
Key Helper Functions
| Address | Recovered Name | Purpose |
|---|---|---|
sub_987FE0 | getOpcode() | Reads NVVM opcode from IR node |
sub_B46B10 | getOperand(idx) | Operand access |
sub_B44E20 | eraseFromParent() | Unlink instruction |
sub_F207A0 | eraseInstFromFunction() | Delete instruction from worklist |
sub_F162A0 | replaceInstUsesWith() | RAUW and return replacement |
sub_F20660 | setOperand(i, val) | Replace operand in-place |
sub_B33BC0 | CreateBinOp() | IRBuilder binary op creation |
sub_B504D0 | CreateBinOp(no-flags) | Binary op without flags |
sub_B51D30 | CreateCast() | Cast instruction creation |
sub_AD8D80 | ConstantInt::get(type, APInt) | Constant integer factory |
sub_AD64C0 | ConstantInt::get(type, val, signed) | Constant integer factory (scalar) |
sub_BCB060 | getScalarSizeInBits() | Type bit-width query |
sub_10E0080 | getKnownBitsProperty() | Metadata property query |
sub_B43CB0 | getFunction() | Get parent function |
sub_B43CA0 | getParent() | Get parent basic block |
sub_10A0170 | extractFlags() | Read fast-math, exact, etc. |
sub_B44900 | isCommutative() | Check commutativity |
sub_C444A0 | APInt::countLeadingZeros() | Bit analysis |
sub_986760 | APInt::isZero() | Zero test |
sub_10EA360 | recordSeparateStorageOperand() | Separate storage alias hint |
Diagnostic Strings
Diagnostic strings recovered from the InstCombine binary region. InstCombine uses assertion-style diagnostics rather than optimization remarks; the computeKnownBits consistency check is the primary runtime diagnostic.
| String | Source | Category | Trigger |
|---|---|---|---|
"computeKnownBits(): " | sub_904010 in sub_11A7600 line ~2204 | Assertion | Debug build: computeKnownBits and SimplifyDemandedBits produce inconsistent results (prints both APInt values, then calls abort()) |
"SimplifyDemandedBits(): " | sub_904010 in sub_11A7600 line ~2212 | Assertion | Debug build: paired with computeKnownBits() inconsistency diagnostic above |
"separate_storage" | Main visitor lines 6557--6567 | Bundle tag | Matched via memcmp (16 bytes) on llvm.assume operand bundles; not a user-visible diagnostic |
"instcombine-negator-max-depth" | ctor_090 at 0x4F908A8 | Knob | Knob registration (default -1, unlimited) |
"instcombine-negator-enabled" | ctor_090 at 0x4F90988 | Knob | Knob registration (default 1, enabled) |
"instcombine-split-gep-chain" | ctor_068 at 0x4F8B4C0 | Knob | Knob registration |
"instcombine-canonicalize-geps-i8" | ctor_068 at 0x4F8B340 | Knob | Knob registration |
"instcombine-max-num-phis" | ctor_091 at 0x4F909E0 | Knob | Knob registration |
"instcombine-guard-widening-window" | ctor_087 at 0x4F90120 | Knob | Knob registration (default 3) |
InstCombine does not emit OptimizationRemark diagnostics. The only runtime-visible diagnostic is the debug assertion that fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results (known_zero & known_one != 0, or results disagree with the demanded mask). This check is compiled into debug/checked builds only and calls abort() after printing both APInt values.
Size Contribution Estimate
| Component | Size | Description |
|---|---|---|
| Upstream visitor baseline | ~200 KB | Standard LLVM visiting ~50 instruction types |
sub_1169C30 intrinsic folding | ~87 KB | NVVM-specific intrinsic canonicalization |
| NVVM GEP/FMA/vector cases | ~40 KB | Expanded GEP chains, ternary FMA, vector width-changing |
separate_storage + assume | ~10 KB | Operand bundle handling for alias hints |
| High-opcode NVIDIA intrinsics | ~15 KB | DCE for opcodes 0x254D/0x2551/0x255F |
| Expanded comparator/cast | ~50 KB | Extended ICmp, cast chain, select handling |
| NVIDIA total addition | ~200 KB | Roughly doubles upstream InstCombine |
Optimization Level Behavior
| Level | Scheduled | Instances | Notes |
|---|---|---|---|
| O0 | Not run | 0 | No optimization passes |
| Ofcmax | Runs | 1 | Single instance in fast-compile pipeline |
| Ofcmid | Runs | 2 | Early + post-GVN cleanup |
| O1 | Runs | 3-4 | Early, post-SROA, post-GVN, late cleanup |
| O2 | Runs | 4-5 | Same as O1 + additional Tier 2 instance after loop passes |
| O3 | Runs | 5-6 | Same as O2 + Tier 3 instance; benefits from more aggressive inlining/unrolling |
InstCombine is the most frequently scheduled pass in the CICC pipeline. Each instance runs the full 405KB visitor but benefits from different preceding transformations: the post-SROA instance cleans up cast chains from aggregate decomposition, the post-GVN instance simplifies expressions exposed by redundancy elimination, and the late instance performs final canonicalization before codegen. The instcombine-negator-max-depth and instcombine-negator-enabled knobs apply uniformly across all instances. Even at Ofcmax, at least one InstCombine run is considered essential for basic IR canonicalization. See Optimization Levels for pipeline tier details.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Binary size | ~200 KB main visitor | ~405 KB main visitor + 87 KB intrinsic folding (~2x upstream) |
| NVVM intrinsic folding | No NVVM-specific intrinsic canonicalization | Dedicated 87 KB function (sub_1169C30) with two-layer dispatch for negation, vector extract/insert, FMA, tensor, dot product, and 15+ fold types |
| High-opcode DCE | Not present | Three NVIDIA proprietary intrinsic IDs (9549, 9553, 9567) with constant-argument dead-code elimination |
separate_storage bundles | No separate_storage operand bundle handling | Iterates llvm.assume bundles, extracting "separate_storage" hints for alias-based optimization |
| Ternary FMA opcodes | Standard llvm.fma / llvm.fmuladd folding | Extended preamble handles opcodes 238--245 for CUDA FMA variants with address-space mismatch handling |
| GEP chain look-through | Single-level GEP simplification | Depth-limited chain walk (dword_4F901A8 steps) backward through constant-index GEP chains to find CallInst base pointers |
| Horizontal reduction | Standard intrinsic-based reduction fold | Four template-instantiated matchBinOpReduction helpers for NVVM horizontal reduction intrinsics (IDs 329, 330, 365, 366) |
| KnownBits integration | Separate computeKnownBits in ValueTracking | Fused 127 KB computeKnownBits + SimplifyDemandedBits with GPU special-register range oracle |
GVN (Global Value Numbering)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/Scalar/GVN.cpp,llvm/lib/Transforms/Scalar/NewGVN.cpp(LLVM 20.0.0)
CICC v13.0 ships two GVN implementations: the classic GVN pass at 0x1900BB0 (83 KB, ~2314 decompiled lines) and a NewGVN pass at 0x19F99A0 (68 KB, ~2460 decompiled lines). Both are derived from upstream LLVM but carry substantial NVIDIA modifications for GPU-specific value numbering, store splitting, and intrinsic-aware CSE. The knob constructor at ctor_201 (0x4E0990) registers eleven tunables that control PRE, store splitting, PHI removal, dominator caching, and recursion depth.
Key Facts
| Property | Value |
|---|---|
| Pass name (pipeline) | gvn (parameterized) |
| Registration | New PM #397, parameterized: no-pre;pre;no-load-pre;load-pre;... |
| Runtime positions | Tier 0 #5 (via sub_1C6E800); also appears at NewGVN/GVNHoist position #6; see Pipeline |
| Classic GVN entry | sub_1900BB0 (83 KB, 2,314 lines) |
| NewGVN entry | sub_19F99A0 (68 KB, 2,460 lines) |
| Knob constructor | ctor_201 at 0x4E0990 |
| Upstream source | llvm/lib/Transforms/Scalar/GVN.cpp, NewGVN.cpp (LLVM 20.0.0) |
Knob Inventory
Knobs are registered in ctor_201 at 0x4E0990. Bool knobs use cl::opt<bool> (vtable 0x49EEC70); int knobs use cl::opt<int> (vtable 0x49EEB70). The store-split limit knobs route through a custom NVIDIA registrar at sub_190BE40 that accepts an int** default initializer.
| Knob | Type | Default | Global Address | Purpose |
|---|---|---|---|---|
enable-pre | bool | true | 0x4FAEEE0 | Enable Partial Redundancy Elimination |
enable-load-pre | bool | true | 0x4FAEE00 | Enable load PRE (load sinking across edges) |
enable-split-backedge-in-load-pre | bool | false | 0x4FAED20 | Allow splitting backedges during load PRE |
enable-phi-remove | int | 2 | 0x4FAEC40 | PHI removal aggressiveness (0=off, 2=aggressive) |
dump-phi-remove | int | 0 | 0x4FAEB60 | Dump PHI removal decisions (debug) |
no-split-stores-below | int | -1 | 0x4FAEA80 | Minimum store width in bits for splitting (-1 = no limit) |
no-split-stores-above | int | -1 | 0x4FAE9A0 | Maximum store width in bits for splitting (-1 = no limit) |
split-stores | bool | true | 0x4FAE8C0 | Master enable for store splitting |
profusegvn | bool | true | 0x4FAE7E0 | Verbose diagnostics via NVIDIA profuse framework |
gvn-dom-cache | bool | true | 0x4FAE700 | Cache dominator tree query results (cache size 32) |
max-recurse-depth | int | 1000 | 0x4FAE620 | Maximum recursion depth during simplification |
IR Before/After Example
GVN eliminates redundant computations and forwards store values to loads. The following shows a common GPU pattern: a redundant load eliminated via value numbering, and a store-to-load forward.
Before:
define void @f(ptr addrspace(1) %p, ptr addrspace(1) %q) {
%a = load float, ptr addrspace(1) %p, align 4
%b = fmul float %a, 2.0
%c = load float, ptr addrspace(1) %p, align 4 ; redundant load (same %p, no intervening store)
%d = fadd float %b, %c
store float 42.0, ptr addrspace(1) %q, align 4
%e = load float, ptr addrspace(1) %q, align 4 ; load from location just stored to
ret void
}
After:
define void @f(ptr addrspace(1) %p, ptr addrspace(1) %q) {
%a = load float, ptr addrspace(1) %p, align 4
%b = fmul float %a, 2.0
; %c eliminated -- replaced with %a (same value number)
%d = fadd float %b, %a
store float 42.0, ptr addrspace(1) %q, align 4
; %e eliminated -- forwarded from store (value 42.0)
ret void
}
The second load from %p is eliminated because GVN assigns it the same value number as %a. The load from %q after the store is forwarded directly from the stored constant. On GPU, eliminating memory loads is especially valuable because each avoided ld.global saves hundreds of cycles of memory latency.
Classic GVN Algorithm
The main entry point is GVN::runOnFunction at sub_1900BB0. The pass object is approximately 600 bytes and carries four scoped hash tables plus a dominator tree reference.
Pass Object Layout
| Offset | Field | Purpose |
|---|---|---|
| +0 | vtable | Pass vtable pointer |
| +16 | Function* | Current function being processed |
| +72 | MemoryDependenceResults* | MemDep analysis handle |
| +88 | DominatorTree* | Dominator tree |
| +240 | LeaderTable | Hash: value number to canonical leader |
| +392 | StoreExprTable | Hash: store expressions |
| +544 | LoadExprTable | Hash: load expressions |
| +592 | RPO counter | Current block's RPO number |
Complexity
Let N = number of instructions, B = number of basic blocks, and D = depth of the dominator tree. The classic GVN traversal visits every instruction exactly once during the RPO walk: O(N). Each instruction is hashed (O(1) amortized via the scoped hash tables) and looked up in the leader table (O(1) amortized). Memory dependence queries (getDependency) are O(D) per load in the worst case, cached by MemDep to amortize across the function. PRE insertion adds at most O(N) new instructions. Store splitting is bounded by the number of stores times the split factor (controlled by no-split-stores-below/above). The gvn-dom-cache (size 32) converts repeated dominance queries from O(D) to O(1). PHI removal (replaceAndErase) is O(U) per replaced value where U = number of uses. Overall: O(N * D) in the worst case due to dominance queries; O(N) in practice with the dominator cache enabled (default). NewGVN's partition-based algorithm is O(N * alpha(N)) amortized where alpha is the inverse Ackermann function from union-find, though the fixpoint iteration can degrade to O(N^2) on pathological inputs.
Traversal Strategy
The pass walks the dominator tree in reverse post-order using an explicit segmented stack rather than recursion. The initial allocation is an 8-slot array of segment pointers (sub_22077B0(64)), each segment holding 64 pointers (512 bytes). The stack grows by allocating new segments and shrinks by freeing segments when popping past a boundary.
Each dominator tree node is a 136-byte structure (sub_22077B0(136)) containing RPO in/out numbers, basic block pointer, child pointers, scope chain links for all four hash tables, an undo list for backtracking, and a visited flag at offset +128.
Main Processing Loop
For each dominator tree node popped from the stack, the pass:
- Sets the RPO number from the node's
RPO_infield. - Skips already-visited nodes (checked via the byte at offset +128).
- Iterates every instruction in the basic block.
- Attempts
SimplifyInstruction(sub_1AE9990) first; if it succeeds, replaces all uses and erases viasub_19003A0. - Dispatches on the instruction opcode byte at offset +16:
- Case 4 (call/intrinsic): Classifies purity via bitmask
0x1F133FFE23FFFF, checks volatility throughsub_1560260(flag 36), looks up in the LeaderTable viasub_18FDEE0(hash) +sub_18FB980(compare). Inserts new leaders viasub_18FEF10. - Case 79 (load): Queries memory dependence, checks four NVIDIA intrinsic IDs for special pointer extraction, then attempts store-to-load forwarding or PRE.
- Case 114 (store): Inserts into the StoreExprTable using a 5-element hash key (opcode, type, pointer, value, alignment) via
sub_18FEB70/sub_18FFC60. - Default: General expression numbering through
sub_13E3350, with sub-dispatch for branches (opcode 57), loads (54/55), and call-like instructions (78).
- Case 4 (call/intrinsic): Classifies purity via bitmask
NVIDIA Intrinsic-Aware Value Numbering
The classic GVN recognizes four NVIDIA-specific LLVM intrinsic IDs and extracts their pointer operands with non-standard indices:
| Intrinsic ID | Name | Pointer Operand Index | Semantics |
|---|---|---|---|
| 4057 | llvm.nvvm.ldu | 1 - numOperands | Load from uniform memory; aggressively CSE-able |
| 4085 | llvm.nvvm.ldg | 1 - numOperands | Load via texture/global cache; CSE if same address |
| 4492 | (NVIDIA-specific) | 2 - numOperands | Variant load with 2-operand pointer extraction |
| 4503 | (NVIDIA-specific) | 2 - numOperands | Variant load with 2-operand pointer extraction |
These intrinsics bypass the standard volatility checks and use custom operand extraction, allowing CSE of texture and surface loads that upstream LLVM GVN would not touch.
Scoped Hash Tables
GVN maintains four ScopedHashTable instances, pushed on dominator tree entry and popped on exit. The scope teardown at lines 1858-2101 restores the LoadExprTable via the undo list at offset +120, restores the StoreExprTable via the undo list at offset +72, frees the MemDepTable scope through sub_18FE3A0, and deallocates the 136-byte dom node.
The hash function (sub_18FDEE0, approximately 140 lines) is NVIDIA-modified. For binary ops (opcodes 35-52), it hashes the opcode and operand pointers with canonicalization (smaller pointer first for commutative operations). For comparisons, it includes the predicate. For GEPs (opcodes 86/87), it hashes the entire index sequence via sub_1597510. Hash mixing uses the formula (ptr >> 9) ^ (ptr >> 4) with XOR combining. The 5-element store expression variant (sub_18FEB70) computes:
hash = (v12>>9)^(v12>>4) ^ (v11>>9)^(v11>>4) ^ (v10>>9)^(v10>>4) ^ (37*v13) ^ (v9>>9)^(v9>>4)
Store Splitting
Three knobs control this NVIDIA-specific extension: split-stores (master enable), no-split-stores-below and no-split-stores-above (bit-width bounds, both default -1 meaning unlimited). The custom registrar at sub_190BE40 handles the limit knobs.
When GVN discovers a store that partially overlaps with a load, it attempts to split the store into sub-stores that individually satisfy dependence constraints. This is critical for GPU code where vector stores (float4, int4) partially overlap with subsequent scalar loads, texture/surface stores have alignment constraints, and shared memory bank conflicts may favor different store granularities.
The function sub_18FECC0 classifies store expressions by instruction type: store (54), atomic store (55), shufflevector (58), extractelement (59), and insertelement (82). The shufflevector/extract/insert handling reflects NVIDIA's lowering of vector operations into intermediate forms before GVN runs.
Dominator Cache
The gvn-dom-cache knob (default true, cache size 32) addresses a known performance bottleneck. GVN's dominance queries are O(n * depth) and can become expensive on deeply nested GPU kernels with many divergent branches. The cache stores recent dominates(A, B) results keyed by basic block pointer, converting repeated queries to O(1). The working set size of 32 was chosen empirically: GPU kernels typically have moderate dominator tree depth because shared memory parallelism keeps CFGs relatively flat.
PHI Removal
After GVN identifies equivalent values, some PHI nodes become trivial. The enable-phi-remove knob controls aggressiveness: level 0 disables removal, level 1 removes only trivially redundant PHIs, and level 2 (default) removes PHIs that become trivial after leader substitution.
The core replaceAndErase routine (sub_19003A0, 11 KB) iterates all uses of a replaced value, checks each PHI-node use for trivial foldability using a SmallDenseSet (opcode 23), and employs a 4-way unrolled loop (lines 301-317) for use scanning. This micro-optimization targets the common case of PHIs with many incoming edges after switch lowering or loop unrolling.
NewGVN
The NewGVN implementation at sub_19F99A0 (68 KB) uses congruence classes instead of simple leader tables, following the partition-based algorithm from Karthik Gargi (2002). The pass object stores a congruence class hash table at offset +1400 with count, bucket array, entry count, tombstone count, and bucket count fields.
The algorithm:
- Builds initial partitions from the RPO-ordered instruction list.
- For each worklist instruction, queries the current congruence class and computes the new value expression.
- If the expression maps to a different class, splits the partition.
- Repeats until fixpoint (no more splits).
Hash table growth is handled by sub_19F5120; insert-or-find by sub_19E6B80. Congruence class members are sorted (sub_19F5A00 + sub_19F6B20) for efficient merge operations.
Memory Dependence Integration
GVN interacts with MemoryDependenceResults at offset +72 through three key functions:
| Function | Address | Role |
|---|---|---|
getDependency | sub_1422850 | Returns the memory instruction this load depends on |
getDominatorTree | sub_1423BA0 | Extracts the DomTree from MemDep for dominance queries |
properlyDominates | sub_1428550 | Tests strict dominance through the MemDep tree |
The replacement safety check (sub_18FBB40) returns true immediately when RPO numbers match, and otherwise chains through getDependency -> getIDom -> dominates().
Profuse Diagnostics
The profusegvn knob (default true) enables verbose output through NVIDIA's custom profuse diagnostic framework, not the standard LLVM OptimizationRemark system. When active, diagnostics are emitted at value replacement decisions, store/load expression matches, and PRE insertion decisions. The framework is likely controlled by environment variables such as CICC_PROFUSE_DIAGNOSTICS.
Key Function Map
| Function | Address | Size | Role |
|---|---|---|---|
GVN::runOnFunction | 0x1900BB0 | 83 KB | Main classic GVN pass |
replaceAndErase | 0x19003A0 | 11 KB | Replace uses + erase instruction |
NewGVN::run | 0x19F99A0 | 68 KB | NewGVN algorithm |
ctor_201 | 0x4E0990 | 9 KB | GVN knob registration |
hashExpression | 0x18FDEE0 | ~5 KB | Expression hash function |
compareExpression | 0x18FB980 | ~2 KB | Expression equality test |
lookupExpr5 | 0x18FEB70 | ~3 KB | 5-key store expression lookup |
insertExpr5 | 0x18FFC60 | ~3 KB | 5-key insert with scoped undo |
insertLeader | 0x18FEF10 | ~5 KB | Leader table insert |
checkStoreSplit | 0x18FECC0 | ~3 KB | Store expression for splitting |
canReplace | 0x18FBB40 | <1 KB | Dominance-based replacement check |
preAvailCheck | 0x18FC460 | ~3 KB | PRE availability analysis |
performPRE | 0x18FF290 | 10 KB | PRE insertion |
largeGVNHelper | 0x18F6D00 | 60 KB | PRE / load forwarding helper |
phiGVNHelper | 0x18FAA90 | 20 KB | PHI-related GVN helper |
storeSplitHelper | 0x1906720 | 26 KB | Store splitting implementation |
storeSplitVisit | 0x1905CD0 | 16 KB | Store-split worklist visitor |
postGVNCleanup | 0x1908A00 | 10 KB | Post-GVN cleanup |
gvnFinalCleanup | 0x190C3B0 | 8 KB | Final cleanup after GVN |
Expression Classification Bitmask
The bitmask 0x1F133FFE23FFFF classifies opcodes that are safe for value numbering (pure, side-effect-free). It appears eight times in the main function. Bit positions correspond to (opcode - 35), covering standard arithmetic, logical, comparison, and cast operations, plus NVIDIA-specific opcodes in the extended range.
Multi-Pass Data Flow: SROA / InstCombine / GVN / DSE
These four passes form the core scalar optimization chain in CICC's mid-pipeline. They execute in sequence (often multiple times through the pipeline), with each pass producing IR transformations that create opportunities for the next. The following diagram traces data flow through a single iteration of the chain, showing what each pass produces and what the next pass consumes.
SROA (Scalar Replacement of Aggregates)
========================================
Input: IR with aggregate alloca instructions (structs, arrays)
Example: %s = alloca %struct.float4 --> lives in .local memory (AS 5)
+--------------------------------------------------------------+
| Phase 1: Slice analysis |
| Walk all uses of each alloca, build byte-range slices |
| Group non-overlapping slices into partitions |
| |
| Phase 2: Partition splitting |
| Replace each partition with a scalar alloca or SSA value |
| Insert extractvalue/insertvalue for partial accesses |
| Defer trivially-promotable allocas to mem2reg |
| |
| Produces: |
| - Scalar SSA values replacing aggregate members |
| - Inserted bitcasts, trunc, zext for type mismatches |
| - Dead aggregate allocas (erased) |
| - GEP chains pointing at sub-fields (now redundant) |
+------------------------------+-------------------------------+
|
| Scalar SSA values with redundant
| casts, dead GEPs, identity ops
v
InstCombine (Instruction Combining)
========================================
Input: Post-SROA IR with redundant instructions
+--------------------------------------------------------------+
| 405KB visitor dispatches across 80 opcode cases: |
| |
| Consumes from SROA: |
| - Redundant bitcasts from type-punned accesses |
| - trunc(zext(x)) chains from width mismatches |
| - Dead GEP arithmetic (base + 0) |
| - Identity selects from conditional stores |
| |
| Canonicalization: |
| - Constant folding (sub_101E960) |
| - Algebraic identities: x+0, x*1, x&-1 (sub_F0F270) |
| - Strength reduction: x*2^n -> x<<n (sub_10BA120) |
| - Cast chain collapse: trunc(zext(x)) -> x or smaller |
| - NVIDIA intrinsic folding (sub_1169C30, 87KB) |
| - computeKnownBits propagation (sub_11A7600, 127KB) |
| |
| Produces: |
| - Canonical instruction forms (const on RHS, etc.) |
| - Simplified expressions (fewer instructions) |
| - Known-bits metadata on values |
| - Opportunities for value numbering (same expression |
| in different blocks now looks identical) |
+------------------------------+-------------------------------+
|
| Canonical IR with duplicate
| expressions across blocks
v
GVN (Global Value Numbering)
========================================
Input: Canonicalized IR from InstCombine
+--------------------------------------------------------------+
| Traverses dominator tree in RPO with scoped hash tables: |
| |
| Consumes from InstCombine: |
| - Canonical expression forms (enables hash-table matching) |
| - Known-bits info (used in SimplifyInstruction) |
| - Folded NVIDIA intrinsics (enables ldu/ldg CSE) |
| |
| Value numbering: |
| - Hash expression: (opcode, type, operands) -> leader |
| - Scoped tables: LeaderTable, StoreExprTable, LoadExprTable|
| - NVIDIA ldu/ldg CSE (intrinsics 4057, 4085, 4492, 4503) |
| |
| Load forwarding: |
| - Query MemoryDependenceResults for store->load forwarding |
| - Store splitting: float4 store -> scalar float load |
| (NVIDIA extension, controlled by split-stores knob) |
| |
| PRE (Partial Redundancy Elimination): |
| - Insert computations at merge points to enable CSE |
| - Load PRE across edges (enable-load-pre) |
| |
| Consumes from alias analysis: |
| - MemoryDependence results (which store feeds which load?) |
| - NVVM AA NoAlias answers for cross-address-space pairs |
| |
| Produces: |
| - Eliminated redundant computations (replaced with leader) |
| - Forwarded loads (replaced with stored value) |
| - Trivial PHIs (from leader substitution) |
| - Dead stores exposed (stored value is never loaded) |
+------------------------------+-------------------------------+
|
| IR with eliminated redundancies,
| forwarded loads, exposed dead stores
v
DSE (Dead Store Elimination)
========================================
Input: Post-GVN IR with dead stores exposed
+--------------------------------------------------------------+
| 91KB across three major functions: |
| |
| Consumes from GVN: |
| - Stores whose values were forwarded to loads (now dead) |
| - Stores to locations that GVN proved are overwritten |
| - Simplified store patterns from PRE insertion |
| |
| Consumes from alias analysis: |
| - MemorySSA graph (which stores are visible to which loads)|
| - NVVM AA NoAlias (cross-space stores never conflict) |
| - TBAA metadata (type-based aliasing for struct fields) |
| |
| Dead store detection: |
| - Complete overwrite: later store covers same location |
| - Partial overwrite: float4 store then float4 store with |
| overlapping range (72-byte hash table tracking) |
| - Store chain decomposition: aggregate stores decomposed |
| via GEP into element-level dead-store checks |
| |
| NVIDIA extensions: |
| - Partial store forwarding with type conversion |
| (float4 -> float via GEP + load extraction) |
| - Cross-store 6-element dependency records |
| - CUDA vector type-aware size computation |
| |
| Produces: |
| - Eliminated dead stores (fewer memory writes) |
| - Replacement loads for partial forwards |
| - Reduced memory traffic (critical for GPU bandwidth) |
+--------------------------------------------------------------+
Cross-pass data dependency table:
| Pass | Consumes from predecessor | Produces for successor |
|---|---|---|
| SROA | Aggregate allocas from frontend/inliner | Scalar SSA values, redundant casts/GEPs |
| InstCombine | Redundant casts, identity ops from SROA | Canonical expressions, known-bits metadata |
| GVN | Canonical forms from InstCombine, MemDep/AA results | Forwarded loads, eliminated redundancies, exposed dead stores |
| DSE | Dead stores exposed by GVN, MemorySSA/AA results | Eliminated stores, reduced memory traffic |
Why this ordering matters for GPU code: SROA is existential because un-promoted allocas become .local memory (200-400 cycle penalty). InstCombine must run before GVN because GVN's hash-table matching requires canonical expression forms -- without InstCombine, (a + 0) and a would hash differently and miss the CSE opportunity. GVN must run before DSE because GVN's load forwarding is what exposes dead stores: once GVN proves that a load reads a value already available as an SSA register, the store that was keeping that value alive becomes dead. DSE then removes it, reducing the memory write traffic that is the primary bandwidth bottleneck on GPU architectures.
Optimization Level Behavior
| Level | Classic GVN | NewGVN | PRE | Store Splitting |
|---|---|---|---|---|
| O0 | Not run | Not run | N/A | N/A |
| Ofcmax | Not run | Not run | N/A | N/A |
| Ofcmid | Runs (1 instance) | Not run | Enabled (enable-pre=true) | Enabled (split-stores=true) |
| O1 | Runs (1-2 instances in Tier 0/1) | Not run | Enabled | Enabled |
| O2 | Runs (2-3 instances across Tier 0/1/2) | Not run | Enabled | Enabled |
| O3 | Runs (2-3 instances, most aggressive inlining exposes more CSE) | Not run | Enabled | Enabled |
GVN is a core mid-pipeline pass that runs at O1 and above. It appears multiple times in the pipeline -- typically once after CGSCC inlining and once in the late scalar cleanup. Each instance benefits from different preceding transformations (inlining, SROA, InstCombine). NewGVN is compiled into the binary but not scheduled in any standard pipeline tier. The enable-pre and enable-load-pre knobs are both true by default across all levels. See Optimization Levels for the complete tier structure.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Store splitting | Not present; GVN handles stores only for forwarding | Three knobs (split-stores, no-split-stores-below, no-split-stores-above) enable splitting wide vector stores into sub-stores matching load granularity |
| NVIDIA intrinsic CSE | No awareness of nvvm.ldu, nvvm.ldg | Four NVIDIA intrinsic IDs (4057, 4085, 4492, 4503) with custom pointer operand extraction, enabling CSE of texture/global cache loads |
| Dominator cache | No caching; dominance queries are O(n * depth) | gvn-dom-cache (default true, size 32) caches recent dominates(A, B) results for O(1) repeated queries |
| PHI removal aggressiveness | Basic trivial PHI cleanup | Three-level enable-phi-remove knob (0=off, 1=trivial, 2=aggressive); 4-way unrolled use-scanning loop for PHI-heavy IR |
| Knob count | ~4 knobs (enable-pre, enable-load-pre, enable-split-backedge-in-load-pre, max-recurse-depth) | 11 knobs including store splitting limits, dominator caching, profuse diagnostics, and PHI removal depth |
| Diagnostic framework | Standard OptimizationRemark system | profusegvn knob (default true) uses NVIDIA's custom profuse diagnostic framework, not LLVM's ORE |
| NewGVN | Standard partition-based NewGVN | Same algorithm, ships alongside classic GVN at separate address; both carry NVIDIA modifications |
Diagnostic Strings
All diagnostic strings recovered from the binary. GVN uses NVIDIA's custom profuse diagnostic framework rather than LLVM's OptimizationRemark system.
| String | Source | Category | Trigger |
|---|---|---|---|
"profuse for GVN" | 0x4FAE7E0 (profusegvn knob description) | Knob | Knob registration |
"enable caching of dom tree nodes" | 0x4FAE700 (gvn-dom-cache knob description) | Knob | Knob registration |
"Max recurse depth (default = 1000)" | 0x4FAE620 (max-recurse-depth knob description) | Knob | Knob registration |
| (profuse GVN diagnostic output) | sub_1909530 (~5 KB) | Debug | profusegvn knob enabled (default true); emits at value replacement, store/load match, and PRE insertion decisions |
| (PHI removal diagnostic output) | sub_19003A0 region | Debug | dump-phi-remove > 0; dumps which PHI nodes are being removed and why |
The profusegvn framework follows the same pattern as profuseinline -- it is a custom NVIDIA diagnostic channel likely controlled by environment variables such as CICC_PROFUSE_DIAGNOSTICS, not the standard LLVM OptimizationRemark / ORE system. The dump-phi-remove knob (default 0) separately enables diagnostic output during PHI removal.
Allocation Strategy
The 136-byte domtree nodes and 48-byte expression entries use sub_145CBF0 (BumpPtrAllocator) and sub_22077B0 (malloc wrapper). This careful memory management addresses the potentially large number of expressions produced by heavily unrolled GPU kernels.
Test This
The following kernel contains redundant loads from the same global address. GVN should eliminate the second load by recognizing it has the same value number as the first.
__global__ void gvn_test(const float* __restrict__ in, float* __restrict__ out, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= n) return;
float a = in[tid]; // first load
float b = a * 2.0f;
float c = in[tid]; // redundant -- same address, no intervening store
float d = c * 3.0f;
out[tid] = b + d;
}
What to look for in PTX:
- Only one
ld.global.f32instruction forin[tid], not two. GVN assigns the same value number to both loads (same pointer, no intervening aliasing store thanks to__restrict__) and replaces the second with the first's result. - The arithmetic should reduce to something equivalent to
in[tid] * 5.0f. After GVN eliminates the redundant load, InstCombine or the backend may simplifya*2 + a*3intoa*5. - Remove
__restrict__and add an intervening store (out[tid] = b;between the two loads). Without__restrict__, GVN cannot prove the second load is redundant (the store tooutmight aliasin), so bothld.global.f32instructions survive. This demonstrates how alias analysis feeds GVN. - For store-to-load forwarding: insert
out[tid] = 42.0f;followed byfloat e = out[tid];. GVN should replace the load with the constant42.0f-- nold.globalemitted fore.
JumpThreading
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: Based on LLVM 20.0.0
JumpThreading.cpp. Evidence: DFA JumpThreading variant (dfa-jump-threading) present as a separate pass matches LLVM 14+;early-exit-heuristicknob matches LLVM 16+. Core algorithm is unmodified; NVIDIA changes are configuration-level (adjusted thresholds, three pipeline positions, OCG disable flag).
CICC v13.0 ships LLVM's JumpThreadingPass at sub_2DC4260 (12,932 bytes, address range 0x2DC4260--0x2DC74E4). The pass duplicates basic blocks so that predecessors whose branch conditions can be statically resolved jump directly to the correct successor, eliminating a conditional branch from the critical path. On a GPU, this directly reduces warp divergence: a branch that was previously data-dependent becomes unconditional along each incoming edge, so the warp scheduler never needs to serialize the two paths.
The pass is fundamentally at odds with PTX's requirement for reducible control flow. Block duplication can create multi-entry loops (irreducible cycles) when the duplicated block is a loop header or when the threading target sits inside a loop whose header is not the threading source. CICC addresses this through three layered mechanisms -- loop header protection, conservative duplication thresholds, and a late-pipeline StructurizeCFG safety net -- that collectively keep the CFG reducible without sacrificing the pass's optimization value.
| Property | Value |
|---|---|
| Pass name (pipeline parser) | "jump-threading" |
| Pass class | llvm::JumpThreadingPass |
| Entry function | sub_2DC4260 |
| Binary size | 12,932 bytes |
| Stack frame | 0x748 (1,864) bytes |
| Block duplication helper | sub_2DC22F0 (2,797 bytes) |
| CFG finalization | sub_2DC30A0 (1,094 bytes) |
| Single-instruction threading | sub_2DC37C0 (2,288 bytes) |
| Select unfolding | sub_2DC40B0 (420 bytes) |
| Pipeline positions | Three invocations: ~position 234, ~278, and a late tier-3 position (~239) |
NVVMPassOptions disable offset | +320 |
| Upstream LLVM source | lib/Transforms/Scalar/JumpThreading.cpp |
Why JumpThreading Matters on GPU
Consider a CUDA kernel containing:
if (threadIdx.x < threshold)
val = computeA();
else
val = computeB();
if (val > 0)
result = pathX(val);
else
result = pathY(val);
The second branch depends on val, which is a PHI of computeA() and computeB(). If JumpThreading can determine that computeA() always returns a positive value, it duplicates the second if block and wires the computeA predecessor directly to pathX. Threads that took the first branch path never execute the second conditional at all.
On a CPU this saves a branch misprediction. On a GPU the payoff is larger: eliminating the second branch prevents a second point of warp divergence. If both branches would diverge on different thread subsets, removing one cuts the total serialization overhead in half. The threads that took computeA proceed straight to pathX without waiting for the computeB threads to rejoin.
Knob Inventory
Six cl::opt globals control the pass, registered in ctor_456 at 0x544220:
| Knob | Default | Global | Description |
|---|---|---|---|
jump-threading-threshold | 6 | qword_4FFDBA0 | Max instructions in a block eligible for duplication |
jump-threading-implication-search-threshold | 3 | qword_4FFDAC0 | Max predecessors to search for condition implications |
jump-threading-phi-threshold | 76 (0x4C) | qword_4FFD9E0 | Max PHI nodes in a block eligible for duplication |
jump-threading-across-loop-headers | false | qword_4FFD900 | Allow threading across loop headers (testing only) |
jump-threading-disable-select-unfolding | false | qword_4FFDC80 | Disable unfolding select instructions into branches |
print-lvi-after-jump-threading | false | -- | Debug: print LazyValueInfo cache after pass completes |
The block-size threshold of 6 matches upstream LLVM. The PHI threshold of 76 is significantly higher than upstream's default (which is typically lower), reflecting GPU kernels' tendency toward wider PHI nodes due to predication and convergence patterns. The implication search depth of 3 is conservative, limiting compile-time cost from predecessor chain analysis in the typically shorter basic-block chains of GPU code.
Two Disable Flags
CICC registers two independent cl::opt flags that suppress jump threading behavior. They live in different subsystems and control different things:
| Flag | Registration | Subsystem | Effect |
|---|---|---|---|
"disable-JumpThreadingPass" | ctor_637 @ 0x5934A7 | JumpThreading pass itself | Disables the standalone JumpThreadingPass invocations in the pipeline |
"disable-jump-threading" | ctor_073 @ 0x49A91E (also ctor_243 @ 0x4ED0C0) | SimplifyCFG | Disables jump threading logic within SimplifyCFG -- the per-block branch-through-PHI threading that SimplifyCFG performs as part of its CFG simplification |
The "disable-jump-threading" flag carries the annotation "Disable jump threading for OCG experiments", where OCG is NVIDIA's Optimizing Code Generation research infrastructure. This is a SimplifyCFG option, not a JumpThreadingPass option -- SimplifyCFG has its own internal implementation of branch threading through PHI nodes that is separate from the standalone pass. NVIDIA engineers can disable either or both independently.
The "fold-with-var-cond" flag is registered alongside "disable-jump-threading" in the same SimplifyCFG constructor group, controlling a related NVIDIA-specific extension for folding branches with variance conditions.
Interaction with StructurizeCFG
The fundamental tension: JumpThreading duplicates blocks to bypass conditionals, which can transform a reducible loop into an irreducible cycle. PTX requires all loops to be natural (single-entry, reducible). An irreducible CFG causes StructurizeCFG to emit "UnsupportedIrreducibleCFG" and bail out, leaving the function in a state that ptxas will likely reject.
CICC addresses this through three layered mechanisms:
1. Loop Header Protection via LoopInfo
The jump-threading-across-loop-headers flag defaults to false. Before threading any block, the pass queries LoopInfo through a red-black tree lookup at 0x2DC4781 using dword_501D5A8 as the analysis key. If the target block is a loop header (the LoopInfo query returns a non-null loop containing the block as its header), the pass skips it entirely.
A parallel DominatorTree query at 0x2DC4839 (using dword_501D4C8) verifies loop membership and nesting depth. If the block is found within a loop, a threshold override is loaded from qword_501D628, replacing the standard duplication threshold with a loop-specific one. A second override from qword_501D548 applies to blocks found via the DominatorTree-based lookup.
This double check -- LoopInfo for header identification, DominatorTree for membership -- prevents the most common source of irreducibility: duplicating a loop header creates a second entry into the loop body.
2. Conservative Duplication Thresholds
The three thresholds (6 instructions, 3 predecessors, 76 PHIs) restrict duplication to small, simple blocks where the CFG outcome is highly predictable and the duplication cost is bounded. A block must satisfy all three limits simultaneously. These thresholds interact multiplicatively: even a 6-instruction block with 4 predecessors would exceed the implication search depth and be rejected, while a 5-instruction block with 100 PHIs would exceed the PHI threshold.
3. StructurizeCFG Safety Net
StructurizeCFG (sub_35CC920) runs late in the pipeline, after all IR-level scalar and loop transforms. Its irreducibility detector (sub_35CA2C0) checks every back-edge: if the target does not dominate the source, the loop has multiple entries and is irreducible. If JumpThreading or any other pass creates an irreducible cycle that slipped past the loop header protection, StructurizeCFG will catch it.
This is defense-in-depth: the threading constraints prevent most irreducible cases, and structurization catches the rest. The design deliberately tolerates a small number of "false acceptances" at the JumpThreading level because the cost of occasionally running StructurizeCFG's rejection path is far lower than the cost of being too conservative and missing profitable threading opportunities.
Cost Model
The pass enforces a multi-level cost model that bounds total code growth per function.
Global Budget
At 0x2DC4887, the pass initializes a global instruction budget:
mov ebx, 200h ; 512 instructions total budget
Each block duplication charges the duplicated block's instruction count against this budget. The budget is tracked in var_460 and checked before each duplication. Once exhausted, no further threading occurs in that invocation regardless of how profitable individual candidates might be.
Per-Predecessor Cost Division
When threading involves multiple predecessors, the per-predecessor cost is the block instruction count divided by the number of predecessors being threaded, with ceiling rounding:
cost_per_pred = block_instr_count / num_predecessors
; ceiling via: sbb eax, -1 (adds 1 if remainder was nonzero)
This division at 0x2DC4D78--0x2DC4D8E means a 6-instruction block being threaded for 3 predecessors costs only 2 instructions per predecessor against the global budget. The logic recognizes that multi-predecessor threading amortizes the code growth across more eliminated branches.
Special Cases
- Single-instruction blocks (checked at
0x2DC4D94): Always eligible, regardless of budget. A block containing only a terminator instruction costs nothing to duplicate. - Empty blocks (checked at
0x2DC4D70): Skipped entirely. - Blocks with
<=1effective instructions (0x2DC4BF1): The comparisoncmp edx, 1; jbegates a fast path where the pass bypasses the full cost analysis.
LazyValueInfo Integration
The pass accepts a LazyValueInfo pointer as its third parameter (rdx). When non-null (checked at 0x2DC42BD), LVI provides range-based condition evaluation that enables threading even when the branch condition is not a simple constant comparison.
LVI State
The LVI cache occupies approximately 600 bytes (0x258) of local state:
| Field | Offset | Purpose |
|---|---|---|
| Cache structure | var_2F0 through var_98 | LVI range cache local state |
| Valid flag | var_C0 | Set to 1 when LVI is initialized |
| Cached ranges | var_B0 | SmallVector-like structure |
| Initial capacity | var_A8 | 8 entries |
Range-Based Threading
For ICMP_NE conditions (opcode 0xBA = 186), the pass calls sub_11F3070 (LVI::getPredicateAt) with the ICmp operand and a comparison predicate of 2, followed by sub_DFABC0 (evaluateConditionOnEdge) to resolve the branch direction along a specific incoming edge.
For alternate opcode paths (opcode 0x165 = 357), the pass uses sub_988330 (getConstantOnEdge) instead, which returns a concrete constant value if LVI can prove the condition evaluates to a known value along that edge.
The virtual dispatch at 0x2DC67D6 (call qword ptr [rax+78h]) invokes LVI::getPredicateOnEdge. If the vtable matches sub_920130 (the default implementation), a fallback path calls sub_AC4810 (isImpliedCondition) with predicate 0x27 (39), and if that also fails, sub_AA93C0 (SimplifyICmpInst).
Cleanup
On exit, if LVI was used, three cleanup calls occur:
sub_FFCE90--LVI::eraseBlock(invalidation)sub_FFD870--LVI::clearsub_FFBC40--LVI::releaseMemory
Main Algorithm
Outer Loop
The pass iterates over the function's basic block list via a linked-list traversal (BB->next chain at [BB+8]):
run(result_ptr, function, lvi_ptr, tli, ...):
if lvi_ptr != null:
initialize_lvi_cache(lvi_ptr)
budget = 512
changed = false
loop:
current_bb = function.entry_block // sub_B2BEC0
end = function + 0x48 // end sentinel
while current_bb != end:
if try_thread_block(current_bb, budget):
changed = true
current_bb = current_bb.next // [current_bb + 8]
if changed:
changed = false
goto loop // restart: threading may expose new opportunities
cleanup_lvi()
return results
The restart-on-change behavior means threading is iterative: eliminating one branch can expose a new statically-determinable branch downstream.
Per-Block Classification
For each basic block, the pass examines the terminator instruction:
-
Opcode check (
0x2DC443E): The instruction opcode byte is compared against0x55(85), which is LLVM'sBranchInstopcode. Only conditional branches are considered. -
Metadata check (
0x2DC4449--0x2DC446E): Two calls tosub_A73ED0check for metadata kinds0x17(23,"prof"branch weights) and0x04(debug). Thensub_B49560(hasMetadataOtherThanDebugLoc) is called on the branch instruction. -
Condition extraction (
0x2DC45F8--0x2DC4636):sub_981210(getBranchCondition) returns a success flag and a condition code. Two condition codes are handled:0x165(357): likelyCmpInst::ICMP_EQor a switch opcode0x0BA(186): likelyCmpInst::ICMP_NE
Other condition codes cause the block to be skipped.
-
Operand analysis (
0x2DC465F--0x2DC467C): The operand count is extracted (AND with0x7FFFFFFmask -- the use-count field in LLVM'sValuelayout). If the branch condition is an ICmp with a constant operand (type byte0x11= 17 =ConstantInt), threading is potentially profitable.
Condition-Specific Threading Paths
The pass contains four specialized threading strategies:
Constant-value threading (0x2DC66B7): When a predecessor can determine the branch outcome via a constant PHI incoming value, the simplest path. Creates a direct unconditional branch.
Single-instruction threading (sub_2DC37C0, 2,288 bytes): For blocks containing exactly one instruction (the terminator), called at 0x2DC6704. Creates a direct branch bypass.
Switch threading (0x2DC6A76--0x2DC6B0C): When the terminator is a SwitchInst (opcode byte 0x37 = 55), calls sub_2DC40B0 (tryToUnfoldSelect). This checks for SelectInst (opcode 0x52 = 82) and unfolds the select into explicit branches that can be individually threaded.
Implication-based threading (0x2DC6E71--0x2DC6EB3): For ICmpInst variants (opcode 0x28 = 40), the pass checks whether the predicate implies the branch condition via sub_B532B0, creates the threaded edge via sub_B52EF0, and wires the new block via sub_92B530.
All-Ones Constant Detection
Four sites (0x2DC71B0, 0x2DC71CA, 0x2DC7380, 0x2DC74DA) check for all-ones constants as PHI incoming values:
or rax, -1 ; create all-ones mask
shr rax, cl ; cl = 64 - bitwidth, shift to match width
cmp [rdx+18h], rax ; compare against actual constant value
setz al ; true if constant is all-ones
For an i1 type, all-ones means true. This handles the common pattern where a PHI incoming value from one predecessor is the constant true (all bits set), allowing the pass to resolve the branch direction for that predecessor.
PHI Operand Iteration
Two nearly identical loops at 0x2DC7206--0x2DC726E and 0x2DC7456--0x2DC74CD iterate PHI operands to determine if all incoming values from relevant predecessors resolve to the same constant:
for pred_idx in range(phi.num_operands): // var_668
incoming = phi.getIncomingValueForBlock(pred) // sub_AD69F0
type_tag = incoming.type_byte
if type_tag == 0x0D: // ConstantInt::getTrue()
continue
if type_tag == 0x11: // ConstantInt with bitwidth check
if bitwidth <= 64:
if value == all_ones_for_width:
continue // resolves to true
else:
skip // wide integers, bail out
// If any incoming value is non-constant, threading is unprofitable
bail_out()
If every relevant predecessor provides the same constant value, the branch direction is fully determined and threading proceeds.
Created Block Names
When threading occurs, the pass creates new basic blocks with diagnostic names:
| Name | String address | Purpose |
|---|---|---|
"endblock" | 0x42E9094 | Terminal block of the threaded path; created via sub_F36990 (SplitBlockAndInsertIfThen) |
"phi.res" | 0x42E90C0 | PHI resolution node for merged values; created via sub_D5C860 (PHINode::Create) |
"res_block" | 0x42E909D | Result block for the threaded path; allocated as 0x50-byte BasicBlock via sub_22077B0 |
"loadbb" | 0x42E90B9 | Load basic block for load-bearing threading; created in a loop at 0x2DC4F05--0x2DC4FFB |
"phi.src1" | 0x42E90A7 | First PHI source block |
"phi.src2" | 0x42E90B0 | Second PHI source block |
The "loadbb" blocks are created in a dynamic loop for multi-way threading, where each iteration allocates a 0x50-byte (sizeof(BasicBlock)) object and wires it into the CFG via sub_AA4D50 (BasicBlock::insertInto).
Block Duplication Engine: sub_2DC22F0
The 2,797-byte helper performs actual block cloning. Parameters:
| Register | Role |
|---|---|
rdi | Duplication context structure (at var_490) |
rsi | Source block's value table |
rdx | Destination hash table |
rcx | PHI operand map |
r8d | Instruction count for the source block |
The cloning process:
- Clone each instruction from the source block
- Insert cloned instructions into use-def chains (
0x2DC59A1--0x2DC59E7: linked-list surgery on LLVM's Value use-list) - Update PHI operands to reference the new predecessor (
0x2DC5E1Eonward) - Update branch targets in the predecessor blocks
CFG Finalization: sub_2DC30A0
The 1,094-byte helper, called at 0x2DC5015 and 0x2DC6408 after threading completes for a block, performs:
- Successor edge updates
- Dead block elimination for blocks made unreachable by the threading
- DominatorTree updates if available (via
sub_FFB3D0,DominatorTree::changeImmediateDominator)
Pipeline Positions
JumpThreading appears three times in the CICC pipeline, at different stages with different surrounding context:
| Position | Pipeline context | Parameter | Purpose |
|---|---|---|---|
| ~234 | After ADCE, within the main function simplification loop | sub_198DF00(-1) | First opportunity: thread branches exposed by dead code elimination |
| ~278 | After NVVMPeephole2 and optionally GVN, in the NVIDIA-specific tier-2 sequence | sub_198DF00(-1) | Second opportunity: thread branches exposed by value numbering and peephole |
| Late tier-3 | Within the ADCE/MemCpyOpt/DSE sequence | sub_198DF00(t) | Final opportunity: catch any remaining threadable branches before StructurizeCFG |
The sub_198DF00 function is the combined CorrelatedValuePropagation/JumpThreading registration wrapper. The -1 parameter likely selects the default mode; the t parameter in the third position may be an optimization-level-dependent configuration.
All three positions are conditional on NVVMPassOptions offset +320 not being set to disable. Each invocation resets the 512-instruction global budget, so the total code growth across all three invocations can reach up to 1,536 instructions per function.
DFA JumpThreading
A separate DFA-based JumpThreading variant exists at sub_276AF50, registered as "dfa-jump-threading" (llvm::DFAJumpThreadingPass). This pass is controlled by:
| Knob | Registration | Description |
|---|---|---|
enable-dfa-jump-thread | ctor_445 @ 0x53F5C0 | Enable/disable the DFA variant |
dfa-jump-view-cfg-before | ctor_445 | Debug: dump CFG before DFA threading |
dfa-early-exit-heuristic | ctor_445 | Early-exit heuristic for compile time |
DFA JumpThreading handles state-machine patterns (switch statements in loops with predictable transitions between cases) that the standard JumpThreading cannot resolve. It is a separate pass with its own pipeline registration and does not share the budget or thresholds of the standard JumpThreading pass.
Before/After IR Example
Consider a kernel with a two-branch diamond:
Before JumpThreading:
entry:
%cond1 = icmp sgt i32 %x, 0
br i1 %cond1, label %positive, label %negative
positive:
%a = call i32 @computeA()
br label %merge
negative:
%b = call i32 @computeB()
br label %merge
merge:
%val = phi i32 [ %a, %positive ], [ %b, %negative ]
%cond2 = icmp eq i32 %val, 42
br i1 %cond2, label %match, label %nomatch
match:
...
nomatch:
...
If LVI can prove that computeA() always returns 42 (e.g., it is a known constant), JumpThreading duplicates the merge block for the %positive predecessor:
After JumpThreading:
entry:
%cond1 = icmp sgt i32 %x, 0
br i1 %cond1, label %positive, label %negative
positive:
%a = call i32 @computeA()
br label %match ; threaded: skip %merge entirely
negative:
%b = call i32 @computeB()
br label %merge
merge: ; now has only one predecessor
%val = phi i32 [ %b, %negative ]
%cond2 = icmp eq i32 %val, 42
br i1 %cond2, label %match, label %nomatch
match:
...
nomatch:
...
The %positive path no longer passes through merge. The second branch is eliminated for threads that took the first path.
Differences from Upstream LLVM
| Aspect | CICC v13.0 | Upstream LLVM 20 |
|---|---|---|
| PHI threshold default | 76 | Lower (typically ~32 or similar) |
disable-jump-threading in SimplifyCFG | Present, annotated for OCG experiments | Present (standard LLVM flag) |
| Annotation | "Disable jump threading for OCG experiments" | No OCG reference |
| Pipeline invocations | Three positions, combined with CVP via sub_198DF00 | Typically two (early and late in the function simplification pipeline) |
NVVMPassOptions disable | Offset +320 | N/A |
| Loop header override thresholds | qword_501D628, qword_501D548 | Standard LoopInfo check only |
fold-with-var-cond | NVIDIA-specific SimplifyCFG companion flag | Not present |
The core algorithm is unmodified from upstream. NVIDIA's changes are configuration-level: adjusted thresholds, additional pipeline positions, the OCG disable flag, and integration with the NVVMPassOptions system.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
JumpThreadingPass::run (main pass body) | sub_2DC4260 | 12,932 bytes | -- |
Block cloning engine (duplicateBlock) | sub_2DC22F0 | 2,797 bytes | -- |
| CFG finalization after threading | sub_2DC30A0 | 1,094 bytes | -- |
| Single-instruction threading | sub_2DC37C0 | 2,288 bytes | -- |
tryToUnfoldSelect | sub_2DC40B0 | 420 bytes | -- |
| SmallVector append/copy for instruction map | sub_2DC1F40 | 349 bytes | -- |
LVI::getPredicateAt | sub_11F3070 | -- | -- |
evaluateConditionOnEdge | sub_DFABC0 | -- | -- |
getConstantOnEdge | sub_988330 | -- | -- |
isImpliedCondition | sub_AC4810 | -- | -- |
SimplifyICmpInst | sub_AA93C0 | -- | -- |
getBranchCondition | sub_981210 | -- | -- |
BranchInst::getCondition | sub_B43CB0 | -- | -- |
BranchInst::Create (conditional) | sub_B4C9A0 | -- | -- |
BranchInst::Create (unconditional) | sub_B4C8F0 | -- | -- |
PHINode::addIncoming | sub_B99FD0 | -- | -- |
PHINode::Create | sub_D5C860 | -- | -- |
SplitBlockAndInsertIfThen | sub_F36990 | -- | -- |
BasicBlock::getContext | sub_BD5C60 | -- | -- |
operator new(0x50) (allocate BasicBlock) | sub_22077B0 | -- | -- |
BasicBlock::insertInto | sub_AA4D50 | -- | -- |
Value::replaceAllUsesWith | sub_BD84D0 | -- | -- |
Instruction::eraseFromParent | sub_B43D60 | -- | -- |
DominatorTree::changeImmediateDominator | sub_FFB3D0 | -- | -- |
PHINode::getIncomingValueForBlock | sub_AD69F0 | -- | -- |
| LoopInfo pass lookup | sub_C959E0 | -- | -- |
| Predicate implies branch check | sub_B532B0 | -- | -- |
ConstantExpr::getICmp or create threaded edge | sub_B52EF0 | -- | -- |
CloneBasicBlock or wire new block | sub_92B530 | -- | -- |
CloneBasicBlock (alternate path) | sub_929DE0 | -- | -- |
Cross-References
- StructurizeCFG -- the late-pipeline safety net that catches irreducible CFG created by threading or other passes
- Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
- GVN -- runs between JumpThreading invocations in the tier-2 sequence; can expose new threadable branches
- Pipeline & Ordering -- tier-dependent scheduling of all three invocations
- Knobs -- master knob inventory including all six JumpThreading knobs
LICM (Loop-Invariant Code Motion)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Loop-Invariant Code Motion in cicc v13.0 operates at three distinct levels: an IR-level pass ("licm", backed by MemorySSA), a pre-RA machine pass ("early-machinelicm"), and a post-RA machine pass ("machinelicm"). The IR-level pass runs in two modes within the same pipeline -- a hoist invocation early in the optimization sequence that pulls invariant computations and loads out of loops into preheaders, and a sink invocation via LoopSinkPass (or implicit re-processing) later that pushes unprofitable hoists back into cold loop blocks. On a CPU, hoisting is almost universally profitable because the preheader executes once per loop entry rather than once per iteration. On a GPU, the calculus is different: every value hoisted into the preheader extends its live range across the entire loop body, consuming a register for all iterations. If that extra register pushes the kernel past an occupancy cliff -- the threshold where the SM can fit one fewer warp -- the net effect is a slowdown, not a speedup. NVIDIA addresses this tension through the interplay of the two invocations, the NVVM alias analysis pipeline that makes cross-address-space loads trivially hoistable, and the downstream rematerialization passes that can undo hoists that turned out to be unprofitable after register allocation.
Key Facts
| Property | Value |
|---|---|
| IR pass name | "licm" (new PM), "LICMPass" (legacy) |
| IR pass factory | sub_195E880(0) -- creates LICM with AllowSpeculation=false |
| IR pass factory (alt) | sub_184CD60() -- creates LICM (also identified as ConstantMerge in some sweeps; identity ambiguous -- see Analysis Notes) |
| Machine pass (pre-RA) | "early-machinelicm" / EarlyMachineLICMPass |
| Machine pass (post-RA) | "machinelicm" / MachineLICMPass |
| Knob registration | ctor_457_0 at 0x544C40 (18,398 bytes -- 11 knobs) |
| MachineLICM knob registration | ctor_305 (4 knobs) |
| Disable flag | -disable-LICMPass via -Xcicc |
| PassOptions disable | -opt "-do-licm=0" (also forced by --emit-optix-ir) |
| NVVMPassOptions slot | opts[1240] (disable), opts[2880] (enable, reversed logic) |
| Upstream LLVM source | llvm/lib/Transforms/Scalar/LICM.cpp, llvm/lib/CodeGen/MachineLICM.cpp |
Pipeline Positions
LICM appears at multiple pipeline positions depending on the optimization tier and compilation mode. The pass uses two distinct factory functions, and the identification of which is definitively LICM versus another pass is uncertain in some cases due to the stripped binary. The following table lists all confirmed appearances.
IR-Level LICM
| Position | Call site | Factory | Guard condition | Context |
|---|---|---|---|---|
| O1 baseline, position 12 | sub_12DE330 | sub_184CD60() | none | After LoopRotate, before IndVarSimplify. First hoist invocation. |
| Main optimizer, mid-pipeline | sub_12DE8F0 | sub_195E880(0) | !opts[1240] | Guarded by the LICM disable flag. Runs after DCE and before NVVMLowerBarriers. |
| Main optimizer, late | sub_12DE8F0 | sub_195E880(0) | opts[2880] && !opts[1240] | Second invocation, guarded by both enable and disable flags. Runs after ADCE, before LoopUnroll. |
| Extended pipeline | sub_12E54A0 | sub_195E880(0) | opts[2880] && !opts[1240] | After NVVMLowerBarriers, before LoopUnroll. |
| Late pipeline | sub_12E54A0 | sub_195E880(0) | !opts[1240] | After LoopIdiomRecognize and LoopSimplify, before SimplifyCFG. Late cleanup invocation. |
| Aggressive (O3, "mid" path) | sub_12E54A0 | sub_184CD60() | none | Position 1 and position 18 of the aggressive pipeline. Second invocation follows GVN. |
Machine-Level LICM
| Position | Pass | Guard | Context |
|---|---|---|---|
| Pre-RA | early-machinelicm | enable-mlicm | After EarlyTailDuplicate, before MachineCSE. Controlled by the NVPTX target. |
| Post-RA | machinelicm | !disable-postra-machine-licm | After ExpandPostRAPseudos, before post-RA MachineSink. |
Algorithm
IR-Level: Hoist Mode
LICM's hoist mode is the upstream LLVM 20.0.0 algorithm with no visible NVIDIA patches to the core logic. The NVIDIA delta is entirely in the analysis results that LICM consumes (NVVM AA, MemorySSA precision, convergent-call handling) and in the pipeline orchestration (multiple invocations, register-pressure-aware sink mode).
The algorithm processes each loop from innermost to outermost:
for each loop L in post-order (innermost first):
preheader = L.getLoopPreheader()
if preheader is null: skip
// 1. Collect candidates
for each basic block BB in L:
for each instruction I in BB:
if isLoopInvariant(I, L) and isSafeToHoist(I, L):
candidates.push(I)
// 2. Hoist each candidate
for I in candidates:
if I is a load:
// Query MemorySSA walker for clobbering stores
clobber = MSSA.getClobberingMemoryAccess(I)
if clobber is outside L:
hoist(I, preheader)
else if I is a pure computation (no side effects):
hoist(I, preheader)
else if I is a store and hoist-const-stores is enabled:
if store address is loop-invariant and
no other store in L aliases this address:
hoist(I, preheader)
The isLoopInvariant check verifies that all operands of the instruction are either defined outside the loop or are themselves loop-invariant. The isSafeToHoist check queries MemorySSA to determine whether the instruction's memory behavior is loop-invariant -- for loads, this means no store inside the loop may alias the load's address.
MemorySSA walker interaction. When LICM calls getClobberingMemoryAccess(load_in_loop), the MemorySSA walker walks upward from the load's MemoryUse through the MemorySSA graph. If the walk reaches the loop's entry MemoryPhi without encountering a MemoryDef that may-alias the load, the load is hoistable. The walk is bounded by licm-mssa-optimization-cap to prevent compile-time explosion on functions with dense memory SSA graphs.
The licm-mssa-max-acc-promotion knob limits how many MemoryAccesses LICM will attempt to promote (scalar-replace loads from loop-invariant addresses with SSA values held in registers across iterations). This is the LICM variant of store-to-load forwarding within a loop.
IR-Level: Sink Mode
The LoopSink pass ("loop-sink", registered at pipeline parser entry 271) is the inverse of hoist mode. It runs late in the pipeline and pushes instructions that were hoisted to the preheader back into the loop body, specifically into cold blocks that execute infrequently relative to the loop header.
The decision to sink is driven by block frequency analysis:
for each instruction I in preheader:
if I has uses only in cold blocks of the loop:
coldest_block = argmin(blockFreq(B) for B where I is used in B)
if blockFreq(preheader) / blockFreq(coldest_block) > threshold:
sink(I, coldest_block)
On GPUs, the sink mode is particularly important because:
- Occupancy recovery. A hoist that added one live register at the preheader may have pushed the kernel from 8 to 7 warps per SM. Sinking that value back undoes the damage.
- Divergent control flow. If the hoisted value is only used in a branch taken by some threads (divergent execution), hoisting forces all threads to compute it. Sinking limits the computation to the threads that actually take the branch.
Machine-Level: MachineLICM
MachineLICM operates on MachineInstr after instruction selection. The pre-RA variant (early-machinelicm) is gated by the enable-mlicm knob, which is controlled by the NVPTX target. The post-RA variant (machinelicm) runs unconditionally unless disable-postra-machine-licm is set.
The machine-level algorithm differs from the IR level in that it has concrete register pressure information:
for each machine loop ML (innermost first):
preheader = ML.getLoopPreheader()
for each MachineInstr MI in ML:
if isLoopInvariant(MI) and isSafeToHoist(MI):
// Compute pressure impact
pressure_delta = estimatePressureIncrease(MI, preheader)
if sink-insts-to-avoid-spills and
pressure_delta would cause spills:
skip MI // Do not hoist
else:
hoist(MI, preheader)
The sink-insts-to-avoid-spills knob (registered at ctor_305) is the critical GPU-specific control: it tells MachineLICM to abandon a hoist when the resulting register pressure in the preheader would exceed the spill threshold. This directly prevents the occupancy-cliff problem at the machine level.
GPU-Specific Considerations
Register Pressure and Occupancy Cliffs
Each SM's register file is shared among all resident warps, creating discrete occupancy cliffs where a single additional register per thread can drop maximum occupancy by an entire warp group.
Hoisting one additional value into the preheader extends its live range across the entire loop body, increasing peak register pressure by one. If that increase crosses an occupancy cliff boundary, the kernel loses an entire warp's worth of parallelism per SM. This is why cicc invokes LICM early (to expose optimization opportunities for GVN, DSE, and InstCombine) and then relies on the downstream rematerialization infrastructure to undo hoists that became unprofitable after the register allocator made its decisions.
NVVM AA and Cross-Address-Space Independence
The single most impactful NVIDIA-specific behavior in LICM is not a patch to LICM itself but the NVVM alias analysis (nvptx-aa) that feeds into MemorySSA. When LICM queries whether a load from addrspace(1) (global memory) is clobbered by a store to addrspace(3) (shared memory), NVVM AA returns NoAlias immediately. This means:
- A load from global memory inside a loop is trivially hoistable past any number of shared memory stores.
- A shared memory load is hoistable past global stores.
- Only stores to the same address space (or to
addrspace(0)/ generic) prevent hoisting.
This dramatically increases the set of hoistable instructions compared to a flat-memory architecture. Without NVVM AA, a conservative alias analysis would assume any store could clobber any load, making most loads inside GPU kernels non-hoistable.
Barrier-Aware Motion Constraints
CUDA __syncthreads() barriers are lowered to llvm.nvvm.barrier0 intrinsic calls, which are marked convergent and have memory side effects on shared memory. The convergent attribute prevents LICM from hoisting any instruction that depends (directly or transitively through the call graph) on a convergent call. The memory side effect on the barrier prevents hoisting loads across it even when the load does not depend on the barrier's value, because the barrier's MemoryDef in MemorySSA clobbers all shared-memory accesses.
This means LICM correctly refuses to hoist a shared memory load from below a __syncthreads() to above it -- doing so would read a value that the barrier was supposed to synchronize.
The NVVMLowerBarriers pass (sub_1C98160) runs between LICM invocations in the pipeline. Its position matters: barriers are still at the intrinsic level during the first LICM invocation, providing the convergent/memory-effect constraint. After lowering, the barrier semantics are encoded differently, which could affect what a later LICM invocation can move.
Interaction with Downstream Passes
LICM's hoist decisions feed into several downstream passes that can undo or refine them:
-
Rematerialization (
nvvmrematerialize,nv-remat-block): If hoisting increased register pressure past the target, the rematerialization pass will clone the hoisted instruction back to each use site, effectively undoing the hoist while keeping the optimization benefits at the IR level. See Rematerialization. -
Sinking2 (
sub_1CC60B0): NVIDIA's custom sinking pass runs after LICM and can push instructions back toward their uses. Therp-aware-sinkandmax-uses-for-sinkingknobs control whether the sink considers register pressure impact. See Sinking2. -
Base Address Strength Reduction: Hoisted address computations are candidates for strength reduction. The
sub_1C51340function checks whether a base address is loop-invariant, which is trivially true after LICM has hoisted it.
Configuration
IR-Level LICM Knobs (ctor_457_0 at 0x544C40)
These are standard LLVM knobs present in the cicc binary. No NVIDIA-specific knobs were found in the IR-level LICM registration.
| Knob | Type | Default | Effect |
|---|---|---|---|
disable-licm-promotion | bool | false | Disable scalar promotion of memory locations (store-to-load forwarding within loops). When set, LICM will not replace repeated loads from a loop-invariant address with a register-held value. |
licm-control-flow-hoisting | bool | false | Enable hoisting of instructions with control-flow-dependent execution. When disabled, only instructions that dominate the loop latch can be hoisted. |
licm-force-thread-model-single | bool | false | Override the thread model to single-threaded, allowing LICM to hoist atomic operations. Not useful on GPU. |
licm-max-num-uses-traversed | int | 8 | Maximum number of uses to traverse when checking whether all uses of a hoisted value are inside the loop. Limits compile time on values with many uses. |
licm-max-num-fp-reassociations | int | (default) | Maximum FP reassociation chains LICM will attempt to hoist as a group. |
licm-hoist-bo-association-user-limit | int | (default) | User count limit for binary operator association hoisting. |
licm-skip-unrolled-loops | bool | false | Skip LICM on loops that have been unrolled (identified by metadata). Avoids re-hoisting values that were deliberately placed by the unroller. |
licm-insn-limit | int | (default) | Maximum number of instructions LICM will process per loop. Compile-time safety valve. |
licm-max-num-int-reassociations | int | (default) | Maximum integer reassociation chains for group hoisting. |
licm-mssa-optimization-cap | int | (default) | Maximum number of MemorySSA accesses the walker will visit per query. Prevents pathological compile times on functions with dense memory access patterns. |
licm-mssa-max-acc-promotion | int | (default) | Maximum number of MemoryAccesses LICM will attempt to promote (scalar-replace) per loop. |
IR-Level LICM Pipeline Parameters
The pass text-pipeline parser accepts two parameters for the "licm" pass:
| Parameter | Effect |
|---|---|
allowspeculation | Allow speculative execution of hoisted instructions (loads that might trap). |
conservative-calls | Use conservative call analysis -- treat all calls as potentially clobbering. |
The factory function sub_195E880(0) creates LICM with AllowSpeculation=false, which is the safe default for GPU code where speculative loads from unmapped memory would fault the entire kernel.
Machine-Level MachineLICM Knobs (ctor_305)
| Knob | Type | Default | Effect |
|---|---|---|---|
avoid-speculation | bool | (default) | Avoid hoisting instructions that could speculatively execute and trap. |
hoist-cheap-insts | bool | (default) | Hoist instructions with very low cost even when register pressure is high. |
sink-insts-to-avoid-spills | bool | (default) | Critical GPU knob. When enabled, MachineLICM will sink (not hoist) instructions when hoisting would increase register pressure past the spill threshold. This directly trades code motion for spill avoidance. |
hoist-const-stores | bool | (default) | Hoist stores of constant values out of loops. Enabled at the NVIDIA sinking/code-motion category level. |
NVPTX Target Gating Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
enable-mlicm | bool | opt-level dependent | Master enable for pre-RA EarlyMachineLICM on NVPTX. |
disable-machine-licm | bool | false | Disable pre-RA MachineLICM (stock LLVM knob). |
disable-postra-machine-licm | bool | false | Disable post-RA MachineLICM (stock LLVM knob). |
Global Pipeline Controls
| Control | Mechanism | Effect |
|---|---|---|
do-licm=0 | PassOptions (-opt flag) | Disables IR-level LICM entirely. Automatically set by --emit-optix-ir. |
disable-LICMPass | -Xcicc flag | Disables IR-level LICM via the pass-disable mechanism. |
opts[1240] | NVVMPassOptions bit | Per-invocation disable flag for IR LICM. |
opts[2880] | NVVMPassOptions bit | Per-invocation enable flag for IR LICM (reversed logic). |
Diagnostic Strings
The IR-level LICM pass emits optimization remarks via the standard LLVM remark infrastructure. The following remark identifiers are present in upstream LLVM 20 and apply unchanged in cicc:
| Remark | Condition |
|---|---|
"hoisted" | Instruction was successfully hoisted to preheader. |
"sunk" | Instruction was sunk from preheader into a loop block. |
"promoted" | Memory location was scalar-promoted (repeated load replaced with register). |
"licm" | General LICM diagnostic (pass name in remark metadata). |
MachineLICM emits its own set:
| String | Condition |
|---|---|
"Hoisting to BB#%d" | Machine instruction hoisted to the specified preheader block. |
"Won't hoist cheap instruction" | Instruction deemed too cheap to justify the pressure increase. |
"Can't hoist due to spill pressure" | sink-insts-to-avoid-spills vetoed the hoist. |
Analysis Notes
Identity Ambiguity: sub_184CD60 and sub_195E880
The pipeline analysis identified two factory functions as LICM candidates:
-
sub_195E880(0): Called with explicit LICM disable guards (!opts[1240],opts[2880]). Present in the main optimizer and extended pipeline. This is the higher-confidence identification as the IR-level LICM factory. -
sub_184CD60(): Called in the O1 baseline pipeline at position 12 (after LoopRotate), and in the aggressive pipeline. Some sweeps identify this asConstantMergeorGlobalDCE. The O1 pipeline context (LoopRotate ->sub_184CD60-> IndVarSimplify) strongly suggests this is LICM, as this is the canonical upstream LLVM loop optimization sequence. However, the aggressive pipeline uses it in a position whereConstantMergewould also make sense. Without the stripped symbol, the definitive identification relies on structural context.
Both functions likely create the same underlying LICMPass -- the difference may be in the parameters (e.g., AllowSpeculation, ConservativeCalls) or the analysis dependencies they request.
No Visible NVIDIA Patches to IR-Level LICM
Unlike DSE, GVN, and InstCombine, the IR-level LICM code does not appear to contain NVIDIA-specific modifications. The 11 knobs registered at ctor_457_0 are all standard upstream LLVM options. The NVIDIA delta for LICM is architectural:
- Analysis precision: NVVM AA and enhanced MemorySSA provide better aliasing information, making LICM more aggressive without code changes.
- Pipeline orchestration: Multiple invocations at different pipeline stages with different guard conditions.
- Machine-level integration:
sink-insts-to-avoid-spillsandenable-mlicmprovide GPU-specific pressure management. - Downstream safety net: Rematerialization undoes unprofitable hoists after register allocation.
LICM Disabled for OptiX IR
The --emit-optix-ir mode (triggered by OptiX runtime compilation with device type 0xDEED or 0xABBA) automatically sets do-licm=0, disabling LICM entirely. This suggests that OptiX IR is intended to be consumed by a downstream optimizer (the OptiX JIT compiler) that performs its own code motion decisions, and pre-hoisting at the cicc level would interfere with those decisions.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
LICMPass::create | sub_195E880 | -- | IR-level LICM factory (AllowSpeculation=false) |
LICMPass::create (alt) | sub_184CD60 | -- | IR-level LICM factory (identity ambiguous, may be ConstantMerge) |
| LICM knob registration | ctor_457_0 (0x544C40) | -- | 11 cl::opt registrations for IR LICM |
| MachineLICM knob registration | ctor_305 | -- | 4 cl::opt registrations for MachineLICM |
EarlyMachineLICMPass | (in codegen pipeline) | -- | Pre-RA machine-level LICM |
MachineLICMPass | (in codegen pipeline) | -- | Post-RA machine-level LICM |
LoopSinkPass | pipeline parser entry 271 | -- | Inverse of LICM hoist -- sinks unprofitable hoists |
NVVMLowerBarriers | sub_1C98160 | -- | Runs between LICM invocations; lowers barrier intrinsics |
| NVVM AA query | sub_146F1B0 | -- | Address-space-based NoAlias determination used by MemorySSA |
| MemorySSA clobber walk | sub_1A6AFB3 | -- | Walker that LICM uses to determine load hoistability |
| Loop-invariant check | sub_1C51340 | -- | Utility for checking if a value is loop-invariant |
Differences from Upstream LLVM
| Aspect | Upstream LLVM 20 | cicc v13.0 |
|---|---|---|
| Pipeline invocations | Typically one LICM invocation in the function pipeline, plus LoopSink. | 4-6 invocations at different pipeline stages with conditional guards. |
| Alias analysis precision | BasicAA + TBAA. Cross-address-space aliasing not exploited (all code shares one address space). | NVVM AA returns NoAlias for cross-address-space pairs, dramatically increasing hoistable instruction count. |
| MemorySSA sparsity | Dense graphs on flat-memory architectures. | Sparse graphs due to NVVM AA, reducing walker overhead and improving LICM precision. |
| Register pressure feedback | MachineLICM has sink-insts-to-avoid-spills but no GPU occupancy model. | sink-insts-to-avoid-spills interacts with NVPTX's occupancy-based register targets. enable-mlicm provides target-level gating. |
| Speculative hoisting | Allowed by default on most targets. | Disabled (AllowSpeculation=false) because GPU kernels fault on speculative loads from unmapped memory. |
| OptiX mode | N/A. | LICM entirely disabled for OptiX IR emission. |
| Downstream undo | No systematic mechanism to undo unprofitable hoists. | Rematerialization (nvvmrematerialize, nv-remat-block) systematically undoes hoists that increase pressure past the occupancy target. |
Cross-References
- MemorySSA Builder for GPU -- how MemorySSA exploits NVVM AA for sparse dependency graphs
- Alias Analysis & NVVM AA -- the cross-address-space NoAlias analysis that enables aggressive hoisting
- Rematerialization -- the safety net that undoes unprofitable hoists
- Sinking2 -- NVIDIA's custom sinking pass that complements LICM sink mode
- LLVM Optimizer -- pipeline assembly and two-phase compilation
- Optimization Levels -- per-tier pipeline configuration
- Machine-Level Passes -- MachineLICM pre-RA and post-RA placement
- Loop Passes (Standard) -- LoopRotate, LCSSA, LoopSimplify that canonicalize before LICM
- Loop Unrolling -- runs after LICM in the pipeline; the LoopUnroll pass factory at
sub_19B73C0was previously mislabeled as LICM
LICM (Loop-Invariant Code Motion) -- Redirect
This page previously contained LoopUnroll content due to a sweep misidentification. The LoopUnroll pass factory at
sub_19B73C0was incorrectly labeled as LICM because the two passes are adjacent in the binary. All LoopUnroll content has been merged into the Loop Unrolling page.
For the actual LICM documentation, see: LICM (Loop-Invariant Code Motion)
The LICM page covers:
- IR-level LICM (
"licm", backed by MemorySSA) -- hoist and sink modes - Machine-level LICM (
"early-machinelicm","machinelicm") -- pre-RA and post-RA - GPU-specific considerations: register pressure, occupancy cliffs, NVVM AA cross-address-space independence
- All pipeline positions, knobs, and diagnostic strings
- Interaction with downstream passes (rematerialization, Sinking2)
DSE (Dead Store Elimination)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp(LLVM 20.0.0)
CICC v13.0 contains a heavily modified Dead Store Elimination pass totaling approximately 91 KB of decompiled code across three major functions: the core DSE::runOnFunction at sub_19DA750 (33 KB), the overwrite detection engine at sub_19DDCB0 (28 KB), and the partial overwrite tracking system at sub_19DF5F0 (30 KB). This substantially exceeds the size of upstream LLVM DSE, primarily due to NVIDIA's additions for partial store forwarding with type conversion, cross-store dependency tracking, store-chain decomposition for aggregates, and native CUDA vector type awareness.
IR Before/After Example
DSE removes stores that are overwritten before any load reads them. The NVIDIA extension handles partial overwrites common in CUDA vector code.
Before (dead store followed by overwrite):
define void @f(ptr addrspace(1) %p, float %x, float %y) {
store float %x, ptr addrspace(1) %p, align 4 ; dead: overwritten below before any load
%other = fadd float %x, %y
store float %other, ptr addrspace(1) %p, align 4 ; overwrites the first store completely
ret void
}
After:
define void @f(ptr addrspace(1) %p, float %x, float %y) {
; first store removed -- overwritten by second store, no intervening load
%other = fadd float %x, %y
store float %other, ptr addrspace(1) %p, align 4
ret void
}
NVIDIA's DSE also handles partial overwrite patterns with CUDA vector types. When a float4 store partially overwrites a previous float4 store, the pass decomposes via GEP to determine which elements are dead. This is a key GPU extension that upstream LLVM DSE does not handle.
Analysis Dependencies
DSE requires five analysis passes, resolved through the pass manager at registration time (sub_19DD1D0):
| Analysis | Global Address | Pass ID |
|---|---|---|
| MemorySSA | unk_4F9E06C | Memory SSA graph |
| DominatorTree | unk_4F9A488 | Dominator tree |
| MemoryDependence | unk_4F9B6E8 | Memory dependence queries |
| PostDominatorTree | unk_4F9D764 | Post-dominator tree |
| AliasAnalysis | unk_4F9D3C0 | NVVM-aware alias analysis |
Core Algorithm
The main entry point DSE::runOnFunction (sub_19DA750) processes a function by iterating over store instructions and checking whether each store is dead (fully or partially overwritten by a later store to the same location before any intervening load).
Early Exit and Setup
The pass begins with an early exit check via sub_1636880() to determine whether the function should be skipped entirely. It then retrieves MemoryDependence and AliasAnalysis from the pass manager and calls sub_14A4050 / sub_14A2F00 to verify the function contains stores worth analyzing. If no stores are present, the pass returns immediately.
Store Instruction Identification
Store instructions are identified by checking byte +16 of the instruction structure for value 77. The operand count is read from offset +20 (masked with 0xFFFFFFF), and the "has-operand-list-pointer" flag at byte +23, bit 0x40, indicates indirect operand storage for instructions with many operands.
Type Size Computation
DSE computes store sizes through a type-walker switch on byte +8 of the type structure. This logic is shared between the core pass and the overwrite detector:
| Type Code | Size | Notes |
|---|---|---|
| 1 | 16 bits | Half-precision float |
| 2 | 32 bits | Float / int32 |
| 3, 9 | 64 bits | Double / int64 |
| 4 | 80 bits | x86 long double / PTX f80 |
| 5, 6 | 128 bits | Quad precision / int128 |
| 7 | pointer-sized | Resolved via sub_15A9520 |
| 0xB | immediate | Size from upper bits of type word |
| 0xD | struct | Layout computed by sub_15A9930 |
| 0xE | vector | element_size * num_elements with alignment |
| 0xF | integer | Arbitrary-width integer |
| 0x10 | array | Recurses into element type, multiplies by count |
| 0, 8, A, C | array-like | Follows pointer chain |
The vector type formula (case 0xE) accounts for element alignment: 8 * num_elements * element_alignment * ceil(element_alignment + ceil(element_bits/8) - 1) / element_alignment). This handles CUDA native vector types (float2, float4, int4).
Overwrite Detection
The overwrite analysis engine at sub_19DDCB0 (28 KB) determines whether one store completely or partially covers another. It receives the instruction, an operand index, alias analysis results, and address-space information.
Alias Queries
The function calls sub_14C2730 to perform alias queries with full parameters: (target_ptr, data_layout, 0, instruction, store_address, alias_analysis). This returns whether two memory locations may alias. The alias analysis already incorporates CUDA address-space separation (shared=3, global=1, local=5, constant=4), so DSE itself does not need explicit address-space checks.
Partial Store Forwarding
When store sizes do not match, NVIDIA's DSE creates truncation or extension casts to extract the relevant portion. This is a critical GPU-specific extension:
- If the source is smaller than the destination: creates an extension (opcode 36 = zext).
- If the source is larger than the destination: creates a truncation (opcode 38 = trunc).
- Alignment requirements are verified through
sub_16431D0. - Complex types use
sub_15FDBD0for cast creation; simple types usesub_15A46C0.
Standard LLVM DSE bails on size mismatches. NVIDIA's version handles the common CUDA pattern of a float4 store followed by a scalar float load by extracting the relevant component via GEP + load.
Store Size Ratio Check
At labels LABEL_25 / LABEL_29 in the core function, DSE performs a ratio check:
- Computes
v159= aligned size of destination type. - Computes
v48= aligned size of source type. - Calculates
v148 = v48 / v159(how many destination elements fit in source). - If
v48 % v159 != 0, bails (partial overlap that cannot be forwarded). - If sizes differ, creates a GEP + load to extract the relevant portion.
Metadata Preservation
After creating a replacement instruction, the pass preserves metadata:
- Debug location via
sub_157E9D0. - Use-chain linkage by updating prev/next pointers at offsets +24/+32.
- Basic block insertion via
sub_164B780. - TBAA metadata propagation through
sub_1623A60/sub_1623210. nonnullattribute copying viasub_15FA300/sub_15FA2E0.- Use replacement via
sub_164B7C0.
Partial Overwrite Tracking
The function-level partial overwrite pass at sub_19DF5F0 (30 KB) maintains a hash table of all stores in a function and tracks which stores partially overwrite each other.
Hash Table Structure
Each hash table entry is 72 bytes:
| Offset | Content |
|---|---|
| +0 | Key (store instruction pointer; -8 = empty, -16 = tombstone) |
| +8 | Operand list pointer |
| +16 | Operand count |
| +24 | Inline storage (when count <= small threshold) |
| +48 | Additional metadata |
The hash function, probing strategy, and growth/compaction thresholds follow the standard DenseMap infrastructure; see Hash Table and Collection Infrastructure. This instance uses NVVM-layer sentinels (-8 / -16) and a minimum table size of 64 entries.
Cross-Store Dependency Records
When a new store aliases an existing entry, DSE records both stores in a 6-element record: {store1, store2, operand1, operand2, ptr1, ptr2}. This enables tracking stores that partially overwrite each other even when the overwritten value has been modified between stores. Reference counting is managed through sub_1649AC0 / sub_1649B30, and per-entry operand lists grow via sub_170B450.
Store-Chain Decomposition
In the LABEL_47 region of the core function, DSE walks store chains through struct/array GEPs and decomposes aggregate stores into element-level dead store checks. sub_19D94E0 handles chain-level elimination, while sub_19D91E0 builds the comparison set for overlap detection.
Address-Space Handling
DSE does not contain explicit CUDA address-space comparisons. Address-space separation is handled entirely by the underlying NVVM alias analysis (unk_4F9D3C0), which knows that different address spaces cannot alias. The alias query function sub_14C2730 receives the full instruction context including address space, so query results already incorporate this constraint.
Store Forwarding to Loads
The function sub_19DBD20 (20 KB) attempts store-to-load forwarding. When sub_19DD7C0 finds a store feeding into a load, it constructs a replacement using sub_12815B0. Sign/zero extension matching uses type byte 15 (float types) and type byte 11 (integer types), with opcodes 45 (float-to-int truncation), 46 (int-to-float), and 47 (generic cast).
Related Passes
Two related passes are registered alongside DSE in the same code region:
- MergedLoadStoreMotion (
sub_19DCD20, pass namemldst-motion): Shares the same alias infrastructure and is registered with the same analysis dependencies. - NaryReassociate (
sub_19DD420/sub_19DD530): N-ary reassociation pass factory, registered atsub_19DD1D0with its own analysis set.
Key Function Map
| Function | Address | Size | Role |
|---|---|---|---|
DSE::runOnFunction | 0x19DA750 | 33 KB | Main dead store elimination |
DSE::analyzeOverwrite | 0x19DDCB0 | 28 KB | Complete/partial overwrite detection |
DSE::runPartialOverwritePass | 0x19DF5F0 | 30 KB | Function-level partial tracking |
DSE::tryForwardStoresToLoad | 0x19DBD20 | 20 KB | Store-to-load forwarding |
DSE::buildOverwriteRecord | 0x19D8AF0 | -- | Overlap record construction |
DSE::buildComparisonSet | 0x19D91E0 | -- | Set of stores to compare |
DSE::eliminateStoreChain | 0x19D94E0 | -- | Chain-level elimination |
DSE::scanLoopForDeadStores | 0x19DCB70 | -- | Loop-level DSE |
DSE::runOnBasicBlock | 0x19DCC90 | -- | Block-level entry point |
DSE::extractStoreOperands | 0x19DD690 | -- | Get base pointer and stored value |
DSE::lookupDeadStoreCandidate | 0x19DD7C0 | -- | Hash table lookup |
DSE::decomposeGEPStore | 0x19DD950 | -- | GEP-based store decomposition |
DSE::collectPartialOperands | 0x19DEFC0 | -- | Partial overwrite operand collection |
DSE::checkPartialOverwrite | 0x19DEE70 | -- | Individual partial overwrite check |
DSE::tryEliminateStore | 0x19DF200 | -- | Attempt store elimination |
DSE::rehashStoreTable | 0x19DF220 | -- | Hash table resize |
Differences from Upstream LLVM
- Partial store forwarding with type conversion. Standard LLVM DSE bails when store and load sizes differ. NVIDIA's version creates GEP + load sequences to extract relevant portions, handling
float4->floatpatterns. - 72-byte hash table entries with cross-store tracking. Upstream uses simpler data structures. NVIDIA tracks which stores partially overwrite each other through 6-element dependency records.
- Store-chain decomposition. Aggregate stores are decomposed through struct/array GEPs into element-level checks, enabling elimination of stores that are collectively dead.
- Vector type awareness. The type walker includes a dedicated case for CUDA vector types with proper alignment computation.
- Total code size. At ~91 KB across three functions, NVIDIA's DSE is roughly 3x the size of upstream LLVM's equivalent.
Constant Folding: Math & Intrinsics
NVIDIA-modified pass. GPU-specific changes (110+ math name variants, 60+ NVVM intrinsic IDs, exception-safe host evaluation) are documented throughout this page.
Upstream source:
llvm/lib/Analysis/ConstantFolding.cpp(LLVM 20.0.0). The upstreamConstantFoldCallfunction handles standardllvm.*intrinsics; NVIDIA's extensions (sub_14D90D0eligibility checker,sub_14D1BC0evaluator) are layered on top.LLVM version note: The upstream
ConstantFolding.cppin LLVM 20 handles approximately 30 standard math intrinsics (llvm.sin,llvm.cos,llvm.sqrt, etc.) and a small set of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt innvvm.*form). CICC extends this to 110+ math name variants (C, glibc__*_finite, C++ mangled_Z*) and 60+ NVVM intrinsic IDs. The upstreamdisable-fp-call-foldingknob (cl::Hidden, defaultfalse) is preserved; NVIDIA adds a separateFPFoldDisableCiccOption for independent control.
CICC v13.0 extends LLVM's ConstantFolding analysis with two large custom functions that together enable compile-time evaluation of over 110 distinct math function name variants and 60+ NVVM intrinsic IDs. Upstream LLVM's ConstantFoldCall handles standard llvm.sin, llvm.cos, llvm.sqrt, and a handful of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in their nvvm.* forms, plus FP-to-integer conversion intrinsics). CICC goes far beyond this: it recognizes every C math library name (sin, sinf), every glibc __*_finite internal variant, every C++ mangled form (_Z3cosf, _Z4acosd), and the full set of NVVM approximate/FTZ math intrinsics -- then evaluates them using the host C math library with an exception-safe wrapper that refuses to produce results when the host FPU signals domain errors, overflow, or underflow.
The system is split into two cooperating functions. The eligibility checker sub_14D90D0 (27 KB, called nvvmIntrinsicConstantFold in the sweep analysis) is a fast predicate that answers "can this call be constant-folded?" without touching operand values. The evaluator sub_14D1BC0 (54 KB, called nvvmConstantFoldLibCall) performs the actual computation when all operands are constant. A third function, the NVVM InstCombine intrinsic folder sub_1169C30 (87 KB), handles algebraic simplification of NVVM intrinsics and is documented separately on the InstCombine page.
| Eligibility checker | sub_14D90D0 (0x14D90D0, 27 KB, 282 basic blocks, 489 edges) |
| Math evaluator | sub_14D1BC0 (0x14D1BC0, 54 KB) |
| Constant extractor | sub_14D1620 (0x14D1620) |
| Safe unary eval wrapper | sub_14D19F0 (0x14D19F0) |
| Safe binary eval wrapper | sub_14D1A80 (0x14D1A80) |
| ConstantFP builder | sub_14D17B0 (0x14D17B0) |
| Custom fabs | sub_14D1280 (0x14D1280) -- SSE2 sign-bit mask |
| Custom floor | sub_14D13B0 (0x14D13B0) -- truncation + sign correction |
| Custom ceil | sub_14D1410 (0x14D1410) -- truncation + sign correction |
| Custom sqrt | sub_14D1470 (0x14D1470) -- thin wrapper around libc sqrt |
| Vector math mapping | sub_149E420 (0x149E420, 26 KB) |
| LLVM knob | disable-fp-call-folding (upstream, cl::Hidden, default false) |
| NVIDIA knob | FPFoldDisable (NVIDIA CiccOption, disables FP constant folding) |
Two-Tier Architecture: Eligibility vs. Evaluation
The constant folding system operates as a two-phase protocol. The caller (from the ConstantFolding pass or InstCombine visitCallInst path) first invokes the eligibility checker to determine whether a call instruction is a candidate, then invokes the evaluator to produce the folded constant. This split exists for performance: the eligibility check is cheap (no operand extraction, no FP computation), while the evaluator is expensive (extracts APFloat values, calls host math library, checks FP exceptions).
Eligibility Checker: sub_14D90D0
The function takes a tagged IR node pointer and a context (intrinsic descriptor). The node pointer carries a 3-bit tag in its low bits; the function masks with ~7 to recover the aligned base. Before examining intrinsic IDs, it performs three attribute pre-filter checks on the callee:
-
Speculatable/ReadNone (attribute kind
0x15= 21): The callee must be safe to speculatively execute. If the direct callee lacks this attribute, the function follows one level of indirection through the resolved function target at[callee + 0x70]and re-checks. -
NoUnwind (attribute kind
5): The callee must not throw. Same indirection chain. -
Convergent gate (attribute kind
0x34= 52): If the callee is markedconvergent, the function returns 0 immediately. This is the critical safety check for GPU code -- convergent intrinsics like__syncthreads(),__ballot_sync(), and warp shuffle operations have warp-synchronous semantics that would be violated by folding them away, even when all arguments happen to be constant.
After attribute filtering, the function reads the intrinsic ID from [context + 0x24] (offset +36, unsigned 32-bit enum) and dispatches through a two-level scheme.
Evaluation: sub_14D1BC0
The evaluator receives the function name string, its length, an opcode/intrinsic-ID enum, a return type descriptor, an array of constant operand IR nodes, the operand count (1, 2, or 3), a flag enabling name-based matching, and a context pointer. It returns a ConstantFP or ConstantInt IR node on success, or null on failure.
The top-level dispatch is on operand count:
- Unary (count = 1): Trigonometric, exponential, logarithmic, rounding, and absolute value functions.
- Binary (count = 2):
pow,fmod,atan2,copysign,fmin,fmax. - Ternary (count = 3): FMA / fused multiply-add (opcodes 99 and 100 only).
Foldable Intrinsics Master Table
Standard LLVM Intrinsic IDs (0--211)
These are dispatched via a jump table at jpt_14D91F0 in the eligibility checker. The evaluator handles them via cascading opcode comparisons.
| ID | Hex | Intrinsic | Category |
|---|---|---|---|
| 5 | 0x05 | llvm.bswap | Bitwise |
| 6 | 0x06 | llvm.ceil | Rounding |
| 8 | 0x08 | llvm.copysign | Sign |
| 11 | 0x0B | llvm.cos | Trig |
| 12 | 0x0C | llvm.ctlz | Bitwise |
| 13 | 0x0D | llvm.ctpop | Bitwise |
| 30 | 0x1E | llvm.exp | Exponential |
| 31 | 0x1F | llvm.exp2 | Exponential |
| 32 | 0x20 | llvm.fabs | Absolute |
| 33 | 0x21 | llvm.floor | Rounding |
| 54 | 0x36 | llvm.fma | Ternary |
| 55 | 0x37 | llvm.fmuladd | Ternary |
| 96 | 0x60 | llvm.log | Logarithmic |
| 97 | 0x61 | llvm.log10 | Logarithmic |
| 99 | 0x63 | llvm.log2 | Logarithmic |
| 100 | 0x64 | llvm.lround | Rounding |
| 115 | 0x73 | llvm.maxnum | MinMax |
| 122 | 0x7A | llvm.minnum | MinMax |
| 123 | 0x7B | llvm.nearbyint | Rounding |
| 124 | 0x7C | llvm.pow | Power |
| 129 | 0x81 | llvm.powi | Power |
| 132 | 0x84 | llvm.rint | Rounding |
| 139 | 0x8B | llvm.round | Rounding |
| 140 | 0x8C | llvm.roundeven | Rounding |
| 146 | 0x92 | llvm.sin | Trig |
| 147 | 0x93 | llvm.tan | Trig |
| 187 | 0xBB | llvm.sqrt | Root |
| 188 | 0xBC | llvm.trunc | Rounding |
| 189--211 | 0xBD--0xD3 | Integer ops (umax, sadd.with.overflow, etc.) | Integer |
NVVM-Specific Intrinsic IDs (>211)
These are dispatched via cascading range checks with bitmask tests in the eligibility checker.
| ID Range | Hex | Intrinsic | Category |
|---|---|---|---|
| 3637--3639 | 0xE35--0xE37 | nvvm.bitcast.* / nvvm.move.* | Bitwise |
| 3660 | 0xE4C | nvvm.ptr.gen.to.* | Pointer |
| 3764--3765 | 0xEB4--0xEB5 | nvvm.ceil.f / nvvm.ceil.d | Rounding |
| 3778--3779 | 0xEC2--0xEC3 | nvvm.ctlz.i / nvvm.ctlz.ll | Bitwise |
| 3787 | 0xECB | nvvm.cos.approx.ftz.f | Trig |
| 3811 | 0xEE3 | nvvm.div.* / nvvm.fabs variant | Arith |
| 3870--3871 | 0xF1E--0xF1F | nvvm.exp2.approx.ftz.f / .d | Exponential |
| 3911--3912 | 0xF47--0xF48 | nvvm.fabs.f / .d | Absolute |
| 3924--3925 | 0xF54--0xF55 | nvvm.floor.f / .d | Rounding |
| 3944 | 0xF68 | nvvm.log.approx.ftz.f | Logarithmic |
| 3946 | 0xF6A | nvvm.log2.approx.ftz.f | Logarithmic |
| 3948 | 0xF6C | nvvm.log10.approx.ftz.f | Logarithmic |
| 3950 | 0xF6E | nvvm.rcp.approx.ftz.d | Reciprocal |
| 3952 | 0xF70 | nvvm.rsqrt.approx.ftz.f | Root |
| 3954 | 0xF72 | nvvm.sqrt.f / .approx.ftz.f | Root |
| 4072--4074 | 0xFE8--0xFEA | nvvm.sin/cos.approx.ftz variants | Trig |
| 4114--4115 | 0x1012--0x1013 | nvvm.max.i / .ui | MinMax |
| 4118--4119 | 0x1016--0x1017 | nvvm.min.i / .ui | MinMax |
| 4167--4168 | 0x1047--0x1048 | nvvm.max.ll / .ull | MinMax |
| 4170--4172 | 0x104A--0x104C | nvvm.min.ll / .ull | MinMax |
| 4230--4231 | 0x1086--0x1087 | nvvm.mul.hi.* | Multiply |
| 4413 | 0x113D | nvvm.sin.approx.ftz.f | Trig |
| 4475, 4478 | 0x117B, 0x117E | nvvm.sqrt.f / .rn.d | Root |
| 4483--4484 | 0x1183--0x1184 | nvvm.sqrt.approx.f / .ftz.f | Root |
| 5293 | 0x14AD | nvvm.f2i / nvvm.d2i | Conversion |
| 5300 | 0x14B4 | nvvm.i2f / nvvm.i2d | Conversion |
| 7297--7298 | 0x1C81--0x1C82 | nvvm.fmax.f / .d | MinMax |
| 7301--7302 | 0x1C85--0x1C86 | nvvm.fmin.f / .d | MinMax |
| 7334--7335 | 0x1CA6--0x1CA7 | nvvm.fmax.ftz.f / .ftz.nan.f | MinMax |
| 7339--7340 | 0x1CAB--0x1CAC | nvvm.fmin.ftz.f / .ftz.nan.f | MinMax |
Name-Based Foldable Functions (Case 0 Fallthrough)
When the intrinsic ID is 0 (unrecognized LLVM intrinsic), both the eligibility checker and the evaluator fall through to string-based matching. The evaluator uses a two-tier name matching system: fast-path intrinsic ID dispatch, then slow-path name comparison when the a7 flag is set.
Plain C library names (44 entries):
| Category | Functions |
|---|---|
| Trigonometric | sin, sinf, cos, cosf, tan, tanf |
| Inverse trig | acos, acosf, asin, asinf, atan, atanf, atan2, atan2f |
| Hyperbolic | sinh, sinhf, cosh, coshf, tanh, tanhf |
| Exponential | exp, expf, exp2, exp2f |
| Logarithmic | log, logf, log10, log10f |
| Rounding | ceil, ceilf, floor, floorf, round, roundf |
| Absolute / Root | fabs, fabsf, sqrt, sqrtf |
| Binary | pow, powf, fmod, fmodf, atan2, atan2f |
Glibc __*_finite variants (20 entries):
__acos_finite, __acosf_finite, __asin_finite, __asinf_finite, __atan2_finite, __atan2f_finite, __cosh_finite, __coshf_finite, __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite, __log_finite, __logf_finite, __log10_finite, __log10f_finite, __pow_finite, __powf_finite, __sinh_finite, __sinhf_finite
C++ mangled names (~48 entries): _Z3cosf, _Z3cosd, _Z3sinf, _Z3sind, _Z3tanf, _Z3tand, _Z3expf, _Z3expd, _Z3logf, _Z3logd, _Z4acosf, _Z4acosd, _Z4asinf, _Z4asind, _Z4atanf, _Z4atand, _Z4ceilf, _Z4ceild, _Z4coshf, _Z4coshd, _Z4exp2f, _Z4exp2d, _Z4fabsf, _Z4fabsd, _Z4sinhf, _Z4sinhd, _Z4sqrtf, _Z4sqrtd, _Z4tanhf, _Z4tanhd, _Z4fmodff, _Z4fmoddd, _Z5floorf, _Z5floord, _Z5log10f, _Z5log10d, _Z5atan2ff, _Z5atan2dd, _Z5powff, _Z5powdd, _Z5roundf, _Z5roundd
Total across all three name forms: approximately 112 distinct recognized strings.
Name Matching Algorithm
The evaluator's name matching is a hand-tuned trie-like dispatch optimized for the specific set of math function names. It avoids hash tables or sorted arrays in favor of cascading character comparisons:
nameMatch(name, length):
// Strip C++ mangling prefix
if name[0] == '_' and name[1] == 'Z':
dispatch on name[2]: // length digit
'3' -> match 3-char base: cos, sin, tan, exp, log
'4' -> match 4-char base: acos, asin, atan, ceil, cosh, exp2, fabs, sinh, sqrt, tanh, fmod
'5' -> match 5-char base: floor, log10, atan2, pow, round
verify trailing type suffix: 'f' = float, 'd' = double
return FOUND
// Strip glibc __finite prefix
if name[0] == '_' and name[1] == '_':
dispatch on name[2]:
'a' -> __acos_finite, __acosf_finite, __asin_finite, __asinf_finite,
__atan2_finite, __atan2f_finite
'c' -> __cosh_finite, __coshf_finite
'e' -> __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite
'l' -> __log_finite, __logf_finite, __log10_finite, __log10f_finite
'p' -> __pow_finite, __powf_finite
's' -> __sinh_finite, __sinhf_finite
verify with memcmp against string constant
return FOUND
// Plain C library name
dispatch on name[0]:
'a' -> acos, asin, atan + 'f' variants
'c' -> cos, cosf, ceil, ceilf, cosh, coshf
'e' -> exp, expf, exp2, exp2f
'f' -> fabs, fabsf, floor, floorf
'l' -> log, logf, log10, log10f
'p' -> pow, powf
'r' -> round, roundf
's' -> sin, sinf, sinh, sinhf, sqrt, sqrtf
't' -> tan, tanf, tanh, tanhf
// Within each group, dispatch on name length:
length 3: direct 3-byte compare ("sin", "cos", "tan", "exp", "log", "pow")
length 4: DWORD compare (4-byte integer, little-endian):
0x736F6361 = "acos" 0x6E697361 = "asin"
0x6E617461 = "atan" 0x6C696563 = "ceil"
0x68736F63 = "cosh" 0x73626166 = "fabs"
0x66736F63 = "cosf" 0x686E6973 = "sinh"
0x74727173 = "sqrt" 0x686E6174 = "tanh"
0x32707865 = "exp2" 0x66707865 = "expf"
...
length 5+: memcmp against literal string constant
return FOUND or NOT_FOUND
The 4-byte integer comparison trick deserves attention: instead of calling memcmp for 4-character names, the code loads the name as a uint32_t and compares against a pre-computed little-endian constant. For example, *(uint32_t*)name == 0x736F6361 checks for "acos" ('a'=0x61, 'c'=0x63, 'o'=0x6F, 's'=0x73). This micro-optimization eliminates function call overhead for the most common name lengths.
Exception-Safe Host Evaluation
The core safety mechanism is the FP exception wrapper used for all transcendental evaluation. Both the unary wrapper (sub_14D19F0) and binary wrapper (sub_14D1A80) follow the same protocol:
Value* safeMathEval(double (*mathFunc)(double), Type* resultType, double arg) {
feclearexcept(FE_ALL_EXCEPT); // clear all FP exception flags
*__errno_location() = 0; // clear errno
double result = mathFunc(arg); // call host C library
// Check errno for domain/range error
int e = *__errno_location();
if (e == EDOM || e == ERANGE) { // errno 33 or 34
feclearexcept(FE_ALL_EXCEPT);
*__errno_location() = 0;
return nullptr; // refuse to fold
}
// Check FP exception flags (mask = 0x1D = 29)
// FE_INVALID(1) | FE_DIVBYZERO(4) | FE_OVERFLOW(8) | FE_UNDERFLOW(16)
if (fetestexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)) {
feclearexcept(FE_ALL_EXCEPT);
*__errno_location() = 0;
return nullptr; // refuse to fold
}
// FE_INEXACT (32) is intentionally NOT checked --
// most transcendentals produce inexact results and that is acceptable.
return createConstantFP(resultType, result);
}
This design means the folder refuses to produce a result whenever the host FPU signals any exceptional condition other than inexact. The implications:
sin(1e308)might overflow on the host -- not folded, left in IR for runtime evaluation.log(-1.0)produces a domain error -- not folded.sqrt(-0.01)triggersFE_INVALID-- not folded.sin(0.5)produces an inexact result (since sin(0.5) is irrational) -- folded normally.
Domain Pre-Checks
In addition to the post-evaluation exception check, certain functions have explicit domain guards before calling the host math library:
| Function | Precondition | Rationale |
|---|---|---|
log, logf, log10, log10f | argument > 0.0 | Negative inputs produce NaN |
sqrt, sqrtf | argument >= 0.0 | Negative inputs produce NaN |
acos, asin | no pre-check | Relies on FP exception mechanism |
The asymmetry is deliberate: log/sqrt get explicit checks because their domain violations are common and cheap to detect, while acos/asin rely on the post-evaluation FE_INVALID check.
Host FPU vs. GPU Precision
The constant folder evaluates using the host CPU's math library (j_sin, j_cos, j_exp, etc. -- PLT stubs to glibc). This creates a potential precision mismatch: the folded constant may not be bit-identical to what the GPU hardware would compute. NVIDIA mitigates this through several mechanisms:
-
Custom implementations for exact functions.
fabs,floor,ceil, androundhave custom host-side implementations that match GPU rounding semantics exactly:- fabs (
sub_14D1280): Pure SSE2 bitwise AND with0x7FFFFFFFFFFFFFFF(clear sign bit). Bit-exact regardless of platform. - floor (
sub_14D13B0): Custom truncation: for|x| < 2^52, truncate to integer, subtract 1.0 if truncation rounded toward zero for negative values, preserve sign bit. For|x| >= 2^52, return unchanged (already integral). - ceil (
sub_14D1410): Mirror of floor: truncate to integer, add 1.0 if truncation rounded toward zero for positive values. - round (
j__round): Uses libcround()directly (round-half-away-from-zero, matching PTXround.rni).
- fabs (
-
Exception rejection for transcendentals. For
sin,cos,exp,logand other transcendentals, CICC accepts the host result because IEEE-754 guarantees these are correctly rounded within 1 ULP on both host and device. The exception wrapper catches cases where host and device behavior might diverge (denormals, overflow boundary). -
exp2(x) folded as pow(2.0, x). Rather than calling
exp2()directly (which might differ between host and device implementations), the evaluator computespow(2.0, x)through the binary wrapper, ensuring consistent behavior. -
No half-precision transcendental folding. The type check at the evaluator's entry rejects type byte 1 (half) for all trig/exp/log functions. Only basic operations (convert, compare) work on fp16. This is safe because half-precision math functions are implemented as promote-to-float, compute, demote-to-half -- by the time the constant folder runs, the promotion has already been inlined.
FTZ and Approximate Intrinsics
NVVM intrinsics like nvvm.exp2.approx.ftz.f and nvvm.sin.approx.ftz.f carry .approx (reduced precision) and .ftz (flush-to-zero for denormals) modifiers. These are present in the foldable ID list, which may seem surprising -- folding an "approximate" intrinsic with exact host math could produce a different value than the hardware.
The rationale: constant folding evaluates the mathematical function, not the hardware instruction. If the input is a normal float and the result is a normal float, the folded value is correct regardless of FTZ or approximation quality. The FTZ modifier only affects denormal inputs (which the exception wrapper would catch via FE_UNDERFLOW), and the .approx modifier only matters for runtime execution speed. For compile-time constants, exact evaluation is strictly better.
Comparison with Upstream LLVM
Upstream LLVM's ConstantFolding.cpp (as of LLVM 19.x) handles NVPTX intrinsics in canConstantFoldCallTo and ConstantFoldCall. The overlap and gaps:
| Capability | Upstream LLVM | CICC v13.0 |
|---|---|---|
llvm.sin, llvm.cos, llvm.exp, llvm.log, etc. | Yes | Yes |
nvvm.ceil.f, nvvm.floor.f, nvvm.fabs, nvvm.sqrt.* | Yes | Yes |
nvvm.fmax.*, nvvm.fmin.* (all variants) | Yes (including .xorsign_abs) | Yes (subset: .f, .d, .ftz, .ftz.nan) |
nvvm.f2i_*, nvvm.d2i_* (FP-to-int with rounding modes) | Yes (all 32 variants) | Partial (IDs 5293, 5300 only) |
Plain C math names (sin, cosf, exp2f, etc.) | Via TargetLibraryInfo | Direct name matching (44 entries) |
Glibc __*_finite variants | No | Yes (20 entries) |
C++ mangled _Z3cosf, _Z4acosd, etc. | No | Yes (~48 entries) |
nvvm.cos.approx.ftz.f, nvvm.exp2.approx.ftz.f, etc. | No | Yes |
nvvm.rcp.approx.ftz.d, nvvm.rsqrt.approx.ftz.f | No | Yes |
nvvm.mul.hi.* | No | Yes |
| Convergent intrinsic rejection | Implicit (no fold path) | Explicit attribute check |
| FMA constant fold | Yes (via APFloat) | Yes (opcodes 99/100, APFloat fma) |
| Integer min/max/ctlz/cttz | Partial | Yes (full NVVM ID coverage) |
The critical CICC-only capabilities are the __*_finite variants (needed when code is compiled with -ffinite-math-only), the C++ mangled names (emitted by device-side C++ math overloads), and the .approx.ftz intrinsic family.
Integer Constant Folding
The evaluator also handles integer-domain operations when operands have type tag 13 (ConstantInt) or when FP operands encode integer comparisons:
Binary integer ops (operand count = 2, both ConstantInt):
- Opcodes 189, 195, 198, 209, 210, 211: APInt binary operations (add, sub, mul, sdiv, udiv, srem) via
sub_16A7290and related APInt helpers. - Opcodes
0xEC2/0xEC3(3778/3779):ctlz(count leading zeros). - Opcodes
0x1014/0x1015,0x1016/0x1017: Signed/unsigned min/max via APInt comparison. - Opcodes
0x104B/0x104C,0x1087/0x1088: Additional signed/unsigned min/max encodings. - Opcode 3811: Division where divisor is known zero -- returns
UndefValue.
Integer comparison fold (type tag 14 with integer-domain opcodes):
- Opcode
0xBB(187),0x8C(140):icmp eq/ne-- predicate 0. - Opcode
0x61(97):icmp slt-- predicate 2. - Opcode
0xBC(188):icmp sgt-- predicate 4. - Opcode
0xCE(206):icmp uge-- predicate 3. - Opcode
0x08(8):icmp ult-- predicate 1.
These produce ConstantInt 0 or 1 via sub_169EBA0/sub_169D440.
Libdevice Integration
NVIDIA's libdevice (libdevice.10.bc) provides optimized LLVM bitcode implementations of math functions. After linking libdevice, calls like __nv_sinf are typically inlined and disappear before constant folding runs. However, if inlining fails or is disabled, residual __nv_* calls may survive.
The constant folder does not recognize __nv_* prefixed names directly. The __ name-matching path only handles glibc __*_finite patterns, not NVIDIA's __nv_* convention. Un-inlined libdevice residuals are handled upstream by the NVVM InstCombine intrinsic canonicalizer (sub_1169C30), which recognizes __nv_* prefixes and may convert them to standard LLVM intrinsics that the constant folder can then process.
The __nvvm_reflect mechanism (used for __CUDA_ARCH queries) is resolved by a separate earlier pass (NVVMReflect) that replaces __nvvm_reflect("__CUDA_ARCH") with a constant integer based on the target SM. By the time the constant folder runs, all __nvvm_reflect calls have been eliminated.
Configuration Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
disable-fp-call-folding | cl::opt<bool> | false | Upstream LLVM hidden flag. When true, prevents constant folding of any function returning or accepting floating-point types. Checked in canConstantFoldCallTo. |
FPFoldDisable | NVIDIA CiccOption | false | NVIDIA-specific flag that disables FP constant folding at the NVVM level. |
instcombine-negator-enabled | cl::opt<bool> | true | Controls the negation propagation system in sub_1169C30 (InstCombine intrinsic folder). |
instcombine-negator-max-depth | cl::opt<int> | platform-dependent | Depth limit for the negator chain in InstCombine intrinsic folding. Prevents exponential blowup when pushing negation through deep arithmetic chains. |
The FPFoldDisable knob is significant for debugging precision issues: when a kernel produces different results with -O0 vs -O2, disabling FP folding isolates whether constant-folded values are the source of the discrepancy.
ConstantFP Result Creation
The result builder sub_14D17B0 creates the final LLVM ConstantFP IR node from the evaluated double result. It dispatches on the return type byte at *(type + 8):
| Type byte | Precision | Behavior |
|---|---|---|
| 1 | half | Not reached from math folder (filtered at entry). Infrastructure exists: converts through APFloat semantics. |
| 2 | float | Truncates double to float via C cast, then converts float to APFloat via sub_169D3B0. |
| 3 | double | Stores full double precision via sub_169D3F0 (double to APFloat). |
Both paths finish with sub_159CCF0(*type, &storage) which constructs the ConstantFP node from the APFloat storage. The float path's truncation via C cast means the folded float value matches what (float)host_result produces -- this is IEEE-754 correct because the cast performs round-to-nearest-even.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
nvvmIntrinsicConstantFold | 0x14D90D0 | 27 KB | Eligibility predicate: can this intrinsic be constant-folded? |
nvvmConstantFoldLibCall | 0x14D1BC0 | 54 KB | Math evaluator: compute constant result from constant args |
extractDoubleFromConstantFP | 0x14D1620 | -- | Extract double from ConstantFP IR node |
safeMathEvalUnary | 0x14D19F0 | -- | Exception-safe unary evaluation wrapper |
safeMathEvalBinary | 0x14D1A80 | -- | Exception-safe binary evaluation wrapper |
createConstantFPResult | 0x14D17B0 | -- | Build ConstantFP from evaluated double |
customFabs | 0x14D1280 | -- | SSE2 sign-bit clear |
customFloor | 0x14D13B0 | -- | Truncation + sign correction |
customCeil | 0x14D1410 | -- | Truncation + sign correction |
customSqrt | 0x14D1470 | -- | Thin wrapper around libc sqrt |
fptoui_fptosi_fold | 0x14D1500 | -- | FP-to-integer conversion fold |
apintMoveTransfer | 0x14D15E0 | -- | APInt move/transfer helper |
vectorMathLibMapping | 0x149E420 | 26 KB | Scalar-to-vectorized math mapping table |
platformFuncCanonicalize | 0x149FA60 | 15 KB | Platform-specific name canonicalization |
constantExprFoldSCEV | 0x14D44C0 | 20 KB | ConstantExpr fold / SCEV integration |
constantFoldAggregate | 0x14D5510 | 16 KB | ConstantFold for aggregate types |
constantFoldGEPExtract | 0x14D66F0 | 17 KB | ConstantFold for GEP and extract |
constantExprSCEVBuild | 0x14DBA90 | 22 KB | ConstantExpr + SCEV builder |
AttributeList::hasAttribute | 0x1560260 | -- | Attribute query (used 8 times in eligibility checker) |
Value::getName | 0x1649960 | -- | Name string extraction (case 0 path) |
| NVVM InstCombine intrinsic fold | 0x1169C30 | 87 KB | Algebraic simplification of NVVM intrinsics (see InstCombine) |
Cross-References
- InstCombine -- The NVVM intrinsic canonicalizer (
sub_1169C30) handles algebraic simplification, negation propagation, and operand folding for NVVM intrinsics. It calls constant folding as a sub-step. - Pipeline & Ordering -- Where constant folding sits in the optimization pipeline (runs within InstCombine and as a standalone analysis).
- Builtin Table: Math Functions -- The complete list of CUDA math builtins and their mapping to NVVM intrinsics.
- CLI Flags --
FPFoldDisableand other optimization control flags. - LLVM Knobs -- The full
disable-fp-call-foldingflag and related InstCombine depth limits.
KnownBits & DemandedBits for GPU
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
NVIDIA's KnownBits and DemandedBits infrastructure in cicc v13.0 diverges from upstream LLVM in three structural ways. First, the two analyses are fused into a single 127 KB function (sub_11A7600) that simultaneously computes known-zero/known-one bitmasks and simplifies instructions whose demanded bits allow constant folding or narrowing -- upstream LLVM separates computeKnownBits (in ValueTracking) from SimplifyDemandedBits (in InstCombine). Second, a dedicated GPU-specific known-bits oracle (sub_F0C4B0) provides range constraints for NVIDIA special registers (%tid, %ntid, %ctaid, %nctaid, %warpsize, %laneid) that have no CPU equivalent. Third, an early NVVM pipeline pass (nvvm-intr-range at sub_216F4B0) attaches !range metadata to every special-register read intrinsic, giving downstream analyses the same bounded-range information that CPU targets only get from profile data or programmer assertions. Together these form the primary dataflow backbone for address calculation optimization, type narrowing, and dead-bit elimination in GPU kernels.
| Merged computeKnownBits + SimplifyDemandedBits | sub_11A7600 (0x11A7600, 127 KB, 4,156 lines) |
| Secondary SimplifyDemandedBits helper | sub_11A1430 (0x11A1430, 6.3 KB, 6 opcodes) |
| Per-operand demand propagation trampoline | sub_11AE940 (0x11AE940) |
| Generic computeKnownBits (reference) | sub_9AC0E0 (fallback for unhandled opcodes) |
| Debug-only reference computeKnownBits | sub_9AC330 (cross-validation oracle) |
| computeKnownBitsFromOperator | sub_11A3F30 (0x11A3F30, 50 KB) |
| computeKnownBitsFromAssume | sub_11A6910 (0x11A6910, 12.5 KB) |
| computeKnownBitsFromRangeMetadata | sub_11A68C0 |
| Post-analysis NVIDIA fixup | sub_99B5E0 (alignment + range refinement) |
| NVIDIA intrinsic known-bits oracle | sub_F0C4B0 (special register ranges) |
| Intrinsic return range analysis | sub_10CA790 + sub_11A1390 |
| NVVMIntrRange pass | sub_216F4B0 (nvvm-intr-range) |
| SelectionDAG computeKnownBits | sub_33D4EF0 (0x33D4EF0, 114 KB, 3,286 lines) |
| Pointer alignment known-bits | sub_BD5420 (getPointerAlignmentBits) |
| Debug cross-validation flag | qword_4F90C28 (enables abort-on-mismatch) |
| Max recursion depth | 6 (checked in sub_11AE940) |
GPU-Specific Known-Bits Sources
The key difference from CPU targets: GPU code has dozens of values with statically knowable ranges that never exist on a CPU. Every CUDA thread reads its identity from special registers whose values are bounded by hardware launch parameters. NVIDIA exploits this in two places: the nvvm-intr-range pass adds !range metadata at the IR level, and the target-specific known-bits oracle sub_F0C4B0 provides bitmask information directly to computeKnownBits.
Special Register Range Table
The following ranges apply to every NVVM intrinsic that reads a PTX special register. The !range metadata attached by nvvm-intr-range (sub_216F4B0) encodes [lo, hi) as an LLVM MDNode. The known-bits column shows which bits are guaranteed zero given the maximum value.
| Register | PTX | NVVM Intrinsic ID Range | Value Range | i32 Known Zero (upper bits) |
|---|---|---|---|---|
%tid.x/y/z | %tid.x | 350--352 | [0, maxntid-1] | bits [ceil(log2(maxntid)), 31] |
%ntid.x/y/z | %ntid.x | 353--355 | [1, 1024] | bits [11, 31] (at most 1024) |
%ctaid.x/y/z | %ctaid.x | 356--358 | [0, gridDim-1] | bits [ceil(log2(gridDim)), 31] |
%nctaid.x/y/z | %nctaid.x | 359--361 | [1, 2^31-1] | bit 31 (always non-negative) |
%warpsize | %WARP_SZ | ~370 | {32} (constant) | bits [0,4] = 00000, bit 5 = 1, bits [6,31] = 0 |
%laneid | %laneid | ~371 | [0, 31] | bits [5, 31] |
%warpid | %warpid | ~372 | [0, maxWarpsPerSM-1] | SM-dependent upper bits |
%smid | %smid | ~375 | [0, numSMs-1] | architecture-dependent |
%nsmid | %nsmid | ~376 | [1, numSMs] | architecture-dependent |
%gridid | %gridid | ~378 | [0, 2^32-1] | none (full range) |
%clock | %clock | ~380 | [0, 2^32-1] | none |
%lanemask_eq/lt/le/gt/ge | %lanemask_* | ~382--386 | [0, 2^32-1] | none |
When __launch_bounds__(maxThreadsPerBlock, minBlocksPerMP) is present on a kernel, nvvm-intr-range tightens the %tid ranges to [0, maxThreadsPerBlock-1] and %ntid to [1, maxThreadsPerBlock]. Similarly, nvvm.reqntid metadata (from __launch_bounds__ with exact dimensions or reqntid pragmas) can constrain each dimension independently to an exact value.
The knob nvvm-intr-range-sm (constructor ctor_359) selects the SM variant used to determine architectural limits for registers like %warpid, %smid, and %nsmid.
Address Space Known Bits
CUDA uses separate address spaces with distinct pointer bit-widths and alignment properties. These feed directly into sub_BD5420 (getPointerAlignmentBits), which OR's known-zero low bits into the KnownBits result for any pointer-typed value:
| Address Space | PTX | Pointer Width | Known Alignment | Known Bits Effect |
|---|---|---|---|---|
| 0 (generic) | default | 64 bits | none guaranteed | pointer alignment only |
| 1 (global) | .global | 64 bits | >= 16 bytes (typical) | low 4 bits often known-zero |
| 3 (shared) | .shared | 32 bits | >= 4 bytes (minimum) | low 2 bits known-zero, bits [32,63] irrelevant |
| 4 (constant) | .const | 64 bits | >= 4 bytes | low 2 bits known-zero |
| 5 (local) | .local | 32 bits | >= 4 bytes (stack) | low 2 bits known-zero, bits [32,63] irrelevant |
The 32-bit address spaces (shared and local) are critical: any value known to be a shared-memory pointer has bits [32, 63] entirely dead. The DemandedBits analysis exploits this to eliminate zero-extensions and truncations around shared-memory address calculations, keeping everything in 32-bit arithmetic.
Launch Parameter Integration
The __launch_bounds__ attribute, __maxnreg__ pragma, and nvvm.reqntid / nvvm.maxntid metadata all flow into the known-bits infrastructure:
-
nvvm-intr-range pass (
sub_216F4B0): Runs early in the pipeline. Reads kernel metadata (nvvm.reqntid,nvvm.maxntid) viasub_93AE30. Attaches!rangemetadata to everyllvm.nvvm.read.ptx.sreg.*intrinsic call. The metadata format is!{i32 lo, i32 hi}wherehiis exclusive. -
computeKnownBitsFromRangeMetadata (
sub_11A68C0): Called during standardcomputeKnownBitstraversal. Reads!rangemetadata from any value and derives known-zero/known-one masks. For a range[0, 1024), this yieldsknownZero = 0xFFFFFC00(bits 10--31 known zero). -
Intrinsic return range analysis (
sub_10CA790+sub_11A1390): A separate path used when the mergedcomputeKnownBits+SimplifyDemandedBitsprocesses ZExt/SExt of intrinsic calls. Computes[lo, hi]bounds for the intrinsic's return value and checks whether the extension can be eliminated because the return range fits within the demanded bits.
The Merged Analysis: Algorithm and Pseudocode
Unlike upstream LLVM where InstCombiner::SimplifyDemandedBits calls computeKnownBits as a subroutine, cicc fuses them. The entry point sub_11AE870 wraps sub_11AE3E0, which calls the core sub_11A7600. A hash table at InstCombiner + 2064 tracks visited instructions to prevent infinite recursion.
Core Algorithm
// sub_11A7600 — merged computeKnownBits + SimplifyDemandedBits
// Returns: replacement instruction pointer, or NULL if no simplification
Instruction* computeKnownBitsAndSimplify(
AnalysisCtx *ctx, // a1 — holds IR module, pass info
IRNode *inst, // a2 — instruction to analyze
APInt *demanded, // a3 — which output bits the consumer needs
KnownBits *result, // a4 — output {knownZero, knownOne}
unsigned depth, // a5 — recursion depth (checked in caller)
QueryState *state // a6 — worklist context
) {
uint8_t opcode = inst->opcode_tag; // single-byte opcode at offset 0
unsigned width = demanded->getBitWidth();
// Stack-allocate 4 APInt accumulators for operand known bits
APInt kz0(width, 0), ko0(width, 0); // operand 0
APInt kz1(width, 0), ko1(width, 0); // operand 1
switch (opcode) {
case '*': // Mul — lines 654-1037
// Pattern: if one operand is known power-of-2 from intrinsic call,
// replace Mul with Shl (critical for threadIdx * stride)
if (auto *rhs = matchConstantPow2Call(inst->getOperand(1))) {
if (inst->getOperand(0)->hasOneUse())
return createShl(inst->getOperand(0), log2(rhs));
}
// Generic: narrow demanded mask by leading zeros, propagate to operands
unsigned effectiveBits = width - demanded->countLeadingZeros();
APInt narrowDemand = APInt::getLowBitsSet(width, effectiveBits);
propagateDemandToOperand(ctx, inst, 0, narrowDemand, &kz0, &ko0, depth+1, state);
propagateDemandToOperand(ctx, inst, 1, narrowDemand, &kz1, &ko1, depth+1, state);
KnownBits::computeForMul(result, {kz0,ko0}, {kz1,ko1}, inst->hasNUW(), inst->hasNSW());
break;
case '6': // ZExt — lines 1677-1919
// Check if source is intrinsic call with known return range
if (auto range = getIntrinsicReturnRange(inst->getOperand(0))) {
if (range.fitsBitWidth(demanded->getActiveBits()))
return inst->getOperand(0); // eliminate extension
}
// Standard: shift demanded bits down, propagate to source, zext result
propagateDemandToOperand(ctx, inst, 0, demanded->trunc(srcWidth), ...);
KnownBits::zext(result, srcWidth);
break;
case 'U': // NVIDIA Intrinsic — lines 3521-4085
unsigned intrinsicID = getIntrinsicID(inst);
switch (intrinsicID) {
case 0x0F: handleBFE_BFI(inst, demanded, result); break;
case 0x42: handlePopcount(inst, demanded, result); break;
case 0x01: handleAbs(inst, demanded, result); break;
case 0xB4: handleFSHL(inst, demanded, result); break;
case 0xB5: handleFSHR(inst, demanded, result); break;
case 0x12B: handleBswap(inst, demanded, result); break;
default:
// Fall through to NVIDIA intrinsic known-bits oracle
sub_F0C4B0(inst, result, depth, state);
break;
}
break;
// ... 13 more opcode cases (Add, Sub, Xor, PHI, Trunc, SExt, etc.)
default:
sub_9AC0E0(inst, result, depth, state); // generic fallback
break;
}
// POST-ANALYSIS REFINEMENT (lines 2134-2281)
// 1. Pointer alignment: if type is pointer, OR alignment bits into knownZero
if (inst->getType()->isPointerTy()) {
unsigned alignBits = getPointerAlignmentBits(inst); // sub_BD5420
result->knownZero |= APInt::getLowBitsSet(width, alignBits);
}
// 2. Debug cross-validation (when qword_4F90C28 is set)
if (DEBUG_FLAG) {
KnownBits reference;
sub_9AC330(inst, &reference, depth, state); // independent computation
if (reference != *result) {
print("computeKnownBits(): ", reference);
print("SimplifyDemandedBits(): ", *result);
abort();
}
}
// 3. Demand-covers-known check: can we replace with constant?
if (demanded->isSubsetOf(result->knownZero | result->knownOne))
return ConstantInt::get(inst->getType(), result->knownOne);
return nullptr;
}
Demand Propagation Per Operand
The trampoline sub_11AE940 is the per-operand demand propagation entry point. It increments depth, checks the depth limit (depth > 6 returns all-unknown), and dispatches between the big handler (sub_11A7600) and the binary-arithmetic-specific helper (sub_11A1430) based on opcode class:
// sub_11AE940 — per-operand demand propagation trampoline
Instruction* propagateDemandToOperand(
AnalysisCtx *ctx, IRNode *parent, unsigned opIdx,
APInt *demand, KnownBits *out, unsigned depth, QueryState *state
) {
if (depth > 6)
return nullptr; // MaxAnalysisRecursionDepth reached
IRNode *operand = parent->getOperand(opIdx);
uint8_t opcode = operand->opcode_tag;
// Binary arithmetic subset goes to the helper
if (opcode == '*' || opcode == '9' || opcode == ':' ||
opcode == ';' || opcode == ',' || opcode == '8')
return sub_11A1430(ctx, operand, demand, out, depth, state);
// Everything else goes to the big merged handler
return sub_11A7600(ctx, operand, demand, out, depth, state);
}
The secondary helper sub_11A1430 handles Add/Sub/Xor/Mul/BitCast/ExtractElement with a tighter structure: it uses a four-accumulator cascade with three successive isSubsetOf checks per operation, which is more aggressive than upstream LLVM's single post-merge check.
The Four-Accumulator Cascade
For binary operators (Add, Sub, Xor), cicc maintains four APInt accumulators (two per operand) and performs a three-tier check:
// Three-tier demand satisfaction check (sub_11A1430 pattern)
// More aggressive than upstream single-check approach
KnownBits kb0, kb1;
computeKnownBits(op0, &kb0, depth+1, state);
computeKnownBits(op1, &kb1, depth+1, state);
KnownBits merged = mergeForOpcode(kb0, kb1, opcode);
sub_99B5E0(inst, &merged, depth, state); // NVIDIA post-fixup
// Check 1: merged result covers demand?
if (demanded.isSubsetOf(merged.knownZero | merged.knownOne))
return ConstantInt::get(merged.knownOne);
// Check 2: union of operand known-bits covers demand?
if (demanded.isSubsetOf((kb0.knownZero | kb1.knownZero) |
(kb0.knownOne | kb1.knownOne)))
return ConstantInt::get(...);
// Check 3: all accumulated zero|one covers demand?
if (demanded.isSubsetOf(allAccumulatedZero | allAccumulatedOne))
return followUseDef(...);
The post-analysis fixup sub_99B5E0 is NVIDIA-specific and does not exist in upstream LLVM. It applies additional refinements from thread index range constraints, warp-level uniformity, and shared memory alignment guarantees.
DemandedBits for GPU: Narrowing Optimizations
The DemandedBits analysis is the backward complement to KnownBits' forward analysis. When a consumer only needs the low N bits of a value, the producer can be narrowed or eliminated. On GPU, this interaction is dramatically more productive than on CPU because of three factors:
-
32-bit address spaces: Shared memory (AS 3) and local memory (AS 5) use 32-bit pointers. When address calculations are performed in i64 (as the generic address space requires), the upper 32 bits are entirely undemanded for shared/local accesses. DemandedBits proves this and enables truncation to i32.
-
Bounded thread indices:
threadIdx.x * stride + offsetpatterns produce values that fit in far fewer bits than i32. IfthreadIdx.x < 256(from__launch_bounds__) andstride < 4096, the product fits in 20 bits. DemandedBits propagates this, enabling downstream shifts and masks to operate on narrower types. -
Type demotion to i16/fp16: When DemandedBits proves only the low 16 bits of an i32 computation matter, cicc can demote to 16-bit operations. The function at
sub_1185740(InstCombine'svisitTrunc) inserts narrowing truncations. This is particularly valuable for texture coordinate calculations and index arithmetic in tensor core operations.
Dead Bit Elimination
The core optimization check appears approximately 15 times across the analysis functions:
// Inline version (width <= 64):
uint64_t unknown = ~(knownZero | knownOne);
if ((demanded & unknown) == 0) {
// All demanded bits are determined -> replace with constant
return ConstantInt::get(type, knownOne);
}
// Wide version (width > 64):
if (demanded.isSubsetOf(knownZero | knownOne)) {
return ConstantInt::get(type, knownOne); // sub_AD6220
}
This is the heart of the analysis: backward-propagated demand meets forward-propagated known-bits. When they cover every bit the consumer needs, the entire instruction is dead and can be replaced with a compile-time constant.
GPU Patterns Enabled by Known Bits
The following simplifications are GPU-specific and do not have CPU equivalents:
Mul to Shl for threadIdx arithmetic (lines 714--861): When both operands of a multiply originate from intrinsic calls with known power-of-2 returns (e.g., threadIdx.x * blockDim.x where blockDim is a power-of-2 from __launch_bounds__), the multiply is replaced with a left shift. The pattern matcher checks sub_BCAC40 (hasOneUse) and sub_10A0620 (createShl replacement).
Bswap + BFE fusion (lines 3959--4007): Detects a byte-swap feeding into a bit-field extract and replaces with a direct byte read at the swapped offset. Common in endianness conversion code for shared memory operations.
ZExt/SExt elimination via intrinsic return range (sub_10CA790 path): When a ZExt or SExt extends the result of an NVVM intrinsic call, and the intrinsic's annotated return range fits entirely within the demanded bits, the extension is eliminated. This fires frequently for threadIdx.x reads extended to i64 for address calculations.
BitCast-through-ZExt folding (sub_11A1430 at 0x11A2360): When a BitCast's source is a ZExt and the demanded bits fit within the original narrow type, the bitcast+zext chain collapses to the original value. Common in CUDA address calculations involving zero-extension followed by pointer reinterpretation.
SelectionDAG computeKnownBits
The DAG-level known-bits analysis at sub_33D4EF0 (114 KB, 3,286 lines) mirrors the IR-level analysis but operates on SDNode opcodes. It handles 112 opcode cases organized into 14 groups.
NVPTX Target Node Known Bits
For NVPTX-specific DAG opcodes (above ISD::BUILTIN_OP_END = 499), the function delegates to NVPTXTargetLowering::computeKnownBitsForTargetNode via vtable slot 254 at offset 2032. The key NVPTX-specific cases:
| Opcode Range | NVPTX DAG Node | Known-Bits Behavior |
|---|---|---|
| 0x152--0x161 (338--353) | TEX, SULD, surface ops | Result width known: bits above element size set to zero |
| 0x12A (298) | LoadV2, LoadParam | Extension mode from flags byte bits[2:3]: zext/sext/none |
| 0x16A, 0x16C (362, 364) | StoreParam, StoreRetval | When flags bits[2:3] == 0b11: element type width known |
| 0x175 (373) | ConstantPool | Uses ConstantRange::fromKnownBits intersection |
| 0xCA (202) | INTRINSIC_WO_CHAIN | Boolean-like: bit 0 unknown, bits [1..width] known zero |
| >= 499 | All target-specific | Delegates to vtable[254] computeKnownBitsForTargetNode |
The DAG-level analysis uses the same recursion depth cap of 6 (a6 > 5 returns all-unknown), matching LLVM's MaxRecursionDepth.
Texture/Surface Fetch Result Width
Cases 0x152--0x161 encode the known bit-width of texture and surface fetch results. For an 8-bit texture fetch zero-extended to i32, the analysis sets bits [8, 31] as known-zero in the result. This enables downstream shift and mask elimination in texture sampling code.
KnownBits Data Structure Layout
Both the IR-level and DAG-level implementations use the same 32-byte struct:
struct KnownBits { // 32 bytes total
union {
uint64_t val; // +0x00: inline storage (width <= 64)
uint64_t *ptr; // +0x00: heap pointer (width > 64)
} knownZero;
uint32_t knownZero_width; // +0x08: bit-width
uint32_t _pad0; // +0x0C: padding
union {
uint64_t val; // +0x10: inline storage (width <= 64)
uint64_t *ptr; // +0x10: heap pointer (width > 64)
} knownOne;
uint32_t knownOne_width; // +0x18: bit-width
uint32_t _pad1; // +0x1C: padding
};
// Invariant: (knownZero & knownOne) == 0 (no bit both 0 and 1)
// Threshold: width > 64 triggers heap allocation via sub_C43690
Roughly 43% of sub_11A1430's binary size consists of APInt destructor sequences (cmp [rbp+var], 0x40; jbe skip; call free) for the width > 64 cleanup paths.
Configuration
| Knob | Source | Default | Effect |
|---|---|---|---|
nvvm-intr-range-sm | ctor_359 | Current target SM | SM variant used to compute special register ranges for nvvm-intr-range pass |
scev-cgp-tid-max-value | ctor_XXX | Architecture limit | Maximum value of thread ID used in SCEV-based CodeGenPrep address calculations |
nv-remat-threshold-for-spec-reg | unk_4FD3860 | 20 | Threshold controlling when special register reads are rematerialized instead of spilled (interacts with known-bits because remat preserves range metadata) |
qword_4F90C28 | internal debug flag | 0 (disabled) | Enables cross-validation abort: runs independent reference computeKnownBits (sub_9AC330) and aborts if results disagree with merged analysis |
| Max recursion depth | hardcoded | 6 | Matches LLVM's MaxAnalysisRecursionDepth; checked in sub_11AE940 |
| APInt inline threshold | hardcoded | 64 bits | Values <= 64 bits use inline uint64 storage; wider values heap-allocate |
Diagnostic Strings
The merged analysis emits the following diagnostics (only in debug/assert builds when qword_4F90C28 is set):
| String | Location | Trigger |
|---|---|---|
"computeKnownBits(): " | sub_11A7600 line ~2204 | Cross-validation mismatch: prints the reference implementation's result |
"SimplifyDemandedBits(): " | sub_11A7600 line ~2208 | Cross-validation mismatch: prints the merged analysis result |
"Mismatched known bits for <inst> in <func>" | sub_11A7600 line ~2200 | Precedes the two values above; followed by abort() |
The nvvm-intr-range pass emits:
| String | Location |
|---|---|
"Add !range metadata to NVVM intrinsics." | sub_216F4B0 (pass registration) |
NVVM IR Node Layout
The KnownBits analysis traverses IR nodes using cicc's internal representation. Each node is 32 bytes:
struct IRNode { // 32 bytes (0x20)
uint8_t opcode; // +0x00: single-byte opcode tag (ASCII-based)
uint8_t flags; // +0x01: bit 1, bit 2 = nsw/nuw flags
uint16_t _reserved; // +0x02
uint32_t operand_idx; // +0x04: 27-bit operand index + 5-bit flags
// byte 7 bit 6 (0x40) = use-list vs indexed
// ... remaining 24 bytes: use-list pointers, type info, metadata
};
// Operand resolution:
// If byte[7] & 0x40 (use-list flag set):
// operand = *(node - 8) -> *(ptr + 0x20)
// If byte[7] & 0x40 == 0 (indexed):
// idx = (node[4..7] & 0x7FFFFFF)
// operand = node - (idx << 5) // 27-bit index * 32 bytes
The 27-bit index allows up to 134 million nodes (4 GB theoretical IR size).
Function Map
IR-Level Known-Bits
| Function | Address | Size |
|---|---|---|
computeKnownBitsAndSimplify -- merged main analysis | sub_11A7600 | 127 KB |
SimplifyDemandedBitsHelper -- binary arithmetic subset | sub_11A1430 | 6.3 KB |
| Per-operand demand propagation trampoline (depth check) | sub_11AE940 | varies |
| SimplifyDemandedBits entry wrapper (allocates APInts) | sub_11AE870 | thin |
| SimplifyDemandedBits result caching (hash table at IC+2064) | sub_11AE3E0 | 235 lines |
computeKnownBitsFromOperator / PHI merge | sub_11A3F30 | 50 KB |
computeKnownBitsFromAssume (processes @llvm.assume) | sub_11A6910 | 12.5 KB |
computeKnownBitsFromRangeMetadata (reads !range) | sub_11A68C0 | varies |
Generic computeKnownBits (fallback, no simplification) | sub_9AC0E0 | varies |
Reference computeKnownBits (debug cross-validation only) | sub_9AC330 | varies |
| NVIDIA post-analysis fixup (alignment + range refinement) | sub_99B5E0 | varies |
| NVIDIA intrinsic known-bits oracle (special registers) | sub_F0C4B0 | varies |
isNVVMFunction check (NVIDIA-specific flag) | sub_F0C3D0 | varies |
Intrinsic return range analysis (computes [lo, hi]) | sub_10CA790 | 11.2 KB |
| Extract return range bounds from range analysis result | sub_11A1390 | varies |
getPointerAlignmentBits (alignment-derived known zeros) | sub_BD5420 | varies |
isDemandedBitsFullyKnown (demand subset-of known) | sub_10024C0 | varies |
NVVMIntrRange pass -- attaches !range metadata | sub_216F4B0 | varies |
SelectionDAG-Level Known-Bits
| Function | Address | Size |
|---|---|---|
SelectionDAG::computeKnownBits (recursive, 112 opcode cases) | sub_33D4EF0 | 114 KB |
Creates all-demanded mask, delegates to sub_33D4EF0 | sub_33DD090 | wrapper |
computeMinLeadingZeros (calls sub_33D25A0 + returns) | sub_33D4D80 | wrapper |
computeNumSignBits (parallel switch structure) | sub_33D25A0 | 49 KB |
computeOverflowForAdd / computeOverflowForSub | sub_33DCF10 | varies |
KnownBits Arithmetic Helpers
| Function | Address |
|---|---|
KnownBits::computeForMul(result, nuw, nsw, kb0, kb1) | sub_C70430 |
KnownBits::add(a, b, nsw, nuw, carry) | sub_C74E10 |
KnownBits::sub(a, b, nsw, nuw) | sub_C75B70 |
KnownBits::computeForAddSub(isSub, nsw, nuw, a, b) | sub_C76560 |
KnownBits::shl(a, shamt) | sub_C73220 |
KnownBits::lshr(a, b) | sub_C738B0 |
KnownBits::ashr(a, b) | sub_C73E40 |
KnownBits::and(a, b, commutative) | sub_C787D0 |
KnownBits::or(a, b) | sub_C78F20 |
KnownBits::xor(a, b) | sub_C790F0 |
KnownBits::mergeForPHI / smax(a, b) | sub_C79480 |
KnownBits::truncate / smulh(a, b) | sub_C7B4D0 |
KnownBits::cttz(a, shift) | sub_C7BCF0 |
KnownBits::ctpop(a) | sub_C7BD50 |
KnownBits::bswap(a) | sub_C7BDB0 |
KnownBits::abs(a, known_shift) | sub_C746C0 |
KnownBits::umin(a, b) | sub_C740A0 |
KnownBits::umax(a, b) | sub_C74180 |
KnownBits::ctlz(a, poisonAtZero) | sub_C778B0 |
APInt Utilities
| Function | Address |
|---|---|
APInt(width, 0) -- zero-init constructor (heap for width > 64) | sub_C43690 |
APInt copy constructor | sub_C43780 |
APInt::operator&= | sub_C43B90 |
| `APInt::operator\ | sub_C43BD0 |
APInt::setBits(lo, hi) | sub_C43C90 |
APInt::flipAllBits | sub_C43D10 |
APInt::trunc(width) | sub_C44740 |
APInt::zext(width) | sub_C449B0 |
APInt::sext(width) | sub_C44830 |
APInt::countTrailingZeros | sub_C44590 |
APInt::countLeadingZeros | sub_C444A0 |
APInt::countPopulation | sub_C44630 |
APInt::isSubsetOf(other) | sub_C446F0 |
APInt::reverseBits / byteSwap | sub_C44AB0 |
ConstantInt::get(type, APInt) -- creates constant replacement | sub_AD6220 |
ConstantInt::get(type, value, isSigned) | sub_AD64C0 |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Analysis architecture | Separate computeKnownBits (ValueTracking) and SimplifyDemandedBits (InstCombine) | Fused into single 127 KB function (sub_11A7600) that simultaneously computes bitmasks and simplifies instructions |
| GPU register ranges | No special register concept; all values have full-width range | Dedicated oracle (sub_F0C4B0) provides known-zero bits for %tid, %ntid, %ctaid, %warpsize, %laneid, and 10+ PTX special registers |
| Range metadata injection | No equivalent pass; range info comes from profile data or programmer annotations | nvvm-intr-range pass (sub_216F4B0) attaches !range metadata to every special-register read; tightened by __launch_bounds__ |
| Warp size | Not a concept; no constant is known | %warpsize is statically known to be exactly 32 (known-zero bits [0,4] and [6,31], bit 5 = 1) |
| Cross-validation | No cross-validation in release builds | Debug flag qword_4F90C28 enables abort-on-mismatch between computeKnownBits and SimplifyDemandedBits results |
| SelectionDAG integration | Separate DAG-level computeKnownBits (~60 KB) | Extended DAG-level version at sub_33D4EF0 (114 KB, 3,286 lines) with GPU-specific value tracking |
| Max recursion depth | 6 (configurable) | Same default 6, checked in sub_11AE940 with identical semantics |
Cross-References
- InstCombine -- The primary consumer of KnownBits analysis;
sub_11AE870is called from the binary operator visitor's Phase 0 - SelectionDAG -- DAG-level known-bits at
sub_33D4EF0feeds into DAGCombine and instruction selection pattern matching - Loop Strength Reduction -- LSR interacts with shared-memory known-bits through the
lsr-no-ptr-address-space3knob that disables LSR for 32-bit shared memory pointers - GVN --
sub_9AC330(reference computeKnownBits) is also called from GVN to validate value numbering decisions - LICM -- Loop-invariant code motion uses known-bits to prove that hoisted expressions are safe (no integer overflow when known-bits constrain the range)
CodeGenPrepare and SCEV-CGP
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: Upstream CodeGenPrepare is stock LLVM 20.0.0
CodeGenPrepare.cppwith all 20+cl::optknobs unchanged. SCEV-CGP is a fully proprietary NVIDIA pass with no upstream equivalent; it is disabled by default (nv-disable-scev-cgp = true).
cicc v13.0 contains two distinct passes that prepare LLVM IR for the NVPTX backend's instruction selection. The first is upstream LLVM's CodeGenPreparePass, registered as "codegenprepare" in the New PM pipeline (line 216 of sub_2342890), which sinks address computations, creates PHI nodes for sunk values, and splits critical edges. The second is NVIDIA's proprietary SCEV-CGP (Scalar-Evolution-based Code Generation Preparation), a fully custom pass that uses SCEV analysis to rewrite address expressions with GPU thread ID as an induction variable.
Both passes operate at the LLVM IR level, immediately before SelectionDAG construction. They share the goal of making address expressions cheap for the backend to lower, but they work at different abstraction levels: CodeGenPrepare operates syntactically on individual memory instructions; SCEV-CGP operates semantically on entire address expression families using scalar evolution. NVIDIA disables SCEV-CGP by default (nv-disable-scev-cgp defaults to true), relying on upstream CodeGenPrepare plus the downstream Base Address Strength Reduction and Common Base Elimination passes to handle GPU address optimization.
Key Facts
| Property | Value |
|---|---|
| Pass name (upstream) | codegenprepare (New PM) |
| Pass name (NVIDIA) | SCEV-CGP (no formal New PM pass name found in binary) |
| Binary range (v12.x) | 0x1D60000--0x1D7FFFF (helpers + main transforms) |
| Binary range (v13.0) | 0x2D75700--0x2D88660 (Cluster 6 in 0x2D sweep) |
| Address sinking | sub_1D73760 / sub_2D75700 (65--72 KB), string "sunkaddr" |
| PHI sinking | sub_1D706F0 / sub_2D784F0 (64--68 KB), string "sunk_phi" |
| Block splitting | sub_1D7AA30 / sub_2D88660 (54--74 KB), strings ".unlikely", ".cond.split" |
| Main transform | sub_2D80050 (54 KB) -- orchestrates address mode lowering |
| SCEV-CGP knob ctor | ctor_263_0 at 0x4F36F0 (9.9 KB, 44 option strings) |
| CGP knob ctor | ctor_288_0 at 0x4FA950 (8.6 KB, 44 option strings) |
| Master disable | nv-disable-scev-cgp (default: true -- SCEV-CGP is disabled) |
| Upstream source | llvm/lib/CodeGen/CodeGenPrepare.cpp |
| Pipeline position | Late IR, immediately before SelectionDAG ISel |
Upstream CodeGenPrepare
Purpose
CodeGenPrepare is the last IR-level pass before instruction selection. Its job is to transform the IR into a form that the SelectionDAG builder can lower efficiently: address computations should be adjacent to their memory uses (reducing live ranges), complex addressing modes should be materialized as GEP chains that ISel can pattern-match, and unlikely branches should be split into cold blocks so that block placement can isolate them.
On NVPTX this pass is less critical than on x86 because PTX has simpler addressing modes (base + offset, no scaled index), but it still performs three important transforms.
Transform 1: Address Sinking (sunkaddr)
The address sinking logic lives in sub_1D73760 (v12.x) / sub_2D75700 (v13.0). It identifies memory instructions whose address operand is computed in a dominating block, then sinks the computation to the block containing the memory instruction. The sunk address is named "sunkaddr" in the IR, appearing as a GEP, inttoptr, or bitcast chain:
Before:
entry:
%addr = getelementptr float, ptr %base, i64 %idx
br label %loop
loop:
%val = load float, ptr %addr ; addr live across loop
After:
entry:
br label %loop
loop:
%sunkaddr0 = getelementptr float, ptr %base, i64 %idx
%val = load float, ptr %sunkaddr0 ; addr local to use
The naming convention "sunkaddr" with a numeric suffix (20+ occurrences in binary string references) is the standard LLVM naming. Each sunk address gets a unique suffix: sunkaddr0, sunkaddr1, etc.
The sinking decision is controlled by a cache called ValueToSunkAddr (a DenseMap at sub_2CE7CF0 in the v13.0 build). Before sinking a value, the pass checks whether the same address expression has already been sunk into the target block. If so, it reuses the existing sunk copy rather than creating a duplicate.
The core sinking algorithm:
for each basic block BB in function:
for each instruction I in BB:
if I is a memory instruction (load/store/atomic):
addr = I.getPointerOperand()
if addr.getParent() != BB:
// addr defined in a dominating block
addr_mode = matchAddressMode(addr) // sub_2D67BB0
if addr_mode.isFoldable():
sunk = materializeAddrMode(addr_mode, BB) // sub_2D68450
I.setPointerOperand(sunk)
mark changed
Key helpers in the v13.0 build:
| Function | Address | Size | Role |
|---|---|---|---|
| -- | -- | sub_2D749D0 | -- |
| -- | -- | sub_2D67BB0 | -- |
| -- | -- | sub_2D6E640 | -- |
| -- | -- | sub_2D68450 | -- |
| -- | -- | sub_2CE7CF0 | -- |
Transform 2: PHI Sinking (sunk_phi)
When an address computation has multiple uses in successor blocks of a conditional branch, the pass creates a PHI node in the merge block rather than sinking independent copies into each successor. The resulting PHI is named "sunk_phi":
Before:
entry:
%addr = getelementptr float, ptr %base, i64 %idx
br i1 %cond, label %then, label %else
then:
%v1 = load float, ptr %addr
br label %merge
else:
%v2 = load float, ptr %addr
br label %merge
After (conceptual):
then:
%sunkaddr0 = getelementptr float, ptr %base, i64 %idx
%v1 = load float, ptr %sunkaddr0
br label %merge
else:
%sunkaddr1 = getelementptr float, ptr %base, i64 %idx
%v2 = load float, ptr %sunkaddr1
br label %merge
When the two sunk copies would be identical and the value is needed in the merge block for other uses, the pass instead creates:
merge:
%sunk_phi = phi ptr [ %sunkaddr0, %then ], [ %sunkaddr1, %else ]
The PHI creation calls sub_B44260 (PHI node setup), with naming via sub_BD6B50. The addr-sink-new-phis cl::opt knob (registered at ctor_288_0) controls whether the pass is allowed to create new PHIs during address sinking. The addr-sink-new-select knob similarly controls creation of new select instructions.
Transform 3: Block Splitting
sub_1D7AA30 (v12.x) / sub_2D88660 (v13.0) splits basic blocks to isolate unlikely paths. The pass creates blocks with suffixes ".unlikely" and ".cond.split", allowing MachineBlockPlacement to push cold code away from the hot path. This is driven by branch probability metadata and profile-guided section prefix hints.
On NVPTX, block splitting interacts with StructurizeCFG: the split blocks must still form reducible control flow, otherwise StructurizeCFG will have to insert additional flow blocks to restore structure. The profile-guided-section-prefix knob controls whether section prefix metadata (.hot, .unlikely, .unknown) is attached to split blocks.
Upstream CodeGenPrepare Knobs
All registered at ctor_288_0 (0x4FA950, 8.6 KB, 44 strings). These are standard LLVM cl::opt knobs, unchanged from upstream:
| Knob | Type | Effect |
|---|---|---|
disable-cgp-branch-opts | bool | Disable CodeGenPrepare branch optimizations |
disable-cgp-gc-opts | bool | Disable CodeGenPrepare GC optimizations |
disable-cgp-select2branch | bool | Disable select-to-branch conversion |
addr-sink-using-gep | bool | Use GEP instructions for address sinking (vs. inttoptr) |
enable-andcmp-sinking | bool | Sink and/cmp instruction pairs into branches |
disable-cgp-store-extract | bool | Disable store-extractvalue optimization |
stress-cgp-store-extract | bool | Stress test store-extractvalue path |
disable-cgp-ext-ld-promotion | bool | Disable extension-load promotion |
disable-preheader-prot | bool | Disable loop preheader protection |
profile-guided-section-prefix | bool | Attach section prefix based on profile data |
cgp-freq-ratio-to-skip-merge | int | Block frequency ratio threshold to skip block merging |
force-split-store | bool | Force store splitting |
cgp-type-promotion-merge | bool | Merge type promotions |
disable-complex-addr-modes | bool | Disable complex addressing mode optimization |
addr-sink-new-phis | bool | Allow creating new PHIs during address sinking |
addr-sink-new-select | bool | Allow creating new select during address sinking |
addr-sink-combine-base-reg | bool | Combine base register in address sink |
addr-sink-combine-gv | bool | Combine global value in address sink |
addr-sink-combine-offs | bool | Combine offset in address sink |
addr-sink-combine-scaled-reg | bool | Combine scaled register in address sink |
cgp-split-large-offset-gep | bool | Split GEPs with large offsets |
GPU Relevance of Upstream Knobs
Most of these knobs are effectively no-ops on NVPTX because the target's addressing modes are simple (base + immediate offset, no scaled index register). However, a few matter:
-
addr-sink-using-gep: Controls whether sunk addresses use GEP or inttoptr chains. On NVPTX, GEP chains are preferred because they preserve address space information through lowering. The inttoptr path strips address space, forcing the backend to re-derive it. -
cgp-split-large-offset-gep: Relevant for large array accesses where the constant offset exceeds the PTX immediate encoding width (±2^31 for 64-bit addressing). Splitting the GEP allows the backend to use a base register plus a small offset rather than a 64-bit constant. -
addr-sink-new-phis: On GPU, creating new PHIs can increase divergent live ranges. If the condition driving the PHI is thread-divergent, the PHI result will be divergent, potentially requiring a wider (per-lane) register allocation.
NVIDIA SCEV-CGP
What Is It?
SCEV-CGP is a fully custom NVIDIA pass that uses LLVM's ScalarEvolution analysis to optimize address mode expressions at the function level, with specific awareness of GPU thread ID as an induction variable. Where upstream CodeGenPrepare operates syntactically (pattern-matching individual instructions), SCEV-CGP operates semantically: it analyzes address expressions as SCEV recurrences, factors out common base computations, and rewrites them to minimize register pressure.
The pass is registered in ctor_263_0 at 0x4F36F0 alongside Base Address Strength Reduction knobs. The 44 strings registered in this single constructor cover both SCEV-CGP and BASR, confirming they are part of the same address optimization subsystem.
Why NVIDIA Disables It By Default
The nv-disable-scev-cgp knob defaults to true (the description reads "Disable optimize addr mode with SCEV pass" and the raw data at ctor_609_0 marks it as def=on meaning disabled). This is a deliberate choice:
-
Redundancy with BASR/CBE. NVIDIA has invested heavily in Base Address Strength Reduction (62 KB) and Common Base Elimination (39 KB), which handle the most profitable GPU address optimizations (sharing base computations across array accesses in loop bodies). These passes are simpler, more predictable, and better-tested than the general SCEV-CGP framework.
-
Interaction with LSR. Both SCEV-CGP and Loop Strength Reduction operate on SCEV expressions. If both are active, they can fight over the same address expressions: LSR rewrites IVs for loop-carried efficiency, then SCEV-CGP undoes part of that work to optimize address modes. The result can be worse than either pass alone. By disabling SCEV-CGP, NVIDIA lets LSR (with its full GPU-aware formula solver) handle SCEV-based address optimization without interference.
-
Compile-time cost. SCEV-CGP with aggressive mode (
do-scev-cgp-aggresively[sic]) is expensive. Thescev-cgp-inst-limitandscev-cgp-controlknobs exist precisely because uncontrolled SCEV-CGP can balloon compile times on large kernels with many address expressions. -
Overflow hazards. The
ignore-32-bit-overflowandignore-signed-32-bit-overflowknobs inctor_263_0indicate that SCEV-CGP can produce address arithmetic that overflows 32-bit intermediates. On GPU where 32-bit addressing is common (shared memory, constant memory), this is a correctness risk that NVIDIA mitigates by keeping the pass off by default.
When SCEV-CGP Would Be Beneficial
Despite being disabled by default, the pass has 11 dedicated knobs -- NVIDIA clearly uses it selectively:
-
Kernels with complex strided access patterns where thread ID participates in multi-dimensional address calculations (e.g.,
base + tid.x * stride_x + tid.y * stride_y + tid.z * stride_z). BASR handles the case where multiple accesses share a base, but it does not factor thread ID expressions across dimensions. -
Register-pressure-critical kernels at occupancy cliffs where SCEV-based address strength reduction can save enough registers to cross an occupancy boundary. The
scev-cgp-tid-max-valueknob lets the pass reason about the bounded range of thread IDs, enabling tighter value range analysis. -
Function-level address optimization (enabled by
do-function-scev-cgp) where cross-loop base sharing matters more than per-loop IV optimization.
Thread ID Max Value Knob
The scev-cgp-tid-max-value knob deserves special attention. It provides SCEV analysis with the maximum possible value of a GPU thread ID, which is architecture-dependent:
- threadIdx.x: max 1024 (all architectures sm_70+)
- threadIdx.y: max 1024
- threadIdx.z: max 64
- blockIdx.x: max 2^31 - 1
By telling SCEV that threadIdx.x is bounded by 1024, the analysis can prove that threadIdx.x * element_size fits in 32 bits for element sizes up to ~2 million bytes. This enables 32-bit address arithmetic where the expression would otherwise be widened to 64 bits. The knob links to the Known Bits analysis documented in Known Bits, where the nvvm-intr-range pass provides similar bounded-range information for special registers.
SCEV-CGP Knobs (Complete Reference)
All registered in ctor_263_0 at 0x4F36F0. These are NVVMPassOptions values, stored in the 222-slot pass option registry.
| Knob | Type | Default | Effect |
|---|---|---|---|
do-scev-cgp | bool | true [MEDIUM confidence] | Master enable for SCEV-based CodeGenPrepare transforms. Default inferred from the fact that nv-disable-scev-cgp exists as an override, implying this defaults to enabled. |
do-scev-cgp-aggresively [sic] | bool | false [MEDIUM confidence] | Enable aggressive SCEV-CGP mode with expanded search. Default inferred from naming convention (aggressive modes typically off by default). |
do-function-scev-cgp | bool | false [MEDIUM confidence] | Enable function-level (cross-loop) SCEV-CGP. Default inferred from naming convention. |
nv-disable-scev-cgp | bool | true | Master disable switch in NVPTX backend (overrides do-scev-cgp) |
scev-cgp-control | int | unknown | Limit the total number of SCEV-CGP transformations per function |
scev-cgp-cross-block-limit | int | unknown | Max number of common base expressions from a single block |
scev-cgp-idom-level-limit | int | unknown | Max dominator tree depth for hoisting base computations |
scev-cgp-inst-limit | int | unknown | Max instructions analyzed per parameter expression |
scev-cgp-old-base | bool | unknown | Use old (legacy) base computation method instead of new |
scev-cgp-tid-max-value | int | arch-dependent | Maximum value of thread ID for address range analysis |
scev-cgp-check-latency | int | unknown | Latency threshold for address computation profitability |
scev-cgp-norm | int | unknown | Normalization control for SCEV expression canonicalization |
print-after-scev-cgp | bool | false | Dump function IR after SCEV-CGP completes |
dump-scev-cgp | bool | false | Debug dump during SCEV-CGP execution |
Additional ctor_263_0 Knobs (BASR/CBE Related)
The same constructor also registers these knobs, documented in their respective pages:
| Knob | See |
|---|---|
do-base-address-strength-reduce | Base Address Strength Reduction |
do-base-address-strength-reduce-chain | Base Address Strength Reduction |
base-address-strength-reduce-iv-limit | Base Address Strength Reduction |
base-address-strength-reduce-max-iv | Base Address Strength Reduction |
topo-sort-begin | Topological sort starting point for address expression graph |
ignore-bad-base | Bypass validity checks on base pointer classification |
ignore-32-bit-overflow | Skip 32-bit overflow checks in address arithmetic |
ignore-signed-32-bit-overflow | Skip signed 32-bit overflow checks |
Interaction with LSR
CodeGenPrepare/SCEV-CGP and Loop Strength Reduction both optimize address expressions, but at different pipeline stages and granularities.
| Aspect | LSR | CodeGenPrepare | SCEV-CGP |
|---|---|---|---|
| Pipeline position | Late IR optimization (loop passes) | Pre-ISel (after all IR opts) | Pre-ISel (NVIDIA custom position) |
| Scope | Per-loop IV rewriting | Per-instruction address sinking | Per-function address expression rewriting |
| SCEV usage | Full: formula generation, stride factoring, chain construction | None (syntactic pattern matching) | Full: base decomposition, range analysis |
| Register pressure | Explicit RP tracking with occupancy ceiling | Implicit (sinking reduces live ranges) | Implicit via scev-cgp-cross-block-limit |
| Address space | Full awareness (shared memory protection, 64-bit IV gating) | No special GPU handling | Thread ID aware (scev-cgp-tid-max-value) |
| Default status | Enabled (with GPU-custom formula solver) | Enabled (standard upstream) | Disabled (nv-disable-scev-cgp = true) |
The key insight is the pipeline ordering: LSR runs first during the optimization phase, rewriting IVs across the loop. CodeGenPrepare runs later, sinking the results into individual use sites. If SCEV-CGP were also enabled, it would run between these two, potentially undoing LSR's IV choices to create "better" address modes -- which may conflict with LSR's register-pressure-informed formula selection.
NVIDIA's solution is pragmatic: keep SCEV-CGP off, let LSR handle SCEV-level optimization, let BASR/CBE handle GPU-specific base sharing, and let upstream CodeGenPrepare handle the final address sinking.
Differences from Upstream LLVM
| Area | Upstream LLVM | cicc v13.0 |
|---|---|---|
| CodeGenPrepare pass | Standard, used as-is | Retained unchanged from LLVM 20.0.0 |
| SCEV-CGP | Does not exist | NVIDIA proprietary, disabled by default |
| Address sinking | Always uses TTI::getAddrModeType | Same, but NVPTX TTI returns simple modes (base+offset only) |
| Block splitting | Hot/cold based on PGO | Same, but must preserve reducibility for StructurizeCFG |
| BASR/CBE | Do not exist | NVIDIA proprietary alternatives to SCEV-CGP for GPU |
| Knob count | ~20 cl::opt for CGP | 20 upstream CGP + 14 SCEV-CGP + 8 BASR = 42 total |
Function Map
CodeGenPrepare (v12.x Addresses)
| Function | Address | Size | Role |
|---|---|---|---|
| -- | sub_1D73760 | 65 KB | optimizeMemoryInst -- address sinking, creates "sunkaddr" |
| -- | sub_1D706F0 | 68 KB | PHI optimization, creates "sunk_phi" |
| -- | sub_1D7AA30 | 74 KB | Block splitting, creates ".unlikely", ".cond.split" |
| -- | sub_1D779D0 | 71 KB | IR transform (DAG combine-level, possibly optimizeInst) |
| -- | sub_1D765D0 | 34 KB | Select lowering ("cond.false", "cond.end") |
| -- | sub_1D7F9D0 | 31 KB | Deque-based worklist processor |
CodeGenPrepare (v13.0 Addresses)
| Function | Address | Size | Role |
|---|---|---|---|
| -- | sub_2D75700 | 72 KB | Address sinking with "sunk_phi", ValueToSunkAddr DenseMap |
| -- | sub_2D784F0 | 64 KB | Address mode lowering orchestrator, calls sub_2D75700 |
| -- | sub_2D80050 | 54 KB | Main CodeGenPrepare transform, calls TTI and address mode logic |
| -- | sub_2D82850 | 62 KB | Late lowering/expansion (type widening, custom lowering) |
| -- | sub_2D88660 | 70 KB | Block splitting with branch weights ("hot", "unlikely", "unknown") |
| -- | sub_2D749D0 | -- | Address mode cache lookup |
| -- | sub_2D67BB0 | -- | Address mode legality test |
| -- | sub_2D6E640 | -- | Address mode cache insert |
| -- | sub_2D68450 | -- | Address mode materialization |
| -- | sub_2D6DEE0 | -- | Address mode matching |
| -- | sub_2D69E90 | -- | Cleanup/init |
Helper Range (0x1D60000--0x1D6FFFF)
This 64 KB sub-range contains CodeGenPrepare helper functions. The sweep identifies it as "CodeGenPrepare helpers" but no individual functions are called out with string evidence. These likely include address mode computation utilities, operand analysis, and GEP canonicalization.
SCEV-CGP Option Registration
| Function | Address | Size | Role |
|---|---|---|---|
| -- | ctor_263_0 (0x4F36F0) | 9.9 KB | Registers 44 cl::opt strings for SCEV-CGP + BASR |
| -- | ctor_288_0 (0x4FA950) | 8.6 KB | Registers 44 cl::opt strings for upstream CodeGenPrepare |
| -- | ctor_591 (0x57C1A0) | 9.3 KB | Additional CodeGenPrepare sink/split options |
| -- | ctor_544_0 (0x56C190) | 13.1 KB | CodeGenPrepare options (v13.0 duplicate registration) |
| -- | ctor_609_0 (0x585D30) | 37.3 KB | NVPTX backend mega-block, includes nv-disable-scev-cgp |
Cross-References
- Loop Strength Reduction -- SCEV-based IV rewriting, runs before CGP
- Base Address Strength Reduction -- NVIDIA's preferred GPU address optimization
- Common Base Elimination -- inter-block complement to BASR
- SCEV Analysis -- the scalar evolution infrastructure both LSR and SCEV-CGP depend on
- Known Bits -- thread ID range analysis that
scev-cgp-tid-max-valuefeeds into - Code Generation Overview -- pipeline position context
- NVPTX Target & TTI -- the
nv-disable-scev-cgpregistration inctor_609_0 - Optimizer Pipeline --
do-scev-cgpin the NVVMPassOptions system
ScalarEvolution Overview & Construction
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Analysis/ScalarEvolution.cpp,llvm/include/llvm/Analysis/ScalarEvolution.h(LLVM 20.0.0)LLVM version note: CICC v13.0 is based on LLVM 20.0.0
ScalarEvolution.cpp. Evidence: the non-recursive worklist-basedcreateSCEVdriver (sub_DD8130) matches the LLVM 16+ refactoring that replaced the recursivecreateNodeForValue. ThegetSmallConstantTripCount/getSmallConstantMaxTripCountAPI matches LLVM 17+ signatures. NVIDIA's three extension categories --simple_modecomplexity control, GPU-specific SCEV sources (thread index bounds), and CUDA loop idiom recognition (warp-stride, grid-stride) -- are layered on top of the stock LLVM 20 analysis with no modifications to the core SCEV algebra.
ScalarEvolution (SCEV) is the foundational analysis that models how values change across loop iterations. Every loop optimization in cicc -- vectorization, unrolling, strength reduction, interchange, distribution -- depends on SCEV to answer three questions: "what is the trip count?", "what is the stride?", and "what is the value range?" NVIDIA's cicc v13.0 ships an LLVM 20.0.0-based ScalarEvolution with three categories of proprietary extensions: a complexity control system (simple_mode) that prevents SCEV from spending unbounded time on GPU kernels with hundreds of induction variables, GPU-specific SCEV sources that inject thread index bounds and launch configuration constraints into the analysis, and recognition of CUDA-specific loop idioms (warp-stride and grid-stride patterns) that have no analog in CPU code. This page documents SCEV expression construction -- the core getSCEV / createSCEV / createNodeForInstruction call chain. Range computation and trip count analysis are covered in SCEV Range Analysis & Trip Counts; cache invalidation and delinearization in SCEV Invalidation & Delinearization.
Key Facts
| Property | Value |
|---|---|
| LLVM base version | 20.0.0 ScalarEvolution.cpp |
| Top-level entry | sub_DD8400 (getSCEV) |
| Core builder | sub_DD65B0 (createNodeForInstruction, 1103 lines) |
| Worklist driver | sub_DD8130 (non-recursive worklist createSCEV, 154 lines) |
| Instruction decomposer | sub_D94080 (452 lines) |
| PHI handler | sub_DD92B0 (createNodeForPHI) |
| GEP handler | sub_DD3A70 (getGEPExpr) |
| Cache lookup | sub_D98300 (lookupSCEV) |
| Cache store | sub_DB77A0 (insertSCEV) |
| NVIDIA complexity scorer | sub_DB3670 (expression size estimator) |
| SE object size | >1572 bytes (fields documented through offset +1572) |
| Calling conventions bypassing budget | CC 42, CC 43 (PTX kernel entry points) |
ScalarEvolution Object Layout
The ScalarEvolution context (SE) is a large heap-allocated object. The fields relevant to SCEV construction:
| Offset | Type | Field | Notes |
|---|---|---|---|
+0 | Module* | LLVM module / context pointer | |
+8 | TargetLibraryInfo* | TLI | Used for intrinsic recognition |
+32 | DominatorTree* | Dominator tree | Required for PHI analysis |
+40 | LoopInfo* | Loop analysis | AddRec construction needs this |
+48 | void* | Analysis pointer | Used by complexity scorer |
+320 | SmallDenseSet | PHI visited set | Prevents infinite recursion |
+976 | void* | Unsigned range cache table | 40-byte entries, open addressing |
+992 | uint32_t | Unsigned range cache capacity | Power-of-two |
+1008 | void* | Signed range cache table | Same structure |
+1024 | uint32_t | Signed range cache capacity | |
+1560 | uint8_t | simple_mode flag | 0 = normal, 1 = NVIDIA complexity control |
+1564 | uint32_t | failure_count | Simple mode: bailed instructions |
+1568 | uint32_t | recursion_count | Normal mode: depth counter |
+1572 | uint8_t | Complexity config bits | Tuning for the scorer |
The SE object also contains the ValueExprMap (primary SCEV cache mapping Value* to SCEV*), the backedge-taken count cache at offset +648/+656/+672, and the per-exit BTC cache at +1168/+1184. These are documented in the range/BTC page.
The getSCEV Entry Point
sub_DD8400 (getSCEV) is the single entry point for obtaining a SCEV expression for any LLVM Value*. Every consumer -- LoopVectorize, LoopUnroll, LSR, IndVarSimplify, LoopInterchange -- calls this function. The algorithm:
SCEV* getSCEV(SE *se, Value *V) {
// 1. Memo-table check
SCEV *cached = lookupSCEV(se, V); // sub_D98300
if (cached) return cached;
// 2. Dispatch based on mode
if (se->simple_mode == 0) {
// NORMAL PATH
CallingConv cc = V->getParent()->getParent()->getCallingConv();
if (cc == 42 || cc == 43) {
// PTX kernel entry: bypass budget entirely
return createSCEV(se, V);
}
se->recursion_count++;
if (se->recursion_count <= MaxRecursionDepth) {
return createSCEV(se, V);
}
return getUnknown(se, V); // budget exceeded
}
// NVIDIA SIMPLE MODE (complexity control)
if (se->failure_count > MaxExprFailures) {
SCEV *u = getUnknown(se, V);
insertSCEV(se, V, u); // cache the Unknown
return u;
}
uint64_t complexity = computeExprSize(se, V); // sub_DB3670
if (complexity > MaxExprSize) {
se->failure_count++;
SCEV *u = getUnknown(se, V);
insertSCEV(se, V, u);
return u;
}
// Expression is small enough: run normal path with mode toggled off
se->simple_mode = 0;
se->recursion_count = 0;
SCEV *result = createSCEV(se, V);
se->simple_mode = 1;
return result;
}
The PTX kernel bypass (calling conventions 42 and 43) is significant: kernel functions always receive full SCEV analysis regardless of budget. NVIDIA considers kernels important enough that truncating their analysis would lose more performance than the extra compile time costs. Device helper functions, by contrast, are subject to the budget.
NVIDIA Simple Mode (Complexity Control)
Upstream LLVM uses a single recursion counter to bound getSCEV. NVIDIA replaces this with a two-stage gating system called simple_mode (enabled by the scalar-evolution-complexity-control flag, default true). The system is stored entirely in four bytes of the SE object:
| Offset | Type | Field | Role |
|---|---|---|---|
+1560 | uint8 | simple_mode | 0 = normal (upstream-style), 1 = NVIDIA complexity control |
+1564 | uint32 | failure_count | Running count of instructions classified as SCEVUnknown by the size gate |
+1568 | uint32 | recursion_count | Upstream-style depth counter, only active when simple_mode == 0 |
+1572 | uint8 | complexity_config | Tuning bits read by the expression size scorer |
When scalar-evolution-complexity-control is true (the default), the SE constructor initializes simple_mode to 1. The gating operates in three stages:
Stage 1 -- Failure gate. Before scoring anything, getSCEV checks failure_count > scalar-evolution-max-expr-failures (global qword_4F88348, default 100). If the function has already exceeded the failure budget, the instruction is classified as SCEVUnknown, the result is cached via sub_DB77A0 (insertSCEV), and control returns immediately. This prevents a single pathological function from burning O(N^2) time trying to score thousands of instructions that will all fail.
Stage 2 -- Expression size scoring. The scorer sub_DB3670 (expressionComplexity, 35KB binary, self-recursive) estimates how large the resulting SCEV expression tree would be. It walks the instruction's def-use chain bottom-up, counting nodes and weighting by expression kind:
uint64_t expressionComplexity(SE *se, Value *V) {
// sub_DB3670 -- self-recursive, calls sub_CF4090 for SCEV node size
if (V is Constant) return 1;
if (V is Argument) return 1;
if (!isSCEVable(V)) return 0; // non-integer/pointer: free
// Look up V in the SCEV cache; if already a SCEV node,
// delegate to the node-size estimator
SCEV *cached = lookupSCEV(se, V);
if (cached)
return sub_CF4090(cached); // count nodes in SCEV tree
// Not yet in cache: estimate from instruction structure
Instruction *I = dyn_cast<Instruction>(V);
if (!I) return 1;
uint64_t score = 1; // 1 for this node
Loop *L = LoopInfo->getLoopFor(I);
if (L) {
uint32_t depth = L->getLoopDepth();
score += depth; // loop nesting multiplier
}
// Walk operands, accumulating recursively
for (unsigned i = 0; i < I->getNumOperands(); i++) {
score += expressionComplexity(se, I->getOperand(i));
}
// Apply configuration scaling from SE+1572
if (se->complexity_config & 0x1)
score = score * 3 / 2; // 50% penalty for aggressive mode
if (se->complexity_config & 0x2)
score += depth * 2; // extra loop nesting weight
return score;
}
The helper sub_CF4090 counts nodes in an existing SCEV expression tree: it returns 1 for SCEVConstant and SCEVUnknown, recurses into operands for SCEVAddExpr/SCEVMulExpr/SCEVAddRecExpr (summing child sizes + 1), and handles casts (Truncate/ZeroExtend/SignExtend) as 1 + child size. The node-size estimate is precise because SCEV expressions are uniqued -- the same sub-expression pointer is never double-counted within a single scoring call.
If the total score exceeds scalar-evolution-max-expr-size (global dword_4F88428, default 384), the instruction is classified as SCEVUnknown and failure_count is incremented. The SCEVUnknown result is cached immediately so that later queries from different loop passes return instantly rather than re-running the scorer.
Stage 3 -- Mode toggle. When an instruction passes the size check (score <= 384), simple_mode is temporarily set to 0 and the recursion counter reset to 0 before calling createSCEV:
se->simple_mode = 0; // disable complexity gating
se->recursion_count = 0; // reset upstream counter for this sub-tree
SCEV *result = createSCEV(se, V);
se->simple_mode = 1; // restore
This prevents double budget-checking: the upstream recursion counter inside createSCEV starts from 0 for the sub-expression tree rather than inheriting a parent depth. Each createSCEV call thus gets a fresh budget of scalar-evolution-max-recursion-depth (default 100) for its own sub-tree.
Practical effect: GPU kernels with hundreds of address computations (common in tiled matrix multiply, convolution stencils) hit the complexity wall early for outer variables, but the important inner loop induction variables -- which have simple affine structure -- always get analyzed. The two-stage gate (score first, then depth-limit) avoids the upstream problem where a single deep operand chain exhausts the entire recursion budget for the function.
Why not just raise the upstream recursion limit? The upstream counter is a global depth counter -- raising it means every instruction in the function gets more budget, including ones that will never produce useful SCEV expressions. The NVIDIA approach is per-instruction: each instruction is independently scored, and only instructions with manageable complexity get the full treatment. This keeps total SCEV compile time bounded at O(N * max_expr_size) rather than O(N * max_recursion_depth^2).
Worklist-Driven createSCEV
sub_DD8130 implements a non-recursive worklist to avoid deep stack frames. NVIDIA replaced the upstream recursive createSCEV with this iterative approach to handle GPU kernels that can have extremely deep expression trees (deeply nested address computations involving multiple grid dimensions).
The worklist stores Value* pointers with tag bits in the low 3 bits:
| Bit | Meaning |
|---|---|
Bit 2 (0x4) | First visit: needs full createNodeForInstruction |
| Bits 0-1 clear | Post-processing: operands have been evaluated, collect results |
Algorithm:
- Push initial value with bit 2 set.
- Pop top entry.
- If bit 2 set: call
sub_DD80F0(createSCEV wrapper), which checksisSCEVable(V->getType())viasub_D97040, then delegates tosub_DD65B0(createNodeForInstruction). - If the result is immediately available: cache it via
sub_DB77A0and continue. - If operands are needed: push operands (without bit 2) for deferred processing.
- If bit 2 set: call
- Repeat until worklist empty.
- Return
lookupSCEV(initial_value).
The isSCEVable check (sub_D97040) accepts integer types and pointer types. Floating-point values and aggregate types produce SCEVUnknown.
Instruction Decomposer
Before the main opcode dispatch, sub_D94080 (decomposeIRInstruction) analyzes each instruction and fills a 48-byte decomposition struct:
struct SCEVDecomp { // 48 bytes
uint32_t kind; // +0 decomposition opcode
void *operandL; // +8 left operand (Value*)
void *operandR; // +16 right operand (Value*)
bool hasNUW; // +24 no-unsigned-wrap flag
bool hasNSW; // +25 no-signed-wrap flag
void *extra; // +32 third operand / loop variable
bool valid; // +40 decomposition succeeded
};
The decomposer extracts NUW/NSW flags from inst->byte[1] (bit 2 = NUW, bit 1 = NSW), and these flags are only captured for opcodes matching the bitmask 0x40540000000000 -- covering add, sub, mul, shl, and related flag-bearing arithmetic. The kind field values:
| Kind | Decimal | SCEV Construction |
|---|---|---|
0x0D | 13 | Add/Sub -- iterative addend collection |
0x0F | 15 | MulRec -- multiply-recurrence (loop-carried) |
0x11 | 17 | Multiply -- iterative multiplicand collection |
0x13 | 19 | UDiv |
0x16 | 22 | UMax select pattern |
0x19 | 25 | Shl -- converted to multiply by 2^N |
0x1A | 26 | Generic shift/bitop fallback |
0x1B | 27 | LShr -- complex truncate+extend chain |
0x1C | 28 | AShr -- sign-extend analysis |
0x1D | 29 | ICmp / comparison |
0x1E | 30 | And (bitwise) -- pointer truncation patterns |
The decomposer includes a GPU-specific PHI detection path (kind 64): when a PHI node's incoming value chain traces through a comparison instruction (byte == 0x55) whose operand is a function-entry value (byte == 0) that resolves to one of the recognized NVIDIA builtins (intrinsic IDs 312, 333, 339, 360, 369, 372), the decomposer creates a specialized recurrence form. This is how threadIdx.x-bounded loop variables become proper AddRec expressions.
createNodeForInstruction: The Core Builder
sub_DD65B0 (1103 lines) is the largest function in the SCEV subsystem. It operates in three phases:
Phase 1: Fast Path (lines 300-312)
Checks the instruction's type byte. Constants (byte 17) go directly to getConstant. Non-instruction values go to getUnknown. Real instructions check loop depth via LoopInfo -- if the instruction's loop nesting exceeds the maximum tracked depth, it bails to getUnknown with a simplified operand from sub_ACADE0.
Phase 2: Decomposition-Based Dispatch (lines 336-933)
After calling the instruction decomposer, dispatches on decomp.kind:
Add/Sub (kind 13): Iteratively collects addends into a SmallVector. For each operand with a non-zero extra field (the loop iteration variable), checks the SCEV cache, and if the operand has a known loop context (from sub_DD86E0 / getLoopForExpr), builds an SCEVAddRecExpr. Otherwise recursively calls getSCEV and optionally negates (for subtraction via getNegativeSCEV). Final result: getAddExpr(collected_operands).
Multiply (kind 17): Same iterative structure as Add but builds getMulExpr. For loop-carried chains, constructs getAddRecExpr(start, step, flags).
Shl (kind 25): Converts shift-left to multiplication by a power of two. When the shift amount is a constant: extracts the shift amount, verifies it fits in the type width (sub_986EE0), then builds getMulExpr(getSCEV(base), getConstant(1 << shamt), flags). Handles nested shl-of-shl by re-decomposing.
LShr (kind 27): When shifting right by a constant amount, builds a chain of getMulExpr + getTruncateExpr + getZeroExtendExpr to represent the bit extraction pattern. Falls back for non-constant shifts.
AShr (kind 28): Complex bit-extraction logic. For constant shifts, analyzes known bits to determine whether the shift extracts only zeros from the sign position. If provable, builds getSignExtendExpr(getTruncateExpr(getSCEV(base), intermediate_type), original_type). For non-constant shifts, tries SMin/SMax pattern matching.
And (kind 30): Handles pointer truncation patterns. When the mask equals (1 << ptr_bits) - 1 (a ptrtoint-then-mask pattern), builds getPtrToIntExpr + getSignExtendExpr. Otherwise bails.
Phase 3: Opcode-Based Dispatch (lines 936-1101)
Handles instructions not captured by the decomposer. The normalized opcode maps raw instruction bytes to semantic categories:
Call/Intrinsic (cases 5, 56): First tries the intrinsic SCEV lookup table (sub_B494D0). For known intrinsics, dispatches on intrinsic ID:
| ID | Hex | SCEV Construction | Likely Intrinsic |
|---|---|---|---|
| 1 | 0x001 | getNotSCEV(op0) | bitwise NOT |
| 7 | 0x007 | getSCEV(op0) (identity) | llvm.assume |
| 292 | 0x124 | getSCEV(op0) (identity) | PTX intrinsic passthrough |
| 329 | 0x149 | getUMinExpr(op0, op1) | llvm.umin |
| 330 | 0x14A | getSMinExpr(op0, op1) | llvm.smin |
| 344 | 0x158 | getSCEV(op0) (identity) | passthrough |
| 359 | 0x167 | getSMinExpr + getUDivExpr + getAddExpr | complex min/div |
| 365 | 0x16D | getSMaxExpr(op0, op1) | llvm.smax |
| 366 | 0x16E | getSMinExpr(op0, op1) | llvm.smin variant |
| 371 | 0x173 | getAddRecExpr(op0, getUDivExpr(op0, op1)) | recurrence with division |
| 493 | 0x1ED | getConstant(inst->qword[1]) | constant from intrinsic metadata |
PHI Node (case 34): Dispatches to sub_DD92B0 (createNodeForPHI). Walks PHI incoming values, checks for loop recurrence. If the PHI forms a recurrence: builds {start, +, step} as an SCEVAddRecExpr. Otherwise returns SCEVUnknown.
GEP (case 47): Calls sub_DD3A70 (getGEPExpr). Computes the SCEV of the base pointer, then adds the SCEV of each index scaled by the element size. If the result is SCEVUnknown, bails.
Casts (cases 38-40): Trunc produces getTruncateExpr. SExt produces getSignExtendExpr. ZExt has a special optimization: if the source decomposes as a multiply-recurrence (kind 15), it builds separate zero-extensions of start and step, then constructs getAddRecExpr(zext(start), zext(step), NUW) -- preserving the recurrence structure across the extension.
BitCast/AddrSpaceCast (case 49): If both source and target types are SCEV-able, returns getSCEV(source) (transparent). Otherwise getUnknown.
Select (cases 20, 23): If condition and true-value are loop-invariant (sub_DBED40), builds getUDivExpr (case 20) or getUMaxExpr (case 23) of the branches.
GPU-Specific SCEV Sources
Thread and Block Index Builtins
When the instruction decomposer encounters a PHI whose incoming value chain traces to one of NVIDIA's special register intrinsics, it recognizes it as a bounded induction variable. The recognized intrinsic IDs and their SCEV significance:
| Intrinsic ID | CUDA Variable | SCEV Range Bound |
|---|---|---|
| 312 | blockDim.x / gridDim.x | Dimension query -- provides trip count upper bound |
| 333 | threadIdx.x | Range: [0, blockDim.x) |
| 339 | threadIdx.y / blockIdx.x | Range: [0, blockDim.y) or [0, gridDim.x) |
| 360 | threadIdx.z / blockIdx.y | Range: [0, blockDim.z) or [0, gridDim.y) |
| 369 | blockIdx.z | Range: [0, gridDim.z) |
| 372 | warpSize / laneid | Range: [0, 32) (constant on all architectures) |
These ranges are injected during SCEV construction, not during range analysis. When a PHI node tests a value against threadIdx.x (for example, a loop for (int i = threadIdx.x; i < N; i += blockDim.x)), the decomposer produces an SCEVAddRecExpr whose start value carries the constraint [0, blockDim.x). This propagates through all downstream SCEV consumers.
The CUDA variable to LLVM intrinsic mapping is:
| CUDA | LLVM Intrinsic | PTX Register |
|---|---|---|
threadIdx.x | @llvm.nvvm.read.ptx.sreg.tid.x | %tid.x |
threadIdx.y | @llvm.nvvm.read.ptx.sreg.tid.y | %tid.y |
threadIdx.z | @llvm.nvvm.read.ptx.sreg.tid.z | %tid.z |
blockDim.x | @llvm.nvvm.read.ptx.sreg.ntid.x | %ntid.x |
blockIdx.x | @llvm.nvvm.read.ptx.sreg.ctaid.x | %ctaid.x |
gridDim.x | @llvm.nvvm.read.ptx.sreg.nctaid.x | %nctaid.x |
PTX Kernel Calling Convention Bypass
Functions with calling convention 42 or 43 (PTX __global__ kernels) bypass the SCEV recursion budget entirely. The rationale: kernels are the units of work the programmer explicitly marked for GPU execution. Spending extra compile time to fully analyze their loop structure always pays off because:
- Kernels are where vectorization decisions have the highest payoff.
- GPU hardware constraints (occupancy, shared memory) demand precise trip count knowledge.
- Kernel functions are few per compilation unit, so the budget bypass does not cause compile-time explosion.
Device functions (__device__, conventions other than 42/43) remain subject to the standard budget.
Warp-Stride and Grid-Stride Loop Patterns
Two CUDA-specific loop idioms produce distinctive SCEV expressions. Neither has an analog in CPU code, and cicc's SCEV subsystem recognizes both at construction time -- not as a post-hoc pattern match.
Warp-Stride Loop
for (int i = threadIdx.x; i < N; i += warpSize) { ... }
The PHI decomposer (sub_D94080) recognizes the increment value as the constant 32 (warpSize is a compile-time constant on all NVIDIA architectures). The resulting SCEV:
{threadIdx.x, +, 32}<nuw><loop>
- Start:
SCEVUnknown(@llvm.nvvm.read.ptx.sreg.tid.x), range[0, blockDim.x)(injected from the builtin table, intrinsic ID 333). - Step:
SCEVConstant(32). - Flags: NUW (no-unsigned-wrap) is set because the start is non-negative and the step is positive. The PHI decomposer sets this flag when the incoming value (intrinsic ID 372 = warpSize) resolves to a constant and the start range has a non-negative lower bound.
- Trip count: The backedge-taken count (
sub_DB9E00) computes:
This is the standard SCEVBTC = udiv(N - threadIdx.x + 31, 32) = udiv(sext(N) - sext(start) + step - 1, step)computeExitCountFromICmpUNpath fori < Nwith stride 32.
The NUW flag is critical: it allows the loop vectorizer to prove that the induction variable never wraps, enabling vectorization without a runtime overflow check. Without the warp-stride recognition, the vectorizer would see SCEVUnknown(threadIdx.x) as an opaque value and conservatively assume wrapping is possible.
Grid-Stride Loop
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { ... }
The instruction decomposer traces through the PHI's increment chain. The addition blockDim.x * gridDim.x is recognized as two calls to special register intrinsics (IDs 312 for blockDim.x and 312 again for gridDim.x) combined in a multiply. The resulting SCEV:
{blockIdx.x * blockDim.x + threadIdx.x, +, blockDim.x * gridDim.x}<loop>
Decomposition detail:
- Start:
SCEVAddExpr(SCEVMulExpr(SCEVUnknown(blockIdx.x), SCEVUnknown(blockDim.x)), SCEVUnknown(threadIdx.x)).blockIdx.x(ID 339): range[0, gridDim.x).blockDim.x(ID 312): range[1, 1024](hardware limit).threadIdx.x(ID 333): range[0, blockDim.x).- The combined start range is
[0, gridDim.x * blockDim.x)=[0, total_threads).
- Step:
SCEVMulExpr(SCEVUnknown(blockDim.x), SCEVUnknown(gridDim.x))-- this is the total grid size. Both operands areSCEVUnknownvalues with ranges from the builtin table. - Trip count:
computeBackedgeTakenCount(sub_DB9E00) produces:
whereBTC = udiv(N - start + step - 1, step)startandstepare symbolic. The trip count itself isSCEVUnknown(the exact value depends on runtime launch configuration), but the maximum trip count can be bounded using the range constraints.
Delinearization of Grid-Stride Patterns
The delinearization system (sub_DE9D10, documented in SCEV Invalidation & Delinearization) specifically recognizes the grid-stride pattern. In the ZeroExtend/SignExtend handlers (cases 3 and 4 of the delinearizer), when an AddRecExpr whose step matches the delinearization context's step_recurrence field (ctx+0x68):
- The delinearizer checks if
step == blockDim.x * gridDim.xby comparing the step SCEV pointer againstctx[+0x68]. - If matched and the
AddRechas exactly 2 operands (start + step), the delinearizer treats this as a dimension boundary -- the step represents the stride of the outer dimension in a multi-dimensional array access. - The dimension size is extracted and added to the term collector at
ctx[+0x58]. The element count is obtained viasub_D33D80(getElementSize) andsub_DA4270(getConstant). - The delinearizer reconstructs the multi-dimensional subscript by applying
getZeroExtendExpr(orgetSignExtendExpr) to the start and step separately, preserving the recurrence structure across the extension.
This is how cicc recovers the original multi-dimensional array indices from grid-stride loops over flattened arrays -- essential for dependence analysis in LoopVectorize and LoopInterchange.
Block-Stride Loop (Variant)
A less common but recognized pattern:
for (int i = threadIdx.x; i < N; i += blockDim.x) { ... }
Produces: {threadIdx.x, +, blockDim.x}<loop>. The step is SCEVUnknown(blockDim.x) with range [1, 1024]. The trip count is udiv(N - threadIdx.x + blockDim.x - 1, blockDim.x) -- symbolic but bounded. This pattern is common in reduction kernels and shared-memory tiling.
Aggressive Positive Stride Analysis
The NVIDIA-specific knob aggressive-positive-stride-analysis (see nvbug 3972412) enables additional reasoning about stride signs. When enabled, the SCEV range analysis assumes that strides derived from blockDim.x, gridDim.x, and warpSize are always positive (range [1, ...) rather than [0, ...)). This allows the loop vectorizer and LSR to prove monotonic increase of induction variables, eliminating runtime overflow checks. The knob is registered in ctor_131_0 (constructor at 0x4E1CD0 area) and can be disabled via -no-aggressive-positive-stride-analysis.
The special-reassociate-for-threadid knob (description: "Don't move back expressions with threadid") prevents SCEV-based reassociation from hoisting threadIdx.x expressions out of their canonical position. Without this guard, the reassociator might combine threadIdx.x + offset into a form that obscures the warp/grid-stride pattern for downstream consumers.
SCEV Expression Types and the FoldingSet
SCEV expressions are uniqued in a FoldingSet (LLVM's hash-based deduplication container). Each expression type is identified by a uint16 opcode at scev_expr+24:
| Opcode | Type | Operands | Notes |
|---|---|---|---|
| 0 | SCEVConstant | 1 (APInt) | Leaf: integer constant |
| 1 | SCEVUnknown | 1 (Value*) | Leaf: opaque value, possibly with range info |
| 2 | SCEVTruncateExpr | 1 + type | Truncation cast |
| 3 | SCEVZeroExtendExpr | 1 + type | Zero extension |
| 4 | SCEVSignExtendExpr | 1 + type | Sign extension |
| 5 | SCEVAddExpr | N-ary | Commutative sum |
| 6 | SCEVMulExpr | N-ary | Commutative product |
| 7 | SCEVUDivExpr | 2 | Unsigned division |
| 8 | SCEVAddRecExpr | 2+ (start, step, ...) | {start, +, step}<loop> recurrence |
| 9 | SCEVSMaxExpr | N-ary | Signed maximum |
| 10 | SCEVUMaxExpr | N-ary | Unsigned maximum |
| 11 | SCEVSMinExpr | N-ary | Signed minimum |
| 12 | SCEVUMinExpr | N-ary | Unsigned minimum |
| 13 | (variant min/max) | N-ary | Additional min/max form |
| 14 | SCEVCouldNotCompute | 0 | Sentinel: analysis failed |
| 15 | SCEVSequentialUMinExpr | N-ary | Short-circuit unsigned min |
The expression node layout:
| Offset | Size | Field |
|---|---|---|
+0 | 8 | Vtable / tag |
+24 | 2 | Opcode (SCEV kind) |
+28 | 2 | Flags: NUW=0x2, NSW=0x4 |
+32 | 8 | Operand array pointer or first operand |
+40 | varies | Operand count (for N-ary) or second operand |
Pointer comparisons suffice for SCEV equality because of the uniquing: two SCEV* values are equal if and only if they point to the same node.
SCEV Constructor Functions
Each expression type has a dedicated constructor that canonicalizes and deduplicates:
| Address | Function | Signature |
|---|---|---|
sub_DC8BD0 | getAddExpr | (SmallVector &operands, flags, depth) |
sub_DC7ED0 | getAddExpr | (SCEV *a, SCEV *b, flags, depth) |
sub_DCA690 | getMulExpr | (SCEV *a, SCEV *b, flags, depth) |
sub_DCC810 | getAddRecExpr | (SCEV *start, SCEV *step, flags, depth) |
sub_DCB270 | getUDivExpr | (SCEV *lhs, SCEV *rhs) |
sub_DCFA50 | getUMaxExpr | (SCEV *a, SCEV *b) |
sub_DCEE80 | getSMinExpr | (SCEV *a, SCEV *b) |
sub_DCE050 | getSMaxExpr | (SCEV *a, SCEV *b) |
sub_DCDFA0 | getUMinExpr | (SCEV *a, SCEV *b) |
sub_DC5200 | getTruncateExpr | (SCEV *op, Type *ty, depth) |
sub_DC5000 | getZeroExtendExpr | (SCEV *op, Type *ty, depth) |
sub_DC2B70 | getSignExtendExpr | (SCEV *op, Type *ty, depth) |
sub_DD1D00 | getPtrToIntExpr | (SCEV *ptr) |
sub_DA26C0 | getConstant | (APInt val) |
sub_DA3860 | getUnknown | (Value *V) |
sub_DCAF50 | getNegativeSCEV | (SCEV *expr, flags) |
sub_DCE000 | getNotSCEV | (SCEV *expr, bool isNSW) -- -1 - x |
The N-ary constructors (getAddExpr, getMulExpr, min/max) canonicalize operand order and fold constants. For example, getAddExpr({5, x, 3}) folds to getAddExpr({8, x}) and orders the constant first.
The SCEV Cache
The primary SCEV cache (ValueExprMap) maps Value* to SCEV* using an open-addressed hash table with the standard hash function used throughout cicc's SCEV subsystem:
slot = ((uint32_t)key >> 9) ^ ((uint32_t)key >> 4)
slot &= (capacity - 1)
Sentinels: EMPTY = 0xFFFFFFFFFFFFF000 (-4096), TOMBSTONE = 0xFFFFFFFFFFFFE000 (-8192). Capacity is always a power of two. Growth occurs at 75% load factor (doubling), and in-place rehashing (tombstone cleanup) triggers when fewer than 1/8 of slots are truly empty.
Cache lookup (sub_D98300) is called at the top of every getSCEV invocation. Cache store (sub_DB77A0) is called after every successful SCEV construction, and also when the complexity control bails to SCEVUnknown (caching the Unknown result prevents re-scoring the same instruction).
The simple mode's failure caching is critical for performance: once an instruction is classified as SCEVUnknown, the result is cached so that subsequent queries (from different loop analysis passes) return instantly rather than re-running the complexity scorer.
How SCEV Feeds Loop Optimizations
SCEV is consumed by every loop optimization in cicc. The key interfaces:
LoopVectorize (sub_DFAE00 and callers): Calls getBackedgeTakenCount (sub_DCF980) to determine whether the loop has a computable trip count. If not, vectorization is abandoned. Uses getSmallBestKnownTC (sub_2AA7EC0) for the trip count upper bound, which is compared against -vectorizer-min-trip-count. SCEV range analysis (sub_DBB9F0) proves that the epilogue trip count is sufficient for the minimum vector factor. Runtime SCEV overflow checks generate scev.check basic blocks.
LoopUnroll (sub_19B6690): The unroll factor selection function extracts MaxTripCount from SCEV. Runtime trip counts below flat-loop-tripcount-threshold (default 5) mark the loop as "flat" and skip unrolling. Partial unrolling requires BackedgeCount % UnrollCount computation. After unrolling, sub_2A13F00 reconciles SCEV and LoopInfo for the modified loop.
Loop Strength Reduction (sub_19A87A0): The NVIDIA custom LSR reads SCEV expressions for each loop use (base SCEV at +0, stride SCEV at +8, loop bounds at +712/+720). The formula solver generates alternatives by factoring common strides out of SCEV expressions. SCEV normalization (sub_199D980) provides canonical forms for hash-table keying.
IndVarSimplify (sub_1945A50): Uses SCEV to compute exit values, rewrite loop exit conditions, and perform LFTR (Linear Function Test Replace). NVIDIA adds two guards:
Disable-unknown-trip-iv(registered inctor_203at0x4E1CD0, globalqword_4FAF520): When set, the pass is skipped entirely for loops whose trip count isSCEVCouldNotCompute. The check in therun()wrapper (sub_19489B0, lines 119-122) callssub_1CED350(trip count query) andsub_1CED620(trip count for header). This protects GPU-specific loops with divergent control flow from incorrect IV transforms.iv-loop-level(default 1, globalqword_4FAF440): Limits IndVarSimplify to loops at nesting depth <= the configured level.sub_193DD90(getLoopDepth) returns 1 for outermost loops. The default restricts IV simplification to outermost loops only, avoiding compile-time explosion on deeply-nested GPU kernels (stencil, tensor code).
Loop Strength Reduction (sub_19A87A0): The NVIDIA custom LSR reads SCEV expressions for each loop use (base SCEV at +0, stride SCEV at +8, loop bounds at +712/+720). The formula solver generates alternatives by factoring common strides out of SCEV expressions. SCEV normalization (sub_199D980) provides canonical forms for hash-table keying. NVIDIA adds disable-unknown-trip-lsr to skip LSR entirely for unknown-trip-count loops, plus lsr-check-rp / lsr-rp-limit to gate LSR on register pressure.
LoopInterchange (sub_E05-loop-interchange): Uses SCEV stride analysis to determine which loops carry memory strides. If a subscript has stride in both inner and outer loops, it is marked "ambiguous" and interchange is blocked. For grid-stride loops, the step blockDim.x * gridDim.x is recognized as an outer-loop stride, allowing interchange when the array subscript depends on a single loop dimension.
Configuration: All SCEV Knobs
NVIDIA-Specific Knobs
| Knob | Default | Effect |
|---|---|---|
scalar-evolution-complexity-control | true | Enables the simple_mode system |
scalar-evolution-max-expr-size | 384 | Max SCEV expression complexity score before bailing to Unknown |
scalar-evolution-max-expr-failures | 100 | Max bailed instructions before giving up on entire function |
scalar-evolution-max-add-items | 500 | Max addends in a single SCEVAddExpr |
do-sign-ext-expand | false | Expand sign-extensions during SCEV construction |
do-sign-ext-simplify | (bool) | Simplify SCEV on sign-extend expressions |
track-trip-count-more | true | More aggressive trip count tracking |
common-factor-with-mr265 | true | SCEV common factor optimization (internal MR reference) |
scalar-evolution-classify-expressions | true | Enable SCEV expression classification |
aggressive-positive-stride-analysis | (bool) | Aggressive stride sign reasoning for blockDim/gridDim/warpSize (see nvbug 3972412) |
special-reassociate-for-threadid | (bool) | Prevent hoisting threadIdx expressions out of canonical position |
Disable-unknown-trip-iv | (bool) | Skip IndVarSimplify for loops with SCEVCouldNotCompute trip count |
disable-unknown-trip-lsr | (bool) | Skip Loop Strength Reduction for unknown-trip-count loops |
iv-loop-level | 1 | Max loop nesting depth for IndVarSimplify (1 = outermost only) |
scev-cgp-tid-max-value | (int) | Max value of thread ID for SCEV-CGP address mode optimization |
Upstream LLVM Knobs (Preserved in cicc)
| Knob | Default | Effect |
|---|---|---|
scalar-evolution-max-recursion-depth | 100 | Hard counter for getSCEV depth in normal mode |
scalar-evolution-max-iterations | 100 | Max iterations for constant evolution |
scalar-evolution-max-arith-depth | 32 | Max arithmetic simplification depth |
scalar-evolution-max-cast-depth | 8 | Max cast folding depth |
scalar-evolution-max-ext-depth | 8 | Max extension analysis depth |
scalar-evolution-max-constant-evolving-depth | 32 | Max depth for constant evolving analysis |
scalar-evolution-max-scev-compare-depth | 32 | Max depth for SCEV comparison |
scalar-evolution-max-scev-operations-implication-depth | 2 | Max depth for implication reasoning |
scalar-evolution-max-value-compare-depth | 2 | Max depth for value comparison |
scev-mulops-inline-threshold | 32 | Max multiply operands before outline |
scev-addops-inline-threshold | 500 | Max add operands before outline |
verify-scev | false | Enable SCEV verification |
verify-scev-strict | false | Stricter SCEV verification |
verify-scev-maps | false | Verify SCEV map consistency |
SCEV Global Variables (Binary Addresses)
| Global | Knob String | Default | Used By |
|---|---|---|---|
dword_4F88268 | scalar-evolution-max-recursion-depth | 100 | getSCEV normal mode depth counter |
qword_4F88348 | scalar-evolution-max-expr-failures | 100 | getSCEV simple mode failure gate |
dword_4F88428 | scalar-evolution-max-expr-size | 384 | expressionComplexity size threshold |
qword_4F88DC8 | (loop iteration bound) | -- | Exit analysis iteration limit |
qword_4F88EA8 | (range recursion limit) | -- | getRangeRef max recursion depth |
SCEV-CGP Knobs (Address Mode Optimization)
| Knob | Effect |
|---|---|
do-scev-cgp | Enable SCEV-based CodeGenPrepare |
do-scev-cgp-aggresively | Aggressive mode (sic -- typo preserved in binary) |
do-function-scev-cgp | Function-level SCEV-CGP |
nv-disable-scev-cgp | Disable the SCEV-CGP pass entirely |
scev-cgp-control | Control number of transformations |
scev-cgp-cross-block-limit | Max common bases from a single block |
scev-cgp-idom-level-limit | Limit IDOM traversal level |
scev-cgp-inst-limit | Max instructions considered per parameter |
scev-cgp-old-base | Use old base computation method |
scev-cgp-tid-max-value | Max thread ID value for address mode analysis |
print-after-scev-cgp | Print function IR after SCEV-CGP |
Differences from Upstream LLVM
The cicc v13.0 SCEV subsystem diverges from upstream LLVM 20.0.0 ScalarEvolution.cpp in the following ways:
| Feature | Upstream LLVM | cicc v13.0 |
|---|---|---|
| Budget system | Single recursion_count depth counter | Two-stage: expression size scoring (sub_DB3670) + failure counting, toggled via simple_mode flag |
| Kernel bypass | No concept of calling convention bypass | CC 42/43 (PTX __global__) bypass all SCEV budgets |
createSCEV | Recursive | Non-recursive worklist (sub_DD8130) to handle deep GPU expression trees |
| GPU builtin ranges | No thread/block index knowledge | Intrinsic IDs 312/333/339/360/369/372 inject ranges at SCEV construction time |
| PHI decomposition | Standard recurrence detection | GPU-specific path (kind 64) traces PHI chains through NVIDIA special register intrinsics |
| Delinearization | Standard dimension recovery | Polymorphic predicate collector recognizes grid-stride patterns; step_recurrence field enables GPU memory coalescing analysis |
| Trip count tracking | Standard | track-trip-count-more (default true) enables more aggressive BTC computation |
| Stride sign reasoning | Standard | aggressive-positive-stride-analysis assumes blockDim/gridDim/warpSize are always positive |
| Expression canonicalization | Standard | special-reassociate-for-threadid prevents moving threadIdx expressions |
| SCEV-CGP | Not present | Complete NVIDIA SCEV-based CodeGenPrepare pass with 11 dedicated knobs |
| Knob count | ~15 standard knobs | 15 upstream + 15 NVIDIA-specific + 11 SCEV-CGP = ~41 total SCEV knobs |
The most consequential divergence is the simple_mode system: it changes the compile-time complexity class of SCEV analysis from O(N * D^2) (where D is recursion depth) to O(N * S) (where S is the per-instruction size limit), making SCEV analysis tractable on large GPU kernels without sacrificing accuracy on the important inner-loop induction variables.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
getSCEV | sub_DD8400 | -- | Top-level entry; cache + mode dispatch |
Worklist createSCEV | sub_DD8130 | -- | Non-recursive worklist driver |
createSCEV wrapper | sub_DD80F0 | -- | Type check + delegate |
createNodeForInstruction | sub_DD65B0 | -- | Core 3-phase opcode dispatch |
decomposeIRInstruction | sub_D94080 | -- | Instruction to decomposition struct |
createNodeForPHI | sub_DD92B0 | -- | PHI to AddRec conversion |
createNodeForSelectOrPHI | sub_DD99C0 | -- | Select/PHI combined handler |
getExistingExpr | sub_DD6410 | -- | Fast path for phi recurrence |
getGEPExpr | sub_DD3A70 | -- | GEP to SCEV conversion |
getLoopForExpr | sub_DD86E0 | -- | Determine loop context for expression |
lookupSCEV | sub_D98300 | -- | Cache lookup (ValueExprMap) |
insertSCEV | sub_DB77A0 | -- | Cache store |
expressionComplexity | sub_DB3670 | -- | NVIDIA expression size scorer; self-recursive, uses sub_CF4090 |
| SCEV node size counter | sub_CF4090 | -- | Counts nodes in existing SCEV tree for complexity scoring |
getSmallConstantTripCount | sub_DB04E0 | -- | Extract small constant trip count |
classifyExpressions / print | sub_1495EB0 | -- | Debug: "Classifying expressions for: " |
isSCEVable | sub_D97040 | -- | Type is integer or pointer |
isUnknown / isFailedSCEV | sub_D96A50 | -- | Check SCEVUnknown |
getSCEVType | sub_D95540 | -- | Extract LLVM Type from SCEV expr |
getTypeBitWidth | sub_D97050 | -- | Bit width of a type |
lookupIntrinsicSCEV | sub_B494D0 | -- | Intrinsic fast-path table |
isIntrinsicCall | sub_988010 | -- | Intrinsic detection |
isLoopInvariant | sub_DBED40 | -- | Loop invariance check |
isIntegerTy | sub_BCAC40 | -- | Integer type check |
getRangeRef | sub_DBB9F0 | -- | ConstantRange evaluator (see range page) |
computeBackedgeTakenCount | sub_DB9E00 | -- | BTC computation (see range page) |
forgetLoop | sub_DE2750 | -- | Cache invalidation (see invalidation page) |
delinearize | sub_DE9D10 | -- | Array delinearization (see invalidation page) |
Cross-References
- LoopVectorize & VPlan -- primary consumer of trip counts and SCEV ranges
- Loop Unrolling -- uses SCEV for unroll factor selection and trip count analysis
- Loop Strength Reduction (NVIDIA) -- uses SCEV expressions for formula generation
- SCEV Range Analysis & Trip Counts -- ConstantRange computation and backedge-taken count
- SCEV Invalidation & Delinearization -- cache eviction and multi-dimensional array recovery
- Builtin Table Structure -- intrinsic ID assignments for threadIdx/blockIdx/etc.
- IndVarSimplify -- SCEV-dependent IV transforms with
Disable-unknown-trip-ivguard - SCEV-CGP (CodeGenPrepare) -- NVIDIA SCEV-based address mode optimization
- LLVM Knobs (1,689) -- full knob catalog including all SCEV knobs
- GPU Execution Model -- why GPU kernels need special SCEV treatment
SCEV Range Analysis & Backedge-Taken Counts
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Every loop optimization in cicc ultimately depends on two questions: "what values can this expression take?" and "how many times does this loop iterate?" The SCEV range analysis (sub_DBB9F0, corresponding to ScalarEvolution::getRangeRef) answers the first by propagating ConstantRange intervals through SCEV expression trees. The backedge-taken count (BTC) machinery (sub_DB9E00 / sub_DB9040, corresponding to computeBackedgeTakenCount / computeExitCountForBranch) answers the second by solving loop exit conditions algebraically. The two systems feed each other: range analysis uses trip counts to bound AddRec expressions, and trip count computation uses ranges to prove overflow behavior. On GPU targets, these analyses gain additional precision from NVIDIA-specific range sources -- thread indices are bounded by block dimensions, warpSize is the constant 32, and __launch_bounds__ metadata constrains block dimensions -- all of which flow into tighter ranges and more computable trip counts.
Key Facts
| Property | Value |
|---|---|
| Range evaluator | sub_DBB9F0 (0xDBB9F0), 31 KB |
| BTC dispatcher | sub_DCF3A0 (0xDCF3A0), mode 0=exact, 1=constant-max, 2=symbolic-max |
| BTC cache builder | sub_DB9E00 (0xDB9E00), 2,265 bytes |
| Exit count engine | sub_DB9040 (0xDB9040), 18 KB |
| howFarToZero | sub_DBA850 (0xDBA850), 8 KB |
| howManyLessThans | sub_DCE310 (0xDCE310), 317 lines |
| Range cache (unsigned) | scev_ctx+976, 40-byte entries, open-addressing |
| Range cache (signed) | scev_ctx+1008, 40-byte entries, open-addressing |
| BTC cache | scev_ctx+656, 168-byte entries, open-addressing |
| Per-exit BTC cache | scev_ctx+1168, 56-byte entries |
| Max range recursion depth | qword_4F88EA8 (global, configurable) |
| Extended exit analysis flag | qword_4F88C08 (global, enables Phase D) |
| NVIDIA knobs | track-trip-count-more, aggressive-positive-stride-analysis, do-sign-ext-simplify, do-sign-ext-expand |
ConstantRange Propagation Algorithm
The range evaluator sub_DBB9F0 takes a SCEV expression, a signedness flag (is_signed: 0=unsigned, 1=signed), and a recursion depth counter. It returns a pointer to a cached 32-byte ConstantRange representing the half-open interval [lower, upper) with wrapping semantics. The algorithm is a recursive descent over the SCEV expression tree with aggressive caching.
Cache Structure
Two separate hash tables store signed and unsigned ranges:
if (is_signed) {
table = scev_ctx[+1008]; // signed range cache
capacity = scev_ctx[+1024];
} else {
table = scev_ctx[+976]; // unsigned range cache
capacity = scev_ctx[+992];
}
Each entry is 40 bytes: an 8-byte key (SCEV pointer, with 0xFFFFFFFFFFFFF000 as the empty sentinel) followed by a 32-byte ConstantRange value. The hash function is:
slot = ((uint32_t)scev_ptr >> 9) ^ ((uint32_t)scev_ptr >> 4);
slot &= (capacity - 1); // capacity is always a power of two
Linear probing resolves collisions. On a cache hit, the function returns immediately without recomputation.
Dispatch by SCEV Kind
After a cache miss, the evaluator dispatches on the SCEV opcode at scev_expr+24 (uint16):
| Opcode | Kind | Range Computation |
|---|---|---|
| 0 | SCEVConstant | Single-value range from the constant's APInt |
| 1 | SCEVUnknown | sub_988CD0: range from ValueTracking / instruction semantics |
| 2 | SCEVTruncate | Recurse on operand, apply ConstantRange::truncate |
| 3 | SCEVZeroExtend | Recurse on operand, apply ConstantRange::zeroExtend |
| 4 | SCEVSignExtend | Recurse on operand, apply ConstantRange::signExtend |
| 5 | SCEVAddExpr | Fold operand ranges with addWithNoWrap, respecting NUW/NSW |
| 6 | SCEVMulExpr | Fold operand ranges with ConstantRange::multiply |
| 7 | SCEVUDivExpr | ConstantRange::udiv of LHS and RHS ranges |
| 8 | SCEVAddRecExpr | Multi-phase analysis (see below) |
| 9-13 | SMax/UMax/SMin/UMin | Fold via lookup table dword_3F74E60[opcode-9] + sub_ABD750 |
| 14 | SCEVCouldNotCompute | Passthrough (identity range) |
| 15 | SCEVSequentialUMin | Complex instruction-level analysis (PHI, intrinsics, metadata) |
Every computed range is intersected with an initial range derived from the type's bit width and any known-bits / sign-bits information before being stored in the cache. This intersection can only narrow the range, never widen it.
Initial Range Narrowing
Before the SCEV-kind dispatch, the evaluator computes an initial range from type information:
- Unsigned mode: calls
sub_DB5510(getKnownBits) to extract known high zero bits, constructs a range[0, 2^(bitwidth - leading_zeros))and intersects it with the full-set range. - Signed mode: calls
sub_DB55F0(getNumSignBits) and constructs a symmetric signed range from the sign-bit count, e.g., if 3 sign bits are known, the range is[-2^(bw-3), 2^(bw-3)).
This pre-narrowing ensures that even when the SCEV-kind dispatch returns a full-set (e.g., for complex expressions at the depth limit), the result still reflects type-level constraints.
AddRec Range Analysis (The Core)
The SCEVAddRecExpr case (opcode 8) is the most complex, executing up to five phases that progressively narrow the range of a loop induction variable {start, +, step}:
Phase A -- NoWrap Start Refinement. If the AddRec has NUW or NSW flags (bits at scev_expr+28), the unsigned range of the start value is computed and intersected. This ensures that the IV's initial value constrains the overall range even before considering the step.
Phase B -- Step Monotonicity. If the NSW flag (bit 2, value 0x4) is set:
sub_DBED40checks if all step operands are non-negative (monotone up). If so, the signed minimum of start becomes the lower bound: range[smin(start), SMAX].sub_DBEC80checks if all steps are non-positive (monotone down). If so, the signed maximum of start becomes the upper bound: range[SMIN, smax(start)+1].
Phase C -- Trip Count Refinement. For simple two-operand recurrences ({start, +, step} with operand count == 2):
- Call
sub_DCF3A0(ctx, loop, 1)to get the max backedge-taken count. - If the trip count is computable, compute
range(start + step * [0, trip_count])for both unsigned (sub_DBEFC0) and signed (sub_DBF480) domains. - Intersect both results into the accumulated range.
This is where range analysis and BTC computation form their feedback loop: the BTC is used to bound the AddRec's range.
Phase D -- Exit Value Analysis (NVIDIA-gated). Enabled only when global qword_4F88C08 is set. Gets the exact backedge-taken count (mode=2 via sub_DCF3A0), and if the trip count bit width fits within the AddRec's bit width and NSW is set, calls sub_DE4FD0 to compute the exit value range. This provides the tightest possible bound but is more expensive.
Phase E -- Cache and Return. The final accumulated range (from all intersections) is stored in the cache.
SCEVUnknown and Instruction-Level Analysis
For SCEVUnknown (opcode 1) and the complex instruction-level path (opcode 15), the range evaluator performs several specialized analyses:
- !range metadata: if the underlying instruction carries
!rangemetadata (kind=4),sub_B91C10extracts it andsub_ABEA30builds aConstantRangedirectly. - Predecessor merging:
sub_DBB110computes ranges by analyzing incoming values from predecessor basic blocks, intersecting the results. - PHI node analysis: for PHI nodes (instruction opcode 84), the evaluator iterates all incoming values, computes each one's SCEV range, and unions them. A visited-PHI set at
scev_ctx+320prevents infinite recursion through cyclic PHIs. - Intrinsic ranges:
sub_988010identifies specific intrinsics (e.g.,ctpop,ctlz,cttz) and constrains their ranges to non-negative values viasub_ABB6C0. - Stride alignment:
sub_BD4FF0computes stride/alignment information for loads and stores, narrowing the range to multiples of the known stride.
Signed/Unsigned Cross-Pollination
A critical detail: the AddRec analysis explicitly recurses with the opposite signedness flag in certain sub-analyses. Phase A always computes the start in unsigned mode (is_signed=0), while Phase B always uses signed mode (is_signed=1). This cross-referencing allows information from one domain to constrain the other, producing tighter bounds than either domain alone.
GPU-Specific Range Sources
Three categories of NVIDIA-specific range information feed into SCEV range analysis, all derived from the CUDA execution model:
Thread and Block Index Ranges
The intrinsics @llvm.nvvm.read.ptx.sreg.tid.{x,y,z} (threadIdx) produce values in [0, blockDim-1]. The intrinsics @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z} (blockIdx) produce values in [0, gridDim-1]. When these intrinsics appear as SCEVUnknown nodes, the range evaluator propagates their constrained ranges through the expression tree.
The block dimension intrinsics @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z} are bounded by __launch_bounds__ metadata when present. Specifically, nvvm.maxntid (from __launch_bounds__(maxThreads)) provides an upper bound on ntid.x * ntid.y * ntid.z, and nvvm.reqntid provides an exact value. These bounds are read by sub_CE8D40 (NvvmMeta_getMaxNTID) and sub_CE8DF0 (NvvmMeta_getReqNTID).
warpSize (@llvm.nvvm.read.ptx.sreg.warpsize) is the constant 32 on all architectures from sm_70 onward, producing the singleton range [32, 33).
Grid-Stride Loop Patterns
SCEV delinearization (sub_DE9D10) specifically recognizes the grid-stride pattern:
// CUDA: for (int i = tid + bid * bdim; i < N; i += bdim * gdim)
// SCEV: {threadIdx.x + blockIdx.x * blockDim.x, +, blockDim.x * gridDim.x}
The step blockDim.x * gridDim.x inherits known-positive range from both operands, enabling the monotonicity analysis in Phase B to prove the IV is non-decreasing. Combined with the bounded start value (tid.x + bid.x * bdim.x is non-negative), the range of the entire AddRec is [0, N) rather than full-set.
KnownBits and DemandedBits Integration
The sub_99B5E0 post-analysis in SimplifyDemandedBits applies NVIDIA-specific refinements including thread index range constraints (threadIdx.x < blockDim.x) and warp-level uniformity assumptions. These propagate through SCEV's getKnownBits (sub_DB5510) to tighten the initial unsigned range of expressions involving GPU special registers.
Backedge-Taken Count Computation
The BTC machinery computes how many times a loop's backedge executes before any exit is taken. The result has three variants:
- Exact count: the precise number of iterations, or
SCEVCouldNotComputeif unknown. - Constant max: a constant upper bound on the iteration count.
- Symbolic max: a SCEV expression bounding the iteration count (may involve loop-invariant values).
BTC Cache Layout
The primary BTC cache at scev_ctx+656 uses 168-byte entries:
| Offset | Size | Field |
|---|---|---|
| +0x00 | 8 | Key: SCEV pointer (sentinels: empty=-4096, tombstone=-8192) |
| +0x08 | 128 | Per-exit count data (SmallVector of {BasicBlock*, SCEV* count, flags}) |
| +0x88 | 8 | Exact backedge-taken count (SCEV pointer or null) |
| +0x90 | 1 | Flag: exact count is valid |
| +0x98 | 8 | Max backedge-taken count (SCEV pointer or null) |
| +0xA0 | 1 | Flag: max count is valid |
The hash function is identical to the range cache: ((key >> 9) ^ (key >> 4)) & (capacity - 1). Load factor threshold is 75% for capacity doubling (via sub_DB6980) and 87.5% (only capacity/8 truly empty slots remaining) for in-place rehash to reclaim tombstones.
A secondary per-exit table at scev_ctx+1168 stores 56-byte entries indexing individual exit block trip counts, avoiding linear scans through the main entry's embedded exit array.
Exit Count Computation Pipeline
sub_DB9040 (computeExitCountForBranch) is the heavy lifter. For each exiting block, it:
- Extracts the branch condition's ICmp instruction.
- Identifies the comparison operands as SCEV expressions.
- Classifies the exit condition into one of the standard shapes.
- Dispatches to the appropriate solver.
The two primary solvers are:
howFarToZero (sub_DBA850, 8 KB) -- handles x != 0 exit conditions. The exit condition is normalized to V = LHS - RHS, so the loop exits when V == 0. For affine AddRec {Start, +, Step}:
// The loop exits when: Start + Step * N = 0 (mod 2^BW)
// Solving: N = -Start / Step (mod 2^BW)
// For positive step (counting up to overflow): N = -Start / Step
// For negative step (counting down to zero): N = Start / (-Step)
For quadratic AddRec {L, +, M, +, N}, it solves the quadratic equation via SolveQuadraticAddRecExact. If the expression is not affine or quadratic, it returns CouldNotCompute.
howManyLessThans (sub_DCE310, 317 lines) -- handles x < bound (signed or unsigned) exit conditions. For affine IV = {Start, +, Step} with loop-invariant Bound:
// Unsigned: count = ceil_div(max(Bound, Start) - Start, Step)
// Signed: count = ceil_div(max_signed(Bound, Start) - Start, Step)
// With overflow checks based on NUW/NSW flags
This function also contains special logic for zero-extended IVs: if the comparison involves zext(IV) < Bound, it can infer NUW on the inner AddRec by proving that the bound is small enough that unsigned overflow cannot occur before the exit.
Loop Shape Handling
The BTC computation handles several loop shapes through the exit condition classification:
- Countable (for-style):
for (i = 0; i < N; i++)produces{0, +, 1} < N, solved by howManyLessThans asN - 0 = Niterations. - While-do: the exit test precedes the body. Trip count equals the number of backedge traversals, which is one less than the number of condition evaluations.
- Do-while: the exit test follows the body. The backedge is taken at least once if the loop is entered. Trip count comes directly from the exit condition solver.
- Multiple exits:
computeBackedgeTakenCount(sub_DB9E00) iterates all exiting blocks, computes per-exit counts, and takes the minimum. If any exit is not computable, the exact count isCouldNotComputebut the max count may still be known from the computable exits. - Exhaustive evaluation:
sub_DCFD50(computeExitCountExhaustively) brute-force iterates small constant-evolving loops (up toscalar-evolution-max-iterations= 100 iterations) to find exit counts that algebraic methods cannot handle.
Overflow Handling and NoWrap Flags
Trip count precision depends critically on the NoWrap flags (NUW = bit 1, NSW = bit 2) stored at scev_expr+28:
- NUW (No Unsigned Wrap): if an AddRec
{Start, +, Step}has NUW, unsigned arithmetic cannot wrap, soStart + Step * Nis monotonically increasing in the unsigned domain. This allows howManyLessThans to compute an exact count without overflow guards. - NSW (No Signed Wrap): similarly for signed arithmetic. Enables signed comparison trip counts and the Phase B monotonicity analysis in range computation.
- Neither flag: the solver must account for wrapping. howFarToZero solves modular arithmetic; howManyLessThans may fall back to constant-max estimates or
CouldNotCompute.
The NVIDIA-specific knob aggressive-positive-stride-analysis (documented as "See nvbug 3972412") enables more aggressive inference of NUW flags on AddRec expressions with positive strides, particularly for GPU loop patterns where the step is a known-positive grid dimension.
How BTC Feeds Loop Optimizations
Loop Unrolling
The unroll decision engine (sub_19BB5C0) queries getSmallBestKnownTC (sub_2AA7EC0) which calls the BTC machinery. The result determines the unroll strategy:
- Exact trip count known and small: enables full unrolling -- the loop body is replicated exactly N times with no remainder loop. This is the most profitable case for GPU code since it eliminates all loop overhead.
- Exact trip count known but large: enables partial unrolling with an exact remainder. The unroll factor is chosen to divide the trip count, avoiding a remainder loop entirely.
- Only max trip count known: enables partial unrolling with a runtime remainder check. The unroll factor is bounded by the max trip count.
- Trip count unknown: unrolling is gated by the NVIDIA knob
Disable-unknown-trip-iv-- when set, IndVarSimplify (sub_19489B0) skips loops entirely if the trip count is not computable.
Loop Vectorization
The vectorizer (sub_2AE3460) uses BTC in two ways:
-
Minimum trip count threshold:
getSmallBestKnownTCis compared againstdword_500EAE8(-vectorizer-min-trip-count). If the known trip count is below this threshold, vectorization bails with "LowTripCount" (note the preserved typo: "The trip count is below the minial threshold value."). -
Divisibility for epilogue: when the exact trip count is known, the vectorizer checks if it is divisible by the vectorization factor. If so, no scalar epilogue is needed. If not, it generates an epilogue loop. The exact trip count from SCEV enables eliminating the runtime divisibility check.
IRCE (Inductive Range Check Elimination)
IRCE (sub_194D450) uses SCEV ranges to split loops into pre-loop / main-loop / post-loop regions. The BTC determines the main loop's iteration space, and the range checks within the loop body define the boundaries for the pre/post loops. Tighter SCEV ranges mean tighter pre/post loops (fewer wasted iterations), which is significant for GPU kernels where every wasted iteration occupies a warp lane.
IndVarSimplify
IndVarSimplify (sub_1945A50) uses the exact BTC for Linear Function Test Replacement (LFTR): replacing the original loop exit test with a comparison against the trip count. This is gated by three NVIDIA knobs: disable-lftr, Disable-unknown-trip-iv, and iv-loop-level (default 1, restricting IV simplification to outermost loops only to limit compile-time on deeply nested GPU kernels).
GPU-Specific Trip Count Patterns
Grid-Stride Loops
for (int i = threadIdx.x + blockIdx.x * blockDim.x;
i < N;
i += blockDim.x * gridDim.x)
SCEV representation: {tid.x + ctaid.x * ntid.x, +, ntid.x * nctaid.x}. The start is bounded by [0, ntid.x * nctaid.x) and the step is provably positive (product of two positive values). Trip count: ceil((N - start) / step). With __launch_bounds__, the step's range can be computed precisely, enabling exact trip count computation when N is loop-invariant.
Warp-Stride Loops
for (int i = threadIdx.x % 32; i < N; i += 32)
SCEV representation: {tid.x urem 32, +, 32}. The start is [0, 31] (since warpSize=32), and the step is the constant 32. Trip count: ceil((N - (tid.x % 32)) / 32). This is always computable when N is loop-invariant.
Block-Bounded Loops
for (int i = 0; i < blockDim.x; i++)
When nvvm.reqntid metadata is present, blockDim.x has a known constant value, and the loop has a compile-time-known trip count. This enables full unrolling -- common for shared memory initialization and reduction loops.
Configuration Knobs
| Knob | Default | Effect |
|---|---|---|
scalar-evolution-max-iterations | 100 | Max iterations for exhaustive BTC evaluation |
scalar-evolution-max-scev-compare-depth | 32 | Recursion limit for SCEV comparison |
scalar-evolution-max-arith-depth | 32 | Recursion limit for arithmetic simplification |
scalar-evolution-max-cast-depth | 8 | Recursion limit for ext/trunc handling |
scalar-evolution-max-ext-depth | 8 | Recursion limit for extension expressions |
scalar-evolution-max-constant-evolving-depth | 32 | Depth limit for constant evolution |
scalar-evolution-max-expr-size | 384 | Expression complexity budget (NVIDIA simple mode) |
scalar-evolution-max-expr-failures | 100 | Max failures before all expressions bail to Unknown |
scev-addops-inline-threshold | 500 | Max add operands before bailing |
scev-mulops-inline-threshold | 32 | Max mul operands before bailing |
scev-cheap-expansion-budget | (default) | Cost budget for SCEVExpander materialization |
track-trip-count-more | false | "Track loop trip count more aggressively" (NVIDIA-specific) |
aggressive-positive-stride-analysis | true | More aggressive NUW inference for positive strides (nvbug 3972412) |
do-sign-ext-simplify | (default) | Simplify sign-extension SCEV expressions |
do-sign-ext-expand | (default) | Expand sign-extensions during SCEV construction |
qword_4F88EA8 | (global) | Max recursion depth for range computation |
qword_4F88C08 | (global) | Enable extended exit-value analysis (Phase D) |
The NVIDIA-specific knobs are particularly important. track-trip-count-more enables additional effort in BTC computation that upstream LLVM does not attempt -- the exact mechanism is not fully reversed, but the typo in its description string ("aggresively") matches the binary. aggressive-positive-stride-analysis is tied to a specific NVIDIA bug (nvbug 3972412) and enables proving NUW on AddRec expressions whose step is known positive from range analysis, creating a positive feedback loop between range computation and NoWrap inference.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
ScalarEvolution::getRangeRef() -- core range evaluator | sub_DBB9F0 | -- | -- |
getRangeForAffineARViaRange() -- predecessor-based range | sub_DBB110 | -- | -- |
computeUnsignedRangeFromAddRecTripCount() | sub_DBEFC0 | -- | -- |
computeSignedRangeFromAddRecTripCount() | sub_DBF480 | -- | -- |
computeExitValueRange() -- Phase D exit value analysis | sub_DE4FD0 | -- | -- |
getFullRangeFallback() -- depth-exceeded fallback | sub_DDFBD0 | -- | -- |
cacheRange() -- insert range into hash table | sub_DB0AC0 | -- | -- |
getKnownBits() for SCEV (unsigned known bits) | sub_DB5510 | -- | -- |
getNumSignBits() for SCEV (signed known bits) | sub_DB55F0 | -- | -- |
isKnownNonNegative(step) | sub_DBED40 | -- | -- |
isKnownNonPositive(step) | sub_DBEC80 | -- | -- |
getBackedgeTakenCount(loop, mode) -- BTC dispatcher | sub_DCF3A0 | -- | -- |
computeBackedgeTakenCount() -- per-loop BTC with caching | sub_DB9E00 | -- | -- |
computeExitCountForBranch() -- exit condition analysis | sub_DB9040 | -- | -- |
howFarToZero() -- "reaches zero" trip count | sub_DBA850 | -- | -- |
howManyLessThans() -- "less than" trip count | sub_DCE310 | -- | -- |
computeExitCountExhaustively() -- brute-force small loops | sub_DCFD50 | -- | -- |
computeExitLimit() -- exit limit from condition | sub_DCB270 | -- | -- |
getSmallConstantTripCount() | sub_DB04E0 | -- | -- |
getSmallConstantMaxTripCount() | sub_DB06C0 | -- | -- |
| BTC hash table growth / rehash | sub_DB6980 | -- | -- |
| BTC hash table rehash-in-place (tombstone cleanup) | sub_DE0180 | -- | -- |
getRangeFromUnknownSCEV() -- range for SCEVUnknown | sub_988CD0 | -- | -- |
ConstantRange::intersectWith() | sub_AB2160 | -- | -- |
ConstantRange::unionWith() | sub_AB3510 | -- | -- |
ConstantRange::addWithNoWrap() | sub_ABA0E0 | -- | -- |
ConstantRange::multiply() | sub_AB5480 | -- | -- |
ConstantRange::udiv() | sub_AB6A50 | -- | -- |
ConstantRange::minmax_combine() | sub_ABD750 | -- | -- |
ConstantRange from !range metadata | sub_ABEA30 | -- | -- |
ConstantRange from KnownBits | sub_C4B490 | -- | -- |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Range sources | Profile data, __builtin_assume, !range metadata from user annotations | Additional GPU-specific sources: nvvm-intr-range pass injects !range on all special register reads; __launch_bounds__ constrains %tid/%ntid ranges; warpSize = 32 constant |
| Thread index bounds | No concept of bounded thread indices | %tid.x/y/z bounded by [0, maxntid-1], %ntid.x/y/z by [1, 1024], %laneid by [0, 31]; these tighten trip count computation for thread-indexed loops |
| Trip count precision | Depends on programmer-visible range annotations | Substantially higher precision on GPU due to statically known hardware launch bounds; most CUDA loops have computable trip counts |
| Range feedback loop | Range analysis and BTC computation feed each other | Same mutual feeding, but GPU-specific ranges make the feedback loop converge faster and more precisely |
| Warp-stride loops | No concept; stride analysis treats all strides equally | NVIDIA SCEV recognizes warp-stride patterns (stride = warpSize or stride = blockDim.x), enabling specialized BTC computation for cooperative thread loops |
| Overflow analysis | Standard NSW/NUW flag analysis | Same flags, plus GPU-specific insight: 32-bit IVs with %tid or %ctaid bases are often provably non-wrapping given launch dimension bounds |
Cross-References
- SCEV Overview & Construction -- expression creation, caching, simple mode
- Loop Unrolling -- how trip counts drive unroll factor selection
- LoopVectorize & VPlan -- min trip count threshold, epilogue generation
- Loop Strength Reduction -- IV manipulation driven by SCEV ranges
- KnownBits & DemandedBits -- GPU-specific known-bits feeding into range analysis
- LLVM Knobs -- all SCEV-related knob values
SCEV Invalidation & Delinearization
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
SCEV analysis results are expensive to compute and are cached aggressively. When the IR mutates -- a loop is unrolled, a value is replaced, a block is deleted -- cached SCEV expressions, range information, and backedge-taken counts can become stale. The invalidation subsystem (forgetLoop, forgetValue, forgetAllLoops) determines exactly which cache entries must be discarded after each transformation. Get it wrong in either direction and the compiler either produces incorrect code (stale data) or wastes time recomputing everything (over-invalidation).
Delinearization is the complementary recovery problem: given a flat pointer expression like base + i*N*M + j*M + k, recover the original multi-dimensional subscripts [i][j][k]. This is critical for GPU code because memory coalescing analysis needs to know whether adjacent threads in a warp are accessing adjacent addresses -- a question that can only be answered by examining per-dimension subscripts against the thread index structure.
In cicc v13.0, both subsystems carry NVIDIA-specific modifications. The invalidation engine has an extended exit-analysis depth threshold and an early-out for simple two-operand AddRec expressions common in GPU loops. The delinearization engine has a polymorphic predicate collector that supports GPU-aware strategies for shared memory bank conflict detection and coalescing analysis, plus at least 9 configuration knobs not present in upstream LLVM.
| Property | Value |
|---|---|
forgetLoop address | sub_DE2750 (0xDE2750) |
forgetLoop size | 10,051 bytes (~2,271 asm lines) |
forgetValue address | sub_D9EE30 (0xD9EE30) |
forgetValue size | ~9 KB |
forgetAllLoops address | sub_D9D700 (0xD9D700) |
forgetAllLoops size | ~8 KB |
delinearize address | sub_DE9D10 (0xDE9D10) |
delinearize size | 3,614 bytes (~849 asm lines) |
collectParametricTerms address | sub_DE8D20 (0xDE8D20) |
| Hash function | (key >> 9) ^ (key >> 4) & (capacity - 1) |
| Empty sentinel | 0xFFFFFFFFFFFFF000 |
| Tombstone sentinel | 0xFFFFFFFFFFFFE000 |
Cache Invalidation
The Seven Caches
SCEV maintains seven distinct cache tables that must be kept consistent. Each has its own eviction path inside forgetLoop:
| # | Cache | Entry size | Key | Value | Context offset |
|---|---|---|---|---|---|
| 1 | ValueExprMap (primary) | 16 bytes | Value* | SCEV* | main SE object |
| 2 | Unsigned range cache | 40 bytes | SCEV* | ConstantRange | +976 |
| 3 | Signed range cache | 40 bytes | SCEV* | ConstantRange | +1008 |
| 4 | BTC cache | 168 bytes (0xA8) | loop SCEV* | BackedgeTakenInfo | +0x290 |
| 5 | Per-exit BTC cache | 16 bytes | exit SCEV* | exit count | +0x490 |
| 6 | AddRec folding cache | per-expression | AddRec pair | folded form | per-expression |
| 7 | Predicated BTC cache | 16 bytes | loop SCEV* | predicated count | secondary table |
All hash tables use the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth/compaction thresholds.
forgetLoop: The 8-Phase Algorithm
sub_DE2750 is the largest invalidation function -- 10 KB of machine code organized into eight sequential phases. It is called after every loop transformation that might invalidate SCEV data.
Signature:
void forgetLoop(
ScalarEvolution *SE, // rdi -- the SE context
Loop *L, // rsi -- loop being forgotten
BasicBlock *Header, // rdx -- loop header block
ExitInfo *Exits, // rcx -- exit block info (nullable)
int DepthFlag, // r8d -- 0=shallow, 1=deep, >1=bounded
int ExtraFlag, // r9d -- controls AddRec early-out
SmallDenseSet *Visited // stack -- prevents cycles in nested loops
);
Phase 1 -- Block value collection (0xDE27C9). Iterates the loop's basic blocks and collects all Values that have cached SCEV entries. The block array is at loop[+0x20] -> [+0x10] (pointer) / [+0x18] (count), stored as 32-byte entries. For each Value, a dominance check (sub_B19D00) confirms it belongs to the loop, then the SCEV index is extracted from a 27-bit field at value[+4] & 0x7FFFFFF. Collected pointers are stored in a SmallVector (inline capacity 6) with bit 2 set as a tag.
Phase 1B -- Scope chain collection (0xDE28A7). Walks a scope chain obtained via sub_B6AC80(SE[0][+0x28], 0x99), where 0x99 is the SCEV scope identifier. Filters to SCEVUnknown entries (type byte 0x55) with specific flag conditions (byte [+0x21] & 0x20), verifying loop membership and dominance. This captures values not directly in the loop's blocks but semantically part of its analysis scope.
Phase 2 -- Exit block processing (0xDE29D9). Enumerates exit blocks via sub_AE6EC0 and processes their AddRec chains. For each exit, reads the chain at [exit+0x30] & ~7 (stripping tag bits), checks expression kind byte (range 0x1E--0x28), and extracts operands. For the common case of simple {start, +, step} two-operand recurrences, an early-out stops after processing 2 operands when ExtraFlag != 0. If the loop has exactly 2 exits and ExtraFlag >= qword_4F88DC8 (a global threshold for maximum exit analysis depth), deep exit analysis is skipped entirely.
Phase 3 -- Expression dependency analysis (0xDE2BC5). The core invalidation loop. Iterates the collected values in reverse order and builds a transitive closure of all dependent SCEV expressions. Uses a stack-based worklist (SmallVector, inline capacity 8) and a SmallDenseSet for visited tracking. The dependency walk dispatches on expression type:
Type 0x52 ('R' = AddRec): Follow Start and Step operands via getSCEV,
compare ranges with getRangeRef
Type 0x56 ('V' = variant): Check function pointer equality at [-0x60]
and [+0x08], follow if simple recurrence
Type 0x39/0x3A (flagged): Check bit 6 of flags byte, follow base pointer
or compute canonical form from 27-bit index
General: Follow underlying object, check for pointer
types (0x11/0x12), verify integer type
Phase 4 -- Primary cache eviction (0xDE2DFF). For each expression identified by Phase 3, looks it up in the ValueExprMap, computes both unsigned and signed ranges via getRangeRef (sub_DBB9F0), compares old and new ranges via ConstantRange::contains (sub_AB1BB0), and clears validity bits in the range cache ([entry+0x20] for unsigned, [entry+0x21] for signed). Wide APInt buffers (>64 bits) are freed through __libc_free.
Phase 5 -- BTC eviction (0xDE3D2F). For each collected value, looks it up in the BTC hash table. On hit: writes TOMBSTONE, decrements entry count, increments tombstone count, then calls forgetMemoizedResults (sub_DE2690) to recursively invalidate any expressions that depended on this backedge-taken count. Also evicts the corresponding predicated BTC entry from the secondary table.
Phase 6 -- AddRec folding cache cleanup (0xDE3230). For AddRec expressions (type 0x52), invalidates pre-computed folding results. Extracts the 6-bit opcode from [expr+2] & 0x3F and dispatches:
- Opcode
0x20(shift/power-of-two multiply): checks viacountPopulationwhether the step is a power of two, then callstryFoldAddRecWithStep(sub_DCFD50) - Opcodes
0x22--0x29(binary operations): constructs the appropriate folded expression per operation type and marks it for invalidation - Opcode
0x24with pointer type (0x0E): skips pointer-integer cast invalidation
Phase 7 -- Predicate and assumption cleanup (0xDE3856). Processes the predicate hash table via the loop object's fields. Performs range intersection (sub_AB0910), union (sub_AB0A00), and emptiness/fullness checks (sub_AAFBB0, sub_AAF760). If the resulting range is neither empty nor full, stores the updated BTC in the loop's entry.
Phase 8 -- Final output (0xDE3CCD). Writes 0x0101 to loop->flags[+0x20], marking the loop as SCEV-forgotten (bit 0 = primary cache invalidated, bit 8 = secondary cache invalidated). Frees heap-allocated collection and output buffers.
forgetValue and forgetAllLoops
forgetValue (sub_D9EE30, ~9 KB) performs single-value eviction. It removes the value's entry from the ValueExprMap, then walks all expressions that transitively depend on it and evicts those as well. Used when a single instruction is replaced (RAUW) or deleted.
forgetAllLoops (sub_D9D700, ~8 KB) iterates every loop in the function's LoopInfo and calls forgetLoop for each one. Used when the entire function's loop structure changes (e.g., after inlining or full function cloning).
Which Passes Trigger Invalidation
forgetLoop is called after these loop transformations:
| Pass | Why invalidation is needed |
|---|---|
| LoopUnroll | Trip counts change; unrolled body has different IVs |
| LoopVectorize | Widened types; vector IVs replace scalar ones |
| LoopPeeling | Peeled iterations change the start value of recurrences |
| LoopUnswitching | Exit conditions change; control flow restructured |
| LICM | Hoisted values have new SCEV forms outside the loop |
| LoopSimplify | Preheader/exit block insertion changes loop structure |
| LoopRotate | Header/latch swap requires BTC recomputation |
| LoopDistribute | Original loop split into multiple loops |
| LoopIdiomRecognize | Pattern replacement changes loop body |
| LoopIndexSplit (NVIDIA) | IV range split into subranges |
| MemorySpaceOpt (NVIDIA) | Address space changes invalidate pointer SCEVs |
The DepthFlag parameter controls the aggressiveness of invalidation: 0 does shallow invalidation (only direct loop values), 1 follows all dependency chains, and values >1 impose a bounded depth useful for performance in deeply nested loops. The Visited parameter (a SmallDenseSet*) prevents infinite cycles when nested loops have mutual SCEV dependencies.
The forget-scev-loop-unroll knob (boolean) controls whether SCEV cache is invalidated after unrolling -- disabling it is unsound but can be used for compile-time experimentation.
Delinearization
The Problem
CUDA kernels routinely access multi-dimensional arrays:
float val = A[blockIdx.x * BLOCK_H + threadIdx.y][blockIdx.y * BLOCK_W + threadIdx.x];
By the time this reaches LLVM IR, the address computation has been flattened:
%addr = getelementptr float, ptr %A, i64 %flat_idx
; where %flat_idx = (blockIdx.x * BLOCK_H + threadIdx.y) * N + (blockIdx.y * BLOCK_W + threadIdx.x)
SCEV sees this as a single polynomial. Delinearization recovers the per-dimension subscripts, which are essential for:
- Coalescing analysis: determining whether adjacent threads (
threadIdx.x,threadIdx.x+1, ...) access adjacent memory addresses (coalesced) or strided addresses (uncoalesced). This requires isolating the dimension wherethreadIdx.xappears. - Shared memory bank conflict detection: 32 banks, 4-byte stride. Knowing whether the innermost subscript is
threadIdx.x(conflict-free) vs.threadIdx.x * stride(potential conflicts) requires dimensional decomposition. - Dependence analysis: per-dimension dependence tests (Banerjee, GCD, MIV) are more precise than whole-expression tests. Delinearized subscripts feed
DependenceInfofor vectorization legality.
Delinearization Context
The delinearizer (sub_DE9D10) operates on a context object:
| Offset | Type | Field | Purpose |
|---|---|---|---|
+0x00 | ScalarEvolution* | SE | Parent SCEV context |
+0x08 | SCEV* | ElementSize | Innermost element size |
+0x10 | uint8_t | Flags | Bit 0: inline cache mode |
+0x18 | 64 bytes | InlineCache | 4-slot direct-mapped table (inline mode) |
+0x20 | uint32_t | Capacity | Heap table capacity (heap mode) |
+0x58 | SCEV* | TargetArrayPtr | Array being delinearized |
+0x60 | void* | PredicateCollector | Nullable; collects validity predicates |
+0x68 | SCEV* | StepRecurrence | AddRec step for innermost dimension |
The inline cache (4 slots of 16 bytes at +0x18) is a small-buffer optimization sized for the overwhelmingly common GPU case of 1D or 2D array accesses. Cache entries use the same (key >> 9) ^ (key >> 4) hash as all other SCEV tables.
The Recursive Delinearization Algorithm
sub_DE9D10 is a recursive function dispatching on 17 SCEV expression kinds via a jump table:
| Kind | Expression type | Handling |
|---|---|---|
| 0, 1, 16 | Constant, TruncateExpr (ident), Unknown | Leaf -- return unchanged |
| 2 | TruncateExpr | Recurse into inner, rebuild with getTruncateExpr |
| 3 | SignExtendExpr | Recurse; dimension discovery on AddRec step match |
| 4 | ZeroExtendExpr | Recurse; dimension discovery on AddRec step match |
| 5 | AddExpr | N-ary: delinearize each operand, rebuild with getAddExpr |
| 6 | MulExpr | N-ary: delinearize each factor, rebuild with getMulExpr |
| 7 | UDivExpr | Delinearize both operands, rebuild with getUDivExpr |
| 8 | AddRecExpr | N-ary with wrap flag preservation; critical path |
| 9--13 | SMax/UMax/SMin/UMin/SeqUMin | N-ary: delinearize operands, rebuild |
| 14 | PtrToIntExpr | Recurse into pointer, rebuild |
| 15 | GEP | Primary dimension discovery entry point |
The N-ary pattern. Cases 5, 6, 8--13 share a common template:
SmallVector<const SCEV*, 2> NewOps; // inline capacity 2
bool Changed = false;
for (auto *Op : Expr->operands()) {
const SCEV *NewOp = delinearize(Ctx, Op); // recursive
NewOps.push_back(NewOp);
if (NewOp != Op) Changed = true;
}
if (!Changed) return Expr; // pointer identity optimization
return rebuildExpr(SE, Kind, NewOps);
The "changed" flag enables pointer identity short-circuiting: if no operand was modified during recursion, the original expression pointer is returned without allocation.
AddRecExpr (case 8) is the most critical case for GPU code. Multi-dimensional array accesses manifest as nested AddRec expressions: {A[0][0], +, dim1}<outer_loop> wrapping {init, +, 1}<inner_loop>. The delinearizer preserves wrap flags (NSW/NUW/NW from bits [+0x1C] & 7) and the step value ([+0x30]) when reconstructing via getAddRecExpr (sub_DBFF60).
ZeroExtend/SignExtend (cases 3, 4) are secondary dimension discovery points. When the inner operand is an AddRec whose step matches Ctx->StepRecurrence (+0x68) and the AddRec has exactly 2 operands (the common {start, +, step} form), the handler extracts dimension information: it calls getElementSize (sub_D33D80) and getConstant (sub_DA4270) to compute the element count, then pushes a new term into the term collector at Ctx[+0x58]. This identifies a dimension boundary -- the extend operation wrapping a matching-step AddRec indicates the point where one array dimension ends and another begins.
GEP (case 15) is the primary entry for actual dimension discovery. It first checks the predicate collector (Ctx[+0x60]). If present, it searches the collector's table for a matching GEP index entry (type field == 1, matching scev_expr, operation == 0x20). If no predicate collector or no match, it falls back to structural delinearization via sub_DE97B0, which analyzes the GEP's index computation structure, iterates discovered terms, and classifies them by dimension type. Terms matching Ctx->StepRecurrence go to the direct collector; others go through the predicate collector's virtual dispatch (vtable[+0x10]).
Fixed-Point Iteration
The function itself is a single recursive pass, but its callers implement a fixed-point loop:
- Initialize the context with an initial guess for dimension sizes
- Call
sub_DE9D10to delinearize using those dimensions - During recursion, the GEP and extend handlers collect new dimension information into
Ctx[+0x58](term collector) andCtx[+0x60](predicate collector) - If collected dimensions differ from the initial guess, update and repeat from step 2
- Terminate when dimensions stabilize or a maximum iteration count is exceeded
The memoization cache ensures unchanged sub-expressions are not recomputed across iterations.
Parametric vs Fixed-Size Arrays
Upstream LLVM has the delinearize-use-fixed-size-array-heuristic knob (default: false). When the standard parametric delinearization fails -- typically because dimension sizes are runtime values with no SCEV relationship -- the fixed-size heuristic uses compile-time-known array dimensions from type metadata to guide decomposition.
cicc extends this with an alternative delinearization entry point at sub_147EE30 (25 KB), which applies additional heuristics controlled by at least 3 of the delinearization config globals (dword_4F9AB60, dword_4F9AE00, dword_4F9B340). This second path is likely NVIDIA-enhanced for cases common in GPU code, such as dynamically-allocated shared memory with dimensions derived from kernel launch parameters.
The dependence analysis subsystem has its own entry points into delinearization (sub_146F1B0 at 40 KB for delinearizeAccess, sub_146B5E0 at 18 KB for tryDelinearize) that combine delinearization with per-dimension dependence testing in a single pass.
GPU-Specific Delinearization Patterns
Thread grid indexing. The canonical GPU pattern threadIdx.x + blockIdx.x * blockDim.x produces an AddRec with step = blockDim.x (grid stride). The delinearizer recognizes this by matching the step recurrence against Ctx[+0x68]. When the step corresponds to a grid dimension, the subscript identifies which dimension of a multi-dimensional array is parallelized across the thread grid.
Shared memory bank conflicts. For shared memory arrays, the delinearizer feeds into bank conflict analysis. Shared memory has 32 banks with 4-byte interleaving. If delinearization reveals A[threadIdx.y][threadIdx.x] with row stride 32 (or any multiple of 32), every thread in a warp hits the same bank -- a 32-way conflict. If the stride is relatively prime to 32, accesses are conflict-free. This analysis requires knowing per-dimension subscripts, which only delinearization can provide from the flat pointer arithmetic.
Predicate collector polymorphism. The PredicateCollector at Ctx[+0x60] uses virtual dispatch (vtable[+0x10]), allowing different delinearization strategies to be plugged in:
- Standard delinearization for host code
- GPU-aware delinearization that considers shared memory bank geometry
- Coalescing-aware delinearization that checks whether the innermost subscript varies with
threadIdx.x
High-dimensional tensors. The term collector at Ctx[+0x58] is a growable SmallVector, supporting arrays with arbitrary dimensionality. This matters for tensor operations in CUDA (e.g., CUTLASS library patterns, which cicc special-cases elsewhere -- see the cutlass substring check in the dependence analysis region).
SCEV Term Collection
Before delinearization runs, collectParametricTerms (sub_DE8D20) walks the SCEV expression tree to extract candidate terms:
SCEVAddRecExproperands yield stride candidates (the step of each AddRec)SCEVUnknownandSCEVMulExprnodes yield dimension-size candidatesSCEVSignExtendExprnodes are also collected (they often wrap dimension-related terms)
These candidates are passed to findArrayDimensions (sub_147B0D0) which uses product decomposition to determine which terms correspond to array dimensions. The resulting dimension list seeds the delinearization context before sub_DE9D10 is invoked.
Configuration
SCEV Invalidation Knobs
| Knob | Default | Effect |
|---|---|---|
forget-scev-loop-unroll | true | Enable SCEV invalidation after loop unrolling |
verify-scev | false | Verify SCEV consistency after transformations |
verify-scev-strict | false | Stricter verification (compare old/new trip counts) |
verify-scev-maps | false | Verify SCEV map consistency |
qword_4F88DC8 (max exit analysis depth) | unknown | Threshold beyond which deep exit analysis is skipped |
SCEV Analysis Depth Limits (shared with invalidation)
| Knob | Default | Effect |
|---|---|---|
scalar-evolution-max-iterations | 100 | Maximum loop iterations for constant evaluation |
scalar-evolution-max-scev-compare-depth | 32 | Maximum SCEV comparison recursion depth |
scalar-evolution-max-arith-depth | 32 | Maximum SCEV arithmetic simplification depth |
scalar-evolution-max-ext-depth | 8 | Maximum sign/zero-extend nesting depth |
scalar-evolution-max-cast-depth | 8 | Maximum cast chain depth |
scalar-evolution-max-constant-evolving-depth | 32 | Maximum constant evolution depth |
scalar-evolution-max-expr-size | 384 | Maximum expression node count |
scalar-evolution-max-expr-failures | 100 | Maximum SCEV creation failures before bailout |
scalar-evolution-max-scev-operations-implication-depth | 2 | Maximum depth for implications |
scalar-evolution-max-value-compare-depth | 2 | Maximum value comparison depth |
NVIDIA-Specific SCEV Knobs
| Knob | Effect |
|---|---|
aggressive-positive-stride-analysis | More aggressive positive-stride IV analysis (nvbug 3972412) |
do-sign-ext-simplify | Simplify SCEV sign-extend expressions |
do-sign-ext-expand | Expand sign-extends during SCEV construction |
track-trip-count-more | Track loop trip counts more aggressively |
scev-mulops-inline-threshold (32) | Max MulExpr operands before out-of-line |
scev-addops-inline-threshold (500) | Max AddExpr operands before out-of-line |
Delinearization Knobs
| Global | Likely identity | Notes |
|---|---|---|
byte_4F9A8C0 | Delinearization enable flag | Master enable for the delinearization subsystem |
dword_4F9A620 | Config 1 | Referenced by combined delinearize-and-test |
dword_4F9A700 | Config 2 | Referenced by delinearizeAccess core |
dword_4F9A7E0 | Config 3 | Referenced by delinearizeAccess core |
dword_4F9AB60 | Config 4 | Referenced by alternative delinearization v2 |
dword_4F9AC40 | Config 5 | Referenced by dependence distance with delinearization |
dword_4F9AE00 | Config 6 (shared) | Referenced by both combined-test and v2 paths |
dword_4F9B260 | Config 7 | Referenced by combined delinearize-and-test |
dword_4F9B340 | Config 8 | Referenced by alternative delinearization v2 |
da-delinearize | Try to delinearize array references | DependenceAnalysis pass knob (upstream LLVM) |
da-miv-max-level-threshold | MIV test depth limit | DependenceAnalysis pass knob (upstream LLVM) |
Function Map
Invalidation Functions
| Function | Address | Size | Role |
|---|---|---|---|
ScalarEvolution::forgetLoop | sub_DE2750 | 10,051 B | 8-phase loop invalidation |
ScalarEvolution::forgetValue | sub_D9EE30 | ~9 KB | Single-value eviction |
ScalarEvolution::forgetAllLoops | sub_D9D700 | ~8 KB | Invalidate all loops |
forgetMemoizedResults | sub_DE2690 | small | Recursive BTC invalidation helper |
ScalarEvolution::verify | sub_DE5FA0 | ~52 KB | Debug verification (old/new trip count comparison) |
| Loop invalidation helper | sub_DE5640 | ~178 lines | Helper for forgetLoop |
| SCEV expression invalidator | sub_DCE1C0 | small | Callback for AddRec folding cleanup |
Delinearization Functions
| Function | Address | Size | Role |
|---|---|---|---|
ScalarEvolution::delinearize | sub_DE9D10 | 3,614 B | Recursive delinearizer (17-case switch) |
collectParametricTerms | sub_DE8D20 | ~521 lines | Term extraction before delinearization |
| Structural GEP delinearization | sub_DE97B0 | small | Sub-analysis called from GEP case |
canonicalizeExpr | sub_D9ABD0 | small | SCEV normalization |
computeAccessFunctions | sub_D94080 | ~12 KB | Access function computation |
SCEV_delinearize (dependence region) | sub_CF5550 | 6,276 B | Alternate copy in dependence analysis |
Dependence Analysis Delinearization
| Function | Address | Size | Role |
|---|---|---|---|
delinearizeAccess | sub_146F1B0 | 40 KB | Core delinearization for dependence analysis |
tryDelinearize | sub_146B5E0 | 18 KB | Delinearization attempt with fallback |
| Delinearize subscript | sub_1472640 | 10 KB | Per-subscript extraction |
| Array dimension inference | sub_1473850 | 12 KB | Infers dimensions from access patterns |
collectSubscripts | sub_1476060 | 22 KB | Multi-dimensional GEP subscript collection |
| Dependence distance with delinearization | sub_14747F0 | 15 KB | Computes dependence vectors using delinearized subscripts |
findArrayDimensions | sub_147B0D0 | 11 KB | Dimension sizes from SCEV product decomposition |
| Combined delinearize-and-test | sub_147C070 | 34 KB | Delinearize + per-dimension dependence test |
| Alternative delinearization v2 | sub_147EE30 | 25 KB | NVIDIA-enhanced heuristics |
| Partial result combiner | sub_147DF40 | 11 KB | Combines partial delinearization results |
Key SCEV Callees (shared by both subsystems)
| Function | Address |
|---|---|
getRangeRef -- range computation | sub_DBB9F0 |
ConstantRange::contains | sub_AB1BB0 |
ConstantRange::intersectWith | sub_AB0910 |
ConstantRange::unionWith | sub_AB0A00 |
ConstantRange::isEmptySet | sub_AAFBB0 |
ConstantRange::isFullSet | sub_AAF760 |
getSCEV -- expression resolution | sub_DD8400 |
tryFoldAddRecWithStep | sub_DCFD50 |
getAddExpr (N-ary) | sub_DC7EB0 |
getMulExpr (N-ary) | sub_DC8BD0 |
getAddRecExpr | sub_DBFF60 |
getUDivExpr | sub_DCB270 |
getZeroExtendExpr | sub_DC5000 |
getSignExtendExpr | sub_DC2B70 |
getTruncateExpr | sub_DC5200 |
getPtrToIntExpr | sub_DD3A70 |
DominatorTree::dominates | sub_B19D00 |
SmallDenseSet::insert | sub_C8CC70 |
| Cache insert (delinearization result memoization) | sub_DB11F0 |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Delinearization purpose | Optimize for cache locality; multi-dimensional subscript recovery for polyhedral analysis | Optimize for memory coalescing: recover subscripts to determine whether adjacent warp threads access adjacent addresses |
| Invalidation triggers | Standard loop transformations (unroll, vectorize, simplify) | Additional triggers from NVIDIA-specific passes: MemorySpaceOpt (address space transformations), IV Demotion (narrowing changes SCEV types), NVLoopStrengthReduce |
| Delinearization result caching | No explicit memoization in upstream | Memoization cache via sub_DB11F0 prevents redundant delinearization of the same GEP across multiple consumers |
| Thread index awareness | No concept of thread-index-based access patterns | Delinearized subscripts are analyzed against threadIdx dimensions to determine coalescing quality; feeds into vectorization and LSR decisions |
forget-scev-loop-unroll knob | Present in upstream LLVM | Same knob, but more critical on GPU because over-invalidation forces expensive SCEV recomputation on deeply nested kernel loops |
| Range source diversity | Profile data, programmer assertions (__builtin_assume) | Additional sources: !range metadata from nvvm-intr-range, __launch_bounds__, warpSize constant, special register bounded ranges |
Cross-References
- ScalarEvolution Overview & Construction -- SCEV expression creation, the ValueExprMap, and the expression DAG structure that invalidation walks
- SCEV Range Analysis & Trip Counts -- range caches and BTC caches that invalidation must clear; the
getRangeRefand BTC computation functions called during eviction - LoopVectorize & VPlan -- primary consumer of delinearization results for vectorization legality; calls
forgetLoopafter vectorizing - Loop Unrolling -- calls
forgetLoopafter unrolling; theforget-scev-loop-unrollknob controls this - Loop Strength Reduction (NVIDIA) -- uses SCEV for IV analysis; its transformations trigger
forgetValuecalls - MemorySpaceOpt -- NVIDIA-specific pass that triggers SCEV invalidation after address space transformations
- Alias Analysis & NVVM AA -- delinearization results feed into alias analysis for disambiguating multi-dimensional array accesses
Loop Optimization Passes
Loop optimization is the single most performance-sensitive area of the cicc pipeline. On an NVIDIA GPU, the constraints are fundamentally different from CPU: register pressure dominates (every additional register per thread reduces SM occupancy), memory coalescing replaces cache locality as the primary memory optimization target, and warp divergence caused by loop-carried control flow destroys SIMT efficiency. NVIDIA's cicc v13.0 addresses these constraints by shipping a mix of stock LLVM loop passes, LLVM passes with GPU-specific threshold overrides, and fully proprietary loop transformations -- all orchestrated through a carefully ordered pipeline where the position of each pass reflects hard-won engineering tradeoffs between register pressure, instruction count, and memory access patterns.
This page provides the big-picture view of loop optimization in cicc: what passes exist, how they are ordered, what analyses they share, and why the ordering matters for GPU targets. Each pass links to a dedicated sub-page with full algorithmic detail.
Why Loop Optimization Is Different on GPU
Four properties of the GPU execution model distinguish GPU loop optimization from the CPU case that upstream LLVM targets:
Register pressure is the primary constraint. Every loop transformation that increases live values (unrolling, vectorization, LICM hoisting) must be evaluated against the SM's register budget and its discrete occupancy cliffs -- adding one register can drop occupancy by a full warp group. CPU compilers never face this tradeoff.
Memory coalescing replaces cache line optimization. Loop transformations that improve stride-1 access patterns (interchange, vectorization) improve coalescing; transformations that increase the number of live pointers (unrolling, distribution) may degrade it by interleaving access streams.
No out-of-order execution. Warps execute instructions in program order; the only latency-hiding mechanism is warp-level multithreading. Unrolling creates ILP within a single warp by exposing independent instructions that the ptxas backend can interleave, but the benefit is bounded by the register pressure cost.
Address space semantics. GPU memory is partitioned into address spaces with different pointer widths, hardware addressing modes, and performance characteristics. Loop passes that rewrite address computations (LSR, IndVarSimplify) must respect these distinctions -- strength-reducing a 32-bit shared memory pointer into 64-bit generic form defeats the backend's ability to emit efficient .shared:: instructions.
Pipeline Ordering
The loop passes execute within the main optimization pipeline assembled by sub_12E54A0. The ordering below reflects the Tier 1/2/3 optimization path (the normal path for -O1 and above). Passes marked with (N) are NVIDIA-specific or have significant NVIDIA modifications; unmarked passes are stock LLVM with at most threshold overrides.
LoopSimplify + LCSSA (canonicalization)
|
v
LoopRotate (do-while canonical form)
|
v
LICM (hoist) (move invariants out)
|
v
LoopIndexSplit **(N)** (split index-dependent branches)
|
v
IndVarSimplify **(N)** (canonicalize IVs, LFTR)
|
v
LoopIdiomRecognize (memcpy/memset/mismatch idioms)
|
v
LoopDistribute (fission for vectorization)
|
v
LoopVectorize **(N)** (widen scalar loops to v2/v4)
|
v
LoopUnroll **(N)** (replicate body, GPU-tuned)
|
v
LoopInterchange (swap nest levels for coalescing)
|
v
IRCE (range check elimination)
|
v
NVLoopStrengthReduce **(N)** (NVIDIA custom LSR solver)
|
v
LoopDeletion (remove dead loops)
|
v
LoopSink / LICM (sink) (demote unprofitable hoists)
Several passes appear more than once. LICM runs in both hoist and sink mode. LoopUnroll has an early invocation in the main pipeline and a late invocation gated by opts[1360] (nv-disable-loop-unrolling). IndVarSimplify runs before vectorization to canonicalize induction variables, then again after unrolling to clean up newly exposed IVs. LoopSimplify and LCSSA are implicit -- they run as required analyses whenever any loop pass requests them, ensuring loops remain in canonical form throughout.
The ordering reflects a deliberate strategy: canonicalize first (LoopSimplify, LoopRotate, IndVarSimplify), transform for parallelism (LoopDistribute, LoopVectorize, LoopInterchange), replicate for ILP (LoopUnroll), and clean up addressing (LSR, LoopDeletion, LoopSink). Reordering these passes produces measurably different code: running LSR before LoopVectorize would pollute the cost model with strength-reduced IVs that confuse SCEV; running LoopUnroll before LoopVectorize would prevent vectorization of unrolled-but-still-vectorizable loops.
LoopPassManager Structure
cicc uses the LLVM New Pass Manager's LoopPassManager infrastructure. Loop passes are grouped inside a FunctionPassManager that contains a LoopToFunctionPassAdaptor wrapping the LoopPassManager. The adaptor iterates over all loops in the function in reverse post-order of the loop forest (innermost first), running the full sequence of loop passes on each loop before moving to the next.
The LoopStandardAnalysisResults struct is threaded through all loop passes, providing shared access to:
| Analysis | Typical Accessor | Purpose |
|---|---|---|
ScalarEvolution | AR.SE | Trip counts, strides, value ranges |
LoopInfo | AR.LI | Loop structure, nesting depth |
DominatorTree | AR.DT | Dominance queries for code motion |
AssumptionCache | AR.AC | __builtin_assume facts |
TargetTransformInfo | AR.TTI | Cost model, addressing modes |
MemorySSA | AR.MSSA | Memory alias queries for LICM/DSE |
AAResults | AR.AA | Alias analysis chain |
Passes that structurally modify loops (LoopUnroll, LoopDistribute, IRCE) call LPMUpdater::markLoopAsDeleted() or LPMUpdater::addSiblingLoops() to inform the pass manager of changes. SCEV is invalidated per-loop via SE.forgetLoop() after any transformation that changes the loop's backedge-taken count.
Complete Pass Inventory
The table below lists every loop pass present in cicc v13.0 with its pipeline position, NVIDIA modification status, and primary function address.
| Pass Name | Pipeline Position | NVIDIA Modified | Entry Address | Status |
|---|---|---|---|---|
loop-simplify | Infrastructure (on demand) | No | stock LLVM | Canonicalizes loop form |
lcssa | Infrastructure (on demand) | No | stock LLVM | Ensures loop-closed SSA |
loop-rotate | Early, before LICM | No | stock LLVM | Converts to do-while form |
licm | Early (hoist) + Late (sink) | Threshold only | stock LLVM | Invariant code motion |
loop-index-split | After LICM, before IndVars | Yes (proprietary) | sub_2CBEC60 (New PM) | Splits index-dependent branches |
indvars | Before vectorize | Yes (3 knobs) | sub_19489B0 | IV canonicalization + LFTR |
loop-idiom | Before distribute | No | stock LLVM | Memcpy/memset/mismatch recognition |
loop-distribute | Before vectorize | Threshold only | sub_1A8CD80 | Loop fission for vectorization |
loop-vectorize | Main loop slot | Yes (cost model) | sub_2AF1970 | Vectorize inner loops to v2/v4 |
loop-unroll | After vectorize (x2) | Yes (decision engine) | sub_19BE360 | Replicate loop body |
loop-interchange | After unroll | Threshold only | sub_1979A90 | Swap loop nest levels |
irce | After interchange | No | sub_194D450 | Range check elimination |
loop-reduce | Late, after unroll | Yes (complete rewrite) | sub_19CE990 (NV wrapper) | Strength reduction for GPU |
loop-deletion | Late | No | stock LLVM | Remove dead/empty loops |
loop-sink | Late | No | stock LLVM | Sink invariants back into loops |
loop-instsimplify | Utility | No | stock LLVM | Simplify instructions in loops |
loop-flatten | Utility | No | stock LLVM | Flatten nested counted loops |
loop-guard-widening | Utility | No | stock LLVM | Widen loop guards |
loop-predication | Utility | No | stock LLVM | Predicate unswitched loops |
loop-reroll | Utility | No | stock LLVM | Reverse unrolling (rarely used) |
Passes marked "Utility" are registered in the pipeline infrastructure but are not part of the default optimization sequence -- they are available for explicit pipeline specification via -mllvm -passes=....
Pass Descriptions and Sub-Page Links
Canonicalization Passes
LoopSimplify and LCSSA run on demand before any loop transformation pass executes. LoopSimplify ensures each loop has a single preheader, a single backedge (latch), and dedicated exit blocks. LCSSA (Loop-Closed SSA) ensures that values defined inside a loop and used outside it pass through PHI nodes at loop exit blocks. These are stock LLVM utilities with no NVIDIA modifications. Together they establish the invariants that all subsequent loop passes depend on.
LoopRotate converts a loop from while-form (while (cond) { body }) to do-while form (do { body } while (cond)). This creates a single-entry loop body and moves the exit test to the latch, which is the canonical form expected by SCEV, LoopVectorize, and LoopUnroll. Stock LLVM, no NVIDIA modifications.
NVIDIA-Custom Loop Passes
Loop Index Split is a revived and heavily reworked version of a pass removed from upstream LLVM 3.0. It splits loops when the loop body contains a condition that depends on the induction variable (e.g., if (i == K)), producing two or three loops where each has a uniform body. On GPU, this eliminates warp divergence caused by index-dependent branches. The pass implements three transformation modes: all-but-one peel (for i == K), only-one collapse (for nearly-empty special iterations), and full range split (for i < K vs i >= K). Proprietary, no upstream equivalent.
IndVarSimplify (NVIDIA) is upstream LLVM's induction variable canonicalization pass with three NVIDIA-specific extensions: Disable-unknown-trip-iv (bool, qword_4FAF520) -- bypasses the pass entirely when SCEV cannot compute the trip count, preventing aggressive IV transforms on warp-divergent loops; iv-loop-level (int, default 1, qword_4FAF440) -- restricts the pass to loops at a maximum nesting depth to control compile time on deeply nested stencil kernels; and disable-lftr (bool, byte_4FAF6A0) -- disables Linear Function Test Replace when the IV canonicalization would increase register pressure.
LoopVectorize (GPU-Adapted) is the largest single pass in the cicc loop pipeline (88 KB). On GPU, vectorization means generating ld.v2/ld.v4 wide loads rather than filling SIMD lanes. The pass builds VPlans, selects VF through a GPU-aware cost model that penalizes register pressure, and caps VF at 4 for most GPU targets. Scalable vectors are always disabled. The pass includes an outer-loop vectorization path (rarely triggered on GPU) and an inner-loop path (the main code path).
Loop Unrolling (GPU-Tuned) ships a substantially reworked computeUnrollCount decision engine with GPU heuristics: a local-array threshold multiplier that aggressively unrolls loops over __shared__ arrays, power-of-two factor enforcement, a pragma threshold 200x larger than stock LLVM, and a register-pressure-aware cost model. The transformation engine is lightly modified upstream UnrollLoop. The pass runs twice: once in the main pipeline, once as a late cleanup.
NVLoopStrengthReduce (NVIDIA Custom) is the most GPU-specific LLVM pass in cicc. NVIDIA ships a complete replacement formula solver (160 KB, 2688 lines) with 11 custom knobs controlling register pressure checking, address-space-aware formula selection, sign-extension optimization, and 64-bit IV handling. The stock LLVM LSR remains in the binary but the NVIDIA overlay replaces the formula generation and selection phases.
Standard Loop Passes (Threshold Overrides Only)
LICM (Loop-Invariant Code Motion) hoists loop-invariant computations above the loop and sinks them below it. On GPU, LICM's hoist mode must be conservative: hoisting increases register pressure in the loop preheader, which may push past occupancy cliffs. The sink mode (running later) undoes unprofitable hoists. Stock LLVM with NVIDIA-tuned thresholds.
LoopInterchange swaps the nesting order of a perfectly-nested loop pair when doing so improves memory access locality. In cicc, the threshold loop-interchange-threshold (dword_4FB07E0) defaults to 0, meaning interchange is only performed when the net locality benefit is non-negative AND parallelism improves. The pass has a 100-pair dependence limit (0x960 bytes) as a compile-time safety valve. There is no visible CUDA-specific memory space awareness -- the standard LLVM stride-1 locality model applies uniformly. See the standard loop passes page for details.
IRCE (Inductive Range Check Elimination) splits a loop into preloop/mainloop/postloop regions, eliminating range checks from the mainloop where the induction variable is provably within bounds. The implementation is stock LLVM with no visible NVIDIA modifications. Configuration globals include a block count threshold (dword_4FB0000), a debug flag (byte_4FAFE40), and a "constrained" relaxation mode (byte_4FAFBA0) that handles slightly non-canonical range checks common in GPU thread-coarsened loops.
LoopDistribute (loop fission) splits a single loop into multiple loops to separate unsafe memory dependences from safe ones, enabling LoopVectorize to vectorize the safe partition. Stock LLVM algorithm. The SCEV runtime check threshold (qword_4FB5480) is likely GPU-tuned. The pass runs before LoopVectorize in the pipeline.
LoopIdiomRecognize detects loops that implement common patterns (byte-by-byte copy, memset, mismatch search, string search) and replaces them with optimized multi-block IR or library calls. The expansion routines generate vectorized mismatch detection (sub_2AA00B0, 48 KB) and vectorized first-occurrence string search (sub_2AA3190, 40 KB), both with page-boundary-safe masked vector loads. Stock LLVM pass; the expansion quality benefits GPU targets where wide loads are profitable.
LoopDeletion removes loops proven dead (no observable side effects). Stock LLVM. LoopSink moves loop-invariant operations that were hoisted by LICM back into the loop body when doing so reduces register pressure -- particularly valuable on GPU where the register pressure tradeoff is acute.
Loop Analysis Infrastructure
All loop passes share three core analysis frameworks.
ScalarEvolution (SCEV)
SCEV models how values evolve across loop iterations. Every loop pass depends on it for trip count computation, stride analysis, and value range queries. cicc ships an LLVM 20.0.0-based SCEV with three NVIDIA extensions: a complexity control system (simple_mode) that prevents unbounded analysis time, GPU-specific SCEV sources that inject thread index bounds, and recognition of CUDA loop idioms (warp-stride, grid-stride). See ScalarEvolution Overview, Range Analysis & Trip Counts, and Invalidation & Delinearization.
LoopInfo
LoopInfo provides the loop forest structure: which basic blocks belong to which loops, nesting depth, header/latch/exit identification. It is the primary structural query interface for all loop passes. Stock LLVM, no NVIDIA modifications.
DependenceInfo
DependenceInfo computes memory dependence direction vectors between instruction pairs across loop iterations. LoopInterchange and LoopDistribute are its primary consumers. The analysis uses SCEV to classify dependences as forward (<), backward (>), equal (=), scalar (S), independent (I), or unknown (*). Direction vectors drive the legality checks for loop interchange (no reversed backward-carried dependences after swap) and loop distribution (which instructions must stay in the same partition).
Loop-Related Knobs Summary
The following table consolidates all loop-pass-specific configuration knobs discovered in cicc v13.0. These are controllable via -mllvm -<knob>=<value>.
| Knob | Pass | Type | Default | Effect |
|---|---|---|---|---|
Disable-unknown-trip-iv | IndVarSimplify | bool | false | Skip IV canonicalization for unknown-trip loops |
iv-loop-level | IndVarSimplify | int | 1 | Max nesting depth for IV simplification |
disable-lftr | IndVarSimplify | bool | false | Disable Linear Function Test Replace |
replexitval | IndVarSimplify | enum | 1 (cheap) | Exit value replacement strategy: 0=never, 1=cheap, 2=always |
indvars-widen-indvars | IndVarSimplify | bool | true | Allow IV widening to eliminate sign/zero extension |
loop-interchange-threshold | LoopInterchange | int | 0 | Minimum net locality improvement for interchange |
vectorize-loops | LoopVectorize | bool | true | Master vectorization enable |
enable-early-exit-vectorization | LoopVectorize | bool | false | Allow vectorization of early-exit loops |
force-vector-width-outer | LoopVectorize | bool | false | Force VF=4 for outer loops |
nv-disable-loop-unrolling | LoopUnroll | bool | false | Disable the late unroll invocation |
disable-unknown-trip-lsr | NV LSR | bool | false | Skip LSR for unknown-trip loops |
lsr-check-rp | NV LSR | bool | true | Enable register pressure checking in LSR |
lsr-rp-limit | NV LSR | int | ~32-64 | Register pressure ceiling for LSR |
filter-bad-formula | NV LSR | bool | true | NVIDIA custom formula filtering |
do-lsr-64-bit | NV LSR | bool | arch-dep | Enable LSR for 64-bit IVs (false on sm_3x-5x) |
count-sxt-opt-for-reg-pressure | NV LSR | bool | true | Credit sign-ext savings in cost model |
lsr-sxtopt | NV LSR | bool | true | Fold sign-extensions into IV expressions |
lsr-loop-level | NV LSR | int | 0 (all) | Restrict LSR to specific loop nesting depth |
lsr-skip-outer-loop | NV LSR | bool | false | Skip outer loop IVs in nested loops |
disable-lsr-for-sharedmem32-ptr | NV LSR | bool | false | Disable LSR for addrspace(3) pointers |
disable-lsr-complexity-discount | NV LSR | bool | false | Disable complexity discount in cost model |
irce-block-threshold | IRCE | int | varies | Max basic blocks before IRCE bails |
enable-loop-distribute | LoopDistribute | bool | false | Force-enable distribution |
loop-distribute-scev-check-threshold | LoopDistribute | int | varies | Max SCEV runtime checks allowed |
Cross-References
- Pipeline context: LLVM Optimizer -- two-phase compilation, tier dispatch, NVVMPassOptions
- Pipeline ordering: Pipeline & Pass Ordering -- complete pass registration table
- Vectorization: LoopVectorize & VPlan -- GPU-adapted vectorizer with full cost model
- Unrolling: Loop Unrolling -- decision cascade with GPU-specific heuristics
- Strength reduction: Loop Strength Reduction (NVIDIA) -- the most GPU-specific pass in cicc
- NVIDIA custom passes: Loop Index Split, NVVM Peephole
- SCEV infrastructure: ScalarEvolution Overview, Range Analysis & Trip Counts, SCEV Invalidation
- Standard loop passes: Standard Loop Passes -- IndVarSimplify, LoopInterchange, IRCE, LoopDistribute, LoopIdiomRecognize details
Standard Loop Passes
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
CICC v13.0 includes a full complement of LLVM loop transformation passes beyond the major ones (LoopVectorize, LoopUnroll, LICM, LSR) that have their own pages. This page covers the remaining loop passes: LoopInterchange, IRCE, IndVarSimplify, LoopDistribute, LoopIdiom, LoopRotate, LoopSimplify, and LCSSA. Most are stock LLVM with default thresholds, but IndVarSimplify carries three NVIDIA-specific knobs that materially change behavior on GPU code. LoopRotate appears multiple times in the pipeline as a canonicalization prerequisite for LICM and unrolling. The canonicalization trio -- LoopSimplify, LCSSA, and LoopRotate -- run so frequently they constitute the backbone of loop pass infrastructure in cicc.
Barrier awareness. None of these 8 passes have explicit barrier (__syncthreads()) awareness. Barrier handling in cicc occurs through dedicated NVIDIA passes: Dead Barrier Elimination (sub_2C83D20) and convergence control token verification (sub_E35A10). The structural passes (LoopRotate, LoopSimplify, LCSSA) do not move instructions across basic blocks in ways that could reorder barriers. LoopInterchange and LoopDistribute could theoretically reorder barriers, but barriers in CUDA kernels typically occur outside perfectly-nested loop bodies (interchange) or create non-distributable loop bodies (distribution).
Occupancy interaction. None of the 8 passes interact with occupancy or register pressure directly. Occupancy-aware loop optimization occurs in LSR (register pressure tracking at a1+32128 with occupancy ceiling), LoopUnroll (TTI-based register pressure estimation), and register allocation. These 8 passes are IR-level transforms that run before register allocation.
Address space awareness. None of the 8 passes distinguish between addrspace(0) (generic), addrspace(1) (global), addrspace(3) (shared), or addrspace(5) (local). Only LSR has address space awareness via the disable-lsr-for-sharedmem32-ptr knob. This is a notable gap: LoopInterchange's cost model should ideally weight global memory coalescing higher than shared memory locality, and LoopDistribute could benefit from knowing that shared-memory and global-memory partitions have different cost characteristics.
LoopInterchange
Swaps the iteration order of a perfectly-nested loop pair to improve memory access locality. On GPUs, interchange can convert non-coalesced global memory accesses (strided across warps) into coalesced ones (consecutive addresses per warp), which is often the single largest performance lever for memory-bound kernels.
| Property | Value |
|---|---|
| Entry point | sub_1979A90 (69 KB) -- processLoopList |
| Legality checker | sub_1975210 (45 KB) |
| Dependence helper | sub_1978000 (37 KB) |
| Pass name | "loop-interchange" |
| Knob | loop-interchange-threshold at dword_4FB07E0, default 0 |
| Knob constructor | ctor_208 at 0x4E39E0 |
| NVIDIA delta | None -- stock LLVM algorithm and threshold |
Required analyses (from sub_19743F0): ScalarEvolution (unk_4F9A488), LoopInfoWrapperPass (unk_4F96DB4), DominatorTreeWrapperPass (unk_4F9E06C), AAResultsWrapperPass (unk_4F9920C), DependenceAnalysisWrapperPass (unk_4F98D2D), OptimizationRemarkEmitter (unk_4FB66D8), TargetTransformInfoWrapperPass (unk_4FB65F4), LoopAccessLegacyAnalysis (unk_4F99CB0). The pass preserves both DominatorTree and LoopInfo.
Algorithm. The pass collects the loop nest as a SmallVector by walking the single-subloop chain (enforcing the "perfectly nested" constraint -- each loop must have exactly one child). For nests with fewer than two levels, it returns immediately. It then builds direction vectors for every memory-dependence pair via DependenceInfo (sub_13B1040), encoding each dimension as one of < (forward), > (backward), = (equal), S (scalar), I (independent), or * (unknown). A hard bail-out fires if the number of dependence pairs exceeds 100 (0x960 bytes at 24 bytes per entry) -- a compile-time safety valve.
For each candidate pair from outermost inward, the decision pipeline runs five checks in sequence:
- Dependence safety -- any
*or backward-carried dependence that would be reversed by interchange bails with remark"Dependence". The safety check uses two bitmasks:0x803003for valid direction combination and0x400801for the "all equal-like before inner" pattern. A special case allows inner>when all preceding levels are=orS(zero distance in those dimensions). - Call instructions -- calls in the inner body that are not provably readonly intrinsics bail with
"CallInst". The intrinsic check callssub_1560260(callee, -1, 36)andsub_1560260(callee, -1, 57)for two classes of safe intrinsics. - Tight nesting -- extra computation between the loops (non-PHI, non-terminator instructions) bails with
"NotTightlyNested". Checkssub_15F3040(extra computation),sub_15F3330(volatile/atomic operations), andsub_15F2ED0(calls with side effects). - Exit PHI validation -- complex PHI nodes at the loop exit bail with
"UnsupportedExitPHI". For each exit PHI, the pass walks the use chain checking operand count via(v287 & 0xFFFFFFF), verifying each operand references the latch block and thatsub_157F120(hasLoopInvariantOperands) returns true. - Cost model -- counts memory subscripts with stride in the inner vs. outer loop. Net cost =
benefit - penalty. Interchange proceeds only ifcost >= -threshold(default:>= 0) AND all direction vectors show a parallelism improvement (outer dimension becomes scalar/independent while inner becomes equal).
Cost model details. For each memory instruction (opcode byte 0x38 at offset -8), the pass extracts the subscript count via (*(_DWORD*)(instr-4) & 0xFFFFFFF) and calls sub_146F1B0(ScalarEvolution, operand) to get the SCEV expression. Strides are classified per-loop. Subscripts with stride in both loops are counted as penalties (ambiguous). The net cost is locality_benefit - locality_penalty. The parallelism override requires ALL direction vectors to have the outer dimension as S (83) or I (73) and the inner dimension as = (61) -- even a non-negative cost is rejected if this pattern fails, with remark "InterchangeNotProfitable".
Post-interchange bookkeeping. After transformation, the pass: (a) calls sub_1AF8F90 to update LCSSA form for inner loop first, then outer; (b) reruns legality check via sub_1975210 as a safety recheck after LCSSA updates; (c) swaps direction-vector columns and loop-list positions; (d) decrements indices to try the next pair inward. The TTI availability boolean at a1+192 (checked via sub_1636850) is passed to the LCSSA updater as its 4th argument, controlling rewrite aggressiveness.
GPU considerations. The cost model counts memory accesses generically via SCEV stride analysis. There is no visible special handling for address spaces (shared vs. global vs. texture). The standard "stride-1 is good" locality model applies uniformly. For a reimplementation targeting GPUs, you would want to weight global-memory accesses (addrspace 1) far more heavily than shared-memory accesses (addrspace 3), since shared memory has no coalescing requirement. The 100-pair dependence limit prevents the pass from even being considered for CUDA kernels with massive shared-memory access patterns (e.g., tiled matrix multiplication). The pass does not check for barriers -- perfectly-nested loops with __syncthreads() in the inner body would be blocked by the call-instruction check unless the barrier is lowered to an intrinsic classified as safe (which it is not).
IRCE (Inductive Range Check Elimination)
Splits a loop into pre/main/post regions so that inductive range checks (bounds checks on the induction variable) can be eliminated from the main loop body, which executes the vast majority of iterations.
| Property | Value |
|---|---|
| Entry point | sub_194D450 (71 KB) -- InductiveRangeCheckElimination::run |
| Pass name | "irce" |
| Block threshold | dword_4FB0000 -- max basic blocks before bail-out |
| Debug flag | byte_4FAFE40 -- prints "irce: looking at loop" |
| Constrained mode | byte_4FAFBA0 -- relaxes canonical-form requirements |
| SCEV verify | byte_4FAFC80 -- post-transform range verification |
| Metadata flag | byte_4FAFF20 -- propagate "irce.loop.clone" metadata |
| NVIDIA delta | Minimal -- stock algorithm, "constrained" mode may help GPU strided patterns |
Stack frame and signature. The function allocates ~0x960 bytes (2400 bytes) of local state. Signature: sub_194D450(void *this_pass, void *Loop, void *LoopAnalysisManager, void *LoopStandardAnalysisResults, void *LPMUpdater). Returns PreservedAnalyses by value.
Algorithm (8 phases).
Phase 1 -- Early validation. Extracts ScalarEvolution, DominatorTree, LoopInfo, and BranchProbabilityInfo from LoopStandardAnalysisResults. Loads block count threshold from dword_4FB0000 and bails if the loop exceeds it. Checks simplify form (single latch, single exit, proper preheader).
Phase 2 -- Range check discovery. IRCE scans conditional branches in the loop body for ICmp instructions comparing the induction variable against loop-invariant bounds. The ICmp predicate dispatch table:
| Predicate value | LLVM predicate | Range check kind |
|---|---|---|
0x20 (32) | SLT (signed less-than) | UPPER |
0x22 (34) | SGT (signed greater-than) | LOWER (swapped operands) |
0x24 (36) | SGE (signed greater-equal) | LOWER |
0x26 (38) | UGE (unsigned greater-equal) | LOWER |
0x28 (40) | ULT (unsigned less-than) | UPPER |
Each candidate is classified into one of four kinds:
RANGE_CHECK_UNKNOWN = 0 (skip)
RANGE_CHECK_LOWER = 1 (indvar >= lower_bound)
RANGE_CHECK_UPPER = 2 (indvar < upper_bound)
RANGE_CHECK_BOTH = 3 (lower <= indvar < upper)
The InductiveRangeCheck structure is 40 bytes (0x28), iterated with stride 0x28: Begin (SCEV, +0x00), Step (SCEV, +0x08), End (SCEV, +0x10), CheckUse (Use*, +0x18), Operand (Value*, +0x20), Kind (uint32, +0x24).
Phase 3 -- Filtering and validation. Calls sub_1949EA0 (classifyRangeCheckICmp) to validate each candidate. A bitvector (allocated at [rbp+var_460]) tracks valid checks. The "constrained" relaxation flag (byte_4FAFBA0) routes to sub_1949670 (canHandleRangeCheckExtended), allowing range checks where the induction variable relationship is slightly non-canonical -- useful for GPU thread-coarsened loops with strided access patterns. Validation requires: constant step (+1 or -1), loop-invariant bounds, simplify form, and SCEV-computable trip count.
Phase 4 -- SCEV-based bound computation. For each valid check, computes the safe iteration range [safe_begin, safe_end) using SCEV. Calls sub_145CF80 (SCEV getConstant), sub_147DD40 (SCEV getAddRecExpr / max/min), and sub_3870CB0 (isSafeToExpandAt). If expansion safety fails, the check is abandoned.
Phase 5 -- Preloop creation. Calls sub_194C320 (createPreLoop, ~1200 bytes) to clone the loop for iterations [0, safe_begin). Creates basic blocks named "preloop" and "exit.preloop.at". The clone remaps instructions and PHI nodes, creates the branch from preloop exit to mainloop entry, and updates dominator tree and loop info.
Phase 6 -- Postloop creation. Calls sub_194AE30 (createPostLoop, ~1300 bytes) for iterations [safe_end, trip_count). Calls sub_1949270 (adjustSCEVAfterCloning) to refresh SCEV expressions invalidated by cloning.
Phase 7 -- Two-path splitting for BOTH checks. When kind=3, IRCE creates TWO separate cloning operations, producing three loop clones total. Both sub_194C320 and a second call produce pre/main/post regions with BOTH range checks eliminated from the center.
Phase 8 -- Cleanup. Cleans up InductiveRangeCheck entries (stride 0x40 after alignment). If metadata flag byte_4FAFF20 is set, propagates "irce.loop.clone" metadata to cloned loops via red-black tree manipulation. Releases SCEV expression references via sub_1649B30.
GPU considerations. The block count threshold (dword_4FB0000) protects against pathologically large GPU kernel loops from unrolled or tiled computations. The constrained relaxation mode helps with range checks in GPU kernels where induction variables use non-canonical strides (common after thread coarsening). IRCE has no barrier awareness -- if a loop body contains __syncthreads(), the loop cloning would duplicate the barrier into all three clones (pre/main/post), which is correct but increases code size and instruction cache pressure. The pass does not check for convergent calls, so it could clone a loop containing warp-level primitives; this is safe because all three clones execute the same iterations as the original (just partitioned differently).
Pipeline position. IRCE runs after LoopSimplify and before LoopUnroll. It consumes canonicalized induction variables produced by IndVarSimplify and feeds into vectorization by removing bounds checks that would otherwise prevent LoopVectorize.
IndVarSimplify
Canonicalizes induction variables: simplifies IV users, performs Linear Function Test Replace (LFTR), replaces exit values with closed-form SCEV expressions, and sinks dead IV computations. This is the pass with the most significant NVIDIA modifications in this group.
| Property | Value |
|---|---|
| Core function | sub_1945A50 (65 KB) -- IndVarSimplify::run |
| NewPM wrapper | sub_19489B0 -- applies NVIDIA guards before core |
| Pass name | "indvars" |
| NVIDIA knob 1 | Disable-unknown-trip-iv at qword_4FAF520 -- skip pass for unknown-trip loops |
| NVIDIA knob 2 | iv-loop-level at qword_4FAF440, default 1 -- max nesting depth |
| NVIDIA knob 3 | disable-lftr at byte_4FAF6A0 -- disable LFTR entirely |
| Upstream knob | replexitval at dword_4FAF860 -- {never=0, cheap=1, always=2} |
| All knobs registered | ctor_203 at 0x4E1CD0 |
| NVIDIA delta | Significant -- two custom guard knobs plus depth limiter |
NVIDIA guards. Before the core algorithm runs, sub_19489B0 checks two NVIDIA-specific conditions:
-
Loop depth gate (
iv-loop-level): ifsub_193DD90(loop) > qword_4FAF440[20], the pass is skipped entirely.sub_193DD90is a recursivegetLoopDepth()returning 1 for outermost loops. Default 1 means only outermost loops receive IV simplification. This controls compile time on deeply-nested stencil and tensor kernels that commonly have 3-5 nested loops. -
Unknown trip count gate (
Disable-unknown-trip-iv): ifLOBYTE(qword_4FAF520[20])is set AND (sub_1CED350(loop) <= 1OR!sub_1CED620(loop, header)), the pass is skipped.sub_1CED350returns the SCEV-computed trip count; values <= 1 indicate unknown or trivial loops. This protects GPU kernels with divergent or dynamic bounds (where trip count depends onthreadIdxorblockIdx) from aggressive IV transforms that can cause correctness issues with warp-level scheduling assumptions.
Core algorithm (five phases):
-
Header PHI collection -- walks the loop header's instruction list via
**(a2+32)+48, collecting all PHI nodes (opcode 77) as candidate induction variables into worklistv342. -
Per-IV rewriting -- for each PHI, calls
sub_1B649E0(SimplifyIndVar::simplifyIVUsers, via vtable atoff_49F3848) to fold truncs/sexts/zexts, fold comparisons with known ranges, and eliminate redundant increment chains. Sets changed flag ata1+448. Then callssub_1943460(rewriteLoopExitValues) to replace uses of the IV outside the loop with closed-form SCEV expressions. New PHIs discovered during rewriting are pushed back to the worklist for fixpoint iteration. -
LFTR (Linear Function Test Replace) -- gated by four conditions:
dword_4FAF860 != 0(replexitval not "never") AND trip count not constant (!sub_14562D0),!byte_4FAF6A0(disable-lftr not set),hasCongruousExitingBlock(sub_193E1A0), andexitValueSafeToExpand(sub_193F280). Selects the best IV viasub_193E640(isBetterIV) preferring non-sign-extending, wider IVs with higher SCEV complexity (sub_1456C90). Computes a wide trip count viasub_1940670(computeWideTripCount). Three rewriting strategies:- Strategy A: Integer IV with matching types -- computes exact exit value via APInt arithmetic, materializes as constant.
- Strategy B: Type mismatch -- expands SCEV expression via
sub_14835F0(SCEVExpander::expandCodeFor), creates"wide.trip.count"instruction using ZExt (opcode 37) or SExt (opcode 38). - Strategy C: Direction check failure -- creates
"lftr.wideiv"as a truncation (opcode 36, Trunc) down to exit condition type. - Finally creates
"exitcond"ICmp instruction (opcode 51) with computed predicatev309 = 32 - depth_in_loop_set.
-
Exit value replacement -- materializes closed-form exit values via SCEVExpander. The "cheap" mode (
replexitval=1) adds a cost gate atsub_1941790wheredword_4FAF860 == 1 && !v136 && v31[24]skips expensive expansions (v136 = simple loop flag, v31[24] = per-candidate "expensive" flag fromsub_3872990, the SCEV expansion cost model). -
Cleanup -- dead instruction removal (drains worklist at
a1+48..a1+56, using opcode check: type <=0x17= LLVM scalar type), IV computation sinking (walks latch block backwards, tracks live set in red-black tree viasub_220EF30/sub_220EF80/sub_220F040, sinks dead IVs past loop exit viasub_15F2240), PHI predecessor fixup (handles Switch opcode 27 and Branch opcode 26 terminators), andsub_1AA7010(deleteDeadPhis) on the loop header.
Additional upstream knobs present: indvars-post-increment-ranges (bool, default true), indvars-predicate-loops (bool, default true), indvars-widen-indvars (bool, default true), verify-indvars (bool, default false).
Pass state object layout:
| Offset | Type | Content |
|---|---|---|
| +0 | ptr | TargetTransformInfo |
| +8 | ptr | DataLayout / Module |
| +16 | ptr | DominatorTree |
| +24 | ptr | LoopInfo |
| +32 | ptr | DeadInstVector |
| +40 | ptr | ScalarEvolution |
| +48 | ptr | DeadInstWorklist array |
| +56 | u32 | DeadInstWorklist count |
| +60 | u32 | DeadInstWorklist capacity |
| +448 | byte | Changed flag |
GPU relevance. The depth limiter is important because CUDA stencil codes often have 3-5 nested loops, and running IndVarSimplify on inner loops can blow up compile time without meaningful benefit (inner loops typically have simple IVs already). The unknown-trip guard prevents miscompiles on kernels where the trip count depends on threadIdx or blockIdx. The interaction with IV Demotion (sub_1CD74B0) is notable: IndVarSimplify runs first and may widen IVs to 64-bit, then IV Demotion (a separate NVIDIA pass) narrows them back to 32-bit where the value range permits, reducing register pressure -- a critical factor for GPU occupancy.
LoopDistribute
Splits a single loop into multiple loops (loop fission), each containing a subset of the original instructions. The primary motivation is separating memory accesses with unsafe dependences from safe ones, enabling LoopVectorize to vectorize the safe partition.
| Property | Value |
|---|---|
| Entry point | sub_1A8CD80 (63 KB) -- LoopDistributePass::run |
| Pass name | "loop-distribute" |
| Force flag | byte_4FB5360 -- force distribution ignoring metadata |
| SCEV check threshold | qword_4FB5480 -- max runtime checks before bail-out |
| Secondary limit | qword_4FB53A0 -- max dependence checks per partition |
| Verify flag | byte_4FB56E0 -- post-distribution verification |
| NVIDIA delta | None -- stock LLVM algorithm |
Stack frame. ~0x780 bytes (1920 bytes). Signature: sub_1A8CD80(void *this_pass, void *Function, void *FunctionAnalysisManager).
Algorithm. The pass runs a gauntlet of six bail-out conditions per loop:
"NotLoopSimplifyForm"--sub_157F0D0(Loop::isLoopSimplifyForm) fails."MultipleExitBlocks"--sub_157F0B0(Loop::getUniqueExitBlock) returns null.- Metadata
"llvm.loop.distribute.enable"disabled (checked viasub_15E0530MDNode lookup).byte_4FB5360(force flag) overrides this. "NoUnsafeDeps"-- LAI flag at+0xDAh(HasUnsafeDependences) is zero."MemOpsCanBeVectorized"-- all memory operations already vectorizable."TooManySCEVRuntimeChecks"-- SCEV check count at LAI+0x118exceedsqword_4FB5480.
LoopAccessInfo (LAI) structure (0x130 = 304 bytes):
| Offset | Content |
|---|---|
| +0x00 | Loop* TheLoop |
| +0x08 | PredicatedScalarEvolution* PSE |
| +0x10 | RuntimeCheckingPtrGroup* PtrRtChecks |
| +0x90 | SmallVector buffer (16-byte aligned) |
| +0xDAh | bool HasUnsafeDependences |
| +0xE0h | MemoryDepChecker::Dependence* DepArray |
| +0xE8h | uint32 NumDependences |
| +0x108 | SCEVUnionPredicate* Predicates |
| +0x110 | SCEVCheck* SCEVChecks |
| +0x118 | uint32 NumSCEVChecks |
Dependence entry (0x40 = 64 bytes per entry): source instruction (+0x00), destination instruction (+0x08), dep type info (+0x10), SCEV distance (+0x18), DependenceType byte (+0x28). Stride confirmed at shl rax, 6 (0x1A8E6B9).
If validation passes, the core phase builds a partition graph. Each instruction starts in its own partition. The partition hash set uses 16-byte slots with NVVM-layer sentinels (-8 / -16) and an additional -2 value for "unassigned" partitions. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.
For each unsafe memory dependence pair, the pass either merges source and destination partitions (if the dependence cannot be broken) or marks it as cross-partition. A union-find structure tracks merged partitions. After merging, if at least two distinct partitions remain, sub_1B1E040 (distributeLoopBody, ~2000 bytes) clones the loop body once per partition, removes instructions not belonging to each partition, and wires the clones in dependence order. Optional runtime dependence checks (loop versioning) are added. Post-distribution: sub_1B1DC30 updates the dominator tree, sub_197E390 registers new loops, sub_143AA50 (ScalarEvolution::forgetLoop) invalidates SCEV cache. Metadata "distributed loop" (16 chars) is attached to prevent future re-distribution.
GPU relevance. Distribution is valuable for CUDA kernels that mix shared-memory and global-memory accesses in the same loop -- the shared-memory partition can often be vectorized independently. The "llvm.loop.distribute.enable" metadata is controllable via #pragma clang loop distribute(enable). The SCEV runtime check threshold (qword_4FB5480) balances runtime check overhead against distribution benefit -- GPU kernels often have simple loop structures but complex pointer arithmetic from tiled access patterns.
LoopIdiom
Recognizes loop patterns that correspond to standard library calls (memset, memcpy, memcmp, strstr) and replaces them with optimized implementations. CICC includes both the standard LoopIdiomRecognize pass and the newer LoopIdiomVectorize pass.
| Property | Value |
|---|---|
| Recognizer core | sub_196FF90 (51 KB) -- LoopIdiomRecognize::run |
| Memset detection | sub_196B740 (10 KB) -- detects memset_pattern16 |
| Memcpy/memmove | sub_196E000 (43 KB) |
| Mismatch expansion | sub_2AA00B0 (48 KB) -- expandMemCmpMismatch |
| String search expansion | sub_2AA3190 (40 KB) -- expandFindFirst |
| Pass name | "loop-idiom" (recognizer), "loop-idiom-vectorize" (vectorizer) |
| Vectorize knobs | disable-loop-idiom-vectorize-all, loop-idiom-vectorize-style (masked/predicated), loop-idiom-vectorize-bytecmp-vf, etc. |
| NVIDIA delta | None visible -- stock LLVM |
Standard idioms. The recognizer scans loops for store patterns that correspond to memset (constant value stored on every iteration) and memcpy/memmove (load-store pairs with matching strides). It also detects trip-count-decrement patterns ("tcphi", "tcdec") used in hand-written copy loops. Recognized patterns are lowered to @llvm.memset / @llvm.memcpy / @llvm.memmove intrinsics.
Vectorized idiom expansion -- MemCmpMismatch (sub_2AA00B0). The expansion generates a two-tier multi-block IR structure:
-
LoopIdiomExpansionState structure (80+ bytes): idiom type at +0 (0=byte, 1=word), loop info at +8, DataLayout at +16, alloc context at +24, target info at +32, output blocks at +48 through +80.
-
11 basic blocks created in sequence:
"mismatch_end","mismatch_min_it_check","mismatch_mem_check","mismatch_vec_loop_preheader","mismatch_vec_loop","mismatch_vec_loop_inc","mismatch_vec_loop_found","mismatch_loop_pre","mismatch_loop","mismatch_loop_inc","byte.compare". -
Page-boundary safety protocol (shared with string search expansion):
PtrToInt->LShrbylog2(pagesize)(fromsub_DFB4D0via DataLayout) ->ICmpNEof start/end page numbers. If both pointers stay within a single page, wider-than-element vector loads are safe; otherwise,@llvm.masked.loadprovides the fallback. The page size is retrieved viasub_DFB4D0(*a1[32])from the target DataLayout. -
Vector loop body: dispatches to
sub_2A9D690(byte-granularity) orsub_2A9EC20(word-granularity) based on*a1idiom type. Generates vector load + compare + cttz (count trailing zeros viasub_B34870). -
Scalar fallback: byte-by-byte comparison with
"mismatch_index"phi node, induction variable add (sub_929C50), andICmpULT(sub_92B530(0x20)) loop bound check. -
LCSSA verification: explicit assertion
"Loops must remain in LCSSA form!"viasub_D48E00. SE/LI/DT invalidated/recalculated on exit (sub_FFCE90,sub_FFD870,sub_FFBC40).
Vectorized idiom expansion -- FindFirst (sub_2AA3190). Implements vectorized first-occurrence search (strstr-like):
-
7 basic blocks:
"scalar_preheader","mem_check","find_first_vec_header","match_check_vec","calculate_match","needle_check_vec","search_check_vec". -
Needle splatting:
needle[0]is extracted viaExtractElement(sub_B4DE80) with index 0, frozen viasub_B37620, then splatted across all vector lanes viaShuffleVector(sub_B36550). The splat enables parallel comparison of the haystack against the needle's first character. -
Masked loads:
@llvm.masked.load(sub_B34C20) provides page-boundary-safe vectorized reads. Same page-boundary protocol as mismatch expansion. -
Two nested loops: outer scans haystack, inner verifies full needle match at candidate positions. PHI nodes:
"psearch"(haystack),"pneedle"(needle position),"match_start","match_vec".
GPU considerations. LoopIdiom is present in cicc but its value on GPU code is limited. GPU memset/memcpy are typically handled by device runtime calls or specialized PTX instructions (st.global, ld.global with vectorized widths) rather than loop-based patterns. The vectorized mismatch/search expansions target CPU-style byte-level operations that are rare in GPU kernels. The page-boundary safety protocol is irrelevant on GPU (virtual memory page faults work differently -- GPU global memory is always accessible within the allocation). The pass runs but likely fires infrequently. When it does fire, the generated @llvm.memset/@llvm.memcpy intrinsics are later lowered to PTX-specific sequences by the NVPTX backend.
LoopRotate
Transforms loops so that the latch block (back-edge source) becomes the exiting block (where the exit condition is tested). This converts "while" loops into "do-while" form, which is a prerequisite for LICM (the loop body is guaranteed to execute at least once, enabling unconditional hoisting) and simplifies trip count computation for SCEV.
| Property | Value |
|---|---|
| Entry point (legacy) | sub_18A3090 -- called directly in O1/O2/O3 pipeline |
| Entry point (new PM) | sub_28448D0 -- LoopRotatePass with "header-duplication;" param |
| Core implementation | sub_2A0CFD0 (65 KB) -- LoopRotation::runOnLoop |
| String markers | ".lr.ph" (preheader), "h.rot", "pre.rot" |
| Pass name | "loop-rotate" |
| Params | no-header-duplication / header-duplication |
| Pipeline knob | enable-loop-header-duplication (bool) -- controls default param |
| NVIDIA delta | None -- stock LLVM, but appears multiple times in pipeline |
Pipeline placement. LoopRotate appears at least four times in the cicc pipeline across different tiers:
- Full O1+ pipeline, position 11:
sub_18A3090()insub_12DE330-- runs before LICM (sub_184CD60) and IndVarSimplify. - Tier 1 passes: appears alongside SimplifyCFG and InstCombine as part of the canonicalization loop.
- Tier 2 passes: appears again in the LoopRotate+LICM pair.
- Pipeline assembler:
sub_195E880appears 4 times (labeled "LICM/LoopRotate"), conditional onopts[1240]andopts[2880].
This multiple invocation is standard LLVM practice -- rotation may be needed again after other transforms invalidate the rotated form. In the Ofcmid fast-compile pipeline, LoopRotate does not appear as a standalone pass; LICM (which internally depends on rotation) handles it.
Algorithm. The pass duplicates the loop header into the preheader (creating a "rotated" header named "h.rot" or "pre.rot"), then rewires the CFG so the original header becomes the latch. The header-duplication parameter controls whether the header is actually duplicated (which increases code size) or only the branch is restructured. After rotation, SCEV's backedge-taken count computation becomes straightforward because the exit test is at the latch.
SCEV interaction. LoopRotate requires BTC (backedge-taken count) recomputation after the header/latch swap. This is handled by ScalarEvolution::forgetLoop being called by downstream passes that depend on fresh SCEV data.
GPU considerations. LoopRotate is purely a structural transformation that does not examine instruction semantics. It has no barrier awareness -- if a barrier (__syncthreads()) is in the loop header, it will be duplicated into the preheader during rotation. In practice, barriers in CUDA kernels are rarely in loop headers (they are typically in loop bodies or between loops). The header duplication can increase code size, which affects instruction cache utilization on GPU -- SM instruction caches (L0/L1 I-cache) are small (typically 12-48 KB per SM depending on architecture), so excessive duplication of large loop headers across many loops in a kernel could cause I-cache pressure. The pass does not have a size threshold to prevent this.
LoopSimplify
Enforces LLVM's canonical loop form: single preheader, single latch, single dedicated exit block, and no abnormal edges. Nearly every loop optimization pass requires simplify form as a precondition.
| Property | Value |
|---|---|
| Canonicalization core | sub_1A5B3D0 (62 KB) |
| DomTree update helper | sub_1A593E0 (47 KB) |
| Preheader insertion | sub_1A5E350 (25 KB) |
| Exit block normalization | sub_1A5F590 (42 KB) |
| Pass name | "loop-simplify" |
| String markers | ".backedge", "llvm.loop" |
| Pipeline wrapper (standalone) | sub_1832270(n) where n = verify flag |
| Pipeline wrapper (bundled) | sub_1841180() -- LoopSimplify + LCSSA combined |
| NVIDIA delta | None -- stock LLVM |
Pipeline placement. LoopSimplify is the most frequently invoked loop pass in the cicc pipeline:
| Context | Call site | Position |
|---|---|---|
| Full O1+ pipeline | sub_1841180() | Position 40 (bundled with LCSSA) |
| Ofcmid pipeline | sub_1832270(1) | Position 11 (standalone) |
| Ofcmid pipeline | sub_1841180() | Position 15 (bundled with LCSSA) |
| Post-tier insertion | sub_1841180() | Tier 2/3 additional invocations |
| As precondition | sub_157F0D0 (check) | Called by LoopInterchange, LoopDistribute, IRCE, LoopVectorize |
The pass appears at least 5 times across different pipeline tiers. It also runs as a utility called by other loop passes -- LoopInterchange, LoopDistribute, IRCE, and LoopVectorize all check isLoopSimplifyForm() (sub_157F0D0) and bail out if it fails.
What it does. If a loop lacks a single preheader, LoopSimplify creates one by inserting a new basic block on the entry edge (named with .lr.ph suffix via sub_1A5E350). If multiple latch blocks exist, it merges them into one (inserting .backedge blocks). If exit blocks are shared with other loops, it creates dedicated exit blocks via sub_1A5F590 (42 KB normalization function). After transformation, loop metadata ("llvm.loop" nodes) is preserved on the new latch terminator.
GPU considerations. LoopSimplify is purely structural and has no GPU-specific implications. However, it is worth noting that StructurizeCFG (which runs after all loop optimizations, during NVPTX code generation) re-canonicalizes the CFG for GPU divergence handling. Loop structures created by LoopSimplify may be further modified by StructurizeCFG when the loop contains divergent branches. The two passes do not interfere because they run in different pipeline phases (IR optimization vs. code generation).
LCSSA (Loop-Closed SSA)
Ensures that every value defined inside a loop and used outside it passes through a PHI node at the loop exit. This invariant simplifies SSA-based transformations: passes can modify loop internals without worrying about breaking uses outside the loop.
| Property | Value |
|---|---|
| Formation pass | sub_1AE2630 (49 KB) |
| Lightweight form | sub_1961B00 (13 KB) -- creates .lcssa PHI nodes |
| LCSSA updater | sub_1AF8F90 -- used by LoopInterchange post-transformation |
| Pass name | "lcssa" |
| Verify knob | verify-loop-lcssa registered at ctor_094 (~0x4A2491) |
| String markers | ".lcssa" suffix on PHI node names |
| NVIDIA delta | None -- stock LLVM |
Pipeline placement. LCSSA runs bundled with LoopSimplify via sub_1841180() at position 40 in the full pipeline. In the Ofcmid fast-compile pipeline, it appears at position 15 via the same bundled wrapper. It is also maintained incrementally by every pass that modifies loop structure:
- LoopInterchange calls
sub_1AF8F90to update LCSSA form for both inner and outer loops after transformation. The inner loop is updated first. The TTI availability boolean froma1+192is passed as the 4th argument to the updater. - LoopUnroll checks LCSSA form via
sub_D49210and generates.unr-lcssablocks for unrolled iterations. - LoopIdiom expansions (
sub_2AA00B0,sub_2AA3190) end with explicitverifyLoopLCSSAassertion ("Loops must remain in LCSSA form!").
What it does. For each instruction defined inside the loop, LCSSA checks all uses outside the loop's exit blocks. For each such use, it inserts a PHI node in the exit block with the defined value as the incoming value from the latch. The PHI node is named with a .lcssa suffix. After LCSSA formation, all external uses of loop-internal values go through these PHI nodes, and loop transforms only need to update the PHI nodes rather than chasing all external uses.
GPU considerations. LCSSA is purely structural and has no GPU-specific behavior. However, LCSSA PHI nodes interact with the NVPTX backend's divergence analysis: when a loop exit depends on a divergent condition (different threads take different exit iterations), the .lcssa PHI node at the exit carries a divergent value. The divergence analysis pass (NVVMDivergenceLowering, sub_1C76260) must handle these PHIs correctly to avoid generating incorrect predication. This is not an issue with LCSSA itself but with downstream consumers.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
IndVarSimplify::run (core) | sub_1945A50 | 65 KB | -- |
IndVarSimplifyPass::run (NewPM wrapper with NVIDIA guards) | sub_19489B0 | -- | -- |
rewriteLoopExitValues | sub_1943460 | -- | -- |
replaceExitValuesWithCompute (LFTR commit) | sub_1941790 | -- | -- |
computeWideTripCount | sub_1940670 | -- | -- |
hasCongruousExitingBlock | sub_193E1A0 | -- | -- |
getLoopDepth (recursive, 1 for outermost) | sub_193DD90 | -- | -- |
isBetterIV (candidate comparison for LFTR) | sub_193E640 | -- | -- |
exitValueSafeToExpand (SCEV expandability check) | sub_193F280 | -- | -- |
findFinalIVValue (trace IV to exit value) | sub_193F190 | -- | -- |
hasSafeExitBlock (exit block LFTR safety) | sub_193F750 | -- | -- |
initPassState (initialize pass-level state) | sub_1940CE0 | -- | -- |
clearPassState (cleanup per-iteration state) | sub_1940B30 | -- | -- |
SimplifyIndVar::simplifyIVUsers | sub_1B649E0 | -- | -- |
LoopInterchange::processLoopList | sub_1979A90 | 69 KB | -- |
LoopInterchange legality checker | sub_1975210 | 45 KB | -- |
LoopInterchange dependence analysis helper | sub_1978000 | 37 KB | -- |
LoopInterchange::getAnalysisUsage | sub_19743F0 | -- | -- |
| SmallVector copy helper (dep vector / loop list) | sub_19742B0 | -- | -- |
vector<DepVector> push_back | sub_1974CB0 | -- | -- |
| Swap loop bounds / trip count metadata | sub_1973F90 | -- | -- |
InductiveRangeCheckElimination::run | sub_194D450 | 71 KB | -- |
createPreLoop / cloneLoopForRange (~1200 bytes) | sub_194C320 | -- | -- |
createPostLoop / wirePostLoop (~1300 bytes) | sub_194AE30 | -- | -- |
classifyRangeCheckICmp (~800 bytes) | sub_1949EA0 | -- | -- |
canHandleRangeCheck (~400 bytes) | sub_1949540 | -- | -- |
canHandleRangeCheckExtended (~300 bytes, constrained mode) | sub_1949670 | -- | -- |
buildInductiveRangeCheck (~500 bytes) | sub_1949C30 | -- | -- |
adjustSCEVAfterCloning | sub_1949270 | -- | -- |
simplifyLoopAfterCloning (~200 bytes) | sub_1948FD0 | -- | -- |
verifyLoopStructure (~200 bytes) | sub_1948D70 | -- | -- |
LoopDistributePass::run | sub_1A8CD80 | 63 KB | -- |
distributeLoopBody (core fission engine, ~2000 bytes) | sub_1B1E040 | -- | -- |
updateDominatorTree (post-distribution, ~400 bytes) | sub_1B1DC30 | -- | -- |
updateLoopInfo (post-distribution, ~300 bytes) | sub_1B1DDA0 | -- | -- |
cleanupPartitions (~400 bytes) | sub_1B1F0F0 | -- | -- |
verifyDistribution (~300 bytes) | sub_1B216C0 | -- | -- |
cleanupAfterDistribution (~200 bytes) | sub_1A8C510 | -- | -- |
lookupPartitionForInstruction (hash table lookup) | sub_3860240 | -- | -- |
hasDirectDependence(partA, partB) | sub_385DBB0 | -- | -- |
alreadyMerged(partA, partB) | sub_385DB90 | -- | -- |
isSafeToDistribute (final safety check) | sub_1452CB0 | -- | -- |
LoopIdiomRecognize::run | sub_196FF90 | 51 KB | -- |
| LoopIdiom memset pattern detection | sub_196B740 | 10 KB | -- |
| LoopIdiom memcpy/memmove patterns | sub_196E000 | 43 KB | -- |
expandMemCmpMismatch | sub_2AA00B0 | 48 KB | -- |
expandFindFirst (string search vectorization) | sub_2AA3190 | 40 KB | -- |
expandByteMismatchLoopBody (type 0) | sub_2A9D690 | -- | -- |
expandWordMismatchLoopBody (type 1) | sub_2A9EC20 | -- | -- |
replaceUsesOfPhiInSuccessors (LCSSA fixup) | sub_2A9D330 | -- | -- |
LoopRotation::runOnLoop | sub_2A0CFD0 | 65 KB | -- |
LoopRotatePass (NewPM, "header-duplication;") | sub_28448D0 | -- | -- |
LoopRotate (legacy pipeline call) | sub_18A3090 | -- | -- |
LoopSimplify canonical form enforcement | sub_1A5B3D0 | 62 KB | -- |
LoopSimplify DomTree update helper | sub_1A593E0 | 47 KB | -- |
| LoopSimplify preheader insertion | sub_1A5E350 | 25 KB | -- |
| LoopSimplify exit block normalization | sub_1A5F590 | 42 KB | -- |
LoopSimplify pipeline wrapper (with verify flag) | sub_1832270 | -- | -- |
LoopSimplify + LCSSA bundled pass | sub_1841180 | -- | -- |
| LCSSA formation pass | sub_1AE2630 | 49 KB | -- |
LCSSA lightweight .lcssa PHI insertion | sub_1961B00 | 13 KB | -- |
| LCSSA form updater (used post-interchange) | sub_1AF8F90 | -- | -- |
verifyLoopLCSSA (assertion: "Loops must remain in LCSSA form!") | sub_D48E00 | -- | -- |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| IndVarSimplify knobs | Stock LLVM defaults; no GPU-specific configuration | Three NVIDIA-specific knobs that change IV widening/narrowing behavior for GPU register pressure management |
| Barrier awareness | No concept of GPU barriers or synchronization primitives | None of the 8 standard passes have explicit barrier awareness; barrier handling deferred to dedicated NVIDIA passes (Dead Barrier Elimination, convergence token verification) |
| LoopRotate frequency | Runs once or twice in pipeline | Appears multiple times as canonicalization prerequisite for LICM and unrolling; forms the backbone of loop pass infrastructure |
| LoopIdiom patterns | memset, memcpy recognition for CPU targets | Same patterns; GPU-specific expansion handled downstream by MemmoveUnroll pass |
| IRCE | Range check elimination for deoptimization-safe targets | Present but effectiveness limited on GPU: no deoptimization support, relies on SCEV range analysis for bound proofs |
| LoopInterchange | Cost model driven by cache locality | Same legality checks; profitability analysis implicitly favors stride-1 access (coalescing) over cache line optimization |
| IV Demotion | Not present | Downstream NVIDIA pass (IV Demotion) narrows IVs widened by IndVarSimplify back to 32-bit where GPU value ranges permit |
Cross-References
- LoopVectorize & VPlan -- LoopDistribute feeds vectorization; IRCE removes bounds checks that block it.
- Loop Unrolling -- Runs after IndVarSimplify canonicalizes IVs; requires LoopSimplify form. The
unroll-runtime-convergentknob forces epilogue mode when convergent calls (warp-level primitives) are present -- an interaction with GPU barrier semantics that these 8 standard passes do not handle. - LICM -- Requires LoopRotate and LoopSimplify as prerequisites.
- ScalarEvolution -- IndVarSimplify and IRCE are among the heaviest SCEV consumers; LoopInterchange uses SCEV for stride analysis. LoopRotate and LoopDistribute call
ScalarEvolution::forgetLoopafter transformation. - SCEV Invalidation -- LoopRotate requires BTC recomputation after header/latch swap; LoopDistribute calls forgetLoop after fission.
- Loop Strength Reduction -- Runs after IndVarSimplify; consumes the canonicalized IV forms it produces. LSR has address-space-aware chain construction for shared memory (addrspace 3) that these 8 passes lack.
- IV Demotion -- NVIDIA's custom pass that narrows IVs widened by IndVarSimplify back to 32-bit where value ranges permit, reducing register pressure for GPU occupancy.
- Dead Barrier Elimination -- Handles barrier optimization that these standard loop passes do not address.
- Pipeline & Ordering -- LoopRotate at position 11, LoopSimplify/LCSSA at position 40 in the full O1+ pipeline.
- NVVMDivergenceLowering -- Handles divergent LCSSA PHI nodes at loop exits when different threads take different exit iterations.
Loop Unrolling
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp(decision engine),llvm/lib/Transforms/Utils/LoopUnroll.cpp(transformation engine),llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp(runtime unrolling) (LLVM 20.0.0)
Loop unrolling in cicc is one of the most heavily tuned transformations in the entire pipeline. On a GPU, unrolling directly trades register pressure against instruction-level parallelism: every additional copy of the loop body increases live register count, which reduces SM occupancy and the number of concurrent warps available to hide memory latency. Conversely, too little unrolling leaves performance on the table by failing to expose independent instructions that the hardware scheduler can overlap. NVIDIA's unroller resolves this tension through a priority-based decision cascade with GPU-specific heuristics that have no upstream equivalent -- most notably a local-array threshold multiplier, power-of-two factor enforcement, and a pragma threshold 200x larger than stock LLVM. The transformation engine itself is a lightly modified version of upstream llvm::UnrollLoop, but the decision engine (computeUnrollCount) is substantially reworked.
The pass appears twice in the cicc pipeline. The first invocation (sub_197E720) runs early, interleaved with loop vectorization in the main optimization sequence. The second invocation (sub_19C1680) runs later as a cleanup pass, gated by opts[1360] (the nv-disable-loop-unrolling flag). Both share the same decision engine; the second invocation operates on loops that were created or exposed by intervening passes (InstCombine, SROA, EarlyCSE).
| Property | Value |
|---|---|
| Decision engine | sub_19BB5C0 / computeUnrollCount (50 KB, ~1681 lines) |
| Transformation engine | sub_2A15A20 / UnrollLoop (85 KB, ~2434 lines) |
| Top-level driver | sub_19BE360 / tryToUnrollLoop |
| Runtime-check unroller | sub_2A25260 / UnrollLoopWithRuntimeChecks (91 KB) |
| Pipeline slot (early) | sub_197E720 -- runs once in main opt pipeline |
| Pipeline slot (late) | sub_19C1680 -- conditional on !opts[1360] |
| Disable knob | -Xcicc "-disable-LoopUnrollPass" or opts[1360] |
| LLVM base | LoopUnrollPass from LLVM 20.0.0 |
Why Unrolling Matters More on GPU
On a CPU, the primary benefit of unrolling is reducing branch overhead and enabling wider SIMD scheduling. On a GPU, the calculus is different in three ways that all trace back to the GPU execution model:
First, unrolling increases register pressure, and register pressure determines occupancy. If unrolling pushes a kernel from 64 to 96 registers per thread, the SM drops from 32 to 21 resident warps -- a 34% reduction. Fewer warps means less latency hiding, so the unroll factor selection must be conservative in ways that a CPU unroller never needs to be.
Second, there is no out-of-order execution within a warp; the hardware issues instructions in program order. Unrolling creates independent instructions that the compiler (ptxas) can interleave, particularly independent loads that can overlap with arithmetic. This is the ILP benefit, and it is the primary argument for aggressive unrolling.
Third, GPU loops often access shared memory (__shared__) or local memory arrays indexed by threadIdx. Unrolling these loops enables the backend to promote array elements to registers and to rearrange memory accesses to avoid bank conflicts. NVIDIA's local-array heuristic (see below) exists specifically to exploit this opportunity.
The unroller's job is to find the sweet spot: enough copies to saturate the instruction pipeline, few enough to keep register pressure within occupancy targets.
The Decision Engine: computeUnrollCount
The decision engine at sub_19BB5C0 implements a strict six-level priority cascade. Each level is tried in order; the first level that produces a valid unroll factor wins. Every decision is logged through optimization remarks, making the logic traceable from -Rpass-analysis=loop-unroll.
UnrollParams Struct Layout
The decision communicates its result through a struct passed by pointer (a12 / v14):
| Offset | Field | Type | Description |
|---|---|---|---|
| +0 | Threshold | u32 | Cost budget for full unroll |
| +4 | MaxPercentThresholdBoost | u32 | Max boost percentage (default 400) |
| +12 | PartialThreshold | u32 | Cost budget for partial unroll |
| +20 | Count | u32 | Chosen unroll factor (primary output) |
| +24 | PeelCount | u32 | Loop peel iteration count |
| +28 | DefaultUnrollCount | u32 | Fallback count when no factor found |
| +32 | MaxCount | u32 | Hard cap on unroll factor |
| +36 | FullUnrollMaxCount | u32 | Max trip count for full unroll |
| +40 | FixedCost | u32 | Non-scaling cost (IV increments, branches) |
| +44 | AllowPartial | u8 | Partial unrolling permitted |
| +45 | AllowRemainder | u8 | Remainder loop generation permitted |
| +46 | UserProvidedCount | u8 | True when pragma supplies count |
| +48 | (reserved) | u8 | -- |
| +49 | AllowUpperBound | u8 | Use max-trip-count when exact unknown |
The Cost Model
Every decision in the cascade uses the same linear cost model to estimate unrolled loop size:
estimated_size = FixedCost + Count * (LoopBodySize - FixedCost)
LoopBodySize is the instruction cost of one iteration (parameter a11, computed by LLVM's CodeMetrics). FixedCost captures instructions that do not replicate with unrolling -- induction variable increments, the backedge branch, loop overhead. The difference (LoopBodySize - FixedCost) is the per-copy marginal cost.
For full unrolls, an additional dynamic cost simulation (sub_19B9A90) constant-folds through the unrolled body. If the loop contains iteration-dependent simplifications (constant array indices, strength-reduced expressions), the simulation reports a cost lower than worst-case. The effective budget for this check is boosted:
dynamic_budget = Threshold * MaxPercentThresholdBoost / 100
With the default boost of 400%, this means a loop whose body simplifies substantially after unrolling gets 4x the normal cost budget.
Priority Cascade (Pseudocode)
int computeUnrollCount(Loop *L, SE, TTI, TripCount, MaxTripCount,
BodySize, UnrollParams *UP, bool *AllowRuntime) {
// PRIORITY 1: Local array threshold multiplier (NVIDIA-specific)
int localSize = computeLocalArraySize(L); // scans for AS5 allocas
int multiplier = min(max(localSize, 1), 6);
int effectiveThreshold = multiplier * UP->Threshold;
// PRIORITY 2: #pragma unroll N
int pragmaCount = getMetadataCount(L, "llvm.loop.unroll.count");
if (pragmaCount != 0) {
if (pragmaCount == 1) {
UP->Count = 1; // disable unrolling
return UNROLL_DISABLED;
}
UP->Count = pragmaCount;
int estSize = UP->FixedCost + pragmaCount * (BodySize - UP->FixedCost);
if (estSize > multiplier * PragmaUnrollThreshold) {
// too large -- try to find smaller factor
searchSmallerDivisibleFactor(UP, TripCount);
}
if (TripMultiple % pragmaCount != 0)
emitRemark("remainder loops not allowed");
return UNROLL_PRAGMA;
}
// PRIORITY 3: #pragma unroll (full, no count)
if (hasMetadata(L, "llvm.loop.unroll.full")) {
if (TripCount > 0 && TripCount <= UP->FullUnrollMaxCount) {
int estSize = UP->FixedCost + TripCount * (BodySize - UP->FixedCost);
if (estSize <= effectiveThreshold) {
if (simulateLoopBody(L, TripCount, dynamicBudget))
{ UP->Count = TripCount; return FULL_UNROLL; }
}
}
// fallthrough to lower priorities
}
// PRIORITY 4: Loop peeling
int peelCount = computePeelCount(L, SE, UP);
if (peelCount > 0) {
UP->PeelCount = peelCount;
UP->Count = 1;
return PEEL;
}
// PRIORITY 5: Static partial unrolling (known trip count)
if (TripCount > 0 && (UP->AllowPartial || pragmaOversize) && isInnermost(L)) {
int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;
// Size clamp
if (UP->PartialThreshold < UP->FixedCost + count * (BodySize - UP->FixedCost))
count = (UP->PartialThreshold - UP->FixedCost) / (BodySize - UP->FixedCost);
count = min(count, UP->MaxCount);
// Power-of-two + trip-divisible search
while (count > 0) {
if (TripCount % count == 0 && isPowerOfTwo(count))
break;
count--;
}
// Fallback: halve DefaultUnrollCount until it fits
if (count == 0 && UP->UserProvidedCount) {
count = UP->DefaultUnrollCount;
while (UP->PartialThreshold <
UP->FixedCost + count * (BodySize - UP->FixedCost))
count >>= 1;
}
if (count > 1) { UP->Count = count; return PARTIAL_UNROLL; }
}
// PRIORITY 6: Runtime unrolling (unknown trip count)
if (!hasMetadata(L, "llvm.loop.unroll.runtime.disable")
&& RuntimeUnrollThreshold >= BodySize
&& isInnermost(L)) {
int rtTripCount = computeRuntimeTripCount(L, SE);
if (rtTripCount < FlatLoopTripCountThreshold) return NO_UNROLL;
int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;
// same halving + threshold logic as Priority 5
while (UP->PartialThreshold <
UP->FixedCost + count * (BodySize - UP->FixedCost))
count >>= 1;
count = min(count, UP->MaxCount);
if (count > 1) {
UP->Count = count;
*AllowRuntime = true;
return RUNTIME_UNROLL;
}
}
// Small-function override (tiny kernels get aggressive unrolling)
if (functionInstructionCount < SmallFunctionThreshold)
return handleSmallFunction(L, UP, BodySize);
return NO_UNROLL;
}
Local Array Heuristic
The function sub_19B5DD0 (computeLocalArraySize) is entirely NVIDIA-specific. It scans every basic block in the loop for load/store instructions that access address space 5 (GPU local memory). For each such access, it traces back to the underlying alloca, determines the array type, and computes the product of array dimensions. If any dimension is unknown at compile time, it substitutes the unroll-assumed-size knob (default 4). The returned value is the maximum local-array size found across all accesses.
This value becomes a threshold multiplier, capped at 6:
int computeLocalArraySize(Loop *L) {
int maxSize = 0;
for (BasicBlock *BB : L->blocks()) {
for (Instruction &I : *BB) {
if (!isLoadOrStore(I) || getAddressSpace(I) != 5) continue;
Value *base = getUnderlyingAlloca(I);
if (!base || !isArrayType(base->getType())) continue;
int size = 1;
for (int dim : getArrayDimensions(base))
size *= (dim > 0) ? dim : UnrollAssumedSize; // default 4
maxSize = max(maxSize, size);
}
}
return maxSize;
}
The rationale: GPU kernels frequently use __shared__ or local arrays indexed by threadIdx. Unrolling such loops by a factor proportional to the array size enables register promotion of individual array elements and eliminates bank-conflict-prone access patterns. The cap at 6 prevents pathological explosion when arrays are large.
Power-of-Two Factor Enforcement
The partial-unroll factor search at Priority 5 requires the chosen count to satisfy two constraints simultaneously: it must evenly divide the trip count and must be a power of two. The implementation uses the classic bitmask test:
while (count > 0) {
if (tripCount % count == 0 && (count & (count - 1)) == 0)
break;
count--;
}
This is a GPU-specific requirement. Warp size is 32 (a power of two), and many GPU memory access patterns, shared-memory bank calculations, and reduction operations assume power-of-two alignment. An unroll factor of, say, 6 would create asymmetric loop bodies that interact poorly with warp-level execution.
Pragma Handling
The frontend (sub_9305A0 / emitUnrollPragma) translates CUDA pragmas to LLVM metadata during codegen:
| CUDA Source | LLVM Metadata |
|---|---|
#pragma unroll (bare) | !{!"llvm.loop.unroll.full"} |
#pragma unroll N (N > 1) | !{!"llvm.loop.unroll.count", i32 N} |
#pragma unroll 1 | Disables unrolling at Priority 2 |
The metadata is attached to the backedge branch as a self-referential !llvm.loop node. A guard flag (dword_4D046B4) skips pragma processing entirely in fast-codegen mode.
The pragma threshold is 32768 (0x8000), compared to upstream LLVM's 16384 (0x4000). This means #pragma unroll succeeds on loop bodies up to approximately 32K cost units -- covering virtually any realistic GPU kernel loop. When even this generous budget is exceeded, the decision engine falls through to lower priorities and attempts partial unrolling.
The __launch_bounds__ attribute does not directly feed the unroll decision. Instead, it constrains register allocation downstream, which indirectly limits the benefit of aggressive unrolling. There is no feedback loop from register pressure estimation back into the unroll factor at this stage of the pipeline; that coordination happens implicitly through the PartialThreshold provided by TTI.
Runtime Unrolling
Runtime unrolling (Priority 6) handles loops whose trip count is unknown at compile time. cicc enables it by default (unroll-runtime = true), with several GPU-specific twists:
Convergent instruction support. The knob unroll-runtime-convergent (default true, NVIDIA-specific) allows unrolling loops that contain convergent operations like warp-level primitives (__shfl_sync, __ballot_sync). Upstream LLVM refuses to unroll such loops because it cannot guarantee all threads in the warp execute the same iterations. cicc overrides this, relying on the waterfall-epilogue mechanism to preserve convergence.
Epilog vs. prolog remainder. The choice is controlled by a cascade:
- If
waterfall-unrolling-force-epilogueistrue(default, NVIDIA-specific) and the loop has runtime trip count: epilog mode is selected. - If the loop body contains function calls (
hasCallInLoop/sub_2A10B40checks for opcode 17): epilog mode is forced. This preserves the property that all threads in a warp participate in calls, which matters for convergent operations. - Otherwise,
unroll-runtime-epilog(defaultfalse) determines the mode.
In practice, GPU loops almost always use epilog-style remainders.
Flat-loop exclusion. If the estimated runtime trip count is below flat-loop-tripcount-threshold (default 5), runtime unrolling is skipped. The overhead of generating the modulo check and epilog loop is not worth it for loops that iterate fewer than 5 times.
Body size gate. Runtime unrolling only proceeds if runtime-unroll-threshold (default 95) is greater than or equal to the loop body size. This is more conservative than the static partial-unroll threshold, preventing code explosion for large loop bodies when the trip count is unknown.
Thresholds: NVIDIA vs. Upstream LLVM
| Parameter | Upstream LLVM (O3) | Upstream LLVM (NVPTX TTI) | cicc v13.0 |
|---|---|---|---|
| Threshold | 300 | 300 | From TTI (300), then multiplied by local-array factor (1-6x) |
| PartialThreshold | 150 | 75 (Threshold / 4) | From TTI (75), plus local-array scaling |
| MaxPercentThresholdBoost | 400% | 400% | 400% (same) |
| PragmaUnrollThreshold | 16384 | 16384 | 32768 |
| RuntimeUnrollThreshold | -- | -- | 95 (NVIDIA addition) |
| FlatLoopTripCountThreshold | 5 | 5 | 5 (same) |
| MaxUpperBound | 8 | 8 | 8 (same) |
| MaxPragmaUpperBound | -- | -- | 64 (NVIDIA addition) |
| DefaultUnrollRuntimeCount | 8 | 8 | From TTI |
| AllowPartial | false | true | true (from TTI) |
| Runtime | false | true | true (from TTI) |
| AllowRemainder | true | true | true |
| MaxIterationsCountToAnalyze | 10 | 10 | 10 (same) |
| UnrollAssumedSize | -- | -- | 4 (NVIDIA addition) |
The critical differences: cicc doubles the pragma threshold, introduces a body-size gate for runtime unrolling (95), adds the local-array multiplier (up to 6x on base thresholds), and enforces power-of-two partial factors. The upstream NVPTX TTI enables partial and runtime unrolling but leaves thresholds at modest CPU-oriented values; cicc's decision engine applies substantial additional logic on top.
Interaction with Loop Vectorization
In the cicc pipeline, loop vectorization (LoopVectorizePass) runs before the first unroll invocation. Specifically, sub_197E720 combines both vectorization and unrolling decisions in the early pipeline slot. The vectorizer decides the vector width first (VF), and if it applies a transformation, the resulting loop (possibly with a scalar epilog) is then presented to the unroller.
This means vectorization and unrolling do not "coordinate" in the planning sense -- the vectorizer runs to completion before the unroller sees the loop. However, the vectorizer's interleave count (IC) serves a similar role to unrolling: it replicates the vectorized loop body to increase ILP. When the vectorizer chooses IC > 1, the subsequent unroller typically finds the loop body too large to unroll further, producing a de facto coordination through cost thresholds.
The second unroll invocation (sub_19C1680) runs much later, after InstCombine, SROA, and EarlyCSE have had a chance to simplify the vectorized code. Loops that were too large to unroll earlier may become eligible after dead code elimination within the unrolled-and-vectorized body.
The Transformation Engine: UnrollLoop
The transformation at sub_2A15A20 takes a loop and an unroll factor and physically duplicates the loop body. It is structurally close to upstream llvm::UnrollLoop with the following entry guards:
- Loop must have a preheader (
sub_D4B130) - Loop must have a single latch (
sub_D47930) - Loop must be in LCSSA form (
sub_D49210) - Header flags must be clean (no special bits set)
The duplication proceeds by iterating Count - 1 times, each iteration cloning every basic block in the loop body, remapping instructions through a value map, and rewiring PHI nodes so that iteration i's latch feeds iteration i+1's header. After all copies, the backedge of the last copy is reconnected to the first copy's header (for partial unroll) or removed entirely (for full unroll).
For partial unrolls where TripCount % Count != 0, a remainder loop is generated by sub_2A23640. If remainder generation fails (e.g., multi-exit loops), the engine delegates to sub_2A25260 which generates the runtime-check variant with prologue/epilogue.
The return value encodes the result: 0 = no change, 1 = partial unroll, 2 = full unroll.
Configuration Knobs
Standard LLVM Knobs (with NVIDIA defaults)
| Knob | Default | Global | Effect |
|---|---|---|---|
unroll-threshold | From TTI | sub_19B7760 struct | Base cost budget for full unroll |
unroll-partial-threshold | From TTI | 0x4FB3140 area | Cost budget for partial unroll |
unroll-max-percent-threshold-boost | 400 | dword_4FB3100 | Max dynamic cost boost (%) |
unroll-max-iteration-count-to-analyze | 10 | dword_4FB3020 | Max iterations for cost simulation |
unroll-count | Unset | dword_4FB2EA8 | Force specific unroll factor |
unroll-max-count | Unset | sub_19B7760 struct | Hard cap on unroll factor |
unroll-full-max-count | Unset | 0x4FB2CE0 area | Max trip count for full unroll |
unroll-peel-count | Unset | 0x4FB2C00 area | Force specific peel count |
unroll-allow-partial | false | 0x4FB2B20 area | Enable partial unrolling override |
unroll-allow-remainder | false | 0x4FB2A40 area | Enable remainder loop generation |
unroll-runtime | true | 0x4FB2960 area | Enable runtime (dynamic TC) unrolling |
unroll-max-upperbound | 8 | dword_4FB2920 | Max trip count for upper-bound unroll |
pragma-unroll-threshold | 32768 | dword_4FB2760 | Cost budget for pragma-directed unrolls |
flat-loop-tripcount-threshold | 5 | 0x4FB2680 area | Min estimated TC for runtime unroll |
runtime-unroll-threshold | 95 | dword_4FB3560 | Max body size for runtime unroll |
max-pragma-upperbound-unroll | 64 | dword_4FB2840 | Max upper-bound factor for pragma |
unroll-assumed-size | 4 | dword_4FB33A0 | Assumed array size for unknown dims |
NVIDIA-Specific Knobs
| Knob | Default | Global | Effect |
|---|---|---|---|
unroll-runtime-convergent | true | 0x500A440 area | Allow unrolling loops with convergent ops |
unroll-runtime-epilog | false | qword_500A3E8 | Force epilog-style remainder (override) |
waterfall-unrolling-force-epilogue | true | qword_500A148 | Force epilog for waterfall patterns |
Knobs are registered in two constructors: standard LLVM knobs in ctor_216_0 at 0x4E5C30, NVIDIA-specific knobs in ctor_501 at 0x559890.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
emitUnrollPragma | 0x09305A0 | -- | Frontend: #pragma unroll to metadata |
parseUnrollMetadata | 0x19B4C50 | -- | Reads llvm.loop.unroll.* metadata |
computeLocalArraySize | 0x19B5DD0 | -- | NVIDIA: local array threshold heuristic |
handleSmallFunction | 0x19B6500 | -- | Special aggressive unroll for tiny kernels |
selectUnrollFactor | 0x19B6690 | -- | Trip count analysis helper |
emitRemainderNotAllowedRemark | 0x19B78B0 | -- | Diagnostic emission |
simulateLoopBody | 0x19B9A90 | -- | Dynamic cost simulation with constant folding |
computeUnrollCount | 0x19BB5C0 | -- | Main decision engine |
tryToUnrollLoop | 0x19BE360 | -- | Top-level driver |
computePeelCount | 0x1B0B080 | -- | Loop peeling logic |
computeRuntimeTripCount | 0x1B18810 | -- | Runtime trip count estimation |
hasCallInLoop | 0x2A10B40 | -- | Checks for call/invoke in loop body |
createSideExitPHI | 0x2A10DD0 | -- | PHI nodes for side-exit unrolled loops |
cloneInstructionsInBlock | 0x2A12AD0 | -- | Instruction-level cloning |
reconcileLoopAfterUnroll | 0x2A13F00 | -- | Post-unroll SCEV/LoopInfo fixup |
UnrollLoop | 0x2A15A20 | -- | Main transformation engine |
unrollCostModel | 0x2A1AA10 | -- | Cost estimation helper |
UnrollAndJamLoop | 0x2A1CF00 | -- | Unroll-and-jam variant |
generateRemainderLoop | 0x2A23640 | -- | Remainder loop construction |
UnrollLoopWithRuntimeChecks | 0x2A25260 | -- | Prologue/epilogue generation |
Pass Factory and Object Layout
The following section documents the LoopUnroll pass factory at
sub_19B73C0, which was originally misidentified as LICM in the P2C.3 sweep due to binary adjacency with the actual LICM pass. The vtable atunk_4FB224C, the 7-parameter constructor signature, and diagnostic function strings all confirm LoopUnroll identity.
The pass factory at sub_19B73C0 allocates a 184-byte pass object and accepts seven parameters that control unroll behavior. When a parameter is -1, the pass uses its compiled-in default.
Constructor Parameters
| Parameter | Offset | Enable Flag | Semantics |
|---|---|---|---|
a1 (optimization level) | +156 | -- | 2 = standard, 3 = aggressive |
a2 (unroll threshold) | +168 | +172 | Trip count threshold; -1 = use default |
a3 (unroll count) | +160 | +164 | Explicit unroll factor; -1 = use default |
a4 (allow partial) | +176 | +177 | 0 = disable partial unroll, 1 = enable |
a5 (runtime unroll) | +178 | +179 | 0 = disable runtime unroll, 1 = enable |
a6 (upper bound) | +180 | +181 | 0 = disable upper-bound unroll, 1 = enable |
a7 (profile-based) | +182 | +183 | 0 = disable profile-guided unroll, 1 = enable |
Object Construction
The factory allocates 184 bytes via sub_22077B0, sets the vtable to off_49F45F0 (loop-unroll pass vtable), stores pass ID unk_4FB224C at offset +16, initializes self-referential linked-list pointers at offsets +80/+88 and +128/+136, sets pass type 2 (FunctionPass) at offset +24, and calls sub_163A1D0 / sub_19B71A0 for pass registration.
Pipeline Invocation Configurations
CICC invokes LoopUnroll with six distinct configurations at different pipeline stages, reflecting NVIDIA's careful tuning of unroll aggressiveness per compilation phase. These are the factory-level parameter sets passed to sub_19B73C0; see also the decision engine's per-invocation behavior in The Decision Engine above.
Configuration A: Standard Pipeline (O1/O2)
Call site: sub_12DE330
LoopUnroll(2, -1, -1, -1, -1, -1, -1)
All parameters at defaults. Standard unrolling with default thresholds at optimization level 2.
Configuration B: Code-Size Mode
Call site: sub_12DE8F0, when *(a3+4480) < 0 (NVIDIA code-size flag set)
LoopUnroll(a2, -1, -1, 0, 0, 0, 0)
All unrolling features disabled: partial, runtime, upper-bound, and profile-based are all zeroed. The pass only unrolls when the trip count is statically known and the benefit is certain. This reflects the constraint that GPU register pressure makes speculative unrolling expensive when code size matters.
Configuration C: Normal Optimizer
Call site: sub_12DE8F0, when *(a3+4480) >= 0 (normal mode)
LoopUnroll(a2, -1, -1, -1, -1, -1, -1)
Fully aggressive unrolling with all defaults. The optimization level is passed through from the caller.
Configuration D: Late Pipeline (Conservative)
Call site: sub_12DE8F0, late pipeline position
LoopUnroll(a2, -1, -1, 0, 0, -1, -1)
Partial and runtime unrolling disabled, but upper-bound and profile-based unrolling retain their defaults. This conservative late-pipeline configuration avoids creating new runtime overhead in code that has already been substantially optimized.
Configuration E: Aggressive Pipeline (O3)
Call site: sub_12E54A0
LoopUnroll(3, -1, -1, 0, 0, -1, 0)
Optimization level 3 with aggressive thresholds, but partial, runtime, and profile-based unrolling are disabled. Only upper-bound unrolling retains its default. The rationale is that at O3, the higher thresholds already capture most profitable unrolling opportunities without needing speculative runtime checks.
Configuration F: User-Configured
Call site: sub_12EA3A0
LoopUnroll(a1[4], a1[5], a1[6], a1[7], a1[8], a1[9], a1[10])
All seven parameters are read from a stored configuration object, enabling user-specified unroll behavior via command-line flags or pragmas.
Threshold Initialization (Pass-Level)
The function sub_19B6690 (17 KB) configures unroll thresholds based on optimization level and LLVM knobs at pass construction time. These values feed into the UnrollParams struct consumed by the decision engine.
Default Threshold Values
| Offset | Field | Default (O2+) | Default (O1) |
|---|---|---|---|
| +0 | OptThreshold | 405 | 150 |
| +4 | Threshold | 400 | 400 |
| +12 | SmallTripCountThreshold | 150 | 150 |
| +56 | MaxIterationsCountToAnalyze | 60 | 60 |
Function-Attribute-Aware Override
The threshold initializer queries function attributes via sub_1560180:
- Attribute ID 34 (
minsize): ReducesOptThresholdtoSmallTripCountThreshold(150). - Attribute ID 17 (
optsize): Same reduction.
This means kernels annotated with size constraints get conservative unroll thresholds regardless of the global optimization level.
Per-Function Knob Override via BST
The function queries the LLVM option registry (dword_4FA0208 BST) ten times, each time looking up a different knob address. For each knob, it searches the BST rooted at dword_4FA0208[2], compares the current function hash (sub_16D5D50) against node ranges, and applies the override if the knob value meets the threshold. The knob-to-field mapping:
| Knob Address | Override Address | Field |
|---|---|---|
dword_4FB3228 | dword_4FB32C0 | OptThreshold (+0) |
dword_4FB3148 | dword_4FB31E0 | SmallTripCountThreshold (+12) |
dword_4FB3068 | dword_4FB3100 | Threshold (+4) |
dword_4FB2DC8 | dword_4FB2E60 | field +32 |
dword_4FB2CE8 | dword_4FB2D80 | field +36 |
dword_4FB2C08 | dword_4FB2CA0 | field +24 |
dword_4FB2B28 | (next value) | field +40 |
The per-function BST lookup keyed by function hash enables fine-grained tuning of unroll behavior per kernel, a capability not present in upstream LLVM.
Diagnostic Functions
Three diagnostic emission functions produce optimization remarks:
| Function | Address | Diagnostic |
|---|---|---|
emitPragmaCountDiag | sub_19B78B0 | Reports when pragma unroll count conflicts with trip multiple |
emitThresholdDiag | sub_19B7B10 | Reports when unrolled size exceeds threshold |
emitLoopSizeDiag | sub_19B7D80 | Reports when loop body is too large to unroll |
Main Loop Processing and Hash Infrastructure
The primary analysis function sub_19B7FA0 (11 KB) analyzes each candidate loop. The pass uses hash table infrastructure shared with other CICC LLVM passes:
| Function | Address | Size | Role |
|---|---|---|---|
rehashSmallTable | sub_19B60B0 | 5 KB | Small hash table resize |
rehashTable | sub_19B8820 | 4 KB | Key-value hash table resize |
rehashSet | sub_19B89E0 | 7 KB | Set hash table resize |
insertIntoSet | sub_19B8DA0 | -- | Set insert with growth |
All hash tables use the same (value >> 9) ^ (value >> 4) hash function and linear probing strategy found throughout CICC's LLVM passes. See Hash Infrastructure for the common implementation.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Pragma threshold | UnrollThreshold default 150; pragma multiplier ~8x | Pragma threshold 200x larger than stock (PragmaUnrollThreshold = 30000); enables aggressive pragma-directed unrolling for GPU kernels |
| Power-of-two enforcement | No power-of-two requirement; any profitable factor accepted | Enforces power-of-two unroll factors; non-power-of-two factors are rounded down to avoid irregular loop tails |
| Local array multiplier | No concept of local array bonus | Dedicated local-array threshold multiplier boosts unroll budget when loop body accesses alloca/.local arrays indexed by IV, enabling register promotion |
| Decision engine | ~20 KB computeUnrollCount | Substantially reworked 50 KB computeUnrollCount (sub_19BB5C0) with 6-level priority cascade and GPU-specific occupancy heuristics |
| Register pressure model | Generic TTI-based unroll cost; no occupancy concept | Occupancy-aware cost model considers register pressure cliffs where one additional register per thread drops warp occupancy |
| Pipeline invocations | Single invocation in optimization pipeline | Two invocations: early (interleaved with vectorization) and late (cleanup, gated by opts[1360] / nv-disable-loop-unrolling) |
| Transformation engine | Stock llvm::UnrollLoop | Lightly modified UnrollLoop (sub_2A15A20, 85 KB); decision engine is where the changes concentrate |
Test This
The following kernel contains a simple counted loop that is a prime candidate for full unrolling. Compile and compare PTX output with and without #pragma unroll.
__global__ void unroll_test(float* out, const float* in) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
#pragma unroll
for (int i = 0; i < 8; i++) {
sum += in[tid + i * 128];
}
out[tid] = sum;
}
What to look for in PTX:
- With
#pragma unroll: the loop should be fully unrolled into 8 sequentialld.global.f32+add.f32sequences with no backedge branch. Look for the absence ofbrainstructions targeting a loop header and the presence of 8 distinctld.global.f32instructions with addresses offset by128*sizeof(float). - Without
#pragma unroll(remove the pragma): the compiler may still unroll if the trip count (8) times body size fits within the threshold (default 300). Check whether the PTX has a loop or is fully unrolled -- this exercises the automatic decision engine. - With
#pragma unroll 1: the loop must remain as a counted loop with a backedge branch. This tests that pragma disabling works. - Compare
.nregvalues across the three variants. Full unrolling increases register pressure (8 loads live simultaneously); the partial or no-unroll variant uses fewer registers at the cost of loop overhead. - The power-of-two enforcement is visible when the trip count is not a power of two: change the loop bound to 6 and check whether the compiler partially unrolls by 4 (highest power of two dividing the body-size budget) rather than 6.
Cross-References
- Loop Optimization Passes -- pipeline context and pass ordering
- LICM -- runs before second unroll invocation, feeds hoisted invariants
- Loop Strength Reduction -- runs after unrolling, reduces IV expressions
- Register Allocation -- occupancy-driven allocation consumes what unrolling produces
- StructurizeCFG -- runs after all loop transforms, restructures divergent control flow
- InstCombine -- simplifies unrolled loop bodies between invocations
LoopVectorize and VPlan (GPU-Adapted)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp,llvm/lib/Transforms/Vectorize/VPlan*.cpp(LLVM 20.0.0). VPlan infrastructure lives inllvm/lib/Transforms/Vectorize/VPlan.cpp,VPlanRecipes.cpp,VPlanTransforms.cpp, and related files.LLVM version note: CICC v13.0 is based on LLVM 20.0.0 trunk. Evidence includes histogram-pattern support (merged in LLVM 19), early-exit vectorization (LLVM 20 experimental feature, gated by
byte_500CDA8), and the VPlan-native path. The VPlan object size (656 bytes) is consistent with LLVM 17/18+ layout. Scalable vectors are always disabled for NVPTX.
NVIDIA's cicc ships a heavily modified copy of LLVM's LoopVectorizePass, the single largest pass in the vectorization pipeline at 88 KB of decompiled output (2,612 lines in sub_2AF1970). The modifications do not change the pass's fundamental architecture -- it still builds VPlans, selects a vectorization factor (VF) through cost modeling, and transforms IR through VPlan execution -- but the cost model, VF selection heuristics, interleave count logic, and legality checker are all tuned for a target where "vectorization" means something fundamentally different than on a CPU. On a CPU, loop vectorization fills SIMD lanes: a VF of 4 on SSE processes four float elements per vector instruction. On an NVIDIA GPU, there are no SIMD lanes in the CPU sense -- each thread already executes scalar code, and the warp executes 32 threads in lockstep. The reasons to vectorize on GPU are: (1) memory coalescing -- adjacent threads issuing adjacent loads produce 128-byte cache line transactions, and vectorizing a per-thread loop body with VF=2 or VF=4 produces ld.v2/ld.v4 wide loads that maximize bytes-per-transaction; (2) reducing instruction count -- a single ld.global.v4.f32 replaces four ld.global.f32 instructions, saving fetch/decode/issue bandwidth; (3) register-to-memory width matching -- PTX supports 32-, 64-, and 128-bit load/store widths, and vectorization widens narrow scalar accesses to fill these naturally.
Key Facts
| Property | Value |
|---|---|
| Registration | New PM #400, parameterized: no-interleave-forced-only;... |
| Runtime positions | Not in Tier 0/1/2/3 tables; invoked via LLVM standard sub-pipeline sub_1A62BF0 when vectorization is enabled (see Pipeline) |
| Main entry point | sub_2AF1970 (0x2AF1970) -- LoopVectorizePass::processLoop() |
| Binary size | 88 KB decompiled, 2,612 lines |
| VPlan builder | sub_2AEE460 (0x2AEE460) -- tryToBuildVPlanWithVPRecipes(), 56 KB |
| VPlan object size | 656 bytes (0x290), consistent with LLVM 17/18 layout |
| LLVM base | LLVM 20 trunk (evidence: histogram-pattern support, early-exit vectorization, VPlan-native path) |
| Scalable vectors | Always disabled -- sub_DFE610 returns false for NVPTX |
| Register bit width (TTI) | 32 bits fixed (TypeSize::getFixed(32) in upstream NVPTXTTIImpl) |
| Pass name string | "vectorize-loops" at 0x439F095 |
| Address cluster | 0x2AA0000--0x2C20000 (loop vectorizer + VPlan infrastructure) |
Why Vectorize on GPU
GPU vectorization is not about filling SIMD lanes -- the SIMT model already replicates scalar code across 32 threads. Vectorization targets three orthogonal benefits related to memory coalescing and instruction throughput:
Memory coalescing width. The GPU memory subsystem services requests in 128-byte transactions. If a single thread's inner loop accesses 4 consecutive floats in sequence, those 4 accesses become 4 separate scalar loads issued over 4 iterations. Vectorizing with VF=4 converts them into one ld.global.v4.f32, which the memory subsystem can service in a single wider transaction per thread. Across the warp, this multiplies the effective memory bandwidth.
Instruction count reduction. PTX's ld.v2 and ld.v4 instructions load 2 or 4 elements with a single instruction. The instruction issue pipeline has finite throughput (typically 1-2 instructions per clock per scheduler), so halving instruction count directly improves throughput-bound kernels.
Register-width matching. PTX has 32-bit typed registers. A 128-bit ld.v4.f32 loads directly into four consecutive registers via a single instruction, which is strictly better than four separate 32-bit loads (each requiring its own address computation).
These benefits are bounded by register pressure -- the primary constraint that does not exist on CPU. On a GPU, every additional register per thread can cross an occupancy cliff, potentially losing an entire warp group. A VF=4 vectorization that quadruples the live register count may halve occupancy and lose net throughput.
The 8-Phase Pipeline
The main function sub_2AF1970 implements eight phases, closely following upstream LLVM's structure but with GPU-specific decision points at each stage.
Phase 1: Legality Pre-Check
sub_31A4FD0(legalityCtx, Loop, Function, ORE, SE) // init legality scratch
TTI = *(**(Loop+32) + 72) // Loop->getHeader()->getParent()->getTTI()
if (!sub_31A91F0(legalityCtx, TTI, Loop, LoopInfo)) // canVectorize() quick check
return false
sub_31AF060(costCtx, ForceVectorization) // canVectorize() full check
// ForceVectorization = qword_500D340[17] ("vectorize-loops" knob)
The legality checker (sub_31AF060) performs standard LLVM legality analysis: loop simplify form, single exit, computable backedge-taken count, no irreducible control flow. The NVIDIA-specific addition is early-exit loop handling:
if (hasUncountableEarlyExit && !byte_500CDA8) // -enable-early-exit-vectorization
emit "UncountableEarlyExitLoopsDisabled"
return false
This knob (byte_500CDA8) gates an LLVM 20 feature that NVIDIA includes but disables by default. Early-exit vectorization requires predicated execution, which on GPU means divergent warps -- typically unprofitable.
Phase 2: Outer vs Inner Loop Dispatch
if (Loop->getSubLoops().size() > 0)
goto outerLoopPath // PATH A (rarely taken on GPU)
else
goto innerLoopPath // PATH B (the main path)
Outer loop vectorization is controlled by byte_500D208 (-force-vector-width-outer). When enabled and TTI-based VF selection returns VF <= 1, the pass forces VF=4 -- a hardcoded NVIDIA override for kernel patterns where outer-loop vectorization benefits warp-level memory access patterns. In practice, inner loop vectorization (Path B) handles the vast majority of GPU kernels.
Phase 3: Trip Count and Safety Checks
tripCount = getSmallBestKnownTC(PSE, Loop) // sub_2AA7EC0
if (tripCount < VectorizerMinTripCount // dword_500EAE8
&& !isForceVectorize(legalCtx)
&& !(exactTC >= userVF))
emit "LowTripCount"
reduce hint to interleave-only
if (hasAttribute(TTI, NoImplicitFloat)) // attribute 30
bail "NoImplicitFloat"
if (hasUnsafeFPOps && !canReorderFP(override))
bail "UnsafeFP" / "CantReorderFPOps"
The FP reorder safety check has an override mechanism: dword_500D508 selects whether the override is active, and byte_500D588 provides the override value. This lets NVIDIA force-allow FP reordering for specific compilation modes (e.g., -ffast-math propagated from nvcc).
Phase 4: VF Selection
This is where NVIDIA diverges most from upstream. The upstream algorithm queries TTI::getRegisterBitWidth() which returns the vector register width (256 for AVX2, 512 for AVX-512), then computes VF = registerWidth / elementSize. On NVPTX, getRegisterBitWidth() returns 32 -- a single scalar register width. This means the upstream formula would always produce VF=1 for 32-bit types.
NVIDIA's VF selection (sub_2AB8AC0 for outer loops, sub_2AE08E0 for inner loops via VPlan cost) works differently:
// sub_2AB8AC0 — outer loop VF selection (simplified)
elementBits = getWideningElementSize(CostModel) // sub_2AB4370: top 32 bits
regWidth = TTI.getRegisterBitWidth(Vector) // sub_DFE640: returns 32
VF = regWidth / (elementBits / 8)
if (!isScalable && VF <= 1 && forceOuterMode) // byte_500D208
VF = 4 // NVIDIA hardcoded override
For inner loops (the common path), VF selection goes through the full VPlan cost model:
// sub_2AE08E0 — selectBestVF() from VPlan candidates
bestCost = INT64_MAX
for each VPlan in candidatePlans:
for each VF in VPlan.VFRange:
cost = computeCostForVF(VPlan, VF) // sub_2AE0750
if isBetterThan(cost, bestCost): // sub_2AB3FE0
bestVF = VF
bestCost = cost
return {bestVF, isScalable, bestIC}
The cost accumulation uses saturating arithmetic -- __OFADD__ overflow detection clamping to INT64_MAX/INT64_MIN -- preventing wrap-around in cost comparisons. This is defensive engineering for GPU kernels with very large loop bodies where naive summation could overflow.
Phase 5: Cost Model Construction
The cost model object (sub_2AB2780, 16 parameters) assembles all analysis results into a single context:
CostModel = {
Loop*, DominatorTree*, LoopBlocksRPO*, ScalarEvolution*,
TargetLibraryInfo*, AssumptionCache*, PredicatedScalarEvolution*,
ValuesToIgnore=0, ORE*,
/* additional context fields */
}
The VPlan planner (sub_2AF13F0) generates VPlans for all candidate VFs, then sub_2AE08E0 selects the best one. Each VPlan recipe provides its own cost through the virtual getVPCost(VF, CostCtx) method, which delegates to NVPTXTargetTransformInfo for GPU-specific instruction costs.
Phase 6: Profitability Decision and Interleave Selection
After VF selection, the pass evaluates a decision matrix:
| Condition | Result |
|---|---|
| VF=1, not scalable | VectorizationNotBeneficial -- bail |
| IC=1 but user wanted more | InterleavingNotBeneficial |
| IC>1 but user disabled | InterleavingBeneficialButDisabled |
| Histogram loop + scalar interleave | HistogramPreventsScalarInterleaving -- bail |
| VF=1, IC>1 | Interleave-only path: executeVPlan(VF=1, IC) |
| VF>1 | Full vectorization path |
The histogram diagnostic (HistogramPreventsScalarInterleaving) is an NVIDIA addition not present in upstream LLVM. It blocks scalar interleaving of histogram-pattern loops where reduction ordering constraints make interleaving incorrect without vectorization.
Interleave count selection (sub_2AED330) is register-pressure-bounded on GPU:
// sub_2AED330 — selectInterleaveCount() (simplified)
maxIC = TTI.getMaxInterleaveFactor(VF) // sub_DFB120(TTI+448)
// Override knobs:
if (VF.isScalar() && ForceTargetMaxScalarInterleave) // dword_500E148
maxIC = ForceTargetMaxScalarInterleave
if (VF.isVector() && ForceTargetMaxVectorInterleave) // dword_500E068
maxIC = ForceTargetMaxVectorInterleave
tripCount = getSmallBestKnownTC(PSE, Loop)
IC = bit_floor(tripCount / (VF * 2)) // conservative: vector loop runs >= 2x
IC = min(IC, maxIC)
// Small loop boost
if (loopCost < SmallLoopCost) // qword_500DC88
smallIC = min(IC, bit_floor(SmallLoopCost / loopCost))
IC = max(IC, smallIC)
// Scheduling-based cap (NVIDIA-specific TTI path)
issueWidth = *(TTI + 56 + 32) // scheduling info at TTI+88
latency = *(TTI + 56 + 36) // scheduling info at TTI+92
IC = IC / max(issueWidth, latency) // cap by scheduling model
// Aggressive interleave mode
if (byte_500D908) // AggressiveInterleave
IC = maxIC // bypass all heuristics
IC = clamp(IC, 1, maxIC)
return powerOf2Floor(IC)
On CPU, the interleave count is bounded by vector register count (e.g., 16 YMM registers / registers per iteration). On GPU, it is bounded by register pressure impact on occupancy -- the TTI scheduling info encodes this constraint. The AggressiveInterleave knob (byte_500D908) bypasses all heuristics and sets IC to the maximum, useful for benchmarking or known-good kernels.
Phase 7: VPlan Execution and Epilogue Vectorization
mainVPlan = getBestPlanFor(bestVF) // sub_2BF1320
executeVPlan(mainVPlan, bestVF, IC) // sub_2AE3460
// Epilogue vectorization (when byte_500ED88 is set)
epilogueVF = selectEpilogueVectorizationFactor() // sub_2ABBD40
if (epilogueVF > 1):
clonedPlan = cloneVPlan(mainVPlan) // sub_2BF7CB0
epiloguePlan = getBestPlanFor(epilogueVF)
mergeVPlans(clonedPlan, epiloguePlan) // sub_2AB0350
// Remap operands between main and epilogue plans:
// recipe types 29 (load/store), 36 (phi), 17 (GEP)
// types 19-20 (inttoptr/ptrtoint casts)
executeVPlan(merged, epilogueVF, epilogueIC, isEpilogue=true)
Epilogue vectorization is particularly relevant on GPU: the scalar remainder loop after vectorization forces warp divergence (some threads in the warp execute the epilogue while others are masked off), which is expensive. A vectorized epilogue with a smaller VF reduces the scalar remainder to fewer iterations, minimizing divergence overhead.
The epilogue VF selection (sub_2ABBD40) can be forced via qword_500ECA8 (-epilogue-vectorization-force-VF). When not forced, it uses SCEV range analysis (sub_DC3A60, sub_DBB9F0) to prove the epilogue trip count is sufficient for the candidate VF.
Phase 8: Post-Vectorization Metadata
The pass applies follow-up loop metadata (llvm.loop.vectorize.followup_all, llvm.loop.vectorize.followup_epilogue) and emits optimization remarks through sub_2AC2B40. Generated basic blocks use naming conventions vec.epilog.middle.block and vec.epilog.vector.body.
VPlan Construction (sub_2AEE460)
The VPlan builder allocates a 656-byte VPlan object and iterates over candidate VFs in powers of 2 (VF *= 2 each iteration, visible as add r15d, r15d in the binary). For each VF, it calls sub_2AA9E60 (tryToBuildRecipesForVF).
Recipe type tags observed in the binary:
| Tag | Recipe Type |
|---|---|
| 0x04 | VPWidenMemoryInstructionRecipe |
| 0x0F | VPWidenRecipe |
| 0x1D | VPReplicateRecipe |
| 0x21 | VPWidenSelectRecipe |
| 0x43 | VPWidenCallRecipe |
Interleave group recipes are built from LoopAccessInfo at [Planner+0x28]+0x150. The builder removes individual load/store recipes and replaces them with interleave group recipes via sub_2AB9570 (replaceAllUsesWith), using a hash map with the pointer-hash function (ptr >> 4) ^ (ptr >> 9) & mask -- identical to LLVM's DenseMap hash.
Cost annotation happens in Phase 6 of VPlan construction via sub_2C2E3C0, which walks all recipes and annotates them with TTI-derived costs. This is where NVPTXTargetTransformInfo shapes the cost model: it prices ld.v4 cheaper than 4x ld.f32, making vectorization profitable even with register pressure increase.
The VPlan verification flag at 0x500D2E8 enables VPlan dump/verify paths -- useful for debugging vectorization decisions with -mllvm -vplan-verify-or-dont.
NVPTXTargetTransformInfo Hooks
The loop vectorizer reaches NVIDIA's TTI through Loop->getHeader()->getParent()->getTTI() (recovered as *(**(Loop+32)+72)). Key hooks:
| TTI Method | Address | GPU Behavior |
|---|---|---|
getRegisterBitWidth(Vector) | sub_DFE640 | Returns 32 (fixed) -- single scalar register width |
supportsScalableVectors() | sub_DFE610 | Returns false -- no SVE/RVV equivalent |
getMaxInterleaveFactor() | sub_DFB120 | Queried at TTI+448; register-pressure-bounded |
getMaxInterleaveFactor(vectorized) | sub_DFB730 | Separate limit for vectorized loops |
hasAttribute(47) | sub_B2D610 | "alwaysvectorize" check |
hasAttribute(30) | sub_B2D610 | "noimplicitfloat" check |
The 32-bit register width return is the critical difference from CPU targets. It means the standard VF formula (regWidth / elemSize) produces VF=1 for 32-bit types, VF=2 for 16-bit types, and VF=4 for 8-bit types. Wider vectorization (VF=4 for float) must come from the cost model determining that ld.v4.f32 is profitable despite the VF exceeding the "register width."
The scheduling info at TTI+56 (with issue width at offset +32 and latency at +36 within that sub-structure) feeds interleave count capping. This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the issue pipeline may saturate at IC=4.
Knobs and Thresholds
| Knob | Global Address | CLI Name | Default | Effect |
|---|---|---|---|---|
| ForceVectorization | qword_500D340[17] | vectorize-loops | true | Master switch for loop vectorization |
| EnableEarlyExitVectorization | byte_500CDA8 | -enable-early-exit-vectorization | false | Gates LLVM 20 early-exit loop vectorization |
| ForceOuterLoopVectorization | byte_500D208 | -force-vector-width-outer | false | Forces VF=4 for outer loops when TTI returns VF<=1 |
| ForceCanReorderFP (selector) | dword_500D508 | -- | 0 | Whether FP reorder override is active |
| ForceCanReorderFP (value) | byte_500D588 | -- | -- | FP reorder override value |
| ForceScalarEpilogue (selector) | dword_500E308 | -- | 0 | Whether scalar epilogue is forced |
| ForceScalarEpilogue (value) | byte_500E388 | -- | -- | Scalar epilogue override value |
| VectorizerMinTripCount | dword_500EAE8 | vectorizer-min-trip-count | 16 (upstream) | Minimum trip count to attempt vectorization |
| CostThreshold | qword_500EA08 | -- | -- | Maximum cost for memory reorder safety check |
| EnableEpilogueVectorization | byte_500ED88 | -enable-epilogue-vectorization | true (upstream) | Enables vectorized epilogue loop |
| EpilogueVectorizationForceVF | qword_500ECA8 | -epilogue-vectorization-force-VF | 0 | Forces specific epilogue VF |
| AggressiveInterleave | byte_500D908 | -- | false | Bypasses IC heuristics, sets IC=max |
| PreferPredicateOverEpilogue | byte_500DAC8 | prefer-predicate-over-epilogue | -- | Uses predication instead of scalar epilogue |
| SmallLoopCost | qword_500DC88 | small-loop-cost | 20 (upstream) | Threshold below which loops get boosted IC |
| ForceTargetMaxScalarInterleave | dword_500E148 | force-target-max-scalar-interleave | 0 | Overrides max IC for scalar loops |
| ForceTargetMaxVectorInterleave | dword_500E068 | force-target-max-vector-interleave | 0 | Overrides max IC for vectorized loops |
NVIDIA vs upstream defaults: The upstream vectorizer-min-trip-count default is 16. The upstream small-loop-cost default is 20. The upstream enable-epilogue-vectorization default is true. NVIDIA preserves these defaults from the knob registration code, but the TTI hooks (particularly getRegisterBitWidth returning 32 and getMaxInterleaveFactor being register-pressure-bounded) shift the effective behavior dramatically. Where a CPU target with AVX-512 might select VF=16 for float, NVPTX typically selects VF=2 or VF=4 -- just enough to use ld.v2/ld.v4 instructions without excessive register pressure.
Diagnostic Strings
All diagnostic strings are embedded in the binary with OptimizationRemarkAnalysis tags. Source: p2-E01-loop-vectorize.txt.
| Tag | Message | Trigger |
|---|---|---|
UncountableEarlyExitLoopsDisabled | "Auto-vectorization of loops with uncountable early exit is not enabled" | Early-exit loop + byte_500CDA8 knob off |
LowTripCount | "The trip count is below the minial threshold value." | TC < dword_500EAE8 min threshold (note: "minial" is a typo [sic] in the NVIDIA binary) |
NoImplicitFloat | "Can't vectorize when the NoImplicitFloat attribute is used" | Function attribute 30 check |
UnsafeFP | "Potentially unsafe FP op prevents vectorization" | FP safety check failure |
CantReorderFPOps | "loop not vectorized: cannot prove it is safe to reorder floating-point operations" | FP reorder proof failure |
CantReorderMemOps | "loop not vectorized: cannot prove it is safe to reorder memory operations" | Memory reorder proof failure |
VectorizationNotBeneficial | "the cost-model indicates that vectorization is not beneficial" | Cost model: VF=1 wins |
InterleavingNotBeneficial | "the cost-model indicates that interleaving is not beneficial" | Cost model: IC=1 wins |
InterleavingNotBeneficialAndDisabled | (appended: " and is explicitly disabled or interleave count is set to 1") | IC=1 + explicitly disabled |
InterleavingBeneficialButDisabled | (tag only, no message body recovered) | IC>1 but user disabled interleaving |
InterleavingAvoided | "Ignoring UserIC, because interleaving was avoided up front" | User-specified IC overridden |
HistogramPreventsScalarInterleaving | "Unable to interleave without vectorization due to constraints on the order of histogram operations" | NVIDIA-specific: histogram loop + scalar IC |
ScalableVFUnfeasible | "Scalable vectorization requested but not supported by the target" | Scalable VF on NVPTX |
UncountableEarlyExitUnsupported | "Auto-vectorization of early exit loops requiring a scalar epilogue is unsupported" | Early-exit + epilogue |
| (success remark) | "interleaved loop (interleaved count: N)" | Vectorization/interleaving succeeded via sub_2AC2B40 |
| (metadata) | "llvm.loop.vectorize.followup_all" | Post-vectorization loop metadata tag |
| (metadata) | "llvm.loop.vectorize.followup_epilogue" | Post-vectorization epilogue metadata tag |
| (block name) | "vec.epilog.middle.block" | Epilogue vectorization middle block |
| (block name) | "vec.epilog.vector.body" | Epilogue vectorization body block |
| (block name) | "scev.check" | Runtime SCEV overflow check block (sub_27C1C30) |
| (VPlan debug) | "Initial VPlan" | VPlan builder debug output at 0x2AEFC7B |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
LoopVectorizePass::processLoop() | sub_2AF1970 | 88 KB | -- |
tryToBuildVPlanWithVPRecipes() | sub_2AEE460 | 56 KB | -- |
Planner::plan() -- generate VPlans for candidate VFs | sub_2AF13F0 | -- | -- |
selectBestVF() -- iterate VPlans, pick lowest cost | sub_2AE08E0 | -- | -- |
computeCostForVF() -- per-VF cost query | sub_2AE0750 | -- | -- |
isBetterThan() -- VF cost comparator | sub_2AB3FE0 | -- | -- |
executeVPlan() -- IR transformation from VPlan | sub_2AE3460 | -- | -- |
selectInterleaveCount() -- IC heuristic | sub_2AED330 | -- | -- |
selectEpilogueVectorizationFactor() | sub_2ABBD40 | -- | -- |
LoopVectorizationCostModel constructor (16 params) | sub_2AB2780 | -- | -- |
selectVectorizationFactor() -- outer loop path | sub_2AB8AC0 | -- | -- |
selectVectorizationFactor() -- hint/pre-check | sub_2AAEAB0 | -- | -- |
computeExpectedScalarCost() | sub_2AAD640 | -- | -- |
LoopVectorizationLegality::init() | sub_31A4FD0 | -- | -- |
canVectorize() -- pre-check | sub_31A91F0 | -- | -- |
canVectorize() -- full check | sub_31AF060 | -- | -- |
getBestPlanFor(VF) -- VPlan lookup | sub_2BF1320 | -- | -- |
cloneVPlan() | sub_2BF7CB0 | -- | -- |
mergeVPlans() -- main + epilogue merge | sub_2AB0350 | -- | -- |
buildInterleaveGroupRecipes() | sub_2C06CE0 | -- | -- |
| VPlan cost annotation pass | sub_2C2E3C0 | -- | -- |
| VPlan simplification / recipe combining | sub_2C32950 | -- | -- |
| VPlan legality re-verification | sub_2C2A390 | -- | -- |
getSmallBestKnownTC() -- trip count upper bound | sub_2AA7EC0 | -- | -- |
tryToBuildRecipesForVF() -- per-VF body builder | sub_2AA9E60 | -- | -- |
finalizeRecipesForVF() -- scaling/widening | sub_2AD9850 | -- | -- |
TTI::getMaxInterleaveFactor() | sub_DFB120 | -- | -- |
TTI::getRegisterBitWidth(Vector) | sub_DFE640 | -- | -- |
TTI::supportsScalableVectors() | sub_DFE610 | -- | -- |
| Emit vectorization success remarks | sub_2AC2B40 | -- | -- |
| VPlan fixup/finalize | sub_ABDAE0 | -- | -- |
Related Pages
- Loop Strength Reduction -- LSR runs after vectorization and must handle the wider induction variables and address expressions that vectorization introduces. NVIDIA's custom LSR is occupancy-aware and interacts with the same register pressure model.
- Register Allocation -- The register pressure that bounds VF and IC decisions is ultimately resolved by the register allocator. VF=4 with IC=2 may request 8x the base register count; the allocator must either accommodate this or spill to local memory.
- Scheduling -- The TTI scheduling info (issue width and latency at TTI+56) that caps interleave count comes from the same target model used by instruction scheduling.
- SelectionDAG -- Vectorized IR produces vector types (
<4 x float>) that SelectionDAG must lower to PTXld.v4/st.v4instructions. - SLP Vectorizer -- SLP vectorization (
sub_2BD1C50) handles straight-line code and horizontal reductions; loop vectorization handles loop bodies. Both share the same TTI cost model.
What Upstream LLVM Gets Wrong for GPU
Upstream LLVM's LoopVectorize pass was built for CPU SIMD: fill wider vector registers to process more data elements per instruction. On a GPU, every foundational assumption is inverted:
- Upstream assumes SIMD lanes need filling. The CPU vectorizer exists to pack 4/8/16 scalar operations into one vector instruction (SSE/AVX/NEON). On GPU, there are no SIMD lanes in the CPU sense -- the SIMT model already executes 32 threads in lockstep per warp. "Vectorization" on GPU means widening per-thread memory accesses to
ld.v2/ld.v4for coalescing, not filling SIMD lanes. - Upstream computes VF from vector register width. The standard formula is
VF = registerWidth / elementSize(e.g., AVX-512 gives VF=16 for float). NVPTX'sgetRegisterBitWidth()returns 32 bits -- a single scalar register width -- so this formula always produces VF=1 for 32-bit types. Wider VFs must come entirely from the cost model deciding thatld.v4.f32is profitable, bypassing the standard VF selection path. - Upstream ignores register pressure when selecting VF. On CPU, VF=16 using 16 ZMM registers has no throughput penalty -- there is no occupancy concept. On GPU, VF=4 that quadruples live registers can cross an occupancy cliff, losing an entire warp group and halving net throughput. Every VF and IC decision must be bounded by register pressure impact on occupancy.
- Upstream assumes scalable vectors are desirable. LLVM supports SVE/RISC-V V scalable vector types. NVPTX disables them entirely (
supportsScalableVectors() = false) because PTX has no scalable vector model -- only fixed-widthld.v2/ld.v4instructions exist. - Upstream's interleave count is bounded by CPU port pressure. CPU IC selection considers execution port contention and register file depth (e.g., 16 YMM registers). GPU IC selection is capped by the TTI scheduling model's issue width and latency at
TTI+56, reflecting the SM's instruction issue pipeline saturation -- a completely different bottleneck.
Optimization Level Behavior
| Level | Scheduled | Max VF | Interleave | Notes |
|---|---|---|---|---|
| O0 | Not run | N/A | N/A | No optimization passes |
| Ofcmax | Not run | N/A | N/A | Fast-compile skips vectorization entirely |
| Ofcmid | Not run | N/A | N/A | Vectorization not in medium fast-compile tier |
| O1 | Runs (Tier 1) | 4 | Enabled | Single instance after loop canonicalization |
| O2 | Runs (Tier 1) | 4 | Enabled | Same scheduling as O1; benefits from more aggressive scalar optimization preceding it |
| O3 | Runs (Tier 1) | 4 | Enabled | Same as O2; additional Tier 3 loop passes (interchange, distribution) may create more vectorization opportunities |
Loop vectorization is a Tier 1 pass, meaning it runs at O1 and above but not in any fast-compile tier. The maximum VF is effectively capped at 4 by the GPU register pressure constraint -- higher VFs would multiply live registers past occupancy cliffs. The vectorize-loops knob (qword_500D340[17]) can force vectorization even when the cost model says it is unprofitable; this knob defaults to off and is typically used only for debugging. Early-exit vectorization (byte_500CDA8) is gated separately and defaults to disabled. See Optimization Levels for the complete tier structure.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Vectorization purpose | Fill SIMD lanes (SSE/AVX/NEON) for data parallelism | Memory coalescing (ld.v2/ld.v4), instruction count reduction, and register-to-memory width matching; no SIMD lanes on GPU |
| Scalable vectors | Supported (SVE, RISC-V V) | Always disabled -- sub_DFE610 returns false for NVPTX; only fixed-width VF=2/4 |
| Register bit width (TTI) | Target-dependent (128/256/512 for x86) | Fixed 32 bits (TypeSize::getFixed(32)) reflecting PTX's 32-bit register model |
| VF selection cost model | SIMD-width-driven: higher VF fills wider vector registers | Occupancy-bounded: VF must not increase register pressure past warp occupancy cliffs; VF=4 is typically the maximum |
| Interleave count | Profile-guided or port-pressure-based (2--8 typical) | Capped by TTI scheduling info at TTI+56; conservative due to register pressure cost per interleaved iteration |
| Early-exit vectorization | Experimental (behind flag) | Present, gated by byte_500CDA8 (-enable-early-exit-vectorization) |
| Convergent call handling | Standard legality rejection | Additional barrier-aware legality: convergent intrinsics (__syncthreads, warp shuffles) block vectorization of the containing loop body |
SLP Vectorizer
NVIDIA-modified pass. See Key Behavioral Differences from Upstream for GPU-specific changes.
The SLP (Superword-Level Parallelism) vectorizer packs independent scalar operations on adjacent data into vector operations. Unlike the loop vectorizer, SLP operates on straight-line code within a single basic block --- it does not require a loop. On NVPTX, the practical payoff is combining two or four scalar loads/stores into ld.v2/ld.v4 (or st.v2/st.v4), and folding arithmetic on adjacent elements into a single wider instruction. CICC runs the SLP vectorizer as part of the combined LoopVectorize / SLPVectorize pass group at step 31 of the O2 pipeline (sub_19B73C0), after SCCP/GlobalOpt and before the post-vectorization GVN cleanup. The pass is registered under the name slp-vectorizer (pipeline slot 350, llvm::SLPVectorizerPass).
| Property | Value |
|---|---|
| Pass name | slp-vectorizer |
| Pipeline slot | 350 (llvm::SLPVectorizerPass) |
| Constructor registration | ctor_517 at 0x560FD0 (12,410 bytes) |
| Option constructor | ctor_248 at 0x4EEF30 (8,219 bytes) |
| Horizontal reduction entry | sub_2BD1C50 (~85 KB, ~3,005 decompiled lines) |
| Straight-line SLP entry | sub_2BCE070 |
| Store-SLP entry | sub_2BCA110 |
| SLP tree code cluster | 0x1BC0000--0x1BFFFFF (~1,353 KB across ~266 files) |
| Key diagnostic strings | "slp-vectorizer", "HorSLPNotBeneficial", "VectorizedHorizontalReduction", "const.rdx", "SLP vectorized with cost", "Cannot SLP vectorize list:", "Stores SLP vectorized with cost" |
SLP vs Loop Vectorization on GPU
The loop vectorizer (see LoopVectorize & VPlan) transforms counted loops by widening the loop body to process multiple iterations per step, driven by VPlan. SLP vectorization is fundamentally different: it searches a single basic block for groups of isomorphic scalar instructions that operate on adjacent memory or independent data, then replaces them with a single vector instruction. No loop structure is required.
On a GPU, SLP opportunities arise in three main patterns:
-
Adjacent memory operations. Two consecutive
f32loads from addressespandp+4become a singleld.v2.f32. Four consecutivei32stores becomest.v4.b32. This is the highest-value SLP transformation on NVPTX because coalesced memory transactions are critical for throughput. -
Same-typed arithmetic on independent operands. Two
faddinstructions with no data dependency between them can become a single vectorfaddon<2 x float>. The PTX backend later lowers this back to scalar instructions if the target has no native wide ALU, but the combined form enables better scheduling and may survive to the load/store vectorizer's benefit. -
Texture coordinate packing. Texture/surface sampling requires coordinate tuples (u, v) or (u, v, w). When the scalar coordinates are computed independently, SLP can pack them into a
<2 x float>or<4 x float>bundle that feeds directly into the sampling intrinsic, avoiding per-element extract/insert overhead.
NVPTX TTI Hooks Affecting SLP
The SLP vectorizer consults TargetTransformInfo at several decision points. NVIDIA's proprietary TTI implementation differs significantly from the upstream open-source NVPTX backend.
Upstream Open-Source NVPTX TTI (for reference)
| Hook | Return Value | Comment |
|---|---|---|
getRegisterBitWidth(Vector) | 32 bits | "Only <2 x half> should be vectorized" |
getMinVectorRegisterBitWidth() | 32 bits | Matches 32-bit register file |
getNumberOfRegisters() | 1 (all classes) | FIXME in source: "this is conservative" |
getArithmeticInstrCost(i64) | 2x base for ADD/MUL/XOR/OR/AND | Reflects 32-bit ALU emulation |
supportsScalableVectors() | false | No SVE/RVV equivalent in PTX |
With these returns, the standard LLVM VF formula (registerBitWidth / elementBitWidth) produces VF = 1 for f32 and VF = 2 for f16. The open-source backend effectively limits SLP to <2 x half> bundles only.
CICC v13.0 Proprietary TTI
CICC overrides the upstream returns at three levels: the TTI wrapper pass, the SLP tree's internal scheduling-width parameter, and several SLP-specific helper functions that query TTI indirectly.
TTI hooks queried by SLP (directly or via the cost model):
| Hook | Address | Return / Behavior | SLP Impact |
|---|---|---|---|
getRegisterBitWidth(Vector) | sub_DFE640 | TypeSize::getFixed(32) | Formal register width --- same as upstream. But see a2+840 override below. |
getRegisterBitWidth(Scalar) | sub_DFB1B0 | 32 | Confirms 32-bit register file for scalar cost comparison. |
supportsScalableVectors() | sub_DFE610 | false | Scalable VF never attempted. |
getInstructionCost() | sub_20E14F0 (33KB) | Per-opcode latency from scheduling model | Called indirectly through getTreeCost() (sub_2B94A80) for each tree node. |
getInstructionCost() (IR-level) | sub_B91420 | Per-instruction cost estimate | Called 7 times per instruction during per-node SLP cost evaluation. |
hasAttribute(47) | sub_B2D610 | Checks alwaysvectorize | When set, SLP skips profitability check and vectorizes unconditionally. |
hasAttribute(18) | sub_B2D610 | Checks optnone | When set, SLP is entirely disabled. |
The a2+840 scheduling-width override:
The SLP tree object (BoUpSLP, parameter a2 in the horizontal reduction entry sub_2BD1C50) stores a max register pressure / scheduling width at offset +840. This value does NOT come from getRegisterBitWidth(Vector) directly. Instead, it is computed during SLP tree initialization from a combination of the target's scheduling model and available register budget. In the decompiled code, the VF derivation at lines 1354-1578 reads this value and clamps the resulting bit width to [128, 512]:
// VF derivation from a2+840 (decompiled sub_2BD1C50)
uint64_t max_sched_width = *(a2 + 840); // NOT from TTI.getRegisterBitWidth()
uint64_t scalar_width = sub_2B49BC0(a2, first_scalar); // getScalarTypeWidth()
uint64_t vf;
if (scalar_width <= max_sched_width) {
vf = 1 << bsr(max_sched_width / scalar_width); // round-down power-of-2
vf = clamp(vf, 128, 512); // clamp to [128, 512] BITS
} else {
vf = 128;
}
// For f32 (32 bits) with max_sched_width=256: vf = 256/32 = 8 elements
// For f64 (64 bits) with max_sched_width=256: vf = 256/64 = 4 elements
This is the single most important NVIDIA divergence from upstream for SLP: the 32-bit getRegisterBitWidth(Vector) return would produce VF=1 for f32 operations and kill SLP entirely for 32-bit types, but the a2+840 scheduling width allows VF=4 or VF=8 for f32. The result is that CICC's SLP can produce <4 x float> bundles (later lowered to ld.v4.f32 / st.v4.f32) that the open-source backend would never attempt.
SLP-specific TTI helper functions:
| Function | Address | Upstream Equivalent | Behavior |
|---|---|---|---|
getScalarTypeWidth() | sub_2B49BC0 | DL.getTypeSizeInBits() | Returns bit width of a scalar type for VF computation. |
getNextLegalVF() | sub_2B1E190 | No direct equivalent | Steps down through legal vector factors when current VF is unprofitable. Takes (TTI, type, currentVF), returns next smaller legal VF >= minimum VF. Respects PTX v2/v4 legality constraints. |
adjustVF() | sub_2B1FA70 | Partial in BoUpSLP::buildTree | When SLPMaxVF (qword_500F628) is non-zero and operand_count+1 is a power of 2, returns operand_count directly (non-power-of-2 VF). Otherwise computes a power-of-2 VF. |
isTreeNotBeneficialForArch() | sub_2B2DA40 | Not in upstream | NVIDIA-specific early rejection based on SM reduction type (a1+1576). Rejects trees whose structure is known to be unprofitable on the current GPU architecture. |
Arithmetic Cost Impact on SLP Trees
The TTI cost model for i64 operations directly affects SLP profitability. Since NVPTX GPUs emulate all 64-bit integer arithmetic through pairs of 32-bit operations, the cost differential inflates the scalar cost baseline, making i64 SLP trees more profitable in relative terms:
| Operation | i32 Scalar Cost | i64 Scalar Cost | i64 Vector Cost (v2) | SLP Delta |
|---|---|---|---|---|
| ADD/SUB | 1 | 2 (add.cc + addc) | 4 (two add.cc + addc pairs) | Neutral (2x scalar = 2x vector) |
| MUL | 1 | ~4 (mul.lo + mul.hi + add chain) | ~8 | Neutral |
| Loads | 1 | 1 (ld.b64) | 1 (ld.v2.b64) | Profitable --- single wide load |
| Stores | 1 | 1 (st.b64) | 1 (st.v2.b64) | Profitable --- single wide store |
The asymmetry is clear: SLP profit on NVPTX comes almost entirely from memory coalescing (loads and stores), not from arithmetic. The arithmetic cost for a v2 bundle is roughly 2x the scalar cost for all types, providing no ALU benefit. But a ld.v2.f32 replaces two separate load instructions with one, reducing instruction count and improving coalescing. This is why Store-SLP (sub_2BCA110) and the load/store adjacency heuristics dominate profitable SLP on GPU.
Maximum Vector Width on NVPTX
PTX supports vector types up to .v4 for most data types, but the actual hardware constraint is tighter:
- v2: Supported for all types (
.b8through.b64,.f16,.f32,.f64). This is the sweet spot for SLP. - v4: Supported for
.b8,.b16,.b32,.f16,.f32. NOT supported for.b64/.f64. - v8/v16: Not supported in PTX at all. CPU-style AVX-width vectorization is never legal.
The SLP vectorizer's VF selection logic at sub_2BD1C50 lines 1354--1578 computes:
// VF selection pseudocode (from decompiled sub_2BD1C50)
uint64_t max_sched_width = *(a2 + 840); // from TTI
uint64_t scalar_width = getScalarTypeWidth(a2, first_scalar);
uint64_t vf;
if (scalar_width <= max_sched_width) {
vf = 1 << bsr(max_sched_width / scalar_width); // round-down power-of-2
vf = clamp(vf, 128, 512); // clamp to [128, 512] bits
} else {
vf = 128;
}
For f32 (32 bits) with a max scheduling width of 256 bits, this yields VF = 8 elements. However, PTX legalization later splits anything wider than v4 into multiple instructions, so the effective maximum is v4 for 32-bit types and v2 for 64-bit types. The SLP cost model accounts for this split cost.
GPU-Specific Vectorization Constraints
Legal Vector Types on NVPTX
The NVPTX target has exactly ONE vector register class --- Int32HalfRegs (.b32, prefix %hh) --- which holds 32 bits of packed data. The only legal vector types at the SelectionDAG level are:
| Type | Packing | Register | Legal Since |
|---|---|---|---|
v2f16 | Two f16 in 32 bits | %hh | SM 53+ |
v2bf16 | Two bf16 in 32 bits | %hh | SM 80+ |
v2i16 | Two i16 in 32 bits | %hh | SM 53+ |
v4i8 | Four i8 in 32 bits | %hh | SM 70+ |
Every other vector type is illegal and must be split or scalarized during type legalization (sub_2029C10 / sub_202E5A0). This includes <2 x float>, <4 x float>, <2 x i32>, and <2 x double> --- the very types SLP produces for 32-bit and 64-bit operations.
How SLP Vectors Survive to PTX
SLP-produced vector types such as <4 x float> are not killed by type legalization. Instead, the path is:
- SLP vectorizer (IR level) produces
<4 x float>loads, stores, and arithmetic in LLVM IR. - SelectionDAG type legalization splits
<4 x float>into four scalarf32values for arithmetic operations. However, load and store nodes are intercepted by NVPTX's custom lowering (NVPTXTargetLowering::LowerOperation) which converts them to target-specificNVPTX::LD_v4_f32/NVPTX::ST_v4_f32pseudo-instructions. - Instruction selection maps these pseudo-instructions to PTX
ld.v4.f32/st.v4.f32. - Arithmetic on the vector elements becomes four independent scalar instructions, which the scheduler can interleave with memory operations.
The net effect: SLP's primary benefit on NVPTX is vectorized memory access, while vectorized arithmetic is a wash. The cost model at sub_2B94A80 (getTreeCost) accounts for this by assigning low cost to vector loads/stores and high scalarization overhead to vector arithmetic.
PTX Vector Width Ceiling
PTX .v2 and .v4 load/store support imposes hard ceilings:
| Element Type | Max .vN | Max Bits | SLP VF Ceiling |
|---|---|---|---|
.b8 / .u8 | .v4 | 32 | 4 |
.b16 / .f16 | .v4 | 64 | 4 |
.b32 / .f32 | .v4 | 128 | 4 |
.b64 / .f64 | .v2 | 128 | 2 |
.b128 | .v1 only | 128 | 1 (no vectorization) |
When the SLP VF exceeds the PTX ceiling (e.g., VF=8 for f32 from the [128,512] bit-width clamping), the backend splits the single wide operation into multiple legal operations. The SLP cost model at sub_2B889C0 factors this split cost into the tree evaluation, ensuring that overly wide VFs are rejected if the split overhead eliminates the coalescing benefit.
Algorithm Overview
CICC's SLP vectorizer has three entry points that collectively implement the upstream BoUpSLP / SLPVectorizerPass:
Straight-Line SLP (sub_2BCE070)
Scans each basic block for groups of isomorphic instructions (same opcode, adjacent or compatible operands). Builds a bottom-up SLP tree using sub_2BAACB0 (buildTree), evaluates cost via sub_2B94A80 (getTreeCost), and emits vector code via sub_2BC6BE0 (vectorizeTree) when profitable. Diagnostic: "SLP vectorized with cost N" on success, "Cannot SLP vectorize list:" on failure.
Store-SLP (sub_2BCA110)
Seeds the SLP tree from consecutive stores to adjacent memory addresses. This is the primary entry point for memory coalescing. Diagnostic: "Stores SLP vectorized with cost N".
Horizontal Reduction SLP (sub_2BD1C50)
The most complex path. Handles horizontal reductions (e.g., summing all elements of a vector). Proceeds in six phases:
Phase 0 -- Scalar chain scan. Reads the reduction operand array at a1+304 (pointer) and a1+312 (count). Each bundle entry is 64 bytes. Classifies operands by opcode: values <= 0x1C are simple scalars (add/sub/mul/etc.), values > 0x1C are complex (fcmp, icmp variants). Calls sub_2B0D8B0 (isReductionOp) to validate each operation as a legal reduction (add, fadd, mul, fmul, and, or, xor, smin/smax/umin/umax, fmin/fmax).
Phase 1 -- Hash table construction. Builds two open-addressing hash tables. The "AllOps" table uses 32-byte entries with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth/compaction thresholds.
Phase 2 -- Bundle pair extraction. Calls sub_2B5F980 per bundle to classify reduction opcode pairs. When two consecutive bundles both contain fadd reductions (opcode 90), NVIDIA attempts a paired fadd bundle merge via sub_2B3C030/sub_2B25EA0/sub_2B38BA0. This is an NVIDIA-specific optimization for warp-level fadd reductions not present in upstream LLVM.
Phase 3 -- Main vectorization loop. For each bundle, builds candidate operand lists, selects a VF, and tries vectorization with progressively smaller VFs on failure. The VF trial loop uses memoization (sub_2B3C060) to avoid re-trying the same (offset, VF) pair. Key substeps: canVectorize (legality), buildTree, isTreeTinyAndNotFullyVectorizable / isTreeNotBeneficialForArch (early rejection), scheduleBlock, getTreeCost + getReductionCost (profitability).
Phase 4 -- Final reduction codegen. Produces the final horizontal reduction instruction via sub_2B21C80 (createFinalReduction), chaining multiple entries with sub_2B34820 when multiple sub-trees were vectorized.
Phase 5 -- Multi-tree scheduling and cleanup. Builds a multi-tree reduction schedule, iteratively calling sub_2B2F4A0 (reduceTreeLevel) until a single root value remains, then replaceAllUsesWith + eraseFromParent.
Paired fadd Bundle Merging (NVIDIA-Specific)
This optimization is absent from upstream LLVM and targets warp-level floating-point reduction patterns common in CUDA kernels (e.g., block-level sum reductions, dot products, softmax denominators). When two consecutive reduction bundles both contain fadd operations, CICC attempts to merge them into a single wider bundle, doubling the effective vectorization width for the reduction.
Trigger Condition
During Phase 2 of the horizontal reduction path (sub_2BD1C50, lines 921-1098), sub_2B5F980 (classifyReductionPair) is called per bundle and returns a pair of reduction opcodes (reductionOpcodeA, reductionOpcodeB). The merge path activates when:
- Both opcodes in the current bundle equal 90 (0x5A), which is the internal opcode for
faddreduction. - The next consecutive bundle also has both opcodes equal to 90.
- The two bundles are adjacent in the reduction operand array (no intervening non-fadd bundles).
// Trigger check (decompiled from Phase 2, sub_2BD1C50)
if (v83 == v84 && v83 == 90) { // both opcodes in bundle[i] are fadd
if (v83_next == v84_next && v83_next == 90) { // bundle[i+1] also all-fadd
// Try paired merge
sub_2B3C030(bundle_i, bundle_i_plus_1, ...); // tryMergeFaddBundles
}
}
Three-Function Pipeline
The merge proceeds through three functions in sequence:
| Step | Function | Address | Role |
|---|---|---|---|
| 1. Try | tryMergeFaddBundles() | sub_2B3C030 | Checks whether the two bundles' operand lists can be concatenated without violating data dependencies. Verifies that no operand in bundle B depends on the result of bundle A (or vice versa). Returns a candidate merged-bundle descriptor or null on failure. |
| 2. Validate | validateMergedBundle() | sub_2B25EA0 | Confirms that the merged bundle satisfies SLP legality: all operands are isomorphic (same opcode), the combined operand count does not exceed SLPMaxVF limits, and the merged bundle's scheduling pressure stays within a2+840. Also checks that external uses of intermediate reduction values are compatible with the wider bundle. |
| 3. Rewrite | rewriteMergedBundle() | sub_2B38BA0 | Physically merges the two bundle entries in the reduction operand array. The combined bundle gets double the operand count, and the second bundle slot is marked as consumed (skipped in Phase 3). Updates the AllOps hash table entries to point to the new merged bundle. |
Why This Matters on GPU
Consider a warp-level sum reduction of 64 f32 values, structured as two consecutive 32-element fadd reduction trees. Without merging, the SLP vectorizer processes each 32-element tree independently, producing two separate vectorized reduction chains. With merging, the combined 64-element tree exposes a wider VF window, allowing the vectorizer to produce wider v4 bundles and reduce the total number of reduction shuffle steps.
The merged bundle also benefits the final reduction codegen (sub_2B21C80, createFinalReduction): instead of producing two separate reduction results and combining them with a scalar fadd, the merged tree produces a single reduction result directly.
Commutativity Classification
The SM reduction type at a1+1576 drives commutativity via bitmask 0x10804:
bool is_commutative;
if (reduction_type <= 0x10) {
is_commutative = !((1 << reduction_type) & 0x10804);
// Non-commutative types: 2, 14, 16 (likely fsub, signed cmp variants)
} else {
is_commutative = true;
}
SLP and the Load/Store Vectorizer
CICC runs two distinct passes that vectorize memory operations, and their scopes partially overlap:
| SLP Vectorizer | OldLoadStoreVectorizerPass | |
|---|---|---|
| Pass name | slp-vectorizer | old-load-store-vectorizer |
| Scope | Isomorphic ops in a BB | Adjacent loads/stores only |
| Seed | Any instruction group | Store/load chains |
| Handles arithmetic | Yes | No |
| Handles reductions | Yes (horizontal) | No |
| Pipeline position | Step 31 (with LoopVectorize) | Post-optimization (NVIDIA-specific) |
| Disable flag | vectorize-slp | disable-nvptx-load-store-vectorizer |
The NVIDIA-proprietary old-load-store-vectorizer (llvm::OldLoadStoreVectorizerPass) is a separate pass distinct from LLVM's LoadStoreVectorizerPass. It runs later in the pipeline and handles NVVM-specific intrinsic vectorization (nvvm_load/nvvm_ld, nvvm_store/nvvm_st) via the vect-intrinsics knob. SLP may vectorize the same load/store chains if they also contain arithmetic; the load/store vectorizer catches whatever SLP missed.
Register Pressure Impact
SLP vectorization increases register pressure because vector values occupy wider registers. On NVPTX, a <2 x float> consumes two 32-bit registers (PTX has no native 64-bit float register file for packed types --- the backend lowers <2 x f32> to a pair of .f32 registers). The benefit comes from reduced instruction count and improved memory coalescing, not from register savings.
The SLP cost model accounts for register pressure through a2+840 (max scheduling width), and the profitability check rejects vectorization when the combined cost (tree cost + reduction cost) exceeds the threshold. When register pressure is already high, the TTI cost model inflates the scalarization overhead, making SLP less likely to fire.
SLP Cost Model and TTI Callouts
The SLP profitability decision is the product of two cost functions that both delegate to TTI: getTreeCost() (sub_2B94A80, 71KB) and getReductionCost() (sub_2B28940). Understanding exactly how these call into TTI is essential for predicting when SLP will fire on a given kernel.
getTreeCost() (sub_2B94A80)
This 71KB function walks every node in the SLP tree and accumulates the cost difference between the vectorized form and the original scalar form. For each tree node, it:
- Calls
sub_2B889C0(45KB, the inner cost computation) which dispatches to TTI viasub_B91420(TTI::getInstructionCost()at the IR level) --- called approximately 7 times per instruction to query costs for the scalar original, the vector alternative, and scalarization overhead (insert/extract elements). - For load/store nodes, queries the memory cost model which returns favorable costs for adjacent accesses (reflecting
ld.v2/ld.v4coalescing benefit) and high costs for gather/scatter patterns. - For shuffle nodes (operand reordering), queries
TTI::getShuffleCost()which on NVPTX returns high cost for any non-identity shuffle --- GPU has no native shuffle-within-register instruction for packed 32-bit values. - Returns a pair:
(vectorCost : i64, isExact : i32). WhenisExact == 1, the cost is a precise measurement from the scheduling model; the profitability check accepts it unconditionally regardless of the threshold.
getReductionCost() (sub_2B28940)
Called with the TTI pointer (a4) as the second parameter, this function computes the cost of the horizontal reduction itself --- the shuffle-and-reduce tree that turns a vector into a scalar. Parameters:
sub_2B28940(
a1, // HorizontalReduction object
a4, // TargetTransformInfo*
v478, // operand window start
v479, // operand window end
v432, // hasExternalUses flag
v433, // common opcode mask from Phase 1
a2 // BoUpSLP tree
)
// Returns: (reductionCost : i64, costKind : i32)
The reduction cost on NVPTX is typically high because the GPU has no native horizontal reduction instruction for arbitrary vector widths. A <4 x float> fadd reduction requires 2 shuffle-and-add steps (log2(4) = 2), each involving an extractelement and a scalar fadd. The TTI cost model at sub_20E14F0 (33KB) provides the per-step latency from the scheduling model.
Combined Profitability Decision
// Profitability check (decompiled from sub_2BD1C50, lines 2062-2163)
int64_t treeCost = sub_2B94A80(tree, ...); // vector tree cost
int64_t reducCost = sub_2B28940(rd, TTI, ...); // reduction overhead
int64_t combined = treeCost + reducCost; // overflow-checked via __OFADD__
int64_t threshold = -(int64_t)qword_5010428; // SLPCostThreshold, default 0
if (costKind == 1) {
// Exact cost from scheduling model: always accept
goto vectorize;
}
if (combined > threshold) {
// Not profitable: emit "HorSLPNotBeneficial" diagnostic
// Try smaller VFs via getNextLegalVF() loop
goto try_smaller_vf;
}
// Profitable: proceed to vectorizeTree()
The costKind == 1 fast path is notable: when the cost model can determine the exact scheduling benefit (rather than a heuristic estimate), it bypasses the threshold entirely. This typically fires for small, fully-analyzable SLP trees where every instruction's latency is known from the TTI scheduling tables at TTI+56.
VF Stepping on Failure
When vectorization at the current VF is unprofitable, the horizontal reduction path does not immediately give up. Instead, it calls sub_2B1E190 (getNextLegalVF) to step down to the next smaller legal VF, then re-tries the entire build-tree / get-cost cycle:
// VF step-down loop (decompiled from sub_2BD1C50, lines 2097-2163)
while (currentVF > minVF) {
currentVF = sub_2B1E190(TTI, elementType, currentVF);
if (sub_2B3C060(&memoSet, {offset, currentVF})) // alreadyTried?
continue;
// Re-try vectorization at new VF
sub_2BAACB0(tree, ops, currentVF, ...); // buildTree
treeCost = sub_2B94A80(tree, ...); // getTreeCost
reducCost = sub_2B28940(rd, TTI, ...); // getReductionCost
combined = treeCost + reducCost;
if (combined <= threshold)
goto vectorize;
}
// All VFs exhausted: emit "HorSLPNotBeneficial"
The memoization set (sub_2B3C060) prevents re-evaluating the same (offset, VF) pair, which is essential because the VF step-down loop can iterate many times for large operand counts.
Configuration Knobs
Upstream LLVM Knobs (present in CICC)
| Knob | Type | LLVM Default | CICC Default | Effect |
|---|---|---|---|---|
slp-threshold | int | 0 | 0 | Profitability threshold. Vectorize when cost <= -threshold. Default 0 means any non-positive cost is profitable. |
slp-vectorize-hor | bool | true | true | Enable horizontal reduction vectorization. |
slp-vectorize-hor-store | bool | false | false | Seed horizontal reduction from stores. |
slp-max-reg-size | int | 128 | 128 | Maximum vector register size in bits for SLP scheduling. |
slp-min-reg-size | int | 128 | 128 | Minimum vector register size. |
slp-schedule-budget | int | 100000 | 100000 | Maximum scheduling region size per block. |
slp-recursion-max-depth | int | 12 | 12 | Maximum recursion depth for tree building. |
slp-min-tree-size | int | 3 | 3 | Minimum tree size for full vectorization. |
vectorize-slp | bool | true | true | Master switch for the SLP pass. |
view-slp-tree | bool | false | false | Display SLP trees with Graphviz (debug). |
slp-max-vf | int | 0 | 0 | Maximum vector factor override (0 = unlimited). |
NVIDIA-Specific Globals
| Global | Address | Default | Effect |
|---|---|---|---|
SLPMaxVF | qword_500F628 | 0 | When zero: minimum VF = 4 elements. When non-zero: minimum VF = 3, and the value caps the maximum VF. Also bypasses power-of-2 VF requirement. |
SLPCostThreshold | qword_5010428 | 0 | Cost threshold for horizontal reduction profitability. Test is cost > -(int)threshold. Default 0: any non-positive cost is profitable. |
| Straight-line max VF | qword_500FEE8 | unknown | Maximum VF override for straight-line SLP (sub_2BCE070), separate from horizontal reduction. |
Key Behavioral Differences from Upstream
-
Minimum VF default. When
SLPMaxVFis zero (default), CICC requires at least 4 scalar operands to attempt horizontal reduction vectorization. Upstream LLVM has no such global minimum; it relies onslp-min-tree-size(default 3) instead. -
VF clamping. CICC clamps VF to [128, 512] bits based on the
a2+840scheduling width, then steps down viagetNextLegalVF()(sub_2B1E190). Upstream computes VF fromTTI::getMaximumVF()orslp-max-reg-sizewithout the explicit bit-width clamping. The [128, 512] range allows VF=4 through VF=16 forf32types, whereas upstream NVPTX (32-bit register width) would produce VF=1. -
Paired fadd merging. CICC merges consecutive
faddreduction bundles into wider bundles viasub_2B3C030/sub_2B25EA0/sub_2B38BA0. This is absent from upstream and is targeted at GPU warp-level reduction patterns. See the dedicated section above. -
Scheduling-width-driven VF (not register-width-driven). The upstream SLP vectorizer derives VF from
TTI::getRegisterBitWidth(Vector). CICC stores a separate scheduling width ata2+840that reflects available register budget after accounting for live-in pressure. This decouples SLP VF from the register file width, allowing profitable vectorization even thoughgetRegisterBitWidth(Vector)returns 32. -
isTreeNotBeneficialForArch(). CICC adds a GPU-architecture-specific early rejection filter (sub_2B2DA40) that takes the SM reduction type as a parameter. This rejects tree shapes known to be unprofitable on the target SM variant (e.g., trees that would produce reduction patterns not supported by the SM's warp-level primitives). -
O-level gating. SLP vectorization is gated by
tier != 1in the pipeline assembler: it is disabled at O1 and enabled at O2 and O3. At O2/O3, the LoopVectorize/SLP parameterwidthis set totier(2 at O2, 3 at O3), affecting the scheduling width multiplier. SM-architecture-dependent thresholds are resolved at runtime via thea2+840value. -
Non-power-of-2 VF support. When
SLPMaxVF(qword_500F628) is non-zero andoperand_count + 1is a power of 2,adjustVF()(sub_2B1FA70) returnsoperand_countdirectly, enabling VFs like 3, 5, 7. Upstream LLVM requires power-of-2 VFs except in specific recent patches for fixed-length non-power-of-2 vectorization.
Diagnostic Strings
| String | Function | Meaning |
|---|---|---|
"SLP vectorized with cost N" | sub_2BCE070 | Straight-line SLP succeeded |
"Cannot SLP vectorize list:" | sub_2BCE070 | Straight-line SLP failed legality/cost |
"Stores SLP vectorized with cost N" | sub_2BCA110 | Store-seeded SLP succeeded |
"HorSLPNotBeneficial" | sub_2BD1C50 | Horizontal reduction not profitable |
"Vectorizing horizontal reduction is possible but not beneficial with cost C and threshold T" | sub_2BD1C50 | Full rejection diagnostic with cost details |
"VectorizedHorizontalReduction" / "Vectorized horizontal reduction with cost C and with tree size N" | sub_2BD1C50 | Horizontal reduction succeeded |
"const.rdx" | sub_2B21B90 | Intermediate reduction variable name |
"rdx.shuf.l", "rdx.shuf.r" | (cluster 0x1BDDB00) | Left/right reduction shuffle names |
"op.rdx", "op.extra" | (cluster 0x1BDDB00) | Reduction operation and extra operation names |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
HorizontalReduction::tryToReduce() -- main horizontal reduction entry | sub_2BD1C50 | 85 KB | -- |
| Straight-line SLP vectorizer entry | sub_2BCE070 | -- | -- |
| Store-SLP vectorizer entry | sub_2BCA110 | -- | -- |
BoUpSLP::buildTree() | sub_2BAACB0 | -- | -- |
BoUpSLP::getTreeCost() | sub_2B94A80 | 71 KB | -- |
BoUpSLP::vectorizeTree() (codegen) | sub_2BC6BE0 | 71 KB | -- |
BoUpSLP::computeScheduleData() | sub_2BBDBE0 | 40 KB | -- |
BoUpSLP::scheduleBlock() | sub_2BBFB60 | 71 KB | -- |
BoUpSLP::optimizeGatherSequence() | sub_2BB3590 | -- | -- |
BoUpSLP::reorderInputsIfNecessary() | sub_2BB0460 | -- | -- |
BoUpSLP::buildExternalUses() | sub_2B4F3D0 | -- | -- |
getReductionCost() | sub_2B28940 | -- | -- |
createFinalReduction() | sub_2B21C80 | -- | -- |
createReductionOp() ("const.rdx") | sub_2B21B90 | -- | -- |
buildReductionResult() | sub_2B2FE10 | -- | -- |
reduceTreeLevel() | sub_2B2F4A0 | -- | -- |
isReductionOp() | sub_2B0D8B0 | -- | -- |
isHomogeneous() (all ops satisfy predicate) | sub_2B0D880 | -- | -- |
canVectorize() (legality check) | sub_2B4B450 | -- | -- |
isTreeTinyAndNotFullyVectorizable() | sub_2B2DB00 | -- | -- |
isTreeNotBeneficialForArch() | sub_2B2DA40 | -- | -- |
adjustVF() (vectorization factor selection) | sub_2B1FA70 | -- | -- |
getNextLegalVF() | sub_2B1E190 | -- | -- |
getScalarTypeWidth() | sub_2B49BC0 | -- | -- |
hasVectorizableReductions() | sub_2B6E610 | -- | -- |
tryMergeFaddBundles() (NVIDIA-specific) | sub_2B3C030 | -- | -- |
validateMergedBundle() (NVIDIA-specific) | sub_2B25EA0 | -- | -- |
rewriteMergedBundle() (NVIDIA-specific) | sub_2B38BA0 | -- | -- |
perBundleVectorize() | sub_2B77B90 | -- | -- |
emitVectorizedReductionDiagnostic() | sub_2B44ED0 | -- | -- |
reorderForCanonical() | sub_2B33D00 | -- | -- |
| SLP tree scheduling | sub_2BD7F70 | 46 KB | -- |
| SLP tree cost computation | sub_2B889C0 | 45 KB | -- |
| SLP value rewriting (scalar-to-vector) | sub_2BCFB90 | 44 KB | -- |
| SLP node creation (tree construction) | sub_2BCAEC0 | 42 KB | -- |
deleteTree() (cleanup on failure) | sub_2B5C350 | -- | -- |
alreadyTried() (VF memoization) | sub_2B3C060 | -- | -- |
tryNextVF() (advance or fail) | sub_2B399C0 | -- | -- |
classifyReductionPair() (per-bundle opcode pair extraction) | sub_2B5F980 | -- | -- |
hasExternalUses() (external use check for bundles) | sub_2B27F10 | -- | -- |
getTargetInfo() (TTI accessor) | sub_BD5C60 | -- | -- |
initDominatorContext() | sub_D5F1F0 | -- | -- |
hashOperandSlice() (operand slice hash for scheduling cache) | sub_27B0000 | -- | -- |
| Extended opcode classifier (opcodes > 0x1C) | sub_2B15E10 | -- | -- |
buildOperandOrder() (commutative reorder table) | sub_2B3D4E0 | -- | -- |
isInScheduledSet() (scheduling membership test) | sub_2B3D560 | -- | -- |
| Reduction use counter (per-operand) | sub_2B54920 | -- | -- |
TTI::getRegisterBitWidth(Vector) -- returns 32 | sub_DFE640 | -- | -- |
TTI::supportsScalableVectors() -- returns false | sub_DFE610 | -- | -- |
TTI::getRegisterBitWidth(Scalar) -- returns 32 | sub_DFB1B0 | -- | -- |
TTI::getInstructionCost() (scheduling cost model) | sub_20E14F0 | 33 KB | -- |
TTI::getInstructionCost() (IR-level variant) | sub_B91420 | -- | -- |
TTI::hasAttribute(N) (function attribute query) | sub_B2D610 | -- | -- |
Data Structure: HorizontalReduction Object
| Offset | Type | Field |
|---|---|---|
| +0 | ReductionBundle* | Array of reduction bundle structs |
| +8 | u32 | Bundle count |
| +304 | Value** | Pointer to operand arrays (each bundle = 64 bytes) |
| +312 | u32 | Operand array count |
| +384 | void* | Auxiliary dependency table |
| +392 | void* | useDef map (bit 0 = inline/external flag) |
| +400 | void* | useDef map pointer |
| +408 | u32 | useDef map capacity |
| +1568 | Value* | Root function / reduction entry value |
| +1576 | u32 | SM reduction type (arch-specific opcode) |
| +1580 | u8 | Commutative flag |
| +1584 | char* | Output result array |
| +1592 | u32 | Output result count |
| +1596 | u32 | Output result capacity |
| +1600 | char[16] | Inline result storage |
Cross-References
- LoopVectorize & VPlan -- loop-based vectorization, runs alongside SLP in the same pipeline step
- Loop Unrolling -- unrolling exposes more straight-line code for SLP
- Pipeline & Ordering -- SLP placement at pipeline step 31
- GVN -- runs after SLP to clean up redundancies introduced by vectorization
- Optimization Levels -- SLP enabled at tier 2+; width parameter varies by tier
- NVPTX Target Infrastructure -- TTI hook return values that drive SLP VF selection and cost model
- Type Legalization -- vector split/scalarize rules that constrain SLP output legality
- SelectionDAG & NVPTX Lowering -- custom lowering of SLP-produced vector loads/stores to
ld.vN/st.vN - GPU Execution Model -- memory coalescing requirements that motivate SLP on GPU
Loop Strength Reduction (NVIDIA Custom LSR)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp(LLVM 20.0.0). CICC ships the stock LLVM LSR at0x284F650--0x287C150alongside a completely separate NVIDIA custom formula solver at0x199A--0x19BF. The custom solver replaces the formula generation and selection phases while reusing LLVM's SCEV infrastructure, IV rewriting, and chain construction.
NVIDIA ships two entirely separate LSR implementations inside cicc v13.0. The first is upstream LLVM's LoopStrengthReducePass (approximately 200 helpers across 0x284F650--0x287C150, compiled from llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp). The second is a custom 160KB formula solver (sub_19A87A0, 2688 decompiled lines) sitting at 0x199A--0x19BF, wrapped by NVLoopStrengthReduce at sub_19CE990. Both are invoked through the "loop-reduce" pass name in the LLVM new pass manager pipeline, but NVIDIA's overlay replaces the formula generation and selection phases with GPU-aware logic while reusing LLVM's SCEV infrastructure, IV rewriting, and chain construction.
This page documents the NVIDIA overlay -- the most GPU-specific LLVM pass in cicc. If you are reimplementing cicc's optimizer, this is the pass you cannot skip.
Why NVIDIA Rebuilt LSR
The root motivation is a single equation that does not exist on CPUs: register count determines occupancy, and occupancy determines performance. On a GPU, each additional register per thread can cross a discrete occupancy cliff, dropping warp-level parallelism by an entire warp group -- see the GPU Execution Model for the register budget and cliff table.
On a CPU, LSR's primary concern is minimizing the number of live induction variables to reduce register pressure, with a secondary goal of producing address expressions that fold into hardware addressing modes. The cost model compares formulae by counting registers, base additions, immediate encoding costs, and setup instructions. This works because a CPU's register file is fixed (16 GPRs on x86-64) and the cost of spilling to cache is relatively uniform.
On an NVIDIA GPU, four properties break this model:
-
Discrete occupancy cliffs. A formula that saves one instruction but adds one register might push the kernel past a cliff and lose 50% throughput. The cliff boundaries and their impact are documented in the occupancy cliff table.
-
No equivalent of L1 spill cost. When a GPU "spills," values go to local memory (DRAM, 200-800 cycles), which is orders of magnitude slower than CPU L1 cache.
-
Address space semantics. GPU memory is partitioned into address spaces with different widths and hardware addressing modes. Shared memory (
addrspace(3)) uses 32-bit pointers with specialized.shared::load/store instructions. Generic pointers are 64-bit. Strength-reducing a 32-bit shared-memory pointer can produce 64-bit intermediate values that force truncation, defeating the optimization. -
Typed registers. PTX uses typed virtual registers (
%rfor 32-bit,%rdfor 64-bit,%ffor float). A 64-bit induction variable costs two 32-bit register slots. On older architectures (sm_3x through sm_5x), 64-bit integer operations are emulated and expensive; on sm_70+, native 64-bit addressing makes them acceptable.
LLVM's stock cost model knows none of this. It calls TTI::isLSRCostLess which compares an 8-field cost tuple ({Insns, NumRegs, AddRecCost, NumIVMuls, NumBaseAdds, ImmCost, SetupCost, ScaleCost}), but the NVPTX TTI implementation cannot express occupancy cliffs, address space constraints, or the sign-extension register savings that matter on GPU. NVIDIA's solution: replace the formula solver entirely, with 11 knobs for fine-grained control.
Architecture Overview
The NVIDIA LSR overlay is structured as a 7-phase formula solver pipeline. The main entry point is sub_19A87A0, which takes a single argument: a pointer to an LSR state object (referred to as a1 throughout). The state object is large -- relevant fields span from offset 0 through offset 32160.
LSR State Object Layout
| Offset | Type | Field |
|---|---|---|
+8 | ScalarEvolution* | SCEV analysis handle |
+32 | LoopInfo* | Loop analysis handle |
+40 | uint64_t | Target address space identifier |
+192 | int64_t** | Stride factor table (array of stride values) |
+200 | uint32_t | Stride factor count |
+320 | void* | Reuse chain table base |
+328 | uint32_t | Reuse chain count |
+368 | LoopRecord* | Loop use-groups array base |
+376 | uint32_t | Loop use-groups count |
+32128 | RPTracker | Register pressure tracking structure |
+32136 | void* | Formula hash table base |
+32152 | uint32_t | Formula hash table bucket count |
+32160 | void* | Working formula set |
Each loop record is 1984 bytes. Each use record within a loop is 96 bytes. The loop's use array starts at loop record offset +744, with the use count at +752. The stride at these sizes -- 1984 bytes per loop, 96 bytes per use -- is a constant throughout all 7 phases. The solver iterates loop_count * uses_per_loop in every phase, making this an O(L * U * S) algorithm where S is the stride factor table size.
The 7-Phase Formula Solver Pipeline
Phase 1: Initial Use Setup (lines 471--537)
The solver iterates all loop use-groups and, within each, all individual uses. For each 96-byte use record, it:
- Copies the use record to stack locals (the record contains base SCEV, stride SCEV, flags, formula kind, scaled register array, offset expression, and secondary immediate).
- Calls
sub_19930D0to expand the scaled register array into a working formula operand list. - Calls
sub_19A22F0(per-register formula generation): iterates over the scaled register count and callssub_19A1B20for each register operand to generate one initial formula candidate per operand. Ifformula_kind == 1(with-offset mode), it also callssub_19A1B20withoperand_idx = -1to generate a formula for the offset expression itself. - Calls
sub_19A23A0(alternative formula generation): a second pass with different addressing-mode generation logic, likely producing formulae with combinedbase+offsetor folded immediates.
The output of Phase 1 is a set of initial formula candidates, one per (use, scaled register) pair, covering the basic addressing modes.
Phase 2: Expression Folding with Unfolded Offsets (lines 548--662)
This phase targets uses where base == NULL (pure IV uses, no base pointer). It performs two sub-passes:
Sub-pass A (unfold offset into base): For each pure-IV use, calls sub_19A2680 per scaled register to generate candidate formulae that move the offset expression into the base register field. This is the inverse of LLVM's stock "fold offset into immediate" transform -- NVIDIA sometimes wants the offset in a register because GPU addressing modes have limited immediate widths.
Sub-pass B (factor loop bounds into formula): Builds an iterator set from the loop's start bound (+712) and end bound (+720), then calls sub_19A2820 per (scaled register, iterator bound) pair. This generates formulae that factor common strides out of the loop bounds. For example, if the loop runs i = 0..N and a use computes 4*i + base, this phase can factor out the stride 4 and produce a formula with a single-step IV.
Phase 3: Stride Factor Expansion (lines 662--862)
Runs only for loops where the use type is 3 (immediate-only addressing). This is the phase that explores alternative stride factors from the stride factor table (a1+192).
For each use:
- Extract the representative SCEV via
sub_1456040. - Verify bit width is at most 64 via
sub_1456C90. - Verify single loop bound (
start == end, meaning unit stride). - Reject floating-point constant offsets (type 15).
- For each stride factor
Sin the table:- Compute
scaled_stride = S * use_stride. - Overflow check: verify
S * stride / S == strideand thatINT64_MINis not involved (avoiding signed overflow UB). - Validate SCEV representability via
sub_1594790. - Also validate
S * loop_startandS * loop_end. - Construct a candidate formula with the factored stride.
- Run formula legality check via
sub_1995490(validates against TTI target legality, loop dominance, and SCEV overflow). - If legal: rewrite all scaled operands via
sub_13A5B60, check value equivalence viasub_1999100, then commit viasub_19A1660.
- Compute
The overflow guards in this phase are critical. The multiplication overflow check (v94 * v455 / v94 == v455) prevents generating formulae whose stride values cannot be represented in 64-bit arithmetic, which would silently produce wrong results.
Phase 4: Chain-Based Formula Generation (lines 872--1082)
For each use, the solver attempts chain-based strength reduction: building formulae where one IV feeds the next use through a simple increment rather than a full recomputation.
Key logic:
- Extracts the representative SCEV for the use.
- If
formula_kind != 1(not with-offset) or the formula has a single element, iterates stride factors and builds chained formulae. - For immediate-type uses (
type == 3), also considers promoting to with-offset mode (type == 1). - Each candidate is validated through
sub_1995490. - Operands are rewritten via
sub_145CF80(SCEV multiply by stride factor). - The flag at loop record
+728controls address-space-aware chain construction. When set, chains respect address space constraints -- critical for shared memory (see thedisable-lsr-for-sharedmem32-ptrknob section).
Phase 5: Reuse Chain Matching (lines 1093--1256)
For uses where base == NULL, the solver attempts to match existing IV chains for reuse rather than creating new ones.
- Extract the representative SCEV and compute its "extent" (value range) via
sub_1456E10. - Iterate the reuse chain table (
a1+320througha1+328). - For each chain entry, check if the use's extent matches the chain target.
- Validate legality via
sub_14A2CF0. - If matched: rewrite the use's offset via
sub_147BE70(SCEV rebase). - Register pressure check: validate via
sub_19955B0(rp_tracker, scev_value, loop_idx)that the register pressure after adding this formula stays under the limit. - If RP passes: tag with address space via
sub_19932F0, commit viasub_19A1660.
This is where the lsr-check-rp and lsr-rp-limit knobs have direct effect. The sub_19955B0 function compares projected register pressure against the configured ceiling and returns false if the formula would exceed it.
Phase 6: Formula-to-Use Hash Table Construction (lines 1260--1940)
The most complex phase. It builds two hash tables and uses them to identify which formulae serve multiple uses (shared IV expressions).
Hash Table 1 (7-QWORD entries per slot): maps SCEV expression to a linked list of formula candidates.
| Field Offset | Size | Content |
|---|---|---|
+0 | 8 | Key: SCEV pointer, or sentinel (-8 = empty, -16 = tombstone) |
+8 | 8 | Formula candidate linked list head |
+16 | 4 | Candidate count |
+24 | 8 | Linked list tail |
+32 | 8 | Previous pointer (for median walk) |
+40 | 8 | Next pointer (for median walk) |
+48 | 8 | Total cost accumulator |
Hash Table 2 (2-QWORD entries): maps SCEV expression to a use-count bitmap tracking how many uses reference the expression.
Both tables use the same hash function: (val >> 9) ^ (val >> 4) masked to the bucket count. Probing is quadratic with tombstone support. Resize triggers at 75% load factor via sub_19A82C0.
The phase then:
- Inserts every formula from the working set into Hash Table 1 with SCEV normalization via
sub_199D980. - Cross-references into Hash Table 2 for use counting, merging bitmaps via
sub_1998630. - Iterates the formula set again and, for each formula, traverses the linked list of referencing uses.
- Computes combined cost using
sub_220EFE0(reads cost from a binary tree node at+32). - Finds the median-cost insertion point (threshold at
total_cost / 2) -- this is a key difference from upstream LLVM, which always picks the cheapest formula. NVIDIA's median heuristic avoids both extremes: the cheapest formula might use too many registers, while the most register-efficient formula might use too many instructions. - Builds (register_id, distance) candidate pairs for each (formula, use) combination. If the candidate set exceeds 31 entries, it migrates from an inline
SmallVectorto a balanced tree set (sub_19A5C50).
The use-count bitmap uses a compact inline representation: if (value & 1), the high bits encode max_reg_id and the remaining bits form the bitmap directly; otherwise, the value is a pointer to a heap-allocated BitVector (size at +16, data at +0). The popcount check at line 1927 (popcount != 1) filters out expressions used by only one use -- they cannot benefit from strength reduction.
Phase 7: Final Formula Selection and Commitment (lines 2042--2686)
After hash table cleanup, the solver iterates the candidate triples (register_id, distance, scev_use) and performs the final selection:
for each candidate (reg_id, distance, scev_use):
loop_record = a1->loops[loop_idx]
repr_scev = getStart(scev_use) // sub_1456040
extent = getExtent(repr_scev) // sub_1456E10
offset_expr = getAddExpr(extent, -distance, 0) // sub_15A0680
offset_norm = foldNormalize(scev_ctx, offset_expr) // sub_146F1B0
bit_budget = getBitWidth(offset_norm) // sub_1456C90
for each use in loop_record:
copy 96-byte use record to stack
if formula_kind == 1: // with-offset mode
fold offset into scaled_regs
set formula_kind = 0 // demote to normal
if candidate IV appears in use's scaled_regs:
// Direct replacement path
validate via sub_1995490 (formula legality)
build replacement AddRecExpr: sub_147DD40(scev_ctx, [target_iv], 0, 0)
replace matching operand in formula
// Sign-extension / width-fit check:
value_range = computeRange(replacement)
if abs(distance) < value_range:
tag with address space (sub_19932F0)
commit formula (sub_19A1660)
else if use references a different IV:
// Cross-IV replacement path
alt_offset = stride + num_uses * distance
alt_formula = getAddExpr(extent, -alt_offset, 0)
validate via sub_1995490
// Sign-bit check: if sign(replacement) == sign(distance),
// the formula may wrap -- reject
if sign_bit_matches: continue
// Width-fit check via APInt:
if countLeadingZeros(result) confirms fit in register width:
commit formula
The width-fit checks use full APInt arithmetic (sub_16A4FD0 for copy, sub_16A7490 for shift/add, sub_16A8F40 for negate, sub_16A7400 for absolute value, sub_16A7B50 for bitwise AND, sub_16A57B0 for leading zero count) to determine whether the replacement formula's value range fits in the bit budget. This is essential for correctness: a formula that overflows its register width produces wrong results silently.
Register Pressure Integration
The integration between LSR and register pressure is the single most important difference from upstream LLVM. It works at three levels:
Level 1: Hard Gate (lsr-check-rp + lsr-rp-limit)
Before committing any reuse chain formula (Phase 5) and internally within the legality check sub_1995490, the solver calls sub_19955B0(rp_tracker, scev_value, loop_idx). This function reads the pre-computed per-loop register pressure estimate from offset a1+32128 and compares the projected post-formula RP against lsr-rp-limit. If the projected RP exceeds the limit, the formula is rejected outright -- it does not even enter the candidate set.
This prevents the pathological case where LSR produces a formula that requires one less instruction per iteration but needs two more live registers, pushing the kernel past an occupancy cliff. On GPU, that one extra instruction is vastly cheaper than the occupancy loss.
Level 2: Bit Budget Proxy (Phase 7)
The "bit budget" computed in Phase 7 (v325 = sub_1456C90(offset_norm)) acts as an indirect register pressure proxy. Wider values need more register slots: a 64-bit value occupies two 32-bit register slots on NVPTX. By enforcing that replacement formulae fit within the bit budget, the solver prevents needless register widening.
Level 3: Sign-Extension Credit (count-sxt-opt-for-reg-pressure + lsr-sxtopt)
When lsr-sxtopt is enabled, LSR attempts to fold sign-extension operations into IV expressions, producing narrower IVs. When count-sxt-opt-for-reg-pressure is also enabled, the cost model credits the register pressure savings from eliminated sign-extensions. A formula that requires one more base register but eliminates a sign-extension might be net-neutral or even beneficial in RP terms.
Level 4: Median-Cost Heuristic (Phase 6)
Rather than always selecting the cheapest formula (as upstream LLVM does), NVIDIA uses a median-cost heuristic. The total cost is summed across all uses of a formula, and the selection threshold is total_cost / 2. This balances instruction cost against register pressure: the cheapest formula often has the highest register pressure, while the formula nearest the median typically represents a balanced tradeoff.
GPU-Specific Knobs
All 11 knobs are registered at ctor_214_0 (0x4E4B00). They are LLVM cl::opt command-line options injected through NVIDIA's option registration infrastructure.
Complete Knob Reference Table
| Knob | Type | Default | Category | Effect |
|---|---|---|---|---|
disable-unknown-trip-lsr | bool | false | Scope Control | Skips LSR entirely for loops where SCEV cannot determine the trip count. Unknown-trip loops on GPU may be warp-divergent; applying LSR without trip count knowledge can increase register pressure with no loop-count-informed gain. |
lsr-check-rp | bool | true [MEDIUM confidence] | Register Pressure | Master switch for register pressure checking. When disabled, LSR ignores occupancy constraints and behaves more like upstream LLVM. Default inferred from observed RP-aware behavior in O2 compilations; constructor default not directly confirmed. |
lsr-rp-limit | int | ~32-64 [LOW confidence] | Register Pressure | Register pressure ceiling. If current RP for the loop meets or exceeds this value, LSR is skipped for that loop. The threshold is set to coincide with occupancy cliff boundaries. Range estimated from SM occupancy math; actual compiled-in default not extracted from binary. |
filter-bad-formula | bool | true [MEDIUM confidence] | Formula Quality | Enables NVIDIA's custom formula pruning pass. "Bad" formulae are those requiring too many registers or producing address modes unsupported by SASS (for example, formulae that require scaled-index modes that only exist on CPU). Default inferred from observed pruning behavior; constructor value unconfirmed. |
do-lsr-64-bit | bool | arch-dependent | IV Width | Enables LSR for 64-bit induction variables. Default is false on sm_3x through sm_5x (where 64-bit integer ops are emulated), true on sm_70+ (native 64-bit datapath). |
count-sxt-opt-for-reg-pressure | bool | true [MEDIUM confidence] | Register Pressure | When calculating RP cost, credits the register savings from sign-extension eliminations that LSR enables. Default inferred from observed behavior; constructor value unconfirmed. |
lsr-sxtopt | bool | true [MEDIUM confidence] | Sign Extension | Master switch for sign-extension folding within LSR. Folds sign-extension operations into IV expressions to produce narrower IVs, reducing register file consumption. Default inferred from observed behavior; constructor value unconfirmed. |
lsr-loop-level | int | 0 (all) | Scope Control | Restricts LSR to loops at a specific nesting depth. 0 = all levels. 1 = innermost loops only (where address arithmetic is hottest). |
lsr-skip-outer-loop | bool | false | Scope Control | Skips the outer loop's IV when processing nested loops. Prevents strength-reducing the outer IV when the inner loop is the performance bottleneck. |
disable-lsr-for-sharedmem32-ptr | bool | false | Address Space | Disables LSR for pointers into 32-bit shared memory (addrspace(3)). Protects efficient .shared:: addressing modes and bank-conflict-free access patterns. |
disable-lsr-complexity-discount | bool | false | Cost Model | Disables the complexity estimation discount. When the discount is active (this knob is false), the cost model gives a bonus to formulae that reduce addressing complexity even if they use more registers. Disabling forces strict register-count-based comparison. |
Knob Grouping by Function
Register pressure control (4 knobs): lsr-check-rp, lsr-rp-limit, count-sxt-opt-for-reg-pressure, lsr-sxtopt. These collectively determine whether and how aggressively the solver factors occupancy into formula selection. With all four active, NVIDIA's LSR is deeply occupancy-aware. With all four disabled, it degrades toward upstream LLVM behavior.
Scope control (3 knobs): disable-unknown-trip-lsr, lsr-loop-level, lsr-skip-outer-loop. These restrict which loops LSR operates on. They are safety valves: if LSR is hurting a specific kernel, these allow narrowing its scope without disabling it entirely.
Address space control (2 knobs): disable-lsr-for-sharedmem32-ptr, do-lsr-64-bit. These control how LSR interacts with GPU memory semantics. The shared-memory knob protects 32-bit pointer optimality; the 64-bit knob controls IV width policy.
Cost model control (2 knobs): filter-bad-formula, disable-lsr-complexity-discount. These tune the formula evaluation heuristics. The bad-formula filter removes candidates early; the complexity discount adjusts the tradeoff between instruction count and register count.
Address-Space Awareness
Shared Memory 32-Bit Pointer Protection
Shared memory on NVIDIA GPUs uses addrspace(3) with 32-bit addressing. The hardware provides dedicated .shared:: load/store instructions with efficient addressing modes, including bank-conflict-free access patterns tied to pointer alignment.
NVIDIA's LSR overlay tracks address spaces at two levels:
- Loop level: the address space identifier at loop record
+40. - Use level: the alignment constraint at use record
+48.
In Phase 4 (chain-based formula generation), line 983 checks use+48 == a1+40 || flag_at_+728. If the use's address space matches the target or the address-space-crossing flag is set, the solver uses address-space-aware chain construction. The sub_19932F0 helper tags committed formulae with the correct address space.
When disable-lsr-for-sharedmem32-ptr is enabled, the solver skips all formulae targeting addrspace(3) pointers. The rationale: strength-reducing a 32-bit shared memory pointer can create 64-bit intermediate values (the IV increment may be computed in 64-bit before truncation to 32-bit). This defeats the optimization and can prevent the backend from using efficient 32-bit .shared:: addressing modes.
64-Bit IV Control
The do-lsr-64-bit knob controls whether LSR generates formulae using 64-bit induction variables. The architecture-dependent default reflects hardware reality:
- sm_30 through sm_52: 64-bit integer operations are emulated (two 32-bit ops + carry). A 64-bit IV costs roughly 2x the register pressure and 2x the instruction cost. LSR is disabled for 64-bit IVs.
- sm_60 through sm_62: Partial native 64-bit support for address computation.
- sm_70 and above: Full native 64-bit addressing and arithmetic. 64-bit IVs become acceptable.
Phase 3 (stride factor expansion) checks the bit width of the representative SCEV (sub_1456C90 must return at most 64). Phase 7's bit budget check ensures replacement formulae fit within the available register width. Together, these prevent 64-bit IV generation on architectures where it is disabled.
Sign-Extension Optimization
When lsr-sxtopt is enabled, the solver actively seeks to fold sign-extension operations into IV expressions. On NVPTX, this is important because:
- PTX uses typed registers. A
sext i32 %x to i64creates a new 64-bit value occupying a separate register pair. - If LSR can express the IV in a narrower type from the start, the sign-extension becomes dead code.
- When
count-sxt-opt-for-reg-pressureis also enabled, the cost model credits this saving.
The sign-extension check appears in Phase 7's width-fit verification. After constructing a replacement formula, the solver computes the value range using APInt arithmetic and checks whether abs(distance) < value_range. If the replacement fits, the sign-extension can be eliminated. An additional sign-bit check (line 2545) rejects replacements where the sign bit of the result matches the sign of the distance -- this would cause the formula to wrap, producing incorrect values.
Complexity Discount Heuristic
When disable-lsr-complexity-discount is false (the default), the cost model applies a discount to formulae that reduce addressing complexity, even if they use more registers. "Addressing complexity" here means the number of operations required to compute the effective address for a memory operation.
Consider two formulae for a memory access inside a loop:
- Formula A:
base + 4*i-- one multiplication, one addition. Requires a scaled index register. - Formula B:
ptr += 4each iteration -- one addition per iteration, no multiplication. Requires one increment register.
Formula B is "simpler" in addressing complexity but might use one more register (the incrementing pointer) alongside the existing base. The complexity discount gives Formula B a bonus in the cost model, reflecting the GPU reality that address computation instructions compete with arithmetic instructions for issue slots, while an extra register has low cost when the kernel is not at an occupancy cliff.
When the discount is disabled (the knob is set to true), the cost model falls back to strict register-count comparison, similar to upstream LLVM behavior.
Comparison: NVIDIA LSR vs Upstream LLVM LSR
| Aspect | Upstream LLVM LSR | NVIDIA Custom LSR |
|---|---|---|
| Code size | ~180KB compiled (500+ helpers, 4 mega-functions) | ~160KB compiled (30 functions, main solver 83KB) |
| Binary location | 0x284F650--0x287C150 | 0x199A--0x19BF overlay |
| Cost model | 8-field tuple: {Insns, NumRegs, AddRecCost, NumIVMuls, NumBaseAdds, ImmCost, SetupCost, ScaleCost}. Compared via TTI::isLSRCostLess. | Register-pressure-aware with occupancy ceiling. Median-cost heuristic. Complexity discount. Sign-extension credit. |
| Formula selection | Always picks cheapest formula per cost tuple ordering | Median-cost heuristic: picks near cost midpoint to balance instructions vs registers |
| Register pressure | Counted but not capped. No occupancy awareness | Hard-gated: lsr-check-rp + lsr-rp-limit reject formulae that exceed RP ceiling |
| Address spaces | Single flat address space assumed | Full address-space tracking. Shared memory (addrspace 3) gets special 32-bit protection |
| 64-bit IVs | Always considered if legal | Gated by do-lsr-64-bit with architecture-dependent defaults |
| Sign-extension | Not a first-class concern | Dedicated optimization path with RP credit (lsr-sxtopt, count-sxt-opt-for-reg-pressure) |
| Loop scope | All loops | Filterable by nesting depth (lsr-loop-level) and outer-loop exclusion (lsr-skip-outer-loop) |
| Trip count requirement | Attempts all loops | Can skip unknown-trip loops (disable-unknown-trip-lsr) |
| Hash table | DenseSet<SmallVector<SCEV*>> for uniquification | Custom 7-QWORD-per-entry hash table with quadratic probing, tombstones, 75% load factor resize, linked-list formula chains, and use-count bitmaps |
| Formula phases | Single-pass candidate generation followed by cost-based pruning | 7 sequential phases: initial setup, expression folding, stride expansion, chain generation, reuse matching, hash table construction, final selection |
| SCEV infrastructure | Native | Reused from LLVM (shared SCEV, IV rewriting, chain construction) |
| Tuning knobs | 7 cl::opt knobs (general-purpose: lsr-insns-cost, lsr-filter-same-scaled-reg, lsr-complexity-limit, etc.) | 11 GPU-specific knobs (register pressure, address space, loop scope, cost model) |
What NVIDIA Reuses From Upstream
The NVIDIA overlay does not replace everything. It reuses:
- SCEV infrastructure (
0xDB--0xDFrange):ScalarEvolutionanalysis,AddRecExprconstruction, range analysis, and trip count computation. - IV rewriting (
sub_1997F10): creates the replacement IV values with the naming convention"IV.S."and"IV.S.next.". - Chain construction (
sub_199EAC0): builds IV chains with the"lsr.chain"naming prefix. - Formula cost model base (
sub_1995010): the underlying cost computation, which NVIDIA then wraps with RP checking and sign-extension credit. - Terminator folding (
sub_287C150): the"lsr_fold_term_cond"transform that folds loop exit comparisons.
What NVIDIA Replaces
- Formula generation (Phases 1--5): entirely custom, with address-space awareness, stride factor expansion, and reuse chain matching with RP validation.
- Formula-to-use mapping (Phase 6): custom hash tables replacing LLVM's
DenseSet-based uniquification with a design optimized for linked-list traversal and median-cost computation. - Final selection (Phase 7): custom selection with width-fit checks, sign-extension validation, and cross-IV replacement -- none of which exist in upstream LLVM.
Key Helper Function Map
For reimplementation reference, the critical helpers and their roles:
| Address | Function | Role |
|---|---|---|
sub_19A87A0 | Main 7-phase solver | Entry point (83KB, 2688 lines) |
sub_19CE990 | NVLoopStrengthReduce::run() | Pass wrapper |
sub_1995490 | Formula legality validator | TTI + SCEV + loop constraint check |
sub_19955B0 | Register pressure check | Compares projected RP vs limit |
sub_19932F0 | Address space tagger | Sets addrspace on formula |
sub_19A1660 | Formula commit | Sorts, deduplicates, inserts into candidate set |
sub_19A22F0 | Per-register formula gen (Phase 1) | Loops sub_19A1B20 per operand |
sub_19A2680 | Unfolded-offset formula gen (Phase 2a) | Offset-to-base transform |
sub_19A2820 | Loop-bound-factored formula gen (Phase 2b) | Stride factoring |
sub_19A82C0 | Hash table resize | Power-of-two bucket growth |
sub_199D980 | SCEV normalization | Canonical form for hashing |
sub_1998630 | Use-count bitmap merge | Inline bitmap + heap fallback |
sub_1456040 | SCEV getStart() | Extract base from AddRecExpr |
sub_1456C90 | SCEV getBitWidth() | Register width determination |
sub_1456E10 | SCEV extent computation | Value range of IV |
sub_145CF80 | SCEV getMulExpr() | Multiply SCEV by stride factor |
sub_147BE70 | SCEV rebase | Rewrite base in AddRecExpr |
sub_147DD40 | AddRecExpr constructor | Build replacement IV chain |
sub_15A0680 | SCEV getAddExpr() | Add constant offset |
sub_146F1B0 | SCEV fold/normalize | Simplify expression |
Data Structure Reference
Use Record (96 bytes)
+0 [8] base_scev : SCEV* (NULL for pure-IV uses)
+8 [8] stride_scev : SCEV* (loop stride expression)
+16 [1] flags : bit 0 = is_address, bit 1 = has_offset
+24 [8] formula_kind : 0 = normal, 1 = with-offset, 3 = immediate-only
+32 [8] scaled_regs_ptr : pointer to SmallVector<SCEV*>
+40 [4] scaled_regs_cnt : number of scaled register operands
+48 [32] padding / alignment / additional fields
+80 [8] offset_scev : SCEV* (offset expression)
+88 [8] secondary_imm : secondary immediate value
Loop Record (1984 bytes)
+32 [4] use_type : 0 = generic, 1 = address-check, 3 = immediate
+40 [8] addr_space : address space identifier
+48 [4] alignment : alignment constraint (bytes)
+712 [8] loop_start : SCEV* (loop start bound)
+720 [8] loop_end : SCEV* (loop end bound)
+728 [1] as_aware_flag : address-space-aware LSR active
+729 [1] dead_guard_flag : if set && use_count > 0, skip loop
+744 [8] use_array_ptr : pointer to array of use records
+752 [4] use_count : number of uses in this loop
Reimplementation Notes
-
Start with the knob infrastructure. Register all 11
cl::optknobs before anything else. The pass wrapper (sub_19CE990) reads these early and uses them to gate entire phases. -
The RP tracker must exist before the solver runs. The register pressure estimate at
a1+32128is computed by an earlier pass (likely during loop analysis). The NVIDIA LSR does not compute RP itself -- it only reads and compares. -
The hash function is deterministic.
(val >> 9) ^ (val >> 4)masked to bucket count. Quadratic probing with tombstone support. If you are reimplementing the hash tables, use the same scheme or your formula deduplication will differ. -
The median-cost heuristic is the secret sauce. Upstream LLVM always picks the cheapest formula. NVIDIA picks near the median. This single difference is responsible for most of the occupancy improvements. If you must simplify, keep this heuristic.
-
The overflow checks in Phase 3 are load-bearing. The
S * stride / S == stridecheck and theINT64_MINguard prevent generating formulae with wrapped arithmetic. Removing these checks will produce silently wrong code on kernels with large strides. -
Address space tagging (
sub_19932F0) must happen before commit. Every formula committed viasub_19A1660must carry the correct address space tag. Forgetting this will produce PTX that uses generic loads/stores where shared-memory instructions are required, breaking both performance and correctness. -
The use-count bitmap has two representations. Inline (when
value & 1) and heap-allocated. The inline form is fast but limited to small register ID ranges. The heap form uses aBitVectorwith the size at+16. Both must be supported. -
Phase ordering is strict. The 7 phases must run in order. Later phases depend on candidates generated by earlier ones, and the hash tables in Phase 6 assume all candidates have been generated by Phases 1--5.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Formula solver | Single LLVM LoopStrengthReduce with TTI-based cost model | Two implementations: stock LLVM LSR + custom 160 KB NVIDIA formula solver (sub_19A87A0, 2688 lines) that replaces formula generation/selection |
| Cost model | 8-field cost tuple ({Insns, NumRegs, AddRecCost, ...}), no occupancy concept | Occupancy-aware cost: register count evaluated against discrete warp occupancy cliffs where +1 register can halve throughput |
| Address space awareness | No address space semantics in formula selection | Address space tagging (sub_19932F0) ensures formulae preserve shared memory (addrspace 3) 32-bit pointer width; prevents strength-reducing 32-bit pointers into 64-bit generic form |
| Knob count | ~5 knobs for cost tuning | 11 knobs for fine-grained GPU-specific control (lsr-no-ptr-address-space3, stride limits, formula depth, etc.) |
| Algorithm structure | Monolithic formula generator + greedy selector | 7-phase formula solver pipeline: candidate generation, stride-based filtering, use-group analysis, formula selection, commit, rewrite |
| State object | Modest state for formula tracking | 32,160-byte state object with embedded register pressure tracker, formula hash table, and per-use-group formula arrays |
| Typed register cost | All registers weigh the same | 64-bit IVs cost two 32-bit register slots; emulated on sm_3x--5x; native on sm_70+ but still double the pressure |
StructurizeCFG
Prerequisites: Familiarity with GPU execution model (warp divergence, reconvergence), LLVM dominator tree and post-dominator tree concepts, and the PTX emission pipeline. Understanding of reducible vs. irreducible control flow is assumed.
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/Scalar/StructurizeCFG.cpp(LLVM 20.0.0). The upstream version was originally written for AMDGPU; cicc ships both the stock AMDGPU copy atsub_1F0EBC0and a separate NVPTX-customized copy atsub_35CC920.
CICC v13.0 ships two copies of the StructurizeCFG pass: an NVPTX-specific version at sub_35CC920 (95 KB, 2,397 decompiled lines) and a stock LLVM/AMDGPU version at sub_1F0EBC0. Both exist because the binary links both the NVPTX backend and the generic LLVM Scalar library; only the NVPTX instance is scheduled in the CUDA compilation pipeline. This page documents the NVPTX version exclusively.
The pass is mandatory for PTX emission. It is registered as "structurizecfg" in the pipeline parser (sub_2377300, sub_233F860) and listed as a required late pass by sub_29882C0 and sub_1A6D600.
Why PTX Requires Structured Control Flow
PTX is a structured instruction set. Unlike x86 or ARM, where a branch can target any address and the hardware resolves control flow at retirement, the NVIDIA GPU execution model imposes three hard constraints:
-
Reconvergence at post-dominators. When a warp diverges (threads take different sides of a branch), the hardware needs a defined reconvergence point where all threads synchronize before continuing. This reconvergence point must be the immediate post-dominator of the branch. An unstructured CFG has no guarantee that such a point exists or is reachable from both sides.
-
No multi-entry loops. A loop header must dominate every block in the loop body. If two distinct blocks serve as loop entries (an irreducible cycle), the hardware has no single point to insert the loop counter logic and the warp-level loop exit barrier. PTX therefore requires all loops to be natural (single-entry, reducible).
-
No exception handling funclets. CUDA device code has no runtime support for stack unwinding, personality routines, or catch dispatch. The funclet-based EH model (Windows SEH, C++ landing pads) produces control flow patterns that cannot be expressed in PTX.
The StructurizeCFG pass converts reducible-but-unstructured flow into structured form by inserting "Flow" blocks that serve as explicit reconvergence points. It rejects irreducible flow and EH funclets with diagnostic remarks rather than attempting to restructure them.
Binary Layout
| Function | Address | Size | Role |
|---|---|---|---|
sub_35CC920 | 0x35CC920 | 95 KB | Main pass body |
sub_35CF930 | 0x35CF930 | ~2 KB | Entry gate / dispatch wrapper |
sub_35CA2C0 | 0x35CA2C0 | ~4 KB | Irreducibility detector |
sub_35CB4A0 | 0x35CB4A0 | ~8 KB | Uniform branch classifier |
sub_35CBCD0 | 0x35CBCD0 | ~6 KB | Region structurizer core |
sub_35CA580 | 0x35CA580 | ~1 KB | Diagnostic emitter |
sub_35CA9C0 | 0x35CA9C0 | ~1 KB | Hash-set insert for BB tracking |
sub_35C9CD0 | 0x35C9CD0 | ~2 KB | Edge reroute through new block |
sub_35C9ED0 | 0x35C9ED0 | ~1 KB | Domtree NCA (nearest common ancestor) walk |
sub_35C9B40 | 0x35C9B40 | trivial | Successor array offset (return a1 + 8*a3) |
Entry Gate: sub_35CF930
sub_35CF930 is the runOnFunction entry. It implements a multi-stage filter before committing to the expensive structurization:
sub_35CF930(pass, function):
// 1. Early-out for trivially uninteresting functions
if sub_BB98D0(pass, function) fails:
return 0
// 2. Single-block functions need no structurization
bb_list = function + 40
if bb_list points to itself (single block):
return 0
// 3. Query target machine for a structurizer strategy object
strategy = target_machine->vtable[136](...)
// 4. Check enable-shrink-wrap override
switch qword_50400C8:
case 1: goto force_structurize // always run
case 2: return 0 // always skip
case 0: // ask strategy object
if not strategy->vtable[72](function):
return 0 // strategy says skip
// 5. Check function attributes for safe-to-skip markers
for attr_id in [56, 63, 59, 64, 57]:
if sub_B2D610(function, attr_id):
return 0
// 6. Run the actual structurizer
force_structurize:
return sub_35CC920(pass, function)
The attribute IDs likely map to: 56 = convergent, 63 = nodivergencesource, 59 = nounwind, 64 = alwaysinline, 57 = optnone. [MEDIUM confidence] These numeric-to-name associations are inferred from LLVM attribute enumeration ordering in the upstream source and the semantic context of their usage (skip-structurize guard), not from string evidence in the binary. The attribute enum may differ in NVIDIA's fork. Functions carrying any of these are either already guaranteed to have uniform control flow or are explicitly marked as not-to-be-optimized.
CLI Knobs
| Knob | Registration | Type | Default | Effect |
|---|---|---|---|---|
structurizecfg-skip-uniform-regions | ctor_227 @ 0x4E9E40, ctor_489 @ 0x553F30 | bool | false | When true, regions with only uniform (warp-coherent) branches are left unstructured, avoiding unnecessary code bloat |
structurizecfg-relaxed-uniform-regions | ctor_489 @ 0x553F30 | bool | true | Allows treating a region as uniform even if sub-regions contain non-uniform branches, provided there is at most one conditional direct child |
enable-shrink-wrap (qword_50400C8) | ctor_688 @ 0x5A6520 | int (0/1/2) | 0 | 0 = ask TargetRegisterInfo (vtable+72) whether to structurize; 1 = force structurize unconditionally; 2 = skip structurize entirely |
The enable-shrink-wrap knob is stored as a global at qword_50400C8. Despite its name (borrowed from the generic LLVM shrink-wrapping pass infrastructure), it serves as a master override for the structurization decision. Mode 2 effectively disables the pass, which would produce miscompilation for any function with divergent branches -- it exists purely as a debugging/override mechanism.
Irreducibility Detection: sub_35CA2C0
Called early in sub_35CC920 (line ~743 of the decompiled output), this function determines whether the CFG contains irreducible cycles. It detects irreducibility but does not restructure it.
Algorithm
The function receives the RPO-ordered basic block list from the SCC decomposition phase and iterates backwards:
sub_35CA2C0(result, domtree_data, bb_list, bb_count):
for each BB in reverse(bb_list):
for each successor S of BB:
// Probe dominator tree hash table
// Hash: ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1)
dom_node = lookup(domtree_data, S)
// If S does NOT dominate BB, but there is a back-edge
// from BB to S, this is an irreducible cycle
if back_edge(BB, S) and not dominates(S, BB):
return 1 // irreducible
return 0 // reducible
The core invariant: in a reducible CFG, every back-edge target dominates its source. If a back-edge exists where the target does not dominate the source, the loop has multiple entries and is irreducible.
Rejection behavior
When sub_35CA2C0 returns 1 (irreducible detected), the main pass emits:
remark: UnsupportedIrreducibleCFG
"Irreducible CFGs are not supported yet."
via sub_35CA580 and returns without modifying the function. The return value is forced to 0 (no modification made).
This is a critical design choice. LLVM upstream provides a separate FixIrreduciblePass (sub_29D33E0, registered as "fix-irreducible") that performs node-splitting to convert irreducible cycles into reducible ones. However, the NVPTX pipeline in CICC v13.0 does not schedule FixIrreduciblePass before StructurizeCFG. The assumption is that well-formed CUDA C++ source never produces irreducible flow. If it does (extreme goto abuse, or a prior optimization pass introducing an irreducible pattern), the compilation emits the diagnostic and the resulting PTX will likely be rejected by ptxas.
EH Funclet Rejection
During the per-block iteration in the main loop, each basic block is checked for funclet status at offset BB+235 (a boolean flag indicating the block is a catchpad, cleanuppad, or catchret target):
if BB->isEHFunclet(): // *(BB + 235) != 0
emit_diagnostic("UnsupportedEHFunclets",
"EH Funclets are not supported yet.")
clear visited bitvector
bail out
The funclet model (Windows x64, ARM64) structures exception handling into mini-functions that require personality routines and unwind tables. None of this exists in the GPU runtime. If a funclet block appears, it means the frontend erroneously lowered exception handling into device code.
After emitting the diagnostic, the pass checks qword_503FFE8 (a global flag, possibly a debug override). If nonzero, it attempts to find a single-entry point and process the rest of the function; if zero, it bails out entirely.
Uniform Branch Classification: sub_35CB4A0
This function (~500 decompiled lines) classifies whether a branch instruction is warp-uniform (all threads in the warp take the same direction) or divergent. The classification determines whether the region under that branch needs structurization.
Classification logic
sub_35CB4A0(pass_state, BB, ...):
terminator_opcode = BB->opcode_category // BB + 68, unsigned short
// Non-conditional terminators (ret, unreachable, switch) skip analysis
if (terminator_opcode - 1) > 1:
return 0 // not a conditional branch, no structurization needed
// Check function-level flags
func_flags = BB->parent->flags // BB + 32 + 64
// bit 3 (0x08) = hasConvergentCalls
// bit 4 (0x10) = hasDivergentBranches
// Check block-level properties
block_flags = BB->properties // BB + 44
// bit 2 (0x04) = already classified
// bit 3 (0x08) = uses profile data
// Query DivergenceAnalysis
uniformity = sub_2E88A90(divergence_info, BB, mask_bits)
// mask_bits: 0x80000 = uniform, 0x100000 = divergent, 0x80 = other
// Additional uniformity check
is_uniform = sub_2E8B090(divergence_info, BB)
if is_uniform and skip_uniform_regions_enabled:
return 0 // uniform, can skip structurization
return 1 // divergent, needs structurization
When the structurizecfg-skip-uniform-regions knob is active, regions with all-uniform branches are left unmodified. This is sound because uniform branches do not cause warp divergence and therefore do not require explicit reconvergence points. Skipping these regions reduces code bloat from the insertion of unnecessary Flow blocks.
The structurizecfg-relaxed-uniform-regions knob relaxes the uniformity check for sub-regions. In upstream LLVM, hasOnlyUniformBranches refuses to treat a region as uniform if any sub-region contains a non-uniform branch. The relaxed mode allows this if there is at most one conditional direct child, under the reasoning that a single divergent sub-region can be handled by an inner structurization pass invocation.
Region Structurizer Core: sub_35CBCD0
This is the heart of the transformation. When a non-uniform, non-EH block is identified, sub_35CBCD0 processes its region:
sub_35CBCD0(pass_state, BB, context):
// 1. Manage region boundaries
head = pass_state[67] // current region head
tail = pass_state[68] // current region tail
// 2. Iterate successors
for each successor S of BB (via sub_2E313E0):
// 3. Check uniformity of successor edge
if sub_35CB4A0(pass_state, S, ...) returns 0:
continue // uniform edge, skip
// 4. Compute reconvergence point via NCA
nca = sub_35C9ED0(domtree, BB, S)
// NCA = nearest common ancestor in dominator tree
// This is where threads from both sides of the branch
// must reconverge
// 5. Update region boundaries
pass_state[67] = update_head(head, nca)
pass_state[68] = update_tail(tail, nca)
// 6. Update visited-BB bitvector
set_bit(pass_state[91], BB->ordinal)
The NCA computation (sub_35C9ED0) walks the dominator tree upward from both the current block and its successor until finding their nearest common ancestor. This NCA becomes the reconvergence point: the block where the hardware must synchronize all threads before continuing.
Main Structurization Loop: sub_35CC920
The main pass body executes in four phases.
Phase 1: Initialization (lines 433-648)
// Store analysis results in pass object fields
pass[65] = DivergenceAnalysis + 200
pass[66] = LoopInfo + 200
pass[67] = 0 // current head
pass[68] = 0 // current tail
pass[69] = DomTree + 200
pass[70] = PostDomTree + 200
pass[71] = loop_depth_info
// Compute RPO (reverse post-order)
rpo = sub_2EA7130() -> sub_2EA7B20()
// Build SCC ordering (cross-references RPO with SCC decomposition)
scc_order = sub_357E170(rpo)
// Check for irreducible cycles
if sub_35CA2C0(scc_order, domtree, ...):
emit "UnsupportedIrreducibleCFG"
return 0
Phase 2: Per-block classification (lines 816-2253)
Iterates blocks in reverse RPO order (bottom-to-top):
for each BB in reverse_rpo(scc_order):
// (a) Reject EH funclets
if BB->isEHFunclet:
emit "UnsupportedEHFunclets"
clear bitvector, bail out
// (b) Already marked for structurization
if BB->structurize_flag (BB+216) or BB->flag_262 (BB+262):
sub_35CBCD0(pass, BB, ...) // structurize this region
continue
// (c) Check successors for back-edges to visited blocks
has_loop = false
for each successor S of BB:
if bitvector_test(S->ordinal):
has_loop = true // back-edge detected = loop header
// (d) Classify uniformity of predecessors
needs_structurize = false
for each predecessor P of BB:
if sub_35CB4A0(pass, P, ...):
needs_structurize = true
break
// (e) Apply structurization
if needs_structurize:
sub_35CBCD0(pass, BB, ...)
// (f) Update bitvector
bitvector_set_or_clear(BB->ordinal, needs_structurize)
Phase 3: Domtree-guided reconvergence (lines 2255-2396)
After the per-block loop, if a split point was identified (pass[67] != 0 and pass[68] != 0):
// Walk domtree from split point upward
current = split_point
while current != null:
// Query strategy object for split decisions
if strategy->shouldSplit(current): // vtable+312
sub_35CBCD0(pass, current, ...)
if strategy->shouldSplitChild(current): // vtable+320
// second round for child regions
...
current = domtree_parent(current)
// Store results in function metadata for PTX emission
function_obj[672] = head // reconvergence head
function_obj[680] = tail // reconvergence tail
These stored head/tail values are read by subsequent PTX emission passes to emit the correct convergence/reconvergence annotations in the output PTX.
Phase 4: Cleanup (lines 2383-2396)
Frees the helper object allocated at line 771 (0xA8 bytes), the SCC ordering buffer, and returns the modification flag (0 = no changes, 1 = modified).
Reconvergence Insertion Path
When a non-uniform divergent region is identified between a head block and a tail block, the pass performs the actual CFG transformation:
Step 1: Dominance validation
// Head must dominate tail
if not sub_2E6D360(domtree, head, tail):
skip // invalid region, cannot structurize
// Tail must post-dominate head
if not sub_2EB3EB0(postdomtree, tail, head):
skip
Step 2: Edge classification
Collect successors of the tail into two sets:
- External edges: successors pointing outside the region (into
v395/v396) - Internal edges: successors pointing back inside the region (into
v404/v405)
The strategy object (vtable+344) classifies each edge to determine if restructuring is needed.
Step 3: Flow block creation
// Create new "Flow" basic block
new_block = sub_2E7AAE0(function, 0, ...) // BasicBlock::Create
sub_2E33BD0(new_block, insert_point) // insert into BB list
// Copy phi-node entries from original target
for each phi in original_target:
sub_2E33140(phi, ...) // copy incoming value
sub_2E341F0(phi, ...) // update predecessor
Step 4: Edge rerouting
// Reroute edges from old target to new Flow block
sub_2E337A0(old_target, new_block) // replaceAllUsesWith
sub_2E33F80(new_block) // finalize successors
// For each stale edge, update divergence info
for each stale_edge:
sub_35C9CD0(stale_edge, ...)
strategy->updateDivergence(...) // vtable+368
Step 5: Recursive child splitting
If the strategy's shouldSplitChild (vtable+320) returns true, the newly created Flow block itself may need further splitting. This creates another block, reroutes edges again, and recurses. This handles deeply nested divergent regions where a single Flow block is insufficient.
Before/After CFG Example
Consider a function with a divergent if-then-else:
Before structurization:
Entry
/ \
Then Else
\ /
Merge
|
Exit
If the branch at Entry is divergent (some threads go to Then, others to Else), the hardware needs an explicit reconvergence point. After structurization:
After structurization:
Entry
/ T
| \
| Then
| /
Flow1 <- new block: reconvergence for Then
| F \
| Else
| /
Flow2 <- new block: reconvergence for Else
|
Merge
|
Exit
The Flow1 and Flow2 blocks are inserted with conditional branches controlled by PHI networks. Flow1 has a branch: if the thread came from Then, continue to Flow2; if the thread skipped Then, also continue to Flow2 (the "false" exit). Flow2 similarly gates the Else path.
For a divergent loop:
Before:
Entry
|
Header <--+
/ \ |
Body | |
\ / |
Latch -----+
|
Exit
After:
Entry
|
Header <------+
| |
Body |
| |
FlowLoop |
/ (back) \ |
| +----+
| (exit)
Exit
FlowLoop is a new block whose branch condition is a PHI: true incoming from Body means exit the loop, false means take the back-edge. This inverted convention (true = break, false = continue) matches upstream LLVM's structurization invariant.
Flow Block Insertion Algorithm
The previous sections describe the pass at the function-dispatch level. This section provides the complete algorithmic detail of how Flow blocks are actually created, wired, and how PHI networks are maintained -- the core transformation that converts a reducible-but-unstructured CFG into a fully structured CFG suitable for PTX emission.
Complexity
Let B = number of basic blocks, E = number of CFG edges, and D = depth of the dominator tree. The irreducibility detection (sub_35CA2C0) is O(B * E) -- for each block in reverse RPO, it probes successors against the dominator tree hash table (O(1) per probe). The per-block classification loop is O(B * (P_avg + S_avg)) where P_avg and S_avg are average predecessor and successor counts -- effectively O(B + E). The uniform branch classifier (sub_35CB4A0) is O(1) per block (a few flag checks and one DivergenceAnalysis query). The NCA computation (sub_35C9ED0) walks the domtree upward from two nodes until convergence: O(D) per call. Each Flow block insertion is O(D + PHI_count) where PHI_count is the number of PHI nodes at the original merge point (each needs entry copying). Recursive child splitting adds at most O(B) new blocks total across the entire function. The bitvector tracking is O(B / 64) per test/set operation. Overall: O(B * D + E + F * PHI_total) where F = number of Flow blocks created. Since F <= B (one Flow per divergent region) and D = O(B) in the worst case, the theoretical worst case is O(B^2 + E). In practice, CUDA CFGs are shallow (D < 20) and sparsely divergent, making the pass effectively O(B + E).
Conceptual Model
A "Flow block" is a synthetic basic block that serves as an explicit thread reconvergence point. In an unstructured CFG, divergent branches may merge at a common successor without any indication of which predecessor each thread arrived from. The hardware's reconvergence mechanism needs a single merge point where it can resume lockstep execution. Flow blocks provide this by:
- Interposing between the divergent region and its exit.
- Carrying a PHI node whose value encodes the path taken by each thread.
- Branching conditionally on that PHI to either enter the next region body or skip to the next Flow block.
The algorithm processes the function bottom-to-top (reverse RPO), which ensures that inner regions are structurized before outer ones. Each region is defined by a head (dominator) and tail (post-dominator). The output is a function where every conditional branch leads to at most one "then" block followed by a Flow block, guaranteeing single-entry single-exit regions.
Top-Level Algorithm: sub_35CC920
This is the complete algorithm for the main pass body, including the Flow block insertion logic interleaved with the classification phases already described above.
sub_35CC920(pass, function):
// ---- Phase 1: Analysis setup ----
div_info = getAnalysis<DivergenceAnalysis>(function) + 200
loop_info = getAnalysis<LoopInfo>(function) + 200
dom_tree = getAnalysis<DominatorTree>(function) + 200
post_dom = getAnalysis<PostDominatorTree>(function) + 200
pass[65] = div_info
pass[66] = loop_info
pass[67] = NULL // region_head
pass[68] = NULL // region_tail
pass[69] = dom_tree
pass[70] = post_dom
// Compute RPO via sub_2EA7130 -> sub_2EA7B20
rpo_list = computeRPO(function)
// Cross-reference RPO with SCC decomposition (sub_357E170)
scc_order = buildSCCOrdering(rpo_list)
// ---- Phase 1b: Reject irreducible ----
if sub_35CA2C0(scc_order, dom_tree) == 1: // irreducible detected
sub_35CA580(pass, "UnsupportedIrreducibleCFG",
"Irreducible CFGs are not supported yet.")
return 0
// ---- Phase 1c: Initialize bitvector ----
bb_count = countBasicBlocks(function)
word_count = (bb_count + 63) >> 6
bitvector = allocate(word_count * 8)
memset(bitvector, 0, word_count * 8)
pass[91] = bitvector // at offset +728
// ---- Phase 2: Bottom-up region identification and Flow insertion ----
modified = false
order = reverse(scc_order) // process bottom-to-top
for each BB in order:
// 2a. Reject EH funclets
if *(BB + 235) != 0: // isEHFunclet flag
sub_35CA580(pass, "UnsupportedEHFunclets",
"EH Funclets are not supported yet.")
resetBitvector(pass)
return 0
// 2b. Already marked for structurization (from prior inner-region pass)
if *(BB + 216) != 0 or *(BB + 262) != 0:
sub_35CBCD0(pass, BB, context)
continue
// 2c. Detect back-edges to already-visited blocks (loop detection)
has_loop_backedge = false
for each successor S of BB:
if bitvectorTest(pass[91], S->ordinal):
has_loop_backedge = true
// 2d. Classify predecessors for divergence
needs_structurize = false
for each predecessor P of BB:
if sub_35CB4A0(pass, P, ...) == 1: // divergent branch
needs_structurize = true
break
// 2e. Structurize the region rooted at BB
if needs_structurize:
sub_35CBCD0(pass, BB, context) // collect region bounds
// If region bounds are valid, insert Flow blocks
head = pass[67]
tail = pass[68]
if head != NULL and tail != NULL:
modified |= insertFlowBlocks(pass, head, tail, function)
// 2f. Update bitvector
if needs_structurize:
bitvectorSet(pass[91], BB->ordinal)
else:
bitvectorClear(pass[91], BB->ordinal)
// ---- Phase 3: Domtree-guided outer-region finalization ----
if pass[67] != NULL and pass[68] != NULL:
current = pass[67] // split_point
while current != NULL:
if strategy->shouldSplit(current): // vtable+312
sub_35CBCD0(pass, current, context)
modified |= insertFlowBlocks(pass, pass[67], pass[68], function)
if strategy->shouldSplitChild(current): // vtable+320
// recurse into child regions
modified |= insertFlowBlocksForChildren(pass, current, function)
current = domtreeParent(dom_tree, current)
// Store reconvergence metadata for PTX emission
*(function + 672) = pass[67] // reconvergence head
*(function + 680) = pass[68] // reconvergence tail
// ---- Phase 4: Cleanup ----
free(scc_order)
free(bitvector)
return modified ? 1 : 0
Flow Block Insertion Detail: insertFlowBlocks
This function (inlined within the Phase 2/Phase 3 loops of sub_35CC920, approximately decompiled lines 980--2027) performs the actual CFG transformation for a single region.
insertFlowBlocks(pass, head, tail, function):
// Step 1: Validate region boundaries via dominator/post-dominator trees
if not dominates(pass[69], head, tail):
return false // head does not dominate tail => not a valid region
if not postDominates(pass[70], tail, head):
return false // tail does not post-dominate head => not a valid region
// Step 2: Classify edges leaving the tail block
external_edges = [] // edges pointing outside the region
internal_edges = [] // edges pointing back inside the region
for each successor S of tail:
if not dominatedBy(S, head) or S == tail:
external_edges.append((tail, S))
else:
internal_edges.append((tail, S))
// Step 3: Query strategy object for each edge
for each edge E in (external_edges + internal_edges):
classification = strategy->classifyEdge(E) // vtable+344
if classification == SKIP:
continue
// else: edge needs restructuring
// Step 4: Create the Flow block
// sub_2E7AAE0 = BasicBlock::Create(context, name_hint, function)
flow_bb = sub_2E7AAE0(function->getContext(), "Flow", function)
// sub_2E33BD0 = insert into function's BB list after tail
sub_2E33BD0(flow_bb, tail->getNextNode())
// Step 5: Build PHI node in the Flow block
// The PHI encodes "which path did threads arrive from?"
// Convention: true (i1 1) = came from the "then" body
// false (i1 0) = skipped the body (fell through)
phi = createPHINode(Type::i1, flow_bb)
phi.addIncoming(ConstantInt::getTrue(), body_block) // threads that executed body
phi.addIncoming(ConstantInt::getFalse(), head_block) // threads that skipped body
// Step 6: Create conditional branch in the Flow block
// Branch on PHI: true -> next_region_or_exit, false -> next_flow_or_exit
createCondBranch(flow_bb, phi, next_target_true, next_target_false)
// Step 7: Reroute original edges through the Flow block
// For each predecessor that previously branched to the original merge:
for each edge (P, original_merge) that should go through flow_bb:
// sub_2E337A0 = replaceAllUsesWith for the branch target
P->getTerminator()->replaceSuccessor(original_merge, flow_bb)
// Step 8: Copy PHI entries from original merge to Flow block
// If the original merge had PHI nodes, their incoming values from
// rerouted predecessors must be transferred.
for each phi_node in original_merge->phis():
value = phi_node->getIncomingValueForBlock(rerouted_pred)
// sub_2E33140 = addIncoming to new PHI at flow_bb
// sub_2E341F0 = removeIncomingValue from original PHI
flow_bb_phi.addIncoming(value, rerouted_pred)
phi_node.removeIncomingBlock(rerouted_pred)
phi_node.addIncoming(flow_bb_phi, flow_bb)
// Step 9: Update dominator tree
// The new Flow block is immediately dominated by head.
// It immediately dominates the original merge (if flow_bb is its
// only predecessor now).
dom_tree->addNewBlock(flow_bb, head)
// Step 10: Update divergence analysis
// sub_35C9CD0 = edge reroute handler
for each rerouted_edge:
sub_35C9CD0(pass, rerouted_edge)
strategy->updateDivergence(rerouted_edge) // vtable+368
// Step 11: Recursive child-split (if needed)
// The strategy may determine that the Flow block itself needs
// further splitting (deeply nested divergent regions).
if strategy->shouldSplitChild(flow_bb): // vtable+320
child_flow = sub_2E7AAE0(function->getContext(), "Flow", function)
sub_2E33BD0(child_flow, flow_bb->getNextNode())
// ... repeat Steps 5-10 for the child Flow block ...
// This recursion terminates when shouldSplitChild returns false.
// Step 12: Expand bitvector if function grew
new_bb_count = countBasicBlocks(function)
if new_bb_count > pass[bb_count_field]:
sub_C8D5F0(pass[91], new_bb_count) // SmallVector::grow
// Initialize new words to 0xFF...FF (conservatively "visited")
// Then clear trailing bits beyond actual block count
return true
PHI Network Construction for Nested Regions
When multiple Flow blocks are created for a chain of if-then-else regions, the PHI networks form a cascade. Each Flow block's PHI determines whether threads should enter the next body or skip to the subsequent Flow block.
Consider a three-way branch (implemented as nested if-then-else):
Before: After:
Entry Entry
/ | \ |
A B C cond_A?
\ | / / T F
Merge A |
| |
Flow1 |
/ F T |
| cond_B?
| / T F
| B |
| | |
| Flow2 |
| / F T |
|| C |
|| | |
|| Flow3 |
| \ | / |
Merge----+
The PHI cascade at each Flow block:
Flow1:
%path_A = phi i1 [ true, %A ], [ false, %Entry ]
br i1 %path_A, <continue to cond_B>, <skip to Merge via Flow3>
Flow2:
%path_B = phi i1 [ true, %B ], [ false, %Flow1 ]
br i1 %path_B, <continue to C>, <skip to Merge via Flow3>
Flow3:
%path_C = phi i1 [ true, %C ], [ false, %Flow2 ]
br i1 %path_C, <Merge>, <Merge>
// Flow3's branch is unconditional to Merge (both sides converge)
// but the PHI values propagated through the chain ensure each
// thread sees the correct value at Merge's PHI nodes.
Each Flow block carries exactly one i1 PHI and one conditional branch. The chain length equals the number of divergent exits from the region minus one. The final Flow block has an unconditional branch (or a branch where both targets are the same) because all paths must converge at the region exit.
Loop Flow Block Insertion
For divergent loops, Flow blocks serve double duty: they both gate the loop body and control the back-edge. The algorithm handles loops specially:
insertLoopFlowBlock(pass, header, latch, exit, function):
// The loop has structure: header -> body -> latch -> {header, exit}
// After structurization:
// header -> body -> FlowLoop -> {header (back-edge), exit}
// Step 1: Create FlowLoop block between latch and exit
flow_loop = sub_2E7AAE0(context, "Flow", function)
sub_2E33BD0(flow_loop, latch->getNextNode())
// Step 2: PHI in FlowLoop encodes continue/break decision
// Convention: true = exit the loop, false = take back-edge
// This is INVERTED from what you might expect.
// Rationale: the "default" path (false) continues the loop,
// and the "exception" path (true) exits. This matches
// upstream LLVM's structurization invariant and simplifies
// the PHI lowering in CSSA.
phi_loop = createPHINode(Type::i1, flow_loop)
phi_loop.addIncoming(ConstantInt::getTrue(), exit_pred) // threads exiting
phi_loop.addIncoming(ConstantInt::getFalse(), body_block) // threads continuing
// Step 3: Conditional branch
createCondBranch(flow_loop, phi_loop, exit, header)
// true -> exit, false -> header (back-edge)
// Step 4: Reroute latch
latch->getTerminator()->replaceSuccessor(header, flow_loop)
latch->getTerminator()->replaceSuccessor(exit, flow_loop)
// Step 5: Update loop info
// FlowLoop is inside the loop (it has the back-edge to header).
// LoopInfo must be updated so that FlowLoop is recognized as
// a loop block, otherwise subsequent passes (LICM, LSR) may
// misclassify it.
loop_info->addBlockToLoop(flow_loop, loop)
// Step 6: Domtree update
// FlowLoop is dominated by latch (or by header if the latch
// was the only block between header and exit).
dom_tree->addNewBlock(flow_loop, latch)
The inverted convention (true = break) is critical. It ensures that the "natural" loop iteration (the common case) follows the fall-through path, which maps to the hardware's predicted branch direction. The PTX assembler uses this hint to generate the @p bra instruction with the back-edge as the taken path, minimizing branch misprediction overhead on the GPU.
Irreducible CFG Rejection: Why FixIrreducible is Not Scheduled
The pass rejects irreducible CFGs rather than attempting to restructure them. This section documents the design rationale and the consequences.
What Makes a CFG Irreducible
A CFG is irreducible if it contains a cycle with multiple entry points -- that is, there exist two blocks A and B in the cycle such that neither dominates the other, yet both can be reached from outside the cycle. The classic example is a goto into the middle of a loop:
Irreducible:
Entry
/ \
v v
A -> B
^ /
\ v
C
Both A and B are reachable from Entry, and both are in the cycle A->B->C->A.
Neither A dominates B nor B dominates A.
In a reducible CFG, every back-edge target dominates its source. This is the invariant that sub_35CA2C0 checks: it iterates blocks in reverse RPO and, for each back-edge (successor that was already visited), verifies that the target dominates the source via the dominator tree hash table.
The FixIrreducible Pass Exists But Is Not Used
CICC v13.0 links FixIrreduciblePass at sub_29D33E0 (registered as "fix-irreducible" at pipeline-parser index 239). Its core implementation at sub_29D3E80 (60KB) performs controlled node splitting: it duplicates blocks to create a single-entry version of each irreducible cycle. This is the standard compiler technique (T1-T2 node splitting from Hecht and Ullman).
However, the NVPTX pipeline in CICC v13.0 does not schedule FixIrreduciblePass before StructurizeCFG. The pipeline ordering is:
... -> SimplifyCFG -> Sink -> StructurizeCFG -> CSSA -> ISel -> ...
^
|
fix-irreducible is NOT here
Design Rationale
Three factors explain this decision:
-
CUDA source language guarantee. Well-formed CUDA C++ does not produce irreducible control flow. The language has no
gotoacross loop boundaries (the EDG frontend rejects it), and structured constructs (if/for/while/do/switch) always produce reducible CFGs. The only way to get irreducible flow is through extremegotoabuse in C mode or through a buggy optimization pass that introduces one. -
Code size explosion. Node splitting can exponentially increase code size in pathological cases. For a cycle with N entry points, splitting may duplicate up to 2^N blocks. On a GPU where register pressure is the primary performance limiter, this expansion would be catastrophic -- more blocks means more live ranges, more register pressure, and lower occupancy.
-
Correctness risk.
FixIrreduciblePasstransforms the CFG before divergence analysis has finalized. If the splitting creates new blocks with divergent branches, those branches would need re-analysis. The interaction betweenFixIrreducible,DivergenceAnalysis, andStructurizeCFGis not validated in the NVPTX pipeline.
Consequence: Silent Miscompilation Risk
When sub_35CA2C0 detects irreducibility, it emits a diagnostic remark:
remark: UnsupportedIrreducibleCFG
"Irreducible CFGs are not supported yet."
The pass then returns 0 (no modification). The function proceeds through the rest of the pipeline with its irreducible CFG intact. Downstream, one of two things happens:
-
ptxas rejects the PTX. If the irreducible pattern produces a branch target that violates PTX's structured control flow rules,
ptxaswill emit an error. This is the safe outcome. -
ptxas silently accepts malformed PTX. If the irreducible pattern happens to look like valid PTX (perhaps it only involves uniform branches), the resulting code may execute with undefined reconvergence behavior. Threads may reconverge at the wrong point, producing silent data corruption. This is the dangerous outcome.
The Stock LLVM Version Has the Same Limitation
The stock LLVM StructurizeCFG at sub_1F0EBC0 (linked from llvm/lib/Transforms/Scalar/StructurizeCFG.cpp) contains identical rejection logic. The AMDGPU backend, which also requires structured control flow, schedules FixIrreduciblePass explicitly before StructurizeCFG. NVIDIA chose not to do this.
| Instance | Address | Size | Irreducible handling |
|---|---|---|---|
| NVPTX custom | sub_35CC920 | 95 KB | Reject with diagnostic |
| Stock LLVM | sub_1F0EBC0 | ~58 KB | Reject with diagnostic |
| FixIrreducible | sub_29D33E0 / sub_29D3E80 | 60 KB | Node splitting (not scheduled) |
The Stock StructurizeCFG Entry Block Handling
The stock LLVM version also includes explicit entry block handling at sub_1A74020 (13KB). When the function's entry block has predecessors (which can happen if the function is a loop body extracted by a prior pass), this function creates a new entry block named "entry" and renames the original to "entry.orig". The NVPTX version at sub_35CC920 handles this inline in Phase 1.
PTX Structured Control Flow Contract
This section documents the precise contract that StructurizeCFG must satisfy for downstream passes to emit correct PTX.
What "Structured" Means for PTX
After StructurizeCFG completes, the function's CFG must satisfy these five invariants:
-
Single-entry regions. Every natural loop has exactly one entry (the loop header dominates all loop blocks). No irreducible cycles exist.
-
Post-dominator reconvergence. For every divergent conditional branch at block B, there exists a block P that post-dominates B and dominates all merge points of the two branch targets. A Flow block is inserted at P if one does not already exist.
-
Linear Flow chain. Between any divergent branch and its reconvergence point, the CFG forms a chain of Flow blocks with single-entry single-exit semantics. Each Flow block has exactly two predecessors (the "then" body exit and the "skip" path) and two successors (the next body entry or the final merge).
-
PHI-encodable path selection. Every Flow block contains an
i1PHI that encodes which path was taken. This PHI is the sole branch condition of the Flow block's terminator. No other computation occurs in Flow blocks. -
Metadata tagging. Uniform branches are tagged with
!structurizecfg.uniformmetadata (metadata kind registered atsub_298D780). This prevents CSSA from inserting unnecessary copies at reconvergence points for branches where all threads agree.
Downstream Consumer: CSSA
The CSSA pass (sub_3720740) consumes the structured CFG and inserts explicit copy instructions at every reconvergence point. It relies on:
- The Flow block chain to identify where reconvergence happens.
- The
i1PHI in each Flow block to determine which threads took which path. - The
!structurizecfg.uniformmetadata to skip copy insertion for uniform regions.
Without StructurizeCFG, CSSA would not know where to insert copies, and the resulting register allocation would be unsound under warp divergence.
Downstream Consumer: Convergence Control in AsmPrinter
The reconvergence head/tail stored at function offsets +672 and +680 are consumed by the AsmPrinter's convergence control framework (see AsmPrinter). The AsmPrinter emits CONVERGENCECTRL_ENTRY (opcode 24) and CONVERGENCECTRL_LOOP (opcode 33) pseudo-instructions at the boundaries defined by these metadata values. The hardware uses these to program the convergence barrier stack.
Interaction with SIAnnotateControlFlow (AMDGPU Comparison)
AMDGPU uses a different approach: SIAnnotateControlFlow inserts explicit if/else/end_cf intrinsics after StructurizeCFG. NVPTX does not use this -- instead, the convergence information flows through:
- StructurizeCFG (Flow blocks + function metadata)
- CSSA (copy insertion at reconvergence)
- SelectionDAG / ISel (structured branch patterns)
- AsmPrinter (convergence pseudo-instructions)
This four-stage pipeline is NVIDIA-specific. Upstream LLVM for AMDGPU collapses stages 1-2 into StructurizeCFG + SIAnnotateControlFlow and has no equivalent of stage 4.
The Two Binary Instances
CICC v13.0 contains two complete copies of the StructurizeCFG pass because the binary links both the NVPTX backend (custom) and the generic LLVM Scalar library (stock). Only the NVPTX version is scheduled in the pipeline.
| NVPTX Custom | Stock LLVM | |
|---|---|---|
| Main body | sub_35CC920 (95 KB) | sub_1F0EBC0 (~58 KB) |
| Entry gate | sub_35CF930 | (inlined) |
| Region processing | sub_35CBCD0 | sub_1A761E0 (28 KB) |
| Entry block handler | (inlined in Phase 1) | sub_1A74020 (13 KB, strings "entry.orig", "entry") |
| Region-based | Operates on entire function | Operates on individual Region objects |
| Uniform metadata | sub_298D780 ("structurizecfg.uniform") | Same string, different address |
| Registration | sub_29882C0 ("Structurize the CFG") | sub_2988270 ("Structurize control flow") |
| Pipeline parser | Index 413: "structurizecfg" with skip-uniform-regions param | Same index, same params |
The NVPTX version is 37 KB larger because it inlines the entry-block handler and region-processing logic (avoiding virtual dispatch overhead) and adds the CUDA-specific attribute checks (IDs 56, 63, 59, 64, 57) and the convergence metadata writes at offsets +672/+680.
Bitvector Tracking for Region Membership
The pass tracks which basic blocks have been visited using a dynamically sized bitvector stored in the pass object:
| Field | Offset | Meaning |
|---|---|---|
uint64_t *array | pass + 728 | Pointer to the word array |
uint64_t word_count | pass + 736 | Current number of 64-bit words |
uint64_t capacity | pass + 740 | Allocated capacity in words |
uint64_t bb_count | pass + 792 | Total number of basic blocks |
Index computation for a block with ordinal idx:
word_offset = idx >> 6; // idx / 64
bit_mask = 1ULL << (idx & 63); // idx % 64
// Test
is_visited = (array[word_offset] & bit_mask) != 0;
// Set
array[word_offset] |= bit_mask;
// Clear
array[word_offset] &= ~bit_mask;
When new basic blocks are created during structurization (the function grows), the bitvector is expanded via sub_C8D5F0 (the SmallVector::grow equivalent). New words are initialized to 0xFFFFFFFFFFFFFFFF (all bits set = "visited"), then trailing bits beyond the actual block count are cleared. This ensures newly created blocks are conservatively marked as visited until explicitly processed.
Hash Table Implementation
The pass uses two DenseSet-style hash tables with LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure for the hash function, probing, and growth policy. The resize function for this pass is sub_2E61F50. Table v394 tracks BBs already processed during the BFS expansion, and v417 serves as a scratch set for child-split deduplication.
Comparison with Upstream LLVM StructurizeCFG
The NVIDIA version and upstream LLVM share the same fundamental algorithm. Both are derived from the same codebase (confirmed by identical diagnostic strings and strategy-object vtable layouts). The differences are:
Architectural differences
| Aspect | NVIDIA (sub_35CC920) | Upstream LLVM |
|---|---|---|
| Granularity | Operates on entire function, iterating blocks in SCC/RPO order | Operates on individual Region objects, one region per invocation |
| Region discovery | Inline SCC decomposition + domtree walk | Relies on RegionInfo analysis pass |
| Object layout | Pass fields at a1[65..91]; BB flags at +216, +235, +262 | Different offsets reflecting different BasicBlock subclass |
| SCC ordering | sub_357E170 computes RPO/SCC cross-product | Uses scc_iterator from llvm/ADT/SCCIterator.h |
| Strategy object | Queried via vtable+312/320/344/368 | Uses TargetTransformInfo for cost decisions |
Functional differences
-
Irreducibility handling. Both reject irreducible CFGs with the same diagnostic. Neither performs restructuring. Upstream LLVM relies on
FixIrreduciblePassbeing scheduled separately (AMDGPU does this). NVIDIA does not schedule it. -
EH funclet handling. Both reject funclets. The NVIDIA version checks
BB+235(a wider BasicBlock struct with CUDA-specific fields). Upstream checks viaisa<FuncletPadInst>. -
Uniform region skipping. Both support
structurizecfg-skip-uniform-regions. The NVIDIA version integrates DivergenceAnalysis queries inline (sub_2E88A90,sub_2E8B090). Upstream usesUniformityInfo::isUniform(BranchInst*). -
Metadata tagging. Both use the
"structurizecfg.uniform"metadata kind to mark branches that have been classified as uniform, preventing re-analysis in nested region processing. -
Zero-cost hoisting. Upstream LLVM (recent versions) includes
hoistZeroCostElseBlockPhiValuesto reduce VGPR pressure from structurization-induced phi nodes. The NVIDIA version may or may not include this optimization; the decompiled code at the corresponding offset shows similar phi-manipulation logic but uses different register-pressure heuristics. -
Reconvergence metadata. The NVIDIA version writes reconvergence head/tail to function metadata at offsets
+672and+680. This is consumed by downstream PTX emission passes (AsmPrinter, convergence barrier insertion). Upstream LLVM has no equivalent because AMDGPU usesSIAnnotateControlFlowinstead.
What NVIDIA did NOT change
The core structurization algorithm is identical: topological ordering of region nodes, iterative flow-block insertion, PHI-node reconstruction via SSAUpdater, and domtree maintenance. The strategy-object interface (shouldSplit, shouldSplitChild, classifyEdge, updateDivergence) has the same vtable layout in both versions. The FlowBlock naming convention ("Flow") is preserved.
Pipeline Position
StructurizeCFG runs late in the NVPTX backend pipeline, after most IR-level optimizations and before machine code generation:
... -> SimplifyCFG -> Sink -> StructurizeCFG -> CSSA -> ISel -> ...
It must run after divergence analysis (so it can query which branches are uniform) and before instruction selection (which assumes structured control flow). The CSSA (Convergent SSA) pass that follows converts phi nodes to respect warp divergence semantics at the reconvergence points that StructurizeCFG inserted.
Summary of Pass Decisions
| Input condition | Action | Diagnostic |
|---|---|---|
| Single-block function | Skip | None |
| Function with convergent/optnone attributes | Skip | None |
enable-shrink-wrap = 2 | Skip | None |
| Strategy object declines | Skip | None |
| All-uniform branches (with skip-uniform knob) | Skip | None |
| Irreducible CFG detected | Reject | "UnsupportedIrreducibleCFG" |
| EH funclet block detected | Reject | "UnsupportedEHFunclets" |
| Reducible, divergent regions | Restructure | None (new Flow blocks inserted, edges rerouted) |
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent CFG structurization pass for a GPU target.
1. Attempting to restructure irreducible CFGs instead of rejecting them. The LLVM codebase includes FixIrreduciblePass (sub_29D33E0) which performs T1-T2 node splitting, but NVIDIA deliberately does not schedule it before StructurizeCFG. A reimplementation that adds node splitting to "handle" irreducible CFGs risks exponential code size blowup (2^N blocks for N entry points), catastrophic register pressure increases from the duplicated live ranges, and untested interaction with divergence analysis. The correct approach for an NVPTX target is to reject irreducible CFGs with a diagnostic and rely on the CUDA language guarantee that well-formed source never produces them.
2. Forgetting to update LoopInfo when inserting Flow blocks inside loops. When insertLoopFlowBlock creates a new block between the latch and the exit, that block carries the back-edge to the header and is therefore inside the loop. If LoopInfo is not updated (loop_info->addBlockToLoop), subsequent passes (LICM, LSR, LoopUnroll) will not recognize the Flow block as a loop member and may hoist or sink code across it incorrectly. This is a silent miscompilation: the kernel produces wrong results only for inputs that exercise the divergent loop path.
3. Inverting the Flow block PHI convention. The pass uses true = exit loop (break) and false = continue loop (back-edge) for loop Flow blocks. This is counterintuitive -- most programmers expect true to mean "condition is met, continue." Reversing this convention causes the back-edge to be the taken path for true, which not only produces wrong control flow but also defeats the branch prediction hint that maps the fall-through (false) path to the common-case loop continuation. A reimplementation must match the exact convention documented in the upstream LLVM structurization invariant.
4. Not writing reconvergence metadata to function offsets +672/+680. The AsmPrinter's convergence control framework reads the head and tail stored at these offsets to emit CONVERGENCECTRL_ENTRY and CONVERGENCECTRL_LOOP pseudo-instructions. A reimplementation that structures the CFG correctly but does not write these metadata values will cause the AsmPrinter to emit PTX without convergence barriers. On architectures with hardware convergence tracking (SM 7.0+), this can lead to threads reconverging at incorrect points, producing silent data corruption.
5. Skipping structurization for regions where all branches appear uniform but sub-regions contain divergent branches. The structurizecfg-relaxed-uniform-regions knob allows skipping outer regions when they have at most one conditional direct child. A reimplementation that skips any region marked "uniform" without checking sub-region divergence will fail to insert Flow blocks for inner divergent branches, leaving the PTX with unstructured control flow that ptxas may reject or (worse) silently miscompile.
Cross-References
- CSSA -- the Conventional SSA pass that consumes Flow blocks to insert warp-safe copies
- AsmPrinter -- convergence control pseudo-instruction emission consuming the
+672/+680metadata - GPU Execution Model -- warp divergence and reconvergence fundamentals
- Branch Folding -- may eliminate redundant Flow blocks after code generation
- Hash Infrastructure -- details on the DenseSet implementation used by the BB tracking tables
- Pipeline -- exact position of
structurizecfgin the pass ordering - Knobs --
structurizecfg-skip-uniform-regions,structurizecfg-relaxed-uniform-regions,enable-shrink-wrap - Upstream LLVM source:
llvm/lib/Transforms/Scalar/StructurizeCFG.cpp
Differences from Upstream LLVM
| Aspect | Upstream LLVM (AMDGPU) | CICC v13.0 (NVPTX) |
|---|---|---|
| Binary copies | Single StructurizeCFG in LLVM Scalar library | Two copies: NVPTX-specific at sub_35CC920 (95 KB) and stock LLVM/AMDGPU at sub_1F0EBC0; only NVPTX instance scheduled |
| Divergence query | Queries AMDGPU divergence analysis | Queries NVPTX warp divergence analysis; uniform branch skip via structurizecfg-skip-uniform-regions knob |
| Flow block metadata | Flow blocks inserted without convergence metadata | Inserts convergence control metadata at offsets +672/+680 on Flow blocks, consumed by AsmPrinter for warp reconvergence pseudo-instructions |
| Relaxed uniform regions | Not present | structurizecfg-relaxed-uniform-regions knob allows less aggressive structurization when all branches in a region are provably uniform |
| Irreducible CFG handling | Attempts T1/T2 node-folding reduction | Same approach, but rejection diagnostic "UnsupportedIrreducibleCFG" is NVPTX-specific; GPU code with irreducible CFG is a hard error |
| Skip conditions | Skip for single-block functions | Extended skip: single-block, convergent/optnone attributes, enable-shrink-wrap = 2, and strategy object decline |
| Mandatory status | Required for AMDGPU but can be skipped via flag | Mandatory for PTX emission: registered as required late pass by both sub_29882C0 and sub_1A6D600 |
Machine-Level Passes
Machine-level passes in CICC v13.0 operate on MachineFunction / MachineBasicBlock / MachineInstr representations after SelectionDAG instruction selection has converted LLVM IR into target-specific pseudo-instructions. On a conventional CPU target, these passes ultimately produce native machine code; on NVPTX, they produce PTX assembly -- a virtual ISA with unlimited virtual registers and a structured instruction set. This distinction is fundamental: NVPTX's "machine code" still uses virtual registers (%r0, %f1, %p3), and the final PTX text is consumed by ptxas which performs the actual register allocation against the hardware register file. The machine-level passes in CICC therefore serve a different purpose than on CPU: they optimize register pressure (to maximize occupancy), structure control flow (PTX requires structured CFG), compute .local memory frame layouts, and prepare clean PTX for ptxas to finish.
| Pass pipeline parser (MF) | sub_235E150 (53KB) |
| Master pass registry | sub_2342890 (102KB) |
| Codegen pass config | ctor_335_0 at 0x507310 (88 strings) |
| NVPTX target pass config | ctor_358_0 at 0x50E8D0 (43 strings) |
| Total registered MF passes | 51 (stock LLVM) + 13 (NVIDIA custom) |
| Total MF analyses | 14 registered |
| Pipeline configuration | sub_2166D20 (addISelPasses), sub_2166ED0 (addPreRegAlloc), sub_21668D0 (addPostRegAlloc) |
Why Machine Passes Matter on GPU
In upstream LLVM for x86 or AArch64, the machine pass pipeline assigns physical registers, inserts spill code, schedules instructions for pipeline hazards, and emits relocatable object code. On NVPTX, none of this maps directly:
-
No physical register file. PTX registers are virtual. The greedy register allocator in CICC does not assign physical registers -- it tracks register pressure per class and enforces the
-maxreglimit (default 70) that controls SM occupancy. When the allocator "spills," it moves values to.localmemory rather than to stack slots addressed by%rsp. -
No prolog/epilog in the traditional sense. There is no call stack with push/pop sequences.
PrologEpilogInserterin CICC computes.localframe offsets for spilled virtual registers and insertsld.local/st.localpairs. -
Structured control flow is mandatory. PTX requires structured control flow (
bra,@p bra,bra.uni). TheStructurizeCFGpass runs before instruction selection, andBranchFoldingmust preserve the structured property. -
Instruction scheduling targets
ptxas, not hardware. Machine scheduling optimizes the instruction stream thatptxaswill consume. Sinceptxasperforms its own scheduling against the actual hardware pipeline, CICC's scheduling focuses on register pressure reduction (nvptx-sched4reg) and exposing parallelism thatptxascan exploit. -
Two peephole levels. CICC runs both the stock LLVM
PeepholeOptimizer(operates on genericMachineInstrpatterns) and the NVIDIA-specificNVPTXPeephole(sub_21DB090) which handles PTX-specific patterns like redundantcvtainstructions, predicate folding, and address space conversions.
Pipeline Flow
SelectionDAG ISel
│
▼
FinalizeISel ─── expand pseudo-instructions from ISel
│
▼
┌─────────────────────────────────────┐
│ Pre-RA Optimization │
│ ┌─ EarlyTailDuplicate │
│ ├─ EarlyMachineLICM │
│ ├─ MachineCSE (RP-aware) │
│ ├─ MachineSink (gated by knob) │
│ ├─ PeepholeOptimizer │
│ ├─ NVPTXPeephole ★ │
│ ├─ DeadMachineInstrElim │
│ └─ MachineCopyPropagation │
└─────────────────────────────────────┘
│
▼
TwoAddressInstruction ─── convert 3-addr to 2-addr form
│
▼
PHIElimination (CSSA/deSSA) ─── lower MachineInstr PHIs to copies
│
▼
┌─────────────────────────────────────┐
│ Register Allocation │
│ ┌─ LiveIntervals + SlotIndexes │
│ ├─ RegisterCoalescing │
│ ├─ RAGreedy (pressure-driven) │
│ ├─ NVPTXBlockRemat ★ │
│ └─ StackSlotColoring │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Post-RA Optimization │
│ ┌─ ExpandPostRAPseudos │
│ ├─ MachineLICM (post-RA) │
│ ├─ MachineSink (post-RA, gated) │
│ ├─ MachineCopyPropagation │
│ ├─ BranchFolding / TailMerge │
│ ├─ MachineBlockPlacement │
│ └─ MachinePipeliner (SMS) │
└─────────────────────────────────────┘
│
▼
PrologEpilogInserter ─── .local frame layout
│
▼
MachineOutliner ─── OUTLINED_FUNCTION_ stub creation
│
▼
NVPTXProxyRegErasure ★ ─── remove redundant cvta.to.local
│
▼
AsmPrinter ─── PTX text emission
Passes marked with ★ are NVIDIA-custom. The exact ordering varies by optimization level; at -O0, most pre-RA and post-RA optimization passes are skipped and RegAllocFast replaces RAGreedy.
Pipeline Configuration Functions
The NVPTX backend configures the machine pass pipeline through three key functions:
sub_2166D20 -- addISelPasses(): Configures passes before instruction selection. Diagnostic string: "\n\n*** Final LLVM Code input to ISel ***\n". Adds: alloca hoisting, ISel DAG printer (conditional), NVPTXProxyRegErasure, NVPTXLowerArgs, NVPTX-specific ISel.
sub_2166ED0 -- addPreRegAlloc(): Configures machine passes before register allocation. Diagnostic strings: "After Pre-RegAlloc TailDuplicate", "After codegen DCE pass", "After Machine LICM, CSE and Sinking passes", "After codegen peephole optimization pass". Adds: TailDuplicate, codegen DCE, Machine LICM + CSE + Sinking (conditional on byte_4FD1980, byte_4FD18A0, byte_4FD1A60), codegen peephole.
sub_21668D0 -- addPostRegAlloc(): Configures post-register-allocation passes. Diagnostic strings: "After Machine Scheduling", "After StackSlotColoring". Adds: Machine scheduling (2 modes controlled by dword_4FD26A0 -- value 1 selects simple scheduling, otherwise full pipeline), Stack slot coloring, nvptx-mem2reg (conditional on byte_4FD25C0).
Machine Pass Inventory
NVIDIA-Custom Machine Passes
| Pass ID | Class / Address | Pipeline Position | Description |
|---|---|---|---|
nvptx-peephole | sub_21DB090 | Pre-RA | PTX-specific peephole: folds redundant address space conversions (cvta), optimizes predicate patterns, simplifies PTX-specific instruction sequences. Controlled by enable-nvvm-peephole (default: on). |
nvptx-remat-block | sub_217DBF0 | During RA | Machine-level block rematerialization. Iterative "pull-in" algorithm that recomputes values near their use rather than loading from spill slots. Two-phase candidate selection with a "second-chance" heuristic. See Rematerialization. |
machine-rpa | sub_21EAA00 | Analysis (pre-RA) | Machine Register Pressure Analysis. Provides per-basic-block pressure data consumed by MachineCSE, scheduling, and rematerialization. |
extra-machineinstr-printer | sub_21E9E80 | Diagnostic | Prints per-function register pressure statistics. Debug-only pass for tuning pressure heuristics. |
nvptx-mem2reg | sub_21F9920 | Pre-RA | Machine-level mem2reg: promotes .local memory loads/stores back to virtual registers when profitable. Conditional on byte_4FD25C0 (nv-disable-mem2reg inverts). |
ldgxform | sub_21F2780 | Pre-RA | Transforms qualifying global memory loads into ld.global.nc (LDG -- load through read-only data cache). Splits wide vector loads for hardware constraints. |
nvptx-prolog-epilog | sub_21DB5F0 | Post-RA | NVPTX-specific PrologEpilog pass. Works alongside or replaces the stock PEI to handle PTX frame semantics where there is no traditional stack pointer. |
nvptx-proxy-reg-erasure | sub_21DA810 | Late post-RA | Removes redundant cvta.to.local instructions left by address space lowering. |
nvptx-assign-valid-global-names | sub_21BCD80 | Pre-emission | Sanitizes symbol names to comply with PTX naming rules (no @, $, or other characters illegal in PTX identifiers). |
nvptx-replace-image-handles | sub_21DBEA0 | Pre-emission | Replaces IR-level texture/surface handle references with PTX-level .tex / .surf declarations. |
nvptx-image-optimizer | sub_21BCF10 | Pre-emission | Texture/surface instruction optimization: coalesces related texture operations, validates image type consistency for tex, suld, sust, suq. |
alloca-hoisting | sub_21BC7D0 | Early post-ISel | Hoists alloca instructions to the entry basic block, enabling the frame layout pass to assign fixed offsets. |
generic-to-nvvm | sub_215DC20 | Early post-ISel | Converts generic address space (0) references to global address space (1). Runs before instruction selection on some pipelines, but also present as a machine-level fixup. |
param-opt | sub_2203290 | Post-ISel | Optimizes ld.param instructions. NVIDIA-custom pass for parameter load coalescing and redundant parameter load elimination. |
nvptx-trunc-opts | sub_22058E0 | Post-ISel | Optimizes redundant ANDb16ri instructions [sic: binary string reads "instrunctions"] generated during i16 truncation patterns. |
redundant-move-elim | sub_2204E60 | Post-ISel | Removes redundant register-to-register moves left by instruction selection. |
Stock LLVM Machine Passes (NVPTX Configuration)
| Pass ID | Class | NVIDIA Modification | Notes |
|---|---|---|---|
finalize-isel | FinalizeISelPass | None | Expands ISel pseudo-instructions; mandatory first MF pass. |
early-tailduplication | EarlyTailDuplicatePass | None | Pre-RA tail duplication. Can be disabled via disable-early-taildup. |
early-machinelicm | EarlyMachineLICMPass | Gated | Controlled by enable-mlicm. Hoists loop-invariant machine instructions before register allocation. |
machine-cse | MachineCSEPass | Modified | NVIDIA adds register-pressure-aware CSE (rp-aware-mcse, pred-aware-mcse, copy-prop-mcse). Uses MRPA (sub_2E5A4E0) for incremental pressure tracking. See Instruction Scheduling. |
machine-sink | MachineSinkingPass | Gated | Disabled by default on NVPTX; enabled via nvptx-enable-machine-sink. When active, sinks instructions closer to uses to reduce register pressure. |
peephole-opt | PeepholeOptimizerPass | None | Stock LLVM peephole: folds redundant copies, simplifies compare-and-branch patterns, optimizes sub-register operations. Can be disabled via disable-peephole. |
dead-mi-elimination | DeadMachineInstrElimPass | None | Eliminates dead machine instructions. Can be disabled via disable-machine-dce. |
machine-cp | MachineCopyPropagationPass | None | Propagates copies to reduce move instructions. Can be disabled via disable-copyprop. |
machinelicm | MachineLICMPass | Gated | Post-RA variant. Controlled by disable-postra-machine-licm. NVIDIA adds sink-insts-to-avoid-spills to trade hoisting for spill reduction. |
two-address-instruction | TwoAddressInstructionPass | None (stock) | Converts three-address instructions to two-address form by inserting copies. sub_1F53550 (79KB, 2470 lines). Shared between cicc and libNVVM (twin at sub_F4EA80). |
phi-node-elimination | PHIEliminationPass | Modified | NVIDIA's CSSA/deSSA method selection via usedessa (default 2). Controls how machine-level PHI nodes are lowered to copies; affects register allocation quality. See cssa-coalesce, cssa-verbosity. |
register-coalescer | RegisterCoalescerPass | Custom NVPTX variant | The NVPTX backend has its own register coalescing framework at 0x349--0x34B (separate from LLVM's stock coalescer at 0xB40000). Uses interference oracle sub_349D6E0, open-addressing hash with (reg >> 9) ^ (reg >> 4). See Register Coalescing. |
greedy | RAGreedyPass | Modified | Pressure-driven rather than assignment-driven. Dual instances (legacy + new PM). Core at sub_2F49070 (82KB). See Register Allocation. |
stack-coloring | StackColoringPass | None | Colors stack slots to reduce .local memory usage by sharing slots with non-overlapping lifetimes. |
stack-slot-coloring | StackSlotColoringPass | None | Secondary stack slot optimization. Can be disabled via disable-ssc. |
post-ra-pseudos | ExpandPostRAPseudosPass | None | Expands post-RA pseudo-instructions (e.g., COPY to actual move). |
post-RA-sched | PostRASchedulerPass | Gated | Post-RA instruction scheduling. Controlled by disable-post-ra. |
machine-scheduler | MachineSchedulerPass | Modified | NVIDIA adds nvptx-sched4reg mode for register-pressure-driven scheduling. Pre-RA scheduling variant. |
postmisched | PostMachineSchedulerPass | None | Post-RA machine scheduling with ScheduleDAGMILive (sub_355F610, 64KB). Controlled by misched-postra. |
early-ifcvt | EarlyIfConverterPass | None | If-conversion before register allocation. Can be disabled via disable-early-ifcvt. |
machine-combiner | MachineCombinerPass | None | Combines machine instructions using target-defined patterns. Knob: machine-combiner-inc-threshold. |
block-placement | MachineBlockPlacement | None (stock) | Profile-guided basic block ordering. sub_3521FF0 (82KB). Uses ext-TSP and chain-based algorithms. See Block Placement. |
machine-outliner | MachineOutliner | None | Creates OUTLINED_FUNCTION_ stubs for repeated instruction sequences. sub_3537010 (77KB). See MachineOutliner. |
prologepilog | PrologEpilogInserter | Modified | NVIDIA's PEI (sub_35B1110, 68KB) computes .local memory frame offsets. Frame objects are 40-byte records with offset, size, alignment, and spill-slot flags. See PrologEpilogInserter. |
opt-phis | OptimizePHIsPass | None | Optimizes machine-level PHI nodes (removes trivially dead or redundant PHIs). |
tailduplication | TailDuplicatePass | None | Post-RA tail duplication. Controlled by disable-tail-duplicate. |
detect-dead-lanes | DetectDeadLanesPass | None | Detects unused sub-register lanes; minimal impact on NVPTX since register classes are fully disjoint. |
rename-independent-subregs | RenameIndependentSubregsPass | None | Splits sub-register live ranges into independent virtual registers. |
localstackalloc | LocalStackSlotAllocationPass | None | Allocates local frame indices for large stack objects. |
machine-latecleanup | MachineLateInstrsCleanupPass | None | Late-stage dead instruction cleanup. |
machine-pipeliner | MachinePipeliner | None (stock) | Swing Modulo Scheduling for loop bodies. sub_3563190 (58KB). See below. |
Per-Pass Algorithm Descriptions
NVPTXPeephole (sub_21DB090) -- PTX-Specific Peephole Optimizer
Registration: sub_21DB090 at 0x21DB090, pass ID "nvptx-peephole". Enabled by default; controlled by enable-nvvm-peephole.
This pass runs pre-RA and performs pattern-matching rewrites on MachineInstr sequences that are specific to the NVPTX target. Unlike the stock LLVM PeepholeOptimizer (which operates on generic copy/compare patterns), NVPTXPeephole handles PTX address space semantics and predicate register idioms.
Patterns handled:
-
Redundant
cvtaelimination. When address space lowering insertscvta.to.globalorcvta.to.sharedfollowed by an operation that already operates in the correct address space, thecvtais dead. The pass scans forcvtainstructions whose result is used only by instructions with matching address space qualifiers, and deletes thecvta. -
Predicate folding. PTX predicates (
%p0,%p1, ...) are first-class. The pass identifies patterns where asetpinstruction produces a predicate that is consumed by exactly one@p braand folds them into a conditional branch with embedded comparison. -
Address space conversion simplification. When
generic-to-nvvminsertsaddrspacecastand the consuming instruction directly emits the correct address qualifier (.global,.shared,.local,.const), the intermediate cast is redundant.
// Pseudocode: NVPTXPeephole main loop
fn nvptx_peephole(MF: &mut MachineFunction) -> bool {
let mut changed = false;
for mbb in MF.basic_blocks() {
let mut dead_list = vec![];
for mi in mbb.instrs() {
match mi.opcode() {
NVPTX::CVTAToGeneric | NVPTX::CVTAToGlobal
| NVPTX::CVTAToShared | NVPTX::CVTAToLocal => {
if single_user_in_matching_addrspace(mi) {
propagate_operand_and_kill(mi);
dead_list.push(mi);
changed = true;
}
}
NVPTX::SETP_* => {
if let Some(bra) = single_predicate_consumer(mi) {
fold_setp_into_branch(mi, bra);
dead_list.push(mi);
changed = true;
}
}
_ => {}
}
}
for mi in dead_list { mi.erase_from_parent(); }
}
changed
}
NVPTXBlockRemat (sub_217DBF0) -- Machine-Level Block Rematerialization
Registration: sub_217DBF0 at 0x217DBF0, pass name "NVPTX Specific Block Remat", pass ID "nvptx-remat-block". Knob constructor at ctor_361_0 (0x5108E0). Main engine: sub_2186D90 (47KB, ~1742 decompiled lines).
This is NVIDIA's custom register-pressure-reduction pass. It re-computes values at their use sites instead of keeping them live across long spans. The algorithm is iterative with a two-phase candidate selection including a "second-chance" heuristic for marginal candidates.
Knobs (16 total):
| Global Variable | CLI Flag | Default | Description |
|---|---|---|---|
dword_4FD3820 | nv-remat-block | 14 | Bitmask controlling remat modes (bits 0-3) |
dword_4FD3740 | nv-remat-max-times | 10 | Max iterations of the outer remat loop |
dword_4FD3660 | nv-remat-block-single-cost | 10 | Max cost per single live value pull-in |
dword_4FD3580 | nv-remat-block-map-size-limit | 6 | Map size limit for single pull-in |
dword_4FD3040 | nv-remat-block-max-cost | 100 | Max total clone cost per live value reduction |
dword_4FD3120 | nv-remat-block-liveout-min-percentage | 70 | Min liveout % for special consideration |
unk_4FD3400 | nv-remat-block-loop-cost-factor | 20 | Loop cost multiplier |
unk_4FD3320 | nv-remat-default-max-reg | 70 | Default max register pressure target |
unk_4FD2EC0 | nv-remat-block-load-cost | 10 | Cost assigned to load instructions |
unk_4FD3860 | nv-remat-threshold-for-spec-reg | 20 | Threshold for special register remat |
byte_4FD2E80 | nv-dump-remat-block | off | Debug dump toggle |
byte_4FD2DA0 | nv-remat-check-internal-live | off | Check internal liveness during MaxLive |
qword_4FD2C20 | max-reg-kind | 0 | Kind of max register pressure info |
qword_4FD2BE0 | no-mi-remat | (list) | Skip remat for named functions |
word_4FD32F0 | load-remat | on | Enable load rematerialization |
word_4FD3210 | vasp-fix1 | off | VASP fix (volatile/addsp) |
Algorithm pseudocode (sub_2186D90):
fn nvptx_block_remat(MF: &mut MachineFunction) -> bool {
// (A) INITIALIZATION
let target = max_reg_override.unwrap_or(nv_remat_default_max_reg); // default 70
if MF.block_count() == 1 { return false; }
if function_name in no_mi_remat_list {
log("Skip machine-instruction rematerialization on {name}");
return false;
}
// (B) LIVEOUT FREQUENCY COUNTING
for bb in MF.blocks() {
for reg in bb.live_out() {
freq_map[reg] += 1;
}
}
// Normalize: freq_pct = (100 * count) / num_blocks
// (C) OUTER ITERATIVE LOOP
let mut iteration = 0;
let mut overall_changed = false;
loop {
iteration += 1;
if iteration > nv_remat_max_times { break; } // default 10
// Phase 1: COMPUTE MAX-LIVE
let max_live = sub_2186590(MF); // scan all blocks
log("Max-Live-Function({num_blocks}) = {max_live}");
if target >= max_live { break; } // no pressure problem
let mut changed = false;
// Phase 2: FOR EACH OVER-PRESSURE BLOCK
for bb in blocks_where(pressure > target) {
let excess = bb.pressure - target;
// Phase 3: CLASSIFY LIVE-OUT REGISTERS
let (pullable, non_pullable) = classify_liveout(bb);
// sub_217E810 (MULTIDEF check) -- must have single unique def
// sub_2181550 (recursive pullability, depth <= 50)
log("Pullable: {pullable.len()}");
// Phase 4: SECOND-CHANCE HEURISTIC (sub_2181870)
if excess > pullable.len() && second_chance_list.not_empty() {
second_chance_promote(&mut pullable, &mut non_pullable);
// Re-evaluates rejected candidates with relaxed criteria
// Uses visit-count mechanism to prevent infinite loops
// Hash: h(regID) = 37 * regID, open-addressing
log("ADD {n} candidates from second-chance");
}
log("Total Pullable before considering cost: {pullable.len()}");
// Phase 5: COST ANALYSIS (sub_2183E30)
let candidates = pullable.filter_map(|reg| {
let cost = compute_remat_cost(reg); // 0 = cannot remat
(cost > 0).then(|| (reg, cost))
});
// Phase 6: SELECT BY COST-BENEFIT (cheapest first)
candidates.sort_by_key(|(_, cost)| *cost); // selection sort
let mut final_list = vec![];
for (reg, cost) in candidates {
if cost > nv_remat_block_single_cost { break; } // default 10
let width = if reg_class_size(reg) > 32 { 2 } else { 1 };
final_list.push(reg);
if final_list.len() >= excess { break; }
}
log("Really Final Pull-in: {final_list.len()} ({total_cost})");
// Phase 7: EXECUTE REMATERIALIZATION
for reg in &final_list {
clear_from_liveout(bb, reg); // sub_217F620
}
bb.pressure -= final_list.len();
propagate_backward(bb, &final_list); // sub_2185250
// Clone defining instructions at use sites
// sub_21810D0 replaces register references
changed = true;
}
overall_changed |= changed;
if !changed { break; }
}
// (D) DEAD INSTRUCTION REMOVAL -- cascading deletion
remove_dead_instructions(); // sub_217DA10
overall_changed
}
MULTIDEF detection (sub_217E810): Returns the defining instruction if the register has exactly one non-dead, non-debug definition. Rejects instructions with hazardous descriptor flags (desc->flags & 0x3F80), opcodes in the non-rematerializable set (memory ops 534-609, texture ops 680-681, atomics 817-832, barriers 2913-2918, surface ops 3281-3287, 3449-3454, large MMA blocks 4423-4447), and instructions with tied extra defs.
Recursive pullability (sub_2181550): Walks the operand chain up to depth 50, checking each operand register against the non-pullable set and the MULTIDEF oracle. All operands in the chain must be single-def, safe-opcode, and themselves pullable.
Cost model: sub_2183E30 computes the clone cost of rematerializing a register. Load instructions cost nv-remat-block-load-cost (default 10). Instructions in loops are penalized by nv-remat-block-loop-cost-factor (default 20x). Double-wide registers (class size > 32) count as 2 for pressure and have 2x cost.
Machine Register Pressure Analysis (sub_21EAA00) -- MRPA
Registration: sub_21EAA00 at 0x21EAA00, pass name "Register pressure analysis on Machine IRs", pass ID "machine-rpa". Main analysis body: sub_21EEB40 (68KB). Incremental updater: sub_2E5A4E0 (48KB). Backend variant: sub_1E00370 (78KB).
MRPA is NVIDIA's custom analysis pass that provides per-basic-block register pressure data. Unlike LLVM's stock RegisterPressure tracking (which is tightly coupled to the scheduler), MRPA is consumed by multiple clients: RP-aware MachineCSE, instruction scheduling, and the block rematerialization pass.
Architecture:
The MRPA system has two modes:
- Full recomputation (
sub_21EEB40): Walks every instruction in every basic block, tracking register births (defs) and deaths (last uses), recording the peak pressure per register class per block. - Incremental update (
sub_2E5A4E0): When a single instruction is moved or deleted (e.g., by MachineCSE), MRPA updates the affected blocks' pressure without rescanning the entire function.
Incremental update algorithm (sub_2E5A4E0):
fn mrpa_incremental_update(context, bb, instruction_delta) {
// DenseMap hash: (ptr >> 9) ^ (ptr >> 4)
// Empty sentinel: -8, Tombstone: -16
// Minimum 64 buckets, always power-of-2
// 1. Build worklist of affected BBs via DFS
let worklist = dfs_from(bb, context.visited_set);
// 2. For each BB: create/update tracking entry
for bb in worklist {
let entry = context.pressure_map.get_or_insert(bb);
// 3. Filter schedulable instructions via sub_2E501D0
for mi in bb.instrs().filter(schedulable) {
// 4. For each virtual register operand (40-byte entries):
for operand in mi.operands() {
sub_2EBEF70(operand); // find existing rename mapping
sub_2EBEE10(operand); // query register info
sub_2EBE820(operand); // attempt rename if profitable
sub_2EBF120(operand); // free old register after rename
}
// 5. Check register class constraints via sub_E922F0
// 6. Validate pressure feasibility via sub_2E4F9C0
}
// 7. Erase unprofitable instructions via sub_2E88E20
}
}
Verification: When verify-update-mcse is enabled (qword_501F8A8, default OFF), MRPA runs a full recomputation after every incremental update and compares results. Mismatch triggers: "Incorrect RP info from incremental MRPA update" via sub_C64ED0. The print-verify knob (qword_501F7C8) controls whether detailed per-register-class diagnostic output is printed on mismatch.
Diagnostic output (sub_21E9A60): The companion pass extra-machineinstr-printer at sub_21E9E80 prints: "Max Live RRegs: {n}\tPRegs: {m}\nFunction Size: {s}" for each function, providing per-function register pressure statistics for tuning.
LDG Transform (sub_21F2780) -- Read-Only Data Cache Load Transformation
Registration: sub_21F2780 at 0x21F2780, pass name "Ldg Transformation", pass ID "ldgxform". Transformation body: sub_21F2C80 (19KB). Vector splitting engine: sub_21F3A20 (44KB).
This pass transforms qualifying global memory loads into ld.global.nc (LDG) instructions, routing them through the read-only texture cache (L1 on Kepler+, unified L1/tex on Maxwell+). The transformation is profitable for read-only data because the texture cache has separate bandwidth from the L1 data cache, effectively doubling memory throughput for qualifying loads.
Algorithm:
fn ldgxform(MF: &mut MachineFunction) -> bool {
let mut changed = false;
for mi in MF.all_instrs() {
if !is_global_load(mi) { continue; }
if is_volatile(mi) { continue; }
if !pointer_is_readonly(mi.address_operand()) { continue; }
// Replace ld.global with ld.global.nc (LDG)
mi.set_opcode(ldg_variant(mi.opcode()));
// Split wide loads if necessary
if load_width(mi) > hardware_max_ldg_width() {
// sub_21F2C80: LDG split transformation
// Tags: ".ldgsplit", ".load", ".ldgsplitinsert"
let (lo, hi) = split_wide_load(mi);
// Insert: lo = ldg.64 [addr]
// hi = ldg.64 [addr + 8]
// result = INSERT_SUBREG lo, hi
changed = true;
}
changed = true;
}
changed
}
Vector splitting (sub_21F3A20, 44KB): This is the third-largest function in the 0x21F range. NVPTX supports limited native vector widths (typically .v2 and .v4 of 32-bit elements). When wider vectors (e.g., v8f32, v16f16) appear, this engine splits them into legal widths. Operations handled:
vecBitCast: bitcast between vector typessplitVec: split a vector into sub-vectorsextractSplitVec/insertSplitVec: element access on split vectorssplitVecGEP: GEP computation on split vector elements
The split width depends on TargetOpt.HasLDG (stored at target options offset 5, extracted from p2h-01 analysis). When LDG is available, 128-bit loads (LDG.128) are preferred, resulting in .v4.b32 patterns.
NVPTXMem2Reg (sub_21F9920) -- Machine-Level Mem2Reg
Registration: sub_21F9920 at 0x21F9920, pass name "Mem2Reg on Machine Instructions to remove local stack objects", pass ID "nvptx-mem2reg". Main body: sub_21FA880 (22KB), engine: sub_21FC920 (33KB). Controlled by byte_4FD25C0 (inverted by nv-disable-mem2reg, default: enabled).
Standard LLVM mem2reg operates on LLVM IR alloca instructions. This NVIDIA-custom pass operates on MachineInstr -- specifically on ld.local / st.local pairs that access __local_depot frame slots. After register allocation, some values that were spilled to .local memory can be promoted back to virtual registers if their access pattern is simple enough (single def, multiple uses, no aliasing stores).
Algorithm:
fn nvptx_machine_mem2reg(MF: &mut MachineFunction) -> bool {
if nv_disable_mem2reg { return false; } // byte_4FD25C0
let mut changed = false;
for frame_idx in MF.frame_info().stack_objects() {
if !is_local_depot_slot(frame_idx) { continue; }
// Collect all loads and stores to this frame slot
let stores = find_stores_to(MF, frame_idx);
let loads = find_loads_from(MF, frame_idx);
if stores.len() != 1 { continue; } // must be single-def
let store = stores[0];
let src_reg = store.source_register();
// Check: no aliasing stores between def and uses
// Check: store dominates all loads
if !dominates_all(store, &loads) { continue; }
// Promote: replace all ld.local with the source register
for load in &loads {
replace_load_with_reg(load, src_reg);
load.erase_from_parent();
}
store.erase_from_parent();
MF.frame_info().remove_object(frame_idx);
changed = true;
}
changed
}
This pass is positioned in addPostRegAlloc(), meaning it runs after the greedy register allocator has already assigned slots. It acts as a cleanup: register allocation may have conservatively spilled values that turn out to be unnecessary after coalescing and copy propagation eliminate intermediate uses.
GenericToNVVM (sub_215DC20) -- Address Space Normalization
Registration: sub_215DC20 at 0x215DC20, pass name "Ensure that the global variables are in the global address space", pass ID "generic-to-nvvm". Pass descriptor: 80-byte allocation. Factory: sub_215D530 (allocates 320-byte state with two 128-bucket DenseMaps). New PM variant: sub_305ED20.
CUDA and LLVM IR use address space 0 (generic) as the default for globals, but NVPTX requires globals in address space 1. This pass rewrites every GlobalVariable in address space 0 to address space 1, inserting addrspacecast instructions at all use sites.
Algorithm:
fn generic_to_nvvm(M: &mut Module) -> bool {
let mut gv_map = DenseMap::new(128); // old -> new Value mapping
let mut const_map = DenseMap::new(128); // old -> new Constant mapping
for gv in M.globals().filter(|g| g.address_space() == 0) {
// 1. Clone to address space 1
let new_gv = GlobalVariable::new(
gv.value_type(), gv.is_constant(), gv.linkage(),
gv.initializer(), gv.name(), /*addrspace=*/ 1
);
new_gv.set_alignment(gv.alignment());
// 2. Insert addrspacecast(1 -> 0) at each use
let cast = ConstantExpr::addrspace_cast(new_gv, gv.type());
// 3. Replace all uses
gv.replace_all_uses_with(cast);
// 4. Track in map and erase original
gv_map.insert(gv, new_gv);
gv.erase_from_parent();
}
// Cleanup: sub_215D780 iterates gv_map, properly ref-counting Values
cleanup_gv_map(&gv_map);
!gv_map.is_empty()
}
NVPTXProxyRegErasure (sub_21DA810) -- Redundant cvta.to.local Removal
Registration: sub_21DA810 at 0x21DA810, pass name "NVPTX optimize redundant cvta.to.local instruction".
This late post-RA pass removes cvta.to.local instructions that are left over from address space lowering. After frame layout is complete, local memory addresses are known, and cvta.to.local (which converts a generic pointer to a .local pointer) is redundant when the address is already known to be in .local space. The pass is simple: scan for cvta.to.local MachineInstrs, verify the source is already a .local address, replace uses with the source operand, delete the cvta.
NVPTXAssignValidGlobalNames (sub_21BCD80) -- PTX Name Sanitization
Registration: sub_21BCD80 at 0x21BCD80, pass name "Assign valid PTX names to globals", pass ID "nvptx-assign-valid-global-names".
PTX has stricter naming rules than LLVM IR. Characters like @, $, . (in certain positions), and Unicode are illegal in PTX identifiers. This pass walks all GlobalValues in the module and replaces illegal characters with safe alternatives (typically _). It also handles name demangling artifacts and ensures the final names are unique after sanitization.
NVPTXImageOptimizer (sub_21BCF10) -- Texture/Surface Optimization
Registration: sub_21BCF10 at 0x21BCF10, pass name "NVPTX Image Optimizer". Type validation helper: sub_21DD1A0 (16KB).
This pre-emission pass optimizes texture and surface access patterns. It validates image type consistency for tex, suld, sust, and suq operations, emitting errors for mismatches: "Invalid image type in .tex", "Invalid image type in .suld", "Invalid image type in suq.", "Invalid image type in .sust". The pass coalesces related texture operations when they access the same texture handle with compatible coordinates and can be merged into wider vector fetches.
NVPTXReplaceImageHandles (sub_21DBEA0) -- Image Handle Lowering
Registration: sub_21DBEA0 at 0x21DBEA0, pass name "NVPTX Replace Image Handles".
Replaces IR-level texture/surface handle references (which are LLVM Value pointers to @texture_handle globals) with PTX-level .tex / .surf declarations and integer handle indices. This is a pre-emission pass that bridges the gap between LLVM IR's opaque handle model and PTX's explicit texture declaration model.
AllocaHoisting (sub_21BC7D0) -- Entry Block Alloca Hoisting
Registration: sub_21BC7D0 at 0x21BC7D0, pass name "Hoisting alloca instructions in non-entry blocks to the entry block", pass ID "alloca-hoisting". Registration helper: sub_21BC5A0.
PTX requires that all local memory declarations be hoisted to the function entry. This pass scans all basic blocks for alloca instructions and moves them to the entry block. This enables the frame layout pass (PrologEpilogInserter) to assign fixed offsets to all stack objects -- a requirement because PTX emits .local .align N .b8 __local_depotX[SIZE] at the function prologue and all local accesses are indexed from this single base.
ParamOpt (sub_2203290) -- Parameter Load Optimization
Registration: sub_2203290 at 0x2203290, pass name "Optimize NVPTX ld.param", pass ID "param-opt".
NVPTX-custom pass that optimizes ld.param instructions generated during kernel argument passing. When a kernel parameter is loaded multiple times (common when the same argument is used in different basic blocks), this pass eliminates redundant loads by propagating the first load's result to subsequent uses. Related knob: remat-load-param ("Support remating const ld.param that are not exposed in NVVM IR").
NVPTXTruncOpts (sub_22058E0) -- i16 Truncation Optimization
Registration: sub_22058E0 at 0x22058E0, pass name "Optimize redundant ANDb16ri instrunctions" [sic], pass ID "nvptx-trunc-opts".
When LLVM lowers trunc i32 to i16 operations, the NVPTX backend emits an AND.b16 with mask 0xFFFF to ensure the high bits are zero. In many cases this AND is redundant -- the producing instruction already guarantees a 16-bit result. This pass pattern-matches ANDb16ri instructions with the 0xFFFF immediate and removes them when the source provably fits in 16 bits.
RP-Aware MachineCSE (NVIDIA-Modified machine-cse)
Stock LLVM MachineCSE eliminates redundant machine instructions by matching instruction patterns within dominance regions. NVIDIA adds three extensions via ctor_302_0 (0x4FEB70, 7.8KB, 14 strings):
RP-aware CSE (rp-aware-mcse): Before eliminating a common subexpression, queries MRPA (sub_2E5A4E0) for the current register pressure. If eliminating the CSE candidate would increase pressure beyond the target (because the shared result must stay live longer), the CSE is suppressed. This prevents the classic GPU problem where CSE reduces instruction count but increases register pressure, reducing occupancy.
Predicate-aware CSE (pred-aware-mcse): Extends RP awareness to predicate registers (PTX %p class). Predicate registers are a scarce resource (maximum 7 per thread on most architectures), so predicate pressure is tracked separately from general-purpose register pressure.
Copy-prop CSE (copy-prop-mcse): Embeds copy propagation within the CSE framework. When CSE eliminates an instruction, the resulting COPY instructions can often be propagated immediately rather than waiting for the separate MachineCopyPropagation pass.
Incremental MRPA integration: The MCSE pass uses qword_501F988 (incremental-update-mcse, default ON) to incrementally update MRPA as CSE decisions are made, avoiding full recomputation per CSE candidate.
MachinePipeliner (SMS) Detail
The Swing Modulo Scheduler at sub_3563190 performs software pipelining -- overlapping successive loop iterations to hide latency. It operates on a single loop body at the MachineInstr level:
- DAG construction: builds a data dependency graph with
sub_2F97F60, computes latencies viasub_3559990, adds edges viasub_3542B20. - MII computation:
RecMII(recurrence-based) viasub_354CBB0,ResMII(resource-based) viasub_35449F0.MII = max(RecMII, ResMII). - Early exits: MII == 0 is invalid; MII >
SwpMaxMii(default 27,-pipeliner-max-mii) aborts. - II search: starts at MII, tries up to
pipeliner-ii-search-range(default 10,qword_503E428) consecutive II values. First valid schedule wins. - Schedule construction: ASAP via
sub_354BFF0, ALAP viasub_354BFF0, topological sort, core SMS node placement viasub_354C3A0, then finalization. - Kernel generation: Three code generation backends selected by priority -- annotation-only (
pipeliner-annotate-for-testing), MVE-based (pipeliner-mve-cg, default enabled), and experimental peeling (pipeliner-experimental-cg).
The pipeliner stores its schedule context as a 616-byte (0x268) structure with four SmallVectors and per-BB data at 256-byte stride. Maximum pipeline stages: SwpMaxStages (default 3, -pipeliner-max-stages).
Core scheduling pipeline (10 sequential calls):
| Step | Function | Purpose |
|---|---|---|
| 1 | sub_35476E0 | DAG construction / dependency analysis |
| 2 | sub_35523F0 | Recurrence detection / RecMII computation |
| 3 | sub_35546F0 | Resource usage / ResMII computation |
| 4 | sub_3543340 | MII = max(RecMII, ResMII) finalization |
| 5 | sub_35630A0 | Node ordering / priority assignment |
| 6 | sub_35568E0 | Schedule table initialization |
| 7 | sub_35433F0 | Pre-scheduling transforms |
| 8 | sub_3557A10 | Instruction ordering/selection (heuristic) |
| 9 | sub_354A760 | Schedule finalization / modulo expansion |
| 10 | sub_355F610 | ScheduleDAGMILive integration (64KB) |
Instruction selection heuristic (sub_3557A10):
Priority ordering: (1) deeper instructions first (offset 240 = latency/depth), (2) target priority table at a1+3944 (16-byte entries: [start, end, priority, window_width]), (3) narrower schedule windows first. Latency recomputation via sub_2F8F5D0 during comparison.
Error messages:
"Invalid Minimal Initiation Interval: 0"-- MII computation returned zero"Minimal Initiation Interval too large: MII > SwpMaxMii. Refer to -pipeliner-max-mii."-- loop is too complex"Unable to find schedule"-- no valid II found within search range"No need to pipeline - no overlapped iterations in schedule."--numStages == 0"Too many stages in schedule: numStages > SwpMaxStages. Refer to -pipeliner-max-stages."-- pipeline depth exceeded
PrologEpilogInserter (sub_35B1110) -- .local Frame Layout
Address: sub_35B1110 (68KB, 2388 decompiled lines). Stack frame: 0x490 bytes of local state. This is NVIDIA's monolithic PEI for PTX. Unlike a traditional PEI that emits push/pop sequences and adjusts %rsp, this one computes .local memory frame offsets.
10-phase structure:
| Phase | Lines | Description |
|---|---|---|
| 1 | 443-490 | Target/subtarget retrieval, initial setup |
| 2 | 491-566 | Callee-saved register determination |
| 3 | 567-730 | Pre-pass: collect fixed objects from frame info |
| 4 | 733-1070 | Stack object offset assignment (main layout engine) |
| 5 | 1078-1600 | General local variable layout |
| 6 | 1688-1795 | Frame-pointer stack area |
| 7 | 1803-1872 | Prolog/epilog instruction insertion per BB |
| 8 | 1873-2132 | Scavenger / frame-index elimination |
| 9 | 2270-2304 | Stack-size warning & diagnostic reporting |
| 10 | 2305-2388 | Cleanup & deallocation |
Frame object record (40 bytes):
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | Byte offset in .local memory (assigned by PEI) |
| +8 | 8 | Object size in bytes |
| +16 | 1 | Alignment (log2) |
| +20 | 1 | isDead flag (skip if set) |
| +32 | 1 | isSpillSlot flag |
| +36 | 1 | Category byte (0/1/2/3) |
Stack layout algorithm (Phase 4):
fn assign_frame_offsets(MF: &MachineFunction, frame: &mut FrameInfo) {
let grows_neg = frame.stack_direction == 1;
let mut offset = frame.initial_offset;
let mut max_align = frame.max_alignment;
// Fixed objects first
for obj in frame.fixed_objects() {
if obj.is_dead { continue; }
let align = 1 << obj.log2_align;
offset = align_to(offset, align);
obj.offset = if grows_neg { -offset } else { offset };
offset += obj.size;
max_align = max(max_align, align);
}
// Callee-saved register region
for csr in frame.callee_saved_range() {
if csr.is_dead || csr.size == -1 { continue; }
let align = 1 << csr.log2_align;
offset = align_to(offset, align);
csr.offset = if grows_neg { -offset } else { offset };
offset += csr.size;
}
// General locals: three category buckets, each via sub_35B0830
for category in [1, 2, 3] {
for obj in frame.objects_of_category(category) {
let align = 1 << obj.log2_align;
offset = align_to(offset, align);
obj.offset = if grows_neg { -offset } else { offset };
offset += obj.size;
}
}
frame.stack_size = offset;
}
The final PTX emission (sub_2158E80) uses these offsets to emit: .local .align N .b8 __local_depotX[SIZE]; at the function prologue, and ld.local / st.local instructions reference [%SPL + offset] where %SPL is the local stack pointer register.
ScheduleDAGMILive (sub_355F610) -- Post-RA Instruction Ordering
Address: sub_355F610 (64KB). This is the post-RA machine instruction scheduler, consuming either the pipeliner's output or standalone scheduling regions.
Data structures:
SUnit(Scheduling Unit): 88 bytes per instruction- Instruction-to-node hash map: 632-byte entries
- RP tracking structure: 112 bytes (offsets 32-48: per-class pressure current, offsets 56-72: per-class pressure limits)
Scheduling flow:
- Initialize RP tracking via
sub_3551AB0(ifpipeliner-register-pressureis set) - Set per-class pressure defaults via
sub_2F60A40 - Walk BB instruction list, build instruction-to-node hash map (632-byte entries)
- Compute ASAP via
sub_354BFF0-> earliest cycle per instruction - Compute ALAP via
sub_354BFF0-> latest cycle per instruction - Place instructions via
sub_354C3A0(returns success/failure) - Calculate stage count:
(lastCycle - firstCycle) / II - Verify placement via
sub_355C7C0 - Build stage descriptors via
sub_355D7E0(80 bytes per stage)
Machine-Level Analysis Infrastructure
Machine passes depend on a set of analysis passes that compute liveness, dominance, and frequency information over the MachineFunction representation.
| Analysis ID | Class | Description |
|---|---|---|
slot-indexes | SlotIndexesAnalysis | Assigns a dense integer index to every instruction slot in the function. All liveness computations reference slot indexes rather than instruction pointers, enabling O(log n) interval queries. |
live-intervals | LiveIntervalsAnalysis | Computes live ranges for every virtual register as a set of [start, end) slot-index intervals. The LiveRangeCalc engine (sub_2FC4FC0, 12.9KB) manages 296-byte segment entries with inline small-object buffers for endpoint, register mask, kill-set, and use-def chain data. See LiveRangeCalc. |
live-reg-matrix | LiveRegMatrixAnalysis | Tracks physical register unit interference. On NVPTX, used primarily for register-class-level pressure tracking rather than physical unit assignment. |
machine-dom-tree | MachineDominatorTreeAnalysis | Dominance tree over MachineBasicBlock graph. Required by LICM, CSE, sinking, and register allocation. |
machine-post-dom-tree | MachinePostDominatorTreeAnalysis | Post-dominance tree. Used by block placement (sub_3521FF0 stores at this+544). |
machine-loops | MachineLoopAnalysis | Loop detection on the machine CFG. Used by LICM, block placement, and the pipeliner. |
machine-block-freq | MachineBlockFrequencyAnalysis | Block frequency estimates (profile-guided or static). Block placement uses this at this+528 to drive chain construction. |
machine-branch-prob | MachineBranchProbabilityAnalysis | Branch probability data. Block placement stores at this+536. |
machine-trace-metrics | MachineTraceMetricsAnalysis | Trace-based metrics (critical path length, resource depth). Used by MachineCombiner and if-conversion. |
machine-opt-remark-emitter | MachineOptRemarkEmitterAnalysis | Optimization remark emission for machine passes. |
edge-bundles | EdgeBundlesAnalysis | Groups CFG edges into bundles for spill placement. |
spill-code-placement | SpillPlacementAnalysis | Determines optimal spill/reload points using edge bundles and frequency data. |
regalloc-evict | RegAllocEvictionAdvisorAnalysis | Advises the greedy allocator on which live range to evict. |
regalloc-priority | RegAllocPriorityAdvisorAnalysis | Assigns allocation priority to live ranges. |
virtregmap | VirtRegMapAnalysis | Maps virtual registers to their assigned physical registers (or spill slots). |
machine-rpa ★ | sub_21EAA00 | NVIDIA-custom machine register pressure analysis. Provides per-BB pressure data consumed by RP-aware MCSE, scheduling, and rematerialization. |
Machine Pass Knobs Summary
NVIDIA Target Pass Enable/Disable
| Knob | Type | Default | Effect |
|---|---|---|---|
enable-nvvm-peephole | bool | true | Enable NVPTX-specific peephole optimizer |
nvptx-enable-machine-sink | bool | false | Enable MachineSink on NVPTX (off by default due to pressure concerns) |
enable-mlicm | bool | (opt-level dependent) | Enable MachineLICM on NVPTX |
enable-mcse | bool | (opt-level dependent) | Enable MachineCSE on NVPTX |
nv-disable-mem2reg | bool | false | Disable machine-level mem2reg |
nv-disable-remat | bool | false | Disable all NVIDIA rematerialization passes |
enable-new-nvvm-remat | bool | (varies) | Enable new NVVM remat, disable old |
usedessa | int | 2 | Select deSSA method for PHI elimination |
cssa-coalesce | int | (varies) | Controls PHI operand coalescing aggressiveness |
Stock LLVM Codegen Controls
| Knob | Type | Default | Effect |
|---|---|---|---|
disable-machine-dce | bool | false | Disable dead machine instruction elimination |
disable-machine-licm | bool | false | Disable pre-RA MachineLICM |
disable-postra-machine-licm | bool | false | Disable post-RA MachineLICM |
disable-machine-cse | bool | false | Disable MachineCSE |
disable-machine-sink | bool | false | Disable MachineSink (NVPTX also gates via nvptx-enable-machine-sink) |
disable-postra-machine-sink | bool | false | Disable post-RA MachineSink |
disable-branch-fold | bool | false | Disable BranchFolding / tail merge |
disable-tail-duplicate | bool | false | Disable post-RA tail duplication |
disable-early-taildup | bool | false | Disable pre-RA tail duplication |
disable-block-placement | bool | false | Disable MachineBlockPlacement |
disable-copyprop | bool | false | Disable MachineCopyPropagation |
disable-ssc | bool | false | Disable Stack Slot Coloring |
disable-post-ra | bool | false | Disable post-RA scheduler |
disable-early-ifcvt | bool | false | Disable early if-conversion |
disable-peephole | bool | false | Disable stock LLVM peephole optimizer |
enable-machine-outliner | enum | (varies) | disable / enable / guaranteed beneficial |
misched-postra | bool | false | Run MachineScheduler post-RA |
optimize-regalloc | bool | true | Enable optimized register allocation path |
verify-machineinstrs | bool | false | Run MachineVerifier after each pass |
NVIDIA RP-Aware MachineCSE Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
rp-aware-mcse | bool | (varies) | Enable register-pressure-aware MachineCSE |
pred-aware-mcse | bool | (varies) | Enable predicate-register-pressure-aware MCSE |
copy-prop-mcse | bool | (varies) | Enable copy propagation within MachineCSE |
incremental-update-mcse | bool | true | Incrementally update MRPA during MCSE |
verify-update-mcse | bool | false | Debug: verify incremental MRPA updates against full recomputation |
print-verify | bool | false | Debug: print detailed RP mismatch diagnostic |
cta-reconfig-aware-mrpa | bool | (varies) | CTA reconfiguration aware machine RP analysis |
NVPTXBlockRemat Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
nv-remat-block | int | 14 | Bitmask controlling remat modes (bits 0-3) |
nv-remat-max-times | int | 10 | Max iterations of the outer remat loop |
nv-remat-block-single-cost | int | 10 | Max cost per single live value pull-in |
nv-remat-block-map-size-limit | int | 6 | Map size limit for single pull-in |
nv-remat-block-max-cost | int | 100 | Max total clone cost per live value reduction |
nv-remat-block-liveout-min-percentage | int | 70 | Min liveout % for special consideration |
nv-remat-block-loop-cost-factor | int | 20 | Loop cost multiplier |
nv-remat-default-max-reg | int | 70 | Default max register pressure target |
nv-remat-block-load-cost | int | 10 | Cost assigned to load instructions |
nv-remat-threshold-for-spec-reg | int | 20 | Threshold for special register remat |
nv-dump-remat-block | bool | false | Debug dump toggle |
load-remat | bool | true | Enable load rematerialization |
Pipeliner Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
enable-pipeliner | bool | true | Enable the MachinePipeliner pass |
pipeliner-max-mii | int | 27 | Maximum Minimal Initiation Interval before abort |
pipeliner-max-stages | int | 3 | Maximum pipeline stages |
pipeliner-ii-search-range | int | 10 | Number of consecutive II values to try |
pipeliner-register-pressure | bool | false | Enable RP tracking during pipelining |
pipeliner-register-pressure-margin | int | 5 | RP margin before pipeliner backs off |
pipeliner-ignore-recmii | bool | false | Zero out RecMII, use only ResMII |
pipeliner-annotate-for-testing | bool | false | Annotate schedule without modifying code |
pipeliner-experimental-cg | bool | false | Use experimental peeling code generator |
pipeliner-mve-cg | bool | true | Use MVE code generator (default path) |
outliner-benefit-threshold | int | 1 | Minimum size in bytes for outlining candidate |
Register Pressure Target Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
reg-target-adjust | int | 0 | Adjust register pressure target (-10 to +10) |
pred-target-adjust | int | 0 | Adjust predicate register pressure target (-10 to +10) |
fca-size | int | 8 | Max size of first-class aggregates in bytes |
remat-load-param | bool | (varies) | Support remating const ld.param not exposed in NVVM IR |
cta-reconfig-aware-rpa | bool | (varies) | CTA reconfiguration aware register pressure analysis |
Function Address Map
| Address | Size | Function | Role |
|---|---|---|---|
sub_215DC20 | -- | GenericToNVVM registration | Address space normalization |
sub_215D530 | 320B state | GenericToNVVM factory | Allocates pass state with 2 DenseMaps |
sub_215D780 | -- | GenericToNVVM cleanup | GVMap iteration and Value ref-counting |
sub_2166D20 | 1.5KB | addISelPasses | Pre-ISel pass configuration |
sub_2166ED0 | 1.6KB | addPreRegAlloc | Pre-RA pass configuration |
sub_21668D0 | 1.2KB | addPostRegAlloc | Post-RA pass configuration |
sub_217D300 | -- | BlockRemat pass name | "NVPTX Machine Block Level Rematerialization" |
sub_217DBF0 | -- | BlockRemat registration | "nvptx-remat-block" |
sub_217E810 | 5.2KB | MULTIDEF detection | Single-def checker with opcode exclusion table |
sub_2181550 | ~3KB | Recursive pullability | Depth-limited chain validation (depth <= 50) |
sub_2181870 | 19KB | Second-chance heuristic | Re-evaluates rejected remat candidates |
sub_2183E30 | -- | Cost evaluator | Computes clone cost for rematerialization |
sub_2184890 | 12KB | Remat allocation helper | Simulates pressure after remat |
sub_2185250 | 17KB | Liveness propagation | Core instruction cloning/replacement engine |
sub_2186590 | -- | Max-live computation | Per-block pressure scan |
sub_2186D90 | 47KB | BlockRemat main engine | Iterative pull-in algorithm (1742 lines) |
sub_21810D0 | 9.4KB | Instruction replacement | Replaces register uses after remat |
sub_21BC5A0 | -- | AllocaHoisting name | Pass name registration |
sub_21BC7D0 | -- | AllocaHoisting registration | "alloca-hoisting" |
sub_21BCD80 | -- | ValidGlobalNames registration | "nvptx-assign-valid-global-names" |
sub_21BCF10 | -- | ImageOptimizer registration | "NVPTX Image Optimizer" |
sub_21DA810 | -- | ProxyRegErasure | Redundant cvta.to.local removal |
sub_21DB090 | -- | NVPTXPeephole registration | "nvptx-peephole" |
sub_21DB5F0 | -- | NVPTXPrologEpilog registration | "NVPTX Prolog Epilog Pass" |
sub_21DBEA0 | -- | ReplaceImageHandles registration | "NVPTX Replace Image Handles" |
sub_21DD1A0 | 16KB | Image type validation | tex/suld/sust/suq type checking |
sub_21E9A60 | 4.9KB | RP stats printer | "Max Live RRegs: " / "PRegs: " |
sub_21E9E80 | -- | ExtraMachineInstrPrinter registration | "extra-machineinstr-printer" |
sub_21EAA00 | -- | MRPA registration | "machine-rpa" |
sub_21EEB40 | 68KB | MRPA full recomputation | Per-BB pressure computation |
sub_21F2780 | -- | LdgXform registration | "ldgxform" |
sub_21F2C80 | 19KB | LDG split body | .ldgsplit / .ldgsplitinsert |
sub_21F3A20 | 44KB | Vector splitting engine | splitVec / vecBitCast / extractSplitVec |
sub_21F9920 | -- | NVPTXMem2Reg registration | "nvptx-mem2reg" |
sub_21FA880 | 22KB | Mem2Reg body | Machine-level mem2reg driver |
sub_21FC920 | 33KB | Mem2Reg engine | Promotion/replacement logic |
sub_2200150 | 78KB | DAGToDAG ISel main | Hash-table pattern matching (h = (37*idx) & (size-1)) |
sub_2203290 | -- | ParamOpt registration | "param-opt" |
sub_2204E60 | -- | Redundant move elim | "Remove redundant moves" |
sub_22058E0 | -- | TruncOpts registration | "nvptx-trunc-opts" |
sub_2E5A4E0 | 48KB | MRPA incremental updater | Incremental RP tracking for MCSE |
sub_1E00370 | 78KB | MRPA backend variant | Alternative RP tracker |
sub_35B1110 | 68KB | PrologEpilogInserter | .local frame layout (2388 lines) |
sub_3563190 | 58KB | MachinePipeliner | Swing Modulo Scheduling |
sub_355F610 | 64KB | ScheduleDAGMILive | Post-RA instruction ordering |
sub_3557A10 | -- | SMS instruction selection | Scheduling heuristic |
Global Variable Reference
| Variable | Type | Default | Role |
|---|---|---|---|
byte_4FD1980 | byte | (opt-level) | MachineLICM enable flag |
byte_4FD18A0 | byte | (opt-level) | MachineCSE enable flag |
byte_4FD1A60 | byte | (opt-level) | MachineSink enable flag |
byte_4FD25C0 | byte | (opt-level) | nvptx-mem2reg enable |
byte_4FD2160 | byte | -- | Extra ISel pass enable |
byte_4FD2E80 | byte | off | nv-dump-remat-block |
dword_4FD26A0 | dword | -- | Scheduling mode (1 = simple, else = full) |
dword_4FD3740 | dword | 10 | nv-remat-max-times |
dword_4FD3820 | dword | 14 | nv-remat-block mode bitmask |
dword_4FD33C0 | dword | 70 | nv-remat-default-max-reg (global) |
qword_501F988 | qword | 1 | incremental-update-mcse |
qword_501F8A8 | qword | 0 | verify-update-mcse |
qword_501F7C8 | qword | 0 | print-verify |
Cross-References
- SelectionDAG -- the ISel pass that produces MachineInstrs consumed by machine passes
- Register Allocation -- pressure-driven greedy allocator with NVPTX register classes
- Register Coalescing -- NVPTX-custom copy elimination framework
- PrologEpilogInserter & Frame Layout --
.localmemory frame computation - MachineOutliner -- suffix-tree-based code size reduction
- Block Placement -- profile-guided basic block ordering
- Instruction Scheduling -- MRPA, MachinePipeliner, ScheduleDAGMILive
- Rematerialization -- NVIDIA's custom machine-level remat
- NVVM Peephole -- IR-level NVVM peephole (distinct from machine-level
nvptx-peephole) - AsmPrinter & PTX Emission -- final pass: MachineInstr to PTX text
- Code Generation -- pipeline overview including ISel and DAG infrastructure
- StructurizeCFG -- mandatory CFG structurization (runs before ISel, feeds machine passes)
- Hash Infrastructure -- DenseMap hash function
(ptr >> 9) ^ (ptr >> 4)used throughout MRPA - Register Classes -- NVPTX register class definitions consumed by all machine passes
SelectionDAG & Instruction Selection
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source: Target-independent DAG infrastructure:
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp,DAGCombiner.cpp,LegalizeDAG.cpp,LegalizeTypes.cpp,SelectionDAGBuilder.cpp,SelectionDAGISel.cpp. NVPTX target:llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp,NVPTXISelDAGToDAG.cpp,NVPTXInstrInfo.td(LLVM 20.0.0).LLVM version note: The target-independent SelectionDAG infrastructure at
0xF05000--0xF70000appears to be stock LLVM 20 with no detectable NVIDIA modifications. All NVIDIA customization lives in the NVPTX target range (0x3290000--0x35FFFFF) via virtual dispatch throughNVPTXTargetLoweringandNVPTXDAGToDAGISel. The intrinsic lowering switch covers IDs up to 14196 (0x3774), far exceeding upstream NVPTX which covers approximately IDs 0--300.
CICC v13.0 contains a complete NVPTX SelectionDAG backend derived from LLVM 20.0.0, with substantial NVIDIA customizations for GPU-specific lowering, the PTX .param-space calling convention, tensor core intrinsic selection, and a 343KB intrinsic lowering mega-switch covering over 200 CUDA intrinsic IDs. The SelectionDAG pipeline converts LLVM IR into machine-level PTX instructions through four major phases: type legalization, operation legalization, DAG combining, and pattern-based instruction selection.
The NVPTX SelectionDAG backend spans roughly 4MB of code across two address ranges: 0xF05000--0xF70000 for the target-independent DAG infrastructure (combining, known-bits, node management) and 0x3290000--0x35FFFFF for the NVPTX-specific lowering, instruction selection, and register allocation. The infrastructure range is stock LLVM with no detectable NVIDIA modifications; all NVIDIA customization lives in the latter range via target hooks and virtual dispatch.
| LowerOperation dispatcher | sub_32E3060 (111KB, 3,626 lines) |
| LowerCall (.param ABI) | sub_3040BF0 (88KB, 2,909 lines) |
| Intrinsic lowering switch | sub_33B0210 (343KB, 9,518 lines) |
| ISel::Select driver | sub_3090F90 (91KB, 2,828 lines) |
| LegalizeTypes | sub_20019C0 (348KB, 10,739 lines) |
| LegalizeOp dispatcher | sub_1FCE100 (91KB, ~100 opcodes) |
| LegalizeOp action dispatch | sub_1FFB890 (137KB, 967 cases) |
| DAG combiner visitor | sub_F20C20 (64KB) |
| DAG combiner orchestrator | sub_F681E0 (65KB) |
| DAGCombiner::combine (NVPTX) | sub_3425710 (142KB, "COVERED"/"INCLUDED" tracing) |
| PerformDAGCombine (NVPTX) | sub_33C0CA0 (62KB) |
| DAG combine: post-legalize | sub_32EC4F0 (92KB) |
| computeKnownBits (NVPTX) | sub_33D4EF0 (114KB, 3,286 lines) |
| Inline asm lowering | sub_2079C70 (83KB, 2,797 lines) |
| Inline asm constraints (NVPTX) | sub_338BA40 (79KB) |
| NVPTXTargetLowering init | sub_3056320 (45KB, constructor) |
| Type legalization setup | sub_3314670 (73KB, table population) |
| Upstream | lib/CodeGen/SelectionDAG/, lib/Target/NVPTX/NVPTXISelLowering.cpp |
Complexity
Let N = number of DAG nodes and E = number of edges (use-def relationships). The SelectionDAG pipeline runs eight sequential phases. SelectionDAGBuilder converts IR instructions to DAG nodes in O(I) where I = LLVM IR instruction count. Each DAG Combiner pass is worklist-driven: O(N) nodes are visited, each matched against pattern rules in O(1) via opcode dispatch; ReplaceAllUsesWith is O(U) per node where U = uses. The three combiner passes total O(3 * N * U_avg). Type legalization (sub_20019C0, 348KB) iterates until all types are legal -- each iteration processes O(N) nodes, and convergence is guaranteed in O(T) iterations where T = max type-promotion depth (typically 2--3 for GPU types). Operation legalization (sub_1FFB890, 137KB) visits each node once: O(N). The action table lookup is O(1) via the 2D array at TLI + 259 * VT + opcode + 2422. ISel pattern matching (sub_3090F90, 91KB) visits each node once in topological order: O(N). Per-node matching is O(P) where P = number of patterns for that opcode, but NVPTX patterns are organized by opcode-indexed tables making this effectively O(1) for common opcodes. The DAG worklist uses ((addr >> 9) ^ (addr >> 4)) & (cap - 1) hashing for O(1) amortized membership tests. Overall: O(I + N * U_avg * 3 + N * T + N) which simplifies to O(N * U_avg) in practice. The intrinsic lowering mega-switch (343KB, 200+ IDs) adds O(1) per intrinsic call via the jump table, not O(200).
Pipeline Position
The SelectionDAG phases execute in a fixed sequence after SelectionDAGBuilder (sub_2081F00) converts LLVM IR into an initial DAG:
- SelectionDAGBuilder -- IR-to-DAG lowering, visitor dispatch at
sub_2065D30 - DAG Combiner (
sub_F681E0/sub_F20C20) -- initial algebraic simplification - DAGTypeLegalizer (
sub_20019C0) -- iterates to fixpoint until all types are legal; see Type Legalization - DAG Combiner -- second pass after type legalization
- LegalizeDAG (
sub_1FCE100dispatcher,sub_1FFB890action engine) -- legalizes operations on legal types - DAG Combiner -- third pass after operation legalization
- NVPTXTargetLowering::PerformDAGCombine (
sub_33C0CA0) -- NVPTX-specific post-legalize combines - Instruction Selection (
sub_3090F90) -- see ISel Patterns
Type Legalization
Type legalization (sub_20019C0) is the largest single function in the SelectionDAG pipeline at 348KB. Unlike upstream LLVM, which splits legalization across LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, and LegalizeVectorTypes.cpp, NVIDIA ships all type-legalization logic inlined into a single monolithic dispatch. This may be an LTO artifact or a deliberate choice for branch-prediction locality.
The master switch dispatches on approximately 50 ISD opcodes. Type legalization actions follow the standard LLVM model:
- Promote -- widen small types to register width (e.g.,
i8toi32) viaANY_EXTEND/ZERO_EXTEND, perform the operation, thenTRUNCATEthe result. - Expand -- split wide types into halves (e.g.,
i128into twoi64values) using shift-and-OR sequences. - Soften -- emulate unsupported FP types through integer libcall sequences.
- Scalarize/Split Vector -- decompose illegal vector types into scalar element operations.
The legality table lives inside NVPTXTargetLowering at offset +2422, organized as a 2D array indexed by 259 * VT + opcode. The 259-byte row stride accommodates LLVM's ~250 generic opcodes plus approximately 10 NVPTX target-specific opcodes. A secondary condition-code action table at offset +18112 uses 4-bit packed nibbles indexed by (VT_row + 15 * CC).
The SimpleVT type encoding appears as a recurring pattern throughout the function (at least 11 instances of the same bitwidth-to-VT mapping):
| SimpleVT | Type | SimpleVT | Type |
|---|---|---|---|
| 1 | i1 | 7 | i128 |
| 3 | i8 | 8 | f16 |
| 4 | i16 | 9 | f32 |
| 5 | i32 | 10 | f64 |
| 6 | i64 | 14--109 | vector types |
The vector type range 14--109 maps fixed-width (14--55) and scalable (56--109) vector MVTs to their scalar element types through a ~100-case switch block that appears six times in the function body. The definitive MVT::getSizeInBits() mapping (confirmed at sub_1FDDC20) is:
| MVT Range | Bits | Description |
|---|---|---|
| 0, 1 | 0 | Other, Glue |
| 2 | 1 | i1 |
| 3 | 8 | i8 |
| 4, 8 | 16 | i16, f16 |
| 5, 9 | 32 | i32, f32 |
| 6, 10 | 64 | i64, f64 |
| 7 | 128 | i128 |
| 11 | 80 | ppcf128 / x87 f80 |
| 14--23 | varies | 2-element vectors |
| 24--109 | varies | 3+ element vectors |
| 111--114 | 0 | token, metadata, untyped |
Type legalization workers fan out from several dispatch functions:
| Dispatcher | Role | Size | Cases |
|---|---|---|---|
sub_201E5F0 | Promote/expand secondary dispatch | 81KB | 441 case labels, 6 switches |
sub_201BB90 | ExpandIntegerResult | 75KB | 632 case labels |
sub_2000100 | PromoteIntegerResult | 45KB | recursive self-calls |
sub_2029C10 | SplitVectorResult | 5KB (dispatcher) | ~190 cases |
sub_202E5A0 | SplitVectorOperand | 6KB (dispatcher) | ~157 cases |
sub_2036110 | ScalarizeVectorResult | dispatch | "Do not know how to scalarize..." |
sub_2035F80 | ScalarizeVectorOperand | dispatch | "Do not know how to scalarize..." |
For complete detail, see Type Legalization.
Operation Legalization
LegalizeOp Dispatcher: sub_1FCE100
The top-level operation legalizer (sub_1FCE100, 91KB) is a massive switch on SDNode::getOpcode() (read as *(uint16_t*)(node + 24)) that dispatches approximately 100 ISD opcodes to dedicated per-opcode handler functions. The switch covers all major categories:
| Opcode | ISD Name | Handler | Size |
|---|---|---|---|
| 0x02 | EntryToken | sub_1F823C0 | |
| 0x03--0x04 | TokenFactor | sub_1F73660 | |
| 0x32 | CopyFromReg | sub_1F78510 | |
| 0x33 | CopyToReg | sub_1F987D0 | |
| 0x34 | MERGE_VALUES | sub_1FC08F0 | |
| 0x35 | ADD | sub_1FA8F90 | 31KB |
| 0x36 | SUB | sub_1FAA420 | 26KB |
| 0x37 | MUL | sub_1FAB9E0 | |
| 0x38 | SDIV/UDIV | sub_1FABFF0 | |
| 0x39--0x3A | SREM/UREM | sub_1F99DA0 | |
| 0x3B | AND | sub_1FD2F20 | |
| 0x3C | OR | sub_1FD2A20 | |
| 0x40 | SHL | sub_1FA27D0 | |
| 0x41 | SRA | sub_1FA2510 | |
| 0x42 | SRL | sub_1F71080 | |
| 0x43 | ROTL | inline | builds opcode 65 target node |
| 0x44 | ROTR | sub_1FA2D60 | |
| 0x47 | CTLZ | sub_1FA7370 | |
| 0x49 | CTPOP | sub_1FA2A00 | |
| 0x4A | BSWAP | inline | 16-bit width check |
| 0x4B | BITREVERSE | inline | |
| 0x4C | SELECT | sub_1FAC480 | 78KB |
| 0x4D | SELECT_CC | sub_1FAE680 | 87KB |
| 0x4E | SETCC | sub_1FB04B0 | 26KB |
| 0x4F | VSELECT | sub_1FCC170 | |
| 0x63 | SIGN_EXTEND | sub_1F8D440 | 22KB |
| 0x65 | ZERO_EXTEND | sub_1F74E80 | |
| 0x68 | TRUNCATE | sub_1F912F0 | 77KB |
| 0x69 | FP_ROUND | sub_1F97850 | 27KB |
| 0x6A | FP_EXTEND | sub_1FC15C0 | 36KB |
| 0x6C | BITCAST | sub_1F94350 | 22KB |
| 0x6D | LOAD | inline | alignment+memtype checks |
| 0x70 | STORE | sub_1F766E0 | |
| 0x72--0x75 | ATOMIC_FENCE..LOAD | sub_1FAA010 | |
| 0x76 | ATOMIC_STORE | sub_1FBDC00 | 76KB |
| 0x77 | ATOMIC_LOAD_ADD | sub_1FB1F30 | 37KB |
| 0x78 | ATOMIC_LOAD_SUB | sub_1FBB600 | 44KB |
| 0x7A | ATOMIC_LOAD_AND | sub_1FB8710 | 47KB |
| 0x7B | ATOMIC_LOAD_OR | sub_1FBA730 | 24KB |
| 0x7C | ATOMIC_LOAD_XOR | sub_1FB6C10 | 39KB |
| 0x86 | INTRINSIC_WO_CHAIN | sub_1F9E480 | 47KB |
| 0x87 | INTRINSIC_W_CHAIN | sub_1F9D3D0 | 26KB |
| 0x88 | INTRINSIC_VOID | sub_1F9CFD0 | |
| 0x8E | BUILD_VECTOR | sub_1FA3B00 | 26KB |
| 0x8F | INSERT_VECTOR_ELT | sub_1FA4AC0 | 67KB |
| 0x90 | EXTRACT_VECTOR_ELT | sub_1FA0CA0 | 20KB |
| 0x91 | CONCAT_VECTORS | sub_1FB3BB0 | 65KB |
| 0x94 | EXTRACT_SUBVECTOR | sub_1FB5FC0 | 19KB |
| 0x9A | DYNAMIC_STACKALLOC | sub_1F8F600 | |
| 0x9E | BR_CC | sub_1F8B6C0 |
Opcodes not listed (0--1, 5--0x31, 0x3D--3F, 0x46, 0x48, 0x51--0x62, etc.) return immediately with code 0 (legal, no transformation needed).
Action Dispatch Engine: sub_1FFB890
The operation legalization action engine (sub_1FFB890, 137KB) determines what to do for each DAG node based on the target's action table, then executes the chosen strategy. It reads the per-opcode action byte from NVPTXTargetLowering + 2422 using the formula *(uint8_t*)(TLI + 259 * VT + opcode + 2422):
| Action | Code | Behavior |
|---|---|---|
| Legal | 0 | Return immediately -- node is natively supported |
| Custom | 1 | Call NVPTXTargetLowering::LowerOperation (vtable slot #164, offset +1312); if NULL returned, fall through to expand |
| Expand | 2 | Try LegalizeTypes, then ExpandNode (sub_1FF6F70) as fallback |
| LibCall | 3 | Call ExpandNode directly for libcall substitution |
| Promote | 4 | Find a larger legal type and rebuild the node |
The function contains 967 case labels dispatching on opcode. When LowerOperation returns NULL (the custom lowering cannot handle the node), the framework falls through to the expansion path. When it returns a different node, ReplaceAllUsesWith (sub_1D44C70) splices the replacement into the DAG and marks the old node as dead (tombstone value -2 in the worklist hash set).
The promote path contains approximately 30 opcode-specific expansion strategies covering integer arithmetic, FP operations, vector operations, bitcasts, shifts, and NVPTX-specific operations. For FP promotion, the pattern is: FP_EXTEND both operands to the promoted type, apply the original operation, then FP_ROUND the result back.
Worklist management uses sub_1FF5010 with a DenseSet-like structure. The hash function for SDNode pointers follows the standard LLVM pattern: ((addr >> 9) ^ (addr >> 4)) & (capacity - 1).
Load/Store Legalization
The largest individual per-opcode handlers deal with memory operations:
| Handler | Opcode | Size | Behavior |
|---|---|---|---|
sub_1FC2C30 | LOAD (complex) | 70KB | Extending loads, vector loads, memory type conversion |
sub_1FC66B0 | Load/Store vectorization | 68KB | Offset-based coalescing with introsort (sub_1F6CA30) |
sub_1FC9570 | STORE legalization | 60KB | Alignment checks, store splitting, scatter sequences |
The load/store vectorization helper sorts operands by memory offset to detect coalescing opportunities, then creates vector load/store sequences when contiguous accesses are found. This is important for NVPTX because PTX supports ld.v2/ld.v4/st.v2/st.v4 instructions that load/store 2 or 4 elements in a single transaction.
Atomic Legalization
All atomic operations (ATOMIC_STORE through ATOMIC_LOAD_XOR, opcodes 0x72--0x7C) follow a shared structural pattern:
- Check operation legality via
sub_1D16620(isAtomicStoreLegal/isOperationLegalOrCustom) - If legal, emit the operation directly
- If custom, call
NVPTXTargetLowering::LowerOperationfor scope-aware NVPTX atomics - Build atomic fence pairs around the operation when needed
- Lower to target-specific NVPTX atomic operations with CTA/GPU/SYS scope
The ATOMIC_LOAD_SUB handler at sub_1FBB600 converts subtraction to atom.add of the negated operand when the target lacks native atom.sub.
NVPTX Custom Lowering: sub_32E3060
The LowerOperation dispatcher (sub_32E3060, 111KB) handles NVPTX-specific ISD opcode lowering. This is the second-largest function in the 0x32XXXXX range. It operates through a multi-phase approach rather than a clean switch-on-opcode, with approximately 620 local variables and a 0x430-byte stack frame.
The dispatcher is reached via vtable slot #164 (offset +1312) of the NVPTXTargetLowering object whenever the operation legalizer encounters action code 1 (Custom).
Supported Opcodes
| Opcode | ISD Node | Lowering Strategy |
|---|---|---|
| 51 | UNDEF | Direct pass-through via getNode(UNDEF) |
| 156 | BUILD_VECTOR | Iterates operands, detects all-same, calls dedicated handler |
| 186 | VECTOR_SHUFFLE | Three-level approach by result count (1, 2, 3+) |
| 234 | EXTRACT_VECTOR_ELT | Three sub-paths: predicate check, direct sub-register, general extract |
Additionally, the function handles load/store lowering (sub_32D2680, 81KB companion), integer/FP operation legalization (sub_32983B0, 79KB), address space casts (sub_32C3760, 54KB), bitcast/conversion (sub_32C7250, 57KB), and conditional/select patterns (sub_32BE8D0, 54KB). These large helper functions are called from within sub_32E3060's dispatch logic.
BUILD_VECTOR Lowering
BUILD_VECTOR (opcode 156) lowering begins by iterating all operands to detect the all-same (splat) case. When all elements are the same value, the lowering produces a single scalar load followed by register-class-appropriate replication. When elements differ, it falls through to a per-element insert chain.
For NVPTX, BUILD_VECTOR is significant because PTX has no native vector construction instruction -- vectors are built by storing elements into .param space and reloading as a vector type, or through register-pair packing for 2-element vectors.
VECTOR_SHUFFLE Three-Level Lowering
Vector shuffle lowering (lines 2665--3055 of the decompilation) implements a three-level strategy based on the result element count:
Level 1 -- Single-result shuffle. When the shuffle produces a single element, the lowering extracts the source element directly via EXTRACT_VECTOR_ELT and wraps it in a BUILD_VECTOR if needed. This avoids any actual shuffle machinery.
Level 2 -- Two-result shuffle. The handler uses a two-phase identity/extract detection with BitVector tracking. Phase A scans the shuffle mask to identify which source elements map to which result positions. Phase B determines whether each result position is an identity (element already in the correct position in one of the source vectors) or requires extraction. Results that are identities are left in place; non-identity elements are extracted and inserted.
Level 3 -- General shuffle (3+ results). Falls back to a BUILD_VECTOR-based reconstruction. Each result element is individually extracted from the appropriate source vector using EXTRACT_VECTOR_ELT, then all elements are combined via BUILD_VECTOR. For certain mask patterns, pairwise shuffle via sub_32B2430 is attempted first as an optimization.
EXTRACT_VECTOR_ELT Three Sub-Paths
EXTRACT_VECTOR_ELT (opcode 234) lowering takes one of three paths based on the extraction context:
-
Predicate extraction. When extracting from a vector of
i1(predicates), the lowering produces a bitwise test on the packed predicate register. This is NVPTX-specific: PTX stores predicate vectors packed into integer registers. -
Direct sub-register extraction. When the element index is a compile-time constant and the element aligns with a register boundary, the lowering generates a direct sub-register reference. This maps to PTX's
mov.b32ormov.b64for extracting elements from packed register pairs. -
General extraction. For non-constant indices or non-aligned elements, the lowering stores the entire vector to local memory, computes the byte offset from the index, and loads the element back. This generates
st.local+ld.localsequences, which is expensive but handles all cases.
Supporting NVPTX Lowering Functions
The custom lowering infrastructure at 0x3290000--0x32FFFFF consists of approximately 13 large functions totaling ~850KB:
| Function | Size | Role |
|---|---|---|
sub_32E3060 | 111KB | Master LowerOperation dispatcher |
sub_32A1EF0 | 109KB | Custom type promotion for NVPTX types |
sub_32EC4F0 | 92KB | Post-legalize DAG combine |
sub_32FE970 | 88KB | Vector operation splitting/scalarization |
sub_32D2680 | 81KB | Load/store DAG lowering (address space, alignment) |
sub_32983B0 | 79KB | Integer/FP operation legalization |
sub_32B8A20 | 71KB | NVVM intrinsic lowering (tex/surf/special) |
sub_32CBCB0 | 57KB | Extended type legalization |
sub_32C7250 | 57KB | Bitcast/conversion lowering |
sub_32A9030 | 55KB | Vector operation lowering |
sub_32C3760 | 54KB | Address space cast / pointer lowering |
sub_32BE8D0 | 54KB | Conditional/select lowering |
sub_32B6540 | 50KB | Special register / intrinsic lowering |
Common helpers shared across all functions in this cluster:
| Range | Role |
|---|---|
sub_325Fxxx | EVT/MVT type utilities |
sub_326xxxx | DAG node creation (getNode variants) |
sub_327xxxx | DAG memory node creation |
sub_328xxxx | Target-specific node creation |
sub_33Exxxx | NVPTX-specific node builders |
sub_33Fxxxx | NVPTX instruction node helpers |
sub_340xxxx | NVPTX constant/register node helpers |
sub_341xxxx | NVPTX chain/glue node construction |
The .param-Space Calling Convention
PTX does not use registers for argument passing. Instead, all arguments flow through .param memory space, a compiler-managed address space specifically for call sites. LowerCall (sub_3040BF0, 88KB) implements this convention by emitting a structured sequence of NVPTXISD custom DAG nodes.
Call Sequence DAG Structure
CallSeqBegin(315, seq_id, 0)
DeclareScalarParam(506, align=4, idx=0, size=32) // scalar arg
DeclareParam(505, align=4, idx=1, size=N) // struct arg (byval)
StoreV1(571, ...) // 8 bytes at a time
StoreV2(572, ...) // or 2-element vector
DeclareRetScalarParam(508, 1, 32, 0) // return decl
CallProto(518, callee, ...)
CallStart(514, ...) // actual call
LoadRetParam(515, 1, 0, ...) // load return value
CallSeqEnd(517, ...)
CallSeqEnd_Outer(316, ...)
Each call increments a monotonic sequence counter at NVPTXTargetLowering + 537024 (offset 134256 * 4), used to match CallSeqBegin/CallSeqEnd pairs and generate unique .param variable names (e.g., __param_0, __param_1, etc.).
Scalar Widening Rules
Scalar arguments narrower than 32 bits are widened to 32 bits; values between 32 and 64 bits are widened to 64 bits. This matches the PTX ABI requirement that .param scalars have a minimum 32-bit size:
| Source Width | Widened To | PTX Type |
|---|---|---|
i1 (1 bit) | i32 (32 bit) | .param .b32 |
i8 (8 bit) | i32 (32 bit) | .param .b32 |
i16 (16 bit) | i32 (32 bit) | .param .b32 |
i32 (32 bit) | i32 (no change) | .param .b32 |
i64 (64 bit) | i64 (no change) | .param .b64 |
f16 (16 bit) | i32 (32 bit) | .param .b32 |
f32 (32 bit) | f32 (no change) | .param .f32 |
f64 (64 bit) | f64 (no change) | .param .f64 |
Vector Parameter Passing
Vector arguments use StoreV1/StoreV2/StoreV4 (opcodes 571--573) mapping to PTX st.param.b32, st.param.v2.b32, st.param.v4.b32 and their 64-bit variants. The element count determines the opcode:
| Opcode | Name | PTX | Description |
|---|---|---|---|
| 571 | StoreV1 | st.param.b32 / .b64 | Single element store |
| 572 | StoreV2 | st.param.v2.b32 / .v2.b64 | 2-element vector store |
| 573 | StoreV4 | st.param.v4.b32 / .v4.b64 | 4-element vector store |
For byval struct arguments, the lowering decomposes the aggregate into chunks that fit the largest available vector store. An 80-byte struct, for example, might be lowered as five StoreV4.b32 operations (5 x 4 x 4 = 80 bytes).
NVPTXISD DAG Node Opcodes
The complete set of NVPTXISD opcodes used in call lowering:
| Opcode | Name | Role |
|---|---|---|
| 315 | CallSeqBegin | Marks start of call parameter setup (maps to ISD opcode) |
| 316 | CallSeqEnd | Outer end-of-call marker (maps to ISD opcode) |
| 505 | DeclareParam | Declares a byval .param aggregate parameter |
| 506 | DeclareScalarParam | Declares a scalar .param parameter with width+alignment |
| 508 | DeclareRetScalarParam | Declares the return value .param parameter |
| 510 | CallDirect | Direct call with prototype |
| 511 | CallDirectNoProto | Direct call without prototype (old-style C) |
| 512 | CallIndirect | Indirect call (function pointer) with prototype |
| 513 | CallIndirectNoProto | Indirect call without prototype |
| 514 | CallStart | The actual call instruction |
| 515 | LoadRetParam | Loads return value from .param space |
| 517 | CallSeqEnd (inner) | Inner end-of-call marker |
| 518 | CallProto | Call prototype declaration (type signature) |
| 571--573 | StoreV1/V2/V4 | Stores to .param space |
Four Call Flavors
Call dispatch is selected by prototype availability and call directness:
| Opcode | Name | When Used |
|---|---|---|
| 510 | CallDirect | Direct call to a named function with a known prototype |
| 511 | CallDirectNoProto | Direct call without prototype (K&R C style, rare in CUDA) |
| 512 | CallIndirect | Function pointer call with known prototype |
| 513 | CallIndirectNoProto | Function pointer call without prototype |
In CUDA code, CallDirect (510) dominates because the vast majority of device function calls are direct with full prototypes. CallIndirect (512) appears when calling through __device__ function pointers. The no-prototype variants are legacy paths that may not be exercisable from CUDA C++ but are retained for C compatibility.
Libcall Generation
When the lowering needs to synthesize a library call (e.g., for __divdi3 software division), it attaches "nvptx-libcall-callee" metadata set to "true" on the callee. This metadata string was extracted from the binary at sub_3040BF0. The metadata tells later passes that the callee is a compiler-generated runtime helper rather than user code.
The primary helpers called from LowerCall:
| Helper | Role |
|---|---|
sub_302F170 | Parameter marshaling setup |
sub_3031480 | Argument type coercion |
sub_3031850 | Scalar widening |
sub_30351C0 | Struct decomposition for byval args |
sub_303E700 | Return value handling |
DAG Combining
The DAG combiner runs three times during the SelectionDAG pipeline: once after initial DAG construction, once after type legalization, and once after operation legalization. The combiner consists of a target-independent framework and NVPTX-specific target hooks.
Target-Independent Combiner Framework
The combiner orchestrator (sub_F681E0, 65KB) manages the worklist-driven iteration over all DAG nodes:
function DAGCombine(dag):
worklist = dag.allNodes() // linked list iteration
visited = SmallPtrSet()
while worklist not empty:
node = worklist.pop()
if visited.count(node): continue
visited.insert(node) // sub_C8CA60 / sub_C8CC70
result = visitNode(node) // sub_F20C20
if result != node:
ReplaceAllUsesWith(node, result) // sub_F162A0
add users of result to worklist
mark node dead
The worklist operates on the SDNode linked list. Nodes are processed via sub_C8CA60 (SmallPtrSet::count for visited check) and sub_C8CC70 (SmallPtrSet::insert with vector growth for worklist membership). The exclusion list at this + 64 (with count at this + 76) prevents certain nodes from being visited.
Global flag byte_4F8F8E8 enables verbose/debug tracing of the combining process.
Visitor: sub_F20C20
The per-node combine visitor (sub_F20C20, 64KB) implements six sequential optimization phases for each node:
Phase 1: Opcode-specific combine. Calls sub_100E380, the target-independent combine dispatcher, which switches on the node's opcode and applies algebraic simplifications (e.g., x + 0 -> x, x & -1 -> x, x * 1 -> x). For NVPTX, this also invokes the target-specific combine hook via vtable dispatch.
Phase 2: Known-bits narrowing. For nodes with constant operands, the combiner builds APInt masks and calls sub_11A3F30 (computeKnownBits / SimplifyDemandedBits) to narrow constants. When all high bits of a result are known-zero, the operation can be narrowed to a smaller type. Two global cl::opt flags gate this phase: qword_4F8B3C8 controls strict-FP known-bits combining, and qword_4F8B548 controls 2-operand reassociation.
Phase 3: Operand type-narrowing loop. For each operand, the combiner computes the legalized type, skips zero-constant operands, creates legalized replacements, and inserts SIGN_EXTEND/TRUNCATE cast nodes as needed. This handles the common case where an operation was originally on i64 but only uses the low 32 bits.
Phase 4: All-constant-operand fold. Detects when every operand is a ConstantSDNode (opcode 17) and calls sub_1028510 for full constant-fold evaluation. The constant check uses a 4x-unrolled loop for performance. The operand count is extracted via the 0x7FFFFFF mask from the packed SDNode header.
Phase 5: Division-by-constant strength reduction. Replaces division by power-of-two constants with shift+mask sequences via APInt shift/mask computation. Division by non-power-of-two constants uses the magic-number reciprocal multiplication technique: x / C becomes (x * M) >> shift where M is the multiplicative inverse.
Phase 6: Vector stride / reassociation patterns. Attempts associative FP decomposition via sub_F15980, with fast-math flag propagation when both sub-results are known non-negative. This handles patterns like (a + b) + c -> a + (b + c) when nsz and arcp flags permit.
ReplaceAllUsesWith: sub_F162A0
The combiner's RAUW implementation walks the use-list and hashes each user into a worklist map using the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function and growth policy.
Supporting Combine Functions
| Function | Size | Role |
|---|---|---|
sub_F0F270 | 25.5KB | Pattern matcher (STORE/BITCAST/CONSTANT) |
sub_F24210 | 34.6KB | DAG simplification pass |
sub_F2B940 | 29.8KB | Truncation/extension chain combines |
sub_F29CA0 | 26.9KB | Node morphing / operand updating |
sub_F27020 | 25KB | Specific operation combines |
sub_F2D1B0 | 22.2KB | Comparison combines |
sub_F2DD30 | 11.5KB | Shift combines |
sub_F62E00 | 46.7KB | Address/memory operation combines |
sub_F657D0 | 26.1KB | Vector operation combines |
sub_F6C1B0 | 15.7KB | TokenFactor chain management |
SDNode Data Structure
The combiner manipulates SDNodes using these field offsets (reconstructed from access patterns throughout the combining code):
| Offset | Size | Field |
|---|---|---|
| -8 | 8 | Operand list pointer (when bit 6 of byte +7 is set) |
| 0 | 8 | First operand / use chain linked list |
| +4 | 4 | Packed: NumOperands (bits 0--26) | Flags (bits 27--31) |
| +7 | 1 | Extra flags (bit 6 = has operand pointer at -8) |
| +8 | 8 | ValueType / MVT |
| +16 | 8 | Use chain (next user pointer, 0 if none) |
| +24 | 2 | Opcode (uint16_t) |
| +32 | 4 | Result type info |
| +36 | 4 | DebugLoc / location ID |
| +40 | 8 | Chain operand |
| +48 | 8 | Value pointer / type info |
| +72 | 4 | NumResults |
| +80 | 4 | Additional operand count / mask index |
Operand stride is 32 bytes. Access pattern: node - 32 * (node[+4] & 0x7FFFFFF) yields the first operand.
NVPTX Target-Specific Combines: sub_33C0CA0
NVPTXTargetLowering::PerformDAGCombine (sub_33C0CA0, 62KB) provides NVPTX-specific algebraic optimizations. This function is called from the target-independent combiner framework via vtable dispatch. It receives an SDNode and returns either NULL (no transformation) or a replacement node.
The function calls sub_2FE8D10 (13x), sub_2FE6CC0 (12x), sub_30070B0 (14x), and sub_2D56A50 (9x), with 27 calls into sub_B2D*/B2C* for debug value builders.
A secondary NVPTX DAG combine function at sub_32EC4F0 (92KB) handles post-legalize optimization, operating after the main legalization pass. It calls into the same shared DAG construction helpers (sub_2FE3480, sub_2FE6750, sub_325F5D0, sub_3262090).
The NVIDIA-side DAGCombiner at sub_3425710 (142KB) includes debug tracing with "COVERED: " and "INCLUDED: " prefix strings, confirming it was built with NVIDIA's internal debug infrastructure. This function calls sub_C8D5F0 (31x for type action checks), sub_2E79000 (14x for value type access), and sub_3423E80 (8x for combine helper dispatch).
NVPTX Address Spaces
Address space constants appear throughout the SelectionDAG lowering. See Address Spaces for the master table and SelectionDAG Address Space Encoding for the backend-specific secondary encoding used in .param passing conventions.
In LowerCall, pointer arguments undergo addrspacecast to generic (AS 0) via sub_33F2D30. The pointer size for AS 5 follows a power-of-two encoding: sizes 1, 2, 4, 8, 16, 32, 64, 128 bytes map to codes 2, 3, 4, 5, 6, 7, 8, 9.
Address space handling permeates the entire lowering infrastructure. Functions sub_33067C0 (74KB), sub_331F6A0 (62KB), sub_331C5B0 (60KB), and sub_33D4EF0 (114KB) all contain address-space-aware logic for NVPTX memory operations, global address lowering, argument handling, and complex pattern matching respectively.
Intrinsic Lowering
The intrinsic lowering mega-switch (sub_33B0210, 343KB) dispatches over 200 distinct NVPTX intrinsic IDs into DAG node construction. The switch covers intrinsic IDs 0--0x310 in the main body, with high-ID ranges for texture/surface operations extending to ID 14196 (0x3774). The function contains approximately 1,000 local variables and calls sub_338B750 (getValue helper) 195 times, sub_3406EB0 (getNode) 116 times, and sub_337DC20 (setValue) 100 times.
Key intrinsic categories:
| Category | ID Range | Handler | Count |
|---|---|---|---|
| Math ops (rounding modes) | 2, 10, 12, 20, 21, 63, ... | sub_33FA050 | ~20 |
| WMMA / MMA (tensor core) | 0xA4--0xA8, 0x194--0x1EC | sub_33A64B0 | 95 |
| Texture sampling | 0x5D--0x8D | sub_33A4350 | 50 |
| Surface read/write | 0x8E--0x90 | sub_33A3180 | 3 |
| Warp shuffle | 0xD4, 0xD5, 0xDF, 0xE0 | sub_33FAF80 | 4 |
| Vote intrinsics | 0xE1--0xE6 | sub_339CDA0 / sub_339E310 | 6 |
| Atomics | 0xEB--0xF8 | sub_3405C90 / sub_340AD50 | ~14 |
| cp.async / TMA | 0x175--0x17C | sub_33AD3D0 | ~8 |
| MMA sm90+ (Hopper wgmma) | 0x183--0x191 | sub_33AC8F0 | 15 |
| Texture/surface handle | 10578 | inline | nvvm_texsurf_handle |
The WMMA/MMA block is the largest single-handler group: 95 consecutive case labels (intrinsic IDs 404--492) all delegate to sub_33A64B0, covering wmma.load, wmma.store, wmma.mma, mma.sync (sm70+), mma.sp (sm80+), and mma.f64 (sm90+). The warp shuffle intrinsics map to specific NVPTXISD opcodes: __shfl_down_sync to 277, __shfl_up_sync to 275, __shfl_xor_sync to 278, and __shfl_sync to 276.
Math intrinsics encode explicit rounding modes via an inner opcode table. For example, ADD_RN (round-to-nearest) maps to opcode 252, ADD_RZ (round-toward-zero) to 249, ADD_RM (round-toward-minus-infinity) to 245, and ADD_RP (round-toward-plus-infinity) to 270.
NVIDIA-specific intrinsic IDs include high-value entries: ID 10578 handles nvvm_texsurf_handle, IDs 8920/8937--8938 handle texture/surface operations. The overflow path at sub_33A1E80 handles intrinsic IDs that fall outside the main switch range.
NVPTX computeKnownBits
The NVPTX target provides a custom computeKnownBitsForTargetNode implementation (sub_33D4EF0, 114KB) that propagates bit-level information through 112 opcode cases in the SelectionDAG. This function calls sub_969240 (SDNode accessor) 399 times and itself recursively 99 times. It supports demanded-bits pruning via an APInt mask parameter and caps recursion at depth 6 (matching LLVM's default MaxRecursionDepth).
Notable NVPTX-specific known-bits behaviors:
- Memory operation type inference (opcode 0x12A): Propagates known bits through load operations based on extension mode (zero-extend, sign-extend, any-extend) encoded in the node flags byte at bits
[2:3]. Handlesld.global.u32vsld.global.s32vsld.global.b32distinctions. - Texture/surface fetch results (opcodes 0x152--0x161): Sets known bits in the range
[elementSize..width]based on the result type, encoding the known bit-width of texture fetch results. - Constant pool integration (opcode 0x175): Uses LLVM's
ConstantRangeclass to derive known bits from constant pool values, chainingfromKnownBitsthroughintersecttotoKnownBits. - Target fence at opcode 499 (
ISD::BUILTIN_OP_END): All opcodes above 499 delegate to theTargetLoweringvirtual method; below that, the generic ISD switch handles everything.
APInt values with width at most 64 bits use inline storage; wider values trigger heap allocation. The constant 0x40 (64) appears hundreds of times as the inline/heap branch condition.
The target-independent known-bits infrastructure at 0xF50000--0xF60000 includes:
| Function | Size | Role |
|---|---|---|
sub_F5A610 | 36.7KB | computeKnownBits for generic ISD opcodes (depth limit at a4 == 48) |
sub_F5F040 | 52.4KB | Extended known-bits with recursive expansion limit: (v74-1)*v77 > qword_4F8BF28 |
sub_F5CD10 | 26.6KB | DAG combine using known-bits results |
sub_F54050 | 17.8KB | Known-bits for multi-result nodes |
sub_F54F50 | 10.7KB | Known-bits for vector operations |
Global qword_4F8BF28 is a threshold that limits recursive known-bits expansion to prevent combinatorial blowup.
Inline Assembly Lowering
Inline assembly lowering spans two locations in the binary: the target-independent SelectionDAGBuilder::visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB).
Target-Independent Framework: sub_2079C70
The inline assembly visitor (sub_2079C70, 83KB, 2,797 lines) lowers LLVM IR asm statements into ISD::INLINEASM (opcode 193) or ISD::INLINEASM_BR (opcode 51) DAG nodes. The function allocates an 8.4KB stack frame and processes operands in five phases:
-
Initialization. Parses the asm string and metadata. Looks up
"srcloc"metadata on the asm instruction for error location reporting. -
Constraint pre-processing. Each constraint string is parsed into a 248-byte record. Constraints are classified as: immediate (
'i', flag0x20000), memory ('m', flag0x30000), or register (determined by target). -
Tied operand resolution. Input operands tied to output operands (e.g.,
"=r"and"0") are matched and validated for type compatibility. Diagnostic:"inline asm not supported yet: don't know how to handle tied indirect register inputs". -
Per-operand lowering. Each operand is lowered to an SDValue. Register operands go through
TargetLowering::getRegForInlineAsmConstraint()(virtual dispatch). Diagnostics:"couldn't allocate output register for constraint '","couldn't allocate input reg for constraint '". -
DAG node finalization. All operands are assembled into an
INLINEASMSDNode with chain and flag operands.
The function uses a 16-entry inline operand buffer (7,088 bytes on stack), reflecting the assumption that CUDA inline asm rarely exceeds 16 operands. Each operand working structure is 440 bytes. Overflow triggers heap reallocation via sub_205BBA0.
Diagnostic strings found in the binary:
| String | Condition |
|---|---|
"couldn't allocate output register for constraint '" | Register constraint unsatisfiable |
"couldn't allocate input reg for constraint '" | Input constraint unsatisfiable |
"Don't know how to handle indirect register inputs yet..." | Indirect tied operand |
"inline asm error: This value type register class is not natively supported!" | Unsupported type for register |
"invalid operand for inline asm constraint '" | Generic operand mismatch |
"Indirect operand for inline asm not a pointer!" | Non-pointer indirect operand |
NVPTX Constraint Handler: sub_338BA40
The NVPTX-specific inline asm constraint handler (sub_338BA40, 79KB) is part of the NVPTXTargetLowering class. It processes constraint strings specific to the NVPTX backend:
-
Simplified constraint model. NVPTX recognizes single-character
'i'(immediate) and'm'(memory) constraints throughsub_2043C80, avoiding the complex multi-character constraint tables used by x86/ARM backends. -
Register class mapping. The function maps MVT values to NVPTX register classes using a 544-case switch (confirmed at
sub_204AFD0, 60KB): MVTs 0x18--0x20 map toInt32Regs, 0x21--0x28 toInt64Regs, 0x29--0x30 toFloat32Regs, 0x31--0x36 toFloat64Regs, 0x37 toInt128Regs, 0x56--0x64 to 2-element vector registers. -
Convergent flag handling (bit 5): Ensures barrier semantics are preserved for inline asm, checked via operand bundle attribute or function-level
convergent. -
Scalar-to-vector conversion. String
"non-trivial scalar-to-vector conversion"indicates that the handler attempts to pack scalar inline-asm results into vector register classes when the output constraint specifies a vector type.
Additional support at sub_2046E60 emits ", possible invalid constraint for vector type" when a vector type is used with an incompatible constraint.
ISel Pattern Matching Driver
The instruction selection driver (sub_3090F90) manages the top-level selection loop rather than performing pattern matching directly. It builds a cost table for function arguments using a hash table with hash function key * 37, processes the topological worklist using a min-heap priority queue, and calls the actual pattern matcher (sub_308FEE0) for each node.
The driver maintains an iteration budget of 4 * numInstructions * maxBlockSize to guard against infinite loops. When the budget is exceeded, selection terminates for the current function.
For complete ISel detail, see ISel Pattern Matching & Instruction Selection.
NVPTXTargetLowering Initialization
The NVPTXTargetLowering constructor (sub_3056320, 45KB + sub_3314670, 73KB) populates the legalization action tables that drive all subsequent SelectionDAG processing. It calls sub_302E500, sub_302F030, sub_3030230, and sub_3034720 to register legal/custom/expand actions for each {ISD_opcode, MVT} pair.
Key aspects of the initialization:
-
Subtarget-gated feature checks. Offsets
+2843,+2584, and+2498in the subtarget object encode SM-version-dependent feature availability. These control which types and operations are marked Legal vs. Custom vs. Expand. -
Vector support. NVPTX has limited native vector support. Most vector operations are marked Custom or Expand, forcing them through the custom lowering at
sub_32E3060. -
Atomic support. The string
"vector atomics not supported on this architecture!"atsub_3048C30confirms SM-version-gated vector atomic support, likely SM 90+ (Hopper) or SM 100+ (Blackwell). -
Address space assertions. AS values (generic=0, global=1, shared=3, const=4, local=5) are encoded directly into the legalization tables, with different legal operation sets per address space.
What Upstream LLVM Gets Wrong for GPU
Upstream LLVM's SelectionDAG framework was designed for CPU ISAs where register classes overlap and share a unified physical register file. The NVPTX target breaks these assumptions at every level:
- Upstream assumes register classes interfere with each other. On x86, GR32 is a sub-register of GR64; allocating
eaxconstrainsrax. The interference graph, coalescing, and copy elimination infrastructure all assume overlapping classes. NVPTX has nine completely disjoint classes (%r,%f,%fd,%p, etc.) with zero cross-class interference. The DAG's register pressure tracking, copy coalescing hints, and class constraint propagation solve a problem that does not exist on this target. - Upstream assumes function calls are cheap register shuffles. CPU calling conventions move arguments through registers (
rdi,rsi, etc.) or a stack backed by L1 cache. NVPTX function calls go through the.paramaddress space with explicitDeclareParam/st.param/ld.paramsequences -- O(n) memory operations per argument. TheLowerCallfunction in cicc is 88KB (vs. upstream's few KB) because it must handle four call flavors, monotonic.paramnaming, and"nvptx-libcall-callee"metadata for synthesized calls. - Upstream assumes a small set of intrinsics. Upstream NVPTX intrinsic lowering covers approximately IDs 0-300. CICC's intrinsic mega-switch at
sub_33B0210(343KB) handles IDs up to 14196, covering cp.async, TMA, WGMMA, and the full SM 90/100 tensor operation set. The upstream framework's assumption that intrinsic lowering is a small switch case is off by two orders of magnitude. - Upstream assumes vector types are natively supported. CPU targets have native vector registers (XMM/YMM/ZMM, NEON Q-registers). NVPTX has no native vector registers -- most vector operations are marked Custom or Expand, forcing them through 111KB of custom lowering at
sub_32E3060. The "legalize then select" pipeline spends most of its time decomposing vectors that never should have been formed. - Upstream assumes known-bits propagation is a small target hook. Upstream NVPTX's
computeKnownBitsForTargetNodehandles fewer than 20 opcodes. CICC's version atsub_33D4EF0(114KB, 112 opcode cases) propagates bits through texture fetches, address space loads, and NVPTX-specific operations -- a 50x expansion that upstream's hook interface was never designed to support cleanly.
Differences from Upstream LLVM
The NVPTX SelectionDAG backend in cicc v13.0 diverges from upstream LLVM NVPTX in several structural and behavioral ways. This section catalogs the known differences.
Structural Divergences
Monolithic type legalizer. Upstream LLVM splits type legalization across four source files (LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp, LegalizeTypes.cpp). In cicc, all four are collapsed into a single 348KB function (sub_20019C0), likely an LTO artifact. The behavioral result is identical, but the code layout makes the function nearly impossible to patch incrementally.
Dual-address ISel infrastructure. The NVPTX lowering code exists at two address ranges (0x32XXXXX and 0x33XXXXX), with functions at sub_32E3060 (LowerOperation) and sub_3377410 (secondary dispatch) forming a two-level dispatch. Upstream NVPTX uses a single LowerOperation method. The binary has a secondary overflow path for intrinsic IDs that fall outside the main switch range.
142KB NVPTX DAGCombiner. The function sub_3425710 includes "COVERED:" and "INCLUDED:" debug trace strings not present in any upstream LLVM release. This is NVIDIA internal instrumentation for tracking combine coverage during development.
Two inline asm subsystems. The target-independent visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB) total 162KB. The upstream NVPTX inline asm support is approximately 200 lines of code. The cicc version is vastly more complex, likely handling NVIDIA-internal PTX inline asm patterns.
Behavioral Divergences
Calling convention. Upstream LLVM NVPTX uses a simplified LowerCall that handles only the standard .param space protocol. CICC's sub_3040BF0 (88KB) adds "nvptx-libcall-callee" metadata for synthesized libcalls, monotonic sequence counters for unique .param names, and four call flavors (with/without prototype x direct/indirect). The upstream has two flavors.
Intrinsic count. The cicc intrinsic lowering switch (sub_33B0210, 343KB) handles intrinsic IDs up to 14196 (0x3774), with dedicated handlers for cp.async/TMA and WGMMA instructions. Upstream LLVM's NVPTX intrinsic lowering covers approximately IDs 0--300. The extended range covers SM 90 (Hopper) and SM 100 (Blackwell) tensor operations.
Vector shuffle lowering. The three-level shuffle lowering (identity detection, BitVector tracking, BUILD_VECTOR fallback) is more sophisticated than upstream NVPTX, which typically scalarizes all shuffles unconditionally.
Atomic scope awareness. CICC's atomic lowering at sub_3048C30 (86KB) supports CTA/GPU/SYS scope atomics with SM-version gating. Upstream LLVM NVPTX handles basic atomics but lacks the full scope hierarchy.
Known-bits propagation. The NVPTX computeKnownBitsForTargetNode at sub_33D4EF0 (114KB, 112 opcode cases, 399 SDNode accesses, 99 recursive calls) is far more extensive than the upstream version, which typically handles fewer than 20 target-specific opcodes. The cicc version propagates bits through texture fetches, address space loads, and NVPTX-specific operations.
PerformDAGCombine depth. The NVPTX-specific combine at sub_33C0CA0 (62KB) plus the post-legalize combine at sub_32EC4F0 (92KB) total 154KB. Upstream NVPTXISelLowering::PerformDAGCombine is approximately 2KB.
Address space 101. CICC uses address space 101 as an alternative .param encoding (seen in sub_33067C0), which does not exist in upstream LLVM NVPTX. This may be an internal convention for distinguishing kernel .param from device-function .param.
Unchanged from Upstream
The following components appear to be stock LLVM with no NVIDIA modifications:
- SelectionDAG core infrastructure at
0xF05000--0xF70000(combining, known-bits, node management) - DAG node hashing with
((a3 >> 4) ^ (a3 >> 9)) & (capacity - 1)atsub_F4CEE0 - Constrained FP intrinsic lowering at
sub_F47010(36KB,"round.tonearest","fpexcept.ignore") ReplaceAllUsesWithimplementation atsub_F162A0- All SDNode creation, deduplication, and lifecycle management
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
SelectionDAGLegalize::LegalizeOp dispatcher (~100 opcodes) | sub_1FCE100 | 91KB | -- |
SelectionDAGLegalize action dispatch (967 cases) | sub_1FFB890 | 137KB | -- |
| Legalization worklist management | sub_1FF5010 | -- | |
ExpandNode fallback | sub_1FF6F70 | -- | |
DAGCombiner::visitNode (6-phase per-node combine) | sub_F20C20 | 64KB | -- |
DAGCombiner::combine orchestrator (worklist management) | sub_F681E0 | 65KB | -- |
ReplaceAllUsesWith (hash: ((id >> 9) ^ (id >> 4))) | sub_F162A0 | -- | |
| Combine pattern matcher (STORE/BITCAST/CONSTANT) | sub_F0F270 | 25.5KB | -- |
| Target-independent opcode-specific combine dispatcher | sub_100E380 | -- | |
| All-constant-operand fold evaluation | sub_1028510 | -- | |
| Vector stride / reassociation combine | sub_F15980 | -- | |
Generic computeKnownBits | sub_F5A610 | 36.7KB | -- |
| Extended known-bits (recursive expansion limit) | sub_F5F040 | 52.4KB | -- |
SelectionDAG::getNode / CSE hash table | sub_F4CEE0 | 41.3KB | -- |
| DAG node builder (operand/result setup) | sub_F49030 | 38.2KB | -- |
| Constrained FP intrinsic lowering | sub_F47010 | 36.4KB | -- |
NVPTXTargetLowering::LowerOperation dispatcher | sub_32E3060 | 111KB | -- |
| LowerOperation secondary dispatch (overflow) | sub_3377410 | 75KB | -- |
| NVPTX custom type promotion | sub_32A1EF0 | 109KB | -- |
| NVPTX post-legalize DAG combine | sub_32EC4F0 | 92KB | -- |
| NVPTX vector operation splitting | sub_32FE970 | 88KB | -- |
| NVPTX load/store lowering | sub_32D2680 | 81KB | -- |
| NVPTX integer/FP legalization | sub_32983B0 | 79KB | -- |
| NVPTX intrinsic lowering (tex/surf) | sub_32B8A20 | 71KB | -- |
| NVPTX vector operation lowering | sub_32A9030 | 55KB | -- |
| NVPTX addrspacecast / pointer lowering | sub_32C3760 | 54KB | -- |
| NVPTX conditional/select lowering | sub_32BE8D0 | 54KB | -- |
| NVPTX special register lowering | sub_32B6540 | 50KB | -- |
NVPTXTargetLowering::PerformDAGCombine | sub_33C0CA0 | 62KB | -- |
| NVPTX DAGCombiner with "COVERED"/"INCLUDED" tracing | sub_3425710 | 142KB | -- |
NVPTXTargetLowering::LowerCall | sub_3040BF0 | 88KB | -- |
| NVPTX atomic operation lowering | sub_3048C30 | 86KB | -- |
NVPTXTargetLowering constructor (action setup) | sub_3056320 | 45KB | -- |
| Type legalization table population | sub_3314670 | 73KB | -- |
| Intrinsic lowering mega-switch | sub_33B0210 | 343KB | -- |
NVPTX computeKnownBitsForTargetNode | sub_33D4EF0 | 114KB | -- |
| NVPTX inline asm constraint handler | sub_338BA40 | 79KB | -- |
SelectionDAGBuilder::visitInlineAsm | sub_2079C70 | 83KB | -- |
NVPTX visitNVVMTexSurf handler | sub_2077400 | 20KB | -- |
| NVPTX argument passing / type coercion | sub_2072590 | 38KB | -- |
NVPTXDAGToDAGISel::Select driver | sub_3090F90 | 91KB | -- |
| Address space / memory operation support | sub_33067C0 | 74KB | -- |
| Global address lowering | sub_331F6A0 | 62KB | -- |
| Formal arguments / return lowering | sub_3349730 | 82KB | -- |
Call lowering (visitCall / LowerCallTo) | sub_332FEA0 | 79KB | -- |
Reimplementation Checklist
- NVPTXTargetLowering with legality tables. Populate the 2D action table at offset +2422 (259-byte row stride, indexed by
259 * VT + opcode) with per-SM-version legal/custom/expand/promote actions for all ISD opcodes and NVPTX-specific opcodes. Include the condition-code action table at offset +18112 and the SM-gated type legality rules (f16 on SM 53+, v2f16 on SM 70+, bf16 on SM 80+). - LowerOperation dispatcher (111KB equivalent). Implement the master
LowerOperationswitch dispatching ~3,626 lines of GPU-specific lowering for loads, stores, calls, atomics, vector operations, and address space casts, including the.param-space calling convention with DeclareParam/StoreV1-V4/LoadRetParam sequences. - Intrinsic lowering mega-switch (343KB equivalent). Build the intrinsic lowering function covering 200+ CUDA intrinsic IDs (up to ID 14196/0x3774), organized as a jump table with per-intrinsic lowering handlers for tensor core, warp, surface/texture, and math intrinsics.
- PerformDAGCombine for NVPTX. Implement the NVPTX-specific DAG combines (62KB) that run after operation legalization, including load/store vectorization (offset-based coalescing with sorting for
ld.v2/ld.v4/st.v2/st.v4detection), NVPTX-specific algebraic simplifications, and the "COVERED"/"INCLUDED" tracing infrastructure. - ISel::Select pattern matching (91KB equivalent). Implement the top-down instruction selection driver that visits DAG nodes in topological order, matching against NVPTX-specific patterns via opcode-indexed tables, with special handling for tensor core instructions, inline assembly constraints, and multi-result nodes.
- computeKnownBits for NVPTX (114KB). Implement the NVPTX-specific known-bits analysis covering
ctaid,tid,ntid, address space pointer width constraints, and GPU-specific intrinsic range information to enable downstream optimization.
Cross-References
- Type Legalization -- detailed 348KB monolith documentation
- ISel Pattern Matching -- instruction selection patterns and matching
- Register Allocation -- follows ISel in the pipeline
- Address Spaces -- consolidated AS reference
- Register Classes -- NVPTX register class definitions
- NVPTX Opcodes -- MachineInstr opcode reference
- NVPTXTargetMachine -- target machine and TTI hooks
- Emission -- PTX emission from MachineInstrs
- Tensor Core Intrinsics -- WMMA/MMA intrinsic detail
- Surface/Texture Intrinsics -- tex/surf lowering
Type Legalization
Prerequisites: Familiarity with SelectionDAG, NVPTX register classes, and LLVM type system basics. Understanding of the compilation pipeline up to instruction selection is assumed.
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Type legalization is the SelectionDAG phase that rewrites every DAG node whose result or operand type is illegal for the target into equivalent sequences of legal-type operations. In upstream LLVM this logic spans four source files (LegalizeTypes.cpp, LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp) totaling roughly 16,000 lines. In CICC v13.0, NVIDIA ships all of it as a single 348KB monolithic function -- sub_20019C0 -- the largest function in the SelectionDAG address range and among the largest in the entire binary. Operation legalization follows in a separate 169KB function (sub_1FFB890), and vector split/scalarize dispatchers fan out into an additional 25+ worker functions.
The monolithic structure is either an LTO inlining artifact (all four upstream .cpp files collapsed by link-time optimization) or a deliberate choice for branch-prediction locality. The functional behavior is a faithful reproduction of upstream LLVM's DAGTypeLegalizer, but the legality tables, legal-type set, and vector legalization rules are heavily NVPTX-specific.
| Type legalizer monolith | sub_20019C0 (348KB, 10,739 lines) |
| Operation legalizer | sub_1FFB890 (169KB) |
| SplitVectorResult | sub_2029C10 (dispatcher, 190 cases) |
| SplitVectorOperand | sub_202E5A0 (dispatcher, 157 cases) |
| ScalarizeVectorResult | sub_2036110 |
| ScalarizeVectorOperand | sub_2035F80 |
| WidenVector | sub_2036AE0 (31KB, limited NVPTX usage) |
| ExpandIntegerResult | sub_201BB90 (75KB, 632 case labels) |
| PromoteIntegerResult | sub_2000100 (45KB) |
| PerformExpensiveChecks | sub_2010FB0 (62KB, debug verifier) |
| NVPTXTargetLowering init | sub_3314670 (73KB, table population) |
| Upstream | LegalizeTypes.cpp, LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp |
Pipeline Position
Type legalization runs as the first major SelectionDAG transformation after the initial DAG is built by SelectionDAGBuilder (sub_2081F00). The full sequence:
- SelectionDAGBuilder converts LLVM IR to an initial DAG with potentially illegal types
- DAG Combiner (
sub_F20C20) runs initial combines - DAGTypeLegalizer (
sub_20019C0) iterates until all types are legal -- this page - LegalizeDAG (
sub_1FFB890) legalizes operations on now-legal types - DAG Combiner runs again to clean up
- Instruction selection (
sub_3090F90) pattern-matches the final legal DAG
The type legalizer iterates to a fixpoint: each pass may create new nodes with illegal types (e.g., splitting a vector creates two half-width vectors that may themselves be illegal), so the worklist loops until every node in the DAG has only legal result and operand types.
NVPTX Legal Type Model
The legal type set is defined in the NVPTXTargetLowering constructor (sub_3314670, 73KB) which populates the action table at offset +2422. NVPTX has a narrow set of legal types dictated by the PTX register file:
| Register Class | Legal MVTs |
|---|---|
Int1Regs (%p) | i1 |
Int16Regs (%rs) | i16 |
Int32Regs (%r) | i32 |
Int64Regs (%rd) | i64 |
Float32Regs (%f) | f32 |
Float64Regs (%fd) | f64 |
Int16HalfRegs (%h) | f16, bf16 |
Int32HalfRegs (%hh) | v2f16, v2bf16, v2i16, v4i8 |
Int128Regs (%rq) | i128 (SM 70+) |
For the complete register class table (vtable addresses, PTX types, encoded IDs, copy opcodes) see Register Classes.
The critical constraint: Int32HalfRegs is the only vector register class. It holds exactly 32 bits of packed data. The only legal vector types are those that pack into 32 bits:
v2f16-- twof16values in one 32-bit registerv2bf16-- twobf16values (SM 80+)v2i16-- twoi16values in one 32-bit registerv4i8-- fouri8values in one 32-bit register
Every other vector type (v4f32, v2f32, v8i32, v4f16, v2f64, etc.) is illegal and must be split, scalarized, or expanded during type legalization. There is no packed float32 SIMD on NVPTX -- this is a fundamental architectural constraint.
SM-Gated Type Legality
The legal type set changes with the SM version. The constructor at sub_3314670 queries subtarget features and conditionally marks types legal or illegal:
| SM Range | Legal Types Added | Legalization Change |
|---|---|---|
| SM < 53 | (base: i1, i16, i32, i64, f32, f64) | f16 ops promoted to f32; no legal vectors |
| SM 53--69 | Scalar f16 | v2f16 legal for ld/st but packed arithmetic is Custom/Expand |
| SM 70+ | v2f16 packed arithmetic, i128 | f16x2 PTX instructions (add.f16x2, mul.f16x2, fma.rn.f16x2) |
| SM 80+ | v2bf16 | bf16x2 PTX instructions |
| SM 100+ | e2m1x2 (FP4), e2m3x2 (FP6), e3m2x2 (FP6), ue8m0x2 | Additional packed narrow FP types for tensor core feeders |
On SM 70+, v2f16 operations marked Legal or Custom in the action table map directly to packed PTX instructions, delivering 2x throughput versus scalarized f16. This is why CUDA __half2 operations are efficient: the type stays packed through the entire pipeline. In contrast, float4 is always fully scalarized to four independent f32 operations on every SM generation.
The Legality Table
Primary Action Table (offset +2422)
The core data structure is a 2D array inside NVPTXTargetLowering:
action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)
Where:
- TLI = pointer to
NVPTXTargetLoweringobject (loaded fromthis->TLIata1[1]) - VT =
SimpleVTenum value (1--10 for scalar types, 14--109 for vector types) - opcode = ISD opcode (0--258), capped at
0x102by a guard check - 259 = row stride (256 generic opcodes + 3 metadata bytes per VT row)
The action byte encodes:
| Value | Action | Meaning |
|---|---|---|
0 | Legal | Node is natively supported -- return immediately |
1 | Custom | Call NVPTXTargetLowering::LowerOperation (vtable slot #164, offset +1312) |
2 | Expand | Call LegalizeTypes, then ExpandNode (sub_1FF6F70) as fallback |
3 | LibCall | Call ExpandNode directly for library-call substitution |
4 | Promote | Find a larger legal type and rebuild the node at that type |
The legality check uses (action & 0xFB) == 0 as the "legal" predicate. This means bit 2 is a don't-care -- a node with action byte 0x04 is still treated as legal in certain fast-path checks, which is the standard LLVM encoding where bit 2 flags "custom-but-legal" operations.
Type-Supported Flag Array (offset +120)
A second structure at TLI + 8*VT + 120 is a pointer array: non-null means the type VT is natively supported by the target. This provides a fast "is this type legal at all?" check before the per-opcode lookup.
Promotion Action Table (offset +2681)
A 1D table indexed by opcode only (no VT dimension):
action = *(uint8_t *)(TLI + opcode + 2681)
Used for four specific opcodes: BSWAP (43), CTLZ (44), CTTZ (45), and BITREVERSE (199). Also used for opcode 204 (CONCAT_VECTORS) when the operand type is zero. This table encodes whether these operations should be promoted regardless of operand type.
FSINCOS Action Table (offset +3976)
Another 1D table for FSINCOS (opcode 211):
action = *(uint8_t *)(TLI + opcode + 3976)
FSINCOS has unique legalization requirements because it produces two results (sin and cos simultaneously).
Condition Code Action Table (offset +18112)
A packed 4-bit nibble table for condition-code-dependent operations (FP_TO_SINT, FP_TO_UINT, SELECT_CC, BR_CC):
base = (VT_id >> 3) + 15 * condcode_type + 18112
action = (*(uint32_t *)(TLI + base * 4 + 12) >> (4 * (VT_id & 7))) & 0xF
The 15-entry stride per condition code allows per-CC/per-VT legalization decisions. Each nibble stores a 4-bit action code, so two VT actions pack into one byte. This is the standard LLVM condition-code action encoding, but the table is populated with NVPTX-specific rules (e.g., PTX's limited set of comparison predicates determines which CCs are legal for which types).
SimpleVT Type Encoding
Types throughout the legalizer are encoded as a single byte, the SimpleVT enum:
| SimpleVT | Type | SimpleVT | Type |
|---|---|---|---|
| 0 | extended/custom | 7 | i128 |
| 1 | i1 | 8 | f16 |
| 2 | i2 (rare) | 9 | f32 |
| 3 | i8 | 10 | f64 |
| 4 | i16 | 14--55 | fixed-width vectors |
| 5 | i32 | 56--109 | scalable vectors |
| 6 | i64 |
The bitwidth-to-SimpleVT conversion pattern appears as a recurring code fragment at least 11 times in sub_20019C0:
// Reconstructed from decompilation -- 11 instances in the function
if (bits == 32) VT = 5; // i32
else if (bits > 32) { VT = 6; // i64 tentative
if (bits != 64) { VT = 0; // extended type
if (bits == 128) VT = 7; // i128
}
} else {
VT = 3; // i8 tentative
if (bits != 8) VT = 4 * (bits == 16); // i16 or 0
}
The vector type range 14--109 maps to scalar element types through a ~100-case switch block that also appears six times in the function body:
| MVT Range | Scalar Element | Description |
|---|---|---|
| 14--23 | i2 (VT 2) | Fixed-width v2i2..v1024i2 |
| 24--32 | i8 (VT 3) | Fixed-width v2i8..v256i8 |
| 33--40 | i16 (VT 4) | Fixed-width v2i16..v64i16 |
| 41--48 | i32 (VT 5) | Fixed-width v2i32..v64i32 |
| 49--54 | i64 (VT 6) | Fixed-width v2i64..v32i64 |
| 55 | i128 (VT 7) | Fixed-width v2i128 |
| 56--61 | i2 (VT 2) | Scalable nxv2i2..nxv64i2 |
| 62--67 | i8 (VT 3) | Scalable nxv2i8..nxv64i8 |
| 68--73 | i16 (VT 4) | Scalable nxv2i16..nxv64i16 |
| 74--79 | i32 (VT 5) | Scalable nxv2i32..nxv64i32 |
| 80--85 | i64 (VT 6) | Scalable nxv2i64..nxv64i64 |
| 86--88 | f16 (VT 8) | Scalable nxv2f16..nxv8f16 |
| 89--93 | f32 (VT 9) | Scalable nxv2f32..nxv32f32 |
| 94--97 | f64 (VT 10) | Scalable nxv2f64..nxv16f64 |
| 98--100 | f16 (VT 8) | Fixed-width v2f16..v8f16 (additional) |
| 101--105 | f32 (VT 9) | Fixed-width v2f32..v32f32 (additional) |
| 106--109 | f64 (VT 10) | Fixed-width v2f64..v16f64 (additional) |
This switch implements getVectorElementType() on the decompiled SimpleVT enum. Its six-fold repetition in the monolith accounts for a significant fraction of the function's 348KB size.
The Four Legalization Actions
Promote (Type Widening)
Promotion widens a narrow type to the nearest legal register width. The pattern is consistent across integer and FP promotion:
promoted_vt = TLI.getTypeToPromoteTo(opcode, VT) // sub_1F40B60
extended = DAG.getNode(ANY_EXTEND, DL, promoted_vt, input) // opcode 143
result = DAG.getNode(original_op, DL, promoted_vt, extended, ...)
truncated = DAG.getNode(TRUNCATE, DL, original_vt, result) // opcode 145
For integer promotion, ANY_EXTEND (opcode 143) or ZERO_EXTEND (opcode 144) widens the input depending on whether the high bits need defined values (unsigned operations use ZERO_EXTEND). For FP promotion, the pattern uses FP_EXTEND/FP_ROUND instead:
ext0 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op0)
ext1 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op1)
res = DAG.getNode(FADD, DL, promoted_vt, ext0, ext1)
out = DAG.getNode(FP_ROUND, DL, original_vt, res)
The promote path in sub_1FFB890 contains approximately 30 opcode-specific expansion strategies. The custom-promotion BST (red-black tree at TLI + 9257/9258) stores (opcode, VT) pairs that override the default promotion target. When no BST entry exists, a linear scan walks upward from the current VT until it finds a type where the action is not Custom (i.e., Legal or Expand).
Expand (Type Splitting)
Expansion splits a wide type into two halves and reassembles the result:
// i128 ADD expansion (simplified)
lo_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 0) // low half
hi_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 1) // high half
lo_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 0)
hi_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 1)
lo_r = DAG.getNode(ADD, DL, i64, lo_a, lo_b)
carry = ... // carry detection via SETCC
hi_r = DAG.getNode(ADD, DL, i64, hi_a, hi_b)
hi_r = DAG.getNode(ADD, DL, i64, hi_r, carry)
result = DAG.getNode(BUILD_PAIR, DL, i128, lo_r, hi_r)
For CTLZ (case 53), expansion builds an all-ones mask, AND chain, and shift sequence. For SINT_TO_FP/UINT_TO_FP (cases 59/60), the helper sub_20B5C20 performs iterative two-way splitting: it finds the half-type, builds the pair, and recursively legalizes each half.
The ExpandIntegerResult handler at sub_201BB90 (75KB, 632 case labels) is itself a major function that dispatches expansion for specific opcodes including STORE (case 77), shifts (81--93), and atomics.
Soften (Float-to-Integer Emulation)
Softening converts unsupported FP operations to integer-based library call sequences. On NVPTX this primarily affects f128 (which has no hardware support on any SM) and f16 on SM < 53. The softened path at sub_2019DA0 (18KB) dispatches via the SoftenedFloats DenseMap.
The FADD/FMUL cases (74/75 in the main switch) compute twice the bit width, find the promoted FP type, and build SUB (opcode 54) / SRL (opcode 123) chains that implement the FP operation in integer arithmetic.
Scalarize and Split Vector
Vector legalization proceeds through recursive halving:
v8f32 -> split -> 2x v4f32
v4f32 -> split -> 2x v2f32
v2f32 -> scalarize -> 2x f32 (v2f32 is NOT legal on NVPTX)
v4f16 -> split -> 2x v2f16 (LEGAL on SM 70+ -- stops here)
v8f16 -> split -> 2x v4f16 -> 4x v2f16
v4i8 -> LEGAL (packed in Int32HalfRegs, no split needed)
v8i8 -> split -> 2x v4i8 (one split, then legal)
The splitting strategy follows LLVM's standard approach:
- Determine half type:
v4f32splits tov2f32viaEVT::getVectorVT(scalar_element, count/2)(sub_1F58CC0) - Split operands: Look up the
SplitVectorsDenseMap to get{Lo, Hi}halves from the input's own legalization - Apply operation:
Lo_result = DAG.getNode(opcode, DL, half_type, Lo_op1, Lo_op2), and similarly forHi - Record result: Store
{Lo_result, Hi_result}in theSplitVectorsDenseMap viasub_20167D0
The critical observation for NVPTX: v2f32 is not legal (no 64-bit packed float register class), so v4f32 ends up fully scalarized to 4x f32. In contrast, v4f16 on SM 70+ splits to 2x v2f16 which is legal, enabling the f16x2 packed instruction path.
Master Opcode Dispatch (sub_20019C0)
The main body of sub_20019C0 is a switch on *(int16_t *)(node + 24) -- the ISD opcode of the current SDNode. Approximately 50 cases are handled:
| Case | ISD Opcode | Action |
|---|---|---|
| 10 | LOAD | legalizeLoad -- type-aware load splitting |
| 11 | STORE | Iterative type demotion loop (see below) |
| 20--21, 26 | Generic arithmetic | Promote via sub_1D38BB0 (getConstant) |
| 27 | EXTRACT_ELEMENT | Split + re-extract |
| 29 | BUILD_PAIR | Promote to i32 |
| 48 | BITCAST | Promote or expand depending on isSimple() |
| 49 | EXTRACT_SUBVECTOR | Extract + rebuild via TRUNCATE (opcode 145) |
| 50 | INSERT_SUBVECTOR | Low/upper split via ANY_EXTEND (143) / ZERO_EXTEND_INREG (144) |
| 51 | CONCAT_VECTORS | Iterate operands, copy each to result list |
| 53 | CTLZ / CTPOP | Expand via mask-then-shift (AND=120, ADD=52) |
| 54 | ATOMIC_CMP_SWAP | Full promote path: check legality table, fallback to libcall |
| 55--56 | SIGN_EXTEND_INREG / SMIN | Legality check via TLI + 259*VT + opcode + 2422 |
| 57--58 | FP_TO_SINT / FP_TO_UINT | Chain of promote + expand nodes |
| 59--60 | SINT_TO_FP / UINT_TO_FP | Iterative split via sub_20B5C20 |
| 70, 72 | FMINNUM / FMAXNUM | BUILD_PAIR (opcode 0x89) reassembly |
| 74--75 | FADD / FMUL | Promote to wider FP type |
| 77 | FMA | Extend operands, FMA at wider type, round back |
| 105 | BUILD_VECTOR | Delegate to sub_1FEC5F0 |
| 106 | EXTRACT_VECTOR_ELT | Check vector element count, dispatch |
| 108 | MGATHER / MSCATTER | Load/store with alignment fixup via sub_20BD400 |
| 110 | VSELECT | Element-by-element type demotion loop |
| 112--113 | SETCC | Legality check with swapped-direction fallback |
| 114--117 | VECREDUCE_* | Opcode lookup in dword_42FEAE0, chain to VECREDUCE |
| 122--124 | SHL / SRL / SRA | Iterative width expansion |
| 125--126 | ROTL / ROTR | 4-way split: shift + mask + OR |
| 136 | BR_CC | Uses CC action table at offset +18112 |
| 152 | ATOMIC_LOAD_* | Delegate to sub_20B7F50 (atomic promote) |
| 153 | ATOMIC_CMP_SWAP_WITH_SUCCESS | Full CAS expansion with APInt mask |
| 199--200 | INTRINSIC_W_CHAIN / INTRINSIC_WO_CHAIN | TLI+112 check, intrinsic lowering dispatch |
| 211 | UNDEF | Replicate zero-constant to fill operand count |
| 243 | TOKEN_FACTOR | Duplicate single operand to all slots |
Cases not listed fall through to LABEL_25 (node already legal or handled by a different legalization category).
Store Iterative Demotion (Case 11)
The STORE case contains an explicit type-walking loop that searches downward for a legal store type:
// Reconstructed from case 11, lines ~2077-2095
while ((vt_byte - 8) > 1) { // while VT is not f16(8) or f32(9)
--vt_byte; // try next smaller type
if (TLI.getTypeAction(VT)) // sub_1D16180
if (TLI.isOperationLegal(STORE, VT))
break; // found a legal store type
}
This walks i64 -> i32 -> i16 -> i8 (or f64 -> f32 -> f16) until it finds a type the target can store natively, then emits a truncating store sequence via sub_1D3C080 (getTruncStore).
Atomic CAS Expansion (Cases 54, 153)
Atomic operations receive extensive legalization because PTX has limited atomic type support. The CAS expansion at case 153 (ATOMIC_CMP_SWAP_WITH_SUCCESS) builds APInt masks via sub_16A4EF0, constructs compare-and-swap loops, and handles the success flag as a separate result. The helper sub_20B7E10 decides whether to use a CAS loop or a direct atomic based on the target SM's capabilities.
Vector Legalization Workers
SplitVectorResult (sub_2029C10)
This thin dispatcher reads the opcode from *(uint16_t *)(node + 0x18), subtracts base 0x30 (48), and dispatches across 190 cases (opcodes 48--237) to SplitVecRes_XXX workers. Key handler categories:
| Handler | Cases | Description |
|---|---|---|
sub_20230C0 | FADD--FREM, SHL/SRA/SRL, int arith | Generic binary op split: split both inputs, apply op to each half |
sub_2028A10 | CONCAT, INSERT_ELT, load/store variants | Unary/multi-input split with reassembly |
sub_2025910 | Strict FP (cases 81--98) | Strict FP split with exception chain propagation |
sub_2023B70 | BUILD_VECTOR (case 104) | Split BUILD_VECTOR into two half-width constructs |
sub_2023F80 | CONCAT inner (case 107) | Trivial: return two operands as Lo and Hi |
sub_20293A0 | VECTOR_SHUFFLE (case 110, 10KB) | Decompose shuffle into sub-shuffles on half-width vectors |
sub_20251A0 | VSELECT, EXTRACT_ELT | Split condition mask along with operands |
sub_2025380 | Extending loads (cases 149--151) | Split load into two half-width loads |
Four handlers in the 0x214xxxx range are NVPTX-specific split workers not present in upstream:
| Handler | Opcode | NVPTX-Specific Behavior |
|---|---|---|
sub_2146BB0 | CONCAT_VECTORS | Checks VT range 0x0E--0x6D for packed-type dispatch |
sub_2146C90 | SELECT_CC / BR_CC (2.7KB) | Multi-operand split with per-operand type classification |
sub_2147770 | FP_ROUND-like | NVPTX-specific FP rounding split |
sub_2147AE0 | BITCAST | NVPTX-specific bitcast split for packed registers |
After a handler returns, the dispatcher stores the {Lo, Hi} result pair in the SplitVectors DenseMap via sub_20167D0 (hash = 37 * key, quadratic probing, rehash at 75% load).
Fatal error on unhandled opcode: "Do not know how to split the result of this operator!" via sub_16BD130.
SplitVectorOperand (sub_202E5A0)
Same dispatch pattern as SplitVectorResult but for operand-side legalization. Base opcode 0x65 (101), range 157 (opcodes 101--258). Notable inline handling for FP_EXTEND/FP_ROUND (cases 146--147, 152--153) that compares source and destination type sizes to choose the correct split strategy:
// Inline in SplitVectorOperand, cases 146-147
src_size = getSizeInBits(src_vt); // sub_2021900
dst_size = getSizeInBits(dst_vt);
if (dst_size < src_size)
SplitVecOp_VSELECT(...) // sub_202D8A0 -- shrinking
else
SplitVecOp_Generic(...) // sub_202A670 -- standard split
After the handler, ReplaceAllUsesOfValueWith (sub_2013400) substitutes the old node with the split result.
Scalarize and Widen
ScalarizeVectorResult (sub_2036110) handles vector types that reduce to scalar. ScalarizeVectorOperand (sub_2035F80) has 80 cases starting from base opcode 106. These cover the final step when splitting has reduced a vector to width 1 or 2 elements, and those elements must become individual scalars.
WidenVector (sub_2036AE0, 31KB) sees limited use on NVPTX. Widening is only useful when the wider type is legal:
- Widening
v1f16tov2f16is useful (promotes to legal packed type) - Widening
v3i8tov4i8is useful (promotes to legal packed type) - Widening
v3f32tov4f32is not useful (v4f32 is still illegal)
The WidenVector path uses the MVT lookup table at word_4305480 to determine element counts and find the nearest wider legal vector type.
Operation Legalization (sub_1FFB890)
After type legalization, operation legalization processes each node through a per-opcode action lookup. The same primary action table is used:
action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)
The dispatch:
| Action | Code | Path |
|---|---|---|
| Legal | 0 | Return immediately |
| Custom | 1 | TLI->LowerOperation(node, DAG) via vtable slot #164 (offset +1312) |
| Expand | 2 | sub_20019C0 (LegalizeTypes), then sub_1FF6F70 (ExpandNode) as fallback |
| LibCall | 3 | sub_1FF6F70 (ExpandNode) directly |
| Promote | 4 | Find larger legal type, rebuild node |
| Special | 5+ | sub_1FF9780 (ExpandLoad) or sub_1FF5310 (LegalizeLoadOps) for load/store variants |
When Custom lowering returns NULL, the framework falls through to expansion. When it returns a different node, ReplaceAllUsesWith splices the replacement into the DAG and marks the old node dead (tombstone value -2 in the worklist hash set).
The operation legalizer also contains an outer switch on the ISD opcode (v11 = *(uint16_t *)(node + 24)) for opcode-specific handling before the table lookup. Shift/rotate opcodes (81--98) are remapped to internal opcode numbers before the table lookup (e.g., case 81 maps to internal opcode 76, case 82 to 77). The opcode-specific dispatch covers approximately 30 opcode groups.
How CUDA Vector Types Get Legalized
Tracing common CUDA types through the full legalization pipeline:
float4 (v4f32) -- fully scalarized on every SM:
- SplitVectorResult:
v4f32-> 2xv2f32 - ScalarizeVectorResult:
v2f32-> 2xf32(no packedf32register class) - Final: 4 independent
f32scalar operations - PTX: 4 separate
add.f32/mul.f32instructions
half2 (__half2 / v2f16) -- stays packed on SM 70+:
- Legal type, no splitting needed
- Final: single
v2f16packed operation - PTX:
add.f16x2,mul.f16x2,fma.rn.f16x2
__nv_bfloat162 (v2bf16) -- legal on SM 80+:
- Same as
half2but withbf16x2PTX instructions
float2 (v2f32) -- scalarized, not packed:
- ScalarizeVectorResult:
v2f32-> 2xf32 - No 64-bit packed float register class exists
v4f16 on SM 70+:
- SplitVectorResult:
v4f16-> 2xv2f16(legal -- stops here) - Final: 2x
f16x2packed operations (2x throughput vs scalarized)
v4f16 on SM < 53:
- Split:
v4f16-> 2xv2f16 - Scalarize: each
v2f16-> 2xf16 - Promote: each
f16->FP_EXTEND->f32 - Final: 4x
f32operations withFP_EXTEND/FP_ROUNDwrappers
double2 (v2f64):
- Scalarize:
v2f64-> 2xf64(splitting would givev1f64which is scalar)
Tensor core fragments bypass vector legalization entirely. WMMA/MMA intrinsics represent matrix fragments as individual scalar registers, not LLVM vector types. However, packed conversion types used with tensor cores (e4m3x2, e5m2x2, e2m1x2, etc.) do pass through legalization and map to Int32HalfRegs.
Verification Infrastructure
sub_2010FB0 (62KB) implements DAGTypeLegalizer::PerformExpensiveChecks, gated by the enable-legalize-types-checking flag (registered at ctor_341). It validates nine DenseMap categories that track the state of every legalized value:
| Map | Content |
|---|---|
PromotedIntegers | Values widened to a larger integer type |
ExpandedIntegers | Values split into two halves |
SoftenedFloats | FP values converted to integer representation |
PromotedFloats | FP values widened to a larger FP type |
ExpandedFloats | FP values split into halves |
ScalarizedVectors | Vectors reduced to scalar elements |
SplitVectors | Vectors split into {Lo, Hi} pairs |
WidenedVectors | Vectors widened to a larger legal type |
ReplacedValues | Values replaced by RAUW |
Diagnostic strings on verification failure: "Processed value not in any map!", "Value in multiple maps!", "Value with legal type was transformed!".
DAG Node Builder Subroutines
Key subroutines called from the type legalizer for constructing replacement DAG nodes:
| Function | Upstream Equivalent | Notes |
|---|---|---|
sub_1D309E0 | DAG.getNode(opc, DL, VT, op) | 1-operand (TRUNCATE, ANY_EXTEND, etc.) |
sub_1D332F0 | DAG.getNode(opc, DL, VT, op1, op2) | 2-operand |
sub_1D3A900 | DAG.getNode(opc, DL, VT, op1, op2, op3) | 3-operand (FMA) |
sub_1D38BB0 | DAG.getConstant(val, DL, VT) | Integer constant creation |
sub_1D38970 | DAG.getConstant(APInt) | Wide constant / all-ones mask |
sub_1D364E0 | DAG.getUNDEF(VT) | Undefined value |
sub_1D37440 | DAG.getSetCC(DL, VT, LHS, RHS, CC) | Comparison node |
sub_1D36A20 | DAG.getSelectCC(DL, VT, ..., CC) | Select-on-comparison |
sub_1D3BC50 | DAG.getExtLoad(opc, DL, VT, ...) | Extending load |
sub_1D3C080 | DAG.getTruncStore(...) | Truncating store |
sub_1D23890 | DAG.ReplaceAllUsesWith(old, new) | RAUW for result replacement |
sub_1FEB8F0 | MVT::getSizeInBits(SimpleVT) | Bit width from SimpleVT |
sub_1F58D40 | EVT::getSizeInBits() | Bit width from extended VT |
sub_1F58D30 | EVT::getVectorNumElements() | Vector element count |
sub_1F40B60 | TLI.getTypeToPromoteTo(opc, VT) | Promotion target lookup |
sub_1D16180 | TLI.getTypeAction(VT) | Action for type |
sub_1D16EF0 | TLI.getCondCodeAction(CC, VT) | Condition code legality |
Result Accumulation and Worklist
Results from each legalization step are accumulated into a SmallVector of {SDValue, SDValue} pairs (node pointer + result index). The vector grows via sub_16CD150 (SmallVector::grow()) when count exceeds capacity. After each pass, new nodes feed back into the worklist for iterative re-legalization until fixpoint -- all types are legal.
The worklist hash set uses open addressing with hash function ((id >> 9) ^ (id >> 4)) & (size - 1) and grows at 75% load factor. Dead nodes are marked with sentinel -2 (tombstone). The DenseMap instances used by the split/scalarize infrastructure use hash 37 * key with quadratic probing.
Differences from Upstream LLVM
| Aspect | Upstream LLVM 20 | CICC v13.0 |
|---|---|---|
| Source organization | 4 files, ~16,000 lines total | 1 monolithic function, 10,739 lines (348KB) |
| Vector legal types | Target-dependent, often includes v4f32, v2f64 | Only v2f16, v2bf16, v2i16, v4i8 (32-bit packed) |
v2f32 | Legal on most targets (x86, ARM) | Illegal -- scalarized |
| Scalable vectors | Actively used (AArch64 SVE) | Encoded in tables but no SM target uses them |
i128 | Expanded on most targets | Legal on SM 70+ (Int128Regs / .b128 / %rq) |
| NVPTX-specific split handlers | N/A | 4 functions in 0x214xxxx range for packed-type dispatch |
| Custom-promotion BST | Standard red-black tree | Same, at TLI offsets +9257/+9258 |
| Type-supported flag array | Pointer array at known offset | At TLI + 8*VT + 120 |
| CC action table | 4-bit packed nibbles | Same encoding, NVPTX-specific CC legal set |
The monolithic structure means that code changes to any legalization category (integer promote, float soften, vector split) require recompilation of the entire 348KB function. In upstream LLVM, these are independent compilation units.
Configuration
| Knob | Location | Default | Description |
|---|---|---|---|
enable-legalize-types-checking | ctor_341 | false | Enables PerformExpensiveChecks debug verifier |
No CICC-specific legalization knobs beyond the standard LLVM flag were found. The ptxas assembler has a related knob MercuryDisableLegalizationOfTexToURBound for texture-to-uniform-register legalization, but this operates at the assembler level, not in CICC.
Key Functions
| Function | Address | Size | Role |
|---|---|---|---|
| Type legalizer monolith | sub_20019C0 | 348KB | DAGTypeLegalizer::run() master dispatch |
| PromoteIntegerResult | sub_2000100 | 45KB | Integer type promotion |
| PromoteFloatResult | sub_2019DA0 | 18KB | Float type promotion / softening |
| ExpandFloatResult | sub_201B410 | 11KB | Float type expansion |
| ExpandIntegerResult | sub_201BB90 | 75KB | Integer type expansion (632 case labels) |
| Promote+expand dispatch | sub_201E5F0 | 81KB | Secondary dispatch (441 case labels) |
| PerformExpensiveChecks | sub_2010FB0 | 62KB | Debug verifier for 9 DenseMap categories |
| SplitVectorResult | sub_2029C10 | 5KB | Dispatcher for 190 opcode cases |
| SplitVectorOperand | sub_202E5A0 | 6KB | Dispatcher for 157 opcode cases |
| SplitVecRes_BinOp | sub_20230C0 | -- | Generic binary op split |
| SplitVecRes_VECTOR_SHUFFLE | sub_20293A0 | 10KB | Shuffle decomposition |
| ScalarizeVectorResult | sub_2036110 | -- | Vector-to-scalar reduction |
| ScalarizeVectorOperand | sub_2035F80 | -- | Operand scalarization (80 cases) |
| WidenVector | sub_2036AE0 | 31KB | Vector widening (limited NVPTX use) |
| Operation legalizer | sub_1FFB890 | 169KB | LegalizeOp per-node action dispatch |
| ExpandNode | sub_1FF6F70 | 43KB | Full node expansion fallback |
| ExpandLoad | sub_1FF9780 | 55KB | Load legalization |
| LegalizeLoadOps | sub_1FF5310 | 41KB | Store splitting/coalescing |
| NVPTX split: CONCAT | sub_2146BB0 | 219B | NVPTX-specific CONCAT_VECTORS split |
| NVPTX split: SELECT_CC | sub_2146C90 | 2.7KB | NVPTX-specific SELECT_CC split |
| NVPTX split: FP_ROUND | sub_2147770 | -- | NVPTX-specific FP rounding split |
| NVPTX split: BITCAST | sub_2147AE0 | -- | NVPTX-specific bitcast split |
| NVPTXTargetLowering init | sub_3314670 | 73KB | Populates legality tables |
| FP conversion split helper | sub_20B5C20 | -- | Iterative SINT_TO_FP/UINT_TO_FP |
| Atomic promote helper | sub_20B7F50 | -- | ATOMIC_LOAD promotion |
| CAS expansion decision | sub_20B7E10 | -- | CAS loop vs direct atomic |
| Gather/scatter alignment | sub_20BD400 | -- | MGATHER/MSCATTER alignment fixup |
Reimplementation Checklist
- NVPTX legal type model. Define the narrow set of legal types dictated by PTX register classes (i1, i16, i32, i64, f32, f64, f16, bf16, v2f16, v2bf16, v2i16, v4i8, i128), with SM-gated legality: f16 arithmetic on SM 53+, v2f16 packed ops on SM 70+, v2bf16 on SM 80+, FP4/FP6 packed types on SM 100+.
- Primary legality table population. Build the 2D action table at
TLI + 259 * VT + opcode + 2422with per-opcode-per-type action bytes (0=Legal, 1=Custom, 2=Expand, 3=LibCall, 4=Promote), plus the type-supported flag array at offset +120, the promotion action table at offset +2681, and the condition-code action table at offset +18112 with 4-bit packed nibbles. - Four legalization actions. Implement Promote (widen via ANY_EXTEND/ZERO_EXTEND, operate, TRUNCATE), Expand (split via shift-and-OR for integers, libcall for floats), Soften (integer emulation of unsupported FP types), and Scalarize/Split-Vector (decompose illegal vectors into scalar or half-width vector operations).
- Iterative fixpoint loop. Run the type legalizer worklist until every node in the DAG has only legal result and operand types, since each pass may create new nodes with illegal types (e.g., splitting a vector creates half-width vectors that may themselves require further splitting).
- Vector legalization for NVPTX. Handle the critical constraint that Int32HalfRegs is the only vector class (32 bits total): scalarize all vectors wider than 32 bits (v4f32, v2f32, v8i32, etc.) while keeping v2f16/v2bf16/v2i16/v4i8 legal. Implement the SplitVectorResult/SplitVectorOperand/ScalarizeVector dispatchers with their 190+/157+/~100 case switches.
- SimpleVT type encoding. Implement the bitwidth-to-SimpleVT conversion (11 instances in NVIDIA's monolith) and the ~100-case vector-element-type switch (6 instances) mapping MVT ranges 14--109 to their scalar element types.
Cross-References
- SelectionDAG & Instruction Selection -- parent page covering the full SelectionDAG pipeline
- NVPTX Target Infrastructure --
NVPTXTargetLoweringconstructor and TTI hooks - SM 70--89, SM 90, SM 100 -- per-SM legal type details
- DAG Node -- SDNode layout (opcode at +24, operands at +32, type at +40)
- Hash Infrastructure -- DenseMap mechanics used throughout legalization
ISel Pattern Matching & Instruction Selection
Prerequisites: Familiarity with SelectionDAG, Type Legalization, and DAG Node Layout. Understanding of the Pattern Database structure and NVPTX opcodes is recommended.
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
The NVPTX instruction selector in cicc v13.0 translates legal SelectionDAG nodes into target MachineInstr opcodes through a three-level dispatch hierarchy totaling approximately 900KB of code. At the top sits NVPTXDAGToDAGISel::Select (sub_3090F90, 91KB), which builds a per-function cost table, manages a priority-queue-driven topological worklist, and calls the pattern matcher (sub_308FEE0) for every node. The pattern matcher fans out to a hand-written NVPTX-specific select switch (sub_347A8D0, 309KB) and a TableGen-generated SelectCode function (sub_348D3E0, 256KB). Surrounding this core are six NVPTX-specific sub-selectors covering memory operations, texture/surface fetches, complex addressing modes, vector patterns, and atomics. NVIDIA's key delta from upstream LLVM is (1) a compressed per-SM-variant legality table that gates which target opcodes exist on which GPU architecture, (2) a secondary 4-bit packed bitfield for fine-grained operand-class legality, and (3) the iteration budget that prevents the selector from looping indefinitely on pathological DAGs.
| ISel driver | sub_3090F90 (91KB, 2,828 lines) |
| Pattern matcher entry | sub_308FEE0 |
| NVPTX Select switch | sub_347A8D0 (309KB -- largest ISel function) |
| SelectCode (TableGen) | sub_348D3E0 (256KB -- auto-generated) |
| Vector/SIMD patterns | sub_3475BB0 (89KB) |
| Memory operation patterns | sub_306D850 (77KB) |
| Complex addressing modes | sub_30811D0 (77KB) |
| Addressing mode helper | sub_30783B0 (39KB) |
| Texture/surface ISel | sub_306A930 (52KB) |
| Atomic lowering | sub_3048C30 (86KB) |
| Constraint table | word_3F3E6C0 (see Pattern Database) |
| Compressed legality table | Base + 6414, 500-byte stride per SM variant |
| Secondary 4-bit bitfield | Base + 521536 |
| Legalize action table | Object + 72760, 4-bit packed |
| Knob registration | ctor_286 at 0x4FA0C0 (5KB) |
| Upstream LLVM source | lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp |
ISel Driver: sub_3090F90
The top-level driver is not the pattern matcher itself; it is the orchestration loop that feeds nodes to the matcher in the right order and maintains shared state. It breaks into three phases.
Phase 1: Function Argument Cost Table
Before selecting any instructions, the driver builds a DenseMap-style hash table at this + 408 that maps function argument indices to their byte sizes. The hash table uses LLVM's standard integer-key hash function key * 37, open addressing with linear probing, and the tombstone sentinel -2. Growth triggers at 75% load factor (4 * (count + 1) >= 3 * capacity).
// Phase 1: build argument cost table
hash_table = this->arg_cost_map; // at this + 408
for each argument A in function->args():
byte_size = alignTo(getSizeInBits(A.type) / 8, A.alignment)
key = A.index
slot = (key * 37) & (capacity - 1)
while hash_table[slot] is occupied and != key:
slot = (slot + 1) & (capacity - 1)
hash_table[slot] = { key, byte_size }
if load_factor > 0.75: rehash()
The table layout:
| Field | Offset from this | Description |
|---|---|---|
data | +416 | Pointer to hash bucket array |
count | +424 | Number of live entries |
tombstone_count | +428 | Number of tombstone slots |
capacity | +432 | Total bucket count (power of 2) |
If the function has a non-void return type, the driver also inserts the return value sizes into the same table, computing aligned_size = ((size + 7) >> 3 + (1 << align) - 1) >> align << align for each return element. The return-type attribute check uses attribute kind 81 (likely sret).
Phase 2: Return Value Processing
For non-void functions, the driver iterates each return value element via:
sub_A74710(attribute, 81)-- checks forsretattributesub_A748A0(index)-- gets return type at given indexsub_AE5020(dataLayout, type)-- computes ABI alignmentsub_9208B0(dataLayout, type)-- computes size in bits
Each return value's aligned byte size is inserted into the argument cost table, so the pattern matcher can look up the cost of materializing any function parameter or return value during instruction selection.
Phase 3: Topological Selection Loop
The main selection loop processes DAG nodes in topological order using a min-heap priority queue where priority equals topological order (lower number = earlier in the DAG, processed first). The iteration is bounded by an explicit budget.
// Phase 3: main ISel loop
sub_308B6F0(this); // initialize worklist from DAG
budget = 4 * numInstructions * maxBlockSize
iteration = 0
while heap is not empty:
node = heap.extractMin() // sub_3089BD0: heap-sift-down
sub_308FEE0(this, node, &tmp) // pattern matcher dispatch
if this->selectionChanged: // byte at this + 400
re-scan affected nodes
iteration++
if iteration > budget:
break // anti-infinite-loop guard
sub_308AB30(this) // cleanup
sub_264E600(this) // deallocate worklist
sub_308B100(this) // destroy hash table
The min-heap stores (SDNode*, priority) pairs at 16-byte stride. The heap-sift-down operation (sub_3089BD0) maintains the heap invariant after extraction. The selectionChanged flag at this + 400 is set by the pattern matcher when it replaces a node, signaling the driver to re-examine downstream users.
The iteration budget formula 4 * numInstructions * maxBlockSize is an NVIDIA addition -- upstream LLVM's SelectionDAGISel does not have this guard. It prevents pathological DAGs (for example, from heavily-inlined device functions with thousands of parameters) from causing the selector to spin indefinitely when combine/legalize/select cycles interact.
Pattern Matcher Dispatch: sub_308FEE0
The pattern matcher is called once per SDNode. It reads the node's opcode at *(node + 24) and dispatches through a multi-level decision tree:
- Quick-reject filter. If the node is already selected (machine opcode bit set in flags), return immediately.
- NVPTX-specific hand-written patterns. Calls
sub_347A8D0for NVPTX custom opcodes (NVPTXISD range >= 499). This handles texture loads, MMA instructions, atomic operations,.param-space loads/stores, and other GPU-specific patterns. - TableGen auto-generated matcher. Calls
sub_348D3E0(SelectCode) for standard ISD opcodes. This function is mechanically generated from the.tdpattern files in the NVPTX backend and contains a massive switch table mapping DAG patterns to MachineInstr opcodes. - Complex pattern matching. For load/store addressing modes, calls
sub_30811D0(77KB) andsub_30783B0(39KB), which matchbase + offset,base + scaled_index, and address-space-qualified patterns. - Fallback. If no pattern matches, the node is marked as "failed ISel" and the driver may retry after DAG combining.
NVPTX Select Switch: sub_347A8D0 (309KB)
This is the largest single ISel function, containing the hand-written pattern matching for all NVIDIA-specific DAG nodes. It calls sub_969240 263 times (SDNode accessor), is self-recursive 42 times, and dispatches to:
| Sub-selector | Size | Coverage |
|---|---|---|
sub_3447D70 | 32KB | Specific pattern sub-dispatch |
sub_3441190 | -- | Pattern helpers |
sub_343FD60 | -- | Type-aware matching |
sub_3475BB0 | 89KB | Vector/SIMD patterns (v2, v4 packed types) |
The function switches on the SDNode opcode to handle:
- Load/store with address spaces -- selects between
ld.global,ld.shared,ld.local,ld.param,ld.const, and generic-space loads, each requiring different PTX instructions. - Texture/surface operations -- dispatches to
sub_306A930fortex,suld,sustinstruction patterns. - MMA/WMMA/tensor ops -- selects the correct
mma.sync,wmma.mma,wgmmavariant based on operand types and SM architecture. - Atomic operations -- selects between
atom.global.add,atom.shared.cas,red.global.add, etc., with scope qualifiers (.cta,.gpu,.sys). - Barrier/fence operations -- selects
bar.sync,bar.warp.sync,membar.cta,membar.gl,membar.sys.
SelectCode (TableGen): sub_348D3E0 (256KB)
This auto-generated function implements the standard LLVM TableGen pattern matching algorithm. It is a giant switch-table compiled from the .td instruction pattern files in lib/Target/NVPTX/*.td. The function:
- Calls
sub_96924045 times andsub_32889F038 times (opcode/type checkers). - Contains no string literals (purely mechanical code).
- Works in tandem with
sub_347A8D0: the hand-written selector handles NVPTX custom nodes first, and anything that falls through goes toSelectCode.
The auto-generated matcher encodes patterns as a sequence of opcode checks, type checks, and operand recursive matches. When a full pattern matches, it calls MorphNodeTo to convert the SDNode into a MachineSDNode with the target opcode and register operands.
Compressed Instruction Legality Table
NVIDIA's instruction selector uses a per-SM-variant legality table to determine whether a given target opcode is legal on the current GPU architecture. This table is checked during instruction selection to gate SM-specific instructions (for example, wgmma instructions are illegal on SM 70 but legal on SM 90+).
The table lives at a fixed offset from the base of the ISel object, accessed by sub_376DE90:
legality = *(uint8_t*)(base + 500 * arch_variant + opcode + 6414)
| Field | Encoding |
|---|---|
| Base offset | 6414 bytes from object base |
| Row stride | 500 bytes per architecture variant |
| Index | 500 * arch_variant + opcode |
| Value 0 | Illegal -- this opcode does not exist on this SM |
| Value 1 | Custom -- requires custom lowering before emission |
| Value 2 | Legal -- can be emitted directly |
The arch_variant value selects which row of the table to consult. Each row contains 500 entries, one per target opcode. The table is read-only after initialization and occupies approximately num_variants * 500 bytes in the .data section.
Secondary 4-bit Packed Bitfield
A second legality table at base + 521536 provides fine-grained operand-class legality using 4-bit packed nibbles:
byte_offset = (opcode_class >> 3) + 36 * arch_id - arch_id
nibble = (*(uint8_t*)(base + 521536 + byte_offset) >> (4 * (opcode_class & 7))) & 0xF
The offset simplification 36 * arch_id - arch_id equals 35 * arch_id, giving a 35-byte stride per architecture variant. Each byte packs two 4-bit legality fields, and the low/high nibble is selected by bit 0 of opcode_class. The 4-bit values encode a richer set of actions than the primary table's 3-value encoding.
Legalize Action Table
The operation legalization subsystem (separate from the ISel legality table above) uses a 4-bit packed action table at object offset 72760 to determine how to legalize each (opcode, type) pair:
index = type_bits + 15 * opcode + 18112
action = (*(uint32_t*)(object + 4 * index + 72760) >> (4 * (type & 7))) & 0xF
| Action | Value | Behavior |
|---|---|---|
| Legal | 0 | Node is natively supported |
| Promote | 1 | Widen to a larger legal type |
| Custom | 5 | Call NVPTXTargetLowering::LowerOperation via vtable slot 164 |
| ExpandInteger | 9 | Split wide integers into halves |
| ExpandFloat | 13 | Emulate unsupported FP via libcalls |
| SplitVector | 14 | Decompose illegal vector into legal sub-vectors |
This table is distinct from the type-legality table at TLI + 2422 (described in SelectionDAG), which uses a 259-byte stride and encodes the simpler 5-action set (Legal/Custom/Expand/LibCall/Promote). The table at +72760 is the operation-level action table used during the LegalizeOp phase, while the +2422 table is the type-level action table used during LegalizeTypes.
NVPTX-Specific Pattern Categories
Memory Operations: sub_306D850 (77KB)
Selects PTX load/store instructions with the correct address space qualifier, vector width, and volatility. The function handles the full matrix of {ld,st} x {.global,.shared,.local,.param,.const,.gen} x {.b8,.b16,.b32,.b64,.b128} x {.v1,.v2,.v4} x {.volatile,.relaxed,.acquire,.release} instruction variants. Address space is determined by querying the pointer operand's address space attribute through the DAG.
The memory pattern matching also covers:
- Vector loads/stores --
ld.global.v2.b32,ld.global.v4.b32, and their 64-bit variants, selected based on the vector element count (1, 2, or 4). - Parameter loads --
ld.param.b32andst.param.b32for call ABI (see SelectionDAG: .param ABI). - Generic-space loads with addrspacecast -- when the address space is generic (AS 0), the selector checks whether the source can be proven to be in a specific space and emits a non-generic load if so.
Texture/Surface Instructions: sub_306A930 (52KB)
Selects tex, suld, and sust instructions from DAG nodes produced by the intrinsic lowering mega-switch. The selector dispatches through helper functions:
| Helper | Purpose |
|---|---|
sub_2FE5F00 | Texture fetch type selection |
sub_2FE5F30 | Surface read type selection |
sub_2FE5F60 | Surface write type selection |
sub_2FE69A0 | Texture sampler mode selection |
sub_2FE6CC0 | Unified texture/surface dispatch |
Texture instructions have complex operand requirements: sampler reference, texture reference, coordinate type (1D/2D/3D/cube), data type (f32/i32/f16), and optional LOD/gradient parameters. The selector maps each combination to a specific PTX tex.1d.v4.f32.f32 (or similar) opcode.
Complex Addressing Modes: sub_30811D0 (77KB)
Matches addressing patterns for load/store operands. NVPTX supports a limited set of addressing modes compared to x86:
- Register + immediate offset --
[%r1 + 16], the most common PTX addressing mode. - Register --
[%r1], zero-offset variant. - Immediate --
[0x1000], absolute address (rare on GPU). - Register + register -- not directly supported in PTX; decomposed into add + register addressing.
The complex pattern matcher at sub_30811D0 calls seven helper functions (sub_307B990 through sub_307FEF0) to decompose DAG address expressions into base-register + offset pairs. When the offset is a constant that fits in the PTX immediate field, it folds into the instruction encoding. When the offset is too large or non-constant, it generates a separate add instruction and uses register addressing.
MMA / Tensor Core Instructions
Tensor core instruction selection is split across the intrinsic lowering stage (which generates NVPTXISD nodes from wmma.load, wmma.mma, mma.sync, wgmma intrinsics) and the ISel stage (which selects the specific PTX opcode). The ISel switch in sub_347A8D0 handles these by checking:
- SM architecture --
wmmarequires SM 70+,mma.syncrequires SM 75+,wgmmarequires SM 90+. - Matrix dimensions -- m16n16k16, m8n8k4, m16n8k8, etc.
- Data types -- f16, bf16, tf32, f64, i8, i4, b1, fp8 (SM 90+), fp4 (SM 100+).
- Accumulator type -- f16 or f32 for half-precision MMA.
The architecture check consults the compressed legality table to determine whether a given MMA variant is legal on the target SM.
Atomic Operations: sub_3048C30 (86KB)
Atomic instruction selection generates atom.{scope}.{op}.{type} instructions. The selector handles:
| Operation | PTX | NVPTXISD opcodes |
|---|---|---|
| Compare-and-swap | atom.cas | 462 |
| Add (int) | atom.add | 294--297 |
| Min (signed) | atom.min | 302--305 |
| Max (signed) | atom.max | 314--317 |
| Exchange | atom.exch | (via generic path) |
| AND/OR/XOR | atom.and / atom.or / atom.xor | (via generic path) |
The selector checks "vector atomics not supported on this architecture!" for vector-width atomics and gates them behind an SM version check (likely SM 90+). Scope qualifiers (.cta, .gpu, .sys) are determined from the memory ordering of the LLVM atomic instruction.
Vector / SIMD Patterns: sub_3475BB0 (89KB)
Handles vector-type instruction selection for NVPTX's limited vector support (v2 and v4 packed types). The function calls sub_969240 121 times and is self-recursive 28 times. It selects between:
- Packed register operations --
add.v2.f32,mul.v2.f32when the SM supports native vector operations. - Scalarized fallback -- decomposes vector operations into per-element scalar operations when the vector type is not natively supported.
- mov.v2 / mov.v4 -- register-to-register vector moves for shuffles and extracts.
Knobs
The ISel subsystem registers its knobs at ctor_286 (0x4FA0C0, 5KB):
| Knob | Type | Description |
|---|---|---|
fast-isel-abort | int | Abort mode for FastISel failures (0=silent, 1=warn, 2=abort) |
fast-isel-report-on-fallback | bool | Report when FastISel falls back to SelectionDAG |
use-mbpi | bool | Use Machine Branch Probability Info during ISel |
dag-disable-combine | bool | Disable DAG combining entirely |
pre-RA-sched | enum | Pre-RA scheduler variant: "default", "list-burr", "source", "list-hybrid", "list-ilp" |
Note that cicc does not use FastISel for GPU code generation. The fast-isel-* knobs exist because the upstream LLVM SelectionDAGISel framework registers them unconditionally, but the NVPTX backend always takes the full SelectionDAG path. The dag-disable-combine flag is the only ISel-phase knob that has a meaningful effect on NVPTX code generation; setting it skips the DAG combiner entirely, which produces worse code but can be useful for debugging.
Differences from Upstream LLVM
| Aspect | Upstream LLVM 20.0 | NVIDIA cicc v13.0 |
|---|---|---|
| Iteration budget | No explicit budget; relies on DAG invariants to terminate | Budget = 4 * numInstructions * maxBlockSize |
| Argument cost table | Not present in SelectionDAGISel | Hash table with key * 37 hash for argument byte sizes |
| Legality table | Simple isLegal() callback per target | Compressed 500-stride table + 4-bit packed secondary table |
| FastISel | Used for -O0 on most targets | Never used; always full SelectionDAG |
| ISel function size | Typical NVPTX Select() is ~50KB upstream | 309KB hand-written + 256KB TableGen = 565KB total |
| Memory patterns | Standard load/store | 5 address spaces, each with distinct PTX encoding |
| Texture/surface | Not present in upstream NVPTX (handled by intrinsics only) | 52KB dedicated sub-selector for tex/suld/sust |
| Atomic patterns | Standard expansion via AtomicExpandPass | 86KB custom selector with scope qualifiers and architecture gating |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
NVPTXDAGToDAGISel::Select -- ISel driver | sub_3090F90 | 91KB | -- |
| Pattern matcher entry (dispatches to Select switch and SelectCode) | sub_308FEE0 | -- | -- |
| NVPTX hand-written Select switch | sub_347A8D0 | 309KB | -- |
| TableGen-generated SelectCode | sub_348D3E0 | 256KB | -- |
| Vector/SIMD pattern selection | sub_3475BB0 | 89KB | -- |
| Memory operation patterns (ld/st with address spaces) | sub_306D850 | 77KB | -- |
| Complex addressing mode matching | sub_30811D0 | 77KB | -- |
| Addressing mode helper (base + offset extraction) | sub_30783B0 | 39KB | -- |
| Texture/surface instruction selection | sub_306A930 | 52KB | -- |
| Atomic operation selection | sub_3048C30 | 86KB | -- |
| Sub-selector for specific NVPTX patterns | sub_3447D70 | 32KB | -- |
| Pattern matching helpers | sub_3472970 | 36KB | -- |
| Operand matching | sub_343A2E0 | 49KB | -- |
| Compressed legality table lookup | sub_376DE90 | -- | -- |
| Initialize topological worklist | sub_308B6F0 | -- | -- |
| Min-heap sift-down (priority queue) | sub_3089BD0 | -- | -- |
| ISel cleanup | sub_308AB30 | -- | -- |
| Hash table destruction | sub_308B100 | -- | -- |
Cross-References
- SelectionDAG & Instruction Selection -- parent page covering the full SelectionDAG pipeline (type legalization, operation legalization, DAG combining, and the ISel overview)
- Pattern Database / Constraint Table -- the per-instruction operand constraint table at
word_3F3E6C0 - DAG Node Layout -- SDNode structure definition
- NVPTX Target Infrastructure -- target machine, subtarget features, and register classes
- Hash Infrastructure -- the
key * 37integer hash used throughout cicc - Tensor / MMA Builtins -- intrinsic lowering for MMA operations that feed into ISel
- Surface & Texture Builtins -- intrinsic lowering for texture/surface operations
- Atomics Builtins -- intrinsic lowering for atomic operations
InstrEmitter
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: SDNode field layout matches LLVM 20.0.0 base. NVIDIA merges the upstream
EmitNode/EmitSpecialNodesplit into a single monolithic function, adds a dedicated CopyToReg handler, an extended MachineInstr flag at bit 36, and a triple vtable dispatch for GPU pseudo-expansion.
InstrEmitter is the final translation layer between LLVM's SelectionDAG representation and the machine-level MachineInstr pipeline. After instruction selection has converted LLVM IR into a DAG of target-specific SDNodes, and after scheduling has linearized those nodes into a sequence, InstrEmitter walks the scheduled sequence and converts each SDNode into one or more MachineInstrs inserted into the current MachineBasicBlock. In CICC v13.0, the emitter lives at sub_2EDDF20 (11,722 bytes) and is called by ScheduleDAGSDNodes::EmitSchedule (sub_2EE0CF0). NVIDIA's build contains three key modifications relative to upstream LLVM: a dedicated CopyToReg handler factored out for NVPTX's physical-register-heavy parameter ABI, a triple vtable dispatch pattern that gates custom pseudo-expansion for GPU-specific instructions, and an extended MachineInstr flag at bit 36 (0x1000000000) not present in stock LLVM.
| EmitNode / EmitMachineNode | sub_2EDDF20 (11,722 bytes, 872-byte stack frame) |
| EmitSchedule (top-level driver) | sub_2EE0CF0 (59KB) |
| EmitCopyToReg handler | sub_2ED95B0 |
| EmitSubregNode | sub_2EDB7A0 |
| EmitCopyToRegClassOp | sub_2EDD7E0 |
| ProcessOperands / EmitMachineNode core | sub_2ED3660 |
| getRegForValue | sub_2E8B400 |
| isDeadNode predicate | sub_2DADC00 |
| MinRCSize threshold | 4 (upstream default, unchanged) |
| VReg hash load factor | 3/4 (rehash when count * 4 >= capacity * 3) |
| Hash function | key * 37, masked by capacity - 1 |
| SDOperand stride | 40 bytes (0x28) per entry |
Emission Architecture
In upstream LLVM, InstrEmitter::EmitNode is a trivial dispatcher: if the SDNode carries a target-specific (machine) opcode, it calls EmitMachineNode; otherwise it calls EmitSpecialNode for ISD-level pseudo-operations. CICC merges both paths into a single monolithic function (sub_2EDDF20) that dispatches on the raw 16-bit opcode at SDNode offset +0x44. The entry point performs a bit-table test against a 64-bit immediate (0x80001078000) to classify opcodes <= 0x2B as "special" ISD nodes requiring dedicated handling; everything above falls through to the generic machine emission path.
The driver, ScheduleDAGSDNodes::EmitSchedule (sub_2EE0CF0), iterates the scheduled SUnit sequence. For each SUnit, it first walks the glue chain backwards (via SDNode::getGluedNode) and emits each glued predecessor before emitting the SUnit's own node. This guarantees that glued instructions appear as a contiguous sequence in the MachineBasicBlock, which is critical for NVPTX where texture sampling sequences must remain bundled with their address computation.
The Emission Algorithm
The combined EmitNode function proceeds through fourteen phases. The condensed flow:
EmitNode(InstrEmitter *self, SDNode *node):
// Phase 1: Early exit for dead nodes
if !self->forceEmit && node->useCount <= 1:
return false // single-use folded into consumer
// Phase 2: Glue chain traversal
root = node
while root->predecessor has chain/glue bit set:
root = strip_tag(root->predecessor)
if root->hasChainResult:
walk further to data-producing node
// Phase 3: Opcode dispatch
opc = node->opcode // uint16 at +0x44
switch opc:
0x0E (CopyToReg): call EmitCopyToReg(self, node)
0x13 (TokenFactor): skip entirely
0x14 (CopyFromReg): goto copyfromreg_path
0x0F, 0x10, 0x1C, 0x2B: special ISD handling
default: goto generic_emission
// Phase 4: Generic machine emission
desc = TII->get(opc)
MI = BuildMI(MBB, node->debugLoc, desc)
CreateVirtualRegisters(node, MI, desc)
for each operand in node->operands:
AddOperand(MI, operand)
MI.setMemRefs(node->memoperands)
MBB->insert(InsertPos, MI)
// Phase 5: Custom inserter check (triple vtable dispatch)
if TII->vtable[0xB8] != sub_2ED11C0: // not default
call custom inserter for NVPTX pseudos
if TII->vtable[0x348] != sub_2ED11F0:
call expandPostRAPseudo
if TII->vtable[0x160] != sub_2ED11E0:
call sub-register inserter
// Phase 6: Implicit physreg defs
collect UsedRegs from glue chain (CopyFromReg, RegisterSDNode)
mark unused implicit defs as dead
// Phase 7: Post-emission dead copy elimination
for each emitted copy:
if copy result has no remaining uses:
eraseFromParent(copy MI)
Opcode Dispatch Details
The bit-table dispatch uses a 64-bit immediate as a compressed lookup: bt 0x80001078000, opcode. The bits that are set correspond to ISD opcodes that need special (non-generic) handling:
| Opcode | ISD Value | Handler |
|---|---|---|
0x0E | ISD::CopyToReg | sub_2ED95B0 -- dedicated handler |
0x0F | ISD::EH_LABEL / special | Label emission path |
0x10 | ISD::INLINEASM | Inline assembly emission |
0x13 | ISD::TokenFactor | Skipped (ordering-only, no MI) |
0x14 | ISD::CopyFromReg | Physical-to-virtual register copy |
0x1C | ISD::LIFETIME_START/END | Frame index annotation |
0x2B | ISD::PSEUDO_PROBE | Profiling probe emission |
For opcodes above 0x2B, the emitter falls through to the generic path that calls TII->get(opc) to obtain the MCInstrDesc and builds a MachineInstr from its operand descriptors.
CopyToReg Emission
CopyToReg (sub_2ED95B0) handles the common case of copying a value from a virtual register into a physical register. Upstream LLVM handles this inline within EmitSpecialNode; NVIDIA factors it into a separate function, likely for code size reasons given how frequently CopyToReg appears in NVPTX code. NVPTX's parameter-passing convention maps kernel parameters to fixed physical registers %r1--%r255, which generates large CopyToReg cascades at function entry and before calls.
The handler:
- Reads the destination register from
SDNode->operand(1)(a RegisterSDNode). - If the destination is virtual and the source is an
IMPLICIT_DEF, emitsIMPLICIT_DEF destdirectly instead of a COPY. - Otherwise resolves the source value to a virtual register via
getVR(which consults the VRBaseMap). - If source and destination are the same register, does nothing (copy coalesced away).
- Emits
COPY dest, src.
CopyFromReg Emission
CopyFromReg (opcode 0x14) is the reverse: it copies a physical register into the virtual register domain. The CICC implementation at sub_2EDDF20 offset 0x2EDF423 follows a multi-step process:
- Extract the source register from
SDNode->operand(1). If virtual, insert the SDValue-to-VReg mapping directly into VRBaseMap and return. - If physical, determine the correct register class:
- Query all users of this CopyFromReg. If the sole user is a CopyToReg to a virtual register in the same class, reuse that destination register.
- Otherwise compute
UseRCas the intersection of all user register class constraints viaTRI->getCommonSubClass. - Fall back to
TRI->getMinimalPhysRegClass(SrcReg, VT).
- If copying the physical register is impossible or expensive (
RC->expensiveOrImpossibleToCopy()), use the physical register directly. - Otherwise emit
COPY VRBase, SrcRegwhere VRBase is a new virtual register in DstRC.
The register class membership test at 0x2EDF4C2 uses LLVM's compressed bit-vector representation:
bool RegisterClass::contains(unsigned Reg) {
unsigned class_idx = Reg >> 3;
if (class_idx >= desc->num_classes)
return false;
return (desc->class_table[class_idx] >> (Reg & 7)) & 1;
}
NVPTX Custom Pseudo-Expansion
The triple vtable dispatch pattern is the emitter's most distinctive NVIDIA modification. After inserting a MachineInstr for a target-specific opcode, the emitter checks three separate vtable slots to determine whether the instruction requires custom expansion:
Vtable slot 0xB8: EmitInstrWithCustomInserter
Default stub: sub_2ED11C0 (returns false). When the NVPTX target overrides this for a given opcode, the custom inserter replaces the pseudo MachineInstr with an expanded sequence. Approximately 15--20 NVPTX pseudo-instructions use this path:
- Texture load operations (
tex.1d,tex.2d,tex.3d) -- these expand into address register setup, sampler state configuration, and the actual texture fetch instruction. - Surface operations (
sust,suld) -- surface load/store instructions that need coordinate clamping and format conversion. - Warp-level intrinsics (
shfl,vote,match) -- instructions that require lane mask setup and predicate register manipulation. - Atomic operations -- certain atomics expand into compare-and-swap loops on older architectures.
Vtable slot 0x348: expandPostRAPseudo
Default stub: sub_2ED11F0. This handles pseudo-instructions that can only be expanded after register allocation has assigned physical registers. In NVPTX this is less common since the PTX virtual register model defers most allocation to ptxas.
Vtable slot 0x160: sub-register insertion
Default stub: sub_2ED11E0. Handles INSERT_SUBREG and related patterns that need target-specific lowering.
All three stubs are adjacent in memory (within 48 bytes of each other), confirming they are trivial return-false implementations in the NVPTXInstrInfo class.
Register Class Assignment During Emission
When creating virtual registers for SDNode results, CreateVirtualRegisters (sub_2E8B400 path) performs:
- For each result value of the SDNode, obtain the register class from
TII->getRegClass(II, i). - Refine based on the value type: if the type is legal, compute
TLI->getRegClassFor(VT, isDivergent)and intersect with the instruction constraint viaTRI->getCommonSubClass. - The divergence flag (
SDNode::isDivergent) is critical in NVPTX: divergent values must go into general-purpose registers (not uniform/constant registers), which affects class selection. - If a result's sole consumer is a CopyToReg to a virtual register in a compatible class, reuse the CopyToReg destination directly to avoid a redundant copy.
- Create the virtual register via
MRI->createVirtualRegister(RC)and add it as a def operand on the MachineInstr.
The MinRCSize threshold (4, unchanged from upstream) prevents over-constraining: if the intersection of all register class constraints would yield a class with fewer than 4 registers, the emitter inserts a COPY to a less-constrained virtual register instead.
Implicit Def/Use Handling
After inserting a MachineInstr, the emitter processes implicit physical register definitions. This is essential for GPU instructions that clobber status registers or have side effects beyond their explicit operands.
The flow collects UsedRegs by scanning:
- Implicit defs beyond explicit results: if
NumResults > NumDefs, the extra results correspond to implicit physical register definitions fromMCInstrDesc::implicit_defs(). For each such def that has at least one use, a CopyFromReg is emitted to capture the value. - Glue chain uses: the emitter walks the glue chain upward from the current node, collecting physical registers referenced by CopyFromReg nodes and RegisterSDNode operands.
- Dead marking:
MachineInstr::setPhysRegsDeadExcept(UsedRegs)marks any implicit def that is NOT in UsedRegs as dead, allowing the register allocator and later passes to ignore it.
NVIDIA Extended Flag: Bit 36 (0x1000000000)
Standard LLVM MachineInstr flags occupy bits 0--31 of the flags word (is_def, is_implicit, is_dead, is_kill, is_undef, is_early_clobber, etc.). CICC extends this to a 64-bit flags field and reserves bit 36 (0x1000000000) for an NVIDIA-specific purpose. The flag is queried via sub_2E88A90 (hasProperty) with argument rsi = 0x1000000000, edx = operand_index.
Where Bit 36 Is Checked
There are exactly two call sites within sub_2EDDF20:
Site 1 -- Generic emission path (0x2EDE50A--0x2EDE523)
0x2EDE4EF: mov eax, [r13+2Ch] ; load SDNode property flags
0x2EDE4F3: test eax, 0x20000 ; bit 17 = hasDebugValue?
0x2EDE4F8: jnz skip_flag_check ; if set, skip the bit-36 test
0x2EDE4FA: test al, 4 ; bit 2 = isTied
0x2EDE4FC: jnz loc_2EDF064 ; tied operand -> different path
0x2EDE502: test al, 8 ; bit 3 = hasGlue
0x2EDE504: jz loc_2EDF064 ; no glue -> different path
0x2EDE50A: mov edx, 1 ; operand index = 1
0x2EDE50F: mov rdi, r13 ; SDNode*
0x2EDE512: mov rsi, 0x1000000000 ; bit 36 flag mask
0x2EDE51C: call sub_2E88A90 ; hasProperty(node, flag, idx)
0x2EDE521: test al, al
0x2EDE523: jnz loc_2EDE086 ; if set -> skip emission entirely
Site 2 -- CopyFromReg-adjacent path (0x2EDEE5D--0x2EDEE86)
0x2EDEE5D: test al, 4 ; bit 2 = isTied
0x2EDEE5F: jnz loc_2EDEFA2 ; tied -> sub-register path
0x2EDEE65: test al, 8 ; bit 3 = hasGlue
0x2EDEE67: jz loc_2EDEFA2 ; no glue -> sub-register path
0x2EDEE6D: mov edx, 1 ; operand index = 1
0x2EDEE72: mov rdi, r13 ; SDNode*
0x2EDEE75: mov rsi, 0x1000000000 ; bit 36 flag mask
0x2EDEE7F: call sub_2E88A90 ; hasProperty(node, flag, idx)
0x2EDEE84: test al, al
0x2EDEE86: jnz loc_2EDE100 ; if set -> skip (no MI emitted)
Guard Conditions and Semantics
Both sites share the same guard pattern: the flag is only checked when the SDNode's property byte at +0x2C satisfies bit_3_set AND NOT bit_2_set -- i.e., the node has a glue result chain but is not a tied operand. This narrows the check to nodes that participate in glue chains: typically multi-instruction sequences like texture fetches, surface operations, and warp-level intrinsics where a chain of SDNodes must emit as a contiguous bundle.
When hasProperty(node, 0x1000000000, 1) returns true, the emitter skips the node entirely. The operand index of 1 means the flag is checked on the first data operand (operand 0 is typically the chain input). The effect is that nodes carrying bit 36 on operand 1 are treated as "already materialized" -- their value has been produced by a preceding glued instruction and does not require a separate MachineInstr.
The most likely interpretation of bit 36 is "implicit glue consumer already emitted": when a glued predecessor has already produced the value as a side effect (e.g., a texture fetch that writes both the result and a predicate), the glue consumer SDNode carries bit 36 to tell the emitter that no additional COPY or MI is needed. This is consistent with the check position immediately after getRegForValue succeeds -- the VReg mapping exists, the glue chain has been walked, and the emitter is about to create a potentially redundant MI.
sub_2E88A90 Calling Convention
The function serves as a universal property query across the emitter and other codegen passes. Observed flag values and their meanings:
| Flag Value | Bit | Meaning | Call Sites |
|---|---|---|---|
0x80 | 7 | isCall | Instruction scheduler (sub_2EE40E0) |
0x200 | 9 | isReservedReg | Branch folding (sub_2F33DD0) |
0x80000 | 19 | isImplicit | InstrEmitter generic path, StructurizeCFG |
0x100000 | 20 | isSimple / isMachineReg | InstrEmitter CopyFromReg, dead copy pass |
0x400000 | 22 | isSubRegister | InstrEmitter sub-register resolution |
0x40000000 | 30 | isAllocatable | InstrEmitter CopyFromReg class check |
0x1000000000 | 36 | NVIDIA: implicit glue consumer | InstrEmitter only (2 sites) |
The function signature is bool hasProperty(SDNode *node, uint64_t flag_mask, unsigned operand_idx). It reads the MCInstrDesc via [node+10h] -> [desc+18h], extracts a bit field by shifting right by the appropriate amount, and ANDs with 1 to produce a boolean result.
Internal Data Structures
InstrEmitter Object Layout
The InstrEmitter instance carries three hash tables for tracking the SDNode-to-MachineInstr mapping:
| Offset | Name | Entry Size | Purpose |
|---|---|---|---|
+0x410 | VReg Map (Table A) | 16 bytes | SDNode result to virtual register |
+0x460 | MI Map (Table B) | 40 bytes | Glue chain to MachineInstr mapping |
+0x4D0 | Result Map (Table C) | 32 bytes | SDNode to result number |
+0x4E0 | forceEmit flag | 1 byte | When set, emit even dead nodes |
All three use LLVM's DenseMap implementation with open addressing and linear probing. The hash function is key * 37 (LLVM's DenseMapInfo<unsigned>::getHashValue). Empty sentinel: 0xFFFFFFFF. Tombstone: 0xFFFFFFFE. Table C uses an extended sentinel 0xFFFFFFFFFFFFF000. Rehash triggers at 3/4 load factor: entry_count * 4 >= capacity * 3. Growth is handled by sub_2E29BA0 which doubles capacity and rehashes.
SDOperand Output Record
Each emitted result is recorded in a 40-byte (0x28) structure:
struct EmitResultRecord { // 40 bytes
SDNode *producer; // +0x00: SDNode that produced this result
int32_t src_vreg; // +0x08: source virtual register (-1 if physical)
int32_t dst_vreg; // +0x0C: destination virtual register (-1 if unassigned)
TargetRegisterClass *RC; // +0x10: register class pointer (or NULL)
unsigned sub_reg_idx; // +0x18: sub-register index (or 0)
uint32_t flags; // +0x20: tied, early_clobber, implicit bits
};
SDNode Field Offsets
Confirmed SDNode field layout from the binary (matches LLVM 20.0.0 base with minor NVIDIA extensions):
| Offset | Type | Field |
|---|---|---|
+0x00 | tagged ptr | Chain/glue link (low 3 bits = type tag) |
+0x08 | uint32 | Use count / reference count |
+0x20 | ptr | Operand array pointer |
+0x28 | uint32 | Operand count (low 24 bits) |
+0x2C | uint8 | Property flags (bit 2 = isTied, bit 3 = hasEarlyClobber) |
+0x30 | tagged ptr | First predecessor link |
+0x38 | tagged ptr | Glue result chain |
+0x44 | uint16 | Opcode |
+0x78 | uint32 | Reference count (dead node detection) |
Tagged pointers are stripped throughout with AND 0xFFFFFFFFFFFFFFF8 (clear low 3 bits). Physical registers are encoded with bit 31 set (negative int32); extraction uses AND 0x7FFFFFFF followed by a shift-left by 4 to index the register descriptor table.
Dead Copy Elimination
After the main emission loop completes, a dedicated cleanup pass (Phase 12 in the binary, offset 0x2EE0816--0x2EE09AC) scans all emitted result records and eliminates redundant COPY instructions. This is notably aggressive compared to upstream LLVM, which defers dead copy removal to a separate DeadMachineInstrElimination pass later in the pipeline. CICC performs it inline because NVPTX's SelectionDAG generates massive numbers of redundant copies when lowering kernel parameter loads -- each parameter maps to a fixed physical register (%r1--%r255 corresponding to PTX parameter registers), and the DAG legalizer inserts CopyFromReg nodes for every parameter access.
Dead Copy Elimination Algorithm
The algorithm walks the emitted result record array (0x28-byte stride, accumulated during Phases 4--11) and classifies each record for deletion or preservation.
DeadCopyElimination(InstrEmitter *self, ResultRecord *records, int count):
// records is at [rbp-0x250], count at [rbp-0x248]
// stride = 0x28 (40 bytes per record)
end = records + count * 0x28
cursor = records
while cursor < end:
MI = cursor->producer // [rbx+0x00]: the MachineInstr*
TII = self->TargetInstrInfo // [r14+0x08]
// Step 1: Classify by opcode
if MI->opcode == 0x14: // CopyFromReg
// CopyFromReg-specific path: virtual dispatch to target
vtable = TII->vtable
result = vtable[0xF0]( // ~30th virtual method
MI, // the CopyFromReg MI
&cursor[0x08], // source vreg slot
/* additional args */
)
// This checks whether the target considers the copy
// sinkable or rematerlizable -- NVPTX overrides this
// for parameter register copies that are trivially dead
else:
// Generic MI path: check via vtable[0x350]
result = TII->vtable[0x350](MI, cursor, ...)
// Step 2: Check source register kill flags
src_reg = cursor->src_vreg // [rbx+0x08]
if src_reg < 0: // physical register (sign bit set)
clearKillFlags(self->MRI, src_reg) // sub_2EBF120
// Step 3: Check dest register kill flags
dst_reg = cursor->dst_vreg // [rbx+0x0C]
if dst_reg < 0: // physical register
clearKillFlags(self->MRI, dst_reg) // sub_2EBF120
// Step 4: Determine if MI is dead
// Check opcode: if (MI->opcode - 1) <= 1 (opcode 1 or 2)
// then check MI->operand[0] byte [+0x40] bit 4 (0x10)
// which indicates "result consumed by inline fold"
opc = MI->opcode
if (opc == 1 || opc == 2): // COPY or REG_SEQUENCE
if MI->operands[0].flags & 0x10: // inline folded
goto mark_dead
// Step 5: Property gate
flags_2c = MI->flags_2c // [rdi+2Ch]
if !(flags_2c & 0x04): // bit 2 not set
// Check TSFlags bit 20 via descriptor
desc = MI->MCInstrDesc // [rdi+10h]
tsflags = desc->TSFlags // [desc+18h]
is_simple = (tsflags >> 20) & 1
if !is_simple:
goto emit_and_advance // not a candidate
// (falls through only when bit 2 set OR TSFlags bit 20 set)
// Step 6: Check hasProperty(0x100000, 1) -- isMachineReg
has_prop = hasProperty(MI, 0x100000, 1) // sub_2E88A90
if !has_prop:
// MI is deletable: call eraseFromParent
eraseFromParent(MI) // sub_2E88E20
advance cursor by 0x28
continue
mark_dead:
// Step 7: Liveness check via isUnusedReg
unused = isUnusedReg(MI) // sub_2E8B100
if unused:
// Still has a def -- erase immediately
eraseFromParent(MI) // sub_2E88E20
else:
// Defer: add to dead list for bulk deletion
addToDeadList(self->deadList, MI) // sub_2ED56A0
// deadList is at InstrEmitter+0x4A0
advance cursor by 0x28
Glue Chain Walk in Dead Copy Context
After the per-record loop, the emitter performs a secondary traversal for CopyFromReg records that survived deletion. For each surviving copy whose SDNode has a glue result ([r13+38h] != 0):
- Walk the glue chain backward via
[r13+0] & 0xFFFFFFFFFFFFFFF8(strip tag bits). - For each predecessor in the chain, check
[rax+2Ch] & 4-- if the predecessor has been scheduled (bit 2 set), continue walking. - If the predecessor has an unresolved glue reference (
[r13+38h]non-null) and the predecessor's MI has zero uses after copy elimination, mark it for deferred deletion too.
This secondary walk catches cascading dead copies: when a CopyFromReg is deleted, its glued predecessor may also become dead.
Deferred Deletion via Dead List
MIs added to InstrEmitter+0x4A0 via sub_2ED56A0 are not deleted immediately. Instead, they are accumulated and deleted in bulk during Phase 14 (final cleanup at 0x2EE0C0B). The dead list is a SmallVector<MachineInstr*> with 8 inline entries (64 bytes inline buffer), growing via sub_C8D5F0 if needed. Bulk deletion avoids iterator invalidation during the emission loop and is more cache-friendly for large basic blocks.
Why NVPTX Needs Aggressive Dead Copy Elimination
NVPTX kernel signatures routinely have 20--60 parameters, each lowered through a CopyFromReg from a fixed physical register. The SelectionDAG legalizer creates CopyFromReg SDNodes for each parameter load, but many parameters are only used in a subset of the kernel's basic blocks. Without immediate dead copy elimination, a kernel with 50 parameters would carry 50 COPY MachineInstrs at function entry, most of which are dead in any given block. The standard LLVM DeadMachineInstrElimination pass would eventually clean these up, but doing so immediately during emission:
- Reduces the MachineBasicBlock size that subsequent passes (register allocation, scheduling) must process.
- Avoids creating unnecessary VReg-to-PhysReg interference entries in the register allocator.
- Prevents false register pressure signals from dead copies during the MRPA (Machine Register Pressure Analysis) pass that NVIDIA uses for scheduling decisions.
NVIDIA-Specific Emission Patterns
Parameter Cascade Emission
NVPTX kernel entry functions map each parameter to a physical register via a cascade of CopyFromReg SDNodes. During emission, this produces a dense block of COPY MachineInstrs at the top of the entry MachineBasicBlock. The emitter handles this pattern specially:
- When
EmitScheduleprocesses the first SUnit, it detects a sequence of CopyFromReg nodes whose source registers are consecutive physical parameter registers (%r1,%r2, ...). - Each CopyFromReg is processed through the Phase 5 path (at
0x2EDF423). The register class resolution at0x2EDF4C2uses the compressed bit-vector test to verify the destination belongs to theInt32RegsorInt64Regsclass. - Dead copy elimination (Phase 12) immediately removes copies whose destinations have no users, reducing the entry block size before subsequent passes see it.
Texture/Surface Glue Bundle Emission
Texture and surface operations are emitted as glue bundles: a chain of SDNodes connected by glue edges that must produce a contiguous sequence of MachineInstrs. The emitter walks the glue chain backward from the final node and emits predecessors first. The bit 36 flag is critical here: when a texture fetch produces both a data result and a predicate condition, the predicate-producing node carries bit 36 on its data operand, telling the emitter that the preceding glued instruction already materialized the value and no separate COPY is needed.
The triple vtable dispatch at the end of emission (Phase 5 in the algorithm) handles the expansion of texture pseudo-instructions: EmitInstrWithCustomInserter (vtable 0xB8) replaces the texture pseudo-MI with the actual address setup, sampler configuration, and fetch instruction sequence.
Multi-Result SDNode Self-Recursion
When an SDNode produces multiple results (e.g., a div+rem pair or a load-with-predicate), the emitter calls itself recursively at sub_2EDDF20 to emit MIs for each additional result. The self-recursive call shares the same InstrEmitter instance and hash tables. This is a CICC-specific pattern; upstream LLVM handles multi-result nodes in a loop within EmitMachineNode rather than via recursion. The recursive approach simplifies the handling of multi-result nodes that themselves have glue chains (e.g., a texture fetch that returns 4 components).
Opcode-1/Opcode-2 Inline Fold Detection
During the dead copy scan (Phase 12, offset 0x2EE08A0--0x2EE08BA), the emitter checks if the MI's opcode is 1 or 2 (COPY or REG_SEQUENCE). For these opcodes, it reads the first operand's byte at [operand_array + 0x40] and tests bit 4 (0x10). This bit indicates the result was consumed via an inline fold -- the consumer instruction selected a pattern that folds the copy directly into its own operand. When this bit is set, the COPY MI is marked dead regardless of its use count, because the consuming instruction no longer references it.
0x2EE08A0: movzx eax, word ptr [rdi+44h] ; MI->opcode
0x2EE08A4: sub eax, 1 ; opcode - 1
0x2EE08A7: cmp eax, 1 ; is it 1 (COPY) or 2 (REG_SEQUENCE)?
0x2EE08AA: ja not_copy ; no -> skip
0x2EE08AC: mov rax, [rdi+20h] ; MI->operands array
0x2EE08B0: test byte ptr [rax+40h], 0x10 ; bit 4 = inline fold consumed
0x2EE08B4: jnz mark_dead ; if folded -> dead
NVIDIA Modifications vs Stock LLVM
| Area | Upstream LLVM | CICC v13.0 |
|---|---|---|
| EmitNode dispatch | Two separate functions: EmitMachineNode + EmitSpecialNode | Single merged function sub_2EDDF20 with bit-table dispatch |
| CopyToReg | Inline in EmitSpecialNode | Factored into dedicated sub_2ED95B0 |
| Custom inserter check | Single vtable call to EmitInstrWithCustomInserter | Triple vtable dispatch (0xB8, 0x348, 0x160) |
| Extended MI flags | Standard LLVM flag set (32 bits) | Bit 36 (0x1000000000) for NVPTX-specific semantics |
| Dead copy elimination | Post-emission pass in ScheduleDAGSDNodes | Inlined aggressive cleanup within EmitNode |
| Stack frame | ~300--400 bytes typical | 872 bytes (multiple inline SmallVectors and hash tables) |
| Self-recursion | Not self-recursive | Self-recursive for multi-result SDNode chains |
| Inline fold detection | Not present at this stage | Opcode-1/2 fold bit check during dead copy scan |
| Glue chain secondary walk | Not present | Cascading dead copy detection through glue predecessors |
Complexity
- Main emission loop: O(N) in the number of scheduled SDNodes.
- Hash table lookups: O(1) amortized with rehashing at 3/4 load.
- Dead copy elimination: O(C * U) where C = copies emitted, U = average uses per register.
- Glue chain traversal: O(G) per node where G = glue chain length (typically 1--5).
- Memory: O(N) for the three hash tables + O(R) for result records.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
InstrEmitter::EmitNode | sub_2EDDF20 | -- | Main entry, 11,722 bytes |
ScheduleDAGSDNodes::EmitSchedule | sub_2EE0CF0 | -- | Top-level driver, 59KB |
EmitCopyToReg | sub_2ED95B0 | -- | Dedicated CopyToReg handler |
getRegForValue | sub_2E8B400 | -- | SDValue to VReg mapping |
isUnusedReg | sub_2E8B100 | -- | Dead register predicate |
isDeadNode | sub_2DADC00 | -- | Dead SDNode predicate |
eraseFromParent | sub_2E88E20 | -- | MachineInstr deletion |
hasProperty | sub_2E88A90 | -- | Register/operand flag query |
getVRegDef | sub_2EBEE10 | -- | Virtual register definition lookup |
isPhysReg | sub_2EBEF70 | -- | Physical vs virtual register check |
replaceRegWith | sub_2EBECB0 | -- | Virtual register substitution |
clearKillFlags | sub_2EBF120 | -- | Remove kill annotations |
| Sub-register resolution | sub_2ED7930 | -- | SUBREG_TO_REG handling |
EmitSubregNode | sub_2EDB7A0 | -- | Sub-register copy emission |
EmitCopyToRegClassOp | sub_2EDD7E0 | -- | Class-constrained copy |
ProcessOperands | sub_2ED3660 | -- | EmitMachineNode core |
isAllocatableInClass | sub_2E6D360 | -- | Register class membership |
DenseMap::find | sub_2E5E6D0 | -- | SDNode-to-MI lookup |
addToDeadList | sub_2ED56A0 | -- | Queue MI for deletion |
DenseMap::grow | sub_2E29BA0 | -- | Hash table resize |
| NVPTXInstrInfo default | sub_2ED11C0 | -- | EmitInstrWithCustomInserter stub |
| NVPTXInstrInfo default | sub_2ED11E0 | -- | getInsertSubreg stub |
| NVPTXInstrInfo default | sub_2ED11F0 | -- | expandPostRAPseudo stub |
| operand comparison | sub_2ED1840 | -- | Operand equality helper |
| MI builder | sub_2ED19B0 | -- | Additional MachineInstr construction |
| register mapping | sub_2ED41E0 | -- | Register mapping utility |
| register info query | sub_2ED4900 | -- | Register info accessor |
| MI property query | sub_2ED5D10 | -- | MachineInstr property reader |
| emission utility | sub_2EDA920 | -- | Additional emission helper |
setDesc | sub_2EAB0C0 | -- | Sets MI operand descriptors during emission |
addOperand | sub_2E31210 | -- | Appends operand to MachineInstr |
| MI manipulation | sub_2E31DD0 | -- | Additional MI manipulation utility |
| TRI utility | sub_2E4EE60 | -- | TargetRegisterInfo helper |
| NVPTXRegisterInfo | sub_2E4F5F0 | -- | Register class query vtable method |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| EmitNode structure | Separate EmitNode and EmitSpecialNode dispatchers | Merged into single monolithic function (sub_2EDDF20, 11,722 bytes) with bit-table opcode classification |
| CopyToReg handling | Inline within EmitSpecialNode | Factored out to dedicated handler (sub_2ED95B0) for NVPTX's physical-register-heavy .param ABI |
| MachineInstr flags | Standard flag bits (up to bit ~20) | Extended flag at bit 36 (0x1000000000) not present in stock LLVM; marks NVIDIA-specific instruction properties |
| Pseudo-expansion | Single vtable dispatch for target pseudo-instructions | Triple vtable dispatch pattern gating custom expansion for GPU-specific pseudo-instructions |
| Dead node predicate | Standard isDeadNode check | Custom sub_2DADC00 predicate with NVPTX-specific liveness criteria |
| VReg hash table | Standard DenseMap for value-to-VReg mapping | Custom hash with key * 37 and 3/4 load factor rehash policy |
Cross-References
- SelectionDAG & Instruction Selection -- the DAG construction and pattern-matching phase that produces the SDNodes consumed by InstrEmitter
- Instruction Scheduling --
ScheduleDAGSDNodes::EmitSchedulecalls InstrEmitter after linearizing the scheduled sequence - Register Allocation -- the VRegs created by InstrEmitter flow into the register allocator
- Register Coalescing -- coalesces the COPY instructions emitted here
- AsmPrinter & PTX Body Emission -- the final consumer of the MachineInstrs produced by InstrEmitter
TwoAddressInstruction
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: Structurally identical to LLVM 20.0.0
TwoAddressInstructionPass.cpp. NVIDIA extensions are limited to deeperEXTRACT_SUBREGhandling for multi-register results (texture/tensor/warp ops), extendedLiveVariablesmaintenance,OptimizationRemarkEmitterintegration, and the standardoptnone/fast-compile gate.
The TwoAddressInstruction pass converts three-address MachineInstrs into two-address form by inserting COPY pseudo-instructions so that tied operand constraints are satisfied before register allocation. In upstream LLVM, many CPU targets have instructions where one source operand must be the same physical register as the destination (x86 addl %esi, %edi means %edi = %edi + %esi); the pass rewrites A = B op C into A = COPY B; A op= C. On NVPTX this pass is largely a formality -- PTX instructions are three-address and the virtual register file has no physical-register constraints -- but it still performs essential bookkeeping: eliminating REG_SEQUENCE and INSERT_SUBREG pseudo-instructions, building copy-equivalence maps for downstream coalescing, and handling the tied operands that arise from multi-result NVPTX intrinsics (texture loads, tensor core operations, warp-level collectives). CICC's binary is structurally identical to stock LLVM, with extended EXTRACT_SUBREG handling for multi-register results, deeper LiveVariables maintenance, OptimizationRemarkEmitter integration, and the standard NVIDIA optnone/fast-compile gate.
| Pass name | "Two-Address instruction pass" |
| Pass ID | "twoaddressinstruction" |
| Pipeline slot | "two-address-instruction" (MachineFunction pass #521) |
runOnMachineFunction | sub_1F53550 (79KB, 2,470 lines) |
tryInstructionTransform | sub_1F4EF20 (28KB, 1,127 lines) |
processTiedPairs | sub_1F50270 (63KB, 2,209 lines) |
| Cluster address range | 0x1F4D000 -- 0x1F56000 |
| libNVVM twin | sub_F4EA80 (2,455 lines, structurally identical) |
| Verification string | "After two-address instruction pass" |
| Ordering | After PHI elimination, before RegisterCoalescer |
Why This Pass Exists on NVPTX
PTX is a three-address virtual ISA -- every arithmetic instruction takes separate dst, src0, src1 operands, and the hardware register allocator inside ptxas handles physical assignment. On a CPU target like x86, the TwoAddress pass is critical because most ALU instructions destroy one source register. On NVPTX, the pass fires primarily for three categories:
-
Pseudo-instruction lowering.
REG_SEQUENCE,INSERT_SUBREG, andEXTRACT_SUBREGare LLVM-internal pseudo-opcodes that must be eliminated before register allocation regardless of target. The TwoAddress pass rewritesINSERT_SUBREGintoCOPYand expandsREG_SEQUENCEinto per-subreg copies. -
Multi-result intrinsics. NVPTX texture/surface loads return
v4f32orv2f64as multi-register results. Warp-level operations (wmma,mma) produce multi-register outputs. These get lowered into chains ofEXTRACT_SUBREGpseudo-instructions that the pass must decompose into individualCOPYs, one per extracted component. -
Inline assembly tied operands. CUDA inline
asmblocks with"+r"(read-write) constraints produce tied operands where the output register must match the input. The pass inserts aCOPYfrom the input virtual register to the output register to satisfy the constraint.
For most ordinary NVPTX arithmetic instructions, collectTiedOperands finds nothing and the pass skips the instruction after updating the distance map and processing any copy-equivalence information. The pass is not a no-op, but the heavy transformation paths (commutation, 3-address conversion, load unfolding) almost never fire for GPU code.
Algorithm
The pass iterates over every MachineBasicBlock and every MachineInstr within it, maintaining per-block data structures that are cleared at block boundaries.
for each MBB in MF:
clear DistanceMap, SrcRegMap, DstRegMap, SrcEqClassMap, DstEqClassMap, Processed
dist = 0
for each MI in MBB:
skip bundle internals
skip COPY (opcode 12) and SUBREG_TO_REG (opcode 13)
skip if MI is in the "reprocess" set
if MI is EXTRACT_SUBREG (opcode 14):
// NVPTX extended path -- multi-result decomposition
// See detailed algorithm below
decomposeExtractSubreg(MI)
continue
if MI is REG_SEQUENCE (opcode 15):
// Standard LLVM: expand into per-subreg COPYs
eliminateRegSequence(MI)
continue
DistanceMap[MI] = ++dist
// Build copy-equivalence classes for downstream coalescing
processCopy(MI) // tracks COPY, REG_SEQUENCE, INSERT_SUBREG chains
// Collect (srcIdx, dstIdx) pairs for all tied operands
if not collectTiedOperands(MI, TiedOperandMap):
continue
// Single-pair fast path: attempt commutation / 3-addr conversion
if TiedOperandMap has exactly 1 register with 1 pair:
if tryInstructionTransform(MI, srcIdx, dstIdx, dist):
continue // constraint eliminated without COPY
// General path: insert COPYs for all remaining tied pairs
for each (reg, pairs) in TiedOperandMap:
processTiedPairs(MI, pairs, dist)
// Rewrite INSERT_SUBREG to COPY after tied constraints satisfied
if MI is INSERT_SUBREG:
remove operands 3 and 1
rewrite descriptor to COPY
tryInstructionTransform (sub_1F4EF20)
This is the optimization core. When OptLevel != None, it attempts to satisfy a tied constraint without inserting a COPY, in priority order:
-
Commutation. If swapping operands makes src match dst, commute the instruction via
TII->commuteInstruction(). On NVPTX, most arithmetic instructions are commutative, so this is the most frequent success path. Upstream usesisProfitableToCommute()which walks up toMaxDataFlowEdge(default 3) dataflow edges to evaluate benefit. -
3-address conversion. Call
TII->convertToThreeAddress()to produce a true three-operand form. On NVPTX this is essentially dead code -- PTX instructions are already three-address -- but the infrastructure exists because the pass is shared LLVM code. -
Rescheduling. When
twoaddr-rescheduleis enabled (defaulttrue), attempt to move the kill of the source register closer to the current instruction (rescheduleMIBelowKill) or move the current instruction below the kill (rescheduleKillAboveMI). This can eliminate the need for a copy by making the source register die at the tied use. -
Load unfolding. For instructions with folded loads where the source is not killed, unfold the load into a separate
MOV+ arithmetic pair. Not applicable on NVPTX (no load folding). -
COPY insertion. If all optimization attempts fail, fall through to
processTiedPairswhich inserts an explicitCOPY.
The function calls itself recursively (22 cross-references including a recursive self-call at sub_1F4EF20) for transitive constraint resolution -- when unfolding creates a new instruction that itself has tied operands, the resolution recurses.
EXTRACT_SUBREG Multi-Result Decomposition Algorithm
This is the most substantial NVIDIA extension to the upstream pass. The code lives at lines 821--994 of sub_1F53550 (decompilation line numbers from the 2,470-line function body). Standard LLVM handles single-result EXTRACT_SUBREG; the NVPTX version handles multi-result instructions where the InstrEmitter has produced a single EXTRACT_SUBREG pseudo with multiple operand pairs representing all extracted components.
Why Multi-Result EXTRACT_SUBREG Exists
When InstrEmitter::EmitNode (sub_2EDDF20, 872-byte stack frame, self-recursive for multi-result SDNode chains) lowers a multi-result NVPTX intrinsic, it produces a single MachineInstr with opcode 14 (EXTRACT_SUBREG) carrying N operand pairs -- one per result component. Each pair contains a def register (the extracted component destination) and a use register (the source super-register) plus a subreg index encoding which component to extract. The TwoAddress pass must decompose this single multi-operand pseudo into N separate COPY instructions.
The three major producer categories:
| Producer | Handler | ID range | Typical result width |
|---|---|---|---|
| Texture/surface loads | sub_33A4350 | 50 IDs (0x5D--0x8D) | v4f32, v2f64, v4i32 |
| WMMA / MMA operations | sub_33A64B0 | 95 IDs (0xA4--0xA8, 0x194--0x1EC) | 2--8 register fragments |
| Multi-element surface ops | case 0xA2 | single | loop over elements |
| MMA sm90+ (wgmma) | sub_33AC8F0 | 0x183--0x191 | 8--16 register fragments |
| TMA operations | sub_33AD3D0 | 0x179--0x17C | varies |
| Async copy | sub_33ADA20 | 0x17F--0x182 | 2 results (data + token) |
The DAG-level builders that produce multi-result nodes are sub_3411BE0 (multi-result DAG node), sub_33FC220 (multi-result variadic node), and sub_33F7800 (multi-result alternate form). The type list is built by sub_1D25C30 (SelectionDAG::getVTList for multi-result).
Operand Memory Layout
Each MachineOperand occupies 40 bytes in memory (stride 40 per operand in the operand array):
| Offset within operand | Size | Field |
|---|---|---|
| +0 | byte | Flags byte 0: bit 0 = isDef |
| +2 | word | Flags word: bits 4--11 = subreg class index, bits 8--19 = subreg index |
| +3 | byte | Flags byte 3: bit 4 = isTied, bit 6 = earlyTied |
| +4 | byte | Flags byte 4: bit 0 = isTied flag (secondary) |
| +8 | int64 | Register number (virtual reg > 0, physical reg < 0) |
The subreg index is extracted by the formula:
subregIdx = (*(uint32_t*)(operand + 0) >> 8) & 0xFFF
This 12-bit field encodes which sub-register of the source to extract: sub0, sub1, sub2, sub3, etc. For a v4f32 texture result, the values are typically 1 through 4.
Decomposition Pseudocode
// sub_1F53550 lines 821-994: EXTRACT_SUBREG handler (opcode == 14)
decomposeExtractSubreg(MI):
numOps = MI.getNumOperands() // v405
pairIdx = 0 // v286, stride-2 counter
while pairIdx < numOps:
defOp = MI.getOperand(pairIdx) // base + pairIdx * 40
useOp = MI.getOperand(pairIdx + 1) // base + (pairIdx+1) * 40
dstReg = defOp.getReg() // *(int64*)(defOp + 8)
srcReg = useOp.getReg() // *(int64*)(useOp + 8)
// Extract subreg index from def operand flags (bits 8-19)
subregIdx = (defOp.flags >> 8) & 0xFFF
// Check if this operand is already tied (bit 0 of byte +4)
alreadyTied = (defOp.flagsByte4 & 1) != 0
// === CREATE COPY INSTRUCTION ===
// sub_1E0B640(MBB, insertPoint, MI.getDebugLoc(), 0)
// This is BuildMI -- allocates a new MachineInstr with opcode COPY
newCOPY = BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(TargetOpcode::COPY))
// Insert into block's instruction list
if MI.isBundledWithSucc():
sub_1DD6E10(MBB, MI, newCOPY) // insertBefore (bundled variant)
else:
sub_1DD5BA0(MBB, MI, newCOPY) // standard list insert
// Add def operand: destination register with subreg class encoding
// sub_1E1A9C0(newCOPY, dstReg, flags_with_subregclass)
newCOPY.addOperand(MachineOperand::CreateReg(dstReg, /*isDef=*/true))
// Add use operand: source register
// sub_1E1A9C0(newCOPY, srcReg, flags_use)
newCOPY.addOperand(MachineOperand::CreateReg(srcReg, /*isDef=*/false))
// === EARLY TIED OPTIMIZATION ===
// When this is NOT the first pair (pairIdx > 0) and the instruction
// has tied constraints, check if a later pair shares the same dest
// register. If so, mark the first operand of this COPY with isTied,
// allowing the register coalescer to merge them without an extra COPY.
if pairIdx > 0:
earlyTiedCheck = (defOp.flagsByte3 >> 6) & 1 // bit 6
isTiedCheck = (defOp.flagsByte3 >> 4) & 1 // bit 4
if earlyTiedCheck AND isTiedCheck:
newCOPY.getOperand(0).setTied() // set bit 0 of byte +4
// === OPTIMIZATION REMARK ===
if ORE != null: // pass object offset +272
sub_1DCCCA0(ORE, dstReg, MI, newCOPY) // emit copy remark
remarkData = sub_1DCC790(ORE, dstReg) // lookup remark data
sub_1F4C640(remarkData) // filter/emit remark
sub_1DCBB50(ORE) // push to output
if newCOPY.isInsideBundle():
walk to bundle head via successor chain
if sub_1E1AFE0(bundleHead): // hasProperty check
sub_1DCC370(ORE, remarkNode) // append to list
// === LIVEVARIABLES UPDATE ===
if LV != null: // pass object offset +280
sub_1DBF6C0(LV, MBB, MI, newCOPY, ...)
// This calls the full update chain:
// sub_1DBA290: createNewVarInfo for newCOPY's def register
// sub_1DBB110: initVarInfo (initialize kill/def lists)
// sub_1DB3C70: findKill (locate kill point in block)
// sub_1DB4410: addKill (update kill tracking for srcReg)
// sub_1DB8610: addNewBlock (update block-level liveness)
pairIdx += 2 // v286 += 2 (stride-2)
// === CLEANUP ===
// Remove all operands from original MI, then erase it
sub_1E16240(MI) // RemoveOperand (bulk)
MI.eraseFromParent()
earlyTied Optimization Detail
The earlyTied optimization is a critical performance path. Consider a v4f32 texture load producing 4 results. Without earlyTied, the decomposition creates 4 independent COPY instructions. The register coalescer must then discover independently that some of these COPYs can be coalesced.
The earlyTied flag (bit 6 of operand flags byte +3) is set during instruction emission when the emitter knows that consecutive extract results target adjacent sub-registers of a contiguous super-register. When detected, the pass marks the COPY's def operand with the isTied bit, creating a chain of tied constraints:
// Without earlyTied (4 independent COPYs, coalescer must work harder):
%dst0 = COPY %src.sub0
%dst1 = COPY %src.sub1
%dst2 = COPY %src.sub2
%dst3 = COPY %src.sub3
// With earlyTied (COPYs carry tie hints, coalescer has direct information):
%dst0 = COPY %src.sub0 // first pair: no tie
%dst1 = COPY %src.sub1 [tied to %dst0.succ] // isTied bit set
%dst2 = COPY %src.sub2 [tied to %dst1.succ] // isTied bit set
%dst3 = COPY %src.sub3 [tied to %dst2.succ] // isTied bit set
The condition is: (flagsByte3 >> 6) & 1 (earlyTied set) AND (flagsByte3 >> 4) & 1 (isTied set) AND pairIdx > 0 (not the first pair). This triple-guard prevents false positives on single-result extracts and on the first component which has no predecessor to tie to.
LiveVariables Update Chain
Every COPY produced by the decomposition triggers a six-function update sequence. This is deeper than upstream LLVM's TwoAddress LiveVariables handling and suggests NVIDIA's downstream register allocator (the greedy RA at sub_1E5B110) is particularly sensitive to stale liveness:
| Step | Function | Purpose |
|---|---|---|
| 1 | sub_1DBF6C0 | Entry: transfer liveness from old MI to new COPY |
| 2 | sub_1DBA290 | createNewVarInfo: allocate VarInfo for the COPY's def register |
| 3 | sub_1DBB110 | initVarInfo: initialize the VarInfo's kill list, def list, and alive-block bitvector |
| 4 | sub_1DB3C70 | findKill: scan the current block to locate where srcReg is killed |
| 5 | sub_1DB4410 | addKill / removeKill: move the kill point from the original MI to the new COPY (srcReg now dies at the COPY, not at the original EXTRACT_SUBREG) |
| 6 | sub_1DB8610 | addNewBlock: update block-level liveness bitvectors if srcReg is live-in to this block from a predecessor |
For a v4f32 decomposition, this executes 24 function calls (6 per component times 4 components). For a wmma.mma producing 8 fragments, it is 48 calls. The cost is quadratic in the worst case because findKill scans from the block start, but in practice the kill is always close to the insertion point.
Multi-Result Producers on NVPTX
The EXTRACT_SUBREG decomposition path fires for all NVPTX operations that produce more than one register result. These originate in the intrinsic lowering pass (sub_33A64B0 and friends in the 0x33A cluster) and flow through SelectionDAG ISel and InstrEmitter before reaching TwoAddress.
Texture and Surface Loads
The texture bulk handler sub_33A4350 covers 50 intrinsic IDs (0x5D through 0x8D). A tex.1d.v4.f32 intrinsic produces an SDNode with value type list {f32, f32, f32, f32, chain} via sub_1D25C30 (getVTList). InstrEmitter converts this into a single MachineInstr with 8 operands (4 def/use pairs), which TwoAddress decomposes into 4 COPYs.
Surface read/write handlers at sub_33A3180 (IDs 0x8E--0x90) and the scatter/gather handler at case 0xA2 follow the same pattern with variable result widths.
WMMA and MMA Operations
The mega-handler sub_33A64B0 services 95 intrinsic IDs covering all wmma/mma variants across sm70+. A wmma.mma.sync on sm70 with fp16 accumulation produces 8 f16x2 fragments; on sm80 with tf32 it produces 4 f32 fragments. The sm90+ wgmma handler at sub_33AC8F0 (IDs 0x183--0x191) can produce up to 16 register fragments for large matrix shapes.
Each fragment becomes one operand pair in the EXTRACT_SUBREG pseudo. The TwoAddress pass decomposes a 16-fragment wgmma result into 16 individual COPYs, each with full LiveVariables update. This is the most expensive decomposition path in the entire pass.
TMA and Async Copy
TMA bulk operations (sub_33AD3D0, IDs 0x179--0x17C) and async copy operations (sub_33ADA20, IDs 0x17F--0x182) produce 2-result nodes (data + completion token). These are simpler decompositions with only 2 COPY instructions.
Inline Assembly Tied Operands
CUDA inline assembly with "+r" read-write constraints is the third category that exercises the TwoAddress pass on NVPTX. The tied operand pipeline spans three compilation stages:
Stage 1: EDG Constraint Construction (sub_1286D80 path)
The EDG frontend's inline asm codegen (analyzed in p2-B07-inline-asm-codegen.txt) detects tied operands when the operand descriptor byte at offset +24 equals 3. It constructs the constraint string by:
- Emitting the input value via
sub_1286D80 - Appending
*for indirect operands - Appending the tied operand index as a decimal number to the constraint string
If the type size is a power-of-2 and 64 bits or less, it may insert a bitcast to matching integer type. GCC-style matching-digit constraints in input position are explicitly rejected with "tied input/output operands not supported!".
Stage 2: DAG-Level Tied Resolution (sub_2079C70)
SelectionDAGBuilder::visitInlineAsm (sub_2079C70, 83KB) uses:
sub_20B4290:hasTiedOperand()-- checks if tied index is not -1sub_20B42B0:getTiedOperand()-- returns the tied indexsub_2045250:resolveTiedOperand()-- creates the DAG-level constraint
The error string "inline asm not supported yet: don't know how to handle tied indirect register inputs" guards against the unsupported case of tied operands on memory-indirect inline asm operands.
Stage 3: TwoAddress COPY Insertion
After ISel, the tied operand from inline asm appears as a regular tied constraint in the MachineInstr operand list. The TwoAddress pass processes it through the standard collectTiedOperands / processTiedPairs path. For "+r" constraints this typically produces a single COPY before the INLINEASM instruction.
processTiedPairs Detail (sub_1F50270)
This 63KB / 2,209-line function is the heavyweight tied-operand resolver. It is called from the main loop whenever collectTiedOperands finds constraints that the fast path (tryInstructionTransform) could not resolve.
processTiedPairs(MI, tiedPairs, distance):
for each (srcIdx, dstIdx) in tiedPairs:
srcReg = MI.getOperand(srcIdx).getReg()
dstReg = MI.getOperand(dstIdx).getReg()
if srcReg == dstReg:
continue // constraint already satisfied
// === ATTEMPT COMMUTATION (OptLevel != None) ===
if canCommute(MI):
// isProfitableToCommute walks up to MaxDataFlowEdge (default 3)
// dataflow edges from srcReg and dstReg, comparing distances
// in DistanceMap to determine if commuting reduces copies
if isProfitableToCommute(MI, srcIdx, dstIdx, distance):
TII->commuteInstruction(MI)
if MI.getOperand(srcIdx).getReg() == MI.getOperand(dstIdx).getReg():
continue // resolved by commutation
// === ATTEMPT RESCHEDULING (twoaddr-reschedule = true) ===
if twoAddrReschedule:
// Try to move MI below the kill of srcReg
if rescheduleMIBelowKill(MI, srcIdx, dstIdx, distance):
continue // resolved by rescheduling
// Try to move the kill of srcReg above MI
if rescheduleKillAboveMI(MI, srcIdx, dstIdx, distance):
continue // resolved by rescheduling
// === ATTEMPT 3-ADDRESS CONVERSION ===
// On NVPTX, convertToThreeAddress always returns null (dead code)
if TII->convertToThreeAddress(MI, LIS):
continue // resolved by conversion (never happens on NVPTX)
// === INSERT COPY (last resort) ===
newCOPY = BuildMI(MBB, MI, DL, TII.get(COPY), dstReg).addReg(srcReg)
// Extract subreg index from original operand
subregIdx = (MI.getOperand(srcIdx).flags >> 8) & 0xFFF
if subregIdx != 0:
newCOPY.getOperand(1).setSubReg(subregIdx)
// Insert into DistanceMap with incremented counter
// Walk predecessor chain to find scheduling unit
DistanceMap[newCOPY] = ++distance
DistanceMap[MI] = ++distance
// Rewrite srcReg to dstReg in original MI
MI.getOperand(srcIdx).setReg(dstReg) // sub_1E310D0
// Update SrcEqClassMap: map srcReg -> dstReg
SrcEqClassMap.insert(srcReg, dstReg) // sub_1F4E3A0
// === LIVEVARIABLES UPDATE ===
if LV:
varInfo = LV.getVarInfo(dstReg) // sub_1DC1550
if varInfo not found:
varInfo = LV.createNewVarInfo(dstReg) // sub_1DBA290
LV.initVarInfo(varInfo) // sub_1DBB110
// Transfer kill info: srcReg kill moves from MI to newCOPY
killInfo = varInfo.findKill(MBB) // sub_1DB3C70
varInfo.addKill(newCOPY, flags) // sub_1DB4410
// Update block-level liveness
varInfo.addNewBlock(MBB, position) // sub_1DB8610
// === OPTIMIZATION REMARK ===
if ORE and commutationWasAttempted: // v384 flag
sub_1DCC790(ORE, srcReg) // lookup remark data
sub_1F4C640(remarkData) // filter remark
sub_1DCBB50(ORE) // push
if newCOPY.isInsideBundle():
walk to bundle head
if sub_1E1AFE0(bundleHead):
sub_1DCC370(ORE, remarkNode) // append to list
// === REGISTER CLASS TIGHTENING ===
// sub_1E69410(SubtargetInfo, dstReg, regClass, 0)
// constrainRegClass on the destination register to the intersection
// of the current class and the class required by the tied operand
INSERT_SUBREG Rewrite (lines 2386--2396)
After all tied pairs are processed for an INSERT_SUBREG instruction (opcode 8), the pass converts it into a plain COPY:
if MI.getOpcode() == INSERT_SUBREG:
// Propagate subreg encoding from operand[3] into operand[0]
subregBits = MI.getOperand(3).getSubRegIdx()
MI.getOperand(0).setSubReg(subregBits)
// Copy tie flag from operand[1] into operand[0]
MI.getOperand(0).setTied(MI.getOperand(1).isTied())
// Remove operands 3 and 1 (in reverse order to preserve indices)
MI.RemoveOperand(3) // sub_1E16C90(MI, 3)
MI.RemoveOperand(1) // sub_1E16C90(MI, 1)
// Rewrite opcode descriptor to COPY
MI.setDesc(TII.get(COPY)) // descriptor at TII + 960
Copy-Equivalence Classes
The pass builds two maps (SrcEqClassMap at offset +552, DstEqClassMap at +584) that track transitive copy chains. When it encounters COPY, REG_SEQUENCE, or INSERT_SUBREG instructions, it records the source-to-destination register mapping. The helper collectRegCopies (sub_1F4E620, 357 lines) walks use-def chains to build transitivity: if A -> B -> C via COPYs, then A maps directly to C. These maps are consumed by the downstream RegisterCoalescer to improve copy elimination.
The collectRegCopies algorithm:
collectRegCopies(startReg):
chain = SmallVector()
reg = startReg
while true:
if not MRI.hasOneUse(reg): // sub_1E69E00
break
defMI = MRI.getVRegDef(reg)
if defMI.getOpcode() not in {COPY, REG_SEQUENCE, INSERT_SUBREG}:
break
nextReg = defMI.getOperand(1).getReg()
chain.push(reg)
reg = nextReg
// Process chain in reverse: build transitivity
for i in reverse(chain):
SrcEqClassMap.insert(chain[i], chain[i+1]) // sub_1F4E3A0
Data Structures
TiedOperandMap (stack-allocated SmallDenseMap<unsigned, SmallVector<pair<unsigned,unsigned>, 4>> with 4 inline entries):
| Offset in entry | Type | Field |
|---|---|---|
| +0 | int32 | Key (virtual register number; -1 = empty, -2 = tombstone) |
| +8 | ptr | Pair list pointer (points to +24 for inline storage) |
| +16 | int32 | Pair list size |
| +20 | int32 | Pair list capacity |
| +24 | int64[4] | Inline pair storage (each qword packs `srcIdx |
Entry stride: 56 bytes. Hash function: 37 * key, linear probing, load factor 3/4. Total inline size: 224 bytes on stack.
DistanceMap (DenseMap<MachineInstr*, unsigned> at pass object offsets +312..+336): maps each MI to its sequential position within the current block. Hash: (ptr >> 4) ^ (ptr >> 9). Used by tryInstructionTransform and processTiedPairs for rescheduling decisions and commutation profitability evaluation.
Pass Object Layout (selected fields):
| Offset | Type | Field |
|---|---|---|
| +232 | MachineFunction* | Current function |
| +240 | MachineRegisterInfo* | MRI |
| +248 | TargetInstrInfo* | TII |
| +256 | TargetRegisterInfo* | TRI |
| +264 | ptr | InstrItineraryData* or TargetSubtargetInfo* |
| +272 | OptimizationRemarkEmitter* | ORE (NVIDIA addition) |
| +280 | LiveVariables* | LV |
| +288 | LiveIntervals* | LIS (via SlotIndexes at +160) |
| +296 | int | Effective optimization level |
| +304 | MachineBasicBlock* | Current MBB |
| +312..+336 | DenseMap | DistanceMap |
| +344..+376 | SmallPtrSet | Processed set |
| +448..+476 | SmallPtrSet | Second set (reprocessing) |
| +552..+576 | DenseMap | SrcEqClassMap |
| +584..+608 | DenseMap | DstEqClassMap |
Tied Operand Scanning (Lines 1183--1413)
The collectTiedOperands logic iterates all operands of an instruction checking for tied constraints. The inner loop (at STEP 7 in the raw analysis) contains a special-case direct resolution path:
for opIdx in 0..numOps-1:
// Skip defs, already-tied, and operands with no subreg class
if operand.isDef(): continue // byte +0 != 0
if operand.isTied(): continue // bit 4 of byte +3
if operand.subregClass == 0: continue // bits 4-11 of word +2
tiedIdx = MI.findTiedOperandIdx(opIdx) // sub_1E16AB0
srcReg = operand[opIdx].getReg()
dstReg = operand[tiedIdx].getReg()
if srcReg == dstReg:
continue // already satisfied
// SPECIAL CASE: direct resolution without COPY
if operand.isTied(secondary) AND def.subregClass == 0:
if dstReg < 0: // physical register
regClass = sub_1F3AD60(MRI, instrDesc, opIdx, TII, MF)
if regClass:
MRI.constrainRegClass(dstReg, regClass) // sub_1E69410
operand.setReg(dstReg) // sub_1E310D0
operand.clearSubregBits() // *operand &= 0xFFF000FF
// Constraint resolved: use now points to same reg as def
continue
// NORMAL: add to TiedOperandMap
TiedOperandMap[srcReg].push({opIdx, tiedIdx}) // packed as qword
The special-case path at the isTied(secondary) check (bit 0 of byte +4) handles the case where the operand carries a secondary tie flag from instruction emission and the def side has no subreg class constraint. In this case the pass can directly rewrite the use register to match the def without inserting a COPY, and clears the subreg bits with the mask 0xFFF000FF.
NVIDIA Modifications
The pass is structurally stock LLVM -- the libNVVM build at sub_F4EA80 is byte-for-byte identical in structure, confirming shared source. The NVIDIA delta consists of four additions:
-
Extended EXTRACT_SUBREG handling (lines 821--994 of the decompilation). Standard LLVM handles single EXTRACT_SUBREG; the NVPTX version handles multi-result instructions with multiple extract chains via stride-2 operand iteration. This is required for texture/surface loads returning
v4f32, wmma/mma producing multi-register fragments, and similar multi-result NVPTX intrinsics. The earlyTied optimization (checking bits 4 and 6 of operand flags byte +3) is unique to this extension and provides direct coalescing hints for contiguous sub-register sequences. -
Deeper LiveVariables maintenance (lines 1791--2064). When a COPY is inserted, the pass creates new
VarInfoentries (sub_1DBA290), initializes them (sub_1DBB110), updates kill info (sub_1DB3C70/sub_1DB4410), and maintains block-level liveness (sub_1DB8610). This six-function chain executes per COPY, not per instruction. For a 16-fragment wgmma result, this produces 96 function calls for liveness maintenance alone. -
OptimizationRemarkEmitter integration (lines 2207--2258). The pass reports cases where tied-operand constraints forced extra COPY insertions, providing performance diagnostic information. This is absent in upstream LLVM's TwoAddress pass. The ORE pointer is stored at pass object offset +272 and acquired via analysis lookup of
unk_4FC4534. The five-function chain (sub_1DCCCA0throughsub_1DCC370) handles remark creation, filtering, and bundle-aware emission. -
optnone/fast-compile gate (
sub_1636880). When the function hasoptnoneor when NVIDIA's fast-compile mode is active, the effective optimization level is forced to 0. This disables commutation, 3-address conversion, and rescheduling attempts intryInstructionTransform(which returnsfalseimmediately whenOptLevel == None), making the pass a pure COPY-insertion pass with no optimization.
Knobs
| Knob | Default | Effect |
|---|---|---|
twoaddr-reschedule | true | Enable/disable instruction rescheduling to coalesce copies. When true, the pass attempts to move instructions up or down within the block to avoid needing a COPY. |
dataflow-edge-limit | 3 | Maximum number of dataflow edges to traverse when evaluating the profitability of commuting operands in isProfitableToCommute(). Higher values allow deeper analysis at compile-time cost. |
Both knobs are registered in constructor ctor_337 (found in the sweep at 0x4F0000--0x51FFFF). They are standard upstream LLVM options with no NVIDIA-specific modifications to their defaults.
The optnone/fast-compile gate is not a knob per se but has the effect of disabling all optimization paths in the pass, equivalent to setting both knobs to their most conservative values.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Pass registration (name + ID) | sub_1F4D900 | small | Sets "Two-Address instruction pass" and "twoaddressinstruction" |
| Constructor | sub_1F4D9F0 | small | |
| Helper: rescheduleMIBelowKill support | sub_1F4CC10 | -- | Called by sub_1F4EF20 |
| Helper: rescheduleKillAboveMI support | sub_1F4D060 | -- | Called by sub_1F4EF20 |
SmallPtrSet::contains(MI*) | sub_1F4DD40 | 67 lines | Processed set membership check |
SmallDenseMap::clear() | sub_1F4DE20 | 180 lines | TiedOperandMap cleanup, frees heap-allocated pair lists |
DenseMap<int,int>::insert | sub_1F4E3A0 | 166 lines | EqClassMap insertion, hash = 37 * key |
collectRegCopies | sub_1F4E620 | 357 lines | Walks COPY chains to build transitive equivalence classes |
DenseMap<ptr,int>::insert | sub_1F4EC70 | 164 lines | DistanceMap insertion, hash = (ptr>>4) ^ (ptr>>9) |
tryInstructionTransform | sub_1F4EF20 | 28KB / 1,127 lines | Core tied-operand rewriter: commutation, 3-addr, COPY. Recursive (22 xrefs). |
processTiedPairs | sub_1F50270 | 63KB / 2,209 lines | Full pipeline: commute, convert, COPY insertion, LV/LI update |
SmallDenseMap::grow | sub_1F53020 | 312 lines | TiedOperandMap rehash, 56-byte entry stride |
runOnMachineFunction | sub_1F53550 | 79KB / 2,470 lines | Pass entry point |
| Helper: find matching superclass | sub_1F3AD60 | -- | Finds register class for tied physical reg constraints |
| Helper: implicit tied operands | sub_1F4C460 | -- | Checks if MI has implicit tied operand pairs |
| Helper: filter/emit remark | sub_1F4C640 | -- | ORE filtering for copy-insertion diagnostics |
LiveVariables::createNewVarInfo | sub_1DBA290 | -- | Allocates VarInfo for new register |
LiveVariables::initVarInfo | sub_1DBB110 | -- | Initializes kill/def lists and alive bitvector |
VarInfo::findKill | sub_1DB3C70 | -- | Scans block for register kill point |
VarInfo::addKill / removeKill | sub_1DB4410 | -- | Updates kill tracking |
VarInfo::addNewBlock | sub_1DB8610 | -- | Updates block-level liveness bitvectors |
LiveVariables::HandlePhysRegDef | sub_1DBF6C0 | -- | Transfer liveness from old MI to new COPY |
ORE::emit (copy remark) | sub_1DCCCA0 | -- | Emits optimization remark for COPY insertion |
ORE::lookup | sub_1DCC790 | -- | Looks up remark data for register |
ORE::push | sub_1DCBB50 | -- | Pushes remark to output |
ORE::appendToList | sub_1DCC370 | -- | Appends remark (bundle-aware) |
MachineFunction::verify | sub_1E926D0 | -- | Called with "After two-address instruction pass" |
isOptNone / fast-compile check | sub_1636880 | -- | Forces OptLevel = 0 when active |
Binary Size Note
The 79KB runOnMachineFunction plus 63KB processTiedPairs plus 28KB tryInstructionTransform total approximately 170KB of machine code. Upstream LLVM source for the entire pass is approximately 2,000 lines of C++. The binary bloat is almost entirely explained by aggressive inlining: every DenseMap::insert, DenseMap::find, DenseMap::clear, SmallPtrSet::insert, and SmallPtrSet::find operation is fully expanded inline with all template specialization, sentinel initialization, grow/rehash, and power-of-2 computation logic. This accounts for roughly 40% of the binary. The remaining expansion comes from the COPY-creation path (operand setup, flag manipulation, list splicing) being duplicated for each opcode-specific branch rather than factored into a shared helper.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Primary purpose | Convert 3-address to 2-address form for physical register constraints (x86 tied operands) | Largely a formality on NVPTX (PTX is 3-address); primary role is eliminating REG_SEQUENCE/INSERT_SUBREG and building copy-equivalence maps |
| EXTRACT_SUBREG handling | Standard sub-register extraction for CPU multi-result instructions | Extended decomposition for multi-register NVPTX results: texture loads, tensor core operations (WMMA/MMA), and warp-level collectives |
| LiveVariables maintenance | Standard liveness tracking | Deeper LiveVariables maintenance with explicit VarInfo allocation/init (sub_1DBA290/sub_1DBB110) for new registers created during decomposition |
| ORE integration | Basic or absent remark emission for copies | Full OptimizationRemarkEmitter integration for COPY insertion diagnostics (sub_1DCCCA0/sub_1DCC790/sub_1DCBB50) |
| Binary size | ~2,000 lines of C++ source | 170 KB of machine code (79 KB runOnMachineFunction + 63 KB processTiedPairs + 28 KB tryInstructionTransform); bloat from aggressive DenseMap inlining |
| optnone/fast-compile gate | Standard OptLevel check | NVIDIA optnone / fast-compile check (sub_1636880) forces OptLevel = 0 for fast-compile kernels |
Cross-References
- Register Coalescing -- runs immediately after TwoAddress; consumes the SrcEqClassMap/DstEqClassMap built here
- Register Allocation -- the downstream consumer that requires tied operands to be resolved
- SelectionDAG -- produces the EXTRACT_SUBREG/INSERT_SUBREG/REG_SEQUENCE pseudo-instructions that this pass eliminates
- Instruction Emitter --
sub_2EDDF20creates multi-result EXTRACT_SUBREG chains from SDNode output - MMA Code Generation -- WMMA/MMA intrinsics producing multi-register results that require decomposition
- ISel Patterns -- instruction selection creates the tied operand constraints
- Instruction Scheduling -- runs before TwoAddress in the pre-RA scheduling slot
- Pipeline & Ordering -- full pass ordering context
- CLI Flags --
optnoneand fast-compile mode - LLVM Knobs --
twoaddr-reschedule,dataflow-edge-limit - Hash Infrastructure -- DenseMap and SmallDenseMap internals used throughout
- Diagnostics -- OptimizationRemarkEmitter system
Instruction Scheduling
Prerequisites: Familiarity with Register Allocation, NVPTX register classes, and the codegen pipeline. Understanding of the GPU execution model (warp scheduling, latency hiding) is essential.
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/CodeGen/MachineScheduler.cpp(ScheduleDAGMILive),llvm/lib/CodeGen/MachinePipeliner.cpp(Swing Modulo Scheduler) (LLVM 20.0.0). The MRPA incremental pressure tracker and Texture Group Merge pass are NVIDIA-only with no upstream equivalent.
CICC v13.0 implements three distinct scheduling subsystems: MRPA (Machine Register Pressure Analysis) for incremental pressure tracking during MCSE, a Swing Modulo Scheduling pipeliner for loop bodies, and ScheduleDAGMILive for post-RA instruction ordering. All three maintain per-register-class pressure arrays but differ in granularity and update frequency. A texture group merge pass (sub_2DDE8C0) acts as a scheduling-adjacent optimization that groups texture load instructions for hardware coalescing.
| MRPA incremental tracker | sub_2E5A4E0 (primary), sub_1E00370 (backend variant) |
| MachinePipeliner (SMS) | sub_3563190 |
| ScheduleDAGMILive | sub_355F610 |
| Instruction selection heuristic | sub_3557A10 |
| Texture group merge | sub_2DDE8C0 |
| Scheduling mode switch | sub_21668D0 (post-RA), sub_2165850 (pre-RA) |
MRPA: Incremental Register Pressure Tracking
MRPA (Machine Register Pressure Analysis) provides incremental register pressure tracking for the Machine Common Subexpression Elimination (MCSE) pass. Rather than recomputing pressure from scratch after each instruction move or elimination, MRPA applies delta updates to maintain a running pressure state.
The primary implementation lives at sub_2E5A4E0 (48KB), with a backend variant at sub_1E00370 (78KB). Both use DenseMap hash tables for per-instruction pressure data with the hash function (ptr >> 9) ^ (ptr >> 4), empty sentinel -8, tombstone sentinel -16, minimum 64 buckets, and power-of-two sizing. The sub_1E00370 backend variant calls the pressure computation core at sub_1DF7390 (8 call sites) and sub_1DFB9D0 (6 call sites), plus pressure set queries via sub_1E1C690 / sub_1E15D60.
The MRPA pressure cluster spans the address range 0x1DF0000--0x1E0FFFF:
| Function | Role |
|---|---|
sub_1DF3D00 | Scheduler support (lowest address) |
sub_1DF4120 | Scheduler support |
sub_1DF4FB0 | Scheduler support |
sub_1DF5810 | Machine function pass (pressure-aware scheduling) |
sub_1DF7390 | Pressure computation core (called 8x from sub_1E00370) |
sub_1DF76E0 | Register liveness query |
sub_1DF7A80 | Code motion feasibility check |
sub_1DF81C0 | Pressure computation core |
sub_1DF9E90 | Schedule optimization pass |
sub_1DFB810 | DenseMap (64-bit value variant) |
sub_1DFB9D0 | DenseMap (32-bit value variant, called 6x) |
sub_1E00370 | MRPA entry -- backend variant |
Incremental Update Flow
The incremental update is the core algorithm. Rather than performing a full O(n) pressure recomputation after every MCSE transform, it maintains a running pressure state through delta operations. The pseudocode below is reconstructed from sub_2E5A4E0:
function mrpa_incremental_update(context, basicBlock):
// Phase 1: Build worklist via DFS
visited = DenseSet() // v292--v295
worklist = []
dfs_push(worklist, basicBlock, visited) // standard DFS seed
while worklist is not empty:
bb = worklist.pop()
// Phase 2: Create instruction tracking entries
tracking = context.densemap[+80..+104] // DenseMap at context offsets
for each instr in bb.instructions:
tracking.insert(instr, PressureEntry{})
// Phase 3: Filter schedulable instructions
if not sub_2E501D0(instr): // schedulability predicate
continue
// Phase 4: Scan operands (40-byte entries)
for i in range(instr.num_operands): // iterated at v69/v70
operand = instr.operand[i] // 40-byte stride
if not operand.isVirtualRegister():
continue
// Phase 5: Virtual register operand processing
old_reg = sub_2EBEF70(operand) // find existing rename mapping
reg_info = sub_2EBEE10(operand) // query register class, constraints
new_reg = sub_2EBE820(operand) // attempt rename if profitable
if new_reg != old_reg:
sub_2EBF120(old_reg) // free old register
// Phase 6: Register class constraint validation
sub_reg_list = sub_E922F0(reg_info) // sub-register list for class
for each sub_reg in sub_reg_list:
validate_class_constraint(sub_reg, context.class_limits)
// Phase 7: Pressure feasibility check
bb_pressure = context.per_bb_data[bb] // at v279[36]
if not sub_2E4F9C0(bb_pressure): // exceeds class limits?
// Rename was unprofitable -- roll back
context.rename_count_fail++ // *((_DWORD*)v254 + 17)
sub_2E88E20(instr) // erase unprofitable instruction
else:
context.rename_count_success++ // *((_DWORD*)v254 + 16)
The key insight is that steps 5--7 form a speculative rename-then-validate loop: MRPA tentatively renames a virtual register, checks whether the rename reduces pressure below the class limit, and rolls back if it does not. The rename counts at *((_DWORD*)v254 + 16) (success) and *((_DWORD*)v254 + 17) (failure) provide a diagnostic ratio of how often speculative renames succeed.
Register Liveness Queries
Register liveness (sub_1DF76E0) checks whether a register is live in an instruction range [a3, a4] using _bittest on register class bitmaps. A compressed alias table at context offset +240 stores sub-register overlap information in 24-byte entries containing alias counts and alias data offsets.
The alias table structure:
| Offset | Size | Content |
|---|---|---|
| +0 | 8 | Sub-table pointer |
| +8 | 56 | Alias data block |
| +8..+10 (per entry) | 2 | Alias count (uint16) |
| +10.. | variable | Alias register IDs (2 bytes each) |
Sub-register overlap is resolved through an incremental alias walk: for each register in the query range, the alias table is consulted to expand the register into its physical sub-registers, and each sub-register is tested against the liveness bitmap.
Code Motion Feasibility
Code motion feasibility (sub_1DF7A80) validates whether an instruction can be moved between basic blocks:
- Check single-predecessor relationship between source and destination BBs.
- Validate against the allocation bitmask at allocator offset
+38. - Walk an instruction window bounded by offset
+296(configurable window size). - Count conflicting operands within the window.
- Track affected registers in an rb-tree set (offsets 56--88) with node structure
[left(16), right(24), value(32)].
An instruction is movable only if the conflicting operand count within the window is zero and the allocation bitmask permits the move.
MRPA Verification
A debug-only verification path checks incremental update correctness against full recomputation. The trigger path in sub_2E5A4E0 (decompiled lines 1702--1708):
if ( *(_BYTE *)(v7 + 40) // [1] context enable flag -- always ON during MCSE
&& (_BYTE)qword_501F8A8 // [2] verify-update-mcse -- user must enable
&& (_BYTE)qword_501F988 // [3] incremental-update-mcse -- default ON
&& !sub_2E59B70( // [4] full recomputation DISAGREES
*(_QWORD*)(v7+48),
qword_501F7C8, ...) )
{
sub_C64ED0("Incorrect RP info from incremental MRPA update\n", 1u);
}
All four conditions must hold simultaneously:
- Context enable flag (
v7 + 40) is set -- always true during MCSE. verify-update-mcseis ON -- user must explicitly enable this debug knob.incremental-update-mcseis ON -- default is ON.sub_2E59B70returns false -- full recomputation disagrees with the incremental state.
When all conditions hold, the error "Incorrect RP info from incremental MRPA update" fires via sub_C64ED0 (LLVM's report_fatal_error). The print-verify knob controls whether detailed per-register-class mismatch data is printed.
The backend variant (sub_1E00370, decompiled lines 2416--2420) uses byte_4FC6020 as its guard flag, calls sub_1DFF720 for verification, and falls back to byte_4FC62C0 (a cached result) if verification is disabled.
| Knob | Default | Description |
|---|---|---|
incremental-update-mcse | true | Incrementally update register pressure analysis |
verify-update-mcse | false | Verify incremental update by full RP analysis |
print-verify | false | Print problematic RP info if verification failed |
To trigger verification: cicc -Xcuda -verify-update-mcse input.cu. NVIDIA keeps this check off by default since the full rescan is O(n) and expensive.
MachinePipeliner: Swing Modulo Scheduling
Complexity. Let N = number of instructions in the loop body and E = number of dependency edges in the DDG. DDG construction is O(N + E). RecMII computation (computeRecMII) finds the maximum cycle ratio via enumeration of elementary circuits in the DDG -- worst-case exponential, but bounded in practice by small loop sizes (N < 100) and sparse dependency graphs. ResMII computation is O(N) (sum of resource vectors). ASAP/ALAP computation is O(N + E) each (topological traversals). The II search probes at most pipeliner-ii-search-range (default 10) candidate IIs. For each II, node placement is O(N * II) -- each of N nodes probes up to II cycle slots. The total scheduling cost is O((N + E) + R * N * II_max) where R = search range. The pipeliner-max-stages (default 3) and pipeliner-max-mii (default 27) provide additional constant-factor bounds. For MRPA, the incremental pressure update is O(1) per instruction move (delta update), compared to O(N) for a full recomputation -- this is the key efficiency gain over a naive approach.
The MachinePipeliner (sub_3563190, ~2030 decompiled lines, ~58KB) implements Swing Modulo Scheduling (SMS) for software pipelining of loop bodies. It overlaps iterations of a loop body to improve throughput on pipelined hardware by interleaving instructions from different iterations. The upstream LLVM equivalent is SwingSchedulerDAG::schedule().
Pass discovery: the pipeliner walks an analysis array at this+3456 (offset 3456) looking for vtable unk_4F86530 (the MachinePipeliner analysis pass), then extracts the SwingSchedulerDAG context at offset +176.
Phase 1: Initialization and DDG Construction
The setup chain builds the data dependence graph and computes MII lower bounds:
| Step | Function | Description |
|---|---|---|
| 1 | sub_2F97F60 | initializeDAG -- build data dependence graph (DDG) over the single-BB loop body |
| 2 | sub_3559990 | computeNodeLatencies -- fill latency fields per SUnit from the target scheduling model |
| 3 | sub_3542B20 | addDependencies -- add register/memory/order dependency edges to the DDG |
| 4 | sub_2F90200 | updateRegPressure -- compute initial register pressure state for the loop body |
| 5 | sub_354CBB0 | computeRecMII -- find the maximum cycle length of any recurrence in the DDG |
| 6 | sub_35449F0 | computeResMII -- compute ceil(total_resource_usage / functional_unit_count) |
The context object SwingSchedulerDAG occupies approximately 4100 bytes:
| Offset | Field |
|---|---|
| +32 | MachineFunction* |
| +48..56 | BB range (iterated at 256-byte stride) |
| +944 | DenseMap: pre-existing ordering constraints |
| +3456 | Analysis pass vector |
| +3472 | MII (int32) |
| +3480 | schedulingSucceeded (bool) |
| +3488 | DiagnosticsEngine / remark context |
| +3520 | TargetSubtargetInfo* |
| +3944..3952 | DDG node storage (vector) |
| +4016..4072 | Recurrence DenseMap (24-byte entries) |
Phase 2: MII Computation and II Search
MII computation combines two lower bounds:
- RecMII (Recurrence MII): the longest cycle in the DDG, computed by
sub_354CBB0. Each recurrence (loop-carried dependency cycle) constrains the minimum II because the cycle must fit within one iteration interval. Ifpipeliner-ignore-recmiiis set, RecMII is forced to zero so only resource constraints matter. - ResMII (Resource MII):
ceil(sum of resource usage across all instructions / number of available functional units), computed bysub_35449F0. This reflects the throughput bottleneck of the hardware.
function compute_MII():
recMII = sub_354CBB0() // max recurrence length
resMII = sub_35449F0() // resource throughput limit
if pipeliner-ignore-recmii: // qword_503E888
recMII = 0
MII = sub_3542AB0(resMII, recMII) // max(resMII, recMII)
sub_3542AE0() // store MII at this+3472
return MII
The II search algorithm starts at MII and probes upward:
function ii_search(MII):
max_ii = MII + pipeliner-ii-search-range // default: MII + 10
if pipeliner-force-ii != 0: // qword_503EB80
return try_schedule(pipeliner-force-ii) // skip search entirely
for II = MII to max_ii:
// 1. Compute ASAP/ALAP at this II
asap = compute_ASAP(DDG, II) // sub_354BFF0 -> v369
alap = compute_ALAP(DDG, II) // sub_354BFF0 -> v373
// 2. Place all nodes into II-wide modulo reservation table
success = place_nodes(asap, alap, II) // sub_354C3A0
if not success:
continue // try next II
// 3. Compute stage count
numStages = (lastCycle - firstCycle) / II // (v84 - v80) / v88
// 4. Validate stage count
if numStages > pipeliner-max-stages: // default 3
continue
// 5. Register pressure check (if enabled)
if pipeliner-register-pressure: // qword_503E2C0
if not verify_pressure(II, pipeliner-register-pressure-margin):
continue // sub_355C7C0
return (II, schedule)
return FAILURE // "Unable to find schedule"
The pipeliner-force-ii knob (default 0) bypasses the search entirely and forces a specific II value. This is useful for testing or when the compiler team knows the optimal II for a specific loop shape.
Phase 3: ASAP/ALAP Computation
ASAP (As Soon As Possible) and ALAP (As Late As Possible) define the scheduling window for each instruction at a given II:
ASAP computation (sub_354BFF0, first invocation producing v369): traverses the DDG in topological order. For each node, ASAP = max over all predecessors of (predecessor.ASAP + edge.latency). The root nodes (no predecessors) have ASAP = 0. This gives the earliest cycle each instruction can execute without violating data dependencies.
ALAP computation (sub_354BFF0, second invocation producing v373): traverses the DDG in reverse topological order. For each node, ALAP = min over all successors of (successor.ALAP - edge.latency). Leaf nodes (no successors) have ALAP = II - 1 (or the schedule length bound). This gives the latest cycle an instruction can execute.
The scheduling window for instruction i is [ASAP(i), ALAP(i)]. Instructions with narrow windows (ASAP close to ALAP) are more constrained and are typically scheduled first by the node ordering heuristic.
Phase 4: Node Placement
Node placement (sub_354C3A0) attempts to assign each instruction to a specific cycle in the modulo reservation table (MRT). The MRT has II columns (one per cycle in the initiation interval) and tracks resource usage per cycle.
The placement algorithm follows the Swing Modulo Scheduling strategy:
- Node ordering (
sub_35630A0): nodes are prioritized by a combination of critical-path depth, recurrence membership, and scheduling freedom (ALAP - ASAP). Nodes in tight recurrences and on the critical path are placed first. - Direction selection: for each node, the scheduler decides whether to place it "forward" (from ASAP toward ALAP) or "backward" (from ALAP toward ASAP) based on its dependency relationships. The "swing" refers to alternating direction between predecessor-constrained and successor-constrained nodes.
- Cycle probing: starting from the preferred direction, the scheduler tries each cycle in the node's
[ASAP, ALAP]window. At each candidate cycle, it checks resource availability in the MRT (the cycle modulo II must have sufficient functional unit capacity) and verifies that all dependency constraints remain satisfied. - Conflict resolution: if no cycle in the window is feasible, the placement fails for this II and the search continues with II+1.
Phase 5: Kernel Generation
After a valid schedule is found, the pipeliner builds the kernel, prolog, and epilog. The numStages value ((lastCycle - firstCycle) / II) determines how many iterations overlap.
function build_kernel(schedule, II, numStages):
// Build instruction-to-stage and instruction-to-cycle DenseMaps
instrToStage = DenseMap<SUnit*, int>() // v317/v318/v319
instrToCycle = DenseMap<SUnit*, int>() // v320/v321/v322
// DenseMap config: hash=(key>>9)^(key>>4), empty=-4096, tombstone=-8192
for stage in range(numStages):
for each SUnit in schedule.stage_bundle(stage):
instrToStage[SUnit] = stage
instrToCycle[SUnit] = SUnit.assigned_cycle
// Cross-reference recurrence edges with stage assignments
if this+4064 (recurrence count) != 0:
for each recurrence_edge in this+4056:
edge.stage = instrToStage[edge.instruction]
// Build per-recurrence analysis DenseMap (24-byte entries)
// Select codegen backend (priority order):
if pipeliner-annotate-for-testing: // testing mode: annotate only
sub_359AD80(schedule)
return
if pipeliner-experimental-cg: // peeling code generator
if numStages == 0:
sub_35A5710() // trivial kernel (no overlap)
else:
sub_35A93B0() // experimental peeling CG
sub_3598EB0() // finalize prolog/epilog
return
if pipeliner-mve-cg: // MVE code generator (DEFAULT)
if numStages == 0 and target_supports_mve():
sub_35A7730() // MVE compatibility check
sub_35A76E0() // MVE code generator
return
// else fall through to experimental CG
// Default fallthrough: experimental CG path
The codegen backend priority is: (1) pipeliner-annotate-for-testing for test infrastructure, (2) pipeliner-experimental-cg for peeling-based generation, (3) pipeliner-mve-cg (default enabled) for the MVE (Modulo Variable Expansion) code generator. The MVE path is gated on numStages == 0 and a target callback at **(this+3520)+72 returning non-default (i.e., not sub_2FDC510).
The SBO (Small Buffer Optimization) pattern is used for nodeInfo arrays: v416 = v418 (inline buffer of 704 bytes = 8 nodes x 88 bytes). When the loop body exceeds 8 instructions, sub_35498F0 sorts and possibly heap-allocates.
Error Conditions
| Condition | Diagnostic | Severity |
|---|---|---|
| MII == 0 | "Invalid Minimal Initiation Interval: 0" | 0x15 (missed) |
MII > pipeliner-max-mii | "Minimal Initiation Interval too large: MII > SwpMaxMii" | 0x15 (missed) |
| Scheduling failure | "Unable to find schedule" | 0x15 (missed) |
| numStages == 0 | "No need to pipeline - no overlapped iterations in schedule." | 0x15 (missed) |
numStages > pipeliner-max-stages | "Too many stages in schedule: numStages > SwpMaxStages" | 0x15 (missed) |
| Success | "Pipelined succesfully!" [sic] | 0x13 (passed) |
The typo "succesfully" (single 's') is preserved from upstream LLVM.
Pipeliner Knobs
| Knob | Global | Default | Description |
|---|---|---|---|
enable-pipeliner | unk_503EE20 | true | Master switch for SMS |
enable-pipeliner-opt-size | qword_503ED40 | false | Enable SWP at -Os |
pipeliner-max-mii | qword_503ECE8 | 27 | Maximum allowed MII |
pipeliner-force-ii | qword_503EB80 | 0 | Force specific II (0 = auto) |
pipeliner-max-stages | qword_503EB28 | 3 | Maximum pipeline stages |
pipeliner-prune-deps | qword_503E9C0 | true | Prune deps between unrelated Phi nodes |
pipeliner-prune-loop-carried | qword_503E8E0 | true | Prune loop-carried order deps |
pipeliner-ignore-recmii | qword_503E888 | false | Ignore RecMII (hidden knob) |
pipeliner-show-mask | qword_503E720 | false | Debug: show scheduling mask |
pipeliner-dbg-res | qword_503E640 | false | Debug: resource usage |
pipeliner-annotate-for-testing | qword_503E5E8 | false | Annotate instead of codegen |
pipeliner-experimental-cg | qword_503E508 | false | Use peeling code generator |
pipeliner-ii-search-range | qword_503E3A0 | 10 | Range to search for II |
pipeliner-register-pressure | qword_503E2C0 | false | Consider register pressure |
pipeliner-register-pressure-margin | qword_503E1E0 | 5 | Margin % for reg pressure limit |
pipeliner-mve-cg | unk_503E100 | true | Use MVE code generator |
pipeliner-enable-copytophi | qword_503E020 | true | Enable CopyToPhi DAG Mutation |
pipeliner-force-issue-width | qword_503DF40 | 0 | Force issue width (0 = auto) |
All registered in ctor_676_0_0x5a3430.c.
MachinePipeliner Function Map
| Function | Identity |
|---|---|
sub_3563190 | Top-level SMS orchestrator (SwingSchedulerDAG::schedule) |
sub_2F97F60 | initializeDAG -- build DDG |
sub_3559990 | computeNodeLatencies |
sub_3542B20 | addDependencies -- register/memory/order edges |
sub_2F90200 | updateRegPressure |
sub_354CBB0 | computeRecMII |
sub_35449F0 | computeResMII |
sub_3542AB0 | setMII = max(ResMII, RecMII) |
sub_3542AE0 | validateMII / store at +3472 |
sub_3556270 | collectNodeInfo -- gather 88-byte per-node records |
sub_35476E0 | initNodeOrder -- compute scheduling order |
sub_35523F0 | computeSchedule -- build SUnit ordering |
sub_35546F0 | orderDependences -- topological sort |
sub_3543340 | computeStart -- ASAP/ALAP times |
sub_35630A0 | normalizeSchedule -- adjust cycle numbering |
sub_35568E0 | scheduleNodes -- core SMS placement |
sub_35433F0 | adjustSchedule -- post-adjustment |
sub_3557A10 | computeFinalSchedule -- finalize stage/cycle |
sub_354A760 | buildStageMap -- iteration-to-stage mapping |
sub_355F610 | schedule() -- II search loop (2351 lines) |
sub_354BE50 | getScheduleForStage |
sub_35498F0 | sortNodeInfo (for >8 nodes) |
sub_359AD80 | annotateForTesting |
sub_35A5710 | generateTrivialKernel |
sub_35A93B0 | experimentalPeelingCG |
sub_3598EB0 | finalizeExperimentalKernel |
sub_35A76E0 | mveCG -- MVE code generator |
sub_35A7730 | mveCompatCheck |
ScheduleDAGMILive: Post-RA Instruction Ordering
ScheduleDAGMILive (sub_355F610, 64KB) is the post-RA machine instruction scheduler. It takes the pipeliner's output (or standalone scheduling regions) and determines the final instruction order while respecting register pressure limits.
Data structures:
- SUnit (Scheduling Unit): 88 bytes per instruction, consistent across both the pipeliner and
ScheduleDAGMILive. - Instruction-to-node hash map: 632-byte entries per instruction. The unusually large entry size suggests extensive caching of per-instruction metadata (RP deltas, latency info, dependency edges) to avoid recomputation.
- RP tracking structure: 112 bytes, with per-register-class pressure arrays at offsets 32--48 (current) and 56--72 (limits).
The scheduling flow:
- Initialize RP tracking via
sub_3551AB0(ifpipeliner-register-pressureis set). - Set per-class pressure defaults via
sub_2F60A40. - Walk BB instruction list, build instruction-to-node hash map.
- Compute ASAP (earliest cycle) via
sub_354BFF0->v369. - Compute ALAP (latest cycle) via
sub_354BFF0->v373. - Place instructions via
sub_354C3A0(returns success/failure). - Calculate stage count:
(lastCycle - firstCycle) / II=(v84 - v80) / v88. - Verify placement via
sub_355C7C0. - Build stage descriptors via
sub_355D7E0(80 bytes per stage, 10 QWORDs each).
Instruction Selection Heuristic
The instruction selection heuristic (sub_3557A10, 47KB) determines which instruction to schedule next from the ready set. It implements a multi-level priority scheme operating on 88-byte SUnit entries:
Level 1 -- Latency/Depth priority (SUnit offset +240): instructions deeper in the dependency graph are scheduled first. Depth is measured as the longest path from the instruction to a sink node in the DDG. This ensures that critical-path instructions are placed early, preventing them from becoming bottlenecks. Latency recomputation occurs via sub_2F8F5D0 during priority comparison to account for any scheduling decisions already made.
Level 2 -- Target priority table (context a1+3944): a table of 16-byte entries, each containing:
| Offset | Size | Field |
|---|---|---|
| +0 | 4 | start -- first cycle of priority window |
| +4 | 4 | end -- last cycle of priority window |
| +8 | 4 | priority -- target-assigned priority value |
| +12 | 4 | window_width -- scheduling window size |
The target (NVPTX backend) populates this table to express hardware-specific ordering preferences -- for example, prioritizing memory operations that can be overlapped with computation, or ensuring that warp-synchronous instructions are scheduled in specific relative positions. Instructions that fall within a priority window with a higher priority value are selected first.
Level 3 -- Schedule window width: when levels 1 and 2 are tied, the instruction with the narrower scheduling window (ALAP - ASAP) is preferred. Narrower windows mean fewer legal placement options, so these instructions should be placed before more flexible ones to avoid creating conflicts.
The ready queue is managed by sub_3553D90. Pattern matching on ready instructions proceeds through sub_35540D0 (applicability check) and sub_35543E0 (pattern application), with validation via sub_3546B80. A hash table at a1+3976 maps instructions to schedule nodes for O(1) lookup during priority comparison.
function select_next_instruction(ready_set):
best = null
for each candidate in ready_set:
if best is null:
best = candidate
continue
// Level 1: depth comparison
if candidate.depth > best.depth: // offset +240
best = candidate
continue
if candidate.depth < best.depth:
continue
// Level 2: target priority table
cand_prio = lookup_target_priority(candidate, priority_table)
best_prio = lookup_target_priority(best, priority_table)
if cand_prio > best_prio:
best = candidate
continue
if cand_prio < best_prio:
continue
// Level 3: prefer narrower window
cand_width = candidate.ALAP - candidate.ASAP
best_width = best.ALAP - best.ASAP
if cand_width < best_width:
best = candidate
return best
Texture Group Merge
The Texture Group Merge pass (sub_2DDE8C0, 74KB, 2382 decompiled lines) groups texture load instructions that access related memory locations, enabling the hardware texture unit to coalesce them into fewer requests. This is an NVIDIA-specific pass not present in upstream LLVM.
Fibonacci Hashing
The pass uses Fibonacci hashing for candidate bucketing:
hash = (ptr * 0xBF58476D1CE4E5B9) >> shift
The constant 0xBF58476D1CE4E5B9 is the 64-bit Fibonacci hash multiplier, derived from 2^64 / phi where phi = (1 + sqrt(5)) / 2 is the golden ratio. This multiplicative hash provides near-optimal distribution for pointer-based keys because the golden ratio's irrational nature ensures that consecutive multiples are maximally spread across the output range. The same constant appears in:
- Linux kernel's
hash_64()ininclude/linux/hash.h - LLVM's
FoldingSetandDenseMapinternals - CICC's SCEV expression uniquing (
sub_DC2B70) - CICC's OpenMP SPMD region hash table
The shift parameter controls how many high bits are retained, effectively determining the number of hash buckets as 2^(64 - shift). For a 1024-bucket table, shift would be 54.
Algorithm Detail
- Walk the BB instruction list.
- For each instruction, call
sub_2DDC600(candidate identification) to determine if it is a texture load eligible for merging. - Hash the candidate's key pointer using Fibonacci hashing to assign it to a bucket.
- Insert the candidate into the group table.
Group table entries are 56 bytes (7 QWORDs):
| Offset | Size | Content |
|---|---|---|
| +0 | 8 | Key pointer (texture base address or descriptor) |
| +8 | 8 | Data pointer (to member array) |
| +16 | 4 | Member count |
| +20 | 4 | Member capacity |
| +24 | 32 | Reserved / padding |
Group members are 32 bytes each:
| Offset | Size | Content |
|---|---|---|
| +0 | 8 | MachineInstr* -- the texture load instruction |
| +8 | 8 | Symbol -- the texture symbol reference |
| +16 | 8 | Debug info -- source location |
| +24 | 8 | Scope info -- DWARF scope |
Generated group names carry a .Tgm (Texture Group Merge) suffix via sub_2241490. This suffix appears in debug output and internal symbol tables.
4-Callback Framework
The pass operates through a general instruction grouper framework (sub_3147BA0) that supports multiple types of instruction grouping through a common callback interface. Four callbacks are registered for texture group merge:
| # | Callback | Function | Purpose |
|---|---|---|---|
| 1 | Candidate identification | sub_2DDC600 | Examines each MachineInstr and returns true if it is a texture load eligible for grouping. Checks opcode, address space (texture memory), and operand constraints. |
| 2 | Group formation | sub_2DDBF40 | After candidates are identified and hashed into buckets, this callback decides which candidates within a bucket should form a group. It checks address proximity, common base registers, and compatible access patterns. |
| 3 | Merge execution | sub_2DDB3F0 | Applies the actual merge transformation. Replaces individual texture loads with a single grouped load instruction, rewrites operands, and updates dependency edges. |
| 4 | Cleanup | sub_2DDB400 | Frees temporary data structures (group tables, member arrays, hash buckets) after merging is complete. |
Additional helper functions in the texture group merge:
| Function | Role |
|---|---|
sub_2DDD850 | Node insertion into group table |
sub_2DDDD70 | Resize/grow scheduling data |
sub_2DDD530 | Scheduling iteration over groups |
sub_2DDDAB0 | Node analysis (profitability check) |
sub_2DDB710 | Data dependency edge creation |
sub_2DDE490 | Grouping operation (merge groups) |
sub_2DDBC50 | Constraint application |
sub_2DDBBA0 | Constraint application (secondary) |
sub_2DDBA80 | Finalize group (seal and emit) |
The grouper framework is designed to be reusable: by registering different callback tuples, the same framework can group surface loads, shared memory accesses, or other coalescing-friendly instruction patterns.
Scheduling Mode: The usedessa Knob
The usedessa knob (dword_4FD26A0, default 2) controls the scheduling pass pipeline configuration despite its name suggesting deSSA (de-Static Single Assignment) method selection. Pre-RA scheduling dispatches through sub_2165850; post-RA through sub_21668D0.
Mode 1 (simple): Pre-RA scheduling is skipped entirely. Post-RA runs only unk_4FCE24C (the post-RA scheduler). This minimal configuration is useful for debugging or when scheduling is harmful to performance.
Mode 2 (full, default): Pre-RA scheduling runs unk_4FC8A0C. Post-RA scheduling runs three passes sequentially:
unk_4FC8A0C-- pre-RA pass (disabled/noop in post-RA context).unk_4FCE24C-- post-RA scheduler.unk_4FC9D8C-- extra scheduling pass.
After scheduling completes, the framework prints "After Machine Scheduling", optionally runs sub_21F9D90, then runs unk_4FCAC8C and prints "After StackSlotColoring".
The "disabled" passes in mode 2 are registered but gated internally, allowing the framework to maintain a uniform pass list while selectively activating passes based on the current compilation phase.
Cross-Cutting Observations
Register pressure tracking appears in three distinct places within the scheduling infrastructure, each serving a different consumer:
| Tracker | Consumer | Update Frequency |
|---|---|---|
MRPA incremental (sub_2E5A4E0) | MCSE decisions | Per instruction move/elimination |
ScheduleDAGMILive (sub_355F610) | Scheduling decisions | Per scheduling region |
| MachinePipeliner stage tracking | II feasibility | Per pipeline stage |
All three maintain per-register-class pressure arrays but with different granularities. The MRPA tracker uses incremental delta updates for efficiency; the scheduler computes ASAP/ALAP bounds per region; the pipeliner tracks pressure per modulo stage.
The DenseMap hash function (ptr >> 9) ^ (ptr >> 4) is shared across both the 32-bit value variant (sub_1DFB9D0) and 64-bit value variant (sub_1DFB810), indicating a common template instantiation pattern consistent with LLVM's DenseMap<K, V> template.
Contrast with ptxas scheduling: ptxas has its own instruction scheduling subsystem with 195 knobs (including scoreboard-aware scheduling via the AdvancedSB* family, SchedDisableAll, SchedForceReverseOrder, and the GemmPipeliner* family of 8 knobs for matrix multiply detection and pipelining). CICC's scheduling operates at the MachineInstr level before PTX emission; ptxas re-schedules at the SASS level after PTX assembly. The two scheduling layers are independent but complementary.
What Upstream LLVM Gets Wrong for GPU
Upstream LLVM's instruction scheduling framework was designed for CPU cores with out-of-order execution, branch prediction, and deep reorder buffers. On a GPU SM, these hardware features do not exist:
- Upstream assumes out-of-order hardware will hide scheduling mistakes. Modern CPUs have 200+ entry reorder buffers that dynamically reorder instructions, making compiler scheduling a second-order optimization. GPU SMs execute instructions in-order within each warp -- every scheduling decision is final. A poorly ordered instruction stream on GPU means stalls that no hardware can recover from.
- Upstream optimizes for pipeline hazards and port pressure. CPU schedulers model execution port contention (e.g., port 0 vs. port 1 on Intel), dispatch group rules, and pipeline bubble avoidance. GPU scheduling targets register pressure minimization (
nvptx-sched4reg) because the SM's warp scheduler handles instruction-level parallelism through warp interleaving, not through instruction reordering within a single thread. - Upstream assumes a single scheduling pass produces the final order. On CPU, LLVM's
ScheduleDAGMILiveemits the final instruction sequence. On NVPTX, cicc's scheduling is the first of two layers -- ptxas re-schedules the entire program at the SASS level with its own 195-knob subsystem (including scoreboard-aware scheduling via theAdvancedSB*family). CICC's scheduler optimizes for ptxas consumption, not for direct hardware execution. - Upstream has no concept of texture instruction grouping. CPU scheduling never considers grouping memory operations for hardware coalescing units. NVIDIA adds a dedicated Texture Group Merge pass (
sub_2DDE8C0, 74KB) that groups texture load instructions by base address for the hardware texture unit -- an entirely GPU-specific optimization absent from upstream. - Upstream does not track register pressure incrementally during CSE. Upstream LLVM recomputes register pressure from scratch after each Machine CSE transform. NVIDIA's MRPA subsystem (
sub_2E5A4E0, 48KB) maintains running pressure state through delta updates, because on GPU the pressure-to-occupancy relationship makes every CSE decision a potential occupancy cliff crossing that must be evaluated cheaply.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Scheduling subsystems | ScheduleDAGMILive + optional MachinePipeliner; no incremental pressure tracker | Three distinct subsystems: MRPA incremental tracker, Swing Modulo Scheduler, ScheduleDAGMILive; plus texture group merge pass |
| MRPA (incremental pressure) | Not present; pressure recomputed from scratch after each CSE transform | sub_2E5A4E0 (48 KB) + backend variant sub_1E00370 (78 KB) maintain running pressure state through delta operations during MCSE |
| Texture group merge | No concept of texture instruction grouping | Dedicated pass (sub_2DDE8C0) groups texture load instructions for hardware coalescing; scheduling-adjacent optimization absent from upstream |
| Scheduling target | Optimize for hardware pipeline hazards and port pressure | Optimize the MachineInstr stream for ptxas consumption; focus on register pressure reduction (nvptx-sched4reg) rather than hardware pipeline timing |
| Two-level scheduling | Single scheduling pass produces final instruction order | CICC scheduling is first layer; ptxas re-schedules at SASS level with its own 195-knob subsystem |
| Register pressure model | Per-register-class pressure sets from TRI | Same model but with GPU occupancy awareness; pressure arrays used to detect occupancy cliff crossings |
| Scheduling mode switch | Configured at pipeline construction time | Runtime mode switch between pre-RA (sub_2165850) and post-RA (sub_21668D0) with different heuristic weights |
ptxas Interaction
cicc's instruction scheduling operates at the MachineInstr level and produces a PTX instruction order that is not final. ptxas re-schedules the entire program at the SASS level using its own 195-knob scheduling subsystem, including scoreboard-aware scheduling (AdvancedSB* family), the GemmPipeliner* family for matrix multiply detection and software pipelining, and SchedForceReverseOrder for debugging. cicc's scheduler therefore optimizes for ptxas consumption rather than direct hardware execution: its primary goal is minimizing register pressure (nvptx-sched4reg) so that ptxas starts from a low-pressure baseline. The two scheduling layers are independent but complementary -- cicc controls the virtual register count visible to ptxas, and ptxas maps the resulting instruction stream onto the SM's hardware pipeline with full knowledge of scoreboard latencies and functional unit availability.
LiveRangeCalc
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: Based on LLVM 17.x
LiveRangeCalc.cpp(the page's own diff table cites LLVM 17.x as baseline). NVIDIA adds dual-bitvector GP/predicate tracking, a small-function bypass (instruction count <= 15), an enlarged 296-byte segment structure with inlined SmallVectors, and a 4/5 active-block fraction not present in any upstream version.
LiveRangeCalc is the low-level engine inside LLVM's CodeGen that turns def/use information into live intervals -- contiguous [SlotIndex, SlotIndex) segments describing when each virtual register holds a value. It sits between the SlotIndexes numbering pass and the LiveIntervals analysis, performing the actual iterative dataflow computation that propagates liveness backward through the CFG and inserts PHI-def value numbers at merge points. In CICC v13.0 the implementation at sub_2FC4FC0 is structurally based on upstream LLVM's LiveRangeCalc::extend / calculateValues but carries several NVIDIA-specific modifications: a dual-bitvector tracking scheme that separates general-purpose and predicate register liveness, a small-function bypass that skips the full dataflow for trivial kernels, and an enlarged per-segment structure (296 bytes) that inlines four separate SmallVector buffers to avoid heap allocations on the hot path.
| Main entry | sub_2FC4FC0 (12,900 bytes, 78KB decompiled) |
| Stack frame | 504 bytes (0x1F8) |
| Callers | sub_2FC8470 (LiveIntervals::computeRegUnitRange), sub_2FC8230 (createDeadDef/addSegment), self-recursive |
| SlotIndexes pass | sub_1F10BF0 (11KB), registered as "slotindexes" / "Slot index numbering" |
| LiveIntervals analysis | pipeline entry "live-intervals" (analysis ID unk_4F96DB4) |
| Address range | 0x2FBF390 -- 0x2FC8470 (full LiveRangeCalc cluster) |
| Returns | bool -- whether any live range was extended |
SlotIndex Infrastructure
Before LiveRangeCalc can operate, every MachineInstr must have a SlotIndex -- a monotonically increasing integer that encodes both the instruction's position and a sub-slot discriminator (early-clobber, register, dead, etc.). The SlotIndexes pass at sub_1F10BF0 walks the MachineFunction and assigns these numbers. CICC's implementation matches upstream LLVM: each MachineBasicBlock owns a contiguous range [StartIdx, EndIdx), and the mapping from SlotIndex back to MachineBasicBlock* is maintained in a sorted array that supports binary search.
The sentinel values found in the binary confirm standard LLVM DenseMap usage:
| Sentinel | Value | Meaning |
|---|---|---|
| Empty key | 0xFFFFFFFFFFFFF000 | Slot has never been occupied |
| Tombstone | 0xFFFFFFFFFFFFE000 | Slot was occupied, then erased |
These appear throughout the segment hash table, the pending-def table, and the VNInfo chain, always as DenseMap<SlotIndex, ...> sentinels.
Segment Structure Layout
Each live range segment in CICC is 296 bytes (0x128), substantially larger than upstream's LiveRange::Segment (which is 24 bytes). The inflation comes from four inlined SmallVector buffers that avoid separate heap allocations for the common case:
Segment (296 bytes / 0x128):
+0x00 u64 status / SlotIndex start (sentinel if free)
+0x08 ptr endpoint buffer (or inline at +0x18)
+0x18 [16] inline endpoint buffer
+0x28 additional metadata (segment flags, subrange info)
+0x50 ptr register mask buffer (or inline at +0x60)
+0x60 [56] inline register mask buffer
+0x98 ptr kill-set buffer (or inline at +0xA8)
+0xA8 [48] inline kill-set buffer
+0xD8 u32 kill count
+0xE0 ptr use-def chain buffer (or inline at +0xF0)
+0xF0 [48] inline use-def chain buffer
+0x120 u32 total instruction count covered
Each pointer field follows the LLVM SmallVector convention: if the pointer equals the address of the inline buffer immediately following it, the data lives inline; otherwise it points to a heap allocation. During cleanup (Phase 1 of the algorithm), each segment's four buffers are freed individually before the segment is marked with the empty sentinel.
VNInfo Structure
Value numbers are tracked via 120-byte (0x78) VNInfo nodes, allocated from a bump-pointer allocator at [this+0x4A0]:
VNInfo (120 bytes / 0x78):
+0x00 ptr endpoint buffer (inline at +0x10)
+0x08 u64 capacity (initial: 0x200000000 = inline cap 2)
+0x10 [48] inline endpoint buffer
+0x40 ptr kill-set buffer (inline at +0x50)
+0x48 u64 capacity for kill-set
+0x60 ptr sub-chain pointer (phi resolution)
+0x68 ptr sub-chain pointer 2
+0x70 u32 block number
+0x74 u32 value number (initially unassigned)
The allocator is a classic bump allocator: a cursor at [this+0x4A0] advances by 0x10 per allocation, checked against capacity at [this+0x448]. When the arena fills, a slow-path reallocation grows the backing store. Deallocation chains through sub_2FBF390, which walks sub-chains and calls free with size 0x38 (56 bytes) per intermediate node and 0x78 (120 bytes) for the VNInfo itself.
Algorithm
The computation in sub_2FC4FC0 proceeds in eight phases. It is self-recursive: when iterative refinement discovers new work, the function calls itself to converge.
Phase 1 -- Initialization and Cleanup (0x2FC4FC0 -- 0x2FC50C2)
Links the SlotIndex base ([rdi] = [rsi+0x30]), increments the iteration counter at [this+0x10], and walks the existing segment table (stride 0x128) freeing stale entries. Segments marked with the empty sentinel (0xFFFFFFFFFFFFF000) are skipped; tombstoned entries (0xFFFFFFFFFFFFE000) and live entries both have their four internal buffers freed and are then marked empty.
The cleanup loop at 0x2FC5040--0x2FC50AE iterates with stride 0x128 over the segment array beginning at rbx. For each entry it checks [rbx+0x00] against both sentinels. If the entry is live or tombstoned, it frees four inlined SmallVector buffers in reverse allocation order:
[rbx+0xE0]-- use-def chain buffer (freed if pointer differs from inline region atrbx+0xF0).[rbx+0x98]-- kill-set buffer (freed if pointer differs from inline region atrbx+0xA8).[rbx+0x50]-- register mask buffer (freed if pointer differs from inline region atrbx+0x60).[rbx+0x08]-- segment endpoint buffer (freed if pointer differs from inline region atrbx+0x18).
After freeing, the entry is stamped with the empty sentinel: mov qword [rbx], 0xFFFFFFFFFFFFF000. The old segment count stored at [rdi+0x20] is loaded into r15d at entry and used to bound the cleanup iteration.
Phase 2 -- Auxiliary Table Cleanup (0x2FC50C2 -- 0x2FC52A3)
Resets the old segment count, increments the auxiliary sequence counter, and walks three secondary tables:
- Pending-def table at
[this+0x40](16-byte stride): cleared with empty sentinels. - VNInfo chain at
[this+0xA0]: walked back-to-front, freeing each node throughsub_2E0AFD0(getRegInfo) andsub_2FBF390. The walk reads count from[r13+0xA8], loads each entry at[r12-8], decrementsr12. For each VNInfo: frees sub-chains viasub_2FBF390(size0x38= 56 bytes per intermediate node), then frees the VNInfo itself (size0x78= 120 bytes) viaj_j___libc_free_0. - Auxiliary tables at offsets
0x130(48-byte stride) and0x480(16-byte stride): freed/resized viasub_C7D6A0(realloc). - Checks
[r13+0x458]for additional pending work from a previous iteration.
Phase 3 -- Block Count and Threshold Check (0x2FC52A3 -- 0x2FC53F4)
Computes the active block count from the MBB array: active = (total_blocks * 4/5) - dead_block_count. The * 4/5 fraction is computed via the classic imul 0xCCCCCCCD trick for unsigned division by 5 on x86. If the result is zero, the function returns immediately.
The precise x86 idiom:
mov rax, [rdx+10h]
sub rax, [rdx+8] ; pointer diff on MBB array
sar rax, 3 ; divide by sizeof(pointer) = 8
imul eax, 0xCCCCCCCD ; unsigned multiply by magic constant
shr eax, 2 ; result = total_blocks * 4 / 5 (rounded down)
sub eax, [rdx+20h] ; subtract dead_block_count
Two bitvectors are allocated on the stack for the live-in set. Initial inline capacity is 8 words (512 registers); if the block count exceeds 8, SmallVector::grow at sub_C8D5F0 expands them. The pre-allocated capacity at [r13+0xAC] is also checked; if insufficient, sub_2FC1040 (grow per-block segment table) is called.
Small-function bypass: If the total instruction count is 15 or fewer, OR the block count is 1 or fewer, OR the global flag qword_5025F68 is set (-Ofast-compile mode [LOW confidence] -- the flag triggers a compile-time shortcut consistent with a fast-compile option, but no string or CLI mapping for this global has been recovered; it could also be a debug-only override or an internal tuning knob), the function skips the full dataflow and returns early. This is an NVIDIA addition not present in upstream LLVM -- it avoids the quadratic cost of bitvector dataflow on trivial kernel bodies where liveness is obvious from local analysis alone.
Phase 4 -- Per-Block Segment Allocation (0x2FC538D -- 0x2FC55E7)
Calls sub_2FC1A70 (ensureCapacity) to prepare per-block storage, then loops over all non-dead blocks summing instruction counts. For each block:
- Allocates a 120-byte VNInfo via the bump allocator (
sub_22077B0). If allocation fails, jumps to error path at0x2FC7E1C. - Initializes inline buffers with capacity markers (
0x200000000-- encodes inline capacity 2 in the high 32 bits with size 0 in the low 32 bits, the standard LLVM SmallVector representation). - Sets
[vn+0x00]= pointer to inline endpoint buffer (rax+0x10),[vn+0x40]= pointer to inline kill-set buffer (rax+0x50). - Clears sub-chain pointers:
[vn+0x60] = 0,[vn+0x68] = 0. - Records the block number at
[vn+0x70] = ebxand clears the value number[vn+0x74] = 0. - Advances the bump-pointer allocator at
[r14+0x4A0]by0x10to allocate a "pending use" object. The allocator checks against capacity at[r14+0x448]and falls back to a slow-path reallocation when the arena fills. - Inserts the VNInfo into the
[this+0xA0]vector (grows if needed viasub_C7D6A0). - Registers the block number in the
[this+0xC0]map (grows if needed). - Frees old VNInfo if it was a placeholder from a previous iteration.
Phase 5 -- Liveness Propagation via Bitvector Dataflow (0x2FC5656 -- 0x2FC5CC6)
This is the core computation -- a standard backward-dataflow fixed-point iteration, operating on 64-bit word bitvectors. It implements the classic liveness equation:
LiveIn(B) = (LiveOut(B) \ Kill(B)) | Def(B)
LiveOut(B) = Union over all successors S of LiveIn(S)
The iteration continues until no bitvector word changes across a complete pass over all pending blocks. The changed flag (var_1B0 on the stack) is cleared at the top of each outer iteration and set whenever any bitvector word is modified.
Detailed dataflow pseudocode
// Phase 5 reconstructed from sub_2FC4FC0 at 0x2FC5656--0x2FC5CC6
//
// State:
// segment_table[] -- hash table, stride 0x128, keyed by block ID
// .gp_bv (+0x98) -- general-purpose register bitvector (live set)
// .pred_bv (+0xE0) -- predicate register bitvector (live set)
// .kill_set(+0xA8) -- inline kill-set buffer
// .kill_cnt(+0xD8) -- number of killed registers
// .def_bv (+0x08) -- def-set bitvector
// worklist -- pending blocks at [r13+0x50]
// bv_words -- number of 64-bit words = ceil(num_regs / 64)
// changed -- var_1B0 on stack
fn liveness_propagation(this: &mut LiveRangeCalc) -> bool {
let bv_words: usize = (this.num_regs + 63) / 64;
loop {
let mut changed: bool = false;
for block in this.worklist.iter() {
// --- Step 1: Hash lookup for block's segment entry ---
// Hash function: h = ((block.id >> 4) ^ (block.id >> 9))
// & (capacity - 1)
// Linear probing until key match or empty sentinel
let entry = this.segment_table.lookup(block.id);
// --- Step 2: Accumulate kill bitvector from kill set ---
// The kill set at entry.kill_set contains register IDs
// that are killed (last-use) within this block.
// For each killed register, look up its own segment entry
// and OR its kill bitvector into a local accumulator.
let mut kill_accum: [u64; bv_words] = [0; bv_words];
for i in 0..entry.kill_cnt {
let killed_reg = entry.kill_set[i];
let kill_entry = this.segment_table.lookup(killed_reg);
// x86: OR [kill_accum + rdx*8], [kill_entry.kill_bv + rdx*8]
for w in 0..bv_words {
kill_accum[w] |= kill_entry.gp_bv[w];
}
}
// --- Step 3: Compute live_in for general-purpose registers ---
// Standard backward dataflow: live_in = (live_out & ~kills) | defs
// live_out is the current content of entry.gp_bv (propagated
// from successors in previous iterations or initialization)
let mut src: [u64; bv_words];
for w in 0..bv_words {
// x86: rax = NOT [kill_accum + w*8]
// rax = AND rax, [entry.gp_bv + w*8] -- live_out & ~kills
// rax = OR rax, [entry.def_bv + w*8] -- | defs
src[w] = (entry.gp_bv[w] & !kill_accum[w]) | entry.def_bv[w];
}
// Boundary mask: clear unused high bits in last word
// x86: ecx = num_regs & 63
// shl rdx, cl; not rdx; and [src + (bv_words-1)*8], rdx
if this.num_regs % 64 != 0 {
let tail_bits = this.num_regs % 64;
let mask = (1u64 << tail_bits) - 1;
src[bv_words - 1] &= mask;
}
// --- Step 4: Interference check against allocated set ---
// Compares computed live_in against the segment's "allocated"
// bitvector at +0x98. Any bit set in src but NOT in allocated
// indicates a new live register that extends the range.
// x86 at 0x2FC5B86:
// rax = NOT [entry.gp_bv + rdx*8] -- ~allocated
// rax = AND rax, [src + rdx*8] -- new bits
// test rax, rax / jnz -> extend
for w in 0..bv_words {
let new_bits = src[w] & !entry.gp_bv[w];
if new_bits != 0 {
entry.gp_bv[w] |= src[w]; // extend coverage
changed = true;
}
}
// --- Step 5: Repeat identically for predicate register bv ---
// The predicate bitvector at entry offset +0xE0 is processed
// with exactly the same kill-accumulate / dataflow / interference
// sequence. Predicate registers (%p0, %p1, ...) occupy a
// physically separate register file in NVPTX hardware, so they
// get their own independent bitvector to avoid inflating the
// interference graph of the main register namespace.
// [identical loop over pred_bv words omitted for brevity]
} // end for each block
if !changed {
break; // Fixed point reached
}
// Otherwise: var_1B0 was set to 1, loop back to top
}
}
Convergence criteria
The fixed-point iteration terminates when a complete pass over all pending blocks produces no change to any bitvector word. Formally, convergence is guaranteed because:
- Monotonicity. Each bitvector word can only gain bits (the
|=operation in the interference-check step is monotone). Bits are never cleared during the iteration. - Finite lattice. The bitvector domain is a finite lattice of height
num_regs. Each word can change at most 64 times (once per bit), so the total number of changes across all words and all blocks is bounded byN * W * 64whereN= block count andW= bitvector width in words. - Worst-case iterations. In practice, the iteration converges in
O(D)passes whereD= maximum loop nesting depth of the CFG. Each pass propagates liveness information one level deeper through nested loops. The theoretical worst case isNiterations for a pathological CFG with a chain ofNblocks each feeding into the next, but CUDA kernels rarely exhibit such structure.
The changed flag (var_1B0) is a single byte on the stack. It is zeroed with mov byte [rbp+var_1B0], 0 at the top of each outer iteration and set with mov byte [rbp+var_1B0], 1 whenever the interference check finds new bits. The outer do { ... } while (changed) loop tests this byte at 0x2FC5CC0 with cmp byte [rbp+var_1B0], 0; jne back to the loop head at 0x2FC5656.
Kill and Def computation
The kill and def sets are not computed inside sub_2FC4FC0 itself. They are pre-populated by callers before invoking the dataflow engine:
-
Kill set (
+0xA8inline buffer, count at+0xD8): Populated bysub_2FC8470(LiveIntervals::computeRegUnitRange) which walks eachMachineBasicBlock's instruction list. A register is added to the kill set when an instruction has a use operand that is the last use before the next def (or end of block). The kill set is stored as a flat array of register IDs, not a bitvector -- the dataflow loop then expands it into a bitvector accumulator by looking up each killed register in the hash table. -
Def set (
+0x08endpoint buffer): Populated by the same caller. A register is added when aMachineInstrdefines it (operand flagisDef). For NVPTX, since all registers are virtual, every def creates a fresh value number. The def set is stored as a bitvector where bitiis set if virtual registeriis defined in the block. -
Initial live-out (
+0x98for GP,+0xE0for predicate): Initialized to the empty set for all blocks. The dataflow iteration propagates liveness backward: when a use is found in a successor block with no preceding def, the register becomes live-out in the current block. The first iteration seeds liveness from the use/def information; subsequent iterations propagate it through the CFG.
This separation means the hash table must be fully populated with per-block kill and def information before sub_2FC4FC0 enters Phase 5. The hash table at sub_2FC0880 supports insert, lookup, and resize operations with open addressing.
Bitvector word-at-a-time implementation
All bitvector operations operate on 64-bit words with standard x86-64 bitwise instructions:
| Operation | x86 pattern | Semantics |
|---|---|---|
| Union (OR) | or [rdx+rax*8], rcx | `bv[w] |
| Difference (AND-NOT) | mov rax, [rsi+rdx*8]; not rax; and rax, [rdi+rdx*8] | new = src[w] & ~allocated[w] |
| Boundary mask | mov ecx, count_mod_64; mov rdx, -1; shl rdx, cl; not rdx; and [ptr+last_word], rdx | Clear unused high bits |
| Zero test | test rax, rax; jnz target | Any bit set? |
The boundary mask is critical for correctness: without it, garbage bits in the padding region of the last word would create phantom interference. The mask is computed once per iteration entry and applied after every live-in computation. The instruction sequence shl rdx, cl; not rdx creates a mask with count % 64 low bits set and the rest cleared.
Hash table for segment lookup
The segment hash table (sub_2FC0880) uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192) and an entry stride of 0x128 (296 bytes), matching the full segment structure size. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.
During the dataflow iteration, each block requires two hash lookups per killed register (one for the block entry, one for each killed register's entry), so the total hash table traffic per iteration is O(N * K_max) where K_max is the maximum kill-set size across all blocks. Since NVPTX virtual register counts are typically in the hundreds (bounded by -maxreg, default 70), the hash table remains small and cache-friendly.
Phase 6 -- PHI Value Resolution (0x2FC5ED8 -- 0x2FC5F95)
After the dataflow converges, resolves PHI-def values at block boundaries. For each block, walks the predecessor chain at [block+0x30] and calls sub_2FBF8B0 (resolvePhiValue / findReachingDef) with four arguments: the LiveRangeCalc*, predecessor MBB, current bitvector, and a stack-allocated phi resolution buffer. This is the same algorithm as upstream LiveRangeCalc::updateSSA -- it propagates live-out values down the dominator tree and inserts PHI-def VNInfo nodes where multiple values reach a merge point.
The var_181 byte is initialized to 0 before each block as a "phi_resolved" flag. If sub_2FBF8B0 returns true, control jumps to 0x2FC710C for phi merge handling -- this path allocates a new VNInfo, links it into the sub-chain at [vn+0x60]/[vn+0x68], and updates the block's value number at [vn+0x74]. The temporary phi resolution buffer is freed after each block regardless of the outcome.
Phase 7 -- Segment Endpoint Fixup (0x2FC5FA8 -- 0x2FC6021)
For each word in the destination bitvector that has bits set (masked with 0xFFFFFFFFFFFFFFF8 to skip low tag bits), looks up the block's SlotIndex via [r14+0x18] shifted and indexed into the SlotIndex table at [rcx+0x98], retrieves the segment's use-def chain at [rdi+0x40], and calls sub_2E0F080 (addSegment / extendInBlock) to materialize the [start, end) segment in the LiveRange object. After processing all pending blocks, advances to the next MBB in the linked list via [r14+8], continuing until hitting the sentinel at [rbp+var_1F0].
Phase 8 -- Finalization and Return (0x2FC5974 -- 0x2FC59E6)
If no interference was found across all iterations, frees pending blocks from the [this+0x4A8] array (via sub_2E88E20), sets the pending count to zero ([r13+0x4B0] = 0), frees any dynamically-allocated bitvectors, and returns bool indicating whether any live range was extended. The return value is derived from var_1F0 = (count != 0).
Dual Bitvector Tracking
The most significant NVIDIA-specific modification is maintaining two independent bitvectors per segment:
| Offset | Register class | Purpose |
|---|---|---|
+0x98 | General-purpose registers | %r, %rd, %f, %fd, %h, %fh liveness |
+0xE0 | Predicate registers | %p liveness |
Both bitvectors are processed by identical code paths in Phase 5, but independently -- kills in one class do not affect the other. This separation reflects NVPTX's hardware architecture where predicate registers occupy a physically separate register file from data registers. Upstream LLVM's LiveRangeCalc handles all register classes through a single unified mechanism; CICC's split avoids interference-graph inflation by keeping the small predicate namespace out of the main bitvector.
The two bitvectors are processed sequentially within the same iteration body (not in separate passes). For each pending block, the general-purpose bitvector at +0x98 is processed first, then the predicate bitvector at +0xE0 is processed with structurally identical code. The changed flag is shared between both -- a change in either bitvector triggers another iteration of the outer loop. This means the predicate register dataflow rides for free on the same convergence pass, and the two bitvectors converge simultaneously.
The register coalescer at sub_34A46B0 also maintains a bitvector-per-block structure (a 12,336-byte stack buffer v90[12336] at offset 0x270 used as a bitmap for tracking live-through blocks during range rebuild after coalescing). That coalescer bitvector feeds updated information back into the LiveRangeCalc segment table when live intervals are modified by register coalescing.
Differences from Upstream LLVM
CICC v13.0's LiveRangeCalc diverges from upstream LLVM LiveRangeCalc (as of LLVM 17.x) in these specific ways:
-
Dual bitvector tracking. Upstream uses a single mechanism for all register classes. CICC splits GP and predicate into independent bitvectors to exploit the physical separation in NVPTX hardware.
-
Small-function bypass. The instruction-count threshold of 15 and the block-count threshold of 1 are NVIDIA additions. Upstream always runs the full dataflow. This optimization is significant because CUDA kernels frequently contain tiny
__device__helper functions that are inlined by the optimizer. -
Global fast-compile flag. The
qword_5025F68check that bypasses the entire dataflow loop has no upstream equivalent. It is likely tied to the-Ofast-compileor-O0optimization level in cicc. -
Enlarged segment structure. Upstream's
LiveRange::Segmentis 24 bytes (start SlotIndex, end SlotIndex, VNInfo pointer). CICC's segment is 296 bytes (0x128), inlining four SmallVector buffers to avoid heap allocations on the hot path. This is a performance optimization for the common case where segments have small kill sets and few endpoints. -
Active-block fraction. The
* 4/5computation in Phase 3 (viaimul 0xCCCCCCCD) to determine the active block count is not present in upstream. Upstream counts all blocks equally. CICC discounts approximately 20% of blocks, likely accounting for unreachable or dead blocks that StructurizeCFG may have created but not yet eliminated. -
PhysReg parameter always zero. Upstream's
findReachingDefstakes aRegister PhysRegparameter for physical register interference. Since NVPTX has no physical registers (all registers are virtual and hardware-mapped at launch time), this parameter is alwaysRegister()(zero). The binary confirms:sub_2E0FDD0(isAllocatable) is called but its return value never gates segment creation.
GPU-Specific Considerations
Virtual-only register file. NVPTX has no physical registers in the LLVM sense -- all registers are virtual (%r0, %f0, %p0, ...) and the hardware thread scheduler maps them at launch time. This means LiveRangeCalc never needs to handle physical register liveness, live-in lists for calling conventions, or register unit interference. The PhysReg parameter in upstream's findReachingDefs is always Register() (zero). The binary confirms this: sub_2E0FDD0 (isAllocatable / reserved register check) is called but its return value is never used to gate segment creation.
Pressure-driven analysis. The live intervals produced by LiveRangeCalc feed directly into the greedy register allocator's interference cache (at selectOrSplit offset +648). Since NVPTX allocation is pressure-driven rather than assignment-driven, the intervals primarily serve to detect which virtual registers are simultaneously live, not to assign physical registers. The total count of simultaneously-live intervals at any program point determines the register pressure, which the allocator compares against the -maxreg limit (default 70).
Small-kernel bypass. The threshold check in Phase 3 (instruction count <= 15 OR block count <= 1) is absent from upstream LLVM. CUDA kernels frequently contain tiny helper device functions that are inlined into the caller; computing full dataflow liveness for a 10-instruction single-block function is pure overhead. The bypass returns immediately, letting the register allocator fall back to local analysis.
Configuration
| Knob | Default | Effect |
|---|---|---|
early-live-intervals | false | Runs LiveIntervals analysis earlier in the pipeline, before the standard scheduling pass |
join-liveintervals | true | Master enable for register coalescing over live intervals |
qword_5025F68 (global flag) | 0 | When nonzero (likely -Ofast-compile), skips the full dataflow loop entirely |
The instruction-count threshold of 15 and the block-count threshold of 1 are hardcoded constants, not configurable via LLVM cl::opt flags.
LiveRangeCalc Object Layout
The LiveRangeCalc object (this pointer passed in rdi) is reconstructed from register offsets observed throughout sub_2FC4FC0:
LiveRangeCalc (approx 0x4C0 bytes):
+0x00 ptr SlotIndex base (set from [rsi+0x30] in Phase 1)
+0x08 ptr VNInfo* / MBB* parameter (set from rsi in Phase 1)
+0x10 u32 iteration counter (incremented each call)
+0x14 u32 (padding / alignment)
+0x20 u32 old segment count (r15d loaded in Phase 1)
+0x30 u32 auxiliary sequence counter (incremented in Phase 2)
+0x40 ptr pending-def table (16-byte stride)
+0x50 ptr worklist (pending blocks array)
+0xA0 ptr VNInfo chain (vector of VNInfo*)
+0xA8 u64 VNInfo chain count
+0xAC u32 pre-allocated capacity for per-block segment table
+0xC0 ptr block-number-to-VNInfo map
+0x130 ptr auxiliary table (48-byte stride)
+0x440 ptr bump allocator arena base
+0x448 u64 bump allocator capacity
+0x458 ptr additional pending work (checked in Phase 2)
+0x480 ptr secondary auxiliary table (16-byte stride)
+0x4A0 ptr bump allocator cursor (advances by 0x10 per allocation)
+0x4A8 ptr pending-blocks array (freed in Phase 8)
+0x4B0 u64 pending block count (zeroed in Phase 8)
Complexity
- Per iteration:
O(N * W)whereN= number of basic blocks,W= bitvector width in words (ceil(num_regs / 64)). Both GP and predicate bitvectors are processed per iteration, so the actual cost isO(N * (W_gp + W_pred)), but since predicate register counts are small (typically < 64, fitting in a single word), the predicate contribution isO(N). - Kill-set expansion per iteration:
O(N * K_max * W)whereK_max= maximum kill-set size per block. For each of theNblocks, up toK_maxhash lookups andW-word OR operations are performed. - Convergence: Typically
O(D)iterations whereD= maximum loop nesting depth. The monotonicity of the OR-based bitvector union guarantees termination. Worst case isO(N)iterations for a pathological single-predecessor chain, but CUDA kernels (especially after StructurizeCFG) have bounded nesting depth. - Total:
O(N * W * D)for the core liveness computation, plusO(N * K_max * W * D)for kill-set expansion. - Hash table operations:
O(1)amortized per lookup. Load factor is maintained below 75% by the DenseMap rehash policy. - Memory:
O(N * W)for bitvectors +O(S * 296)for the segment table whereS= number of live segments +O(V * 120)for VNInfo nodes whereV= number of value numbers. - Phase 1 cleanup:
O(S_old)whereS_old= segment count from previous iteration. Each segment requires checking four buffer pointers and potentially freeing four allocations.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| LiveRangeCalc::extend / calculateValues -- main entry, self-recursive (12,900 bytes, 78KB decompiled) | sub_2FC4FC0 | -- | -- |
| LiveIntervals::computeRegUnitRange (caller, populates kill/def sets) | sub_2FC8470 | -- | -- |
| LiveIntervals::createDeadDef / addSegment (caller) | sub_2FC8230 | -- | -- |
| ensureCapacity / resetLiveRanges (per-block storage preparation) | sub_2FC1A70 | -- | -- |
grow per-block segment table (called when [r13+0xAC] insufficient) | sub_2FC1040 | -- | -- |
interval building helper (called from sub_2FC1040) | sub_2FC1190 | -- | -- |
| hash table operations: insert/lookup/resize with open addressing | sub_2FC0880 | -- | -- |
| segment creation / initialization (296-byte struct setup) | sub_2FC0040 | -- | -- |
| resolvePhiValue / findReachingDef (PHI resolution, 4 args) | sub_2FBF8B0 | -- | -- |
| free VNInfo chain (frees 0x38-byte intermediate nodes, 0x78-byte VNInfo) | sub_2FBF390 | -- | -- |
| segment merge / extend (interference update) | sub_2FBFCC0 | -- | -- |
| live range query | sub_2FC3C20 | -- | -- |
| live range intersection test | sub_2FC3A50 | -- | -- |
| getRegInfo / MachineRegisterInfo query | sub_2E0AFD0 | -- | -- |
| isAllocatable / reserved register check (return value unused in NVPTX) | sub_2E0FDD0 | -- | -- |
addSegment / extendInBlock (materializes [start, end) segments) | sub_2E0F080 | -- | -- |
| MachineFunction helper | sub_2E76F70 | -- | -- |
| eraseFromParent (MachineInstr deletion, used in Phase 8 cleanup) | sub_2E88E20 | -- | -- |
register property check (called with flags 0x80000, 0x100000) | sub_2E88A90 | -- | -- |
| operator new (VNInfo allocation, 120 bytes) | sub_22077B0 | -- | -- |
| SlotIndexes::runOnMachineFunction (11KB) | sub_1F10BF0 | -- | -- |
SlotIndexes pass registration ("slotindexes" / "Slot index numbering") | sub_1F10320 | -- | -- |
| SlotIndexes insertion / repair (13KB) | sub_1F112A0 | -- | -- |
SlotIndex validity check (string: "invalid") | sub_1F10810 | -- | -- |
| computeLiveIntervals (RA integration, called from greedy RA init) | sub_2F54D60 | -- | -- |
| SmallVector::grow (bitvector expansion when block count > 8) | sub_C8D5F0 | -- | -- |
| realloc (SmallVector resize / auxiliary table resize) | sub_C7D6A0 | -- | -- |
| malloc (new allocation) | sub_C7D670 | -- | -- |
Cross-References
- Register Allocation -- consumes live intervals to drive the pressure-based greedy allocator
- Register Coalescing -- merges live ranges of copy-connected virtual registers; runs before RA, feeds updated intervals back through LiveRangeCalc
- Instruction Scheduling -- the
SlotIndexesnumbering assigned here is consumed during post-RA scheduling for latency-aware reordering - SelectionDAG -- produces the initial
MachineInstrstream that SlotIndexes numbers
Register Coalescing
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Register coalescing in CICC v13.0 eliminates redundant copy instructions by merging the live ranges of their source and destination virtual registers. NVPTX's unlimited virtual register model (PTX has no fixed physical register file) changes the purpose of coalescing compared to CPU targets: rather than reducing physical register pressure to avoid spills, the goal is strictly copy elimination -- fewer mov instructions in the emitted PTX, which in turn gives ptxas a cleaner input with fewer live-range constraints to resolve during its own physical allocation. CICC runs two coalescing passes in sequence: the standard LLVM RegisterCoalescer at sub_2F71140 (which handles generic COPY pseudo-instructions) and a separate NVPTX-specific coalescer rooted at sub_34AF4A0 (which handles NVPTX copy instruction families in the opcode 440--503 range that the generic pass does not recognize). This page documents both, with emphasis on the NVPTX-specific pass where the bulk of the proprietary logic resides.
| Standard LLVM RegisterCoalescer | sub_2F71140 (80KB, 2,190 lines) |
| NVPTX coalescing driver | sub_34AF4A0 (67KB, 2,373 lines) |
| Per-instruction coalesce attempt | sub_34AE060 (28KB) |
| Interference check | sub_34AA450 (11.5KB) |
| Block-level coalescing | sub_34BAAF0 (31.7KB) |
| Live-out / weight computation | sub_34B7280 (22KB) |
| Interval tree (red-black BST) | sub_34A0610 (14.7KB) |
| Range rebuild after merge | sub_34A46B0 (13KB) |
| Opcode -> copy-type mapping | sub_3494EA0 (12.7KB) |
| Operand type classification table | byte_444C4A0 (16-byte entries) |
| Address range | 0x3494EA0 -- 0x34BF740 |
| Pass parameters | (pass_obj*, func_info*, MF*, copy_limit, coalesce_limit) |
| Pass ordering | After TwoAddressInstruction, before greedy RA |
Why Coalescing Matters on a Virtual-Register Target
On CPU targets, coalescing reduces register pressure by allowing two virtual registers to share one physical register, potentially preventing a spill. On NVPTX the motivation is different. PTX is a virtual ISA with typed, unlimited registers (%r0, %r1, ... for 32-bit integers; %f0, %f1, ... for 32-bit floats). The "physical" allocation is deferred entirely to ptxas, which maps virtual registers to the hardware register file at kernel launch time based on occupancy targets. CICC's coalescing therefore serves three purposes:
-
Copy elimination. Every
movinstruction that survives into emitted PTX is dead weight -- it costs an issue slot and extends the live range of both source and destination. Coalescing removes these by unifying src and dst into a single virtual register. -
Reduced register name count. Even though PTX registers are virtual,
ptxasmust solve a graph-coloring problem on them. Fewer distinct register names (after coalescing merges equivalents) giveptxasa smaller interference graph and faster compilation. -
Cleaner SSA destruction. PHI elimination during the transition from SSA form to machine code inserts copies at every PHI edge. Many of these are immediately coalesceable because the PHI operand's live range does not extend past the copy point. The coalescer cleans up the mechanical output of PHI lowering.
Copies that the coalescer processes arise from three sources: PHI elimination copies, ABI/calling-convention .param register copies for kernel call boundaries, and sub-register operations (EXTRACT_SUBREG, INSERT_SUBREG).
Standard LLVM RegisterCoalescer (sub_2F71140)
CICC includes the stock LLVM RegisterCoalescer at sub_2F71140, registered as pass "register-coalescer" with debug output markers "Before register coalescing" / "After register coalescing". This pass handles the generic COPY pseudo-instruction (LLVM's TargetOpcode::COPY) using the standard worklist-driven algorithm from upstream.
The key LLVM knobs that apply to this instance:
| Knob | Default | Effect |
|---|---|---|
join-liveintervals | true | Master enable for copy coalescing |
join-splitedges | subtarget | Coalesce copies on split critical edges |
join-globalcopies | subtarget | Coalesce copies that span basic blocks |
terminal-rule | true | Apply the terminal rule (copies at block ends) |
verify-coalescing | false | Verify MachineInstrs before and after coalescing |
late-remat-update-threshold | 100 | Batch live-interval updates when a def has many copy uses |
large-interval-size-threshold | 100 | Intervals with more valnos than this are "large" |
large-interval-freq-threshold | 256 | Stop coalescing a large interval after this many joins |
The standard pass operates on COPY pseudo-instructions only. It does not understand NVPTX-specific move instruction families (opcodes 440--503), which is why the NVPTX-specific pass exists.
NVPTX-Specific Coalescer (sub_34AF4A0)
The proprietary coalescer at sub_34AF4A0 runs after the standard RegisterCoalescer and targets NVPTX copy instruction families that the generic pass skips. It operates on the MachineFunction representation and accepts two limit parameters beyond the standard pass/MF arguments: copy_limit (maximum number of copy instructions to consider) and coalesce_limit (maximum number of successful merges before bailing out). These are compile-time budget controls that prevent quadratic behavior on large functions.
Opcode Classification
The function sub_3494EA0 contains a giant switch statement mapping NVPTX instruction opcodes (range 1--0x12) to copy families in the 440--503 opcode range. Each family represents a distinct copy semantic:
- Opcodes 440--443: Type-preserving moves within a single register class (i32-to-i32, f32-to-f32, etc.). These map from internal opcodes 12, 13, 15 in the operand-type classification table.
- Opcodes 444--503: Cross-class moves, paired/wide register moves (128-bit pairs for tensor core paths), and ABI-related
.paramcopies.
The return value is an __m128i pair encoding both the copy semantics and the register class constraints, which subsequent stages use to decide whether coalescing is legal.
Operand-type classification happens via sub_34961A0, which reads operands and classifies them through a lookup table at byte_444C4A0. Each entry in this table is 16 bytes:
struct OperandTypeEntry {
uint8_t type_code; // +0: 12=i32, 13=i64, 15=f32, etc.
uint8_t size_class; // +1: size in register-width units
uint8_t register_bank; // +2: bank identifier
uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
uint8_t reserved[12]; // +4: padding/future use
};
The constraint flag at offset +3 (mask 0x10) gates whether the operand participates in coalescing at all. Operands with this bit cleared are excluded from the worklist.
Register Class Constraints
Coalescing is constrained to same-class merges. The NVPTX register classes are completely disjoint -- an Int32Regs (%r) register cannot coalesce with a Float32Regs (%f) register even though both are 32 bits wide. This is a consequence of PTX's typed register model: .reg .b32 %r0 and .reg .f32 %f0 are distinct storage locations from ptxas's perspective. The complete register class table and coalescing constraint flags are in Register Classes. All eight primary classes are same-class-only; Int128Regs is excluded from the coalescing worklist entirely (constraint flag cleared).
Cross-class copies (e.g., bitcasting an i32 to f32) use distinct cross-class copy opcodes (see the copy opcode table) and are never eliminated by the coalescer -- they must survive as explicit instructions in PTX.
Sub-Register Handling
NVPTX has a flat register file with no sub-register structure in the CPU sense. There are no %eax/%ax/%al hierarchies. The exception is wide register pairs: 128-bit values used by tensor core operations are represented as pairs of 64-bit registers. sub_3497B40 handles paired-register decomposition, and when coalescing the low half of a pair, the high half inherits corresponding constraints. The coalesce candidate record (248 bytes) stores sub-operand arrays at offset +16 (4 entries of 32 bytes each, inline SBO) specifically for tracking these pair relationships.
Coalescing Algorithm
The NVPTX coalescer follows the standard LLVM pattern of worklist-driven interval joining but uses proprietary data structures throughout.
Phase 1: Initialization (lines 494--617)
Load TargetInstrInfo, TargetRegisterInfo, and TargetSubtargetInfo from the MachineFunction vtables. Initialize approximately 15 open-addressing hash maps, 2 min-heaps, 3 interval trees (red-black BSTs), and 2 linked lists. The stack frame is approximately 4.5KB. Walk all basic blocks, filter virtual-register operands via sub_2DADC00 (the isVirtualRegister check), and collect copy instructions into the worklist hash.
Phase 2: Block-Level Scanning (lines 618--857)
For each basic block, walk instructions and identify NVPTX copy instructions (opcode field at instruction offset +68 equals 14 or 15). For each copy:
- Validate source type via
sub_B10CD0(extract register class). - Check physical register constraints (vestigial on NVPTX but present in the code).
- Build a coalesce pair via
sub_34A70E0, creating a 248-byte candidate record.
Track live-through registers per block using bitvectors.
Phase 3: Interference Graph Construction (lines 858--998)
Build the interval tree via sub_2DACB60 and sub_C8CD80. Cross-compare forward and backward interval lists via sub_2E564A0. Flatten into indexed format via sub_2E507D0. The result is a set of live intervals indexed by register number, stored in a red-black BST where each node is 448 bytes (0x1C0).
Phase 4: Worklist-Driven Coalescing (lines 1040--2092)
This is the core loop. Candidates are extracted from a min-heap ordered by register number (lowest first -- a standard LLVM heuristic that processes defs before uses in reverse postorder).
function CoalesceWorklistDriven(heap, intervals, hash_map):
while heap is not empty:
candidate = heap.extract_min()
src_interval = lookup(hash_map, candidate.src_key)
dst_interval = lookup(hash_map, candidate.dst_key)
// Same-class check
if register_class(src_interval) != register_class(dst_interval):
continue
// Interference check
if CheckInterference(src_interval, dst_interval) != 0:
push candidate to secondary_heap
continue
// Pre-coalesce validation
if not ValidateCopy(candidate):
push candidate to secondary_heap
continue
// Execute the merge
merged = MergeIntervals(src_interval, dst_interval)
RewriteOperands(candidate.copy_instr, merged)
UpdateHashMap(hash_map, merged)
// Verify and rebuild
VerifyMergedInterval(merged)
RebuildRanges(merged)
// Double-buffer swap: retry with secondary heap
swap(heap, secondary_heap)
if secondary_heap was non-empty:
repeat from top
The double-buffer swap (lines 2073--2093) alternates between two heaps (v373 and v376). After exhausting one worklist, the pass swaps and retries -- implementing the LLVM-style "iterate until convergence" pattern where an earlier merge may resolve interference that blocked a later merge.
Phase 5: Code Patching (lines 2095--2144)
For each coalesced pair, rewrite instruction operands:
sub_349D6E0-- look up the merged interval's representative register.sub_349FA50-- find the instruction position.sub_2E31040-- patch the operand's register field.- Fix linked-list pointers using the
ptr & 0xFFFFFFFFFFFFFFF8mask (the low 3 bits encode tags onMachineOperandpointers: 0 = normal, 3 = tied operand, 4 = implicit operand).
Phase 6: Cleanup (lines 2145--2371)
Destroy interval trees (sub_349E8A0), perform final range rebuild (sub_34A46B0), finalize coalescing metadata (sub_34A2530), commit merged intervals (sub_34AA090), and deallocate all hash maps, heaps, and trees (16+ free calls).
Interference Check (sub_34AA450)
The interference check is the critical decision point. Given two intervals (identified by their register keys), it determines whether merging them would create a conflict -- that is, whether both registers are simultaneously live at any program point.
function CheckInterference(interval_A, interval_B) -> {0 = safe, 1 = interfering}:
for each instruction I in interval_A.instruction_vector:
if I is in the "already-coalesced" set:
continue
reg_class = extract_register_class(I)
dst_interval = lookup(reg_to_interval_hash, I.dst_reg)
if dst_interval overlaps with interval_B:
return 1 // interfering
return 0 // safe to coalesce
The "already-coalesced" set is an open-addressing hash map (pointer keys, hash (key >> 9) ^ (key >> 4), sentinels -4096/-8192). The sentinel check at a3+8 (a flag byte) determines whether the set uses inline or heap storage (small-buffer optimization for sets under approximately 8 entries).
Since NVPTX has no physical register file, "interference" here means purely that two virtual register live ranges overlap at a program point. On CPU targets this would also involve physical register conflict checks, but on NVPTX that dimension is absent.
Priority and Weight System
The coalescing priority determines the order in which candidates are processed when the min-heap's register-number ordering produces ties.
Weight computation (sub_34B7280):
weight = instruction_count + spill_weight[offset+240] + use_count[offset+252]
The flag at offset+254 & 1 guards weight computation: if set, the interval was pre-weighted by an earlier pass and the coalescer uses the existing weight rather than recomputing.
Higher weight means higher coalescing priority. The overall ordering is:
- Primary key: register number (min-heap, lowest first).
- Secondary key: weight (higher breaks ties in favor of more-used registers).
Block frequency integration: The pass reads a boolean from TargetPassConfig (via sub_35DDE70 at *(_QWORD*)(pass[4]+256)+856) that controls whether block frequency data influences priority. When enabled, copies in hot blocks receive higher priority, biasing the coalescer toward eliminating copies on the critical execution path.
Data Structures
Hash Maps
All hash maps use the standard DenseMap open-addressing infrastructure described in Hash Table and Collection Infrastructure. Two sentinel variants appear in this pass:
| Variant | Key Type | Sentinel pair |
|---|---|---|
| Integer-key | int32_t | -1 / -2 (hash: key * 37) |
| Pointer-key | int64_t | -4096 / -8192 (hash: (key >> 9) ^ (key >> 4)) |
Growth policy: next_power_of_2(2 * old_capacity - 1), minimum 64 entries.
Allocator: sub_C7D670(size, alignment=8) / sub_C7D6A0(ptr, size, alignment=8) -- CICC's aligned malloc/free wrappers.
Interval Tree (Red-Black BST)
Managed by sub_34A0610. Each node is 448 bytes (0x1C0):
| Offset | Size | Field |
|---|---|---|
| +0 | 24 | Tree links (left, right, parent pointers) |
| +32 | 8 | Interval key (register/slot encoding) |
| +64 | 8 | Instruction vector pointer |
| +72 | 4 | Instruction count |
| +192 | 16 | Debug name (SBO: inline if len <= 15) |
| +200 | 4 | Sub-operand count |
| +224 | 4 | Instruction opcode |
| +240 | 2 | Priority/weight (uint16) |
Comparator: sub_34A0190 (compares interval start positions). Rebalancing: sub_34A0330. The tree maintains count (a2[5]) and cached min/max (a2[3]/a2[4]).
Coalesce Candidate Record (248 bytes)
Built by sub_349AB40 for each potential coalescing opportunity:
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | Source interval key |
| +8 | 8 | Destination interval key |
| +16 | 128 | Sub-operand array (SBO, 4 entries x 32 bytes) |
| +64 | 112 | Type-constraint array (SBO, 2 entries x 56 bytes) |
| +192 | 32 | Debug name (SBO string) |
| +224 | 4 | Opcode classification (1--6: copy, subreg, extract, ...) |
| +232 | 4 | Copy source register |
| +240 | 2 | Priority (default: 1) |
MachineOperand Pointer Encoding
Throughout the coalescing code, MachineOperand pointers use low-bit tagging (8-byte alignment guarantees 3 unused low bits):
| Tag (ptr & 7) | Meaning |
|---|---|
| 0 | Normal operand |
| 3 | Tied operand (requires special coalescing -- both operands must map to same register) |
| 4 | Implicit operand (flag bit at operand offset +44, bit 3) |
The code consistently masks with & 0xFFFFFFFFFFFFFFF8 before dereferencing and checks (ptr & 7) == 3 or (ptr & 4) != 0 for branching decisions.
CSSA Coalescing (PHI-Specific)
Separate from the two coalescing passes above, CICC includes a CSSA (Conventional SSA) coalescing stage controlled by the cssa-coalesce knob (constructor at ctor_705, address 0x5BD430). This pass operates at the SSA level rather than the machine level, coalescing PHI operands before PHI elimination to reduce the number of copies that PHI lowering generates. Associated knobs:
| Knob | Effect |
|---|---|
cssa-coalesce | Enable/disable PHI operand coalescing |
cssa-verbosity | Verbosity level for CSSA debug output |
dump-before-cssa | Dump IR before CSSA coalescing |
usedessa | Select deSSA method (alternative to CSSA) |
Knobs and Thresholds Summary
| Knob | Source | Default | Effect |
|---|---|---|---|
join-liveintervals | LLVM | true | Master enable for standard RegisterCoalescer |
join-splitedges | LLVM | subtarget | Coalesce on split critical edges |
join-globalcopies | LLVM | subtarget | Coalesce cross-block copies |
terminal-rule | LLVM | true | Terminal rule for block-end copies |
verify-coalescing | LLVM | false | Pre/post verification |
late-remat-update-threshold | LLVM | 100 | Batch remat update threshold |
large-interval-size-threshold | LLVM | 100 | Large interval valno threshold |
large-interval-freq-threshold | LLVM | 256 | Large interval coalesce limit |
twoaddr-reschedule | LLVM | -- | Coalesce copies by rescheduling in TwoAddress |
copy_limit | NVPTX | runtime | Max copies to consider in NVPTX pass |
coalesce_limit | NVPTX | runtime | Max merges before bailout in NVPTX pass |
cssa-coalesce | NVPTX | -- | PHI operand coalescing |
cssa-verbosity | NVPTX | -- | CSSA debug verbosity |
| block frequency flag | NVPTX | config | Weight copies by block hotness |
The copy_limit and coalesce_limit parameters are passed into sub_34AF4A0 at call time (not static cl::opt knobs). Their values come from the pass pipeline configuration and serve as compile-time budget caps to avoid quadratic worst-case behavior on functions with thousands of copies.
Impact on ptxas
The quality of CICC's coalescing directly affects ptxas's register allocation phase:
- Fewer virtual registers means a smaller interference graph for
ptxasto color, reducing its compilation time. - Eliminated copies reduce instruction count, giving
ptxas's scheduler more freedom and fewer false dependencies. - Preserved type invariants (no cross-class coalescing) ensure
ptxasnever encounters type-inconsistent register usage, which would require additional conversion instructions. - Wide register pair tracking ensures tensor core instruction patterns remain intact --
ptxasexpects specific register pair relationships formmaandwmmainstructions.
A pathological case is over-aggressive coalescing that creates very long live ranges spanning many basic blocks. On NVPTX this does not cause spills (there is no physical register file to spill from), but it can increase ptxas's reported register usage, reducing occupancy. The coalesce_limit parameter and the large-interval frequency threshold exist partly to avoid this scenario.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Main NVPTX coalescing driver | sub_34AF4A0 | 67KB | -- |
| Per-instruction coalesce attempt | sub_34AE060 | 28KB | -- |
| Pre-coalesce validation (opcode 14/15 check) | sub_34AB5C0 | 16KB | -- |
| Post-coalesce update (rewrite def-use chains) | sub_34AC810 | 19KB | -- |
| Constrained-copy validation variant | sub_34AD8B0 | 8.5KB | -- |
| Interference check | sub_34AA450 | 11.5KB | -- |
Range rebuild (bitvector v90[12336]) | sub_34A46B0 | 13KB | -- |
| Interval equivalence verify | sub_34A2770 | 7.3KB | -- |
| Interval tree insert/rebalance (RB-tree) | sub_34A0610 | 14.7KB | -- |
| Register-to-interval hash lookup | sub_34A3910 | 2.7KB | -- |
| Build worklist from BB operand scan | sub_34A3D10 | 5KB | -- |
| Build worklist from instruction iteration | sub_34A41A0 | 4.8KB | -- |
| Block-level coalescing driver | sub_34BAAF0 | 31.7KB | -- |
| Live-out analysis + weight computation | sub_34B7280 | 22KB | -- |
| Per-register interference build | sub_34B6620 | 17.7KB | -- |
| Operand-type classification | sub_34961A0 | 26.6KB | -- |
| Register-pair decomposition | sub_3497B40 | 16.5KB | -- |
| Opcode -> copy-type mapping (switch) | sub_3494EA0 | 12.7KB | -- |
| Build coalesce candidate list | sub_349AB40 | 24.5KB | -- |
| Merged-interval representative lookup | sub_349D6E0 | -- | -- |
| Instruction position lookup/creation | sub_349FA50 | 7.1KB | -- |
| Interval tree destructor (variant A) | sub_349E330 | 4KB | -- |
| Interval tree destructor (variant B) | sub_349E500 | 4KB | -- |
| Interval tree destructor (variant C) | sub_349E6D0 | 4KB | -- |
| Interval tree destructor (variant D) | sub_349E8A0 | 4KB | -- |
| Interval info populate from instruction | sub_349F140 | 4.7KB | -- |
| Interval structure reset | sub_349F740 | 4KB | -- |
Generic map cleanup (callback sub_349D600) | sub_34A2010 | -- | -- |
| Finalize coalescing metadata | sub_34A2530 | -- | -- |
| Commit merged intervals | sub_34AA090 | -- | -- |
| Secondary coalesce commit | sub_34A9A60 | -- | -- |
| Register info initializer | sub_35065A0 | -- | -- |
| Standard LLVM RegisterCoalescer | sub_2F71140 | 80KB | -- |
| RegisterCoalescer::getPassName | sub_2F60C50 | -- | -- |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Number of passes | Single RegisterCoalescer pass handling COPY pseudo-instructions | Two passes in sequence: stock LLVM RegisterCoalescer (sub_2F71140) + NVPTX-specific coalescer (sub_34AF4A0) |
| Opcode coverage | Handles only TargetOpcode::COPY (generic copy pseudo) | NVPTX pass handles NVPTX copy instruction families in opcode range 440--503 that the generic pass does not recognize |
| Coalescing goal | Reduce physical register pressure to prevent spills | Strictly copy elimination (PTX has unlimited virtual registers); goal is fewer mov instructions in emitted PTX and smaller interference graphs for ptxas |
| Interference check | Standard LiveIntervals query | Custom interference check (sub_34AA450, 11.5 KB) with interval tree (red-black BST at sub_34A0610) for NVPTX register classes |
| Block-level coalescing | Part of the unified worklist | Separate block-level coalescing pass (sub_34BAAF0, 31.7 KB) processes copies within each block before cross-block coalescing |
| Operand classification | Generic operand handling | Custom operand type classification table (byte_444C4A0, 16-byte entries) maps NVPTX opcode families to copy semantics |
| Pass parameters | Standard runOnMachineFunction with no limits | Parameterized with explicit (copy_limit, coalesce_limit) bounds for compile-time control on large kernels |
Cross-References
- Register Allocation -- the greedy allocator that runs after coalescing; shares the register class table and interference hash pattern.
- Instruction Scheduling -- scheduling runs after RA and benefits from reduced copy count; MRPA pressure tracking is affected by coalescing decisions.
- LLVM Knobs -- full knob inventory including all coalescing-related flags.
- Code Generation -- pipeline ordering showing where coalescing fits relative to other machine passes.
Register Allocation
Prerequisites: Familiarity with NVPTX register classes, the GPU execution model (especially occupancy and register pressure), and Live Range Calculation. Understanding of Register Coalescing is helpful.
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/CodeGen/RegAllocGreedy.cpp,llvm/lib/CodeGen/SplitKit.cpp,llvm/lib/CodeGen/RegisterCoalescer.cpp,llvm/lib/CodeGen/LiveRangeEdit.cpp(LLVM 20.0.0). NVPTX register class definitions:llvm/lib/Target/NVPTX/NVPTXRegisterInfo.td.LLVM version note: CICC v13.0 ships two complete copies of
RAGreedy(legacy PM at0x1EC0400, new PM at0x2F4C2E0). The new PM variant matches the LLVM 20RAGreedyPassinterface. ThePriorityAdvisor/EvictionAdvisorinfrastructure matches LLVM 15+ patterns. All NVPTX-specific behavior (pressure-driven allocation,-maxregceiling, occupancy-aware rematerialization) is layered on top of stockRAGreedyvia TTI hooks and custom knobs.
NVPTX register allocation in CICC v13.0 operates under a fundamentally different model from CPU targets. PTX has no fixed physical register file -- registers are virtual (%r0, %r1, %f0, ...) and the hardware scheduler maps them to physical resources at launch time. The "physical register" concept in LLVM's greedy allocator maps to register pressure constraints rather than actual hardware registers, making the allocator pressure-driven rather than assignment-driven. The primary constraint is the -maxreg limit (default 70), which bounds total live registers across all classes to control occupancy on the SM.
| Greedy RA driver | sub_2F5A640 (466 lines) |
| selectOrSplit core | sub_2F49070 (82KB, 2,314 lines) |
| Live range splitting | sub_2F2D9F0 (93KB, 2,339 lines) |
| Register coalescing | sub_2F71140 (80KB, 2,190 lines) |
| Register info init (new) | sub_30590F0 |
| Register info init (old) | sub_2163AB0 |
| Allocation failure handler | sub_2F418E0 |
Dual Greedy RA Instances
CICC contains two complete copies of the Greedy Register Allocator infrastructure, corresponding to the legacy and new LLVM pass managers:
- Instance A (legacy,
0x1EC0400region): registered through the old pass manager pipeline. - Instance B (new,
0x2F4C2E0region): registered throughsub_2F504C0as the factory function.
Both are registered under the pass name "Greedy Register Allocator" via RAGreedyPass (sub_2342890). The selectOrSplit entry point at sub_2F4BAF0 is a thin wrapper that redirects to sub_2F49070(this + 200, ...). A separate entry at sub_2F4BB00 handles the spill-or-split path with SplitEditor integration.
NVPTX Register Classes
CICC defines nine register classes plus one internal-only class. The complete register class table -- vtable addresses, PTX type suffixes, prefixes, encoded IDs, copy opcodes, and coalescing constraints -- is in Register Classes.
The classes are completely disjoint -- there is no cross-class interference. Each type lives in its own namespace: integer 32-bit values occupy %r registers, 32-bit floats occupy %f registers, and so on. Copy instructions are class-specific, with both same-class and cross-class opcodes dispatched by sub_2162350 (see the copy opcode table).
Greedy selectOrSplit -- Detailed Algorithm
Complexity. Let V = number of virtual registers, R = number of register units, and I = total MachineInstr count. The main allocation loop processes V virtual registers in priority order. For each VReg, selectOrSplit performs: (1) operand scanning in O(operands) with 40-byte stride, (2) interference scanning (scanInterference) in O(R) via the RegAllocMatrix, (3) assignment or eviction attempts in O(R) per candidate. The tryLastChanceRecoloring path is bounded by lcr-max-depth (default 5) and lcr-max-interf (default 8), giving O(8^5) = O(32768) per VReg in the absolute worst case -- though this path is rarely taken. Live range splitting (splitAroundRegion, 93KB) iterates segments in O(S) where S = number of live range segments, with interference analysis per segment in O(R). Overall: O(V * R) for the common case, O(V * R + V * 8^D) when last-chance recoloring is exercised at depth D. The interference cache's open-addressing hash map with 37 * reg provides O(1) amortized lookups. Spill cost computation (setupSpillCosts) is O(V * I_avg) where I_avg is average instructions per VReg's live range. On NVPTX, the completely disjoint register classes mean cross-class interference is zero, reducing the effective R to the per-class register count.
The core allocation algorithm (sub_2F49070, 82KB, 2,314 decompiled lines) follows LLVM's standard RAGreedy::selectOrSplit structure with NVPTX-specific adaptations for pressure-driven allocation. The following pseudocode is reconstructed from the decompiled binary and covers the key phases visible in the new-pass-manager instance.
Initialization (lines 381--484)
fn selectOrSplit(this: &mut RAGreedyState, VirtReg: &LiveInterval) -> PhysReg {
let TRI = this.TargetRegisterInfo;
let NumRegs = TRI[+44]; // total reg unit count
// --- RegUnitStates: per-register-unit state array ---
// Stored at this+1112, 4 bytes per unit.
// Values: 0 = free, 1 = interfering, 2 = reserved
this.RegUnitStates = alloc_zeroed(NumRegs * 4); // at this+1112
// --- Live-through bitvector ---
// Stored at this+736, one bit per register unit,
// packed into 64-bit words. Set bits mark units
// live across the entire interval.
let bv_words = (NumRegs + 63) / 64;
this.LiveThrough = alloc_zeroed(bv_words * 8); // at this+736
// --- Interference cache ---
// Open-addressing hash map at this+648/656/664.
// Key = register number (unsigned 32-bit)
// Hash = 37 * reg (mod table_capacity)
// Empty = 0xFFFFFFFF (-1), Tombstone = 0xFFFFFFFE (-2)
// Growth: when 4*(count+1) >= 3*capacity, double & rehash.
this.IntfCache.buckets = alloc_sentinel(initial_cap); // this+648
this.IntfCache.count = 0; // this+656
this.IntfCache.capacity = initial_cap; // this+664
...
The RegUnitStates array is the central per-unit bookkeeping structure for the entire allocation of a single virtual register. Each 4-byte slot tracks whether that register unit is free, already interfering with the current live range, or reserved by the target. The array is zeroed at the start of every selectOrSplit invocation and released at cleanup (lines 2192--2313).
The interference cache at this+648 is distinct from LLVM's standard InterferenceCache (allocated at 0x2C0 bytes via sub_2FB0E40 during driver setup). This per-invocation cache is a lightweight open-addressing map used to deduplicate interference queries within a single selectOrSplit call. The hash function 37 * reg is a small Knuth-style multiplicative hash chosen for speed over distribution quality -- adequate because register numbers are small consecutive integers.
Operand Scanning (lines 690--1468)
The function walks every MachineOperand attached to the live range's segment list. Operands are stored in a flat array with a 40-byte stride per entry. The type byte at offset +0 of each operand classifies it:
| Type Byte | Meaning | Action |
|---|---|---|
0 | Virtual register | Check copyable/tied flags; record in VReg worklist |
12 | Register mask (call clobber) | Store pointer in regmask list at this+1176/1184 |
| other | Physical register | Mark in reserved bitvector; update RegUnitStates |
For each operand:
for op in VirtReg.operands(stride=40):
match op.type_byte:
0 => // virtual register
if op.reg & 0x80000000: // negative = virtual
check_copyable(op);
check_tied(op);
update needsRecoloringFlag; // v321
else:
mark_reserved(op.reg, RegUnitStates);
update hasPhysicalAssignment; // v323
12 => // regmask
append(this.regmask_list, op);
_ =>
mark_reserved(op.reg, RegUnitStates);
The 40-byte operand stride is wider than upstream LLVM's MachineOperand (typically 32 bytes) because CICC embeds an additional 8-byte field for NVPTX-specific metadata (likely the register class tag and a flags word). The scanning loop at line 690 uses v321 (needsRecoloringFlag) and v323 (hasPhysicalAssignment) as accumulator flags that gate later phases: if no virtual registers need work, the function returns early.
Interference Processing via sub_2F43DC0 (lines 714--955)
After operand scanning, the allocator calls sub_2F43DC0 (scanInterference) to populate the interference cache:
scanInterference(this, VirtReg, &IntfCache);
// IntfCache now contains register units that conflict.
// Iterate the conflict list at this+1128/1136:
for conflict in IntfCache.entries():
if conflict.is_constrained:
// Tied operand or early-clobber -- must try eviction
result = tryEviction(conflict); // sub_2F48CE0
else:
// Normal overlap -- try simple direct assignment first
result = tryAssign(conflict); // sub_2F47B00
if result.success:
record_assignment(result.phys_reg);
break;
// else: continue to next candidate
sub_2F43DC0 is the interference scanner. It walks the RegAllocMatrix (set up by sub_3501A90 during driver init) to find live range overlaps. For each physical register unit that overlaps the current virtual register's live range, it inserts an entry into the interference cache using the 37 * reg hash. The scanner distinguishes between two conflict types:
- Constrained conflicts (tied operands, early-clobber, regmask kills) -- these route to
sub_2F48CE0(tryEviction), which attempts to evict the conflicting virtual register from its current assignment if the eviction cost is lower than the current candidate's spill weight. - Normal conflicts -- these route to
sub_2F47B00(tryAssign), which attempts a simple recoloring without eviction.
Additional helper functions participate in this phase:
| Function | Role |
|---|---|
sub_2F47200 | processConstrainedCopies -- handles operands where a COPY forced a specific register |
sub_2F46530 | tryLastChanceRecoloring -- last-resort recoloring bounded by lcr-max-depth (default 5) and lcr-max-interf (default 8) |
sub_2F46EE0 | rehashInterferenceTable -- grows/rehashes when load factor exceeds 75% |
sub_2F424E0 | updateInterferenceCache -- inserts a newly discovered conflict |
sub_2F42840 | markRegReserved -- marks a physical register as reserved in RegUnitStates |
The tryLastChanceRecoloring path (sub_2F46530) is the most expensive fallback. It recursively attempts to reassign conflicting registers, up to lcr-max-depth levels deep and considering at most lcr-max-interf conflicting live ranges at each level. The exhaustive-register-search flag bypasses both cutoffs, trading compile time for allocation quality.
Copy Coalescing Hints -- Kinds 20 and 21 (lines 1060--1163)
During operand scanning, the allocator identifies COPY-like instructions by checking the operand kind field. Two kind values trigger coalescing hint recording:
for op in VirtReg.operands(stride=40):
match op.kind:
20 => // direct COPY hint
record_hint(op.source_reg, op.dest_reg);
21 => // parent-chain COPY hint
if op.flags[+44] & (1 << 4): // "has parent" flag
// Walk up the parent live range chain
let parent = op.parent_LR;
while parent != null:
recordCoalescingHint(parent); // sub_2F41240
parent = parent.parent;
// Coalescing opportunities tracked at this+832/840
Kind 20 represents a simple register-to-register COPY where the source and destination should ideally receive the same physical register. Kind 21 is more complex: it indicates a COPY from a split sub-range that has a parent live range. The has parent flag at byte +44 bit 4 triggers a chain walk via sub_2F41240 (recordCoalescingHint), which records each parent in a coalescing hint list at this+832/840. The hint list is later consumed by sub_2F434D0 (collectHintInfo) during the allocation priority computation, biasing the allocator toward assigning the same physical register to the entire chain.
This is standard LLVM coalescing hint infrastructure, but on NVPTX it interacts with the complete class separation: hints only apply within a single register class, since cross-class coalescing is impossible.
Virtual Register Assignment (lines 1005--1368)
After interference processing and copy hint collection, the function enters the main assignment loop:
for vreg in unassigned_vregs:
// Check the live-through bitvector at this+736
if is_live_through(vreg, this.LiveThrough):
// This vreg is live across the entire region -- expensive
result = tryLastChanceRecoloring(vreg); // sub_2F46530
else:
result = tryAssignFromHints(vreg);
if result.success:
recordAssignment(result); // sub_2F42240
refresh_operand_list(); // re-scan
else:
// Allocation failed for this vreg -- proceed to splitting
add_to_split_worklist(vreg);
The live-through bitvector at this+736 is the key data structure for this phase. A set bit indicates that the register unit is live from the beginning to the end of the current region, making it the hardest case for the allocator because there is no gap in which to insert a split point. These live-through ranges go directly to last-chance recoloring.
Cleanup (lines 2192--2313)
The function releases the RegUnitStates array, clears the interference cache, frees the live-through bitvector, and returns 1 on success (physical register assigned) or 0 on failure (must spill).
Live Range Splitting -- Detailed Algorithm
The splitting engine (sub_2F2D9F0, 93KB, 2,339 lines) implements RAGreedy::splitAroundRegion with SplitAnalysis and SplitEditor integration. This is the largest single function in the register allocation cluster.
Segment Enumeration (40-byte stride, gap/sub-range flags)
The splitting engine iterates the live range's segment linked list using the same 40-byte stride as the operand scanner. Two flag bits in the segment header control splitting decisions:
| Flag | Location | Meaning |
|---|---|---|
| Gap flag | bit 2 of byte[0] | Segment has a gap before it (potential split point) |
| Sub-range flag | bit 3 of byte[44] | Segment is a sub-range of a larger interval |
fn splitAroundRegion(this: &mut SplitEditor, MF: &MachineFunction) {
let SubTarget = MF.vtable[+128];
let TRI = SubTarget.vtable[+200];
// Per-region loop -- worklist at this+320
for region in this.worklist:
// (a) Hash table init -- 16-byte entries per tracked register
clear_and_resize(this.region_hash, initial_cap=16);
// (b) Segment enumeration
let seg = region.first_segment;
while seg != null:
let is_gap = (seg[0] >> 2) & 1; // bit 2 of byte[0]
let is_subrange = (seg[44] >> 3) & 1; // bit 3 of byte[44]
if is_gap:
// Potential split point -- record in visit set
record_gap(seg, this.visit_set); // sub_C8CC70
if is_subrange:
// Chain through sub-ranges
process_subranges(seg);
seg = seg.next; // stride = 40 bytes
The gap flag is the primary signal for split point selection. When the allocator detects a gap between two live segments, it can insert a split there without introducing a new spill -- the value is simply not live during the gap, so the split editor can create two separate live ranges that each get a different physical register. The sub-range flag indicates that the segment belongs to a sub-register lane (e.g., the low half of an Int64Regs value), which requires special handling to avoid breaking the lane structure.
Copy Hint Detection and Local Splitting
For COPY instructions (kind values 68 and 0), the splitter extracts register pairs and builds a conflict set:
// (c) Copy hint detection
for inst in region.instructions:
if inst.kind == 68 || inst.kind == 0: // COPY variants
let (src, dst) = extract_reg_pair(inst); // operands at +32, stride 40, reg at +8
conflict_set.insert(src);
conflict_set.insert(dst);
// Try local split first
if tryLocalSplit(conflict_set): // sub_2F2A2A0
// Success -- materialize the new segments
materializeSplitSegment(); // sub_2FDF330
continue;
sub_2F2A2A0 (tryLocalSplit) attempts a low-cost split within a single basic block. On success, sub_2FDF330 inserts the new split segments into the live interval data structure. The result entries from a local split use a 24-byte stride, where byte +16 is a quality flag and dwords at +8/+12 are the start/end positions of the split segment.
Interference Analysis for Non-COPY Segments
For non-COPY segments, the splitting engine performs interference analysis using regmasks:
// (d) Interference analysis (lines 785-914)
for seg in region.non_copy_segments:
for op in seg.operands:
if op.is_def && op.flags[+3] & (1 << 4):
check_def_interference(op); // sub_2F28E80
// Regmask check -- type 12 operands
if op.type_byte == 12:
for entry in region_hash:
if bittest(op.mask_data[+24], entry.reg):
// Register killed by mask -- tombstone it
tombstone(entry); // set to -2
The _bittest operation on regmask data at offset +24 identifies which registers are killed by call clobber masks. Killed entries are tombstoned in the tracking hash table (sentinel value -2), removing them from further consideration.
Coalescing and Reassignment Dispatch
The splitting engine dispatches through vtable offsets for coalescing:
// (e) Coalescing / reassignment (lines 917-999)
if vtable[1064](this, region): // tryReassign
markRegUsed(result_reg); // sub_2E88E20
goto DONE;
if vtable[1072](this, region): // canRecolorVirtReg
markRegUsed(result_reg); // sub_2E88E20
goto DONE;
// Also try alternative local split via vtable[480]
vtable[480](this, region, &SmallVectorArgs);
The vtable-indirect calls at offsets [1064] and [1072] correspond to tryReassign and canRecolorVirtReg in upstream LLVM. The offset [480] call is a fallback local split strategy. On success, sub_2E88E20 (markRegUsed) updates the allocation state.
Register Pressure and the -maxreg Constraint
The real allocation constraint on NVPTX is not register scarcity but register pressure -- higher per-thread register usage reduces occupancy, directly impacting throughput through fewer warps available for latency hiding. The -maxreg CLI flag (parsed at sub_900130, stored at compilation context offset +1192) caps the total live register count. Duplicate -maxreg definitions produce the error: "libnvvm : error: -maxreg defined more than once" (sub_9624D0).
Concrete Occupancy Examples
The occupancy formula and cliff table are documented in the GPU Execution Model. Here the relevant values are shown for the -maxreg settings that the allocator targets:
-maxreg | Regs/Warp | Warps (SM 8.0) | Occupancy | Warps (SM 9.0) | Occupancy |
|---|---|---|---|---|---|
| 32 | 1,024 | 64 | 100% | 48 | 100% |
| 64 | 2,048 | 32 | 50% | 32 | 67% |
| 96 | 3,072 | 21 | 33% | 21 | 44% |
| 128 | 4,096 | 16 | 25% | 16 | 33% |
| 192 | 6,144 | 10 | 16% | 10 | 21% |
| 255 | 8,160 | 8 | 13% | 8 | 17% |
The -maxreg flag sets the ceiling, and the remat infrastructure aggressively reduces pressure below the nearest cliff to avoid losing an entire warp slot.
The remat-for-occ knob (default 120) encodes an occupancy target. When set, the IR-level rematerialization pass (sub_1CE7DD0) calls sub_1C01730 to compute an occupancy-based register target. The heuristic applies a scale factor: if the computed occupancy level exceeds 4, it multiplies the target by 3/2 (effectively allowing more registers when occupancy is already high). If the result still exceeds the ceiling, it applies target = 2*target/3 as a tighter bound.
ptxas Register Allocation Knobs
In addition to cicc's LLVM-side allocator, ptxas has its own register allocation stage with 72+ dedicated knobs. These are independent of the LLVM greedy allocator and operate on the ptxas-internal IR after PTX parsing:
| ptxas Knob | Description |
|---|---|
RegAllocRematEnable | Enable ptxas-level rematerialization |
RegAllocEnableOptimizedRemat | Use optimized remat algorithm |
RegAllocSpillForceXBlockHoistRefill | Force cross-block spill hoist/refill |
RegAllocSpillValidateDebug | Validate spill code in debug builds |
RegAllocDebugConflictDetails | Print conflict details during allocation |
RegAllocPrintDetails | Print allocation decisions |
RegAllocPerfDiffBackoff | Back off allocation when perf difference is small |
RegAllocPerfDiffBackoffBegin/End | Range for perf backoff |
CTAReconfigMaxRegAlloc | Max registers for CTA reconfiguration |
MaxRegsForMaxWarp | Register ceiling for maximum warp occupancy |
RegTgtSelHigherWarpCntHeur | Heuristic favoring higher warp count |
RegTgtSelLowerWarpCntHeur | Heuristic favoring lower warp count |
CommonCrossBlockRegLimit | Cross-block register usage limit |
DisableHMMARegAllocWar | Disable HMMA register allocation workaround |
These ptxas knobs are accessed via nvcc -Xptxas "--knob KnobName=Value". The MaxRegsForMaxWarp and RegTgtSel* knobs directly implement the occupancy-aware allocation strategy at the ptxas level, complementing cicc's -maxreg ceiling.
NVIDIA Rematerialization Knobs (cicc)
NVIDIA provides an extensive set of custom rematerialization knobs to reduce pressure below the target threshold:
| Knob | Default | Description |
|---|---|---|
nv-remat-default-max-reg | 70 | Default maximum register target |
nv-remat-max-times | 10 | Max rematerialization iterations |
nv-remat-block-single-cost | 10 | Single live pull-in cost limit |
nv-remat-block-max-cost | 100 | Max clone cost for reducing one live |
nv-remat-block-loop-cost-factor | 20 | Loop body cost scaling factor |
nv-remat-block-liveout-min-percentage | 70 | Minimum live-out percentage for block remat |
nv-remat-block-map-size-limit | 6 | Map size limit for block-level remat |
nv-remat-block-load-cost | 10 | Load cost in Remat Machine Block |
nv-remat-threshold-for-spec-reg | 20 | Threshold for special register remat |
load-remat | (flag) | Enable load rematerialization |
no-mi-remat | (flag) | Disable MI remat for specific functions |
The greedy allocator itself has additional tuning knobs:
| Knob | Default | Description |
|---|---|---|
split-spill-mode | 1 | 0=default, 1=size, 2=speed |
lcr-max-depth | 5 | Last chance recoloring max depth |
lcr-max-interf | 8 | Last chance recoloring max interferences |
exhaustive-register-search | (flag) | Bypass LCR depth/interference cutoffs |
enable-deferred-spilling | (flag) | Defer spill code to end of allocation |
grow-region-complexity-budget | 10000 | growRegion() edge budget |
split-threshold-for-reg-with-hint | 75 | Split threshold percentage |
Additional rematerialization knobs registered separately include do-remat (default 3), remat-maxreg-ceiling (default 0), remat-single-cost-limit (default 6000), remat-loop-trip (default 20), and remat-for-occ (default 120, targeting higher occupancy).
Spill Cost Computation
Spill costs are computed during driver initialization by sub_2RAD5E0 (step 5 of the driver sequence), which calculates VirtRegAuxInfo spill weights for every virtual register before the main allocation loop begins. The spill weight determines priority in the allocation queue and eviction decisions.
On NVPTX, "spilling" is a misnomer because PTX has no stack spill in the traditional CPU sense -- a spilled value either gets rematerialized (re-computed from inputs) or written to local memory (per-thread DRAM-backed memory, orders of magnitude slower than registers). The cost model therefore heavily penalizes local memory spills and strongly favors rematerialization.
The PriorityAdvisor (looked up via global dword_5023AC8) determines the order in which virtual registers enter the allocation queue. The EvictionAdvisor (looked up via dword_5023BA8) determines when to evict a lower-priority register to make room for a higher-priority one. Both advisors are initialized via vtable [24] calls during driver setup and can be customized via the regalloc-evict and regalloc-priority analysis passes registered in the pipeline parser.
Allocation Failure Handler (sub_2F418E0) -- Three Error Paths
When physical register assignment fails (sub_2F418E0), three error paths exist:
Path 1: Empty Allocation Order
"no registers from class available to allocate"
The register class has zero allocatable registers. This can happen for the internal-only class (off_4A026E0) if the target configuration excludes all environment registers. Diagnostic emitted via sub_B6EB20 (DiagnosticHandler).
Path 2: All Registers Occupied
"ran out of registers during register allocation"
The allocation order exists but all registers are occupied/interfering. This fires when the eviction/split pipeline exhausts all options -- the sequence is: tryAssign -> tryEviction -> tryLastChanceRecoloring -> trySplit -> fail. Uses sub_B2BE50 for source location, sub_B157E0 for DebugLoc, and sub_B158E0 for diagnostic formatting.
Path 3: Inline Assembly Overflow
"inline assembly requires more registers than available"
Special handling for inline asm operands (kind values 1--2 at offset +68). Inline assembly can specify explicit register constraints that consume all available registers in a class, leaving nothing for surrounding code.
FailedRegAlloc Flag
All three paths set the FailedRegAlloc flag (bit 10 in MachineFunction properties, sub_2E78A80). This flag allows downstream passes to handle the failure gracefully rather than crashing. Passes that check this flag can skip optimization or emit degraded but correct code.
The RAGreedy Driver
The top-level driver (sub_2F5A640) orchestrates the full allocation pass:
- Store
MachineFunctionata1[96], retrieveSubTarget(vtable+128). - Optional debug dump:
"Before greedy register allocator". sub_35B4B20-- calculate register class info.sub_2F55040-- check if any virtual registers need allocation.sub_2FAD5E0-- setup spill costs.sub_2F54D60-- compute live intervals.- Query vtable
+328forgetRegPressureSetLimit(stored ata1[3633]). - Look up
EvictionAdvisor(dword_5023BA8) andPriorityAdvisor(dword_5023AC8) viastd::maplookups. - Initialize advisors via vtable
[24]. - Allocate
InterferenceCache(0x2C0 bytes,sub_2FB0E40). - Allocate
SplitAnalysis(0x738 bytes,sub_2FB1ED0). sub_3501A90-- setupRegAllocMatrix.- Initialize
PhysRegEntriesarray (32 entries, 144-byte stride). sub_2F55730-- reset priority queue.sub_35B5380-- seed queue from virtual registers.sub_2F58C00-- main allocation loop.- Optional debug dump:
"Before post optimization". - Post-allocation optimization via vtable
[24]. sub_2F5A580,sub_2F50510-- finalize.
Differences from Upstream LLVM
The following table summarizes where CICC's register allocator diverges from upstream LLVM 20.0.0 RAGreedy:
| Aspect | Upstream LLVM 20 | CICC v13.0 |
|---|---|---|
| Primary constraint | Fixed physical register set (CPU ISA-defined) | Pressure ceiling via -maxreg; no fixed physical registers |
| Register classes | Often overlapping (e.g., GR32 is a subset of GR64 on x86) | 9 completely disjoint classes; no cross-class interference |
| Spill destination | Stack frame (cheap, L1/L2 latency) | Local memory (DRAM-backed, 100x+ latency) or rematerialization |
| Rematerialization | LLVM built-in MachineInstr::isRematerializable() | Massive custom infrastructure: 11+ nv-remat-* knobs, separate IR-level remat pass (sub_1CE7DD0), iterative pressure reduction loop |
| Occupancy awareness | None -- CPU has no occupancy concept | remat-for-occ (default 120) drives occupancy-targeted register reduction; MaxRegsForMaxWarp ptxas knob |
| Interference cache hash | Standard LLVM DenseMap with (ptr >> 4) ^ (ptr >> 9) | Custom open-addressing map with 37 * reg hash, -1/-2 sentinels |
| Operand stride | 32 bytes (MachineOperand size) | 40 bytes (8-byte NVPTX extension for class tag + flags) |
| Dual pass manager | Single implementation used by both old and new PM | Two complete copies: Instance A at 0x1EC0400, Instance B at 0x2F4C2E0 |
| Register encoding | LLVM MCRegister (16-bit class + index) | 32-bit: 4-bit class tag in [31:28], 28-bit index in [27:0] |
| Spill weight formula | length / (spill_cost * block_freq) | Same formula, but cost model penalizes local memory heavily; rematerialization candidates get near-zero weight |
| Last-chance recoloring | Same knobs, but rarely critical | Frequently exercised due to tight -maxreg ceilings; exhaustive-register-search flag more relevant |
| Post-RA remat | Minimal | ptxas performs a second register allocation with its own 72+ knobs (RegAllocRematEnable, etc.) |
| Splitting strategy | Region-based splitting (splitAroundRegion) | Same algorithm, but gap flag (bit 2) and sub-range flag (bit 3) in 40-byte segment entries use NVPTX-specific encoding |
| Callee-saved registers | CSR-first-time-cost matters for ABI compliance | NVPTX has no callee-saved convention; regalloc-csr-first-time-cost is effectively dead code |
| Debug strings | "Before greedy register allocator" | Same string, but emitted conditionally on unk_503FCFD (a debug flag at a fixed BSS address) |
What Upstream LLVM Gets Wrong for GPU
Upstream LLVM's register allocation framework was designed for CPU targets where the register file is a fixed, small, physically-interfering resource. Every core assumption breaks on NVPTX:
- Upstream assumes spills are cheap (L1/L2 latency). On x86/AArch64, a spill is a store to the stack frame backed by L1 cache (3-5 cycles). On GPU, a "spill" writes to local memory backed by device DRAM at 200-800 cycle latency. This 40-160x penalty makes rematerialization nearly always preferable to spilling, which is why NVIDIA ships 11+ custom
nv-remat-*knobs and an iterative remat loop that has no upstream equivalent. - Upstream assumes a fixed physical register set with cross-class interference. CPU ISAs have a static register file (e.g., 16 GPRs on x86-64) where GR32 is a sub-register of GR64 and allocating one constrains the other. NVPTX has no fixed register count and its nine register classes are completely disjoint -- allocating
%r5(Int32Regs) never conflicts with%f5(Float32Regs). The entire interference-graph framework is solving the wrong problem. - Upstream has no concept of occupancy. CPU register allocation never reduces parallelism -- a function uses N registers and that is the end of the story. On GPU, every additional register per thread can cross an occupancy cliff, losing an entire warp group and halving throughput. The allocator must minimize pressure to a target, not just avoid running out of registers.
- Upstream assumes one allocation pass produces the final assignment. On CPU, LLVM's greedy RA emits final machine code. On NVPTX, cicc's allocator emits PTX with virtual registers bounded by
-maxreg, and then ptxas performs an entirely separate second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources. The LLVM allocator is half the pipeline, not the whole thing. - Upstream's callee-saved register convention is irrelevant. CPU ABIs define callee-saved sets (e.g.,
rbx,rbpon SysV x86-64) that the allocator must respect. NVPTX has no callee-saved convention at all -- there is no hardware call stack for registers. Theregalloc-csr-first-time-costknob is dead code on this target.
Common Pitfalls
These are mistakes a reimplementor is likely to make when building a register allocator for an NVPTX-like GPU target.
1. Treating register allocation as an assignment problem instead of a pressure problem. On CPU targets, the allocator must map N virtual registers to K physical registers, and the problem is coloring a fixed interference graph. On NVPTX, there is no fixed physical register file -- PTX registers are virtual and unlimited. The real constraint is the -maxreg ceiling, which controls occupancy. A reimplementation that tries to assign physical registers will produce correct but meaningless output; the correct approach is to minimize peak live register count below the -maxreg threshold, and let ptxas handle the final hardware mapping.
2. Ignoring occupancy cliffs when setting the register target. Going from 64 to 65 registers per thread crosses an occupancy cliff that halves the number of active warps on SM 8.0 (from 32 warps at 50% to 21 warps at 33%). A reimplementation that treats the register ceiling as a hard binary constraint (under = good, over = bad) will miss the fact that reducing from 65 to 64 is worth enormous effort (doubles throughput), while reducing from 63 to 62 is nearly worthless. The remat-for-occ knob (default 120) exists specifically to drive rematerialization toward the nearest cliff boundary, not just toward the ceiling.
3. Using CPU-calibrated spill costs. On x86, a spill is a store to L1-cached stack memory at 3-5 cycle latency. On GPU, a "spill" writes to per-thread local memory backed by device DRAM at 200-800 cycle latency -- a 40-160x penalty. A reimplementation that uses upstream LLVM's default spill cost formula without recalibrating for GPU memory latency will spill aggressively when it should rematerialize. NVIDIA's 11+ nv-remat-* knobs and the iterative rematerialization loop exist because rematerialization is almost always cheaper than spilling on GPU.
4. Assuming cross-class register interference exists. NVPTX's nine register classes are completely disjoint: Int32Regs (%r) never conflicts with Float32Regs (%f), Int64Regs (%rd) never conflicts with Float64Regs (%fd), and so on. A reimplementation that builds a global interference graph spanning all classes will waste significant compile time computing interference relationships that are always empty. The correct approach is per-class allocation with independent pressure tracking.
5. Forgetting that cicc's allocation is only half the pipeline. The LLVM greedy allocator in cicc emits PTX with virtual registers bounded by -maxreg. Then ptxas performs an entirely separate second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources. A reimplementation that tries to produce final hardware register assignments at the LLVM level is solving the wrong problem -- the output should be well-pressure-managed virtual registers, not hardware assignments.
Diagnostic Strings
Diagnostic strings recovered from the register allocation binary region (p2c.5-01-register-alloc.txt) and the rematerialization passes (p2b.2-01-remat-ir.txt, p2b.2-02-remat-machine.txt).
Allocation Failure Diagnostics
| String | Source | Category | Trigger |
|---|---|---|---|
"no registers from class available to allocate" | sub_2F418E0 path 1 | Error | Register class has zero allocatable registers; emitted via sub_B6EB20 (DiagnosticHandler) |
"ran out of registers during register allocation" | sub_2F418E0 path 2 | Error | All registers occupied/interfering after tryAssign -> tryEviction -> tryLastChanceRecoloring -> trySplit exhausted |
"inline assembly requires more registers than available" | sub_2F418E0 path 3 | Error | Inline asm explicit register constraints consume all available registers in a class |
"libnvvm : error: -maxreg defined more than once" | sub_9624D0 | Error | Duplicate -maxreg CLI flag definitions |
Debug/Trace Diagnostics
| String | Source | Category | Trigger |
|---|---|---|---|
"Before greedy register allocator" | sub_2F5A640 step 2 | Debug | Conditional on unk_503FCFD debug flag |
"Before post optimization" | sub_2F5A640 step 17 | Debug | Post-allocation debug dump |
"Before register coalescing" | sub_2F60C50 | Debug | Register coalescer debug dump |
"After register coalescing" | sub_2F60C50 | Debug | Register coalescer debug dump |
Rematerialization Diagnostics (nv-remat-block)
| String | Source | Category | Trigger |
|---|---|---|---|
"Skip machine-instruction rematerialization on <name>" | sub_1CE7DD0 region | Debug | Function name matches no-mi-remat skip list |
"Max-Live-Function(<num_blocks>) = <max_live>" | remat-block step 10 | Debug | Reports maximum live register count across all blocks |
"live-out = <count>" | remat-block step 7 | Debug | Per-block live-out register count |
"Pullable: <count>" | remat-block step 5 | Debug | Number of pullable (rematerializable) instructions |
"Total Pullable before considering cost: <count>" | remat-block step 8 | Debug | Total pullable candidates before cost filtering |
"Really Final Pull-in: <count> (<total_cost>)" | remat-block step 11 | Debug | Final rematerialization candidate count and total cost |
"After pre-check, <N> good candidates, <M> given second-chance" | remat two-phase selection | Debug | Two-phase candidate selection with second-chance |
"ADD <N> candidates from second-chance" | remat two-phase selection | Debug | Candidates recovered from second-chance pass |
"\treplaced" | remat code emission | Debug | Rematerialized instruction replacement confirmation |
Pass Registration Strings
| String | Source |
|---|---|
"Greedy Register Allocator" | Pass name for both Instance A (0x1EC0400) and Instance B (0x2F4C2E0) |
"Register Coalescer" | sub_2F60C50 pass registration |
"nv-remat-block" | ctor_361_0 at 0x5108E0 -- machine-level remat pass registration |
"Legacy IR Remat" | sub_1CE7DD0 region -- IR-level remat pass display name |
"nvvmrematerialize" | IR-level remat pass pipeline ID |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
RAGreedy::runOnMachineFunction | sub_2F5A640 | -- | Top-level driver (466 lines) |
RAGreedy::selectOrSplit | sub_2F49070 | -- | Core allocator (82KB, 2,314 lines) |
| selectOrSplit thunk | sub_2F4BAF0 | -- | Redirects to sub_2F49070(this+200) |
| selectOrSplit + SplitEditor | sub_2F4BB00 | -- | Spill-or-split path |
SplitEditor::splitAroundRegion | sub_2F2D9F0 | -- | Live range splitting (93KB) |
| tryLocalSplit | sub_2F2A2A0 | -- | Local split within single BB |
| materializeSplitSegment | sub_2FDF330 | -- | Insert split segments |
| scanInterference | sub_2F43DC0 | -- | Populate interference cache |
| tryAssign | sub_2F47B00 | -- | Simple assignment path |
| tryEviction | sub_2F48CE0 | -- | Evict conflicting VReg |
| tryLastChanceRecoloring | sub_2F46530 | -- | Recursive recoloring fallback |
| processConstrainedCopies | sub_2F47200 | -- | Handle tied-operand COPYs |
| rehashInterferenceTable | sub_2F46EE0 | -- | Interference cache rehash |
| rehashCoalescingTable | sub_2F46A90 | -- | Coalescing hint table rehash |
| markRegReserved | sub_2F42840 | -- | Mark unit as reserved |
| recordAssignment | sub_2F42240 | -- | Record successful assignment |
| updateInterferenceCache | sub_2F424E0 | -- | Insert conflict entry |
| recordCoalescingHint | sub_2F41240 | -- | Record parent-chain hint |
| collectHintInfo | sub_2F434D0 | -- | Gather all hints for priority |
| assignRegFromClass | sub_2F418E0 | -- | Allocation failure handler |
| hasVRegsToAllocate | sub_2F55040 | -- | Pre-flight check |
| computeLiveIntervals | sub_2F54D60 | -- | Build live interval data |
| resetPriorityQueue | sub_2F55730 | -- | Clear and re-init queue |
| mainAllocationLoop | sub_2F58C00 | -- | Per-VReg dispatch loop |
| finalize | sub_2F50510 | -- | Post-allocation cleanup |
| setupSpillCosts | sub_2FAD5E0 | -- | Compute VirtRegAuxInfo weights |
InterferenceCache::init | sub_2FB0E40 | -- | Allocate 0x2C0-byte cache |
SplitAnalysis::init | sub_2FB1ED0 | -- | Allocate 0x738-byte analysis |
| setupRegAllocMatrix | sub_3501A90 | -- | Build the global interference matrix |
| calculateRegClassInfo | sub_35B4B20 | -- | Pre-compute class sizes/orders |
| seedQueueFromVRegs | sub_35B5380 | -- | Initial queue population |
RegisterCoalescer::runOnMachineFunction | sub_2F71140 | -- | Register coalescing (80KB) |
| printMachineProperties | sub_2E78A80 | -- | Includes FailedRegAlloc flag |
| encodeVirtualReg | sub_21583D0 | -- | `CLASS_BITS \ |
| emitCopyInstruction | sub_2162350 | -- | Class-specific copy opcodes |
Reimplementation Checklist
- Pressure-driven allocation model. Replace the standard assignment-to-physical-registers model with a pressure-tracking model: PTX registers are virtual, so the allocator must track and bound total live register count per class against the
-maxregceiling (default 70) rather than assigning to a finite physical register set. - Nine disjoint register classes. Define the nine NVPTX register classes (Int1Regs, Int16Regs, Int32Regs, Int64Regs, Float32Regs, Float64Regs, Int16HalfRegs, Int32HalfRegs, Int128Regs) with complete cross-class disjointness -- no interference between classes, class-specific copy opcodes, and per-class pressure tracking.
- Greedy selectOrSplit with NVPTX adaptations. Implement the core allocation loop: per-unit RegUnitStates array (free/interfering/reserved), interference cache with
37 * reghash, 40-byte-stride operand scanning, copy coalescing hints (kinds 20/21), and live-through bitvector for detecting worst-case live ranges. - Live range splitting with SplitKit. Implement
splitAroundRegion(93KB equivalent): identify split points at block boundaries and within blocks, create sub-ranges with new virtual registers, insert copies at split points, and update the interference cache. - Eviction and last-chance recoloring. Implement
tryEviction(compare spill weights to decide whether evicting a conflicting VReg is cheaper) andtryLastChanceRecoloring(recursive reassignment bounded bylcr-max-depth=5andlcr-max-interf=8). - Occupancy-aware spill cost computation. Weight spill costs by occupancy impact: spills to local memory (device DRAM, 200--800 cycle latency) must account for the GPU-specific penalty, and the register ceiling must respect occupancy cliff boundaries.
- Dual pass manager instances. Register the allocator for both legacy and new pass managers, ensuring both instances share the same NVPTX-specific hooks (custom rematerialization interaction, pressure-driven priority queues,
maxregenforcement).
Architectural Uniqueness
NVPTX's register allocation differs from all other LLVM targets in several fundamental ways:
- Unlimited virtual registers: PTX has no fixed register count. The allocator manages pressure, not assignment to a finite set of physical registers.
- Complete class separation: The nine register classes are fully disjoint. An
Int32Regsallocation never conflicts with aFloat32Regsallocation. - Pressure as the primary constraint: The
-maxregceiling and NVIDIA's custom rematerialization infrastructure (nv-remat-*knobs) exist specifically to control occupancy, which has no equivalent in CPU register allocation. - Two-stage allocation: cicc performs LLVM greedy RA to emit PTX with virtual registers bounded by
-maxreg, then ptxas performs a second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources. - Dual implementation: Two complete RA copies exist (old at
0x1E*--0x1F*, new at0x2F*--0x35*), one per pass manager generation.
ptxas Interaction
Register allocation in cicc is the first of two allocation stages. cicc's greedy RA assigns virtual PTX registers (%r0, %f3, etc.) bounded by the -maxreg ceiling to control occupancy, but these are not hardware registers -- they are symbolic names in the PTX text. ptxas then performs its own complete register allocation pass, mapping cicc's virtual registers onto the SM's physical register file (e.g., 255 32-bit registers per thread on SM 80+). ptxas has 72+ RA-related knobs (RegAllocScheme, DynamicRegAlloc, RegUsageOpt, etc.) and may split, coalesce, or spill registers differently than cicc anticipated. The -maxreg value cicc enforces serves as a hint to ptxas about the desired occupancy target, but ptxas makes the final hardware binding decision.
PrologEpilogInserter & Frame Layout
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
NVIDIA GPUs have no hardware stack pointer. There is no push, no pop, no %rsp — the entire concept of a "stack frame" is a compiler fiction. When a CUDA kernel needs local storage (spill slots, alloca, local arrays), cicc allocates a byte array called __local_depot in PTX .local address space and computes all offsets at compile time. The PrologEpilogInserter (PEI) pass is responsible for this: it takes abstract MachineFrameInfo frame indices produced by register allocation and earlier lowering, assigns concrete byte offsets within the depot, emits the two-instruction prologue that sets up the %SP/%SPL pseudo-registers, and rewrites every frame-index operand in the MachineFunction to [%SP + offset] form. At 68 KB and ~2,400 decompiled lines, cicc's PEI is a heavily modified monolith — the upstream open-source NVPTX backend replaces LLVM's standard PEI with a stripped-down 280-line NVPTXPrologEpilogPass that handles only offset calculation and frame-index elimination. cicc restores and extends nearly all of the standard PEI's functionality: callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering, and a stack-size diagnostic system.
| Property | Value |
|---|---|
| Binary address | sub_35B1110 (0x35B1110) |
| Binary size | 68,332 bytes (~2,388 decompiled lines) |
| Pass identity | PrologEpilogInserter::runOnMachineFunction |
| Pass position | Post-register-allocation, before NVPTXPeephole |
| Stack frame | 0x490 bytes of local state (~400 variables) |
| Upstream equivalent | NVPTXPrologEpilogPass (280 lines) + NVPTXFrameLowering (101 lines) |
| Key strings | "warn-stack-size", "stack frame size" |
| Knobs | warn-stack-size (function attribute), nvptx-short-ptr, nv-disable-mem2reg |
The GPU "Stack" Model
__local_depot: The Frame Array
Every PTX function that needs local storage declares a .local byte array:
.local .align 16 .b8 __local_depot0[256];
This is the entire "stack frame." The alignment value is the maximum alignment of any object in the frame. The size is the total frame size computed by PEI. The suffix number (0, 1, ...) is the function index within the module.
There is no call stack in the CPU sense. GPU threads have a fixed local memory allocation (typically 512 KB per thread on modern architectures). The .local directive reserves a region within this per-thread memory. Recursive functions and dynamic allocations are legal in PTX but the driver/ptxas resolves their addresses — cicc only needs to produce a statically-sized depot for each function's fixed-size locals.
%SP and %SPL: The Two Frame Pseudo-Registers
PTX declares two pseudo-register pairs for frame access:
.reg .b64 %SP; // generic address space pointer to the frame
.reg .b64 %SPL; // local address space (AS 5) pointer to the frame
In 32-bit mode these are .reg .b32. The distinction exists because NVIDIA GPUs use address space qualification:
-
%SPL(Stack Pointer Local) — points directly into the.localaddress space (PTX address space 5). Loads/stores using%SPLemitld.local/st.localinstructions, which ptxas can optimize for the L1 cache local-memory path. This is the efficient pointer. -
%SP(Stack Pointer) — a generic address space pointer obtained by converting%SPLviacvta.local. Loads/stores using%SPgo through generic address resolution, which adds a TLB lookup to determine the address space at runtime. This is slower but required when the address escapes to code that expects generic pointers (e.g., passing a local variable's address to a called function).
The prologue sequence is:
mov.u64 %SPL, __local_depot0; // MOV_DEPOT_ADDR_64
cvta.local.u64 %SP, %SPL; // cvta_local_64
The cvta.local (Convert Address) instruction is the key: it takes a .local pointer and produces the equivalent generic-space pointer. When nvptx-short-ptr is enabled, %SPL is 32 bits (sufficient for the per-thread local memory window, always < 4 GB) while %SP may still be 64 bits on 64-bit targets.
Upstream's NVPTXFrameLowering::emitPrologue implements this directly. It checks MachineRegisterInfo::use_empty for each register — if %SP has no uses, it skips the cvta.local; if %SPL has no uses, it skips the mov.depot. The NVPTXPeephole pass runs immediately after PEI and rewrites LEA_ADDRi64 %VRFrame64, offset followed by cvta_to_local_64 into LEA_ADDRi64 %VRFrameLocal64, offset, eliminating the generic-to-local conversion when the address stays in local space.
Frame Index Resolution
During instruction selection and register allocation, local memory references use abstract frame indices: %stack.0, %stack.1, etc. Each maps to a MachineFrameInfo frame object with a size, alignment, and (after PEI) a byte offset.
Frame-index elimination in upstream is simple — NVPTXRegisterInfo::eliminateFrameIndex replaces the frame-index operand with VRFrame (which prints as %SP) and sets the immediate offset:
MI.getOperand(FIOperandNum).ChangeToRegister(getFrameRegister(MF), false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);
The VRDepot physical register (prints as %Depot internally) serves as the canonical frame base in getFrameIndexReference. For debug info, %Depot is remapped to %SP since cuda-gdb resolves stack frames via the generic pointer.
Frame Layout Algorithm
cicc's PEI executes in ten sequential phases within a single monolithic function. The algorithm is significantly more sophisticated than upstream's linear scan.
Phase 1–2: Setup and Callee-Saved Registers (lines 443–566)
Retrieves the TargetFrameLowering and TargetRegisterInfo from the MachineFunction's subtarget. If callee-saved registers exist (determined by vtable(FrameLowering, +480)), allocates a 0xA8-byte callee-save info structure at PEI state offset +200 containing two inline SmallVectors for register indices.
On GPU targets, callee-saved registers are unusual — PTX functions use a fully virtual register file, so there is no hardware register saving in the CPU sense. However, cicc models device-function calling conventions that may require preserving certain virtual registers across calls, and this mechanism handles that.
Phase 3: Fixed Object Collection (lines 567–730)
Initializes a chunk table (deque-like structure) with -4096 sentinel values. Collects prolog/epilog insertion points from the PEI state arrays at offsets +216 (prolog points, count at +224) and +264 (epilog points, count at +272).
When callee-saves exist and optimization level is not 20 (a special threshold), manually inserts save/restore instructions:
- Simple saves:
storeRegToStackSlot(MBB, MI, reg, kill=1, FI, RC, TRI) - Compound saves: handles sub-register decomposition via
sub_2F26260whenbyte+9 == 1in the callee-save info.
Phase 4: Offset Assignment — The Core Layout Engine (lines 733–1070)
This is the heart of PEI. It assigns byte offsets within __local_depot to every frame object.
MachineFrameInfo layout:
StackDirection: 1 = grows-negative (toward lower addresses)
0 = grows-positive (toward higher addresses)
LocalFrameSize: initial offset base
NumFixedObjects: count of pre-positioned objects
MaxAlignment: tracks largest alignment seen
Fixed objects are laid out first. Each frame object is a 40-byte record:
| Offset | Type | Field |
|---|---|---|
| +0 | i64 | Byte offset (written by PEI) |
| +8 | i64 | Object size in bytes |
| +16 | u8 | Alignment (log2) |
| +20 | u8 | isDead flag |
| +32 | u8 | isSpillSlot flag |
| +36 | u8 | Category (0–3) |
The alignment formula appears ~20 times throughout the pass:
// Round up 'value' to next multiple of (1 << align_log2):
aligned = -(1 << align_log2) & (value + (1 << align_log2) - 1);
// Equivalent to: aligned = (value + mask) & ~mask where mask = (1<<n) - 1
For grows-negative direction, offsets are stored as negative values; for grows-positive, they accumulate upward.
Callee-saved region is laid out next, iterating frame indices in range [PEI+208 .. PEI+212]. Each CSR object gets an aligned offset using the same formula.
Separate stack area: if MachineFrameInfo+665 flag is set, NVIDIA supports a physically separate stack region with its own alignment at +664 and total size at +656. This likely corresponds to a distinct .local segment for shared-memory scratch or ABI-reserved zones.
Phase 5: Categorized Local Variable Layout (lines 1060–1600)
This is cicc's most significant divergence from upstream PEI. Objects are classified into three priority buckets by a category byte at frame-object offset +36:
| Category | Bucket | Typical contents | Layout order |
|---|---|---|---|
| 3 | v427 | Vector/tensor spills (high alignment) | First |
| 2 | v419 | Medium-aligned objects | Second |
| 1 | v412 | General locals | Third |
| 0 | — | Skip (already placed or dead) | — |
Each bucket is processed by sub_35B0830 which assigns aligned offsets. The ordering minimizes alignment waste: laying out large-alignment objects first avoids padding gaps.
Objects are skipped if:
- They are spill slots in a separate stack area
- They fall within the callee-saved index range
- Their size is -1 (sentinel for dynamic-size objects)
- They are the frame-pointer object
- They are dead
Bitmap-Based Packing — When register count is nonzero and canUseStackBitmap returns true (frame size <= 0x7FFFFFFF), cicc builds a bitset representing every byte of the frame:
// Bitmap size in qwords:
bitmap_size = (frame_size + 63) >> 6;
// Mark all bytes as free (bits set to 1)
// Then clear bits for fixed objects and CSR objects
for each placed_object:
clear bits [offset .. offset + size)
For each unassigned general object, the algorithm scans the bitmap using tzcnt (trailing zero count) to find contiguous runs of set bits that match the object's size and alignment:
for each unassigned_obj in v412:
candidate = tzcnt_scan(bitmap, obj.size);
if (candidate != NOT_FOUND):
// Verify alignment
if aligned(candidate, obj.alignment):
// Verify all bits available (inner loop)
if all_bits_set(bitmap, candidate, candidate + obj.size):
assign_offset(obj, candidate);
clear_bits(bitmap, candidate, candidate + obj.size);
continue;
// Fallback: linear allocation at end of frame
offset = align(running_offset);
assign_offset(obj, offset);
running_offset += obj.size;
This is substantially more aggressive than both upstream LLVM PEI (which does a single linear pass) and the upstream NVPTX PrologEpilogPass (which has no packing at all). It enables reuse of "holes" left by fixed objects, callee-saves, and dead objects.
Phase 6: Final Alignment and Frame Size (lines 1688–1795)
After all objects are laid out:
- If
targetHandlesStackFrameRoundingreturns true, skip to finalization. - Add
MaxCallFrameSizeto the running offset if the function adjusts the stack. - Choose alignment:
StackAlign(fromTFI.getStackAlign()) for functions with calls or alloca, orTransientStackAlignfor leaf functions. The subtarget stores these atFrameLowering[12]and[13]respectively. - Round up:
final = align(running_offset, max(StackAlign, MaxAlignment)). - If alignment changed the total and direction is grows-negative, shift all callee-save offsets by the delta to maintain correct relative positions.
- Write
FrameInfo.StackSize = final_offset - initial_offset.
This value becomes the SIZE in .local .align ALIGN .b8 __local_depotN[SIZE].
Phase 7: Prologue/Epilogue Insertion (lines 1803–1872)
Executed when optimization level is not at threshold 20. For each prolog insertion point, calls emitPrologue(MF, MBB) via RegisterInfo vtable at +96. For each epilog point, calls emitEpilogue(MF, MBB) at +104.
Post-fixup via sub_35AC7B0, then a second pass over prolog points for insertPrologueSaveCode (vtable +152, if not a null stub).
Architecture-specific extension: checks (*(Module+2) >> 4) & 0x3FF == 0xB (SM arch code 11). When matched, calls an additional prolog handler at vtable +176. This likely targets an early or internal SM variant.
Phase 8–9: Frame Index Elimination (lines 1873–2268)
Two strategies selected by vtable(FrameLowering, +616):
Forward elimination (Path A): walks each MBB's instruction list forward. For each instruction, checks the opcode against FRAME_SETUP and FRAME_DESTROY pseudos — these adjust the SP offset tracker. For other instructions, scans operands for type-5 (FrameIndex), then calls sub_35ABF20 to attempt elimination or falls back to the target-specific handler.
Backward elimination (Path B): same logic but iterates instructions in reverse order. Handles FRAME_SETUP/FRAME_DESTROY with different SP adjustment accumulation.
This dual-path approach is unique to cicc — upstream NVPTX PrologEpilogPass only does a single backward walk. The forward path may be needed for instructions where the SP adjustment at a given point depends on preceding pseudo-ops.
Phase 10: Diagnostics and Cleanup (lines 2270–2388)
Stack size warning: default threshold is 0xFFFFFFFF (4 GB, effectively disabled). If the function has a "warn-stack-size" attribute, it parses the value via strtoul(str, &end, 10). When the total frame size (plus optional regspill area at MF+86*wordsize if opt-level flag 55 is set) exceeds the threshold, emits a "stack frame size" diagnostic.
Stack annotation: if annotation output is enabled (checked via sub_B6EA50/sub_B6F970), formats and writes stack-size metadata to the analysis output for the NVVM container.
Cleanup frees the 0xA8 callee-save info structure, resets prolog/epilog point counts, resets frame metadata, and walks the chunk table to free non-inline instruction arrays.
Dynamic Stack Allocation (alloca)
PTX supports alloca semantics at the LLVM IR level — the alloca instruction lowers to a local memory reservation. However, truly dynamic-sized allocations (variable-length arrays, runtime alloca(N)) are constrained:
MachineFrameInfo.hasVarSizedObjects(flag at +36) tracks whether the function contains VLA-style allocations.- When present, PEI selects
StackAlign(the full stack alignment) rather thanTransientStackAlignfor final frame rounding. - ptxas ultimately resolves dynamic allocations at JIT time, not cicc. cicc's role is to set up the frame pointer correctly so that dynamic objects can be addressed relative to it.
- The
FramePointerIndex(atMachineFrameInfo+68) is laid out last among general objects, ensuring the frame pointer anchors the top of the fixed frame with dynamic objects growing beyond it.
For fixed-size allocas, SROA (Scalar Replacement of Aggregates) typically promotes them to SSA registers before PEI ever runs. When SROA succeeds for all allocas, MachineFrameInfo has no stack objects and PEI emits no __local_depot at all — the function runs entirely in registers.
Spill Slots
Register spills are the primary consumer of __local_depot space. When the register allocator cannot fit a virtual register's live range into the available physical registers, it creates a spill slot — a frame object marked with isSpillSlot = 1 (byte at frame-object +32).
Spill-slot frame objects are created during register allocation. PEI does not create them; it only assigns their offsets. In cicc, spill slots interact with the categorized layout:
- Spill slots in a separate stack area (when
hasSeparateStackAreais set) are excluded from the general layout and handled in Phase 4's separate-area processing. - Remaining spill slots are classified into categories 1–3 based on their alignment requirements and register class — vector register spills (e.g., 128-bit
%rqregisters) end up in category 3, scalar spills in category 1.
After PEI assigns offsets, the spill loads/stores reference [%SP + offset] or [%SPL + offset] directly. The post-PEI NVPTXPeephole pass optimizes these: when a LEA_ADDRi64 %VRFrame64, offset feeds directly into cvta_to_local_64, the peephole collapses this to LEA_ADDRi64 %VRFrameLocal64, offset, saving the generic address conversion.
Interaction with SROA
SROA runs early in the optimization pipeline (see SROA) and aggressively promotes alloca instructions to SSA values. For many GPU kernels — especially those that avoid taking addresses of locals — SROA eliminates all allocas, resulting in an empty MachineFrameInfo. In this case:
- PEI's frame size computes to 0.
- The PTX emitter (
sub_2158E80) checksFrameInfo.StackSize; if zero, it emits no.localdirective and no%SP/%SPLdeclarations. - The function runs entirely in the virtual register file — the ideal case for GPU performance.
When SROA cannot promote (address-taken locals, aggregates too large for SROA's threshold controlled by sroa-size-limit, or when sroa-skip-mem2reg is set), PEI becomes essential. Additionally, cicc has a custom MI Mem2Reg pass (nv-disable-mem2reg controls it) that runs post-register-allocation and promotes MachineIR local-memory accesses back to registers — effectively a second chance at eliminating __local_depot usage after regalloc.
Comparison with Upstream
| Aspect | Upstream NVPTXPrologEpilogPass | cicc sub_35B1110 |
|---|---|---|
| Size | 280 lines | ~2,400 lines |
| Callee-saved regs | Not handled | Full save/restore infrastructure |
| Register scavenging | Not used | Both forward and backward paths |
| Layout algorithm | Single linear pass over all objects | Categorized 3-bucket layout + bitmap packing |
| Frame packing | None — objects placed sequentially | tzcnt-accelerated bitmap hole-finding |
| Stack direction | Supports both, simple | Supports both, with per-direction callee-save adjustment |
| Diagnostics | None | warn-stack-size attribute + annotation output |
| Separate stack area | Not supported | Full support (flag at MFI+665) |
| Arch-specific prolog | None | SM arch code 0xB extension |
| Optimization gating | None | opt-level 20 skips prolog/epilog emission |
| Frame-index elimination | Single backward walk | Dual forward/backward strategies |
The upstream pass explicitly disables LLVM's standard PrologEpilogCodeInserterID and replaces it. cicc's version is closer to the full standard LLVM PEI but with GPU-specific extensions — it re-enables callee-saved handling, register scavenging, and the frame-rounding logic that upstream strips out.
Configuration
| Knob | Type | Default | Effect |
|---|---|---|---|
warn-stack-size | Function attribute (string→int) | 0xFFFFFFFF (disabled) | Emit diagnostic when frame size exceeds threshold |
nvptx-short-ptr | cl::opt<bool> | false | Use 32-bit pointers for local/const/shared address spaces; affects %SPL width |
nv-disable-mem2reg | cl::opt<bool> | false | Disable post-regalloc MI Mem2Reg pass (more objects remain for PEI to lay out) |
sroa-size-limit | cl::opt<int> | (varies) | Max aggregate size SROA will promote; larger values reduce PEI workload |
| Opt-level flag 20 | Internal | — | Skips prolog/epilog instruction emission and callee-save handling |
| Opt-level flag 55 | Internal | — | Includes regspill area in stack-size diagnostic total |
FrameLowering[12] | Subtarget | arch-dependent | Stack alignment for functions with calls/alloca |
FrameLowering[13] | Subtarget | arch-dependent | Stack alignment for leaf functions (TransientStackAlign) |
Key Data Structures
MachineFrameInfo (at MachineFunction+48)
Offset Type Field
+8 ptr Objects array base pointer (40-byte records)
+16 ptr Objects array end pointer
+32 i32 NumFixedObjects
+36 u8 hasVarSizedObjects
+48 i64 StackSize ← WRITTEN by PEI
+64 u8 MaxAlignment (log2)
+65 u8 hasCalls / needsStackAlignment
+68 i32 FramePointerIndex (-1 if none)
+80 i64 MaxCallFrameSize (-1 if unknown)
+96 ptr Separate-area array base
+104 ptr Separate-area array end
+120 u8 hasCalleeSaves ← SET by PEI
+128 ptr Extra-area array pointer
+136 i64 Extra-area count
+656 i64 Separate area total size
+664 u8 Separate area alignment
+665 u8 hasSeparateStackArea flag
PEI State (pass object, offset from a1)
Offset Type Field
+8 ptr Analysis list (tagged analysis pointers)
+200 ptr Callee-save info (0xA8-byte struct, or null)
+208 u32 First CSR frame index
+212 u32 Last CSR frame index
+216 ptr Prolog insertion points array
+224 u32 Prolog point count
+264 ptr Epilog insertion points array
+272 u32 Epilog point count
+312 u8 hasReservedCallFrame flag
+313 u8 requiresRegisterScavenging flag
+320 ptr Stack-size annotation analysis pointer
Frame Object Record (40 bytes each)
Offset Type Field
+0 i64 Byte offset in __local_depot (assigned by PEI)
+8 i64 Object size in bytes
+16 u8 Alignment (log2 encoding)
+20 u8 isDead flag
+32 u8 isSpillSlot flag
+36 u8 Category: 0=skip, 1=general, 2=medium, 3=large
Diagnostic Strings
| String | When emitted |
|---|---|
"warn-stack-size" | Function attribute name — read and parsed as an integer threshold |
"stack frame size" | Diagnostic message when total frame size exceeds the warn-stack-size threshold |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
PrologEpilogInserter::runOnMachineFunction — main entry (68 KB) | sub_35B1110 | -- | -- |
| PEI pre-setup: initialize frame object tracking | sub_35AC440 | -- | -- |
| Record frame object into chunk table | sub_35AFAD0 | -- | -- |
| Determine CSR frame index range (writes PEI+208, +212) | sub_35AEEB0 | -- | -- |
| Post-save fixup | sub_35AE230 | -- | -- |
| Insert restore instructions at epilog points | sub_35ADBC0 | -- | -- |
| Assign offsets to categorized frame object bucket | sub_35B0830 | -- | -- |
| Push frame object index into categorized bucket | sub_35B0B10 | -- | -- |
| Post-prolog/epilog fixup | sub_35AC7B0 | -- | -- |
| Try to eliminate a single frame index operand | sub_35ABF20 | -- | -- |
| Initialize register scavenger for a MBB | sub_35C5BD0 | -- | -- |
| Advance register scavenger | sub_35C5C00 | -- | -- |
| Post-scavenging callee-save cleanup | sub_35C6D20 | -- | -- |
| Format stack-size annotation | sub_35AE7D0 | -- | -- |
PTX emitter: emitFunctionFrameSetup() (__local_depot + %SP/%SPL) | sub_2158E80 | -- | -- |
| Local depot helper | sub_214C040 | -- | -- |
| Local depot helper | sub_2154370 | -- | -- |
| Collect callee-saved registers | sub_2E77EA0 | -- | -- |
| Get register class for physical register | sub_2FF6500 | -- | -- |
| Build sub-register decomposition list | sub_2F26260 | -- | -- |
| Insert compound save instruction | sub_2E8EAD0 | -- | -- |
| Check optimization level flag | sub_B2D610 | -- | -- |
| Check function attribute existence | sub_B2D620 | -- | -- |
| Get function attribute value | sub_B2D7E0 | -- | -- |
| Build stack-size diagnostic message | sub_B15960 | -- | -- |
Differences from Upstream LLVM
| Aspect | Upstream LLVM (NVPTX open-source) | CICC v13.0 |
|---|---|---|
| Implementation | Stripped-down NVPTXPrologEpilogPass (~280 lines); handles only offset calculation and frame-index elimination | Full 68 KB PEI monolith with callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering |
| Stack concept | No hardware stack; minimal __local_depot offset assignment | Same __local_depot model but with full-featured offset assignment: categorized frame objects, alignment-based bucketing, dead frame object elimination |
| Callee-saved registers | Skipped entirely (no function calls in typical kernels) | Restored: full callee-saved register scan, compound save/restore instruction insertion for non-inlined device function calls |
| Register scavenging | Absent | Included: sub_35C5BD0/sub_35C5C00 initialize and advance a register scavenger per MBB for emergency spill resolution |
| Frame packing | Sequential offset assignment | Bitmap-based packing with categorized buckets; objects sorted by alignment to minimize padding waste |
| Stack-size diagnostics | No diagnostic system | Annotation system (sub_35AE7D0) formats stack-size remarks; integrates with -Rpass-analysis for occupancy tuning |
| Prologue emission | Two-instruction %SP/%SPL setup | Same two-instruction prologue (sub_2158E80) but with additional __local_depot sizing logic for complex frame layouts |
Cross-References
- Register Allocation — creates spill slots that PEI lays out; the number and alignment of spills directly determines frame size.
- Register Coalescing — reduces register pressure, which reduces spills, which reduces frame size.
- SROA — SROA eliminates allocas before they reach MachineIR; when fully successful, PEI has nothing to do.
- AsmPrinter & PTX Body Emission —
sub_2158E80emits the.localdirective and%SP/%SPLdeclarations that PEI computed. - Instruction Scheduling — runs before PEI; scheduling decisions affect register pressure and thus spill count.
- Pipeline & Ordering — PEI runs post-regalloc, followed immediately by NVPTXPeephole for
%VRFrameto%VRFrameLocaloptimization.
BranchFolding & TailMerge
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
LLVM version note: Based on LLVM 20.0.0
BranchFolding.cpp. The critical divergence is that cicc removes therequiresStructuredCFG()gate that upstream uses to disable tail merging for GPU targets, and compensates with a reserved-register merge safety check not present in any upstream version.
BranchFolding is LLVM's post-register-allocation CFG optimizer. It runs after block placement and performs three transformations in a fixed-point loop: tail merging (extracting identical instruction tails from multiple blocks into a shared block), branch optimization (eliminating redundant or unreachable branches, merging single-predecessor blocks into predecessors), and common-code hoisting (lifting identical instructions from successors into a shared predecessor). In cicc v13.0 the pass lives at sub_2F336B0 (the OptimizeBlock / TailMergeBlocks core, 11,347 bytes) with pass entry at sub_2F36310. The NVPTX version carries one critical divergence from upstream LLVM: tail merging is not disabled by requiresStructuredCFG(). Instead, cicc keeps tail merging enabled but gates individual merge decisions on a reserved-register check that prevents merging when NVPTX special registers (%tid.x, %ntid.x, etc.) cross the merge boundary.
Key Facts
| Property | Value |
|---|---|
| Core function | sub_2F336B0 (OptimizeBlock / TailMergeBlocks) |
| Function size | 11,347 bytes (792-byte stack frame) |
| Pass entry point | sub_2F36310 (iterates all MBBs) |
| Pass ID (upstream) | "branch-folder" / BranchFolderPassID |
| Pipeline position | After register allocation, after block placement |
| Disable knob | -disable-branch-fold (global at qword_5022CC8) |
| Tail-merge gate | enable-tail-merge (tri-state: unset/true/false) |
| Tail-merge threshold | -tail-merge-threshold (default 150) |
| Minimum tail length | -tail-merge-size (default 3 instructions) |
| Knob constructor | ctor_346 |
| Required property | NoPHIs -- SSA phi nodes must already be eliminated |
Upstream vs. NVPTX Behavior
In stock LLVM, BranchFolderPass::run checks requiresStructuredCFG() on the TargetMachine and, if true, disables tail merging entirely:
bool EnableTailMerge = !MF.getTarget().requiresStructuredCFG()
&& PassConfig->getEnableTailMerge();
NVPTX returns true from requiresStructuredCFG(), so upstream LLVM would completely suppress tail merging for GPU targets. cicc removes this gate. The binary evidence is the vtable check at 0x2F337A3 (cmp rax, offset sub_2DAC790), which verifies that the NVPTXInstrInfo vtable supports analyzeBranch -- if it does, tail merging proceeds. The structured-CFG check is absent. This makes sense: StructurizeCFG has already run by this point and guaranteed reducible control flow; tail merging two blocks that share a common successor preserves reducibility because it only introduces a new unconditional branch to the merged tail, which does not create irreducible cycles.
However, cicc compensates with three safety mechanisms that upstream does not need:
-
Reserved-register check. At
0x2F3427B, the pass callssub_2E88A90with flag0x200(isReservedReg) on every register live across the proposed merge boundary. NVPTX special registers (%tid.x,%ntid.x,%ctaid.x, etc.) are reserved and cannot be live-in to a newly created shared tail block because their values are implicitly defined by the hardware. If any reserved register is detected, the merge is rejected. See the Reserved-Register Safety Mechanism section below for full detail. -
Priority ordering for conditional branches. The pattern
or ecx, 2at0x2F33B1Cassigns priority >= 2 to conditional branch terminators and lower priority to unconditional branches. This ensures unconditional-branch tails are merged first, because those merges never alter branch conditions and are always safe within structured CFG. Conditional tail merges are attempted only after unconditional ones are exhausted. -
NVPTXInstrInfo vtable validation. The vtable check at
0x2F337A3(cmp rax, offset sub_2DAC790) verifies that theTargetInstrInfoobject supportsanalyzeBranchbefore any merge is attempted. This is a guard against running the pass on a MachineFunction whoseInstrInfodoes not implement branch analysis -- a scenario that cannot occur in the normal NVPTX pipeline but could if the pass were invoked from an unexpected context. The check loads the vtable pointer from[TII], compares against the knownNVPTXInstrInfovtable base, and short-circuits to "no merge" if the match fails.
Algorithm
The pass entry sub_2F36310 calls OptimizeFunction, which runs a fixed-point loop:
OptimizeFunction(MF):
repeat:
changed = TailMergeBlocks(MF)
changed |= OptimizeBranches(MF)
changed |= HoistCommonCode(MF)
until !changed
// clean up dead jump tables
TailMergeBlocks
TailMergeBlocks operates in two phases.
Phase A -- return/exit blocks. Collect all blocks with no successors (return blocks, noreturn calls) into MergePotentials, capped at tail-merge-threshold (150). Hash each block's tail via sub_2F26260 (HashEndOfMBB), which computes HashMachineInstr on the last non-debug instruction. If two or more candidates share a hash, call TryTailMergeBlocks to attempt the merge.
Phase B -- multi-predecessor blocks. For each block IBB with >= 2 predecessors, collect the predecessors into MergePotentials. For each predecessor PBB:
- Skip self-loops (
PBB == IBB), EH-pad successors, inline-asm-br blocks. - Call
AnalyzeBranch(sub_2E09D00) onPBB. IfPBBconditionally branches toIBB, reverse the condition so the unconditional fall-through toIBBis removed, leaving only the conditional branch to the "other" target. This normalization enables tail comparison. - Hash the tail of the normalized
PBBand push it intoMergePotentials.
Then call TryTailMergeBlocks(IBB, PredBB, MinCommonTailLength):
TryTailMergeBlocks(SuccBB, PredBB, MinTail):
sort MergePotentials by hash
for each group of candidates sharing a hash:
for each pair (MBB1, MBB2) in the group:
tail_len = ComputeCommonTailLength(MBB1, MBB2)
if tail_len >= MinTail:
// check reserved-register constraint (NVPTX addition)
for each reg live across merge point:
if hasProperty(reg, 0x200): // isReservedReg
reject merge; continue
// perform the merge
create new MBB "CommonTail"
splice tail instructions from MBB1 into CommonTail
ReplaceTailWithBranchTo(MBB2, CommonTail)
UpdateTerminator on both blocks
update live-ins for CommonTail
merged = true
return merged
ComputeCommonTailLength walks backwards from both block ends, comparing instructions via isIdenticalTo. It skips debug and CFI instructions. Inline asm is never merged (hard-coded rejection in upstream). The cicc binary performs this comparison at 0x2F33B0F--0x2F33BDD, extracting opcode from [ptr+18h] and comparing sub-fields via sar/and arithmetic on the instruction encoding.
HashEndOfMBB -- sub_2F26260
The hash function at sub_2F26260 computes a 32-bit hash of a block's tail for fast merge-candidate matching. The algorithm:
HashEndOfMBB(MBB):
iter = MBB.rbegin() // last instruction
// skip debug instructions
while iter != MBB.rend() && iter.isDebugInstr():
iter++
if iter == MBB.rend():
return 0 // empty block (or all-debug)
// skip terminator branches -- hash the last non-branch
while iter != MBB.rend() && iter.isTerminator():
iter++
if iter == MBB.rend():
return 0 // block contains only terminators
return HashMachineInstr(*iter)
HashMachineInstr (at sub_2E89C70) hashes the instruction's opcode, number of operands, and the first two operands' register/immediate values. It does not hash memory operands or metadata -- this is intentional, because the hash is only used to bucket candidates for pairwise comparison. False collisions are resolved by the subsequent ComputeCommonTailLength call. The hash uses a simple multiply-and-XOR scheme:
HashMachineInstr(MI):
h = MI.getOpcode()
h = h * 37 + MI.getNumOperands()
if MI.getNumOperands() >= 1:
h = h * 37 + hashOperand(MI.getOperand(0))
if MI.getNumOperands() >= 2:
h = h * 37 + hashOperand(MI.getOperand(1))
return h
The * 37 constant is standard LLVM hashing (the same multiplier used in DenseMapInfo). The hash is deliberately coarse -- it accepts false positives (two different instructions hashing to the same value) but never produces false negatives (two identical instructions hashing differently), which is the correct tradeoff for a merge-candidate filter.
ComputeCommonTailLength -- Detailed Binary Walkthrough
The comparison loop at 0x2F33B0F proceeds as follows:
ComputeCommonTailLength(MBB1, MBB2):
iter1 = MBB1.rbegin() // walk backwards from end
iter2 = MBB2.rbegin()
count = 0
// skip debug instructions at tails
skip_debug(iter1, MBB1)
skip_debug(iter2, MBB2)
while iter1 != MBB1.rend() && iter2 != MBB2.rend():
MI1 = *iter1
MI2 = *iter2
// extract opcode from [MI + 0x18]
opc1 = *(uint32_t*)(MI1 + 0x18)
opc2 = *(uint32_t*)(MI2 + 0x18)
// reject if either is inline asm (opcode check)
if is_inline_asm(opc1) || is_inline_asm(opc2):
break
// reject if either is a CFI pseudo-instruction
if is_cfi(opc1) || is_cfi(opc2):
skip to next non-CFI; continue
// full comparison: opcode, operand count, each operand
if !isIdenticalTo(MI1, MI2):
break
count++
iter1++; iter2++
skip_debug(iter1, MBB1)
skip_debug(iter2, MBB2)
return count
The isIdenticalTo comparison at the binary level extracts fields from the MachineInstr layout:
[MI + 0x18]: opcode (32-bit)[MI + 0x08]: operand list pointer[MI + 0x10]: operand count (16-bit at+0x10, flags at+0x12)- Each operand at stride 40 bytes:
[operand + 0x00]= type tag,[operand + 0x08]= register/immediate value
Two instructions are identical if and only if: same opcode, same number of operands, and for each operand pair: same type tag and same value. Memory operands (MachineMemOperand) are not compared -- two loads from different memory locations with the same register operands will compare as identical. This is correct for tail merging because if the instructions are in the tail of two blocks that reach the same successor, their memory operands must be equivalent by construction (they access the same state at the same program point).
Merge Candidate Ordering and the Priority System
The or ecx, 2 pattern at 0x2F33B1C implements a priority-based ordering within hash groups. When building the MergePotentials list, each entry is annotated with a priority value:
| Priority | Condition | Meaning |
|---|---|---|
| 0 | Block ends with unconditional branch only | Safest merge -- no condition changes needed |
| 1 | Block ends with fallthrough (no explicit branch) | Safe -- may need a branch inserted |
| 2+ | Block ends with conditional branch | Riskier -- merge may require condition reversal |
The sort at TryTailMergeBlocks sorts first by hash (grouping candidates), then within each hash group by priority (ascending). This ensures that the O(K^2) pairwise comparison within each hash group tries unconditional-only pairs first. If a merge succeeds for a low-priority (safe) pair, the modified block may no longer be a candidate for a higher-priority (conditional) pair, reducing the number of conditional merges attempted.
On NVPTX, this ordering is particularly important because conditional branch reversal (sub_2E09D00 AnalyzeBranch + condition inversion) can alter the fall-through layout. In a structured CFG, the fall-through direction often corresponds to the "then" path of an if-then-else, and reversing the condition flips which path falls through. While this does not change correctness, it can change the reconvergence point's distance from the branch, affecting I-cache locality. By preferring unconditional-only merges, the pass minimizes layout disruption.
OptimizeBranches
OptimizeBranches (sub_2F36310 inner loop) walks every MBB and calls OptimizeBlock to perform local branch simplifications:
- Empty-block elimination. If MBB contains only debug instructions, redirect all predecessors to the fallthrough successor.
- Unconditional-to-same-target folding. If the previous block's conditional and unconditional branches both target the same block, replace with a single unconditional branch (or fallthrough).
- Single-predecessor merge. If MBB has exactly one predecessor and that predecessor falls through unconditionally, splice MBB's instructions into the predecessor and remove MBB.
- Redundant branch removal. If the previous block branches only to MBB (the natural fallthrough), remove the branch entirely.
- Condition reversal. If the previous block conditionally branches to MBB on true and somewhere else on false, reverse the condition to create a fallthrough.
- Tail-block relocation. If MBB has no successors (return/noreturn) and the predecessor could fall through to the next block instead, move MBB to the end of the function and reverse the predecessor's condition.
Each transformation triggers goto ReoptimizeBlock to re-analyze the modified block. Dead blocks (no predecessors after optimization) are removed via sub_2E790D0 (RemoveBlock).
HoistCommonCode
For each block with exactly two successors, if both successors begin with identical instructions, hoist those instructions into the predecessor. This is the inverse of tail merging -- it reduces code size when two divergent paths start with the same setup sequence. The EnableHoistCommonCode flag (always true in cicc) controls this phase.
Reserved-Register Safety Mechanism
This section documents the NVIDIA-specific reserved-register check that gates tail merging in cicc. This mechanism has no equivalent in upstream LLVM because upstream disables tail merging entirely for structured-CFG targets.
Why Reserved Registers Cannot Cross Merge Boundaries
NVPTX "special registers" (%tid.x, %ntid.x, %ctaid.x, %nctaid.x, %laneid, %warpid, and the SM 90+ cluster registers) are not stored in the virtual register file. They are hardware-defined, read-only values whose definitions are implicit -- there is no MachineInstr that defines %tid.x. Instead, these registers appear as implicit uses on instructions that read thread/block/grid coordinates.
When tail merging creates a new shared tail block CommonTail, LLVM's infrastructure computes the live-in set for CommonTail from the union of live-outs of the merged predecessors. For a normal virtual register, this is safe: the register has a concrete definition (a MachineInstr somewhere in the function), and the live-in annotation tells the downstream passes that the value is available at block entry.
For a reserved register, there is no concrete definition. The value is implicitly available at every point in the function -- it is defined by the hardware thread context, not by any instruction. Creating a new block with a reserved register in its live-in set is semantically meaningless but causes three concrete problems:
-
LiveIntervals confusion. The
LiveIntervalsanalysis (already computed by this point) has no interval for reserved registers. Adding a live-in for a reserved register toCommonTailwould require creating a newLiveIntervalthat spans fromCommonTail's entry to the last use withinCommonTail. But reserved registers do not participate inLiveIntervals-- they are excluded during interval construction atsub_2F5A640. The resulting inconsistency triggers assertions in debug builds and can silently corrupt the interference matrix in release builds. -
Register pressure miscounting. The greedy register allocator tracks pressure per register class. Reserved registers belong to the internal-only class at
off_4A026E0(the"!Special!"class documented in Register Classes). This class has no encoded ID, no PTX declaration, and is excluded from pressure accounting. If a reserved register appeared as a live-in, the pressure tracker would attempt to look up its class and fail -- or worse, miscount it against one of the nine real classes. -
Emission failure. During PTX emission,
sub_21583D0(the register encoding function) maps each register to its 4-bit class tag via vtable comparison. Reserved registers use theoff_4A026E0vtable, which triggers the fatal"Bad register class"error. A reserved register in a live-in set could propagate to a point where the emitter attempts to declare it, causing an unconditional abort.
The hasProperty Check -- sub_2E88A90
sub_2E88A90 is a multi-purpose property query function used across several subsystems in cicc:
| Call site | Flag | Meaning |
|---|---|---|
BranchFolding (0x2F3427B) | 0x200 | isReservedReg -- register is a hardware-defined special register |
StructurizeCFG (sub_2E88A90 in structurize) | 0x80000 / 0x100000 | Uniformity/divergence classification |
InstrEmitter (sub_2E88A90 in emitter) | 0x1000000000 (bit 36) | NVPTX-specific implicit-use flag |
The function takes three arguments:
sub_2E88A90(context_ptr, register_or_operand, flag_mask) -> bool
For the BranchFolding call at 0x2F3427B, the calling convention is:
; rdi = TargetRegisterInfo* (from MachineFunction->getSubtarget().getRegisterInfo())
; esi = register ID (physical register number from live-in set)
; edx = 0x200 (isReservedReg flag)
; returns: al = 1 if reserved, 0 if not
The function internally indexes into a per-register property table at [TRI + 0x58]. This table is initialized during NVPTXRegisterInfo construction (sub_2163AB0 for legacy PM, sub_30590F0 for new PM) and contains one entry per physical register. Each entry is a 64-bit bitmask of properties. The 0x200 bit (bit 9) is set for every register in the NVPTX special/environment register set.
Which Registers Are Marked Reserved (flag 0x200)
The following registers have bit 9 (0x200) set in the property table and will cause a merge rejection if live across the merge boundary:
| Register Group | PTX Names | Emission Function | Count |
|---|---|---|---|
| Thread ID | %tid.x, %tid.y, %tid.z | sub_21E86B0 (opcodes 0x26--0x28) | 3 |
| Block dimensions | %ntid.x, %ntid.y, %ntid.z | sub_21E86B0 (opcodes 0x29--0x2B) | 3 |
| Block ID | %ctaid.x, %ctaid.y, %ctaid.z | sub_21E86B0 (opcodes 0x2C--0x2E) | 3 |
| Grid dimensions | %nctaid.x, %nctaid.y, %nctaid.z | sub_21E86B0 (opcodes 0x2F--0x31) | 3 |
| Warp/lane ID | %warpid, %laneid | sub_21E86B0 (opcodes 0x5E--0x5F, via sub_3958DA0) | 2 |
| Cluster registers (SM 90+) | %cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}, %is_explicit_cluster | sub_21E9060 (values 0--14) | 15 |
| Stack pointer | %SP, %SPL | inline in frame setup | 2 |
| Environment regs | ENVREG0--ENVREG31 | internal (not emitted to PTX) | 32 |
Total: 63 reserved registers. These correspond to the physical register set in NVPTX -- recall that NVPTX has no general-purpose physical registers, so the only physical registers are the special hardware-defined ones plus the stack pointer pair.
The environment registers (ENVREG0--ENVREG31) are used internally by the CUDA runtime to pass kernel arguments and configuration data. They are read-only from the kernel's perspective and never appear explicitly in emitted PTX. Their presence in the reserved set is a safety measure against internal IR manipulations that might introduce them as explicit operands.
The Check in Context: Full Merge Decision Sequence
The reserved-register check is the third of four gates in the merge decision path. The complete sequence at 0x2F33B0F--0x2F34300 is:
MergeDecision(MBB1, MBB2, MinTail):
// Gate 1: Instruction comparison
tail_len = ComputeCommonTailLength(MBB1, MBB2)
if tail_len < MinTail:
return REJECT
// Gate 2: Branch analysis feasibility
ok = AnalyzeBranch(MBB1, ...)
if !ok:
return REJECT // unanalyzable terminator (inline asm, etc.)
// Gate 3: Reserved-register check (NVPTX-specific)
for each reg in LiveIns(MBB1[split_point:]) ∪ LiveIns(MBB2[split_point:]):
if sub_2E88A90(TRI, reg, 0x200):
return REJECT // reserved register crosses merge boundary
// Gate 4: Profitability (code size)
overhead = 1 // one branch instruction to CommonTail
if MBB1 needs UpdateTerminator:
overhead += 1
if tail_len <= overhead:
return REJECT // no net code-size reduction
return ACCEPT
Gate 3 iterates every register that would be live-in to the proposed CommonTail block. The live-in set is computed by walking the tail instructions backwards and collecting register uses that have no definition within the tail. If any register in this set has the 0x200 property, the entire merge is rejected -- there is no fallback or partial merge.
Interaction with computeLiveIns -- sub_2E16F10
After a merge is accepted and the CommonTail block is created, sub_2E16F10 (computeLiveIns) populates the new block's live-in set. This function must agree with the pre-merge reserved-register check: if the check passed (no reserved registers), then computeLiveIns will produce a live-in set containing only virtual registers and non-reserved physical registers. The function at sub_2E16F10 performs its own filtering:
computeLiveIns(CommonTail):
for each reg in upward_exposed_uses(CommonTail):
if isReserved(reg):
continue // redundant safety -- already filtered by Gate 3
addLiveIn(CommonTail, reg)
The double-check (once in the merge decision, once in computeLiveIns) is a defense-in-depth pattern. The merge decision check prevents the merge from happening at all; the computeLiveIns filter prevents a reserved register from entering the live-in set even if the merge decision check were somehow bypassed (e.g., by a future code change that added a new merge path).
GPU-Specific Considerations
Tail Merging and Warp Divergence
Tail merging on GPU does not interact with warp divergence in the way that branch duplication does. When two blocks A and B both end with the same instruction sequence and share a common successor C, merging the tails into a shared CommonTail block that falls through to C does not change which warps execute which instructions. Every warp that previously executed the tail of A now executes the same instructions in CommonTail; similarly for B. The branch from A (or B) to CommonTail is unconditional and therefore non-divergent by definition.
However, there is one subtle interaction: if A and B are the two sides of a divergent branch, and the tail merge creates CommonTail between them and their common successor C, the reconvergence point may shift. Previously, warps reconverged at C's entry. After the merge, warps reconverge at CommonTail's entry -- which is equivalent but changes the block numbering. StructurizeCFG has already inserted any necessary reconvergence tokens before BranchFolding runs, and those tokens are block-relative. The UpdateTerminator call at sub_2FAD510 and the ReplaceUsesOfBlockWith call at sub_2E0E0B0 update all references, so the reconvergence semantics are preserved.
Code Size vs. Instruction Cache
On GPU, the primary motivation for tail merging is code size reduction, which translates directly to reduced instruction cache pressure. NVIDIA GPUs have small instruction caches per SM partition (32--128 KB depending on architecture generation). Tail merging reduces the number of unique instructions the I-cache must hold.
The tail-merge-size default of 3 reflects the GPU's branch cost: one bra instruction to redirect flow to CommonTail, plus one additional instruction if the predecessor's terminator needs rewriting. With a minimum tail length of 3, the merge always saves at least one instruction's worth of I-cache footprint. On a GPU where each instruction occupies 8--16 bytes (PTX instructions vary in encoding width, but ptxas expands them to fixed-width SASS), a 3-instruction merge saves 24--48 bytes of I-cache per merge site.
The tail-merge-threshold of 150 is generous compared to upstream LLVM's default (also 150 in upstream, but upstream disables the entire mechanism for GPU targets). In practice, GPU kernels rarely have blocks with 150+ predecessors -- the threshold exists primarily to prevent pathological compile times on machine-generated code with massive switch tables.
Structured CFG Preservation Proof
The claim that tail merging preserves structured (reducible) control flow deserves a rigorous argument, since this is the justification for NVIDIA removing the requiresStructuredCFG() gate.
Claim: If the input CFG is reducible, then the CFG after tail merging is also reducible.
Proof sketch: Tail merging performs one operation: it takes two blocks A and B that share a common tail instruction sequence, creates a new block T containing the tail, and replaces the tail portions of A and B with unconditional branches to T. The successors of A and B in the tail (which were the same for both, by construction) become successors of T instead.
Consider the back-edge structure. In a reducible CFG, every cycle has a single entry point (the loop header). Tail merging cannot create a new cycle because:
Tis a new block with no incoming edges except fromAandB.T's outgoing edges are a subset of the original outgoing edges ofAandB's tails.- No edge into
Tcan form a back-edge of an existing cycle unlessAorBwas already a back-edge target, in which case the cycle's entry point wasAorB, notT. - The only new edges are
A->TandB->T(unconditional). These cannot create a new cycle becauseTdoes not dominateAorB(it was just created).
Therefore, no new irreducible cycle is introduced. The disable-nvptx-require-structured-cfg knob (at qword_5022CC8 in NVPTXTargetMachine) provides a backdoor to disable the structured-CFG requirement entirely, but it is false by default and should never be set in production.
Interaction with EH and Cleanup Pads
NVPTX does not support C++ exceptions in the traditional sense -- there is no stack unwinding on GPU. However, cicc does handle cleanup semantics for CUDA cooperative groups and destructor calls. The branch folding pass skips blocks that are EH landing pads (isEHPad() check at the start of OptimizeBlock). On NVPTX, this check is typically a no-op because no blocks are marked as EH pads, but the check remains active because the same binary serves both CUDA and non-CUDA compilation paths.
Interaction with Convergence Control Tokens
On SM 90+ (Hopper and later), cicc emits convergence control pseudo-instructions (bra.convergent, .pragma "convergent") that are consumed by ptxas to guide reconvergence behavior. These pseudo-instructions are MachineInstrs with specific opcodes that BranchFolding must not merge or reorder. The isIdenticalTo comparison in ComputeCommonTailLength considers opcode, operands, and flags, so two convergence control instructions with different target blocks will not compare as identical and will naturally terminate the common-tail scan. This prevents the tail merger from accidentally merging convergence annotations that belong to different reconvergence points.
Data Structures
The MBBInfo structure passed via rdi to sub_2F336B0:
| Offset | Type | Field |
|---|---|---|
+0x00 | MachineFunction* | Parent function / block list head |
+0x08 | MachineBasicBlock* | Fallthrough candidate block |
+0x10 | BranchAnalysisResult* | Cached result from AnalyzeBranch |
+0x28 | DenseMap<uint, list> | Hash-to-candidate-list merge table |
The pass allocates a 792-byte stack frame holding:
| Stack variable | Purpose |
|---|---|
var_2E0 | merge_count (number of merges performed) |
var_309 | modified flag |
var_30A | should_try_fold flag (initialized to 1) |
var_224 | Hash table allocated flag |
var_1E4 | Operand table allocated flag |
Configuration
| Knob | Type | Default | Effect |
|---|---|---|---|
disable-branch-fold | bool | false | Skips the entire pass |
enable-tail-merge | tri-state | unset (uses target default) | Force-enable or disable tail merging |
tail-merge-threshold | unsigned | 150 | Max predecessors considered per merge round; caps MergePotentials size |
tail-merge-size | unsigned | 3 | Minimum common tail length (in instructions) to justify a merge |
branch-fold-placement | bool | true | Enables branch folding within MachineBlockPlacement (separate invocation) |
ifcvt-branch-fold | bool | true | Enables branch folding within the if-converter pass |
The tail-merge-threshold of 150 exists purely as a compile-time throttle. For a block with N predecessors, the pass performs O(N^2) pairwise comparisons within each hash group. Setting the threshold to 0 effectively disables tail merging for blocks with many predecessors while keeping branch optimization active.
The tail-merge-size of 3 is the break-even point: creating a new shared block plus a branch instruction costs roughly 2 instructions of overhead, so merging fewer than 3 common instructions produces no net code-size reduction.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
BranchFolder::OptimizeFunction | sub_2F36310 | -- | Pass entry; fixed-point loop over TailMerge + OptimizeBranches + HoistCommonCode |
BranchFolder::OptimizeBlock / inner logic | sub_2F336B0 | 11,347B | Per-block optimization + tail merge core (792-byte stack frame) |
HashEndOfMBB | sub_2F26260 | -- | Tail hash computation; hashes last non-debug non-terminator instruction |
isBranchFoldable | sub_2F31250 | -- | Checks if operand represents a foldable branch target |
| Merge candidate map lookup | sub_2F33020 | -- | Hash table lookup in MergePotentials DenseMap |
TryTailMergeBlocks | sub_2E2B9F0 | -- | Attempts merge across candidate set; calls Gates 1--4 |
AnalyzeBranch | sub_2E09D00 | -- | NVPTXInstrInfo branch analysis: type, targets, conditions |
RemoveBranch | sub_2E0C3B0 | -- | Removes terminator branch instructions from MBB |
InsertBranch | sub_2E0F080 | -- | Inserts new branch instruction to redirect flow |
ReplaceTailWithBranchTo | sub_2E0A600 | -- | Splices tail into shared block, inserts unconditional redirect |
ReplaceUsesOfBlockWith | sub_2E0E0B0 | -- | Updates phi nodes and predecessor lists after merge |
getBlockNumbered | sub_2E192D0 | -- | MBB number to pointer lookup |
UpdateTerminator | sub_2FAD510 | -- | Fixes terminators after CFG modification |
RemoveBlock | sub_2E790D0 | -- | Removes dead MBB from function; updates predecessor/successor lists |
computeLiveIns | sub_2E16F10 | -- | Updates live-in register sets for merged block; filters reserved registers |
getVRegDef | sub_2EBEE10 | -- | Virtual register definition lookup |
hasProperty(flag) | sub_2E88A90 | -- | Multi-purpose register/operand property query (flag 0x200 = reserved, 0x80000 = uniform, 0x100000 = divergent) |
HashMachineInstr | sub_2E89C70 | -- | Instruction hash for merge candidate bucketing (* 37 multiply-XOR scheme) |
SpliceBlock | sub_2E31080 | -- | Unlinks MBB from doubly-linked list |
NVPTXInstrInfo vtable | sub_2DAC790 | -- | Vtable base checked at 0x2F337A3 to validate InstrInfo supports analyzeBranch |
| Dynamic special register resolver | sub_3958DA0 | -- | Resolves opcodes 0x5E/0x5F to %warpid/%laneid |
| Special register emission | sub_21E86B0 | -- | Emits %tid, %ctaid, %ntid, %nctaid (opcodes 0x26--0x31) |
| Cluster register emission (SM 90+) | sub_21E9060 | -- | Emits 15 cluster registers (%cluster_ctarank, %clusterid, etc.) |
Interaction with StructurizeCFG
StructurizeCFG runs during the IR-level pipeline (before SelectionDAG), while BranchFolding runs after register allocation at the machine level. By the time BranchFolding executes, all control flow is already structured and reducible. The key interaction:
- StructurizeCFG may insert "Flow" blocks that serve as reconvergence points. These are often empty or contain only an unconditional branch. BranchFolding's empty-block elimination (step 1 of
OptimizeBranches) can remove these if they have become redundant after code generation. - Tail merging never introduces irreducible control flow because it only adds unconditional branches to a new shared tail block. The new block post-dominates the merged tails, preserving reducibility.
- The
branch-fold-placementknob controls a separate invocation of branch folding logic embedded within MachineBlockPlacement. That invocation runs before the standalone BranchFolding pass and performs a limited subset of the same transformations during layout decisions.
Complexity
The hash-based matching makes the typical case efficient. For N blocks and average predecessor count M, the overall complexity is O(N * M) for hash computation, plus O(K^2 * T) for pairwise comparison within hash groups, where K is the number of blocks sharing a hash and T is the common tail length. The tail-merge-threshold caps K at 150. The recursive self-call pattern (the pass re-invokes itself when a merge creates new opportunities) means worst-case is O(N^2) iterations, but this is rare in practice -- most functions converge in 2-3 iterations.
Differences from Upstream LLVM
| Aspect | Upstream LLVM 20 | cicc v13.0 |
|---|---|---|
| Tail merge for structured-CFG targets | Disabled (requiresStructuredCFG() returns true -> tail merge off) | Enabled -- structured-CFG gate removed |
| Reserved-register merge gate | Not present (unnecessary -- tail merge disabled for GPU) | Gate 3: sub_2E88A90 with flag 0x200 rejects merges when special registers are live across the boundary |
| Priority ordering | Candidates sorted by hash only | Additional priority sort within hash groups: unconditional branches first (priority 0), then conditional (priority 2+) |
| NVPTXInstrInfo vtable check | Not present | cmp rax, offset sub_2DAC790 at 0x2F337A3 validates InstrInfo before merge attempts |
| computeLiveIns filtering | No reserved-register filter | Double-filters reserved registers (once at merge decision, once at live-in computation) |
| Convergence control awareness | Not present (no convergence tokens in upstream) | isIdenticalTo naturally prevents merging convergence pseudo-instructions with different targets |
| MachineInstr stride | 32-byte operand stride | 40-byte operand stride (extra 8 bytes for NVPTX-specific metadata) |
| Upstream source | llvm/lib/CodeGen/BranchFolding.cpp | Binary at 0x2F336B0--0x2F36310 range |
Cross-References
- Block Placement -- runs before BranchFolding; its
branch-fold-placementknob triggers inline branch folding during layout. - StructurizeCFG -- guarantees structured control flow before BranchFolding runs; inserts Flow blocks that BranchFolding may later eliminate. Uses the same
sub_2E88A90for divergence queries. - Register Allocation -- BranchFolding requires
NoPHIsproperty, meaning it runs post-regalloc in the NVPTX pipeline. The greedy RA atsub_2F5A640excludes reserved registers from pressure tracking. - Instruction Scheduling -- scheduling runs after BranchFolding; the final CFG shape from branch folding determines scheduling regions.
- Register Classes -- documents the internal-only
off_4A026E0class ("!Special!") that holds reserved/environment registers. The register encoding functionsub_21583D0fatally aborts on this class. - PTX Emission -- special register emission functions
sub_21E86B0andsub_21E9060that handle the 63 reserved registers. - NVPTX Target Infrastructure -- the
disable-nvptx-require-structured-cfgknob that controls the structured-CFG requirement. - Machine-Level Passes -- pipeline context showing BranchFolding's position after register allocation and before instruction scheduling.
- InstrEmitter -- another consumer of
sub_2E88A90that uses flag bit 36 for NVPTX-specific implicit-use detection.
MachineBlockPlacement for GPU
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
MachineBlockPlacement decides the physical ordering of basic blocks in a MachineFunction. On CPU, it is primarily an I-cache optimization. On GPU, block ordering has deeper consequences: PTX is a structured ISA where every taken branch stalls the SM instruction fetch pipeline, warp divergence must reconverge at post-dominators, and instruction cache capacity is measured in tens of kilobytes per SM partition. cicc carries two separate instances of this pass -- a stock LLVM copy for internal use and an NVPTX-pipeline copy at sub_3521FF0 that participates in GPU-specific analysis. The NVPTX instance queries a divergence flag on the MachineFunction to decide whether tail duplication is profitable, and adds an alternative layout proposal path (sub_34BEDF0 / sub_34C7080) that is absent from upstream LLVM.
Key Facts
| Property | Value |
|---|---|
| Entry point | sub_3521FF0 (82 KB decompiled, 2435 lines) |
| Pass name | "Branch Probability Basic Block Placement" |
| Pass ID | "block-placement" |
| Registration (NVPTX) | sub_350FE30 (pass), sub_350FEE0 (stats) |
| Registration (generic) | sub_1DE8060 (pass), sub_1DE8500 (stats) |
| Stats pass ID | "block-placement-stats", callback sub_3517680 |
| Knob constructor | ctor_671_0 at 0x5A0470 |
| Required analyses | MachineBlockFrequencyInfo, MachineBranchProbabilityInfo, MachinePostDominatorTree, MachineLoopInfo, TargetPassConfig |
Why Block Placement Matters on GPU
Three properties of GPU execution make block ordering non-trivial.
Instruction fetch pipeline. GPU SMs fetch instructions sequentially. A taken branch introduces a fetch bubble -- the warp scheduler cannot issue from the new target until the instruction cache services the request. Every fall-through edge is free; every taken branch costs at least one cycle of fetch latency. The misfetch-cost (default 1) and jump-inst-cost (default 1) knobs model this cost. Maximizing fall-through sequences directly reduces warp stall cycles at branch points.
Instruction cache pressure. GPU instruction caches are small (typically 32-128 KB per SM partition). Code duplication through tail-dup increases I-cache working set. The tail-dup-placement-penalty (default 2%) penalizes code copies that improve fall-through at the expense of I-cache pressure. The ext-TSP model, when enabled, explicitly optimizes for I-cache utilization by modeling forward/backward reference distances.
Warp divergence. When a branch is divergent (different lanes take different paths), all paths must execute serially, and the warp reconverges at the post-dominator. Block ordering cannot eliminate the divergence cost, but it determines which side of the branch falls through vs. takes a jump. The divergence flag at MF+8+688 bit 0 gates whether tail duplication is even attempted: duplicating a tail block that sits below a divergent branch wastes code size because divergent warps execute both paths regardless of which one falls through.
Pass Object Layout
The pass object at a1 is populated during runOnMachineFunction:
| Offset | Type | Content |
|---|---|---|
+488 | ptr | Loop chain working data (cleared by sub_35142F0) |
+520 | MachineFunction* | Current function being processed |
+528 | ptr | MachineBlockFrequencyInfo* (adjusted +169 from raw analysis pointer) |
+536 | ptr | MachineBranchProbabilityInfo* (40-byte struct at +200) |
+544 | ptr | MachinePostDominatorTree* (+200) |
+552 | u64 | Working state (cleared to 0) |
+560 | ptr | TargetInstrInfo* (nullptr if default vtable) |
+568 | ptr | TargetRegisterInfo* (nullptr if default vtable) |
+576 | ptr | TailDuplicator* (from unk_50209DC analysis, +200) |
+584 | ptr | MachineLoopInfo* |
+592 | ptr | TargetPassConfig* |
+600 | inline | Chain-builder state (initialized by sub_2FD5DC0) |
+776 | u64 | Profile-derived hot threshold |
+784 | i32 | Tail-dup threshold (2 or 4) |
+788 | bool | Profile count was explicitly provided |
+792 | ptr | Bump allocator base (for chain node allocation) |
+800 | u64 | Bump allocator capacity |
+872 | u64 | Bump allocator total allocation counter |
+888 | struct | Chain-map (BB-to-chain DenseMap, queried via sub_3515040) |
Chain nodes are 64 bytes each, allocated from the bump allocator:
struct ChainNode { // 64 bytes
MachineBasicBlock** bb_array; // +0: pointer to BB array (initially +16)
uint32_t count; // +8: number of BBs in chain
uint32_t capacity; // +12: capacity (initial: 1)
MachineBasicBlock* inline_bb; // +16: inline storage for single-BB chain
uint8_t padding[24]; // +24: space for up to 3 more inline BBs
void* chain_map; // +48: pointer to parent chain-map
uint64_t flags; // +56: chain flags
};
Algorithm Overview
The entry point sub_3521FF0 dispatches to one of two layout algorithms: the standard chain-based placement, or the ext-TSP layout when explicitly enabled. The overall flow:
runOnMachineFunction(MF):
if MF.empty(): return 0
// Fetch analyses
MBFI = getAnalysis<MachineBlockFrequencyInfo>()
MBPI = getAnalysis<MachineBranchProbabilityInfo>()
MPDT = getAnalysis<MachinePostDominatorTree>()
MLI = getAnalysis<MachineLoopInfo>()
TPC = getAnalysis<TargetPassConfig>()
TII = MF.getSubtarget().getInstrInfo()
TRI = MF.getSubtarget().getRegisterInfo()
// Compute tail-dup threshold
threshold = computeTailDupThreshold(optLevel, TII)
// Decide layout algorithm
if enable-ext-tsp-block-placement AND MF.size() fits:
applyExtTsp(MF)
else:
buildChains(MF) // sub_3521900
tailDupPlacement(MF) // sub_35185B0 (if enabled + not divergent)
tryAlternativeLayout(MF) // sub_34BEDF0 + sub_34C7080 (NVIDIA addition)
// Post-placement
optimizeBranches() // flip branches for fall-through
alignBlocks() // sub_3516980
cleanup()
return 1
Chain-Based Placement (Standard Path)
sub_3521900 (buildChains) is the workhorse. It operates in four steps.
Step 1 -- Initial Chain Construction
For every BB in the MachineFunction (iterated via the doubly-linked intrusive list from MF+328 to sentinel MF+320), the builder:
- Allocates a 64-byte chain node from the bump allocator at
pass+792. The node is initialized withcount=1,capacity=1, the inline BB pointer set to the current BB, and the chain-map pointer set topass+888. - Inserts the BB-to-chain mapping into the chain-map via
sub_3515040(DenseMap insert with pointer hash((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1)). - Attempts to extend the chain forward: calls
TII->analyzeBranch()(vtable+344) on the current BB. If analyzable and a fall-through successor exists, callssub_2E32580to verify the successor is valid for chaining (not already claimed by a different chain, not a landing pad, not the function entry if it would create a cycle). If valid, the successor is appended to the chain's BB array (growing from inline storage to heap allocation viasub_C7D6A0when needed), and the walk continues from the successor.
The result is a set of maximal fall-through chains -- each chain represents a sequence of BBs where every transition is a fall-through edge according to analyzeBranch.
Step 2 -- Loop Chain Merging
Read the MachineLoopInfo structure at pass+584. Iterate loops from innermost outward. For each loop, call sub_351EBB0 (buildLoopChains), which:
- Identifies all chains that contain BBs belonging to the loop.
- Merges these chains into a single loop chain, ordering them to maximize fall-through within the loop body.
- Applies loop rotation via
sub_351C710(rotateLoop) to place the exiting block at the bottom, making the back-edge a fall-through and the exit a taken branch (or the reverse, whichever minimizes cost according to the profile data).
Cold blocks within the loop (where loop_freq / block_freq > loop-to-cold-block-ratio) are ejected from the loop chain and will be placed at the function's end during the commit step.
Step 3 -- Global Successor Ordering
Call sub_35157A0 (selectBestSuccessor) for each BB to find the globally best successor chain ordering. The selection considers:
- Edge probability from
sub_2E441D0(getEdgeProbability) - Whether the successor is already the fall-through (free) or would require a taken branch (cost =
misfetch-cost+jump-inst-cost) - Whether chaining the successor would break an existing profitable chain connection
Then sub_351D700 (buildChainForBlock) performs a greedy walk from the function entry, building the top-level chain by repeatedly selecting the best unchained successor and appending it.
Step 4 -- Commit
Walk the final chain's BB array and splice each BB into position using intrusive-list pointer swaps on the MachineFunction's BB list (pointer updates at BB+0 and BB+8 -- the prev/next pointers of the doubly-linked list).
Ext-TSP Layout (Optional Path)
When enable-ext-tsp-block-placement is true (default: false), the pass uses the Extended Travelling Salesman Problem formulation from LLVM's CodeLayout.h. This is a profile-guided model that explicitly optimizes I-cache utilization by penalizing backward references and rewarding fall-through edges.
The ext-TSP path builds a BB index hash-map using LLVM's DenseMap pattern (hash: (ptr >> 9) ^ (ptr >> 4), 75% load factor), computes block frequencies and edge weights, then runs three solver functions:
| Function | Role |
|---|---|
sub_29BAF70 | calcExtTspScore() -- score the original layout |
sub_29BAC40 | calcExtTspScore() -- score the alternative layout |
sub_29BB2B0 | computeExtTspLayout() -- reorder chains by ext-TSP objective |
The pass compares original vs. reordered cost and commits the better ordering via sub_3519A10 (applyBlockOrder). Additional ext-TSP tuning knobs (registered in ctor_492 at 0x5545a0):
| Knob | Description |
|---|---|
ext-tsp-forward-weight-cond / uncond | Weight for conditional/unconditional forward jumps |
ext-tsp-backward-weight-cond / uncond | Weight for conditional/unconditional backward jumps |
ext-tsp-fallthrough-weight-cond / uncond | Weight for fall-through edges |
ext-tsp-forward-distance / backward-distance | Distance thresholds for cache modeling |
ext-tsp-max-chain-size | Maximum chain size for ext-TSP merging |
ext-tsp-chain-split-threshold | Threshold for splitting chains |
ext-tsp-max-merge-density-ratio | Density ratio cap for chain merges |
ext-tsp-apply-without-profile | Run ext-TSP even without PGO data |
cdsort-cache-entries / cache-size | CDSort cache model parameters |
cdsort-max-chain-size | CDSort chain size limit |
cdsort-distance-power / frequency-scale | CDSort cost model tuning |
NVIDIA-Specific Modifications
Divergence-Gated Tail Duplication
The most significant GPU-specific behavior is the divergence check before tail duplication. At step (G) in the algorithm, the pass reads MF+8+688 bit 0 -- a flag set by earlier divergence analysis passes indicating the function contains warp-divergent branches. When this bit is set, sub_35185B0 (tailDupPlacement) is skipped entirely.
The rationale: tail duplication creates an additional copy of a basic block to convert a diamond-shaped CFG into a straight-line fall-through. On CPU, this eliminates a taken branch on the hot path. On GPU with divergent branches, both sides of the diamond execute regardless (the warp mask simply toggles), so duplicating the tail block doubles code size for zero fall-through benefit. The divergence flag is a conservative gate -- it disables tail-dup for the entire function, not per-branch.
Alternative Layout Proposal Algorithm
When the standard chain-based path is selected (not ext-TSP), and the function has more than 3 basic blocks with profile data and is not marked divergent, the pass runs a complete alternative layout evaluation through a pipeline absent from upstream LLVM. This is one of cicc's most significant code-layout additions.
Activation Gate
if (byte_503C568 is set AND MF.size() > 3):
evaluator = sub_34BEDF0(state, profile_flag, MBFI, TII, MBPI)
changed = sub_34C7080(evaluator, MF, chain_data, ...)
if changed:
commit(evaluator_layout)
The gate variable byte_503C568 corresponds to the branch-fold-placement knob (default true). When branch-fold-placement is active and the function has enough basic blocks to justify the extra analysis cost, the alternative path fires.
State Object Initialization -- sub_34BEDF0 (321 bytes)
sub_34BEDF0 is a constructor that initializes a 0x100-byte evaluator state object. It takes six arguments: (rdi=state, rsi=profile_available, rdx=?, rcx=MBFI*, r8=TII*, r9=MBPI*). The initialization zeroes the majority of the structure and sets up internal storage pointers:
struct LayoutEvaluatorState { // 0x100 bytes, initialized by sub_34BEDF0
void* bb_array_ptr; // +0x00: BB ordering array (initially null)
uint64_t bb_array_size; // +0x08: count
uint64_t bb_array_cap; // +0x10: capacity
uint64_t iteration_count; // +0x18: cleared to 0
void* inline_storage_ptr; // +0x20: points to +0x38 (inline array)
uint64_t initial_capacity; // +0x28: set to 2
uint32_t current_count; // +0x30: set to 0
uint8_t is_fresh; // +0x34: set to 1 (first-run flag)
uint8_t padding[3]; // +0x35
uint8_t inline_array[72]; // +0x38: inline storage for small chains
uint8_t profile_available; // +0x80: bit 0 = profile flag
uint8_t force_mode; // +0x81: set from qword_503AD08 knob
uint8_t divergence_aware; // +0x82: set from dl argument
uint8_t needs_reconverge; // +0x83: cleared to 0, set during evaluation
uint32_t bb_limit; // +0x84: from stack argument (BB count cap)
// +0x88..+0xA8: five qword slots, all zeroed
void* bb_ptr_array; // +0xB0: points to +0xC8 (inline)
uint64_t bb_ptr_array_pad; // +0xB8: cleared
uint64_t bb_ptr_array_cap; // +0xC0: set to 8
uint8_t bb_ptr_inline[24]; // +0xC8: inline BB pointer storage
uint64_t total_cost; // +0xD8: cleared
uint32_t cost_flags; // +0xE0: cleared
void* mbfi_ptr; // +0xE8: MachineBlockFrequencyInfo*
void* tii_ptr; // +0xF0: TargetInstrInfo*
void* mbpi_ptr; // +0xF8: MachineBranchProbabilityInfo*
};
The force_mode field at offset +0x81 is set based on the global qword_503AD08. When this global equals 0, the force mode takes the profile_available argument. When it equals 1, force mode is unconditionally set to 1 (always evaluate). Any other value causes a straight return (skip evaluation). This provides a three-way override: 0=auto, 1=always, other=never.
Dispatch Wrapper -- sub_34C7080 (17 bytes)
sub_34C7080 is a thin guard:
// sub_34C7080(rdi=evaluator, rsi=MF, rdx=chain_data, rcx=..., r8=..., r9=changed_flag)
if (rdx == NULL) return 0; // no chain data -> nothing to evaluate
return sub_34C6AF0(rdi, rsi, rdx, rcx, r8, (bool)r9);
The NULL check on rdx (the chain-data pointer) provides a fast exit when the chain builder produced no intermediate state worth re-evaluating.
Core Layout Evaluator -- sub_34C6AF0 (1419 bytes)
sub_34C6AF0 is the real body of the alternative layout evaluator. It operates on the evaluator state object (from sub_34BEDF0) and the MachineFunction, performing a complete re-evaluation of the chain-based layout against a different cost model. The algorithm proceeds in six steps:
Step 1 -- Iteration counter and hash table reset.
Increment the iteration count at state+0x18. If the hash table at state+0x20 is not fresh (byte at state+0x34 is 0), compute a minimum table size as max(32, 4 * (capacity - count)), and if the current table is undersized, fill it with 0xFF sentinels via memset. This hash table tracks which BBs have been visited during the current evaluation pass.
Step 2 -- State initialization from MachineFunction.
Clear the running cost accumulator at state+0x2C..+0x30. Read the first BB from the MachineFunction's chain data. Store the chain data pointer, the iteration limit from state+0x84, and the analysis pointers (MBPI at state+0x98, TII at state+0xA0) into the evaluator's working slots.
Read MF->getSubtarget()->something at offset +0x220 and subtract 0x2A (decimal 42). This produces an SM-generation index (sm_70=0, sm_75=1, sm_80=2, ..., sm_90=6, sm_100=16 based on this encoding). This index determines which cost table row is used for the fetch-penalty model.
Step 3 -- Divergence-aware block scanning.
For the first BB in the chain, check bit 2 of the flags at BB[0]+0x158. If set, dispatch to TII->vtable+0x210 (which is compared against sub_2FF52D0 -- the default stub). If the target overrides this vtable slot, call the override with the MachineFunction to determine if the block needs special handling. When the default is in use, set state+0x83 (needs_reconverge) to 1 unconditionally. This appears to be an NVPTX check for whether the block is in a reconvergence region where layout ordering has correctness implications, not just performance.
Step 4 -- Main evaluation loop.
Call sub_34BA1B0 to snapshot the current chain state into a temporary structure on the stack. Then enter the main loop:
while (true):
status = sub_34C4890(state, MF) // advance to next BB in evaluation order
changed_bit = (state->profile_available XOR 1) OR status
if changed_bit == 0:
// Reached the evaluation boundary without changes
if (sm_index <= 1): // sm_70 or sm_75
check qword_503AA68 knob // additional gate for older archs
if set: call sub_34C0690(state, loop) for each loop in MF
if state->divergence_aware:
call sub_34C56D0(state, loop) for each loop in MF
break if no further changes
// A change was proposed
call sub_34C2D70(state, MF) // apply the proposed reordering step
accumulate changed flags
if (sm_index <= 1): // sm_70/sm_75
check qword_503AA68 knob
if set: call sub_34C0690 for each loop
if state->divergence_aware:
call sub_34C56D0 for each loop
if changed_this_iteration:
continue loop
sub_34C4890 advances through the MachineFunction's basic blocks in frequency-priority order, proposing a reordering when a higher-frequency successor is not the current fall-through. sub_34C2D70 performs the actual chain manipulation to implement the proposed swap.
Step 5 -- Loop-level re-evaluation.
The calls to sub_34C56D0 (5137 bytes, called from sub_34C6AF0 via the loop-iteration path at 0x34C6E90) perform loop-level cost re-evaluation. This function:
- Walks the MachineFunction's loop tree (from
MF+0x148, the MachineLoopInfo block list) - For each loop, evaluates whether the proposed layout improves or degrades the loop body's fall-through density
- Calls
sub_34C0EE0for block-level cost queries - Calls
sub_34BE7F0for chain adjacency analysis - Queries
sub_2E88AF0(divergence analysis) andsub_2E88FE0for convergence properties - Uses
sub_2FDC710/sub_2FDC700for target-specific cost overrides via the TII vtable - Calls
sub_3509790for reconvergence point identification
sub_34C0690 (called on the sm_70/sm_75 path gated by qword_503AA68) is a lighter variant that omits the divergence-aware sub-evaluations, appropriate for older SM architectures where divergence reconvergence is handled differently.
Step 6 -- Final cost comparison and bitvector scan.
After the evaluation loop terminates, build a bitvector tracking which BBs changed position. The bitvector uses 64-bit words with word index = bb_index >> 6 and bit position = bb_index & 63. Walk the MachineFunction's loop tree blocks (MF+0x148 linked list):
- For each block in the loop, walk the instruction list starting at
BB+0x20 - For each instruction, mask the opcode with
0xFFFFFFand computeopcode * 5as a stride - If the instruction byte at offset 0 is
0x08(a branch instruction), set the corresponding bit in the bitvector
Then scan the bitvector against the evaluator's proposed ordering to detect any BB that would need to move. If at least one BB is displaced, set the return flag.
On the final cost-comparison path (at 0x34C6FD3), the evaluator reads TII->vtable+0x5D8 and compares against sub_2FDC810. If the target overrides this slot, the override is called to provide a final accept/reject decision. Otherwise, a default threshold of 3 is used: the proposed layout is accepted only if the cost reduction exceeds the acceptance threshold. The stat-based knobs at dword_503AAC8 and qword_503AB48 provide tuning for the threshold lookup via the sub_C52410/sub_C959E0 statistics infrastructure.
Shared Infrastructure with Register Allocation
A surprising discovery: sub_34BEDF0 and sub_34C7080/sub_34C6AF0 are also called from sub_34ED530 (RegAllocGreedy, 91KB) via sub_34F1190. The register allocator uses the same layout evaluator to assess whether a spill-induced block split would degrade code layout quality. This sharing means the cost model is consistent between register allocation decisions and post-RA block placement, preventing the two passes from working at cross purposes. The evaluator state is separate per invocation (stack-allocated), so there is no state leakage between the two callers.
SM-Generation-Dependent Behavior
The SM index computation ((MF->getSubtarget()+0x220) - 0x2A) creates generation-dependent behavior:
| SM Generation | Index | Loop Evaluator | Divergence Sub-Eval |
|---|---|---|---|
| sm_70 (Volta) | 0 | sub_34C0690 if qword_503AA68 | Only if divergence flag |
| sm_75 (Turing) | 1 | sub_34C0690 if qword_503AA68 | Only if divergence flag |
| sm_80+ (Ampere+) | 2+ | Skipped (only sub_34C56D0) | Always if divergence flag |
This split reflects the hardware difference: Volta and Turing use a stack-based reconvergence mechanism that benefits from the lighter sub_34C0690 analysis, while Ampere and later use the uniform warp scheduler where the more thorough sub_34C56D0 evaluation is worthwhile.
Dual Pass Registration
The binary contains two complete instances of MachineBlockPlacement:
| Instance | Registration | Purpose |
|---|---|---|
sub_350FE30 (NVPTX) | NVPTX backend pipeline | GPU-specific analysis results, divergence-aware |
sub_1DE8060 (generic) | Default LLVM pipeline | Standard pass for any non-GPU path |
Having a separate NVPTX instance allows NVIDIA to control pass ordering independently. The NVPTX version is inserted at a specific point in the backend pipeline where divergence analysis results are available.
Target Tail-Dup Threshold Override
The tail-dup threshold (how many instructions a tail block can have before duplication is rejected) is determined by a multi-level decision:
default_threshold = 2 // tail-dup-placement-threshold
aggressive_threshold = 4 // tail-dup-placement-aggressive-threshold
if TII->getTailDupThreshold(optLevel) overrides: // vtable+1488
threshold = TII_override // NVPTX can take full control
elif optLevel > 2 (-O3):
threshold = aggressive_threshold // 4
else:
threshold = default_threshold // 2
The default stub at sub_2FDC800 returns 2 * ((optLevel > 2) + 1), i.e., 2 at -O2 and 4 at -O3. If NVPTX's TargetInstrInfo overrides this (the pass explicitly checks whether the vtable slot points to sub_2FDC800), the override takes full control. This allows the NVPTX backend to set a different tail-dup aggressiveness based on SM generation or kernel properties.
Loop Rotation and Header Placement
Loop rotation (sub_351C710, called from buildLoopChains) determines whether the loop header is placed at the top or bottom of the loop chain. The goal is to place the exiting block at the bottom so the back-edge is a fall-through and the exit is a taken branch (or vice versa, whichever is more profitable).
Two rotation strategies exist:
Basic rotation (default): Place the exiting block last. Skip rotation if the header already has a viable fall-through from outside the loop, unless the exit edge frequency exceeds the fall-through frequency. This avoids introducing an unnecessary branch at loop entry.
Profile-guided rotation (precise-rotation-cost): Enumerate all possible rotations, compute fall-through cost for each (missed fall-through from loop entry, missed fall-throughs at exit points, missed back-edge fall-through), and select the rotation with minimum total cost. Controlled by two knobs:
precise-rotation-cost(default false): enable profile-guided rotation cost modelforce-precise-rotation-cost(default false): force it even without good profile data
For GPU kernels where loops are the dominant compute pattern, correct loop rotation determines whether the loop body executes as a straight fall-through sequence or requires a taken back-edge branch every iteration. Since the misfetch-cost is low (default 1), the benefit is modest per iteration but accumulates over millions of iterations typical in GPU compute.
Hot/Cold Splitting
cicc does not perform function-level hot/cold splitting. This is expected: GPU kernels are designed for all threads in a warp to execute the same path. There is no equivalent of a CPU "cold" exception handler that should be placed far from hot code. The loop-to-cold-block-ratio knob (default 5) does enable outlining individual cold blocks from loop chains -- moving them to the end of the function -- but this is intra-function block reordering, not function splitting.
The knob force-loop-cold-block (default false) forces cold block outlining from loops regardless of the frequency ratio. When loop_freq / block_freq > loop-to-cold-block-ratio, the block is moved out of the loop chain to reduce the loop body's I-cache footprint.
Post-Placement Passes
After layout is committed, two post-processing steps run:
Branch optimization. Walk the final BB ordering. For each analyzable branch with profile info, check whether reversing the branch direction would improve fall-through. Call TII->reverseBranchCondition() (vtable+880) to flip the condition, then update the branch targets via vtable+360/368. This is controlled by sub_2EE6AD0 which checks profitability by comparing edge costs with sub_2E441D0 (getEdgeProbability).
Block alignment (sub_3516980). Walk each BB and set alignment based on block frequency, loop depth, and whether the block is a fall-through target. Controlled by:
align-all-blocks(default 0): force log2 alignment on every blockalign-all-nofallthru-blocks(default 0): force alignment on blocks without fall-through predecessorsmax-bytes-for-alignment(default 0): cap padding bytes
On GPU, block alignment is generally not useful -- PTX does not expose alignment constraints on basic blocks, and the hardware instruction fetch unit does not benefit from aligned block boundaries the way a CPU I-cache line does.
Configuration Knobs
All knobs are LLVM-standard with stock defaults. The NVIDIA delta is behavioral, not configurational.
| Knob | Type | Default | Effect |
|---|---|---|---|
disable-block-placement | bool | false | Disable the pass entirely |
enable-block-placement-stats | bool | false | Collect placement statistics |
tail-dup-placement | bool | true | Enable tail duplication during placement |
tail-dup-placement-threshold | int | 2 | Max instructions for tail-dup candidate |
tail-dup-placement-aggressive-threshold | int | 4 | Aggressive threshold at -O3 |
tail-dup-placement-penalty | int | 2 | I-cache pressure penalty (percent) |
tail-dup-profile-percent-threshold | int | 50 | Min hot-count percentage for profile-guided tail-dup |
triangle-chain-count | int | 2 | Consecutive triangles before triangle heuristic activates |
branch-fold-placement | bool | true | Fold branches during placement |
misfetch-cost | int | 1 | Taken-branch fetch penalty |
jump-inst-cost | int | 1 | Cost of a jump instruction |
block-placement-exit-block-bias | int | 0 | Frequency percentage for loop exit replacement |
loop-to-cold-block-ratio | int | 5 | Ratio threshold for cold block outlining |
force-loop-cold-block | bool | false | Force outlining cold blocks from loops |
precise-rotation-cost | bool | false | Profile-guided loop rotation cost |
force-precise-rotation-cost | bool | false | Force precise rotation cost |
align-all-blocks | int | 0 | Force block alignment (log2) |
align-all-nofallthru-blocks | int | 0 | Force alignment on non-fall-through blocks |
max-bytes-for-alignment | int | 0 | Max padding for alignment |
enable-ext-tsp-block-placement | bool | false | Enable ext-TSP layout algorithm |
ext-tsp-block-placement-max-blocks | int | -1 | Max BB count for ext-TSP (unlimited) |
apply-ext-tsp-for-size | bool | false | Use ext-TSP for code size optimization |
renumber-blocks-before-view | bool | false | Renumber BBs before dot-graph output |
DenseMap Implementation Pattern
The pass uses LLVM's DenseMap for BB-to-chain and BB-to-index lookups. The open-addressing hash-map pattern appears 20+ times in the decompiled code:
// Hash function for pointer keys
size_t hash = ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1);
// Probing: linear with increment counter
// Empty sentinel: 0xFFFFFFFFFFFFF000 (-4096)
// Deleted sentinel: 0xFFFFFFFFFFFFE000 (-8192)
// Rehash trigger: 4 * (count + 1) >= 3 * bucket_count (75% load)
// Rehash function: sub_2E3E470(map, new_capacity)
GPU-Specific Placement Considerations
Why This Pass Matters More on GPU Than on CPU
On CPU, MachineBlockPlacement is primarily an I-cache optimization -- placing hot blocks contiguously reduces cache misses. On GPU, the stakes are higher for three reasons:
-
No branch prediction. GPU SMs do not speculate. Every taken branch is a guaranteed fetch stall. The ratio of taken branches to fall-throughs directly translates to warp scheduler utilization. Optimal block placement can eliminate 10-30% of fetch bubbles in branch-heavy kernels.
-
Instruction cache is tiny and shared. A single SM partition has 32-128 KB of instruction cache shared across all active warps. Code duplication (tail-dup, loop unrolling) competes with warp occupancy for this shared resource. The tail-dup-placement-penalty (2%) is conservative -- on kernels with high warp counts, even small code size increases can cause I-cache thrashing.
-
Reconvergence is layout-sensitive. On architectures before Ampere (sm_70, sm_75), the stack-based reconvergence mechanism depends on the post-dominator being reachable from both sides of a divergent branch. Block placement that separates a post-dominator from its divergent predecessors can increase the live warp state, consuming scarce convergence stack entries. The alternative layout evaluator's
sub_34C0690path specifically addresses this by evaluating reconvergence distance.
Structured Control Flow Constraint
Unlike CPU backends where block placement has complete freedom, the NVPTX backend runs StructurizeCFG before MachineBlockPlacement. This means:
- All irreducible control flow has already been eliminated
- Structured regions (loops, if-then-else diamonds) are contiguous in the CFG
- Block placement cannot violate structured region boundaries without re-structurizing
This constraint actually simplifies placement in some cases (fewer valid orderings to consider) but eliminates certain profitable reorderings that would be legal on CPU (e.g., outlining a cold exception handler to a distant location that breaks region contiguity).
Interaction with PTX Emission
The final block ordering directly determines which branches in the PTX output are bra instructions (taken) vs. fall-throughs (implicit). The AsmPrinter (see AsmPrinter) emits bra only for non-fall-through edges. Since ptxas performs its own block scheduling on the PTX input, the cicc block ordering serves as a strong hint rather than a final answer -- but ptxas generally respects the input ordering for blocks within the same structured region.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
runOnMachineFunction | sub_3521FF0 | -- | Entry point, 82 KB |
buildChains | sub_3521900 | -- | Initial chain construction |
tailDupPlacement | sub_35185B0 | -- | Tail-dup-aware chain merging |
applyBlockOrder | sub_3519A10 | -- | Commit final BB ordering to MF |
alignBlocks | sub_3516980 | -- | Post-placement alignment |
buildLoopChains | sub_351EBB0 | -- | Loop-aware chain merging |
buildChainForBlock | sub_351D700 | -- | Greedy successor chain walk |
selectBestSuccessor | sub_35157A0 | -- | Pick best fall-through successor |
chainLookup | sub_3515040 | -- | DenseMap BB-to-chain lookup |
rotateLoop | sub_351C710 | -- | Loop rotation heuristic |
mergeTails | sub_351A710 | -- | Chain tail merge logic |
lowerChain | sub_35161F0 | -- | Final lowering of chain to BB list |
| (helper) | sub_3515CB0 | -- | Chain cost model evaluation |
| (helper) | sub_3515280 | -- | Chain building iteration |
| (helper) | sub_3516000 | -- | Chain length query |
| (NVIDIA addition) | sub_34BEDF0 | -- | Layout evaluator state constructor (321 bytes) |
| (NVIDIA addition) | sub_34C7080 | -- | Layout evaluator dispatch wrapper (17 bytes, guards sub_34C6AF0) |
| (NVIDIA addition) | sub_34C6AF0 | -- | Core layout evaluator body (1419 bytes, SM-aware) |
| (NVIDIA addition) | sub_34C4890 | -- | Frequency-priority BB advancement |
| (NVIDIA addition) | sub_34C2D70 | -- | Chain swap application |
| (NVIDIA addition) | sub_34C56D0 | -- | Loop-level cost re-evaluation (5137 bytes, divergence-aware) |
| (NVIDIA addition) | sub_34C0690 | -- | Lightweight loop evaluator (sm_70/sm_75 path) |
| (NVIDIA addition) | sub_34BA1B0 | -- | Chain state snapshot |
| (NVIDIA addition) | sub_34C0EE0 | -- | Block-level cost query |
| (NVIDIA addition) | sub_34BE7F0 | -- | Chain adjacency analysis |
| (NVPTX) | sub_350FE30 | -- | Pass registration |
| (NVPTX) | sub_350FEE0 | -- | Stats pass registration |
| (generic) | sub_1DE8060 | -- | Generic LLVM pass registration |
| (generic) | sub_1DE8500 | -- | Generic LLVM stats registration |
| cleanup | sub_3511770 | -- | Chain-map teardown |
| cleanup | sub_35142F0 | -- | Loop chain data teardown |
| cleanup | sub_3510940 | -- | Bump allocator teardown |
calcExtTspScore | sub_29BAF70 | -- | Ext-TSP score (original layout) |
calcExtTspScore | sub_29BAC40 | -- | Ext-TSP score (alternative layout) |
computeExtTspLayout | sub_29BB2B0 | -- | Ext-TSP chain reordering solver |
| (helper) | sub_2EE6520 | -- | Ext-TSP enable decision |
| (helper) | sub_2EE6AD0 | -- | Branch redirect profitability check |
getEdgeProbability | sub_2E441D0 | -- | Edge probability query |
| (default stub) | sub_2FDC800 | -- | Default getTailDupThreshold implementation |
| (default stub) | sub_2FF52D0 | -- | Default reconvergence-region query |
| (default stub) | sub_2FDC810 | -- | Default layout-accept threshold query |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Pass instances | Single MachineBlockPlacement per pipeline | Two instances: stock LLVM copy + NVPTX-pipeline copy at sub_3521FF0 |
| Divergence awareness | No divergence concept; layout optimizes for I-cache locality | Queries warp divergence flag on MachineFunction; divergent branches affect tail duplication profitability |
| Alternative layout proposal | Absent; single layout path only | Additional proposal path (sub_34BEDF0 / sub_34C7080) evaluates alternative orderings with SM-aware cost |
| Tail duplication threshold | TailDupPlacementThreshold (default 2) | GPU-specific threshold via vtable query (sub_2FDC810); controlled by reconvergence-region analysis |
| Loop cost evaluation | Frequency-weighted chain cost | Divergence-aware loop cost re-evaluation (sub_34C56D0, 5137 bytes) considers warp reconvergence overhead |
| Ext-TSP scoring | Standard profile-guided layout scoring | Same Ext-TSP solver but gated by NVPTX-specific enable decision (sub_2EE6520) |
| Structured CFG constraint | No structured CFG requirement (targets like x86 have arbitrary CFG) | Must preserve structured regions from StructurizeCFG; contiguous structured blocks cannot be interleaved |
Cross-References
- StructurizeCFG -- runs before block placement; produces the structured CFG that constrains which block orderings are legal. Structured regions must remain contiguous.
- BranchFolding -- runs after placement; performs tail merging and branch folding on the committed layout. See
sub_2F336B0. - Instruction Scheduling -- block ordering affects scheduling windows. Post-placement scheduling operates within the committed layout.
- Register Allocation -- register pressure is affected by block ordering through live range extent.
- AsmPrinter -- emits PTX from the final block ordering, generating
brainstructions for taken branches and fall-through for sequential blocks.
MachineOutliner for GPU
The MachineOutliner in CICC v13.0 is the stock LLVM MachineOutliner pass, compiled into the binary at two address ranges: a candidate-finder at sub_3539E80 and a core outlining engine at sub_3537010, totaling approximately 136KB of combined code. A second instance at sub_1E3D600 (62KB) appears in the MIR infrastructure region (0x1E20000--0x1E3FFFF) containing the same diagnostic strings ("NotOutliningCheaper", "OutliningBenefit", etc.) and [MEDIUM confidence] likely represents the runOnModule entry point that delegates to the two primary functions. The runOnModule identification is based on the function's address being in the MIR infrastructure region and its diagnostic string overlap with the primary outliner; it could alternatively be a separate pass-manager wrapper or a legacy code path. The pass extracts repeated MachineInstr sequences across all functions in a module, factors them into shared OUTLINED_FUNCTION_* stubs, and replaces the original sequences with calls. On GPU targets this is significant because code size directly affects the L1 instruction cache (L0/L1i) footprint per SM, and every instruction that survives into PTX also contributes to ptxas compilation time and register pressure during its own allocation pass.
CICC ships the pass as part of its standard LLVM codegen infrastructure, controlled by the enable-machine-outliner TargetPassConfig knob (tri-state: disable, enable, guaranteed beneficial). The binary does not override the upstream default -- meaning the outliner's activation depends on whether the NVPTX backend's TargetPassConfig::addMachineOutliner() enables it. The presence of full outliner infrastructure (pass registration at sub_35320A0, ~136KB of outliner code, the benefit-threshold knob, and the "nooutline" function-attribute check) confirms the pass is callable. The critical question is whether NVIDIA's default pipeline activates it. The evidence is ambiguous but leans toward conditionally enabled: the TargetPassConfig enum includes "guaranteed beneficial" mode, and the NVPTX-specific calling convention 95 (assigned to outlined functions when no special CC is required) would serve no purpose if the pass were dead code.
| Pass name | "Machine Function Outliner" / "machine-outliner" |
| Registration | sub_35320A0 -- stores pass ID at unk_503D78C |
| Core outlining engine | sub_3537010 (77KB, 2,185 decompiled lines) |
| Candidate finder | sub_3539E80 (59KB) |
| Second instance (MIR region) | sub_1E3D600 (62KB, 0x1E3D600) |
| Pass factory | sub_3534A50 |
| Benefit threshold knob | qword_503DAC8 = outliner-benefit-threshold (default: 1) |
| Cost mode flag | qword_503DC88 (loaded into pass state at offset +184) |
| Debug flag | qword_503D828 (verbose outliner output) |
| Options constructor | ctor_675 at 0x5A2820 (10,602 bytes) |
| NVPTX outlined-function CC | Calling convention 95 (PTX .func linkage) |
| Outlined function naming | OUTLINED_FUNCTION_{round}_{index} |
| Function attributes applied | nounwind (47), minsize (18), internal linkage |
Suffix Tree Algorithm
The outliner's core algorithm is Ukkonen's suffix tree construction, applied to a flattened sequence of MachineInstr encodings from every eligible basic block in the module. The process proceeds in three stages.
Stage 1: Instruction Mapping
sub_3508720 (buildInstrLegalityMapping) walks each MachineBasicBlock and encodes every instruction as a uint16 alphabet symbol. The encoding incorporates both the opcode and a structurally significant operand pattern, so that two instruction sequences with different register names but identical structure map to the same suffix-tree substring. The helper sub_35082F0 initializes from the MBB's scheduling info (offset +32), and sub_35085F0 populates the actual mapping.
Register-class resolution happens in a second pass via sub_3508F10 (buildRegClassMapping): sub_3508B80 builds register-class bitmask information, and sub_3508890 computes the final mapping. This two-layer encoding is critical because NVPTX has typed register classes (i32, i64, f32, f64, pred, etc.) and an outlined sequence must be valid across all call sites regardless of which specific virtual register names appear.
Instructions that cannot participate in outlining receive a special encoding: unique negative integers starting at -3 (matching upstream's IllegalInstrNumber). Each illegal instruction gets a distinct value so it acts as a suffix-tree terminator, preventing matches from spanning across them. The sentinel value 0xFFFFFFFF (-1 as uint32) in the cost array explicitly marks these.
Stage 2: Suffix Tree Construction and Candidate Extraction
sub_35364E0 (insertIntoSuffixTree) inserts each MBB's encoded instruction sequence into the suffix tree working set. The suffix tree identifies all repeated substrings of length >= 2. For each repeated substring with at least 2 occurrences, the pass creates a candidate group.
Function filtering happens before insertion. sub_3539E80 iterates all MachineFunctions in the module's linked list and applies three gates:
-
nooutlineattribute check --sub_B2D620tests whether the function has the"nooutline"string attribute. If present, all MBBs in that function are skipped. -
shouldOutlineFrom-- vtable dispatch at offset +1440 on the TargetInstrInfo. The NVPTX backend's implementation of this hook determines whether a given function is eligible based on target constraints. -
isFunctionSafeToOutlineFrom-- vtable dispatch at offset +1432, receiving the outliner cost mode byte fromqword_503DC88. This is where target-specific safety checks (e.g., functions with special register constraints or inline assembly) can reject outlining.
Additional per-block filters: a block must contain more than one instruction, must not already be marked as outlined (byte at MBB offset +217), and must have no special flag (qword at MBB offset +224 must be zero).
Stage 3: Sorting and Pruning
After suffix-tree extraction, the candidate list is sorted using a hybrid merge sort:
sub_3534120-- parallel merge sort for large arrays (recursive, splits at midpoint)sub_3533600-- in-place merge sort for small arrays (fallback when size < 14 pointers = 112 bytes)sub_3533450-- insertion sort for very small partitions (<= 14 elements)
The sorted suffix array is then scanned by sub_3532120 (findIllegalInRange), which performs a 4-way unrolled linear scan searching for the sentinel value 0xFFFFFFFF in the integer cost array. Any candidate whose instruction range contains an illegal sentinel is pruned. The compaction loop copies valid entries forward in place and frees discarded entries' internal string buffers via _libc_free.
Benefit/Cost Model
The outliner accepts a candidate only if the net benefit exceeds the threshold. The formula:
Benefit = NumOccurrences * PerOccurrenceCost - FrameOverheadCost
Where:
- NumOccurrences = number of identical sequences found (vtable dispatch at slot 0 on the candidate)
- PerOccurrenceCost = bytes saved per replacement (effectively the cost of the call instruction that replaces the inlined sequence, dispatched via vtable slot 0 multiplied by the
repeat_countat candidate offset +40) - FrameOverheadCost = cost of the outlined function itself: the function entry/exit, the return instruction, and any callee-saved register saves (vtable dispatch at slot 8)
The decision rule:
int benefit = num_occurrences * per_call_cost - frame_overhead;
if (benefit < 0) benefit = 0;
if (benefit < outliner_benefit_threshold) continue; // skip candidate
The threshold qword_503DAC8 defaults to 1, meaning any candidate that saves at least one byte is accepted. This is identical to upstream LLVM's default and is intentionally aggressive -- the outliner relies on the cost model's accuracy rather than a conservative threshold to filter bad candidates.
NVPTX Cost Model Considerations
The cost model is dispatched through the TargetInstrInfo vtable, meaning the NVPTX backend supplies its own getOutliningCandidateInfo, buildOutlinedFrame, and insertOutlinedCall implementations. Several factors make the GPU cost model structurally different from CPU targets:
Call overhead in PTX is expensive. A PTX .func call requires .param space declaration, parameter marshaling (each argument is copied to .param memory), the call instruction itself, and result retrieval from .param space. On CPU targets, a call instruction is a single opcode plus a return address push. On NVPTX, the overhead is proportional to the number of live values that must be passed to the outlined function. This means the FrameOverheadCost for NVPTX candidates is significantly higher than on CPU, and only sequences with many occurrences or substantial length achieve positive benefit.
No hardware call stack. PTX function calls are lowered by ptxas into something closer to inlined code with register renaming. The actual "call" may or may not involve a hardware subroutine mechanism depending on the SM architecture and ptxas optimization level. This makes the cost model somewhat speculative from CICC's perspective -- the outlined function may be re-inlined by ptxas.
Calling convention 95. When no candidate entry in a group requires a special calling convention, the outlined function is assigned CC 95 -- an NVPTX-specific calling convention not present in upstream LLVM. CC 95 maps to PTX .func linkage with internal visibility, meaning the function is private to the compilation unit and ptxas has full freedom to inline or optimize it. See Calling Convention 95 below for the complete assignment algorithm and CC comparison table.
Outlined Function Creation
When a candidate group passes the benefit threshold, sub_3537010 creates the outlined function through these steps:
Name generation. The name follows the pattern OUTLINED_FUNCTION_{round}_{index}. The round number (pass counter at state offset +188) is omitted in round 0, producing OUTLINED_FUNCTION_0, OUTLINED_FUNCTION_1, etc. for the first pass and OUTLINED_FUNCTION_2_0, OUTLINED_FUNCTION_2_1, etc. for subsequent reruns. The integer-to-string conversion uses a standard two-digit lookup table ("00010203...9899") for fast decimal formatting.
LLVM Function creation. sub_BCB120 (getOrInsertFunction) creates or retrieves the Function in the LLVM Module. sub_BCF640 creates the function type (void return, no arguments by default). sub_B2C660 creates the corresponding MachineFunction.
Function flags. The flag word at function offset +32 is set to (existing & 0xBC00) | 0x4087. The bit pattern 0x4087 encodes internal linkage, norecurse, and nounwind. The mask 0xBC00 preserves target-dependent alignment and visibility bits. Two explicit attributes are added: nounwind (attribute ID 47) and minsize (attribute ID 18).
Register liveness. A calloc-allocated byte array (one byte per physical register, count from TargetRegisterInfo::getNumRegs() at TRI offset +16) tracks which registers are live-through versus defined-inside the outlined region. sub_35095B0 (populateOutlinedFunctionBody) walks the outlined MBB's instruction stream, checking the TargetRegisterInfo live-in bitmap (offset +48 in the subtarget). Registers not in the live-in set are inserted as phantom definitions. Super-register chains are walked via delta tables at TRI offset +56, following standard LLVM MCRegisterInfo encoding.
Outlined body. The TargetInstrInfo hook buildOutlinedFrame (vtable offset +1408) constructs the actual machine instructions in the outlined function by copying from the candidate entries. The isOutlined flag is set at MachineFunction offset +582.
Call-Site Rewriting
After creating the outlined function, the pass rewrites each call site:
-
For each candidate entry,
insertOutlinedCall(vtable offset +1416) is invoked with the caller's MBB, an insertion point, the outlined Function, and the candidate metadata. This returns the new call MachineInstr. -
If the outlined function has callee-saved register information (flag at candidate offset 344), the pass builds live-in/live-out register sets using red-black trees (
sub_3536E40for classification). Registers are classified as defs (implicit-def, flag0x30000000), uses (implicit-use, flag0x20000000), or implicitly-defined. These operands are attached to the call instruction viasub_2E8F270. -
The original instruction range in the cost array is memset to
0xFF, marking it with illegal sentinels. This prevents future outlining passes (reruns) from attempting to re-outline already-outlined code.
Candidate Entry Structure
Each candidate is a 224-byte structure (56 x uint32 stride):
| Offset | Size | Field |
|---|---|---|
+0x00 | 4 | start_index -- index into module instruction array |
+0x04 | 4 | length -- number of instructions in sequence |
+0x08 | 8 | call_info_ptr -- pointer to MBB or instruction range |
+0x10 | 8 | metadata_0 |
+0x18 | 8 | metadata_1 |
+0x20 | 4 | num_occurrences_field |
+0x28 | 4 | cost_field |
+0x2C | 48 | SSO string data (via sub_3532560) |
+0x70 | 4 | benefit_or_flags |
+0x78 | 40 | Second SSO string field |
+0xA0 | 1 | flag_byte_0 |
+0xA1 | 1 | flag_byte_1 |
+0xA8 | 4 | field_A8 |
+0xAC | 4 | field_AC |
+0xB0 | 4 | field_B0 |
+0xB4 | 4 | field_B4 |
The two string fields use LLVM's small-string optimization (SSO): strings shorter than the inline buffer are stored directly in the struct; longer strings allocate on the heap. The copy function sub_3532560 handles both cases.
Calling Convention 95: The NVPTX Outlined-Function CC
CICC defines calling convention 95 (0x5F) as an NVPTX-specific calling convention that does not exist in upstream LLVM. It is assigned exclusively to outlined functions and signals to both the AsmPrinter and ptxas that the function is a module-internal device helper with PTX .func linkage.
CC Assignment Algorithm
The CC assignment happens in Phase 5 of sub_3537010 (lines 838--877 of the decompilation), after the outlined MachineFunction is created and before its body is populated. The algorithm:
fn assign_outlined_cc(candidate_group, outlined_fn):
max_cc = 0
for entry in candidate_group:
cc = sub_A746B0(entry) // extract caller's CC from candidate
max_cc = max(max_cc, cc)
if max_cc > 0:
// At least one call site has a non-default CC.
// Inherit the highest CC and create a callee-saved register mask.
sub_B2BE50(outlined_fn, max_cc) // setCallingConv
sub_A77AA0(outlined_fn, max_cc) // create callee-saved mask
else:
// All call sites have default CC (0) -- typical case for
// device functions compiled from __device__ code.
// Assign the NVPTX-specific outlined-function CC.
outlined_fn.setCallingConv(95)
sub_A746B0 extracts the calling convention from each candidate entry's source MachineFunction. The "max" selection rule means that if candidates come from functions with different CCs, the outlined function inherits the most restrictive one. In practice, since the outliner only groups structurally identical MachineInstr sequences, all entries in a group typically come from functions with the same CC.
CC 95 vs Other NVPTX Calling Conventions
| CC | Decimal | PTX Linkage | Meaning |
|---|---|---|---|
| 0 | 0 | .func | Default C calling convention (non-kernel device function) |
| 42 | 0x2A | .entry | PTX kernel entry (one of two kernel CCs; used in SCEV budget bypass) |
| 43 | 0x2B | .entry | PTX kernel entry (variant; also bypasses SCEV budget) |
| 71 | 0x47 | .entry | Primary CUDA kernel CC (isKernel returns true when linkage == 0x47) |
| 95 | 0x5F | .func | NVPTX outlined-function CC -- internal, never a kernel |
CC 95 functions are emitted as .func by the AsmPrinter (sub_215A3C0). The .entry vs .func branch at line 30--33 of the PTX header emission calls sub_1C2F070 (isKernelFunction), which checks whether the CC is one of the kernel CCs (42, 43, 71) or the nvvm.kernel metadata flag. CC 95 fails all kernel tests, so the function is always emitted as .func.
What CC 95 Communicates
The CC carries three semantic signals:
-
Internal linkage. CC 95 functions are never externally visible. The flag word
0x4087applied at function offset +32 encodes internal linkage. Combined with thenounwind(47) andminsize(18) attributes, this tells the backend andptxasthat the function is private to the compilation unit. -
No
.param-space calling convention overhead. Unlike CC 0 device functions, which must declare.paramspace for every argument and marshal values throughst.param/ld.paramsequences (the fullsub_3040BF0LowerCallpath withDeclareParam/DeclareScalarParamnodes), CC 95 functions use a simplified call interface. The outlined function takes no explicit arguments -- live values are passed implicitly through the register state, and theTargetInstrInfo::insertOutlinedCallhook (vtable +1416) handles the call-site ABI. -
ptxasis free to inline. Because CC 95 functions are internal.funcwith no special ABI constraints,ptxascan and frequently does inline them back at the call site during its own optimization passes. This makes the outlining decision partially speculative from CICC's perspective -- the code size reduction measured by the benefit model may be undone byptxas.
Callee-Saved Register Mask Interaction
When max_cc > 0 (the non-default path), sub_A77AA0 creates a callee-saved register mask for the outlined function. This mask determines which registers the outlined function must preserve across its body. For CC 95 (the max_cc == 0 path), no callee-saved mask is created. Instead, the call-site rewriting logic at Phase 11 of sub_3537010 (lines 1469--1968) builds explicit implicit-def (flag 0x30000000) and implicit-use (flag 0x20000000) operands on the call instruction using the RB-tree-based register classifier at sub_3536E40. This makes the register interface fully explicit rather than relying on a convention-defined preserved set.
launch_bounds Interaction and Cross-Kernel Outlining
The MachineOutliner operates at module scope -- it considers all functions in the module simultaneously. On NVPTX, this raises the question of whether sequences can be outlined across functions with different __launch_bounds__ annotations.
How launch_bounds Metadata Flows
The __launch_bounds__ attribute on a __global__ function flows through CICC as follows:
-
EDG frontend (
sub_826060): Validates__launch_bounds__arguments. Rejects__launch_bounds__on non-__global__functions. Detects conflicts with__maxnreg__. -
Post-parse fixup (
sub_5D0FF0): Converts__launch_bounds__values into structured metadata. -
Kernel metadata emission (
sub_B05_kernel_metadata): Stores as LLVM named metadata undernvvm.annotations:nvvm.maxntid-- max threads per block (from first__launch_bounds__argument)nvvm.minctasm-- minimum CTAs per SM (from second argument, if present)nvvm.maxnreg-- max registers per thread (from__maxnreg__or third argument)
-
PTX emission (
sub_214DA90): Reads the metadata back and emits.maxntid,.minnctapersm,.maxnregdirectives. These are emitted only for.entryfunctions -- the guard at step (g) ofsub_215A3C0ensures.funcfunctions never receive these directives.
The Outlined Function Inherits Nothing
Because outlined functions are created with internal linkage, void return type, and CC 95 (.func), they are device functions -- never kernels. The function creation code in Phase 5 of sub_3537010 does not copy any metadata from source functions. Specifically:
- No
nvvm.kernelflag is set. - No
nvvm.maxntidmetadata is attached. - No
nvvm.maxnregmetadata is attached. - No
nvvm.minctasmmetadata is attached. - No
nvvm.cluster_dimornvvm.maxclusterrankmetadata is attached. - The
isKernelcheck (sub_CE9220) returns false: the CC is not 0x47, there is nonvvm.kernelmetadata, and there is no"kernel"entry innvvm.annotations.
The only function-level metadata the outlined function receives is the isOutlined flag at MachineFunction offset +582 and the two attributes nounwind (47) and minsize (18).
Function Eligibility Gating
The candidate finder (sub_3539E80) applies three gates before considering a function's basic blocks for outlining:
fn is_eligible(func, cost_mode):
// Gate 1: explicit opt-out
if sub_B2D620(func, "nooutline"): // has "nooutline" attribute?
return false
// Gate 2: target hook -- "should we outline FROM this function?"
tii = get_target_instr_info(func)
if !tii.vtable[1440](func): // shouldOutlineFrom
return false
// Gate 3: target hook -- "is it SAFE to outline from this function?"
if !tii.vtable[1432](func, cost_mode): // isFunctionSafeToOutlineFrom
return false
return true
The NVPTX backend's implementation of shouldOutlineFrom (vtable +1440) and isFunctionSafeToOutlineFrom (vtable +1432) determines whether kernel functions and launch_bounds-constrained functions participate. The evidence does not contain the NVPTX-specific implementation of these hooks, so we cannot state definitively whether kernels with nvvm.maxnreg are rejected. However, the architectural implications are clear:
If the hooks permit outlining from constrained kernels, the outliner may extract a sequence shared between a maxnreg=32 kernel and a maxnreg=64 kernel into a single CC 95 .func. That .func has no register budget. When ptxas processes the maxnreg=32 kernel's call to this .func, it must either:
- Inline the call -- absorbing the outlined function's register usage into the kernel's allocation. If the outlined body fits within 32 registers, this is transparent.
- Keep the call -- allocating the outlined function's registers within the kernel's 32-register budget. If the outlined function needs more registers than available after the kernel's own allocation,
ptxaswill spill to local memory.
Both outcomes preserve correctness. The performance risk is that spilling may occur in a kernel that would not have spilled without outlining, because the CICC-side cost model has no visibility into ptxas's register allocation decisions.
If the hooks reject constrained kernels, the outliner only operates on unconstrained device functions (CC 0) and kernels without __launch_bounds__. This is the conservative and likely behavior, given that NVIDIA is aware of the register-pressure implications.
Per-Block Eligibility
Even within an eligible function, individual basic blocks are filtered:
| Condition | Check | Effect |
|---|---|---|
| Block has <= 1 instruction | MBB.size() <= 1 | Skipped -- too small to outline |
| Block already outlined | byte at MBB offset +217 | Skipped -- prevents re-outlining |
| Block has special flag | qword at MBB offset +224 != 0 | Skipped -- target-specific block exclusion |
The "already outlined" flag at MBB offset +217 is set by the call-site rewriting phase (Phase 11) after replacing a sequence with a call to the outlined function. Combined with the cost-array sentinel memset (0xFF fill), this provides a two-layer defense against re-outlining.
Outlining vs. Inlining Tension
The MachineOutliner and the LLVM inliner operate in opposite directions: the inliner copies callee bodies into call sites (increasing code size, reducing call overhead), while the outliner extracts common sequences out of function bodies (decreasing code size, adding call overhead). In CICC, the two passes do not directly coordinate -- the inliner runs during the IR optimization pipeline (CGSCC pass manager), while the MachineOutliner runs late in the machine codegen pipeline after register allocation and scheduling.
The tension manifests in two ways:
-
The inliner may create outlining opportunities. Aggressive inlining of small device functions can produce multiple copies of the same instruction sequence in different callers, which the outliner then detects and re-extracts. This round-trip (inline then outline) is wasteful but not incorrect. The net result depends on whether the outliner's shared function is more cache-friendly than the inlined copies.
-
The outliner may undo inlining benefits. If the inliner carefully decided that inlining a hot function improves performance by eliminating call overhead and enabling cross-function optimization, the outliner may later extract the inlined sequence back out if it appears in multiple callers. The
minsizeattribute on outlined functions does not prevent this -- it only signals that the outlined function should be optimized for size rather than speed.
The enable-machine-outliner knob's "guaranteed beneficial" mode addresses this partially by only outlining sequences where the cost model is confident the savings are worthwhile, but it cannot reason about the inliner's original intent.
Configuration Knobs
All knobs are LLVM cl::opt command-line options, passable via -Xllc in CICC:
| Knob | Type | Default | Effect |
|---|---|---|---|
outliner-benefit-threshold | unsigned | 1 | Minimum net byte savings for a candidate to be accepted. Higher values make outlining more conservative. |
enable-machine-outliner | enum | target-dependent | Tri-state: disable, enable, guaranteed beneficial. Controls whether the pass runs at all. |
enable-linkonceodr-outlining | bool | false | Whether to outline from linkonce_odr functions. Off by default because the linker can deduplicate these. Should be enabled under LTO. |
machine-outliner-reruns | unsigned | 0 | Number of additional outliner passes after the initial run. Each rerun can find new candidates from code modified by previous outlining. |
outliner-leaf-descendants | bool | true | Consider all leaf descendants of internal suffix-tree nodes as candidates (not just direct leaf children). |
disable-global-outlining | bool | false | Disable global (cross-module) outlining, ignoring codegen data generation/use. |
The options constructor at ctor_675 (0x5A2820, 10,602 bytes) registers the outliner-specific options including the linkonce-odr and rerun knobs. The benefit threshold is registered separately in the same constructor.
Diagnostic Strings
The outliner emits LLVM optimization remarks under the "machine-outliner" pass name:
| Remark key | Meaning |
|---|---|
"OutlinedFunction" | A new outlined function was created |
"NotOutliningCheaper" | Candidate rejected because outlining would not save bytes |
"Did not outline" | Candidate rejected for other reasons (illegal instructions, safety checks) |
"OutliningBenefit" | Named integer: net byte savings |
"OutliningCost" | Named integer: cost of the outlined call sequence |
"NotOutliningCost" | Named integer: cost of keeping the sequence inline |
"NumOccurrences" | Named integer: how many times the sequence was found |
"Length" | Named integer: number of instructions in the sequence |
"StartLoc" / "OtherStartLoc" | Source locations of the outlined regions |
The remark message format: "Saved {N} bytes by outlining {M} instructions from {K} locations. (Found at: {loc1}, {loc2}, ...)".
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Pass registration (name, ID, factory) | sub_35320A0 | -- | -- |
| Pass factory function | sub_3534A50 | -- | -- |
Core outlining engine (outline + rewrite) | sub_3537010 | 77KB | -- |
| Candidate finder / suffix-tree builder | sub_3539E80 | 59KB | -- |
MachineOutliner runOnModule entry (MIR region) | sub_1E3D600 | 62KB | -- |
insertIntoSuffixTree -- adds MBB instruction hashes | sub_35364E0 | -- | -- |
SuffixArray::allocateWorkBuffer | sub_3535DB0 | -- | -- |
SuffixArray::parallelMergeSort | sub_3534120 | -- | -- |
SuffixArray::inPlaceMergeSort (fallback for small arrays) | sub_3533600 | -- | -- |
| Insertion sort for <= 14 elements | sub_3533450 | -- | -- |
findIllegalInRange (4-way unrolled sentinel scan) | sub_3532120 | -- | -- |
buildInstrLegalityMapping -- MBB to suffix alphabet | sub_3508720 | -- | -- |
buildRegClassMapping -- register-class constraint resolution | sub_3508F10 | -- | -- |
populateOutlinedFunctionBody -- instruction insertion | sub_35095B0 | -- | -- |
classifyOperandRegisters -- RB-tree register tracking | sub_3536E40 | -- | -- |
RBTree::destroyAll -- recursive tree deallocation | sub_3532B90 | -- | -- |
std::string constructor (for name generation) | sub_35323D0 | -- | -- |
| SmallString SSO-aware deep copy | sub_3532560 | -- | -- |
RemarkBuilder::appendField | sub_3534BB0 | -- | -- |
RemarkBuilder::emitOutlinedFunctionRemark | sub_35341F0 | -- | -- |
| Extract calling convention from candidate entry's source function | sub_A746B0 | -- | -- |
| Create callee-saved register mask for non-default CC | sub_A77AA0 | -- | -- |
hasAttribute("nooutline") -- function attribute check | sub_B2D620 | -- | -- |
isKernel(func) -- returns true for CC 0x47 or nvvm.kernel metadata | sub_CE9220 | -- | -- |
isKernelFunction -- .entry vs .func emission branch | sub_1C2F070 | -- | -- |
Kernel attribute emission (.maxntid, .maxnreg, .minnctapersm) | sub_214DA90 | -- | -- |
PTX function header orchestrator (.entry / .func branch + params) | sub_215A3C0 | -- | -- |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Activation | Default off for most targets; explicit -enable-machine-outliner required | Conditionally enabled via TargetPassConfig::addMachineOutliner(); evidence of "guaranteed beneficial" mode for NVPTX |
| Calling convention | Uses target default CC for outlined functions | Assigns CC 95 to outlined functions -- a dedicated NVPTX convention that bypasses .param-space ABI overhead |
| Kernel interaction | No kernel concept; all functions treated equally | isKernel(func) check (sub_CE9220) for CC 0x47 / nvvm.kernel metadata; kernel attributes (.maxntid, .maxnreg, .minnctapersm) may constrain outlining profitability |
nooutline attribute | Standard function attribute check | Same check (sub_B2D620 / hasAttribute("nooutline")); kernels with tight __launch_bounds__ may implicitly disable outlining |
| Code size motivation | Reduce instruction cache footprint and binary size | Primary motivation is L0/L1i instruction cache pressure per SM partition; every surviving PTX instruction also costs ptxas compilation time |
| Suffix tree/array | Standard suffix array construction | Same algorithm; parallel merge sort (sub_3534120) with fallback insertion sort for <= 14 elements |
Cross-References
- Inliner Cost Model -- the opposing force: inlining decisions that the outliner may partially reverse
- AsmPrinter & PTX Body Emission -- how outlined
.funcfunctions are emitted as PTX - Register Allocation -- the outliner runs after RA; outlined functions affect register pressure
- Register Coalescing -- coalescing happens before outlining; the outliner operates on already-coalesced code
- Block Placement -- block layout interacts with code size; the outliner reduces the instruction footprint that placement must arrange
- Pipeline & Ordering -- where the outliner sits in the overall pass sequence
- NVPTX Call ABI -- the
.param-space calling convention that CC 0 device functions use; CC 95 outlined functions bypass this - SCEV Analysis -- SCEV budget bypass for CC 42/43 kernel functions; illustrates CC-based dispatch in CICC
Tensor Core / MMA Code Generation
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
CICC v13.0 contains a complete tensor core code generation pipeline spanning five SM generations (Volta through Blackwell), three distinct MMA instruction families (HMMA/IMMA/BMMA), the SM 90 Warp Group MMA (WGMMA) system, and the SM 100 Tensor Core Generation 5 (tcgen05) engine. The pipeline transforms NVVM intrinsic calls through two parallel lowering paths -- one in the NVVM IR lowering layer (sub_955A70) and one in the SelectionDAG backend (sub_33B0210) -- before reaching a common PTX instruction emission layer that constructs MMA instructions from packed 64-bit descriptors encoding shape, type, layout, rounding, and saturation.
This page documents the code generation mechanics: how MMA operations flow from source-level __hmma_* / __wmma_* / __wgmma_* builtins through LLVM intrinsic selection, SelectionDAG lowering, and PTX string emission. For the builtin-to-intrinsic mapping and per-ID reference, see Tensor / MMA Builtins. For the SelectionDAG infrastructure that hosts this lowering, see SelectionDAG.
| NVVM builtin dispatch | sub_955A70 (105KB) -- main NVVM builtin lowering dispatcher |
| SelectionDAG intrinsic switch | sub_33B0210 (343KB, 9,518 lines) -- intrinsic lowering mega-switch, CAT-17 |
| SelectionDAG MMA handler | sub_33A64B0 -- WMMA/MMA DAG node construction (95 intrinsic IDs) |
| WMMA load handler | sub_94CAB0 / sub_94DCB0 -- fragment load codegen |
| WMMA MMA handler | sub_94E0D0 -- matrix multiply-accumulate codegen |
| MMA PTX string builder | sub_21E74C0 (AsmPrinter) / sub_35F3E90 (backend) |
| tcgen05.mma lowering | sub_304E6C0 (SelectionDAG) / sub_36E9630 (instruction emission) |
| tcgen05 infrastructure | sub_30462A0 -- fence/wait/alloc/dealloc/cp/commit |
| Address range | 0x21D0000--0x21F0000 (AsmPrinter MMA), 0x304xxxx--0x36Fxxxx (backend) |
| Upstream | lib/Target/NVPTX/NVPTXISelLowering.cpp (no upstream MMA; entirely NVIDIA-proprietary) |
Pipeline Overview
MMA code generation follows a three-stage pipeline. The first two stages exist in parallel copies; the third is shared.
CUDA source: __hmma_m16n16k16_mma_f32f32(d, a, b, c, 0)
│
┌───────────┴───────────┐
│ NVVM builtin lowering │ SelectionDAG intrinsic lowering
│ (sub_955A70) │ (sub_33B0210, CAT-17)
│ │
│ 3-table lookup: │ sub_33A64B0 -> SDNode construction
│ dword_3F14840/7E0/7A0 │ 95 case labels (0xA4-0xA8, 0x194-0x1EC)
│ │
│ sub_94E0D0 (MMA) │
│ sub_94CAB0 (load) │
│ sub_9493D0 (store) │
└───────────┬───────────┘
│
┌───────────┴───────────┐
│ PTX Instruction Emit │
│ sub_21E74C0 (printer) │
│ sub_1D23DE0 (emitter) │
└───────────────────────┘
The NVVM builtin lowering path handles builtins that arrive as direct function calls from the EDG frontend. The SelectionDAG path handles the same operations when they arrive as LLVM intrinsic calls (the normal path when CUDA C++ compiles through Clang-style IR generation). Both paths converge at the PTX string builder, which reads a packed 64-bit descriptor word and emits text like mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32.
Packed MMA Descriptor
All MMA operations are encoded as a single 64-bit descriptor word stored at *(QWORD*)(*(QWORD*)(a1+16) + 16*a2 + 8). The PTX string builder (sub_21E74C0) queries this descriptor through a string-keyed interface. The caller passes a query string (e.g., "shape", "ety", "mid"), and the builder extracts the relevant bits and emits the corresponding PTX text.
Bit Layout
Bits Field Query key Values
─────── ────────── ───────── ──────
[0] rowcol "rowcol" 0=row, 1=col
[2:1] mid "mid" 0=a, 1=b, 2=c, 3=d
[7:4] opc "opc" 0=default, 1=.and.popc, 2=.xor.popc
[2:0] rnd "rnd" 0=none, 1=.rn, 2=.rm, 3=.rp, 4=.rz
[15:8] aty "aty" A element type enum (see below)
[23:16] bty "bty" B element type enum
[25:24] al "al" A layout: 0=row, nonzero=col
[27:26] bl "bl" B layout: 0=row, nonzero=col
[28] satf "satf" 0=off, 1=.satfinite
[39:32] shape "shape" Shape enum (see below)
The "ety" query reads the result/accumulator element type from bits [27:24], sharing bit positions with al/bl in a context-dependent manner -- the builder dispatches on the query string to select the correct extraction mask.
Type Enum
| Value | Type | Bits | PTX string |
|---|---|---|---|
| 1 | b1 | 1 | "b1" |
| 2 | s4 | 4 | "s4" |
| 3 | u4 | 4 | "u4" |
| 4 | s8 | 8 | "s8" |
| 5 | u8 | 8 | "u8" |
| 6 | f16 | 16 | "f16" |
| 7 | bf16 | 16 | "bf16" |
| 8 | tf32 | 19 | "tf32" |
| 9 | f64 | 64 | "f64" |
| 10 | f32 | 32 | "f32" |
| 11 | s32 | 32 | "s32" |
Any other value triggers the fatal error "Wrong MMA element type".
Shape Enum
| Value | Shape | PTX string | Notes |
|---|---|---|---|
| 0x01 | m8n8k4 | "m8n8k4" | Original Volta HMMA |
| 0x02 | m8n8k16 | "m8n8k16" | Integer MMA (s8/u8) |
| 0x03 | m8n8k32 | "m8n8k32" | Sub-byte (s4/u4) |
| 0x04 | m8n8k64 | "m8n8k64" | Extended sub-byte |
| 0x05 | m8n8k128 | "m8n8k128" | Binary MMA (b1) |
| 0x06 | m8n32k16 | "m8n32k16" | Appears unused in standard paths |
| 0x10 | m16n8k4 | "m16n8k4" | Turing HMMA, Ampere f64 |
| 0x11 | m16n8k8 | "m16n8k8" | Turing/Ampere HMMA |
| 0x12 | m16n8k16 | "m16n8k16" | Ampere (bf16, tf32) |
| 0x13 | m16n8k32 | "m16n8k32" | Ampere integer |
| 0x14 | m16n8k64 | "m16n8k64" | Sub-byte integer |
| 0x15 | m16n8k128 | "m16n8k128" | Extended sub-byte |
| 0x16 | m16n8k256 | "m16n8k256" | Largest shape (binary/sub-byte) |
| 0x17 | m16n16k16 | "m16n16k16" | Square shape (Hopper+) |
| 0x18 | m32n8k16 | "m32n8k16" | Tall shape |
| 0x19 | m16n16k8 | "m16n16k8" | WMMA f16 path |
Unrecognized shape values hit the default branch and trigger BUG() abort.
PTX String Emission
The string builder uses an optimized emission pattern: short constant strings are stored as integer literals for single-store writes. For example, "m16n8k16" is emitted as:
*(QWORD*)ptr = 0x36316B386E36316DLL; // "m16n8k16" in little-endian
When the output buffer has sufficient remaining capacity, the builder writes directly via DWORD/WORD/BYTE stores. On buffer overflow, it falls back to sub_16E7EE0 (slow-path string append).
HMMA / IMMA / BMMA Lowering (SM 70--89)
The pre-Hopper MMA families share a common architecture: a three-table builtin-to-intrinsic lookup, per-family handler functions for load/store/MMA, and a consistent operand processing pattern.
Three-Table Intrinsic Lookup
| Table | Address | ID Range | Description |
|---|---|---|---|
dword_3F14840 | Entries 0--29 | 678--707 | HMMA (FP16, first-gen) |
dword_3F147E0 | Entries 0--23 | 708--731 | IMMA (INT8) |
dword_3F147A0 | Entries 0--12 | 732--744 | BMMA (binary) / INT4 |
Each table maps (builtin_id - base) to an LLVM intrinsic ID. The first table additionally sets a v43=1 flag indicating "first generation WMMA", which affects fragment size determination.
HMMA Handler Family (SM >= 70)
Four functions implement half-precision MMA operations. All share a common pattern:
- Architecture gate:
*(target_info + 252) > 0x45(SM >= 70) - Fetch debug location
- Validate
rowcoloperand is constant (opcode 10 or 32 check) - Resolve address space via
sub_21DEF90 - Build operands via
sub_1D38BB0calls - Emit instruction via
sub_1D23DE0
| Function | Address | Operation | Operand Count |
|---|---|---|---|
sub_21E0360 | 0x21E0360 | hmmaldab (load A/B) | 6 |
sub_21E0630 | 0x21E0630 | hmmaldc (load C) | 5 |
sub_21DFBF0 | 0x21DFBF0 | hmmastc (store C/D) | 9 or 13 (shape-dependent) |
sub_21E0870 | 0x21E0870 | hmmamma (MMA) | 19 or 23 + 1 metadata |
For hmmastc, the operand count depends on the accumulator width: 9 operands for narrow accumulators, 13 for wide (when the a2 shape flag is set).
For hmmamma, the handler loads A fragments (v100 iterations), B fragments (v95 iterations), C fragments (v101 iterations), emits the MMA call via sub_921880, then scatters results through v103 iterations of element-wise stores.
IMMA Handler Family (SM >= 72)
Integer MMA follows the same pattern but with additional SM 72 (Xavier) restrictions:
| Function | Address | Operation | SM Gate |
|---|---|---|---|
sub_21E1280 | 0x21E1280 | immaldab (load A/B) | SM > 0x47 (>= 72) |
sub_21E15D0 | 0x21E15D0 | immaldc (load C) | SM > 0x47 |
sub_21E1830 | 0x21E1830 | immastc (store C) | SM > 0x47 |
sub_21E1D20 | 0x21E1D20 | immamma (MMA + saturation) | SM > 0x47 |
SM 72 special case. Xavier's tensor cores support only basic IMMA shapes (variant 0 or 1). The gate check is:
if (sm_version <= 0x47 || (sm_version == 72 && shape_variant > 1))
fatal_error("not supported on this architecture");
For immaldc at SM 72, certain intrinsic opcodes (610, 611, 179, 180) are explicitly blocked:
if (sm_version <= 0x47 || ((opcode-610 <= 1 || opcode-179 <= 1) && sm_version == 72))
fatal_error(...);
The immamma handler includes an explicit satf (saturation-to-finite) constant extraction. The .satfinite modifier is appended to the PTX instruction when bit 28 of the descriptor is set. This clamps infinities and NaNs to the largest representable finite value.
IMMA operand counts vary by opcode:
| Opcode | Fragment Count | Shape |
|---|---|---|
| 584 | 12 | Large integer shape |
| 609 | 4 | Compact integer shape |
| other | 13 | Default |
BMMA Handler (SM >= 73/75)
Binary MMA (sub_21E2280, 0x21E2280) handles b1 operations with XOR-POPC and AND-POPC modes. Gate: SM > 0x48 (>= 73, in practice SM 75). The handler takes 8+ operands.
Fragment Size Determination
Fragment size (the number of register-width elements per warp fragment) is computed differently per family:
WMMA (first-gen, v43=1):
| Condition | Fragment Count |
|---|---|
BF16, store operation (a6==1 && !a5) | 4 |
| Default first-gen | 8 |
| Intrinsic 8914 or 8280 | 2 |
IMMA (v43=0):
| Intrinsic IDs | Fragment Count |
|---|---|
| 0x22B3--0x22B6, 0x22CF | 2 |
| 0x22BB--0x22BC, 0x22C5--0x22C6 | 4 |
| 0x22BD--0x22BE, 0x22C3--0x22C4, 0x22CB--0x22CE | 1 |
| 0x22B7, 0x22BF, 0x22C7 | 8 |
BMMA: Always 2 fragments, with v101=2, v95=1, v100=1.
MMA Codegen (sub_94E0D0)
The WMMA multiply-accumulate handler processes five input operands:
v102-- destination fragment pointer (output)v7-- A matrix fragment pointerv93-- B matrix fragment pointerv92-- C accumulator fragment pointerv8--rowcoloperand (validated range: 0--3 for MMA)v9--satfflag (validated: 0 or 1; skipped for intrinsic 8279)
Fragment counts for the MMA operation itself:
| Family | v95 (A frags) | v100 (B frags) | v101 (C frags) | v103 (D frags) |
|---|---|---|---|---|
| BMMA | 1 | 1 | 2 | 2 |
| IMMA 0x22C0--0x22C1 | 1 | 4 | 8 | 8 |
| IMMA 0x22B8--0x22B9 | 2 | 2 | 8 | 8 |
| IMMA 0x22C8--0x22C9 | 4 | 1 | 8 | 8 |
| WMMA (default) | 8 | 8 | varies | 4 or 8 |
For first-gen WMMA, v103 (D fragment count) is determined by a bit test:
if ((0x300C003 >> (intrinsic_id + 127)) & 1)
v103 = 4;
else
v103 = 8;
The code generation sequence is:
1. LOAD A fragments: v100 iterations of sub_94B510 (extract from ptr v7)
2. LOAD B fragments: v95 iterations (extract from ptr v93)
3. LOAD C fragments: v101 iterations (extract from ptr v92)
4. EMIT MMA call: sub_90A810(tables, intrinsic_id, 0, 0) -> sub_921880
5. STORE D fragments: v103 iterations of sub_94B940 (scatter to ptr v102)
Address Space Resolution (sub_21DEF90)
MMA load/store operations resolve the target memory address space through sub_21DEF90, which checks the instruction opcode at offset +24:
| Opcode Range | Condition | Address Space |
|---|---|---|
| 185--237 | Bit test against 0x3FFFFD00000003 | varies |
| 44--45 | Bit 1 of byte at offset +26 | varies |
| >= 659 | unconditional | accepted |
| default | generic (0) |
Return values: 0=generic, 1=global, 2=shared, 3=local, 4=constant, 5=special, 404=special (from value 101).
SelectionDAG Path (sub_33B0210 / sub_33A64B0)
In the SelectionDAG intrinsic lowering mega-switch (sub_33B0210), 95 consecutive case labels (IDs 0xA4--0xA8 and 0x194--0x1EC, corresponding to LLVM intrinsic IDs 164--168 and 404--492) all dispatch to a single helper: sub_33A64B0.
This function handles every WMMA/MMA SelectionDAG intrinsic for SM 70--89:
wmma.load.a/wmma.load.b/wmma.load.cwmma.store.dwmma.mmafor all shape/type combinationsmma.sync(SM 70+),mma.sp(SM 80+, structured sparsity),mma.f64(SM 80+)
The SelectionDAG path constructs NVPTXISD target-specific DAG nodes that are later matched by the instruction selection tables. The intrinsic IDs from the mega-switch are distinct from the builtin IDs used in the NVVM path -- the mega-switch IDs are LLVM intrinsic table indices, not CUDA builtin numbers.
WGMMA -- Warp Group MMA (SM 90 Hopper)
WGMMA operates on a warp group (4 warps, 128 threads) instead of a single warp. Four builtin IDs (765--768) expand to over 150 LLVM intrinsic variants through compile-time dimension and type dispatch.
Builtin-to-Intrinsic Expansion
| Builtin ID | Builtin | Variants |
|---|---|---|
| 765 (0x2FD) | __wgmma_mma_async_f16 | Full 6-operand set (a, b, c, scale, negate, sparsity) |
| 766 (0x2FE) | __wgmma_mma_async_bf16 | 2-operand (no scale/negate) |
| 767 (0x2FF) | __wgmma_mma_async_tf32 | Reduced operand set |
| 768 (0x300) | __wgmma_mma_async_f8 | Minimal (2 scale operands only) |
The lowering handler (in sub_955A70, cases 0x2FD--0x300, ~800 lines) extracts 7 levels of chained operands:
v263 -- M dimension (constant)
v512 -- accumulator fragments
v528 -- A descriptor
v524 -- B descriptor
v519 -- scale factors
v264 -- layout params
v540 -- element type info
Dimension-to-Intrinsic Mapping
The N dimension (extracted via sub_620FD0 as a constant integer) maps to one of 144 LLVM intrinsic IDs spanning 10654--10779. The mapping forms a dense table with stride 4 per N step:
| N | Integer-type Intrinsic | Float-type Intrinsic |
|---|---|---|
| 8 | 10774 | 10775 |
| 16 | 10690 | 10691 |
| 32 | 10742 | 10743 |
| 64 | 10758 | 10759 |
| 128 | 10666 | 10667 |
| 256 | 10738 | 10739 |
For intermediate N values (multiples of 8 from 8 to 256), the mapping continues at stride +4 per N increment. Even intrinsic IDs encode integer-element variants; odd IDs encode float-element variants. The element type is determined by checking whether the LLVM type is an integer with width 10 (i.e., tf32 or bf16 packed as i10 -- a quirk of the NVVM type system).
If constant extraction overflows, the compiler emits:
"unexpected constant overflow in __wgmma_mma_async operand"
If N is not a power of two: (N & (N - 1)) != 0 triggers:
"N only supported for powers of two"
WGMMA 5-Dimensional Intrinsic Grid
The full WGMMA intrinsic table (sub_12B2E10) uses a 144-entry grid spanning IDs 5304--5447:
| Dimension | Values | Count |
|---|---|---|
| N | 16, 32, 64, 128 | 4 |
| B_shared | false, true | 2 |
| is_s64 | false, true | 2 |
| A_scale/negate | combo | varies |
| case variant | 0x2FD--0x300 | 4 |
Each WGMMA call packs mode bits into a single integer:
bit 0: accumulate flag (from operand v433)
bit 1: transpose flag (from operand v445)
bit 2: negate-C flag (from operand v433)
bit 3: reserved
bit 4: negate-A flag (from operand v427)
Combined: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).
WGMMA Parameter Lookup (sub_953BA0)
On first call, sub_953BA0 lazily initializes a red-black tree at ctx+560 with 7 entries encoding per-ID shape, transpose, register count, and type information:
| ID | trans_a | shape | a_nregs | b_nregs | a_type | b_type | c_type |
|---|---|---|---|---|---|---|---|
| 745 | 0 | 1 | 1 | 1 | i64 | i64 | -- |
| 746 | 1 | 0 | 9 | 9 | i32 | i32 | i32x2 |
| 747 | 0 | 0 | 8 | 8 | i16x2 | i16x2 | -- |
| 748 | 0 | 0 | 7 | 7 | i32x4 | i32x4 | i32x8 |
| 749 | 0 | 0 | 7 | 7 | i32x4 | i32x4 | i32x8 |
| 750 | 0 | 0 | 7 | 7 | i64 | i32x2 | i32x8 |
The output is packed into a 64-bit value:
bits[3:0] = trans_a
bits[7:4] = shape << 4
bits[15:8] = a_nregs << 8
bits[27:16] = b_nregs << 16
bits[31:28] = padding << 28
bits[63:32] = trans_b << 32
bit[25] = ((rowcol & 2)==0) ? 0x2000000 : 0x1000000
bits[27:26] = ((rowcol & 1)+1) << 26
WGMMA MMA Async Load (sub_9547E0)
A second red-black tree at ctx+656 holds 12 entries for MMA async load parameters:
| ID | Shape | NRegs | Variant | Fragment Type |
|---|---|---|---|---|
| 753 | 1 | 9 | 0 | -- |
| 754 | 1 | 9 | 1 | -- |
| 755 | 1 | 9 | 2 | i16x2 |
| 756 | 25 | 8 | 0 | -- |
| 757 | 25 | 8 | 1 | -- |
| 758 | 25 | 10 | 2 | i32x8 |
| 759 | 23 | 7 | 0 | i32x4 |
| 760 | 23 | 7 | 1 | i32x4 |
| 761 | 24 | 7 | 0 | i32x4 |
| 762 | 24 | 7 | 1 | i32x4 |
| 763 | 6 | 7 | 0 | i32x2/i64 |
| 764 | 6 | 7 | 1 | i32x2/i64 |
WGMMA Fence/Store Dispatch
| IDs | Operation | Intrinsic | Handler |
|---|---|---|---|
| 745--750 | fence_aligned | 9062 (3 type overloads) | sub_953BA0 -> sub_94B510 x3 -> sub_94B940 |
| 751--752 | store | 9145 (2 type overloads) | sub_954350 |
| 753--764 | mma_async load | 9067 (2 type overloads) | sub_9547E0 |
The fence operations pack A/B/C fragment operands via sub_94B510 and scatter results via sub_94B940 with name hint "mmafrag".
tcgen05 -- Tensor Core Generation 5 (SM 100 Blackwell)
SM 100 introduces tcgen05, a completely new tensor core instruction family with support for MX floating-point formats (MXF4, MXF8F6F4), structured sparsity, weight stationary mode, block scaling, and scaled input accumulators. The tcgen05 system includes both computation (tcgen05.mma) and lifecycle management (alloc, dealloc, fence, wait, commit, cp, relinquish) instructions.
Architecture Gate
All tcgen05 operations require SM >= 100. The gate check reads two architecture fields:
v1 = *(int*)(arch_struct + 340); // arch_value: 1000=sm100, 1030=sm103, 1200=sm120
v2 = *(int*)(arch_struct + 336); // ptx_version
// Family-conditional: ptx >= 86
// Arch-conditional: ptx >= 88
if (v1 <= 0x3E8 && v1 <= 0x408) // neither sm_100 nor sm_103
fatal_error("tcgen05.mma supported only on arch-conditional "
"or family-conditional variants from SM100 onwards.");
tcgen05 Infrastructure Operations
All handled by sub_30462A0:
| Operation | Intrinsic Opcode | ISD Opcode | Operands |
|---|---|---|---|
| tcgen05.alloc | 10080 | 4765 | basic allocation |
| tcgen05.alloc (multicast) | 10083 | 4770/4771 | 32-bit flag variant |
| tcgen05.dealloc | 10140 | 4827 | 4 operands |
| tcgen05.commit | 10090 | 4772--4777 | multicast mask variants |
| tcgen05.fence | 10143 | 4830 | 2 operands |
| tcgen05.wait | 10351 | 5020 | 2 operands |
| tcgen05.relinquish.alloc | 10311 | 4941 | 2 operands |
| tcgen05.cp.* | 10101 | 4790 | 4 operands |
Commit operations validate multicast mask size -- only 16-bit and 32-bit masks are supported:
"tcgen05.commit.* supports only 16-bit and 32-bit multicast mask size."
tcgen05.mma Data Types
The "kind" field occupies bits [8:6] of the packed operand word:
| Value | Kind | Description |
|---|---|---|
| 0 | mxf4nvf4 | MX FP4 with NV FP4 |
| 1 | f8f6f4 | FP8/FP6/FP4 standard |
| 2 | mxf8f6f4 | MX variant of f8f6f4 |
| 3 | f16 | Half precision |
| 4 | i8 | 8-bit integer (arch-conditional only) |
| 5 | tf32 | TensorFloat-32 |
| 7 | mxf4 | MX FP4 |
tcgen05.mma Modifiers
Scale vector size (bits [3:2]):
| Value | Modifier | Constraints |
|---|---|---|
| 0/1 | .scale_vec::1X | Cannot use for mxf4nvf4 type |
| 2 | .scale_vec::2X | Cannot use for mxf8f6f4 type |
| 3 | .scale_vec::4X | Cannot use for mxf8f6f4 or mxf4 type |
Block scale alias (bits [10:9]):
| Value | Modifier | Constraint |
|---|---|---|
| 0 | .block16 | Not supported for f16, tf32, f8f6f4, i8 |
| 1 | .block32 | Same constraint |
Weight stationary (bit 0): .ws flag. Not compatible with cta_group::2, mxf8f6f4, or fp4 types.
CTA group (bits [1:0]): .cta_group::1 (bit 1 clear) or .cta_group::2 (bit 1 set).
Sparsity (bit 5): Adds one extra operand. Restricted for MXF4 and MXF4NVF4 types to arch-conditional variants only.
Scale input accumulator (bit 4): Only usable with f16 and tf32 types. Not supported on sm_100a (v=1001) or sm_103a (v=1033), but supported on sm_100 (v=1000), sm_103 (v=1030), and sm_120+ (v>=1101).
Collector modes (emitted by sub_35F38B0):
| Value | PTX modifier |
|---|---|
| 1 | .collector::a::lastuse |
| 2 | .collector::a::fill |
| 3 | .collector::a::use |
Cannot use collector::a::use or collector::a::fill with ashift.
tcgen05.mma ISD Opcode Selection (sub_36E9630)
The intrinsic lowering handler (sub_304E6C0) maps 10 shape cases (intrinsic opcodes 10299--10308) to ISD opcodes 4905--4940:
| Case | Shape Class | Base ISD | +scaleD | +sparsity | +ws | +scaleInputAccum |
|---|---|---|---|---|---|---|
| 10299 | Small | 4906 | -- | 4907 | -- | -- |
| 10300 | Small v2 | 4908 | -- | 4909 | -- | -- |
| 10301 | Medium | 4905 | 4910 | 4911/4912 | 4937/4938 | yes |
| 10302 | Medium v2 | 4913 | 4914 | 4915/4916 | -- | yes |
| 10303 | Large | 4917 | 4918 | 4919/4920 | -- | yes |
| 10304 | Block-scale small | 4922 | -- | 4923 | -- | -- |
| 10305 | Block-scale small v2 | 4924 | -- | 4925 | -- | -- |
| 10306 | Block-scale medium | 4921 | 4926 | 4927/4928 | 4939/4940 | yes |
| 10307 | Block-scale medium v2 | 4929 | 4930 | 4931/4932 | -- | -- |
| 10308 | Block-scale large | 4933 | 4934 | 4935/4936 | -- | -- |
Operand count varies by variant: small shapes take 5--6 base operands plus optional sparsity operand; medium shapes take 6 base plus optional scale factor; large shapes iterate over additional operands spanning offsets 440--600 (or 440--760 on sm_103 extended variants).
tcgen05.mma Validation Errors
The full set of compile-time validation errors (emitted via sub_C64ED0):
| Error Message | Condition |
|---|---|
"INT8 type is supported only on arch-conditional variants." | kind==i8 on family-conditional SM100 |
"MXF4 and MXF4NVF4 types with Sparsity are supported only on arch-conditional variants." | (type+7)%8 > 5 AND sparsity set, on family-conditional |
"Explicit scale vector size is supported only on arch-conditional variants." | scale_vec_size 1--3 on family-conditional |
"Scale input accumulator can only be used with f16 and tf32 types" | bit 4 set but kind not f16 or tf32 |
"Scale input accumulator is not supported on this architecture." | scaleInputAccum on sm_100a or sm_103a |
"Block scale is not supported for f16, tf32, f8f6f4 and i8 types" | block_scale with incompatible type |
"ashift is not supported with tcgen05.mma.block_scale variants" | ashift + block_scale |
"cta_group::2 is not supported with weight stationary" | cta_group::2 + .ws |
"Cannot use weight stationary with mxf8f6f4 and fp4 types" | .ws + mxf8f6f4 or fp4 |
"Cannot use collector::a::use or colletor::a::fill with ashift" | [sic] collector + ashift |
"Cannot use 2X or 4X as scale vector size for mxf8f6f4 type" | scale_vec >= 2X + mxf8f6f4 |
"Cannot use 1X as scale vector size for mxf4nvf4 type" | scale_vec 1X + mxf4nvf4 |
"Cannot use 1X or 4X as scale vector size for mxf4 type" | scale_vec 1X or 4X + mxf4 |
Note the typo "colletor" (missing 'c') in the binary -- this is a genuine NVIDIA binary string, not a transcription error.
tcgen05 Scaled MMA Operand Builder
Two identical copies exist for the tcgen05 scaled MMA descriptor:
| Copy | Address | Layer |
|---|---|---|
sub_21E8CD0 | 0x21E8CD0 | AsmPrinter / PTX emission |
sub_35F3E90 | 0x35F3E90 | NVPTX backend / SelectionDAG |
The packed descriptor encodes Blackwell-specific modifiers:
| Bit | Query | Set Value | Clear Value | Semantics |
|---|---|---|---|---|
| 0 | "scaleD" | "1" | "0" | Scale output accumulator |
| 1 | "negA" | "-1" | "1" | Negate A matrix |
| 2 | "negB" | "-1" | "1" | Negate B matrix |
| 3 | "transA" | "1" | "0" | Transpose A |
| 4 | "transB" | "1" | "0" | Transpose B |
scaleD and transA/transB emit boolean "0"/"1" strings. negA and negB emit sign multiplier strings "-1"/"1" because PTX applies negation as a multiplication factor.
tcgen05.cp Copy Operations
Shape variants (bits [3:1]):
| Value | PTX shape |
|---|---|
| 0 | .128x256b |
| 1 | .4x256b |
| 2 | .128x128b |
| 3 | .64x128b |
| 4 | .32x128b |
Destination format variants:
| Condition | PTX format |
|---|---|
| default | .b8x16 |
| bit 7 = 0 | .b6x16_p32 |
| bit 7 = 1 | .b4x16_p64 |
| bit 8 set | error: "Unsupported tcgen05.cp destination format" |
Multicast modes:
| Type | PTX modifier |
|---|---|
| type 1, shape 3 | .warpx2::02_13 |
| type 2, shape 3 | .warpx2::01_23 |
| type 3, shape 4 | .warpx4 |
Duplicate Backend Copies
Several MMA functions exist as near-identical pairs -- one in the AsmPrinter emission layer (0x21Dxxxx--0x21Exxxx) and one in the NVPTX backend layer (0x36Exxxx). The difference is limited to error reporting and reference counting functions:
| AsmPrinter Copy | Backend Copy | Operation |
|---|---|---|
sub_21DFBF0 | sub_36E91F0 | hmmastc |
sub_21E0360 | sub_36E72A0 | hmmaldab |
sub_21E0630 | sub_36E7580 | hmmaldc |
sub_21E0870 | sub_36E77C0 | hmmamma |
sub_21E1280 | sub_36E7B50 | immaldab |
sub_21E15D0 | sub_36E7EA0 | immaldc |
sub_21E1830 | sub_36E8110 | immastc |
sub_21E1D20 | sub_36E8630 | immamma |
sub_21E2280 | sub_36E8BD0 | bmmamma |
sub_21E8CD0 | sub_35F3E90 | tcgen05 scaled MMA |
AsmPrinter copies use sub_16BD130 for errors; backend copies use sub_C64ED0. AsmPrinter copies use sub_1623A60/sub_161E7C0 for refcounting; backend copies use sub_B96E90/sub_B91220.
Shape x Type x Architecture Matrix
| Shape | A/B Types | Accumulator | Min SM | Notes |
|---|---|---|---|---|
| m8n8k4 | f16 | f16, f32 | SM 70 | Original Volta |
| m16n8k4 | f64 | f64 | SM 80 | Ampere double precision |
| m16n8k8 | f16 | f16, f32 | SM 75 | Turing+ |
| m16n8k16 | f16, bf16, tf32 | f16, f32 | SM 80 | Ampere+ |
| m16n16k8 | f16 | f16, f32 | SM 70 | WMMA path |
| m16n16k16 | f16, bf16 | f16, f32 | SM 90 | Hopper+ |
| m32n8k16 | f16, bf16 | f16, f32 | SM 80 | Tall shape |
| m8n8k16 | s8, u8 | s32 | SM 72 | Integer MMA |
| m16n8k16 | s8, u8 | s32 | SM 75 | Turing+ integer |
| m16n8k32 | s8, u8 | s32 | SM 75 | Turing+ integer |
| m8n8k32 | s4, u4 | s32 | SM 75 | Sub-byte |
| m16n8k64 | s4, u4 | s32 | SM 75 | Sub-byte |
| m8n8k64 | s4, u4 | s32 | SM 75 | Extended sub-byte |
| m16n8k128 | s4, u4 | s32 | SM 75 | Extended sub-byte |
| m8n8k128 | b1 | s32 | SM 75 | Binary (.and.popc / .xor.popc) |
| m16n8k256 | b1 | s32 | SM 75 | Binary extended |
| tcgen05 (10 variants) | mxf4nvf4, f8f6f4, mxf8f6f4, f16, tf32, i8, mxf4 | varies | SM 100 | +block_scale, +sparsity, +ws |
LLVM Intrinsic ID Reference
Key intrinsic IDs used in the MMA code generation pipeline:
| Intrinsic ID | Symbol | Usage |
|---|---|---|
| 8181 | llvm.nvvm.wmma.store (complex) | WMMA complex store |
| 8210 | llvm.nvvm.wmma.store | WMMA store |
| 8279 | (special) | IMMA MMA without satf |
| 8280 | (special) | Fragment count = 2 trigger |
| 8914 | (special) | Fragment count = 2 trigger |
| 9062 | llvm.nvvm.wgmma.fence.aligned | WGMMA fence (3 type overloads) |
| 9067 | llvm.nvvm.wgmma.mma.async | WGMMA MMA async (2 type overloads) |
| 9145 | llvm.nvvm.wgmma.store | WGMMA store |
| 10654--10779 | llvm.nvvm.wgmma.mma.async.* | Per-dimension WGMMA variants (144 entries) |
| 5304--5447 | (WGMMA grid) | 5-dimensional intrinsic grid for WGMMA |
Error Handling
Two error-reporting functions serve the two layers:
| Function | Address | Layer | Behavior |
|---|---|---|---|
sub_16BD130 | 0x16BD130 | AsmPrinter / PTX emission | Fatal (severity=1 -> abort) |
sub_C64ED0 | 0xC64ED0 | NVPTX backend / SelectionDAG | Fatal (severity=1 -> abort) |
Error categories:
- Architecture not supported:
"X is not supported on this architecture"-- SM gate failure - Constant validation:
"rowcol not constant","satf not constant"-- non-constant operand - Type restrictions:
"Wrong MMA element type"-- invalid type enum - Feature combination:
"ashift is not supported with tcgen05.mma.block_scale"-- conflicting modifiers - Scale restrictions:
"Cannot use N as scale vector size for X type"-- type/scale mismatch
Differences from Upstream LLVM
Upstream LLVM's NVPTX backend has no MMA code generation. The entire MMA pipeline -- builtin tables, three-table lookup, fragment size computation, WGMMA dimension dispatch, tcgen05 lowering, packed descriptor encoding, and all shape/type validation -- is NVIDIA-proprietary code with no upstream equivalent.
Upstream LLVM handles MMA operations at the PTX level only: the upstream NVPTXAsmPrinter can print PTX mma.sync instructions, but the instruction selection, intrinsic lowering, and code generation logic that produces them exists only in NVIDIA's cicc binary. An open-source reimplementation would need to build the entire pipeline from the WMMA/MMA intrinsic definitions through SelectionDAG lowering and PTX emission.
Cross-References
- Tensor / MMA Builtins -- per-builtin-ID reference table and validation rules
- SelectionDAG & ISel -- DAG infrastructure hosting MMA lowering
- ISel Pattern Matching -- downstream pattern matcher consuming MMA DAG nodes
- SM 90 -- Hopper -- WGMMA feature gate details
- SM 100 -- Blackwell -- tcgen05 feature gate details
- SM 120 -- Blackwell consumer variant features
- NVPTX Machine Opcodes -- ISD opcode reference
- Register Classes -- fragment register allocation
- PTX Emission -- downstream PTX text generation
NVVM Builtin Table Structure
770 builtins mapped to integer IDs (1--770) in a wyhash open-addressing hash table. Dual tables exist: pre-optimization (sub_90AEE0) and post-optimization (sub_126A910), both with identical content but separate address spaces.
| Pre-opt table builder | sub_90AEE0 (109 KB, populates all 770 entries) |
| Pre-opt dispatcher | sub_913450 (name -> ID lookup) |
| Post-opt table builder | sub_126A910 (123 KB) |
| Post-opt dispatcher | sub_12731E0 (name -> ID lookup) |
| Hash function | sub_CBF760 (wyhash v4 family) |
| Hash table insert | sub_90ADD0 -> sub_C92610 -> sub_C92740 |
| Hash table find | sub_C92860 (find-only, quadratic probing) |
| Rehash | sub_C929D0 (75% load factor trigger) |
| Total builtins | 770 (IDs 1--770) |
| Storage | Open-addressing at context+480 (20-byte header) |
Architecture
sub_913450 (public API: name -> builtin ID)
|
+-- Guard: context+492 == 0?
| +-- sub_90AEE0 (lazy init: populate all 770 entries, once)
|
+-- strlen(name)
+-- sub_C92610(name, len) -> compute wyhash
+-- sub_C92860(context+480, ...) -> quadratic probe find
|
+-- return *(uint32*)(entry + 8) -> the builtin ID
Hash Table Infrastructure
The builtin name table uses a specialized 20-byte hash table header at context+480 with a parallel hash cache array and wyhash-v4 string hashing. The table employs quadratic probing with triangular-number increments and grows at 75% load factor. For 770 entries the capacity sequence is 16 -> 32 -> 64 -> 128 -> 256 -> 512 -> 1024.
Full structural details -- table layout, bucket format, string entry format, wyhash length-dispatch table with pseudocode, probing algorithm, triple-gated comparison guard, rehash procedure, and sentinel values -- are documented in Hash Table and Collection Infrastructure. The "wyhash v4 String Hasher" and "Probing Strategy" sections on that page are the canonical references.
Complete Builtin ID Inventory
Synchronization & Compiler Intrinsics (IDs 1–7)
| ID | Name |
|---|---|
| 1 | __syncthreads |
| 2 | __nvvm_bar0 |
| 3 | __nvvm_membar_cta |
| 4 | __nvvm_membar_gl |
| 5 | __nvvm_membar_sys |
| 6 | __builtin_is_constant_evaluated |
| 7 | __builtin_unreachable |
Cluster Operations — SM 90+ (IDs 8–14)
| ID | Name |
|---|---|
| 8 | __nv_clusterDimIsSpecifed_impl |
| 9 | __nv_clusterRelativeBlockRank_impl |
| 10 | __nv_clusterSizeInBlocks_impl |
| 11 | __nv_cluster_barrier_arrive_impl |
| 12 | __nv_cluster_barrier_wait_impl |
| 13 | __nv_cluster_barrier_arrive_relaxed_impl |
| 14 | __nv_threadfence_cluster_impl |
Barrier Extensions (IDs 15–20)
| ID | Name |
|---|---|
| 15–17 | __nvvm_bar0_{popc,and,or} |
| 18–20 | __nvvm_bar{_sync_all,rier_sync,_warp_sync} |
Bit Manipulation (IDs 21–26)
__nvvm_clz_{i,ll}, __nvvm_popc_{i,ll}, __nvvm_brev{32,64}
Math — Rounding/Abs/Saturate (IDs 27–56)
__nvvm_{floor,ceil,abs,fabs,round,trunc,saturate}_{ftz_f,f,d}, __nvvm_{ex2,lg2,sin,cos}_approx_{ftz_f,f,d}
Reciprocal / Sqrt / Rsqrt (IDs 57–87)
__nvvm_rcp_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_sqrt_{f,rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_rsqrt_approx_{ftz_f,f,d}
Type Conversions (IDs 88–184)
97 entries covering all float↔int, double↔int, float↔half, bitcast combinations with all four rounding modes and FTZ variants.
Address Space & Memory Queries (IDs 185–204)
| ID | Name |
|---|---|
| 185 | __nv_isGlobal_impl |
| 186–188 | __nv_bswap{16,32,64}_impl |
| 189–192 | __nv_is{Shared,Constant,Local,GridConstant}_impl |
| 193–200 | __nv_cvta_{generic_to,to_generic}_{global,shared,constant,local}_impl |
| 201 | __builtin_assume |
| 202 | __nv_isClusterShared_impl |
| 203 | __nv_cluster_query_shared_rank_impl |
| 204 | __nv_associate_access_property_impl |
Atomic Operations — Legacy NVVM (IDs 207–275)
69 entries: __nvvm_atom_{,cta_,sys_}{add,xchg,min,max,inc,dec,and,or,xor}_gen_{i,ll,f,d,ui,ull,128}
FP Arithmetic (IDs 276–349)
__nvvm_{min,max}_{i,ui,ll,ull}, __nvvm_f{min,max}_{f,ftz_f,d}, __nvvm_mulhi_{i,ui,ll,ull}, __nvvm_mul_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_div_*, __nvvm_add_*
Vote Operations (IDs 351–358)
__nvvm_vote_{all,any,uni,ballot} + _sync variants
Match Operations (IDs 361–364)
__match{32,64}_{any,all}_sync
FMA (IDs 383–403)
__nvvm_fma_{rn,rz,rm,rp}_{ftz_f,f,d,ftz_f2,f2}
C++11 Atomics (IDs 417–473)
Sized variants: __nv_atomic_{load,store,fetch_add,fetch_sub,fetch_and,fetch_or,fetch_xor,fetch_max,fetch_min,exchange,compare_exchange}_{1,2,4,8,16}_{u,s,f}
Surface Stores — sust (IDs 474–638)
165 entries covering __nvvm_sust_b_{1d,1d_array,2d,2d_array,3d}_{i8,...,v4i32}_{clamp,trap,zero}.
Pattern: sust_b_<dim>_<type>_<oob_mode> across 5 dimensions × 11 types × 3 OOB modes.
CUDA Varargs (IDs 639–642)
__cu_va_{start,end,arg,copy}
Tex/Surf Handler (ID 647)
__nv_tex_surf_handler — generic dispatch for texture/surface reads (surface stores use the dedicated sust builtins above).
C++ ABI (IDs 648–677)
__cxa_vec_{ctor,cctor,dtor,new2,new,new3,delete2,delete,delete3}, __gen_nvvm_mem{cpy,set}_*, _Znw{j,m,y}, _Zna{j,m,y}, _ZdlPv{,m,y}, _ZdaPv{,m,y}
WMMA Tensor Core — SM 70+ (IDs 678–707)
30 entries: __hmma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c_f16,ld_c_f32,st_c_f16,st_c_f32,mma_f16f16,mma_f32f16,mma_f16f32,mma_f32f32}
Integer/Binary Tensor Core — SM 75+ (IDs 708–745)
38 entries: __imma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c,st_c,mma}_{s8,u8}, __imma_m8n8k32_{s4,u4}, __bmma_m8n8k128_{b1}
Extended Tensor Core — SM 80+ (IDs 746–764)
__dmma_m8n8k4_mma_f64, __mma_tf32_m16n16k8_mma_f32, __mma_bf16_m*_mma_f32 + load/store variants
WGMMA — SM 90+ (IDs 765–768)
__wgmma_mma_async_{f16,bf16,tf32,f8}
Alloca (IDs 769–770)
_alloca, __builtin_alloca
Category Summary
| Category | ID Range | Count |
|---|---|---|
| Sync/barriers/cluster | 1–20 | 20 |
| Bit manipulation | 21–26 | 6 |
| Math (floor/ceil/abs/round/etc) | 27–56 | 30 |
| Reciprocal/sqrt/rsqrt | 57–87 | 31 |
| Type conversions | 88–184 | 97 |
| Address space queries/cvta | 185–204 | 20 |
| Atomic ops (NVVM legacy) | 207–275 | 69 |
| FP min/max, mulhi, arithmetic | 276–349 | 74 |
| Vote + match operations | 351–364 | 12 |
| Compare-and-swap | 370–379 | 10 |
| FMA | 383–403 | 21 |
| Shuffle + misc | 404–416 | 13 |
| C++11 atomics (sized) | 417–473 | 57 |
| Surface stores (sust) | 474–638 | 165 |
| CUDA varargs + math shim | 639–646 | 8 |
| Tex/surf handler | 647 | 1 |
| C++ ABI + memgen + new/delete | 648–677 | 30 |
| WMMA tensor core (f16) | 678–707 | 30 |
| IMMA/BMMA tensor core | 708–745 | 38 |
| Extended tensor (dmma/tf32/bf16) | 746–764 | 19 |
| WGMMA (SM 90+ warpgroup) | 765–768 | 4 |
| Alloca | 769–770 | 2 |
| TOTAL | 770 |
SM Generation Coverage
| Generation | Features Enabled |
|---|---|
| SM 70 (Volta) | WMMA (half-precision tensor core) |
| SM 75 (Turing) | IMMA (integer), BMMA (binary) |
| SM 80 (Ampere) | DMMA (double), TF32, BF16 |
| SM 90 (Hopper) | WGMMA (warpgroup), cluster ops, f8 |
All 770 builtins are registered regardless of target SM. Architecture gating happens in the lowering layer that consumes the builtin IDs.
Key Observations
- Lazy initialization: The entire table is built on first lookup. Guard:
context+492 != 0. - No texture reads (suld): Only surface store builtins are registered. Texture/surface reads go through
__nv_tex_surf_handler(ID 647). - Write-once table: Tombstone mechanics exist but deletions never occur for the builtin table.
- Duplicate prefix optimization: IDA shows SSE
xmmwordconstant loads for long common prefixes (__nvvm_sust_b_2d_array_*) — this is compiler optimization of string literal loads, not a different code path.
Atomic Operations Builtins
Atomic builtins constitute the largest and most complex category in the NVVM builtin system, spanning over 130 IDs across two distinct subsystems: the legacy NVVM intrinsic atomics (IDs 207--275, 370--379) and the C++11-model atomics (IDs 366, 417--473). Both families converge in the lowering layer at sub_12AE930 (EDG) / sub_9502D0 (NVVM), a 1495-line handler that generates inline PTX assembly with explicit memory ordering and scope annotations.
Two Atomic Subsystems
The compiler maintains two parallel atomic APIs that reflect CUDA's historical evolution. The legacy NVVM atomics (__nvvm_atom_*) predate the C++ memory model and encode scope directly in the builtin name (e.g., __nvvm_atom_cta_add_gen_i for block-scoped integer add). The C++11 atomics (__nv_atomic_*) accept ordering and scope as runtime parameters, matching the cuda::atomic_ref interface.
Both subsystems lower to identical PTX instructions. The distinction matters only during the EDG frontend phase, where sub_6BBC40 generates the mangled __nv_atomic_* names from C++ source, and the NVVM lowering layer sub_12B3FD0 dispatches them by ID.
Legacy NVVM Atomics (IDs 207--275)
These 69 builtins encode the operation, scope, and type directly in the name. The lowering dispatches through sub_12AA9B0 for exchange-style operations and sub_12ADE80 for load/store/fetch operations. Each operation exists in three scope variants: default (device), _cta_ (block), and _sys_ (system).
| ID Range | Operation | Builtin Pattern | PTX Mnemonic |
|---|---|---|---|
| 207--218 | Add | __nvvm_atom_{,cta_,sys_}add_gen_{i,ll,f,d} | atom.add |
| 219--227 | Exchange | __nvvm_atom_{,cta_,sys_}xchg_gen_{i,ll,128} | atom.exch |
| 228--251 | Min/Max | __nvvm_atom_{,cta_,sys_}{min,max}_gen_{i,ll,ui,ull} | atom.min / atom.max |
| 252--257 | Inc/Dec | __nvvm_atom_{,cta_,sys_}{inc,dec}_gen_ui | atom.inc / atom.dec |
| 258--275 | Bitwise | __nvvm_atom_{,cta_,sys_}{and,or,xor}_gen_{i,ll} | atom.and / atom.or / atom.xor |
Legacy CAS (IDs 370--379)
Compare-and-swap builtins include 128-bit variants for SM 70+ targets. The handler sub_12AA280 builds an AtomicCmpXchg IR node with acquire ordering on both success and failure paths and weak exchange semantics.
| ID Range | Operation | Builtin Pattern |
|---|---|---|
| 370--379 | CAS | __nvvm_atom_{,cta_,sys_}cas_gen_{i,ll,us,128} |
Half-Precision Atomics (IDs 459--468)
Added for SM 90+ (Hopper), these support f16x2 and f16x4 packed atomic adds:
| ID Range | Operation | Builtin Pattern | SM Gate |
|---|---|---|---|
| 459--461 | f16x2 add | __nvvm_atom_{,cta_,sys_}add_gen_f2 | SM 90+ |
| 466--468 | f16x4 add | __nvvm_atom_{,cta_,sys_}add_gen_f4 | SM 100+ (Blackwell) |
C++11 Atomics (IDs 366, 417--473)
These 57 builtins implement the CUDA C++ atomic model with explicit memory ordering and scope parameters. The EDG frontend generator at sub_6BBC40 constructs the mangled names using a __nv_atomic_fetch_{op}_{width}_{type} pattern, where width is the byte count (1, 2, 4, 8, or 16) and the type suffix is _u (unsigned), _s (signed), or _f (float).
Thread Fence (ID 366)
__nv_atomic_thread_fence emits either a volatile fence (SM <= 69) or an explicit fence.{ordering}.{scope}; PTX instruction (SM 70+). Ordering and scope are extracted from constant operand parameters at compile time.
Load/Store (IDs 417--428)
| ID | Builtin | Width | PTX |
|---|---|---|---|
| 417 | __nv_atomic_load | generic | ld.{ordering}.{scope}.{type} |
| 418--422 | __nv_atomic_load_{1,2,4,8,16} | 1--16 bytes | same |
| 423 | __nv_atomic_store | generic | st.{ordering}.{scope}.{type} |
| 424--428 | __nv_atomic_store_{1,2,4,8,16} | 1--16 bytes | same |
Fetch-Op (IDs 429--458)
Arithmetic and bitwise fetch operations are registered with width and type suffixes. Bitwise operations (and, or, xor) omit the type suffix since signedness is irrelevant for bitwise logic.
| ID Range | Operation | Builtin Pattern |
|---|---|---|
| 429--434 | fetch_add | __nv_atomic_fetch_add_{4,8}_{u,s,f} |
| 435--440 | fetch_sub | __nv_atomic_fetch_sub_{4,8}_{u,s,f} |
| 441--446 | fetch_and/or/xor | __nv_atomic_fetch_{and,or,xor}_{4,8} |
| 447--452 | fetch_max | __nv_atomic_fetch_max_{4,8}_{u,s,f} |
| 453--458 | fetch_min | __nv_atomic_fetch_min_{4,8}_{u,s,f} |
For fetch_sub with floating-point types (IDs 437, 440), the lowering negates the operand and emits atom.add rather than a dedicated subtraction instruction.
Exchange and CAS (IDs 462--473)
| ID Range | Operation | Builtin Pattern |
|---|---|---|
| 462--465 | Exchange | __nv_atomic_exchange{,_4,_8,_16} |
| 469--473 | CAS | __nv_atomic_compare_exchange{,_2,_4,_8,_16} |
PTX Inline Assembly Generation
The atomic codegen handler at sub_12AE930 (address 0x12AE930, 41KB) generates PTX inline assembly strings at compile time. The generated instruction format depends on the target SM:
Pre-SM 70 (volatile mode, unk_4D045E8 <= 0x45):
ld.volatile.b32 $0, [$1];
atom.add.volatile.u32 $0, [$1], $2;
SM 70+ (explicit memory model):
ld.acquire.gpu.b32 $0, [$1];
st.release.sys.b32 [$0], $1;
atom.add.acq_rel.cta.u32 $0, [$1], $2;
atom.cas.relaxed.gpu.b64 $0, [$1], $2, $3;
The sub_12AE930 / sub_9502D0 Algorithm in Detail
Both the EDG-side handler (sub_12AE930, 0x12AE930) and its NVVM-side twin (sub_9502D0, 0x9502D0) follow identical logic. They accept five parameters: (result, codegen_state, builtin_id, call_arg_list, type_info). The algorithm proceeds in six phases.
Phase 1: SM Version Check and Path Selection
v186 = (unk_4D045E8 <= 0x45) // SM <= 69 -> volatile mode
When v186 is true, the handler enters the pre-SM 70 "volatile" path. All atomic operations receive a .volatile qualifier instead of explicit memory ordering and scope qualifiers. The 128-bit atomics emit diagnostic 0xEB6 (3766) and are rejected entirely.
When v186 is false (SM 70+), the handler enters the memory model path, which constructs the full {mnemonic}.{ordering}.{scope}.{type} format.
Phase 2: Operand Extraction and Builtin ID Dispatch
The handler extracts between 2 and 5 operands from the call argument list (pointer, value, compare-value for CAS, plus the ordering and scope parameters encoded as compile-time constants). The builtin ID selects the PTX mnemonic via a switch:
switch (builtin_id) {
case 417..422: mnemonic = "ld"; // atomic load
case 423..428: mnemonic = "st"; // atomic store
case 429..434: mnemonic = "atom.add"; // fetch-add (unsigned, signed, float)
case 435..440: mnemonic = "atom.add"; // fetch-sub (negated; see below)
case 441..442: mnemonic = "atom.and"; // fetch-and
case 443..444: mnemonic = "atom.or"; // fetch-or
case 445..446: mnemonic = "atom.xor"; // fetch-xor
case 447..452: mnemonic = "atom.max"; // fetch-max
case 453..458: mnemonic = "atom.min"; // fetch-min
case 462..465: mnemonic = "atom.exch"; // exchange
case 469..473: mnemonic = "atom.cas"; // compare-and-swap
default: fatal("unexpected atomic builtin function");
}
For IDs 435--440 (fetch_sub), the handler does not emit atom.sub (which does not exist in PTX). Instead, for integer types it negates the operand and emits atom.add; for float types it negates via fneg and emits atom.add.f.
For thread fence (ID 366), the handler branches to sub_12AE0E0 (volatile fence, pre-SM 70) or sub_12AE4B0 (explicit fence, SM 70+) and returns immediately, bypassing the rest of the atomic pipeline.
Phase 3: Memory Ordering Resolution
The ordering parameter is extracted from the first constant operand of the C++11 atomic call via sub_620EE0. The value (0--5) maps to a PTX qualifier string:
| Value | C++ Ordering | PTX Qualifier | Applies To |
|---|---|---|---|
| 0 | relaxed / monotonic | relaxed | All operations |
| 1 | consume (treated as acquire) | acquire | Loads, RMW |
| 2 | acquire | acquire | Loads, RMW |
| 3 | release | release | Stores, RMW |
| 4 | acq_rel | acq_rel | RMW operations |
| 5 | seq_cst | acquire (loads), release (stores) | All |
Sequential consistency (value 5) is downgraded: loads get acquire, stores get release, and RMW operations get acq_rel. True seq_cst semantics are achieved by inserting explicit fences around the operation (see "Fence Insertion for Seq_Cst" below).
Store-specific validation. For store builtins (IDs 423--428), only ordering values 0, 3, and 5 are legal. Any other value triggers fatal("unexpected memory order."). Value 5 is treated as relaxed for the store instruction itself, with the seq_cst fence handling the ordering guarantee externally.
Load-specific validation. For load builtins (IDs 417--422), values 3 (release) and 4 (acq_rel) are illegal and trigger the same fatal error.
Phase 4: Scope Resolution
The scope parameter is extracted from the second constant operand via sub_620EE0. The value (0--4) maps to a PTX scope qualifier:
switch (scope_value) {
case 0: // fall through
case 1: scope_str = "cta"; break; // thread block
case 2:
if (unk_4D045E8 > 0x59) // SM > 89
scope_str = "cluster"; // SM 90+ (Hopper)
else
scope_str = "gpu"; // SM <= 89: fallback
break;
case 3: scope_str = "gpu"; break; // device
case 4: scope_str = "sys"; break; // system
default: fatal("unexpected atomic operation scope.");
}
The cluster scope fallback is the critical SM gate at line 255 / 424 of sub_12AE930 / sub_9502D0: when the SM version is 89 or below, scope value 2 ("cluster") silently degrades to gpu. No diagnostic is emitted; the scope is simply rewritten. On SM 90+ (Hopper and later), cluster passes through to the PTX output.
Phase 5: Type Suffix Construction
The type suffix is built from two components: a type-class letter and a byte-width number. The type-class lookup uses a 4-entry table stored in local variable v196:
v196[0] = 'b' // bitwise (for exch, and, or, xor, cas)
v196[1] = 'u' // unsigned (for add, inc, dec, max, min on unsigned)
v196[2] = 's' // signed (for max, min on signed)
v196[3] = 'f' // float (for add on float/double)
The type-class index is derived from the LLVM type of the atomic operand:
- Integer type with unsigned semantics: index 1 (
u) - Integer type with signed semantics: index 2 (
s) - Floating-point type: index 3 (
f) - All other cases (exchange, CAS, bitwise): index 0 (
b)
The byte-width is the size of the atomic operand in bytes. Valid sizes are validated against the bitmask 0x10116:
valid = ((1LL << byte_size) & 0x10116) != 0
This bitmask has bits set at positions 1, 2, 4, 8, and 16, accepting exactly the byte widths {1, 2, 4, 8, 16}. Any other size triggers fatal("unexpected size1").
The resulting suffix is the letter concatenated with the bit width (byte_size * 8): .u32, .s64, .f32, .b128, etc.
Phase 6: Inline ASM String Assembly and Emission
The handler assembles the final PTX string by concatenating the components. Two string buffers are maintained throughout: v190 (ordering string) and v193 (scope string), set during phases 3 and 4.
For SM 70+ (memory model mode):
// Loads:
sprintf(buf, "ld.%s.%s.%c%d $0, [$1];", v190, v193, type_letter, bit_width);
// Stores:
sprintf(buf, "st.%s.%s.%c%d [$0], $1;", v190, v193, type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2;", mnemonic, v190, v193, type_letter, bit_width);
// CAS:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2, $3;", mnemonic, v190, v193, type_letter, bit_width);
For pre-SM 70 (volatile mode):
// Loads:
sprintf(buf, "ld.volatile.%c%d $0, [$1];", type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.volatile.%c%d $0, [$1], $2;", mnemonic, type_letter, bit_width);
Constraint string construction. The LLVM inline ASM constraint string is built dynamically to match the operand pattern:
| Pattern | Constraint String | Meaning |
|---|---|---|
Load (ld) | "=r,l,~{memory}" or "=l,l,~{memory}" | result in reg, address in 64-bit reg, memory clobber |
Store (st) | "l,r,~{memory}" or "l,l,~{memory}" | address in 64-bit reg, value in reg, memory clobber |
RMW (atom.*) | "=r,l,r,~{memory}" | result, address, operand, memory clobber |
CAS (atom.cas) | "=r,l,r,r,~{memory}" | result, address, compare, swap, memory clobber |
The register class for result and value operands is r for 32-bit types and l for 64-bit types. 128-bit types use l with pair operands.
The assembled PTX string and constraint string are passed to sub_B41A60 (NVVM side) or the equivalent EDG-side helper, which creates an LLVM InlineAsm node. The node is then emitted via sub_921880 / sub_1285290.
Fence Insertion for Seq_Cst
When the memory ordering is sequential consistency (value 5) and the SM version supports explicit fences (SM 70+), the handler does not simply emit atom.sc.{scope}. Instead, it implements seq_cst through a fence-bracketed pattern:
-
Pre-fence: If the operation is a store or RMW and ordering >= release, the handler calls
sub_94F9E0(membar) orsub_94FDF0(fence) to emit a leading fence:sub_94F9E0emitsmembar.{scope};as inline PTXsub_94FDF0emitsfence.sc.{scope};orfence.acq_rel.{scope};
-
The atomic operation: Emitted with downgraded ordering (
acquirefor loads,releasefor stores,acq_relfor RMW). -
Post-fence: If the operation is a load or RMW and ordering >= acquire, a trailing fence is emitted.
The fence scope matches the atomic operation's scope. The decision to emit membar vs fence depends on the SM version and the specific ordering level: membar is used for the pre-SM 70 path (though that path should not reach this code), and fence.sc / fence.acq_rel for SM 70+.
The pre/post-fence logic is gated by two conditions in the NVVM-side handler:
PRE-FENCE: if (v186 && (v187 - 3) <= 2) // v187 is ordering; range [3,5] = release, acq_rel, seq_cst
POST-FENCE: if (!v175 && v169 == 5) // v175 = is_store; v169 = ordering = seq_cst
The Volatile Fence Handler (sub_12AE0E0)
For thread fence on SM <= 69, sub_12AE0E0 emits a volatile memory barrier. The function takes an ASM buffer and fence configuration parameters. It produces:
membar.{scope};
where the scope is derived from the fence's scope parameter (cta / gl / sys). This is the pre-memory-model equivalent of the explicit fence path.
The Explicit Fence Handler (sub_12AE4B0)
For thread fence on SM 70+, sub_12AE4B0 constructs an explicit fence.{ordering}.{scope}; instruction. The ordering for fences is a restricted set compared to atomics:
| Ordering Value | Fence Qualifier |
|---|---|
| 3 | sc (sequentially consistent) |
| 4 | acq_rel |
| 5 | sc (same as 3) |
| Other | fatal("unexpected memory order.") |
The scope string follows the same rules as atomics. The assembled string is emitted as LLVM inline ASM with a ~{memory} clobber.
Memory Ordering Encoding
The ordering parameter (values 0--5) maps to PTX qualifiers:
| Value | Ordering | Used For |
|---|---|---|
| 0 | relaxed | Default / monotonic |
| 1, 2 | acquire | Loads, RMW |
| 3 | release | Stores |
| 4 | acq_rel | RMW operations |
| 5 | acquire | Sequential consistency (downgraded) |
Scope Encoding
The scope parameter (values 0--4) maps to PTX scope qualifiers:
| Value | Scope | PTX | SM Requirement |
|---|---|---|---|
| 0, 1 | Block | .cta | All |
| 2 | Cluster | .cluster | SM 90+ (Hopper); falls back to .gpu on SM <= 89 |
| 3 | Device | .gpu | All |
| 4 | System | .sys | All |
Type Suffix Construction
The type suffix is built from a 4-entry table: b (bitwise), u (unsigned), s (signed), f (float). Combined with the byte size, this produces suffixes like .u32, .f64, .b128. Valid sizes are validated against the bitmask 0x10116 (bits for 1, 2, 4, 8, and 16 bytes).
The 13 Atomic Operations at PTX Emission
The PTX emission layer at sub_21E5E70 (base) and sub_21E6420 (L2-hinted) implements the final encoding from the NVPTX MachineInstr opcode to the PTX text. The instruction operand word at this stage encodes both scope and operation:
bits[7:4] — scope: 0 = gpu (default), 1 = cta, 2 = sys
bits[23:16] — atomic operation opcode (BYTE2)
The 13-entry dispatch table:
| Opcode | PTX Suffix | L2-Hinted Suffix | Description |
|---|---|---|---|
| 0x00 | .exch.b | .exch.L2::cache_hint.b | Bitwise exchange |
| 0x01 | .add.u | .add.L2::cache_hint.u | Unsigned add |
| 0x02 | (missing) | (missing) | No .add.s in PTX ISA |
| 0x03 | .and.b | .and.L2::cache_hint.b | Bitwise AND |
| 0x04 | (missing) | (missing) | Unused slot |
| 0x05 | .or.b | .or.L2::cache_hint.b | Bitwise OR |
| 0x06 | .xor.b | .xor.L2::cache_hint.b | Bitwise XOR |
| 0x07 | .max.s | .max.L2::cache_hint.s | Signed max |
| 0x08 | .min.s | .min.L2::cache_hint.s | Signed min |
| 0x09 | .max.u | .max.L2::cache_hint.u | Unsigned max |
| 0x0A | .min.u | .min.L2::cache_hint.u | Unsigned min |
| 0x0B | .add.f | .add.L2::cache_hint.f | Float add |
| 0x0C | .inc.u | .inc.L2::cache_hint.u | Unsigned increment |
| 0x0D | .dec.u | .dec.L2::cache_hint.u | Unsigned decrement |
| 0x0E | .cas.b | .cas.L2::cache_hint.b | Compare-and-swap |
Opcodes 0x02 and 0x04 are unoccupied. There is no signed atomic add in PTX (signed add uses .add.u since two's-complement wrapping is identical). Slot 0x04 is simply skipped.
The scope prefix is emitted before the operation suffix:
bits[7:4] & 0xF:
0 -> (nothing; implicit .gpu scope)
1 -> ".cta"
2 -> ".sys"
Full PTX emission format:
atom[.scope].{op}.{type}{size}
Example: atom.cta.add.u32, atom.sys.cas.b64, atom.exch.b32.
L2 Cache Hint System (SM 80+ / Ampere)
sub_21E6420 (address 0x21E6420) is a parallel version of the base atomic emitter sub_21E5E70. It inserts .L2::cache_hint between the operation and type suffix for all 13 atomic operations:
atom[.scope].{op}.L2::cache_hint.{type}{size}
The L2 cache hint instructs the GPU's L2 cache to retain (or evict) the atomic target data after the operation completes. This is a PTX 7.3+ feature introduced with Ampere (SM 80+).
The L2-hinted path is selected when bit 0x400 is set in the instruction's encoding flags. The hint is applied at the MachineInstr level during instruction selection, not during the inline ASM generation phase of sub_12AE930. Both paths produce identical scope and type encoding; the L2 path adds exactly the .L2::cache_hint substring.
String emission uses SSE (xmm) register loads from precomputed constant data at addresses xmmword_435F590 through xmmword_435F620 to fast-copy the 16-byte prefix of each operation string, then patches the remaining bytes. This avoids branch-heavy string concatenation for the 13 cases.
AtomicExpandPass: IR-Level Expansion (sub_20C9140)
Before sub_12AE930 handles the C++11 atomics, and separately from the legacy builtin lowering, an LLVM FunctionPass named "Expand Atomic instructions" (pass ID "atomic-expand", registered at sub_20CA900) runs on LLVM IR to decide which atomic operations the NVPTX target can handle natively and which must be expanded into CAS loops.
Expansion Decision Tree
For each atomic instruction in the function:
-
shouldExpandAtomicCmpXchgInIR (vtable +0x258): Default expands all
cmpxchgto LL/SC or CAS-based loops. The NVPTX override may keep native i32/i64cmpxchgon SM 70+. -
shouldExpandAtomicRMWInIR (vtable +0x280):
- i32
xchg/add/min/max: kept native on all SM. - i64
xchg/add: kept native on SM 70+. - i32/i64
sub/nand: always expanded to CAS loop (no native PTX instruction). - i8/i16 (any operation): always expanded via partword masking.
- Float
atomicAdd: native on SM 70+ (fp32), SM 80+ (fp16/bf16).
- i32
-
shouldExpandAtomicLoadInIR (vtable +0x270): Native for aligned i32/i64. Expanded for i8/i16 (widen to i32 load + extract) and i128+ (decompose to multiple loads).
-
shouldExpandAtomicStoreInIR (vtable +0x278): Native for aligned i32/i64. Expanded for sub-word and >64-bit types.
Sub-Word Atomic Expansion (sub_20CB200)
No NVIDIA GPU architecture through SM 120 supports native sub-word (i8/i16) atomics. The pass generates mask-and-shift wrappers around word-sized CAS loops. The mask generation function sub_20CB200 (2896 bytes) produces a 6-field output struct:
| Field | Name | Purpose |
|---|---|---|
| +0x00 | AlignedAddr | Pointer masked to word boundary: ptr & ~(word_size - 1) |
| +0x08 | AlignedType | Always i32 |
| +0x10 | PtrLSB | Low address bits: ptr & (word_size - 1) |
| +0x18 | ShiftAmt | Bit position within the word: PtrLSB * 8 (little-endian) |
| +0x20 | Inv_Mask | Inverted mask: ~(((1 << (type_size * 8)) - 1) << ShiftAmt) |
| +0x28 | Mask | Mask: (1 << (type_size * 8)) - 1 |
The CAS loop (sub_20CBD50, 1646 bytes) then:
- Shifts the new value into position:
ValOperand_Shifted = new_val << ShiftAmt. - Loops: loads the word, applies the RMW operation on the masked sub-word, attempts CAS on the full word.
- On success: extracts the sub-word result via shift + mask.
CAS Loop Generation (sub_20C96A0)
For operations that cannot be handled natively, the pass builds a compare-and-swap loop with three basic blocks:
entry -> "atomicrmw.start" -> (CAS failure) -> "atomicrmw.start" (retry)
-> (CAS success) -> "atomicrmw.end"
Steps:
- Load current value from pointer.
- Compute new value using the RMW operation (dispatched through an 11-case switch at
sub_20CC690: Xchg, Add, Sub, And, Or, Xor [implied], Nand, Max, Min, UMax, UMin, FMin, FMax). - Emit
cmpxchgwith packed success+failure orderings. - Branch back to start on failure, fall through to end on success.
Ordering-to-Fence Table (address 0x428C1E0)
The pass uses a 7-entry fence decision table indexed by LLVM AtomicOrdering enum:
| Ordering | Index | Release Fence Before? | Acquire Fence After? |
|---|---|---|---|
| NotAtomic | 0 | No | No |
| Unordered | 1 | No | No |
| Monotonic | 2 | No | No |
| Acquire | 3 | No | Yes |
| Release | 4 | Yes | No |
| AcquireRelease | 5 | Yes | Yes |
| SequentiallyConsistent | 6 | Yes | Yes (+ barrier) |
Fence emission calls sub_15F9C80 which creates an LLVM fence instruction with the specified ordering and sync scope.
Memory Barrier and Fence Emission
The PTX emission layer has two dedicated handlers for barriers and fences, separate from the atomic operation emitters.
Memory Barrier (sub_21E94F0)
Emits membar instructions based on a 4-bit operand encoding:
| Value | Instruction | Scope |
|---|---|---|
| 0 | membar.gpu | Device |
| 1 | membar.cta | Thread block |
| 2 | membar.sys | System |
| 4 | fence.sc.cluster | Cluster (SM 90+) |
| 3 | fatal("Bad membar op") | Invalid |
NVVM-Side Membar (sub_94F9E0)
At the NVVM lowering level, sub_94F9E0 handles membar emission with a different scope encoding:
| Scope Value | Scope String | PTX Instruction |
|---|---|---|
| 0, 1 | cta | membar.cta; |
| 2, 3 | gl | membar.gl; |
| 4 | sys | membar.sys; |
| Other | fatal("unexpected atomic operation scope.") |
NVVM-Side Fence (sub_94FDF0)
Constructs fence.{ordering}.{scope}; from a state array. The ordering mapping is:
| Value | Ordering String |
|---|---|
| 3 | sc |
| 4 | acq_rel |
| 5 | sc |
| Other | fatal("unexpected memory order.") |
Both membar and fence are emitted as inline PTX assembly (not LLVM IR fence instructions) because PTX-level memory ordering semantics have no direct LLVM IR equivalent at the precision NVIDIA requires.
Architecture Gates
| SM Threshold | Effect |
|---|---|
| SM <= 59 | Diagnostic 0xEB6 warning for certain atomic patterns |
| SM 60--69 | Diagnostic 0xEB2 (3762) for specific atomic patterns |
| SM <= 69 | Volatile mode; 128-bit atomics not supported (diagnostic 0xEB4) |
| SM 70+ | Explicit ordering/scope in PTX output |
| SM <= 89 | Scope value 2 silently falls back from cluster to gpu |
| SM <= 89 | Half-precision (2-byte FP) atomics not supported |
| SM 90+ (Hopper) | Cluster scope (.cluster) becomes available |
| SM 90+ | f16x2 packed atomic add (IDs 459--461) |
| SM 90+ | fence.sc.cluster becomes available |
| SM 100+ (Blackwell datacenter) | f16x4 packed atomic add (IDs 466--468) |
EDG Frontend Name Construction
The EDG atomic builtin generator sub_6BBC40 (address 0x6BBC40, 1251 lines) constructs internal function names from C++ cuda::atomic_ref calls. The algorithm uses a dispatch key v165 = *(uint16_t*)(type_node + 176), the EDG "builtin kind" tag, to select the operation:
| v165 (hex) | v165 (dec) | Operation |
|---|---|---|
| 0x6241, 0x6242 | 25153, 25154 | compare_exchange |
| 0x6248, 0x6249 | 25160, 25161 | exchange |
| 0x624F, 0x6250 | 25167, 25168 | fetch_add |
| 0x6257, 0x6258 | 25175, 25176 | fetch_sub |
| 0x625F, 0x6260 | 25183, 25184 | fetch_and |
| 0x6263, 0x6264 | 25187, 25188 | fetch_xor |
| 0x6267, 0x6268 | 25191, 25192 | fetch_or |
| 0x626B, 0x626C | 25195, 25196 | fetch_max |
| 0x6273, 0x6274 | 25203, 25204 | fetch_min |
| 0x627B, 0x627C | 25211, 25212 | load |
| 0x6280, 0x6281 | 25216, 25217 | store |
| 0x6286 | 25222 | thread_fence |
Within each pair, the odd ID is the "generic" overload that enters the renaming path; the even ID has its base name string set explicitly via strcpy.
Name Construction Algorithm (lines 877--996 of sub_6BBC40)
Step 1 -- Base name. Copy the EDG source name, then overwrite with the canonical base for the seven fetch-op builtins:
v165 Base name
------ ---------------------------
0x6250 "__nv_atomic_fetch_add"
0x6258 "__nv_atomic_fetch_sub"
0x6260 "__nv_atomic_fetch_and"
0x6264 "__nv_atomic_fetch_xor"
0x6268 "__nv_atomic_fetch_or"
0x626C "__nv_atomic_fetch_max"
0x6274 "__nv_atomic_fetch_min"
Step 2 -- Width suffix. Append "_%u" formatted with the type size in bytes from *(uint32_t*)(type_node + 128). For fetch-op builtins, the size is validated as (type_size - 4) <= 4, accepting only 4 and 8 bytes.
Step 3 -- Type suffix (only for add/sub/max/min; lines 960--996). Reads type_kind = *(uint8_t*)(type_node + 140):
| type_kind | Meaning | Suffix | Condition |
|---|---|---|---|
| 2 | integer | _s | byte_4B6DF90[signedness_byte] != 0 (signed) |
| 2 | integer | _u | byte_4B6DF90[signedness_byte] == 0 (unsigned) |
| 3 | float | _f | Always |
| 6 | unsigned explicit | _u | Always |
byte_4B6DF90 is a 256-entry lookup table that maps the EDG "integer kind" sub-tag (at type_node + 160) to a boolean: 1 = signed, 0 = unsigned.
Bitwise operations (and/or/xor) omit the type suffix entirely.
Naming Pattern Summary
__nv_atomic_fetch_{op}_{width}[_{type}]
{op} = add | sub | and | xor | or | max | min
{width} = 4 | 8 (bytes)
{type} = _s (signed), _u (unsigned), _f (float), or omitted (bitwise)
For load/store/exchange/compare_exchange, only the width suffix is appended; no type suffix.
Validation Diagnostics
| Diagnostic | Hex | Condition |
|---|---|---|
| 852 | 0x354 | Unsupported atomic operation for target |
| 1645 | 0x66D | Wrong return type for builtin |
| 1646 | 0x66E | Unsupported type size (not in {1,2,4,8,16}) |
| 3745 | 0xEA1 | Atomic not supported for given type |
| 3746 | 0xEA2 | First param scope exceeds range (>5) |
| 3747 | 0xEA3 | Return param scope exceeds range (>4) |
| 3748 | 0xEA4 | fetch_op type size not 4 or 8 bytes |
| 3749 | 0xEA5 | Store with type_size <= 1 (too small) |
| 3750 | 0xEA6 | Load with type_size > 3 (too large) |
| 3756 | 0xEAC | CAS parameter type mismatch |
| 3757 | 0xEAD | Exchange parameter type mismatch |
| 3759 | 0xEAF | Float return not supported below SM 90 |
| 3762 | 0xEB2 | SM 60--69 atomic variant diagnostic |
| 3763 | 0xEB3 | Return type on store (SM <= 89) |
| 3764 | 0xEB4 | 128-bit store/load not supported on this SM |
| 3765 | 0xEB5 | 16-bit store not supported on SM <= 69 |
| 3766 | 0xEB6 | Generic warning for SM <= 59 |
| 3767 | 0xEB7 | Type size not in {1,2,4,8,16} bitmask |
| 3769 | 0xEB9 | Null argument list error |
EDG Type Node Field Map
| Offset | Size | Field |
|---|---|---|
| +128 | 8 | type_size (byte count: 1, 2, 4, 8, 16) |
| +140 | 1 | type_kind (0=void, 2=integer, 3=float, 6=unsigned, 8=pointer, 12=typedef) |
| +160 | varies | For type_kind 12 (typedef): pointer to underlying type. For type_kind 2 (integer): uint8_t signedness sub-tag indexed into byte_4B6DF90. |
| +168 | 8 | Pointer chain (for struct/compound types) |
| +176 | 2 | builtin_kind (the v165 dispatch tag, uint16_t) |
NVPTX MachineInstr Atomic Opcodes
At the SelectionDAG / MachineInstr level, atomic operations map to NVPTX-specific opcodes distinct from the inline ASM emission:
| MachineInstr Opcode | PTX Operation |
|---|---|
| 149 | ATOMIC_LOAD |
| 294--297 | atom.add (f32 / f64 / i32 / i64) |
| 302--305 | atom.min (s32 / s64 / u32 / u64) |
| 314--317 | atom.max (s32 / s64 / u32 / u64) |
| 462 | atom.cas (generic) |
These opcodes are emitted by the SelectionDAG lowering for native atomic operations that survive the AtomicExpandPass without expansion.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
sub_6BBC40 | 0x6BBC40 | ~1251 lines | EDG atomic builtin name generator |
sub_12AA280 | 0x12AA280 | Legacy CAS IR node builder | |
sub_12AA9B0 | 0x12AA9B0 | Legacy atomic exchange handler | |
sub_12ADE80 | 0x12ADE80 | Scoped atomic load/store/fetch handler | |
sub_12AE010 | 0x12AE010 | Fence acquire/release emitter (EDG only; BUG on NVVM) | |
sub_12AE0E0 | 0x12AE0E0 | Volatile fence emitter (pre-SM 70) | |
sub_12AE4B0 | 0x12AE4B0 | Explicit fence emitter (SM 70+) | |
sub_12AE930 | 0x12AE930 | 41KB | PTX inline ASM atomic codegen (EDG side) |
sub_12B3FD0 | 0x12B3FD0 | 103KB | Main builtin lowering mega-switch |
sub_20C7CE0 | 0x20C7CE0 | 1399 | AtomicExpandPass: recursive type walker |
sub_20C84C0 | 0x20C84C0 | 1656 | AtomicExpandPass: address space checker |
sub_20C9140 | 0x20C9140 | 1204 | AtomicExpandPass: runOnFunction |
sub_20C96A0 | 0x20C96A0 | 1814 | AtomicExpandPass: CAS loop generation |
sub_20CA900 | 0x20CA900 | 218 | AtomicExpandPass: registration |
sub_20CB200 | 0x20CB200 | 2896 | AtomicExpandPass: sub-word mask generation |
sub_20CBD50 | 0x20CBD50 | 1646 | AtomicExpandPass: partword RMW expansion |
sub_20CC690 | 0x20CC690 | 43 | AtomicExpandPass: 11-case operation dispatch |
sub_20CD3E0 | 0x20CD3E0 | 6030 | AtomicExpandPass: partword CmpXchg expansion |
sub_20CEB70 | 0x20CEB70 | 10640 | AtomicExpandPass: full CmpXchg LL/SC expansion |
sub_21E5E70 | 0x21E5E70 | PTX emission: base atomic opcode emitter | |
sub_21E6420 | 0x21E6420 | PTX emission: L2-hinted atomic opcode emitter | |
sub_21E8EA0 | 0x21E8EA0 | PTX emission: cluster barrier emitter | |
sub_21E94F0 | 0x21E94F0 | PTX emission: membar/fence emitter | |
sub_9502D0 | 0x9502D0 | 55KB | PTX inline ASM atomic codegen (NVVM side) |
sub_94F9E0 | 0x94F9E0 | NVVM membar emitter | |
sub_94FDF0 | 0x94FDF0 | NVVM fence emitter |
Cross-References
- Builtin System Overview -- hash table infrastructure and ID dispatch
- SM 70--89 Feature Gates -- unk_4D045E8 thresholds
- SM 90 Hopper Features -- cluster scope, fence.sc.cluster
- SM 100 Blackwell Features -- f16x4 atomics
- PTX Emission -- instruction printer subsystem
- NVPTX Opcodes Reference -- MachineInstr opcode table
- Inline Assembly Codegen -- general inline ASM infrastructure at sub_1292420
Math Function Builtins
Math builtins cover floating-point rounding, transcendental approximations, reciprocal/square-root operations, type conversions, and precise arithmetic with explicit rounding modes. They span IDs 21--184 and 276--403, totaling over 230 entries. Unlike most other builtin categories, many math builtins fall through the dispatch switch entirely and resolve via the generic LLVM intrinsic path.
Bit Manipulation (IDs 21--26)
These integer utility operations map directly to hardware instructions available on all SM targets.
| ID | Builtin | Operation |
|---|---|---|
| 21--22 | __nvvm_clz_{i,ll} | Count leading zeros (32/64-bit) |
| 23--24 | __nvvm_popc_{i,ll} | Population count (32/64-bit) |
| 25--26 | __nvvm_brev_{i,ll} | Bit reverse (32/64-bit) |
Rounding and Absolute Value (IDs 27--46)
Float rounding and absolute value operations exist in three type variants: flush-to-zero single (ftz_f), IEEE single (f), and double (d).
| ID Range | Operation | Variants |
|---|---|---|
| 27--29 | __nvvm_floor_{ftz_f,f,d} | Floor |
| 30--32 | __nvvm_ceil_{ftz_f,f,d} | Ceiling |
| 33--35 | __nvvm_abs_{ftz_f,f,d} | Absolute value (integer-style) |
| 36--38 | __nvvm_fabs_{ftz_f,f,d} | Absolute value (float) |
| 39--41 | __nvvm_round_{ftz_f,f,d} | Round to nearest |
| 42--44 | __nvvm_trunc_{ftz_f,f,d} | Truncate toward zero |
| 45--46 | __nvvm_saturate_{ftz_f,f} | Clamp to [0.0, 1.0] |
Transcendental Approximations (IDs 47--56)
Hardware-accelerated approximations for transcendental functions. These use the GPU's special function units (SFU) and are not IEEE-compliant.
| ID Range | Operation | Variants |
|---|---|---|
| 47--49 | __nvvm_ex2_approx_{ftz_f,f,d} | Base-2 exponential |
| 50--52 | __nvvm_lg2_approx_{ftz_f,f,d} | Base-2 logarithm |
| 53--55 | __nvvm_sin_approx_{ftz_f,f,d} | Sine |
| 56 | __nvvm_cos_approx_ftz_f | Cosine (FTZ only registered) |
Reciprocal (IDs 57--69)
Full-precision reciprocal with all four IEEE rounding modes and three type variants.
| ID Range | Operation | Rounding Modes |
|---|---|---|
| 57--69 | __nvvm_rcp_{rn,rz,rm,rp}_{ftz_f,f,d} | RN (nearest), RZ (zero), RM (minus), RP (plus) |
The 13 entries cover 4 rounding modes x 3 types, with the FTZ single-precision variant adding one additional entry.
Square Root and Reciprocal Square Root (IDs 70--87)
| ID Range | Operation | Description |
|---|---|---|
| 70--84 | __nvvm_sqrt_{f,rn,rz,rm,rp}_{ftz_f,f,d} | Square root (5 modes x 3 types) |
| 85--87 | __nvvm_rsqrt_approx_{ftz_f,f,d} | Reciprocal square root (SFU approximation) |
The sqrt_f variant (without rounding qualifier) uses the default hardware rounding. The rsqrt_approx variants use the SFU fast path.
Type Conversions (IDs 88--184)
The largest math subcategory with 97 entries, covering every combination of source type, destination type, rounding mode, and FTZ flag.
Double-to-Float (IDs 88--95)
__nvvm_d2f_{rn,rz,rm,rp}_{ftz,} -- 4 rounding modes x 2 FTZ variants.
Integer/Float Cross-Conversions (IDs 96--177)
82 entries covering all permutations of:
- Source types:
d(double),f(float),i(int32),ui(uint32),ll(int64),ull(uint64) - Destination types: same set
- Rounding modes:
rn,rz,rm,rp
Pattern: __nvvm_{src}2{dst}_{rounding} (e.g., __nvvm_d2i_rn, __nvvm_f2ull_rz).
Half Precision (IDs 178--180)
| ID | Builtin | Description |
|---|---|---|
| 178 | __nvvm_f2h_rn_ftz | Float to half (FTZ, round nearest) |
| 179 | __nvvm_f2h_rn | Float to half (round nearest) |
| 180 | __nvvm_h2f | Half to float |
Bitcast (IDs 181--184)
Reinterpret-cast between integer and float types without value conversion. Lowered via sub_12A7DA0 which emits opcode 0x31 (49, bitcast).
| ID | Builtin | Direction |
|---|---|---|
| 181 | __nvvm_bitcast_f2i | float -> int32 |
| 182 | __nvvm_bitcast_i2f | int32 -> float |
| 183 | __nvvm_bitcast_ll2d | int64 -> double |
| 184 | __nvvm_bitcast_d2ll | double -> int64 |
Integer Min/Max and Multiply-High (IDs 276--293)
| ID Range | Operation | Types |
|---|---|---|
| 276--279 | __nvvm_{min,max}_{i,ui} | 32-bit signed/unsigned |
| 280--283 | __nvvm_{min,max}_{ll,ull} | 64-bit signed/unsigned |
| 284--289 | __nvvm_f{min,max}_{f,ftz_f,d} | Float min/max (with FTZ) |
| 290--293 | __nvvm_mulhi_{i,ui,ll,ull} | Upper half of multiplication |
Precise Float Arithmetic (IDs 294--349)
These builtins provide IEEE-compliant arithmetic with explicit rounding mode control. Each operation exists in all four rounding modes and up to five type variants (ftz_f, f, ftz_f2, f2, d).
| ID Range | Operation | Entries |
|---|---|---|
| 294--313 | __nvvm_mul_{rn,rz,rm,rp}_{ftz_f,f,ftz_f2,f2,d} | 20 |
| 314--333 | __nvvm_add_{rn,rz,rm,rp}_{ftz_f,f,ftz_f2,f2,d} | 20 |
| 334--349 | __nvvm_div_{rn,rz,rm,rp}_{ftz_f,f,d} | 16 |
FMA (IDs 383--402)
Fused multiply-add with all rounding/type combinations:
| ID Range | Operation | Entries |
|---|---|---|
| 383--402 | __nvvm_fma_{rn,rz,rm,rp}_{ftz_f,f,d,ftz_f2,f2} | 20 |
Miscellaneous (IDs 350, 380--382, 403)
| ID | Builtin | Description |
|---|---|---|
| 350 | __nvvm_lohi_i2d | Compose double from two 32-bit halves |
| 380 | __nvvm_prmt | Byte permute (PRMT instruction) |
| 381--382 | __nvvm_sad_{i,ui} | Sum of absolute differences |
| 403 | __nvvm_fns | Find Nth set bit |
Table-Based Lowering for Precise Arithmetic
The precise arithmetic builtins (mul, add, div, fma with rounding modes) are lowered through sub_12B3540 (address 0x12B3540, 10KB), which uses two lazily-initialized red-black trees (std::map<int, triple>) to map builtin IDs to IR opcode triples.
Tree 1 serves three-operand builtins (FMA): maps ID ranges to opcode 0xF59 with variant codes encoding the rounding mode and type.
Tree 2 serves two-operand builtins (mul, add, div): maps to opcodes 0xE3A, 0xE3B, 0x105E, 0x1061 depending on the operation.
The lookup procedure:
- Extract up to 4 operand arguments from the call expression
- Find the builtin ID in the appropriate tree to obtain
(opcode, variant) - Look up the IR function via
sub_126A190 - Emit the call instruction via
sub_1285290 - Generate the inline asm fragment via
sub_12A8F50
LLVM Intrinsic Fallback Path
Many standard math builtins (floor, ceil, sin, cos, sqrt, fma, exp, log) are not handled by the switch cases at all. When the builtin table lookup returns ID 0 (name not found), the dispatcher falls through to the generic LLVM intrinsic path at LABEL_4 in sub_955A70. This path:
- Checks if the name starts with
"llvm."(prefix constant0x6D766C6C) - Looks up the intrinsic via
sub_B6ACB0(LLVM intrinsic name-to-ID) - Lowers all arguments with type-cast insertion where needed
- Emits a standard LLVM call via
sub_921880
This means functions like llvm.floor.f32, llvm.cos.f64, and llvm.fma.f32 bypass the builtin ID system entirely and map directly to LLVM's intrinsic infrastructure.
Float Compatibility Wrappers (IDs 643--646)
Four C runtime float functions are registered as builtins for compatibility:
| ID | Builtin | Maps To |
|---|---|---|
| 643 | __ceilf | __nvvm_ceil_f equivalent |
| 644 | __floorf | __nvvm_floor_f equivalent |
| 645 | __roundf | __nvvm_round_f equivalent |
| 646 | __truncf | __nvvm_trunc_f equivalent |
Tensor Core / MMA Builtins
Tensor core builtins implement the Warp Matrix Multiply-Accumulate (WMMA) and Warp Group MMA (WGMMA) interfaces, spanning IDs 678--770 across four SM generations. Each generation added new data types and matrix shapes, resulting in 91 registered builtins that cover half-precision, integer, binary, double-precision, TF32, BF16, and FP8 matrix operations. SM 100 (Blackwell) adds a fifth generation -- tcgen05 -- documented in Tensor / MMA Codegen.
Key Facts
| Property | Value |
|---|---|
| Builtin IDs | 678--770 (93 entries) |
| WGMMA handler (IDs 753--768) | ~800 lines in sub_12B3FD0 / sub_955A70 |
| LLVM intrinsic range (WGMMA) | 5304--5447 (144-entry 5-D grid) plus 10654--10779 (N-dimension table) |
| NVVM lowering | sub_955A70 (105KB), sub_12B3FD0 (103KB) |
| Backend emission | sub_21E74C0 (PTX builder), sub_36E9630 (tcgen05 ISD selection) |
| SM gates | SM 70+ HMMA, SM 72+ IMMA, SM 75+ BMMA, SM 80+ DMMA/TF32/BF16, SM 90+ WGMMA |
WMMA Architecture Evolution
| SM Generation | Feature | ID Range | Count |
|---|---|---|---|
| SM 70 (Volta) | HMMA: FP16 tensor core | 678--707 | 30 |
| SM 75 (Turing) | IMMA: INT8/INT4, BMMA: binary | 708--745 | 38 |
| SM 80 (Ampere) | DMMA: FP64, TF32, BF16 | 746--764 | 19 |
| SM 90 (Hopper) | WGMMA: warp-group MMA, FP8 | 765--768 | 4 |
| SM 100 (Blackwell) | tcgen05: MX formats, block-scale, sparsity | (intrinsic path) | -- |
HMMA -- Half-Precision (IDs 678--707, SM 70+)
The original tensor core builtins provide 16-bit floating-point matrix multiply for three tile shapes. Each shape has 10 operations: load A, load B, load C (f16 and f32 accumulators), store C (f16 and f32), and four MMA variants for input/output precision combinations.
| ID Range | Shape | Builtin Prefix |
|---|---|---|
| 678--687 | 16x16x16 | __hmma_m16n16k16_* |
| 688--697 | 32x8x16 | __hmma_m32n8k16_* |
| 698--707 | 8x32x16 | __hmma_m8n32k16_* |
Per-shape operations (10 each):
| Suffix | Operation | Description |
|---|---|---|
ld_a | Load A fragment | Load matrix A tile from memory |
ld_b | Load B fragment | Load matrix B tile from memory |
ld_c_f16 | Load C (f16) | Load accumulator as half-precision |
ld_c_f32 | Load C (f32) | Load accumulator as single-precision |
st_c_f16 | Store C (f16) | Store result as half-precision |
st_c_f32 | Store C (f32) | Store result as single-precision |
mma_f16f16 | MMA f16->f16 | FP16 input, FP16 accumulator |
mma_f32f16 | MMA f16->f32 | FP16 input, FP32 accumulator |
mma_f16f32 | MMA f32->f16 | FP32 accumulator, FP16 output |
mma_f32f32 | MMA f32->f32 | FP32 input and accumulator |
IMMA -- Integer MMA (IDs 708--739, SM 75+)
Integer tensor core operations for INT8 and INT4 data types.
INT8 (IDs 708--731)
Three shapes (16x16x16, 32x8x16, 8x32x16), each with 8 operations:
| Suffix | Description |
|---|---|
ld_a_s8 / ld_a_u8 | Load A fragment (signed/unsigned INT8) |
ld_b_s8 / ld_b_u8 | Load B fragment (signed/unsigned INT8) |
ld_c | Load accumulator (INT32) |
st_c_i32 | Store result (INT32) |
mma_s8 / mma_u8 | INT8 MMA (signed/unsigned) |
INT4 (IDs 732--739)
Single shape (8x8x32) with the same operation set but _s4 / _u4 type suffixes.
BMMA -- Binary MMA (IDs 740--745, SM 75+)
Binary (1-bit) matrix multiply with XOR-POPC and AND-POPC accumulation modes. Single shape: 8x8x128.
| ID | Builtin | Description |
|---|---|---|
| 740 | __bmma_m8n8k128_ld_a_b1 | Load A fragment (binary) |
| 741 | __bmma_m8n8k128_ld_b_b1 | Load B fragment (binary) |
| 742 | __bmma_m8n8k128_ld_c | Load accumulator |
| 743 | __bmma_m8n8k128_st_c_i32 | Store result |
| 744 | __bmma_m8n8k128_mma_xor_popc_b1 | Binary MMA (XOR + popcount) |
| 745 | __bmma_m8n8k128_mma_and_popc_b1 | Binary MMA (AND + popcount) |
Extended Tensor Core (IDs 746--764, SM 80+)
SM 80 (Ampere) added double-precision, TF32, and BF16 tensor operations.
DMMA -- Double Precision (IDs 746, 751--754)
| ID | Builtin | Description |
|---|---|---|
| 746 | __dmma_m8n8k4_mma_f64 | FP64 MMA |
| 751 | __dmma_m8n8k4_st_c_f64 | Store FP64 result |
| 752--754 | __dmma_m8n8k4_{ld_a,ld_b,ld_c} | Load fragments |
TF32 (IDs 747, 755--757)
| ID | Builtin | Description |
|---|---|---|
| 747 | __mma_tf32_m16n16k8_mma_f32 | TF32 MMA producing FP32 |
| 755--757 | __mma_tf32_m16n16k8_{ld_a,ld_b,ld_c} | Load fragments |
BF16 (IDs 748--750, 758--764)
| ID | Builtin | Description |
|---|---|---|
| 748 | __mma_bf16_m16n16k16_mma_f32 | BF16 16x16x16 MMA |
| 749 | __mma_bf16_m32n8k16_mma_f32 | BF16 32x8x16 MMA |
| 750 | __mma_bf16_m8n32k16_mma_f32 | BF16 8x32x16 MMA |
| 758--764 | __mma_bf16_m*_{ld_a,ld_b} | Load fragments for each shape |
WMMA Lowering Details
Three-Table Lookup
WMMA builtins use a three-table structure for mapping builtin IDs to LLVM intrinsic IDs:
| Table | Address (NVVM) | ID Range | Description |
|---|---|---|---|
dword_3F14840 | Entries 0--29 | 678--707 | HMMA (first-generation, FP16) |
dword_3F147E0 | Entries 0--23 | 708--731 | IMMA (INT8) |
dword_3F147A0 | Entries 0--12 | 732--744 | BMMA (binary) / INT4 |
The EDG-side parallel tables live at dword_42810C0 (678--709), dword_4281060 (708--731), dword_4281020 (732--744), addressed from sub_12AC1A0.
Fragment Size Determination
The number of register-level fragments varies by operation and data type:
| Condition | Fragment Count | Example |
|---|---|---|
| First-gen WMMA, BF16, store | 4 | BF16 store_c |
| First-gen WMMA, default | 8 | FP16 mma |
| IMMA, intrinsic 8914/8280 | 2 | INT8 ld_a compact |
| BMMA | 2 | Binary operations |
| IMMA intrinsic 0x22BB/0x22BC/0x22C5/0x22C6 | 4 | INT4 load A/B |
| IMMA intrinsic 0x22BD/0x22BE/0x22C3/0x22C4/0x22CB--0x22CE | 1 | Sub-byte single-element |
| IMMA intrinsic 0x22B7/0x22BF/0x22C7 | 8 | INT8 full-width |
MMA Codegen Flow
The MMA handler (sub_94E0D0 / sub_12AC5F0) processes 5 input operands:
- dest_ptr -- Pointer to output fragment storage
- A_fragment -- Matrix A input (loaded
v100times) - B_fragment -- Matrix B input (loaded
v95times) - C_fragment -- Accumulator input (loaded
v101times) - rowcol -- Layout operand (validated 0--3 for MMA)
An optional satf flag (saturation, validated 0--1) is consumed for most intrinsics except ID 8279.
The handler emits the MMA call via sub_921880 and scatters results back to the destination fragment through v103 iterations of element-wise stores.
Fragment iteration counts per family (NVVM path, sub_94E0D0):
| Family | v95 (load B) | v100 (load A) | v101 (load C) | v103 (store D) |
|---|---|---|---|---|
| BMMA (b1) | 1 | 1 | 2 | 2 |
| IMMA (0x22C0-0x22C1) | 1 | 4 | 8 | 8 |
| IMMA (0x22B8-0x22B9 = 8888-8889) | 2 | 2 | 8 | 8 |
| IMMA (0x22C8-0x22C9 = 8904-8905) | 4 | 1 | 8 | 8 |
| HMMA (default, first-gen) | 8 | 8 | varies | varies (4 or 8) |
The output fragment count is determined by bit-test: (0x300C003 >> (intrinsic_id + 127)) & 1 selects 4 vs 8 fragments.
Architecture Gating -- Exact Thresholds
The architecture version is stored at *(target_info + 252) as a DWORD.
| Function | Gate Expression | Minimum SM | Notes |
|---|---|---|---|
sub_21DFBF0 hmmastc | v8 > 0x45 | SM 70 | FP16 store |
sub_21E0360 hmmaldab | v8 > 0x45 | SM 70 | FP16 load A/B |
sub_21E0870 hmmamma | v8 > 0x45 | SM 70 | FP16 MMA |
sub_21E1280 immaldab | v8 > 0x47 | SM 72 | INT load; v8==72 && variant>1 rejected |
sub_21E1D20 immamma | v8 > 0x47 | SM 72 | INT MMA; variant>1 && v8==72 rejected |
sub_21E2280 bmmamma | v8 > 0x48 | SM 73/75 | Binary MMA |
sub_36E9630 tcgen05 | arch >= 0x3E8 | SM 100 | Blackwell only |
SM 72 (Xavier) has a unique partial IMMA implementation: only variant 0/1 shapes are supported, with explicit gating that blocks higher variants. This matches hardware reality where Xavier had limited INT8 tensor cores.
WGMMA -- Warp Group MMA (SM 90+ Hopper)
WGMMA operates on an entire warp group (4 warps, 128 threads) rather than a single warp. The system is split across four builtin IDs, 20 auxiliary IDs for fence/store/load operations, and two massive handler blocks totaling ~800 lines of lowering logic.
Builtin Registration
Four builtins are registered in sub_90AEE0 (NVVM) and sub_126A910 (EDG):
| ID | Builtin | Data Type | Lowering Case |
|---|---|---|---|
| 765 (0x2FD) | __wgmma_mma_async_f16 | FP16 | Full operand set (6 chained: A, B, C, scale, negate, sparsity) |
| 766 (0x2FE) | __wgmma_mma_async_bf16 | BF16 | 2-operand (no scale/negate) |
| 767 (0x2FF) | __wgmma_mma_async_tf32 | TF32 | Reduced operand set |
| 768 (0x300) | __wgmma_mma_async_f8 | FP8 (SM 90a+) | Minimal (2 scale operands only) |
WGMMA ID Space Overview
The full WGMMA ID range spans 745--770, subdivided into four functional groups:
| ID Range | Function | Handler |
|---|---|---|
| 745--750 (0x2E9--0x2EE) | Fence / commit / wait | sub_12B1C20 / sub_953BA0 |
| 751--752 (0x2EF--0x2F0) | Store | sub_12B27B0 / sub_954350 |
| 753--764 (0x2F1--0x2FC) | MMA async load (12 variants) | inline / sub_9547E0 |
| 765--768 (0x2FD--0x300) | MMA async compute (4 type builtins) | inline ~800 lines / sub_12B2E10 |
| 769--770 (0x301--0x302) | Warp-group barrier | inline IR via sub_127FC40 |
WGMMA Fence / Commit / Wait (IDs 745--750)
sub_953BA0 (NVVM) / sub_12B1C20 (EDG) builds a red-black tree on first call with 7 entries keyed by builtin ID. Each entry packs:
struct wgmma_fence_entry {
uint32_t id; // builtin ID (745--751)
uint32_t trans_a; // transpose A flag
uint32_t shape; // shape code (0 or 1)
uint32_t trans_b; // transpose B flag
uint32_t a_nregs; // register count for A fragment
uint32_t b_nregs; // register count for B fragment
uint32_t padding; // unused alignment
llvm_type *a_type; // LLVM type for A (i64, i32, i16x2, i32x4)
llvm_type *b_type; // LLVM type for B
llvm_type *c_type; // LLVM type for C (i32x2, i32x8)
};
Decoded entries from local variables v47--v106:
| ID | trans_a | shape | trans_b | a_nregs | b_nregs | A type | B type | C type |
|---|---|---|---|---|---|---|---|---|
| 745 | 0 | 1 | 5 | 1 | 1 | i64 | i64 | -- |
| 746 | 1 | 0 | 1 | 9 | 9 | i32 | i32 | i32x2 |
| 747 | 0 | 0 | 25 | 8 | 8 | i16x2 | i16x2 | -- |
| 748 | 0 | 0 | 23 | 7 | 7 | i32x4 | i32x4 | i32x8 |
| 749 | 0 | 0 | 24 | 7 | 7 | i32x4 | i32x4 | i32x8 |
| 750 | 0 | 0 | 6 | 7 | 7 | i64 | i32x2 | i32x8 |
Output packed encoding (*a4, 64-bit):
| Bits | Field | Source |
|---|---|---|
| [3:0] | trans_a | *(entry+40) |
| [7:4] | shape | *(entry+48) << 4 |
| [15:8] | a_nregs | *(entry+64) << 8 |
| [27:16] | b_nregs | *(entry+72) << 16 |
| [31:28] | padding | *(entry+80) << 28 |
| [63:32] | trans_b | *(entry+56) << 32 |
| [25] | rowcol bit 1 | (rowcol & 2) == 0 ? 0x2000000 : 0x1000000 |
| [27:26] | rowcol bit 0 | ((rowcol & 1) + 1) << 26 |
The fence dispatch validates the rowcol operand (must be 0--3) and emits a 4-argument call to intrinsic 9062 (llvm.nvvm.wgmma.fence.aligned) with 3 type overloads. Fragment operands are prepared via sub_94B510.
WGMMA Store (IDs 751--752)
sub_954350 / sub_12B27B0 builds a separate parameter lookup tree. Store operations validate rowcol (0 or 1) and emit a 5-argument call using intrinsic 9145 (llvm.nvvm.wgmma.store) with 2 type overloads. Operands: {constant, B_fragment, descriptor, rowcol, zero}.
WGMMA MMA Async Load (IDs 753--764)
sub_9547E0 (NVVM) / sub_12B2E10 (EDG) builds a 12-entry red-black tree at ctx+656:
| ID | Shape | nregs | Variant | Fragment Type |
|---|---|---|---|---|
| 753 | 1 | 9 | 0 | -- |
| 754 | 1 | 9 | 1 | -- |
| 755 | 1 | 9 | 2 | i16x2 |
| 756 | 25 | 8 | 0 | -- |
| 757 | 25 | 8 | 1 | -- |
| 758 | 25 | 10 | 2 | i32x8 |
| 759 | 23 | 7 | 0 | i32x4 |
| 760 | 23 | 7 | 1 | i32x4 |
| 761 | 24 | 7 | 0 | i32x4 |
| 762 | 24 | 7 | 1 | i32x4 |
| 763 | 6 | 7 | 0 | i32x2/i64 |
| 764 | 6 | 7 | 1 | i32x2/i64 |
Output packed encoding (*a4, 64-bit):
| Bits | Field |
|---|---|
| [63:32] | *(entry+40) << 32 |
| [31:4] | *(entry+48) << 4 | rowcol |
| [1] | *(entry+56) << 1 |
Emits intrinsic 9067 (llvm.nvvm.wgmma.mma.async) with 2 type overloads. Arguments: {constant, B_fragment, rowcol_value, zero_constant}. Results scattered via sub_94B940.
WGMMA MMA Async Compute -- The 800-Line Handler (IDs 765--768)
This is the primary WGMMA lowering path. It lives inline in the mega-switch of sub_955A70 (NVVM, lines ~2850--3138) and sub_12B3FD0 (EDG, lines ~2270--3138). The handler implements two completely different intrinsic selection strategies depending on which builtin ID triggered entry.
Argument Extraction
The handler walks the argument chain 7 levels deep from the call expression:
v263 = M dimension (first constant argument)
v512 = accumulator fragments (pointer to fragment array)
v528 = A descriptor (64-bit matrix descriptor or register fragments)
v524 = B descriptor (64-bit matrix descriptor)
v519 = scale factors (A and D scale constants)
v264 = layout params (rowcol encoding)
v516, v265 = shape params (additional dimension info)
v540 = element type info (integer type tag from AST)
Each constant argument is validated through sub_620FD0 (EDG) / sub_620FD0 (shared), which extracts the integer value and sets an overflow flag. On overflow:
"unexpected constant overflow in __wgmma_mma_async operand"
This check is applied 5 times: once for N dimension, once for each scale factor, and once for each negate/saturation bit.
Per-Builtin Argument Layouts
| ID | Builtin | Operand Chain |
|---|---|---|
| 765 (0x2FD) | _f16 | 6 chained: A, B, C, scaleA, scaleD, negate/saturation |
| 766 (0x2FE) | _bf16 | Separate branch (LABEL_56 path), 2-operand (no scale/negate) |
| 767 (0x2FF) | _tf32 | Rearranged arguments, fewer config bits |
| 768 (0x300) | _f8 | Simplest form, 2 matrix descriptors + config |
Strategy 1: N-Dimension Dispatch (IDs 765--768, inner path)
When the element type is checked and the first argument yields an N dimension, the handler enters a 33-entry switch mapping N values to LLVM intrinsic IDs in the range 10654--10779:
| N | Integer-type Intrinsic | Float-type Intrinsic |
|---|---|---|
| 8 | 10774 | 10775 |
| 16 | 10690 | 10691 |
| 24 | 10734 | 10735 |
| 32 | 10742 | 10743 |
| 40 | 10746 | 10747 |
| 48 | 10750 | 10751 |
| 56 | 10754 | 10755 |
| 64 | 10758 | 10759 |
| 72 | 10762 | 10763 |
| 80 | 10766 | 10767 |
| 88 | 10770 | 10771 |
| 96 | 10778 | 10779 |
| 104 | 10654 | 10655 |
| 112 | 10658 | 10659 |
| 120 | 10662 | 10663 |
| 128 | 10666 | 10667 |
| 136 | 10670 | 10671 |
| 144 | 10674 | 10675 |
| 152 | 10678 | 10679 |
| 160 | 10682 | 10683 |
| 168 | 10686 | 10687 |
| 176 | 10694 | 10695 |
| 184 | 10698 | 10699 |
| 192 | 10702 | 10703 |
| 200 | 10706 | 10707 |
| 208 | 10710 | 10711 |
| 216 | 10714 | 10715 |
| 224 | 10718 | 10719 |
| 232 | 10722 | 10723 |
| 240 | 10726 | 10727 |
| 248 | 10730 | 10731 |
| 256 | 10738 | 10739 |
The even/odd intrinsic ID pairing encodes the distinction between integer-element and float-element variants. Type discrimination uses the AST element type: if the element type is integer with width 10 (i.e., a 10-bit integer signaling bf16/tf32 internal encoding), the even (integer) intrinsic is selected; otherwise the odd (float) intrinsic.
N dimension validation:
if ((N & (N - 1)) != 0)
error("N only supported for powers of two");
This is applied when the N value does not match any case in the 33-entry switch. The N values 8, 16, 32, 64, 128, 256 are powers of two; the intermediate values (24, 40, 48, ..., 248) are non-power-of-two multiples of 8 that are still valid WGMMA dimensions.
Strategy 2: 5-Dimensional Intrinsic Grid (IDs 753--764 path, shared)
For the full WGMMA async variants (handled through sub_12B2E10), the handler selects from a 144-entry intrinsic table spanning IDs 5304--5447, organized as a 5-dimensional grid:
| Dimension | Values | Description |
|---|---|---|
| 1. N | {16, 32, 64, 128} | Output column dimension |
| 2. B_shared | {false, true} | Is B operand from shared memory? (sub_12A71A0 != 0) |
| 3. is_s64 | {false, true} | Is accumulator type s64/int? (type tag 2, subtype 10) |
| 4. scale/negate | varies | A scale nonzero? D scale nonzero? |
| 5. variant | {0x2FD, 0x2FE, 0x2FF, 0x300} | Which builtin triggered entry |
Base addresses and stride:
| N | Base ID | Stride per N |
|---|---|---|
| 128 | 5304 | 24 variants |
| 64 | ~5328 | 24 |
| 32 | ~5352 | 24 |
| 16 | ~5376 | 24 |
| overflow | ~5400--5447 | remaining |
Size-based opcode selection (for f16, ID 765):
| Accumulator Size | Opcode (integer) | Opcode (float) |
|---|---|---|
| 16 | 5332 | 5333 |
| 32 | 5380 | 5381 |
| 64 | 5404 | 5405 |
| 128 | 5308 | 5309 |
| other | 5356/5428 | 5357/5429 |
The mapping formula: base + N_offset + shared_offset + type_offset + variant_offset. The accumulator size is extracted by sub_12A71A0(expr) from the expression type chain.
WGMMA Config Bit Packing
Multiple boolean arguments are packed into a single configuration word passed to the final intrinsic call:
| Bit | Field | Source | Value Semantics |
|---|---|---|---|
| 0 | Accumulate / saturation flag | Final constant operand (v433) | 1 = accumulate into D, 0 = overwrite |
| 1 | ScaleD / transpose flag | v445 constant | 1 = transpose B descriptor |
| 2 | Negate-C / layout flag | v81 / v433 constant | 1 = negate accumulator input |
| 3 | Sign bit for B | v427 constant (if present) | Reserved / sign extension |
| 4 | Negate-A / additional mode | v80 / v427 constant (if present) | 1 = negate A operand |
Combined via: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).
After intrinsic selection, the handler:
- Converts the accumulator pointer to a vector pointer (
.asvecptrtag) - Extracts bitfield from constant operands for mode flags
- Calls
sub_1285290/sub_921880with name hint"mmafrag" - Scatters results via
sub_94B940/sub_1280F50(size 4 = float elements)
WGMMA Validation Summary
All constant arguments pass through sub_620FD0, which extracts the integer value and sets an overflow flag.
| Check | Error Message | Condition |
|---|---|---|
| Constant overflow | "unexpected constant overflow in __wgmma_mma_async operand" | Any integer operand overflows extraction (5 occurrences) |
| N power-of-two | "N only supported for powers of two" | (N & (N - 1)) != 0 and N not in the 33-entry switch |
| rowcol range (fence) | "'rowcol' operand can be 0 or 1 only" | rowcol > 1 for load/store |
| rowcol range (MMA) | (implicit -- validated 0--3) | rowcol > 3 for MMA operations |
WGMMA Support Functions
| Function | Address | EDG Parallel | Purpose |
|---|---|---|---|
sub_953BA0 | 0x953BA0 | sub_12B1C20 | Fence/commit/wait parameter lookup, builds packed 64-bit encoding |
sub_9547E0 | 0x9547E0 | sub_12B2E10 | MMA async load parameter lookup, 12-entry red-black tree |
sub_954350 | 0x954350 | sub_12B27B0 | Store variant parameter lookup |
sub_94B510 | 0x94B510 | -- | Prepare fragment operand for WGMMA call |
sub_94B940 | 0x94B940 | sub_1280F50 | Scatter MMA results back to fragment outputs |
sub_94B2B0 | 0x94B2B0 | -- | Extract fragment element at index (WMMA shared) |
sub_12A71A0 | 0x12A71A0 | -- | Extract size/dimension from expression type (EDG-only) |
sub_12A6F10 | 0x12A6F10 | -- | Validate constant integer in range (EDG-only) |
sub_620FD0 | 0x620FD0 | -- | Extract constant integer with overflow detection (shared) |
Packed MMA Descriptor Word
The MMA PTX string builder at sub_21E74C0 (AsmPrinter) / sub_35F_range (NVPTX backend) reads a packed 64-bit descriptor for all MMA instruction emission. The descriptor is stored at:
v22 = *(QWORD *)(*(QWORD *)(a1 + 16) + 16 * a2 + 8)
| Bits | Field | Query Key | Values |
|---|---|---|---|
| [0] | Row/col layout | "rowcol" | 0=row, 1=col |
| [2:1] | Matrix ID | "mid" | 0=a, 1=b, 2=c, 3=d |
| [7:4] | Binary opcode | "opc" | 0=default, 1=.and.popc, 2=.xor.popc |
| [2:0] | Rounding mode | "rnd" | 0=none, 1=.rn, 2=.rm, 3=.rp, 4=.rz |
| [15:8] | A element type | "aty" | Type enum 1--11 |
| [23:16] | B element type | "bty" | Type enum 1--11 |
| [25:24] | A layout | "al" | 0=row, nonzero=col |
| [27:26] | B layout | "bl" | 0=row, nonzero=col |
| [28] | Saturation | "satf" | 1=.satfinite |
| [39:32] | Shape enum | "shape" | 0x01--0x19, 18 entries |
Shape Enum
| Enum | Shape | PTX String | Min SM | Notes |
|---|---|---|---|---|
| 0x01 | m8n8k4 | "m8n8k4" | SM 70 | Original Volta HMMA |
| 0x02 | m8n8k16 | "m8n8k16" | SM 72 | Integer MMA (s8/u8) |
| 0x03 | m8n8k32 | "m8n8k32" | SM 75 | Sub-byte (s4/u4) |
| 0x04 | m8n8k64 | "m8n8k64" | SM 75 | Extended sub-byte |
| 0x05 | m8n8k128 | "m8n8k128" | SM 75 | Binary MMA (b1) |
| 0x06 | m8n32k16 | "m8n32k16" | -- | Appears unused in standard paths |
| 0x10 | m16n8k4 | "m16n8k4" | SM 75 | Turing HMMA, f64 on Ampere |
| 0x11 | m16n8k8 | "m16n8k8" | SM 75 | Turing/Ampere HMMA |
| 0x12 | m16n8k16 | "m16n8k16" | SM 80 | Ampere HMMA (bf16, tf32) |
| 0x13 | m16n8k32 | "m16n8k32" | SM 75 | Ampere integer |
| 0x14 | m16n8k64 | "m16n8k64" | SM 75 | Sub-byte integer |
| 0x15 | m16n8k128 | "m16n8k128" | SM 75 | Extended sub-byte |
| 0x16 | m16n8k256 | "m16n8k256" | SM 75 | Binary/sub-byte (largest) |
| 0x17 | m16n16k16 | "m16n16k16" | SM 90 | Square shape, Hopper+ |
| 0x18 | m32n8k16 | "m32n8k16" | SM 80 | Tall shape |
| 0x19 | m16n16k8 | "m16n16k8" | SM 70 | WMMA f16 path |
Unknown shape codes hit the default branch and abort via BUG(). String emission uses fast-path integer stores: *(QWORD *)ptr = 0x36316B386E36316DLL emits "m16n8k16" as a single 8-byte write.
Type Enum
| Enum | Type | Bits | PTX String |
|---|---|---|---|
| 1 | b1 | 1 | "b1" |
| 2 | s4 | 4 | "s4" |
| 3 | u4 | 4 | "u4" |
| 4 | s8 | 8 | "s8" |
| 5 | u8 | 8 | "u8" |
| 6 | f16 | 16 | "f16" |
| 7 | bf16 | 16 | "bf16" |
| 8 | tf32 | 19 | "tf32" |
| 9 | f64 | 64 | "f64" |
| 10 | f32 | 32 | "f32" |
| 11 | s32 | 32 | "s32" |
Any other type code produces fatal error: "Wrong MMA element type".
Shape x Type x Architecture Summary
| Shape | A/B Types | Acc Types | Min SM | Notes |
|---|---|---|---|---|
| m8n8k4 | f16 | f16, f32 | SM 70 | Original Volta |
| m16n8k4 | f64 | f64 | SM 80 | Ampere f64 |
| m16n8k8 | f16 | f16, f32 | SM 75 | Turing+ |
| m16n8k16 | f16, bf16, tf32 | f16, f32 | SM 80 | Ampere+ |
| m16n16k8 | f16 | f16, f32 | SM 70 | WMMA path |
| m16n16k16 | f16, bf16 | f16, f32 | SM 90 | Hopper+ |
| m32n8k16 | f16, bf16 | f16, f32 | SM 80 | Tall shape |
| m8n8k16 | s8, u8 | s32 | SM 72 | Integer MMA |
| m16n8k16 | s8, u8 | s32 | SM 75 | Turing+ |
| m16n8k32 | s8, u8 | s32 | SM 75 | Turing+ |
| m8n8k32 | s4, u4 | s32 | SM 75 | Sub-byte |
| m16n8k64 | s4, u4 | s32 | SM 75 | Sub-byte |
| m8n8k64 | s4, u4 | s32 | SM 75 | Extended sub-byte |
| m16n8k128 | s4, u4 | s32 | SM 75 | Extended sub-byte |
| m8n8k128 | b1 | s32 | SM 75 | Binary (.and.popc, .xor.popc) |
| m16n8k256 | b1 | s32 | SM 75 | Binary extended |
| WGMMA (N=8..256) | f16, bf16, tf32, f8 | f16, f32 | SM 90 | Warp-group, 33 N values |
| tcgen05 (10 variants) | mxf8f6f4, mxf4, mxf4nvf4, f16, bf16, tf32, i8, fp4 | varies | SM 100 | See mma-codegen |
tcgen05 Blackwell Overview (SM 100+)
Full tcgen05 documentation lives in Tensor / MMA Codegen. Key points summarized here for cross-reference:
Data type kinds (bits [8:6] of the tcgen05 operand, emitted by sub_35F3330):
| Value | Kind | Notes |
|---|---|---|
| 0 | mxf4nvf4 | MX FP4 with NV FP4 |
| 1 | f8f6f4 | FP8/FP6/FP4 standard |
| 2 | mxf8f6f4 | MX variant of f8f6f4 |
| 3 | f16 | Half precision |
| 4 | i8 | 8-bit integer (arch-conditional only) |
| 5 | tf32 | TensorFloat-32 |
| 7 | mxf4 | MX FP4 |
Modifier fields:
| Modifier | Bits | Description |
|---|---|---|
Weight stationary (.ws) | bit 0 | NOT compatible with cta_group::2, mxf8f6f4, fp4 |
| CTA group | bit 1 | cta_group::1 (clear) or cta_group::2 (set) |
| Scale vector size | [3:2] | .scale_vec::1X/2X/4X with per-type constraints |
| Scale input accumulator | bit 4 | f16/tf32 only; NOT on sm_100a/sm_103a |
| Sparsity | bit 5 | MXF4/MXF4NVF4 restricted to arch-conditional |
| Block scale alias | [10:9] | .block16 (0) or .block32 (1) |
Collector modes (emitted by sub_35F38B0):
| Value | Modifier | Constraint |
|---|---|---|
| 1 | .collector::a::lastuse | -- |
| 2 | .collector::a::fill | Cannot combine with .ashift |
| 3 | .collector::a::use | Cannot combine with .ashift |
tcgen05 scaled MMA operand builder (sub_21E8CD0 / sub_35F3E90):
| Bit | Query | Clear | Set |
|---|---|---|---|
| 0 | "scaleD" | "0" | "1" |
| 1 | "negA" | "1" (no negate) | "-1" (negate) |
| 2 | "negB" | "1" | "-1" |
| 3 | "transA" | "0" | "1" |
| 4 | "transB" | "0" | "1" |
Note the asymmetry: scaleD/transA/transB emit boolean "0"/"1" strings, while negA/negB emit sign multiplier "1"/"-1" strings. This reflects the PTX encoding where negation is a multiplication factor and transpose is a boolean flag.
LLVM Intrinsic Reference
| Intrinsic ID | Name | Usage |
|---|---|---|
| 9062 | llvm.nvvm.wgmma.fence.aligned | WGMMA fence (3 type overloads) |
| 9067 | llvm.nvvm.wgmma.mma.async | WGMMA MMA async load (2 type overloads) |
| 9145 | llvm.nvvm.wgmma.store | WGMMA store (2 type overloads) |
| 10654--10779 | llvm.nvvm.wgmma.mma.async.* | Per-N-dimension variants (126 entries, even=int, odd=float) |
| 5304--5447 | (WGMMA 5-D grid) | Per-N x shared x type x scale x variant (144 entries) |
| 4905--4940 | (tcgen05 ISD opcodes) | tcgen05.mma variants (36 opcodes via 10-way shape switch) |
NVPTX Backend Duplicate Functions
All MMA emission functions exist in two structurally identical copies:
| AsmPrinter (0x21Dxxxx) | NVPTX Backend (0x36Exxxx) | Function |
|---|---|---|
sub_21DFBF0 | sub_36E91F0 | hmmastc (HMMA store C) |
sub_21E0360 | sub_36E72A0 | hmmaldab (HMMA load A/B) |
sub_21E0630 | sub_36E7580 | hmmaldc (HMMA load C) |
sub_21E0870 | sub_36E77C0 | hmmamma (HMMA MMA) |
sub_21E1280 | sub_36E7B50 | immaldab (IMMA load A/B) |
sub_21E15D0 | sub_36E7EA0 | immaldc (IMMA load C) |
sub_21E1830 | sub_36E8110 | immastc (IMMA store C) |
sub_21E1D20 | sub_36E8630 | immamma (IMMA MMA) |
sub_21E2280 | sub_36E8BD0 | bmmamma (Binary MMA) |
sub_21E8CD0 | sub_35F3E90 | tcgen05 scaled MMA operands |
The pairs differ only in error reporting (sub_16BD130 vs sub_C64ED0) and reference counting functions (sub_1623A60/sub_161E7C0 vs sub_B96E90/sub_B91220).
Cross-References
- Tensor / MMA Codegen -- backend PTX emission, tcgen05 full detail
- NVPTX Opcodes -- ISD opcode numbers
- SM 90 (Hopper) -- WGMMA architecture context, TMA, cluster
- SM 100 (Blackwell) -- tcgen05 architecture context
- Builtin System -- hash table, registration, dispatch architecture
Surface and Texture Builtins
Surface and texture builtins form the largest contiguous block in the builtin table, with 165 surface store entries (IDs 474--638) plus a generic texture/surface handler (ID 647). CUDA separates texture reads (which go through a unified handler) from surface writes (which have dedicated per-format builtins). This asymmetry reflects the hardware: texture reads use a programmable texture pipeline, while surface stores map directly to typed sust (surface store) instructions.
Surface Store Builtins (IDs 474--638)
The 165 sust (surface store) builtins encode the dimensionality, data type, and out-of-bounds behavior directly in the builtin name. They follow the pattern:
__nvvm_sust_b_{dim}_{type}_{oob_mode}
Dimensions (5 variants)
| Dimension | Description |
|---|---|
1d | One-dimensional surface |
2d | Two-dimensional surface |
3d | Three-dimensional surface |
1d_array | Array of 1D surfaces |
2d_array | Array of 2D surfaces |
Data Types (11 variants)
| Type Suffix | Element Size | Vector |
|---|---|---|
i8 | 8-bit integer | Scalar |
i16 | 16-bit integer | Scalar |
i32 | 32-bit integer | Scalar |
i64 | 64-bit integer | Scalar |
v2i8 | 8-bit integer | 2-element vector |
v2i16 | 16-bit integer | 2-element vector |
v2i32 | 32-bit integer | 2-element vector |
v2i64 | 64-bit integer | 2-element vector |
v4i8 | 8-bit integer | 4-element vector |
v4i16 | 16-bit integer | 4-element vector |
v4i32 | 32-bit integer | 4-element vector |
Out-of-Bounds Modes (3 variants)
| Mode | ID Range | Behavior |
|---|---|---|
clamp | 474--528 | Clamp coordinates to valid range |
trap | 529--583 | Trigger hardware trap on OOB access |
zero | 584--638 | Write zero for OOB coordinates |
The total 5 x 11 x 3 = 165 entries are registered as a contiguous block. IDA shows SSE xmmword constant loads for the long common prefix strings (__nvvm_sust_b_2d_array_*), which is the compiler's optimization of string literal initialization during registration.
Surface Store ID Layout
Within each OOB-mode block of 55 entries, the ordering is dimension-major, type-minor:
base + 0..10: 1d x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 11..21: 1d_array x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 22..32: 2d x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 33..43: 2d_array x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 44..54: 3d x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
Given a surface store builtin ID, the decomposition is:
mode_offset = (id - 474)
oob_block = mode_offset / 55 // 0=clamp, 1=trap, 2=zero
within_block = mode_offset % 55
dim_index = within_block / 11 // 0=1d, 1=1d_array, 2=2d, 3=2d_array, 4=3d
type_index = within_block % 11 // 0=i8 .. 10=v4i32
Texture/Surface Read Handler (ID 647)
All texture reads and surface reads are funneled through a single generic handler:
| ID | Builtin | Description |
|---|---|---|
| 647 | __nv_tex_surf_handler | Dispatch for all texture/surface read operations |
Unlike the surface stores which have 165 dedicated builtins, texture reads use a string-based dispatch mechanism. The handler is a single builtin that receives the texture/surface operation name as a string operand, then dynamically constructs the appropriate LLVM intrinsic name and emits the call.
Handler Dispatch Algorithm (case 0x287 in sub_955A70)
The NVVM-side lowering for __nv_tex_surf_handler (builtin ID 647, hex 0x287) is the most complex string-based builtin dispatch in cicc. It performs five steps:
Step 1 -- String extraction. Walks the AST operand tree from the call expression to locate the constant string naming the texture/surface operation. Validates that byte 173 of the operand node equals 2 (the constant-string-type marker in the EDG AST). The string is the NVVM intrinsic base name, for example __tex_fetch or __surf_read.
Step 2 -- Element type determination. Decodes the return element type from the AST type node attached to the call. The type switch maps to suffix strings:
| AST Type | Suffix String | LLVM Type |
|---|---|---|
void | "void" | void |
char (as signed) | "char_as_schar" | i8 |
char (as unsigned) | "char_as_uchar" | i8 |
signed char | "schar" | i8 |
unsigned char | "uchar" | i8 |
short | "short" | i16 |
unsigned short | "ushort" | i16 |
int | "int" | i32 |
unsigned int | "uint" | i32 |
long | "long" | i32/i64 |
unsigned long | "ulong" | i32/i64 |
long long | "longlong" | i64 |
unsigned long long | "ulonglong" | i64 |
float | "float" | float |
The long/ulong width follows the host ABI convention (32-bit on NVPTX).
Step 3 -- Intrinsic name construction. Concatenates the operation base name with the element type suffix using underscore separation:
intrinsic_name = "{operation_string}_{element_type_suffix}"
For example, __tex_fetch_v4 + float yields __tex_fetch_v4_float.
Step 4 -- Intrinsic lookup. Resolves the constructed name string via sub_BA8CA0 (NVVM intrinsic table lookup) to obtain the corresponding LLVM intrinsic function declaration. The EDG-side parallel path uses sub_1632190. If the intrinsic is not found, this is a fatal error.
Step 5 -- Call emission. Collects all arguments from the call expression, builds the LLVM function type signature from the argument types, and emits the intrinsic call via sub_921880. Returns a dummy i32 value via sub_AD6530.
This design allows the compiler to support an arbitrary number of texture/surface read variants without enumerating them in the builtin table. The single ID 647 entry is a trampoline that dispatches to hundreds of different NVVM intrinsics at runtime.
__nv_tex_surf_handle_t Built-in Type
The EDG parser recognizes __nv_tex_surf_handle_t as a built-in type (keyword index 277 in the keyword table at sub_72BA30). This opaque type is the C++-level representation of a texture or surface reference handle. When the type appears as a function parameter, the PTX emitter (sub_21502D0, 22KB) produces one of:
| Parameter ABI | PTX Syntax |
|---|---|
By-value .texref | .param .texref NAME |
By-value .surfref | .param .surfref NAME |
By-value .samplerref | .param .samplerref NAME |
Pointer to .texref | .param .u64 .ptr .texref NAME |
Pointer to .surfref | .param .u64 .ptr .surfref NAME |
Pointer to .samplerref | .param .u64 .ptr .samplerref NAME |
The selection between .texref / .surfref / .samplerref is determined by the NVVM metadata attached to the GlobalVariable that the handle references. The NVPTXReplaceImageHandles pass (sub_21DBEA0) performs the final substitution of IR-level image handles into PTX-level texture/surface references during machine-level code emission.
Texture/Surface Map Initialization
The NVVM-side handler sub_954F10 maintains two lazily-initialized red-black tree maps for resolving texture and surface operations. These maps are built once (guarded by flag bytes byte_4F6D3B0 and byte_4F6D378) and cleaned up via __cxa_atexit.
Surface Operation Map (unk_4F6D3C0)
Used when the handler's v8 flag is nonzero (surface path). Contains entries mapping builtin IDs to LLVM intrinsic IDs for surface read operations. Each entry is a 12-byte packed triple:
| Intrinsic ID | Description |
|---|---|
0x21CA (8650) | Surface read (primary) |
The map contains 4 entries covering surface read and write variants with address space 4 (constant memory surface descriptors).
Texture Operation Map (unk_4F6D380)
Contains entries for texture fetch operations. The map has 12 entries covering the full matrix of texture modes:
| Intrinsic ID | Mapped Builtin Base | Description |
|---|---|---|
0x1FC6 (8134) | ID 338 | Texture fetch (sync variant) |
0x23C5 (9157) | ID 302 | Texture fetch (base variant) |
0x23C8 (9160) | ID 303 | Texture fetch (alternate) |
These 12 entries span the following texture fetch modes:
| Mode | Behavior |
|---|---|
| Unfiltered fetch | Direct texel access at integer coordinates |
| Filtered fetch | Hardware-interpolated fetch at float coordinates |
| LOD fetch | Explicit level-of-detail selection |
| Gradient fetch | Gradient-based LOD computation |
Map Lookup and Dispatch (sub_954F10)
function TexSurfSampleHandler(retval, ctx, builtin_id, arglist):
// Determine surface vs texture path
is_surface = (v8 flag != 0)
if is_surface:
map = unk_4F6D3C0 // surface map
if not initialized:
populate 4 entries into red-black tree
byte_4F6D3B0 = 1
else:
map = unk_4F6D380 // texture map
if not initialized:
populate 12 entries into red-black tree
byte_4F6D378 = 1
// Tree lookup
entry = rbtree_find(map, builtin_id)
if found:
intrinsic_id = entry.intrinsic_id // e.g. 0x1FC6
else:
intrinsic_id = 0
default_mode = 1
// Create type constant from element type
type_const = sub_BCB2D0(sub_ACD640(...))
// Process 4 standard operands
for operand in [sampler, coordinate, lod, bias]:
if operand != null:
lowered = type_cast(operand, expected_llvm_type)
emit_store(lowered) // sub_B4D190 or sub_B4D3C0
// Build and emit intrinsic call
fn_decl = sub_90A810(intrinsic_tables, intrinsic_id, ...)
sub_921880(fn_decl, args) // emit call
sub_B4D3C0(result) // store result
Operand Processing
For each of the 4 standard texture operands (sampler, coordinate, LOD, bias), the handler:
- Checks if the operand is non-null
- Type-casts to match the expected LLVM type
- Creates a store instruction via
sub_B4D190(loads) orsub_B4D3C0(stores) - Builds the LLVM call via
sub_90A810with the resolved intrinsic ID
SelectionDAG Lowering Layer
After NVVM builtin lowering produces LLVM IR intrinsic calls, the SelectionDAG layer translates these into NVPTX-specific DAG nodes. Three subsystems handle different aspects.
Intrinsic Lowering Dispatch (sub_33B0210, 343KB)
The central intrinsic lowering function dispatches on LLVM intrinsic IDs via a giant switch covering ~440 case labels. Texture and surface operations occupy three distinct ID ranges:
| Intrinsic ID Range | Handler | Category |
|---|---|---|
0x5D--0x8D (93--141) | sub_33A4350 | Texture fetch bulk handler (50 IDs) |
0x8E--0x90 (142--144) | sub_33A3180 | Surface read/write handler (3 IDs) |
0x91 (145) | Inline | Complex texture sample with LOD/bias |
0x92--0x98 (146--152) | Various | Surface store variants |
0x9C--0x9D (156--157) | sub_33AEC60 | Surface atomics |
0x9E--0x9F (158--159) | sub_33AFBA0 / sub_340EC60 | Surface special ops |
0xA0--0xA2 (160--162) | Various | Surface/texture helpers |
0x2952 (10578) | Inline | nvvm_texsurf_handle binding |
0x254D+ (9549+) | sub_34B8FD0 | Unified texture sample core |
Texture Fetch Bulk Handler: sub_33A4350
The 50 consecutive intrinsic IDs 0x5D through 0x8D all delegate to a single helper sub_33A4350(state, dag_node). This function maps the intrinsic ID to an NVPTXISD opcode for one of the tex.1d, tex.2d, tex.3d, or tex.a1d/tex.a2d (array) variants.
The intrinsic-to-opcode mapping encodes:
dimension: 1d / 2d / 3d / 1d_array / 2d_array / cubemap
data_type: u32 / s32 / f32 / f32f32 (filtered)
return_width: scalar / v2 / v4
access_mode: level / grad / unified
Each opcode corresponds to a PTX texture instruction pattern that the instruction emitter will later produce.
Complex Texture Sample (Intrinsic ID 0x91)
The most complex texture lowering path. Handles hardware-filtered texture sampling with programmable LOD computation:
sub_3281100-- Determines element count for the return typesub_3281590-- Computes alignment for the result buffersub_327FD70-- Resolves the return MVT (machine value type)sub_33CC4A0-- SM-specific path selection (some SM levels use different instruction encodings)sub_3406EB0(opcode=57)-- Creates the core sample DAG nodesub_33FAF80(opcode=213)-- LOD computation DAG nodesub_3406EB0(opcode=186)-- Merge result nodesub_33FAF80(opcode=389)-- Final type fixup- Fallback via
sub_33A1E80if the target architecture does not support this texture mode
Surface Read/Write Handler: sub_33A3180
Intrinsic IDs 0x8E (surf1Dread), 0x8F (surf2Dread), 0x90 (surf3Dread) delegate to sub_33A3180(state, dag_node, intrinsic_id). The intrinsic_id parameter selects the dimensionality. This handler produces NVPTXISD suld (surface load) DAG nodes.
Texture/Surface Handle Binding (Intrinsic 0x2952)
The nvvm_texsurf_handle intrinsic (ID 10578) is the mechanism for binding a GlobalVariable to a texture or surface reference. The DAG lowering:
- Validates that operand 0 is metadata wrapping a
GlobalVariable-- errors with"nvvm_texsurf_handle op0 must be metadata wrapping a GlobalVariable"otherwise - Creates a DAG constant node for the handle via
sub_3400BD0(opcode=10579) - Binds the handle via
sub_3406EB0(opcode=46)
The NVPTXReplaceImageHandles pass (sub_21DBEA0) later resolves these abstract handles into concrete PTX .texref / .surfref globals during machine-level emission.
Unified Texture Sample Core (Intrinsic IDs 0x254D+)
For SM 30+ unified texture mode, a more complex sampling path handles the full matrix of texture configurations:
sub_34B8FD0-- Unpacks the parameter block encoding dimension, filtering, coordinate type- Vtable dispatch at
*src+88-- Selects the sampling mode (point, linear, etc.) sub_3409320-- Creates the sampler state DAG nodesub_33EB1C0(opcode=47)-- Creates the core tex/surf sample DAG node with memory semanticssub_33FC220(opcode=2)-- Merges vector result componentssub_33E5830+sub_3411630(opcode=55)-- Packages the final resultsub_B91FC0-- Attaches debug info
Two modes exist: v2637=true (unified texture) and v2637=false (legacy separate-handle texture). The unified path is the modern default.
Texture/Surface Binding Lowering (Intrinsic IDs 0x44, 0x45, 0x47)
These intrinsics handle the compile-time binding of texture and surface references. The lowering checks the a1+120 flag to determine whether the reference is a .texref or .surfref:
sub_3382030-- Initial binding setupsub_3382930-- Variant analysis viasub_3380DB0andsub_B58DC0sub_3386E40-- Final binding emission
Intrinsic 0x48 (opcode 332) handles global texture handles, while 0x162 (opcode 331) handles sampler handles. Intrinsic 0x169 dispatches to sub_3400BD0 + sub_3406EB0(opcode=333) for indirect texture access.
Instruction Selection: sub_306A930 (52KB)
The NVPTX instruction selection pass contains a 52KB handler (sub_306A930) dedicated to matching texture/surface DAG nodes to machine instructions. It calls five helper functions:
| Helper | Address | Role |
|---|---|---|
sub_2FE5F00 | 0x2FE5F00 | Texture instruction type selection |
sub_2FE5F30 | 0x2FE5F30 | Surface instruction type selection |
sub_2FE5F60 | 0x2FE5F60 | Image type validation |
sub_2FE69A0 | 0x2FE69A0 | Coordinate mode encoding |
sub_2FE6CC0 | 0x2FE6CC0 | Return type dispatch |
The ISel handler selects among tex, suld, sust machine instruction patterns, with address space awareness for the different texture/surface memory regions.
Image Type Validation: sub_21DD1A0 (16KB)
A dedicated 16KB validation function (sub_21DD1A0) checks that the image type encoding is legal for the instruction class. Four error messages cover the instruction categories:
| Error String | Instruction Class |
|---|---|
"Invalid image type in .tex" | Texture fetch |
"Invalid image type in .suld" | Surface load |
"Invalid image type in suq." | Surface query |
"Invalid image type in .sust" | Surface store |
This validation occurs during instruction emission, catching type mismatches that survived earlier lowering.
Surface Store Lowering Details
Surface store builtins in the 474--638 range are handled by the main dispatch switch with a block of consecutive cases. Each case:
- Extracts the surface handle, coordinate(s), and data value(s) from the argument list
- The number of coordinate arguments varies by dimensionality (1D: 1, 2D: 2, 3D: 3, arrays: +1 for layer index)
- The number of data arguments varies by vector width (scalar: 1, v2: 2, v4: 4)
- Emits a call to the corresponding
llvm.nvvm.sust.b.*intrinsic
The out-of-bounds mode is encoded in the intrinsic name itself, not as a parameter, which is why each mode requires a separate builtin ID.
PTX Emission: Sampler State Initializers
The PTX emitter sub_2156420 (20KB) handles module-level emission of texture, surface, and sampler global variables. Sampler references receive structured initializers:
.global .samplerref my_sampler = {
addr_mode_0 = wrap, // or clamp_to_border, clamp_to_edge, mirror
addr_mode_1 = clamp_to_edge,
addr_mode_2 = clamp_to_edge,
filter_mode = linear, // or nearest
force_unnormalized_coords = 1
};
The addressing mode and filter mode values are extracted from NVVM metadata attached to the sampler GlobalVariable. The emitter recognizes these sampler reference types via sub_1C2E890 and generates the structured PTX initializer. Texture and surface references use the simpler forms:
.global .texref my_texture;
.global .surfref my_surface;
End-to-End Pipeline
The complete texture/surface compilation pipeline spans five compiler phases:
| Phase | Function(s) | What Happens |
|---|---|---|
| EDG Frontend | sub_72BA30 | Parses __nv_tex_surf_handle_t as built-in type; keyword 277 |
| NVVM Builtin Lowering | sub_955A70 case 0x287 / sub_954F10 | String-based dispatch constructs LLVM intrinsic names; red-black tree maps resolve builtin IDs to intrinsic IDs |
| SelectionDAG Lowering | sub_33B0210 / sub_33A4350 / sub_33A3180 | 50+ texture intrinsic IDs become NVPTXISD DAG nodes; handle binding validated against GlobalVariable metadata |
| Instruction Selection | sub_306A930 (52KB) | DAG nodes matched to tex.* / suld.* / sust.* machine instructions |
| PTX Emission | sub_2156420 / sub_21DD1A0 | .texref/.surfref/.samplerref globals emitted; image type validated; NVPTXReplaceImageHandles substitutes abstract handles |
Architecture Considerations
Surface and texture operations are available on all SM architectures. However, the texture pipeline has evolved significantly:
- All SM: Basic texture fetch, surface read/write with clamp/trap/zero modes
- SM 30+: Unified texture mode via
__nv_tex_surf_handlergeneric dispatch;v2637=truepath in DAG lowering - SM 90+ (Hopper): Tensor memory accelerator (TMA) operations provide an alternative high-throughput path for bulk data movement, partially overlapping with texture/surface functionality but handled through separate builtins (IDs 411--412)
The 165 surface store builtins are registered unconditionally regardless of target SM. Architecture gating occurs at the PTX emission layer, not during builtin registration or lowering. The complex texture sample path (intrinsic 0x91) has an explicit SM feature gate via sub_33CC4A0 that selects alternate instruction encodings for older architectures, with sub_33A1E80 as the fallback for unsupported targets.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| NVVM builtin lowering dispatch | sub_955A70 | -- | Main switch; case 0x287 handles __nv_tex_surf_handler |
| Texture/surface sample handler | sub_954F10 | -- | Red-black tree dispatch for IDs 302--309, 338--345, 395--402 |
| EDG keyword handler | sub_72BA30 | -- | Parses __nv_tex_surf_handle_t built-in type (keyword 277) |
| NVPTX intrinsic lowering | sub_33B0210 | -- | 343KB central dispatch; tex IDs 0x5D--0x8D, surf IDs 0x8E--0x90 |
| Texture fetch bulk handler | sub_33A4350 | -- | 50 consecutive intrinsic IDs for all tex1D/2D/3D/array variants |
| Surface read/write handler | sub_33A3180 | -- | 3 intrinsic IDs for surf1D/2D/3D read |
| Tex/surf sample DAG node builder | sub_33EB1C0 | -- | Creates memory-typed NVPTXISD sample nodes (opcode 47) |
| Sampler state DAG node builder | sub_3409320 | -- | Creates sampler state binding nodes |
| Surface atomics handler | sub_33AEC60 | -- | Intrinsic IDs 0x9C--0x9D |
| Surface special handler | sub_33AFBA0 | -- | Intrinsic ID 0x9E |
| Texture/surface ISel | sub_306A930 | -- | 52KB instruction selection for tex/suld/sust patterns |
| Image type validator | sub_21DD1A0 | -- | 16KB; validates .tex/.suld/.sust/suq. image types |
| NVPTXReplaceImageHandles | sub_21DBEA0 | -- | Replaces IR image handles with PTX .texref/.surfref |
| Global variable emitter | sub_2156420 | -- | 20KB; emits .texref/.surfref/.samplerref with initializers |
| Parameter list emitter | sub_21502D0 | -- | 22KB; emits .param .texref/.surfref/.samplerref in function signatures |
| visitNVVMTexSurf | sub_2077400 | -- | 20KB SelectionDAGBuilder extension for tex/surf handle lowering |
| NVVM intrinsic lookup | sub_BA8CA0 | -- | Resolves constructed intrinsic name string to LLVM function declaration |
| Intrinsic table lookup | sub_90A810 | -- | Resolves intrinsic ID to function declaration with type overloads |
Cross-References
- Builtin System Overview -- Hash table infrastructure and ID assignment
- Atomics Builtins -- PTX inline asm generation pattern shared by surface stores
- NVPTX Instruction Selection -- ISel pattern matching context
- SelectionDAG Lowering -- DAG node construction infrastructure
- PTX Emission -- Final instruction text generation
- Address Spaces -- Memory space qualifiers for tex/surf
Barrier and Synchronization Builtins
Barrier builtins handle thread synchronization, memory fencing, and cluster-level coordination. They span IDs 1--5 (core barriers), 8--20 (cluster and barrier extensions), and several scattered IDs for memory barriers and fences. The lowering layer emits either LLVM intrinsic calls or inline PTX assembly, depending on whether the operation has a direct LLVM IR equivalent.
Core Barriers (IDs 1--5)
The most fundamental synchronization primitives in CUDA map to the lowest builtin IDs.
| ID | Builtin | PTX Equivalent | Description |
|---|---|---|---|
| 1 | __syncthreads | bar.sync 0 | Block-wide barrier |
| 2 | __nvvm_bar0 | bar.sync 0 | Alias for __syncthreads |
| 3 | __nvvm_membar_cta | membar.cta | CTA-scope memory fence |
| 4 | __nvvm_membar_gl | membar.gl | Device-scope memory fence |
| 5 | __nvvm_membar_sys | membar.sys | System-scope memory fence |
The core __syncthreads (ID 1) lowers to the LLVM intrinsic llvm.nvvm.barrier0 (intrinsic ID 8259). Memory barriers at IDs 3--5 are lowered via inline IR generation: the handler builds a barrier store node through sub_128B420 / sub_92C9E0 and inserts it into the current basic block.
Barrier Extensions (IDs 15--20)
These builtins extend the basic barrier with predicate reduction and explicit warp/block synchronization.
| ID | Builtin | Intrinsic | Description |
|---|---|---|---|
| 15 | __nvvm_bar0_popc | llvm.nvvm.barrier0.popc | Barrier + population count of predicate |
| 16 | __nvvm_bar0_and | llvm.nvvm.barrier0.and | Barrier + AND reduction of predicate |
| 17 | __nvvm_bar0_or | llvm.nvvm.barrier0.or | Barrier + OR reduction of predicate |
| 18 | __nvvm_bar_sync_all | llvm.nvvm.barrier.sync (8925) | Named barrier sync (all threads) |
| 19 | __nvvm_barrier_sync | llvm.nvvm.barrier.sync.cnt (9296) | Named barrier sync with count |
| 20 | __nvvm_bar_warp_sync | llvm.nvvm.bar.warp.sync (8258) | Warp-level barrier |
The reduction barriers (IDs 15--17) are dispatched through sub_12AB550 / sub_94C360. The handler looks up intrinsic 3767 (EDG) or the corresponding entry from dword_3F14778[] (NVVM) and emits a function call via sub_1285290 / sub_921880. ID 16 sets flag=1 (AND) and ID 17 sets flag=16|0 (OR); the population count variant uses the default flag.
Barriers with explicit count (IDs 205--206, __nvvm_bar_sync_all_cnt and __nvvm_barrier_sync_cnt) follow the same pattern with additional count arguments.
Cluster Operations (IDs 8--14, SM 90+)
Thread block cluster operations were introduced with SM 90 (Hopper). These builtins query cluster geometry and perform inter-block synchronization within a cluster.
Cluster Geometry Queries (IDs 8--10, 405--408)
| ID | Builtin | Handler | Description |
|---|---|---|---|
| 8 | __nv_clusterDimIsSpecified_impl | sub_12AB0E0(ctx, 0) | Whether cluster dimensions are explicit |
| 9 | __nv_clusterRelativeBlockRank_impl | sub_12AB0E0(ctx, 1) | Block rank within cluster |
| 10 | __nv_clusterSizeInBlocks_impl | sub_12AB0E0(ctx, 2) | Number of blocks in cluster |
| 405 | __nv_clusterDim_impl | -- | Cluster dimension |
| 406 | __nv_clusterRelativeBlockIdx_impl | -- | Block index within cluster |
| 407 | __nv_clusterGridDimInClusters_impl | -- | Grid dimension in cluster units |
| 408 | __nv_clusterIdx_impl | -- | Cluster index |
Cluster Barriers (IDs 11--14)
| ID | Builtin | Intrinsic ID | Description |
|---|---|---|---|
| 11 | __nv_cluster_barrier_arrive_impl | 3767 | Signal arrival at cluster barrier |
| 12 | __nv_cluster_barrier_wait_impl | 3767 | Wait at cluster barrier |
| 13 | __nv_cluster_barrier_arrive_relaxed_impl | 3767 | Relaxed arrival (no ordering guarantee) |
| 14 | __nv_threadfence_cluster_impl | 4159 / 9052 | Cluster-scope memory fence |
The cluster fence at ID 14 emits intrinsic llvm.nvvm.cp.async.commit.group (EDG intrinsic 4159, NVVM intrinsic 9052) with a flag constant of 4, encoding the thread-fence semantic.
Cluster Shared Memory (IDs 202--203, 365)
| ID | Builtin | Description |
|---|---|---|
| 202 | __nv_isClusterShared_impl | Query if address is in cluster shared memory |
| 203 | __nv_cluster_query_shared_rank_impl | Get rank of block that owns shared address |
| 365 | __nv_cluster_map_shared_rank_impl | Map address to another block's shared memory |
ID 203 has an SM-dependent lowering path: on SM <= 63, the handler returns an inline constant (passthrough); on SM 64+, it emits intrinsic 3769 (EDG) / 8825 (NVVM). The same pattern applies to ID 365, which gates on intrinsic 3770 / 9005.
Memory Fence Lowering
Memory fences are emitted as inline PTX assembly because they have no direct LLVM IR equivalent. Two handlers exist:
sub_94F9E0 -- membar (CTA/Device/System)
Generates membar.{scope}; where scope is determined by the scope parameter:
| Scope Value | PTX Output |
|---|---|
| 0, 1 | membar.cta; |
| 2, 3 | membar.gl; |
| 4 | membar.sys; |
The constraint string is ~{memory} to ensure the compiler treats the fence as a full memory clobber. The emitted node receives two memory attributes: inaccessiblemem (attribute 41) and a readonly fence marker (attribute 6).
sub_94FDF0 -- fence (with explicit ordering)
Generates fence.{ordering}.{scope}; for SM 70+ targets:
| Ordering Value | PTX Qualifier |
|---|---|
| 3 | sc (sequentially consistent) |
| 4 | acq_rel |
| 5 | sc (same as 3) |
Both fence handlers use sub_B41A60 to create the inline assembly call and sub_921880 to emit it into the instruction stream.
Async Memory Copy Barriers (IDs 367--369)
The cp.async instructions for asynchronous shared-to-global memory copies include implicit barrier semantics:
| ID | Builtin | Size | Description |
|---|---|---|---|
| 367 | __nv_memcpy_async_shared_global_4_impl | 4 bytes | Async copy with barrier |
| 368 | __nv_memcpy_async_shared_global_8_impl | 8 bytes | Async copy with barrier |
| 369 | __nv_memcpy_async_shared_global_16_impl | 16 bytes | Async copy with barrier |
These are lowered through sub_12AB730 / sub_94C5F0, which builds the cp.async PTX instruction with the specified transfer size.
Architecture Gates
| SM Threshold | Barrier Feature |
|---|---|
| All SM | __syncthreads, membar.{cta,gl,sys}, barrier reductions |
| SM 70+ | Explicit fence ordering (fence.{ordering}.{scope}) |
| SM 70+ | cp.async asynchronous memory copy with barrier |
| SM 90+ (Hopper) | Cluster barriers, cluster fence, cluster shared memory queries |
Lowering Strategy Summary
Barrier builtins use three distinct lowering strategies:
-
LLVM intrinsic call --
__syncthreads, barrier reductions, cluster barriers. These map to well-known LLVM/NVVM intrinsic IDs (8259, 8925, 9296, etc.) and emit viasub_1285290. -
Inline IR generation -- Memory barriers (
__nvvm_membar_*). The handler directly constructs barrier store IR nodes without going through an intrinsic lookup. -
Inline PTX assembly -- Memory fences (
membar.*,fence.*). These have no LLVM IR equivalent and are emitted as inline asm strings with~{memory}clobber constraints.
Warp-Level Operation Builtins
Warp-level builtins provide lane-to-lane communication within a 32-thread warp. They cover four major categories: shuffle (data exchange between lanes), vote (predicate aggregation), match (value matching across lanes), and redux (warp-wide reductions). The shuffle operations also serve as the lowering target for the WMMA fragment load/store operations described in the tensor core page.
Shuffle Operations (IDs 413--416)
The __shfl_sync family enables direct register-to-register communication between warp lanes. Four shuffle modes exist, each registered as a _sync variant:
| ID | Builtin | Mode | Description |
|---|---|---|---|
| 413 | __nvvm_shfl_up_sync | Up | Lane reads from lane - delta |
| 414 | __nvvm_shfl_down_sync | Down | Lane reads from lane + delta |
| 415 | __nvvm_shfl_bfly_sync | Butterfly | Lane reads from lane XOR delta |
| 416 | __nvvm_shfl_idx_sync | Index | Lane reads from arbitrary srcLane |
Shuffle Dispatch via Table Lookup
All shuffle builtins route through sub_12B3540 (EDG) / sub_954F10 (NVVM), the table-based lowering handler. Three groups of 8 IDs each cover the complete shuffle interface:
| ID Range | Group | Description |
|---|---|---|
| 302--309 | Legacy __shfl | Non-sync variants (4 modes x 2 types: i32/f32) |
| 338--345 | __shfl_sync | Sync variants with mask (4 modes x 2 types) |
| 395--402 | __shfl_*_sync | Newer SM interface (4 modes x 2 types) |
Within each group of 8, the layout is:
| Offset | Mode | i32 Variant | f32 Variant |
|---|---|---|---|
| +0, +1 | shfl_up | offset +0 | offset +1 |
| +2, +3 | shfl_down | offset +2 | offset +3 |
| +4, +5 | shfl_xor | offset +4 | offset +5 |
| +6, +7 | shfl_idx | offset +6 | offset +7 |
The handler builds the argument list (mask, value, delta/lane, width), looks up the target intrinsic by shuffle mode and data type from its red-black tree map, and emits a function call.
Vote Operations (IDs 351--358)
Warp vote builtins aggregate a boolean predicate across all participating lanes. Both legacy (non-sync) and sync variants are registered.
| ID | Builtin | Operation | Sync |
|---|---|---|---|
| 351 | __nvvm_vote_all | All predicates true? | No |
| 352 | __nvvm_vote_any | Any predicate true? | No |
| 353 | __nvvm_vote_uni | All predicates equal? | No |
| 354 | __nvvm_vote_ballot | Bitmask of predicates | No |
| 355 | __nvvm_vote_all_sync | All predicates true? | Yes |
| 356 | __nvvm_vote_any_sync | Any predicate true? | Yes |
| 357 | __nvvm_vote_uni_sync | All predicates equal? | Yes |
| 358 | __nvvm_vote_ballot_sync | Bitmask of predicates | Yes |
Vote Lowering
The handler sub_12ABB90 (EDG) / sub_94D570 (NVVM) takes parameters:
(result, ctx, vote_op, args, is_ballot, is_sync)
The vote_op encoding: 0 = all, 1 = any, 2 = uni, 3 = ballot.
When is_sync=1, an extra mask argument is consumed from the call arguments. For non-sync variants, the handler looks up intrinsic 5301 (llvm.nvvm.vote). For sync variants, it generates an inline predicate pattern. The ballot variant (vote_op=3) sets is_ballot=1, which changes the return type from i1 (predicate) to i32 (bitmask).
Match Operations (IDs 361--364)
Match builtins find lanes with equal values and return a bitmask of matching lanes. Available in 32-bit and 64-bit variants with two matching modes.
| ID | Builtin | Width | Mode | Intrinsic |
|---|---|---|---|---|
| 361 | __match32_any_sync | 32-bit | Any match | 0x1011 |
| 362 | __match64_any_sync | 64-bit | Any match | 0x1011 |
| 363 | __match32_all_sync | 32-bit | All match | 0x100F |
| 364 | __match64_all_sync | 64-bit | All match | 0x100F |
The handler sub_12AD230 (EDG) dispatches on two opcodes: 0x1011 for any-match and 0x100F for all-match. The NVVM-side handler sub_94F430 uses intrinsic pairs 0x2017 / 0x2018 with mode variants 0, 1, 2 to encode the width and match type.
Warp Redux (IDs 413--416 range, via sub_12ADD20)
Warp-wide reduction operations perform arithmetic reductions across all active lanes in a single instruction. These are dispatched through sub_12ADD20 (EDG) / sub_94F250 (NVVM).
| ID | Operation | NVVM Intrinsic | Description |
|---|---|---|---|
| redux.sync.add | 0x24F5 (9461) | Sum reduction | Sum of values across warp |
| redux.sync.min | 0x24ED (9453) | Minimum reduction | Minimum value across warp |
| redux.sync.max | 0x24E9 (9449) | Maximum reduction | Maximum value across warp |
| redux.sync.or | 0x24F1 (9457) | Bitwise OR reduction | OR of values across warp |
The EDG side uses intrinsic codes 0x2332 and 0x2330 for the two redux variant families.
Activemask and Lanemask
The active mask and per-lane mask builtins are handled through sub_12ADB00 (EDG) / sub_94CF30 (NVVM):
These builtins return the set of currently active lanes (__activemask()) or per-lane positional masks (__lanemask_lt(), __lanemask_le(), __lanemask_eq(), __lanemask_ge(), __lanemask_gt()). They compile to PTX special register reads (%lanemask_*).
Predicate-Register Conversion (IDs 411--412)
Two builtins convert between predicate registers and general-purpose registers:
| ID | Builtin | Direction | Description |
|---|---|---|---|
| 411 | __nv_p2r | Predicate -> Register | Pack predicates into a 32-bit register |
| 412 | __nv_r2p | Register -> Predicate | Unpack a 32-bit register into predicates |
The handler generates element-wise operations: sub_9483E0 iterates over vector elements using sub_39FAC40 to compute the element count, then builds per-element extractelement + store (for p2r) or load + insertelement (for r2p) chains.
Nanosleep and CP.Async
Warp-adjacent utility builtins handled through sub_12AD230 / sub_94ED50:
| ID Range | Operation | Description |
|---|---|---|
| 367--369 | __nv_memcpy_async_shared_global_{4,8,16}_impl | Asynchronous copy (cp.async) |
These builtins combine data movement with implicit synchronization and are lowered through sub_12AB730 / sub_94C5F0, which builds the cp.async PTX instruction with the specified transfer size (4, 8, or 16 bytes).
Architecture Requirements
| Feature | Minimum SM | Notes |
|---|---|---|
__shfl (legacy, non-sync) | SM 30+ | Deprecated; requires full warp convergence |
__shfl_sync | SM 70+ (Volta) | Explicit mask; independent thread scheduling |
| Vote (non-sync) | SM 30+ | Deprecated |
Vote (_sync) | SM 70+ | Explicit mask required |
Match (_sync) | SM 70+ | Warp-level value matching |
Redux (redux.sync.*) | SM 80+ (Ampere) | Hardware-accelerated warp reduction |
| Elect sync | SM 90+ (Hopper) | Single-lane election from active mask |
cp.async | SM 80+ | Asynchronous shared memory copy |
GPU Target Architecture
45 SM variants across 6 generations. Processor table at qword_502A920 (stride-2 layout: name + PTX version). Architecture gating throughout the binary controls feature availability.
| SM table | qword_502A920 (45 entries, ctor_605 at 0x584510) |
| Arch detection | sub_95EB40 (38KB, CLI -> 3-column mapping) |
| NVVM arch enum | sub_CD09E0 (14.5KB, NVVM_ARCH_* strings) |
| EDG arch gates | sub_60E7C0 (~60 feature flags based on SM version) |
| Backend subtarget | NVPTXSubtarget (feature offsets at +2498, +2584, +2843) |
| Target triples | nvptx64-nvidia-cuda, nvsass-nvidia-directx, nvsass-nvidia-spirv |
Per-SM Deep Dives:
- SM 70-89 (Volta through Ada Lovelace) -- Feature configuration call order, complete
sub_60E7C0flag table, atomic lowering, cumulative flag profiles - SM 90 -- Hopper -- Thread block clusters, TMA descriptor format and lowering, WGMMA, setmaxnreg, distributed shared memory
- SM 100 -- Blackwell Datacenter -- tcgen05 tensor core ISA, arch-conditional vs. family-conditional gating, cvt_packfloat FP4/FP6/MX formats
- SM 120 -- Blackwell Consumer -- No tcgen05, .offset.bindless texture intrinsics, f16 texture support, mma.sync.block_scale (future)
Complete SM Table
| SM | __CUDA_ARCH | PTX Ver | Generation | Suffix | Status | Deep Dive |
|---|---|---|---|---|---|---|
sm_75 | 750 | 5 | Turing | -- | Production | sm70-89 |
sm_80 | 800 | 5 | Ampere | -- | Production | sm70-89 |
sm_82 | 820 | 5 | Ampere | -- | Undocumented | sm70-89 |
sm_86 | 860 | 5 | Ampere | -- | Production | sm70-89 |
sm_87 | 870 | 5 | Ampere | -- | Production | sm70-89 |
sm_88 | 880 | 5 | Ada | -- | Undocumented | sm70-89 |
sm_89 | 890 | 5 | Ada | -- | Production | sm70-89 |
sm_90 | 900 | 5 | Hopper | -- | Production | sm90 |
sm_90a | 900 | 6 | Hopper | a | Production | sm90 |
sm_100 | 1000 | 6 | Blackwell | -- | Production | sm100 |
sm_100a | 1000 | 7 | Blackwell | a | Production | sm100 |
sm_100f | 1000 | 7 | Blackwell | f | Production | sm100 |
sm_101 | 1010 | 6 | Jetson Thor (pre-rename) | -- | Undocumented | sm100 |
sm_101a | 1010 | 7 | Jetson Thor (pre-rename) | a | Undocumented | sm100 |
sm_101f | 1010 | 7 | Jetson Thor (pre-rename) | f | Undocumented | sm100 |
sm_102 | 1020 | 6 | Blackwell | -- | Undocumented | sm100 |
sm_102a | 1020 | 7 | Blackwell | a | Undocumented | sm100 |
sm_102f | 1020 | 7 | Blackwell | f | Undocumented | sm100 |
sm_103 | 1030 | 6 | Blackwell | -- | Production | sm100 |
sm_103a | 1030 | 7 | Blackwell | a | Production | sm100 |
sm_103f | 1030 | 7 | Blackwell | f | Production | sm100 |
sm_110 | 1100 | 6 | Jetson Thor | -- | Production | sm120 |
sm_110a | 1100 | 7 | Jetson Thor | a | Production | sm120 |
sm_110f | 1100 | 7 | Jetson Thor | f | Production | sm120 |
sm_120 | 1200 | 6 | Blackwell (sm120) | -- | Production | sm120 |
sm_120a | 1200 | 7 | Blackwell (sm120) | a | Production | sm120 |
sm_120f | 1200 | 7 | Blackwell (sm120) | f | Production | sm120 |
sm_121 | 1210 | 6 | Blackwell (sm120) | -- | Production | sm120 |
sm_121a | 1210 | 7 | Blackwell (sm120) | a | Production | sm120 |
sm_121f | 1210 | 7 | Blackwell (sm120) | f | Production | sm120 |
Legacy architectures also present in the table but not in the CLI mapping: sm_20, sm_21, sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_73.
Suffix Meanings
| Suffix | Meaning | PTX Version | Detail |
|---|---|---|---|
| (none) | Base feature set | 5 (legacy) or 6 (sm_100+) | All architectures; sm70-89 has no suffix-gated logic |
a | Accelerated / advanced features | 6 (sm_90a) or 7 (sm_100a+) | sm_90a enables one EDG gate (sm90); sm_100a+ enables tcgen05 arch-conditional path (sm100) |
f | Forward-compatible feature set | 7 | Implies a; never read by cicc logic (sm120); reserved for ptxas |
PTX Version Mapping
| PTX Version | SM Range | Notes |
|---|---|---|
| 5 | sm_20 through sm_90 (legacy/base) | All pre-Blackwell base variants |
| 6 | sm_90a, sm_100/101/102/103/110/120/121 (base) | sm_90a is the sole pre-Blackwell PTX 6 target (sm90) |
| 7 | sm_100a/f through sm_121a/f (extended features) | Required for tcgen05 arch-conditional intrinsics (sm100) |
Architecture Gating
Four subsystems cooperate to configure feature flags from the SM version. The master configurator sub_60E7C0 runs last and has the highest non-CLI priority. For the complete flag table per tier, see SM 70-89 Complete sub_60E7C0 Flag Table.
Feature Configuration Pipeline
CLI parser (sub_617BD0) Sets byte_4CF8* override flags
|
sub_60DFC0 (secondary) Sets unk_4D041B8 at sm_80+ (__VA_OPT__)
|
sub_60D650 (optimization level) ~109 flags from -O level
|
sub_60E7C0 (master SM configurator) ~60 flags via SM threshold comparisons
|--- sub_60E530 (tertiary) Supplementary progressive unlocks
|
sub_982C80 (NVPTX subtarget) 224-byte bitfield for LLVM backend
Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag. See CLI Flag Inventory for the complete CLI flag-to-pipeline routing and Optimization Levels for per-level flag differences.
EDG-Level Gates -- sub_60E7C0
Sets ~60 unk_4D04* feature flags based on SM version thresholds. Each flag is gated by a byte_4CF8* user-override check.
| Threshold | SM Boundary | Features Enabled | Detail |
|---|---|---|---|
| > 30399 | sm_75 (Turing) | Base CUDA features, dynamic parallelism | sm70-89 Turing |
| > 40000 | sm_80 (Ampere) | C++20 __VA_OPT__, L2 cache hints, extended atomics | sm70-89 Ampere |
| > 89999 | sm_90 (Hopper) | Cluster ops, TMA, setmaxnreg, WGMMA fence | sm90 Feature Flags |
| > 109999 | sm_100 (Blackwell) | tcgen05, match instruction, dword_4D041AC | sm100 Feature Flags |
| > 119999 | sm_120 | unk_4D047BC disabled, unk_4D0428C | sm120 Feature Flags |
Backend Subtarget Feature Offsets (NVPTXSubtarget)
| Offset | Purpose | Stride | Detail |
|---|---|---|---|
| +2498 | Type legality flags (per MVT) | 259 bytes | See Type Legalization |
| +2584 | Float legality flags (per MVT) | 259 bytes | See Type Legalization |
| +2843 | Integer type support flag | 1 byte | -- |
| +2870 | Branch distance flag | 1 byte | See Block Placement |
| +2871 | Jump table eligibility flag | 1 byte | See BranchFolding |
For the complete NVPTXSubtarget analysis, see NVPTX Target Infrastructure.
Intrinsic Verifier Architecture Gates -- sub_2C7B6A0
The NVVMIntrinsicVerifier (143KB) gates intrinsics by SM version. For the complete three-layer verification architecture, see NVVM IR Verifier.
| Gate | SM | Intrinsics | Detail |
|---|---|---|---|
| sm_72 (Volta) | Convergent branch intrinsics, some atomic ops | sm70-89 Volta | |
| sm_75 (Turing) | Conversion type intrinsics | sm70-89 Turing | |
| sm_89 (Ada) | Specific intrinsics | sm70-89 Ada | |
| sm_90 (Hopper) | Cluster dimensions, TMA, WGMMA | sm90 TMA, sm90 WGMMA | |
| sm_100+ (Blackwell) | .offset.bindless intrinsics, tcgen05 | sm100 tcgen05, sm120 .offset.bindless |
Feature Gate Matrix
This matrix shows which major compiler features are available at each SM tier. Each cell links to the detailed discussion in the per-SM deep-dive page.
Tensor Core / MMA Instructions
| Feature | sm_70-75 | sm_80-89 | sm_90/90a | sm_100/103 | sm_110 | sm_120/121 |
|---|---|---|---|---|---|---|
| HMMA m16n16k16 (f16) | Yes | Yes | Yes | Yes | Yes | Yes |
| IMMA int8/int4, BMMA | sm_75+ | Yes | Yes | Yes | Yes | Yes |
| DMMA fp64, TF32, BF16 | -- | sm_80+ | Yes | Yes | Yes | Yes |
| WGMMA async (f16/bf16/tf32/f8) | -- | -- | Yes | Yes | Yes | -- |
| tcgen05.mma (MX formats) | -- | -- | -- | a/f only | a/f only | No |
| mma.sync.block_scale | -- | -- | -- | -- | -- | Future |
See Tensor / MMA Builtins for the per-builtin ID reference and Tensor / MMA Codegen for the code generation pipeline.
Memory and Synchronization
| Feature | sm_70-75 | sm_80-89 | sm_90/90a | sm_100/103 | sm_110 | sm_120/121 |
|---|---|---|---|---|---|---|
| Full atomic memory ordering | sm_70+ | Yes | Yes | Yes | Yes | Yes |
| 128-bit atomics | sm_70+ | Yes | Yes | Yes | Yes | Yes |
| L2 cache hint atomics | -- | sm_80+ | Yes | Yes | Yes | Yes |
| Cluster scope atomics | -- | -- | Yes | Yes | Yes | Yes |
| cp.async | -- | sm_80+ | Yes | Yes | Yes | Yes |
| TMA (tensor memory access) | -- | -- | Yes | Yes | Yes | Yes |
| TMA 2CTA mode, Im2Col_W | -- | -- | -- | sm_100+ | sm_100+ | sm_100+ |
| setmaxnreg | -- | -- | Yes | Yes | Yes | Yes |
| fence.sc.cluster | -- | -- | Yes | Yes | Yes | Yes |
See Atomics Builtins for atomic PTX generation detail and Barriers & Sync for barrier builtins.
Thread Block Clusters
| Feature | sm_70-89 | sm_90/90a | sm_100+ |
|---|---|---|---|
__cluster_dims__ attribute | Diagnostic 3687 | Yes | Yes |
__launch_bounds__ 3rd param | Diagnostic 3704 | Yes | Yes |
__block_size__ 5th arg | Diagnostic 3790 | Yes | Yes |
| Cluster special registers (15) | -- | Yes | Yes |
| barrier.cluster.arrive/wait | -- | Yes | Yes |
| Cluster query builtins (9) | -- | Yes | Yes |
| Distributed shared memory | -- | Yes | Yes |
.blocksareclusters directive | -- | Yes | Yes |
Numeric Formats
| Format | First Available | Gate Location | Detail |
|---|---|---|---|
| f16, f32, f64 | All | -- | Standard types |
| bf16 (bfloat16) | sm_80+ | Ampere tensor core | Tensor core and cvt |
| tf32 (TensorFloat-32) | sm_80+ | Ampere tensor core | Tensor core only |
| fp8 e4m3, e5m2 | sm_90+ | WGMMA | cvt_packfloat cases 2-3 |
| fp6 e2m3, e3m2 | sm_100+ | cvt_packfloat | Arch-conditional only |
| fp4 e2m1 | sm_100+ | cvt_packfloat | Arch-conditional only |
| ue8m0 (scale factor) | sm_100+ | cvt_packfloat | Both arch and family-conditional |
| MX formats (mxf4, mxf8f6f4, mxf4nvf4) | sm_100+ | tcgen05.mma | tcgen05 a/f sub-variants only |
Texture and Surface
| Feature | sm_70-89 | sm_90 | sm_100/103 | sm_120/121 |
|---|---|---|---|---|
| Standard texture intrinsics | Yes | Yes | Yes | Yes |
.offset.bindless intrinsics (68 variants) | -- | -- | -- | sm_120+ |
| f16 texture element types | Limited (builtin 3811 only) | Limited | Limited | Full support |
See Surface & Texture Builtins for the tex_surf_handler dispatch algorithm.
EDG Frontend Feature Flags
| Feature | Threshold | Flag | Detail |
|---|---|---|---|
| C++17 feature gates (EDG) | sm_70+ | unk_4D041DC, unk_4D04858, unk_4D041EC | sm70-89 Flag Table |
C++20 __VA_OPT__ | sm_80+ | unk_4D041B8 | sm70-89 sub_60DFC0 |
| C++23 extended float suffixes | sm_70+ | unk_4D0428C | sm70-89 Tertiary Cascade |
| C++20 feature gates | sm_90+ | unk_4D043D0, unk_4D041B0, unk_4D04814 | sm90 Feature Flags |
| Blackwell extended features | sm_100+ | unk_4D04184, dword_4D041AC | sm100 Feature Flags |
See EDG 6.6 Frontend for the 737-define configuration system.
tcgen05 Sub-Variant Access Table
The tcgen05 instruction family uses a two-tier gating system unique to Blackwell. Base variants (sm_100, sm_103, sm_110) are excluded; only a and f sub-variants pass the bitmask check.
| SmVersion | Target | tcgen05 | Detail |
|---|---|---|---|
| 1001 | sm_100a | Allowed | sm100 Arch-Conditional Gate |
| 1002 | sm_100f | Allowed | sm100 Arch-Conditional Gate |
| 1031 | sm_103a | Allowed | sm100 Arch-Conditional Gate |
| 1032 | sm_103f | Allowed | sm100 Arch-Conditional Gate |
| 1101 | sm_110a | Allowed | sm120: Jetson Thor |
| 1102 | sm_110f | Allowed | sm120: Jetson Thor |
| 1000, 1030, 1100 | base variants | Blocked | Bitmask 0xC0000C03 rejects; see sm100 |
| 1200-1212 | all sm_120/121 | Blocked | v-1101 > 1; see sm120 No tcgen05 |
Generation-Specific Features
Turing (sm_75)
sm_75 is the default architecture for cicc v13.0, hardcoded as "compute_75" in sub_900130 and sub_125FB30.
- Base tensor core (HMMA m16n16k16) -- see Tensor / MMA Builtins
- Conversion intrinsics
- Baseline for cicc v13.0 (default architecture) -- see CLI Flag Inventory
Full detail: SM 70-89 (Volta through Ada)
Ampere (sm_80-sm_89)
L2::cache_hinton atomic operations (sub_21E6420) -- see Atomics Builtins- Extended tensor core shapes (tf32, bf16) -- see Tensor / MMA Builtins
- Async copy (
cp.async) -- see SM 70-89: Ampere - C++20
__VA_OPT__support -- the sole differentiator between sm_75 and sm_80+ insub_60E7C0/sub_60DFC0
Full detail: SM 70-89 (Volta through Ada)
Hopper (sm_90/90a)
- Cluster operations:
barrier.cluster.arrive/wait,fence.sc.cluster-- see Cluster Barriers - Cluster registers:
%cluster_ctarank,%clusterid.x/y/z,%is_explicit_cluster-- see Cluster Special Registers - Kernel attributes:
.blocksareclusters,.maxclusterrank,.reqnctapercluster,.cluster_dim-- see PTX Directives - setmaxnreg: Dynamic register allocation limit (
sub_21EA5F0) -- see setmaxnreg - TMA: Tensor Memory Access with Im2Col, dimension validation, 2CTA mode -- see TMA
- WGMMA: Warpgroup MMA async (f16, bf16, tf32, f8) -- see WGMMA
- Distributed shared memory:
.shared::clusterqualifier for cross-CTA access -- see DSMEM - Mbarrier extensions: DMA fence/arrive/wait for TMA coordination -- see Mbarrier
Full detail: SM 90 -- Hopper
Blackwell Datacenter (sm_100-sm_103)
- tcgen05: Next-gen tensor core instruction set (
scaleD,transA,negA,negBatsub_21E8CD0) -- see tcgen05 - Arch-conditional vs. family-conditional gating: Two-tier feature system for tcgen05 sub-instructions -- see Gating
- match instruction: Architecture-gated (
"match instruction not supported on this architecture!") -- see sm100 - Extended MMA shapes: m16n8k256 with MX format support
.offset.bindlessintrinsics -- gated at sm_120+, NOT sm_100 (see sm120 .offset.bindless)- cvt_packfloat extended types: FP4, FP6, MX formats -- see cvt_packfloat
Full detail: SM 100 -- Blackwell Datacenter
Jetson Thor (sm_110)
sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename). It retains full tcgen05/TMEM hardware on a/f sub-variants. The sm_110 section is documented on the sm_120 page because the two are often compared.
Full detail: SM 120 -- Jetson Thor section
Blackwell Consumer (sm_120, sm_121)
- No tcgen05: The entire tcgen05 ISA is rejected by cicc for all sm_120/121 variants -- see No tcgen05
.offset.bindlesstexture intrinsics (68 variants) -- see .offset.bindless- 16-bit texture element types -- see f16 Texture
- mma.sync.block_scale: Present in upstream LLVM 22 but NOT emitted by cicc v13.0 -- see block_scale
- Tensor core falls back to HMMA/IMMA inherited from sm_70-sm_90 path
Full detail: SM 120 -- Blackwell Consumer
NVVM Container Architecture Enum -- sub_CD09E0
The NVVM container format uses an architecture enumeration. See NVVM Container for the complete tag inventory.
| Enum String | Implied SM | Detail |
|---|---|---|
NVVM_ARCH_BLACKWELL_10_0 | sm_100 | sm100 |
NVVM_ARCH_BLACKWELL_10_1 | sm_101 | Undocumented |
NVVM_ARCH_BLACKWELL_10_3 | sm_103 | sm100 |
NVVM_ARCH_BLACKWELL_11_0 | sm_110 | sm120: Jetson Thor |
NVVM_ARCH_BLACKWELL_12_0 | sm_120 | sm120 |
NVVM_ARCH_BLACKWELL_12_1 | sm_121 | sm120 |
NVVM_ARCH_HOPPER_9_0 | sm_90 | sm90 |
NVVM_ARCH_ADA_8_9 | sm_89 | sm70-89 |
NVVM_ARCH_AMPERE_8_0 through 8_8 | sm_80-sm_88 | sm70-89 |
NVVM_ARCH_HW_SM_5_0 through 10_4 | sm_50-sm_104 | Hardware SM enum |
Notable: NVVM_ARCH_HW_SM_10_4 (sm_104) and NVVM_ARCH_BLACKWELL_11_0 are not publicly documented. NVIDIA's internal naming uses "BLACKWELL" for all sm_100-sm_121 variants, even though sm_110 is marketed as Jetson Thor and sm_120/121 are a distinct consumer microarchitecture (RTX 50xx). See SM 120: Architecture Identity for the "SM 10.4" internal designation.
Target Triples
| Triple | Purpose | Detail |
|---|---|---|
nvptx64-nvidia-cuda | Standard 64-bit CUDA compilation | Default; see NVPTX Target Infrastructure |
nvptx-nvidia-cuda | 32-bit CUDA compilation | Legacy |
nvptx64-nvidia-nvcl | OpenCL target | -- |
nvsass-nvidia-cuda | SASS backend (native assembly) | -- |
nvsass-nvidia-directx | DirectX SASS backend | Discovered in sub_2C80C90; see NVVM IR Verifier |
nvsass-nvidia-spirv | SPIR-V SASS backend | Discovered in sub_2C80C90 |
The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) reveal that NVIDIA's SASS-level backend supports DirectX and SPIR-V targets alongside traditional CUDA and OpenCL.
Data Layout Strings
| Mode | Layout | Notes |
|---|---|---|
| 64-bit + shared | e-p:64:64:64-p3:32:32:32-i1:8:8-...-n16:32:64 | p3:32:32:32 = 32-bit shared mem pointers |
| 64-bit | e-p:64:64:64-i1:8:8-...-n16:32:64 | No shared memory specialization |
| 32-bit | e-p:32:32:32-i1:8:8-...-n16:32:64 | 32-bit mode |
Address space 3 (shared memory) uses 32-bit pointers even in 64-bit mode, controlled by nvptx-short-ptr and nvptx-32-bit-smem flags. See Address Spaces for the complete address space reference.
SM Version Encoding
Two parallel version tracking systems coexist in the binary:
-
qword_4F077A8-- EncodesSM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use theXX99pattern (e.g., 69999 for pre-Volta, 89999 for pre-Hopper). See SM 70-89: SM Version Encoding for full detail. -
unk_4D045E8-- Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic. See SM 70-89: unk_4D045E8 Frontend Gates for the complete gate table.
Cross-References
- NVPTX Target Infrastructure -- NVPTXTargetMachine, NVPTXSubtarget, TTI hooks
- Tensor / MMA Builtins -- Per-builtin-ID reference for all MMA generations
- Tensor / MMA Codegen -- Code generation pipeline for tensor core operations
- Atomics Builtins -- Atomic PTX generation and scope validation
- Surface & Texture Builtins -- Texture intrinsic dispatch algorithm
- NVVM IR Verifier -- SM-gated intrinsic verification
- NVVM Container -- Architecture enum and tag inventory
- CLI Flag Inventory --
-arch=compute_XXparsing and flag routing - Optimization Levels -- Per-level flag differences that interact with SM gates
- EDG 6.6 Frontend -- 737-define configuration, CUDA keyword handling
- Address Spaces -- Address space 3 shared memory and data layout strings
- GPU Execution Model -- CTA, warp, and cluster execution model context
Volta through Ada Lovelace (sm_70 – sm_89)
The sm_70 through sm_89 range spans four GPU generations — Volta, Turing, Ampere, and Ada Lovelace — and represents the most mature feature tier in cicc v13.0. Turing (sm_75) serves as the compiler's default architecture. Volta (sm_70/72) is no longer directly targetable: no compute_70 or compute_72 entry exists in the CLI parser, though the sm_70 feature boundary is still checked at 23 locations throughout the binary.
Supported Compute Capabilities
The architecture registration table at sub_95EB40 maps CLI strings to internal flags. Only the following are accepted for this generation range:
| Compute Capability | Internal Target | __CUDA_ARCH | PTX Version | Generation |
|---|---|---|---|---|
compute_75 | sm_75 | 750 | 5 | Turing |
compute_80 | sm_80 | 800 | 5 | Ampere |
compute_86 | sm_86 | 860 | 5 | Ampere |
compute_87 | sm_87 | 870 | 5 | Ampere (Jetson Orin) |
compute_88 | sm_88 | 880 | 5 | Ada Lovelace |
compute_89 | sm_89 | 890 | 5 | Ada Lovelace |
There is no compute_70, compute_72, compute_73, or compute_82. The sm_73, sm_82, and sm_88 targets exist only as internal processor table entries — they have no publicly documented differentiation and no unique feature gates in the compiler.
SM Version Encoding
Two parallel version tracking systems coexist in the binary:
-
qword_4F077A8— EncodesSM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use theXX99pattern (e.g., 69999 for pre-Volta, 79999 for pre-Ampere, 89999 for pre-Hopper). -
unk_4D045E8— Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic.
Feature Configuration Call Order
The compiler configures feature flags through a strict four-function call sequence. Each subsequent function can override or augment the previous one's settings:
- CLI parser — Sets
byte_4CF8*override flags from user-specified options. These prevent any subsequent auto-configuration from touching the guarded flag. sub_60DFC0— Basic initialization. Setsunk_4D041B8for sm_80+ (C++20__VA_OPT__support).sub_60D650(opt_level)— Optimization-level-based flag configuration. Sets approximately 109 flags based on the-Olevel. Many of the sameunk_4D04*flags set by SM gates are also set here under C++17/C++20 language-version conditions.sub_60E7C0— Master SM architecture feature configurator. Readsqword_4F077A8and sets approximately 60 backend flags through threshold comparisons. Also callssub_60E530(tertiary cascade) for supplementary flags.sub_982C80— NVPTX subtarget feature table initialization (224-byte bitfield for the LLVM backend). This is a separate path from the EDG flags above.
Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag.
Feature Gates by Generation
Volta (sm_70+) — Threshold qword_4F077A8 > 69999
Volta introduced the first tensor core generation and independent thread scheduling. Although not directly targetable in this compiler version, the sm_70 boundary enables:
-
HMMA tensor core intrinsics — Builtin IDs 678–707 registered in
sub_90AEE0. Three shape variants (m16n16k16, m32n8k16, m8n32k16) with load, store, and MMA operations across f16/f32 accumulator combinations. -
Convergent branch intrinsic —
llvm.nvvm.branch.if.all.convergent(builtin 3755/8282) requires sm_70+. Error: "not supported on pre-Volta Architectures" (checked insub_1C36530andsub_2C7B6A0). -
Proper atomic memory ordering — At sm_70+, atomics use acquire/release/relaxed semantics instead of falling back to volatile qualification. The gate is
unk_4D045E8 > 69. -
128-bit atomic operations — Enabled at sm_70+. Below this threshold, diagnostic 3758 is emitted: "16-byte atomics only supported on sm_70+".
-
Optimizer feature flags —
unk_4D041DC,unk_4D04858,unk_4D041ECare set bysub_60E7C0. The tertiary cascadesub_60E530additionally setsunk_4D0428C(extended float suffix support for C++23std::float*_t/std::bfloat16_t). Multiple SelectionDAG patterns insub_706250activate for sm_70+ codegen. -
Variant-flag-gated features — When
dword_4F077BC(SM variant flag, thea/fsuffix) is set and sm_70+ is active,unk_4D043C4is enabled. When compiling for a virtual architecture with effective SM > 69999,unk_4D04740is set for multi-arch optimization. -
WMMA memory space optimization — The
wmma-memory-space-optpass (registered atctor_267,ctor_531) optimizes memory access patterns for tensor core operations.
Turing (sm_75) — Default Architecture
sm_75 is the baseline for cicc v13.0. The default is hardcoded in sub_900130 and sub_125FB30 via strcpy("compute_75"), and in sub_95EB40 as "-arch=compute_75".
No explicit sm_75-specific feature gates exist beyond the sm_70 tier. All Volta-era features are available. The key behavioral distinction is that sm_75 passes all pre-Volta gates cleanly — no diagnostic 3703 (sub_5C68F0), no volatile atomic fallback, no 128-bit atomic restrictions.
Ampere (sm_80+) — Threshold qword_4F077A8 > 79999
-
C++20
__VA_OPT__support —unk_4D041B8set atsub_60DFC0line 132–133. This is the only flag set exclusively bysub_60DFC0at the sm_80 threshold. It enables__VA_OPT__recognition in the EDG macro expander (sub_A03line 1010), variadic trailing argument elision (line 1584), and diagnostic 2939 for misuse. -
Additional convergent branch —
llvm.nvvm.branch.if.convergent(builtin 3754/8283) requires sm_80+. Error: "not supported on pre-Ampere Architectures". Note the distinction:branch.if.all.convergentrequires only sm_70+, whilebranch.if.convergentrequires sm_80+. -
L2 cache hint atomics — The
L2::cache_hintsuffix on atomic operations, emitted fromsub_21E6DD0when bit0x400is set in instruction encoding flags. Supported operations:exch,add,and,or,xor,max,min,cas, and floating-pointadd. These are PTX 7.3+ features. Emission logic lives insub_21E6420. -
cp.async.bulk patterns — String matching for
cp.async.bulk.tensor.g2s.andcp.async.bulk.in inline assembly validation atsub_A8E250.
Important correction: The master SM feature configurator sub_60E7C0 does NOT set any new flags at the sm_80 boundary (> 79999). The Ampere-specific unk_4D041B8 is set by the secondary configurator sub_60DFC0. The next threshold in sub_60E7C0 after sm_70+ (> 69999) is sm_90+ (> 89999). This means sm_80 through sm_89 share the same sub_60E7C0 flag profile as sm_75.
Ada Lovelace and Ampere Variants (sm_86 – sm_89)
All of sm_86, sm_87, sm_88, and sm_89 share identical feature gates within cicc. They occupy unk_4D045E8 values 86–89 and qword_4F077A8 range 86000–89999, all below the 89999 Hopper boundary.
The primary gate at this tier is unk_4D045E8 <= 89, which delineates pre-Hopper from Hopper+:
| Location | Feature | Behavior at sm_89 and below |
|---|---|---|
sub_5D1A60 | __block_size__ attribute | Diagnostic 3790; only 4 args parsed (5th cluster arg is sm_90+) |
sub_5D1FE0 | __cluster_dims__ attribute | Diagnostic 3687 emitted (cluster dimensions are Hopper-only) |
sub_5D2430 | __launch_bounds__ 3rd param | Diagnostic 3704 emitted (cluster launch bounds) |
sub_6BBC40 | Atomic scope "cluster" | Falls through to "gpu" scope; diagnostic 3763/3759 |
sub_6BBC40 | 16-byte extended atomics | Diagnostic 3764 for certain scope+type combinations |
sub_9502D0 / sub_12AE930 | Atomic scope emission | "gpu" used instead of "cluster" |
sub_214DA90 | Cluster PTX directives | Skipped entirely (arch_id <= 89) |
No code path differentiates sm_89 from sm_86/87/88. Hardware differences between these sub-architectures (e.g., Ada Lovelace RTX 4090 at sm_89 vs. Jetson Orin at sm_87) are resolved at the ptxas assembler level, not in cicc.
Atomic Lowering Detail
The atomic builtin lowering (sub_12AE930 / sub_9502D0) follows two paths split at the sm_70 boundary:
Pre-sm_70 path (unk_4D045E8 <= 69): Atomics are emitted with a volatile qualifier instead of memory ordering. Scope (cta/gpu/sys) is parsed but ordering is forced to volatile. 128-bit atomics emit diagnostic 3758.
sm_70+ path (unk_4D045E8 > 69): Full memory ordering support — relaxed, acquire, release, acq_rel. Scope resolution: cta (scope 0–1), gpu (scope 3), sys (scope 4). Cluster scope (scope 2) is only available at sm_90+; on sm_70–89, scope 2 falls through to "gpu".
Operations: ld, st, atom.add, atom.and, atom.or, atom.xor, atom.max, atom.min, atom.exch, atom.cas. Type suffixes via lookup table: b (bitwise), u (unsigned), s (signed), f (float).
Hopper-Gated Intrinsics Rejected on sm_70–89
Multiple intrinsics emit "this intrinsic is only supported for Hopper+" when the SM version field is non-zero and <= 899:
| Builtin ID | Description |
|---|---|
| 0x10B3 (4275) | Hopper+ intrinsic requiring i1 or i32 return |
| 0xFD5 (4053) | Hopper+ intrinsic |
| 0xEB7 (3767) | Memory ordering/fence intrinsic with operation modes |
| 0xEB9–0xEBA (3769–3770) | Pointer-size-dependent intrinsics (>= 64-bit) |
Complete sub_60E7C0 Flag Table
The master feature configurator sub_60E7C0 (address 0x60E7C0, 12,466 bytes, 56 qword_4F077A8 comparisons) is the primary SM-architecture-to-feature-flag mapper. Every flag assignment follows a guarded pattern: if the corresponding byte_4CF8* override byte is nonzero (set by a CLI flag), the auto-configuration is skipped and the user's explicit value is preserved.
Unconditional Assignments
These flags are set regardless of SM version with no user override check:
| Flag | Value | Notes |
|---|---|---|
unk_4D047C0 | 1 | Always enabled |
unk_4D047B4 | 1 | Always enabled |
unk_4F07584 | 0 | Always cleared |
unk_4D0423C | 1 | Always enabled |
unk_4D04208 | 0 | Always cleared |
unk_4D04218 | 1 | Always enabled |
unk_4D04214 | 1 | Always enabled |
unk_4F06970 | 0 | Always cleared |
unk_4F06964 | 0 | Always cleared |
unk_4F06904 | 0 | Always cleared |
SM-Dependent Unconditional Flags
These depend on SM version but have no user override check:
| Flag | Condition | Value | Notes |
|---|---|---|---|
unk_4D047BC | SM <= 119999 | 1 | Disabled only for sm_120+ |
unk_4D04758 | SM <= 30300 | 1 | sm_32 and below only |
unk_4D04764 | SM <= 30399 | 1 | sm_32 and below only |
unk_4D044B8 | SM <= 40299 | 1 | Pre-Maxwell only |
Guarded Flags (byte_4CF8* Override Bypass)
Each flag is set only when its guard byte is zero (user has not overridden via CLI):
| Guard | Flag | Default Value (guard=0) |
|---|---|---|
byte_4CF807B | dword_4D048B8 | = 1 |
byte_4CF810C | dword_4D04824 | = 1 |
byte_4CF80F0 | unk_4D04388 | = 1 |
byte_4CF8108 | unk_4D04338 | = (dword_4F077BC && !dword_4F077B4 && SM <= 30399) ? 1 : 0 |
byte_4CF8123 | unk_4D047C8 | = 1 (only if SM > 30399, sm_35+) |
byte_4CF8125 | dword_4D047B0 | = 1 (only if SM > 30399, sm_35+) |
byte_4CF8139 | unk_4D04314 | = (SM <= 40299), pre-Maxwell |
byte_4CF814D | unk_4D047C4 | = 0 |
byte_4CF8119 | unk_4D047D0 | = (SM <= 40000) |
byte_4CF8119 | unk_4D047CC | = (SM <= 40099) |
byte_4CF810F | unk_4D047EC | = 1 |
byte_4CF8107 | unk_4D04340 | = 0 |
byte_4CF8116 | unk_4D047E0 | = 1 |
byte_4CF815F | unk_4F0771C | = 1 |
byte_4CF813E | unk_4D044B0 | = (SM > 40299), Maxwell+ |
byte_4CF8149 | unk_4D04470 | Complex Maxwell+ gate |
byte_4CF8172 | dword_4D041AC | = 0 (when SM <= 109999) |
byte_4CF8159 | dword_4D048B0 | = (dword_4D048B8 && dword_4D048B4 && SM > 40799) |
byte_4CF811C | unk_4D04790 | = 0 (when virtual arch flag set) |
byte_4CF813C | dword_4D047AC | sm_35+ feature gate |
byte_4CF8156 | unk_4D04408 | CUDA C++ feature gate |
byte_4CF815D | unk_4D048A0 | = 1 (when SM > 40699) |
Total: 21 override bytes controlling approximately 25 feature flags.
Feature Escalation by SM Version
The cumulative flag-setting cascade. Each tier inherits all flags from lower tiers. Only the tiers relevant to sm_70–89 plus their immediate predecessors and successors are shown.
SM > 59999 (sm_60+, Pascal+):
| Flag | Identified Meaning |
|---|---|
unk_4D043CC | EDG C++17 feature gate (also set by C++17 language block) |
unk_4D04404 | EDG extended feature gate (also set by C++17 language block) |
unk_4D043D8 | EDG C++17 feature gate (also set by C++17 language block) |
unk_4D043D4 | EDG feature gate (also set via virtual arch > 30599) |
dword_4F07760 | PTX generation mode flag |
unk_4D04870 | EDG C++20 feature gate (also set by C++20 language block) |
SM > 69999 (sm_70+, Volta+):
| Flag | Identified Meaning |
|---|---|
unk_4D041DC | EDG C++17 feature gate (also set by C++17 language block) |
unk_4D04858 | EDG C++17 feature gate (also set by C++17 language block) |
unk_4D041EC | EDG C++17/Pascal virtual arch feature gate |
SM > 89999 (sm_90+, Hopper+) — NOT active for sm_70–89:
| Flag | Identified Meaning |
|---|---|
unk_4D043D0 | EDG C++20 feature gate (also set by C++20 language block) |
unk_4D041B0 | EDG C++20 feature gate (also set by C++20 language block) |
unk_4D04814 | EDG C++20 feature gate (also set by C++20 language block) |
unk_4D0486C | (with additional C++ version check) |
sub_60E530 Tertiary Cascade
This supplementary function provides additional progressive unlocks. For the sm_70–89 range:
| Threshold | Hex | Flags Set |
|---|---|---|
| > 40599 | 0x9E97 | unk_4F07764 |
| > 40699 | 0x9EFB | unk_4D043F0, unk_4D043F4 |
| > 40899 | 0x9FC3 | unk_4D04220, unk_4D044D0 |
| > 59999 | 0xEA5F | unk_4D043CC (duplicates sub_60E7C0) |
| > 69999 | 0x1116F | unk_4D0428C (extended float suffixes: C++23 std::float*_t / std::bfloat16_t) |
| > 89999 | 0x15F8F | dword_4F07760 (duplicates sub_60E7C0) |
| > 99999 | 0x1869F | dword_4D043F8, dword_4D041E8 |
Note: unk_4D0428C is set at > 69999 (sm_70+) by the cascade but at > 119999 (sm_120+) by sub_60E7C0. The cascade runs as part of sub_60E7C0, so the sm_70+ activation wins for all practical SM versions. This flag gates C++23 extended float suffixes (std::float16_t, std::float32_t, std::float64_t, std::bfloat16_t) in the EDG numeric parser at sub_A02 line 1612.
sub_60DFC0 SM-Gated Flags
The secondary configurator adds one flag at the sm_80 boundary:
| Threshold | Flag | Identified Meaning |
|---|---|---|
| > 79999 (sm_80+) | unk_4D041B8 | C++20 __VA_OPT__ support in EDG macro expander. Enables __VA_OPT__ recognition, variadic trailing argument elision, and diagnostic 2939. |
Virtual Architecture Downgrade Path
When compiling for a virtual architecture (dword_4F077B4 = 1), sub_60E7C0 uses unk_4F077A0 (the effective/real SM) for a secondary tier of feature decisions:
| Effective SM > | Flags Set |
|---|---|
| 29999 | unk_4D043E4 |
| 30099 | unk_4D044D0 |
| 30199 | unk_4D043F0 |
| 30299 | unk_4D04220 |
| 30599 | unk_4D043D4 |
| 59999 | unk_4D041EC, unk_4D043D8, unk_4D04404 |
| 69999 | unk_4D04740 |
| 79999 | unk_4D043D0 |
| 89999 | unk_4D043D0 (redundant — already set at > 79999) |
| 129999 | unk_4D04184 |
Note: In the virtual arch path, unk_4D043D0 is set at > 79999 (sm_80+), while in the primary path it requires > 89999 (sm_90+). Virtual arch compilation is more conservative, enabling features the real target supports even if the virtual arch normally gates them.
unk_4D045E8 Frontend Gates
These gates use the raw SM number and control frontend semantic checks rather than backend flags:
| Gate | Locations | Effect |
|---|---|---|
| <= 69 | sub_12AE930 ln 241, sub_9502D0 ln 294 | Atomic volatile fallback |
| <= 69 | sub_6BBC40 ln 763 | 128-bit atomic error 3758 |
| <= 69 | sub_5C68F0 | Diagnostic 3703 |
| <= 51 | sub_691790 ln 126 | Surface builtin warning |
| <= 59 | sub_6BBC40 ln 639 | Atomic scope restriction |
| 60–69 | sub_6BBC40 ln 814 | Diagnostic 3762 |
| <= 79 | sub_5C6950 ln 15 | Diagnostic 3660 |
| <= 89 | sub_5D1A60 ln 35 | __block_size__ 5th arg blocked |
| <= 89 | sub_5D1FE0 ln 19 | __cluster_dims__ diagnostic 3687 |
| <= 89 | sub_5D2430 ln 33 | __launch_bounds__ 3rd param diagnostic 3704 |
| <= 89 | sub_6BBC40 ln 684 | Atomic scope diagnostic 3763/3759 |
| <= 89 | sub_6BBC40 ln 805, 827 | 16-byte atomic diagnostic 3764 |
| <= 89 | sub_9502D0 ln 424, sub_12AE930 ln 255 | Cluster scope falls through to "gpu" |
| <= 89 | sub_214DA90 ln 66 | Cluster PTX directives skipped |
Cumulative Flag Profile per SM Version
This table shows the net flag state for each SM version in the range, combining all three configurators (sub_60E7C0 + sub_60E530 + sub_60DFC0). Only flags that differ across the sm_70–89 range are shown.
| Flag | sm_75 | sm_80 | sm_86–89 | Set By | Identified Role |
|---|---|---|---|---|---|
unk_4D041DC | 1 | 1 | 1 | sub_60E7C0 > 69999 | EDG C++17 feature gate |
unk_4D04858 | 1 | 1 | 1 | sub_60E7C0 > 69999 | EDG C++17 feature gate |
unk_4D041EC | 1 | 1 | 1 | sub_60E7C0 > 69999 | EDG C++17 / virtual arch feature gate |
unk_4D0428C | 1 | 1 | 1 | sub_60E530 > 69999 | Extended float suffixes (C++23) |
unk_4D041B8 | 0 | 1 | 1 | sub_60DFC0 > 79999 | C++20 __VA_OPT__ support |
unk_4D043D0 | 0 | 0 | 0 | sub_60E7C0 > 89999 | (sm_90+ only) |
unk_4D041B0 | 0 | 0 | 0 | sub_60E7C0 > 89999 | (sm_90+ only) |
unk_4D04814 | 0 | 0 | 0 | sub_60E7C0 > 89999 | (sm_90+ only) |
unk_4D0486C | 0 | 0 | 0 | sub_60E7C0 > 89999 | (sm_90+ only) |
The sole differentiator between sm_75 and sm_80+ within sub_60E7C0/sub_60DFC0 is unk_4D041B8. All flags set at > 69999 are shared by all sm_70–89 targets. All flags set at > 89999 are absent from all sm_70–89 targets. There is no per-flag difference between sm_86, sm_87, sm_88, and sm_89.
Identified Flag Semantics
Where flag consumers have been positively identified in the decompiled binary:
| Flag | Set At | Consumer | Meaning |
|---|---|---|---|
unk_4D041B8 | sm_80+ (sub_60DFC0) | EDG macro expander (sub_A03 ln 1010) | C++20 __VA_OPT__ support: recognition, variadic trailing argument elision, diagnostic 2939 |
unk_4D0428C | sm_70+ (sub_60E530), sm_120+ (sub_60E7C0) | EDG numeric parser (sub_A02 ln 1612) | Extended float suffixes: C++23 std::float16_t, std::float32_t, std::float64_t, std::bfloat16_t |
dword_4F07760 | sm_60+ (sub_60E7C0, sub_60E530) | PTX generation path | PTX emission mode flag |
unk_4D047C8 | sm_35+ (sub_60E7C0) | Backend | Dynamic parallelism optimization |
dword_4D047B0 | sm_35+ (sub_60E7C0) | Backend | Dynamic parallelism support |
unk_4D04780 | always | EDG macro expander | GNU ##__VA_ARGS__ comma-deletion extension |
The remaining approximately 50 flags feed into the EDG frontend and NVVM IR generation pipeline. Based on the pattern that sub_60D650 (optimization level) and sub_60E7C0 (SM version) set the same flags with overlapping conditions, most are language feature gates (C++17/20/23 features that are also SM-gated) or optimization pass enables that depend on target capability.
Key Binary Locations
| Function | Address | Size | Role |
|---|---|---|---|
sub_60E7C0 | 0x60E7C0 | Master SM feature flag initialization (12,466 bytes, 56 comparisons) | Master SM feature flag initialization (12,466 bytes, 56 comparisons) |
sub_60DFC0 | 0x60DFC0 | Secondary feature flag initialization (unk_4D041B8 at sm_80+) | Secondary feature flag initialization (unk_4D041B8 at sm_80+) |
sub_60E530 | 0x60E530 | Tertiary feature cascade (unk_4D0428C at sm_70+) | Tertiary feature cascade (unk_4D0428C at sm_70+) |
sub_60D650 | 0x60D650 | Optimization-level flag configurator (~109 flags) | Optimization-level flag configurator (~109 flags) |
sub_982C80 | 0x982C80 | NVPTX subtarget 224-byte feature bitfield | NVPTX subtarget 224-byte feature bitfield |
sub_617BD0 | 0x617BD0 | CLI parser; sets unk_4D045E8 per compute_XX | CLI parser; sets unk_4D045E8 per compute_XX |
sub_12AE930 | 0x12AE930 | Atomic builtin lowering (volatile vs. ordering) | Atomic builtin lowering (volatile vs. ordering) |
sub_9502D0 | 0x9502D0 | Duplicate atomic lowering (standalone pipeline) | Duplicate atomic lowering (standalone pipeline) |
sub_6BBC40 | 0x6BBC40 | Builtin semantic checker (atomics, scope validation) | Builtin semantic checker (atomics, scope validation) |
sub_90AEE0 | 0x90AEE0 | Builtin registration table (HMMA builtins 678–707) | Builtin registration table (HMMA builtins 678–707) |
sub_95EB40 | 0x95EB40 | Architecture registration (compute_XX to sm_XX) | Architecture registration (compute_XX to sm_XX) |
sub_1C36530 | 0x1C36530 | NVVM verifier (convergent intrinsic SM gates) | NVVM verifier (convergent intrinsic SM gates) |
sub_2C7B6A0 | 0x2C7B6A0 | NVVM lowering (convergent intrinsic SM gates) | NVVM lowering (convergent intrinsic SM gates) |
sub_21E6DD0 | 0x21E6DD0 | PTX emission (volatile / L2::cache_hint / .unified) | PTX emission (volatile / L2::cache_hint / .unified) |
sub_21E6420 | 0x21E6420 | Atomic L2 cache hint PTX emission | Atomic L2 cache hint PTX emission |
sub_214DA90 | 0x214DA90 | Kernel attribute PTX emitter (cluster directives gated at arch_id > 89) | Kernel attribute PTX emitter (cluster directives gated at arch_id > 89) |
sub_5D1A60 | 0x5D1A60 | __block_size__ attribute (cluster dims at sm_90+) | __block_size__ attribute (cluster dims at sm_90+) |
sub_5D1FE0 | 0x5D1FE0 | __cluster_dims__ attribute (sm_90+ feature) | __cluster_dims__ attribute (sm_90+ feature) |
sub_5D2430 | 0x5D2430 | __launch_bounds__ 3rd param (sm_90+ cluster) | __launch_bounds__ 3rd param (sm_90+ cluster) |
sub_5C68F0 | 0x5C68F0 | Pre-sm_70 diagnostic 3703 | Pre-sm_70 diagnostic 3703 |
sub_5C6950 | 0x5C6950 | Pre-sm_80 diagnostic 3660 | Pre-sm_80 diagnostic 3660 |
Hopper (sm_90, sm_90a)
Hopper represents the largest single-generation feature expansion in cicc v13.0. The sm_90 gate at qword_4F077A8 > 89999 unlocks thread block clusters, distributed shared memory, Tensor Memory Access (TMA), Warpgroup Matrix Multiply-Accumulate (WGMMA), dynamic register count control, and a new fence instruction. The sm_90a "accelerated" sub-variant shares __CUDA_ARCH=900 with sm_90 but uses a higher PTX version and enables one additional feature gate in the EDG frontend.
Architecture Identity
The NVVM container format registers Hopper as NVVM_ARCH_HOPPER_9_0 with numeric value 900, assigned in sub_CD09E0 (line 255) and sub_1C1B150 (line 270) via the pattern v62(a1, "NVVM_ARCH_HOPPER_9_0", v64) => *a2 = 900.
| Variant | Subtarget Enum | __CUDA_ARCH | PTX Version | -opt-arch | -mcpu |
|---|---|---|---|---|---|
sm_90 | 38 | 900 | 5 | sm_90 | sm_90 |
sm_90a | 39 | 900 | 6 | sm_90a | sm_90a |
Both variants share __CUDA_ARCH=900. The distinction lies in the -opt-arch and -mcpu flags passed through the internal pipeline (sub_95EB40 lines 461–469, sub_12C8DD0 lines 435–457). The sm_90a variant is the only pre-Blackwell SM that uses PTX version 6; all sm_20 through sm_90 base variants use PTX version 5.
The a flag is stored in unk_4D045E4 and read in exactly one location: sub_6C4D80 line 167, where the check unk_4D045E8 != 90 || !unk_4D045E4 gates a specific sm_90a-only feature (error code 0xE90 = 3728).
Thread Block Cluster Infrastructure
Clusters are the headline Hopper feature. The compiler gates all cluster functionality at arch_id >= 90 (unk_4D045E8 > 89).
Frontend Attributes
The EDG frontend recognizes three cluster-related kernel attributes:
__cluster_dims__ — Attribute code k in sub_5C79F0. Processing in sub_5D1FE0 validates three integer arguments (x, y, z) and stores them at offsets +20, +24, +28 of the kernel metadata structure. Error codes 3685/3686 on invalid values. On sm_89 and below, diagnostic 3687 is emitted as a warning.
__launch_bounds__ 3rd parameter — The cluster dimension extension to __launch_bounds__ is processed in sub_5D2430. On sm_89 and below, diagnostic 3704 is emitted.
__block_size__ attribute — Handled in sub_5D1A60. At sm_90+, five block dimension arguments are parsed (including the cluster dimension). At sm_89 and below, diagnostic 3790 is emitted and only four arguments are accepted.
NVVM Metadata
Cluster configuration propagates through NVVM IR via several metadata keys:
| Metadata Key | Writers | Readers |
|---|---|---|
nvvm.cluster_dim | sub_93AE30, sub_129A750 | sub_A84F90, sub_CE8EA0 |
cluster_dim_x/y/z | sub_913C80, sub_1273830 | sub_CE8C00/40/80 |
cluster_max_blocks | sub_913C80, sub_1273830 | (kernel metadata) |
nvvm.blocksareclusters | sub_93AE30, sub_129A750 | sub_214DA90 |
nvvm.maxclusterrank | (external) | sub_A84F90, sub_CE9030 |
The blocksareclusters metadata requires reqntid to be set — error message: "blocksareclusters requires reqntid" (sub_214DA90 line 111).
PTX Directives
The kernel attribute emitter at sub_214DA90 gates cluster directives at arch_id >= 90. When the gate passes, four directives may be emitted:
.blocksareclusters— Declares that thread blocks form clusters.explicitcluster— Emitted when all three cluster dimensions are present.reqnctapercluster X, Y, Z— Required CTA count per cluster.maxclusterrank N— Maximum cluster rank
Cluster Special Registers
The PTX emitter at sub_21E9060 handles 15 cluster special registers via a switch statement:
| Case | Register | Description |
|---|---|---|
| 0 | %is_explicit_cluster | Boolean: was cluster explicitly set |
| 1 | %cluster_ctarank | CTA rank within the cluster |
| 2 | %cluster_nctarank | Number of CTAs in cluster |
| 3–5 | %cluster_nctaid.{x,y,z} | Cluster grid dimensions |
| 6–8 | %cluster_ctaid.{x,y,z} | CTA position within cluster |
| 9–11 | %nclusterid.{x,y,z} | Cluster grid count |
| 12–14 | %clusterid.{x,y,z} | Cluster ID |
Cluster Barrier Operations
The barrier.cluster instruction is emitted from sub_21E8EA0 with two operation modes and two memory ordering modes:
| Opcode (bits 0–3) | Operation | Memory Mode (bits 4–7) | Qualifier |
|---|---|---|---|
| 0 | arrive | 0 | (default acquire/release) |
| 1 | wait | 1 | .relaxed |
Error strings: "bad cluster barrier op" for invalid opcode, "bad cluster barrier mem mode" for invalid memory mode.
Three corresponding builtins are registered in sub_90AEE0:
| Builtin | ID |
|---|---|
__nv_cluster_barrier_arrive_impl | 11 |
__nv_cluster_barrier_wait_impl | 12 |
__nv_cluster_barrier_arrive_relaxed_impl | 13 |
Cluster Query Builtins
Nine cluster information builtins are registered in sub_90AEE0:
| Builtin | ID | Purpose |
|---|---|---|
__nv_clusterDimIsSpecifed_impl | 8 | Check if cluster dims are set |
__nv_clusterRelativeBlockRank_impl | 9 | Block rank within cluster |
__nv_clusterSizeInBlocks_impl | 10 | Total blocks in cluster |
__nv_cluster_query_shared_rank_impl | 203 | Query shared memory rank |
__nv_cluster_map_shared_rank_impl | 365 | Map to shared memory rank |
__nv_clusterDim_impl | 405 | Get cluster dimensions |
__nv_clusterRelativeBlockIdx_impl | 406 | Relative block index |
__nv_clusterGridDimInClusters_impl | 407 | Grid dimension in clusters |
__nv_clusterIdx_impl | 408 | Cluster index |
fence.sc.cluster Instruction
A new fence instruction is emitted from sub_21E94F0, the membar/fence printer. The opcode encoding uses the low 4 bits of the operand:
| Value | Instruction | Generation |
|---|---|---|
| 0 | membar.gpu | All |
| 1 | membar.cta | All |
| 2 | membar.sys | All |
| 4 | fence.sc.cluster | Hopper+ |
A duplicate implementation exists in the NVPTX backend at sub_35F18E0.
Atomic Cluster Scope
At sm_90+, the atomic lowering paths (sub_12AE930 line 255, sub_9502D0 line 424) add cluster scope support. Scope value 2 now resolves to "cluster" instead of falling through to "gpu" as it does on sm_70–89. This enables atom.*.cluster operations for intra-cluster synchronization.
setmaxnreg — Dynamic Register Count
Hopper introduces dynamic register count adjustment via setmaxnreg.{inc,dec}.sync.aligned.u32.
NVVM IR validation (sub_BFC6A0 lines 1732–1754): Builtin IDs 9431–9432 correspond to nvvm.setmaxnreg.inc and nvvm.setmaxnreg.dec. Validation rules enforce that the register count must be a multiple of 8 and within the range [24, 256].
Inline assembly recognition (sub_FCDCB0, sub_21EA5F0): The compiler scans inline asm for setmaxnreg. followed by .sync.aligned.u32, extracting the immediate operand from either a $0 placeholder or a literal integer. Backend duplicates exist at sub_307BA30 and sub_3953170.
WGMMA — Warpgroup Matrix Multiply-Accumulate
WGMMA is Hopper's primary tensor core interface, superseding HMMA for large matrix operations.
Registered Builtins
Four type variants are registered in sub_90AEE0 (lines 2941–2944) with a duplicate table in sub_126A910:
| Builtin | ID | Accumulator Type |
|---|---|---|
__wgmma_mma_async_f16 | 765 | FP16 |
__wgmma_mma_async_bf16 | 766 | BF16 |
__wgmma_mma_async_tf32 | 767 | TF32 |
__wgmma_mma_async_f8 | 768 | FP8 |
Shape Selection
The WGMMA lowering at sub_955A70 (lines 2850–2910+) uses a switch on the M dimension (output rows) to select MachineInstr opcodes:
| M Dimension | Opcode |
|---|---|
| 8 | 10774 |
| 16 | 10690 |
| 24 | 10734 |
| 32 | 10742 |
| 40–88 (stride 8) | 10746–10770 |
Error on invalid M: "unexpected constant overflow in __wgmma_mma_async operand".
Operand Modifiers
The NVPTX printer at sub_35F3330 emits WGMMA operand modifiers encoded in bitfields:
- kind (bits 6–8):
mxf4nvf4(0),f8f6f4(1),mxf8f6f4(2),f16(3),i8(4),tf32(5),mxf4(7) - cta_group (bit 1):
cta_group::1(clear) orcta_group::2(set) - scale (bits 2–3): Additional scaling modifier
TMA — Tensor Memory Access
TMA provides hardware-accelerated bulk data movement between global and shared memory, driven by a tensor map descriptor that encodes the multi-dimensional layout. Three independent subsystems in cicc cooperate to implement TMA: the intrinsic name parser (sub_A8E250), the SelectionDAG lowering handler (sub_33AD3D0), and the NVPTX ISel pattern matcher for CpAsyncBulkTensor (sub_36EC510).
TMA Descriptor Format (NVVM Container Tag 401)
The host-side tensor map descriptor is embedded in the NVVM container under tag 401. The tag is conditional on ExtOpt.Field344 (tag 301) having value 1, which identifies the Hopper TMA path. (Blackwell uses tag 402 for TCGen05Config instead, gated by Field344==4; the two are mutually exclusive.)
| Component | Size | Description |
|---|---|---|
| Fixed header | 44 bytes | Tensor map metadata (dimensions, strides, element type, interleave, swizzle, fill, OOB policy) |
| Per-descriptor entry | 16 bytes each | One entry per cp.async.bulk.tensor call site in the kernel |
| Total struct at offset 408 | 44 + 16*N bytes | N = number of distinct TMA operations |
The compiler serializes this into the NVVM container (sub_CDD2D0) so ptxas can validate shared memory allocation sizes and descriptor compatibility at link time.
TMA Descriptor ABI in Kernel Parameters
The EDG frontend detects TMA descriptor parameters during kernel registration stub generation. The detection function sub_8D4C10 (edg::get_tma_descriptor_flags) checks:
if (unk_4F068E0
&& arch > 0x9EFB
&& type_is_struct_or_class(type)
&& (*(type+140) & ~4) == 8
&& get_tma_descriptor_flags(type) & 4):
insert copy_node(sub_7E7ED0, calling_convention=7)
byte_at(node+88) |= 4 // TMA descriptor flag
This gives TMA descriptors a distinct ABI: calling convention 7 with flag bit 4, separate from normal struct-by-value passing. The copy node ensures the descriptor is materialized at the correct address space boundary before kernel launch.
TMA Intrinsic Name Parsing (sub_A8E250)
The intrinsic dispatcher sub_A8E250 (52 KB) matches TMA intrinsic names via string comparison and assigns internal opcode IDs. Two families exist:
Tensor-structured copies (require a tensor map descriptor):
| Intrinsic Pattern | Dimensions | Opcode |
|---|---|---|
cp.async.bulk.tensor.g2s.tile.1d | 1D | 9222 |
cp.async.bulk.tensor.g2s.tile.2d | 2D | 9223 |
cp.async.bulk.tensor.g2s.tile.3d | 3D | 9224 |
cp.async.bulk.tensor.g2s.tile.4d | 4D | 9225 |
cp.async.bulk.tensor.g2s.tile.5d | 5D | 9226 |
cp.async.bulk.tensor.g2s.im2col.3d | 3D | 9213 |
cp.async.bulk.tensor.g2s.im2col.4d | 4D | 9214 |
cp.async.bulk.tensor.g2s.im2col.5d | 5D | 9215 |
cp.async.bulk.tensor.gmem.to.smem.1d | 1D | 8324 |
cp.async.bulk.tensor.gmem.to.smem.2d | 2D | 8325 |
cp.async.bulk.tensor.gmem.to.smem.3d | 3D | 8326 |
cp.async.bulk.tensor.gmem.to.smem.4d | 4D | 8327 |
cp.async.bulk.tensor.gmem.to.smem.5d | 5D | 8328 |
cp.async.bulk.tensor.gmem.to.smem.im2col.w.3d | 3D | 8329 |
cp.async.bulk.tensor.gmem.to.smem.im2col.w.4d | 4D | 8330 |
cp.async.bulk.tensor.gmem.to.smem.im2col.w.5d | 5D | 8331 |
Unstructured bulk copies (byte-level, no tensor map descriptor):
| Intrinsic Pattern | Opcode |
|---|---|
cp.async.bulk.global.to.shared.cluster | 8315 |
cp.async.bulk.gmem.to.dsmem | 8316 |
Fragment-indexed TMA (from builtin IDs 411/412 via sub_9483E0):
| LLVM Intrinsic | Base Opcode | Index Range |
|---|---|---|
llvm.nvvm.tma.load | 9233 | 9227–9232 (6 entries, indexed by fragment count) |
llvm.nvvm.tma.store | 9257 | (corresponding store entries) |
TMA SelectionDAG Lowering (sub_33AD3D0)
The unified TMA handler sub_33AD3D0 receives a mode argument from the main intrinsic lowering switch in sub_33B0210:
| Case | Mode | Operation | Memory Direction |
|---|---|---|---|
0x179 | 2 | TMA load | global -> shared |
0x17A | 3 | TMA store | shared -> global |
0x17B | 5 | TMA prefetch | global (read-only) |
0x17C | 7 | TMA multicast load | global -> N shared (across cluster) |
Related cp.async handlers in the same dispatch table:
| Case | Handler | Operation |
|---|---|---|
0x175 | sub_33AC2B0 | cp.async (non-TMA async copy) |
0x176 | sub_33AC130 | cp.async.wait |
0x177 | sub_33AB690 | cp.async.bulk (non-tensor bulk copy) |
0x178 | goto LABEL_32 | No-op — commit/barrier (scheduling fence only) |
The 0x178 no-op is significant: it represents the cp.async.bulk commit/barrier intrinsic that exists purely for scheduling purposes. The compiler preserves it as a DAG ordering constraint even though it produces no data-flow SDNode.
CpAsyncBulkTensor G2S Lowering (sub_36EC510)
The 27 KB function sub_36EC510 (1185 lines) implements the complete cp.async.bulk.tensor global-to-shared lowering with full architecture gating and mode validation.
Architecture gates (read from offset+340 of the subtarget object):
| SM Value | Hex | Features Unlocked |
|---|---|---|
| >= 1000 | 0x3E8 | SM 90: tile mode (1D–5D), Im2Col mode (3D–5D) |
| >= 1032 | 0x408 | SM 100: adds 2CTA mode, Im2Col_W, Im2Col_W128 |
Mode bit decoding from operand v11:
| Bits | Mask | Meaning |
|---|---|---|
| 2–4 | v11 & 0x1C | Im2Col variant: Im2Col, Im2Col_W, Im2Col_W128 |
| 3–4 | v11 & 0x18 | 2CTA mode flag |
Validation error strings (emitted as fatal diagnostics):
- "NumDims should be at least 3 for Im2Col or Im2Col_W or Im2Col_W128 mode" — Im2Col requires >= 3D tensors
- "Im2Col_W and Im2Col_W128 modes are not supported on this architecture." — SM 90 does not support Im2Col_W/W128; requires SM 100+
- "2CTA Mode for CpAsyncBulkTensorG2S not supported on this architecture" — 2CTA mode requires SM 100+
TMA Builtin Codegen (EDG -> LLVM IR)
The EDG-to-LLVM builtin lowering handles TMA as builtin IDs 411 and 412 (hex 0x19B / 0x19C).
ID 411 (scatter/store path) — sub_12A7070 extracts TMA descriptor info, then an iterative loop builds a vector of per-element store nodes. The intrinsic table 0x107A–0x107F (4218–4223) selects among 6 entries indexed by element count. Approximately 300 lines of handler code (lines 1256–1501 of sub_12A71A0).
ID 412 (gather/load path) — Similar structure but for the load direction. Uses intrinsic table 0x1094–0x109A (4244–4250). Includes bitcast insertion (opcode 47) for type mismatches between the descriptor element type and the destination register type. Approximately 450 lines (lines 1503–1713).
Both paths use:
sub_12AA280— TMA descriptor builder (constructs the multi-operand struct from the builtin arguments)sub_12A9E60—extractvalueemission (decomposes aggregate returns into individual registers)sub_39FAC40— Fragment count computation (determines how many load/store fragments the TMA operation expands into)
TMA Scheduling Constraints
TMA operations impose specific scheduling constraints visible in cicc's SelectionDAG construction:
-
Chain dependencies by mode. Every TMA operation produces a memory chain in the SelectionDAG. The mode parameter determines the chain direction:
Mode Reads Writes Chain Effect 2 (load) global shared Load chain 3 (store) shared global Store chain 5 (prefetch) global (none) Load chain 7 (multicast) global N x shared Load chain -
Commit-as-fence. Intrinsic ID
0x178lowers to no-op (goto LABEL_32), functioning as a pure scheduling barrier. This prevents the DAG scheduler from reordering TMA operations past their commit point. -
Async qualifier hierarchy. The memory space qualifiers emitted by
sub_35F4B50form an ordered fence hierarchy:Qualifier Scope Strength .asyncUnscoped Weakest .async.globalGlobal memory domain .async.shared::ctaCTA-local shared memory .async.shared::clusterCluster shared memory (DSMEM) Strongest
Distributed Shared Memory
Hopper's cluster architecture enables distributed shared memory (DSMEM) across CTAs in a cluster. The NVPTX backend emits memory space qualifiers from two functions:
sub_35F4B50 — Async memory space qualifier emission (switch on operand):
| Line | Qualifier | Semantic |
|---|---|---|
| 20 | .async | Base async qualifier (unscoped) |
| 32 | .async.global | Async from global memory |
| 45 | .async.shared::cta | Async to CTA-local shared memory |
| 59 | .async.shared::cluster | Async to cluster distributed shared memory |
| 73 | .alias | Aliased access modifier (permits overlapping accesses) |
sub_35F4E30 — Commit modifier emission (switch on operand):
| Line | Qualifier | Semantic |
|---|---|---|
| 28 | .cta_group::1 | CTA group 1 selection |
| 38 | .cta_group::2 | CTA group 2 selection |
| 51 | .mbarrier::arrive::one | Single-thread mbarrier arrive |
| 67 | .shared::cluster | Cluster shared memory scope |
| 80 | .multicast::cluster | Multicast to all CTAs in cluster |
sub_35F4080 — Secondary .shared::cluster emission (line 68), used in non-commit contexts.
These qualifiers attach to cp.async.bulk and mbarrier instructions to specify the scope and direction of asynchronous data movement within the cluster.
Mbarrier Extensions — DMA Fence/Arrive/Wait
Hopper extends the async barrier (mbarrier) mechanism to coordinate TMA data movement. The TMA DMA pipeline follows a three-phase synchronization protocol:
Phase 1: Initialization
.mbarrier_init (emitted from sub_35F4AD0) initializes the async barrier with the expected transaction byte count. The arrive_expect_tx variant sets both the expected arrival count and the transaction byte count atomically.
Phase 2: Arrive (Producer Signals Completion)
When a TMA operation completes, it signals the mbarrier:
.mbarrier::arrive::one(sub_35F4E30line 51) — single-thread arrive notification. The TMA hardware auto-arrives with the transferred byte count..cta_group::1/.cta_group::2(sub_35F4E30lines 28/38) — selects which CTA group the arrive targets, enabling pipelined producer-consumer patterns where two groups alternate roles.
Phase 3: Wait (Consumer Blocks)
The consumer thread issues mbarrier.try_wait with a phase bit. The phase alternates each time the barrier completes a full cycle, enabling pipelined double-buffered access patterns. No additional cicc emission function is needed; the standard mbarrier wait path handles this.
WGMMA Fence/Commit/Wait (Distinct Pipeline)
WGMMA has its own synchronization cycle, separate from TMA mbarriers:
| Builtin | IDs | Handler | LLVM Intrinsic |
|---|---|---|---|
__wgmma_fence | 745–750 | sub_12B1C20 | 9062 (wgmma.fence.aligned, 3 type overloads) |
__wgmma_commit_group | (same range) | sub_12B1C20 | (same dispatch) |
__wgmma_wait_group | (same range) | sub_12B1C20 | (same dispatch) |
WGMMA fences synchronize the tensor core accumulator pipeline; TMA mbarriers synchronize the DMA engine. A typical Hopper kernel pipelines both: TMA loads data into shared memory (mbarrier-synchronized), then WGMMA consumes the data from shared memory (fence-synchronized). The two synchronization domains must not be confused in a reimplementation.
Feature Flag Configuration
The master feature configurator sub_60E7C0 sets the following flags at the sm_90+ threshold (qword_4F077A8 > 89999):
| Flag | Source |
|---|---|
unk_4D043D0 | sub_60E7C0 |
unk_4D041B0 | sub_60E7C0 |
unk_4D04814 | sub_60E7C0 |
unk_4D0486C | sub_60E7C0 (with C++ version check) |
dword_4F07760 | sub_60E530 |
dword_4D043F8 | sub_60E530 (at > 99999) |
dword_4D041E8 | sub_60E530 (at > 99999) |
Key Binary Locations
| Function | Address | Size | Role |
|---|---|---|---|
sub_CD09E0 | 0xCD09E0 | NVVM arch enum (NVVM_ARCH_HOPPER_9_0) | NVVM arch enum (NVVM_ARCH_HOPPER_9_0) |
ctor_356 | 0x50C890 | Subtarget registration (sm_90 enum 38, sm_90a enum 39) | Subtarget registration (sm_90 enum 38, sm_90a enum 39) |
sub_214DA90 | 0x214DA90 | Kernel attribute emitter (cluster PTX directives) | Kernel attribute emitter (cluster PTX directives) |
sub_21E9060 | 0x21E9060 | Cluster special register PTX emission | Cluster special register PTX emission |
sub_21E8EA0 | 0x21E8EA0 | Cluster barrier instruction emission | Cluster barrier instruction emission |
sub_21E94F0 | 0x21E94F0 | Membar/fence printer (fence.sc.cluster) | Membar/fence printer (fence.sc.cluster) |
sub_BFC6A0 | 0xBFC6A0 | setmaxnreg NVVM IR validation | setmaxnreg NVVM IR validation |
sub_FCDCB0 | 0xFCDCB0 | setmaxnreg inline asm pattern matching | setmaxnreg inline asm pattern matching |
sub_955A70 | 0x955A70 | WGMMA lowering (M-dimension switch) | WGMMA lowering (M-dimension switch) |
sub_90AEE0 | 0x90AEE0 | Builtin registration (WGMMA, cluster barriers/queries) | Builtin registration (WGMMA, cluster barriers/queries) |
sub_A8E250 | 0xA8E250 | TMA intrinsic name parsing (52 KB) | TMA intrinsic name parsing (52 KB) |
sub_33AD3D0 | 0x33AD3D0 | TMA SelectionDAG lowering handler (modes 2/3/5/7) | TMA SelectionDAG lowering handler (modes 2/3/5/7) |
sub_33AB690 | 0x33AB690 | cp.async.bulk non-tensor handler | cp.async.bulk non-tensor handler |
sub_33AC2B0 | 0x33AC2B0 | cp.async handler | cp.async handler |
sub_33AC130 | 0x33AC130 | cp.async.wait handler | cp.async.wait handler |
sub_36EC510 | 0x36EC510 | CpAsyncBulkTensor G2S lowering (27 KB, 1185 lines) | CpAsyncBulkTensor G2S lowering (27 KB, 1185 lines) |
sub_9483E0 | 0x9483E0 | TMA descriptor extraction | TMA descriptor extraction |
sub_12AA280 | 0x12AA280 | TMA descriptor builder (EDG -> LLVM IR) | TMA descriptor builder (EDG -> LLVM IR) |
sub_12A7070 | 0x12A7070 | TMA scatter/store builtin handler | TMA scatter/store builtin handler |
sub_8D4C10 | 0x8D4C10 | edg::get_tma_descriptor_flags | edg::get_tma_descriptor_flags |
sub_35F4B50 | 0x35F4B50 | DSMEM qualifier emission | DSMEM qualifier emission |
sub_35F4E30 | 0x35F4E30 | Commit modifier emission (mbarrier, multicast) | Commit modifier emission (mbarrier, multicast) |
sub_35F4AD0 | 0x35F4AD0 | .mbarrier_init emission | .mbarrier_init emission |
sub_35F4080 | 0x35F4080 | Secondary .shared::cluster emission | Secondary .shared::cluster emission |
Blackwell Datacenter (sm_100, sm_100a, sm_103, sm_103a)
The Blackwell datacenter family introduces the fifth-generation tensor core instruction set (tcgen05), new floating-point formats (FP4, FP6, MX formats), and a sophisticated arch-conditional versus family-conditional feature gating system. sm_100/sm_100a targets the NVIDIA B200, while sm_103/sm_103a targets Blackwell Ultra (GB300 system). Both share the tcgen05 ISA but differ in __CUDA_ARCH values and minor tensor core configuration.
Architecture Identity
Six Blackwell arch constants are defined in sub_CD09E0:
| NVVM Enum | Numeric Value | Implied SM |
|---|---|---|
NVVM_ARCH_BLACKWELL_10_0 | 1000 | sm_100 |
NVVM_ARCH_BLACKWELL_10_1 | 1010 | sm_101 |
NVVM_ARCH_BLACKWELL_10_3 | 1030 | sm_103 |
NVVM_ARCH_BLACKWELL_11_0 | 1100 | sm_110 (Jetson Thor) |
NVVM_ARCH_BLACKWELL_12_0 | 1200 | sm_120 |
NVVM_ARCH_BLACKWELL_12_1 | 1210 | sm_121 |
Notable: sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line. Despite the rename, both remain in the Blackwell family (NVVM_ARCH_BLACKWELL_*). The numeric encoding follows the standard major*100 + minor*10 formula: 11100 + 010 = 1100.
SM Variant Table
Each Blackwell datacenter target has base, accelerated (a), and forward-compatible (f) sub-variants:
| Variant | __CUDA_ARCH | PTX Version | Product |
|---|---|---|---|
sm_100 | 1000 | 6 | B200 base |
sm_100a | 1000 | 7 | B200 accelerated |
sm_100f | 1000 | 7 | B200 forward-compatible |
sm_103 | 1030 | 6 | Blackwell Ultra / GB300 base |
sm_103a | 1030 | 7 | Blackwell Ultra / GB300 accelerated |
sm_103f | 1030 | 7 | Blackwell Ultra / GB300 forward-compatible |
The undocumented sm_101 and sm_102 targets also exist in the processor table (ctor_605) with their own a/f variants. sm_101 maps to __CUDA_ARCH=1010 and sm_102 to __CUDA_ARCH=1020. No unique feature gates differentiate them from sm_100 in cicc.
Suffix Semantics
The sub-variant flags are stored in EDG frontend globals:
unk_4D045E8— Major SM number (100, 103)unk_4D045E4— Accelerated flag; set for bothaandfvariantsunk_4D045E0— Forward-compatible flag; set only forfvariants
The f suffix implies a — whenever the forward-compatible flag is set, the accelerated flag is also set. In cicc v13.0, the f flag is set during CLI parsing and reset in sub_615CB0 but is never read by any compiler logic. It exists for future-proofing and potential ptxas-level differentiation.
Arch-Conditional vs. Family-Conditional Gating
Blackwell introduces a two-tier feature gating system that distinguishes between "arch-conditional" and "family-conditional" access to instructions. This pattern repeats across every tcgen05 handler.
The gate check at sub_30462A0, sub_304E6C0, and sub_36E9630 uses a complex encoding:
v = arch_version (offset +340 of arch struct)
if (v > 0x408) { // 0x408 = 1032 = sm_103.2
if (v - 1101 > 1) // allows {1101, 1102} — sm_110a/sm_110f (Jetson Thor)
goto ERROR;
} else if (v <= 0x3E8 || ((1LL << ((v & 0xFF) + 23)) & 0xC0000C03) == 0) {
goto ERROR; // 0x3E8 = 1000 = sm_100 base
}
The bitmask 0xC0000C03 selects specific sub-variants when shifted by (v & 0xFF) + 23. PTX version gates further refine access: family-conditional features require PTX >= 86, while arch-conditional features require PTX >= 88.
Features gated by both arch-conditional and family-conditional (broader access): tcgen05.fence, tcgen05.wait, tcgen05.relinquish.alloc, tcgen05.cp, tcgen05.commit, tcgen05.alloc, tcgen05.mma, and the ue8m0x2 type in cvt_packfloat.
Features gated by arch-conditional only (stricter): {fp6/fp4}x2 types in cvt_packfloat, INT8 type in tcgen05.mma, MXF4/MXF4NVF4 with sparsity, and explicit scale vector size.
tcgen05 — Tensor Core Generation 5
The tcgen05 instruction family is the primary new ISA extension for Blackwell datacenter. All tcgen05 instructions are handled in sub_30462A0 and sub_304E6C0.
Lifecycle Instructions
| Instruction | Opcode | ISD | Operands | Purpose |
|---|---|---|---|---|
tcgen05.alloc | 10080 | 4765 | Basic allocation | Allocate tensor core accumulator memory |
tcgen05.alloc (multicast) | 10083 | 4770/4771 | 32-bit flag variant | Multicast allocation |
tcgen05.dealloc | 10140 | 4827 | 4 operands | Deallocate tensor core memory |
tcgen05.commit | 10090/10091 | 4772–4777 | Mask variants | Commit pending operations |
tcgen05.fence | 10143 | 4830 | 2 operands | Memory fence for tensor ops |
tcgen05.wait | 10351 | 5020 | 2 operands | Wait for tensor ops to complete |
tcgen05.relinquish.alloc | 10311 | 4941 | 2 operands | Relinquish allocated tensor memory |
tcgen05.cp.* | 10101 | 4790 | 4 operands | Copy operations for tensor data |
The commit instruction has multiple variants based on multicast mask size. Only 16-bit and 32-bit masks are valid; other sizes produce an error.
tcgen05.mma — Matrix Multiply-Accumulate
The main MMA instruction is handled in sub_304E6C0 (opcodes 10299–10309) and validated in sub_36E9630. The operand encoding packs configuration into bitfields:
Data types (bits 8–6 of operand):
| Value | Kind | Notes |
|---|---|---|
| 0 | kind::mxf4nvf4 | MX FP4 with NV FP4 |
| 1 | kind::f8f6f4 | Standard FP8/FP6/FP4 |
| 2 | kind::mxf8f6f4 | MX variant of f8f6f4 |
| 3 | kind::f16 | Half precision |
| 4 | kind::i8 | 8-bit integer (arch-conditional only) |
| 5 | kind::tf32 | TensorFloat-32 |
| 7 | kind::mxf4 | MX FP4 |
Scale vector sizes (bits 3–2):
| Value | Modifier | Constraints |
|---|---|---|
| default | .scale_vec::1X | Not for mxf4nvf4 or mxf4 |
| 2 | .scale_vec::2X | Not for mxf8f6f4 |
| 3 | .scale_vec::4X | Not for mxf8f6f4 or mxf4 |
Block scale (bits 10–9): .block16 (16-element block scaling) or .block32 (32-element block scaling). Not supported for f16, tf32, f8f6f4, or i8.
Weight stationary (bit 0): .ws flag. Incompatible with cta_group::2, mxf8f6f4, and FP4 types.
Sparsity (bit 5): Restricted for MXF4 and MXF4NVF4 types on arch-conditional variants only.
Scale input accumulator (bit 4): Scales the accumulator input. Only usable with f16 and tf32 types. Notably, this is NOT supported on the a sub-variants (sm_100a at v=1001, sm_103a at v=1033) but IS supported on base variants (sm_100 at v=1000, sm_103 at v=1030) and sm_120+.
CTA group (bit 1): cta_group::1 (clear) or cta_group::2 (set).
Collector modes (from sub_35F38B0): .collector::a::fill, .collector::a::use, .collector::a::lastuse, and .collector::b with ::ws sub-variants. Constraint: cannot use collector::a::use or collector::a::fill with the ashift modifier.
tcgen05.cp Copy Shapes
The copy instruction shape emission at sub_35F5090 supports:
| Shape | Bits 3–1 Value |
|---|---|
.128x256b | 0 |
.4x256b | 1 |
.128x128b | 2 |
.64x128b | 3 |
.32x128b | 4 |
Destination format modifiers: .b8x16 (base), .b6x16_p32 (6-bit with 32-bit padding), .b4x16_p64 (4-bit with 64-bit padding).
Multicast modes: .warpx2::02_13 (warp pairs 0,2 and 1,3), .warpx2::01_23 (warp pairs 0,1 and 2,3), .warpx4 (all 4 warps).
cvt_packfloat — Extended Numeric Formats
The cvt_packfloat intrinsic (sub_304FBD0 for validation, sub_35ED820 for emission) has a base requirement of SM >= 90 and PTX >= 78. Blackwell adds four new types:
| Case | Type | Generation |
|---|---|---|
| 0 | .f32 | sm_90+ |
| 1 | .f16x2 | sm_90+ |
| 2 | .e4m3x2 (FP8 E4M3) | sm_90+ |
| 3 | .e5m2x2 (FP8 E5M2) | sm_90+ |
| 4 | .bf16x2 (BFloat16) | sm_90+ |
| 5 | .e2m1x2 (FP4 E2M1) | sm_100+ |
| 6 | .e2m3x2 (FP6 E2M3) | sm_100+ |
| 7 | .e3m2x2 (FP6 E3M2) | sm_100+ |
| 8 | .ue8m0x2 (UE8M0 scale) | sm_100+ |
The ue8m0x2 type is gated by both arch-conditional and family-conditional paths, while {fp6/fp4}x2 types (e2m1x2, e2m3x2, e3m2x2) are arch-conditional only.
tcgen05 Commit with Mbarrier
The commit modifier emission at sub_35F4E30 combines tensor core commit with mbarrier synchronization:
.cta_group::1/.cta_group::2— Group selection.mbarrier::arrive::one— Mbarrier arrive modifier.shared::cluster— Shared memory cluster scope.multicast::cluster— Multicast cluster scope
sm_100 vs. sm_103 Differences
Both families share the full tcgen05 ISA. Observable differences in cicc:
__CUDA_ARCH: 1000 vs. 1030- Tensor core operand range: sm_103 may handle wider operand loops (offset 760 vs. 600 for simpler variants in cases 10303/10308)
- Scale input accumulator: Not available on
asub-variants of either family
No sm_103-specific feature gates exist beyond the __CUDA_ARCH value. Hardware differences between B200 and GB300 are resolved at the ptxas level.
Feature Flag Configuration
At the sm_100+ threshold (qword_4F077A8 > 109999), the master configurator sub_60E7C0 enables:
| Flag | Condition |
|---|---|
unk_4D04184 | Unconditional |
unk_4D04800 | Requires CUDA mode + C++20 |
dword_4D041AC | Guarded by byte_4CF8172 |
Key Binary Locations
| Function | Address | Size | Role |
|---|---|---|---|
sub_CD09E0 | 0xCD09E0 | NVVM arch enum (all Blackwell constants) | NVVM arch enum (all Blackwell constants) |
sub_1C1B150 | 0x1C1B150 | Second arch enum copy (LLVM module metadata) | Second arch enum copy (LLVM module metadata) |
sub_30462A0 | 0x30462A0 | tcgen05 intrinsic handler (alloc/dealloc/commit/fence/wait/cp) | tcgen05 intrinsic handler (alloc/dealloc/commit/fence/wait/cp) |
sub_304E6C0 | 0x304E6C0 | tcgen05.mma intrinsic handler + SelectionDAG lowering | tcgen05.mma intrinsic handler + SelectionDAG lowering |
sub_36E9630 | 0x36E9630 | tcgen05.mma validation + ISD opcode selection | tcgen05.mma validation + ISD opcode selection |
sub_304FBD0 | 0x304FBD0 | cvt_packfloat intrinsic handler | cvt_packfloat intrinsic handler |
sub_35ED820 | 0x35ED820 | cvt_packfloat type string emission | cvt_packfloat type string emission |
sub_35F3330 | 0x35F3330 | tcgen05.mma modifier emission (kind, scale, cta_group) | tcgen05.mma modifier emission (kind, scale, cta_group) |
sub_35F38B0 | 0x35F38B0 | tcgen05.mma modifier emission (ashift, collector) | tcgen05.mma modifier emission (ashift, collector) |
sub_35F4E30 | 0x35F4E30 | tcgen05 commit modifier emission | tcgen05 commit modifier emission |
sub_35F5090 | 0x35F5090 | tcgen05.cp shape/format emission | tcgen05.cp shape/format emission |
sub_95EB40 | 0x95EB40 | CLI arch string mapping | CLI arch string mapping |
sub_617BD0 | 0x617BD0 | compute_NNN string parsing | compute_NNN string parsing |
ctor_605 | 0x584510 | Processor variant string table | Processor variant string table |
ctor_356 | 0x50C890 | LLVM processor description table | LLVM processor description table |
Blackwell (sm120) — Consumer and Enterprise (sm_120, sm_121)
The sm_120 family targets the consumer RTX 50-series and enterprise RTX Blackwell Pro GPUs. Despite sharing the "Blackwell" marketing name with sm_100, the sm_120 microarchitecture is a distinct design — a chimera of Hopper and Ada Lovelace silicon, with fundamentally different tensor core hardware. sm_121 targets DGX Spark.
Critical architectural difference: sm_120 does NOT have tcgen05 tensor core instructions. The tcgen05 arch-conditional gate in cicc (sub_30462A0, sub_304E6C0, sub_36E9630) reads SmVersion at offset +0x154 and performs:
if (SmVersion > 1032): // above sm_103f
if (SmVersion - 1101) > 1: // only 1101 (sm_110a) and 1102 (sm_110f) pass
→ ERROR "tcgen05 supported only on arch-conditional..."
sm_120's SmVersion is 1200 → 1200 - 1101 = 99 > 1 → rejected by cicc itself, not by ptxas. The values 1101/1102 correspond to sm_110a/sm_110f (Jetson Thor), confirming that Jetson Thor retains tcgen05/TMEM hardware while consumer Blackwell does not.
The upstream LLVM 22 NVPTX backend (NVPTXSubtarget.h) independently confirms this: hasTcgen05InstSupport() lists only {100, 110}, and hasMMABlockScale() lists only {120}.
The complete tcgen05 acceptance list from cicc's binary (all three gate functions use identical logic):
| SmVersion | Target | tcgen05 |
|---|---|---|
| 1001 | sm_100a | Allowed (bitmask bit 0) |
| 1002 | sm_100f | Allowed (bitmask bit 1) |
| 1011 | sm_101a | Allowed (bitmask bit 10) |
| 1012 | sm_101f | Allowed (bitmask bit 11) |
| 1031 | sm_103a | Allowed (bitmask bit 30) |
| 1032 | sm_103f | Allowed (bitmask bit 31) |
| 1101 | sm_110a | Allowed ((v-1101) <= 1) |
| 1102 | sm_110f | Allowed ((v-1101) <= 1) |
| 1000, 1010, 1030, 1100 | base variants | Blocked (no suffix) |
| 1200–1212 | all sm_120/121 | Blocked (v-1101 > 1) |
From the user-visible feature perspective in cicc v13.0, sm_120 adds exactly two compiler-visible features beyond the shared Blackwell base: .offset.bindless texture intrinsics and 16-bit texture element type support.
Architecture Identity
NVIDIA's internal naming places sm_120/sm_121 squarely in the Blackwell family:
| NVVM Enum | Numeric Value | __CUDA_ARCH | Product |
|---|---|---|---|
NVVM_ARCH_BLACKWELL_12_0 | 1200 | 1200 | RTX 50xx / RTX Blackwell Pro |
NVVM_ARCH_BLACKWELL_12_1 | 1210 | 1210 | DGX Spark |
The hardware SM enum NVVM_ARCH_HW_SM_10_4 maps to value 1200, revealing that NVIDIA internally considers sm_120 as "SM 10.4" — a continuation of the Blackwell 10.x line rather than a distinct generation.
SM Variant Table
| Variant | __CUDA_ARCH | PTX Version | a flag | f flag |
|---|---|---|---|---|
sm_120 | 1200 | 6 | 0 | 0 |
sm_120a | 1200 | 7 | 1 | 0 |
sm_120f | 1200 | 7 | 1 | 1 |
sm_121 | 1210 | 6 | 0 | 0 |
sm_121a | 1210 | 7 | 1 | 0 |
sm_121f | 1210 | 7 | 1 | 1 |
The PTX version pattern is identical to sm_100: base variants use PTX 6, accelerated and forward-compatible variants use PTX 7. sm_120 does not require a higher PTX version than sm_100.
Suffix Behavior
For the sm_120 family, the a and f suffixes have no behavioral impact on compiler internals in cicc v13.0:
unk_4D045E4(accelerated flag): Read in exactly one location (sub_6C4D80line 167), but only forunk_4D045E8 == 90— the sm_90a gate. The flag is never checked for sm_120.unk_4D045E0(forward-compatible flag): Set during CLI parsing, reset insub_615CB0, but never read anywhere in the compiler logic.
The suffixes exist for forward-proofing, __CUDA_ARCH macro consistency (all sub-variants share the same value), and potential ptxas-level differentiation not visible in cicc.
SM 120 Exclusive Feature Gates
The entire cicc codebase contains exactly two locations gated on sm_120. Both check __CUDA_ARCH >= 1200 (i.e., the arch value field at offset +8 must exceed 1199).
Feature 1: .offset.bindless Texture Intrinsics
Frontend gate: sub_1C36530 line 2724
Backend gate: sub_2C7B6A0 line 2160
When *(int*)(a1 + 8) <= 1199, the compiler emits: ".offset.bindless intrinsics are not supported on pre-Blackwell architectures". The error message is misleading — sm_100 IS Blackwell, yet .offset.bindless requires sm_120+. The message likely reflects an earlier internal naming convention or considers sm_120 the "true" consumer Blackwell.
The .offset.bindless intrinsics provide texture and surface operations using bindless handles with an additional offset parameter. This enables runtime-flexible texture resource indexing, indirect texture access via descriptor heaps, and offset-based resource aliasing within a descriptor pool.
68 intrinsic variants are classified by two functions:
-
Frontend:
sub_1C303A0— Checks three ID ranges:- Range 1: IDs 4419–4469 (26 IDs, odd numbers only)
- Range 2: IDs 4722, 4725, 4726, 4731, 4734, 4736, 4739 (7 IDs)
- Range 3: IDs 5085–5153 (35 IDs, odd numbers only)
-
Backend:
sub_CEA320— Checks corresponding backend intrinsic IDs
These 68 intrinsics cover the full matrix of texture dimensions (1D, 2D, 3D, cube, array variants), data types (i32, f32, and others), and operation types (sample, fetch, gather). The sm_120 gate means these intrinsics physically require sm_120 hardware — the texture unit changes needed for offset-based bindless addressing are not present on sm_100 silicon.
Feature 2: 16-bit Texture Element Types
Frontend gate: sub_1C36530 line 3381
Backend gate: sub_2C7B6A0 line 2386
When *(int*)(a1 + 8) > 1199, 16-bit (f16) element types become legal for most texture intrinsics. The legalization logic at frontend line 3397:
type_legal = (elem_is_i8_or_i16_raw) || is_32bit(type) ||
(is_16bit(type) && tex16_allowed_flag)
The tex16_allowed_flag differs by architecture:
- sm < 120: True only for builtin ID 3811 (checked by
sub_1C30390) - sm >= 120: True for all texture intrinsics except IDs 5116–5131 (checked by
sub_1C30470on frontend,sub_CEA3F0for backend IDs 10462–10477)
This change reduces memory bandwidth requirements for texture operations on sm_120 by enabling native f16 texture reads without promotion to 32-bit.
sm_120 vs. sm_121
Both variants pass the same > 1199 gate. In cicc v13.0, there is no code path that differentiates sm_121 from sm_120. The only distinction is the __CUDA_ARCH macro value (1200 vs. 1210), which affects user-level #ifdef checks in CUDA source code.
sm_121 is a minor revision of sm_120, analogous to how sm_103 relates to sm_100 — both have different __CUDA_ARCH values but no compiler-internal behavioral difference beyond the macro.
Relationship to sm_100
What sm_120 Inherits from sm_100
sm_120 shares the Blackwell family identity and inherits most non-tensor-core features: Hopper cluster operations, TMA bulk copy, setmaxnreg, narrow FP conversion support (e2m3/e3m2/e2m1/ue8m0), tensormap.replace, and Blackwell ldstmatrix instructions.
What sm_120 Does NOT Have
sm_120 lacks the entire tcgen05 instruction family and its prerequisite Tensor Memory (TMEM) hardware:
- No
tcgen05.alloc/tcgen05.dealloc(no TMEM to allocate) - No
tcgen05.mma(the async TMEM-based tensor core path) - No
tcgen05.cp/tcgen05.commit/tcgen05.fence/tcgen05.wait - No
tcgen05.relinquish.alloc
What sm_120 Has Instead
The sm_120 hardware extends the existing mma.sync instruction family (which has been the standard tensor core interface since Volta/sm_70) with new block_scale qualifiers and MX-format data types:
mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0
This adds per-block MX-format scaling to the synchronous register-based MMA, supporting FP8 (e4m3, e5m2), FP6 (e3m2, e2m3), and FP4 (e2m1) operand types with ue8m0 scale factors. The tile shape is m16n8k32. Upstream LLVM 22 confirms this with hasMMABlockScale() returning true only for {120} and hasMMASparseBlockScaleF4() for {120, 121}.
The block_scale variant is restricted to TN layout (.row.col is hardcoded as a string literal in LLVM's tablegen — not parameterized, no NN/NT/TT variants exist). This is consistent with the broader mma.sync family where all post-Volta shapes are effectively TN-only (only the original m8n8k4 f16 from Volta supports all four layout combinations). By contrast, tcgen05.mma on sm_100/103/110 has no layout qualifier at all — data layout is implicit in the tensor memory descriptor (idesc).
cicc v13.0 does not yet emit mma.sync.block_scale for sm_120. The binary contains the string "nvvm.mma.blockscale currently supports non-sync aligned variants only!", confirming that block-scaled MMA is only available through the tcgen05 (async) path in this release — which sm_120 doesn't have access to. The mma.sync.block_scale support for sm_120 is present in upstream LLVM 22 and presumably coming in a future CUDA release (13.1+).
In cicc v13.0, sm_120 falls back to the standard HMMA/IMMA tensor core codegen inherited from sm_70–sm_90. The new Blackwell-generation tensor features (tcgen05 async path OR block_scale sync path) are both unavailable for sm_120 in this compiler version.
Tensor Core Instruction Timeline
| Generation | SM | Instruction | Memory Model |
|---|---|---|---|
| Volta/Turing | sm_70/75 | mma.sync (HMMA) | Register-to-register, synchronous |
| Ampere | sm_80 | mma.sync (extended shapes) | Register-to-register, synchronous |
| Hopper | sm_90 | wgmma.mma_async | Shared memory → registers, async warpgroup |
| Blackwell datacenter | sm_100/103/110 | tcgen05.mma | Tensor Memory (TMEM), fully async |
| Blackwell consumer | sm_120/121 | mma.sync.block_scale (LLVM 22+) | Register-to-register, synchronous + MX scaling |
sm_110 — Jetson Thor
sm_110 (Jetson Thor, for automotive and robotics SoCs) sits between sm_100 and sm_120 in the architecture numbering. Despite the higher SM number, sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename) and retains tcgen05/TMEM support — the tcgen05 gate explicitly allows sm_110a (SmVersion 1101) and sm_110f (1102). It lacks sm_120's .offset.bindless and f16 texture features but has full tensor core parity with sm_100/sm_103.
| Variant | __CUDA_ARCH | PTX Version |
|---|---|---|
sm_110 | 1100 | 6 |
sm_110a | 1100 | 7 |
sm_110f | 1100 | 7 |
Feature Flag Configuration
At the sm_120+ threshold (qword_4F077A8 > 119999), the master configurator sub_60E7C0 enables:
| Flag | Purpose |
|---|---|
unk_4D047BC | Disabled (set to 0) for sm_120+; enabled for all lower architectures |
unk_4D0428C | Enabled at sm_120+ |
The unk_4D047BC flag is unconditionally assigned based on SM <= 119999, making it the only flag that is actively disabled at sm_120+. This likely controls a legacy optimization or codegen path that is incompatible with sm_120 hardware.
Key Binary Locations
| Function | Address | Size | Role |
|---|---|---|---|
sub_CD09E0 | 0xCD09E0 | NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1) | NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1) |
sub_95EB40 | 0x95EB40 | CLI arch string mapping | CLI arch string mapping |
sub_617BD0 | 0x617BD0 | compute_NNN string parsing | compute_NNN string parsing |
ctor_605 | 0x584510 | Processor variant table (PTX versions) | Processor variant table (PTX versions) |
ctor_356 | 0x50C890 | LLVM processor description table | LLVM processor description table |
sub_1C36530 | 0x1C36530 | Frontend verifier (.offset.bindless + f16 texture gates) | Frontend verifier (.offset.bindless + f16 texture gates) |
sub_2C7B6A0 | 0x2C7B6A0 | Backend verifier (.offset.bindless + f16 texture gates) | Backend verifier (.offset.bindless + f16 texture gates) |
sub_1C303A0 | 0x1C303A0 | .offset.bindless intrinsic classifier (frontend) | .offset.bindless intrinsic classifier (frontend) |
sub_CEA320 | 0xCEA320 | .offset.bindless intrinsic classifier (backend) | .offset.bindless intrinsic classifier (backend) |
sub_1C30470 | 0x1C30470 | f16 texture exclusion list (frontend) | f16 texture exclusion list (frontend) |
sub_CEA3F0 | 0xCEA3F0 | f16 texture exclusion list (backend) | f16 texture exclusion list (backend) |
sub_6C4D80 | 0x6C4D80 | Accelerated flag reader (sm_90a only, not sm_120) | Accelerated flag reader (sm_90a only, not sm_120) |
sub_615CB0 | 0x615CB0 | Forward-compatible flag reset | Forward-compatible flag reset |
NVVM IR Node Layout
The NVVM frontend in cicc v13.0 uses a custom intermediate representation distinct from LLVM's native IR. Each IR node is a variable-length structure allocated from a bump allocator, with operands stored backward from the node header pointer. The node uniquing infrastructure lives in sub_162D4F0 (49KB), which routes each opcode to a dedicated DenseMap inside the NVVM context object.
Node Header Layout
The pointer a1 returned from allocation points to the start of the fixed header. Operands are at negative offsets behind it.
| Offset | Size | Type | Field | Notes |
|---|---|---|---|---|
| +0 | 1B | uint8_t | opcode | Switch key in sub_162D4F0; values 0x04..0x22+ |
| +2 | 2B | uint16_t | subopcode | Intrinsic ID; read for opcodes 0x1C, 0x1D, 0x1E |
| +4 | 4B | -- | (padding) | Not accessed directly |
| +8 | 4B | uint32_t | num_operands | Controls operand access range |
| +16 | 8B | tagged_ptr | context_ptr | Low 3 bits are tag; mask with & ~7 for pointer |
| +24 | 8B | varies | extra_A | DWORD for opcodes 0x1A/0x1B; pointer for 0x10/0x22 |
| +28 | 4B | uint32_t | extra_B | Present for opcode 0x1B |
| +32 | 8B | varies | extra_C | Present for opcode 0x10 |
| +40 | 1B | uint8_t | extra_flag | Present for opcode 0x10 |
Minimum header size is 24 bytes. Total node allocation: 24 + 8 * num_operands bytes minimum, though opcode-specific extra fields extend the header region for certain node types.
Operand Storage
Operands are stored as 8-byte QWORD pointers at negative offsets from the header. The stride is exactly 8 bytes per operand. Access follows this pattern (decompiled from sub_162D4F0):
operand[k] = *(_QWORD *)(a1 + 8 * (k - num_ops))
For a node with num_operands = 3:
operand[0]is ata1 - 24operand[1]is ata1 - 16operand[2]is ata1 - 8
A 2-operand node occupies 40 bytes total (16 operand bytes + 24 header bytes). A node with opcode 0x1B and 5 operands requires approximately 88 bytes (40 operand bytes + ~48 header bytes including extra fields).
Tagged Pointer Semantics
The context_ptr at offset +16 uses low-bit tagging to encode indirection:
- Bits [2:0] = 0: pointer is a direct reference to the context object.
- Bit [2] = 1: pointer is an indirect reference (pointer-to-pointer).
The decompiled dereferencing pattern:
v = *(a1 + 16) & 0xFFFFFFFFFFFFFFF8; // mask off tag bits
if (*(a1 + 16) & 4) // bit 2 set = indirect
v = *v; // one extra dereference
This technique saves a field by encoding the indirection flag inside the pointer itself, relying on 8-byte alignment guarantees.
Opcode Dispatch Table
The uniquing function sub_162D4F0 performs a byte-level switch on *(_BYTE *)a1. Each case extracts the tagged context pointer, dereferences it, then probes an opcode-specific DenseMap for a structurally identical node.
Uniquing Opcode Dispatch (sub_162D4F0, 49KB)
The opcodes fall into two categories: "simple" opcodes that use sub-function tables at fixed stride, and "complex" opcodes that use dedicated DenseMap instances at individually-known offsets.
Simple opcodes (0x04--0x15) -- These 18 opcodes share a uniform dispatch pattern. Each routes to a sub-function table at a fixed byte offset within the context object, spaced 32 bytes apart:
| Opcode | Context Byte Offset | Semantic Category |
|---|---|---|
| 0x04 | +496 | Type / value constant |
| 0x05 | +528 | Binary operation |
| 0x06 | +560 | (simple node) |
| 0x07 | +592 | (simple node) |
| 0x08 | +624 | (simple node) |
| 0x09 | +656 | Undef / poison |
| 0x0A | +688 | (simple node) |
| 0x0B | +720 | (simple node) |
| 0x0C | +752 | (simple node) |
| 0x0D | +784 | Integer constant |
| 0x0E | +816 | FP constant |
| 0x0F | +848 | Constant expression |
| 0x10 | -- | Special: uses DenseMap at qw[178] |
| 0x11 | +912 | (simple node) |
| 0x12 | +944 | (simple node) |
| 0x13 | +976 | Struct / aggregate type |
| 0x14 | +1008 | (simple node) |
| 0x15 | +1072 | (simple node) |
Each sub-function table entry at these offsets is a 32-byte structure containing the callback address and metadata for hash-table probing.
Complex opcodes (0x16--0x22) -- These opcodes each own a full DenseMap within the context object. Each DenseMap occupies 4 qwords at the indicated base, plus associated dword counters:
| Opcode | QWord Base | Byte Offset | DenseMap Dwords | Identified Semantic |
|---|---|---|---|---|
| 0x16 | qw[130] | +1040 | dw[264..266] | Metadata node |
| 0x17 | -- | +1104 | -- | (simple-table path at +1104) |
| 0x18 | -- | +1136 | -- | Alloca (bitcode 0x18/0x58) |
| 0x19 | -- | -- | -- | Load |
| 0x1A | qw[146] | +1168 | dw[296..298] | Branch (br) |
| 0x1B | qw[150] | +1200 | dw[304..306] | Switch |
| 0x1C | qw[154] | +1232 | dw[312..314] | Invoke (reads subopcode) |
| 0x1D | qw[158] | +1264 | dw[320..322] | Unreachable / resume (reads subopcode) |
| 0x1E | qw[162] | +1296 | dw[328..330] | LandingPad (reads subopcode) |
| 0x1F | qw[166] | +1328 | dw[336..338] | Call instruction |
| 0x20 | -- | -- | -- | PHI node |
| 0x21 | -- | -- | -- | IndirectBr |
| 0x22 | qw[178] | +1424 | dw[360..362] | Special (extra_A = ptr) |
Opcodes 0x1C, 0x1D, and 0x1E read the subopcode field at *(unsigned __int16 *)(a1 + 2) as part of the hash key, because these node types require the intrinsic ID to distinguish structurally identical nodes with different semantic meaning.
Hash Function
Every DenseMap in the uniquing tables uses the same hash:
hash(ptr) = (ptr >> 9) ^ (ptr >> 4)
Hash computation for multi-operand nodes (sub_15B3480) extends this by combining the hash of each operand pointer with a mixing step. The hash seed is the opcode byte, then each operand is folded in:
seed ^= hash(operand[i]) + 0x9E3779B9 + (seed << 6) + (seed >> 2);
Sentinel values: empty = -8 (0xFFFFFFFFFFFFFFF8), tombstone = -16 (0xFFFFFFFFFFFFFFF0).
Node Erasure (sub_1621740, 14KB)
The mirror of insertion. Dispatches by the same opcode byte, finds the node in the corresponding DenseMap, overwrites the bucket with the tombstone sentinel (-16), and decrements NumItems while incrementing NumTombstones. When tombstone count exceeds NumBuckets >> 3, a rehash at the same capacity is triggered to reclaim tombstone slots.
Bitcode Instruction Opcode Table
NVIDIA uses LLVM's standard instruction opcode numbering with minor adjustments. The bitcode reader sub_166A310 / sub_151B070 (parseFunctionBody, 60KB/123KB) dispatches on a contiguous range. The NVVM verifier sub_2C80C90 confirms the mapping via its per-opcode validation switch:
| Opcode | Hex | LLVM Instruction | Verifier Checks |
|---|---|---|---|
| 0x0B | 11 | ret | -- |
| 0x0E | 14 | br | -- |
| 0x0F | 15 | switch | -- |
| 0x15 | 21 | invoke | "invoke" unsupported via sub_2C76F10 |
| 0x18 | 24 | alloca | Alignment <= 2^23; AS must be Generic |
| 0x19 | 25 | load | -- |
| 0x1A | 26 | br (cond) | Validates "Branch condition is not 'i1' type!" |
| 0x1B | 27 | switch (extended) | -- |
| 0x1C | 28 | invoke (extended) | -- |
| 0x1D | 29 | unreachable | -- |
| 0x1E | 30 | resume | -- |
| 0x1F | 31 | call | Pragma metadata validation |
| 0x20 | 32 | phi | -- |
| 0x21 | 33 | indirectbr | "indirectbr" unsupported |
| 0x22 | 34 | call (variant) | Validates callee type signature |
| 0x23 | 35 | resume (verifier) | "resume" unsupported |
| 0x23--0x34 | 35--52 | Binary ops (add/sub/mul/div/rem/shift/logic) | -- |
| 0x35--0x38 | 53--56 | Casts (trunc/zext/sext/fpcast) | -- |
| 0x3C | 60 | alloca | Alignment and address-space checks |
| 0x3D | 61 | load | Atomic loads rejected; tensor memory AS rejected |
| 0x3E | 62 | store | Atomic stores rejected; tensor memory AS rejected |
| 0x40 | 64 | fence | Only acq_rel/seq_cst in UnifiedNVVMIR mode |
| 0x41 | 65 | cmpxchg | Only i32/i64/i128; must be generic/global/shared AS |
| 0x42 | 66 | atomicrmw | Address space validation |
| 0x4F | 79 | addrspacecast | "Cannot cast non-generic to different non-generic" |
| 0x55 | 85 | Intrinsic call | Routes to sub_2C7B6A0 (143KB verifier) |
| 0x58 | 88 | alloca (inalloca) | Same as 0x18 |
| 0x5F | 95 | landingpad | "landingpad" unsupported |
The binary opcodes in the 0x23--0x34 range follow LLVM's BinaryOperator numbering:
| Opcode | Hex | Operation | IRBuilder Helper |
|---|---|---|---|
| 0x23 | 35 | add | -- |
| 0x24 | 36 | fadd | -- |
| 0x25 | 37 | sub | -- |
| 0x26 | 38 | fsub | -- |
| 0x27 | 39 | mul | -- |
| 0x28 | 40 | fmul | -- |
| 0x29 | 41 | udiv | -- |
| 0x2A | 42 | sdiv | -- |
| 0x2B | 43 | fdiv | -- |
| 0x2C | 44 | urem | -- |
| 0x2D | 45 | srem | -- |
| 0x2E | 46 | frem | -- |
| 0x2F | 47 | shl | -- |
| 0x30 | 48 | lshr | -- |
| 0x31 | 49 | ashr | -- |
| 0x32 | 50 | and | -- |
| 0x33 | 51 | or | -- |
| 0x34 | 52 | xor | -- |
InstCombine Internal Opcode Table
The InstCombine mega-visitor sub_10EE7A0 (405KB, the single largest function in cicc) uses a different opcode numbering -- the full LLVM Instruction::getOpcode() values rather than the bitcode record codes. These are accessed via sub_987FE0 (getOpcode equivalent). Key ranges observed:
| Opcode Range | LLVM Instructions |
|---|---|
| 0x0B | Ret |
| 0x0E | Br |
| 0x0F | Switch |
| 0x15 | Invoke |
| 0x1A | Unreachable |
| 0x3F | FNeg |
| 0x41--0x43 | Add, FAdd, Sub |
| 0x99 | GetElementPtr |
| 0xAA | Trunc |
| 0xAC--0xAE | ZExt, SExt, FPToUI |
| 0xB4--0xB5 | PtrToInt, IntToPtr |
| 0xCF--0xD2 | ICmp, FCmp, PHI, Call |
| 0xE3--0xEB | VAArg, ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue |
| 0x11A | Fence |
| 0x11D | AtomicCmpXchg |
| 0x125 | AtomicRMW |
| 0x134--0x174 | FPTrunc, FPExt, UIToFP, Alloca, Load, Store, FMul, UDiv, SDiv, ... |
| 0x17D--0x192 | BitCast, Freeze, LandingPad, CatchSwitch, CatchRet, CallBr, ... |
| 0x2551, 0x255F, 0x254D | NVIDIA custom intrinsic operations |
The NVIDIA custom opcodes (0x2551, 0x255F, 0x254D) are in a range far above standard LLVM and handle CUDA-specific operations (texture, surface, or warp-level ops encoded as custom IR nodes) that have no upstream LLVM equivalent.
NVVM Context Object
The context object referenced by context_ptr is a large structure (~3,656 bytes, confirmed by the destructor sub_B76CB0 at 97KB which tears down a ~3656-byte object) containing uniquing tables for every NVVM opcode, plus type caches, metadata interning tables, and allocator state.
Context Layout Overview
| Byte Offset | Size | Field | Description |
|---|---|---|---|
| +0..+200 | 200B | Core state | Module pointer, allocator, flags |
| +200 | 8B | vtable_0 | Points to unk_49ED3E0 |
| +224 | 8B | vtable_1 | Points to unk_49ED440 |
| +248 | 8B | vtable_2 | Points to unk_49ED4A0 |
| +272..+792 | 520B | Hash table array | 16 DenseMaps freed at stride 32 by destructor |
| +496..+1136 | 640B | Simple opcode tables | 18 sub-function tables, 32B each (opcodes 0x04..0x15) |
| +1040..+1424 | 384B | Complex opcode DenseMaps | Dedicated DenseMaps for opcodes 0x16..0x22 |
| +1424..+2800 | ~1376B | Extended tables | Additional hash tables, type caches, metadata maps |
| +2800..+3656 | ~856B | Allocator state | Bump allocator slabs, counters, statistics |
Simple Opcode Table Region (+496..+1136)
The 18 entries for opcodes 0x04 through 0x15 (plus a few extras) are 32-byte structures at fixed offsets:
struct SimpleOpcodeTable {
void *buckets; // +0: heap-allocated bucket array
int32 num_items; // +8: live entry count
int32 num_tombstones; // +12: tombstone count
int32 num_buckets; // +16: always power-of-2
int32 reserved; // +20: padding
void *callback; // +24: hash-insert function pointer (or NULL)
};
Byte offsets increase monotonically: +496, +528, +560, +592, +624, +656, +688, +720, +752, +784, +816, +848, +880, +912, +944, +976, +1008, +1072, +1104, +1136.
Complex Opcode DenseMap Region (+1040..+1424)
Each DenseMap for a complex opcode occupies 4 qwords plus associated dword counters:
struct OpcodeUniqueMap {
int64 num_entries; // qw[N]: includes tombstones
void *buckets; // qw[N+1]: heap-allocated bucket array
int32 num_items; // dw[2*N + offset]: live entries
int32 num_tombstones; // dw[2*N + offset + 1]: tombstone count
int32 num_buckets; // dw[2*N + offset + 2]: capacity (power-of-2)
};
Complete mapping:
| Opcode | qw Base | Byte Offset (qw) | dw Counters | Byte Offset (dw) |
|---|---|---|---|---|
| 0x16 | qw[130] | +1040 | dw[264..266] | +2112..+2120 |
| 0x1A | qw[146] | +1168 | dw[296..298] | +2368..+2376 |
| 0x1B | qw[150] | +1200 | dw[304..306] | +2432..+2440 |
| 0x1C | qw[154] | +1232 | dw[312..314] | +2496..+2504 |
| 0x1D | qw[158] | +1264 | dw[320..322] | +2560..+2568 |
| 0x1E | qw[162] | +1296 | dw[328..330] | +2624..+2632 |
| 0x1F | qw[166] | +1328 | dw[336..338] | +2688..+2696 |
| 0x10 | qw[178] | +1424 | dw[360..362] | +2880..+2888 |
Destructor (sub_1608300, 90KB)
The context destructor confirms the layout by freeing resources in order:
- Calls
j___libc_free_0on bucket pointers at offsets +272 through +792 (stride 32) -- frees all 16 simple opcode hash tables. - Destroys sub-objects via
sub_16BD9D0,sub_1605960,sub_16060D0-- these tear down the complex DenseMap instances and any heap-allocated overflow chains. - Releases vtable-referenced objects at offsets +200, +224, +248.
The separate LLVMContext destructor (sub_B76CB0, 97KB) frees 28+ hash tables from the full ~3,656-byte context structure, confirming that the uniquing tables are only part of the overall context.
Type Tag System
The context object's hash tables also serve as uniquing tables for type nodes. The byte at offset +16 in each IR node encodes the type tag (distinct from the opcode byte at +0):
| Type Tag | Meaning | Notes |
|---|---|---|
| 5 | Instruction / expression | Binary ops, comparisons |
| 8 | Constant aggregate | ConstantArray, ConstantStruct |
| 9 | Undef / poison | UndefValue |
| 13 | Integer constant | APInt at +24, bitwidth at +32 |
| 14 | FP constant | APFloat storage |
| 15 | Constant expression | ConstantExpr (GEP, cast, etc.) |
| 16 | Struct / aggregate type | Element list at +32 |
| 17 | MDTuple / metadata node | Metadata tuple |
| 37 | Comparison instruction | ICmp / FCmp predicate |
The type tag at +16 is used by InstCombine (sub_1743DA0) and many other passes to quickly classify nodes without reading the full opcode. The observed range is 5--75, considerably denser than standard LLVM's Value subclass IDs.
Instruction Creation Helpers
NVIDIA's LLVM fork provides a set of instruction creation functions that allocate nodes from the bump allocator, insert them into the appropriate uniquing table, and update use-lists. These are the core IR mutation API:
Primary Instruction Factories
| Address | Size | Signature | LLVM Equivalent |
|---|---|---|---|
sub_B504D0 | -- | (opcode, op0, op1, state, 0, 0) | BinaryOperator::Create / IRBuilder::CreateBinOp |
sub_B50640 | -- | (val, state, 0, 0) | Result-typed instruction / CreateNeg wrapper |
sub_B51BF0 | -- | (inst, src, destTy, state, 0, 0) | IRBuilder::CreateZExtOrBitCast |
sub_B51D30 | -- | (opcode, src, destTy, state, 0, 0) | CmpInst::Create / IRBuilder::CreateCast |
sub_B52190 | -- | (...) | BitCastInst::Create |
sub_B52260 | -- | (...) | GetElementPtrInst::Create (single-index) |
sub_B52500 | -- | (...) | CastInst::Create with predicate |
sub_B33D10 | -- | (ctx, intrinsicID, args, numArgs, ...) | IRBuilder::CreateIntrinsicCall |
sub_BD2DA0 | -- | (80) | Instruction::Create (allocates 80-byte IR node) |
sub_BD2C40 | -- | (72, N) | Instruction::Create (72-byte base, N operands) |
Opcode Constants for Creation
These numeric opcode values are passed as the first argument to sub_B504D0:
| Value | Operation | Example |
|---|---|---|
| 13 | Sub | sub_B504D0(13, a, b, ...) |
| 15 | FNeg / FSub variant | sub_B504D0(15, ...) |
| 18 | SDiv | sub_B504D0(18, ...) |
| 21 | FMul | sub_B504D0(21, ...) |
| 25 | Or | sub_B504D0(25, ...) |
| 26 | And | sub_B504D0(26, a, mask, ...) |
| 28 | Xor | sub_B504D0(28, ...) |
| 29 | Add | sub_B504D0(29, ...) |
| 30 | Sub | sub_B504D0(30, zero, operand) |
| 32 | Shl | sub_B504D0(32, ...) |
| 33 | AShr | sub_B504D0(33, ...) |
| 38 | And (FP context) | sub_B504D0(38, ...) |
| 40 | ZExt (via sub_B51D30) | sub_B51D30(40, source, resultType) |
| 49 | CastInst | sub_B51D30(49, src, destTy, ...) |
Node Builder / Cloner (sub_16275A0, 21KB)
The IR builder at sub_16275A0 creates new nodes by cloning operand lists from a source node, using the tagged pointer Use-list encoding described above. It dispatches to three specialized constructors:
| Address | Role |
|---|---|
sub_1627350 | Multi-operand node create (MDTuple::get equivalent). Takes (ctx, operand_array, count, flag0, flag1). Called 463+ times from func-attrs and metadata passes. |
sub_15B9E00 | Binary node create. Fixed 2-operand layout, minimal header. |
sub_15C4420 | Variadic node create. Variable operand count, allocates backward operand storage. |
All three ultimately route through the uniquing function sub_162D4F0 to deduplicate structurally-identical nodes.
Infrastructure Functions
| Address | Call Count | Role |
|---|---|---|
sub_1623A60 | 349x | IRBuilder::CreateBinOp or SCEV type extension |
sub_1623210 | 337x | IRBuilder::CreateUnaryOp or SCEV use registration |
sub_15FB440 | 276x | Create node with 5 args: (opcode, type, op1, op2, flags) |
sub_161E7C0 | 463x | Node accessor / property query (most-called IR function) |
sub_164B780 | 336x | Use-chain linked list manipulation |
sub_1648A60 | 406x | Memory allocator: (size, alignment) |
Allocation
NVVM IR nodes are allocated from a slab-based bump allocator:
- Slab growth:
4096 << (slab_index >> 7)-- exponential, capped at 4TB. - Alignment: 8 bytes (pointer aligned via
(ptr + 7) & ~7). - Deallocation: no individual free; entire slabs are released at once.
- Overflow: triggers a new slab via
malloc().
This is the standard LLVM BumpPtrAllocator pattern, consistent with how upstream LLVM manages IR node lifetimes. The lack of per-node deallocation means the NVVM frontend cannot reclaim memory for dead nodes until the entire context is destroyed.
Cross-References
- DenseMap / Hash Infrastructure -- universal hash function and DenseMap layout
- DAG Node -- SelectionDAG-level node layout (104-byte SDNode)
- NVVM Container -- the NVVMPassOptions/container that wraps the context
- Bitcode I/O -- bitcode opcode encoding and parseFunctionBody
- InstCombine -- the 405KB mega-visitor that consumes these nodes
- NVVM Verifier -- per-opcode validation rules
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Node uniquing: lookup-or-insert, opcode dispatch | sub_162D4F0 | 49KB | -- |
| Node erase from uniquing tables (tombstone writer) | sub_1621740 | 14KB | -- |
| IR builder / node cloner | sub_16275A0 | 21KB | -- |
| Multi-operand node create (MDTuple::get) | sub_1627350 | -- | -- |
| Binary node create | sub_15B9E00 | -- | -- |
| Variadic node create | sub_15C4420 | -- | -- |
| Hash computation for multi-operand nodes | sub_15B3480 | -- | -- |
| Context destructor (frees 20+ hash tables) | sub_1608300 | 90KB | -- |
| LLVMContext destructor (~3,656-byte object) | sub_B76CB0 | 97KB | -- |
BinaryOperator::Create / IRBuilder::CreateBinOp | sub_B504D0 | -- | -- |
Result-typed instruction create / CreateNeg | sub_B50640 | -- | -- |
IRBuilder::CreateZExtOrBitCast | sub_B51BF0 | -- | -- |
CmpInst::Create / IRBuilder::CreateCast | sub_B51D30 | -- | -- |
BitCastInst::Create | sub_B52190 | -- | -- |
GetElementPtrInst::Create (single-index) | sub_B52260 | -- | -- |
CastInst::Create with predicate | sub_B52500 | -- | -- |
IRBuilder::CreateIntrinsicCall | sub_B33D10 | -- | -- |
Instruction::Create (80-byte allocation) | sub_BD2DA0 | -- | -- |
Instruction::Create (variable-size) | sub_BD2C40 | -- | -- |
create_empty_ir_node (204 callers, EDG front-end) | sub_72C9A0 | -- | -- |
| IR builder / node constructor (349x calls) | sub_1623A60 | -- | -- |
| IR builder / node constructor variant (337x calls) | sub_1623210 | -- | -- |
| Create node with 5 args (276x calls) | sub_15FB440 | -- | -- |
| Node accessor / property query (463x calls) | sub_161E7C0 | -- | -- |
BitcodeReader::parseFunctionBody (stock LLVM) | sub_166A310 | 60KB | -- |
parseFunctionBody (two-phase compilation path) | sub_151B070 | 123KB | -- |
parseFunctionBody (standalone libNVVM path) | sub_9F2A40 | 185KB | -- |
InstCombinerImpl::visitInstruction (full opcode switch) | sub_10EE7A0 | 405KB | -- |
| InstCombine master visit dispatcher | sub_F2CFA0 | -- | -- |
| NVVMModuleVerifier (per-opcode validation) | sub_2C80C90 | 51KB | -- |
Instruction Constraint Table (Pattern Database)
The instruction selection backend in cicc v13.0 uses a global constraint table to map target opcodes to their operand requirements. This table drives the sub_B612D0 constraint emission function (104KB), which consults a packed 16-bit word array to determine register classes and constraint patterns for each machine instruction. The constraint table is the single authoritative source of truth for every NVPTX MachineInstr's register requirements -- any reimplementation of the backend codegen must reproduce it exactly.
Global Table: word_3F3E6C0
The constraint table is a statically allocated array of 16-bit words in the .data section at address 0x3F3E6C0, indexed by (opcode - 1). Each entry packs two pieces of information into a single 16-bit word:
| Bits | Field | Meaning |
|---|---|---|
| Low byte (bits 0..7) | constraint_class | Index into the constraint switch (0x00..0xB2) |
| High byte (bits 8..15) | register_class_id | Target register class for the result |
The access pattern from sub_B612D0:
// sub_B612D0(a1, a2) where a2 = MachineInstr opcode
v4 = HIBYTE(word_3F3E6C0[a2 - 1]); // register class for output
switch (LOBYTE(word_3F3E6C0[a2 - 1])) // constraint class -> switch case
There are exactly 179 distinct constraint classes (0x00 through 0xB2), each encoding a specific operand pattern for a category of instructions. Multiple opcodes can share the same constraint class if they have identical operand signatures.
Constraint Descriptor Layout
Each constraint descriptor is a stack-allocated array of 16-byte entries built within sub_B612D0's frame. The frame is approximately 0x160 bytes deep. Stack slots span [rsp-0x158] through [rsp-0x20]:
| Offset | Size | Field |
|---|---|---|
| +0 | 4B | constraint_kind (int32) |
| +4 | 4B | (padding / alignment) |
| +8 | 8B | value (int64: register class ID or operand reference) |
Entry stride: 16 bytes (8-byte aligned pairs of {int32 kind, int32 pad, int64 value}).
The constraint_kind values determine the role of each entry in the descriptor array:
| Kind | Meaning |
|---|---|
| -1 | Output/result operand (always the last entry in the array) |
| 0 | Input operand at position 0 |
| 1 | Input operand at position 1 |
| 2 | Input operand at position 2 |
| 3..N | Input operands at higher positions |
The output entry (kind = -1) carries the result register class. Input entries carry the register class constraint for each source operand. The maximum observed operand count is 17 (constraint class 0xB0, corresponding to opcode 176 in the table), requiring 18 descriptor entries = 288 bytes of stack space.
Register Class IDs
The register_class_id in the high byte maps to NVIDIA GPU register files. Values recovered from sub_A778C0 (register class constraint creator), sub_B5BA00 (register class set builder, 111 cases), and sub_2163730 (PTX emission naming):
These IDs are specific to the pattern database constraint system and differ from the 4-bit class tags used in register encoding (see Register Classes for vtable addresses, PTX types, prefixes, and encoded IDs).
| ID | Register Class | Width |
|---|---|---|
| 14 | Int32Regs (%r) | 32 bits |
| 22 | Int16Regs (%rs) | 16 bits |
| 24 | Int16HalfRegs (%h) | 16 bits (f16/bf16) |
| 27 | Int32HalfRegs (%hh) | 32 bits (v2f16/v2bf16) |
| 29 | (unidentified) | -- |
| 32 | (unidentified) | -- |
| 36 | (unidentified) | -- |
| 39 | (unidentified) | -- |
| 40 | Float32Regs (%f) | 32 bits |
| 41 | (unidentified) | -- |
| 43 | Float16Regs (%h, alias of Int16HalfRegs) | 16 bits |
| 50 | Int64Regs (%rd) | 64 bits |
| 51 | Float64Regs (%fd) | 64 bits |
| 52 | Int128Regs (%rq) | 128 bits |
| 67 | (unidentified) | -- |
| 72 | (unidentified) | -- |
| 76 | (unidentified) | -- |
| 78 | Int1Regs (%p) | 1 bit |
| 86 | SpecialRegs (internal-only, off_4A026E0) | varies |
IDs 29, 32, 36, 39, 41, 67, 72, 76 appear in the sub_B612D0 table but have not been definitively mapped to named register classes. They likely correspond to sub-register classes, tied-operand classes, or WMMA accumulator classes that cicc defines beyond the 9 primary classes documented in reference/register-classes.md.
Constraint Type Classification
A secondary classification table at byte_3F252E0 categorizes constraint entries into four families (recovered from sub_A7A6D0 constraint merge/intersection logic at 0xA78000):
| Classification Byte | Family | Applies To |
|---|---|---|
| 0x00 | Simple/scalar | Single-register operands; the vast majority of ALU constraints |
| 0x08 | Ordered | Operands with fixed positional requirements (tied operands) |
| 0x10 | Sized/ranged | Operands with explicit bit-width requirements (sub-register extracts) |
| 0x18 | Compound | Multi-register operands; types 86-97 in the classification table |
The merge function sub_A7A6D0 (7KB) performs set intersection across constraint families when two constraint sets must be unified (e.g., during register coalescing or inline asm constraint resolution). The "compound" family (0x18) covers instructions that require register pairs or wider groupings -- tensor core MMA instructions fall into this category.
Key Sub-Functions
The constraint emission pipeline involves these collaborating functions:
| Address | Size | Function | Purpose |
|---|---|---|---|
sub_A778C0 | -- | createRegClassConstraint(a1, regclass, flags) | Build a register-class constraint entry; stores class ID in value field |
sub_A77AD0 | -- | createAnyRegConstraint(a1, flags) | Build an "any register" constraint (unconstrained operand) |
sub_A79C90 | -- | composeConstraints(a1, &desc, N) | Compose N descriptor entries into a single constraint record |
sub_A7A6D0 | 7KB | mergeConstraints(a1, a2) | Merge/intersect two constraint sets using byte_3F252E0 classification |
sub_B5BA00 | 21KB | createOutputConstraint(a1, regclass_id) | Build the output register constraint; 111-case switch on class ID |
sub_A78010 | -- | emitConstraint(a1, &desc_array, N) | Emit the final constraint with N entries to the instruction descriptor |
sub_B612D0 | 104KB | emitInstrConstraint(a1, opcode) | Top-level: lookup word_3F3E6C0, dispatch on constraint class, build and emit |
The sub_B5BA00 function (21KB) is itself a 111-case switch that translates register class IDs into the internal constraint representation. It produces the value field for output constraint entries. Its size suggests that it handles not just the 9 primary register classes but also sub-register classes, paired classes, and special accumulator classes for tensor operations.
Constraint Switch Structure
The 179-case switch in sub_B612D0 is the heart of the pattern database. Each case constructs a fixed sequence of constraint descriptors on the stack, then calls sub_A78010 to emit them. The cases can be organized into major families based on operand count and register class patterns.
Family 1: Unary Instructions (1 input, 1 output)
These are the simplest constraints: one input operand and one result. Two descriptor entries (32 bytes on stack). Representative constraint classes:
// Constraint class 0x01 — Unary ALU, same type in/out
// Example: MOV, NEG, NOT, ABS for Int32Regs
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E01 (class=0x01, regclass=14=Int32)
case 0x01:
desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) } // input[0]: same class as output
desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: regclass from high byte
sub_A78010(a1, desc, 2)
Constraint classes in this family include 0x01 through approximately 0x08, covering unary operations across all scalar register classes. The register class v4 (from the high byte) determines whether the instruction operates on Int32, Int64, Float32, Float64, Pred, or another class. The same constraint class is reused for multiple opcodes that share the same operand signature.
Family 2: Binary ALU Instructions (2 inputs, 1 output)
The most common family. Three descriptor entries (48 bytes on stack). Covers all two-operand arithmetic and logic instructions:
// Constraint class 0x09 — Binary ALU, all same type
// Example: ADD, SUB, MUL, AND, OR, XOR for Int32Regs
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E09 (class=0x09, regclass=14=Int32)
case 0x09:
desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) } // input[0]: Int32
desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) } // input[1]: Int32
desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: Int32
sub_A78010(a1, desc, 3)
Variants within this family differ in whether inputs are constrained to the same class as the output or to a different class. For instance, shift instructions constrain the shift amount (input[1]) to Int32 regardless of the data type of input[0]:
// Constraint class 0x0C — Binary with mixed types (shift-like)
// Example: SHL.b64, SHR.b64 (data=Int64, shift_amount=Int32)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x320C (class=0x0C, regclass=50=Int64)
case 0x0C:
desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) } // input[0]: Int64 (data)
desc[1] = { kind=1, value=sub_A778C0(a1, 14, 0) } // input[1]: Int32 (shift amount)
desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: Int64
sub_A78010(a1, desc, 3)
Family 3: Comparison / Predicate-Producing Instructions (2 inputs, predicate output)
Comparison instructions produce a predicate register result regardless of the input type. Three descriptor entries:
// Constraint class 0x10 — Compare, predicate output
// Example: SETP.EQ.s32, SETP.LT.f32
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x4E10 (class=0x10, regclass=78=Pred)
case 0x10:
desc[0] = { kind=0, value=sub_A778C0(a1, <input_class>, 0) } // input[0]: operand type
desc[1] = { kind=1, value=sub_A778C0(a1, <input_class>, 0) } // input[1]: operand type
desc[2] = { kind=-1, value=sub_B5BA00(a1, 78) } // output: Pred (%p)
sub_A78010(a1, desc, 3)
The input register class is determined by the instruction variant (integer comparison vs. float comparison), while the output is always predicate register class 78.
Family 4: Ternary / FMA Instructions (3 inputs, 1 output)
Fused multiply-add and select instructions require four descriptor entries (64 bytes on stack):
// Constraint class 0x18 — Ternary FMA, all same float type
// Example: FMA.RN.f32 (a * b + c)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x2818 (class=0x18, regclass=40=Float32)
case 0x18:
desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) } // input[0]: Float32 (a)
desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) } // input[1]: Float32 (b)
desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) } // input[2]: Float32 (c)
desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: Float32 (result)
sub_A78010(a1, desc, 4)
Select/conditional-move instructions also fall here, with one predicate input and two data inputs:
// Constraint class 0x1A — Select (pred, trueval, falseval)
// Example: SELP.b32 (predicated select)
case 0x1A:
desc[0] = { kind=0, value=sub_A778C0(a1, 78, 0) } // input[0]: Pred (condition)
desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) } // input[1]: data (true value)
desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) } // input[2]: data (false value)
desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: data (selected)
sub_A78010(a1, desc, 4)
Family 5: Memory Instructions (load/store with address operands)
Load instructions produce a data result from an address operand. Store instructions consume both data and address. These constraint classes handle the different address space qualifiers and vector widths:
// Constraint class 0x20 — Scalar load from address
// Example: LD.GLOBAL.b32 (global memory load)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E20 (class=0x20, regclass=14=Int32)
case 0x20:
desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) } // input[0]: Int64 (address pointer)
desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: Int32 (loaded data)
sub_A78010(a1, desc, 2)
Vector load variants (LoadV2, LoadV4) use additional output entries for each vector lane:
// Constraint class 0x22 — Vector load V2 (two-element)
// Example: LD.GLOBAL.V2.b32 (load 2x Int32)
case 0x22:
desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) } // input[0]: Int64 (address)
desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) } // input[1]: (offset/predicate)
desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: data element 0
// Second output encoded separately via sub_A79C90 composition
sub_A78010(a1, desc, 3)
Store instructions have no result output (kind = -1 carries a sentinel value or void class):
// Constraint class 0x28 — Scalar store
// Example: ST.GLOBAL.b32 (global memory store)
case 0x28:
desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) } // input[0]: data to store
desc[1] = { kind=1, value=sub_A778C0(a1, 50, 0) } // input[1]: Int64 (address)
desc[2] = { kind=-1, value=sub_B5BA00(a1, 86) } // output: SpecialRegs (chain/token)
sub_A78010(a1, desc, 3)
Family 6: Type Conversion Instructions (input and output differ)
Conversion instructions have an input class that differs from the output class. The constraint class encodes the specific pair:
// Constraint class 0x30 — CVT from Int32 to Float32
// Example: CVT.RN.f32.s32
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x2830 (class=0x30, regclass=40=Float32)
case 0x30:
desc[0] = { kind=0, value=sub_A778C0(a1, 14, 0) } // input[0]: Int32 (source)
desc[1] = { kind=-1, value=sub_B5BA00(a1, 40) } // output: Float32 (result)
sub_A78010(a1, desc, 2)
// Constraint class 0x32 — CVT from Float64 to Int64
// Example: CVT.RTZ.s64.f64
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x3232 (class=0x32, regclass=50=Int64)
case 0x32:
desc[0] = { kind=0, value=sub_A778C0(a1, 51, 0) } // input[0]: Float64 (source)
desc[1] = { kind=-1, value=sub_B5BA00(a1, 50) } // output: Int64 (result)
sub_A78010(a1, desc, 2)
Widening/narrowing conversions between integer sizes and float-to-half conversions each have their own constraint class.
Family 7: Copy / Move Instructions (register transfer)
The copy family (opcodes 440-503) maps to constraint classes that encode same-class and cross-class register transfers:
// Constraint class 0x40 — Same-class copy
// Example: MOV.b32 (Int32 -> Int32)
// Used by opcodes 440-443 (type-preserving moves)
case 0x40:
desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) } // input[0]: same class
desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: same class
sub_A78010(a1, desc, 2)
// Constraint class 0x42 — Cross-class copy (Int32 <-> Float32)
// Example: MOV from Int32Regs to Float32Regs (bitcast-level move)
// Used by opcodes 444+ (cross-class moves)
case 0x42:
desc[0] = { kind=0, value=sub_A778C0(a1, <source_class>, 0) } // input: source class
desc[1] = { kind=-1, value=sub_B5BA00(a1, <dest_class>) } // output: dest class
sub_A78010(a1, desc, 2)
Cross-class copies are never coalesced by the register coalescer (they remain as explicit mov instructions in PTX output). The constraint table enforces this by assigning distinct source and destination classes.
Family 8: Call ABI Instructions (parameter declaration and passing)
The NVPTX calling convention uses special opcodes for .param space management. These have unique constraint classes with no data register operands:
// Constraint class 0x50 — DeclareParam (opcode 505)
// Declares a .param space allocation for function argument passing
case 0x50:
desc[0] = { kind=0, value=sub_A77AD0(a1, 0) } // input[0]: "any" (chain token)
desc[1] = { kind=-1, value=sub_B5BA00(a1, 86) } // output: SpecialRegs (chain)
sub_A78010(a1, desc, 2)
Call sequence opcodes (315=CallSeqBegin, 514=CallStart, 517=CallSeqEnd, 518=CallProto) all use constraint classes that operate on chain tokens rather than data registers. Their inputs and outputs are in the SpecialRegs class (ID 86).
Family 9: Atomic Instructions (address + data + result)
Atomic operations require an address, a data operand, and produce a result of the same data type:
// Constraint class 0x60 — Atomic RMW (read-modify-write)
// Example: ATOM.ADD.s32 (atomic add on Int32)
// Opcodes 294-297 (atom.add family)
case 0x60:
desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) } // input[0]: Int64 (address)
desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) } // input[1]: data (value to add)
desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: data (old value)
sub_A78010(a1, desc, 3)
Atomic compare-and-swap (opcode 462 = atom.cas) requires four operands (address, expected, desired, result):
// Constraint class 0x62 — Atomic CAS
// Example: ATOM.CAS.b32 (compare-and-swap)
case 0x62:
desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) } // input[0]: Int64 (address)
desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) } // input[1]: data (expected)
desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) } // input[2]: data (desired)
desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: data (old value)
sub_A78010(a1, desc, 4)
Family 10: Tensor Core / MMA Instructions (many inputs, many outputs)
The most complex constraint classes handle tensor core matrix operations. These instructions consume multiple register-pair or register-quad operands and produce multiple results. Constraint class 0xB0 is the extreme case with 17 input operands:
// Constraint class 0xB0 — Complex MMA (17 inputs, 1+ outputs)
// Example: tcgen05.mma variants (Blackwell, opcodes 4905-4940)
// This is the maximum-operand constraint class.
case 0xB0:
for (i = 0; i < 17; i++) {
desc[i] = { kind=i, value=sub_A778C0(a1, <operand_class[i]>, 0) }
}
desc[17] = { kind=-1, value=sub_B5BA00(a1, v4) }
sub_A78010(a1, desc, 18)
HMMA/IMMA/BMMA instructions (the SM70+ tensor core families at sub_21E0360-sub_21E2280) use constraint classes in the 0x90-0xAF range, typically with 4-8 register inputs (accumulator fragments) and 4-8 register outputs. The operand classes include Int32HalfRegs (ID 27) for packed f16 pairs and Int128Regs (ID 52) for wide accumulator state.
Family 11: Predicated Instructions (extra predicate input)
Many NVPTX instructions support predication, where execution is conditional on a predicate register. Predicated variants append an extra Pred-class input:
// Constraint class 0x70 — Predicated binary ALU
// Example: @%p0 ADD.s32 %r1, %r2, %r3 (conditional add)
case 0x70:
desc[0] = { kind=0, value=sub_A778C0(a1, 78, 0) } // input[0]: Pred (guard)
desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) } // input[1]: data (src0)
desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) } // input[2]: data (src1)
desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) } // output: data (result)
sub_A78010(a1, desc, 4)
Family 12: Special / Barrier Instructions (chain-only)
Barrier and synchronization instructions have no data operands. They operate purely on the chain token for ordering:
// Constraint class 0x80 — Barrier/Fence (chain-only)
// Example: BAR.SYNC (opcodes 287-290)
case 0x80:
desc[0] = { kind=0, value=sub_A77AD0(a1, 0) } // input[0]: "any" (chain in)
desc[1] = { kind=-1, value=sub_B5BA00(a1, 86) } // output: SpecialRegs (chain out)
sub_A78010(a1, desc, 2)
Pattern Matching Dispatch
The constraint table is consumed during instruction selection by the three-level dispatch hierarchy:
-
Driver (
sub_3090F90, 91KB): Builds a cost table for function arguments viahash(key*37), uses a min-heap priority queue for topological-order traversal, iterates with budget =4 * numInstructions * maxBlockSize. -
Matcher (
sub_308FEE0): Called per-SDNode from the driver. Dispatches to the hand-written selector or the TableGen-generated selector. -
Hand-written selector (
sub_347A8D0, 309KB): Giant switch on ISD/NVPTXISD opcodes. Callssub_969240(SDNode accessor) 263 times. Recursive with 42 self-calls. Handles tex/surf, wmma, atomics, barriers. -
TableGen-generated selector (
sub_348D3E0, 256KB): Auto-generated from NVPTX.tdinstruction pattern definitions. Callssub_96924045 times,sub_32889F038 times. -
Complex addressing mode selector (
sub_33D4EF0, 114KB): Handles NVPTX load/store addressing with address space qualifiers. Callssub_969240399 times -- the single function with the most SDNode accesses in the entire binary.
After pattern matching selects a MachineInstr opcode, the constraint table is queried via sub_B612D0 to determine register requirements. The selected opcode is the index into word_3F3E6C0.
Operand Binding
When the constraint emission function sub_B612D0 builds the descriptor array, operand binding follows this protocol:
-
Lookup: Read
word_3F3E6C0[opcode - 1]. Extractconstraint_class(low byte) andregister_class_id(high byte, stored asv4). -
Switch dispatch: Branch to the case for
constraint_class. -
Input construction: For each input operand position
i:- Call
sub_A778C0(a1, class_id, flags)to create a register-class constraint entry. - The
class_idis eitherv4(same class as output) or a hardcoded value (different class for mixed-type instructions). - The
flagsparameter encodes operand modifiers (tied, early-clobber, etc.). - Store the result in
desc[i]withkind = i.
- Call
-
Output construction: Call
sub_B5BA00(a1, v4)to create the output constraint.sub_B5BA00is a 21KB function with 111 switch cases that translates the register class ID into the internal output representation.- Store in
desc[N]withkind = -1.
-
Emission: Call
sub_A78010(a1, desc, N+1)to finalize. This function walks the descriptor array, validates constraint consistency, and writes the constraint record into the instruction's operand descriptor table.
For instructions that use sub_A77AD0 ("any register" constraint), the operand accepts any register class. This is used for chain tokens, inline asm operands with unconstrained registers, and certain special-purpose slots.
For composition of multi-output instructions, sub_A79C90 merges multiple descriptor sub-arrays into a single compound constraint. This is needed for vector loads (LoadV2, LoadV4) and MMA instructions that produce multiple result registers.
Allocation
The global table word_3F3E6C0 is in the .data section, allocated at link time. It is read-only after cicc process startup. Constraint descriptors are purely stack-allocated within sub_B612D0's frame (approximately 0x160 bytes deep). No heap allocation occurs during constraint emission. This makes the constraint emission path allocation-free and safe for use in concurrent compilation (the function is reentrant as long as each thread has its own stack frame).
Cross-References
- reference/register-classes.md -- Authoritative register class table with encoding scheme
- reference/nvptx-opcodes.md -- NVPTX MachineInstr opcode inventory (consumers of this constraint table)
- llvm/isel-patterns.md -- ISel pattern matching that feeds opcodes into this table
- llvm/selectiondag.md -- SelectionDAG construction that precedes constraint emission
- llvm/register-allocation.md -- Greedy RA that consumes the emitted constraints
- llvm/register-coalescing.md -- Copy family opcodes 440-503 and coalescing constraints
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
createRegClassConstraint | sub_A778C0 | -- | Build register-class input constraint entry |
createAnyRegConstraint | sub_A77AD0 | -- | Build unconstrained ("any") input constraint |
composeConstraints | sub_A79C90 | -- | Merge N descriptor entries into compound constraint |
mergeConstraints | sub_A7A6D0 | 7KB | Set-intersection of constraints using byte_3F252E0 |
emitConstraint | sub_A78010 | -- | Finalize and emit constraint record |
createOutputConstraint | sub_B5BA00 | 21KB | 111-case switch: class ID to output representation |
emitInstrConstraint | sub_B612D0 | 104KB | Top-level: 179-case constraint class dispatch |
decodeOperandType | sub_B6B200 | 44KB | 101-case operand type decoder from bytecode stream |
SelectionDAG Node Structure
The SelectionDAG (SDNode) is the central data structure in cicc's code generation backend. Nodes represent operations in the target-independent DAG before instruction selection lowers them to machine instructions. The DAG builder (sub_2081F00, 267KB) converts LLVM IR into an initial DAG by visiting each IR instruction through a dispatch chain rooted at sub_2065D30. Nodes are deduplicated via a CSE hash table (sub_F4CEE0, 41KB) and allocated from a bump allocator embedded in the builder context object. The complete SelectionDAG pipeline then runs type legalization, operation legalization, DAG combining, and instruction selection over this graph before emitting PTX machine instructions.
SDNode Layout (104 Bytes, Two Views)
Every SDNode is allocated as exactly 104 bytes, hardcoded in sub_163D530. After allocation, all fields are zeroed. Two complementary views of the layout have been recovered: the "allocator view" from the zeroing pattern in sub_163D530, and the "accessor view" from field access patterns across the combiner (sub_F20C20), legalization (sub_1FFB890), and known-bits engine (sub_33D4EF0).
Allocator View (from sub_163D530)
The raw 104 bytes are zeroed via a combination of qword and dword stores:
qw[0..5] = 0, dw[6] = 0, qw[8..10] = 0, dw[11] = 0, byte[96] = 0
The statistics counter at context offset +96 is incremented by 104 for every allocation: *(_QWORD *)(v4 + 96) += 104LL.
Accessor View (Composite from Combiner, Legalizer, KnownBits)
The following table reconciles field accesses across sub_F20C20 (DAG combiner visitor), sub_1FFB890 (LegalizeOp), sub_33D4EF0 (computeKnownBits, 114KB), and sub_1FCE100 (LegalizeOp dispatcher):
| Offset | Size | Type | Field | Evidence |
|---|---|---|---|---|
| +0 | 8B | SDNode* | chain_next / first operand value | D03: *(qword*)(N+0) used as first operand in single-operand patterns |
| +4 | 4B | uint32_t | NumOperands_packed | D03: *(dword*)(N+4) & 0x7FFFFFF = NumOperands (low 27 bits); bits 27--30 = flags; bit 30 (0x40 in byte +7) = hasChainOps |
| +7 | 1B | uint8_t | node_flags_byte | D03: bit 4 = hasDebugLoc; bit 6 = hasChainPtr (operand list at N-8) |
| +8 | 8B | SDVTList* | VTList / ValueType pointer | D03: *(qword*)(N+8) = result value type descriptor; D05: read for MVT extraction |
| +16 | 8B | SDUse* | UseList | D03: head of use-def chain (doubly-linked list) |
| +24 | 4B | uint16_t | opcode | D02: *(uint16_t*)(node+24) = SDNode::getOpcode(); D05: *(a3+24) switched upon |
| +28 | 4B | uint32_t | opcode_flags | D05: *(a3+28) = sub-flags (nsw/nuw/exact bits) |
| +32 | 8B | SDUse* | operand_list | D02: *(node+32) = pointer to first operand SDUse; operand stride = 40 bytes |
| +33 | 1B | uint8_t | extension_mode | D05: *(a3+33) bits[2:3] = load extension mode (0=none, 1=zext, 2=sext, 3=zext) |
| +40 | 8B | ptr | value_list / operand[0] type | D02: *(node+40) = SDValue type info; D01: result type descriptor |
| +48 | 8B | EVT | result_VT | D05: *(a3+48) = result VT list, 16-byte entries {u16 MVT, pad, u64 ext} |
| +60 | 4B | uint32_t | num_values | D02: number of result values |
| +64 | 4B | uint32_t | flags / num_operands_alt | D05: *(a3+64) = operand count (alternate access path in KnownBits) |
| +72 | 8B | SDValue | chain_operand / result EVT | D03: *(qword*)(N+72) = result value type; D01: chain operand for memory ops |
| +80 | 8B | ptr | metadata / mem operand | D01: *(node+80) = predicate for CAS; extra metadata |
| +88 | 4B | uint32_t | address_space / ordering | D01: *(node+88) = memory operand / address-space descriptor |
| +96 | 8B | uint64_t | immediate_value | D05: *(a3+96) = constant value for ConstantSDNode (width <= 64) |
| +104 | 8B | ptr | extended_data | D05: *(a3+104) = second immediate, type info for wide constants |
| +112 | 8B | ptr | mem_chain / alignment | D05: *(a3+112) = MemSDNode chain / alignment info |
Note on dual access patterns. The combiner accesses opcodes at N+24 as a 4-byte field with flags, while the legalizer reads *(uint16_t*)(node+24) for a clean 16-bit opcode. The KnownBits engine (sub_33D4EF0) accesses fields at offsets up to +112, confirming that ConstantSDNode and MemSDNode subclasses extend beyond the base 104-byte allocation. These extended nodes are allocated via sub_BD2DA0 (80 bytes for lightweight variants) or sub_22077B0 (128 bytes for MemSDNode), while the base SDNode remains 104 bytes.
Operand Storage
Operands are stored in a contiguous array of SDUse structures. Two storage modes exist:
Mode A -- backward inline (common for small operand counts). Operands are stored before the node in memory, growing toward lower addresses:
operand[i] = *(qword*)(N + 32*(i - NumOps))
// or equivalently: N - 32*NumOps = first operand address
This 32-byte operand stride is confirmed across sub_F3D570, sub_F20C20, and sub_F5A610.
Mode B -- indirect pointer (when node_flags_byte bit 6 is set). An 8-byte pointer at N-8 points to a separately allocated operand array:
if (*(byte*)(N+7) & 0x40):
operand_base = *(qword*)(N - 8)
The SDUse structure (each operand slot) has a 40-byte stride in the legalizer view (sub_1FFB890) and a 32-byte stride in the combiner view. The 40-byte stride includes use-chain forward/backward pointers:
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8B | Val | Pointer to the SDNode this use points to |
| +8 | 4B | ResNo | Result number within the pointed-to node |
| +16 | 8B | Next | Next SDUse in the use-list of the defining node |
| +24 | 8B | Prev | Previous SDUse (for doubly-linked list) |
| +32 | 8B | User | Back-pointer to the node that owns this operand |
Use-list traversal functions: sub_B43C20 (add to use list), sub_B43D60 (remove from use list).
SDValue
An SDValue is a lightweight {SDNode*, unsigned ResNo} pair identifying a specific result of a specific DAG node. In the decompiled code, SDValues appear as 16-byte pairs at various points:
struct SDValue {
SDNode *Node; // +0: pointer to the defining node
uint32_t ResNo; // +8: which result of that node (0-based)
};
SDValues are passed by value in registers (packed into __m128i in many decompiled signatures) and stored in operand arrays. The SDUse structure wraps an SDValue with use-chain linkage for the def-use graph.
SelectionDAG Builder Context
The builder context is the a1/v4 parameter to sub_163D530. It holds the function being compiled, target information, the bump allocator state, and several DenseMaps for node deduplication.
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8B | func_ptr | The LLVM function being compiled (a2) |
| +8 | 8B | target_ptr | Target machine info (a4) |
| +16 | 8B | alloc_cursor | Bump allocator current position |
| +24 | 8B | alloc_end | Bump allocator end boundary |
| +32 | 8B | slab_array | Pointer to array of slab pointers |
| +40 | 4B | slab_index | Current slab number (dword) |
| +44 | 4B | slab_capacity | Max slabs in array (dword) |
| +48 | var | inline_slab | Start of first allocation region |
| +80 | 8B | bb_list_head | Basic block list sentinel (points to +96) |
| +88 | 8B | bb_list_count | Number of basic blocks (init 0) |
Embedded DenseMaps
Three DenseMap/DenseSet instances are embedded inline in the context for node deduplication and worklist tracking. All use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16); see Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.
Map A (CSE node mapping) at offsets +120..+148:
| Offset | Size | Field |
|---|---|---|
| +120 | 8B | NumEntries |
| +128 | 8B | Buckets pointer |
| +136 | 4B | NumItems |
| +140 | 4B | NumTombstones |
| +144 | 4B | NumBuckets |
Map B (secondary set) at offsets +152..+176, same layout.
Set C (worklist) at offsets +184..+208, same layout.
Total minimum context size: 212 bytes.
Map A uses 16-byte bucket stride (key + value pairs), confirmed by the decompiled access pattern:
v30 = (_QWORD *)(v28 + 16LL * v29); // 16-byte stride
*v30 = v11; // key
v30[1] = v19; // value
DAG Builder Algorithm (SelectionDAGBuilder)
The SelectionDAGBuilder converts LLVM IR to an initial SelectionDAG. The main entry is sub_2081F00 (267KB, ~9,000 lines), with the visit dispatcher at sub_2065D30 (25KB). The builder processes one basic block at a time, walking the IR instruction list and emitting corresponding SDNode subgraphs.
Entry and Dispatch
sub_2081F00(SelectionDAGBuilder *this, BasicBlock *BB):
// this+552 = SelectionDAG pointer
// this+560 = DataLayout pointer
// Walk BB instruction list via linked list at BB+40/+48
for each instruction I in BB:
sub_2065D30(this, I) // main visit dispatch
The visit dispatcher (sub_2065D30) contains a DenseMap for node deduplication (hash function: (key >> 9) ^ (key >> 4)). It switches on the IR opcode and delegates to per-instruction visitors:
| IR Instruction | Visitor Function | Size | Notes |
|---|---|---|---|
| Binary ops | sub_206E5B0--sub_206F0D0 | 2.3KB each | 8 identical template instantiations for different ISD opcodes |
| Call | sub_208CF60 | 56KB | Calls sub_20C7CE0 (NVPTX ComputeCalleeInfo) |
| Load | sub_209B000 | 15KB | Chains via sub_2051C20 |
| Store | sub_2090780 | 14KB | Alignment, volatile, chain tokens |
| Switch/Br | sub_20912B0 | 18KB | Jump tables, range checks |
| PHI | sub_20920A0 | 13KB | Block ordering, vreg setup |
| GEP | sub_209FCA0 | 13KB | Recursive address building |
| Intrinsic | sub_208C8A0 | 9KB | Dispatches to intrinsic handlers |
| Debug | sub_208C270 | 7KB | Debug value/location handling |
| Inline Asm | sub_2079C70 | 83KB | Full constraint parsing |
| NVVM Tex/Surf | sub_2077400 | 20KB | "nvvm_texsurf_handle" metadata, NVIDIA custom |
| NVVM Args | sub_2072590 | 38KB | CUDA argument coercion, NVIDIA custom |
Chain Management
Every memory-touching SDNode carries a chain operand (token type) that enforces memory ordering. The chain is a linked sequence of token-typed SDValues threading through all memory operations in program order.
Chain creation. The builder maintains a "current chain" (PendingChain) that is updated after every memory operation. When a load or store is emitted, the current chain becomes its chain input, and the node's token result becomes the new current chain.
TokenFactor merging. When multiple independent memory operations can be reordered (e.g., independent loads), the builder creates a TokenFactor (opcode 2/55 depending on context) node that merges multiple chains into one:
// sub_F429C0: merge node creation
TokenFactor = getNode(ISD::TokenFactor, dl, MVT::Other, chains[])
Chain handling utilities in the builder:
sub_20993A0(11KB) -- chain/token helper for load/store sequencessub_2098400-- chain token node creatorsub_20989A0-- memory scheduling chain buildersub_F6C1B0(16KB) -- chain management in combining, usessub_B46970(isTokenFactor)
Glue (flag) chains. Certain node pairs must be scheduled adjacently (e.g., CopyToReg + CALL). These use a "glue" value type (MVT::Glue) as an additional operand/result. The call lowering in sub_3040BF0 threads glue through the entire call sequence: CallSeqBegin -> DeclareParam* -> Store* -> CallProto -> CallStart -> LoadRetParam* -> CallSeqEnd.
Per-Node Analysis Structure
During DAG construction, sub_163D530 creates per-node analysis objects (accessed via v381) with the following layout:
| Offset | Size | Field |
|---|---|---|
| +8 | 8B | array_ptr |
| +16 | 4B | array_count |
| +24 | 4B | array_capacity |
| +72 | 8B | set.Buckets |
| +80 | 4B | set.NumItems |
| +84 | 4B | set.NumTombstones |
| +88 | 4B | set.NumBuckets |
Operations: sub_163BE40(v381, ptr) inserts into the +8 array; sub_163BBF0(context, key) looks up the analysis structure for a node in the context's DenseMap.
CSE (Common Subexpression Elimination) Hash Table
The getNode() family of functions deduplicates SDNodes via a CSE hash table. The primary implementation is sub_F4CEE0 (41KB):
sub_F4CEE0(SelectionDAG *DAG, unsigned Opcode, SDVTList VTs, SDValue *Ops, unsigned NumOps):
// 1. Compute profile hash via sub_F4B360 (SDNode::Profile)
// Hash combines: opcode, VTs, all operand node pointers
// 2. Lookup in CSE hash table:
// hash = ((profile >> 4) ^ (profile >> 9)) & (capacity - 1)
// Quadratic probing: step 1, 2, 3, ...
// Sentinels: -4096 (empty), -8192 (tombstone)
// 3. If found: return existing node
// 4. If not found:
// Allocate via sub_BD2C40 (bump allocator)
// Initialize via sub_B44260 (SDNode constructor)
// Insert into hash table
// Add to AllNodes list (global sentinel: qword_4F81430)
// Return new node
Node builder variants handle different operand counts:
sub_F49030(38KB) -- complex node construction with operand/result type setupsub_F429C0(34KB) -- merge/TokenFactor/indexed node creationsub_F44160(22KB) -- CSE rebuild after modificationsub_F40FD0(16KB) -- node construction with chain initialization
The AllNodes list (qword_4F81430) is a doubly-linked intrusive list of all SDNodes in the current DAG, used for iteration during combining and legalization passes.
NVPTX-Specific Node Types (NVPTXISD)
NVPTX target-specific ISD opcodes begin at ISD::BUILTIN_OP_END = 0x1DC9 (confirmed by sub_2095B00 delegation threshold for getTargetNodeName()). In the decompiled code, target opcodes are referenced by small integers (the NVPTXISD enum value minus BUILTIN_OP_END). The following table consolidates all NVPTXISD opcodes discovered across sub_3040BF0, sub_32E3060, sub_33B0210, and the legalization infrastructure:
Call ABI Nodes
| Opcode | Name | Operands | Description |
|---|---|---|---|
| 315 | CallSeqBegin | chain, seqId, frameSize | Mark start of call frame |
| 316 | CallSeqEnd_Outer | chain, ... | Outer call-sequence-end wrapper |
| 505 | DeclareParam | chain, align, idx, size | Declare .param (byval/aggregate) |
| 506 | DeclareScalarParam | chain, align, idx, size | Declare .param (scalar, widened) |
| 507 | DeclareRetParam | chain, ... | Declare .param for return (byval callee) |
| 508 | DeclareRetScalarParam | chain, ... | Declare .param for return (scalar callee) |
| 510 | CallDirect | chain, callee, ... | Direct call (callee not extern) |
| 511 | CallDirectNoProto | chain, callee, ... | Direct call without prototype |
| 512 | CallIndirect | chain, ptr, ... | Indirect call via function pointer |
| 513 | CallIndirectNoProto | chain, ptr, ... | Indirect call without prototype |
| 514 | CallStart | chain, ... | Actual call instruction emission |
| 515 | LoadRetParam | chain, offset | Load return value from .param (not last) |
| 516 | LoadRetParamLast | chain, offset | Load last return value from .param |
| 517 | CallSeqEnd | chain, seqId, ... | End of call sequence (inner chain) |
| 518 | CallProto | chain, paramCount | Declare call prototype (.callprototype) |
| 521 | DeclareRetParam_Ext | chain, ... | Declare .param for return (extended path) |
| 527 | StoreCalleeRetAddr | chain, ... | Store callee return address in .param |
| 528 | StoreRetValToParam | chain, ... | Store return value to .param (return path) |
Memory / Vector Nodes
| Opcode | Name | Operands | Description |
|---|---|---|---|
| 568 | LoadV1 | chain, ptr, offset | Load 1-element from .param (scalar return) |
| 569 | LoadV2 | chain, ptr, offset | Load 2-element vector from .param |
| 570 | LoadV4 | chain, ptr, offset | Load 4-element vector from .param |
| 571 | StoreV1 | chain, val, ptr, offset | Store 1-element to .param (st.param) |
| 572 | StoreV2 | chain, val, ptr, offset | Store 2-element vector to .param |
| 573 | StoreV4 | chain, val, ptr, offset | Store 4-element vector to .param |
Math / Rounding-Mode Nodes
| Opcode | Name | Description |
|---|---|---|
| 245 | ADD_RM | Add, round toward -inf |
| 246 | SQRT_RP | Sqrt, round toward +inf |
| 248 | SQRT_RZ | Sqrt, round toward zero |
| 249 | ADD_RZ | Add, round toward zero |
| 250 | DIV_RZ | Div, round toward zero |
| 251 | MUL_RN | Mul, round to nearest |
| 252 | ADD_RN | Add, round to nearest |
| 253 | FMA_RN | FMA, round to nearest |
| 254 | SQRT_RM | Sqrt, round toward -inf |
| 255 | MUL_RZ | Mul, round toward zero |
| 256 | DIV_RM | Div, round toward -inf |
| 267 | FMA_RZ | FMA, round toward zero |
| 268 | DIV_RN | Div, round to nearest |
| 269 | DIV_RP | Div, round toward +inf |
| 270 | ADD_RP | Add, round toward +inf |
| 271 | FMA_RM | FMA, round toward -inf |
| 272 | MUL_RP | Mul, round toward +inf |
| 273 | FMA_RP | FMA, round toward +inf |
| 274 | MUL_RM | Mul, round toward -inf |
Address Space / Miscellaneous Nodes
| Opcode | Name | Description |
|---|---|---|
| 22 | TargetAddr | Target address computation |
| 24 | Wrapper | Global address wrapping |
| 149 | ATOMIC_LOAD | Atomic load with scope |
| 152 | SELECT_CC | Ternary select on condition code |
| 154 | SQRT_RN | Sqrt, round to nearest |
| 189 | MoveParam | Read thread index / special register |
| 193--196 | MIN/MAX | Integer min/max variants |
| 197 | CTPOP | Population count |
| 198--204 | ConstPool* | Constant pool variants by size |
| 208 | CMPXCHG | Compare-and-exchange atomic |
| 230 | DeclareLocal | Declare local .param / address of param |
| 233--234 | AddrSpaceCast | Bidirectional address space cast pair |
| 287--290 | Barrier/Fence | Memory barrier/fence variants |
| 310 | Annotation | Annotation metadata node |
| 321 | StackRestore | Restore stack pointer |
| 322 | StackAlloc | Dynamic stack allocation |
| 330 | FunctionAddr | Function address |
| 335 | BinaryArith | Generic binary arithmetic |
| 371 | DynAreaOffset | Dynamic alloca offset |
| 499 | ConditionalBranch | Conditional branch with chain |
Atomic Opcodes (from sub_20BED60)
| Opcode Range | Operation | Widths |
|---|---|---|
| 294--297 | atom.add | f32/f64/i32/i64 |
| 302--305 | atom.min | s32/s64/u32/u64 |
| 314--317 | atom.max | s32/s64/u32/u64 |
| 462 | atom.cas | generic |
DAG Legalization Flow
After the initial DAG is built, three legalization phases transform it into a form the NVPTX backend can select:
Phase 1: Type Legalization (sub_20019C0, 348KB)
The DAGTypeLegalizer iterates to fixpoint. For each node, it reads the result/operand types and checks the legality table at TLI + 259 * VT + opcode + 2422. If illegal, it applies one of: promote, expand, soften, scalarize, or split-vector. The worklist iterates until no node has an illegal type.
NVPTX legal vector types are extremely limited (only v2f16, v2bf16, v2i16, v4i8 -- all packing into 32-bit registers via Int32HalfRegs). This means virtually all LLVM-IR vector operations pass through the split/scalarize paths.
Type legalization workers:
sub_201E5F0(81KB) -- promote/expand secondary dispatch (441 case labels, 6 switches)sub_201BB90(75KB) -- ExpandIntegerResult (632 case labels)sub_2029C10-- SplitVectorResult dispatcher (reads opcode atnode+24)sub_202E5A0-- SplitVectorOperand dispatchersub_2036110-- ScalarizeVectorResultsub_2035F80-- ScalarizeVectorOperand
Phase 2: Operation Legalization (sub_1FFB890, 169KB)
After types are legal, the operation legalizer checks whether each operation at its now-legal type is supported. The action lookup:
action = *(uint8_t*)(TLI + 259*VT + opcode + 2422)
Actions dispatch through a five-way switch:
| Action | Code | Behavior |
|---|---|---|
| Legal | 0 | Return immediately |
| Custom | 1 | Call TLI->LowerOperation() via vtable slot #164 (offset 1312) |
| Expand | 2 | Try sub_20019C0 (LegalizeTypes), then sub_1FF6F70 (ExpandNode) |
| LibCall | 3 | Call sub_1FF6F70 directly |
| Promote | 4 | Find next legal type, rebuild at promoted type |
Custom lowering invokes NVPTXTargetLowering::LowerOperation() (sub_32E3060, 111KB) through the vtable. This is where all NVPTX-specific operation lowering happens: BUILD_VECTOR splat detection, VECTOR_SHUFFLE three-level lowering, EXTRACT_VECTOR_ELT three-path dispatch, and the .param-space calling convention.
Additional action tables:
- Second table at
TLI + opcode + 2681-- for BSWAP/CTLZ/CTTZ/BITREVERSE (opcodes 43--45, 199) - Third table at
TLI + opcode + 3976-- for FSINCOS (opcode 211) - Fourth table at
TLI + 18112-- packed nibble format for FP_TO_SINT/FP_TO_UINT/SELECT_CC, indexed by(VT_id >> 3) + 15 * condcode_type
Phase 3: DAG Combining (Three Passes)
DAG combining runs after each legalization phase. The orchestrator (sub_F681E0, 65KB) manages a worklist of SDNodes and calls the per-node visitor (sub_F20C20, 64KB) for each. The visitor implements a six-phase combine algorithm:
- Opcode-specific combine via
sub_100E380-- target-independent pattern matching - Known-bits narrowing -- for constants, calls
sub_11A3F30(computeKnownBits/SimplifyDemandedBits) and narrows if fewer bits demanded - Operand type-narrowing loop -- walks all operands, promotes/truncates to legal types, creates
SIGN_EXTEND/TRUNCATEcasts - All-constant-operand fold -- 4x-unrolled check via
sub_1028510(ConstantFold) - Division-by-constant strength reduction -- shift+mask replacement for power-of-2 divisors
- Vector stride / reassociation --
sub_F15770(shift-fold),sub_F17ED0(stride patterns)
NVPTX-specific combines run as a post-legalize pass:
sub_33C0CA0(62KB) --PerformDAGCombine, the NVPTX target hooksub_32EC4F0(92KB) -- post-legalize combinesub_3425710(142KB) -- the NVIDIA DAGCombiner with internal"COVERED"/"INCLUDED"debug tracing strings (not present in upstream LLVM)
The worklist uses the same DenseMap infrastructure as the builder context, with the hash at DAG+2072 (capacity at DAG+2088, count at DAG+2080). Node replacement goes through sub_F162A0 (CombineTo/ReplaceAllUsesWith), which walks the use-list, hashes each user into the worklist map, then calls sub_BD84D0 for the actual use-chain splice.
Bump Allocator
The builder context uses a slab-based bump allocator identical to the one used for NVVM IR nodes:
- Slab growth:
4096 << (slab_index >> 7)-- exponential, capped at 4TB. - Alignment: 8 bytes.
- No per-node free: entire slabs are released when the DAG is destroyed.
- Overflow: allocates a new slab via
malloc().
Since every base SDNode is exactly 104 bytes (13 qwords), a single 4096-byte initial slab holds approximately 39 nodes before overflow triggers slab growth. Extended node types (ConstantSDNode, MemSDNode) may be larger and are allocated via separate paths:
sub_BD2C40-- standard SDNode allocation (bump allocator)sub_BD2DA0-- SDNode allocation variant (80 bytes, for lightweight nodes)sub_22077B0--operator new[](128 bytes, for MemSDNode with chain/alignment fields)
Basic Block Iteration
The builder iterates over the function's basic blocks via a linked list rooted at a2 + 72 (the function parameter). Each list node embeds the data pointer at offset -24 from the node:
bb_data = node_ptr - 24
Within each basic block, instructions are iterated via an inner list:
- Inner list sentinel at
bb_data + 40 - Inner list head at
bb_data + 48
This matches the LLVM ilist intrusive linked list pattern where the list hook is embedded at a fixed offset within the contained object.
Differences from Upstream LLVM
| Area | NVIDIA (cicc v13.0) | Upstream LLVM 20.0 |
|---|---|---|
| Type legalizer structure | Single 348KB monolithic function (sub_20019C0) | Split across 4 files (LegalizeIntegerTypes.cpp, etc.) |
| NVIDIA DAGCombiner | 142KB sub_3425710 with "COVERED"/"INCLUDED" internal tracing | No equivalent; target combines via PerformDAGCombine hook only |
| computeKnownBits | 114KB sub_33D4EF0, covers 112+ ISD opcodes including NVPTX target nodes | ~30 opcodes in generic computeKnownBits, target extends via hook |
| Inline asm | 162KB total (sub_2079C70 + sub_338BA40) | ~200 lines per target |
| Intrinsic lowering | 343KB switch covering 200+ intrinsic IDs up to 14196 | ~300 standard intrinsic IDs |
| Address spaces | AS 101 (param alt), AS 7 (.param), CTA/GPU/SYS scope atomics | No AS 101; no scope atomics |
| Libcall metadata | "nvptx-libcall-callee" metadata for custom libcall routing | Not present |
| Legal vector types | Only v2f16, v2bf16, v2i16, v4i8 (packed into 32-bit registers) | Varies by target; typically much wider vectors |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| SelectionDAG builder context init | sub_163D530 | 73KB | Allocator, DenseMaps, BB iteration |
| SelectionDAGBuilder::visit | sub_2081F00 | 267KB | IR-to-DAG main lowering |
| SelectionDAGBuilder visit dispatch | sub_2065D30 | 25KB | Per-instruction routing |
| visitCall | sub_208CF60 | 56KB | Call lowering into DAG |
| visitLoad | sub_209B000 | 15KB | Load chain emission |
| visitStore | sub_2090780 | 14KB | Store alignment/chain |
| visitSwitch/Br | sub_20912B0 | 18KB | Control flow lowering |
| visitPHI | sub_20920A0 | 13KB | PHI node handling |
| visitGEP | sub_209FCA0 | 13KB | Address computation |
| visitInlineAsm | sub_2079C70 | 83KB | Inline asm constraint parsing |
| visitNVVMTexSurf | sub_2077400 | 20KB | NVIDIA tex/surf handle lowering |
| NVPTX argument coercion | sub_2072590 | 38KB | CUDA kernel argument lowering |
| getNode / CSE hash table | sub_F4CEE0 | 41KB | Node deduplication |
| SelectionDAG node builder | sub_F49030 | 38KB | Complex node construction |
| Merge/TokenFactor creation | sub_F429C0 | 34KB | Chain merging, indexed nodes |
| DAG combiner orchestrator | sub_F681E0 | 65KB | Worklist management |
| DAG combiner visitor | sub_F20C20 | 64KB | Per-node combine algorithm |
| combine() opcode dispatch | sub_100E380 | -- | Target-independent combines |
| CombineTo / RAUW | sub_F162A0 | -- | Use-chain replacement + worklist push |
| SDNode allocation | sub_BD2C40 | -- | Bump allocator |
| SDNode constructor | sub_B44260 | -- | Initialization |
| SDUse add to use list | sub_B43C20 | -- | Use-chain linkage |
| SDUse remove from use list | sub_B43D60 | -- | Use-chain unlinkage |
| ReplaceAllUsesWith | sub_BD84D0 | -- | Raw use-chain splice |
| transferDbgValues | sub_BD6B90 | -- | Debug info transfer |
| setOperand | sub_B91C10 | -- | Operand mutation |
| replaceOperand | sub_B99FD0 | -- | Single operand swap |
| DAGTypeLegalizer::run | sub_20019C0 | 348KB | Type legalization master dispatch |
| LegalizeOp | sub_1FFB890 | 169KB | Operation legalization |
| ExpandNode | sub_1FF6F70 | -- | Full node expansion fallback |
| NVPTXTargetLowering::LowerOperation | sub_32E3060 | 111KB | NVPTX custom operation lowering |
| NVPTXTargetLowering::LowerCall | sub_3040BF0 | 88KB | .param calling convention |
| Intrinsic lowering switch | sub_33B0210 | 343KB | 200+ CUDA intrinsic IDs |
| PerformDAGCombine (NVPTX) | sub_33C0CA0 | 62KB | Post-legalize NVPTX combines |
| NVIDIA DAGCombiner | sub_3425710 | 142KB | NVIDIA-specific combine engine |
| computeKnownBits (NVPTX) | sub_33D4EF0 | 114KB | 112-opcode known-bits transfer |
| ISel::Select driver | sub_3090F90 | 91KB | Pattern matching entry |
| getOperationName | sub_2095B00 | 35KB | ISD opcode -> string mapping |
Cross-References
- SelectionDAG & Instruction Selection -- pipeline overview, NVPTX lowering, combine detail
- Type Legalization -- 348KB type legalizer deep-dive
- ISel Patterns -- instruction selection pattern database
- Register Classes -- NVPTX register class constraints
- Address Spaces -- address space encoding
- Hash Infrastructure -- universal DenseMap documentation
- IR Node Structure -- NVVM IR node layout (pre-SelectionDAG)
- Pattern Database -- ISel pattern constraint classes
DenseMap, Symbol Table, and EDG Frontend Structures
The EDG 6.6 frontend layered on LLVM's DenseMap maintains its own declaration nodes, type nodes, and scope stack for C/C++/CUDA semantic analysis. This page documents the EDG-level structures that ride on top of the DenseMap. For the DenseMap implementation itself -- layout, hash function, probing, sentinel values, and growth policy -- see Hash Table and Collection Infrastructure.
The EDG symbol tables in this subsystem use the NVVM-layer sentinel pair (-8 / -16) and the pointer hash (ptr >> 9) ^ (ptr >> 4). See the sentinel reference table for other subsystems.
EDG Declaration Node Layout
The EDG 6.6 frontend represents every C/C++ declaration as a variable-length structure. The canonical declaration node layout was recovered from the top-level declarator parser sub_662DE0 and the declaration-specifier resolver sub_7C0F00.
Declaration Node (a_decl_node) -- 456+ bytes
| Offset | Size | Type | Field | Evidence |
|---|---|---|---|---|
| +0 | 8B | ptr | decl_id / entity pointer | *v31 in sub_662DE0 |
| +8 | 8B | uint64_t | decl_flags bitfield (see below) | v31[1] |
| +16 | 8B | uint64_t | decl_extra_flags | v31[2] |
| +24 | 16B | name / identifier info | v31[3..4] | |
| +40 | 8B | name string for "main" check | strcmp target | |
| +72 | 4B | uint32_t | saved_specifier_word1 | v239 in sub_662DE0 |
| +76 | 2B | uint16_t | saved_specifier_word2 | v240 |
| +80 | 1B | uint8_t | entity_kind (for scope dispatch) | checked in sub_860B80 |
| +120 | 1B | uint8_t | accessibility (bits 0-6, bit 7 reserved) | v241 = *(a1+120) & 0x7F |
| +124 | 1B | uint8_t | context_flags_124 | bit 5=explicit_spec, bit 6=class_member |
| +125 | 1B | uint8_t | context_flags_125 | bit 5=was_friend, bit 6=in_class_body, bit 7=template_decl_head |
| +126 | 1B | uint8_t | state_flags (see below) | mask tests throughout sub_662DE0 |
| +127 | 1B | uint8_t | extra_state | bit 0=class_scope_pushed, bit 1=needs_deferred_parse |
| +128 | 8B | ptr | entity_ptr / scope pointer | compared early in sub_739430 |
| +130 | 1B | uint8_t | modifier_flags | bit 5=deferred_parse, bit 6=virtual_specifier |
| +131 | 1B | uint8_t | inline/constexpr flag | bit 4 |
| +132 | 1B | uint8_t | needs_semicolon_check | bit 1 |
| +140 | 1B | uint8_t | type_kind (for type_def nodes) | switch discriminant in sub_766570 case 6 |
| +160 | 8B | ptr | underlying_type (for typedef) | typedef unwrap chain |
| +168 | 8B | ptr | flags_ptr | bit 3 checked for fn-pointer |
| +173 | 1B | uint8_t | elaborate_kind | primary switch in sub_739430 |
| +176 | var | elaborate_sub_kind / secondary | sub-switch in case 12 | |
| +184 | 8B | ptr | parm_list | v31[23] via sub_5CC190(1) |
| +224 | 4B | uint32_t | init_kind | bit 0 = brace-init |
| +256 | 4B | uint32_t | additional_flags | |
| +268 | 1B | uint8_t | decl_kind_enum | 0=variable, 4=function, 6=namespace |
| +269 | 1B | uint8_t | storage_class_kind | 0=none, 1=extern, 2=static |
| +272 | 8B | ptr | decl_type | v31[34] |
| +280 | 8B | ptr | result_type | v31[35] |
| +288 | 8B | ptr | entity_type / return_type | v31[36] |
| +304 | 8B | ptr | template_info | |
| +352 | 8B | ptr | body_ptr | v31[44] |
| +360 | 8B | ptr | scope_or_context | v31[45] |
| +368 | 8B | ptr | forward_decl_chain | linked list |
| +416 | 8B | ptr | pending_list | v31[52] |
| +456 | 8B | ptr | extra_entity | v31[57] |
decl_flags (+8) Bit Definitions
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x1 | is_definition / linkage related |
| 1 | 0x2 | has_initializer / needs init check |
| 4 | 0x10 | is_typedef |
| 5 | 0x20 | is_template_decl / friend declaration |
| 6 | 0x40 | is_inline |
| 7 | 0x80 | is_extern |
| 14 | 0x4000 | structured_binding / decomposition decl |
state_flags (+126) Bit Definitions
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x1 | has_saved_tokens |
| 1 | 0x2 | abstract_declarator_mode |
| 2 | 0x4 | has_leading_attributes |
| 3 | 0x8 | no_declarator_needed (typedef etc.) |
| 4 | 0x10 | suppress_error_recovery |
| 5 | 0x20 | in_declarator_parsing (set on entry) |
| 6 | 0x40 | in_multi_declarator_loop |
| 7 | 0x80 | scope_pushed |
entity_kind (+80) Dispatch Values
Used by sub_860B80 and sub_7C0F00 phase 3:
| Value | Entity Kind |
|---|---|
| 3 | class |
| 4 | enum (variant A) |
| 5 | enum (variant B) |
| 6 | namespace |
| 10 | function |
| 11 | variable |
| 16 | typedef |
| 17 | template |
| 19 | class template |
| 22 | dependent name |
| 23 | using-declaration |
| 24 | injected-class-name |
Declaration Node Allocation
sub_84DCB0 allocates 152-byte declaration entries from a free-list at qword_4D03C68, with fallback to the global allocator sub_823970(152). The full node size table at qword_4B6D500 provides per-tag sizes for all 87 IL node types; the declaration tag (6) indexes into this table for memcpy during template instantiation.
EDG Type Node Layout
Type nodes are the central representation for C/C++ types throughout the EDG frontend. Two distinct layouts exist: the IL-level type node used by the tree walker (sub_7506E0) and the semantic type node used by the type comparison engine (sub_7386E0). The type translation system (sub_91AED0) bridges between these and LLVM types.
IL-Level Type Node (from sub_7506E0 tree walker)
The IL tree walker addresses fields as a1[N] (8-byte indexed), with byte-level sub-kind tags at specific offsets:
| Offset | Size | Type | Field | Evidence |
|---|---|---|---|---|
| -16 | 8B | ptr | parent_ptr / owner | shared-node check path |
| -8 | 1B | uint8_t | flags_byte | bit 0=shared, bit 2=visit-mark |
| +0..+N*8 | var | ptr[] | child pointers (typed per kind) | a1[0]..a1[N] |
| +24 | 1B | uint8_t | expression sub-kind (case 13) | switch discriminant |
| +28 | 1B | uint8_t | scope sub-kind (case 23) | 18 sub-kinds |
| +40 | 1B | uint8_t | declaration sub-kind (case 21) | 25 sub-kinds |
| +48 | 1B | uint8_t | template_arg sub-kind (case 30) | 9 sub-kinds |
| +140 | 1B | uint8_t | type_def_sub_kind (case 6) | 17 sub-kinds |
| +161 | 1B | uint8_t | type_def_flags | |
| +168-177 | var | type sub-kind / sub-sub-kind | ||
| +173 | 1B | uint8_t | type_main_kind (case 2) | 14 sub-kinds by +173 |
| +176 | 1B | uint8_t | type_sub_sub_kind | case 6 elaborated |
Semantic Type Node (from sub_7386E0 comparison engine)
| Offset | Size | Type | Field | Evidence |
|---|---|---|---|---|
| +0 | 8B | ptr | associated_decl | *v10 == *v14 comparison |
| +24 | 1B | uint8_t | type_kind (0..37) | primary switch discriminant |
| +25 | 1B | uint8_t | cv_qualifiers | bits 0-1 = const/volatile, bit 6 = restrict |
| +26 | 1B | uint8_t | type_flags_1 | bit 2 compared |
| +27 | 1B | uint8_t | type_flags_2 | bit 1 compared (case 1) |
| +56 | 8B | type_payload / sub_kind | case 1: char at +56 = base_type_kind | |
| +58 | 1B | uint8_t | type_extra_flags | case 1: bits 0x3A compared |
| +64 | 8B | varies per kind | case 30: word at +64 | |
| +72 | 8B | ptr | type_child / pointer | case 1 integer path |
| +80 | 8B | ptr | linkage_chain | case 33: namespace list +80 = next |
EDG-to-LLVM Type Translation Node (from sub_918E50)
The type translation system reads a third view of the type node with offsets optimized for LLVM type construction:
| Offset | Size | Type | Field | Evidence |
|---|---|---|---|---|
| -72 | 8B | ptr | grandparent_type | nested lookups |
| -48 | 8B | ptr | parent_type_A | |
| -24 | 8B | ptr | parent_type_B / first child | |
| -8 | 8B | ptr | indirect_child_array | if flag 0x40 at +23 |
| +0 | 8B | ptr | llvm_type_descriptor | *node -> LLVM type info |
| +8 | 8B | ptr | member_chain_head | linked list of class members |
| +16 | 1B | uint8_t | type_kind | see kind table below |
| +18 | 2B | uint16_t | qualifier_word | bits 0-14: qualifier ID, bit 15: negation |
| +20 | 4B | uint32_t | child_count | low 28 bits masked & 0xFFFFFFF |
| +23 | 1B | uint8_t | flags | bit 6 (0x40) = indirect children |
| +24 | 8B | type-specific data | varies by kind | |
| +32 | 8B | bitwidth | enum/integer types | |
| +33 | 1B | uint8_t | additional_flags | bit 5 (0x20) = special treatment |
| +36 | 4B | uint32_t | sub_kind_discriminator | nested types |
| +40 | 8B | ptr | scope_linkage_ptr | |
| +48 | 8B | ptr | member_list_head | linked list |
type_kind Enumeration (semantic type comparison)
The full enumeration recovered from sub_7386E0:
| Value | Name | Comparison Strategy |
|---|---|---|
| 0 | tk_none / void | trivially equal |
| 1 | tk_fundamental | sub_kind + base type + class scope |
| 2 | tk_pointer | delegate to sub_739430 on pointee |
| 3 | tk_class | scope identity, unique_id, template args |
| 4 | tk_enum | scope identity, unique_id |
| 5 | tk_function | sub_73A280 pair compare |
| 6 | tk_bitfield | width + base compare |
| 7 | tk_member_pointer | multi-field descriptor |
| 8 | tk_reference | referent descriptor |
| 10 | tk_array | element type recursion |
| 11 | tk_qualified | child + qualifier bit |
| 12 | tk_elaborated | sub_kind switch (typedef/class/enum) |
| 13 | tk_pack_expansion | sub_kind switch |
| 14 | tk_typeof_expr | sub_kind switch |
| 15 | tk_decltype | sub_kind switch |
| 16 | tk_nullptr | trivially equal |
| 17 | tk_auto | identity on entity |
| 18 | tk_function_alt | sub_73A280 |
| 20 | tk_dependent_name | scope identity + unique_id |
| 22 | tk_unresolved | sub_8D97D0 decl compare |
| 23 | tk_attributed | attribute kind + child |
| 24 | tk_decltype_auto | identity on entity |
| 25 | tk_paren | child list compare |
| 26 | tk_adjusted | child type recursion |
| 27 | tk_typeof_decl | resolve decl -> type, recurse |
| 30 | tk_complex | element type(s) recursion |
| 32 | tk_template_template_param | identity + template args |
| 33 | tk_using_decl | child list + base class hash table |
| 34 | tk_atomic | child + qualifier bit |
| 35 | tk_vla | element type recursion |
| 37 | tk_concept_constraint | identity on entity |
EDG-to-LLVM Type Kind Encoding (byte at node+16)
| Value | Hex | Kind |
|---|---|---|
| 0-16 | 0x00-0x10 | Primitive / scalar types |
| 17 | 0x11 | Void (special) |
| 5 | 0x05 | Qualified type (const/volatile/restrict) |
| 13 | 0x0D | Enum type |
| 14 | 0x0E | Function type |
| 26 | 0x1A | Array type (subscript form) |
| 27 | 0x1B | Compound type (struct/union/class) |
| 50 | 0x32 | Union variant A |
| 51 | 0x33 | Union variant B |
| 54 | 0x36 | Typedef / using declaration |
| 55 | 0x37 | Using declaration variant |
| 75 | 0x4B | Pointer type |
| 76 | 0x4C | Reference type (lvalue or rvalue) |
| 77 | 0x4D | Member pointer type |
| 78 | 0x4E | Dependent / nested type |
Qualifier Word Values (node+18 & 0x7FFF)
| Value | CUDA Memory Space |
|---|---|
| 1 | Address space 1 (global memory) |
| 9 | Address space 9 (generic, gated by sub_5F3280) |
| 14 | Function / method qualifier |
| 26 | Array subscript context A |
| 27 | Array subscript context B |
| 32 | Address space 32 (shared memory) |
| 33 | Address space 33 (constant memory) |
Type Canonicalization -- sub_72EC50
Before any type comparison, both sides are canonicalized by stripping non-template typedef aliases:
fn edg_canonicalize_type(type) -> type:
while type.type_kind == 2: // tk_elaborated
scope = type.payload_at_56
if scope.elaborate_kind != 12: // not typedef_name
break
if scope.elaborate_sub_kind != 1: // not single-member typedef
break
if scope.class_flags & 0x10: // has template specialization
break
type = sub_72E9A0(type) // unwrap one layer
return type
This peels through chains like typedef int MyInt; typedef MyInt YourInt; down to the fundamental type. Template specialization aliases are never unwrapped.
EDG Scope Stack
The scope stack is a global array of 776-byte entries, indexed by a scope depth counter. It represents the C++ scope nesting at parse time (file scope -> namespace -> class -> function -> block).
Global State
| Address | Type | Name | Purpose |
|---|---|---|---|
qword_4F04C68 | ptr | Scope stack base | heap-allocated array of 776B entries |
dword_4F04C64 | int32_t | Current scope index | top of the scope stack |
dword_4F04C5C | int32_t | Previous scope index | saved parent index |
dword_4F04C44 | int32_t | Namespace scope index | deepest enclosing namespace |
dword_4F04C34 | int32_t | Class scope index | deepest enclosing class |
dword_4F04C40 | int32_t | Another scope index | auxiliary scope tracking |
dword_4F04C3C | int32_t | Module linkage flag | C++20 module scope state |
unk_4F04C48 | int32_t | Parent scope check | used by using-declaration handler |
Scope Stack Entry Layout (776 bytes)
Each entry at qword_4F04C68[0] + 776 * index:
| Offset | Size | Type | Field |
|---|---|---|---|
| +0 | 4B | uint32_t | scope_id |
| +4 | 2B | uint16_t | scope_kind (see table below) |
| +6 | 1B | uint8_t | flags_a |
| +7 | 1B | uint8_t | flags_b |
| +8 | 1B | uint8_t | flags_c |
| +9 | 1B | uint8_t | flags_d |
| +10 | 1B | uint8_t | flags_e |
| +24 | 8B | ptr | name_list_head |
| +32 | 8B | ptr | name_list_tail |
| +208 | 8B | ptr | class_type_ptr |
| +232 | 8B | ptr | deferred_list |
| +328 | 8B | ptr | template_info |
| +552 | 4B | int32_t | parent_scope_index |
| +624 | 8B | ptr | declaration_ptr |
| +680 | 8B | field used by sub_7C0F00 | |
| +688 | 4B | uint32_t | entity_number_counter (for mangling) |
| +696 | 4B | uint32_t | entity_number_counter_2 |
scope_kind Values
| Value | Scope Kind |
|---|---|
| 5 | namespace |
| 6 | class |
| 7 | function |
| 8 | block (compound statement) |
| 9 | enum |
| 12 | template parameter |
Push / Pop Operations
sub_854590(0) // push_scope -- increments dword_4F04C64, initializes new entry
sub_854430() // pop_scope -- decrements dword_4F04C64, restores parent
sub_854AB0(...) // pop_declarator_scope (context-specific cleanup)
sub_854B40() // push_declarator_scope (declarator-specific init)
The scope depth counter at qword_4F061C8 + 64 is bumped independently for declarator nesting depth tracking. Class scope depth lives at qword_4F061C8 + 81.
Scope Chain Traversal Algorithm
The declaration-specifier resolver sub_7C0F00 performs scope chain traversal to resolve qualified names. The algorithm was recovered from Phase 4 (lines 1197-1600) of that function.
Unqualified Name Lookup
fn lookup_unqualified(name, scope_index) -> entity:
// Phase 2 of sub_7C0F00
// Try each lookup strategy in priority order:
result = sub_7D5DD0(name) // unqualified lookup in current scope
if result:
return result
result = sub_7D2AC0(name, flags) // lookup with specific flags
if result:
return result
result = sub_7ACA80(name) // ADL / ambiguity resolution
return result
Qualified Name Lookup (A::B::C)
The scope iteration loop at LABEL_282/283/285/288 walks the scope chain:
fn lookup_qualified(base_entity, remaining_name) -> entity:
current = base_entity
while true:
// Check if "::" follows the current entity
if current_token != TK_SCOPE_RESOLUTION: // token 37
return current
consume_token() // sub_7B8B50
// Classify the current entity
kind = current.entity_kind // byte at +80
switch kind:
case 6: // namespace
result = sub_7D4A40(current, remaining_name) // namespace lookup
case 3: // class
case 19: // class template
result = sub_7D2AC0(current, remaining_name, MEMBER_FLAG)
case 17: // template
result = sub_830940(current, remaining_name) // class template lookup
default:
result = sub_7D4600(current, remaining_name) // generic qualified lookup
if !result:
// Error: member not found in scope
sub_6851C0(error_code, context)
return null
current = result
// Check member access, visibility, redeclaration
sub_8841F0(current, scope_entry) // access check for C++ members
Self-Recursive Qualified Resolution
When the declaration-specifier resolver encounters a :: after resolving a name, it recurses into itself at sub_7C0F00(20, a2) where flags=20 decodes as:
- bit 2 (0x04) = nested declarator sub-parse context
- bit 4 (0x10) = restrict parse to type-specifiers only
This handles arbitrarily deep qualified names like A::B::C::D. Recursion depth is bounded by the nesting depth of the qualified name.
Scope Chain Walking for Declaration Resolution
sub_868D90 (ADL / instantiation lookup) walks the scope chain upward:
fn walk_scope_chain(start_index) -> entity:
index = start_index
while index >= 0:
entry = scope_table_base + 776 * index
// Check if this scope contains the target declaration
// ... name lookup within the scope's name list ...
// Move to parent scope
index = entry.parent_scope_index // at offset +552
Type Comparison Engine
sub_7386E0 implements structural type comparison for the EDG frontend. It performs a parallel tree walk over two type nodes, comparing them field-by-field with mode-dependent strictness.
Calling Convention
sub_7386E0(packed_pair: __int128, flags: int) -> bool
packed_pair.low = type_A pointer
packed_pair.high = type_B pointer
flags bits:
0-1: cv_compare_mode (0=strict, 1=relaxed, 2=overload)
2: template_matching_mode
5: anonymous_class_structural_compare
Comparison Algorithm
fn compare_types(type_A, type_B, flags) -> bool:
// 1. Null handling
if both null: return true
if either null: return false
// 2. Canonicalize (strip non-template typedefs)
type_A = sub_72EC50(type_A)
type_B = sub_72EC50(type_B)
// 3. Quick-reject on header bytes
if type_A.type_kind != type_B.type_kind: return false
if (type_A.cv_quals ^ type_B.cv_quals) & 0x43: return false // const/volatile/restrict
if (type_A.flags_1 ^ type_B.flags_1) & 0x04: return false
// 4. Type-specific structural comparison
switch type_A.type_kind:
case 3 (class):
if type_A.scope == type_B.scope: return true // identity shortcut
if unique_id_enabled:
if scope_A.unique_id == scope_B.unique_id: return true
if template_mode:
return sub_89BAF0(...) // template arg list compare
if anonymous_mode && both_anonymous:
return sub_739430(member_list_A, member_list_B)
case 7 (member_pointer):
// Compare: flags, class ptr, scope ptr, return type,
// params, exception spec -- 6 sub-comparisons
case 33 (using_decl) in overload mode:
// Hash table lookup at qword_4D03BF8 for base class lists
// Element-by-element comparison of 24-byte triples
// ... 35 other cases ...
// 5. Post-switch: declaration pointer compare
if type_A.decl != type_B.decl:
if !sub_8D97D0(type_A.decl, type_B.decl): return false
return true
Helper Functions
| Address | Name | Purpose |
|---|---|---|
sub_7386E0 | edg_compare_type_nodes | Top-level structural compare |
sub_739370 | edg_compare_type_lists | Linked-list comparator (next at +16) |
sub_739430 | edg_compare_decl_types | Declaration-level comparator (661 lines) |
sub_73A280 | edg_compare_type_pair_triv | Trivial wrapper: null=equal |
sub_72EC50 | edg_canonicalize_type | Strip typedef / elaborated aliases |
sub_8D97D0 | edg_compare_decl_identity | Name/entity identity comparison |
sub_8C7520 | edg_class_same_template | Same primary class template check |
sub_89AB40 | edg_compare_template_args | Template argument list comparison |
sub_89BAF0 | edg_compare_template_arg_lists_full | Full template context compare |
Key Global: dword_4F07588 -- unique_id optimization
When set, enables O(1) identity comparison via the unique_id field at scope+32. This avoids recursive structural comparison for named classes and enums. The field is compared as a non-null integer; matching non-null values prove the two types refer to the same entity.
IL Tree Walker and Copier
Tree Walker -- sub_7506E0 (190KB, 7283 lines)
The generic IL tree walker visits every node in the EDG intermediate representation. It dispatches on 83 node kinds (1-86 with gaps at 24-26) using a massive switch statement.
Callback table at .bss 0x4F08014..0x4F08040:
| Address | Type | Callback | Call Sites |
|---|---|---|---|
dword_4F08014 | bool | skip_shared_nodes | flag |
dword_4F08018 | bool | clear_back_pointers | 49 sites |
qword_4F08020 | fn(node, kind) -> node | list_node_rewrite_fn | 206 sites |
qword_4F08028 | fn(node, kind) -> node | child_rewrite_fn | 926 sites |
qword_4F08030 | fn(node, kind) -> bool | pre_visit_fn | 2 sites |
qword_4F08038 | fn(str, kind, len) | string_visitor_fn | 80 sites |
qword_4F08040 | fn(node, kind) | post_visit_fn | 14 sites |
Visit-mark protocol: Each node has a flag byte at node[-8]. Bit 2 tracks "visited in current pass" with polarity toggled per walk pass via dword_4D03B64. This avoids clearing visited marks between walks.
Linked-list traversal pattern (60+ lists walked):
for cursor = node.field; cursor; cursor = cursor.next:
if list_node_rewrite_fn:
cursor = list_node_rewrite_fn(cursor, child_kind)
if cursor:
walk_il_node(cursor, child_kind)
cursor = node.field // re-read (rewrite may have changed it)
Next-pointer stride varies by node kind: +0, +16, +24, +32, +56, +112, +120 bytes.
Tree Copier -- sub_766570 (148KB, 5187 lines)
The copier is driven by template instantiation (sub_8C5CD0 -> sub_8C4EC0 -> sub_8C2C50 -> sub_766570). It uses the walker's callback infrastructure:
sub_8C38E0=copy_refcallback: resolves pending copy destinationssub_8C3810=copy_scopecallback: resolves scope-level copies- Node sizes from
qword_4B6D500[tag](87+ entries, one per IL node type)
Copy protocol using flag bits at node[-8]:
| Bits | Meaning |
|---|---|
| 0x1 | needs copy, not yet started |
| 0x2 | copy in progress |
| 0x3 | pending copy (both bits) |
| 0x4 | copy destination allocated |
Copy destination stored at *(node - 24). When both bits 0 and 1 are set, sub_8C3650 forces the copy by allocating qword_4B6D500[tag] bytes and performing memcpy followed by pointer rewriting.
EDG-to-LLVM Type Translation System
Entry: sub_91AED0 -> sub_91AB30. Uses a worklist-driven fixed-point iteration.
Translation Context Object (at a1+160)
| Offset | Size | Field |
|---|---|---|
| +0x000 | 8B | debug_logger |
| +0x008 | 8B | pass_list_ptr |
| +0x038 | 8B | edg_node_map (DenseMap: EDG -> LLVM values) |
| +0x058 | 8B | visited_set (DenseSet for dedup) |
| +0x060 | 4B | visited_count |
| +0x064 | 4B | visited_capacity |
| +0x068 | 4B | bucket_count |
| +0x090 | 8B | type_cache (DenseMap: EDG type -> LLVM Type*) |
| +0x168 | 4B | threshold |
| +0x2A0 | 8B | pending_replacements |
| +0x2A8 | 4B | pending_count |
Fixed-Point Algorithm
fn translate_all_types(ctx, module):
// Phase 1: iterate module members
for member in module.member_list:
sub_AA3700(member) // gather initial flags
// Phase 2: fixed-point iteration
do:
ordering = sub_919CD0(module) // topological sort (10-level BFS)
for type in ordering.reverse():
sub_913880(ctx, type) // invalidate stale cache entries
for type in ordering.reverse():
changed |= sub_9197C0(ctx, type) // process single declaration
while changed
// Phase 3: optional late fixup (byte_3C35480-gated)
if optimization_enabled:
do:
changed = sub_917E30(ctx)
while changed
// Phase 4: cleanup
sub_909590(ctx)
Bitmask for Scope-Tracking Types
The expression 0x100000100003FF >> (kind - 25) selects which type kinds in the range [25..78] require scope tracking during translation. This covers compound types, pointer types, and dependent types that carry CUDA address-space qualifiers.
Usage Across the Compiler
DenseMap instances appear at these known locations:
- NVVM context object: 8+ tables for IR node uniquing (opcodes 0x10..0x1F), plus sub-function tables for opcodes 0x04..0x15.
- SelectionDAG builder context: Map A (+120), Map B (+152), Set C (+184) for node deduplication and worklist.
- Per-node analysis: embedded DenseSet at +72 inside analysis structures created during DAG construction.
- Instruction constraint table: the global
word_3F3E6C0array is a flat table rather than a DenseMap, but the constraint emission functions use DenseMaps for lookup caching. - EDG type translation: 5 distinct caches -- visited set, type cache, type-value map, scope table, and type index table.
- Base class comparison:
qword_4D03BF8hash table for overload-resolution base class triple lookup.
The consistency of the hash function, sentinel values, and growth policy across all instances is documented in Hash Table and Collection Infrastructure.
Cross-References
- IR Node Layout -- NVVM IR node structure and operand access
- DAG Node -- SelectionDAG builder that consumes DenseMap instances
- Pattern Database -- instruction selection patterns indexed by DenseMap
- Address Spaces -- CUDA memory space qualifier values
- Hash Infrastructure -- comprehensive DenseMap documentation
- EDG Frontend -- EDG tokenizer and keyword dispatch
- IRGen Types -- EDG-to-LLVM type translation detail
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
edg_parse_declarator | sub_662DE0 | -- | Top-level declarator parser |
edg_parse_decl_specifiers_core | sub_672A20 | -- | While/switch token dispatcher |
edg_resolve_decl_specifiers | sub_7C0F00 | -- | Scope chain + qualified name resolver |
edg_compare_type_nodes | sub_7386E0 | -- | Structural type tree comparison |
edg_compare_type_lists | sub_739370 | -- | Linked-list type comparator |
edg_compare_decl_types | sub_739430 | -- | Declaration-level type comparator |
edg_canonicalize_type | sub_72EC50 | -- | Typedef / elaborated alias stripper |
edg_type_to_string | sub_74A390 | -- | Type-to-string for diagnostics |
edg_walk_il_node | sub_7506E0 | -- | 190KB IL tree walker (297 recursive calls) |
edg_copy_il_node | sub_766570 | -- | 148KB IL tree copier |
edg_push_scope | sub_854590 | -- | Push scope stack entry |
edg_pop_scope | sub_854430 | -- | Pop scope stack entry |
edg_emit_scope_chain | sub_82BDA0 | -- | Scope chain emission |
edg_unqualified_lookup | sub_7D5DD0 | -- | Unqualified name lookup |
edg_qualified_lookup | sub_7D4600 | -- | Qualified name lookup (after ::) |
edg_lookup_with_flags | sub_7D2AC0 | -- | Lookup with specific mode flags |
edg_namespace_lookup | sub_7D4A40 | -- | Lookup in namespace scope |
edg_compare_decl_identity | sub_8D97D0 | -- | Entity identity comparison |
edg_type_translation_entry | sub_91AED0 | -- | Top-level EDG-to-LLVM type translation |
edg_type_translation_driver | sub_91AB30 | -- | Fixed-point iteration driver |
edg_type_kind_dispatch | sub_918E50 | -- | Type-kind dispatch for translation |
edg_type_pair_compare | sub_911D10 | -- | Core type-pair comparison + replacement |
edg_alloc_decl_node | sub_84DCB0 | -- | 152-byte declaration node allocator |
NVVM Container Binary Format
The NVVM container is a proprietary binary envelope that wraps LLVM bitcode with compiler metadata for transport between pipeline stages in cicc v13.0. It carries target architecture, optimization options, fast-math flags, memory window configurations, per-kernel resource tables, and the IR payload itself -- all in a single serializable blob. Two serialization paths exist: a compact binary wire format used in production (nvcc / ptxas pipelines) and an XML-based format used for debugging and interchange. This page specifies the binary format in sufficient detail to write a conformant parser and serializer.
The format is implemented across 26 functions in the 0xCCBB10--0xCDD2D0 address range (Cluster C in the binary layout). The six top-level entry points:
| Function | Address | Size | Role |
|---|---|---|---|
NvvmContainer_serialize | 0xCDD2D0 | 47,540 B | Binary + XML serializer |
NvvmContainer_deserialize_options | 0xCD1D80 | 51,859 B | Binary tag/value decoder |
NvvmContainer_parse_header | 0xCDCA30 | 10,206 B | XML path header parser |
NvvmContainer_check_versions | 0xCD41B0 | 16,708 B | Version compatibility gate |
NvvmContainer_validate_versions | 0xCCD5F0 | 8,987 B | Standalone version validator |
NvvmContainer_init_options_struct | 0xCCBB10 | small | Zero-init 248-byte container struct |
Supporting parsers called from NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes):
| Function | Address | Size | Role |
|---|---|---|---|
NvvmOptions_parse_arch_enum | 0xCD09E0 | 14,516 B | ArchVariant enum string-to-int |
NvvmOptions_parse_fast_math | 0xCCF590 | 12,771 B | FastMathOptions sub-structure |
NvvmOptions_parse_multi_view | 0xCD6D20 | 12,188 B | MultiViewOptions sub-structure |
NvvmOptions_parse_cb_reserved_area | 0xCCE780 | 9,802 B | CB reserved area config |
NvvmOptions_parse_reg_targets | 0xCD7CE0 | 9,542 B | Register target config |
NvvmOptions_parse_serialize_helper | 0xCD58A0 | 9,579 B | Option serialization helper |
NvvmOptions_parse_shader_const_iface | 0xCCEEA0 | 8,355 B | ShaderConstIface (DCI) |
NvvmOptions_parse_align_entries | 0xCD8610 | 6,739 B | Alignment entry config |
NvvmOptions_parse_pgo_section | 0xCD02C0 | 5,482 B | PGO configuration |
NvvmOptions_parse_section | 0xCD5510 | 5,166 B | Nested YAML section parser |
NvvmOptions_parse_memory_windows | 0xCCE100 | 5,042 B | Memory window config |
NvvmOptions_parse_cbank_config | 0xCCE4B0 | 4,173 B | Constant bank config |
NvvmOptions_parse_bool_or_int | 0xCCC4A0 | small | Boolean/int option parser |
NvvmOptions_parse_tristate | 0xCCCFB0 | small | Tri-state option parser |
NvvmOptions_parse_string | 0xCD5150 | small | String option parser |
The finalizer knobs parser (0xCD9990, 31,702 bytes) is called separately to ingest the full set of NVIDIA-specific backend knobs (see NVVMPassOptions).
Binary-level helpers:
| Function | Address | Role |
|---|---|---|
NvvmContainer_write_tag_value | 0xCD17A0 | Write one tag/value pair (called 121 times from serializer) |
NvvmContainer_write_blob | 0xCD1AB0 | Write blob data + tag reference |
NvvmContainer_compute_crc | 0xCCD2B0 | CRC with seeds 0x8DF5D74C, 0xBAA56A96 |
Global state: qword_4F87148 holds the NVVM options global state pointer, checked by many downstream consumers.
Binary Header
Every binary container begins with a fixed 24-byte header. The header is self-describing: HeaderSize at offset 0x0E stores its own length (always 24), and two size fields partition the remainder into a scalar tag region and a blob data region.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Magic (0x7F4E5C7D) | 0x00
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Ver.Major | Ver.Minor | NvvmIR.Major | NvvmIR.Minor | 0x04
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NvvmDbg.Major | NvvmDbg.Minor | Llvm.Major | Llvm.Minor | 0x08
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| IRLevel (u16) | HeaderSize (u16) | 0x0C
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ScalarFieldsEnd (u32) | 0x10
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| BlobDataEnd (u32) | 0x14
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
struct NvvmContainerBinaryHeader {
uint32_t magic; /* 0x00: must be 0x7F4E5C7D */
uint8_t version_major; /* 0x04: container format major (1) */
uint8_t version_minor; /* 0x05: container format minor (<=0x41) */
uint8_t nvvm_ir_major; /* 0x06: NVVM IR version major (2) */
uint8_t nvvm_ir_minor; /* 0x07: NVVM IR version minor (<=0x62) */
uint8_t nvvm_debug_major; /* 0x08: debug info version major (3) */
uint8_t nvvm_debug_minor; /* 0x09: debug info version minor (<=2) */
uint8_t llvm_major; /* 0x0A: LLVM version (see encoding) */
uint8_t llvm_minor; /* 0x0B: LLVM version (see encoding) */
uint16_t ir_level; /* 0x0C: IRLevel enum */
uint16_t header_size; /* 0x0E: always 24 (0x0018) */
uint32_t scalar_fields_end; /* 0x10: byte offset past scalar region */
uint32_t blob_data_end; /* 0x14: byte offset past blob region */
};
The three data regions in order:
[0 .. 24) -- Header (fixed)
[24 .. scalar_fields_end) -- Scalar tag/value pairs
[scalar_fields_end .. blob_data_end) -- Blob data region
The total container size is blob_data_end bytes. After the blob data region, the IR payload (LLVM bitcode, optionally compressed) follows immediately.
LLVM Version Encoding
The llvm_major and llvm_minor bytes encode the LLVM version as a combined integer: llvm_major * 100 + llvm_minor. For cicc v13.0 (LLVM 20), this yields 20 * 100 + 0 = 2000. The version check compares the combined value, not the individual bytes.
IRLevel Enum
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_IR_LEVEL_UNIFIED_AFTER_DCI | Default: IR after Device-Code-Interface unification |
| 1 | NVVM_IR_LEVEL_LTO | Link-Time Optimization IR (partially optimized) |
| 2 | NVVM_IR_LEVEL_OPTIX | OptiX pipeline IR |
Scalar Tag/Value Encoding
Immediately after the 24-byte header, a sequence of (tag, value) pairs encodes every container field that differs from its default value. The encoding is a variable-length scheme optimized for small values:
Case 1 -- value fits in 16 bits (0x0000..0xFFFE):
[tag : int16] [value : int16] -- 4 bytes total
Case 2 -- value needs 32 bits:
[tag : int16] [0xFFFF : int16] [value : int32] -- 8 bytes total
Terminator:
[0x0000 : int16] -- tag 0 ends the sequence
All multi-byte fields are little-endian. The sentinel value 0xFFFF in the value slot signals that a full 32-bit value follows. This means the maximum encodable 16-bit value is 0xFFFE (65534); values of exactly 0xFFFF or larger require the extended form.
The serializer (sub_CD17A0, called 121 times from NvvmContainer_serialize) writes each tag/value pair using this scheme. The deserializer enters a switch loop over tags 1--402, decoding each value and writing it to the appropriate offset in the deserialized container struct.
Delta Encoding Strategy
The serializer allocates a default-initialized 440-byte Options struct and compares each field in the current Options against the corresponding default. Only fields that differ from the default are written as tag/value pairs. This makes typical containers very compact -- a standard compilation targeting SM 89 with -O2 might emit fewer than 20 tag/value pairs, covering just SmMajor, SmMinor, CompileMode, and a handful of target-specific flags.
The deserializer reverses this: it allocates a default Options struct first, then overwrites individual fields as tags are encountered. Unknown tags are silently skipped, which is the mechanism that provides forward compatibility -- a newer serializer can emit tags that an older deserializer simply ignores.
Blob Data Region
Tags in the 200+ and 400+ ranges reference variable-length data stored in the blob region. The scalar value for a blob tag is the byte offset into the blob region where the data begins. The blob region starts at scalar_fields_end bytes from the container start.
To resolve a blob reference: blob_ptr = container_base + scalar_fields_end + offset_value.
Blob entries do not carry explicit length fields in the tag/value stream. The deserializer knows each blob type's expected size from the tag ID (e.g., tag 201 is always 24 bytes, tag 203 is always 40 bytes). Variable-length blobs like strings (tags 209, 210, 213, 216, 217) are null-terminated. Length-prefixed blobs (tag 218) carry a 4-byte length prefix.
Complete Tag Table
144 distinct tag IDs organized into six ranges. The "Offset" column refers to the byte position within the deserialized 440-byte Options struct.
Range 1--39: Core Scalar Options
| Tag | Type | Name | Options Offset | Notes |
|---|---|---|---|---|
| 1 | int32 | SmMajor | +0 (ArchVariant) | SM major version (e.g., 8 for SM 89) |
| 2 | int32 | SmMinor | +0 (ArchVariant) | SM minor version (e.g., 9 for SM 89) |
| 3 | int32 | NumRegs | +216 | Register count hint |
| 4 | int32 | NumBarriers | +220 | Barrier count |
| 5 | int32 | SharedMemorySize | +224 | Shared memory size in bytes |
| 6 | int32 | VertexMode | +72 | See VertexMode enum |
| 7 | bit | ReserveLocalAddressZero | +20 bit 0 | Reserve address 0 in local memory |
| 8 | bit | FastMath.IgnoreInf | +200 bit 0 | Treat infinities as NaN |
| 9 | bit | FastMath.IgnoreNaN | +200 bit 1 | Assume no NaN values present |
| 10 | bit | FastMath.IgnoreSignedZero | +200 bit 2 | Ignore sign of zero |
| 11 | bit | FastMath.ReorderFloat | +200 bit 3 | Allow float reordering |
| 12 | bit | FastMath.ReorderHalf | +200 bit 4 | Allow half-precision reordering |
| 13 | bit | FastMath.Ftz | +200 bit 5 | Flush denormals to zero |
| 14 | bit | FastMath.FastSqrt | +200 bit 6 | Use fast sqrt approximation |
| 15 | bit | FastMath.Fmad | +200 bit 7 | Allow fused multiply-add |
| 16 | bit | FastMath.AllowRcpRsqToSqrt | +201 bit 0 | Allow rcp(rsqrt(x)) to sqrt(x) |
| 17 | bit | FastMath.CanReorderFloatDistribute | +201 bit 1 | Allow distributive reordering |
| 18 | int32 | FastMath.Reserved | +204 | Reserved fast-math field |
| 19 | int32 | MaxRRegsAllowed | +216 | Maximum registers per thread (primary) |
| 20 | int32 | SchedRegTarget | +220 | Scheduling register pressure target |
| 21 | int32 | UnrollControl | +224 | Unroll factor control |
| 22 | bool | AcceleratedArch | +232 | True for sm_XXa variants |
| 23 | bool | StdELF | +233 | Use standard ELF output format |
| 24 | int32 | MaxRRegsAllowed2 | +216 | Secondary max-regs (override) |
| 25 | int32 | SchedRegTarget2 | +220 | Secondary sched target |
| 26 | bit | FastMath.ReassociateFloatAddOverMad | +201 bit 2 | Float add reassociation over MAD |
| 27 | bit | ForceImmediateConstants | +20 bit 1 | Force immediate constant loading |
| 28 | bit | HideFunctions | +20 bit 2 | Hide internal functions from output |
| 29 | bit | UseDX10AddressInRange | +20 bit 3 | DX10 address range mode |
| 30 | int32 | UnrollControl2 | +224 | Secondary unroll control |
| 31 | bit | FastMath.NoFloatMAD | +201 bit 3 | Disable float MAD formation |
| 32 | bool | AcceleratedArch2 | +232 | Secondary accelerated-arch flag |
| 33 | bit | FastMath.LaxFP16ApproximateDivision | +201 bit 4 | Lax FP16 approximate division |
| 34 | bool | StdELF2 | +233 | Secondary StdELF |
| 35 | int32 | ShaderCodegenSelMask | +236 | Shader codegen selection bitmask |
| 36 | bool | OmegaPtxErrorHandling | +240 | Enable Omega-style PTX error handling |
| 37 | int32 | FDLInsertMode | +244 | See FDLInsertMode enum |
| 38 | bit | IsPIC | +20 bit 4 | Position-independent code flag |
| 39 | bit | NoSpillsConstraint | +20 bit 5 | Hard constraint: no register spills |
Tag 99: Compression Metadata
| Tag | Type | Name | Notes |
|---|---|---|---|
| 99 | int32 | CompressAlgoId | Compression algorithm selector for IR payload |
When present, the IR payload following the blob region is compressed. The value selects a codec via sub_16886D0(algo_id). If the value is 0, the runtime substitutes the default algorithm ID 0x75D49913 (1,977,119,507 decimal). The codec is a pluggable compression/encryption layer accessed through four function pointers:
/* Compression codec API (addresses in the 0x1688xxx range) */
void *codec_acquire(uint32_t algo_id); /* sub_16886D0 */
int codec_compress(void *codec, void *data,
size_t size); /* sub_1688730 */
int codec_decompress(void *codec, void *data,
size_t size); /* sub_16887A0 */
void codec_release(void *codec); /* sub_1688720 */
The write path in NvvmContainer_serialize (0xCDD2D0) compresses the LLVM bitcode payload via sub_C8D290, then computes a CRC hash via NvvmContainer_compute_crc (0xCCD2B0) with the two seed values -1914584148 (0x8DF5D74C) and -1162247642 (0xBAA56A96). The CRC value is stored as the CompressAlgoId tag 99 value, which doubles as an integrity check token: the deserializer uses the same CRC seeds to verify the payload before decompression.
The compression subsystem lives outside the main container cluster at addresses 0x16886D0--0x16887A0, in the utility library region of the binary.
Range 101--173: Extended Target Options
These tags configure per-kernel and target-specific hardware parameters. Most map into a sub-structure accessed through the Options struct. The "Byte.Bit" column indicates the packed bitfield location within the target options sub-structure.
| Tag | Type | Name | Location | Notes |
|---|---|---|---|---|
| 101 | bool | HasTextureOps | offset 0 | Target supports texture operations |
| 102 | bool | HasSurfaceOps | offset 0 | Target supports surface operations |
| 103 | bool | HasAtomics | offset 0 | Target supports atomic operations |
| 104 | bool | HasVote | offset 0 | Target supports warp vote intrinsics |
| 105 | int32 | MaxThreadsPerBlock | offset 4 | Maximum CTA thread count |
| 106 | byte | PreferL1SizeFlag | offset 8 | L1 cache vs shared memory preference |
| 107 | bool | HasWarpShuffle | offset 0 | Target supports warp shuffle |
| 108 | bool | HasFunnelShift | offset 0 | Target supports funnel shift |
| 109 | int32 | CBankOfstLow | offset 12 | Constant bank offset lower bound |
| 110 | int32 | CBankOfstHi | offset 16 | Constant bank offset upper bound |
| 111 | int32 | CBankSize | offset 20 | Constant bank size in bytes |
| 112 | bit | Bit0_68 | byte 68, bit 0 | Target capability flag |
| 113 | bit | Bit1_68 | byte 68, bit 1 | Target capability flag |
| 114 | bit | Bit2_68 | byte 68, bit 2 | Target capability flag |
| 115 | bit | Bit3_68 | byte 68, bit 3 | Target capability flag |
| 116 | bit | Bit4_68 | byte 68, bit 4 | Target capability flag |
| 117 | bit | Bit5_68 | byte 68, bit 5 | Target capability flag |
| 118 | bit | Bit7_68 | byte 68, bit 7 | Target capability flag (bit 6 skipped) |
| 119 | bit | EnableCoalesce | byte 69, bit 0 | Enable memory coalescing optimization |
| 120 | bit | EnableVectorize | byte 69, bit 2 | Enable auto-vectorization |
| 121 | 2-bit | CompactionMode | byte 69, bits 3--4 | Thread compaction strategy (0--3) |
| 122 | int32 | StackFrameSize | offset 96 | Stack frame size in bytes |
| 123 | int32 | StackAlignment | offset 100 | Stack alignment requirement |
| 124 | int32 | ParamSpaceSize | offset 104 | Parameter space size |
| 125 | int32 | ParamAlignment | offset 108 | Parameter space alignment |
| 126 | int32 | LocalMemSize | offset 116 | Local memory size per thread |
| 127 | int32 | SharedBankConfig | offset 156 | Shared memory bank configuration |
| 128 | int32 | MinGridSize | offset 248 | Minimum grid size for occupancy |
| 129 | int32 | MaxGridDimX | offset 252 | Maximum X-dimension grid size |
| 130 | int32 | SharedMemPerBlock | offset 264 | Shared memory per block |
| 131 | 2-bit | WarpScheduleMode | byte 70, bits 0--1 | Warp scheduling strategy |
| 132 | bit | EnablePrefetch | byte 70, bit 2 | Enable memory prefetch instructions |
| 133 | bit | Bit4_70 | byte 70, bit 4 | Target capability flag |
| 134 | bit | Bit5_70 | byte 70, bit 5 | Target capability flag |
| 135 | bit | Bit6_70 | byte 70, bit 6 | Target capability flag |
| 136 | bit | Bit7_70 | byte 70, bit 7 | Target capability flag |
| 137 | int32 | MaxDynShared | offset 268 | Maximum dynamic shared memory |
| 138 | bool | HasLDG | offset 5 | Target supports LDG instruction |
| 139 | bit | Bit1_71 | byte 71, bit 1 | Target capability flag |
| 140 | bit | Bit2_71 | byte 71, bit 2 | Target capability flag |
| 141 | bool | HasBarrierReduce | offset 40 | Target supports barrier-reduce |
| 142 | int32 | CacheConfig | offset 280 | Cache configuration selector |
| 143 | bit | Bit6_68 | byte 68, bit 6 | Target capability flag |
| 144 | bit | Bit3_71 | byte 71, bit 3 | Target capability flag |
| 145 | bit | Bit0_71 | byte 71, bit 0 | Target capability flag |
| 146 | int32 | ConstBankSize | offset 256 | Constant bank total size |
| 147 | int32 | ShMemBankStride | offset 152 | Shared memory bank stride |
| 148 | 2-bit | ScheduleMode2 | byte 71, bits 4--5 | Secondary scheduling mode |
| 149 | bit | Bit6_71 | byte 71, bit 6 | Target capability flag |
| 150 | bit | Bit7_71 | byte 71, bit 7 | Target capability flag |
| 151 | int32 | LocalMemAlignment | offset 112 | Local memory alignment |
| 152 | bit | EnableBarrierOpt | byte 69, bit 5 | Enable barrier optimization |
| 153 | bit | Bit0_72 | byte 72, bit 0 | Target capability flag |
| 154 | bit | Bit6_69 | byte 69, bit 6 | Target capability flag |
| 155 | bit | Bit7_69 | byte 69, bit 7 | Target capability flag |
| 156 | bit | Bit1_72 | byte 72, bit 1 | Target capability flag |
| 157 | bool | HasDP4A | offset 1 | Target supports DP4A dot-product |
| 158 | bit | Bit3_72 | byte 72, bit 3 | Target capability flag |
| 159 | int32 | ConstBankSize2 | offset 260 | Secondary constant bank size |
| 160 | int32 | MaxRegsPerThread | offset 284 | Hard limit on registers per thread |
| 161 | int32 | ClusterSize | offset 276 | Thread block cluster size (SM 90+) |
| 162 | bit | Bit4_72 | byte 72, bit 4 | Target capability flag |
| 163 | bit | Bit5_72 | byte 72, bit 5 | Target capability flag |
| 164 | bit | Bit6_72 | byte 72, bit 6 | Target capability flag |
| 165 | bit | Bit7_72 | byte 72, bit 7 | Target capability flag |
| 166 | int32 | MaxCTAPerSM | offset 160 | Maximum CTAs per SM |
| 167 | int32 | TexIndirectLimit | offset 272 | Texture indirect access limit |
| 168 | bit | Bit0_432 | byte 432, bit 0 | Extended capability flag |
| 169 | bit | Bit1_432 | byte 432, bit 1 | Extended capability flag |
| 170 | bit | Bit2_432 | byte 432, bit 2 | Extended capability flag |
| 171 | bool | HasTMAOps | offset 289 | Target supports TMA operations (SM 90+) |
| 172 | bit | Bit3_70 | byte 70, bit 3 | Target capability flag |
| 173 | bool | HasTCGen05 | offset 290 | Target supports TCGen05 (SM 100+) |
Range 201--218: Blob Data Tags
| Tag | Size | Name | Description |
|---|---|---|---|
| 201 | 24 B | MemoryWindowCBank | 3 memory window entries for constant bank (see below) |
| 202 | 24 B | MemoryWindowLocal | 3 memory window entries for local memory |
| 203 | 40 B | MemoryWindowShared | 10 x uint32_t for shared memory windows + flags |
| 204 | 48 B | MultiViewOptions | Multi-view rendering header + typed arrays |
| 205 | var | TargetResourceTable | 24-byte header + 36 bytes per entry |
| 206 | var | PerKernelCBankOffsets | 4-byte count + 4 bytes per kernel |
| 207 | var | PerKernelStackSizes | 4-byte count + 4 bytes per kernel |
| 208 | var | PerKernelSMEMSizes | 8-byte count + 8 bytes per kernel |
| 209 | var | TargetFuncName | Null-terminated string |
| 210 | var | TargetEntryName | Null-terminated string |
| 211 | 8 B | PerKernelQWORD | 8-byte per-kernel datum |
| 212 | 12 B | ExtraMemParams | 8 + 4 bytes of memory parameters |
| 213 | var | AuxString1 | Null-terminated auxiliary string |
| 214 | var | PerKernelRegisters | 4-byte count + 4 bytes per kernel |
| 215 | var | PerKernelBarriers | 4-byte count + 4 bytes per kernel |
| 216 | var | AuxString2 | Null-terminated auxiliary string |
| 217 | var | AuxString3 | Null-terminated auxiliary string |
| 218 | var | AuxByteArray | 4-byte length prefix + raw bytes |
Range 301--309: Extended Int32 Fields
| Tag | Type | Name | Options Offset | Notes |
|---|---|---|---|---|
| 301 | int32 | ExtOpt.Field344 | +344 | Cluster/group configuration selector |
| 302 | int32 | ExtOpt.Field348 | +348 | Extended option |
| 303 | int32 | ExtOpt.Field352 | +352 | Extended option |
| 304 | int32 | ExtOpt.Field356 | +356 | Extended option |
| 305 | int32 | ExtOpt.Field360 | +360 | Extended option |
| 306 | int32 | ExtOpt.Field400 | +400 | Extended option |
| 307 | int32 | ExtOpt.Field364 | +364 | Extended option |
| 308 | int32 | ExtOpt.Field368 | +368 | Extended option |
| 309 | int32 | ExtOpt.Field372 | +372 | Extended option |
Range 351--353: Extended Int64 Blob References
| Tag | Size | Name | Options Offset |
|---|---|---|---|
| 351 | 8 B | ExtOpt.QWord376 | +376 |
| 352 | 8 B | ExtOpt.QWord384 | +384 |
| 353 | 8 B | ExtOpt.QWord392 | +392 |
Range 401--402: Structured Blob Data
These tags are conditionally parsed based on the value of tag 301 (ExtOpt.Field344):
| Tag | Condition | Size | Name | Notes |
|---|---|---|---|---|
| 401 | Field344 == 1 | 56+ B | TMADescriptor | SM 90 Hopper TMA bulk-copy descriptors. 44-byte fixed header + 16 bytes per entry. |
| 402 | Field344 == 4 | 40+ B | TCGen05Config | SM 100 Blackwell TCGen05 tensor configurations. 32-byte fixed header + 12 bytes per entry. |
The conditional parsing means a single container cannot carry both TMA and TCGen05 data -- the Field344 value selects which hardware generation's tensor memory interface is active.
TMADescriptor Layout (Tag 401, Field344 == 1)
TMA (Tensor Memory Access) descriptors configure cp.async.bulk operations on SM 90 Hopper. The TMA descriptor extraction is performed by sub_9483E0 during intrinsic lowering. The blob layout:
struct TMADescriptor {
/* +0 */ uint32_t num_entries; /* Number of TMA descriptors */
/* +4 */ uint32_t dimensionality; /* 1d..5d tensor rank */
/* +8 */ uint32_t element_size; /* Bytes per element */
/* +12 */ uint32_t interleave_layout; /* Memory interleave pattern */
/* +16 */ uint32_t swizzle_mode; /* Swizzle mode selector */
/* +20 */ uint32_t fill_mode; /* Out-of-bounds fill behavior */
/* +24 */ uint32_t [5] global_dims; /* Global tensor dimensions */
/* +44 */ /* --- 16 bytes per entry --- */
/* uint32_t box_dim; Per-entry box dimension */
/* uint32_t stride; Per-entry stride */
/* uint32_t elem_stride; Per-entry element stride */
/* uint32_t reserved; Reserved/padding */
};
See SM 90 Hopper for the TMA instruction format and the cp.async.bulk.tensor.g2s.tile.{1d,2d,3d,4d,5d} intrinsic family.
TCGen05Config Layout (Tag 402, Field344 == 4)
TCGen05 (Tensor Core Generation 5) configurations describe Blackwell SM 100 tensor memory operations. The TCGen05 instruction set includes tcgen05.alloc, tcgen05.dealloc, tcgen05.commit, tcgen05.fence, tcgen05.wait, and tcgen05.relinquish.alloc -- all gated by the SM 100 arch-conditional check at sub_30462A0. The blob layout:
struct TCGen05Config {
/* +0 */ uint32_t num_entries; /* Number of TCGen05 configs */
/* +4 */ uint32_t accumulator_size; /* Accumulator memory size */
/* +8 */ uint32_t commit_mode; /* Commit mode (multicast flags) */
/* +12 */ uint32_t fence_mode; /* Fence mode selector */
/* +16 */ uint32_t [4] reserved; /* Reserved fields */
/* +32 */ /* --- 12 bytes per entry --- */
/* uint32_t config_id; TCGen05 config identifier */
/* uint32_t fragment_count; Number of fragments */
/* uint32_t flags; Per-config flags */
};
See SM 100 Blackwell for the TCGen05 instruction set and the tcgen05.* intrinsic family.
Deserialized Container Struct
After parsing, the container is represented as a 248-byte in-memory structure allocated by NvvmContainer_init_options_struct (0xCCBB10). This struct holds the container metadata plus a pointer to the full 440-byte Options struct.
struct NvvmContainerHeader { /* 248 bytes total */
/* 0x00 */ uint32_t sm_major; /* Tag 1: SM major version */
/* 0x04 */ uint32_t sm_minor; /* Tag 2: SM minor version */
/* 0x08 */ uint32_t num_regs; /* Tag 3 */
/* 0x0C */ uint32_t num_barriers; /* Tag 4 */
/* 0x10 */ uint32_t shared_mem_size; /* Tag 5 */
/* 0x14 */ uint8_t flags_14; /* Packed bits: tags 7,27,28,29,38,39*/
/* bit 0: ReserveLocalAddressZero (tag 7) */
/* bit 1: ForceImmediateConstants (tag 27) */
/* bit 2: HideFunctions (tag 28) */
/* bit 3: UseDX10AddressInRange (tag 29) */
/* bit 4: IsPIC (tag 38) */
/* bit 5: NoSpillsConstraint (tag 39) */
/* 0x15 */ uint8_t _pad15[3];
/* 0x18 */ uint8_t multi_view_options[48]; /* Tag 204 blob */
/* 0x48 */ uint32_t vertex_mode; /* Tag 6 */
/* 0x4C */ uint8_t _pad4c[4];
/* 0x50 */ uint32_t max_rregs; /* Tag 19 */
/* 0x54 */ uint32_t sched_reg_target; /* Tag 20 */
/* 0x58 */ uint32_t unroll_control; /* Tag 21 */
/* 0x5C */ uint8_t _pad5c[4];
/* 0x60 */ uint8_t mem_win_cbank[24]; /* Tag 201 blob */
/* 0x78 */ uint8_t mem_win_local[24]; /* Tag 202 blob */
/* 0x90 */ uint8_t mem_win_shared[40]; /* Tag 203 blob */
/* 0xB8 */ uint8_t _padb8[12];
/* 0xC4 */ uint8_t accelerated_arch; /* Tag 22 */
/* 0xC5 */ uint8_t std_elf; /* Tag 23 */
/* 0xC6 */ uint8_t _padc6[2];
/* 0xC8 */ uint8_t fast_math[8]; /* Tags 8-17,26,31,33 bitfields */
/* 0xD0 */ uint8_t _padd0[8];
/* 0xD8 */ uint32_t max_rregs_2; /* Tag 24 */
/* 0xDC */ uint32_t sched_reg_2; /* Tag 25 */
/* 0xE0 */ uint32_t unroll_ctl_2; /* Tag 30 */
/* 0xE4 */ uint32_t compress_algo_id; /* Tag 99 */
/* 0xE8 */ uint8_t omega_ptx_err; /* Tag 32 */
/* 0xE9 */ uint8_t std_elf_2; /* Tag 34 */
/* 0xEA */ uint8_t _padea[2];
/* 0xEC */ uint32_t shader_cg_sel; /* Tag 35 */
/* 0xF0 */ uint8_t fdl_bit; /* Tag 36 */
/* 0xF1 */ uint8_t _padf1[3];
/* 0xF4 */ uint32_t fdl_insert_mode; /* Tag 37 */
};
/* sizeof(NvvmContainerHeader) == 248 (0xF8) */
The Options pointer is stored at offset 208 (0xD0) of the container header during deserialization -- the container header acts as both a data holder and an index into the full Options struct.
Options Struct (440 bytes)
The full compiler options structure is allocated separately and linked from the container header. It is parsed by NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes) in the XML path, or populated field-by-field from tags in the binary path.
struct NvvmOptions { /* 440 bytes total */
/* +0 */ uint32_t arch_variant; /* ArchVariant enum */
/* +4 */ uint32_t compile_mode; /* CompileMode enum */
/* +8 */ uint32_t opt_level; /* OptLevel enum */
/* +12 */ uint32_t debug_info; /* DebugInfo enum */
/* +16 */ uint32_t client_version;
/* +20 */ uint8_t flags_20; /* Packed booleans: 6 bits */
/* bit 0: ReserveLocalAddressZero */
/* bit 1: ForceImmediateConstants */
/* bit 2: HideFunctions */
/* bit 3: UseDX10AddressInRange */
/* bit 4: IsPIC */
/* bit 5: NoSpillsConstraint */
/* +21 */ uint8_t _pad21[3];
/* +24 */ uint8_t multi_view[48]; /* MultiViewOptions sub-structure */
/* +72 */ uint32_t vertex_mode; /* VertexMode enum */
/* +76 */ uint8_t _pad76[4];
/* +80 */ uint8_t dci_info[120]; /* DCIInfo sub-structure */
/* +200 */ uint8_t fast_math_byte0; /* FastMath bits 0-7 */
/* +201 */ uint8_t fast_math_byte1; /* FastMath bits 8-12 */
/* +202 */ uint8_t _pad202[2];
/* +204 */ uint32_t fast_math_reserved;
/* +208 */ uint8_t _pad208[8];
/* +216 */ uint32_t max_rregs_allowed;
/* +220 */ uint32_t sched_reg_target;
/* +224 */ uint32_t unroll_control;
/* +228 */ uint32_t okey; /* CompressAlgoId / OKey */
/* +232 */ uint8_t accelerated_arch;
/* +233 */ uint8_t std_elf;
/* +234 */ uint8_t _pad234[2];
/* +236 */ uint32_t shader_codegen_sel_mask;
/* +240 */ uint8_t omega_ptx_error_handling;
/* +241 */ uint8_t _pad241[3];
/* +244 */ uint32_t fdl_insert_mode;
/* +248 */ uint8_t target_opts[192]; /* Extended target options (tags 101-173) */
};
/* sizeof(NvvmOptions) == 440 (0x1B8) */
DCIInfo Sub-Structure (Options +80, 120 bytes)
The Device-Code-Interface sub-structure at offset +80 contains the shader constant interface and constant bank reserved area configurations. Parsed by NvvmOptions_parse_shader_const_iface (0xCCEEA0, 8,355 bytes) and NvvmOptions_parse_cb_reserved_area (0xCCE780, 9,802 bytes).
ShaderConstIface XML fields (from sub_CCEEA0):
| Field | Type | Description |
|---|---|---|
OptimizerConstBank | int32 | Constant bank index used by the optimizer |
DriverConstBank | int32 | Constant bank index used by the driver |
BindlessTextureBank | int32 | Constant bank for bindless texture handles |
LocalMemoryWindow | struct | Memory window config for local memory |
SharedMemoryWindow | struct | Memory window config for shared memory |
VectorizeAndRemapTLD | bool | Enable vectorization and TLD remapping |
ELFControlsDCI | bool | ELF controls DCI interface layout |
DiscardDefaultValueOutputs | bool | Discard outputs that match default values |
CBReservedArea XML fields (from sub_CCE780):
| Field | Type | Description |
|---|---|---|
ByteOffsetToEndOfReservedArea | int32 | End-of-reserved-area offset in constant bank |
CbAddressBitsInReservedVABase | int32 | Address bits for reserved virtual address base |
CbBankToReservedVABase | int32 | Constant bank index for reserved VA base |
ForceHighLatencyConstExpr | bool | Force high-latency constant expression evaluation |
ReservedCbReadBank | int32 | Reserved constant bank read bank index |
MultiViewOptions Sub-Structure (Options +24, 48 bytes)
The multi-view rendering options sub-structure at offset +24 carries graphics pipeline multi-view configuration. Parsed by NvvmOptions_parse_multi_view (0xCD6D20, 12,188 bytes). Serialized as blob tag 204.
| Field | Type | Description |
|---|---|---|
NumViews | int32 | Number of rendering views |
NominalViewIDs | int32[] | Array of nominal view identifiers |
PerViewRTIndexConstants | int32[] | Per-view render target index constants |
EnableViewInstanceMask | bool | Enable per-view instance masking |
ComputePerPatchAttribsForViewZero | bool | Compute per-patch attributes for view 0 |
IsImplicit | bool | Implicit multi-view mode |
CompileMode Enum
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI | Whole-program with ABI compliance |
| 1 | NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI | Whole-program without ABI (internal) |
| 2 | NVVM_COMPILE_MODE_SEPARATE_ABI | Separate compilation (relocatable, --device-c) |
| 3 | NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI | Extensible whole-program with ABI |
OptLevel Enum
| Value | Name |
|---|---|
| 0 | NVVM_OPT_LEVEL_NONE |
| 1 | NVVM_OPT_LEVEL_1 |
| 2 | NVVM_OPT_LEVEL_2 (default) |
| 3 | NVVM_OPT_LEVEL_3 |
DebugInfo Enum
| Value | Name |
|---|---|
| 0 | NVVM_DEBUG_INFO_NONE (default) |
| 1 | NVVM_DEBUG_INFO_LINE_INFO |
| 2 | NVVM_DEBUG_INFO_DWARF |
VertexMode Enum
| Value | Name |
|---|---|
| 0 | NVVM_VERTEX_MODE_SINGLE |
| 1 | NVVM_VERTEX_MODE_A |
| 2 | NVVM_VERTEX_MODE_B |
| 3 | NVVM_VERTEX_MODE_AB |
FDLInsertMode Enum
| Value | Name |
|---|---|
| 0 | NVVM_FDL_MODE_NONE |
| 1 | NVVM_FDL_MODE_ALL |
| 2 | NVVM_FDL_MODE_APP |
ArchVariant Enum
The architecture enum uses a numeric encoding where the value equals major * 10 + minor for older architectures and major * 10 + minor (with 3-digit major) for Blackwell. There are two parallel enum spaces: "virtual" architecture variants (used for compute_XX targets) and "HW" variants (used for sm_XX real silicon targets). The virtual variants are serialized by name in the XML format via NvvmOptions_parse_arch_enum (0xCD09E0, 14,516 bytes).
Virtual Architecture Variants
| Enum Name | Numeric Value | Generation | SM |
|---|---|---|---|
NVVM_ARCH_KEPLER_3_0 | 30 | Kepler | 3.0 |
NVVM_ARCH_KEPLER_3_2 | 32 | Kepler | 3.2 |
NVVM_ARCH_KEPLER_3_5 | 35 | Kepler | 3.5 |
NVVM_ARCH_KEPLER_3_7 | 37 | Kepler | 3.7 |
NVVM_ARCH_MAXWELL_5_0 | 50 | Maxwell | 5.0 |
NVVM_ARCH_MAXWELL_5_2 | 52 | Maxwell | 5.2 |
NVVM_ARCH_MAXWELL_5_3 | 53 | Maxwell | 5.3 |
NVVM_ARCH_PASCAL_6_0 | 60 | Pascal | 6.0 |
NVVM_ARCH_PASCAL_6_1 | 61 | Pascal | 6.1 |
NVVM_ARCH_PASCAL_6_2 | 62 | Pascal | 6.2 |
NVVM_ARCH_VOLTA_7_0 | 70 | Volta | 7.0 |
NVVM_ARCH_VOLTA_7_2 | 72 | Volta | 7.2 |
NVVM_ARCH_TURING_7_3 | 73 | Turing | 7.3 |
NVVM_ARCH_TURING_7_5 | 75 | Turing | 7.5 |
NVVM_ARCH_AMPERE_8_0 | 80 | Ampere | 8.0 |
NVVM_ARCH_AMPERE_8_2 | 82 | Ampere | 8.2 |
NVVM_ARCH_AMPERE_8_6 | 86 | Ampere | 8.6 |
NVVM_ARCH_AMPERE_8_7 | 87 | Ampere | 8.7 |
NVVM_ARCH_AMPERE_8_8 | 88 | Ampere | 8.8 |
NVVM_ARCH_ADA_8_9 | 89 | Ada Lovelace | 8.9 |
NVVM_ARCH_HOPPER_9_0 | 90 | Hopper | 9.0 |
NVVM_ARCH_BLACKWELL_10_0 | 100 | Blackwell | 10.0 |
NVVM_ARCH_BLACKWELL_10_1 | 101 | Blackwell | 10.1 |
NVVM_ARCH_BLACKWELL_10_3 | 103 | Blackwell | 10.3 |
NVVM_ARCH_BLACKWELL_11_0 | 110 | Blackwell (Jetson Thor) | 11.0 |
NVVM_ARCH_BLACKWELL_12_0 | 120 | Blackwell (RTX 50xx / Pro) | 12.0 |
NVVM_ARCH_BLACKWELL_12_1 | 121 | Blackwell (DGX Spark) | 12.1 |
Note: NVVM_ARCH_BLACKWELL_10_1 maps to __CUDA_ARCH 1010, while NVVM_ARCH_BLACKWELL_11_0 maps to __CUDA_ARCH 1100. Despite both being in the BLACKWELL family, they are distinct architectures with separate entries in the processor table. sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line.
HW Architecture Variants
The HW variants use a major * 1000 + minor * 10 encoding for their internal numeric values. These map to real silicon rather than virtual compute capabilities:
| Enum Name | Internal Value | Notes |
|---|---|---|
NVVM_ARCH_HW_SM_5_0 | 500 | Maxwell HW baseline |
| ... | ... | One entry per supported HW SM through 9.0 |
NVVM_ARCH_HW_SM_10_0 | 1000 | Blackwell datacenter |
NVVM_ARCH_HW_SM_10_1 | 1010 | Blackwell Ultra (GB300) |
NVVM_ARCH_HW_SM_10_3 | 1030 | Blackwell variant |
NVVM_ARCH_HW_SM_10_4 | 1200 | Maps to SM 120 value -- not publicly documented |
The HW_SM_10_4 = 1200 mapping is notable: SM 10.4 in the HW enum space corresponds to the SM 120 consumer architecture. This reveals that "SM 120" is internally considered a Blackwell 10.4 die variant, not a separate generation.
FastMathOptions Bitfields
The fast-math configuration occupies two bytes at Options offset +200 and +201, with an additional int32 at +204. Each bit independently controls one floating-point relaxation.
Byte +200 (tags 8--15)
Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
| Fmad | Fast | Ftz |Reorder|Reorder|Ignore | Ignore|Ignore |
| | Sqrt | | Half | Float | Sign0 | NaN | Inf |
+-------+-------+-------+-------+-------+-------+-------+-------+
tag 15 tag 14 tag 13 tag 12 tag 11 tag 10 tag 9 tag 8
Byte +201 (tags 16--17, 26, 31, 33)
Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | Lax | No |Reassoc|CanReor| Allow |
| | | | FP16 | Float | Float |derDist| Rcp |
| | | | Div | MAD |AddMAD |ribute | Rsq |
+-------+-------+-------+-------+-------+-------+-------+-------+
tag 33 tag 31 tag 26 tag 17 tag 16
FastMath Divide Sub-Enum
The Divide field within FastMathOptions is a nested enum serialized by name in the XML path:
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_FAST_MATH_DIVIDE_PRECISE_NO_FTZ | IEEE-compliant division, no flush-to-zero |
| 1 | NVVM_FAST_MATH_DIVIDE_PRECISE_ALLOW_FTZ | IEEE division with FTZ permitted |
| 2 | NVVM_FAST_MATH_DIVIDE_FULL_RANGE_APPROX | Full-range approximation |
| 3 | NVVM_FAST_MATH_DIVIDE_FAST_APPROX | Fast approximation (least precise) |
These correspond to the nvcc flags -prec-div=1 (precise) and -prec-div=0 (fast), with FTZ interaction determined by -ftz.
Complete FastMath XML Field Inventory
The full set of XML field names parsed by NvvmOptions_parse_fast_math (0xCCF590, 12,771 bytes):
| XML Field Name | Binary Tag | Type | Description |
|---|---|---|---|
IgnoreInf | 8 | bit | Treat infinities as NaN |
IgnoreNaN | 9 | bit | Assume no NaN values present |
IgnoreSignedZero | 10 | bit | Ignore sign of zero |
ReorderFloat | 11 | bit | Allow float reordering |
ReorderHalf | 12 | bit | Allow half-precision reordering |
Ftz | 13 | bit | Flush denormals to zero |
FastSqrt | 14 | bit | Use fast sqrt approximation |
Fmad | 15 | bit | Allow fused multiply-add |
AllowRcpRsqToSqrt | 16 | bit | Allow rcp(rsqrt(x)) to sqrt(x) |
CanReorderFloatDistribute | 17 | bit | Allow distributive reordering |
ReassociateFloatAddOverMad | 26 | bit | Float add reassociation over MAD |
NoFloatMAD | 31 | bit | Disable float MAD formation |
LaxFP16ApproximateDivision | 33 | bit | Lax FP16 approximate division |
Divide | -- | enum | Division precision sub-enum (above) |
The Divide field is serialized as a nested enum element in XML; in the binary format it is encoded as part of the fast-math reserved int32 at Options +204 (tag 18).
Memory Window Configuration
Memory windows define how the compiler maps address spaces to hardware memory banks. Three window types are serialized as blobs via tags 201--203, parsed by NvvmOptions_parse_cbank_config (0xCCE4B0) and NvvmOptions_parse_memory_windows (0xCCE100).
MemoryWindow Type Enum
| Value | Name | Meaning |
|---|---|---|
| 0 | NVVM_MEMORY_WINDOW_SPECIAL_REGISTER | Accessed via special registers |
| 1 | NVVM_MEMORY_WINDOW_CBANK | Constant bank window |
| 2 | NVVM_MEMORY_WINDOW_IMMEDIATE | Immediate offset addressing |
Window Entry Layout (8 bytes)
struct MemoryWindowEntry {
uint32_t window_type; /* MemoryWindow type enum */
uint32_t cbank; /* Constant bank index */
/* The following are part of the containing blob: */
/* uint32_t cbank_ofst_low; -- lower bound of offset range */
/* uint32_t cbank_ofst_hi; -- upper bound of offset range */
};
- Tag 201 (
MemoryWindowCBank): 24 bytes = 3 entries of{window_type, cbank, low, hi}truncated to fit, or 3 x 8 bytes depending on sub-field packing. - Tag 202 (
MemoryWindowLocal): 24 bytes, same structure. - Tag 203 (
MemoryWindowShared): 40 bytes = 10 xuint32_tvalues encoding shared memory bank strides, offsets, and configuration flags.
Version Compatibility Logic
Version checking is the first operation performed on a container buffer, implemented in NvvmContainer_check_versions (0xCD41B0). The logic is conservative on major versions and lenient on minor versions:
1. Verify magic == 0x7F4E5C7D
Fail: return NULL (not a container)
2. Version.Major must == 1
Fail: "NvvmContainer major version N not compatible" → return NULL
3. Version.Minor compared to 0x41 (65)
If container minor > tool minor:
Warning: "Linked container's NvvmContainer minor version N newer than tool"
Parse continues regardless.
4. NvvmIRVersion.Major must == 2
Fail: "NvvmIR major version N not compatible" → return NULL
5. NvvmIRVersion.Minor compared to 0x62 (98)
If container minor > tool minor: warning, parse continues.
6. NvvmDebugVersion.Major must == 3
Fail: "NvvmDebug major version N not compatible" → return NULL
7. NvvmDebugVersion.Minor compared to 2
If container minor > tool minor: warning, parse continues.
8. LlvmVersion (major*100 + minor) must be <= 2000
Fail: "LLVM version N not compatible" → return NULL
A separate standalone validator (0xCCD5F0) adds a mode-dependent check: in binary dump mode (a5=0), the LLVM version must be exactly 20; in normal mode (a5=1), it must be <= 20.
The philosophy is clear: major version bumps signal breaking format changes and are hard failures. Minor version bumps add new tags but never change existing tag semantics -- the delta encoding and unknown-tag-skipping design ensures forward compatibility.
Current Version Constants (cicc v13.0)
| Field | Major | Minor |
|---|---|---|
| Version (container format) | 1 | 0x41 (65) |
| NvvmIRVersion | 2 | 0x62 (98) |
| NvvmDebugVersion | 3 | 2 |
| LlvmVersion | 20 | 0 |
XML Serialization Format
The XML path (NvvmContainer_parse_header at 0xCDCA30) uses NVIDIA's YAML-based serialization framework with virtual dispatch. The top-level XML document contains these elements:
<NvvmContainer>
<Version major="1" minor="65"/>
<NvvmIRVersion major="2" minor="98"/>
<NvvmDebugVersion major="3" minor="2"/>
<LlvmVersion major="20" minor="0"/>
<IRLevel>NVVM_IR_LEVEL_UNIFIED_AFTER_DCI</IRLevel>
<Options>
<ArchVariant>NVVM_ARCH_ADA_8_9</ArchVariant>
<CompileMode>NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI</CompileMode>
<OptLevel>NVVM_OPT_LEVEL_2</OptLevel>
<DebugInfo>NVVM_DEBUG_INFO_NONE</DebugInfo>
<FastMathOptions>
<Ftz>1</Ftz>
<Fmad>1</Fmad>
<Divide>NVVM_FAST_MATH_DIVIDE_FAST_APPROX</Divide>
...
</FastMathOptions>
<MaxRRegsAllowed>255</MaxRRegsAllowed>
...
</Options>
<IsBinary>1</IsBinary>
<Module>... base64-encoded LLVM bitcode ...</Module>
</NvvmContainer>
All enum values are serialized by their full string names (e.g., NVVM_COMPILE_MODE_SEPARATE_ABI), not by numeric value. The XML format does not use delta encoding -- every field is written regardless of whether it matches the default, making XML containers significantly larger but human-readable.
Serialization Flow
The serializer (0xCDD2D0) has two modes controlled by parameter a3: binary (a3=1) and XML (a3=0).
Binary Serialization (a3=1)
1. Compute version fields (use defaults if not set):
Version = {1, 0x41}
NvvmIRVersion = {2, 0x62}
NvvmDebugVersion = {3, 2}
LlvmVersion = {20, 0}
2. Allocate 248-byte NvvmContainerHeader (zeroed)
3. Allocate 440-byte default Options struct
4. Allocate two growable arrays:
scalar_tags[] -- int32 entries for tag/value pairs
blob_data[] -- byte entries for blob payloads
5. For each field in current Options vs. default Options:
If field differs:
Scalar → sub_CD17A0(scalar_tags, tag_id, value)
Blob → sub_CD1AB0(blob_data, scalar_tags, tag_id, ptr, size)
6. Optional IR compression:
If a4 flag set:
Compress LLVM bitcode via sub_C8D290
Compute CRC via sub_CCD2B0 → store as tag 99
Compress via sub_1688730(codec, data, size)
7. Append terminator: tag 0 to scalar_tags
8. Write 24-byte header (with computed ScalarFieldsEnd, BlobDataEnd)
9. Write scalar_tags array
10. Write blob_data array
11. Write compressed or raw IR payload
Deserialization (0xCD1D80)
1. Verify magic == 0x7F4E5C7D
2. Allocate 248-byte NvvmContainerHeader
3. Allocate 440-byte Options struct with defaults
4. Store Options pointer at container header offset 208
5. Compute tag_ptr = buffer + header_size (from offset 0x0E)
6. Compute blob_base = buffer + scalar_fields_end (from offset 0x10)
7. Enter switch loop:
Read tag (int16), decode value (int16 or sentinel + int32)
Switch on tag (103 unique case labels):
Tags 1-39: → write scalar to Options field
Tag 99: → store compression algo ID
Tags 101-173: → write to extended target options
Tags 201-218: → resolve blob offset, copy blob data
Tags 301-309: → write to extended int32 fields
Tags 351-353: → copy 8-byte blob to extended fields
Tags 401-402: → conditionally parse structured blob
Tag 0 → exit loop
8. If tag 99 present: decompress IR payload
9. Return container pointer
Annotated Hex Dump
A minimal container targeting SM 89 (Ada Lovelace) with default options (only SmMajor and SmMinor differ from defaults):
Offset Hex Decoded
------ ----------------------------------------- ---------------------------------
0x0000 7D 5C 4E 7F Magic: 0x7F4E5C7D
0x0004 01 41 Version: 1.65
0x0006 02 62 NvvmIRVersion: 2.98
0x0008 03 02 NvvmDebugVersion: 3.2
0x000A 14 00 LlvmVersion: 20.0
0x000C 00 00 IRLevel: 0 (UNIFIED_AFTER_DCI)
0x000E 18 00 HeaderSize: 24
0x0010 2C 00 00 00 ScalarFieldsEnd: 44
0x0014 2C 00 00 00 BlobDataEnd: 44 (no blobs)
--- Scalar tag/value region ---
0x0018 01 00 08 00 Tag 1 (SmMajor) = 8
0x001C 02 00 09 00 Tag 2 (SmMinor) = 9
0x0020 0D 00 01 00 Tag 13 (Ftz) = 1
0x0024 0F 00 01 00 Tag 15 (Fmad) = 1
0x0028 00 00 Terminator (tag 0)
0x002A 00 00 Padding to alignment
--- Blob data region ---
(empty -- ScalarFieldsEnd == BlobDataEnd)
--- IR payload follows at offset 0x002C ---
0x002C DE C0 17 0B ... LLVM bitcode (0xDEC0170B magic)
This example shows the efficiency of delta encoding: only 4 tag/value pairs (16 bytes of tags) plus the 24-byte header produce a fully-specified container. All other fields (CompileMode, OptLevel, DebugInfo, all target options) inherit their defaults during deserialization.
A container with a 32-bit value would look like:
0x00XX 13 00 FF FF 00 04 00 00 Tag 19 (MaxRRegsAllowed) = 1024
(0xFFFF sentinel, then 0x0400 LE)
Pipeline Integration
The container serves as the inter-stage transport format within the cicc compilation pipeline. Two entry paths exist:
| Path | Entry Function | Address | Pipeline |
|---|---|---|---|
| Path A (LibNVVM) | nvvmCompileProgram dispatcher | 0x9047E0 | 3-phase: LNK -> OPT -> LLC |
| Path B (standalone) | cicc_main orchestrator | 0x12642A0 | 4-stage: LNK -> OPT -> OPTIXIR -> LLC |
Both paths deserialize the container at phase 1, then translate Options into per-stage compiler flags:
SmMajor/SmMinorfrom tags 1--2 become-mcpu=sm_XXFastMath.Ftzfrom tag 13 becomes-nvptx-f32ftzFastMath.Fmadfrom tag 15 becomes the IEEE mode flagOptLevelbecomes-nvptx-opt-level=NCompileMode == 2(SEPARATE_ABI) adds--device-cIRLevel == 1(LTO) enters the LTO pipeline with partially-optimized bitcodeIRLevel == 2(OPTIX) activates the OptiX IR stage (bit 6 of pipeline bitmask) and disables LICM and IP-MSP
The container format is the single source of truth for all compilation parameters. When cicc is invoked by nvcc, the driver serializes its accumulated flags into a container, passes the container as input, and cicc deserializes it back into compiler options. This round-trip through binary serialization ensures that all pipeline stages see exactly the same configuration, eliminating the flag-parsing divergence that would otherwise arise from each stage having its own CLI parser.
YAML Serialization Framework
The XML/YAML path uses a generic serialization framework built on a bundled YAML parser/emitter library (Cluster A: 0xCB0000--0xCBFA60). The library provides:
| Function | Address | Role |
|---|---|---|
yaml_parser_main | 0xCB9640 | Top-level YAML parser (25,873 bytes) |
yaml_emitter_main_loop | 0xCBDA10 | Main YAML emitter loop (23,583 bytes) |
yaml_scanner_scan_tokens | 0xCB7E40 | Token scanner (17,924 bytes) |
yaml_parser_parse_flow | 0xCB8C00 | Flow-style parsing (15,188 bytes) |
yaml_parser_load_document | 0xCBA570 | Document loader/resolver (9,695 bytes) |
The serialization framework uses virtual dispatch: each serializable type registers a serialize/deserialize function pair, and the framework dispatches based on the YAML node type (scalar=1, sequence, mapping). All enum values are serialized by their full string names (NVVM_COMPILE_MODE_SEPARATE_ABI, NVVM_ARCH_ADA_8_9, etc.), not by numeric value.
Finalizer Knobs Integration
The container Options struct also feeds into the NVIDIA finalizer knobs system through NvvmOptions_parse_finalizer_knobs (0xCD9990, 31,702 bytes -- the 7th largest function in the binary). This parser ingests the complete set of NVIDIA-specific backend configuration knobs:
- Shader pipeline controls:
PromoteHalf,PromoteFixed,USePIXBAR,VSIsVREnabled,VSIsLastVTGStage - Codegen controls:
DisablePredication,DisableXBlockSched,EnableJumpTable,ScheduleKils - Memory controls:
DoMMACoalescing,AssumeConvertMemoryToRegProfitable - Barrier controls:
DisableERRBARAfterMEMBAR,GenConvBranchForWarpSync - PGO controls:
PGOEpoch,PGOBatchSize,PGOCounterMemBaseVAIndex - Per-CTA controls:
CTASizeX,CTASizeY,CTASizeZ,SharedMemorySize,SMemScratchBase - Register controls:
MaxActiveWarpsPerSM,NumReservedUReg,NumScratchURegs
These knobs are distinct from the NVVMPassOptions system (see NVVMPassOptions) -- the finalizer knobs configure the backend code generator, while NVVMPassOptions configure the optimization pipeline.
Tag Summary Statistics
| Range | Count | Description |
|---|---|---|
| 1--39 | 38 | Core scalar options (SM version, fast-math, unroll, flags) |
| 99 | 1 | Compression metadata |
| 101--173 | 73 | Extended target options (hardware capabilities, memory config) |
| 201--218 | 18 | Blob data (memory windows, resource tables, strings) |
| 301--309 | 9 | Extended int32 fields (cluster config, extended options) |
| 351--353 | 3 | Extended int64 blob references |
| 401--402 | 2 | Structured conditional blobs (TMA / TCGen05) |
| Total | 144 | Distinct tag IDs across 6 ranges |
The deserializer switch statement has 103 unique case labels -- the remaining 41 tags share code paths with other tags (e.g., all single-bit tags in a byte share a case that reads the bit position from a secondary table).
Cross-References
- NVVMPassOptions -- 222-slot optimization pipeline configuration
- Pipeline Entry -- LibNVVM API and CLI entry points
- OptiX IR -- IRLevel=2 OptiX pipeline
- LTO Pipeline -- IRLevel=1 link-time optimization
- SM 90 Hopper -- TMA descriptor usage (tag 401)
- SM 100 Blackwell -- TCGen05 config usage (tag 402)
- Bitcode I/O -- LLVM bitcode reader/writer wrapping the IR payload
- nvcc Interface -- Driver-to-cicc container passing
NVPTX Target Infrastructure
The NVPTXTargetMachine, NVPTXSubtarget, and NVPTXTargetTransformInfo form the target description layer that the entire LLVM backend consults for every decision from type legality through instruction cost to vectorization factor selection. In upstream LLVM, these are three separate source files totaling roughly 1,500 lines; in cicc v13.0 they are spread across the 0xDF0000-0xE00000 address range (TTI hooks), the 0x330-0x35B range (NVPTXTargetLowering), the type legalization tables embedded in NVPTXSubtarget, and the pipeline assembler at 0x12EA000-0x12F0000 (TargetMachine construction). The NVIDIA delta relative to upstream is moderate -- the TTI hooks return GPU-specific constants rather than CPU ones, the SubtargetFeatures carry NVIDIA-proprietary math precision flags, and the TargetMachine creation path has a dual-path design that handles both the cicc standalone pipeline and the LibNVVM API pipeline.
Key Facts
| Property | Value |
|---|---|
| SM processor table | qword_502A920 (45 entries, stride-2, ctor_605 at 0x584510) |
| Target lookup | sub_12EA530 (4KB, calls sub_16D3AC0 = TargetRegistry::lookupTarget) |
| TargetMachine creation | sub_12F4060 (16KB, NVIDIA options) / sub_12E54A0 (50KB, pipeline path) |
| TTI wrapper pass | sub_1BFB520 (208-byte alloc, wraps sub_1BFB9A0) |
| Register bit width (Vector) | sub_DFE640 -- returns 32 (fixed) |
| Scalable vectors | sub_DFE610 -- returns false |
| Max interleave factor | sub_DFB120 (at TTI+448), sub_DFB730 (vectorized variant) |
| SubtargetFeatures | Offsets +2498, +2584, +2843, +2870, +2871 |
| Target triples | nvptx64-nvidia-cuda, nvptx-nvidia-cuda, nvsass-nvidia-* (6 total) |
NVPTXTargetMachine
Dual-Path Target Initialization
cicc constructs the TargetMachine through two independent code paths depending on whether compilation enters through the standalone cicc CLI or through the LibNVVM API. Both converge on TargetRegistry::lookupTarget (sub_16D3AC0) but assemble the target triple, feature string, and TargetOptions differently.
Path 1 -- cicc standalone (sub_12F7D90 -> sub_12F4060):
sub_12F7D90 — CLI parser:
parse "-arch=compute_XX" → SM version (multiplied by 10)
parse "-opt=N" → optimization level
parse "-ftz=N" → flush-to-zero mode
parse "-fma=N" → FMA contraction level
parse "-prec-div=N" → float division precision
parse "-prec-sqrt=N" → sqrt precision
parse "--device-c" → device compilation flag
sub_12F4060 — TargetMachine creation (16KB):
triple = (pointerWidth == 64) ? "nvptx64" : "nvptx"
features = ""
if (sharedmem32bit):
features += "+sharedmem32bitptr"
features += ",+fma-level=N,+prec-divf32=N,+prec-sqrtf32=N"
opts = TargetOptions {
flags: 0,
reloc: PIC (1),
codeModel: 8,
optLevel: from_cli,
threadModel: 1
}
TM = TargetRegistry::lookupTarget(triple, cpu_string)
if (!TM):
error "Error: Cannot specify multiple -llcO#\n"
return TM->createTargetMachine(triple, cpu, features, opts)
Path 2 -- pipeline assembler (sub_12E54A0):
The master pipeline assembly function (50KB, called from both Phase I and Phase II) constructs the target independently:
sub_12E54A0:
ptrSize = Module::getDataLayout().getPointerSizeInBits(0)
if (8 * ptrSize == 64):
triple = "nvptx64" // 7 chars
else:
triple = "nvptx" // 5 chars
target = sub_16D3AC0(&triple, &cpu_string) // TargetRegistry::lookupTarget
if (!target):
error "Failed to locate nvptx target\n" // sub_1C3EFD0
// TargetOptions setup:
opts[0] = 0 // no flags
opts[1] = 1 // PIC relocation
opts[2] = 8 // code model
opts[3] = 1 // opt level indicator
opts[4] = 1 // thread model
opts[5] = 0 // reserved
sub_167F890(subtargetInfo) // initialize SubtargetInfo
TLI = sub_14A04B0(targetLibInfo, moduleName) // TargetLibraryInfo
sub_149CBC0(TLI) // finalize TLI
TTI = sub_1BFB9A0(DataLayout, a2, a3, v269) // TargetTransformInfo
optLevel = read qword_4FBB430 // cl::opt<int> value
PassManagerBuilder = sub_1611EE0(PM)
The pipeline assembler path also checks for an extension hook: if the target has a createExtendedTargetMachine vtable entry at offset +88, it calls that instead, enabling custom target backends. The returned TargetMachine pointer feeds into the 150+ pass registrations that follow.
TargetOptions
The TargetOptions struct passed to both paths uses LLVM's standard layout. The key NVIDIA-specific values:
| Field | Value | Meaning |
|---|---|---|
| Relocation model | 1 (PIC) | Position-independent code, always |
| Code model | 8 | Large code model (matches PTX's flat addressing) |
| Thread model | 1 | POSIX-style threading assumed |
| Optimization level | From CLI | Stored in qword_4FBB430, default from qword_4FBB430[2] |
NVIDIA-Specific Target Features
The feature string passed to createTargetMachine encodes math precision and shared memory configuration as subtarget features. These are not upstream LLVM features -- they are NVIDIA extensions:
| Feature | CLI Source | Subtarget Effect |
|---|---|---|
+sharedmem32bitptr | nvptx-short-ptr / nvptx-32-bit-smem | Enables 32-bit pointers for address space 3 (shared memory); adds p3:32:32:32 to data layout |
+fma-level=N | -fma=N | 0=off, 1=on, 2=aggressive FMA contraction |
+prec-divf32=N | -prec-div=N | 0=approx, 1=full, 2=IEEE+ftz, 3=IEEE compliant |
+prec-sqrtf32=N | -prec-sqrt=N | 0=approx (rsqrt.approx), 1=rn (sqrt.rn) |
Registered in ctor_607 (0x584B60, 14KB):
| Knob | Type | Default | Description |
|---|---|---|---|
nvptx-sched4reg | bool | -- | Schedule for register pressure |
nvptx-fma-level | int | -- | FMA contraction level |
nvptx-prec-divf32 | int | -- | F32 division precision |
nvptx-prec-sqrtf32 | int | -- | Sqrt precision |
nvptx-approx-log2f32 | bool | -- | Use lg2.approx for log2 |
nvptx-force-min-byval-param-align | bool | -- | Force 4-byte byval alignment |
nvptx-normalize-select | bool | -- | Override shouldNormalizeToSelectSequence |
enable-bfi64 | bool | -- | Enable 64-bit BFI instructions |
NVPTXSubtarget Feature Flags
The NVPTXSubtarget object carries the type legalization tables and architecture-specific feature flags that the SelectionDAG, register allocator, and type legalizer consult at every step. These are populated during target construction and indexed by the SM processor table.
Feature Flag Offsets
| Offset | Size | Purpose | Stride |
|---|---|---|---|
| +120 | ptr | Register class array (8-byte stride entries) | -- |
| +2498 | 259 | Type legality flags (indexed per MVT) | 259 bytes per type action |
| +2584 | 259 | Float legality flags (indexed per MVT) | 259 bytes per type action |
| +2843 | 1 | Integer type support flag | -- |
| +2870 | 1 | Branch distance flag | -- |
| +2871 | 1 | Jump table eligibility flag | -- |
The type legality arrays at +2498 and +2584 are the backbone of SelectionDAG's getTypeAction() and isTypeLegal() queries. Each entry covers one MVT (Machine Value Type) and stores the action: Legal, Promote, Expand, Scalarize, or SplitVector. For NVPTX, i32 and f32 are always Legal; i64 and f64 are Legal on all supported SM versions but with expanded arithmetic costs; vectors wider than 128 bits are always Split or Scalarized.
The function sub_201BB90 reads these offsets during type legalization to determine expansion strategy. The branch distance flags at +2870/+2871 control sub_20650A0, which decides jump table eligibility beyond the standard no-jump-tables flag.
Initialization Flow
The SubtargetFeatures initialization follows this path:
ctor_605(0x584510, 2.6KB) populatesqword_502A920with the 45-entry SM processor table at static init time.sub_167F890initializes the SubtargetInfo during pipeline setup.sub_982C80initializes the 224-byte NVPTX feature flag table based on SM version and OS/ABI info.sub_97DEE0performs initial population of the feature bitfield.sub_982B20applies SM-version-specific refinements from the global table atqword_4F7FCC8.
The 224-byte feature table (sub_982C80) initializes bytes 0-127 to all-1s (0xFF), then selectively clears bits based on the target configuration. This "default-enabled, selectively-disabled" pattern means that features are assumed present unless explicitly turned off for a given target.
NVPTXTargetTransformInfo Hook Table
The TTI is the interface through which all LLVM optimization passes query target-specific costs and capabilities. For NVPTX, every hook returns a value calibrated for a scalar-register GPU architecture rather than a SIMD-register CPU.
| TTI Hook | Address | Return Value | Upstream Equivalent |
|---|---|---|---|
getRegisterBitWidth(Vector) | sub_DFE640 | TypeSize::getFixed(32) | AVX2 returns 256, AVX-512 returns 512 |
supportsScalableVectors() | sub_DFE610 | false | AArch64 SVE returns true |
getMaxInterleaveFactor() | sub_DFB120 | Register-pressure-bounded | CPU returns 2-4 based on uarch |
getMaxInterleaveFactor(vectorized) | sub_DFB730 | Separate limit for vectorized loops | -- |
getRegisterBitWidth(Scalar) | sub_DFB1B0 | 32 | Matches PTX 32-bit register file |
getInstructionCost() | sub_20E14F0 (32KB) | Per-opcode latency from sched model | -- |
hasAttribute(30) | sub_B2D610 | Checks noimplicitfloat | Standard LLVM |
hasAttribute(47) | sub_B2D610 | Checks alwaysvectorize | Standard LLVM |
hasAttribute(18) | sub_B2D610 | Checks optnone | Standard LLVM |
Impact on Loop Vectorization
The 32-bit register width return from sub_DFE640 is the single most consequential TTI hook for GPU compilation. The standard LLVM VF formula is:
VF = registerBitWidth / elementBitWidth
With registerBitWidth = 32:
float(32-bit): VF = 1 -- no vectorization from the register-width formula alonehalf(16-bit): VF = 2i8(8-bit): VF = 4
This means that profitable vectorization of 32-bit types (the dominant case in CUDA) must come entirely from the cost model determining that ld.v2.f32 or ld.v4.f32 is cheaper than multiple scalar loads, not from the register-width heuristic. The LoopVectorize pass (sub_2AF1970) has an explicit override: when the VF formula produces VF <= 1 and the byte_500D208 knob is set, it forces VF = 4 for outer loops.
Impact on SLP Vectorization
The SLP vectorizer (sub_2BD1C50) receives the target vector register width as parameter a3 and uses it to determine maximum bundle width. With 32 bits, SLP bundles are limited to:
- 2x i16 (32 bits total)
- 4x i8 (32 bits total)
- 1x i32 or f32 (degenerate -- no SLP benefit)
In practice, the SLP vectorizer's profitability model can override this limit when paired loads/stores demonstrate memory coalescing benefit, but the register width serves as the initial upper bound.
Impact on Interleave Count
The getMaxInterleaveFactor hook (sub_DFB120, queried at TTI+448) caps the interleave count (IC) for loop unroll-and-jam. The interleave selection algorithm in sub_2AED330 reads this value and combines it with scheduling info at TTI+56:
maxIC = TTI.getMaxInterleaveFactor(VF)
issueWidth = *(TTI + 56 + 32) // scheduling model: issue width
latency = *(TTI + 56 + 36) // scheduling model: latency
IC = IC / max(issueWidth, latency) // cap by pipeline throughput
This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the warp scheduler may saturate at lower IC values, making additional interleaving waste register budget without throughput gain.
Arithmetic Cost for i64
NVPTX GPUs have 32-bit ALUs. All 64-bit integer arithmetic is emulated through pairs of 32-bit operations with carry propagation. The TTI getArithmeticInstrCost hook reflects this by returning approximately 2x the base cost for i64 operations:
| Operation | i32 Cost | i64 Cost | Ratio |
|---|---|---|---|
| ADD/SUB | 1 | 2 | 2x (add.cc + addc) |
| MUL | 1 | ~4 | 4x (mul.lo + mul.hi + add chain) |
| DIV/REM | high | very high | Library call on both |
| Shift | 1 | 2-3 | funnel shift pair |
This cost differential causes LLVM optimization passes (InstCombine, SCEV-based transformations, IV widening) to prefer i32 operations, which NVIDIA's custom IV Demotion pass (sub_18B1DE0) further exploits by narrowing 64-bit induction variables to 32-bit where the trip count permits.
SM Processor Table
The processor table at qword_502A920 is a flat array of 90 entries (45 SM variants x 2 fields per entry) with stride-2 layout: even indices hold the SM name string pointer, odd indices hold the PTX version code.
Populated by ctor_605 at 0x584510 (2.6KB), called during static initialization before main. The table is read-only after construction.
qword_502A920[2*i + 0] = const char* sm_name // e.g., "sm_100"
qword_502A920[2*i + 1] = uint64_t ptx_version // 5, 6, or 7
PTX Version Codes
| Code | Meaning | SM Range |
|---|---|---|
| 5 | Legacy PTX | sm_20 through sm_90 (all base variants) |
| 6 | Modern PTX | sm_90a, sm_100-sm_121 (base variants only) |
| 7 | Extended PTX | sm_100a/f through sm_121a/f (accelerated/forward-compatible) |
Notable observations:
sm_90ais the only pre-Blackwell SM with PTX version 6.- The
f(forward-compatible) suffix uses the same PTX version asa(accelerated). - No entries exist for sm_84, sm_85 (Ada Lovelace numbering gap).
sm_73(Volta sub-variant) andsm_88(Ada sub-variant) are present but not publicly documented.- The table contains 15 legacy architectures (sm_20 through sm_75) that are no longer accessible through the CLI mapping but remain in the backend's processor table.
Data Layout String
The NVPTX data layout string follows LLVM's standard format with three variants selected based on pointer width and shared memory pointer mode:
64-bit with shared memory specialization (most common)
e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
64-bit without shared memory specialization
e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
32-bit mode
e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
Key fields
| Field | Meaning | NVIDIA Note |
|---|---|---|
e | Little-endian | All NVIDIA GPUs |
p:64:64:64 | Generic pointers: 64-bit, 64-bit aligned | Default for 64-bit compilation |
p3:32:32:32 | Address space 3 (shared memory): 32-bit pointers | Controlled by nvptx-short-ptr / nvptx-32-bit-smem / unk_4D0461C |
n16:32:64 | Native integer widths: 16, 32, 64 | Tells LLVM that i16/i32/i64 are all hardware-supported |
v16:16:16 / v32:32:32 | Vector alignment: natural | 16-bit and 32-bit vectors aligned to their width |
The p3:32:32:32 entry is the NVIDIA delta: shared memory lives in a 48KB-228KB on-chip SRAM per SM, addressable with 32-bit pointers even in 64-bit mode. Using 32-bit pointers for shared memory saves register pressure and instruction count for every shared memory access.
A separate data layout string e-i64:64-v16:16-v32:32-n16:32:64 appears in the IR linker (sub_106AB30) as a compatibility check during module linking. This shortened form is used to validate that two modules being linked share the same NVPTX target data layout.
Data layout validation is performed at multiple points:
sub_2C74F70in the NVVM verifier checks the layout string on every module- If empty:
"Empty target data layout, must exist" - If invalid: prints
"Example valid data layout:"with reference 32-bit and 64-bit strings fromoff_4C5D0A0/off_4C5D0A8
Target Triple Construction
The target triple is constructed at module creation time by checking the pointer width:
if (unk_4F06A68 == 8) // 64-bit data model
triple = "nvptx64-nvidia-cuda" // 19 chars
else
triple = "nvptx-nvidia-cuda" // 17 chars
Eight triples are valid in UnifiedNVVMIR mode:
| Triple | Width | Runtime |
|---|---|---|
nvptx-nvidia-cuda | 32-bit | CUDA |
nvptx64-nvidia-cuda | 64-bit | CUDA |
nvptx-nvidia-nvcl | 32-bit | OpenCL |
nvptx64-nvidia-nvcl | 64-bit | OpenCL |
nvsass-nvidia-cuda | SASS | CUDA native assembly |
nvsass-nvidia-nvcl | SASS | OpenCL native assembly |
nvsass-nvidia-directx | SASS | DirectX backend |
nvsass-nvidia-spirv | SASS | SPIR-V backend |
In non-UnifiedNVVMIR mode, validation is looser: the triple must start with nvptx- or nvptx64- and contain -cuda. The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) are notable evidence that NVIDIA's SASS-level backend supports DirectX and SPIR-V shader compilation alongside traditional CUDA/OpenCL.
Configuration Knobs
Backend Options (ctor_609_0, 0x585D30, 37KB)
| Knob | Type | Default | Description |
|---|---|---|---|
nvptx-short-ptr | bool | -- | 32-bit pointers for const/local/shared |
nvptx-32-bit-smem | bool | -- | 32-bit shared memory pointers |
nvptx-enable-machine-sink | bool | -- | Enable Machine Sinking |
enable-new-nvvm-remat | bool | true | Enable new rematerialization |
nv-disable-remat | bool | false | Disable all remat passes |
nv-disable-mem2reg | bool | false | Disable MI Mem2Reg pass |
nv-disable-scev-cgp | bool | false | Disable SCEV address mode opt |
disable-nvptx-load-store-vectorizer | bool | false | Disable load/store vectorizer |
disable-nvptx-require-structured-cfg | bool | false | Turn off structured CFG requirement |
nvptx-exit-on-unreachable | bool | true | Lower unreachable as exit |
nvptx-early-byval-copy | bool | -- | Copy byval args early |
enable-nvvm-peephole | bool | true | Enable NVVM Peephole Optimizer |
lower-func-args | bool | true | Lower large aggregate params |
enable-sink | bool | true | Enable Sinking |
disable-post-opt | bool | false | Disable LLVM IR opts post-opt |
usedessa | int | 2 | Select deSSA method |
ldg | bool | true | Load Global Constant Transform |
print-isel-input | bool | false | Print LLVM IR input to isel |
no-reg-target-nvptxremat | bool | false | Only old remat without reg targets |
disable-set-array-alignment | bool | false | Disable alignment enhancements |
nvptx-lower-global-ctor-dtor | bool | -- | Lower GPU ctor/dtors to globals |
Register Pressure & FCA Options (ctor_074, 0x49AAB0)
| Knob | Type | Default | Description |
|---|---|---|---|
fca-size | int | 8 | Max size of first-class aggregates (bytes) |
reg-target-adjust | int | 0 (range -10..+10) | Register pressure target adjustment |
pred-target-adjust | int | 0 (range -10..+10) | Predicate register target adjustment |
remat-load-param | bool | -- | Support remating const ld.param not in NVVM IR |
cta-reconfig-aware-rpa | bool | -- | CTA reconfiguration-aware register pressure analysis |
Extension Options (ctor_610, 0x5888A0)
| Knob | Type | Default | Description |
|---|---|---|---|
unroll-assumed-size | int | 4 | Assumed size for unknown local array types |
enable-loop-peeling | bool | -- | Enable loop peeling |
enable-256-bit-load-store | bool | -- | Enable 256-bit vector loads/stores |
ias-param-always-point-to-global | bool | -- | Parameters always point to global memory |
ias-strong-global-assumptions | bool | -- | Strong global memory assumptions |
ias-wmma-memory-space-opt | bool | -- | Memory Space Optimization for WMMA |
TTI Cost Model Options (ctor_061, 0x494D20)
| Knob | Type | Default | Description |
|---|---|---|---|
costmodel-reduxcost | bool | -- | Recognize reduction patterns |
cache-line-size | int | -- | Cache line size for cost model |
min-page-size | int | -- | Minimum page size |
predictable-branch-threshold | float | -- | Threshold for predictable branch cost |
Differences from Upstream LLVM
-
Dual-path TargetMachine construction. Upstream LLVM has a single target creation path through
LLVMTargetMachine::createPassConfig. NVIDIA has two independent paths (CLI and pipeline assembler) that converge atTargetRegistry::lookupTarget. -
NVIDIA-proprietary target features. The
+sharedmem32bitptr,+fma-level=N,+prec-divf32=N,+prec-sqrtf32=Nfeatures do not exist in upstream NVPTX. Upstream NVPTX has+ptx75,+sm_90style features. NVIDIA's math precision features are passed through the target feature string to avoid adding new cl::opt for each. -
224-byte feature table. The
sub_982C80feature table with its "default all-1s then selectively clear" initialization pattern is unique to cicc. Upstream NVPTXSubtarget uses a much simpler feature set derived from+sm_XXand+ptx_YYfeatures. -
Scheduling info at TTI+56. The issue-width and latency values stored in the TTI sub-structure at offset +56 are used by the interleave count selection algorithm. Upstream LLVM's NVPTX backend does not populate these scheduling parameters -- it relies on the default "no scheduling model" behavior.
-
Extension hook at vtable+88. The pipeline assembler checks for a
createExtendedTargetMachineentry, enabling loadable target backend extensions. This is not present in upstream LLVM.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| NVPTX Target Lookup and Creation | sub_12EA530 | 4 KB | -- |
| TargetMachine Creation with NVIDIA Options | sub_12F4060 | 16 KB | -- |
| Master Pipeline Assembly (includes TM setup) | sub_12E54A0 | 50 KB | -- |
| CICC CLI Argument Parser | sub_12F7D90 | 14 KB | -- |
TargetRegistry::lookupTarget() | sub_16D3AC0 | -- | -- |
| SubtargetInfo initialization | sub_167F890 | -- | -- |
| TTIWrapperPass allocation (208 bytes) | sub_1BFB520 | -- | -- |
| TargetTransformInfo / DataLayout creation | sub_1BFB9A0 | -- | -- |
| TargetLibraryInfo creation | sub_14A04B0 | -- | -- |
| TargetLibraryInfo finalization | sub_149CBC0 | -- | -- |
TTI::getRegisterBitWidth(Vector) -- returns 32 | sub_DFE640 | -- | -- |
TTI::supportsScalableVectors() -- returns false | sub_DFE610 | -- | -- |
TTI::getMaxInterleaveFactor() (at TTI+448) | sub_DFB120 | -- | -- |
TTI::getMaxInterleaveFactor(vectorized) | sub_DFB730 | -- | -- |
TTI::getRegisterBitWidth(Scalar) or cache-line query | sub_DFB1B0 | -- | -- |
TTI::getInstructionCost() / scheduling cost model | sub_20E14F0 | 33 KB | -- |
TTI::hasAttribute(N) -- function attribute query | sub_B2D610 | -- | -- |
TTI::getInstructionCost() (IR-level variant) | sub_B91420 | -- | -- |
| NVPTX feature flag table initializer (224 bytes) | sub_982C80 | -- | -- |
| Feature bitfield initial population | sub_97DEE0 | -- | -- |
| SM-version-specific feature refinements | sub_982B20 | -- | -- |
| SubtargetFeature reads at +2843, +2584, +2498 | sub_201BB90 | -- | -- |
| Branch distance / jump table checks at +2870, +2871 | sub_20650A0 | -- | -- |
| EDG SM architecture feature gating (38KB, ~60 flags) | sub_60E7C0 | -- | -- |
| Module initialization with triple and data layout | sub_908850 | -- | -- |
SM processor table population (0x584510, 2.6KB) | ctor_605 | -- | -- |
NVPTX backend math options (0x584B60, 14KB) | ctor_607 | -- | -- |
NVPTX backend options (0x585D30, 37KB) | ctor_609_0 | -- | -- |
Cross-References
- GPU Target Architecture -- Full SM table, architecture gating thresholds, NVVM container arch enum
- LoopVectorize & VPlan -- TTI hook usage in VF selection and interleave count
- SLP Vectorizer -- TTI register width as SLP bundle width limit
- SelectionDAG -- NVPTXTargetLowering, type legality from SubtargetFeatures
- Memory Space Optimization -- Address space numbering convention
- IV Demotion -- Exploits i64 cost differential reported by TTI
- Register Allocation -- Register pressure budgets bounded by TTI
- Instruction Scheduling -- Scheduling model data at TTI+56
- CLI Flags --
-arch,-ftz,-fma,-prec-div,-prec-sqrtrouting - Optimization Levels --
qword_4FBB430optimization level storage - Pipeline & Ordering -- Where TTI is registered in the pass pipeline
Alias Analysis & NVVM AA
cicc ships a custom alias analysis pass (NVVM AA, registered as nvptx-aa) that exploits GPU address space disjointness to prove pointer pairs cannot alias. On a GPU, each hardware memory partition -- global DRAM, shared scratchpad, local stack, constant cache, kernel parameter window -- occupies a physically separate address range. Pointers into different address spaces can never reference the same byte, a property that does not hold on any mainstream CPU ISA. NVVM AA encodes this hardware invariant into the LLVM AA pipeline, returning NoAlias for any cross-address-space pointer pair. This single fact unlocks aggressive dead-store elimination, load-store motion, GVN load forwarding, and MemorySSA precision that would be impossible on a flat-memory machine. The pass is stateless, trivially cheap, and runs first in the AA chain so that more expensive analyses (BasicAA, TBAA) can skip pairs that NVVM AA already resolved.
Beyond pure address-space disjointness, cicc augments the standard LLVM AA infrastructure in three further ways: (1) a process-restrict pass that propagates noalias attributes from __restrict__ kernel parameters, (2) !noalias.addrspace metadata (metadata kind 42) that tags pointers with the set of address spaces they provably do not alias with, and (3) NVIDIA-specific knobs controlling traversal depth, TBAA strictness, and fence relaxation.
Key Facts
| Property | Value |
|---|---|
| Pass name (legacy PM) | nvptx-aa |
| Pass name (new PM) | Registered via NVPTXTargetMachine::registerEarlyDefaultAliasAnalyses |
| Legacy wrapper | NVPTXAAWrapperPass (ImmutablePass, char ID) |
| External wrapper | NVPTXExternalAAWrapper (hooks into ExternalAAWrapperPass, RunEarly=true) |
| Result class | NVPTXAAResult : AAResultBase |
| State | Stateless -- invalidate() always returns false |
| AA chain position | First (before BasicAA) |
| Address traversal depth | Controlled by nvptx-traverse-address-aliasing-limit (default 6) |
| AA evaluator pass | aa-eval at sub_13549C0 (11,038 bytes) |
| AA query entry point | sub_134CB50 -- AAResults::alias(MemoryLocation, MemoryLocation) |
| ModRef query (call, loc) | sub_134F0E0 -- AAResults::getModRefInfo(CallBase, MemoryLocation) |
| ModRef query (call, call) | sub_134F530 -- AAResults::getModRefInfo(CallBase, CallBase) |
GPU Address Space Table
NVPTX defines six logically disjoint address spaces plus a generic (flat) umbrella. See Address Spaces for the complete master table with hardware mapping, pointer widths, latency numbers, and data layout strings.
The critical property exploited by NVVM AA: any (AS_x, AS_y) pair where x != y and neither is 0 (generic) and neither is the shared/shared-cluster pair (AS 3 vs AS 7) returns NoAlias, unless x is global and y is param (or vice versa) since cvta.param on SM 70+ makes param addressable as global. See the Aliasing Rules section for the complete cross-space aliasing specification and the MemorySpaceOpt Internal Bitmask section for the dataflow bitmask encoding used during address space resolution.
The NVVM AA Algorithm
The core alias function follows upstream NVPTXAliasAnalysis.cpp in structure, enhanced with cicc-specific extensions. The pseudocode:
// NVPTXAAResult::alias -- the heart of NVVM AA
AliasResult alias(const MemoryLocation &Loc1,
const MemoryLocation &Loc2,
AAQueryInfo &AAQI) {
unsigned AS1 = getAddressSpace(Loc1.Ptr, TraverseLimit);
unsigned AS2 = getAddressSpace(Loc2.Ptr, TraverseLimit);
// If either pointer is in generic (flat) space, we cannot disambiguate.
// Generic pointers can point to any physical memory at runtime.
if (AS1 == ADDRESS_SPACE_GENERIC || AS2 == ADDRESS_SPACE_GENERIC)
return AliasResult::MayAlias;
// Distributed shared memory (AS 7) overlaps with regular shared (AS 3).
if ((AS1 == 3 && AS2 == 7) || (AS1 == 7 && AS2 == 3))
return AliasResult::MayAlias;
// Same address space: cannot determine from space alone.
// Fall through to BasicAA / TBAA for further analysis.
if (AS1 == AS2)
return AliasResult::MayAlias;
// Different non-generic, non-overlapping spaces: provably disjoint.
return AliasResult::NoAlias;
}
// getAddressSpace -- walk through casts to find the underlying space.
// Traverses up to MaxLookup levels of getUnderlyingObject().
unsigned getAddressSpace(const Value *V, unsigned MaxLookup) {
while (MaxLookup-- > 0) {
unsigned AS = V->getType()->getPointerAddressSpace();
if (AS != ADDRESS_SPACE_GENERIC)
return AS;
const Value *Next = getUnderlyingObject(V, /*MaxLookup=*/1);
if (Next == V)
break; // Reached a root (alloca, argument, global)
V = Next;
}
return V->getType()->getPointerAddressSpace();
}
The getAddressSpace helper is the key difference from a naive check. A pointer may be in generic address space (AS 0) at its use site but was produced by an addrspacecast from a specific space. The traversal walks backward through getUnderlyingObject (which strips GEPs, bitcasts, PHIs) to find the original non-generic space. The depth limit (nvptx-traverse-address-aliasing-limit, default 6) prevents exponential blowup on deeply nested pointer chains.
The getModRefInfoMask method adds a further optimization: pointers into constant memory (AS 4) or parameter memory (AS 101) are read-only, so it returns NoModRef -- the pointer's memory is never modified. This allows DSE to skip analysis of stores that might alias with const/param loads, and lets LICM hoist loads from constant memory without checking for intervening stores.
The getMemoryEffects method handles inline assembly: PTX inline asm without side-effects or {memory} clobbers is treated as having no memory effects, which prevents it from blocking optimizations.
The Generic Address Space Problem
The generic (flat, AS 0) address space is the fundamental obstacle to alias precision on GPUs. When the frontend cannot determine which physical memory a pointer targets, it emits the pointer in AS 0. The hardware resolves generic addresses at runtime using address range checks -- a pointer into the shared memory window maps to shared, otherwise it maps to global.
For NVVM AA, a generic pointer forces MayAlias against every other pointer, destroying the disjointness guarantee. This is why MemorySpaceOpt is so critical: it runs before the main optimization pipeline and converts generic pointers to specific address spaces wherever possible, feeding precise AS information into NVVM AA.
Three mechanisms address the generic pointer problem:
1. MemorySpaceOpt (pre-optimization conversion). The two-phase interprocedural pass at sub_1C70910 resolves generic pointers by tracing them back to their allocation sites. If a generic pointer is always derived from a __shared__ variable, the pass inserts an addrspacecast to AS 3 and rewrites all uses. When different call sites pass different address spaces for the same argument, the pass clones the function into space-specialized versions. This is the most impactful optimization: every generic pointer that MemorySpaceOpt resolves gives NVVM AA an additional NoAlias edge.
2. Address space traversal in AA. Even without MemorySpaceOpt, the getAddressSpace helper in NVVM AA walks through addrspacecast chains. If a generic pointer %p was produced by addrspacecast i8 addrspace(3)* %s to i8*, the traversal discovers AS 3. The traversal depth limit (default 6) controls how far back the walk goes.
3. !noalias.addrspace metadata (kind 42). cicc attaches this metadata to instructions when address space information is known but the pointer itself remains generic. The AA evaluator (sub_13549C0) detects this metadata via opcode byte 0x4E ('N') and sets bit 2 in a pointer-tagged value (OR with 4), propagating the address-space disambiguation information through to AAResults::alias. This is a cicc-specific extension not found in upstream LLVM.
AA Pipeline Ordering
cicc configures the AA chain with NVVM AA running first, as confirmed by the NVPTXExternalAAWrapper which passes RunEarly=true to ExternalAAWrapperPass. The full chain:
NVVM AA --> BasicAA --> TBAA --> ScopedNoAliasAA --> GlobalsAA
| | | | |
| | | | +-- Module-level: which globals
| | | | escape? (enable-unsafe-
| | | | globalsmodref-alias-results)
| | | |
| | | +-- !noalias / !alias.scope metadata
| | | (enable-scoped-noalias, default true)
| | |
| | +-- Type-based: !tbaa metadata tree
| | (enable-tbaa, default true)
| |
| +-- Stateless: GEP decomposition, alloca vs argument,
| capture analysis (basic-aa-recphi, default true;
| basic-aa-separate-storage, default true)
|
+-- Address space disjointness (stateless, O(depth) per query)
The chain is queried through AAResults::alias() (sub_134CB50), which dispatches through the registered AA providers in order. Each provider returns NoAlias, MayAlias, PartialAlias, or MustAlias. If any provider returns NoAlias, the chain short-circuits -- subsequent providers are not consulted. This is why NVVM AA runs first: cross-address-space pairs are resolved in O(1) without invoking the more expensive BasicAA GEP decomposition.
The AAResults object consumed by MemorySSA, GVN, DSE, and LICM is the same chained result. All memory-aware passes benefit transparently from NVVM AA without any code changes.
Integration with Memory Optimization Passes
NVVM AA's impact flows through every pass that queries alias information:
MemorySSA (sub_1A6A260) builds its memory SSA graph using AAResults at [this+0xB8] (retrieved via tag unk_4F9D3C0). When NVVM AA proves that a store to shared memory and a load from global memory are NoAlias, MemorySSA does not create a dependency edge between them, resulting in a sparser -- and more precise -- memory graph. This precision propagates to every consumer of MemorySSA.
GVN (sub_1900BB0) uses AA for load elimination and store forwarding. With NVVM AA, a load from %p_global can be forwarded past a store to %q_shared because they provably do not alias. Without NVVM AA, GVN would conservatively assume they might alias and abandon the forwarding. The GVN implementation queries sub_134CB50 indirectly through MemoryDependenceResults, which itself consults AAResults.
DSE (sub_19DD1D0 and related functions) eliminates dead stores by proving that no subsequent load reads the stored value. DSE requires AAResults at unk_4F9D3C0. The DSE report confirms: "The alias analysis that DSE consumes already handles address-space separation. CUDA address spaces (shared=3, global=1, local=5, constant=4) are handled by the underlying NVVM alias analysis which knows that different address spaces cannot alias." DSE does NOT implement its own address-space checks -- it relies entirely on NVVM AA.
LICM uses AA to determine whether a load inside a loop can be hoisted out. If NVVM AA proves a loop-invariant load from constant memory (AS 4, getModRefInfoMask returns NoModRef) cannot be modified by any store in the loop, LICM hoists it. This is especially impactful for __constant__ kernel arguments accessed repeatedly in hot loops.
noalias Metadata and __restrict__ Handling
cicc provides two mechanisms for marking kernel pointer parameters as non-aliasing:
1. -restrict / --kernel-params-are-restrict (frontend flag, offset +1096). When the user passes -restrict to nvcc (or --kernel-params-are-restrict to cicc), it routes to the LLVM knob -nvptx-kernel-params-restrict via the llc argument vector. This causes cicc to add the noalias attribute to all pointer-typed kernel parameters, asserting that the programmer guarantees no two kernel pointer arguments alias. The process-restrict pass (ProcessRestrictPass, registered as a function pass in the new PM at position 419 in the pipeline parser, parameter parser at sub_233A330) then propagates this attribute through the call graph. The propagate-only mode restricts the pass to propagation without inserting new restrict annotations.
2. -allow-restrict-in-struct (flag at offset +1128). Extends __restrict__ handling to pointer fields inside struct arguments. When enabled, the process-restrict pass annotates struct-member pointers with noalias scope metadata, enabling AA to disambiguate pointers extracted from different struct fields. This flag routes to both the opt and llc argument vectors as -allow-restrict-in-struct.
Supporting knobs:
apply-multi-level-restrict-- apply__restrict__to all pointer indirection levels (not just the outermost pointer)dump-process-restrict-- debug dump during restrict processing
The noalias attribute interacts with the AA chain through ScopedNoAliasAA, which reads !noalias and !alias.scope metadata attached to instructions. cicc's frontend emits these metadata nodes when __restrict__ qualifiers are present in the CUDA source.
The !noalias.addrspace metadata (kind 42, registered in sub_B6EEA0) is a separate mechanism specific to address-space disambiguation. It is attached by MemorySpaceOpt or IR generation when a pointer is known to not alias with pointers in specific address spaces, even if the pointer itself remains in generic AS 0. The AA evaluator detects this metadata and tags the pointer with bit 2 (OR with 4) for disambiguation during alias queries.
The ProcessRestrict Propagation Algorithm
ProcessRestrictPass is NVIDIA's interprocedural restrict propagation pass, registered as pipeline entry 419 with class name ProcessRestrictPass. It runs as a function pass but has interprocedural effects: it reads the noalias attribute from kernel entry points and propagates equivalent information to callees by attaching !noalias and !alias.scope metadata to memory instructions. The knobs controlling its behavior are grouped in ctor_534 (address range 0x560000--0x5CFFFF), alongside allow-restrict-in-struct and apply-multi-level-restrict, and independently in ctor_270 (address range 0x4F0000--0x51FFFF) alongside process-restrict.
Activation and Flag Routing
The restrict pipeline activates through a chain of flag translations:
User: nvcc --restrict kernel.cu
|
nvcc: cicc -restrict (offset +1096 in flag struct)
|
cicc: llc -nvptx-kernel-params-restrict (routes to llc args only)
opt -allow-restrict-in-struct (if -allow-restrict-in-struct set)
opt -apply-multi-level-restrict (if set)
The critical distinction: -restrict routes exclusively to the llc argument vector (not opt), meaning the noalias attribute injection happens during code generation, not during the optimization pipeline. The process-restrict pass in the opt pipeline then reads these attributes and propagates their implications as metadata. The -allow-restrict-in-struct flag routes to both opt and llc, enabling struct-member restrict handling on both sides.
Propagation Algorithm
The pass operates in two modes controlled by the propagate-only parameter:
Full mode (default). The pass performs both annotation and propagation:
ProcessRestrictPass::run(Function &F):
// Phase 1: Identify restrict-qualified pointer arguments
for each Argument &A in F:
if A.hasNoAliasAttr() and A.getType()->isPointerTy():
RestrictArgs.push_back(&A)
// Phase 1b: Struct member extraction (if allow-restrict-in-struct)
if AllowRestrictInStruct:
for each Argument &A in F:
if A.getType() is StructType containing pointer fields:
for each pointer field P extracted via extractvalue/GEP:
RestrictArgs.push_back(P)
// Phase 1c: Multi-level restrict (if apply-multi-level-restrict)
if ApplyMultiLevelRestrict:
for each pointer in RestrictArgs:
if pointer points to pointer (T**):
add the inner pointer dereference to RestrictArgs
if RestrictArgs.empty():
return PreservedAnalyses::all()
// Phase 2: Create alias scope domain and per-argument scopes
MDNode *Domain = createAliasScopeDomain(F.getName())
for each pointer P in RestrictArgs:
MDNode *Scope = createAliasScope(Domain, P->getName())
ScopeMap[P] = Scope
// Phase 3: Attach !alias.scope and !noalias metadata to memory ops
for each Instruction &I in F:
if I is load, store, call, or memcpy/memmove/memset:
Value *Ptr = getPointerOperand(I)
Value *Underlying = getUnderlyingObject(Ptr)
// Which restrict argument does this pointer derive from?
if ScopeMap.count(Underlying):
MDNode *MyScope = ScopeMap[Underlying]
I.setMetadata(!alias.scope, MyScope)
// Build noalias set: all OTHER restrict arguments
SmallVector<Metadata*> NoAliasScopes
for each (P, S) in ScopeMap:
if P != Underlying:
NoAliasScopes.push_back(S)
I.setMetadata(!noalias, MDNode::get(NoAliasScopes))
// Phase 4: Debug dump (if dump-process-restrict)
if DumpProcessRestrict:
print annotated IR to dbgs()
Propagate-only mode. Skips Phase 1 annotation -- does not create new noalias attributes or scopes. Instead, it only reads existing !alias.scope and !noalias metadata from callers and propagates them through inlined call chains. This mode is used in later pipeline stages where new restrict annotations would be unsound (the interprocedural calling context has changed due to inlining).
How ScopedNoAliasAA Consumes the Metadata
The ScopedNoAliasAA provider (registered as scoped-noalias-aa in sub_233BD40, enabled by default via enable-scoped-noalias at ctor_060, global at 0x4B0000) processes the metadata as follows:
ScopedNoAliasAA::alias(LocA, LocB):
// Extract !noalias sets from the instructions that produced LocA and LocB
MDNode *NoAliasA = InstA->getMetadata(!noalias) // set of scopes A does NOT alias
MDNode *ScopeB = InstB->getMetadata(!alias.scope) // B's own scope
MDNode *NoAliasB = InstB->getMetadata(!noalias)
MDNode *ScopeA = InstA->getMetadata(!alias.scope)
// If A's noalias set contains B's scope, or vice versa: NoAlias
if NoAliasA contains any scope in ScopeB:
return NoAlias
if NoAliasB contains any scope in ScopeA:
return NoAlias
return MayAlias // fall through to next AA provider
This means that after ProcessRestrictPass annotates a load from __restrict__ float *a with !alias.scope !{!scope_a} and !noalias !{!scope_b, !scope_c}, any load from __restrict__ float *b (with !alias.scope !{!scope_b}) will be proven NoAlias by ScopedNoAliasAA because scope_b appears in the first instruction's !noalias set. This is the standard LLVM scoped-noalias mechanism; cicc's contribution is the ProcessRestrictPass that generates these metadata nodes from CUDA __restrict__ annotations.
Restrict and Struct Members
When -allow-restrict-in-struct is active, the pass handles a common CUDA pattern where kernel parameters are passed through a struct:
struct Args {
float * __restrict__ a;
float * __restrict__ b;
int n;
};
__global__ void kernel(Args args) {
// Without allow-restrict-in-struct: a and b are NOT marked noalias
// because the struct argument itself is not __restrict__
// With allow-restrict-in-struct: process-restrict extracts the
// pointer fields and creates per-field alias scopes
args.a[i] = args.b[i] * 2.0f; // DSE/LICM can now prove no alias
}
The pass identifies pointer-typed fields within struct arguments by walking extractvalue and getelementptr chains from the struct argument. Each extracted pointer receives its own alias scope, identical to what a top-level __restrict__ parameter would receive.
Multi-Level Restrict
When -apply-multi-level-restrict is active, the pass handles pointer-to-pointer arguments:
__global__ void kernel(float ** __restrict__ ptrs) {
// Level 0: ptrs itself is restrict (different ptrs args don't alias)
// Level 1: *ptrs (the pointed-to pointer) is also restrict
// meaning ptrs[i] and ptrs[j] point to non-aliasing memory
float *a = ptrs[0];
float *b = ptrs[1];
a[x] = b[x]; // Proven NoAlias with multi-level restrict
}
Without this flag, only the outermost pointer level receives noalias treatment. With it, the pass follows dereference chains and creates scopes for each indirection level.
NVVM AA Query Logic -- Internal Detail
The AA chain in cicc is queried through AAResults::alias() at sub_134CB50. This function dispatches through the registered AA providers in registration order. The chain ordering observed in cicc v13.0 is:
NVVM AA -> BasicAA -> TBAA -> ScopedNoAliasAA -> GlobalsAA
This ordering is confirmed by sub_233BD40 (the AA chain builder, 4.8KB) which constructs the pipeline from names: globals-aa, basic-aa, objc-arc-aa, scev-aa, scoped-noalias-aa, tbaa. NVVM AA is injected at the front via NVPTXExternalAAWrapper with RunEarly=true, so it executes before all others.
The Query Dispatch Path
User pass (GVN, DSE, LICM, MemorySSA)
|
v
AAResults::alias(MemoryLocation &A, MemoryLocation &B) [sub_134CB50]
|
+-- (1) NVPTXAAResult::alias()
| Check address spaces: cross-space pairs -> NoAlias
| If NoAlias: short-circuit, return immediately
|
+-- (2) BasicAA
| GEP decomposition, alloca vs argument, capture analysis
| basic-aa-recphi (default true): recursive PHI analysis
| basic-aa-separate-storage (default true): separate underlying objects
|
+-- (3) TBAA (Type-Based Alias Analysis)
| !tbaa metadata tree comparison
| enable-tbaa (default true)
|
+-- (4) ScopedNoAliasAA
| !noalias / !alias.scope metadata (from ProcessRestrict or frontend)
| enable-scoped-noalias (default true, ctor_060 at ~0x494CC1)
|
+-- (5) GlobalsAA [sub_13C7380, 35.7KB]
| Module-level: which globals escape?
| enable-unsafe-globalsmodref-alias-results (default false)
|
v
Final AliasResult (NoAlias / MayAlias / PartialAlias / MustAlias)
Any provider returning NoAlias short-circuits the chain -- subsequent providers are never consulted. This is why NVVM AA runs first: cross-address-space pairs are resolved with zero overhead from BasicAA's GEP decomposition.
ModRef Queries
Two additional entry points handle call-site interactions:
sub_134F0E0 -- AAResults::getModRefInfo(CallBase, MemoryLocation). Returns a ModRefInfo encoding that combines Mod/Ref bits with MustAlias information (8 values, 0--7). This is used by DSE and LICM to determine whether a call can read or write a specific memory location.
sub_134F530 -- AAResults::getModRefInfo(CallBase, CallBase). Same encoding but for two call sites. Used by MemorySSA to build dependencies between calls.
The getModRefInfoMask method in NVVM AA adds a key optimization: pointers into constant memory (AS 4) or parameter memory (AS 101) return NoModRef because these memories are read-only from the kernel's perspective. This lets DSE skip alias analysis entirely for constant/param loads and lets LICM hoist them unconditionally.
getMemoryEffects for Inline Assembly
NVVM AA's getMemoryEffects method inspects PTX inline assembly blocks. An inline asm statement without the sideeffect flag and without a {memory} clobber constraint is classified as having no memory effects (MemoryEffects::none()). This prevents innocent inline asm (register manipulation, warp votes) from blocking load motion, store elimination, and CSE across the asm block.
Address-Space-Based NoAlias Rules -- Complete Matrix
The cross-address-space NoAlias decision is the cheapest and most impactful alias analysis in cicc. The full decision matrix for all pairs:
| AS 0 (generic) | AS 1 (global) | AS 3 (shared) | AS 4 (const) | AS 5 (local) | AS 6 (tensor) | AS 7 (shmem cluster) | AS 101 (param) | |
|---|---|---|---|---|---|---|---|---|
| AS 0 | MayAlias | MayAlias | MayAlias | MayAlias | MayAlias | MayAlias | MayAlias | MayAlias |
| AS 1 | MayAlias | MayAlias | NoAlias | NoAlias | NoAlias | NoAlias | NoAlias | MayAlias* |
| AS 3 | MayAlias | NoAlias | MayAlias | NoAlias | NoAlias | NoAlias | MayAlias | NoAlias |
| AS 4 | MayAlias | NoAlias | NoAlias | MayAlias | NoAlias | NoAlias | NoAlias | NoAlias |
| AS 5 | MayAlias | NoAlias | NoAlias | NoAlias | MayAlias | NoAlias | NoAlias | NoAlias |
| AS 6 | MayAlias | NoAlias | NoAlias | NoAlias | NoAlias | MayAlias | NoAlias | NoAlias |
| AS 7 | MayAlias | NoAlias | MayAlias | NoAlias | NoAlias | NoAlias | MayAlias | NoAlias |
| AS 101 | MayAlias | MayAlias* | NoAlias | NoAlias | NoAlias | NoAlias | NoAlias | MayAlias |
* AS 1 (global) vs AS 101 (param) returns MayAlias because cvta.param (SM 70+) converts parameter pointers to global-space addresses. A parameter-space pointer and a global-space pointer may reference the same physical byte after conversion. This is a conservative choice; upstream LLVM has a commented TODO noting that cvta.param support is not yet implemented, and cicc matches this conservatism.
The decision algorithm implemented in NVPTXAAResult::alias:
if AS1 == 0 or AS2 == 0: -> MayAlias (generic escapes all reasoning)
if AS1 == AS2: -> MayAlias (same space, need deeper AA)
if {AS1,AS2} == {3,7}: -> MayAlias (shared/cluster overlap)
if {AS1,AS2} == {1,101}: -> MayAlias (global/param overlap via cvta.param)
otherwise: -> NoAlias (hardware disjointness)
The !noalias.addrspace Metadata Mechanism
When MemorySpaceOpt or IR generation determines that a generic-space pointer provably does not alias with a specific address space, but cannot convert the pointer itself to that space (for example, because other uses require it to remain generic), cicc attaches !noalias.addrspace metadata (kind 42) to the instruction. This is registered in sub_B6EEA0 alongside the 41 standard LLVM metadata kinds (dbg=1, tbaa=2, prof=3, ..., noalias.addrspace=42).
The AA evaluator at sub_13549C0 detects this metadata during pointer collection (Phase 2 of the evaluator). When it encounters an instruction with opcode byte 0x4E (78, ASCII 'N'), it tags the pointer value with bit 2 set (OR with 4):
// At 0x1356170, 0x1356180, 0x1356190 in the AA evaluator:
if opcode_byte == 0x4E: // noalias.addrspace annotation
tagged_ptr = raw_ptr | 4 // set bit 2 as disambiguation flag
This tagged pointer propagates through to AAResults::alias() (sub_134CB50), AAResults::getModRefInfo(CallBase, MemoryLocation) (sub_134F0E0), and AAResults::getModRefInfo(CallBase, CallBase) (sub_134F530). The AA providers detect bit 2 and use the associated metadata to return NoAlias for the tagged pointer against pointers in the excluded address spaces.
Similarly, opcode byte 0x1D (29) identifies addrspacecast instructions. The evaluator captures the pre-cast value via cmovz, allowing the AA to trace back to the original non-generic address space even when the instruction itself operates on generic pointers.
The three opcode values that trigger special handling in the AA evaluator:
| Opcode byte | Decimal | Meaning | AA evaluator action |
|---|---|---|---|
0x4E | 78 ('N') | !noalias.addrspace annotated | OR pointer with 4 (set bit 2) |
0x1D | 29 | addrspacecast | Capture pre-cast value for AS lookup |
0x36, 0x37 | 54, 55 | llvm.noalias.scope.decl intrinsic results | Insert into separate scope pointer sets |
Comparison with Upstream LLVM NVPTX
Upstream LLVM (as of LLVM 19/20) includes NVPTXAliasAnalysis.cpp in llvm/lib/Target/NVPTX/, which implements the same core address-space disjointness logic. cicc's version is functionally equivalent to upstream for the basic alias query but differs in several ways:
| Aspect | Upstream LLVM | cicc v13.0 |
|---|---|---|
| Core alias check | Same: cross-AS = NoAlias, generic = MayAlias | Same |
| Shared cluster handling | AS 3 vs AS 7 = MayAlias | Present (SM 90+ targets) |
| Param aliasing with global | Commented TODO: "cvta.param not yet supported" | Same conservative treatment |
| getModRefInfoMask | Const/param = NoModRef | Same |
| Inline asm analysis | Checks side-effects + {memory} clobber | Same |
| Traversal depth knob | nvptx-traverse-address-aliasing-limit (default 6) | Same knob present |
!noalias.addrspace metadata | Not used upstream | cicc-specific extension (metadata kind 42) |
strict-aliasing knob | Not in upstream NVPTX | cicc adds "Datatype based strict alias" |
nvptxaa-relax-fences | Not in upstream | cicc-specific: ordering relaxation for fences |
process-restrict pass | Not in upstream NVPTX backend | cicc-specific interprocedural restrict propagation |
| Integration with MemorySpaceOpt | No upstream equivalent | cicc's address space inference feeds NVVM AA |
The most significant delta is the ecosystem: upstream NVPTX has the AA pass but lacks the interprocedural MemorySpaceOpt pipeline that resolves generic pointers, the process-restrict pass that propagates noalias, and the !noalias.addrspace metadata that bridges partial address-space knowledge into the AA chain. These three components working together give cicc far more NoAlias results than upstream LLVM achieves on the same IR.
Configuration Knobs
NVVM AA Knobs
| Knob | Type | Default | Description |
|---|---|---|---|
nvptx-traverse-address-aliasing-limit | unsigned | 6 | Maximum depth for getAddressSpace traversal through getUnderlyingObject |
nvptxaa-relax-fences | bool | (unknown) | Enable ordering relaxation for fence instructions in AA |
strict-aliasing | bool | (unknown) | "Datatype based strict alias" -- NVIDIA extension for type-based disambiguation |
traverse-address-aliasing | bool | (unknown) | "Find address space through traversal" -- master enable for the traversal in getAddressSpace |
assume-default-is-flat-addrspace | bool | false | Treat default address space (0) as flat/generic (testing knob) |
Standard LLVM AA Knobs (present in cicc)
| Knob | Type | Default | Description |
|---|---|---|---|
disable-basic-aa / disable-basicaa | bool | false | Disable BasicAA entirely |
basic-aa-recphi | bool | true | Enable recursive PHI analysis in BasicAA |
basic-aa-separate-storage | bool | true | Enable separate-storage analysis in BasicAA |
enable-tbaa | bool | true | Enable Type-Based Alias Analysis |
enable-scoped-noalias | bool | true | Enable ScopedNoAlias AA (processes !noalias / !alias.scope) |
enable-unsafe-globalsmodref-alias-results | bool | false | Enable GlobalsModRef (requires unsafe assumption about global escapes) |
alias-set-saturation-threshold | int | (default) | Maximum pointers in an AliasSet before it saturates |
aa-pipeline | string | (default) | Override the AA pipeline configuration |
Restrict Processing Knobs
| Knob | Type | Default | Description |
|---|---|---|---|
nvptx-kernel-params-restrict | bool | false | Mark all kernel pointer params as noalias (activated by -restrict flag) |
allow-restrict-in-struct | bool | false | Propagate __restrict__ into struct pointer members |
apply-multi-level-restrict | bool | (unknown) | Apply __restrict__ through all pointer indirection levels |
dump-process-restrict | bool | false | Debug dump during restrict processing |
AA Evaluator Debug Flags
The aa-eval diagnostic pass (sub_13549C0) uses 14 independent boolean flags for selective output:
| Address | Flag | Controls |
|---|---|---|
byte_4F97AA0 | print-all-alias-modref-info | Master enable for all AA debug output |
byte_4F979C0 | print-all-alias-no | Print NoAlias pointer pairs |
byte_4F978E0 | print-all-alias-may | Print MayAlias pointer pairs |
byte_4F97800 | print-all-alias-partial | Print PartialAlias pointer pairs |
byte_4F97720 | print-all-alias-mustalias | Print MustAlias pointer pairs |
byte_4F97640 | print-all-modref-none | Print NoModRef results |
byte_4F97560 | print-all-modref-ref | Print JustRef results |
byte_4F97480 | print-all-modref-mod | Print JustMod results |
byte_4F973A0 | print-all-modref-both | Print BothModRef results |
byte_4F96F40 | aa-eval-callsite-modref | Enable call-site ModRef evaluation (Phase 5) |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
AAResults::alias(MemoryLocation, MemoryLocation) -- main alias query entry | sub_134CB50 | -- | -- |
AAResults::getModRefInfo(CallBase, MemoryLocation) | sub_134F0E0 | -- | -- |
AAResults::getModRefInfo(CallBase, CallBase) | sub_134F530 | -- | -- |
AAEvaluator::runOnFunction -- the aa-eval diagnostic pass | sub_13549C0 | 11,038 B | -- |
SmallPtrSet::insert (pointer collection in aa-eval) | sub_13540B0 | -- | -- |
| Pointer-pair result printer (aa-eval) | sub_1352080 | -- | -- |
| Call-site pair result printer (aa-eval) | sub_1351E00 | -- | -- |
| Formatted alias result printer (aa-eval) | sub_13523B0 | -- | -- |
GlobalsAA main analysis function | sub_13C7380 | 35.7 KB | -- |
GlobalsAA helper (per-function analysis) | sub_13C5530 | 21 KB | -- |
GlobalsAA call-site analysis | sub_13C4410 | 6.7 KB | -- |
GlobalsAA alias query | sub_13C34D0 | 12.6 KB | -- |
| AA iteration / chaining logic | sub_FD1250 | 23.4 KB | -- |
| Dominator-tree-based AA query setup (used by MemorySSA) | sub_14A4050 | -- | -- |
Metadata kind registration (including noalias.addrspace = kind 42) | sub_B6EEA0 | 9 KB | -- |
| MemorySpaceOpt pass entry (IP-MSP worklist driver) | sub_1C70910 | ~2,427 lines | -- |
| MemorySpaceOpt per-BB scanner + address-space bitmask builder | sub_1CA8CD0 | ~898 lines | -- |
Cross-References
- MemorySpaceOpt -- the interprocedural pass that resolves generic pointers to specific address spaces, directly feeding NVVM AA
- IP Memory Space Propagation -- the interprocedural wrapper around MemorySpaceOpt
- GVN -- consumes AA for load elimination and store forwarding
- DSE -- relies on AA for dead store detection; confirmed to have no internal address-space checks
- LICM -- uses AA to hoist/sink memory operations across loops
- Pipeline & Ordering -- where NVVM AA fits in the overall pass schedule
- LLVM Knobs -- complete knob inventory including AA-related knobs
- Optimization Levels -- how
NVVMAliasAnalysisappears in the tier 2+ pipeline
MemorySSA Builder for GPU
MemorySSA constructs a sparse SSA form over memory operations, giving every instruction that reads or writes memory a position in a use-def chain that tracks the flow of memory state through a function. In upstream LLVM, MemorySSA already delivers significant speedups over the older MemoryDependenceResults analysis by avoiding per-query linear scans. In cicc v13.0, the payoff is amplified because the underlying alias analysis pipeline includes NVVM AA, which returns NoAlias for any cross-address-space pointer pair. A store to shared memory (addrspace(3)) and a load from global memory (addrspace(1)) will never produce a dependency edge in the MemorySSA graph, yielding a dramatically sparser representation than would be possible on a flat-memory architecture. Every pass that consumes MemorySSA -- LICM, EarlyCSE, DSE, GVN, SimpleLoopUnswitch -- benefits from this precision without containing any GPU-specific logic itself.
Key Facts
| Property | Value |
|---|---|
| Builder entry wrapper | sub_1A6CAD0 (48 bytes -- skipFunction guard + tail call) |
| Builder core function | sub_1A6A260 (10,344 bytes) |
| MemoryAccess allocator | sub_1A69110 (1,245 bytes) |
| Pass registration string | "memoryssa" (analysis #179 in pipeline parser) |
| Pipeline parser entry | "print<memoryssa>" -> MemorySSAPrinterPass |
| Required analyses | AliasAnalysis (tag unk_4F9D3C0), DominatorTree (tag unk_4F9E06C), LoopInfo (tag unk_4F9A488) |
| Stack frame size | 0x3F8 = 1,016 bytes |
| MemoryAccess node size | 0x40 = 64 bytes (bump-allocated) |
| Walker check limit | memssa-check-limit = 100 (max stores/phis to walk past) |
| Verification flag | verify-memoryssa (off by default, on under EXPENSIVE_CHECKS) |
| DOT graph output | dot-cfg-mssa (filename for CFG + MemorySSA visualization) |
MemorySSA Node Types
MemorySSA represents memory state with three node types, all stored in 64-byte heap-allocated objects:
MemoryDef (kind=2) -- Created for every instruction that may write memory: stores, calls with side effects, atomics, memcpy/memmove intrinsics. Each MemoryDef takes the previous memory state as its operand and produces a new version of memory state.
MemoryUse (kind=1) -- Created for every instruction that reads memory but does not modify it: loads, calls to readonly/readnone functions. A MemoryUse points to the MemoryDef (or MemoryPhi) that represents the most recent memory state it depends on.
MemoryPhi (kind=3) -- Inserted at control flow join points where predecessors have different reaching memory definitions, exactly like an SSA phi node for scalar values. A MemoryPhi merges the memory states from each predecessor into a single version.
All three types share a common layout:
| Offset | Size | Field |
|---|---|---|
| +0x00 | 8 | vtable / next pointer (intrusive list) |
| +0x08 | 8 | prev pointer (intrusive list) |
| +0x10 | 4 | kind (1=MemoryUse, 2=MemoryDef, 3=MemoryPhi) |
| +0x14 | 4 | operand_count (bits 0-27) |
| +0x17 | 1 | flags byte (bit 6 = 0x40 = "has inline operands") |
| +0x18 | 8 | defining instruction / accessed Value* |
| +0x20 | 8 | type/size descriptor (APInt or pointer to APInt) |
| +0x28 | 8 | operand/predecessor pointer |
| +0x30 | 8 | current reaching definition (MemoryAccess*) |
| +0x38 | 8 | associated BasicBlock* (or null) |
The sentinel value 1 stored in the reaching-definition field (+0x30) represents LiveOnEntry -- the implicit MemoryDef that dominates the entire function and represents the initial state of memory at function entry.
Construction Algorithm
The builder at sub_1A6A260 follows the standard LLVM MemorySSA construction algorithm, implemented as a dominator-tree DFS rename pass. The implementation is split into eight phases.
Phase 1 -- Prerequisite Retrieval (0x1A6A260 - 0x1A6A3A0)
The builder queries the analysis manager for three required results via a vtable-tagged vector. Each registered analysis is identified by a unique tag pointer:
unk_4F9D3C0-> calls virtual method[rax+0x68]->sub_14A4050-- retrievesAAResults, stored at[this+0xB8]unk_4F9E06C-> retrieves DominatorTree result, stored at[this+0xA8](offset +0xA0 within the wrapper)unk_4F9A488-> retrieves LoopInfo, stored at[this+0xB0]
If any tag is not found in the registered analysis vector, control jumps to terminal handlers at 0x1A6CAAF-0x1A6CABE (assertion / unreachable).
Phase 2 -- Worklist Initialization (0x1A6A3A0 - 0x1A6A6B0)
The builder allocates a 1,016-byte stack frame and initializes four layers of SmallVector-based renaming stacks:
- Level 0: DFS traversal order over the dominator tree (computed by
sub_13B8390) - Level 1: Per-block instruction iterator
- Level 2: Per-block incoming MemoryPhi operand buffer (SmallVector at
rbp-0x330, inline capacity 8) - Level 3: Memory state stack (current reaching definition per DFS depth)
Each layer is initialized by sub_16CCEE0 (SmallVector move-assign). Temporary intermediate buffers are freed before the main walk begins.
Phase 3 -- Dominator Tree Walk (0x1A6A88C - 0x1A6B070)
The main loop visits each basic block in DFS order over the dominator tree. For every instruction, the builder reads the opcode byte at [instruction-8] and classifies it:
opcode_tag = *(uint8_t*)(instr - 8);
switch (opcode_tag) {
case 0x18..0x38: // Memory instructions (load/store range)
type_tag = *(uint8_t*)(*(instr - 0x18) + 8);
if (type_tag == 0x10) // PointerType result -> this is a Load
createMemoryUse(instr);
else
createMemoryDef(instr); // Store
break;
case 0x0B: // CallInst
classifyCall(instr); // -> sub_1A69C30
break;
case 0x27: // PHINode
if (predecessors_disagree_on_memory_state())
createMemoryPhi(block);
break;
}
Type-size computation. For each memory access, a three-level nested switch computes the byte-size of the accessed region. The switch handles all LLVM Type IDs:
| Type ID | Type | Size computation |
|---|---|---|
| 1 | HalfTy | 16 bits |
| 2 | FloatTy | 32 bits |
| 3 | DoubleTy | 64 bits |
| 4 | FP80 | 80 bits |
| 5 | FP128 | 128 bits |
| 6 | PPC_FP128 | 128 bits |
| 7 | PointerTy | getPointerSizeInBits() via sub_15A9520 |
| 11 | IntegerTy | [type+8] >> 8 (raw bit width) |
| 14 | StructTy | getStructLayout() via sub_15A9FE0 |
| 0, 8, 10, 12, 16 | Array/Vector | element_count * element_size |
When the computed access size differs from the store size ([rax+8] >> 8), the builder routes through sub_1A69690 to create a partial-store MemoryDef, capturing the precise overlap region.
Phase 4 -- Call and Intrinsic Classification
Call instructions (opcode 0x0B) are dispatched through sub_1A69C30 (call-instruction MemoryDef handler), which classifies intrinsics by ID:
- ID 0x0F (lifetime.start) and ID 0x17 (lifetime.end) -- no memory effect, skipped
- ID 0x27 --
memcpy/memmove-like intrinsics, create MemoryDef - ID 0x2F -- atomic intrinsics (checks
[rdx-0x30]for ordering) - ID 0x33 -- NVIDIA-specific intrinsics (surface/texture operations, NVVM builtins)
Phase 5 -- MemoryAccess Allocation (sub_1A69110)
The core allocator creates all three node types. Parameters:
| Register | Meaning |
|---|---|
| rdi | MemorySSA this |
| esi | kind: 1=MemoryUse, 2=MemoryDef, 3=MemoryPhi |
| rdx | defining value / access value |
| rcx | type descriptor (APInt holding access size) |
| r8 | instruction pointer |
| r9 | predecessor block (for MemoryPhi) |
Each allocation calls sub_22077B0 (BumpPtrAllocator::Allocate) for 0x40 bytes, populates all fields, inserts the node into the intrusive list via sub_2208C80, and increments the node counter at [this+0xD0].
For kind==1, sub_16A57B0 (countLeadingZeros) determines whether the access is a full or partial def. For kind==3 (MemoryPhi), the operand list is populated by iterating predecessor blocks through sub_146F1B0 (AA-driven reaching-definition lookup).
Phase 6 -- Trivial Phi Optimization (0x1A6B280 - 0x1A6B9BD)
After the DFS walk, the builder post-processes all MemoryPhi nodes. Any MemoryPhi whose operands all resolve to the same MemoryDef is trivial -- it can be replaced with that single reaching definition. The loop at 0x1A6B9DE iterates the result vector [this+0xD8..this+0xE0]:
for (auto *Phi : result_vector) {
unsigned count = Phi->operand_count & 0x0FFFFFFF;
if (all_operands_identical(Phi)) {
Phi->replaceAllUsesWith(single_reaching_def); // sub_164B780
Phi->eraseFromParent(); // sub_1AEB370
destroy(Phi); // sub_164BEC0
}
}
This cleanup is critical for GPU code. Because NVVM AA proves so many memory operations are independent, many join points that would require MemoryPhis on a flat-memory machine will have all predecessors carrying the same memory state. The trivial-phi elimination pass removes these, reducing the graph to only the essential dependencies.
GPU-Specific Precision Gains
The MemorySSA builder itself contains no explicit GPU logic. The GPU awareness comes entirely through the AA pipeline at [this+0xB8], which chains BasicAA -> TBAA -> ScopedNoAliasAA -> NVVM AA. The critical interaction points are:
Cross-address-space independence. When sub_146F1B0 queries the AA for a (store to addrspace(3), load from addrspace(1)) pair, NVVM AA returns NoAlias before BasicAA or TBAA are even consulted. The MemorySSA builder then skips creating a dependency edge. This means a MemoryUse for a global load will not depend on a MemoryDef for a shared store -- they exist in parallel chains.
Partial-alias precision. The builder at 0x1A6AFB3 creates MemoryDefs even for partial overlaps, then calls sub_1A69690 to register the precise overlap region. Standard LLVM would conservatively treat partial alias as MayAlias and create a full dependency. cicc's more aggressive approach uses the partial overlap information downstream for finer-grained DSE and LICM decisions.
Address-space check on volatile access. The call to sub_15FA300 at 0x1A6B88E performs what appears to be a volatile-access or address-space check specific to CUDA memory spaces. This gate prevents the builder from creating false dependencies between volatile shared memory operations (used for inter-warp communication) and non-volatile global operations.
NVIDIA custom intrinsic handling. Type ID 0x33 in sub_1A69990 is not a standard LLVM type ID. It appears to be cicc's custom type for CUDA-specific memory operations (surface/texture references, NVVM-specific typed pointers). These are classified as memory-clobbering conservatively unless the AA can prove otherwise.
Practical effect. Consider a kernel that loads from global memory, operates on shared memory, and stores back to global memory:
__global__ void kernel(float *out, float *in) {
__shared__ float smem[256];
smem[threadIdx.x] = in[threadIdx.x]; // global load + shared store
__syncthreads();
float val = smem[threadIdx.x] * 2.0f; // shared load
out[threadIdx.x] = val; // global store
}
On a flat-memory machine, the MemorySSA graph would have a single linear chain: every memory operation depends on the previous one. With NVVM AA feeding MemorySSA, the graph splits into two parallel chains -- one for shared memory and one for global memory -- connected only at the __syncthreads() barrier (which is modeled as a MemoryDef that clobbers all address spaces).
The MemorySSA Walker
Passes do not directly traverse the MemorySSA def-use chains. Instead, they query the CachingWalker, which answers the fundamental question: "What is the nearest MemoryDef that actually clobbers this memory location?"
The walker performs an optimized upward walk along the def chain, testing each MemoryDef against the query location using the full AA pipeline. The walk terminates when:
- A MemoryDef that clobbers the query location is found (
instructionClobbersQueryreturns true) LiveOnEntryis reached (the location was never written in this function)- The walk budget (
memssa-check-limit= 100 steps) is exhausted, in which case the current MemoryDef is returned conservatively as a clobber
When a MemoryPhi is encountered, the walker splits into multiple paths (one per predecessor) and tracks them using a DefPath worklist. Each path records a (MemoryLocation, First, Last, Previous) tuple, enabling the walker to reconstruct the full path from any clobber back to the query origin.
Caching. The CachingWalker memoizes results per (MemoryAccess, MemoryLocation) pair. Once a clobber query is resolved, subsequent queries for the same access return the cached result immediately. The SkipSelfWalker variant (used by DSE) additionally skips the MemoryDef that is the query origin itself, answering "what did this store overwrite?" rather than "what clobbers this store?"
On GPU, the walker's budget is rarely exhausted for shared-memory operations because NVVM AA prunes so many false dependencies that the def chain is short. For global memory operations in loops with many stores, the 100-step limit can be hit; increasing memssa-check-limit trades compilation time for precision in these cases.
Consumer Passes
Five major passes consume MemorySSA in cicc:
| Pass | How it uses MemorySSA |
|---|---|
| LICM | Queries the walker to determine whether a load inside a loop is clobbered by any store in the loop body. If no clobber is found, the load is hoisted. NVVM AA makes shared-memory loads trivially hoistable past global stores. |
EarlyCSE (early-cse-memssa variant, sub_27783D0) | Uses MemorySSA to find redundant loads -- two loads from the same location with no intervening clobber are CSE'd. The MemorySSA variant avoids the O(n^2) scanning of the non-MSSA EarlyCSE. |
| DSE | Walks the MemorySSA graph backwards from a store to find earlier stores to the same location with no intervening loads. Dead stores are eliminated. DSE has its own extensive set of MemorySSA walk limits (see knobs below). |
| GVN | Can optionally use MemorySSA instead of MemoryDependenceResults (controlled by enable-gvn-memoryssa). When enabled, GVN uses the walker for load-value forwarding and PRE. |
| SimpleLoopUnswitch | Queries MemorySSA to determine whether a condition inside a loop depends on memory modified in the loop. The simple-loop-unswitch-memoryssa-threshold knob controls the walk limit. |
Knobs and Thresholds
MemorySSA Core
| Knob | Default | Effect |
|---|---|---|
memssa-check-limit | 100 | Maximum stores/phis the walker will walk past before giving up. Higher values improve precision at the cost of compilation time. |
verify-memoryssa | false | Enables expensive verification of MemorySSA invariants after every modification. |
dot-cfg-mssa | "" | If set, dumps the CFG annotated with MemorySSA information to the named DOT file. |
DSE MemorySSA Walk Limits
| Knob | Default | Effect |
|---|---|---|
dse-memoryssa | true | Master switch enabling MemorySSA-based DSE. |
dse-memoryssa-scanlimit | 150 | Max memory accesses DSE will scan for a redundant store. |
dse-memoryssa-walklimit | 90 | Max MemorySSA walk steps per DSE query. |
dse-memoryssa-partial-store-limit | 5 | Max partial stores DSE will try to merge. |
dse-memoryssa-defs-per-block-limit | 5000 | Skip blocks with more defs than this limit. |
dse-memoryssa-samebb-cost | 1 | Walk cost weight for same-block MemoryDefs. |
dse-memoryssa-otherbb-cost | 5 | Walk cost weight for cross-block MemoryDefs. |
dse-memoryssa-path-check-limit | 50 | Max paths DSE will check for nontrivial reachability. |
dse-optimize-memoryssa | true | Enables DSE's own MemorySSA optimization (trivial phi removal during DSE). |
GVN / MemoryDependence
| Knob | Default | Effect |
|---|---|---|
enable-gvn-memoryssa | varies | Switches GVN from MemDep to MemorySSA. |
memdep-block-scan-limit | 100 (legacy) | Legacy MemDep per-block scan limit. |
memdep-block-number-limit | 200 (legacy) / 1000 (NewPM) | Max blocks MemDep will search. Note: the NewPM variant defaults to 1,000, a 5x increase. |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Pass entry wrapper (skipFunction guard + tail call to builder) | sub_1A6CAD0 | 48 | -- |
| MemorySSA builder core (DFS rename walk) | sub_1A6A260 | 10,344 | -- |
| MemoryAccess node allocator (Def/Use/Phi) | sub_1A69110 | 1,245 | -- |
MemoryDef creation dispatcher (routes to sub_1A69110) | sub_1A695F0 | -- | -- |
| Store-instruction MemoryDef handler (partial store support) | sub_1A69690 | 754 | -- |
| MemoryPhi operand insertion handler (bidirectional edge setup) | sub_1A69990 | 664 | -- |
| Call-instruction handler (intrinsic classification) | sub_1A69C30 | -- | -- |
MemorySSA::getMemoryAccess or walker lookup | sub_1643330 | -- | -- |
MemoryAccess::getDefiningAccess | sub_1643D30 | -- | -- |
MemoryLocation::get or getForDest | sub_1644900 | -- | -- |
Value::replaceAllUsesWith (def substitution during trivial phi removal) | sub_164B780 | -- | -- |
MemoryAccess::~MemoryAccess (destructor) | sub_164BEC0 | -- | -- |
MemoryAccess::eraseFromParent | sub_1AEB370 | -- | -- |
BumpPtrAllocator::Allocate (64-byte node allocation) | sub_22077B0 | -- | -- |
| AA query: getModRefInfo / reaching-def resolution | sub_146F1B0 | -- | -- |
| AA query: may-alias check (two-pointer comparison) | sub_145CF80 | -- | -- |
| AA query: isNoAlias / clobber check | sub_1487400 | -- | -- |
| DominatorTree DFS order computation | sub_13B8390 | -- | -- |
skipFunction guard (checks isDeclaration) | sub_1636880 | -- | -- |
Diagnostic Strings
Diagnostic strings recovered from p2-J04-memoryssa.txt and the pipeline parser (p2c.1-01-pipeline-parser.txt). MemorySSA itself emits no optimization remarks; its diagnostics are configuration knobs and the verification/dump infrastructure.
| String | Source | Category | Trigger |
|---|---|---|---|
"memoryssa" | Pipeline parser analysis #179 | Registration | Analysis registration name in the pass pipeline |
"print<memoryssa>" | Pipeline parser #406 | Registration | Printer pass registration; params: no-ensure-optimized-uses |
"memssa-check-limit" | Knob (default 100) | Knob | Maximum stores/phis the CachingWalker will walk past before returning a conservative clobber |
"verify-memoryssa" | Knob (default false) | Knob | Enables expensive verification of MemorySSA invariants after every modification; on under EXPENSIVE_CHECKS |
"dot-cfg-mssa" | Knob (default "") | Knob | If set, dumps the CFG annotated with MemorySSA information to the named DOT file for visualization |
"dse-memoryssa" | Knob (default true) | Knob | Master switch enabling MemorySSA-based DSE |
"dse-memoryssa-scanlimit" | Knob (default 150) | Knob | Max memory accesses DSE will scan for a redundant store |
"dse-memoryssa-walklimit" | Knob (default 90) | Knob | Max MemorySSA walk steps per DSE query |
"dse-memoryssa-partial-store-limit" | Knob (default 5) | Knob | Max partial stores DSE will try to merge |
"dse-memoryssa-defs-per-block-limit" | Knob (default 5000) | Knob | Skip blocks with more defs than this limit |
"dse-memoryssa-samebb-cost" | Knob (default 1) | Knob | Walk cost weight for same-block MemoryDefs |
"dse-memoryssa-otherbb-cost" | Knob (default 5) | Knob | Walk cost weight for cross-block MemoryDefs |
"dse-memoryssa-path-check-limit" | Knob (default 50) | Knob | Max paths DSE will check for nontrivial reachability |
"dse-optimize-memoryssa" | Knob (default true) | Knob | Enables DSE's own MemorySSA optimization (trivial phi removal during DSE) |
"enable-gvn-memoryssa" | Knob (varies) | Knob | Switches GVN from MemDep to MemorySSA |
"memdep-block-scan-limit" | Knob (default 100 legacy) | Knob | Legacy MemDep per-block scan limit |
"memdep-block-number-limit" | Knob (default 200 legacy / 1000 NewPM) | Knob | Max blocks MemDep will search; NewPM variant defaults to 1,000 (5x increase) |
"print<memoryssa-walker>" | Pipeline parser | Registration | MemorySSA walker printer pass |
"early-cse-memssa" | Pipeline parser | Registration | EarlyCSE variant that uses MemorySSA |
Cross-References
- Alias Analysis & NVVM AA -- the AA pipeline that feeds MemorySSA with GPU-aware NoAlias results
- LICM -- primary consumer; NVVM AA-enhanced MemorySSA enables aggressive hoisting of shared-memory loads past global stores
- DSE -- walks MemorySSA backwards to find dead stores; extensive set of MemorySSA-specific knobs
- GVN -- optional MemorySSA backend via
enable-gvn-memoryssa - EarlyCSE -- EarlyCSE's
memssavariant uses MemorySSA for redundant load elimination
LazyCallGraph & CGSCC Pass Manager
The LazyCallGraph (LCG) is the data structure that represents which functions call or reference which other functions, built on demand rather than up front. It drives the CGSCC (Call Graph Strongly Connected Components) pass manager, which walks the call graph in bottom-up order so that interprocedural passes -- the inliner, argument promotion, devirtualization, function attribute inference -- process callees before callers. This ordering is essential: the inliner must have finished optimizing a callee's body before it decides whether to inline that callee into a caller. cicc v13.0 uses LLVM's stock LazyCallGraph implementation without NVIDIA-specific modifications to the graph itself. The GPU-specific behavior comes entirely from how the pipeline configures the CGSCC framework: kernels serve as call graph roots, device functions are internal nodes, recursion is rare, and the inline cost model is radically different from any CPU target.
The LCG cluster occupies approximately 220KB of code at 0xD230A0--0xD2F8A0, containing the graph construction logic, Tarjan's SCC algorithm, incremental SCC mutation operations, and the DOT/text graph printers. A separate 69KB function at sub_2613930 implements the New PM CGSCC inliner that runs inside this framework.
Key Facts
| Property | Value |
|---|---|
| Binary cluster | 0xD230A0 -- 0xD2F8A0 (~220KB, ~25 functions) |
| LLVM source | llvm/lib/Analysis/LazyCallGraph.cpp |
| CGSCC pass manager | sub_1A62BF0 (the InlinerWrapper/standard pipeline factory) |
| CGSCC pipeline parser | sub_2377300 (103KB) |
| CGSCC-to-function adaptor | sub_2362FB0 (6.7KB) |
| New PM CGSCC inliner | sub_2613930 (69KB) |
| NVIDIA custom inliner | sub_1864060 (75KB, the old CGSCC SCC-walk inliner) |
| Inliner core loop | sub_186CA00 (61KB, Inliner::inlineCallsImpl) |
| DevirtSCCRepeatedPass | sub_2284BC0 (16KB, "Max devirtualization iterations reached") |
| SCC object size | 136 bytes (0x88) |
| Edge encoding | Pointer with tag bits: bit 2 = call edge, bit 2 clear = ref edge |
| DenseMap hash | hash(ptr) = (ptr >> 4) ^ (ptr >> 9), bucket size = 16 bytes |
| DenseMap sentinels | Empty = 0xFFFFFFFFFFFFF000, Tombstone = 0xFFFFFFFFFFFFE000 |
| CGSCC invocations per O1/O2/O3 | 4 passes of sub_1A62BF0(1,...), 1 iteration each |
| CGSCC invocations at tier 3 | sub_1A62BF0(5,...) -- 5 iterations |
| BumpPtrAllocator | [LCG+0x150] cursor, [LCG+0x158] slab end |
Lazy Call Graph Construction
The graph is not built all at once. When the CGSCC pass manager begins, the LCG starts with just the module's externally visible functions and kernel entry points as root nodes. Each node's edges are populated only when first visited by the SCC traversal -- the Node::populateSlow() method (sub_D23BF0 returns the edge iterator range) scans all instructions in the function, recording two kinds of edges:
Call edges (bit 2 set in pointer tag): direct CallBase instructions whose callee resolves to a defined function. These form the strong connectivity that defines SCCs.
Ref edges (bit 2 clear): any other reference to a defined function -- a function pointer stored in a global, passed as a callback argument, taken address of. These contribute to RefSCC grouping but do not create call-graph cycles.
Node layout (deduced from binary):
+0x00: Function* (LLVM IR function)
+0x08: Edge array pointer (populated lazily)
+0x10: Edge count / DFSNumber (int32, -1 = completed)
+0x14: LowLink (int32, repurposed as SCC index after Tarjan)
+0x18: Callee edge list (second array for call edges)
+0x20: Callee edge count
Edge encoding (single qword):
Bits 63..3: pointer to target Node
Bit 2: 1 = call edge, 0 = ref edge
Bits 1..0: reserved (alignment)
Population is the only lazy step. Once a node is populated, its edges are cached. Subsequent visits reuse the cached edge list at [node+0x08]. The scan checks [rsi] != 0 to skip unresolvable edges (declaration-only functions with no body).
For a reimplementation: scan every instruction in the function. For each CallBase, if the callee is a defined function, add a call edge. Then walk all non-call operands recursively through constants (including BlockAddress, GlobalAlias, ConstantExpr) collecting any additional function references as ref edges. This matches upstream populateSlow() exactly.
SCC and RefSCC: The Two-Level Hierarchy
The LCG maintains a two-level SCC decomposition:
-
SCC (Call SCC): a maximal set of functions connected by call edges such that every function is reachable from every other through calls. This is the unit of work for the CGSCC pass manager.
-
RefSCC (Reference SCC): a maximal set of SCCs connected by ref edges. A RefSCC contains one or more SCCs. SCCs within a RefSCC can reference each other (e.g., mutually store each other's function pointers) but do not necessarily call each other.
RefSCC layout (from [r15] in sub_D25FD0):
+0x00: LazyCallGraph* (parent graph)
+0x08: SCC array pointer (SmallVector data)
+0x10: SCC array size
+0x14: SCC array capacity
+0x38: DenseMap #1 (SCC* -> index)
+0x38: qword - bucket base pointer (or inline start)
+0x40: byte - flags (bit 0 = active map selector)
+0x44: dword - tombstone count / generation
+0x48: DenseMap #2 (alternate map for lazy rehashing)
+0x48: qword - bucket base pointer
+0x50: dword - bucket count
SCC layout (136 bytes = 0x88):
+0x00: qword - parent pointer / metadata
+0x08: qword - node member array pointer
+0x10: dword - member count
+0x14: dword - capacity
+0x18: Edge list / callee info
+0x38: DenseMap - node-to-index or similar
The bottom-up SCC ordering is computed using Tarjan's algorithm, implemented in sub_D2C610. The algorithm uses the standard DFS stack with 24-byte entries ({Node*, EdgeIter, EdgeEnd}) and the classic DFSNumber / LowLink fields at node offsets +0x10 and +0x14. When LowLink == DFSNumber, the node is an SCC root -- all nodes above it on the DFS result stack are popped into a new SCC, their DFSNumber set to -1 (completed), and the SCC index written into the LowLink field for reuse.
The Tarjan inner loop at 0xD2CD90--0xD2CEA4 and the SCC member popping at 0xD2CF61--0xD2CFD0 are both 4x unrolled, indicating these are hot paths in the CGSCC pipeline.
Tarjan's SCC Algorithm: Binary-Level Pseudocode
Complexity. Tarjan's SCC algorithm is O(V + E) where V = number of nodes (functions) and E = number of call edges among those nodes. The 4x-unrolled inner loop is a constant-factor optimization, not an algorithmic change. The initial buildSCCs (sub_D2BEB0) runs Tarjan once over the entire call graph: O(V_total + E_total). The incremental switchInternalEdgeToRef runs Tarjan only over the affected SCC's members, giving O(V_scc + E_scc) which is typically O(1) since most GPU SCCs contain a single function. switchInternalEdgeToCall is O(V_scc + E_scc) for the same-SCC fast path (bit flip only), or O(M * V_scc) for the slow merge path where M = number of SCCs being merged. switchOutgoingEdgeToCall/Ref (sub_D27A10, 29KB) is O(R * S) where R = number of RefSCCs involved and S = total SCCs in those RefSCCs. The DenseMap operations throughout use (ptr >> 4) ^ (ptr >> 9) hashing with O(1) amortized insert/lookup. Graph verification (sub_D29180) is O(V + E) for the entire graph. The CGSCC pass manager's outer loop processes each SCC once in post-order, re-visiting at most max_devirt_iterations times (default 1, tier 3: 5), giving O(max_iter * V) passes over the SCCs.
The Tarjan implementation lives inside sub_D2C610 (switchInternalEdgeToRef) at address range 0xD2CC66--0xD2D0BC. It recomputes SCCs within a single RefSCC after a call edge is demoted to a ref edge, which may split the original SCC into multiple smaller SCCs.
The following pseudocode is reconstructed directly from the binary. Every variable name corresponds to a register or stack slot; every offset corresponds to a binary address.
// Address: 0xD2CC66 -- 0xD2D0BC (inside sub_D2C610)
// Input: RefSCC containing one SCC whose internal call-edge structure changed
// Output: zero or more new SCCs replacing the original
struct StackEntry { // 24 bytes (0x18)
Node* node; // +0x00
Edge* edge_iter; // +0x08
Edge* edge_end; // +0x10
};
fn tarjan_recompute_scc(old_scc: &SCC, allocator: &BumpPtrAllocator) -> Vec<SCC> {
// --- Phase 0: Initialize ---
let mut dfs_counter: i32 = 1; // r13d, starts at 1
let mut worklist: SmallVector<StackEntry, 4>; // [rbp-0xA0], 24-byte entries
let mut result_stack: SmallVector<*Node, 8>; // [rbp-0x120]
let mut new_scc_count: i32 = 0; // r14d, incremented per SCC found
// --- Phase 1: Push all nodes of old_scc as unvisited roots ---
for node in old_scc.members() {
node.DFSNumber = 0; // [node+0x10] = 0 (unvisited marker)
node.LowLink = 0; // [node+0x14] = 0
}
// --- Phase 2: Outer loop -- pick next unvisited root (0xD2CCF7) ---
for root in old_scc.members() {
if root.DFSNumber != 0 { continue; } // already visited
// Assign DFS number and LowLink to root
root.DFSNumber = dfs_counter; // [rbx+0x10] = r12d
root.LowLink = dfs_counter; // [rbx+0x14] = r12d
dfs_counter += 1; // r13d++
// Lazy-populate edges if not yet done
let (edge_begin, edge_end) = sub_D23BF0(&root.edge_list); // 0xD2CD0E
worklist.push(StackEntry { node: root, edge_iter: edge_begin, edge_end });
// --- Phase 3: DFS inner loop (0xD2CD90 -- 0xD2CEA4, 4x unrolled) ---
while let Some(top) = worklist.last_mut() {
if top.edge_iter == top.edge_end {
// All edges of current node exhausted -- backtrack
let finished = top.node;
worklist.pop(); // 0xD2CE80
// LowLink propagation to parent
if let Some(parent) = worklist.last_mut() {
// 0xD2CDF5: min(parent.LowLink, finished.LowLink)
let child_low = finished.LowLink; // [rbx+0x14]
if child_low >= 0 && child_low < parent.node.LowLink {
parent.node.LowLink = child_low; // [r15+0x14] = edx
}
}
// --- Phase 4: SCC root detection (0xD2CF01) ---
if finished.DFSNumber == finished.LowLink {
// This node is an SCC root. Pop members from result_stack.
// (0xD2CF30 -- 0xD2CFD2, 4x unrolled)
let scc_dfs = finished.DFSNumber; // [r15+0x10]
loop {
// Unrolled: processes 4 nodes per iteration
let member = result_stack.pop();
if member.DFSNumber < scc_dfs { break; } // 0xD2CF61
member.DFSNumber = -1; // 0xFFFFFFFF = completed
member.LowLink = new_scc_count; // assign SCC index
}
// The root itself
finished.DFSNumber = -1;
finished.LowLink = new_scc_count;
new_scc_count += 1; // r14d++
} else {
// Not a root -- push onto result stack for later popping
result_stack.push(finished);
}
continue;
}
// Advance to next edge
let edge_raw = *top.edge_iter; // load qword
top.edge_iter += 1; // advance by 8
let target_node = edge_raw & 0xFFFFFFFFFFFFFFF8; // mask off tag bits
let is_call = (edge_raw & 0x4) != 0; // bit 2 = call edge
// Only follow CALL edges for SCC computation (ref edges ignored)
if !is_call { continue; }
if target_node == 0 { continue; } // skip null targets
let target_dfs = target_node.DFSNumber; // [target+0x10]
if target_dfs == 0 {
// Unvisited: assign DFS number, push onto worklist
target_node.DFSNumber = dfs_counter; // 0xD2CD78
target_node.LowLink = dfs_counter;
dfs_counter += 1;
let (eb, ee) = sub_D23BF0(&target_node.edge_list);
worklist.push(StackEntry { node: target_node, edge_iter: eb, edge_end: ee });
} else if target_dfs == -1 {
// Already in a completed SCC -- skip entirely
continue;
} else {
// On the stack (tree/back edge): update LowLink
// 0xD2CDF5: min(current.LowLink, target.DFSNumber)
if target_dfs < top.node.LowLink {
top.node.LowLink = target_dfs;
}
}
}
}
}
Key binary details:
- The DFS counter is split between
r12dandr13d, alternating roles. In practicer13dholds the next available DFS number, starting at 2 (the root gets 1 via the0x100000001packed initialization at0xD2CD0E). - The 4x-unrolled inner loop at
0xD2CD90processes four edge entries per iteration before branching back, reducing loop overhead on this hot path. - The SCC member popping at
0xD2CF61--0xD2CFD0is likewise 4x unrolled: it pops at offsets-8,-0x10,-0x18,-0x20relative to the result stack top, then subtracts0x20from the stack pointer per iteration. - The completed marker
-1(0xFFFFFFFF) is written to[node+0x10](DFSNumber), and the SCC identifier (ther14dcounter) is written to[node+0x14](LowLink). After Tarjan completes, the LowLink field holds the SCC index for every node -- the DFSNumber/LowLink fields are repurposed, not preserved. - Only call edges (bit 2 set) are followed during Tarjan. Ref edges (bit 2 clear) are skipped. This is what makes the SCC decomposition "call-SCC" rather than "reference-SCC."
Complexity: O(V + E) where V = nodes in the old SCC and E = call edges among those nodes. The 4x unrolling is a constant-factor optimization, not an algorithmic change.
Incremental SCC Mutation Operations
When a pass modifies the call graph, the SCC structure must be updated without recomputing the entire graph. The LCG provides six mutation operations, each handling a specific kind of edge change. The two most complex are switchInternalEdgeToCall and switchInternalEdgeToRef; the others handle cross-RefSCC edges and bulk operations.
switchInternalEdgeToCall -- sub_D25FD0 (5,526 bytes)
Called when a ref edge within the same RefSCC becomes a call edge (the inliner or devirtualization resolves an indirect call to a direct call). This may merge previously separate SCCs into one.
// Address: 0xD25FD0 -- 0xD27566
// Signature (deduced):
// RefSCC::switchInternalEdgeToCall(
// Node& SourceN, // rsi
// Node& TargetN, // rdx
// function_ref<void(ArrayRef<SCC*>)> MergeCB // rcx (nullable), r8 (data)
// ) -> bool
fn switchInternalEdgeToCall(source: &Node, target: &Node, merge_cb: Option<Fn>) -> bool {
let source_scc = sub_D23C40(lcg, source); // lookupSCC at 0xD26025
let target_scc = sub_D23C40(lcg, target); // lookupSCC at 0xD2604E
// FAST PATH 1: Same SCC -- edge type flip only, no structural change
if source_scc == target_scc { // 0xD26B5B
// Mark the edge as a call edge (flip bit 2) via sub_D23E00
return false; // no SCC change
}
// Look up SCC indices within the RefSCC's ordered list
let source_idx = sub_D25BD0(refscc.map, source_scc); // 0xD26055
let target_idx = sub_D25BD0(refscc.map, target_scc); // 0xD260A0
// FAST PATH 2: Source already appears after target in post-order
// (the new call edge doesn't create a cycle in the SCC DAG)
if source_idx > target_idx { // 0xD260B4
// Mark edge as call, no SCC restructuring needed
return false;
}
// SLOW PATH: The new call edge creates a cycle between SCCs.
// Must merge all SCCs in the range [target_idx .. source_idx].
// Phase A: DFS reachability within the RefSCC (0xD26C92 -- 0xD26DAB)
// Walk call edges from target, collecting all SCCs reachable
// back to source. Uses SmallVector worklist (cap 4) and
// DenseMap visited set at [r15+0x48].
let mut merge_set: SmallVector<SCC*, 4>;
let mut visited: DenseSet<SCC*>;
// ... DFS marks all SCCs on the cycle ...
// Phase B: Merge SCCs (0xD26335 -- 0xD263E1)
let merge_range = &refscc.scc_array[target_idx..=source_idx];
let merge_count = merge_range.len();
// Allocate temp buffer for std::rotate
let tmp = sub_2207800(merge_count * 8); // operator new
// sub_D23910 rotates the SCC array to consolidate merged entries
sub_D23910(refscc.scc_array, target_idx, source_idx);
// Move all nodes from secondary SCCs into the primary SCC
for scc in &merge_range[1..] {
primary_scc.members.extend(scc.members);
scc.members.clear();
}
// Update the SCC-to-index DenseMap with double-buffered rehashing
// Toggle flags byte at [RefSCC+0x40], tombstone old entries,
// insert new entries into the alternate map via sub_D24C50
// Phase C: Invoke merge callback (0xD26480)
if let Some(cb) = merge_cb {
cb(ArrayRef { ptr: merge_range.as_ptr(), len: merge_count });
}
// Phase D: Reindex remaining SCCs (0xD267A2)
for scc in &refscc.scc_array[target_idx + 1..] {
scc_index_map[scc] -= merge_count - 1; // "sub [rax], ebx" at 0xD267B9
}
// Notify the graph of structural change
sub_D23D60(lcg, 1); // notifyRefSCCChange
return true; // SCC structure changed
}
Allocation fallback: The temporary buffer allocation at 0xD27447 has a halving fallback (sar rbx, 1): if operator new fails for the full size, it retries with half the size. This handles the case where the merge set is unexpectedly large.
DenseMap double-buffering: The RefSCC maintains two DenseMaps at offsets +0x38 and +0x48. The flags byte at +0x40 (bit 0) selects which map is "current." When entries are migrated during SCC merging, old entries are tombstoned (0xFFFFFFFFFFFFE000) in the departing map and inserted fresh into the other map via sub_D24C50. This avoids a full rehash on every merge -- the tombstone count at +0x44 is incremented, and the map is only rehashed (via sub_D25CB0) when the tombstone ratio crosses a threshold.
switchInternalEdgeToRef -- sub_D2C610 (5,236 bytes)
Called when a call edge within a RefSCC is demoted to a ref edge (a direct call is deleted or replaced with an indirect reference). This may split a single SCC into multiple smaller SCCs.
// Address: 0xD2C610 -- 0xD2DA84
// Signature (deduced):
// RefSCC::switchInternalEdgeToRef(
// RefSCC& Result, // rdi (output -- new RefSCC or self)
// ArrayRef<pair<Node*, Node*>> Pairs // rdx (edge mutations), rcx (byte count)
// ) -> RefSCC&
fn switchInternalEdgeToRef(pairs: &[(Node, Node)]) -> Vec<SCC> {
// Phase 0: Flip all edge types from call to ref (0xD2C6A2)
for (source, target) in pairs {
sub_D23E00(&source.edge_list, target); // clear bit 2 in edge pointer
}
// Phase 1: Check which pairs actually cross SCC boundaries (0xD2C6A2 -- 0xD2CA2B)
// Processes pairs 4 at a time (4x unrolled loop).
// For each pair: DenseMap lookup of source's SCC and target's SCC.
// If same SCC: the call-to-ref demotion might break the SCC.
// If different SCCs: no structural impact (they were already separated).
let mut needs_recompute = false;
for (source, target) in pairs { // 4x unrolled at 0xD2C6D0
let src_scc = densemap_lookup(source);
let tgt_scc = densemap_lookup(target);
if src_scc == tgt_scc {
needs_recompute = true;
}
}
if !needs_recompute { return vec![old_scc]; }
// Phase 2: Run Tarjan's algorithm on the affected SCC (0xD2CC66 -- 0xD2D0BC)
// (See "Tarjan's SCC Algorithm" section above for full pseudocode.)
let new_sccs = tarjan_recompute_scc(old_scc, &lcg.allocator);
if new_sccs.len() == 1 {
// The SCC survived intact -- no split occurred
return vec![old_scc];
}
// Phase 3: Allocate new SCC objects (0xD2D0BC -- 0xD2D12E)
for i in 1..new_sccs.len() {
// BumpPtrAllocator at [LCG+0x150]:
let cursor = lcg.alloc_cursor; // [r12+0x150]
let aligned = (cursor + 7) & !7; // align to 8
let new_end = aligned + 0x88; // 0x88 = 136 bytes per SCC
if new_end > lcg.alloc_slab_end { // [r12+0x158]
sub_9D1E70(allocator, 0x88, 8); // slow path: allocate new slab
}
lcg.alloc_cursor = new_end;
let scc = aligned as *mut SCC;
sub_D23F30(scc, lcg); // SCC constructor
}
// Phase 4: Distribute nodes among new SCCs (0xD2D1F2 -- 0xD2D309)
// Each node's LowLink field (set by Tarjan to its SCC index) determines
// which new SCC it belongs to.
for node in old_scc.members() {
let scc_idx = node.LowLink; // [node+0x14]
new_sccs[scc_idx].members.push(node);
}
// Phase 5: Update ownership maps (0xD2D168 -- 0xD2D1DC)
// Register new SCCs in the RefSCC's SCC list via sub_D248B0
for scc in &new_sccs[1..] {
sub_D248B0(lcg, refscc, scc); // insertRefSCC
}
// Update Node -> SCC DenseMap entries
// Update SCC -> RefSCC back-pointers via sub_D27750
// Phase 6: Clean up old SCC (0xD2D3D6 -- 0xD2D49A)
// Reset all DFS/LowLink fields to -1 (completed state)
// Zero out old SCC's member list
// Clear old SCC's internal DenseMap via sub_D24EE0
return new_sccs;
}
Batch processing optimization: The pair-processing loop at 0xD2C6A2 is 4x unrolled: it processes four (Node*, Node*) pairs per iteration, with explicit remainder handling (1, 2, or 3 leftover pairs) at 0xD2CA2B. Each pair occupies 16 bytes (0x10), so the loop advances by 64 bytes per iteration.
SCC object allocation: New SCC objects (136 bytes each) are allocated from the LCG's BumpPtrAllocator at [LCG+0x150]. The allocator maintains a cursor/end pair for the current slab. When the slab is exhausted, sub_9D1E70 allocates a new slab (the slow path). The alignment requirement is 8 bytes, enforced by the (cursor + 7) & ~7 round-up at 0xD2D0F0.
switchOutgoingEdgeToCall / switchOutgoingEdgeToRef -- sub_D27A10 (29,179 bytes)
Handles edges that cross RefSCC boundaries. When a ref edge from one RefSCC to another becomes a call edge (or vice versa), the RefSCC structure may need updating. If the new call edge creates a cycle between previously separate RefSCCs, they merge into one. This is the RefSCC-level analog of switchInternalEdgeToCall. The function at sub_D27A10 is 29KB -- the largest single function in the LCG cluster -- because it must handle both directions (to-call and to-ref) and the full RefSCC merge/split logic.
insertInternalRefEdge -- sub_D2A080 (15,253 bytes)
Adds a new ref edge within a RefSCC. Called when optimization introduces a new reference between functions that are already in the same RefSCC (e.g., a new constant expression referencing a sibling function). This does not affect SCC structure (only call edges define SCCs), but it updates the RefSCC's internal edge tracking.
computeRefSCC -- sub_D2AD40 (12,495 bytes)
Computes the RefSCC decomposition from scratch for a set of nodes. Used during initial graph construction (sub_D2BEB0) and when incremental updates are insufficient (e.g., after bulk edge insertion). This runs a second level of Tarjan's algorithm over the ref-edge graph, grouping SCCs into RefSCCs.
mergeRefSCC -- sub_D2DA90 (17,930 bytes)
Merges two or more RefSCCs into one. Called when a new ref edge or promoted call edge connects previously separate RefSCCs that are now mutually reachable. This involves relocating all SCCs from the source RefSCC into the target, updating the graph's RefSCC list at [LCG+0x240], and fixing all back-pointers.
CGSCC Pass Manager: Bottom-Up Interprocedural Optimization
The CGSCC pass manager (sub_1A62BF0) wraps the LCG traversal and runs a pipeline of CGSCC passes over each SCC in bottom-up (post-order) order. The pass manager is invoked multiple times at different points in the optimization pipeline, controlled by a pipelineID parameter.
In the O1/O2/O3 pipeline, it is invoked four times, each with 1 devirtualization iteration:
sub_1A62BF0(1,0,0,1,0,0,1) -- pass #2 (inliner framework, early)
sub_1A62BF0(1,0,0,1,0,0,1) -- pass #17 (after DSE/GVN/MemCpyOpt)
sub_1A62BF0(1,0,0,1,0,0,1) -- pass #21 (after ADCE/JumpThreading)
sub_1A62BF0(1,0,0,1,0,0,1) -- pass #38 (late, after Sink)
At higher tier levels (tier 3 aggressive optimization), a 5-iteration variant appears: sub_1A62BF0(5,0,0,1,0,0,1). The first parameter controls the maximum number of SCC re-visitation iterations when the call graph is mutated during optimization.
The pipeline IDs observed across all optimization levels are 1, 2, 4, 5, 7, and 8, likely corresponding to LLVM's PassBuilder extension points:
| Pipeline ID | Extension Point | Notes |
|---|---|---|
| 1 | EP_EarlyAsPossible / basic cleanup | Most common, 4x per O2 |
| 2 | EP_LoopOptimizerEnd | |
| 4 | EP_ScalarOptimizerLate | Sometimes with optFlag=1 |
| 5 | EP_VectorizerStart | Used at tier 3 (5 iterations) |
| 7 | EP_OptimizerLast | |
| 8 | EP_CGSCCOptimizerLate | With optFlag=1 for inlining |
The CGSCC Pass Manager Run Loop
The pass manager's run loop implements the DevirtSCCRepeatedPass pattern. For each SCC in post-order:
fn run_cgscc_pipeline(module: &Module, lcg: &mut LazyCallGraph, max_devirt_iterations: u32) {
// Build initial SCC post-order via sub_D2BEB0 (buildSCCs)
let post_order = lcg.build_sccs(); // sub_D2BEB0, 10KB
for refscc in post_order.bottom_up() { // sub_D2F8A0 / sub_D30800
for scc in refscc.sccs() { // sub_D2E510, 7KB
let mut iteration = 0;
let mut changed = true;
while changed && iteration < max_devirt_iterations {
changed = false;
iteration += 1;
// Run each registered CGSCC pass on this SCC
for pass in &cgscc_pipeline {
let result = pass.run(scc, lcg);
if result.invalidated_call_graph {
// The pass mutated the call graph.
// Update SCC structure via switchInternal* operations.
// If SCCs were merged or split, re-queue affected SCCs.
changed = true;
}
// Run the CGSCC-to-function adaptor (sub_2362FB0)
// to apply function-level passes to newly modified functions
if result.invalidated_functions {
for func in scc.functions() {
run_function_pipeline(func);
}
}
}
}
if iteration >= max_devirt_iterations && changed {
// sub_2284BC0: "Max devirtualization iterations reached"
// Controlled by abort-on-max-devirt-iterations-reached knob
}
}
}
}
Iteration semantics: The max_devirt_iterations parameter (argument 1 to sub_1A62BF0) controls how many times the pass manager will re-run the CGSCC pipeline on an SCC after the call graph mutates. At O1/O2/O3, this is 1 (single pass, no re-visitation). At tier 3, this is 5 (up to 5 re-runs if devirtualization keeps revealing new direct calls). The devirt iteration check at sub_2284BC0 emits "Max devirtualization iterations reached" when the limit is hit and the graph is still changing.
CGSCC-to-Function Adaptor -- sub_2362FB0 (6,700 bytes)
The adaptor at sub_2362FB0 wraps a function-level pass for execution inside the CGSCC framework. When the inliner inlines a callee, the callee's body is absorbed into the caller. The caller must then be re-optimized with function-level passes (SimplifyCFG, InstCombine, etc.) before the next CGSCC pass runs. The adaptor handles this by running the function pipeline on each function in the current SCC after each CGSCC pass that reports a change.
The adaptor constructor at sub_230AC20 (5.4KB) creates the module-to-function or CGSCC-to-function wrappers. The adaptor itself stores the inner pass pipeline as a nested FunctionPassManager and forwards run() calls to each function in the SCC.
Registered CGSCC Passes
The registered CGSCC passes (from the pipeline parser at sub_2377300):
| Pass name | Address/factory | Purpose |
|---|---|---|
inline | sub_2613930 | New PM CGSCC inliner (69KB) |
argpromotion | sub_2500970 | Promote pointer args to by-value |
attributor-cgscc | sub_2582AC0 | CGSCC attribute deduction (39KB) |
attributor-light-cgscc | -- | Lightweight variant |
function-attrs | sub_1841180 | Infer readonly, nounwind, etc. |
openmp-opt-cgscc | -- | OpenMP kernel optimization |
coro-annotation-elide | -- | Coroutine elision |
coro-split | -- | Coroutine splitting |
nv-early-inliner | via sub_2342850 | NVIDIA early inliner (wraps InlinerWrapper) |
CGSCC analyses (3 registered):
| Analysis name | Purpose |
|---|---|
no-op-cgscc | No-op analysis (placeholder) |
fam-proxy | FunctionAnalysisManagerCGSCCProxy -- bridges function-level analyses into CGSCC |
pass-instrumentation | Pass instrumentation callbacks (via sub_2342830) |
How the CGSCC Inliner Uses the Call Graph
The inliner is the most important consumer of the LazyCallGraph. The New PM inliner at sub_2613930 (69KB) and the NVIDIA custom inliner at sub_1864060 (75KB) both interact with the LCG through a specific protocol.
The core inlining loop (implemented at sub_186CA00, 61KB, Inliner::inlineCallsImpl) runs within the CGSCC framework:
fn inline_calls_in_scc(scc: &mut SCC, lcg: &mut LazyCallGraph) {
// Collect all call sites in the SCC
let mut worklist: Vec<CallSite> = collect_call_sites(scc);
for callsite in &worklist {
let callee = callsite.callee();
let caller = callsite.caller();
// Compute inline cost
let cost = compute_inline_cost(callee, caller); // sub_1864060
// Decision: inline if cost < threshold
// (emits optimization remarks: "Inlined", "NotInlined", "AlwaysInline",
// "NeverInline", "TooBig", etc.)
if should_inline(cost) {
// Perform inlining transformation
inline_function(callsite);
// CRITICAL: Update the call graph after inlining.
// The callee's body is now in the caller. New call edges
// may have appeared (callee's callees are now caller's callees).
// Old edges may have disappeared (the call to callee is gone).
// For each new direct call discovered in the inlined body:
// lcg.switchInternalEdgeToCall(caller_node, new_callee_node)
// -> may merge SCCs, triggering re-visitation
// For the removed call edge (caller -> callee):
// lcg.switchInternalEdgeToRef(caller_node, callee_node)
// -> may split SCCs, triggering re-visitation
// (or removeEdge entirely if callee has no other references)
// Run function-level cleanup on the caller
// via CGSCC-to-function adaptor (sub_2362FB0)
}
}
}
Call graph update protocol: After each inline transformation, the inliner must report all edge changes to the LazyCallGraph. The CGSCC pass manager provides an UpdateResult structure that the inliner fills in:
-
New call edges: The inlined function body may contain direct calls that the caller did not previously have. Each creates a
switchInternalEdgeToCallif target is in the same RefSCC, orswitchOutgoingEdgeToCall(sub_D27A10) if target is in a different RefSCC. -
Removed call edges: The direct call from caller to callee is replaced by the inlined body. If the caller no longer references the callee at all, the edge is removed. If it still references the callee (e.g., another call site remains), the edge type may change.
-
SCC merging: If the inlined body creates a new call cycle (e.g., A calls B, B's body contains a call to A), the affected SCCs merge. The merge callback re-queues the merged SCC for another pass of the CGSCC pipeline.
-
SCC splitting: If removing the call edge from caller to callee breaks the only call-path cycle, the SCC splits. New SCCs are created and inserted into the post-order traversal at the correct position.
Initial Graph Construction: buildSCCs -- sub_D2BEB0 (9,782 bytes)
The initial call graph is built by sub_D2BEB0 when the CGSCC pass manager first runs. This function:
- Collects all module-level root functions (kernels, externally visible functions).
- For each root, lazily populates edges via
sub_D23BF0. - Runs Tarjan's algorithm to decompose the call graph into SCCs.
- Runs a second pass (
sub_D2AD40,computeRefSCC) to group SCCs into RefSCCs based on ref edges. - Stores the resulting post-order in the LCG's RefSCC list at
[LCG+0x240].
The post-order traversal helpers (sub_D2F8A0 at 10KB, sub_D30800 at 8KB) implement the iterator that the CGSCC pass manager uses to walk RefSCCs and SCCs in bottom-up order. The SCC iteration logic at sub_D2E510 (7KB) handles advancing through SCCs within each RefSCC.
Graph Verification -- sub_D29180 (6,417 bytes)
The verifier at sub_D29180 checks the consistency of the entire LazyCallGraph after mutations. It validates:
- Every node's SCC assignment is correct (no node belongs to the wrong SCC).
- Every SCC's RefSCC assignment is correct.
- Call edges connect nodes that are reachable via calls (SCC invariant).
- Ref edges connect nodes within the same RefSCC.
- The post-order is valid: for every call edge A -> B, B's SCC appears before A's SCC in the traversal order.
- No dangling pointers (all edge targets are live nodes in the graph).
This verifier is expensive (O(V + E) for the whole graph) and is only enabled in debug builds or when explicitly requested.
LazyCallGraph Data Structure Layout
LazyCallGraph (pointed to by [RefSCC+0]):
+0x000: ...
+0x130: DenseMap<Node*, SCC*> (NodeToSCCMap)
+0x130: qword - bucket count tracking
+0x138: qword - bucket array pointer
+0x140: dword - num entries
+0x144: dword - num tombstones
+0x148: dword - num buckets
+0x150: BumpPtrAllocator
+0x150: qword - current slab cursor
+0x158: qword - current slab end
+0x1A0: qword - total allocated bytes
+0x1B0: SmallVector<SCC*> - SCC ownership list
+0x1B0: qword - data pointer
+0x1B8: dword - size
+0x1BC: dword - capacity
+0x240: SmallVector<RefSCC*> - RefSCC list (post-order)
GPU-Specific Call Graph Properties
The LCG implementation itself is GPU-agnostic, but the call graph shape on GPU differs fundamentally from CPU:
Kernels are roots. Functions annotated with nvvm.annotations kernel metadata are externally visible entry points. They are the roots of the call graph -- nothing calls a kernel (launches are host-side). In CGSCC ordering, kernels are processed last (they are the top of the bottom-up traversal).
Device functions are internal. Non-kernel __device__ functions are typically internal linkage. They appear in the call graph only as callees. This produces a characteristic tree-like (or DAG-like) call graph with very few cycles, meaning most SCCs contain a single function.
Recursion is rare. CUDA hardware historically did not support recursion (stack depth is bounded, and the compiler must statically allocate the call stack). Although modern architectures permit limited recursion, real-world CUDA code almost never uses it. This means SCC merging (switchInternalEdgeToCall) is rarely triggered -- most CGSCC processing is trivially single-function SCCs in a DAG.
Aggressive inlining collapses the graph. The NVIDIA inline budget (default 20,000, vs LLVM's 225) causes most device functions to be inlined into their callers. After the early inliner pass, the remaining call graph is typically flat: a handful of kernels with large bodies and very few un-inlined callees. Later CGSCC invocations mostly iterate over single-function SCCs.
ThinLTO Interaction
When ThinLTO imports functions from other modules, they appear in the call graph as available_externally definitions. The LCG treats them like any other defined function -- they get nodes, their edges are lazily populated, and they participate in SCC computation. The NVModuleSummary builder (sub_12E06D0) records call graph edges in the module summary, which the ThinLTO import pass uses to decide which cross-module functions to import. Once imported, those functions become candidates for inlining during the CGSCC traversal.
The function-inline-cost-multiplier knob (visible in sub_2613930's string table) penalizes recursive functions during ThinLTO inlining, since recursive inlining can explode code size without bound.
Knobs and Thresholds
| Knob | Default | Effect |
|---|---|---|
inline-budget | 20,000 | Per-caller NVIDIA inline cost budget (89x LLVM default) |
inline-threshold | 225 | LLVM default cost threshold (used by New PM inliner) |
nv-inline-all | off | Bypass cost analysis, force-inline everything |
-aggressive-inline | -- | CLI flag, routes to inline-budget=40000 |
intra-scc-cost-multiplier | -- | Cost multiplier for inlining within the same SCC |
function-inline-cost-multiplier | -- | Cost multiplier for recursive functions |
abort-on-max-devirt-iterations-reached | false | Abort if devirt iteration limit is hit |
cgscc-inline-replay | -- | Replay file for inline decisions (debugging) |
cgscc-inline-replay-scope | Function | Replay scope: Function or Module |
cgscc-inline-replay-fallback | Original | Fallback: Original, AlwaysInline, NeverInline |
cgscc-inline-replay-format | Line | Replay format: Line, LineColumn, LineDiscriminator |
CGSCC iteration count (arg 1 to sub_1A62BF0) | 1 (O1-O3), 5 (tier 3) | Max SCC re-visitation iterations after graph mutation |
Sentinel Values and Constants
| Value | Meaning |
|---|---|
0xFFFFFFFFFFFFF000 | DenseMap empty bucket sentinel |
0xFFFFFFFFFFFFE000 | DenseMap tombstone sentinel |
0x100000000 | Packed {size=0, cap=1} for SmallVector initialization |
0x100000001 | Packed {DFSNumber=1, LowLink=1} for Tarjan root init |
0x400000000 | Packed {size=0, cap=4} for SmallVector initialization |
0x800000000 | Packed {size=0, cap=8} for SmallVector initialization |
0x88 (136) | SCC object size in bytes |
0x18 (24) | Tarjan StackEntry size (Node*, EdgeIter, EdgeEnd) |
0x10 (16) | Edge mutation pair size (Node*, Node*) |
0xFFFFFFFF (-1) | DFSNumber value indicating "completed" / assigned to an SCC |
Diagnostic Strings
The call graph printer at sub_D2B640 (12,287 bytes) emits these strings for debugging:
"Printing the call graph for module:"
"RefSCC with"
"SCC with"
"Edges in function:"
"call SCCs:"
"call"
"ref"
" -> "
The DOT dumper at sub_D29900 emits GraphViz format with "digraph", "[style=dashed" (for ref edges), and standard ";\n", "}\n" terminators.
The New PM inliner at sub_2613930 emits: "function-inline-cost-multiplier", "recursive", "recursive SCC split", "unavailable definition".
The devirtualization pass at sub_2284BC0 emits: "Max devirtualization iterations reached".
The old CGSCC inliner at sub_186CA00 emits: "inline", "NoDefinition", "NotInlined", "AlwaysInline", "Inlined", "Callee", "Caller", "cost=always", "cost=", "threshold=".
The call graph DOT writer cluster at 0x2280000--0x228A000 emits: "view-callgraph", "View call graph", "dot-callgraph", "Print call graph to 'dot' file", "Call graph: ", "external caller", "external callee", "external node", "Writing '", "error opening file for writing!".
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| LazyCallGraph cluster start | sub_D230A0 | -- | -- |
std::rotate / SCC array reorder | sub_D23910 | -- | -- |
| SCC array splitting helper | sub_D23A60 | -- | -- |
Node::populate() / edge iterator (lazy population point) | sub_D23BF0 | -- | -- |
LazyCallGraph::lookupSCC(Node&) | sub_D23C40 | -- | -- |
RefSCC::isAncestorOf() connectivity check | sub_D23CB0 | -- | -- |
LazyCallGraph::notifyRefSCCChange() | sub_D23D60 | -- | -- |
Edge::setKind() (flip call/ref tag bit) | sub_D23E00 | -- | -- |
| SCC constructor | sub_D23F30 | -- | -- |
LazyCallGraph::insertRefSCC() | sub_D248B0 | -- | -- |
| Node edge list cleanup | sub_D24960 | -- | -- |
| DenseMap insert (Node-to-SCC) | sub_D24C50 | -- | -- |
RefSCC::isPartOfRefSCC() check | sub_D24D10 | -- | -- |
| DenseMap clear (SCC internals) | sub_D24EE0 | -- | -- |
RefSCC::find() / updateSCCIndex | sub_D25AF0 | -- | -- |
RefSCC::SCCIndexMap::find() | sub_D25BD0 | -- | -- |
| DenseMap grow/rehash | sub_D25CB0 | -- | -- |
switchInternalEdgeToCall() | sub_D25FD0 | 5,526 | -- |
Node::setRefSCC() | sub_D27750 | -- | -- |
switchOutgoingEdgeToCall/Ref() | sub_D27A10 | 29,179 | -- |
| Call graph verification | sub_D29180 | 6,417 | -- |
| DOT graph dumper | sub_D29900 | 8,235 | -- |
insertInternalRefEdge() | sub_D2A080 | 15,253 | -- |
computeRefSCC() | sub_D2AD40 | 12,495 | -- |
| Call graph text printer | sub_D2B640 | 12,287 | -- |
buildSCCs() / initial construction | sub_D2BEB0 | 9,782 | -- |
switchInternalEdgeToRef() | sub_D2C610 | 5,236 | -- |
mergeRefSCC() | sub_D2DA90 | 17,930 | -- |
| SCC iteration logic | sub_D2E510 | 6,890 | -- |
rebuildSCC() | sub_D2F240 | 6,141 | -- |
| Post-order SCC traversal helper | sub_D2F8A0 | 10,451 | -- |
| Post-order traversal | sub_D30800 | 7,796 | -- |
| Edge management helper | sub_D301A0 | 5,148 | -- |
| RefSCC-level operations | sub_D31270 | 7,696 | -- |
| CGSCC pass manager / InlinerWrapper factory | sub_1A62BF0 | -- | -- |
| NVIDIA custom inliner (old CGSCC) | sub_1864060 | 75,000 | -- |
Inliner::inlineCallsImpl() (CGSCC core loop) | sub_186CA00 | 61,117 | -- |
| Call graph node visitor | sub_2280510 | 24,000 | -- |
| Call graph builder | sub_2282680 | 33,000 | -- |
| DevirtSCCRepeatedPass ("Max devirtualization iterations reached") | sub_2284BC0 | 16,000 | -- |
| InlinerWrapper factory (nv-early-inliner, inliner-wrapper) | sub_2342850 | -- | -- |
| CGSCC-to-function adaptor | sub_2362FB0 | 6,700 | -- |
| CGSCC pipeline text parser | sub_2377300 | 103,000 | -- |
| Attributor CGSCC pass | sub_2582AC0 | 39,000 | -- |
| New PM CGSCC inliner | sub_2613930 | 69,000 | -- |
Cross-References
- Inliner Cost Model -- the cost computation that the CGSCC inliner uses to decide whether to inline each call site
- ThinLTO Function Import -- how cross-module functions are imported into the call graph
- Pipeline & Ordering -- where the four CGSCC invocations sit in the overall pass sequence
- Optimization Levels -- how CGSCC iteration counts vary by O-level and tier
- Hash Infrastructure -- DenseMap internals, sentinel values, and probing strategy used throughout the LCG
AsmPrinter & PTX Body Emission
The NVPTXAsmPrinter is cicc's final code-generation stage: the component that converts the machine-level IR (MachineFunction, MachineBasicBlock, MachineInstr) into the textual PTX that ptxas consumes. Unlike a conventional LLVM AsmPrinter, which emits real machine assembly for a physical ISA, the NVPTX variant emits PTX -- a virtual ISA with its own declarative syntax for registers, parameters, address spaces, textures, and kernel launch metadata. The AsmPrinter is not merely "formatting instructions"; it is responsible for the entire PTX module structure: file header directives, global variable declarations with topological ordering, function signatures with .param space marshaling, register class declarations, the instruction body with debug annotations, and convergence-control pseudo-instructions required by the warp execution model. In cicc v13.0 the printer spans two address clusters -- the NVPTX-specific emission layer at 0x2140000-0x21FFFFF and the LLVM AsmPrinter override at 0x31E0000-0x3240000.
| Pass registration | sub_214ABE0 -- "NVPTX Assembly Printer" |
| emitFunctionBody | sub_31EC4F0 (12KB, 2565 asm lines) |
| Header emission (emitHeader) | sub_214F370 (7.2KB) |
| Function header orchestrator | sub_215A3C0 (10KB) |
| Kernel attribute emission | sub_214DA90 (8.7KB) |
| Parameter list emission | sub_21502D0 (22KB) |
| Stack frame + register decls | sub_2158E80 (17KB) |
| Global variable emission | sub_2156420 (20KB) |
| Call prototype emission | sub_21CF8D0 (29KB) |
| Inline asm handler | sub_31F26A0 / sub_397DF10 (30KB) |
| AsmPrinter::doFinalization | sub_3972F10 (24KB) |
PTX Output Structure
A complete PTX module emitted by cicc follows this exact structure. Every element in this layout corresponds to a specific emitter function:
// ← sub_214F370 (emitHeader)
// Generated by NVIDIA NVVM Compiler
// Compiler Build ID: ...
// Based on NVVM 7.0.1
//
.version 8.5 ← PTXVersion / 10 . PTXVersion % 10
.target sm_90, texmode_independent ← subtarget name + driver interface
.address_size 64 ← 64 or 32 from subtarget
// Start of file scope inline assembly ← sub_215ACD0 (doInitialization)
...inline asm...
// End of file scope inline assembly
.extern .func (.param .b32 _) _Z3foov ← sub_2151550 (forward declarations)
.global .texref my_tex; ← sub_2156420 (module-level globals)
.global .surfref my_surf;
.global .samplerref my_samp = { ... };
.global .align 4 .b8 data[1024];
.visible .entry _Z6kernelPf( ← sub_215A3C0 (function header)
.param .u64 _Z6kernelPf_param_0
)
.reqntid 256, 1, 1 ← sub_214DA90 (kernel attributes)
.maxnreg 32
{
.local .align 16 .b8 __local_depot0[64];← sub_2158E80 (frame + registers)
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .pred %p<5>;
.reg .b32 %r<47>;
.reg .b64 %rd<8>;
.reg .f32 %f<20>;
// .loc 1 42 0 ← sub_31D55F0 (per-instruction debug)
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
mov.u32 %r1, %tid.x;
...
}
// -- End function
Header Directive Emission -- sub_214F370
The header is emitted once during doInitialization (sub_215ACD0). The function builds the output into a SmallString<128> buffer then flushes via OutStreamer.EmitRawText. The emission order is fixed:
-
Comment block.
"// Generated by NVIDIA NVVM Compiler", followed by"// Compiler Build ID: "with the build identifier string, then"// Based on NVVM 7.0.1"(the version string is read fromllvm.identmetadata viasub_216F7F0). -
.version X.Y-- the PTX ISA version. Computed asPTXVersion / 10for major,PTXVersion % 10for minor. In cicc v13.0 targeting SM 90, this is typically.version 8.5. -
.target sm_XX[, texmode_independent][, debug]-- the SM target name fromNVPTXSubtarget::getTargetName(). Thetexmode_independentmodifier is appended when the driver interface is NVCL (OpenCL). If the driver interface is CUDA and the subtarget lacks double-precision support,map_f64_to_f32is appended instead. The, debugsuffix is added whenMCAsmInfo::doesSupportDebugInformation()returns true. -
.address_size 64(or32) -- fromNVPTXSubtarget::is64Bit(). All modern CUDA compilation uses 64-bit.
The doInitialization function (sub_215ACD0) also performs two critical rejection checks: it looks up llvm.global_ctors and llvm.global_dtors named metadata. If either is a non-empty array, it issues a fatal error: "Module has a nontrivial global ctor, which NVPTX does not support." GPU kernels have no program startup phase where global constructors could execute.
Function Declaration: .entry vs .func
The function header orchestrator (sub_215A3C0) emits the complete prologue for each function definition. The emission sequence is:
Step (a): Coroutine pragma. Checks a linked list at this+792 for metadata nodes with type byte 'N' (0x4E) matching the current function. If found, emits .pragma "coroutine";.
Step (b): Linkage directive. Calls sub_214CAD0 which emits .visible, .extern, or .common depending on the function's linkage. CUDA kernel compilation mode is gated by *(this+232)->field_952 == 1.
Step (c): Entry vs function. Calls sub_1C2F070 (isKernelFunction). If the function is a kernel: emit .entry. Otherwise: emit .func.
Step (d): Return type. For .func only. Calls sub_1C2FA50 to check whether the function returns a value. If so, calls sub_214C940 to emit the return type specification (e.g., (.param .b32 retval0)). Kernels have no return values in PTX.
Step (e): Function name. sub_214D1D0 emits the mangled C++ name.
Step (f): Parameter list. sub_21502D0 (22KB) emits the complete .param declaration list. This is the most complex part of the header -- see the next section.
Step (g): Kernel attributes. Only for .entry functions. sub_214DA90 emits launch-bound and cluster directives.
Step (h): Additional attributes. sub_214E300 emits .local_maxnreg if set.
Step (i): Noreturn. If the function has metadata attribute 29 (noreturn) and is not a kernel, emits .noreturn.
Step (j): Open body. Emits {\n.
Step (k): Frame and registers. sub_2158E80 emits the local depot, stack pointer registers, and all virtual register declarations.
.param Space Marshaling
PTX uses .param space for all function arguments. The parameter emission function sub_21502D0 handles the full taxonomy of NVPTX parameter types. The emitted parameter name follows the pattern FUNCNAME_param_N where N is a monotonic index starting at 0.
Scalar parameters are emitted as .param .TYPE _param_N where TYPE is the PTX scalar type (.b32, .b64, .f32, .f64, .pred). Scalars smaller than 32 bits are widened to 32 bits; this is the PTX rule that all .param scalars must be at least 4 bytes. The widening logic: if bit-width <= 32, widen to .b32; if 32 < bit-width < 64, widen to .b64; otherwise keep as-is.
Aggregate / byval parameters are emitted as .param .align ALIGN .b8 _param_N[SIZE] -- a byte array with explicit alignment. The alignment comes from the function's DataLayout and the parameter attribute.
Texture / surface / sampler parameters get special treatment:
.param .texref _param_N-- texture reference (direct binding).param .surfref _param_N-- surface reference.param .samplerref _param_N-- sampler reference.param .u64 .ptr .texref _param_N-- pointer to texture (indirect).param .u64 .ptr .surfref _param_N-- pointer to surface.param .u64 .ptr .samplerref _param_N-- pointer to sampler
The distinction between direct references and pointer-to-references reflects whether the texture/surface handle is passed by value or by indirection through a 64-bit pointer.
Call prototypes (sub_21CF8D0, 29KB) are emitted for indirect calls. When a function pointer call occurs, the AsmPrinter generates a .callprototype declaration: prototype_N : .callprototype (.param .b32 _) _ (.param .b64 _, .param .b32 _). The prototype index N is monotonically increasing.
Register Declarations
Inside the function body, sub_2158E80 emits register declarations for every virtual register class used. The nine register classes, their vtable addresses, PTX type suffixes, prefixes, and encoded IDs are documented in Register Classes. The encoding scheme, declaration emission format, and the internal-only tenth class are covered in Register Encoding Scheme and Register Declaration Emission.
The emitted text for each class follows the pattern:
.reg .pred %p<5>; ← 5 predicate registers needed
.reg .b16 %rs<12>; ← 12 short integer registers
.reg .b32 %r<47>; ← 47 general-purpose 32-bit
.reg .b64 %rd<8>; ← 8 double-width integer
.reg .f32 %f<20>; ← 20 single-precision float
.reg .f64 %fd<3>; ← 3 double-precision float
The count for each class is max_register_index + 1. The emitter iterates the function's virtual register map at this+800, deduplicates register classes using a hash table at this+808..832, and tracks the maximum index per class.
The stack frame is emitted before registers when the function has a non-zero local frame:
.local .align 16 .b8 __local_depot0[512]; ← ALIGN from frame info, N = function index
.reg .b64 %SP; ← stack pointer (64-bit mode)
.reg .b64 %SPL; ← stack pointer local
The __local_depot name is a fixed prefix (#define DEPOTNAME "__local_depot" in the source). %SP is the global stack pointer; %SPL points into the local depot. In 32-bit mode these are .reg .b32.
Global Variable & Texture Emission -- sub_2156420
Module-level global variables are emitted by sub_2156420 (20KB), called from emitGlobals during doInitialization. Globals must be emitted in topological order because ptxas does not support forward references. The ordering is computed by sub_2157D50 which performs a DFS over global variable use-def chains, detecting circular dependencies (fatal: "Circular dependency found in global variable set").
Texture references: .global .texref NAME; -- emitted when sub_1C2E830 classifies the global as a texture. Surface references: .global .surfref NAME;. Sampler references get an optional initializer block:
.global .samplerref my_sampler = {
addr_mode_0 = clamp_to_edge,
addr_mode_1 = wrap,
filter_mode = linear,
force_unnormalized_coords = 1
};
Address mode values: wrap, clamp_to_border, clamp_to_edge, mirror. Filter mode values: nearest, linear. The force_unnormalized_coords field is boolean.
Data globals receive an address-space qualifier from sub_214FA80: .global (addrspace 1), .shared (addrspace 3), .const (addrspace 4), .local (addrspace 5). Managed-memory globals get .attribute(.managed). Unified addressing gets .attribute(.unified) or .attribute(.unified(N)).
Skipped globals: Variables whose names start with "llvm.metadata", "llvm.", or "nvvm." are silently skipped.
Demoted globals (shared memory demotion, addrspace 3) emit a comment: "// NAME has been demoted".
Instruction Emission -- sub_31EC4F0
The core emission loop emitFunctionBody at sub_31EC4F0 (12KB) overrides llvm::AsmPrinter::emitFunctionBody. It allocates a 0xF28-byte stack frame (holding SmallString buffers, a DenseMap for instruction-mix statistics, and tracking structures) and proceeds through three phases:
Phase 1: Per-MBB Outer Loop
Iterates the MachineFunction's MBB linked list. The iteration strips tagged-pointer bits (AND ~7) from the ilist node pointers. For each MBB:
- Calls
emitBasicBlockStart(MBB)via vtable dispatch. - Enters the instruction inner loop.
- Calls
emitBasicBlockEnd(MBB). - Collects instruction-mix statistics when debug counters are active.
Phase 2: Per-Instruction Inner Loop
For each MachineInstr, reads the opcode at MI+0x44 (uint16) and dispatches through a 46-case jump table:
Default path (real instructions): Calls emitInstruction(MI) via [vtable+0x128], which dispatches to the tablegen-generated printInstruction(). This function uses the NVPTXGenAsmWriter.inc tables to format each instruction: printInstruction() calls NVPTXInstPrinter::printOperand for each operand, producing text like mov.u32 %r0, %r1 or add.f32 %f2, %f0, %f1. After emission, the instruction counter is incremented and, if debug info is present, sub_31D55F0 emits a .loc directive.
Inline assembly (opcodes 1, 2): Routed to sub_31F26A0 / sub_397DF10 (30KB). The inline asm handler parses ${} operand references, handles .att_syntax / .intel_syntax mode switching, and emits // begin inline asm / // end inline asm comment markers. PTX inline assembly is passed through essentially verbatim, with operand substitution.
Meta-instructions (opcodes 3-7, 10-18): These include STACKMAP, PATCHPOINT, EH_LABEL, GC_LABEL, KILL, CFI_INSTRUCTION, DBG_VALUE, DBG_VALUE_LIST, and DBG_LABEL. Most emit labels or debug comments rather than PTX instructions. The KILL pseudo emits a "kill:" comment listing each killed register with sub_2FF6320 (printReg). DBG_LABEL emits "DEBUG_LABEL: <label>".
Convergence control (opcodes 24, 33): CONVERGENCECTRL_ENTRY calls sub_31DB9B0 to mark the entry point of a convergent region. CONVERGENCECTRL_LOOP calls sub_31DB950 to mark a loop-back convergence point. These pseudo-instructions are critical for the PTX assembler to correctly track warp divergence and reconvergence. See the dedicated Convergence Control Framework section below for the full lowering pipeline.
FAKE_USE (opcode 43): Debug-only. Emits "fake_use:" followed by register operands.
MEMBARRIER (opcode 44): Emits "MEMBARRIER" as a raw comment.
Pre- and post-instruction hooks: Before each instruction, the Handlers vector at this+0x240 is iterated, calling beginInstruction(MI) on each handler. After each instruction, endInstruction() is called. The AsmPrinter maintains two handler lists (at +0x240 and +0x228) supporting both debug-info handlers and exception/unwind handlers.
Phase 3: Post-Function Processing
After all MBBs are emitted:
- Zero-length function avoidance. If no real instructions were emitted (tracked by
var_F30andvar_ED1), inserts a NOP viasub_31DCBB0with comment"avoids zero-length function". - Function-end label. Creates a
"func_end"temp symbol viasub_31DCC50and emits it for DWARF range tracking. - DWARF line table finalization. Creates CIE/FDE symbols, binds them via
emitAssignment, and inserts a debug-loc entry for the function-end symbol. - Handler finalization. Calls
endFunction(MF)on all handlers in both lists. - PGO / BBAddrMap emission. If enabled via
dword_50360A8, emits BB address maps for profile-guided optimization. Missing labels trigger diagnostic:"pgo-analysis-map is enabled for function... but it does not have labels". - End comment. Emits
"-- End function\n"as a raw comment.
Debug Info Emission
Debug information in PTX is emitted as .loc and .file directives embedded in the instruction stream, not as separate DWARF sections (the PTX assembler ptxas constructs the actual DWARF from these directives).
The debug emission is layered:
| Layer | Function | Behavior |
|---|---|---|
Per-instruction .loc | sub_31D55F0 | Emits .loc FileIndex Line Col for instructions with attached DebugLoc |
| Source-line comments | sub_31D89B0 | Emits source location as comments when asm-printer debug counter is active |
| Function-name + inlined-at | emitInlinedAtInfo (NVIDIA) | Appends , function_name LAB, inlined_at FILE LINE COL to .loc |
| Per-MBB boundary | sub_31E6100 | Maintains file/line-to-MCSymbol mapping for MBB boundaries |
.file directives | emitDwarfFileEntries | Maps source filenames to file indices during doFinalization |
| DWARF line section | sub_E81A00 | Binds CIE/FDE symbols for line table construction |
The NVIDIA extension to .loc is the function_name and inlined_at attributes. Upstream LLVM's .loc only has file line column. cicc appends inlining context so that ptxas can reconstruct the full inline call stack in DWARF. The InlinedAtLocs set tracks which inlined-at locations have already been emitted, preventing duplicates. A work list (SmallVector<DebugLoc, 8>) is built by walking the inlined-at chain, then emitted in reverse order so that outer locations appear before inner ones.
When InterleaveSrcInPtx is enabled, the AsmPrinter reads source file lines and emits them as comments interleaved with the PTX.
Module-Level Metadata Directives
Kernel launch-bound metadata directives are emitted by sub_214DA90 in this order:
| Directive | Metadata Source | Notes |
|---|---|---|
.blocksareclusters | nvvm.blocksareclusters | Fatal error if .reqntid not also set |
.reqntid X, Y, Z | nvvm.reqntid (comma-separated strtol) | Unspecified dims default to 1 |
.maxntid X, Y, Z | Structured metadata readers | Unspecified dims default to 1 |
.minnctapersm N | sub_1C2EF70 | Min CTAs per SM |
.explicitcluster | nvvm.cluster_dim | SM 90+ only (field_1212 > 0x59) |
.reqnctapercluster X, Y, Z | Cluster dim readers | SM 90+ only |
.maxclusterrank N | sub_1C2EF50 | SM 90+ only |
.maxnreg N | sub_1C2EF90 | Register limit per thread |
The .pragma "nounroll" directive is emitted at MBB level by sub_3970E40 when llvm.loop.unroll.disable metadata is detected on a loop header. This is an NVIDIA modification to the MBB printer.
The .abi_preserve family of directives is emitted by sub_3937240: .abi_preserve, .abi_preserve_after, .abi_preserve_uniform, .abi_preserve_control. These are NVIDIA-specific PTX directives for register ABI preservation across function calls.
Convergence Control Framework
CUDA's SIMT execution model requires the compiler to track which threads in a warp must execute the same instruction simultaneously. When a conditional branch causes warp divergence (some threads take one path, others take the other), the hardware needs to know where threads reconverge. The convergence control framework propagates this information from LLVM IR intrinsics through MachineInstr pseudo-instructions to the final PTX output, where ptxas uses it to emit correct convergence/reconvergence barriers in SASS.
Three-Layer Architecture
Convergence information flows through three representation layers during compilation:
LLVM IR MachineInstr AsmPrinter
───────────────────── ────────────────────── ──────────────────
llvm.experimental CONVERGENCECTRL_ENTRY sub_31DB9B0
.convergence.entry → (opcode 24) → (emitConvergenceEntry)
llvm.experimental CONVERGENCECTRL_LOOP sub_31DB950
.convergence.loop → (opcode 33) → (emitConvergenceLoop)
llvm.experimental CONVERGENCECTRL_ANCHOR (no AsmPrinter case --
.convergence.anchor → (opcode 34) dropped before emission)
"convergencectrl" (operand bundle tag (verified at IR level,
operand bundle → preserved through ISel) consumed by pseudo-instrs)
Layer 1: IR intrinsics. Three llvm.experimental.convergence.* intrinsics define convergent regions at the LLVM IR level. Each returns an abstract "convergence token" (type token) that is consumed by calls carrying the convergencectrl operand bundle. The bundle ties a call to a specific convergence scope -- the verifier at sub_29ED7A0 enforces "convergent call needs convergencectrl operand" for any call marked with the convergent attribute (attribute kind 0x34 = 52).
Layer 2: MachineInstr pseudo-opcodes. During instruction selection (SelectionDAG lowering), the convergence intrinsics are lowered to target-independent MachineInstr pseudo-opcodes. These survive register allocation and all machine-level optimization passes unchanged -- they carry no register operands and produce no real instructions. Their sole purpose is to mark positions in the MBB instruction stream for the AsmPrinter.
Layer 3: AsmPrinter emission. The emitFunctionBody loop at sub_31EC4F0 dispatches opcodes 24 and 33 to dedicated emitter functions that translate the pseudo-instructions into whatever PTX annotation ptxas requires for reconvergence tracking. The CONVERGENCECTRL_ANCHOR pseudo (opcode 34) does not appear in the AsmPrinter's 46-case jump table, indicating it is either dropped during ISel or consumed by an earlier machine pass.
Convergence Token Semantics
The convergence token model enforces a strict dominance and nesting discipline:
-
convergence.entryproduces a token that represents the function's entry convergence scope. All threads that enter the function are converged at this point. The token must dominate all its uses. -
convergence.loopproduces a token scoped to a natural loop. The token marks the point where loop-back-edge threads reconverge before the next iteration. The loop header must dominate all blocks in the cycle. -
convergence.anchorproduces a token at an arbitrary program point, used for structured convergence within non-loop regions (e.g., structuredif/elseregions where reconvergence is needed at the join point). -
convergencectrloperand bundle attaches a convergence token to a call site. This tells the compiler "this call must execute with the set of threads defined by this token's scope." For example:
%tok = call token @llvm.experimental.convergence.entry()
%result = call float @__shfl_sync(i32 %mask, float %val, i32 %lane)
[ "convergencectrl"(token %tok) ]
The LLVM verifier (sub_BFC6A0, 211KB) checks that convergent calls carry the bundle; the convergence verifier (sub_E35A10, 14KB) checks the structural invariants.
ConvergenceVerifier -- sub_E35A10
The standalone convergence verification pass at sub_E35A10 (14KB) enforces five invariants on convergence token usage:
| Invariant | Diagnostic String |
|---|---|
| Token dominance | "Convergence control token must dominate all its uses." |
| Region nesting | "Convergence region is not well-nested." |
| Cycle heart dominance | "Cycle heart must dominate all blocks in the cycle." |
| Single token per cycle | "Two static convergence token uses in a cycle..." |
| Loop token type | Checks llvm.experimental.convergence.loop usage in cycles |
The verifier calls sub_B19720 for domination checks, sub_E342D0 for cycle detection (using the generic cycle info infrastructure), sub_E45390 for diagnostic emission, and sub_E348A0 for error reporting. It runs as part of the IR verification pipeline, not as a separate pass -- the convergence invariants are checked alongside other LLVM IR well-formedness rules.
NVIDIA Convergent Branch Intrinsics
In addition to the upstream llvm.experimental.convergence.* intrinsics, cicc defines two NVIDIA-specific convergent branch intrinsics that interact with the convergence framework:
| Intrinsic | Builtin ID | Minimum SM | Error on Violation |
|---|---|---|---|
llvm.nvvm.branch.if.all.convergent | 3755 / 8282 | sm_70+ (Volta) | "not supported on pre-Volta Architectures" |
llvm.nvvm.branch.if.convergent | 3754 / 8283 | sm_80+ (Ampere) | "not supported on pre-Ampere Architectures" |
These intrinsics produce a boolean result that must be consumed by exactly one branch instruction (enforced by sub_2C7B6A0 with diagnostic: "result of llvm.nvvm.branch.if.convergent and llvm.nvvm.branch.if.all.convergent can only be used by exactly one branch instruction"). The .all variant tests whether all threads in the warp are converged (equivalent to a "uniform predicate" test); the non-.all variant tests whether the current execution context is convergent (the thread set matches the convergence token's scope).
SM version gating is checked in both the NVVM verifier (sub_1C36530) and the lowering pass (sub_2C7B6A0). The SM version is stored as SM * 10 internally (so sm_70 = 700, sm_80 = 800), compared against thresholds at unk_4D045E8.
The convergent Function Attribute (Kind 0x34)
The convergent function/call attribute (attribute kind 52, bit 0x20 at byte offset +33 in the function attribute flags) marks operations that have warp-synchronous semantics. This attribute affects multiple compilation stages:
Constant folding gate (sub_2C7B430). The NVIDIA intrinsic fold function checks hasAttribute(callee, -1, 0x34) before attempting any constant fold. If the callee is convergent, folding is rejected unconditionally -- even if all arguments are compile-time constants. This prevents __syncthreads(), __ballot_sync(), __shfl_sync(), and warp-vote operations from being eliminated.
Inline asm convergence flag. During SelectionDAG lowering of inline assembly (sub_1560260), the convergent attribute is tested via operand bundle or function attribute. If set, bit 5 of the inline asm flags word is set (isConvergent), encoding into the DAG node as: flags = hasSideEffects | (isAlignStack << 1) | (dialect << 2) | (convergent << 5).
Loop unrolling epilog forcing. When a loop body contains convergent calls (hasCallInLoop check), the unroller forces epilog remainder style rather than prolog, because epilog preserves the property that all threads participate in each full iteration of the unrolled body.
StructurizeCFG skip. Functions carrying the convergent attribute (attribute ID 56 in the attribute check at sub_B2D610) are skipped by the StructurizeCFG pass -- they are assumed to already have correct convergence structure.
Dead barrier elimination gate. The dead sync elimination engine (sub_2C83D20) identifies barrier intrinsics by checking bit 0x20 at byte +33 (the convergent attribute flag) on the callee, combined with opcode 85 (the internal barrier opcode) and a barrier intrinsic ID confirmation via sub_CEA1A0.
Operand Bundle Registration
The convergencectrl operand bundle tag is registered during LLVMContext initialization at sub_B6EEA0 (9KB), alongside the other standard bundle tags:
Operand bundle tags registered at context creation:
"funclet" -- EH funclet scope
"gc-transition" -- GC state transition
"ptrauth" -- pointer authentication
"kcfi" -- kernel control flow integrity
"convergencectrl" -- convergence token attachment
These tags are interned as string IDs in the context's operand bundle tag table. When the bitcode reader parses a call instruction with operand bundles (sub_14FCE40, 107KB), the convergencectrl bundle is reconstructed from the bitcode record and attached to the CallInst/InvokeInst. The inliner at sub_29ED7A0 (96KB) checks "convergent call needs convergencectrl operand" to verify that convergent calls in the callee carry appropriate bundles after inlining.
Pseudo-Instruction Lowering in emitFunctionBody
The emitFunctionBody loop at sub_31EC4F0 handles the two convergence pseudo-instructions as part of its 46-case opcode switch:
Case 24 -- CONVERGENCECTRL_ENTRY. Calls sub_31DB9B0 (emitConvergenceEntry). This function is positioned at address 0x31DB9B0, immediately after sub_31DB950 in the binary layout (the two functions are adjacent, separated by only 0x60 bytes: 0x31DB950 to 0x31DB9B0). The entry pseudo marks the function entry convergence point. It does not emit visible PTX text -- instead it updates internal state that the OutStreamer uses for reconvergence tracking in the generated object.
Case 33 -- CONVERGENCECTRL_LOOP. Calls sub_31DB950 (emitConvergenceLoop). This marks loop-back convergence points. Like the entry pseudo, it produces no visible PTX output but influences ptxas's reconvergence analysis.
Both pseudo-instructions are "silent" -- they do not increment the instruction counter (var_F30), do not trigger .loc emission, and do not invoke the beginInstruction/endInstruction handler callbacks. They fall through the switch without reaching the default path's instruction-counting logic.
Post-Function Convergence Close-Out
After all MBBs in a function are emitted, the emitFunctionBody function performs convergence-related cleanup in Phase 3a (0x31ECFFD-0x31ED0FA):
Phase 3a: Convergence control close-out
if (var_ED1 == true): // any real instructions seen?
OutStreamer->emitAlignment(MF->getAlignment())
for sym in MF->globalSymbolTable[0x48..0x50]:
if (sym[-0x16] & 0x7FFF) != 0: // visibility flags
sub_31E1750(sym) // resolveBlockAddress
if block_was_removed:
emit diagnostic "Address of block that was removed by Co..."
OutStreamer->emitLabel(fallback_sym)
The var_ED1 flag tracks whether any non-meta instructions appeared in the function body. When set, the close-out phase emits function alignment, resolves block-address symbols in the global symbol table (checking visibility flags at sym[-0x16] & 0x7FFF), and handles the edge case where a basic block was removed by CodeGen after a block-address was taken -- this would produce a dangling convergence reference, so a diagnostic is emitted and a fallback label is created.
Convergence and the StructurizeCFG Pass
The StructurizeCFG pass (documented in StructurizeCFG) is the primary consumer of convergence information during the CFG transformation phase. PTX requires reducible control flow: every back-edge must target a loop header that dominates all blocks in the cycle, and every divergent branch must reconverge at a post-dominator.
The pass performs a domtree-guided reconvergence insertion that stores head/tail pointers into function metadata at *(func_obj+672) and *(func_obj+680). These pointers are read by subsequent PTX emission passes to emit correct convergence annotations. Functions with the convergent attribute (or optnone) are skipped entirely -- they are assumed to already have correct structure.
When non-uniform divergent regions are identified, the pass creates new "reconvergence" basic blocks, copies phi entries, and reroutes edges so that all divergent paths merge at a single post-dominator. The sub_35CB4A0 uniformity check and sub_35C9ED0 NCA (nearest common ancestor) computation in the dominator tree determine where reconvergence points are inserted.
NVIDIA Extensions Beyond Upstream
cicc's AsmPrinter diverges from upstream LLVM's NVPTXAsmPrinter in several important ways:
Convergence control pseudo-instructions. Upstream LLVM (as of the LLVM 20 base) has llvm.experimental.convergence.* intrinsics, but the AsmPrinter handling of CONVERGENCECTRL_ENTRY and CONVERGENCECTRL_LOOP as dedicated opcode cases (24 and 33 in the jump table) with calls to sub_31DB9B0 / sub_31DB950 is cicc-specific. These ensure correct warp-level synchronization semantics in the emitted PTX. Additionally, cicc adds two NVIDIA-specific convergent branch intrinsics (llvm.nvvm.branch.if.convergent for sm_80+ and llvm.nvvm.branch.if.all.convergent for sm_70+) that have no upstream equivalent. See the Convergence Control Framework section for the full pipeline.
Enhanced .loc with inlined-at. The function_name and inlined_at extensions to .loc directives are NVIDIA additions. Upstream LLVM's NVPTX backend emits only standard .loc file line col. cicc's version walks the full inlining chain to produce richer debug information.
Cluster directives (SM 90+). The entire cluster attribute family (.blocksareclusters, .explicitcluster, .reqnctapercluster, .maxclusterrank) and the 15 cluster special registers are NVIDIA extensions to PTX not present in upstream LLVM's NVPTX backend.
.abi_preserve directives. The register ABI preservation annotations emitted by sub_3937240 have no upstream equivalent.
.pragma "coroutine". The coroutine pragma emission in the function header orchestrator is NVIDIA-specific, supporting CUDA coroutine execution.
PGO/BBAddrMap integration. The BBAddrMap and PGO analysis info structures (0x80 and 0x98 bytes respectively, dynamically allocated when analysis passes are absent) are LLVM 16+ features that cicc integrates into the PTX emission path.
Instruction-mix statistics. The per-MBB instruction-mix collection ("INST_<name>: <count>" format) under the "asm-printer" statistic group is significantly more elaborate than upstream's simple instruction counter.
Dual handler lists. cicc maintains two separate AsmPrinterHandler lists (at this+0x240 and this+0x228), iterated independently for beginInstruction/endInstruction/endFunction. Upstream uses a single handler list.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| NVPTXAsmPrinter pass registration | sub_214ABE0 | -- | -- |
Return type / .attribute(.unified) emission | sub_214C940 | 1.9KB | -- |
Linkage directive emission (.visible/.extern/.common) | sub_214CAD0 | 2.4KB | -- |
Kernel attribute emission (.reqntid, .maxnreg, cluster) | sub_214DA90 | 8.7KB | -- |
.local_maxnreg emission | sub_214E300 | 1.3KB | -- |
emitHeader (.version, .target, .address_size) | sub_214F370 | 7.2KB | -- |
| Address space qualifier emission | sub_214FA80 | 1.9KB | -- |
emitFunctionParamList (.param declarations) | sub_21502D0 | 22KB | -- |
Parameter name generation (_param_N) | sub_2150230 | -- | -- |
| Function forward declaration emission | sub_2151550 | 3.9KB | -- |
emitFunctionEntryLabel (.entry/.func) | sub_2151D30 | 7.0KB | -- |
Function alias emission (.alias) | sub_21518E0 | 5.0KB | -- |
| Static initializer expression emission | sub_2153350 | 5.3KB | -- |
| Byte-level constant data emission | sub_2153AE0 | 9.9KB | -- |
printModuleLevelGV (texref/surfref/samplerref/data) | sub_2156420 | 20KB | -- |
| Global variable topological sort | sub_2157D50 | 5.9KB | -- |
| Register class -> encoded ID | sub_21583D0 | 4.6KB | -- |
| Stack frame + register declaration emission | sub_2158E80 | 17KB | -- |
| Function header orchestrator | sub_215A3C0 | 10KB | -- |
| Module-level emission entry (ctor/dtor check, DWARF) | sub_215ACD0 | 8.1KB | -- |
| GenericToNVVM pass registration | sub_215DC20 | -- | -- |
| Register class -> PTX type suffix | sub_2163730 | 1.7KB | -- |
| Register class -> PTX register prefix | sub_21638D0 | 1.6KB | -- |
llvm.ident / "Based on NVVM 7.0.1" reader | sub_216F7F0 | 5.7KB | -- |
emitCallPrototype (.callprototype for indirect calls) | sub_21CF8D0 | 29KB | -- |
| Atomic opcode emission (13 operations) | sub_21E5E70 | -- | -- |
| L2-hinted atomic emission (SM 80+) | sub_21E6420 | -- | -- |
| Address space conversion (cvta) + MMA helpers | sub_21E7FE0 | -- | -- |
| Standard special register emission (%tid, %ctaid, etc.) | sub_21E86B0 | -- | -- |
| Cluster barrier emission (SM 90+) | sub_21E8EA0 | -- | -- |
| Cluster special register emission (SM 90+) | sub_21E9060 | -- | -- |
| Memory barrier emission (membar/fence) | sub_21E94F0 | -- | -- |
printReg (register number -> %rN string) | sub_2FF6320 | -- | -- |
Per-instruction .loc DWARF directive | sub_31D55F0 | -- | -- |
| Instruction-level debug comment emission | sub_31D89B0 | -- | -- |
emitConvergenceEntry (CONVERGENCECTRL_ENTRY pseudo, opcode 24) | sub_31DB9B0 | -- | -- |
emitConvergenceLoop (CONVERGENCECTRL_LOOP pseudo, opcode 33) | sub_31DB950 | -- | -- |
| ConvergenceVerifier::verify (token dominance/nesting checks) | sub_E35A10 | 14KB | -- |
| Cycle detection for convergence verification | sub_E342D0 | -- | -- |
| Convergence verification error reporting | sub_E348A0 | -- | -- |
Inliner/verifier core ("convergent call needs convergencectrl operand") | sub_29ED7A0 | 96KB | -- |
| NVVM convergent branch intrinsic SM-version gating | sub_1C36530 | -- | -- |
| Convergent branch lowering + single-use enforcement | sub_2C7B6A0 | -- | -- |
Metadata kind + operand bundle tag registration (incl. convergencectrl) | sub_B6EEA0 | 9KB | -- |
emitNops (zero-length function avoidance) | sub_31DCBB0 | -- | -- |
createTempSymbol ("func_end", "Ltmp") | sub_31DCC50 | -- | -- |
emitFunctionBody (main loop) | sub_31EC4F0 | 12KB | -- |
emitInlineAsm | sub_31F26A0 | -- | -- |
.abi_preserve directive emission | sub_3937240 | 14KB | -- |
MBB printer + .pragma "nounroll" | sub_3970E40 | 18KB | -- |
doFinalization | sub_3972F10 | 24KB | -- |
emitInlineAsm (parser/streamer) | sub_397DF10 | 30KB | -- |
Cross-References
- PTX Emission -- hub page for the emission stage with additional detail on atomic/barrier/special-register emission
- Code Generation -- the MachineInstr-producing stage that feeds the AsmPrinter
- SelectionDAG -- instruction selection that creates the MachineInstrs
- NVPTX Call ABI --
.paramspace calling convention detail - Register Allocation -- determines which virtual registers exist for the register declaration phase
- Inliner Cost Model -- inlining decisions that create the inlined-at debug chains the AsmPrinter must emit
- StructurizeCFG -- CFG restructuring pass that creates reconvergence basic blocks for divergent control flow
- Dead Sync Elimination -- dead barrier elimination engine that uses the convergent attribute to identify barrier intrinsics
- SM 70-89 Architecture -- SM version gating for convergent branch intrinsics
- GPU Execution Model -- SIMT warp divergence/reconvergence background
Debug Info Verification
cicc includes a custom debug info verification pass (sub_29C8000) that validates DWARF-like debug metadata after each optimization pass in the pipeline. This is not the upstream LLVM IR Verifier (llvm::Verifier::verify(Module)); it is an NVIDIA-specific implementation derived from LLVM's CheckDebugInfoPass (in Debugify.cpp) with two significant extensions: a structured JSON reporting mechanism that tracks exactly which optimization passes degrade debug info quality, and a configurable verbosity system that allows the verification overhead to be tuned from silent to exhaustive. The pass lives in a self-contained module of approximately 93 functions in the 0x29C0000--0x29FFFFF address range, alongside the Debugify synthetic debug info injector and general pass infrastructure utilities. Its purpose is to ensure that when a developer compiles with -g or -generate-line-info, the debug metadata that cuda-gdb and Nsight Compute rely on survives the aggressive optimization pipeline intact.
| Primary function | sub_29C8000 (12,480 bytes, 434 basic blocks) |
| Address range | 0x29C8000 -- 0x29CB0C0 |
| Per-instruction verifier | sub_29C3AB0 (5,592 bytes) |
| Debugify injector | sub_29C1CB0 |
| NewPM wrappers | sub_22702B0 (NewPMCheckDebugifyPass), sub_2270390 (NewPMDebugifyPass) |
| Pipeline parser names | "check-debugify" (pass #26), "debugify" (pass #35) |
| Verbose output flag | qword_5008FC8 (bool) |
| Depth threshold | qword_5008C88 (int32) |
| Stack frame | 0x4B8 bytes (eight tracking structures) |
| Upstream origin | llvm/lib/Transforms/Utils/Debugify.cpp -- CheckDebugInfoPass |
Three Verification Modes
cicc supports three independent verification protocols, each activated by a different set of knobs. Understanding which protocol is active determines what diagnostic output to expect and how much overhead the verification adds.
Mode 1: Post-Pass Debug Info Verification (verify-each)
The default verification mode, activated by the verify-each (or its alias verify-after-all) LLVM knob. The pipeline runner invokes sub_29C8000 as a sandwich around each optimization pass:
// Pseudocode for the pipeline runner's verification protocol
// (entry: 0x29C8000, stack: 0x4B8 bytes)
snapshot_debug_metadata(M);
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", 11, file, fileLen, jsonOut);
The pass name argument identifies which optimization just ran, so the JSON report can attribute any debug info degradation to the specific pass responsible. The verifier checks the full metadata inventory: subprograms, scopes, variables, types, labels, imported entities, and retained nodes. It produces ERROR diagnostics for dropped subprograms and WARNING diagnostics for dropped debug variable intrinsics.
Activation: -Xcicc -verify-each or -Xcicc -verify-after-all
Overhead: One full metadata snapshot + eight hash table constructions + per-function variable scan per optimization pass. Substantial for large modules.
Mode 2: Debugify Synthetic Injection + Verification (debugify-each)
The full Debugify cycle injects synthetic debug metadata before each pass, runs the pass, then verifies the synthetic metadata survived. This mode is more aggressive than Mode 1 because it tests every pass even on code compiled without -g.
// Debugify cycle pseudocode
sub_29C1CB0(M, "llvm.debugify"); // inject synthetic debug info
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", ...); // verify
strip_debugify_metadata(M, "llvm.debugify"); // cleanup
The injector (sub_29C1CB0) creates "llvm.debugify" / "llvm.mir.debugify" named metadata nodes that serve as watermarks. The checker looks for these watermarks to distinguish synthetic from genuine debug info.
Activation: -Xcicc -debugify-each
Sub-knobs: debugify-level (locations or location+variables), debugify-quiet, debugify-func-limit, debugify-export
Mode 3: Debug Info Preservation Checking (verify-debuginfo-preserve)
A lighter-weight mode that checks only whether existing debug info survives optimization, without injecting synthetic metadata. This mode is available through the New Pass Manager infrastructure and can export results via verify-di-preserve-export.
Activation: -Xcicc -verify-debuginfo-preserve
Sub-knobs: verify-each-debuginfo-preserve, verify-di-preserve-export
Mode Selection Matrix
| Knob | Scope | Injects synthetic? | Checks variables? | JSON output? |
|---|---|---|---|---|
verify-each | All passes | No | Yes (if -g) | If jsonOutput != NULL |
debugify-each | All passes | Yes | Configurable via debugify-level | Via debugify-export |
verify-debuginfo-preserve | All passes | No | Yes | Via verify-di-preserve-export |
(none, -g active) | -- | No | No per-pass check | No |
Pipeline Integration
The verifier operates as an interleaved "check" pass. The New Pass Manager registers it via two wrappers in the pipeline construction code at 0x2270000--0x227FFFF:
| Address | Registration string | Role |
|---|---|---|
sub_22702B0 | "NewPMCheckDebugifyPass]" | Verification after each pass |
sub_2270390 | "NewPMDebugifyPass]" | Synthetic injection before each pass |
sub_2270470 | "VerifierPass]" | Standard IR verifier (separate) |
The pipeline text parser (sub_2272BE0, 14KB) recognizes these as named module passes:
| Slot | Pipeline name | Class | Level |
|---|---|---|---|
| #26 | "check-debugify" | NewPMCheckDebugifyPass | Module |
| #35 | "debugify" | NewPMDebugifyPass | Module |
When debugify-each is active, the pipeline builder (sub_2277440, 60KB -- buildDefaultPipeline() equivalent) wraps every optimization pass in a debugify/check-debugify pair. When verify-each is active, only the check-debugify wrapper is inserted.
Verification Function Signature
The function signature reconstructed from the binary:
bool sub_29C8000(
Module* module, // rdi
raw_ostream& output, // rsi -- diagnostic stream
NamedMDNode* dbgCU, // rdx -- "llvm.dbg.cu" metadata
DenseMap* hashMap, // rcx -- metadata identity table
const char* passName, // r8
size_t passNameLen, // stack+0x00
const char* fileName, // stack+0x08
size_t fileNameLen, // stack+0x10
raw_ostream* jsonOutput, // stack+0x18 -- NULL if no JSON report
...
);
// Returns: true = all checks passed, false = any violation detected
Verification Algorithm
The pass proceeds through nine sequential phases within a single function call. The 0x4B8-byte stack frame holds eight separate tracking data structures.
Phase 1: Module-Level Guard (0x29C8000 -- 0x29C807A)
Looks up the "llvm.dbg.cu" named metadata node via sub_BA8DC0 (Module::getNamedMetadata). If absent or empty, prints ": Skipping module without debug info\n" and returns 0. This is the fast path for modules compiled without -g.
Phase 2: Pre-Pass Metadata Snapshot (0x29C8080 -- 0x29C8AE5)
Initializes eight SmallVector/DenseMap structures on the stack and walks the compile unit metadata tree:
| Stack offset | Purpose | Copy helper |
|---|---|---|
var_1F0 | DISubprogram tracking set | sub_29C6AD0 |
var_1D0 | Scope chain working set | sub_29C1190 |
var_1A0 | DIVariable tracking | sub_29C1060 |
var_170 | Scope-to-function mapping | -- |
var_140 | DICompileUnit refs | -- |
var_130 | Primary metadata node buffer | -- |
For each DICompileUnit operand, the pass walks the subprogram list and retained types, recording every metadata node in hash tables for O(1) identity comparison. The hash function is:
uint64_t hash = ((ptr >> 4) ^ (ptr >> 9)) & (bucket_count - 1);
This is the standard DenseMap pointer hash with LLVM-layer sentinels. See Hash Table and Collection Infrastructure for the complete specification.
Phase 3: DISubprogram Iteration (0x29C82BE -- 0x29C84C8)
Walks the subprogram list attached to each compile unit via linked-list traversal ([node+8] = next pointer). For each subprogram, reads the metadata tag byte at [node-18h]:
| Tag byte | DWARF tag | Action |
|---|---|---|
0x54 ('T') | DW_TAG_template_parameter | Skip |
0x55 ('U') | Compile unit / subprogram variant | Special handling |
0x44 ('D') | DW_TAG_subprogram | Validate |
0x45 ('E') | DW_TAG_lexical_block | Validate scope chain |
0x46 ('F') | DW_TAG_lexical_block_file | Validate scope chain |
0x47 ('G') | DW_TAG_namespace | Validate scope chain |
The flag byte at [rdx+21h] & 0x20 tests the "definition" bit (only defined, non-declaration subprograms are tracked). Values outside 0x44--0x47 are flagged as invalid scope types.
Phase 4: Hash Table Construction (0x29C8508 -- 0x29C8AC2)
Allocates and populates eight sorted hash tables via sub_C7D670 (aligned_alloc, alignment=8), each holding 16-byte entries [pointer, secondary_key]:
| Object offset | Table contents | Purpose |
|---|---|---|
+18h | DISubprogram | Function-level metadata |
+28h | DIScope | Scope hierarchy |
+48h | DIGlobalVariable | Module-level variables |
+58h | DILocalVariable | Function-local variables |
+78h | DIType | Type descriptions |
+88h | DIImportedEntity | using declarations |
+A8h | DILabel | Label metadata |
+B8h | Retained nodes | Misc retained metadata |
The MDNode operand access pattern used during population:
// MDNode internal layout decoding (0x29C8508+)
byte flags = *(ptr - 0x10);
if (flags & 0x02) { // distinct metadata
operands = *(ptr - 0x20); // operand array is before the node
} else {
int count = (flags >> 2) & 0x0F;
operands = ptr - 0x10 - (count * 8); // inline operands
}
Phase 5: Per-Function Debug Variable Checking (0x29C8B3B -- 0x29C9060)
Iterates every function in the module. For each, looks up its DISubprogram in the hash table and cross-references dbg.value() / dbg.declare() intrinsics against the pre-snapshot. Two diagnostic levels:
ERROR (pass dropped a subprogram entirely):
ERROR: <pass> dropped DISubprogram of <function> from <file>
ERROR: <pass> did not generate DISubprogram for <function> from <file>
WARNING (pass dropped individual variable tracking):
WARNING: <pass> drops dbg.value()/dbg.declare() for <var> from function <func> (file <file>)
The distinction between "dropped" and "did not generate" is significant: "dropped" means metadata existed before the pass and was deleted; "not-generate" means the pass created new IR (e.g., from inlining or outlining) without attaching corresponding debug metadata. This taxonomy is important for GPU compilation because kernel outlining and device function inlining frequently create new IR nodes.
The variable name is resolved by:
- Getting DISubprogram from the metadata ref
- Calling
sub_AF34D0(DIScope::getScope()) to walk the scope chain upward - Getting the file via operand
[10h]of the scope's file ref - Calling
sub_B91420(MDString::getString()) to convert MDString to StringRef
Phase 6: Per-Instruction Location Verification (0x29C8D42 -- 0x29C8D85)
Delegated to sub_29C3AB0 (5,592 bytes), which performs detailed checks:
- Every instruction with a
DebugLochas a validDILocation DILocationscope chains resolve to a validDISubprogram- No orphaned debug locations reference deleted subprograms
- BB-level consistency: all instructions in a basic block share compatible scopes
- Dropped location tracking: emits
"dropped DILocation"diagnostics
The JSON output from this sub-pass uses structured field names: "DILocation", "bb-name", "fn-name", "action" (with values "drop" or "not-generate").
Phase 7: JSON Structured Output (0x29C90BC -- 0x29C94E2)
When a non-null JSON output stream is provided (the jsonOutput parameter), the pass serializes a structured report via sub_2241E40 (YAML/JSON serializer):
{"file":"kernel.cu", "pass":"instcombine", "bugs": [
{"metadata":"DISubprogram", "name":"_Z6kernelPf", "fn-name":"_Z6kernelPf", "action":"drop"},
{"metadata":"dbg-var-intrinsic", "name":"idx", "fn-name":"_Z6kernelPf", "action":"not-generate"}
]}
This JSON reporting mechanism is an NVIDIA extension with no upstream LLVM equivalent. It feeds into NVIDIA's internal CI infrastructure to track debug info quality regressions across compiler versions. The "no-name" string serves as fallback when the pass name pointer is NULL.
The serialization calls sub_CB7060 (YAML::IO constructor) and proceeds through sub_C6D380 (object emission), sub_C6C710 (array emission), and sub_C6B0E0 (key writer). After serialization, the stream is flushed via sub_CB7080 and freed via sub_CB5B00. If the file descriptor is valid (fd != -1), it is closed via sub_C837B0 (close(fd)).
Phase 8: Result Reporting and Metadata Reconstruction (0x29C94E2 -- 0x29C9A27)
Prints the summary line ("<pass>: PASS\n" or "<pass>: FAIL\n"), then reconstructs the module's metadata tables from the verified versions -- reallocating subprogram, type, variable, label, and global variable arrays and copying verified metadata back into the compile unit structures.
The result is a 3-way outcome in bit flags (combined at 0x29C9073--0x29C9080 via AND):
- Bit 0: any verification failure (determines PASS/FAIL)
- Bit 1: JSON report was requested and successfully written
The final result is PASS only if all sub-checks passed AND the JSON report (if requested) was successfully written.
Cleanup frees all eight temporary hash tables (each via sub_C7D6A0 -- sized dealloc with alignment 8), linked list nodes via j_j___libc_free_0, and SmallVector inline buffers are detected by pointer comparison (if ptr == stack_addr, skip free).
Phase 9: Return (0x29C9A12 -- 0x29C9A27)
Returns var_420 (bool) in the al register. Standard epilog restores rbx, r12--r15, rbp.
Complete Diagnostic Code Table
Every diagnostic string emitted by the debug verification subsystem, with exact provenance and trigger conditions.
Verification Pass Diagnostics (sub_29C8000)
| # | Severity | Diagnostic string | Trigger condition | Address range |
|---|---|---|---|---|
| D01 | INFO | ": Skipping module without debug info\n" | "llvm.dbg.cu" absent or empty | 0x29C8000--0x29C807A |
| D02 | ERROR | "ERROR: <pass> dropped DISubprogram of <func> from <file>\n" | DISubprogram existed pre-pass, absent post-pass | 0x29C8C08--0x29C8D2E |
| D03 | ERROR | "ERROR: <pass> did not generate DISubprogram for <func> from <file>\n" | New function has no DISubprogram | 0x29C8C08--0x29C8D2E |
| D04 | WARNING | "WARNING: <pass> drops dbg.value()/dbg.declare() for <var> from function <func> (file <file>)\n" | Variable intrinsics lost for a tracked variable | 0x29C8E4E--0x29C9060 |
| D05 | SUMMARY | "<pass>: PASS\n" | All checks passed | 0x29C94E2+ |
| D06 | SUMMARY | "<pass>: FAIL\n" | Any check failed | 0x29C94E2+ |
| D07 | ERROR | "Could not open file: <path>\n" | JSON report file I/O failure | 0x29C90BC--0x29C94E2 |
Per-Instruction Verifier Diagnostics (sub_29C3AB0)
| # | Severity | Diagnostic string | Trigger condition |
|---|---|---|---|
| D08 | ERROR | "<pass> dropped DILocation" | Instruction had DILocation pre-pass, absent post-pass |
| D09 | ERROR | "<pass> did not generate DISubprogram" | DILocation references nonexistent subprogram |
| D10 | ERROR | (scope chain invalid) | DILocation scope chain does not resolve to a valid DISubprogram |
| D11 | WARNING | (BB inconsistency) | Instructions within a basic block reference incompatible scopes |
JSON Report Field Schema
| # | Field key | Type | Values | Context |
|---|---|---|---|---|
| J01 | "file" | string | Source filename | Top-level report |
| J02 | "pass" | string | Pass name, or "no-name" if NULL | Top-level report |
| J03 | "bugs" | array | Array of bug objects | Top-level report |
| J04 | "metadata" | string | "DISubprogram", "dbg-var-intrinsic", "DILocation" | Per-bug object |
| J05 | "name" | string | Entity name (function or variable) | Per-bug object |
| J06 | "fn-name" | string | Containing function name | Per-bug object |
| J07 | "bb-name" | string | Basic block name | Per-bug object (location bugs) |
| J08 | "action" | string | "drop" or "not-generate" | Per-bug object |
Action Value Taxonomy
| Action | Meaning | Common cause in GPU compilation |
|---|---|---|
"drop" | Pass explicitly or inadvertently deleted existing debug metadata | Dead code elimination removing a function with debug info |
"not-generate" | Pass created new IR without attaching corresponding debug metadata | Kernel outlining, device function inlining, or loop transformation creating new BBs |
String Encoding Details
Several diagnostic strings are constructed inline using immediate mov instructions rather than string table references:
| String | Encoding | Instruction |
|---|---|---|
"ERRO" | 0x4F525245 | mov dword [rsp+X], 0x4F525245 |
"R:" | 0x3A52 | mov word [rsp+X+4], 0x3A52 |
"WARNING:" | 0x3A474E494E524157 | mov qword [rsp+X], 0x3A474E494E524157 |
These inline immediate constructions avoid string table lookups and are a common LLVM raw_ostream optimization for short fixed strings.
Compile Unit Descriptor Layout
The verification pass reads and reconstructs a per-CU descriptor object (referenced at [rbp+var_440]) with the following layout:
| Offset | Type | Contents | Copy helper |
|---|---|---|---|
+08h | void** | Subprogram array data pointer | -- |
+10h | void** | Subprogram array end pointer | -- |
+18h | size_t | Subprogram count | -- |
+20h | void* | Scope chain data | -- |
+28h | size_t | Scope chain count | -- |
+38h | void** | Global variable array data | -- |
+40h | void** | Global variable array end | -- |
+48h | size_t | Global variable count | -- |
+50h | void* | Local variable list head | -- |
+58h | size_t | Local variable count | -- |
+68h | void** | Type array data | -- |
+70h | void** | Type array end | -- |
+78h | size_t | Type count | -- |
+80h | void* | Imported entities list | sub_29C2230 (32-byte node deep copy) |
+88h | size_t | Imported entities count | -- |
+98h | void** | Label array data | -- |
+A0h | void** | Label array end | -- |
+A8h | size_t | Label count | -- |
+B0h | void* | Retained nodes list | sub_29C0F30 |
+B8h | size_t | Retained nodes count | -- |
DISubprogram Node Layout
Accessed during Phase 3 scope chain validation:
| Offset | Type | Contents |
|---|---|---|
[node-38h] | void* | Pointer to compile unit / parent scope |
[node-18h] | byte | Metadata tag byte (DWARF tag discriminator) |
[node-14h] | uint32 | Flags field (lower 27 bits = operand index) |
[node+08h] | void* | Next pointer in linked list |
[node+18h] | void* | Linked list head for child scopes |
[node+20h] | void* | Linked list tail for child scopes |
[node+28h] | void* | Variable attachment (DIVariable list) |
[node+38h] | void* | Additional metadata ref |
[node+48h] | void* | Subprogram scope list head |
[node+50h] | void* | Subprogram scope list tail |
Debugify Injector (sub_29C1CB0)
The Debugify injector creates synthetic debug metadata to test whether optimization passes preserve debug info correctly. It is the counterpart to the verifier -- the injector sets up the watermarks, and the verifier checks them.
Named metadata markers:
"llvm.debugify"-- marks the module as containing synthetic debug info (standard Debugify)"llvm.mir.debugify"-- marks MIR-level synthetic debug info
Behavior controlled by debugify-level:
locations-- inject onlyDILocationon every instruction (cheaper, tests location preservation)location+variables-- injectDILocationplus syntheticdbg.value()/dbg.declare()for every SSA value (full coverage, higher overhead)
The injector assigns monotonically increasing line numbers to every instruction and creates one DILocalVariable per SSA value that produces a result. The variable names follow the pattern "dbg_var_N" where N is the SSA value index. After injection, the module has guaranteed 100% debug coverage, making any coverage loss attributable to the subsequent optimization pass.
Verbosity Control
Two global flags provide fine-grained control over verification output:
qword_5008FC8 -- Verbose Diagnostic Output Enable
Boolean flag (byte). Controls the output stream selection:
- When
0: usessub_CB72A0(null/discard stream constructor) -- diagnostics silently discarded - When non-zero: uses
sub_CB7330(stderr stream accessor) -- diagnostics printed to stderr
This flag gates the ERROR and WARNING messages. The JSON structured output is controlled separately by the jsonOutput parameter. Setting qword_5008FC8 = 0 suppresses text diagnostics while still producing JSON output.
qword_5008C88 -- Metadata Depth Threshold
Signed 32-bit integer, read at 0x29C8371. Controls how deep the scope chain walk goes:
- When
<= 0: the deep scope chain walk is skipped for non-subprogram metadata. Only top-level DISubprogram validation runs. - When
> 0: full scope chain traversal validates every DILexicalBlock, DILexicalBlockFile, and DINamespace in the hierarchy.
This allows production builds to run lightweight verification (subprogram-only) while development builds run exhaustive scope chain checking.
Debugify-Specific Knobs
| Knob | Type | Default | Registration | Effect |
|---|---|---|---|---|
debugify-quiet | bool | off | ctor_493 at 0x556960 | Suppress all debugify text output |
debugify-func-limit | int | unlimited | ctor_493 at 0x556960 | Max functions to inject synthetic debug info into |
debugify-level | enum | location+variables | ctor_493 at 0x556960 | locations or location+variables |
debugify-function | string | -- | ctor_493 at 0x556960 | Restrict debugify to a single named function |
check-debugify-function | string | -- | ctor_493 at 0x556960 | Restrict check-debugify to a single named function |
debugify-each | bool | off | ctor_377 at 0x516190 | Wrap every pass in debugify/check-debugify |
debugify-export | string | -- | ctor_377 at 0x516190 | Export debugify results to file |
GPU Debug Info: What PTX Needs
DWARF for PTX differs fundamentally from DWARF for x86. PTX is a virtual ISA -- there are no physical registers, no real stack, and no fixed instruction encoding. The debug metadata cicc emits serves two consumers: cuda-gdb (which maps PTX locations back to source) and ptxas (which carries debug info forward into SASS/ELF for the hardware debugger).
The .loc Directive
The AsmPrinter (sub_31D55F0) emits DWARF .loc directives before each PTX instruction that has a valid DebugLoc:
.loc 1 42 0 // file 1, line 42, column 0
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
.loc 1 43 5
mul.wide.u32 %rd2, %r1, 4;
The .file directives (sub_31E4280) establish the file table, and sub_31E6100 maintains a file/line-to-MCSymbol mapping for line table construction.
The dwarf-extended-loc knob (enum: Default/Enable/Disable, registered at 0x490000 range) controls whether extended flags appear in .loc directives. When disabled, cicc emits bare .loc file line column without the is_stmt, prologue_end, or discriminator extensions. This is relevant because older ptxas versions do not parse extended .loc flags.
The line-info-inlined-at Extension
The -line-info-inlined-at LLVM knob (registered at ctor_043 / 0x48D7F0, exposed as -no-lineinfo-inlined-at in the cicc CLI, which sets -line-info-inlined-at=0 on the backend) controls whether inlined-at chains are preserved in PTX line info. When enabled (the default), every .loc directive for inlined code carries the full inlining chain so cuda-gdb can reconstruct the call stack at any point in the inlined code. When disabled, only the immediate source location is emitted, losing the inlining context but producing smaller PTX.
The -show-src / nvptx-emit-src Feature
The -show-src CLI flag (stored at flag struct offset +808, routed to the backend as -nvptx-emit-src) enables source line interleaving in PTX output. When active, the AsmPrinter annotates each .loc directive with the corresponding source line as a PTX comment:
// kernel.cu:42 float val = input[idx];
.loc 1 42 0
ld.global.f32 %f1, [%rd2];
// kernel.cu:43 val = val * val;
.loc 1 43 0
mul.f32 %f2, %f1, %f1;
This is purely a readability feature for developers inspecting PTX output. It has no effect on cuda-gdb or debug quality -- the source text is embedded as comments that ptxas ignores.
NvvmDebugVersion
The NVVM container format includes a debug version field (NvvmDebugVersion, packed as {Major:uint16, Minor:uint16} at container offset 0x08--0x09). The current version is Major=3, Minor<=2. The reader (sub_CD41B0) validates that Major equals 3 and warns if Minor exceeds 2. If absent, the default {3, 2} is assumed. This version tracks the debug metadata schema independently of the NVVM IR version, allowing debug format evolution without breaking IR compatibility.
The standalone pipeline (sub_12BFF60) performs a consistency check: if the container declares debug_info_present (bit 4 of flags) AND the debug mode flag is set AND the debug version has not been validated, it returns error code 3 (incompatible).
DbgRecord Format (LLVM 20)
cicc v13.0 uses LLVM 20's DbgRecord format by default (write-experimental-debuginfo = true, registered at ctor_025). This replaces traditional dbg.value()/dbg.declare() intrinsics with non-intrinsic debug records attached directly to instructions. Related knobs:
| Knob | Default | Registration | Effect |
|---|---|---|---|
write-experimental-debuginfo | true | ctor_025 | Use DbgRecord format for new debug info |
write-experimental-debuginfo-iterators-to-bitcode | true | ctor_018 | Serialize DbgRecords to bitcode |
preserve-input-debuginfo-format | false | ctor_018 | When true, preserve whichever format the input uses |
The verifier handles both formats: it checks for dbg.value()/dbg.declare() intrinsics AND for DbgRecord attachments.
Debug Info Stripping Passes
cicc includes five stripping passes registered in the pipeline parser (at sub_12C6910 and related):
| Pipeline name | Slot | LLVM pass | Effect |
|---|---|---|---|
"strip-dead-debug-info" | #110 | StripDeadDebugInfoPass | Remove debug info for dead functions/globals |
"strip-debug-declare" | #112 | StripDebugDeclarePass | Remove dbg.declare() intrinsics only |
"strip-nondebug" | #113 | StripNonDebugSymbolsPass | Remove non-debug symbols (keep debug) |
"strip-nonlinetable-debuginfo" | #114 | StripNonLineTableDebugInfoPass | Strip everything except line tables |
The strip-nonlinetable-debuginfo pass is the key one for the -generate-line-info mode: it strips all debug metadata except .loc / .file directives, producing line-number-only debug info without variable locations, type descriptions, or scope trees. This is what nvcc's --generate-line-info flag triggers -- enough for profiler source correlation but not enough for stepping through code in cuda-gdb.
The core debug info stripping implementation lives at 0xAE0000 (Zone 3 of the type system module), which calls stripDebugInfo() to remove all llvm.dbg.* intrinsics from the module.
Debug Compilation Modes
cicc supports three debug info levels, controlled by CLI flags that route through the flag dispatch table:
| CLI flag | Flag offset | Backend routing | Debug level |
|---|---|---|---|
-g | +296 | -debug-compile to both linker and optimizer | Full debug info (FullDebug emission kind) |
-generate-line-info | +328 | -generate-line-info to optimizer only | Line tables only (LineTablesOnly emission kind) |
| (neither) | -- | -- | No debug info (NoDebug) |
When -g is active, cicc emits DICompileUnit with full emission kind, preserves all DISubprogram, DILocalVariable, DIType, and scope metadata through the pipeline, and the backend emits complete DWARF sections. The verifier runs at full depth.
When -generate-line-info is active, the StripNonLineTableDebugInfoPass runs early in the pipeline, leaving only line table metadata. The verifier still runs but only checks DILocation / DISubprogram consistency (variable checks are skipped because the variable metadata was intentionally stripped).
Key routing difference: -g routes to BOTH the linker (-debug-compile) and optimizer (-debug-compile), because libdevice linking needs the debug flag to preserve user debug info during merging. -generate-line-info routes to the optimizer only.
The frontend uses two independent guard mechanisms for debug emission:
dword_4D046B4-- global flag checked at statement/parameter level bysub_9433F0(per-param debug),sub_943430(per-global debug)[ctx+0x170]-- compile unit pointer checked at module finalization level bysub_915400
The NVVM container carries a dedicated DebugInfo enum (3 values: NONE, LINE_INFO, DWARF) at deserialized struct offset +12, separate from the module metadata.
Complete Knob Reference
| Knob | Type | Default | Registration | Effect |
|---|---|---|---|---|
-g / -debug-compile | bool | off | ctor_043 at 0x48D7F0 | Full debug compilation |
-generate-line-info | bool | off | ctor_043 at 0x48D7F0 | Line tables only |
-no-lineinfo-inlined-at | bool | off | CLI flag dispatch | Disable inlined-at tracking (sets -line-info-inlined-at=0) |
-show-src / -nvptx-emit-src | bool | off | Flag offset +808 | Interleave source in PTX comments |
dwarf-extended-loc | enum | Default | 0x490000 range | Default/Enable/Disable extended .loc flags |
dwarf-version | unsigned | (platform) | LLVM default | DWARF version for debug sections |
debugify-each | bool | off | ctor_377 at 0x516190 | Run Debugify+CheckDebugify around every pass |
debugify-level | enum | location+variables | ctor_493 at 0x556960 | locations or location+variables |
debugify-quiet | bool | off | ctor_493 at 0x556960 | Suppress debugify diagnostics |
debugify-func-limit | int | unlimited | ctor_493 at 0x556960 | Max functions to debugify |
debugify-function | string | -- | ctor_493 at 0x556960 | Restrict debugify to named function |
check-debugify-function | string | -- | ctor_493 at 0x556960 | Restrict check-debugify to named function |
debugify-export | string | -- | ctor_377 at 0x516190 | Export debugify results to file |
verify-each | bool | off | ctor_043 at 0x48D7F0 | Run IR verifier after every pass |
verify-after-all | alias | -- | ctor_043 at 0x48D7F0 | Alias for verify-each |
verify-debuginfo-preserve | bool | off | ctor_376 at 0x512DF0 | Enable debug info preservation checking |
verify-each-debuginfo-preserve | bool | off | ctor_377 at 0x516190 | Per-pass debug info preservation |
verify-di-preserve-export | string | -- | ctor_377 at 0x516190 | Export preservation results to file |
no-inline-line-tables | bool | off | sub_29E2B40 | Prevent inlining from merging line tables |
write-experimental-debuginfo | bool | true | ctor_025 | Use DbgRecord format |
preserve-input-debuginfo-format | bool/default | false | ctor_018 | Preserve input debug format |
qword_5008FC8 | bool | off | -- | Verbose diagnostic output enable |
qword_5008C88 | int32 | >0 | -- | Metadata depth threshold (<=0 skips deep scope walk) |
CAN_FINALIZE_DEBUG | env var | -- | sub_60F290 et al. | Debug finalization control |
NVVM_IR_VER_CHK | env var | enabled | sub_12BFF60 | Override debug version checking (set "0" to disable) |
DWARF Emission Backend
The actual DWARF section emission lives in a separate module at 0x3990000--0x39DF000:
| Address | Size | Function |
|---|---|---|
sub_399B1E0 | 29KB | DwarfDebug::beginModule() -- initializes from llvm.dbg.cu |
sub_3997B50 | 33KB | .debug_aranges emission |
sub_399D1D0 | 12KB | Range list emission (DW_RLE_*) |
sub_399EB70 | 12KB | Register location expressions |
sub_39BDF60 | 38KB | .debug_names accelerator table |
sub_39B6390 | 33KB | DWARF form size calculator |
sub_215ACD0 | 8.1KB | Module-level emission entry (NVPTX Debug Info Emission) |
The module-level entry sub_215ACD0 checks *(a1+240)->field_344 to determine if DWARF is enabled, then looks up the "NVPTX DWARF Debug Writer" / "NVPTX Debug Info Emission" pass info. The NVPTX backend does not emit physical register locations (GPUs have no DWARF register numbering scheme that maps to hardware); instead, it emits virtual register references that cuda-gdb resolves through ptxas's SASS-level debug info.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
"llvm.global_ctors" utility | sub_29C00F0 | -- | -- |
errs() diagnostic output stream accessor | sub_29C0AE0 | -- | -- |
PassManager / PassAdaptor infrastructure ("PassManager", "PassAdaptor") | sub_29C0DC0 | -- | -- |
| Copy retained-nodes list (SmallVector deep copy) | sub_29C0F30 | -- | -- |
| Copy local-variable list | sub_29C1060 | -- | -- |
| Copy scope-chain list | sub_29C1190 | -- | -- |
| Validate scope chain connectivity | sub_29C12C0 | -- | -- |
Debugify synthetic debug info injector ("llvm.debugify", "llvm.mir.debugify") | sub_29C1CB0 | -- | -- |
| Merge/update tracking sets after verification | sub_29C1F00 | -- | -- |
| Serialize verification result to stream | sub_29C20D0 | -- | -- |
| Copy imported-entities list (32-byte node deep copy) | sub_29C2230 | -- | -- |
Per-instruction DILocation verifier | sub_29C3AB0 | 5,592B | -- |
DenseMap::FindAndConstruct for tracking map | sub_29C5270 | -- | -- |
| Set insert with metadata key normalization | sub_29C6AD0 | -- | -- |
| Set insert variant (different key extraction) | sub_29C6DE0 | -- | -- |
| Debug info verification pass (main entry) | sub_29C8000 | 12,480B | -- |
no-inline-line-tables flag handler | sub_29E2B40 | -- | -- |
NewPMCheckDebugifyPass wrapper | sub_22702B0 | -- | -- |
NewPMDebugifyPass wrapper | sub_2270390 | -- | -- |
VerifierPass wrapper (standard IR verifier) | sub_2270470 | -- | -- |
| Pass pipeline text parser | sub_2272BE0 | 14KB | -- |
buildDefaultPipeline() equivalent | sub_2277440 | 60KB | -- |
Flag filter (checks -debug-compile, -g, -generate-line-info) | sub_12C6910 | -- | -- |
Emit per-instruction .loc DWARF directive | sub_31D55F0 | -- | -- |
Emit .file/.loc directives (function scope) | sub_31E4280 | -- | -- |
insertDebugLocEntry (file/line to symbol mapping) | sub_31E6100 | -- | -- |
DwarfDebug::beginModule() | sub_399B1E0 | 29KB | -- |
.debug_aranges emission | sub_3997B50 | 33KB | -- |
| Module-level emission entry / NVPTX Debug Info Emission | sub_215ACD0 | 8.1KB | -- |
| NVVM IR version + debug version validator | sub_12BFF60 | ~9KB | -- |
| NVVM container debug version check | sub_CD41B0 | -- | -- |
Emit DILocalVariable for parameter (frontend) | sub_9433F0 | -- | -- |
| Emit debug info for GlobalVariable (frontend) | sub_943430 | -- | -- |
Set DebugLoc from EDG source position (frontend) | sub_941230 | -- | -- |
Finalize: "Debug Info Version" = 3 (frontend) | sub_915400 | -- | -- |
LLVM Infrastructure Functions Used
| Address | Identity | Called from |
|---|---|---|
sub_BA8DC0 | Module::getNamedMetadata(StringRef) | Phase 1 |
sub_B2FC80 | isa<DISubprogram> or similar MDNode type check | Phase 3 |
sub_B2FC00 | MDNode type check (different metadata kind) | Phase 3 |
sub_B92180 | MDNode::getContext() | Phase 4 |
sub_B91420 | MDString::getString() | Phase 5 |
sub_B91A10 | MDNode::getOperand(unsigned) | Phase 4 |
sub_B14240 | MDNode operand range iterator | Phase 4 |
sub_AF34D0 | DIScope::getScope() -- walk scope chain upward | Phase 5 |
sub_AF4500 | DISubprogram::describes(Function) | Phase 5 |
sub_B58DC0 | DenseSet::insert | Phase 2 |
sub_B96E90 | DenseMap::insert_or_assign | Phase 4 |
sub_B91220 | DenseMap::erase | Phase 8 |
sub_C7D670 | aligned_alloc(size, alignment=8) | Phase 4 |
sub_C7D6A0 | aligned_free_sized(ptr, size, alignment=8) | Phase 8 |
sub_CB7330 | errs() -- get stderr raw_ostream | Phase 5 |
sub_CB72A0 | nulls() -- get null/discard raw_ostream | Phase 5 (quiet mode) |
sub_CB6200 | raw_ostream::write(const char*, size_t) | Phase 5, 7 |
sub_CB5D20 | raw_ostream::write(char) | Phase 5 |
sub_CB5B00 | raw_ostream destructor / free | Phase 7 |
sub_CB7060 | YAML::IO output constructor | Phase 7 |
sub_CB7080 | raw_ostream::flush() | Phase 7 |
NVIDIA Modifications vs Stock LLVM
The key differences from upstream LLVM's CheckDebugInfoPass:
-
JSON structured output -- Upstream only prints text diagnostics. NVIDIA added a YAML/JSON serializer (
sub_2241E40,sub_CB7060) that produces machine-parseable bug reports with"file","pass","bugs"fields and per-bug"action"classification ("drop"vs"not-generate"). -
Verbosity control -- Two global flags (
qword_5008FC8for output enable,qword_5008C88for depth threshold) allow fine-grained control over verification overhead. Upstream has only thedebugify-quietknob. -
Eight-table metadata tracking -- Upstream
CheckDebugInfoPasstracks DISubprograms and debug variable intrinsics. NVIDIA's version maintains eight separate hash tables covering subprograms, scopes, global variables, local variables, types, imported entities, labels, and retained nodes -- a much more comprehensive snapshot. -
Metadata reconstruction -- After verification, NVIDIA's pass reconstructs the module's metadata tables from the verified versions (Phase 8), which upstream does not do. This means the verifier can also serve as a "repair" pass that normalizes metadata after an optimization pass corrupts it.
-
No kernel-specific handling -- The verifier treats
__global__and__device__functions identically. CUDA-specific debug info (address space annotations, shared memory debug, warp-level location info) is validated elsewhere, likely during NVPTX backend emission. -
DbgRecord format support -- cicc v13.0 defaults to the LLVM 20 DbgRecord format (
write-experimental-debuginfo = true), so the verifier handles both intrinsic-based and record-based debug info transparently.
Cross-References
- AsmPrinter & PTX Body Emission --
.loc/.filedirective emission, per-instruction debug annotation - PTX Emission -- module-level emission entry, DWARF debug writer lookup
- Debug Info Pipeline -- end-to-end debug info flow from frontend to backend
- CLI Flags --
-g,-generate-line-info,-show-srcflag routing - LLVM Knobs --
debugify-*,verify-each,dwarf-*knobs - Pipeline & Ordering -- where debug verification fits in the pass pipeline
- Hash Infrastructure -- DenseMap/DenseSet implementation used by tracking tables
- Diagnostics -- broader diagnostic and remark system
- NVVM Container -- NvvmDebugVersion field
Bitcode Reader/Writer
CICC v13.0 contains the complete LLVM 20.0.0 bitcode serialization infrastructure -- reader, writer, metadata loader, module summary IO, and the full intrinsic upgrader -- spread across two address ranges. The 0x9F0000--0xA2FFFF range hosts a first copy of the bitcode reader/writer core used by the standalone libNVVM pipeline, while the 0x1500000--0x157FFFF range hosts the primary copy used by the two-phase compilation path. Both copies are structurally identical LLVM BitcodeReader.cpp and BitcodeWriter.cpp compiled at different link addresses. The reader is stock upstream LLVM 20.0.0 with no NVIDIA modifications to the deserialization logic itself. The writer, however, contains a single critical NVIDIA change: it stamps "LLVM7.0.1" as the bitcode producer identification string rather than the true "LLVM20.0.0", preserving backward compatibility with the NVVM IR ecosystem.
The bitcode subsystem sits at the boundary between all pipeline stages. The standalone pipeline validates magic bytes on entry, the module linker reads bitcode from separate compilation objects, the two-phase orchestrator serializes per-function bitcode blobs between Phase I and Phase II, and the NVVM container wraps bitcode payloads in a proprietary envelope. Every bitcode load also runs the intrinsic upgrader -- a 700+ KB AutoUpgrade subsystem that includes roughly 240 KB of effectively-dead x86 intrinsic renaming tables.
Key Facts
| Property | Value |
|---|---|
| Reader (primary copy) | sub_151B070 (0x151B070, 123 KB) -- parseFunctionBody |
| Reader (standalone copy) | sub_9F2A40 (0x9F2A40, 185 KB) -- parseFunctionBody |
| Writer | sub_1538EC0 (0x1538EC0, 58 KB) -- writeModule |
| Metadata reader | sub_A09F80 (0xA09F80, 121 KB) -- MetadataLoader::parseOneMetadata |
| X86 AutoUpgrade (name) | sub_156E800 (0x156E800, 593 KB) -- UpgradeIntrinsicFunction |
| X86 AutoUpgrade (call) | sub_A939D0 (0xA939D0, 457 KB) -- UpgradeIntrinsicCall |
| NVVM version checker | sub_157E370 (0x157E370, 7 KB) |
| NVVM version checker (standalone) | sub_12BFF60 (0x12BFF60, 9 KB) |
| Producer init (ctor_036) | 0x48CC90 (544 bytes) -- reads LLVM_OVERRIDE_PRODUCER |
| Producer init (ctor_154) | 0x4CE640 (215 bytes) -- reads LLVM_OVERRIDE_PRODUCER |
| Address range (primary) | 0x1500000--0x157FFFF |
| Address range (standalone copy) | 0x9F0000--0xA2FFFF |
| Address range (AutoUpgrade) | 0xA80000--0xABFFFF |
| Hardcoded producer string | "LLVM7.0.1" (writer), "20.0.0" (internal fallback) |
| NVVM IR version gate | major == 3, minor <= 2 |
| Upstream source | lib/Bitcode/Reader/BitcodeReader.cpp, lib/Bitcode/Writer/BitcodeWriter.cpp, lib/IR/AutoUpgrade.cpp |
Bitcode Format Basics
LLVM bitcode uses two magic signatures. The pipeline validates both at module load time:
| Magic Bytes | Meaning | Where Checked |
|---|---|---|
0xDE 0xC0 0x17 0x0B | Raw LLVM bitcode stream | sub_12C06E0 (module linker) |
0x42 0x43 0xC0 0xDE | Bitcode wrapper format (offset + size header around raw stream) | Same function |
If neither signature matches, the pipeline sets *error_code = 9 ("invalid bitcode") and aborts. The wrapper format is more common in practice -- nvcc generates wrapper-format .bc files that embed the raw stream at an offset specified in the wrapper header. The wrapper header is 20 bytes:
struct BitcodeWrapperHeader {
uint32_t magic; // 0x42, 0x43, 0xC0, 0xDE
uint32_t version; // wrapper version (0)
uint32_t offset; // byte offset to raw bitcode within file
uint32_t size; // size of raw bitcode in bytes
uint32_t cpu_type; // target CPU type (0 for NVPTX)
};
After magic validation, the bitstream enters the block-structured reader. LLVM bitcode is organized into nested blocks, each identified by a block ID. The reader uses abbreviation tables (defined in BLOCKINFO blocks) to decode records within each block efficiently using variable-bit-rate (VBR) encoding.
An epoch check runs after magic validation: "Incompatible epoch: Bitcode '<X>' vs current: '<Y>'". This ensures the bitcode was produced by a compatible LLVM generation.
Bitcode Reader
Module Parser (sub_1505110, 60 KB)
The top-level entry reads MODULE_BLOCK records from the bitcode stream. It processes:
- Global variable declarations and definitions
- Function declarations (bodies are deferred for lazy materialization)
- Calling conventions and comdat groups
- Module-level metadata, type tables, and value symbol tables
- Data layout and target triple strings
Error strings: "Invalid calling convention ID", "Invalid function comdat ID", "Invalid global variable comdat ID", "Invalid type for value".
parseFunctionBody (sub_151B070 / sub_9F2A40)
The function body parser is the largest single reader function. The standalone copy sub_9F2A40 is 185 KB (5,706 decompiled lines) with 174 error string references. The primary copy sub_151B070 is 123 KB. Both decode the same FUNCTION_BLOCK records:
- 57 FUNC_CODE instruction record types (switch cases 1--65), covering every LLVM IR opcode:
INST_BINOP,INST_CAST,INST_GEP,INST_SELECT,INST_CMP,INST_RET,INST_BR,INST_SWITCH,INST_INVOKE,INST_CALL(opcode 85),INST_UNREACHABLE,INST_PHI,INST_ALLOCA,INST_LOAD,INST_STORE,INST_ATOMICRMW,INST_CMPXCHG,INST_FENCE,INST_EXTRACTVAL,INST_INSERTVAL,INST_LANDINGPAD,INST_RESUME,INST_CLEANUPPAD,INST_CATCHPAD,INST_CATCHSWITCH,INST_CALLBR,INST_FREEZE, and others. - 4 nested sub-blocks: constants (
0xB), metadata (0xE), use-list order (0x10), operand bundles (0x12). - 53 unique error strings including:
"Alignment value is too large","Invalid record","Invalid record: Unsupported version of DISubrange","METADATA_NAME not followed by METADATA_NAMED_NODE".
For each INST_CALL record (opcode 85), the reader calls into the AutoUpgrade machinery to rename deprecated intrinsics. This is the hook that triggers the 700+ KB x86 upgrader on every call instruction -- even though the upgrader's x86 branches are dead code for NVPTX targets.
Pseudocode for the top-level body parse loop:
Error parseFunctionBody(Function *F) {
SmallVector<uint64_t, 64> Record;
while (true) {
BitstreamEntry Entry = Stream.advance();
switch (Entry.Kind) {
case BitstreamEntry::Error:
return error("Malformed block");
case BitstreamEntry::EndBlock:
return resolveForwardRefs();
case BitstreamEntry::SubBlock:
switch (Entry.ID) {
case CONSTANTS_BLOCK_ID: // 0xB
parseConstants(); break;
case METADATA_BLOCK_ID: // 0xE
parseMetadataAttachment(); break;
case USELIST_BLOCK_ID: // 0x10
parseUseListBlock(); break;
case OPERAND_BUNDLE_TAGS_BLOCK_ID: // 0x12
parseOperandBundleTags(); break;
}
break;
case BitstreamEntry::Record:
unsigned Code = Stream.readRecord(Entry.ID, Record);
switch (Code) {
case FUNC_CODE_INST_BINOP: /* ... */ break;
case FUNC_CODE_INST_CAST: /* ... */ break;
// ... 55 more cases ...
case FUNC_CODE_INST_CALL:
// Parse callee, args, calling convention
// If callee is intrinsic:
// UpgradeIntrinsicFunction(callee, &newCallee);
// if (newCallee) UpgradeIntrinsicCall(CI, newCallee);
break;
}
}
}
}
Lazy Materialization (sub_1503DC0, 13 KB)
Function bodies are not parsed eagerly. The module parser records each function's byte offset in the bitcode stream, and materializeFunctions seeks to that position on demand. Error strings: "Could not find function in stream", "Expect function block", "Expect SubBlock", "Trying to materialize functions before seeing function blocks". The two-phase compilation exploits this by materializing individual functions for per-function Phase II optimization.
Bitstream Infrastructure
| Function | Address | Size | Role |
|---|---|---|---|
readBlockInfoBlock | 0x150F8E0 | 42 KB | Reads BLOCKINFO block (abbreviation definitions) |
readAbbreviatedField | 0x1510D70 | 38 KB | Expands abbreviated records (fixed, VBR, array, blob) |
readAbbrevRecord | 0x1513230 | 20 KB | Reads one abbreviation-defined record |
readRecord | 0x150E2B0 | 19 KB | Core BitstreamCursor::readRecord |
parseMetadataBlock | 0x1518180 | 29 KB | Parses METADATA_BLOCK for function-level metadata |
parseFunctionMetadata | 0x1520420 | 32 KB | Metadata/value-table builder during function parse |
parseMetadataStrings | 0x1522160 | 13 KB | Reads metadata string table |
parseTypeBlock / constants | 0x15083D0 | 26 KB | TYPE_BLOCK or CONSTANTS_BLOCK parser |
parseValueRecord | 0x1515740 | 9 KB | Value record decoder |
string table reader | 0x15140E0 | 13 KB | Bitcode string table entries |
readBlobRecord | 0x1514C40 | 9 KB | Blob-type record reader |
skipBlock | 0x15127D0 | 13 KB | Block skipping and cursor navigation |
parseModuleSummaryIndex | 0x150B5F0 | 63 KB | ThinLTO summary parser |
materializeFunctions | 0x1503DC0 | 13 KB | Lazy function body materialization |
parseModule | 0x1505110 | 60 KB | Top-level MODULE_BLOCK parser |
ThinLTO GUID lookup | 0x150A160 | 7 KB | GUID-based summary index lookup |
parseGlobalInits | 0x1504A60 | 8 KB | Global variable initializer parser |
Bitcode Writer
writeModule (sub_1538EC0, 58 KB)
The top-level writer serializes an entire Module to a bitcode stream. It orchestrates sub-writers in a fixed order:
- Enumerate all values via
ValueEnumerator(sub_15467B0, 23 KB) - Write identification block (with producer string -- see next section)
- Write MODULE_BLOCK header
- Write type table (
sub_1530240, 12 KB) - Write attribute groups (
sub_152F610, 8 KB) - Write global variables
- Write function declarations
- For each defined function:
writeFunction(sub_1536CD0, 40 KB) - Write metadata (
sub_1531F90, 27 KB) + metadata records (sub_15334D0, 8 KB) - Write value symbol table (
sub_1533CF0, 16 KB) - Write named metadata / comdat records (
sub_15311A0, 14 KB) - If ThinLTO: write module summary (
sub_1535340, 26 KB)
writeFunction (sub_1536CD0, 40 KB)
Writes one FUNCTION_BLOCK containing all instructions, each encoded via writeInstruction (sub_1528720, 27 KB). Instructions are encoded as (opcode, operand_ids...) records where operand IDs are relative to the value table. The writer uses abbreviations for compact encoding of common instruction patterns.
Value Enumeration
Before writing, the ValueEnumerator assigns a dense numeric ID to every value in the module. This is the reverse of what the reader does (mapping IDs back to Values).
| Function | Address | Size | Role |
|---|---|---|---|
enumerateModule | 0x15467B0 | 23 KB | Top-level module enumeration |
enumerateValues | 0x1542B00 | 26 KB | Assigns numeric IDs to all values |
optimizeConstants | 0x1548410 | 8 KB | Reorders constants for better compression |
TypeFinder helper | 0x153E1D0 | 7 KB | Recursive type discovery |
Writer Function Map
| Function | Address | Size | Role |
|---|---|---|---|
writeModule | 0x1538EC0 | 58 KB | Top-level module serializer |
writeFunction | 0x1536CD0 | 40 KB | Per-function FUNCTION_BLOCK writer |
writeMetadata | 0x1531F90 | 27 KB | METADATA_BLOCK writer |
writeInstruction | 0x1528720 | 27 KB | Single instruction encoder |
writeModuleSummary | 0x1535340 | 26 KB | ThinLTO summary serializer |
writeValueSymbolTable | 0x1533CF0 | 16 KB | VALUE_SYMTAB_BLOCK writer |
writeNamedMetadata | 0x15311A0 | 14 KB | Named metadata / comdat writer |
writeType / globalVar | 0x1530240 | 12 KB | Type descriptors or global variable records |
emitAbbreviation | 0x152AB40 | 11 KB | Abbreviation definition writer |
emitRecord | 0x152A250 | 9 KB | Low-level record emission |
writeConstants helper | 0x1527BB0 | 9 KB | Constant value encoder |
writeMetadataRecords | 0x15334D0 | 8 KB | Dispatcher for 37 metadata node types |
writeAttributeGroup | 0x152F610 | 8 KB | ATTRIBUTE_GROUP_BLOCK writer |
emitVBR | 0x15271D0 | 7 KB | Variable bit-rate integer encoding |
emitCode | 0x15263C0 | 7 KB | Core abbreviated/unabbreviated record emission |
emitBlob | 0x1528330 | -- | Blob data emission |
Producer String Hack
This is the single most important NVIDIA deviation in the bitcode subsystem. Two global constructors cooperate to set the producer identification string:
ctor_036 at 0x48CC90 (544 bytes): Reads LLVM_OVERRIDE_PRODUCER from the environment. If unset, falls back to the string "20.0.0" (the true LLVM version). Stores the result in the global qword_4F837E0. Also registers disable-bitcode-version-upgrade (cl::opt<bool>).
ctor_154 at 0x4CE640 (215 bytes): Also reads LLVM_OVERRIDE_PRODUCER. Falls back to "7.0.1". Stores into a separate global.
When writeModule (sub_1538EC0) writes the IDENTIFICATION_BLOCK, it emits the string "LLVM7.0.1" as the producer. This is assembled from the prefix "LLVM" plus the version string "7.0.1" loaded from the ctor_154 global.
The consequence is that any tool reading CICC's output bitcode (including older libNVVM, nvdisasm, or third-party NVVM IR consumers) sees producer "LLVM7.0.1" and interprets the bitcode as LLVM 7.x-era IR. Internally, the IR is LLVM 20.0.0 -- all modern instruction opcodes, metadata formats, and type encodings are present. The producer string is purely a compatibility marker that tells downstream tools which NVVM IR version spec to apply, not the actual LLVM version.
Why 7.0.1 specifically: NVVM IR 2.0 was defined against LLVM 7.0.1. The NVVM toolchain ecosystem (libNVVM, nvcc's device compilation pipeline) standardized on this version string as the "NVVM IR format identifier." Upgrading the producer string would require coordinated changes across the entire CUDA toolkit and all consumers.
// Pseudocode for producer string initialization
static const char *producer_version;
void ctor_036() { // at 0x48CC90
const char *env = getenv("LLVM_OVERRIDE_PRODUCER");
if (!env) env = "20.0.0"; // true LLVM version
global_4F837E0 = env;
// Also registers: -disable-bitcode-version-upgrade (cl::opt<bool>)
}
void ctor_154() { // at 0x4CE640
const char *env = getenv("LLVM_OVERRIDE_PRODUCER");
if (!env) env = "7.0.1"; // NVVM IR compat marker
producer_version = env;
}
// In writeModule (sub_1538EC0):
void writeIdentificationBlock(BitstreamWriter &Stream) {
Stream.EnterSubblock(IDENTIFICATION_BLOCK_ID);
// Writes: "LLVM" + producer_version → "LLVM7.0.1"
Stream.EmitRecord(IDENTIFICATION_CODE_STRING, "LLVM");
Stream.EmitRecord(IDENTIFICATION_CODE_EPOCH, CurrentEpoch);
Stream.ExitBlock();
}
Reimplementation note: A reimplementation must write "LLVM7.0.1" as the producer for compatibility with the existing NVVM ecosystem. Setting LLVM_OVERRIDE_PRODUCER to a different value will change the embedded string. The disable-bitcode-version-upgrade flag controls whether the reader's AutoUpgrade logic activates for version-mismatched bitcode.
X86 AutoUpgrade -- Why to Skip It
The intrinsic upgrader is the single largest code mass in the entire cicc binary. Two functions dominate:
| Function | Address | Size | Role |
|---|---|---|---|
UpgradeIntrinsicFunction | sub_156E800 | 593 KB | Name-based intrinsic rename lookup (271 string patterns) |
UpgradeIntrinsicCall | sub_A939D0 | 457 KB | Call instruction rewriter |
X86 intrinsic upgrade helper | sub_A8A170 | 195 KB | SSE/AVX/AVX-512 family tables |
UpgradeIntrinsicCall (2nd copy) | sub_15644B0 | 89 KB | Companion call upgrader |
NVVM upgrade dispatcher | sub_A8E250 | 52 KB | nvvm.atomic, nvvm.shfl, nvvm.cp.async, nvvm.tcgen05, nvvm.cluster, nvvm.ldg |
NVVM call rewriting | sub_A91130 | 28 KB | NVVM-specific call rewriter |
NVVM annotation metadata upgrade | sub_A84F90 | 14 KB | maxclusterrank, maxntid, etc. |
UpgradeModuleFlags | 0x156C720 | 10 KB | Module flag upgrader |
UpgradeLoopMetadata | 0x156A1F0 | 7 KB | llvm.loop.interleave.count, llvm.loop.vectorize.* |
Total intrinsic upgrader code: approximately 1.4 MB across all copies and helpers.
The x86 portion (roughly 1.0 MB) handles SSE/SSE2/SSE4.1/SSE4.2/SSSE3, AVX2, AVX-512 (mask operations, conversions, FMA variants), and ARM NEON patterns (^arm\.neon\.vld, ^arm\.neon\.vst). These branches are functionally dead for NVPTX -- no CUDA program will ever contain an @llvm.x86.sse2.padds.b intrinsic. However, the code is NOT unreachable in the CFG sense: the reader calls UpgradeIntrinsicFunction on every intrinsic name, the function does a string-prefix match, and falls through the x86/ARM branches without matching. The x86 code paths simply never activate.
Reimplementation guidance: You can safely exclude the x86 and ARM AutoUpgrade tables (sub_A8A170, the x86 portions of sub_A939D0, and the ARM patterns in sub_15644B0). The NVVM-relevant upgraders must be preserved:
| Preserved | NVVM Intrinsic Families |
|---|---|
sub_A8E250 | nvvm.atomic.*, nvvm.shfl.*, nvvm.cp.async.*, nvvm.tcgen05.*, nvvm.cluster.*, nvvm.ldg.* |
sub_A91130 | NVVM-specific call rewrites |
sub_A84F90 | NVVM annotation metadata (maxclusterrank, maxntid, etc.) |
sub_156A1F0 | Loop vectorization metadata (llvm.loop.interleave.count) |
sub_156C720 | Module flags |
Stripping the x86 upgrader saves approximately 1.0 MB of binary size and significant reverse-engineering effort, with zero functional impact on GPU compilation.
Metadata Reader
MetadataLoader::parseOneMetadata (sub_A09F80, 121 KB)
The metadata reader handles 42 distinct metadata record types in a single switch statement. Each case constructs one metadata node:
- DI metadata nodes:
DISubprogram,DIFile,DICompileUnit,DIVariable,DILocation,DIType,DIExpression,DISubrange,DIEnumerator,DIGlobalVariableExpression,DIModule,DINamespace,DITemplateTypeParameter,DITemplateValueParameter,DICompositeType,DIDerivedType,DIBasicType,DILexicalBlock,DILexicalBlockFile,DILabel,DIImportedEntity,DIMacro,DIMacroFile,DICommonBlock,DIGenericSubrange,DIStringType,DIArgList - LLVM metadata nodes:
MDTuple,MDString, named metadata - NVVM annotations:
nvvm.annotations(parsed as named metadata carrying per-kernel attributes)
The function is called from parseMetadataBlock (sub_1518180, 29 KB), which reads the block structure, and parseFunctionMetadata (sub_1520420, 32 KB), which processes function-level metadata attachments.
Value materialization (sub_A10370, 33 KB) handles forward references in metadata. When a metadata node references a value that hasn't been parsed yet, the materializer resolves it once the value becomes available.
Module Summary Serialization
Two pairs of functions handle ThinLTO module summary IO:
Summary Writer (sub_1535340, 26 KB)
Writes the MODULE_STRTAB_BLOCK and GLOBALVAL_SUMMARY_BLOCK into the bitcode stream. For each function/alias/global:
- Encodes the GUID hash (64-bit FNV-1a on the mangled name)
- Writes call graph edges with hotness annotations
- Writes reference edges (global value references)
- For ThinLTO: writes module path strings, type test GUIDs
Error string: "Unexpected anonymous function when writing summary".
The NVIDIA-extended summary fields (import priority, complexity budget, kernel bit, CUDA attributes) are written by the NVModuleSummary builder into the standard summary records via additional flag bits and extended record fields.
Summary Reader (sub_150B5F0, 63 KB)
Reads the summary index from bitcode. Handles GUID hashes, function/alias summaries, module paths. Error strings: "Alias expects aliasee summary", "Invalid hash length", "Invalid Summary Block: version expected", "Malformed block".
Summary Writer (standalone copy) (sub_A2D2B0, 48 KB)
A second copy of the summary/metadata writer exists at 0xA2D2B0 in the standalone pipeline's address range.
NVVM IR Version Validation
CICC gates bitcode acceptance on two version checks:
Module-Level Version Gate (sub_157E370, 7 KB)
After parsing the module, this function reads the "nvvmir.version" named metadata node. The metadata contains a pair of integers (major, minor). The check enforces:
major == 3 AND minor <= 2
If the check fails, the function calls sub_16BD130 which emits "Broken module found, compilation aborted!" and terminates compilation. If the module passes the version check, it proceeds to sub_166CBC0 (verifyModule [MEDIUM confidence] -- identification based on call position after bitcode parsing and before optimization, consistent with LLVM's standard verify-after-parse pattern, but no diagnostic string directly confirms the function name) for structural IR verification, then sub_15ACB40 for post-verification processing.
A second instance at sub_12BFF60 (9 KB) in the standalone pipeline performs the same check with additional llvm.dbg.cu debug info presence validation.
Environment Override (NVVM_IR_VER_CHK)
The NVVM_IR_VER_CHK environment variable controls whether version validation runs at all:
| Value | Effect |
|---|---|
Unset or non-"0" | Version check enabled (default) |
"0" | Version check bypassed, no version mismatch errors |
The check is: if (!env || strtol(env, NULL, 10) != 0) then enforce version. This means any non-zero numeric string also enables the check. Only the literal string "0" disables it.
Two verifier instances exist:
sub_12BFF60at0x12BFF60(standalone pipeline)sub_2259720at0x2259720(second instance, possibly duplicate link unit)
Configuration
Environment Variables
| Variable | Effect | Default |
|---|---|---|
LLVM_OVERRIDE_PRODUCER | Overrides bitcode producer identification string | "7.0.1" (ctor_154) / "20.0.0" (ctor_036) |
NVVM_IR_VER_CHK | Set to "0" to bypass NVVM IR version validation | Enabled |
cl::opt Flags
| Flag | Type | Default | Effect |
|---|---|---|---|
disable-bitcode-version-upgrade | bool | false | Disable automatic bitcode upgrade for version mismatch |
bitcode-mdindex-threshold | int | 25 | Number of metadata entries above which an index is emitted |
disable-ondemand-mds-loading | bool | false | Disable lazy metadata loading |
write-relbf-to-summary | bool | false | Write relative block frequency to ThinLTO function summary |
print-summary-global-ids | bool | false | Print global IDs when reading module summary |
import-full-type-definitions | bool | false | Import full type definitions in ThinLTO |
Differences from Upstream LLVM
| Aspect | Upstream LLVM 20.0.0 | CICC v13.0 |
|---|---|---|
| Producer string | "LLVM20.0.0" | "LLVM7.0.1" (hardcoded via ctor_154) |
| Producer override | LLVM_OVERRIDE_PRODUCER env var | Same mechanism, different default |
| Version upgrade disable | disable-bitcode-version-upgrade exists | Same, registered in ctor_036 |
| NVVM IR version gate | Does not exist | nvvmir.version metadata check (major==3, minor<=2) |
| NVVM IR version bypass | Does not exist | NVVM_IR_VER_CHK=0 environment variable |
| X86 AutoUpgrade | Active for x86 targets | Present but dead code (NVPTX only) |
| NVVM intrinsic upgrade | Does not exist | nvvm.atomic, nvvm.shfl, nvvm.cp.async, etc. upgraders added |
| NVVM annotation upgrade | Does not exist | maxclusterrank, maxntid metadata upgrader added |
| Module summary | Standard ModuleSummaryAnalysis | Extended with NVModuleSummary (import priority, kernel bit, complexity budget) |
| Binary copies | Single instance | Two copies (0x9F range, 0x150 range) at different link addresses |
Function Map
Reader (primary, 0x1500000--0x1522000)
| Address | Size | Function |
|---|---|---|
0x1503DC0 | 13 KB | materializeFunctions |
0x1504A60 | 8 KB | parseGlobalInits |
0x1505110 | 60 KB | parseModule |
0x15083D0 | 26 KB | parseTypeBlock / Constants |
0x150A160 | 7 KB | ThinLTO GUID lookup |
0x150B5F0 | 63 KB | parseModuleSummaryIndex |
0x150E2B0 | 19 KB | readRecord |
0x150F8E0 | 42 KB | readBlockInfoBlock |
0x1510D70 | 38 KB | readAbbreviatedField |
0x1513230 | 20 KB | readAbbrevRecord |
0x15127D0 | 13 KB | skipBlock |
0x15140E0 | 13 KB | string table reader |
0x1514C40 | 9 KB | readBlobRecord |
0x1515740 | 9 KB | parseValueRecord |
0x15177F0 | 7 KB | bitcode record helper |
0x1518180 | 29 KB | parseMetadataBlock |
0x1519820 | 7 KB | bitcode record helper |
0x1519BD0 | 7 KB | bitcode record helper |
0x151B070 | 123 KB | parseFunctionBody |
0x1520420 | 32 KB | parseFunctionMetadata |
0x1522160 | 13 KB | parseMetadataStrings |
Reader (standalone copy, 0x9F0000--0xA20000)
| Address | Size | Function |
|---|---|---|
0x9F2A40 | 185 KB | parseFunctionBody |
0xA09F80 | 121 KB | MetadataLoader::parseOneMetadata |
0xA10370 | 33 KB | value materialization |
0x9FF220 | 31 KB | writer helper |
0xA2D2B0 | 48 KB | module summary / metadata writer |
Writer (0x1525000--0x1549000)
| Address | Size | Function |
|---|---|---|
0x15263C0 | 7 KB | emitCode |
0x15271D0 | 7 KB | emitVBR |
0x1527BB0 | 9 KB | writeConstants helper |
0x1528720 | 27 KB | writeInstruction |
0x152A250 | 9 KB | emitRecord |
0x152AB40 | 11 KB | emitAbbreviation |
0x152F610 | 8 KB | writeAttributeGroup |
0x1530240 | 12 KB | writeType / GlobalVar |
0x15311A0 | 14 KB | writeNamedMetadata / comdat |
0x1531F90 | 27 KB | writeMetadata |
0x15334D0 | 8 KB | writeMetadataRecords (37 callees) |
0x1533CF0 | 16 KB | writeValueSymbolTable |
0x1535340 | 26 KB | writeModuleSummary (ThinLTO) |
0x1536CD0 | 40 KB | writeFunction |
0x1538EC0 | 58 KB | writeModule |
Intrinsic Upgrader (0xA80000--0xABFFFF + 0x1560000--0x1580000)
| Address | Size | Function |
|---|---|---|
0x156E800 | 593 KB | UpgradeIntrinsicFunction |
0xA939D0 | 457 KB | UpgradeIntrinsicCall |
0xA8A170 | 195 KB | X86 intrinsic upgrade helper |
0x15644B0 | 89 KB | UpgradeIntrinsicCall (2nd copy) |
0xA8E250 | 52 KB | NVVM upgrade dispatcher |
0xA91130 | 28 KB | NVVM call rewriting |
0xA84F90 | 14 KB | NVVM annotation metadata upgrade |
0xA7CD60 | 10 KB | UpgradeIntrinsicFunction (short, matches "nvvm.", "ftz.") |
0x156C720 | 10 KB | UpgradeModuleFlags |
0x156A1F0 | 7 KB | UpgradeLoopMetadata |
NVVM Version / Producer
| Address | Size | Function |
|---|---|---|
0x157E370 | 7 KB | NVVM version checker (primary) |
0x12BFF60 | 9 KB | NVVM version checker (standalone) |
0x2259720 | -- | NVVM version checker (duplicate instance) |
0x48CC90 | 544 B | ctor_036 -- producer init + disable-bitcode-version-upgrade |
0x4CE640 | 215 B | ctor_154 -- producer init ("7.0.1" default) |
Value Enumeration (0x1540000--0x1549000)
| Address | Size | Function |
|---|---|---|
0x1542B00 | 26 KB | enumerateValues |
0x15467B0 | 23 KB | enumerateModule |
0x1548410 | 8 KB | optimizeConstants |
0x15445A0 | 11 KB | metadata enumeration helper |
0x15450E0 | 9 KB | ValueEnumerator helper |
0x1547D80 | 9 KB | ValueEnumerator helper |
0x1543FA0 | 7 KB | ValueEnumerator helper |
0x1542750 | 7 KB | ValueEnumerator helper |
0x153E1D0 | 7 KB | TypeFinder helper |
Cross-References
- NVVM Container -- wraps bitcode in the proprietary transport format
- LTO & Module Optimization -- consumes bitcode from separate compilation objects
- NVModuleSummary Builder -- extends module summary with CUDA-specific fields; serialized by
sub_1535340 - Two-Phase Compilation -- serializes/deserializes per-function bitcode between phases
- Pipeline Entry -- magic byte validation on bitcode input
- Environment Variables --
LLVM_OVERRIDE_PRODUCER,NVVM_IR_VER_CHK - Binary Layout -- address range context for reader/writer clusters
Concurrent Compilation
CICC implements a two-phase concurrent compilation model that is entirely absent from upstream LLVM. The optimizer runs twice over the same module: Phase I performs whole-module analysis and early IR optimizations on a single thread, then Phase II runs per-function backend optimization in parallel across a thread pool. The design exploits the fact that most backend passes (instruction selection prep, register pressure reduction, peephole) are function-local and do not require cross-function information once Phase I has completed interprocedural analysis.
The two-phase protocol lives in sub_12E7E70 (9,405 bytes), which calls the same master pipeline function sub_12E54A0 twice, discriminated only by a TLS phase counter. The concurrency infrastructure spans the 0x12D4000--0x12EA000 address range and includes a GNU Make jobserver integration for build-system-aware parallelism throttling -- a feature that allows make -j8 to correctly limit total system load even when each cicc invocation itself wants to spawn threads.
| Phase I/II orchestrator | sub_12E7E70 (9,405 bytes) |
| Phase counter (TLS) | qword_4FBB3B0 -- values 1, 2, 3 |
| Concurrency eligibility | sub_12D4250 (626 bytes) |
| Function sorting | sub_12E0CA0 (23,422 bytes) |
| Concurrent entry | sub_12E1EF0 (51,325 bytes) |
| Worker entry | sub_12E7B90 (2,997 bytes) |
| Per-function callback | sub_12E8D50 |
| Per-function optimizer | sub_12E86C0 (7,687 bytes) |
| GNU jobserver init | sub_16832F0 |
| MAKEFLAGS parser | sub_1682BF0 |
| Thread pool create | sub_16D4AB0 |
| Thread pool enqueue | sub_16D5230 |
| Thread pool join | sub_16D4EC0 |
| Disable env var | LIBNVVM_DISABLE_CONCURRENT_API -- byte_4F92D70 |
| Pipeline function | sub_12E54A0 (49,800 bytes) -- called by both phases |
Two-Phase Architecture
Both phases call the same optimization pipeline function sub_12E54A0(context, input, output, opts, errCb). The only difference is the value stored in the TLS variable qword_4FBB3B0 before each call. Individual optimization passes read this TLS variable to decide whether to run: Phase I passes fire when the counter equals 1; Phase II passes fire when it equals 2. This avoids running codegen-oriented passes during analysis and vice versa.
Phase Counter Protocol
The phase counter qword_4FBB3B0 is a TLS variable accessed via sub_16D40E0 (set) and sub_16D40F0 (get). It stores a pointer to a heap-allocated 4-byte integer. Three values are defined:
| Value | Meaning | Set point |
|---|---|---|
| 1 | Phase I active -- analysis + early IR optimization | Before first sub_12E54A0 call |
| 2 | Phase II active -- backend optimization + codegen prep | Before second sub_12E54A0 call |
| 3 | Compilation complete for this module | After second sub_12E54A0 returns |
Sequential Path (sub_12E7E70)
When verbose logging is disabled and the module contains only one defined function, the orchestrator takes a fast path:
// Single-function fast path: no phase counter set at all
if (!verbose && num_defined_functions <= 1) {
sub_12E54A0(ctx, input, output, opts, errCb); // single un-phased call
return;
}
This means the optimizer runs both phases in a single invocation -- passes see no phase counter and run unconditionally. For multi-function modules or when verbose logging is active, the full two-phase protocol engages:
// Phase I
int *phase = malloc(4);
*phase = 1;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);
if (error_reported(errCb))
return; // abort on Phase I error
// Concurrency decision
bool concurrent = sub_12D4250(ctx, opts);
// Diagnostic: "Concurrent=Yes" or "Concurrent=No"
// Phase II
*phase = 2;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);
// Done
*phase = 3;
tls_set(qword_4FBB3B0, phase);
The diagnostic string construction between phases is notable: v46 = 3LL - (v41 == 0) computes the length of "Yes" (3) vs "No" (2, but expression yields 2 via 3 - 1), then logs "Phase II" with the "Concurrent=Yes/No" annotation appended.
Concurrent Path (sub_12E7B90)
When the thread count exceeds 1, the orchestrator dispatches to sub_12E7B90 instead of running Phase II sequentially:
sub_12E7B90(ctx, module_ptr, thread_count, opts, ...)
|
|-- Phase I: *phase=1, sub_12E54A0(...) // whole-module, single thread
|-- sub_12D4250(ctx, opts) // eligibility check
|
+-- if eligible (>1 defined function):
| sub_12E1EF0(...) // concurrent Phase II
| *phase = 3
|
+-- else (single defined function):
*phase = 2, sub_12E54A0(...) // sequential Phase II
*phase = 3
Phase I always runs single-threaded on the whole module because interprocedural analyses (alias analysis, call graph construction, inlining decisions) require a consistent global view. Only after Phase I completes does the system split the module into per-function chunks for parallel Phase II processing.
Eligibility Check
sub_12D4250 (626 bytes) determines whether the module qualifies for concurrent compilation. The check is straightforward:
int sub_12D4250(Module *mod, Options *opts) {
int defined_count = 0;
for (Function &F : mod->functions()) {
if (!sub_15E4F60(&F)) // !isDeclaration()
defined_count++;
}
if (defined_count <= 1)
return 0; // not eligible: only 0 or 1 defined function
byte force = *(byte*)(opts + 4064); // NVVMPassOptions slot 201 (0xC9)
if (force != 0)
return force; // user-forced concurrency setting
return sub_12D3FC0(mod, opts); // auto-determine thread count
}
The key gate is defined_count > 1. A module with a single kernel and no device functions will always compile sequentially regardless of thread count settings. The opts + 4064 byte (NVVMPassOptions slot 201, type BOOL_COMPACT, default 0) allows the user to force concurrent mode on or off. When zero (default), sub_12D3FC0 auto-determines the thread count based on module characteristics.
Function Priority Sorting
Before distributing functions to worker threads, sub_12E0CA0 (23,422 bytes) sorts them by compilation priority. This step is critical for load balancing: larger or more complex functions should start compiling first so they don't become tail stragglers.
Sorting Algorithm
The sort uses a hybrid strategy consistent with libstdc++ std::sort:
| Input size | Algorithm | Function |
|---|---|---|
| Small N | Insertion sort | sub_12D48A0 |
| Large N | Introsort (quicksort + heapsort fallback) | sub_12D57D0 |
The threshold between insertion sort and introsort is 256 bytes of element data (consistent with the libstdc++ template instantiation pattern observed elsewhere in the binary).
Priority Source
Priority values come from function attributes extracted by sub_12D3D20 (585 bytes). The sorted output is a vector of (name_ptr, name_len, priority) tuples with 32-byte stride, used directly by the per-function dispatch loop to determine compilation order. Functions with higher priority (likely larger or more critical kernels) are submitted to the thread pool first.
Enumeration Phase
Before sorting, sub_12E0CA0 enumerates all functions and globals via an iterator callback table:
| Callback | Address | Purpose |
|---|---|---|
| Next function | sub_12D3C60 | Advance to next function in module |
| Iterator advance | sub_12D3C80 | Step iterator forward |
| End check | sub_12D3CA0 | Test if iterator reached end |
For each function, the enumeration:
- Checks the node type discriminator at
*(byte*)(node + 16)-- type 0 = Function, type 1 = GlobalVariable - For functions: calls
sub_15E4F60(isDeclaration check),sub_12D3D20(priority),sub_1649960(name), inserts intov359hash table (name to function) andv362hash table (name to linkage type) - For global variables: walks the parent/linked GlobalValue chain via
sub_164A820, inserts callee references intov365hash table for split-module tracking
GNU Jobserver Integration
When cicc is invoked by GNU Make with -j, it can participate in the make jobserver protocol to avoid oversubscribing the system. The jobserver flag is passed from nvcc via the -jobserver CLI flag, which sets opts + 3288 (NVVMPassOptions slot 163, type BOOL_COMPACT, default 0).
Initialization (sub_16832F0)
The jobserver init function allocates a 296-byte state structure and calls sub_1682BF0 to parse the MAKEFLAGS environment variable:
int sub_16832F0(JobserverState *state, int reserved) {
memset(state, 0, 296);
state->flags[8] = 1; // initialized marker
int err = sub_1682BF0(state); // parse MAKEFLAGS
if (err) return err;
pipe(state->local_pipe); // local token pipe
// state+196 = read FD, state+200 = write FD
pthread_create(&state->thread, // state+208
NULL, token_manager, state);
reserve_vector(state, token_count);
return 0; // success
}
MAKEFLAGS Parsing (sub_1682BF0)
The parser searches the MAKEFLAGS environment variable for --jobserver-auth= and supports two formats:
| Format | Example | Mechanism |
|---|---|---|
| Pipe FDs | --jobserver-auth=3,4 | Read FD = 3, Write FD = 4 (classic POSIX pipe) |
| FIFO path | --jobserver-auth=fifo:/tmp/gmake-jobserver-12345 | Named FIFO (GNU Make 4.4+) |
The pipe format uses comma-separated read/write file descriptors inherited from the parent make process. The FIFO format uses a named pipe in the filesystem. In both cases, the jobserver protocol works the same way: a thread reads tokens from the pipe/FIFO before starting each per-function compilation, and writes tokens back when the function completes. This ensures cicc never runs more concurrent compilations than make's -j level permits.
Error Handling
if (jobserver_init_error) {
if (error_code == 5 || error_code == 6) {
// Warning: jobserver pipe not accessible (probably not in make context)
emit_warning(severity=1);
// Fall through: continue without jobserver
} else {
// Fatal: "GNU Jobserver support requested, but an error occurred"
sub_16BD130("GNU Jobserver support requested, but an error occurred", 1);
}
}
Error codes 5 and 6 are non-fatal (the jobserver pipe may not be available if cicc is invoked outside a make context). All other errors are fatal.
Thread Pool Management
Creation (sub_16D4AB0)
The thread pool is LLVM's standard ThreadPool (the binary contains "llvm-worker-{0}" thread naming at sub_23CE0C0). Creation occurs at line 799 of sub_12E1EF0:
int actual_threads = min(requested_threads, num_functions);
sub_16D4AB0(thread_pool, actual_threads);
The thread count is clamped to the number of functions -- there is no point spawning more threads than there are work items.
Thread Count Resolution
Thread count is resolved through a fallback chain in sub_12E7E70:
int thread_count = opts[1026]; // NVVMPassOptions slot 203 (offset 4104), default -1
if (thread_count < 0)
thread_count = opts[1036]; // NVVMPassOptions slot 205 (offset 4144), default -1
if (thread_count == 0)
thread_count = sub_22420F0(); // get_nprocs() -- number of online CPUs
| Source | Slot | Offset | Default | Meaning |
|---|---|---|---|---|
| Primary | 203 (0xCB) | 4104 | -1 (auto) | Explicit thread count |
| Fallback | 205 (0xCD) | 4144 | -1 (auto) | Secondary thread count |
| System | -- | -- | -- | get_nprocs() return value |
The -1 sentinel means "auto-detect." When both slots are -1, the system falls back to the number of online CPUs. In practice, this means a 16-core machine will spawn 16 threads for concurrent Phase II compilation unless the user or calling tool overrides the count.
Per-Function Context Packaging
Before submitting each function to the thread pool, sub_12E1EF0 builds a self-contained compilation context (4,632 bytes) containing everything the worker needs:
- Bitcode extraction:
sub_1AB9F40(&buffer, *module_bc, &iterators, sub_12D4BD0, &filter_state)-- extracts the bitcode for a single function usingsub_12D4BD0as a filter callback that checks the function name against the work hash table - Bitcode parsing:
sub_153BF40(module_data, &context, 0, 0, 0, 0)-- parses the extracted bitcode into an LLVM Module - Context copy: Copies all iterator state (
v391--v395), context/options/error callbacks (v405--v409), module name string, the full 4,480-byte NVVMPassOptions struct, and module bitcode vectors
Each function gets its own independent copy of the options struct and module -- there is no shared mutable state between worker threads during Phase II.
Worker Entry and Completion
Work Submission
Each function is submitted to the thread pool with two callbacks:
v373 = sub_12E8D50; // completion callback (runs the optimizer)
v372 = sub_12D4D90; // destructor callback (cleanup)
sub_16D5230(work_item, thread_pool, context); // enqueue
Per-Function Callback (sub_12E8D50)
The completion callback extracts the module from the packaged context and calls the Phase II per-function optimizer:
void sub_12E8D50(Context *ctx) {
Module *mod = extract_module(ctx);
sub_12E86C0(ctx, function_index, opts, module_name);
}
Per-Function Phase II Optimizer (sub_12E86C0, 7,687 bytes)
This function sets the TLS phase counter to 2 and runs the pass pipeline on the individual function's module:
void sub_12E86C0(Context *ctx, int func_idx, Options *opts, StringRef name) {
int *phase = malloc(4);
*phase = 2;
tls_set(qword_4FBB3B0, phase);
// Run Phase II pass pipeline on this function's module
sub_12E54A0(ctx, ...);
}
Because qword_4FBB3B0 is TLS, each worker thread has its own phase counter. All worker threads see phase=2 concurrently without interference.
Post-Compilation Merge
After all worker threads complete (sub_16D4EC0 joins the thread pool):
- Jobserver cleanup:
sub_1682740checks for jobserver errors and releases tokens - Error check: If any per-function callback reported an error, the compilation fails
- Normal mode (opt_level >= 0): Appends a null byte to the output buffer (bitcode stream terminator)
- Split-compile mode (opt_level < 0): Re-reads each function's bitcode via
sub_153BF40, links all per-function modules viasub_12F5610(the LLVM module linker), and restores linkage attributes from thev362hash table. Specifically:- Linkage values 7--8: set only low 6 bits (external linkage types)
- Other values: set low 4 bits, then check
(value & 0x30) != 0for visibility bits - Sets
byte+33 |= 0x40(dso_local flag)
Configuration
Environment Variables
| Variable | Check | Effect |
|---|---|---|
LIBNVVM_DISABLE_CONCURRENT_API | getenv() != NULL | Sets byte_4F92D70 = 1. Disables concurrent/thread-safe LibNVVM API usage entirely. Any non-NULL value triggers it. Checked in global constructor ctor_104 at 0x4A5810. |
MAKEFLAGS | Parsed by sub_1682BF0 | Searched for --jobserver-auth= to enable GNU Make jobserver integration |
NVVMPassOptions Slots
| Slot | Offset | Type | Default | Purpose |
|---|---|---|---|---|
| 163 (0xA3) | 3288 | BOOL_COMPACT | 0 | Jobserver integration requested (set by -jobserver flag) |
| 201 (0xC9) | 4064 | BOOL_COMPACT | 0 | Force concurrency on/off (0 = auto) |
| 203 (0xCB) | 4104 | INTEGER | -1 | Primary thread count (-1 = auto) |
| 205 (0xCD) | 4144 | INTEGER | -1 | Fallback thread count (-1 = auto) |
CLI Flags
| Flag | Route | Effect |
|---|---|---|
-jobserver | opt "-jobserver" | Enables GNU jobserver integration (sets slot 163) |
-split-compile=<N> | opt "-split-compile=<N>" | Enables split-module compilation (opt_level set to -1) |
-split-compile-extended=<N> | opt "-split-compile-extended=<N>" | Extended split-compile (also sets +1644 = 1) |
--sw2837879 | Internal | Concurrent ptxStaticLib workaround flag |
Phase State Machine
START
|
v
[phase=1] --> sub_12E54A0 (Phase I: whole-module analysis)
|
v
error? --yes--> RETURN (abort)
|no
v
count_defined_functions()
|
+--(1 func)--> [phase=2] --> sub_12E54A0 (Phase II sequential)
| |
| v
| [phase=3] --> DONE
|
+--(N funcs, threads>1)--> sub_12E1EF0 (concurrent)
| |
| +-- sort functions by priority
| +-- create thread pool
| +-- init jobserver (if requested)
| +-- for each function:
| | extract per-function bitcode
| | parse into independent Module
| | [phase=2] per-function (TLS)
| | submit to thread pool
| +-- join all threads
| +-- link split modules (if split-compile)
| +-- [phase=3] --> DONE
|
+--(N funcs, threads<=1)--> [phase=2] --> sub_12E54A0 (sequential)
|
v
[phase=3] --> DONE
Differences from Upstream LLVM
Upstream LLVM has no two-phase compilation model. The standard LLVM pipeline runs all passes in a single invocation with no phase discrimination. CICC's approach is entirely custom:
-
Phase counter TLS variable: Upstream LLVM passes have no concept of reading a global phase counter to decide whether to run. Every pass in CICC must check
qword_4FBB3B0and early-return if it belongs to the wrong phase. -
Per-function module splitting: Upstream LLVM's
splitModule()(inllvm/Transforms/Utils/SplitModule.h) exists for ThinLTO and GPU offloading, but CICC's splitting atsub_1AB9F40with thesub_12D4BD0filter callback is a custom implementation integrated with the NVVMPassOptions system. -
GNU jobserver integration: No upstream LLVM tool participates in the GNU Make jobserver protocol. This is entirely NVIDIA-specific, implemented to play nicely with
make -jin CUDA build systems. -
Function priority sorting: Upstream LLVM processes functions in module iteration order. CICC's priority-based sorting via
sub_12E0CA0ensures that expensive functions start compiling first, reducing tail latency in the thread pool.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Function iterator: next | sub_12D3C60 | ~200 | -- |
| Function iterator: advance | sub_12D3C80 | ~230 | -- |
| Function iterator: end check | sub_12D3CA0 | ~260 | -- |
| Function attribute/priority query | sub_12D3D20 | 585 | -- |
| Auto thread count determination | sub_12D3FC0 | 3,600 | -- |
| Concurrency eligibility check | sub_12D4250 | 626 | -- |
| Insertion sort (small N) | sub_12D48A0 | -- | -- |
| Per-function bitcode filter callback | sub_12D4BD0 | 2,384 | -- |
| Work item destructor callback | sub_12D4D90 | 2,742 | -- |
| Introsort (large N) | sub_12D57D0 | -- | -- |
| Function sorting and enumeration | sub_12E0CA0 | 23,422 | -- |
| Concurrent compilation top-level entry | sub_12E1EF0 | 51,325 | -- |
| Master pipeline assembly (both phases) | sub_12E54A0 | 49,800 | -- |
| Concurrent worker entry | sub_12E7B90 | 2,997 | -- |
| Phase I/II orchestrator | sub_12E7E70 | 9,405 | -- |
| Per-function Phase II optimizer | sub_12E86C0 | 7,687 | -- |
| Per-function completion callback | sub_12E8D50 | -- | -- |
| LLVM module linker (post-merge) | sub_12F5610 | 7,339 | -- |
| Bitcode reader/verifier | sub_153BF40 | -- | -- |
isDeclaration() check | sub_15E4F60 | -- | -- |
| Get function name | sub_1649960 | -- | -- |
| Walk to parent GlobalValue | sub_164A820 | -- | -- |
| Jobserver error check/cleanup | sub_1682740 | -- | -- |
MAKEFLAGS --jobserver-auth= parser | sub_1682BF0 | -- | -- |
| GNU jobserver init (296-byte state) | sub_16832F0 | -- | -- |
TLS set (qword_4FBB3B0) | sub_16D40E0 | -- | -- |
TLS get (qword_4FBB3B0) | sub_16D40F0 | -- | -- |
| Thread pool create | sub_16D4AB0 | -- | -- |
| Thread pool join | sub_16D4EC0 | -- | -- |
| Thread pool enqueue work item | sub_16D5230 | -- | -- |
| Per-function bitcode extraction | sub_1AB9F40 | -- | -- |
get_nprocs() wrapper | sub_22420F0 | -- | -- |
Cross-References
- Entry Point & CLI -- pipeline dispatch that leads to the optimizer, including
-jobserverflag routing - Optimizer Pipeline --
sub_12E54A0, the pipeline function called by both phases - NVVMPassOptions -- the 222-slot options table including thread count and jobserver slots
- Environment Variables --
LIBNVVM_DISABLE_CONCURRENT_APIandMAKEFLAGS - CLI Flags --
-jobserver,-split-compile,-split-compile-extended - Bitcode I/O --
sub_153BF40bitcode reader used for per-function module extraction
Diagnostics & Optimization Remarks
CICC v13.0 contains three independent diagnostic systems that operate at different phases of compilation and serve different audiences. The EDG frontend diagnostic engine handles C++/CUDA language-level errors and warnings with rich terminal formatting or SARIF JSON output. The LLVM optimization remark infrastructure reports pass-level decisions (what was optimized, what was missed, and why) through the standard DiagnosticInfo hierarchy. NVIDIA's custom "profuse" framework provides verbose per-pass diagnostic output that is entirely separate from both EDG diagnostics and LLVM remarks, controlled by dedicated knobs like profuseinline and profusegvn.
Understanding these three layers is essential for reimplementation because they share no code. EDG diagnostics live in the 0x670000-0x6FFFFF address range and operate on EDG's internal diagnostic record format. LLVM remarks use the stock OptimizationRemarkEmitter analysis pass and the DiagnosticInfoOptimizationBase class hierarchy. The profuse framework is a pure NVIDIA invention that writes directly to stderr through cl::opt<bool> guards with no connection to either of the other two systems.
| EDG terminal emitter | sub_681D20 (37KB, 1,342 lines) at 0x681D20 |
| EDG dispatch/SARIF emitter | sub_6837D0 (20KB) at 0x6837D0 |
| Diagnostic format selector | unk_4D04198: 0 = text, 1 = SARIF |
| Format CLI flag | --diagnostics_format=text|sarif (case 0x125 in sub_617BD0) |
| EDG output mode CLI | --output_mode text|sarif (case 293 in lgenfe_main) |
| LLVM remark registration | ctor_152 at 0x4CE3F0 (3 regex cl::opts) |
| LLVM remark YAML serializer | sub_15CAD70 (13KB) at 0x15CAD70 |
| LLVM remark bitstream serializer | sub_F01350 (23KB) at 0xF01350 |
| Profuse inlining knob | profuseinline at 0x4DBEC0 (ctor_186_0), default off |
| Profuse GVN knob | profusegvn at 0x4FAE7E0 (ctor_201), default true |
| Diagnostic output stream | qword_4F07510 (FILE*, typically stderr) |
| Terminal width | dword_4D039D0 (columns, for word-wrapping) |
| ANSI color enable | dword_4F073CC[0] (nonzero = enabled) |
| Upstream LLVM equivalent | llvm/include/llvm/IR/DiagnosticInfo.h, llvm/lib/Analysis/OptimizationRemarkEmitter.cpp |
EDG Frontend Diagnostics
Dispatch Architecture
Every EDG frontend diagnostic passes through sub_6837D0, which acts as the single dispatch point. This function performs filtering (severity threshold, duplicate suppression, pragma-based suppression), increments error/warning counters, and then routes to one of two renderers based on the global unk_4D04198:
sub_6837D0(diag_record)
|
+-- severity < byte_4F07481[0]? --> suppress (return)
+-- duplicate? (byte_4CFFE80[4*errnum+2] bit flags) --> count only
+-- pragma disabled? (sub_67D520) --> suppress
+-- error limit reached? (unk_4F074B0 + unk_4F074B8 >= unk_4F07478) --> error 1508, abort
|
+-- unk_4D04198 == 0 --> sub_681D20(diag) [terminal text renderer]
+-- unk_4D04198 == 1 --> inline SARIF JSON [JSON renderer within sub_6837D0]
The format is selected by the --diagnostics_format flag (case 0x125 in sub_617BD0), which is surfaced as --output_mode text|sarif in the lgenfe CLI.
Diagnostic Record Layout
EDG diagnostic records are approximately 192-byte structures organized as a tree. Each record can have child diagnostics, notes, context diagnostics (include-stack annotations), and an extra child list, all stored as linked lists.
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 4 | type | 0 = top-level, 1 = unknown, 2 = child-with-parent, 3 = continuation |
| +8 | 8 | next_sibling | Linked list next pointer |
| +16 | 8 | parent_diag | Pointer to parent diagnostic node |
| +24 | 8 | child_list | Linked list of child diagnostics |
| +40 | 8 | extra_child_list | Secondary child list (always emitted) |
| +56 | 8 | note_list | Linked list of attached notes |
| +72 | 8 | context_list | Context diagnostics (include-stack annotations) |
| +96 | 4 | has_source_location | Nonzero if source info is present |
| +100 | 2 | column_number | Column in source line (unsigned short) |
| +120 | 8 | source_file_info | Passed to sub_723260 to get filename string |
| +128 | 4 | line_number | Source line number (unsigned int) |
| +136 | 4 | file_id | File table index (0 = no file) |
| +140 | 2 | column_end | End column for underlining range |
| +144 | 4 | is_command_line | Nonzero means "command line" prefix |
| +152 | 8 | source_entity | If nonzero, use sub_723640 for decorated location |
| +160 | 8 | display_name_ptr | Filename string pointer |
| +168 | 4 | display_line | Line number for display |
| +172 | 4 | tab_stop_width | Tab stop setting for source display |
| +176 | 4 | diagnostic_number | Numeric ID for -W flags, becomes SARIF ruleId |
| +180 | 1 | severity | Severity code (see severity enum below) |
Terminal Text Renderer (sub_681D20)
The 37KB terminal renderer is the larger and more complex of the two backends. It handles ANSI color output, word-wrapping to terminal width, source context display with caret underlining, and recursive child diagnostic emission.
Location prefix. The source location is formatted before the severity label. For file-based diagnostics, sub_722FC0 or sub_723640 produces the filename, followed by (line_number) in parentheses, wrapped in ANSI color code 5 (file path color). Command-line diagnostics use string ID 1490 ("command line"). Diagnostics with no file have no location prefix.
Severity label. The label string is looked up via sub_67C860(string_id) from a localized string table. The string table base v57 is offset by 0 for normal diagnostics, 1 for command-line diagnostics. When diagnostic numbering is enabled (unk_4D04728 set) and severity is 5 or below with a nonzero diagnostic number at +176, the renderer appends #<number> after the severity label, converted by sub_67D2D0.
ANSI color system. CICC does not emit standard ANSI escape sequences directly. Instead, it uses an internal 2-byte marker system where byte 0 is 0x1B (ESC) and byte 1 is a color code from 1 to 5. These internal markers are translated to real terminal escapes by the output layer.
| Internal Code | Semantic | Typical Terminal Mapping |
|---|---|---|
| 1 | Reset/default | \033[0m |
| 2 | Error | Red |
| 3 | Caution/severe-warning | Yellow/magenta |
| 4 | Location highlight | Bold/cyan |
| 5 | File path / remark | Dim/blue |
Color output is gated by dword_4F073CC[0] (nonzero = enabled) and dword_4F073C8 (nonzero = "rich" escape mode; zero = "simple" mode that skips escape bytes entirely).
Word-wrapping. Two code paths exist depending on whether ANSI colors are active.
Without colors (Path A), the algorithm is straightforward: compute available width as dword_4D039D0 - left_margin, scan for the last space within that width, break there, and emit newline plus indent. The left margin and continuation indent depend on the diagnostic type:
| Type (+0) | Left Margin | Continuation Indent |
|---|---|---|
| 0 (top-level) | 0 | 10 |
| 1 | 12 | 22 |
| 2 (child) | 10 or 12 | 20 or 22 |
| 3 (continuation) | 1 | 11 |
For type 2, the margin is +2 if the current diagnostic is not the first child of its parent.
With colors (Path B), the algorithm tracks character-by-character with color state (v40 = current color, v41 = at-start-of-line flag, v152 = remaining columns). On encountering an ESC marker, it consumes the 2-byte pair and updates color state via sub_67BBF0. When the column limit is hit, the algorithm attempts to break at the last recorded space position (with buffer rewind to v147), falling back to a forced break at the current position.
The global qword_4F07468 controls wrap behavior: the low 32 bits disable wrapping entirely when nonzero, and the high 32 bits suppress source context display when nonzero.
Source context display. After the message text, the renderer displays the source line with caret underlining. sub_729B10(file_id, ...) retrieves source line data. Each source position entry is a linked list node with a 24+ byte layout: +0 next pointer, +8 source text pointer, +16 entry type (0 = normal char, 1 = same-position, 2 = 2-byte char, 3 = tab), +24 replacement character. The display renders two lines: the source text and a caret/tilde underline line, where ^ marks the error column and ~ extends the range to column_end. Multi-byte character handling uses sub_721AB0 to determine byte counts.
Recursive emission. After the main diagnostic and source context, child diagnostics are emitted recursively in this order: child_list (+24), note_list (+56, skipped for severity 2 remarks), context_list (+72, with parent pointer set before recursion), extra_child_list (+40). After all children, a blank line separator is emitted (unless compact mode is active), the output buffer is null-terminated, and the result is written via fputs to qword_4F07510 followed by fflush.
Machine-readable log. When qword_4D04908 (log FILE*) is set and the diagnostic type is not 3 (continuation), the renderer writes a single-line record:
<severity-char> "<filename>" <line> <col> <message>\n
The severity character is indexed from the string "rwweeccccCli" by (severity - 4). For child diagnostics, the character is lowercased.
| Index | Character | Meaning |
|---|---|---|
| 0 (sev 4) | r | remark |
| 1 (sev 5) | w | warning |
| 2 (sev 6) | w | caution (displayed as warning) |
| 3 (sev 7) | e | error |
| 4 (sev 8) | e | error (promoted) |
| 5 (sev 9) | c | catastrophe |
| 6 (sev 10) | c | catastrophe |
| 7 (sev 11) | C | catastrophe (alternate) |
| 8 | l | unknown |
| 9 | i | internal error |
SARIF JSON Renderer
The SARIF backend is implemented inline within sub_6837D0. Rather than emitting a complete SARIF document (no $schema, no runs[] envelope), it writes one JSON object per diagnostic as a comma-separated stream to qword_4F07510. The caller or a post-processing tool is expected to wrap the stream.
Each diagnostic object has this structure:
{
"ruleId": "EC<number>",
"level": "error"|"warning"|"remark"|"catastrophe"|"internal_error",
"message": {"text": "<JSON-escaped message>"},
"locations": [
{
"physicalLocation": {
"artifactLocation": {"uri": "file://<path>"},
"region": {"startLine": N, "startColumn": N}
}
}
],
"relatedLocations": [
{
"message": {"text": "..."},
"physicalLocation": { ... }
}
]
}
The ruleId is constructed by sprintf("%lu", *(uint32*)(diag+176)) -- the decimal diagnostic number prefixed with "EC". The level string is mapped from the severity byte at +180 via a switch statement. The message.text is produced by sub_683690, which renders the diagnostic text into qword_4D039E8 via sub_681B50 and then copies it character-by-character into qword_4D039D8 with JSON escaping of " and \ characters. The locations array is present only when *(diag+136) != 0 (valid file ID). The physicalLocation is built by sub_67C120, which calls sub_729E00 to decompose the packed source position and sub_722DF0 to resolve the file ID to a path string. The relatedLocations array carries note sub-diagnostics from the linked list at diag+72.
Multiple diagnostics are comma-separated: a comma is prepended before { when unk_4F074B0 + unk_4F074B8 > 1 (more than one diagnostic emitted so far).
Include-stack annotations. When include depth (dword_4F04C64) is greater than zero, sub_6837D0 walks the include stack (776-byte records at qword_4F04C68) calling sub_67B7E0 to build #include context annotations. These are linked as children at diag+40/+48. Error 453 gives "in file included from ..." context, error 1150 gives ellipsis "..." when too many include levels exist, and errors 1063/1064 give file-reference footers.
Warning-as-error promotion. When a warning (severity 5) has been emitted and unk_4D04728 is set, the function creates a synthetic "warnings treated as errors" diagnostic via sub_67D610(0xE7D, ..., 4) with severity 4 (remark), then recursively calls sub_6837D0 on it.
Diagnostic Filtering and Suppression
Filtering happens in sub_6837D0 before either renderer is invoked:
- Severity threshold:
byte_4F07481[0]stores the minimum severity. Diagnostics below this level are silently suppressed. - Duplicate detection:
byte_4CFFE80[4*errnum + 2]bit flags track "already seen" diagnostics. Bit 0 marks first occurrence, bit 1 marks already emitted. On second hit, the diagnostic is counted but not emitted. - Pragma suppression:
sub_67D520checks whether the diagnostic is disabled via#pragma diag_suppressor similar EDG pragmas.sub_67D470records the suppression. - Error limit: When
unk_4F074B0 + unk_4F074B8 >= unk_4F07478, error 1508 ("error limit reached") is emitted andsub_7235F0(9)aborts compilation.
Diagnostic Severity Enum
The severity byte at diag+180 encodes the following levels, used by both the terminal and SARIF renderers:
| Value | Name | Terminal Color | SARIF Level | Log Char | Label |
|---|---|---|---|---|---|
| 2 | remark | ESC 5 (blue) | "remark" | R | R |
| 4 | warning | ESC 5 (blue) | "warning" | r | W |
| 5 | caution | ESC 3 (yellow) | "warning" | w | W (lowercase) |
| 6 | severe-warning | ESC 3 (yellow) | (falls through to error) | w | E (lowercase) |
| 7 | error | ESC 2 (red) | "error" | e | E |
| 8 | error (promoted) | ESC 2 (red) | "error" | e | E |
| 9 | catastrophe | ESC 2 (red) | "catastrophe" | c | C |
| 10 | catastrophe | ESC 2 (red) | "catastrophe" | c | C |
| 11 | internal-error | ESC 2 (red) | "internal_error" | i | special |
Severity values 9, 10, and 11 are fatal: after emission, sub_7AFBD0 (longjmp / error propagation [LOW confidence] -- the function is called on fatal error paths and does not return to its caller, consistent with longjmp or exit, but could also be a custom abort-style handler; no setjmp/longjmp string evidence found) and sub_7235F0(severity) terminate compilation. Internal errors (11) additionally prepend "(internal error) " to the log output and use the prefix for error 3709.
Note: severity 2 (remark) is distinct from LLVM optimization remarks -- it is an EDG frontend remark (e.g., template instantiation notes). Remarks at severity 2 suppress their note_list children during recursive emission.
LLVM Optimization Remarks
Registration and CLI Surface
Three cl::opt<std::string> knobs are registered at ctor_152 (0x4CE3F0), each taking a regex pattern:
| Knob | Description | Filters |
|---|---|---|
pass-remarks | Enable optimization remarks from passes whose name matches the pattern | Passed (successful) optimizations |
pass-remarks-missed | Enable missed optimization remarks | Optimizations that were considered but not applied |
pass-remarks-analysis | Enable analysis remarks | Intermediate analysis results and explanations |
These are stock LLVM cl::opt registrations. CICC exposes them through the flag catalog (sub_9624D0) via the -inline-info convenience flag, which routes to the opt phase as:
-Xopt -pass-remarks=inline
-Xopt -pass-remarks-missed=inline
-Xopt -pass-remarks-analysis=inline
Additional remark-related knobs registered at ctor_376_0 (0x512DF0):
| Knob | Purpose |
|---|---|
pass-remarks-with-hotness | Include PGO hotness information in remarks |
pass-remarks-hotness-threshold | Minimum hotness for remark emission |
pass-remarks-output | File path for remark output (YAML or bitstream) |
pass-remarks-filter | Additional filter for remark pass names |
pass-remarks-format | Format: yaml or bitstream |
The -w flag (suppress warnings) routes to both opt and llc as -w. The -Werror flag routes to both as -Werror, promoting warnings to errors.
Remark Emission Protocol
LLVM passes emit remarks through a three-step protocol observed consistently across all analyzed passes:
Step 1: Construct the remark. The pass creates a DiagnosticInfoOptimizationBase subclass object via one of these constructors:
| Constructor | Address | Creates |
|---|---|---|
sub_B17560 | 0xB17560 | OptimizationRemark (pass succeeded) |
sub_15CA330 | 0x15CA330 | OptimizationRemark (alternative constructor) |
sub_15CA540 | 0x15CA540 | OptimizationRemarkMissed (pass failed/skipped) |
sub_B178C0 | 0xB178C0 | Warning-level DiagnosticInfo (non-remark warning) |
The constructor takes a pass name string (e.g., "coro-split", "wholeprogramdevirt", "loop-distribute") and a remark ID string (e.g., "Devirtualized", "Distribute", "CoroSplit").
Step 2: Build the message. The message is assembled through a builder pattern:
| Builder Function | Address | Purpose |
|---|---|---|
sub_B18290 | 0xB18290 | Append raw string to remark message |
sub_B16430 | 0xB16430 | Create named string attribute (e.g., "FunctionName") |
sub_B16B10 | 0xB16B10 | Create named integer attribute (e.g., "frame_size") |
sub_B16530 | 0xB16530 | Append named value (used in analysis remarks) |
sub_B180C0 | 0xB180C0 | Finalize and prepare remark for emission |
A typical emission sequence (from CoroSplit at 0x24F05D1):
call sub_B17560("coro-split", "CoroSplit") // create remark
call sub_B18290("Split '") // append prefix
call sub_B16430("function", fn_name) // named attribute
call sub_B18290("' (frame_size=") // literal text
call sub_B16B10("frame_size", N) // integer attribute
call sub_B18290(", align=") // literal text
call sub_B16B10("align", M) // integer attribute
call sub_B18290(")") // closing paren
Resulting remark text: Split '<function_name>' (frame_size=N, align=M)
Step 3: Publish. sub_1049740 publishes the remark to the diagnostic handler registered on the LLVMContext. The handler consults the pass-remarks / pass-remarks-missed / pass-remarks-analysis regex filters to decide whether to emit or suppress the remark.
After emission, remark objects are cleaned up: vtable-based destructors free the remark structure, and SSO string cleanup checks whether each temporary string pointer differs from its inline buffer address (indicating heap allocation that needs free).
Remark Categories
Standard LLVM categories:
| Category | YAML Tag | Meaning |
|---|---|---|
| Passed | !Passed | Optimization was successfully applied |
| Missed | !Missed | Optimization was considered but not applied |
| Analysis | !Analysis | Intermediate analysis information |
| Failure | !Failure | Internal failure during optimization |
NVIDIA-specific categories added to the remark framework:
| Category | YAML Tag | Purpose |
|---|---|---|
| AnalysisFPCommute | !AnalysisFPCommute | GPU floating-point commutativity analysis feedback |
| AnalysisAliasing | !AnalysisAliasing | GPU memory aliasing analysis feedback |
These NVIDIA-specific categories are registered in the YAML serializer at sub_15CAD70 and the YAML parser at sub_C30A00.
Serialization Backends
YAML serializer (sub_15CAD70, 13KB at 0x15CAD70): Emits structured YAML with fields Pass, Name, DebugLoc, and the remark type tag. Uses a vtable-based streaming API at offsets +96 (writeKey), +120 (beginMapping), +128 (endMapping).
Bitstream serializer (sub_F01350, 23KB at 0xF01350): Emits remarks in LLVM's binary bitstream format (used for -fsave-optimization-record). Record types include "Remark", "Remark header", "Remark debug location", "Remark hotness", "Argument with debug location", and "Argument". Uses sub_EFD2C0 for VBR-encoded record emission and sub_EFCCF0 for abbreviation definitions.
Remark serializer factory (sub_C2E790, 6KB at 0xC2E790): llvm::remarks::createRemarkSerializer dispatches to YAML or bitstream format based on configuration. Returns an error for unknown formats: "Unknown remark serializer format.".
OptimizationRemarkEmitter Analysis
Two analysis passes provide remark emission capability to function-level and machine-function-level passes:
| Pass | Pipeline Name | Level |
|---|---|---|
OptimizationRemarkEmitterAnalysis | "opt-remark-emit" (pipeline ID 181) | Function analysis |
MachineOptimizationRemarkEmitterAnalysis | "machine-opt-remark-emitter" (pipeline ID 467) | MachineFunction analysis |
Passes that emit remarks must request the appropriate analysis and store the resulting OptimizationRemarkEmitter*. For example, the TwoAddressInstruction pass stores it at this+272, obtained via analysis lookup unk_4FC4534.
Passes Known to Emit Remarks
This is a non-exhaustive list of passes observed emitting optimization remarks in the binary:
| Pass | Remark Name | Remark Examples |
|---|---|---|
| CoroSplit | "coro-split" | Split '<fn>' (frame_size=N, align=M) |
| WholeProgramDevirt | "wholeprogramdevirt" | Devirtualized '<fn>' |
| LoopDistribute | "loop-distribute" | Distribute, NoUnsafeDeps, TooManySCEVRuntimeChecks |
| LoopVectorize | "loop-vectorize" | Vectorization success/failure details |
| LoopUnroll | "loop-unroll" | Unroll factor and failure reasons |
| LoopInterchange | "loop-interchange" | Cannot interchange loops... |
| LICM | "licm" | Hoist success/failure reasons |
| SLPVectorizer | "slp-vectorizer" | SLP vectorization decisions |
| MachinePipeliner | "pipeliner" | Pipelined succesfully! [sic] |
| MachineOutliner | "machine-outliner" | Outlining decisions |
| OpenMP SPMD Transform | "openmp-opt" | OMP120 (remark), OMP121 (warning) |
| InstCombine | "instcombine" | Visit decisions (via instcombine-visit filter) |
| FastISel | "fastisel" | FastISel failure reports |
| IRCE | "irce" | Range check elimination decisions |
| TwoAddressInstruction | "twoaddressinstruction" | Two-address conversion decisions |
NVIDIA Profuse Framework
Design and Purpose
The "profuse" diagnostic framework is an NVIDIA-specific verbose output system that has no connection to the LLVM OptimizationRemark infrastructure. It predates LLVM's remark system and serves a different purpose: providing NVIDIA compiler engineers with extremely detailed, unstructured diagnostic output from specific optimization passes.
The name "profuse" is unfortunately overloaded in the cicc binary. Two completely unrelated systems use the word:
- PGO profuse: The
profuseknob registered atctor_375(0x512720) is a boolean that enables profile-guided optimization data consumption. It is set via-profile-instr-use <file>which routes to-Xopt -profuse=true -Xopt -proffile=<file>. This is a PGO control flag, not a diagnostic system. - Diagnostic profuse: The
profuseinlineandprofusegvnknobs are NVIDIA diagnostic toggles that control verbose output from specific optimization passes. These are the "profuse framework" discussed here.
profuseinline
Registered at ctor_186_0 (0x4DBEC0) as a cl::opt<bool> with default value off (false).
When enabled, the NVIDIA custom inliner (sub_1864060, the shouldInline / inline cost computation) emits verbose diagnostic output for every inlining decision. This includes the computed cost, threshold comparison, argument type-size coercion details, and the final accept/reject decision.
The profuse inlining output goes directly to stderr through fprintf-style calls within the inliner code. It is not routed through OptimizationRemarkEmitter and does not appear in remark YAML/bitstream output. This is distinct from the LLVM inline-remark-attribute knob which annotates the IR with remark metadata.
The -inline-info CLI flag does not enable profuseinline. Instead, -inline-info routes to the three standard pass-remarks knobs filtered for "inline". To enable profuse output, one must pass -Xopt -profuseinline=true (or -Xcicc -opt -profuseinline=true through nvcc).
Comparison of the two diagnostic channels for inlining:
| Feature | profuseinline | -inline-info (pass-remarks) |
|---|---|---|
| Output format | Unstructured stderr text | Structured LLVM remark |
| Controlled by | cl::opt<bool> | Regex filter on pass name |
| Default | Off | Off |
| YAML/bitstream output | No | Yes (if -pass-remarks-output set) |
| Cost model details | Yes (full cost breakdown) | No (accept/reject only) |
| NVIDIA-specific metrics | Yes (GPU opcode bonus, struct analysis) | No |
profusegvn
Registered at ctor_201 (0x4E0990) as a cl::opt<bool> with default value true (enabled). Global address: 0x4FAE7E0. Description: "profuse for GVN".
When the knob is active (which it is by default), the GVN pass (sub_1900BB0, 83KB) emits verbose diagnostic output at the following decision points:
- Value replacement decisions (when a leader is found in the value numbering table)
- Store/load expression hash table matches
- PRE (Partial Redundancy Elimination) insertion decisions
The output is written directly to stderr, bypassing the LLVM remark system entirely. The profuse GVN output is not captured by -pass-remarks-output and does not appear in remark YAML or bitstream files.
To disable the verbose output, pass -Xopt -profusegvn=false. The fact that this defaults to true (unlike profuseinline which defaults to false) suggests it may be gated by an additional runtime check (possibly wizard mode or an optimization level gate) to prevent user-visible noise in release builds.
Profuse vs. LLVM Remarks Summary
| Aspect | Profuse Framework | LLVM Optimization Remarks |
|---|---|---|
| Origin | NVIDIA custom | Upstream LLVM |
| Passes | Inliner, GVN only (observed) | Most optimization passes |
| Output | Raw stderr fprintf | Structured DiagnosticInfo |
| Format | Unstructured text | YAML, bitstream, or terminal |
| Filtering | Per-knob boolean | Regex on pass name |
| Serialization | None | YAML and bitstream serializers |
| IDE integration | None | SARIF (with post-processing) |
| Default | Off (inline) / On (GVN) | Off (requires -pass-remarks) |
Filtering and Configuration
CLI Flags for Diagnostic Control
EDG frontend diagnostics (Phase I):
| Flag | Route | Effect |
|---|---|---|
--diagnostics_format=sarif | EDG direct | Switch output to SARIF JSON |
--output_mode text|sarif | EDG direct (case 293) | Same as above, alternative spelling |
-w | opt -w, llc -w | Suppress all warnings |
-Werror | opt -Werror, llc -Werror | Promote warnings to errors |
--error_limit N | EDG direct | Maximum errors before abort (unk_4F07478) |
#pragma diag_suppress N | EDG source | Suppress specific diagnostic by number |
LLVM optimization remarks (Phase II / opt):
| Flag | Route | Effect |
|---|---|---|
-inline-info | opt: -pass-remarks=inline, -pass-remarks-missed=inline, -pass-remarks-analysis=inline | Enable inline-specific remarks |
-Xopt -pass-remarks=<regex> | opt direct | Enable passed remarks matching pattern |
-Xopt -pass-remarks-missed=<regex> | opt direct | Enable missed remarks matching pattern |
-Xopt -pass-remarks-analysis=<regex> | opt direct | Enable analysis remarks matching pattern |
-Xopt -pass-remarks-output=<file> | opt direct | Write remarks to file (YAML or bitstream) |
-Xopt -pass-remarks-format=yaml|bitstream | opt direct | Select output format |
-Xopt -pass-remarks-with-hotness | opt direct | Include PGO hotness in remarks |
-Xopt -pass-remarks-hotness-threshold=N | opt direct | Minimum hotness for emission |
-Xopt -pass-remarks-filter=<regex> | opt direct | Additional pass name filter |
NVIDIA profuse diagnostics:
| Flag | Route | Effect |
|---|---|---|
-Xopt -profuseinline=true | opt direct | Enable verbose inlining diagnostics |
-Xopt -profusegvn=false | opt direct | Disable verbose GVN diagnostics (on by default) |
Debug and verbose output:
| Flag | Route | Effect |
|---|---|---|
-enable-verbose-asm | llc -asm-verbose | Verbose assembly comments |
-show-src | llc -nvptx-emit-src | Embed source in PTX output |
-time-passes | special (must be only flag) | Time each LLVM pass |
Global Variables Controlling Diagnostic Behavior
| Address | Type | Name | Purpose |
|---|---|---|---|
unk_4D04198 | int | diagnostic_format | 0 = text, 1 = SARIF |
byte_4F07481[0] | byte | min_severity_threshold | Minimum severity for emission |
unk_4F074B0 | uint | error_count | Running error counter |
unk_4F074B8 | uint | warning_count | Running warning/non-error counter |
unk_4F07478 | uint | error_limit | Maximum errors before abort |
unk_4F07490 | flag | print_counters | Whether to print summary counters |
unk_4D04728 | byte | diag_numbering | Diagnostic numbering enabled |
unk_4D042B0 | byte | command_line_mode | Command-line diagnostic prefix |
unk_4D042B8 | flag | werror_flag | Promote severity to 7 for warnings |
dword_4D039D0 | int | terminal_width | Columns for word-wrapping |
dword_4F073CC[0] | int | ansi_color_enabled | ANSI color output flag |
dword_4F073C8 | int | rich_escape_mode | Rich (2-byte ESC) vs simple mode |
qword_4F07468 | int64 | wrap_control | Low32: disable wrap. High32: suppress context |
qword_4F07510 | FILE* | diag_output_stream | Output stream (stderr) |
qword_4D04908 | FILE* | diag_log_file | Machine-readable log file |
byte_4CFFE80 | array | diag_seen_flags | Per-diagnostic duplicate tracking |
Growable String Buffer Infrastructure
All three diagnostic systems share the same growable string buffer used for message formatting. The buffer structure appears at qword_4D039D8 (output buffer), qword_4D039E0 (prefix buffer), and qword_4D039E8 (header/message buffer):
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | (tag/type) | Unused or type discriminator |
| +8 | 8 | capacity | Maximum bytes before realloc |
| +16 | 8 | length | Current write position |
| +24 | 8 | (unused) | Padding |
| +32 | 8 | data | char* pointer to the actual buffer |
| Helper | Address | Operation |
|---|---|---|
sub_823800 | 0x823800 | Reset/clear buffer (set length to 0) |
sub_823810 | 0x823810 | Grow buffer capacity (realloc) |
sub_8238B0 | 0x8238B0 | Append data: memcpy(buf->data + buf->length, str, len) |
sub_8237A0 | 0x8237A0 | Allocate new buffer (initial capacity = 1024) |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
sub_67B780 | 0x67B780 | -- | EDG: Increment error/warning counters |
sub_67B7E0 | 0x67B7E0 | -- | EDG: Build include-stack annotation |
sub_67B9F0 | 0x67B9F0 | -- | EDG: Diagnostic record pool allocator |
sub_67BB20 | 0x67BB20 | -- | EDG: Argument node allocator |
sub_67BBF0 | 0x67BBF0 | -- | EDG: Set ANSI color state for output |
sub_67BD40 | 0x67BD40 | -- | EDG: Emit newline/flush for source context |
sub_67BDC0 | 0x67BDC0 | -- | EDG: Load file metadata and tab stop width |
sub_67C120 | 0x67C120 | -- | EDG/SARIF: Emit physicalLocation JSON |
sub_67C860 | 0x67C860 | -- | EDG: Localized string lookup by ID |
sub_67D2D0 | 0x67D2D0 | -- | EDG: Convert internal diag ID to user-visible number |
sub_67D470 | 0x67D470 | -- | EDG: Record pragma-based suppression |
sub_67D520 | 0x67D520 | -- | EDG: Check pragma-based suppression |
sub_67D610 | 0x67D610 | -- | EDG: Create synthetic diagnostic (warnings-as-errors) |
sub_681B50 | 0x681B50 | -- | EDG: Populate message text into header buffer |
sub_681D20 | 0x681D20 | 37KB | EDG: Terminal text diagnostic renderer |
sub_683690 | 0x683690 | -- | EDG/SARIF: Emit JSON-escaped message object |
sub_6837D0 | 0x6837D0 | 20KB | EDG: Diagnostic dispatch and SARIF renderer |
sub_721AB0 | 0x721AB0 | -- | EDG: Multi-byte character byte count |
sub_722DF0 | 0x722DF0 | -- | EDG/SARIF: Resolve file-id to path string |
sub_722FC0 | 0x722FC0 | -- | EDG: Format filename into buffer |
sub_723260 | 0x723260 | -- | EDG: Get filename string from file info |
sub_723640 | 0x723640 | -- | EDG: Get decorated source location string |
sub_729B10 | 0x729B10 | -- | EDG: Retrieve file/line data for source context |
sub_729E00 | 0x729E00 | -- | EDG/SARIF: Decompose packed source position |
sub_729F80 | 0x729F80 | -- | EDG: Promote severity (hard error) |
sub_7235F0 | 0x7235F0 | -- | EDG: Fatal exit with severity code |
sub_7AF1D0 | 0x7AF1D0 | -- | EDG: Newline character mapping lookup |
sub_823800 | 0x823800 | -- | Shared: Reset/clear growable string buffer |
sub_823810 | 0x823810 | -- | Shared: Grow/realloc string buffer |
sub_8237A0 | 0x8237A0 | -- | Shared: Allocate new growable buffer |
sub_8238B0 | 0x8238B0 | -- | Shared: Append to string buffer |
sub_B16430 | 0xB16430 | -- | LLVM Remark: Create named string attribute |
sub_B16530 | 0xB16530 | -- | LLVM Remark: Append named value |
sub_B16B10 | 0xB16B10 | -- | LLVM Remark: Create named integer attribute |
sub_B157E0 | 0xB157E0 | -- | LLVM Remark: Get DebugLoc for remark source location |
sub_B17560 | 0xB17560 | -- | LLVM Remark: Construct OptimizationRemark (passed) |
sub_B178C0 | 0xB178C0 | -- | LLVM Remark: Construct warning-level DiagnosticInfo |
sub_B180C0 | 0xB180C0 | -- | LLVM Remark: Finalize and prepare remark for emission |
sub_B18290 | 0xB18290 | -- | LLVM Remark: Append raw string to remark message |
sub_B2BE50 | 0xB2BE50 | -- | LLVM Remark: getRemarkStreamer |
sub_B6EA50 | 0xB6EA50 | -- | LLVM Remark: isEnabled check |
sub_B6F970 | 0xB6F970 | -- | LLVM Remark: getRemarkFilter |
sub_B91220 | 0xB91220 | -- | LLVM Remark: Free remark string |
sub_C2E790 | 0xC2E790 | 6KB | LLVM Remark: createRemarkSerializer factory |
sub_C302C0 | 0xC302C0 | 4KB | LLVM Remark: YAML remark serializer emit |
sub_C30A00 | 0xC30A00 | 6KB | LLVM Remark: YAML remark parser (6 type tags) |
sub_C31010 | 0xC31010 | 8KB | LLVM Remark: YAML remark field parser |
sub_EFCCF0 | 0xEFCCF0 | 9KB | LLVM Remark: Bitstream abbreviation emitter |
sub_EFD2C0 | 0xEFD2C0 | 18KB | LLVM Remark: Bitstream record writer |
sub_EFE900 | 0xEFE900 | 30KB | LLVM Remark: Bitstream remark parser |
sub_F01350 | 0xF01350 | 23KB | LLVM Remark: Bitstream remark serializer |
sub_1049740 | 0x1049740 | -- | LLVM Remark: Publish remark to diagnostic handler |
sub_15CA330 | 0x15CA330 | -- | LLVM Remark: OptimizationRemark constructor |
sub_15CA540 | 0x15CA540 | -- | LLVM Remark: OptimizationRemarkMissed constructor |
sub_15CAB20 | 0x15CAB20 | -- | LLVM Remark: OptimizationRemark::operator<<(StringRef) |
sub_15CAD70 | 0x15CAD70 | 13KB | LLVM Remark: YAML remark serializer (NVIDIA-extended) |
sub_1DCCCA0 | 0x1DCCCA0 | -- | LLVM Remark: OptimizationRemarkEmitter::emit |
Cross-References
- Entry Point & CLI -- flag routing for
-w,-Werror,-inline-info,-Xoptpass-through - GVN --
profusegvnknob and GVN diagnostic output - Inliner Cost Model --
profuseinlineknob and inline cost diagnostics - LLVM Pass Pipeline --
opt-remark-emitandmachine-opt-remark-emitteranalysis pass registration - EDG Frontend -- EDG option registration including
--diagnostics_format - CLI Flags -- complete flag-to-pipeline routing table
- Knobs --
profuseinline,profusegvn, and remark-related knobs - AsmPrinter -- remark emission during code generation
Hash Table and Collection Infrastructure
Every associative container in cicc v13.0 is built from the same handful of primitives: a pointer-hash DenseMap/DenseSet with quadratic probing, a wyhash-v4-family string hasher, and a SmallVector with inline buffer optimization. Before this page existed, the same hash table description was duplicated across 30+ wiki pages. This is the single source of truth. If you are reimplementing cicc's data structures, start here.
There are no NVIDIA-specific modifications to the DenseMap hashing or probing logic -- cicc links the LLVM 20.0.0 implementation unmodified. The only NVIDIA-original hash infrastructure is the wyhash-v4 string hasher used for the builtin name table.
DenseMap Layout
Two variants exist, distinguished by bucket stride. Both share the same 28-byte inline header, the same hash function, the same probing sequence, the same sentinel values, and the same growth policy. The header is always embedded directly inside a larger structure (context object, analysis result, pass state) -- never heap-allocated on its own.
Variant A -- DenseSet (8 bytes/bucket)
| Offset | Size | Type | Field |
|---|---|---|---|
| +0 | 8 | uint64_t | NumEntries |
| +8 | 8 | ptr | Buckets (heap-allocated array) |
| +16 | 4 | uint32_t | NumItems (live entries) |
| +20 | 4 | uint32_t | NumTombstones |
| +24 | 4 | uint32_t | NumBuckets (always power of 2) |
Bucket array size: NumBuckets * 8 bytes. Each bucket holds either a valid pointer, an empty sentinel, or a tombstone sentinel.
Variant B -- DenseMap (16 bytes/bucket)
Same 28-byte header. Each bucket holds a key-value pair at a 16-byte stride:
v30 = (_QWORD *)(buckets + 16LL * slot); // sub_163D530 line 561
*v30 = key; // +0: key
v30[1] = value; // +8: value
Variant B is used by the SelectionDAG builder (context offsets +120 and +152), the NVVM IR node uniquing tables, and any subsystem that maps pointers to pointers.
Where the Variants Appear
| Subsystem | Variant | Context offset | Purpose |
|---|---|---|---|
NVVM IR uniquing (sub_162D4F0) | B (16B) | context qw[130..178] | Node deduplication per opcode |
SelectionDAG builder (sub_163D530) | B (16B) | +120, +152 | Node mapping |
SelectionDAG builder (sub_163D530) | A (8B) | +184 | Worklist set |
| Per-node analysis structures | A (8B) | +72 inside v381 | Visited set |
CSSA PHI map (sub_3720740) | B (16B) | r15+0x60 | PHI-to-ID mapping |
| Coroutine spill tracking | B (16B) | +0x18 inline | Spill/reload tracking |
| Builtin name table | custom (12B stride) | context+480 | Name-to-ID with hash cache |
Pointer Hash Function
Every DenseMap/DenseSet instance in cicc that uses pointer keys employs the same hash:
hash(ptr) = (ptr >> 9) ^ (ptr >> 4)
This is LLVM's DenseMapInfo<void*>::getHashValue, unchanged. The right-shift by 4 discards the low bits that are always zero due to 8- or 16-byte alignment. The right-shift by 9 mixes in higher-order address bits to break up the stride patterns that arise from slab allocation (where consecutive objects are separated by a fixed power-of-two). The XOR combines these two views of the pointer into a single hash value that distributes well for both heap-allocated and slab-allocated objects.
Representative decompiled evidence (appears identically in dozens of functions):
v9 = (v12 - 1) & (((unsigned int)v11 >> 9) ^ ((unsigned int)v11 >> 4));
Integer-Key Hash Variant
A separate hash function is used for DenseMap<unsigned, T> instances (integer keys rather than pointers):
hash(key) = key * 37
This is LLVM's DenseMapInfo<unsigned>::getHashValue. It appears in the instruction emitter (sub_2E29BA0), the two-address pass (sub_1F4E3A0), the vector legalization tables, and the SelectionDAG instruction selection cost table (sub_3090F90). Integer-key maps use a different sentinel pair: 0xFFFFFFFF (empty) and 0xFFFFFFFE (tombstone).
wyhash v4 String Hasher -- sub_CBF760
The NVVM builtin name table uses a separate, NVIDIA-original hash function for string keys. sub_C92610 is a thin wrapper that tail-calls sub_CBF760. The function dispatches on input length into six code paths, each using different constant sets and mixing strategies:
Length Dispatch Table
| Length | Strategy | Constants |
|---|---|---|
| 0 | Return constant | 0x2D06800538D394C2 |
| 1--3 | 3-byte read + XOR + multiply | seed 0x87275A9B, mul 0xC2B2AE3D27D4EB4F, avalanche 0x165667B19E3779F9 |
| 4--8 | 2x uint32 + combine + rotate | XOR 0xC73AB174C5ECD5A2, mul 0x9FB21C651E98DF25 |
| 9--16 | 2x uint64 + 128-bit multiply | XOR 0x6782737BEA4239B9 / 0xAF56BC3B0996523A, avalanche 0x165667919E3779F9 |
| 17--128 | Paired 16B reads from both ends | Per-pair constants, 128-bit multiplies, length mixed with 0x61C8864E7A143579 |
| 129--240 | Extended mixing | Delegates to sub_CBF370 |
| 240+ | Bulk processing | Delegates to sub_CBF100 |
Pseudocode (length 1--3, the most common case for short builtins)
fn wyhash_short(data: &[u8], len: usize) -> u32 {
let a = data[0] as u64;
let b = data[len / 2] as u64;
let c = data[len - 1] as u64;
let combined = a | (b << 8) | (c << 16) | (len as u64) << 24;
let mixed = combined ^ 0x87275A9B;
let wide = mixed.wrapping_mul(0xC2B2AE3D27D4EB4F);
let folded = wide ^ (wide >> 32);
let result = folded.wrapping_mul(0x165667B19E3779F9);
(result ^ (result >> 32)) as u32
}
Pseudocode (length 17--128, covering most __nvvm_* names)
fn wyhash_medium(data: &[u8], len: usize) -> u32 {
let pairs = [
(0x1CAD21F72C81017C, 0xBE4BA423396CFEB8), // pair 0
(0x1F67B3B7A4A44072, 0xDB979083E96DD4DE), // pair 1
(0x2172FFCC7DD05A82, 0x78E5C0CC4EE679CB), // pair 2
// ... additional pairs for 64/96/128 thresholds
];
let (mut v8, mut v10) = (0u64, 0u64);
// read 16 bytes from front, 16 from back, mix with pair constants
for i in 0..((len + 15) / 32) {
let front = read_u128(&data[i * 16..]);
let back = read_u128(&data[len - (i + 1) * 16..]);
(v8, v10) = mix_128(v8, v10, front, back, pairs[i]);
}
let combined = v8 ^ v10 ^ (len as u64 ^ 0x61C8864E7A143579);
let result = 0x165667919E3779F9u64.wrapping_mul(combined ^ (combined >> 37));
(result ^ (result >> 32)) as u32
}
The final return value is always a uint32 -- the high dword of the 64-bit result XORed with the low dword. Most NVVM builtin names are 8--35 bytes, hitting the optimal 4--8 and 9--16 and 17--128 paths.
Probing Strategy
All DenseMap instances use quadratic probing with triangular-number increments:
slot = hash & (capacity - 1) // initial probe
step = 1
loop:
if bucket[slot] == key -> found
if bucket[slot] == EMPTY -> not found (insert here)
if bucket[slot] == TOMBSTONE -> record for reuse
slot = (slot + step) & (capacity - 1)
step++
The probe sequence for initial position h visits:
h, h+1, h+3, h+6, h+10, h+15, h+21, ...
h + T(k) where T(k) = k*(k+1)/2 (triangular numbers)
This guarantees that for a power-of-2 table size n, all n slots are visited before any index repeats. The proof relies on the fact that the differences T(k+1) - T(k) = k+1 produce all residues modulo n when n is a power of 2.
Comparison Guard (Builtin Table)
The builtin name hash table (sub_C92740, sub_C92860) adds a triple comparison guard before performing the expensive memcmp:
- Cached hash equality:
hash_cache[slot] == search_hash - Length equality:
entry->length == search_length - Content equality:
memcmp(search_data, entry->string_data, length) == 0
The hash cache is stored in a separate array immediately after the bucket array and the end-of-table sentinel. This layout avoids polluting bucket cache lines with hash values that are only needed on collision.
Probing Label: "Linear" vs "Quadratic"
Some analysis reports describe the probing as "linear" because the step variable increments by 1 each iteration. The actual probe position advances quadratically (by accumulating triangular numbers). Both descriptions refer to the same code. This page uses the technically precise term: quadratic probing with triangular numbers.
Growth Policy
Load Factor Threshold -- 75%
After every successful insertion, the map checks whether to grow:
if (4 * (NumItems + 1) >= 3 * NumBuckets)
// load factor > 75% -> double capacity
new_capacity = 2 * NumBuckets
Tombstone Compaction -- 12.5%
If the load factor is acceptable but tombstones have accumulated:
elif (NumBuckets - NumTombstones - NumItems <= NumBuckets >> 3)
// fewer than 12.5% of slots are truly empty
// rehash at same capacity to clear tombstones
new_capacity = NumBuckets
Rehash Procedure -- sub_C929D0
calloc(new_capacity + 1, bucket_stride)for the new array.- Write the end-of-table sentinel at position
new_capacity. - For each live (non-empty, non-tombstone) entry in the old table, reinsert into the new table using quadratic probing.
- Copy the cached hash (if the table has a hash cache).
- Track the new position of a "current slot" pointer so the caller can continue using the entry it just inserted.
- Free the old array.
- Reset
NumTombstonesto 0. - Update
NumBucketstonew_capacity. - Return the new position of the tracked slot.
Capacity Constraints
- Power of 2: always. Enforced by the bit-smearing pattern:
x |= x>>1; x |= x>>2; x |= x>>4; ...; x += 1. - Minimum: 64 buckets for standard DenseMap instances. The builtin name table starts at 16 and grows through
16 -> 32 -> 64 -> 128 -> 256 -> 512 -> 1024as its 770 entries are inserted. - Allocation:
sub_22077B0(operatornew[]), freed viaj___libc_free_0.
Sentinel Values
Two sentinel families exist, distinguished by magnitude. Both are chosen to be impossible values for aligned pointers.
NVVM-Layer Sentinels (small magnitude)
Used by the NVVM IR uniquing tables, the SelectionDAG builder maps, and the builtin name table:
| Role | Value | Hex | Why safe |
|---|---|---|---|
| Empty | -8 | 0xFFFFFFFFFFFFFFF8 | Low 3 bits = 0b000 after masking, but no 8-byte-aligned pointer is this close to (uint64_t)-1 |
| Tombstone | -16 | 0xFFFFFFFFFFFFFFF0 | Same reasoning, distinct from -8 |
The builtin name table also uses a value of 2 as an end-of-table sentinel placed at bucket_array[capacity].
LLVM-Layer Sentinels (large magnitude)
Used by the majority of LLVM pass infrastructure -- SCEV, register coalescing, block placement, SLP vectorizer, StructurizeCFG, machine pipeliner, prolog-epilog, and others:
| Role | Value | Hex | Decimal |
|---|---|---|---|
| Empty | 0xFFFFFFFFFFFFF000 | -4096 | -4096 |
| Tombstone | 0xFFFFFFFFFFFFE000 | -8192 | -8192 |
Integer-Key Sentinels
Used by DenseMap<unsigned, T> instances (instruction emitter, two-address pass):
| Role | Value | Hex |
|---|---|---|
| Empty | 0xFFFFFFFF | 32-bit all-ones |
| Tombstone | 0xFFFFFFFE | 32-bit all-ones minus 1 |
Which Sentinel Set to Expect
| Subsystem | Sentinel pair |
|---|---|
| NVVM IR uniquing, SelectionDAG builder | -8 / -16 |
| Builtin name table | -8 (tombstone), 0 (empty), 2 (end marker) |
| SCEV, block placement, SLP vectorizer | -4096 / -8192 |
| Register coalescing, machine pipeliner | -4096 / -8192 |
| StructurizeCFG, prolog-epilog | -4096 / -8192 |
| Instruction emitter, two-address | 0xFFFFFFFF / 0xFFFFFFFE |
| Coroutine spill tracking | 0xFFFFFFFFF000 / 0xFFFFFFFFE000 |
| CSSA PHI map | 0xFFFFFFFFF000 / 0xFFFFFFFFE000 |
| Debug verify | 0xFFFFFFFFF000 / 0xFFFFFFFFE000 |
| LazyCallGraph | 0xFFFFFFFFF000 / 0xFFFFFFFFE000 |
The -8/-16 pair appears exclusively in NVVM-layer (NVIDIA-original) code. The -4096/-8192 pair is the standard LLVM DenseMapInfo<void*> sentinel set. The difference is cosmetic -- both pairs are safe for the same reasons -- but it reveals code provenance: if you see -8/-16, the code was written or heavily modified by NVIDIA; if you see -4096/-8192, it is stock LLVM.
SmallVector Pattern
SmallVector is the universal dynamic array throughout cicc, with two growth implementations:
Layout
[BeginPtr, Size:Count:Capacity, InlineData...]
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | data_ptr (points to inline buffer initially, heap after growth) |
| +8 | 4 | size (live element count) |
| +12 | 4 | capacity (allocated slots) |
| +16 | N | Inline buffer (N = InlineCapacity * element_size) |
When size == capacity on insertion, the vector grows.
Growth Functions
| Function | Address | Description |
|---|---|---|
SmallVector::grow | sub_C8D5F0 | Generic growth -- copies elements, used for non-POD types |
SmallVectorBase::grow_pod | sub_C8D7D0 | POD-optimized growth -- uses realloc when buffer is heap-allocated |
SmallVector::grow (MIR) | sub_16CD150 | Second copy in the MachineIR address range, identical logic |
SmallVector::grow (extended) | sub_C8E1E0 | Larger variant (11KB), handles edge cases |
Growth Policy
The standard LLVM SmallVector growth: double the current capacity, with a minimum of 1. If the current buffer is the inline buffer, malloc a new heap buffer and memcpy the contents. If the buffer is already on the heap, realloc it (for POD types) or malloc + copy + free (for non-POD types).
new_capacity = max(2 * old_capacity, required_capacity)
if (data_ptr == &inline_buffer)
heap_buf = malloc(new_capacity * elem_size)
memcpy(heap_buf, inline_buffer, size * elem_size)
else
// POD: heap_buf = realloc(data_ptr, new_capacity * elem_size)
// non-POD: heap_buf = malloc(...); copy; free(old)
data_ptr = heap_buf
capacity = new_capacity
Common Inline Capacities
Observed across the codebase:
| Inline capacity | Element size | Total inline bytes | Typical use |
|---|---|---|---|
| 2 | 8 | 16 | SCEV delinearization terms |
| 4 | 8 | 32 | LazyCallGraph SCC lists, basic block worklists |
| 8 | 8 | 64 | NVVMReflect call collection, PHI operand lists |
| 16 | 8 | 128 | AA evaluation pointer sets |
| 22 | 8 | 176 | Printf argument arrays (stack-allocated) |
| 8 | 56 | 448 | SROA slice descriptors |
Builtin Name Table -- Specialized Hash Table
The builtin name table at context+480 is a specialized variant that does not use the standard DenseMap layout. It stores string entries rather than pointers, includes a parallel hash cache, and uses the wyhash function instead of the pointer hash.
Table Structure (20 bytes)
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | bucket_array_ptr |
| +8 | 4 | capacity (power of 2) |
| +12 | 4 | count (live entries) |
| +16 | 4 | tombstone_count |
Memory Layout
[0 .. 8*cap-1] bucket_array: cap QWORD pointers
[8*cap .. 8*cap+7] sentinel: value 2 (end-of-table)
[8*cap+8 .. 8*cap+8+4*cap-1] hash_cache: uint32 per slot
String Entry (heap-allocated via sub_C7D670)
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | string_length |
| +8 | 4 | builtin_id (set after insertion) |
| +16 | N+1 | Null-terminated string data |
Total allocation: length + 17 bytes, 8-byte aligned. The string data offset (16) is stored at hashtable+20 for use during comparison.
See Builtins for the complete 770-entry builtin ID inventory.
Usage Across the Compiler
Subsystems Using DenseMap (pointer hash, -8/-16 sentinels)
- NVVM IR uniquing (
sub_162D4F0): 8+ DenseMap instances in the NVVM context object, one per opcode range (0x04--0x1F). Tables at fixed qword-indexed offsets, spaced 32 bytes apart. - SelectionDAG builder (
sub_163D530): Three maps at context offsets +120, +152, +184. Map A and B are 16-byte-stride (key-value), Set C is 8-byte-stride (keys only). - Per-node analysis structures: Embedded DenseSet at +72 within analysis objects created during DAG construction.
- Memory space optimization (
sub_1C6A6C0): DenseMap-style tables for address space tracking.
Subsystems Using DenseMap (pointer hash, -4096/-8192 sentinels)
- SCEV (
sub_F03CD0and family): Expression caching, range computation, back-edge taken count. - Register coalescing (
sub_1F2F8F0): Already-coalesced set, equivalence class map. - Block placement (
sub_2E3B720): Chain membership, tail-merge candidates. - SLP vectorizer (
sub_1ACCE50): AllOps and Scalars hash tables (32-byte entries). - StructurizeCFG (
sub_1B66CF0): Flow-block mapping, region membership. - Machine pipeliner (
sub_20C40D0): Schedule stage tracking. - CSSA (
sub_3720740): PHI-to-ID mapping. - Debug/verify (
sub_265D050): Instruction validation tables. - LazyCallGraph (
sub_D1A040): Edge membership, SCC identity.
Subsystems Using DenseMap (integer hash key * 37)
- Instruction emitter (
sub_2E1F350): Opcode-to-constraint mapping. Sentinels:0xFFFFFFFF/0xFFFFFFFE. - Two-address pass (
sub_1F4BFE0): TiedOperandMap (56-byte entries, 4 inline). EqClassMap. - Vector legalization (
sub_3302A00): Type-split record mapping. - SelectionDAG isel (
sub_3090F90): Argument cost table.
Subsystems Using wyhash (string keys)
- Builtin name table (
sub_90AEE0): 770 NVVM/CUDA builtin names. Uses the specialized 20-byte table header with hash cache. - This is the only known use of
sub_CBF760in cicc.
Key Functions
| Function | Address | Size | Role |
|---|---|---|---|
| DenseMap pointer hash | inline | -- | (ptr >> 9) ^ (ptr >> 4) -- always inlined |
| DenseMap integer hash | inline | -- | key * 37 -- always inlined |
| wyhash v4 | sub_CBF760 | ~4 KB | String hash, length-dispatched |
| wyhash wrapper | sub_C92610 | tiny | Tail-calls sub_CBF760 |
| Builtin insert-or-find | sub_C92740 | ~2 KB | Quadratic probe with hash cache |
| Builtin find-only | sub_C92860 | ~1 KB | Read-only variant of sub_C92740 |
| Builtin rehash | sub_C929D0 | ~1 KB | 75% load factor, tombstone compaction |
| Builtin table init | sub_C92620 | tiny | Creates 16-bucket initial table |
| SmallVector::grow | sub_C8D5F0 | ~2 KB | Generic element growth |
| SmallVectorBase::grow_pod | sub_C8D7D0 | ~5 KB | POD-optimized realloc growth |
| SmallVector::grow (MIR) | sub_16CD150 | ~2 KB | Duplicate in MachineIR range |
| SmallPtrSet::insertOrFind | sub_C9A3C0 | ~16 KB | Small pointer set with growth |
| DenseMap grow (LLVM passes) | varies per pass | -- | Each pass has its own inlined or outlined rehash |
Cross-References
- Builtins -- Hash Table and ID Inventory -- complete 770-entry builtin table with wyhash usage
- DenseMap and Symbol Table Structures -- original page (now a subset of this one, kept for EDG node layout)
- NVVM IR Node -- NVVM context object with DenseMap uniquing tables
- CSSA -- PHI hash map with -4096/-8192 sentinels
- Register Coalescing -- integer-key and pointer-key hash map variants
- SLP Vectorizer -- 32-byte-entry DenseMap with -4096/-8192 sentinels
- SCEV -- SCEV expression caching with -4096/-8192 sentinels
- Instruction Emitter -- integer-key hash with
key * 37
CoroSplit & CoroFrame: Coroutine Lowering on GPU
cicc v13.0 carries the complete LLVM coroutine lowering pipeline -- CoroEarly, CoroSplit, CoroElide, CoroAnnotationElide, and CoroCleanup -- largely unchanged from upstream LLVM 19. The pass infrastructure processes C++20 co_await/co_yield/co_return coroutines emitted by the EDG 6.6 frontend, splitting a single coroutine function into separate resume, destroy, and cleanup functions while computing a coroutine frame struct to carry live state across suspend points. NVIDIA adds one proprietary intrinsic (llvm.nvvm.coro.create.suspend) and emits a .pragma "coroutine" annotation in PTX, but the core splitting and frame layout algorithms are stock LLVM. The practical constraint is that coroutine frame allocation on GPU defaults to malloc in device heap -- extremely expensive on current architectures -- making CoroElide (which replaces heap allocation with a caller-stack alloca) the pass that determines whether GPU coroutines are viable or pathological.
Key Facts
| Property | Value |
|---|---|
| CoroSplit pass entry | sub_24EF980 (71 KB, address range 0x24EF980--0x24F2300) |
| CoroFrame layout computation | sub_24F6730 (11,249 bytes, stack frame 5,624 bytes) |
| Core frame layout workhorse | sub_24F5860 (called from CoroFrame) |
| createResumeFunction | sub_2284030 |
| createDestroyFunction | sub_2284040 |
| CoroEarly pass | sub_24DCD10 (41 KB) |
| CoroElide pass | sub_24DF350 (80 KB) |
| CoroAnnotationElide pass | sub_24E2340 (33 KB) |
| CoroSplit Cloner/Driver | sub_25CA370 (55 KB) |
| CoroFrame Materializer | sub_25C5C80 (49 KB, heap-to-stack frame layout) |
| CoroFrame Spill Analysis | sub_25C1030 (37 KB) |
| Pass name / debug type | "CoroSplit" / "coro-split" (at 0x4388A37 / 0x4387AC3) |
| Coroutine metadata table | unk_4F8FAE8 |
| Pipeline parser ID | #156 (CGSCC pass, param: reuse-storage) |
| CoroElide pipeline ID | #220 (Function pass) |
| CoroAnnotationElide pipeline ID | #155 (CGSCC pass) |
| CoroEarly pipeline ID | #29 (Module pass) |
| CoroCleanup pipeline ID | #28 (Module pass) |
| NVIDIA intrinsic | llvm.nvvm.coro.create.suspend (single constant integer argument) |
| PTX annotation | .pragma "coroutine"; |
The Coroutine Lowering Pipeline
Five passes run in a fixed sequence across the optimizer pipeline. The first and last are module-level bookends; the middle three do the real work inside the CGSCC (Call Graph SCC) pipeline where inlining decisions interact with coroutine splitting.
CoroEarly (module) Lowers coroutine setup intrinsics.
Materializes the NoopCoro.Frame global.
Replaces llvm.coro.resume, llvm.coro.destroy,
llvm.coro.promise, llvm.coro.free with
concrete operations on the frame pointer.
|
v
CoroSplit (CGSCC) Identifies coroutine functions by scanning for
llvm.coro.suspend / llvm.coro.end intrinsics.
Invokes CoroFrame to compute the frame layout.
Clones the function into resume + destroy variants.
Builds the state machine dispatch switch.
|
v
CoroAnnotationElide (CGSCC) Annotation-driven elision: when the callee is
marked "elide_safe_attr" and the call site has
".noalloc", converts heap alloc to alloca in the
caller's frame. New in LLVM 19 / cicc v13.0.
|
v
CoroElide (function) Classic elision: proves the coroutine frame
lifetime is bounded by the caller, replaces
coro.alloc with alloca. Emits optimization
remarks "'<name>' elided in '<caller>'" or
"'<name>' not elided in '<caller>'".
|
v
CoroCleanup (module) Removes remaining coroutine intrinsic stubs
that survived lowering (e.g., coro.subfn.addr).
Final cleanup pass -- no coroutine intrinsics
survive past this point.
The coro-cond module analysis (registered in the pipeline parser at sub_2337E30) gates whether the coroutine passes activate at all. If no function in the module contains llvm.coro.id, the entire pipeline is skipped. This zero-cost guard is important because the vast majority of CUDA kernels contain no coroutines.
CoroSplit as a CGSCC Pass
CoroSplit is registered as CGSCC pass #156 with an optional reuse-storage parameter. When reuse-storage is active, the pass attempts to reuse the storage of coroutine frames that are provably dead -- relevant for generators where the frame is allocated once and resumed many times. In the CGSCC context, CoroSplit runs alongside the inliner (inline) and function-attrs, allowing newly split resume/destroy functions to be immediately considered for inlining into callers within the same SCC.
CoroSplit: Suspend Point Detection and Function Splitting
Detection Phase
sub_24EF980 iterates over every function in the module. For each function, it scans all instructions using a bitmask-based opcode test to identify coroutine suspension intrinsics:
// Suspend point detection (at 0x24F00E6)
// Stack frame: 0x860+ bytes, callee-saved: r15, r14, r13, r12, rbx
// Key locals:
// [rbp-0x7F8] = outer iteration end pointer
// [rbp-0x7E8] = current coroutine info
// [rbp-0x7E0] = suspend point instruction
// [rbp-0x740] = original coroutine function
// [rbp-0x750] = resume function pointer
// [rbp-0x748] = destroy function pointer
uint8_t opcode = inst->getOpcode();
unsigned normalized = opcode - 0x22;
if (normalized > 51) continue; // not in range [0x22, 0x55]
uint64_t mask = 0x8000000000041ULL;
if (!((mask >> normalized) & 1)) continue; // bit not set
The bitmask 0x8000000000041 encodes three intrinsic opcodes:
| Bit position | Opcode | Intrinsic |
|---|---|---|
| 0 | 0x22 | llvm.coro.suspend -- normal suspend point |
| 6 | 0x28 | llvm.coro.suspend.retcon -- returned-continuation suspend |
| 51 | 0x55 | llvm.coro.end -- coroutine termination |
This single 64-bit bt (bit-test) instruction replaces what would otherwise be a three-way comparison or switch, a pattern upstream LLVM uses in its Intrinsic::ID checking.
Validation
After finding a suspend point, CoroSplit validates the coroutine structure (at 0x24F010E):
// Coroutine validation pseudocode (0x24F010E-0x24F0179)
Value *coro_id_inst = ...;
if (coro_id_inst->getOpcode() != 0x55) // must be 'U' = coro.id
goto skip;
Function *parent = coro_id_inst->getParent(); // [rax-20h]
if (!parent || parent->getOpcode() != 0) // entry block check
goto skip;
Value *promise = coro_id_inst->getOperand(4); // [rcx+50h]
if (parent->getContext() != promise) // [rax+18h] == promise
goto skip;
if (!(parent->getFlags() & 0x20)) // "has personality" bit 5 of +0x21
goto skip;
if (parent->getIntrinsicID() != 59) // 0x3B = coro.id
goto skip;
This is a thorough validation ensuring:
- The instruction is indeed
llvm.coro.id(opcode0x55= 'U', intrinsic ID 59 =0x3B) - It belongs to a valid function (parent pointer non-null, starts with opcode 0)
- The promise alloca matches between
coro.idand function context - The function has the correct personality (bit 5 of byte at offset
+0x21) - The intrinsic ID equals 59 (
cmp dword [rax+24h], 0x3B)
Nested coroutines receive additional validation (at 0x24F017F): the pass checks that coro.begin (opcode range 0x1E--0x28, ID 57 = 0x39) references the correct parent function, preventing cross-coroutine confusion when one coroutine is nested inside another.
// Nested coroutine check (0x24F017F-0x24F01D6)
unsigned operand_count = inst->getNumOperands() & 0x7FFFFFF; // mask out type bits
Value *parent_ref = inst->getOperand(-operand_count); // computed offset
if (parent_ref != current_function)
goto skip; // different coroutine -- do not cross wires
uint8_t begin_opcode = begin_inst->getOpcode();
if (begin_opcode - 0x1E > 0x0A) // must be in [0x1E, 0x28]
goto skip; // not a coro.begin-related instruction
Value *frame_ptr = begin_inst->getOperand(2); // [rdx+28h]
Suspend Point Collection
Validated suspend points are collected into a deduplicated array. The dedup check at 0x24F02F9 scans existing entries, following def-use chains ([rbx+10h]) to avoid processing the same suspend point twice when multiple CFG paths reach it. For each suspend point, the pass extracts the value operand at instruction offset +0x28.
// Suspend point collection with dedup (0x24F02F9-0x24F040A)
unsigned count = suspend_array_size;
for (unsigned i = 0; i < count; i++) {
if (suspend_array[i] == new_suspend)
goto already_collected; // follow chain: [rbx+10h]
}
// Extract value operand:
Value *value_operand = suspend_inst->getOperand(2); // [rdx+28h]
suspend_array[count++] = new_suspend;
The Split Algorithm
After collecting all suspend points, the split proceeds in three phases:
Phase 1: Frame layout computation. CoroSplit invokes sub_24F6730 (CoroFrame) to determine which SSA values are live across suspend points and must be stored in the frame struct (see the CoroFrame section below).
Phase 2: Function cloning and specialization. The split mode field at [rbp-0x3F8] controls which function variants are created:
// Function splitting dispatch (at 0x24F0540)
int split_mode = frame_state->split_mode; // [rbp-0x3F8]
if (split_mode == 0) {
// Returned-continuation style: destroy function only
Function *destroy = createDestroyFunction(state, orig_fn, suspends, ...);
} else if (split_mode >= 1 && split_mode <= 3) {
// Standard C++20 coroutine: both resume and destroy
Function *resume = sub_2284030(state, orig_fn, suspends, coro_info,
destroy_data, resume_data);
Function *destroy = sub_2284040(state, orig_fn, suspends, coro_info,
destroy_data, resume_data);
}
sub_2284030 (createResumeFunction) and sub_2284040 (createDestroyFunction) each:
- Clone the original coroutine function via
sub_D2E510(function cloner) - Replace the coroutine frame parameter with a typed pointer to the frame struct
- Insert a switch statement at the entry block dispatching on the suspend index stored in the frame (
__coro_index) - Replace each
llvm.coro.suspendwith a return instruction - Wire function pointers (
__resume_fn,__destroy_fn) into the frame header at offsets+0x00and+0x08
Phase 3: Metadata and remark emission. After splitting, the pass registers the new functions in the coroutine metadata table at unk_4F8FAE8 via sub_BC1CD0, then emits an optimization remark:
// Remark emission (0x24F05D1-0x24F06E8)
sub_B17560(remark, "CoroSplit", "coro-split"); // create remark
sub_B18290(remark, "Split '"); // prefix
sub_BD5D20(orig_fn, name_buf); // get function name
sub_B16430(remark, "function", name_buf); // named attribute
sub_B18290(remark, "' (frame_size=");
sub_B16B10(remark, "frame_size", frame_size); // integer attribute
sub_B18290(remark, ", align=");
unsigned align = 1u << alignment_log2;
sub_B16B10(remark, "align", align);
sub_B18290(remark, ")");
sub_1049740(remark); // publish to diagnostic handler
The format is: Split '<function_name>' (frame_size=N, align=M) where N is the computed frame size in bytes and M is 1 << alignment_log2.
The .corodispatch Trampoline
The CoroSplit dispatcher at sub_3160A60 (48 KB, second code cluster) generates a .corodispatch function -- a lightweight trampoline that:
- Loads
__coro_indexfrom the coroutine frame at offset+0x10 - Switches on the index value to select the correct resume point
- Uses
musttailcall semantics to jump to the target without growing the stack
The string "MustTailCall.Before.CoroEnd" confirms it enforces musttail on the final resume-to-end transition. Additional strings in this function include ".from." (used to construct the dispatch label name), "CoroEnd", "CoroSave", and "CoroSuspend" (marking the IR structures being dispatched through).
For GPU targets, the musttail semantics are critical: stack space is per-thread local memory, and growing it across coroutine bounces would rapidly exhaust the limited local memory budget.
CoroFrame: Frame Layout Computation
sub_24F6730 is the largest and most complex function in the coroutine pipeline, with a 5,624-byte stack frame (0x15F8) -- one of the largest in the entire cicc binary. Its job: determine which SSA values are live across suspend points and must be "spilled" into the coroutine frame struct.
Algorithm Overview
The algorithm is a BFS-based cross-suspend-point liveness analysis:
-
Initialize tracking structures. Two hash tables with 16-byte entries, sentinel
0xFFFFFFFFF000, hash function(val >> 4) ^ (val >> 9). Initial capacity 8 entries each. -
Iterate all instructions. Walk every basic block and instruction. A visitor callback (
[visitor+18h], virtual call) classifies each instruction as relevant or not to the frame computation. -
BFS traversal. A deque with 512-byte blocks (64 pointer-sized entries per block) drives BFS over the CFG. The core computation at
sub_24F5860determines which values cross which suspend points. -
Spill set computation. Values that are defined before a suspend point and used after it must be stored in the frame. The result is a set of (value, suspend_point) pairs.
-
Frame layout. The frame type builder (at
sub_3169200in the second code cluster) arranges spill slots into a struct.
Frame Struct Layout
The coroutine frame is a flat C struct with a fixed header followed by computed spill slots:
struct __coro_frame { // type name: ".coro_frame_ty"
void (*__resume_fn)(struct __coro_frame *); // +0x00 resume function pointer
void (*__destroy_fn)(struct __coro_frame *); // +0x08 destroy function pointer
uint32_t __coro_index; // +0x10 suspend point state variable
// --- header ends, spill slots begin ---
// padding for alignment (computed per-coroutine)
// spill slots ordered by descending alignment requirement
// promise storage (if promise_type is non-trivial)
// alloca copies (stack variables that survive suspend)
};
The frame variable is named "__coro_frame" and the type is ".coro_frame_ty". The suspend point index field "__coro_index" is the state variable for the resume switch dispatch: value 0 means "initial entry", value N means "resumed at suspend point N", and a poison/unreachable value means "coroutine has returned".
The frame type builder at sub_3169200 (46 KB) constructs the StructType using these rules:
- The two function pointers (
__resume_fn,__destroy_fn) always occupy the first 16 bytes __coro_indexoccupies bytes 16--19 (i32)- Remaining spill slots are sorted by alignment (largest first) to minimize padding
- The promise alloca (if present) is placed at a known offset so
llvm.coro.promisecan compute it - Total frame size and alignment are recorded for the split remark
Spill/Reload Code Generation
The spill/reload generator at sub_31650D0 (47 KB) creates the actual load/store instructions that move values between SSA registers and the coroutine frame:
- A basic block named
"AllocaSpillBB"is inserted at the function entry. All alloca instructions that need to survive across suspend points are moved here and replaced with GEP+store into the frame. - A basic block named
"PostSpill"follows, branching to the original entry logic. - At each suspend point,
".spill.addr"store instructions write live SSA values into their frame slots. - After each resume point,
".reload"load instructions fetch values back from frame slots into fresh SSA values.
The naming convention (.spill.addr, .reload) is important for debugging: these instructions appear in -print-after-all dumps and identify coroutine frame traffic distinctly from normal loads/stores.
Detailed BFS Liveness Algorithm
// Pseudocode for sub_24F5860 core frame computation
void computeFrameLayout(Function *F, SmallVector<SuspendPoint> &suspends) {
// Step 1: Build definition map
DenseMap<Value*, uint32_t> def_map; // sentinel 0xFFFFFFFFF000
DenseMap<Value*, uint32_t> cross_map; // sentinel 0xFFFFFFFFF000
// Step 2: Walk all basic blocks, identify definitions
for (BasicBlock &BB : *F) {
for (Instruction &I : BB) {
if (visitor->isRelevant(&I)) // virtual call [visitor+18h]
def_map.insert(&I, generation++);
}
}
// Step 3: For each suspend point, BFS forward to find uses
Deque<BasicBlock*> worklist; // 512-byte blocks, 64 entries each
for (SuspendPoint &SP : suspends) {
worklist.clear();
worklist.push_back(SP.getParent());
while (!worklist.empty()) {
BasicBlock *BB = worklist.pop_front();
for (Instruction &I : *BB) {
for (Value *Op : I.operands()) {
if (def_map.count(Op) && def_before_suspend(Op, SP)) {
// This value is defined before SP and used after it
cross_map.insert({Op, SP.getIndex()});
spill_set.add(Op);
}
}
}
for (BasicBlock *Succ : successors(BB))
worklist.push_back(Succ);
}
}
// Step 4: Build frame struct from spill set
// Sort spill slots by alignment (descending) then by size
// Compute offsets, padding, total frame size
}
The complexity is O(instructions * suspend_points) per coroutine for the liveness phase, O(V+E) for each BFS where V = basic blocks and E = CFG edges.
Data Structures
Frame info (0x138 = 312 bytes, allocated via sub_22077B0):
| Offset | Size | Description |
|---|---|---|
+0x00 | 8 | Spill array pointer |
+0x08 | 8 | Reserved (initially 0) |
+0x10 | 8 | Reference count (initially 1) |
+0x18--+0x98 | 128 | Embedded hash table for spill tracking (16-byte stride, sentinel-filled) |
+0x98 | 8 | Pointer to inner table (self-referential) |
+0xA0 | 8 | Capacity encoding (0x800000000) |
+0x128 | 8 | Back-reference to visitor context |
+0x130 | 8 | Back-reference to suspend point array |
Spill entry (0x48 = 72 bytes):
| Offset | Size | Description |
|---|---|---|
+0x00 | 8 | Coroutine function pointer |
+0x08 | 8 | Buffer pointer (inline or heap) |
+0x10 | 8 | Capacity encoding (6 entries inline) |
+0x18--+0x48 | 48 | Inline buffer for small spill sets |
The inline buffer holds up to 6 spill entries without heap allocation. When exceeded, the buffer externalizes to the heap; cleanup at 0x24F6CB0 checks [entry+8] against [entry+18h] to determine if free() is needed.
BFS deque:
| Parameter | Value |
|---|---|
| Block map allocation | 0x40 bytes (8 pointers) |
| Data block allocation | 0x200 bytes (512 bytes = 64 pointer entries) |
| Block pointers | [rbp-0x340]=front, [rbp-0x338]=count(8), [rbp-0x330]=begin |
Hash Table Policy
Both hash tables in CoroFrame share identical parameters (see hash-infrastructure.md for the universal pattern):
- Hash function:
(val >> 4) ^ (val >> 9)-- same hash used throughout cicc - Entry size: 16 bytes (8-byte key + 8-byte metadata)
- Empty sentinel:
0xFFFFFFFFF000 - Load factor threshold: 75% (triggers growth when
count * 4 >= capacity * 3) - Tombstone cleanup: 12.5% (rehash when
tombstones > capacity >> 3) - Growth factor: 2x (capacity doubles on each growth)
- Collision resolution: linear probing
GPU-Specific Constraints: The Heap Allocation Problem
Why Device Malloc Is Pathological
Standard LLVM coroutines allocate the frame on the heap via operator new (or a custom allocator returned by get_return_object_on_allocation_failure). On GPU, this calls into the device-side malloc, which has severe limitations:
Fixed-size heap. The device heap is controlled by cudaLimitMallocHeapSize (default 8 MB across the entire GPU). A kernel launching 65,536 threads, each with a 256-byte coroutine frame, requires 16 MB of heap -- already exceeding the default. Increasing the limit helps, but the heap must be pre-allocated before kernel launch, wasting memory for non-coroutine workloads.
Serialized allocation. Device malloc implementation uses a global free list protected by atomics. Within a warp, threads attempting simultaneous allocation serialize on this atomic. Across warps on the same SM, L2 cache line bouncing on the free-list head pointer creates further contention. Under heavy allocation pressure (hundreds of concurrent warps), the effective throughput of device malloc can drop to single-digit allocations per microsecond -- three orders of magnitude slower than a register read.
Fragmentation under concurrency. Thousands of threads allocating and freeing small frames (64--512 bytes) rapidly fragment the device heap. The device allocator does not perform compaction. Once fragmented, even a heap with sufficient total free space may fail individual allocations, causing malloc to return nullptr and triggering coroutine allocation failure paths (if the user provided get_return_object_on_allocation_failure) or program termination.
Memory latency hierarchy. The cost difference between frame locations is dramatic:
| Location | Latency | Bandwidth per SM | Notes |
|---|---|---|---|
| Registers | 0 cycles | N/A (direct) | Best case -- values that don't cross suspends |
| Local memory (L1 hit) | ~28 cycles | ~12 TB/s | Stack alloca destination after CoroElide |
| Local memory (L1 miss, L2 hit) | ~200 cycles | ~3 TB/s | Large frames that spill L1 |
| Global memory (device heap) | ~400-800 cycles | ~1 TB/s | Default without CoroElide |
| Device malloc overhead | ~2000+ cycles | N/A | Free-list atomic contention |
The combined overhead of malloc latency + global memory access latency makes un-elided coroutines 50--100x slower than elided ones on GPU. This is the fundamental reason CoroElide is the most performance-critical coroutine optimization for GPU targets.
CoroElide: The GPU Escape Analysis
sub_24DF350 (80 KB -- the largest coroutine pass) implements the classic heap allocation elision. It runs as a function-level pass (#220 in the pipeline parser), meaning it analyzes each caller individually after CoroSplit has already split the coroutine.
Elision Preconditions
For each llvm.coro.id call site in the caller, CoroElide attempts to prove that:
-
No handle escape. The coroutine handle (pointer to
__coro_frame) does not escape the caller's scope. Specifically, the handle is not stored to memory visible to other threads, not passed to functions that might store it, and not returned from the caller. On GPU, the "visible to other threads" criterion is complicated by shared memory (addrspace(3)) and generic address space (addrspace(0)) casts -- a handle stored through a generic pointer could be visible to any thread. -
No external aliases. No alias of the handle is created that could outlive the caller. This includes GEPs into the frame, bitcasts, and pointer arithmetic. The alias analysis at this stage uses the results from the function-level AA pipeline.
-
Full consumption. All suspend/resume/destroy calls on this coroutine handle are within the caller function. If the handle is passed to a helper function that calls
coroutine_handle::resume(), the coroutine is not fully consumed from CoroElide's perspective (unless that helper was inlined first by the CGSCC inliner running in the same SCC iteration). -
Callee identity known. The coroutine callee must be identifiable (not an indirect call through a function pointer). CoroElide needs to read the callee's frame size and alignment from the split remark metadata to size the alloca correctly.
The Elision Transformation
When all preconditions are satisfied, CoroElide performs this rewrite:
// BEFORE elision (caller code):
%id = call token @llvm.coro.id(i32 0, ptr null, ptr null, ptr null)
%need = call i1 @llvm.coro.alloc(token %id)
br i1 %need, label %alloc, label %begin
alloc:
%mem = call ptr @operator_new(i64 FRAME_SIZE) ; <-- heap allocation
br label %begin
begin:
%phi = phi ptr [ %mem, %alloc ], [ null, %entry ]
%hdl = call ptr @llvm.coro.begin(token %id, ptr %phi)
; ... use coroutine ...
call void @llvm.coro.resume(ptr %hdl)
call void @llvm.coro.destroy(ptr %hdl)
// AFTER elision:
%frame = alloca [FRAME_SIZE x i8], align FRAME_ALIGN ; <-- stack allocation
%hdl = call ptr @llvm.coro.begin(token %id, ptr %frame)
; ... use coroutine ...
call void @llvm.coro.resume(ptr %hdl)
; destroy is elided (frame on stack, automatically freed)
The key changes:
llvm.coro.allocis replaced withfalse(allocation not needed)- The
operator newcall is deleted - An
allocaof the correct size and alignment is inserted in the caller's entry block - The
coro.beginnow points at the stack alloca llvm.coro.freeis replaced with a no-op (stack memory does not need explicit deallocation)- The destroy function call may be simplified since stack deallocation is automatic
On NVPTX, the alloca maps to per-thread local memory (address space 5). Local memory accesses go through the L1 cache and are dramatically faster than device malloc followed by global memory access.
Elision Failure Modes on GPU
Several GPU-specific patterns defeat CoroElide:
-
Generic address space cast. If the coroutine handle is cast to
addrspace(0)(generic), the compiler cannot prove it stays in local memory. Generic pointers are indistinguishable from shared or global pointers at the IR level, so the escape analysis conservatively assumes the handle escapes. -
Coroutine handle in shared memory. Storing the handle to
addrspace(3)(shared memory) makes it visible to all threads in the CTA. Even if the programmer knows only one thread uses it, CoroElide cannot prove this. -
Cross-function resume. A common pattern where the coroutine is created in one device function and resumed in another (e.g., a scheduler loop calling resume on handles from a queue). The handle passed as a function argument escapes the creator.
-
Opaque allocator. If the coroutine uses a custom allocator (via
promise_type::operator new), CoroElide may not recognize the allocation/deallocation pattern.
Diagnostic Output
CoroElide emits remarks through the standard optimization remark infrastructure:
- Success:
'<coroutine_name>' elided in '<caller_name>'(via-Rpass=coro-elide) - Failure:
'<coroutine_name>' not elided in '<caller_name>'(via-Rpass-missed=coro-elide)
For GPU developers, the failure remark is the most important diagnostic. An un-elided coroutine on GPU is a performance disaster. The recommended debugging workflow:
nvcc -Xptxas -v --compiler-options="-Rpass-missed=coro-elide" foo.cu
CoroAnnotationElide: Developer-Asserted Elision
sub_24E2340 (33 KB) is the newer annotation-driven elision from LLVM 19. It looks for the "elide_safe_attr" function attribute and ".noalloc" suffix on coroutine function names. When both are present, elision proceeds without the full escape analysis -- the developer has asserted safety.
This is particularly useful for GPU code where the developer knows the coroutine is single-thread-scoped but the compiler cannot prove it due to pointer-to-generic-address-space casts. The "caller_presplit" attribute marks the caller as needing coroutine lowering, enabling the annotation elide pass to fire during the CGSCC iteration before the caller itself is split.
CoroAnnotationElide runs as CGSCC pass #155, meaning it fires before CoroSplit (#156) in the same CGSCC iteration. This ordering allows the annotation-based elision to rewrite allocation sites before CoroSplit performs the split, avoiding the need for a second pass.
The llvm.nvvm.coro.create.suspend Intrinsic
This is the sole NVIDIA-proprietary coroutine intrinsic. The NVVM verifier enforces:
llvm.nvvm.coro.create.suspend must have exactly one argument,
which must be a constant integer
The constant integer argument likely encodes a suspend-point identifier or mode. This intrinsic appears in the NVVM intrinsic table alongside llvm.nvvm.stacksave and llvm.nvvm.stackrestore, suggesting it interacts with the local memory stack for frame placement. Its exact lowering is handled by the NVVM-specific intrinsic lowering pass rather than the standard CoroSplit pipeline.
PTX .pragma "coroutine"
The AsmPrinter (documented in asmprinter.md) optionally emits .pragma "coroutine"; in the function header. This is triggered by metadata nodes with type byte 'N' (0x4E) linked to the current function via the list at this+792. The pragma is the first thing emitted in the function prologue (step (a) in the PTX header emission sequence at sub_215A3C0), before even the .entry/.func keyword.
The pragma signals to ptxas that the function uses coroutine semantics, potentially affecting register allocation and scheduling decisions in the assembler. The exact ptxas behavior triggered by this pragma is not documented publicly, but it likely increases the local memory budget and adjusts the register allocation heuristics for the state-machine dispatch pattern.
Warp Divergence at Suspend Points
A fundamental tension exists between SIMT execution and coroutine suspend. When one thread in a warp suspends while others do not, the warp diverges. The resume dispatch switch (the __coro_index-based state machine) creates a divergence point: threads may be at different suspend indices, requiring the hardware to serialize execution paths. This is identical to how any data-dependent branch causes divergence, but the impact is amplified because coroutine state machines typically have many switch cases (one per suspend point).
The StructurizeCFG pass (see structurizecfg.md) runs after coroutine lowering and will structurize the resume switch, potentially introducing additional control flow to manage reconvergence. On SM 70+ architectures with Independent Thread Scheduling, diverged threads can reconverge at any point, but the switch still introduces warp-level serialization proportional to the number of distinct __coro_index values active within the warp.
The Second Code Cluster (0x3150000 Region)
The binary contains a second, independent cluster of coroutine functions, likely from a different compilation unit or LTO merge:
| Function | Address | Size |
|---|---|---|
| CoroFrame layout computation | 0x3171DA0 | 55 KB |
| CoroSplit splitting logic | 0x316D160 | 49 KB |
CoroSplit dispatcher (.corodispatch, MustTailCall.Before.CoroEnd) | 0x3160A60 | 48 KB |
Spill/reload generation (AllocaSpillBB, PostSpill, .reload, .spill.addr) | 0x31650D0 | 47 KB |
Frame type builder (__coro_frame, .coro_frame_ty, __coro_index) | 0x3169200 | 46 KB |
| CoroElide heap allocation elision | 0x315A7B0 | 41 KB |
| Attributor analysis helper | 0x3150D70 | 43 KB |
| Attributor analysis helper | 0x314DBB0 | 40 KB |
These functions reference the same string literals and implement the same algorithms as the primary cluster. The primary cluster at 0x24D--0x25C and this cluster at 0x314--0x317 are structurally identical -- they differ only in binary address due to compilation unit or LTO merge ordering.
Additionally, three helper functions in the primary cluster's vicinity handle specialized aspects:
| Function | Address | Size |
|---|---|---|
| CoroSplit Cloner/Driver (calls CoroFrame helpers) | sub_25CA370 | 55 KB |
| CoroFrame Materializer (heap-to-stack frame layout) | sub_25C5C80 | 49 KB |
| CoroFrame Spill Analysis helper | sub_25C1030 | 37 KB |
sub_25C5C80 (CoroFrame Materializer) is particularly relevant: this is the function that actually rewrites the IR to replace heap allocation with stack-based frame placement after CoroElide has proven safety. It materializes the frame struct type, inserts the alloca, and rewires all frame access GEPs.
Error Conditions in the Second Cluster
The CoroSplit implementation at 0x316D160 emits two diagnostic errors:
-
"Coroutines cannot handle non static allocas yet"-- triggered when a coroutine body contains a VLA (variable-length array) oralloca()with a dynamic size. The frame layout computation requires compile-time-known sizes for all frame slots. Dynamic allocas would require a separate heap allocation per suspend-resume cycle. -
"alignment requirement of frame variables"-- triggered when a spill slot requires alignment exceeding the frame's maximum supported alignment. This can occur with over-aligned types (e.g.,alignas(256)variables that must survive across suspends).
The CoroFrame at 0x3171DA0 emits:
-
"token definition separated from use by suspend point"-- a fatal error when an LLVM token value (which cannot be stored to memory) crosses a suspend boundary. Tokens are used for exception handling state and musttail call tracking; they are inherently non-materializable. -
"Unable to handle alias with unknown offset before CoroBegin"-- triggered when a GEP with a non-constant offset operates on a value computed beforecoro.begin. The frame layout computation needs constant offsets to compute spill slot positions.
EDG Frontend Support
The EDG 6.6 frontend fully implements C++20 coroutine semantics in two key functions:
-
sub_87AFA0(14 KB) -- Coroutine body processor. Resolvespromise_typemethods:initial_suspend,final_suspend,unhandled_exception,get_return_object,get_return_object_on_allocation_failure. Generates the coroutine body scaffolding including the implicit try-catch around user code. -
sub_87BD00(6 KB) -- Coroutine trait resolver. Looks upstd::coroutine_traits<R, Args...>::promise_type,std::coroutine_handle,return_value,return_void. The EDG IL walker maps these as IL node type 64 (il_coroutine), with expression sub-type0x21(coroutine_expr). The IL copier handles coroutine handles as entity type 72 (coroutine_handle).
The frontend does not restrict coroutines to host-side code. The EDG configuration sets COROUTINE_ENABLING_POSSIBLE = 1 globally, meaning __device__ functions can be coroutines. The full coroutine IR (with llvm.coro.id, llvm.coro.begin, llvm.coro.suspend, etc.) flows into the NVVM optimizer pipeline regardless of the function's execution space.
Diagnostic Strings
| String | Location | Meaning |
|---|---|---|
"Split '<name>' (frame_size=N, align=M)" | CoroSplit remark | Successful coroutine split |
"' elided in '" | CoroElide | Frame allocation replaced with alloca |
"' not elided in '" | CoroElide | Elision failed, heap allocation remains |
"Coroutines cannot handle non static allocas yet" | 0x316D160 | VLA or dynamic alloca inside coroutine body |
"alignment requirement of frame variables" | 0x316D160 | Frame alignment constraint exceeded |
"token definition separated from use by suspend point" | 0x3171DA0 | Token value crosses suspend boundary (error) |
"Unable to handle alias with unknown offset before CoroBegin" | 0x3171DA0 | GEP with non-constant offset on pre-begin alias |
"llvm.nvvm.coro.create.suspend must have exactly one argument, which must be a constant integer" | NVVM verifier | Malformed NVIDIA coroutine intrinsic |
"AllocaSpillBB" | 0x31650D0 | Entry block for spill alloca instructions |
"PostSpill" | 0x31650D0 | Block following spill setup |
".spill.addr" | 0x31650D0 | Store to coroutine frame slot |
".reload" | 0x31650D0 | Load from coroutine frame slot after resume |
".corodispatch" | 0x3160A60 | Dispatch trampoline function name |
"MustTailCall.Before.CoroEnd" | 0x3160A60 | Musttail semantics on final transition |
".from." | 0x3160A60 | Dispatch label name construction |
"NoopCoro.Frame" | 0x24DCD10 | Global no-op coroutine frame (CoroEarly) |
"caller_presplit" | 0x24E2340 | Attribute marking pre-split caller |
"elide_safe_attr" | 0x24E2340 | Attribute asserting elision safety |
".noalloc" | 0x24E2340 | Function name suffix for annotation elide |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| CoroEarly pass entry | sub_24DCD10 | 41 KB | -- |
| CoroElide pass entry | sub_24DF350 | 80 KB | -- |
| CoroAnnotationElide pass entry | sub_24E2340 | 33 KB | -- |
| CoroSplit pass entry | sub_24EF980 | 71 KB | -- |
| Core frame layout computation | sub_24F5860 | -- | -- |
| CoroFrame layout entry | sub_24F6730 | 11 KB | -- |
| CoroFrame Spill Analysis helper | sub_25C1030 | 37 KB | -- |
| CoroFrame Materializer (heap-to-stack) | sub_25C5C80 | 49 KB | -- |
| CoroSplit Cloner/Driver | sub_25CA370 | 55 KB | -- |
| createResumeFunction | sub_2284030 | -- | -- |
| createDestroyFunction | sub_2284040 | -- | -- |
| Function cloner (used for resume/destroy) | sub_D2E510 | -- | -- |
| Frame-already-computed check | sub_B2D610 | -- | -- |
| Get function name string | sub_BD5D20 | -- | -- |
| Register in coroutine metadata table | sub_BC1CD0 | -- | -- |
| Create optimization remark | sub_B17560 | -- | -- |
| Publish remark to diagnostic handler | sub_1049740 | -- | -- |
| Allocator (frame info, spill entries, BFS deque) | sub_22077B0 | -- | -- |
coro-cond module analysis checker | sub_2337E30 | 15 KB | -- |
| Attributor helper (coroutine attributes) | sub_314DBB0 | 40 KB | -- |
| Attributor helper (coroutine attributes) | sub_3150D70 | 43 KB | -- |
| CoroElide (second cluster) | sub_315A7B0 | 41 KB | -- |
CoroSplit dispatcher (.corodispatch) | sub_3160A60 | 48 KB | -- |
| Spill/reload generation | sub_31650D0 | 47 KB | -- |
| Frame type builder | sub_3169200 | 46 KB | -- |
| CoroSplit splitting logic (second cluster) | sub_316D160 | 49 KB | -- |
| CoroFrame layout (second cluster) | sub_3171DA0 | 55 KB | -- |
| EDG coroutine body processor | sub_87AFA0 | 14 KB | -- |
| EDG coroutine trait resolver | sub_87BD00 | 6 KB | -- |
Cross-References
- Pipeline & Ordering -- where coroutine passes sit in the optimization sequence
- SROA -- SROA interacts with coroutine frame allocas; decomposes aggregate allocas into scalar SSA values
- AsmPrinter & PTX Body Emission --
.pragma "coroutine"emission - Inliner Cost Model -- inlining decisions for split resume/destroy functions
- StructurizeCFG -- structurizes the resume dispatch switch
- Hash Infrastructure -- universal DenseMap pattern used by CoroFrame
- Diagnostics & Optimization Remarks -- remark emission protocol
- Address Spaces -- local (5), shared (3), generic (0) spaces relevant to elision
OpenMP Runtime Declaration Table
cicc embeds a 194-entry table of OpenMP runtime function declarations at sub_312CF50 (0x312CF50, 117 KB decompiled). This single function is the authoritative source for every __kmpc_*, omp_*, and __tgt_* device-runtime call the compiler can emit into NVPTX IR. It defines the complete ABI contract between compiler-generated GPU code and the OpenMP device runtime library (libomptarget / libomp). The function takes an integer case index (0--193), constructs the corresponding FunctionType, checks whether the symbol already exists in the module via Module::getNamedValue, and if absent, creates a Function::Create with ExternalLinkage. The result is registered into a context-local array so that any later codegen pass can reference a runtime function by its numeric index without reconstructing the type.
Upstream LLVM defines the same runtime function set declaratively in llvm/include/llvm/Frontend/OpenMP/OMPKinds.def using the __OMP_RTL macro, which the OMPIRBuilder expands at construction time. cicc's table is a procedural equivalent: a giant switch(a3) with 194 cases that does exactly what OMPKinds.def + OMPIRBuilder::initialize() do, but compiled into the binary rather than generated from a .def file. The ordering of cases 0--193 matches the upstream OMPRTL_ enum one-to-one, confirming that cicc v13.0 tracks LLVM 18.x's OpenMP runtime interface.
Key Facts
| Property | Value |
|---|---|
| Entry point | sub_312CF50 @ 0x312CF50 |
| Decompiled size | 117 KB |
| Total entries | 194 (indices 0--193) |
| Sentinel | index 193 = __last (void function, marks table end) |
| Varargs entries | 2: index 7 (__kmpc_fork_call), index 118 (__kmpc_fork_teams) |
| Linkage for all entries | ExternalLinkage (encoded as 0x103 = 259) |
| Special attribute | Attribute #26 applied to indices 7 and 118 post-creation |
| Registration helper | sub_3122A50(context, index, funcDecl) |
| Type construction | sub_BCF480 = FunctionType::get |
| Symbol lookup | sub_BA8CB0 = Module::getNamedValue |
| Function creation | sub_B2C660 = Function::Create |
| Upstream equivalent | OMPKinds.def __OMP_RTL entries + OMPIRBuilder::initialize() |
Context Object Type Cache
The first parameter a1 points to the OpenMP runtime context object. Starting at offset +2600, it contains a pre-allocated cache of LLVM types used to construct function signatures, avoiding redundant Type::get* calls:
| Offset | Type | LLVM equivalent |
|---|---|---|
| +2600 | void | Type::getVoidTy |
| +2608 | i1 | Type::getInt1Ty |
| +2616 | i8 | Type::getInt8Ty |
| +2624 | i16 | Type::getInt16Ty |
| +2632 | i32 | Type::getInt32Ty |
| +2640 | i64 | Type::getInt64Ty |
| +2648 | i8* | PointerType::get(i8, 0) |
| +2664 | i32* | PointerType::get(i32, 0) |
| +2672 | i64* | PointerType::get(i64, 0) |
| +2680 | double | Type::getDoubleTy |
| +2688 | i64 / size_t | DataLayout::getIntPtrType |
| +2704 | i8* (generic ptr) | PointerType::get(i8, 0) |
| +2712 | i8** | PointerType::get(i8*, 0) |
| +2720 | i8*** | PointerType::get(i8**, 0) |
| +2752 | kmp_critical_name* | [8 x i32]* |
| +2784 | ident_t* | {i32, i32, i32, i32, i8*}* |
| +2800 | __tgt_kernel_arguments* | 13-field struct pointer |
| +2816 | __tgt_async_info* | {i8*}* |
| +2896 | KernelEnvironmentTy* | {ConfigEnv, ident_t*, DynEnv*}* |
| +2912 | KernelLaunchEnvironmentTy* | {i32, i32}* |
| +2928 | kmpc_micro | void(i32*, i32*, ...)* (varargs microtask) |
| +2944 | kmp_reduce_func | void(i8*, i8*)* |
| +2960 | kmp_copy_func | void(i8*, i8*)* |
| +3008 | kmpc_ctor | i8*(i8*)* |
| +3024 | kmp_routine_entry_t | i32(i32, i8*)* |
| +3040 | kmp_ShuffleReductFctPtr | void(i8*, i16, i16, i16)* |
| +3056 | kmp_InterWarpCopyFctPtr | void(i8*, i32)* |
| +3072 | kmp_ListGlobalFctPtr | void(i8*, i32, i8*)* |
This layout mirrors the OMP_TYPE, OMP_STRUCT_TYPE, and OMP_FUNCTION_TYPE sections of upstream OMPKinds.def. The struct type definitions for ident_t, KernelEnvironmentTy, and __tgt_kernel_arguments match the upstream __OMP_STRUCT_TYPE declarations exactly.
Execution Modes: SPMD vs Generic
GPU OpenMP kernels operate in one of two execution modes, and the choice fundamentally determines which runtime functions the compiler emits:
| Mode | Value | Description | Worker threads |
|---|---|---|---|
| Generic | 1 | Master-worker state machine. Only thread 0 runs serial code; workers spin in a polling loop (__kmpc_barrier_simple_generic). Parallel regions are dispatched via __kmpc_kernel_prepare_parallel / __kmpc_kernel_parallel. | Idle until parallel region |
| SPMD | 2 | All threads execute the same code from kernel entry. Serial sections between parallel regions are guarded by tid == 0 checks with shared-memory output promotion and __kmpc_barrier_simple_spmd barriers. | Active from first instruction |
| Generic-SPMD | 3 | Transient state during the Generic-to-SPMD transformation. Never observed at runtime. | N/A |
The execution mode is encoded in a bit-vector attached to the kernel function's metadata. The runtime function __kmpc_target_init (index 155) reads the KernelEnvironmentTy struct which embeds the ConfigurationEnvironmentTy -- the first byte of that inner struct encodes the execution mode. __kmpc_is_spmd_exec_mode (index 186) queries it at runtime.
The SPMD-vs-Generic distinction affects which runtime calls appear in the generated IR:
- Generic mode kernels call
__kmpc_kernel_prepare_parallel,__kmpc_kernel_parallel,__kmpc_kernel_end_parallel,__kmpc_barrier_simple_generic, and the full__kmpc_fork_callmicrotask dispatch. - SPMD mode kernels call
__kmpc_parallel_51(index 158) for nested parallelism,__kmpc_barrier_simple_spmdfor synchronization, and__kmpc_alloc_shared/__kmpc_free_sharedfor shared-memory output promotion between guarded and parallel sections. - Both modes call
__kmpc_target_init/__kmpc_target_deinitfor kernel lifecycle management.
Call Generation Infrastructure
When any codegen pass needs a runtime function, it calls sub_312CF50(omp_context + 400, existing_value, case_index). The omp_context object (typically at a2+208 in the pass state) contains both the type cache (+2600..+3072) and the runtime function array. If Module::getNamedValue finds the symbol already declared, it is returned immediately; otherwise a new declaration is created and registered.
Once a declaration is obtained, sub_921880 (create runtime library call instruction) builds the CallInst node with the argument list from current SSA values, attaches debug/source location metadata, and inserts it at the specified basic block position.
Primary Consumers
| Pass | Address | Size | Runtime Entries Used |
|---|---|---|---|
| Generic-to-SPMD transform | sub_26968A0 | 61 KB | 6 (thread ID), 180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd) |
| State machine generation | sub_2678420 | 41 KB | 155 (target_init), 156 (target_deinit), 171 (kernel_parallel), 172 (kernel_end_parallel), 188 (barrier_simple_generic) |
| Parallel region outliner | sub_313D1B0 | 47 KB | 7 (fork_call), 158 (parallel_51) |
| Parallel region merging | sub_2680940 | 52 KB | 180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd) |
| Attributor OpenMP driver | sub_269F530 | 63 KB | All -- identifies/folds known runtime calls by index |
Complete Runtime Function Table
All 194 entries, organized by functional category. The "Index" column is the switch case in sub_312CF50 and the slot in the context's runtime function array. Signatures use LLVM IR type syntax. The "Call Generation" column describes how and when cicc emits each call.
Standard OpenMP Runtime (0--13)
| Index | Function | Signature | Purpose | Call Generation |
|---|---|---|---|---|
| 0 | __kmpc_barrier | void(ident_t*, i32) | Explicit barrier | Emitted for #pragma omp barrier. On GPU compiles to __syncthreads(). OpenMPOpt may replace with index 187 (SPMD barrier) |
| 1 | __kmpc_cancel | i32(ident_t*, i32, i32) | Cancel construct | Third param: cancel kind (1=parallel, 2=sections, 3=for, 4=taskgroup). Returns nonzero if cancellation pending |
| 2 | __kmpc_cancel_barrier | void(ident_t*, i32) | Implicit barrier + cancel check | Generated at end of worksharing constructs when cancel is possible |
| 3 | __kmpc_error | void(ident_t*, i32, i8*) | Runtime error | Second param: severity (1=warning, 2=fatal). Third: message string pointer |
| 4 | __kmpc_flush | void(ident_t*) | Memory fence | #pragma omp flush. On GPU: __threadfence() or scope-specific fence |
| 5 | __kmpc_global_thread_num | i32(ident_t*) | Get global thread ID | On GPU: blockIdx*blockDim+threadIdx. Emitted at start of every region needing a thread identifier |
| 6 | __kmpc_get_hardware_thread_id_in_block | i32() | threadIdx.x equivalent | Direct PTX %tid.x wrapper. Used by SPMD transform (sub_26968A0) to build tid==0 guards. Lookup: sub_312CF50(..., 6) |
| 7 | __kmpc_fork_call | void(ident_t*, i32, kmpc_micro, ...) | Fork parallel region (varargs) | Second param: shared variable count. Third: outlined microtask pointer. Remaining: shared variables. On GPU Generic mode triggers worker state machine dispatch. Attribute #26 applied post-create |
| 8 | __kmpc_fork_call_if | void(ident_t*, i32, i32, i8*, i32) | Conditional fork | Third param: if-clause condition. If false, region executes serially |
| 9 | __kmpc_omp_taskwait | void(ident_t*, i32) | Taskwait | #pragma omp taskwait |
| 10 | __kmpc_omp_taskyield | i32(ident_t*, i32, i32) | Task yield point | Third param: end-of-task flag |
| 11 | __kmpc_push_num_threads | void(ident_t*, i32, i32) | Set thread count | num_threads(N) clause. Pushes count for next parallel region |
| 12 | __kmpc_push_proc_bind | void(ident_t*, i32, i32) | Set affinity | proc_bind(spread/close/master). Third param encodes binding policy |
| 13 | __kmpc_omp_reg_task_with_affinity | i32(ident_t*, i32, i8*, i32, i8*) | Register task with affinity info | OMP 5.0 affinity clause |
Index 7 (__kmpc_fork_call) and index 118 (__kmpc_fork_teams) are the only two varargs entries. Both receive special post-processing: sub_B994D0 sets function attribute #26 (likely the convergent attribute or a varargs-related marker), checked via sub_B91C10. This prevents the optimizer from incorrectly splitting, duplicating, or removing these calls.
Hardware Query (14--16)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 14 | __kmpc_get_hardware_num_blocks | i32() | gridDim.x equivalent |
| 15 | __kmpc_get_hardware_num_threads_in_block | i32() | blockDim.x equivalent |
| 16 | __kmpc_get_warp_size | i32() | Warp size (32 on NVIDIA) |
These three functions have no parameters -- they are direct wrappers around PTX special registers (%nctaid.x, %ntid.x, and a compile-time constant 32).
OMP Standard Library API (17--45)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 17 | omp_get_thread_num | i32() | Thread ID within team |
| 18 | omp_get_num_threads | i32() | Threads in current team |
| 19 | omp_get_max_threads | i32() | Max threads available |
| 20 | omp_in_parallel | i32() | Inside parallel region? |
| 21 | omp_get_dynamic | i32() | Dynamic adjustment enabled? |
| 22 | omp_get_cancellation | i32() | Cancellation enabled? |
| 23 | omp_get_nested | i32() | Nested parallelism enabled? |
| 24 | omp_get_schedule | void(i32*, i32*) | Query loop schedule |
| 25 | omp_get_thread_limit | i32() | Max total threads |
| 26 | omp_get_supported_active_levels | i32() | Max supported nesting |
| 27 | omp_get_max_active_levels | i32() | Current max nesting |
| 28 | omp_get_level | i32() | Current nesting depth |
| 29 | omp_get_ancestor_thread_num | i32(i32) | Ancestor thread ID |
| 30 | omp_get_team_size | i32(i32) | Team size at nesting level |
| 31 | omp_get_active_level | i32() | Active parallel nesting |
| 32 | omp_in_final | i32() | Inside final task? |
| 33 | omp_get_proc_bind | i32() | Current binding policy |
| 34 | omp_get_num_places | i32() | Number of places |
| 35 | omp_get_num_procs | i32() | Available processors |
| 36 | omp_get_place_proc_ids | void(i32, i32*) | Processor IDs in place |
| 37 | omp_get_place_num | i32() | Current place number |
| 38 | omp_get_partition_num_places | i32() | Places in partition |
| 39 | omp_get_partition_place_nums | void(i32*) | Place numbers in partition |
| 40 | omp_get_wtime | double() | Wall clock time |
| 41 | omp_set_num_threads | void(i32) | Set thread count |
| 42 | omp_set_dynamic | void(i32) | Enable/disable dynamic |
| 43 | omp_set_nested | void(i32) | Enable/disable nesting |
| 44 | omp_set_schedule | void(i32, i32) | Set loop schedule |
| 45 | omp_set_max_active_levels | void(i32) | Set max nesting |
These are the user-facing OpenMP API functions. On GPU, most return compile-time constants or trivial register reads. The Attributor-based OpenMP driver (sub_269F530) can fold many of these to constants when the execution mode and team configuration are statically known -- for example, omp_get_num_threads folds to the blockDim.x launch parameter.
Begin/End (53--54)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 53 | __kmpc_begin | void(ident_t*, i32) | Library initialization (rarely used on GPU) |
| 54 | __kmpc_end | void(ident_t*) | Library shutdown |
Master/Masked Constructs (46--49)
| Index | Function | Signature | Purpose | Call Generation |
|---|---|---|---|---|
| 46 | __kmpc_master | i32(ident_t*, i32) | Enter master region | Returns 1 for master thread (thread 0), 0 for all others. IRGen wraps user code in if(__kmpc_master(..)) {...} |
| 47 | __kmpc_end_master | void(ident_t*, i32) | Exit master region | Called at end of master block |
| 48 | __kmpc_masked | i32(ident_t*, i32, i32) | Enter masked region (OMP 5.1) | Third param is the filter ID (which specific thread executes). Replaces master in OMP 5.1 |
| 49 | __kmpc_end_masked | void(ident_t*, i32) | Exit masked region | Called at end of masked block |
Critical Sections (50--52)
| Index | Function | Signature | Purpose | Call Generation |
|---|---|---|---|---|
| 50 | __kmpc_critical | void(ident_t*, i32, kmp_critical*) | Enter critical section | On GPU: atomic spin-lock acquire on the 32-byte lock variable |
| 51 | __kmpc_critical_with_hint | void(ident_t*, i32, i32, kmp_critical*) | Enter with lock hint | Hint encodes contention strategy (uncontended, contended, speculative, non-speculative) |
| 52 | __kmpc_end_critical | void(ident_t*, i32, kmp_critical*) | Exit critical section | Atomic release on lock variable |
On GPU, critical sections use atomic operations on global memory. The kmp_critical_name type is [8 x i32] (32 bytes), used as an atomic lock variable. The _with_hint variant accepts a contention hint that the GPU runtime maps to different atomic strategies.
Reduction (55--58)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 55 | __kmpc_reduce | i32(ident_t*, i32, i32, i64, i8*, kmp_reduce_func, kmp_critical*) | Begin reduction (blocking) |
| 56 | __kmpc_reduce_nowait | i32(ident_t*, i32, i32, i64, i8*, kmp_reduce_func, kmp_critical*) | Begin reduction (non-blocking) |
| 57 | __kmpc_end_reduce | void(ident_t*, i32, kmp_critical*) | End reduction (blocking) |
| 58 | __kmpc_end_reduce_nowait | void(ident_t*, i32, kmp_critical*) | End reduction (non-blocking) |
These are the standard reduction protocol entries. On GPU, the compiler typically prefers the NVIDIA-specific shuffle-based reductions (indices 176--178) which are significantly faster.
Static Loop Scheduling (61--70)
| Index | Function | Signature |
|---|---|---|
| 61--64 | __kmpc_for_static_init_{4,4u,8,8u} | void(ident_t*, i32, i32, i32*, {i32,i64}*, {i32,i64}*, {i32,i64}*, {i32,i64}*, {i32,i64}, {i32,i64}) |
| 65 | __kmpc_for_static_fini | void(ident_t*, i32) |
| 66--69 | __kmpc_distribute_static_init_{4,4u,8,8u} | Same 9-param shape as 61--64 |
| 70 | __kmpc_distribute_static_fini | void(ident_t*, i32) |
The _4 / _4u / _8 / _8u suffixes indicate signed-32, unsigned-32, signed-64, unsigned-64 loop variable types respectively. All static_init functions take 9 parameters: location, thread ID, schedule type, pointer to is-last flag, pointers to lower/upper/stride/incr bounds, and chunk size.
Dynamic Dispatch (71--87)
Indices 71--74 handle distribute + dynamic dispatch initialization. Indices 75--82 handle standard dispatch_init and dispatch_next for the four integer widths. Indices 83--87 are dispatch finalization. Total: 17 entries covering the full dynamic loop scheduling interface.
Team Static & Combined Distribute-For (88--95)
Indices 88--91 (__kmpc_team_static_init_{4,4u,8,8u}) handle team-level static work distribution. Indices 92--95 (__kmpc_dist_for_static_init_{4,4u,8,8u}) are the combined distribute parallel for static init, taking 10 parameters (the extra parameter is the distribute upper bound pointer).
Tasking (98--116)
19 entries covering the full OpenMP tasking interface:
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 98 | __kmpc_omp_task_alloc | i8*(ident_t*, i32, i32, i64, i64, kmp_routine_entry_t) | Allocate task descriptor (6 params). Returns kmp_task_t*. Params: flags, sizeof_task, sizeof_shareds, task_entry |
| 99 | __kmpc_omp_task | i32(ident_t*, i32, i8*) | Submit allocated task for execution. Third param is the kmp_task_t* from task_alloc |
| 100 | __kmpc_end_taskgroup | void(ident_t*, i32) | End #pragma omp taskgroup |
| 101 | __kmpc_taskgroup | void(ident_t*, i32) | Begin taskgroup |
| 102 | __kmpc_omp_task_begin_if0 | void(ident_t*, i32, i8*) | Begin immediate task (when if clause evaluates to false) |
| 103 | __kmpc_omp_task_complete_if0 | void(ident_t*, i32, i8*) | Complete immediate task |
| 104 | __kmpc_omp_task_with_deps | i32(ident_t*, i32, i8*, i32, i8*, i32, i8*) | Task with dependency list (7 params). Params: task, ndeps, dep_list, ndeps_noalias, noalias_list |
| 105 | __kmpc_taskloop | void(ident_t*, i32, i8*, i32, i64*, i64*, i64, i32, i32, i64, i8*) | #pragma omp taskloop (11 params). Params: task, if_val, lb_p, ub_p, st, nogroup, sched, grainsize, task_dup |
| 106 | __kmpc_taskloop_5 | void(ident_t*, i32, i8*, i32, i64*, i64*, i64, i32, i32, i64, i8*, i32) | OMP 5.1 taskloop (12 params). Extra param: modifier |
| 107 | __kmpc_omp_target_task_alloc | i8*(ident_t*, i32, i32, i64, i64, kmp_routine_entry_t, i64) | Target-offload task allocation (7 params). Extra i64: device_id |
| 108 | __kmpc_taskred_modifier_init | i8*(ident_t*, i32, i32, i32, i8*) | Init task reduction with modifier (5 params). Params: is_ws, num, data |
| 109 | __kmpc_taskred_init | i8*(i32, i32, i8*) | Init task reduction (basic) |
| 110 | __kmpc_task_reduction_modifier_fini | void(ident_t*, i32, i32) | Finalize task reduction |
| 111 | __kmpc_task_reduction_get_th_data | i8*(i32, i8*, i8*) | Get thread-local reduction data |
| 112 | __kmpc_task_reduction_init | i8*(i32, i32, i8*) | Init task reduction (alternate path) |
| 113 | __kmpc_task_reduction_modifier_init | i8*(i8*, i32, i32, i32, i8*) | Init with full modifier (5 params) |
| 114 | __kmpc_proxy_task_completed_ooo | void(i8*) | Out-of-order proxy task completion. Used for detached tasks |
| 115 | __kmpc_omp_wait_deps | void(ident_t*, i32, i32, i8*, i32, i8*) | Wait on task dependencies (6 params) |
| 116 | __kmpc_omp_taskwait_deps_51 | void(ident_t*, i32, i32, i8*, i32, i8*, i32) | OMP 5.1 dependency wait (7 params). Extra param: nowait modifier |
Index 106 (__kmpc_taskloop_5) and index 116 (__kmpc_omp_taskwait_deps_51) are OMP 5.1 additions with an extra modifier parameter compared to their predecessors.
Teams and Cancellation (117--121)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 117 | __kmpc_cancellationpoint | i32(ident_t*, i32, i32) | Cancellation point check |
| 118 | __kmpc_fork_teams | void(ident_t*, i32, kmpc_micro, ...) | Fork teams region (varargs) |
| 119 | __kmpc_push_num_teams | void(ident_t*, i32, i32, i32) | Set team count |
| 120 | __kmpc_push_num_teams_51 | void(ident_t*, i32, i32, i32, i32) | Set team count (OMP 5.1, 5 params) |
| 121 | __kmpc_set_thread_limit | void(ident_t*, i32, i32) | Set per-team thread limit |
Copyprivate and Threadprivate (122--124)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 122 | __kmpc_copyprivate | void(ident_t*, i32, i64, i8*, kmp_copy_func, i32) | #pragma omp copyprivate. Broadcasts private data from single thread to all others. 6 params |
| 123 | __kmpc_threadprivate_cached | i8*(ident_t*, i32, i8*, i64, i8***) | Get/allocate threadprivate variable data. 5 params |
| 124 | __kmpc_threadprivate_register | void(ident_t*, i8*, kmpc_ctor, void*, void*) | Register threadprivate with ctor, copy-ctor, dtor callbacks |
Doacross Synchronization (125--128)
Cross-iteration dependencies for #pragma omp ordered depend(source/sink).
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 125 | __kmpc_doacross_init | void(ident_t*, i32, i32, i8*) | Init doacross tracking. Params: num_dims, dims_info |
| 126 | __kmpc_doacross_post | void(ident_t*, i32, i64*) | Post (source): signal iteration completion |
| 127 | __kmpc_doacross_wait | void(ident_t*, i32, i64*) | Wait (sink): wait for iteration to complete |
| 128 | __kmpc_doacross_fini | void(ident_t*, i32) | Finalize doacross tracking |
Memory Allocators (129--136)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 129 | __kmpc_alloc | i8*(i32, i64, i8*) | OpenMP allocator alloc. Params: gtid, size, allocator |
| 130 | __kmpc_aligned_alloc | i8*(i32, i64, i64, i8*) | Aligned allocation. Params: gtid, align, size, allocator |
| 131 | __kmpc_free | void(i32, i8*, i8*) | Free allocated memory. Params: gtid, ptr, allocator |
| 132 | __tgt_interop_init | void(ident_t*, i32, i8**, i32, i32, i32, i8*, i32) | OMP 5.1 foreign runtime interop init (8 params) |
| 133 | __tgt_interop_destroy | void(ident_t*, i32, i8**, i32, i32, i32, i8*) | Destroy interop object (7 params) |
| 134 | __tgt_interop_use | void(ident_t*, i32, i8**, i32, i32, i32, i8*) | Use interop object (7 params) |
| 135 | __kmpc_init_allocator | i8*(i32, i32, i8*, i8*) | Init OpenMP allocator. Params: gtid, memspace, num_traits, traits |
| 136 | __kmpc_destroy_allocator | void(i32, i8*) | Destroy allocator |
Target Offloading (137--153)
18 entries implementing the host-side target offloading protocol. These are primarily used when cicc compiles host code that launches GPU kernels, not within device code itself:
| Index | Function | Signature | Params | Purpose |
|---|---|---|---|---|
| 137 | __kmpc_push_target_tripcount_mapper | void(ident_t*, i64, i64) | 3 | Set iteration count for target region. Params: device_id, trip_count |
| 138 | __tgt_target_mapper | i32(ident_t*, i64, i8*, i32, i8**, i8**, i64*, i64*, i8**, i8**) | 10 | Launch target region with data mapping |
| 139 | __tgt_target_nowait_mapper | (14 params) | 14 | Async target launch. Adds depobj count/list, noalias count/list |
| 140 | __tgt_target_teams_mapper | (12 params) | 12 | Target teams launch. Adds num_teams, thread_limit, mappers |
| 141 | __tgt_target_teams_nowait_mapper | (16 params) | 16 | Async target teams. Most complex host-side offload call |
| 142 | __tgt_target_kernel | i32(ident_t*, i64, i32, i32, i8*, __tgt_kernel_args*) | 6 | New-style kernel launch (takes __tgt_kernel_arguments*) |
| 143 | __tgt_target_kernel_nowait | (10 params) | 10 | Async new-style launch. Adds depobj info |
| 144 | __tgt_target_data_begin_mapper | (9 params) | 9 | Map data to device |
| 145 | __tgt_target_data_begin_nowait_mapper | (13 params) | 13 | Async map-to |
| 146 | __tgt_target_data_begin_mapper_issue | (10 params) | 10 | Split-phase issue for async map-to |
| 147 | __tgt_target_data_begin_mapper_wait | void(i64, __tgt_async_info*) | 2 | Split-phase wait for async map-to |
| 148 | __tgt_target_data_end_mapper | (9 params) | 9 | Map data from device |
| 149 | __tgt_target_data_end_nowait_mapper | (13 params) | 13 | Async map-from |
| 150 | __tgt_target_data_update_mapper | (9 params) | 9 | Data update (host-to-device or device-to-host) |
| 151 | __tgt_target_data_update_nowait_mapper | (13 params) | 13 | Async data update |
| 152 | __tgt_mapper_num_components | i64(i8*) | 1 | Query user-defined mapper component count |
| 153 | __tgt_push_mapper_component | void(i8*, i8*, i8*, i64, i64, i8*) | 6 | Register mapper component. Params: handle, base, begin, size, type, name |
Task Completion Event (154)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 154 | __kmpc_task_allow_completion_event | i8*(ident_t*, i32, i8*) | Allow completion event for detached tasks (OMP 5.0) |
GPU Kernel Lifecycle (155--158)
These are the most important entries for device-side GPU OpenMP code.
| Index | Function | Signature | Purpose | Call Generation |
|---|---|---|---|---|
| 155 | __kmpc_target_init | i32(KernelEnvironmentTy*, KernelLaunchEnvironmentTy*) | Kernel entry | First call in every GPU OpenMP kernel. State machine generator (sub_2678420) emits this at entry. KernelEnvironmentTy carries ConfigurationEnvironmentTy (first byte = execution mode) |
| 156 | __kmpc_target_deinit | void() | Kernel exit | Last call in every GPU OpenMP kernel. Emitted by state machine generator |
| 157 | __kmpc_kernel_prepare_parallel | void(i8*) | Generic: signal workers | Master thread writes outlined function pointer to shared memory, then signals workers to execute it. Replaced by __kmpc_parallel_51 after SPMD conversion |
| 158 | __kmpc_parallel_51 | void(ident_t*, i32, i32, i32, i32, i8*, i8*, i8**, i64) | OMP 5.1 GPU parallel dispatch | 9 params: if_expr, num_threads, proc_bind, fn, wrapper_fn, shared_args, num_shared_args. Used by parallel region outliner (sub_313D1B0) on SPMD kernels. Replaces fork_call for GPU |
__kmpc_target_init is the first runtime call in every GPU OpenMP kernel. In Generic mode, it returns -1 for worker threads (which should enter the polling loop) and 0 for the master thread. In SPMD mode, it returns 0 for all threads. The KernelEnvironmentTy struct carries the ConfigurationEnvironmentTy which encodes the execution mode, team sizes, and runtime configuration.
New-Style Static Loops, OMP 5.1+ (159--170)
12 entries implementing the callback-based loop interface introduced in OpenMP 5.1:
| Index | Function | Signature |
|---|---|---|
| 159--162 | __kmpc_for_static_loop_{4,4u,8,8u} | void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64}, {i32,i64}) |
| 163--166 | __kmpc_distribute_static_loop_{4,4u,8,8u} | void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64}) |
| 167--170 | __kmpc_distribute_for_static_loop_{4,4u,8,8u} | void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64}, {i32,i64}, {i32,i64}) |
Unlike the old-style _init/_fini pairs, these new-style loops take function pointer callbacks (i8* for the loop body and data pointer) and handle initialization + execution + finalization in a single call.
Legacy Kernel-Mode Parallel (171--174)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 171 | __kmpc_kernel_parallel | i1(i8**) | Generic mode: worker checks if parallel work available |
| 172 | __kmpc_kernel_end_parallel | void() | Generic mode: worker signals completion |
| 173 | __kmpc_serialized_parallel | void(ident_t*, i32) | Execute parallel region serially (if(0) parallel) |
| 174 | __kmpc_end_serialized_parallel | void(ident_t*, i32) | End serialized parallel |
These are the Generic-mode worker-side functions. __kmpc_kernel_parallel returns true when the master thread has dispatched work via __kmpc_kernel_prepare_parallel, writing the outlined function pointer into the output parameter.
Warp-Level Primitives (175, 179, 189--190)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 175 | __kmpc_shuffle_int32 | i32(i32, i16, i16) | Warp shuffle for 32-bit value |
| 179 | __kmpc_shuffle_int64 | i64(i64, i16, i16) | Warp shuffle for 64-bit value |
| 189 | __kmpc_warp_active_thread_mask | i64() | Active lane mask (PTX activemask) |
| 190 | __kmpc_syncwarp | void(i64) | Warp-level barrier with mask |
The shuffle functions take (value, lane_offset, warp_size) and implement butterfly-pattern data exchange for intra-warp reductions. These compile down to PTX shfl.sync instructions.
NVIDIA Device Reduction (176--178)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 176 | __kmpc_nvptx_parallel_reduce_nowait_v2 | i32(ident_t*, i64, i8*, ShuffleReductFctPtr, InterWarpCopyFctPtr) | Intra-CTA parallel reduction |
| 177 | __kmpc_nvptx_teams_reduce_nowait_v2 | i32(ident_t*, i32, i8*, i64, i8*, ShuffleReductFctPtr, InterWarpCopyFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr) | Cross-CTA team reduction (11 params) |
| 178 | __kmpc_reduction_get_fixed_buffer | i8*() | Get global reduction scratch buffer |
These are the GPU-specific reduction entries -- the single most important performance-critical runtime calls for OpenMP on NVIDIA GPUs. The parallel reduction (index 176) uses a two-phase approach: (1) intra-warp reduction via shuffle, then (2) inter-warp reduction via shared memory copy. The compiler generates the ShuffleReductFctPtr and InterWarpCopyFctPtr callback functions as outlined helpers that the runtime calls during the reduction tree.
The teams reduction (index 177) adds four ListGlobalFctPtr callbacks for managing global memory buffers across CTAs, plus an extra size parameter. This is the most complex runtime call in the entire table, with 11 parameters.
Shared Memory Management (180--184)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 180 | __kmpc_alloc_shared | i8*(i64) | Dynamic shared memory allocation |
| 181 | __kmpc_free_shared | void(i8*, i64) | Free shared memory |
| 182 | __kmpc_begin_sharing_variables | void(i8***, i64) | Begin variable sharing protocol |
| 183 | __kmpc_end_sharing_variables | void() | End sharing protocol |
| 184 | __kmpc_get_shared_variables | i8**() | Get shared variable array |
__kmpc_alloc_shared / __kmpc_free_shared are heavily used in the SPMD transformation's guarded output mechanism: values computed by the master thread that are needed by all threads are stored into dynamically-allocated shared memory, synchronized via barrier, then loaded by all threads.
SPMD Mode Detection (185--188)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 185 | __kmpc_parallel_level | i16(ident_t*, i32) | Current parallel nesting depth |
| 186 | __kmpc_is_spmd_exec_mode | i8() | Returns 1 if SPMD, 0 if Generic |
| 187 | __kmpc_barrier_simple_spmd | void(ident_t*, i32) | Lightweight barrier for SPMD mode (bar.sync) |
| 188 | __kmpc_barrier_simple_generic | void(ident_t*, i32) | State-machine barrier for Generic mode |
The two barrier variants reflect the fundamental mode difference. __kmpc_barrier_simple_spmd compiles to a single bar.sync instruction. __kmpc_barrier_simple_generic involves polling a shared-memory flag because workers are in a state-machine loop that must check for new work after each barrier.
Profiling (191--192) and Sentinel (193)
| Index | Function | Signature | Purpose |
|---|---|---|---|
| 191 | __llvm_profile_register_function | void(i8*) | PGO: register function for profiling |
| 192 | __llvm_profile_register_names_function | void(i8*, i64) | PGO: register name table |
| 193 | __last | void() | Sentinel marking table end |
The two __llvm_profile_* entries support profile-guided optimization instrumentation on GPU. The __last sentinel at index 193 is a void-to-void function that marks the end of the table; it is never called at runtime.
Declaration Construction Protocol
For each runtime function, sub_312CF50 follows an identical protocol:
// Pseudocode for a typical case (e.g., case 0: __kmpc_barrier)
case 0: {
// 1. Build parameter type array from cached types
Type *params[] = { ctx->ident_t_ptr, ctx->i32_ty }; // a1+2784, a1+2632
// 2. Construct FunctionType
FunctionType *fty = FunctionType::get(
ctx->void_ty, // return type (a1+2600)
params, 2, // param array + count
/*isVarArg=*/false
);
// 3. Check if symbol already exists in module
Value *existing = Module::getNamedValue("__kmpc_barrier");
if (existing == a2) // a2 is the existing-check value
return existing;
// 4. Create new function declaration
Function *decl = Function::Create(
fty,
259, // linkage = ExternalLinkage (0x103)
"__kmpc_barrier",
module
);
// 5. Register in context table
registerRuntimeFunction(a1, /*index=*/0, decl); // sub_3122A50
return decl;
}
The linkage value 259 (0x103) decodes as ExternalLinkage with the DLLImport storage class flag set. This is consistent across all 194 entries.
For the two varargs entries (indices 7 and 118), the FunctionType::get call passes isVarArg=true, and after Function::Create, the code calls sub_B994D0 to add attribute #26 and sub_B91C10 to verify it was applied. Attribute #26 likely corresponds to a convergent-or-varargs marker that prevents the optimizer from incorrectly transforming these calls.
Comparison with Upstream LLVM OMPKinds.def
cicc's table maps one-to-one with the __OMP_RTL entries in LLVM 18.x's OMPKinds.def. The ordering is identical: the enum OMPRTL___kmpc_barrier = 0 corresponds to cicc's case 0, and so on through OMPRTL___last = 193 at case 193.
Key differences from upstream:
-
Procedural vs declarative. Upstream uses X-macros (
__OMP_RTL) expanded byOMPIRBuilder::initialize()to lazily create declarations on first use. cicc'ssub_312CF50is a compiled switch statement that eagerly creates declarations when requested by case index. -
Type representation. Upstream uses opaque pointer types (
PointerType::get(Ctx, 0)) throughout. cicc preserves typed pointers (i8*,i32*,i64*, struct pointers) in its type cache, consistent with LLVM's pre-opaque-pointer era. This is because cicc's internal IR (NVVM IR) still uses typed pointers even though upstream LLVM has migrated to opaque pointers. -
Missing entries. cicc lacks
__kmpc_push_num_threads_strict(present in latest upstream) and uses__kmpc_parallel_51where upstream LLVM 18.x defines__kmpc_parallel_60with a slightly different signature. The_51name indicates cicc v13.0 targets the OMP 5.1 runtime ABI, not the OMP 6.0 draft. -
Attribute handling. Upstream
OMPKinds.defincludes extensive attribute sets (GetterAttrs,SetterAttrs, etc.) that annotate runtime functions withnounwind,nosync,nofree,willreturn, and memory effect attributes for optimization. cicc applies only attribute #26 to the two varargs functions and otherwise relies on the OpenMPOpt pass to infer attributes. -
The
__tgt_interop_*entries (indices 132--134) in cicc take a slightly different parameter list than upstream: cicc includes an extrai32parameter at the end that upstream encodes differently, reflecting a minor ABI divergence in the interop interface.
Configuration Knobs
All LLVM cl::opt knobs related to OpenMP optimization, as found in the cicc binary:
| Knob | Type | Default | Effect |
|---|---|---|---|
openmp-opt-disable | bool | false | Disable all OpenMP optimizations |
openmp-opt-enable-merging | bool | false | Enable parallel region merging |
openmp-opt-disable-internalization | bool | false | Skip function internalization |
openmp-opt-disable-deglobalization | bool | false | Skip global-to-local promotion |
openmp-opt-disable-spmdization | bool | false | Skip Generic-to-SPMD transformation |
openmp-opt-disable-folding | bool | false | Skip ICV folding |
openmp-opt-disable-state-machine-rewrite | bool | false | Skip state machine optimization |
openmp-opt-disable-barrier-elimination | bool | false | Skip redundant barrier removal |
openmp-opt-inline-device | bool | varies | Inline device runtime calls |
openmp-opt-verbose-remarks | bool | false | Emit detailed optimization remarks |
openmp-opt-max-iterations | int | varies | Fixed-point iteration limit for analysis |
openmp-opt-shared-limit | int | varies | Max shared memory for SPMD output promotion |
openmp-opt-print-module-after | bool | false | Dump module IR after OpenMP optimization |
openmp-opt-print-module-before | bool | false | Dump module IR before OpenMP optimization |
openmp-deduce-icv-values | bool | varies | Deduce Internal Control Variable values |
openmp-print-icv-values | bool | false | Print deduced ICV values |
openmp-print-gpu-kernels | bool | false | Print identified GPU kernels |
openmp-hide-memory-transfer-latency | bool | false | Overlap data transfers with computation |
The openmp-opt-shared-limit knob is particularly relevant for the SPMD transformation: it caps the total amount of shared memory allocated for guarded output promotion. If the serial sections between parallel regions produce too many live-out values, the SPMD transformation may be abandoned when the shared memory budget is exceeded.
Diagnostic Strings
The OpenMP subsystem emits two diagnostics during SPMD transformation:
| Code | Severity | Message |
|---|---|---|
| OMP120 | Remark | "Transformed generic-mode kernel to SPMD-mode." |
| OMP121 | Warning | "Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override" |
OMP120 is emitted by sub_26968A0 on successful Generic-to-SPMD conversion. OMP121 is emitted for each call instruction that references a function not in the SPMD-amenable set, explaining why the transformation failed and providing the user with the override attribute.
Pipeline Integration
The OpenMP passes are registered in the pipeline under three names:
| Pipeline ID | Pass Name | Level | Description |
|---|---|---|---|
| 75 | openmp-opt | Module | Pre-link OpenMP optimization |
| 76 | openmp-opt-postlink | Module | Post-link OpenMP optimization |
| 154 | openmp-opt-cgscc | CGSCC | Call-graph-level OpenMP optimization |
The runtime declaration table (sub_312CF50) is invoked lazily from any of these passes when they need to emit a runtime call. The SPMD transformation is part of the module-level openmp-opt pass.
Execution Mode Call Patterns
The execution mode fundamentally determines which runtime functions appear in generated IR. These pseudocode patterns show the exact call sequences emitted by the state machine generator (sub_2678420) and the SPMD transformation (sub_26968A0).
Generic Mode Kernel (mode byte = 1)
entry:
ret = __kmpc_target_init(KernelEnv, LaunchEnv) // [155]
if (ret == -1) goto worker_loop // worker threads
// master thread: user code
__kmpc_kernel_prepare_parallel(outlined_fn_ptr) // [157]
__kmpc_barrier_simple_generic(loc, gtid) // [188]
// ... more serial + parallel sections ...
__kmpc_target_deinit() // [156]
worker_loop:
while (true) {
__kmpc_barrier_simple_generic(loc, gtid) // [188]
if (__kmpc_kernel_parallel(&fn)) // [171]
fn(args);
__kmpc_kernel_end_parallel() // [172]
__kmpc_barrier_simple_generic(loc, gtid) // [188]
}
SPMD Mode Kernel -- Simple (mode byte = 2, single parallel region)
After successful Generic-to-SPMD transformation:
entry:
__kmpc_target_init(KernelEnv, LaunchEnv) // [155], returns 0 for all
tid = __kmpc_get_hardware_thread_id_in_block() // [6]
is_main = (tid == 0)
br is_main, user_code, exit.threads
user_code:
// all threads: user code
__kmpc_parallel_51(loc, gtid, ...) // [158], for nested
__kmpc_barrier_simple_spmd(loc, gtid) // [187]
exit.threads:
__kmpc_target_deinit() // [156]
SPMD Mode Kernel -- Complex (guarded regions, multiple parallel regions)
entry:
__kmpc_target_init(...) // [155]
region.check.tid:
tid = __kmpc_get_hardware_thread_id_in_block() // [6]
cmp = icmp eq tid, 0
br cmp, region.guarded, region.barrier
region.guarded:
... master-only serial code ...
shared_ptr = __kmpc_alloc_shared(sizeof(result)) // [180]
store result -> shared_ptr
region.guarded.end:
br region.barrier
region.barrier:
__kmpc_barrier_simple_spmd(loc, gtid) // [187]
result = load from shared_ptr
__kmpc_barrier_simple_spmd(loc, gtid) // [187], post-load
__kmpc_free_shared(shared_ptr, size) // [181]
... all threads continue with result ...
exit:
__kmpc_target_deinit() // [156]
The SPMD transformation eliminates the worker state machine entirely. Workers no longer idle-spin in a polling loop; they participate in computation from the kernel's first instruction. Serial sections between parallel regions are wrapped in tid==0 guards with shared-memory output promotion and barriers.
SPMD-Amenable Function Table
The SPMD transformation maintains a hash set of functions that are safe to call from all threads simultaneously, located at *(omp_context + 208) + 34952 (base pointer), +34968 (capacity).
| Property | Value |
|---|---|
| Hash function | Open-addressing with linear probing |
| Slot computation | ((addr >> 9) ^ (addr >> 4)) & (capacity - 1) |
| Sentinel | -4096 (empty slot marker) |
| Contents | Functions pre-analyzed or annotated with [[omp::assume("ompx_spmd_amenable")]] |
When a call instruction references a function not in this set, the SPMD transformation fails for that kernel and emits OMP121: "Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override".
Functional Category Summary
| Category | Count | Indices |
|---|---|---|
| Thread hierarchy and hardware query | 20 | 0--6, 14--16, 17--45 |
| Work sharing / loop scheduling | 48 | 61--95, 159--170 |
| Tasking | 19 | 98--116, 154 |
| Synchronization | 12 | 0, 2, 4, 50--52, 59--60, 96--97, 187--188, 190 |
| Target offloading / data mapping | 18 | 137--153 |
| GPU execution mode | 10 | 155--158, 171--174, 185--186 |
| Warp primitives | 4 | 175, 179, 189--190 |
| NVIDIA device reduction | 3 | 176--178 |
| Shared memory management | 5 | 180--184 |
| Memory allocators | 8 | 129--136 |
| Copyprivate / threadprivate | 3 | 122--124 |
| Doacross synchronization | 4 | 125--128 |
| Teams / cancellation | 5 | 117--121 |
| Master / masked | 4 | 46--49 |
| Reduction (standard) | 4 | 55--58 |
| Begin / end | 2 | 53--54 |
| Profiling | 2 | 191--192 |
| Sentinel | 1 | 193 |
| Total | 194 |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
sub_312CF50 -- OpenMP runtime declaration factory (194-case switch) | 0x312CF50 | -- | -- |
sub_3122A50 -- registerRuntimeFunction(context, index, funcDecl) | 0x3122A50 | -- | -- |
sub_2686D90 -- OpenMP runtime declaration table (215 KB, outer wrapper) | 0x2686D90 | -- | -- |
sub_26968A0 -- Generic-to-SPMD transformation (61 KB) | 0x26968A0 | -- | -- |
sub_2680940 -- Parallel region merging (52 KB) | 0x2680940 | -- | -- |
sub_2678420 -- State machine generation for Generic mode (41 KB) | 0x2678420 | -- | -- |
sub_269F530 -- Attributor-based OpenMP optimization driver (63 KB) | 0x269F530 | -- | -- |
sub_313D1B0 -- Parallel region outliner (47 KB) | 0x313D1B0 | -- | -- |
sub_BCF480 -- FunctionType::get(retTy, paramTys, count, isVarArg) | 0xBCF480 | -- | -- |
sub_BA8CB0 -- Module::getNamedValue(name) | 0xBA8CB0 | -- | -- |
sub_B2C660 -- Function::Create(funcTy, linkage, name, module) | 0xB2C660 | -- | -- |
sub_B994D0 -- addAttribute(26, value) -- set function attribute | 0xB994D0 | -- | -- |
sub_B91C10 -- hasAttribute(26) -- check function attribute | 0xB91C10 | -- | -- |
sub_B9C770 -- Attribute construction (varargs attribute) | 0xB9C770 | -- | -- |
sub_B8C960 -- Attribute kind construction | 0xB8C960 | -- | -- |
sub_B2BE50 -- Function::getContext() | 0xB2BE50 | -- | -- |
sub_921880 -- Create runtime library call instruction | 0x921880 | -- | -- |
sub_5FB5C0 -- OpenMP variant processing (%s$$OMP_VARIANT%06d) | 0x5FB5C0 | -- | -- |
OpenMP Variant Processing
cicc also supports OpenMP variant dispatch during EDG front-end processing. The function sub_5FB5C0 at 0x5FB5C0 handles mangled names with the format %s$$OMP_VARIANT%06d, which the front-end generates for #pragma omp declare variant constructs. This is separate from the runtime declaration table and operates at the source-level AST rather than at the LLVM IR level.
Cross-References
- Generic-to-SPMD Transformation -- the primary consumer of the runtime table, performing mode conversion using entries 6, 155, 156, 180, 181, 187, 188
- Pipeline & Ordering -- where
openmp-opt(ID 75),openmp-opt-postlink(ID 76), andopenmp-opt-cgscc(ID 154) sit in the pass pipeline - CLI Flags -- compiler flags that control OpenMP code generation
- LLVM Knobs -- the
openmp-opt-*knobs listed above - Kernel Metadata -- how
KernelEnvironmentTyand execution mode are set during IR generation - Hash Infrastructure -- the open-addressing hash table pattern used by the SPMD-amenable function set
- GPU Execution Model -- broader context on SPMD vs Generic execution
Generic-to-SPMD Transformation
The Generic-to-SPMD transformation (sub_26968A0, 61 KB, ~1807 lines) is cicc's most impactful OpenMP target optimization. It converts GPU kernels from Generic execution mode -- where thread 0 acts as a master running serial code through a state machine while all other threads idle at a barrier -- into SPMD mode, where every thread in the block executes the same code from the first instruction. The transformation eliminates the worker state machine loop entirely, removes warp divergence at kernel entry, replaces heavyweight generic barriers with lightweight SPMD barriers (__syncthreads), and enables the hardware scheduler to fill warps from the very first cycle. On real workloads this routinely yields 2-4x speedups for simple target parallel for regions. The pass emits diagnostic OMP120 on success and OMP121 when a callee's side effects prevent conversion.
Key Facts
| Property | Value |
|---|---|
| Function address | sub_26968A0 |
| Decompiled size | 61 KB (~1807 lines) |
| Pass registration | openmp-opt (pipeline slot 75, Module pass) |
| Post-link variant | openmp-opt-postlink (slot 76) |
| CGSCC variant | openmp-opt-cgscc (slot 154) |
| Parameters | a1 = PassState, a2 = ModuleContext, a3 = OutputFlag |
| Eligibility flag | *(a1+241) -- boolean, set by prior analysis |
| Parallel region array | *(a1+280) base, *(a1+288) count |
| Diagnostic handler | *(a2+4392) |
| Success diagnostic | OMP120: "Transformed generic-mode kernel to SPMD-mode." |
| Failure diagnostic | OMP121: "Value has potential side effects preventing SPMD-mode execution" |
Generic vs SPMD Execution Model
Understanding the two execution modes is essential before examining the transformation.
| Aspect | Generic Mode | SPMD Mode |
|---|---|---|
| Thread roles | Thread 0 = master; threads 1..N-1 = workers | All threads execute same code |
| Kernel entry | __kmpc_target_init returns tid for master, -1 for workers | __kmpc_target_init returns tid for all |
| Serial code | Master executes directly | Wrapped in if (tid == 0) guard |
| Parallel region | Master signals workers via parallel_level; workers wake, execute outlined fn, re-barrier | All threads already executing; outlined fn body inlined |
| Barrier type | __kmpc_barrier_simple_generic (poll-based state machine) | __kmpc_barrier_simple_spmd (maps to bar.sync / __syncthreads) |
| Worker idle loop | while(true) { barrier(); if(parallel_level) { exec(); barrier(); } } | No idle loop -- eliminated entirely |
| Warp divergence | Warps containing thread 0 diverge at entry gate | No divergence at entry |
| Occupancy | Lower -- workers consume registers/shared mem while idle | Higher -- all resources used productively |
| Execution mode constant | 1 (OMP_TGT_EXEC_MODE_GENERIC) | 2 (OMP_TGT_EXEC_MODE_SPMD) |
| Transition marker | -- | 3 (OMP_TGT_EXEC_MODE_GENERIC_SPMD, intermediate during transform) |
In Generic mode the runtime creates a CTA (Cooperative Thread Array) where only thread 0 enters user code. The remaining N-1 threads enter a polling loop: they call __kmpc_barrier_simple_generic, check the parallel_level variable, and if a parallel region has been entered by the master, they wake up, execute the outlined parallel function, then return to polling. This "state machine" pattern is the primary performance bottleneck -- it wastes cycles on barrier polling, causes massive warp divergence on the first warp (which contains both the master and worker lanes), and prevents the scheduler from issuing useful work for idle threads.
SPMD mode eliminates all of this. Every thread begins executing user code at kernel entry. Serial code sections that cannot be parallelized are protected by lightweight tid == 0 guards, with results broadcast to all threads through shared memory and bar.sync barriers.
Legality Analysis
The transformation is gated by a boolean eligibility flag at *(a1+241), which is computed by a prior analysis pass (not sub_26968A0 itself). The analysis determines eligibility based on three conditions:
Condition 1: Kernel is Currently in Generic Mode
The execution mode bit-vector's low byte must equal 1 (Generic). This is checked at line 429 of the decompiled output:
// sub_2674090/sub_2674040 read the execution mode attribute
mode_bv = get_exec_mode(a1 + 304);
if (mode_bv.size <= 64)
mode_val = mode_bv.inline_data;
else
mode_val = *mode_bv.data_ptr;
if ((uint8_t)mode_val != 1) // Not Generic mode
return;
Condition 2: All Callees are SPMD-Amenable
Every call instruction reachable from the kernel's parallel regions must reference a function in the SPMD-amenable function set. This set lives at *(a2+208) + 34952 (base pointer) with capacity at offset +34968.
// SPMD-amenable lookup (open-addressing hash set)
bool is_spmd_amenable(void *func_ptr, void *table_base, uint64_t capacity) {
uint64_t hash = ((uintptr_t)func_ptr >> 9) ^ ((uintptr_t)func_ptr >> 4);
uint64_t slot = hash & (capacity - 1);
while (true) {
void *entry = table_base[slot];
if (entry == func_ptr) return true;
if (entry == (void*)-4096) return false; // empty sentinel
slot = (slot + 1) & (capacity - 1); // linear probe
}
}
Functions are pre-populated in this set if they have been analyzed as side-effect free (from the caller's perspective in SPMD context), or if the programmer annotated them with [[omp::assume("ompx_spmd_amenable")]]. When a callee fails this check, the pass takes Path A (non-SPMD candidate path, lines 1692-1806) and emits OMP121 for each offending call:
warning: Value has potential side effects preventing SPMD-mode execution.
Add `[[omp::assume("ompx_spmd_amenable")]]` to the called function
to override [OMP121]
The diagnostic is constructed via sub_B178C0 (warning constructor), message appended via sub_B18290, and emitted through sub_1049740 to the handler at *(a2+4392).
Condition 3: No Unresolvable Side Effects
The kernel must not contain operations that are inherently unsafe when executed by multiple threads simultaneously -- for example, I/O operations with ordering requirements, or accesses to thread-local storage that assumes single-thread access.
Legality Pseudocode
function is_spmd_eligible(kernel, module_ctx):
// Check current execution mode
mode = read_exec_mode(kernel.attributes)
if mode != GENERIC:
return false
// Scan all parallel regions
for region in kernel.parallel_regions:
for inst in region.instructions:
if is_call_like(inst): // opcode 34, 52, or 86
callee = get_callee(inst)
if callee.is_declaration:
if callee not in module_ctx.spmd_amenable_set:
emit_diagnostic(OMP121, inst.location,
"Value has potential side effects...")
return false
return true
The call-like instruction detection uses a bitmask test: (opcode - 34) <= 0x33 followed by bittest(0x8000000000041, opcode - 34), which matches opcodes 34 (call), 52 (invoke), and 86 (callbr) -- the three LLVM call-family instructions.
Transformation Algorithm
Once eligibility is confirmed, sub_26968A0 takes Path B (lines 407-1691). The path splits based on kernel complexity:
Simple Case: Single Parallel Region
When *(a1+160) == 0 and *(a1+224) == 0, the kernel has a single parallel region with no intervening serial code. This is the fast path (lines 432-672).
function transform_simple_spmd(kernel, module_ctx):
entry_bb = get_entry_block(kernel)
func_scope = get_function_scope(kernel)
thread_config = get_thread_configuration(kernel, module_ctx)
// 1. Create new basic blocks
user_code_bb = create_region("main.thread.user_code")
exit_bb = create_exit_block("exit.threads")
register_in_worklist(user_code_bb)
register_in_worklist(exit_bb)
// 2. Insert thread-id check at entry
tid = call __kmpc_get_hardware_thread_id_in_block() // runtime call ID 6
is_main = icmp eq tid, 0
br is_main, user_code_bb, exit_bb
// 3. Move original parallel body into user_code_bb
// (all threads execute this -- the parallel outlined fn
// is effectively inlined into the kernel)
// 4. Update execution mode: Generic(1) -> SPMD(2)
// Intermediate: set mode 3 (GENERIC_SPMD) then overwrite to 2
bv_entry = create_bitvector_entry(*(kernel+304+8), 3, 0)
current = read_attribute(*(kernel+304))
*(kernel+304) = insert_attribute(current, bv_entry, key=0, value=1)
// 5. Emit success diagnostic
if diagnostic_handler_registered(module_ctx+4392):
emit_remark(OMP120, "Transformed generic-mode kernel to SPMD-mode.")
The resulting CFG is straightforward:
entry:
%tid = call i32 @__kmpc_get_hardware_thread_id_in_block()
%is_main = icmp eq i32 %tid, 0
br i1 %is_main, label %user_code, label %exit.threads
user_code: ; all threads execute
... original parallel body ...
br label %exit.threads
exit.threads:
ret void
Complex Case: Multiple Parallel Regions
When the kernel contains multiple parallel regions with serial code between them, the pass executes a four-phase transformation (lines 720-1676).
Phase 1: Deduplicate Parallel Regions (lines 720-760)
Multiple parallel regions may call the same outlined function. The pass deduplicates by function pointer using an inline hash set:
function dedup_regions(parallel_regions):
seen = HashSet() // inline small-buffer optimization
unique = []
for region in parallel_regions:
fn_ptr = region.outlined_function // offset+40
if fn_ptr not in seen:
seen.insert(fn_ptr)
unique.append(region)
return unique
Phase 2: Identify Non-SPMD-Safe Instructions (lines 768-873)
For each parallel region, the pass walks the CFG successor chain and identifies instructions with side effects that are not SPMD-compatible:
function find_guarded_ranges(region, module_ctx):
ranges = []
first_unsafe = null
last_unsafe = null
for inst in walk_cfg_successors(region):
if is_side_effecting_call(inst):
// Skip known-safe calls (global dtors at module_ctx+208+32432)
if inst.callee == module_ctx.global_dtor_fn:
continue
// For invoke instructions: check if exception handler count is 0
if inst.opcode == 85: // invoke
if get_eh_handler_count(inst) == 0:
continue // can be simplified
if first_unsafe == null:
first_unsafe = inst
last_unsafe = inst
else:
if first_unsafe != null:
ranges.append((first_unsafe, last_unsafe))
first_unsafe = null
last_unsafe = null
if first_unsafe != null:
ranges.append((first_unsafe, last_unsafe))
return ranges
The pass then calls sub_B444E0 to insert guard instructions at each range boundary.
Phase 3: Build Guarded Region Descriptors (lines 876-1059)
Each parallel region is looked up in the function-to-region-tracker hash map at *(a2+144). This map uses a splitmix64-variant hash:
uint64_t hash_function_key(uint64_t name_hash, uint64_t addr_hash) {
uint64_t raw = name_hash ^ (16 * addr_hash);
uint64_t h = raw * 0xBF58476D1CE4E5B9ULL;
h = (h >> 31) ^ (h * 0x1CE4E5B9ULL);
return h;
}
The map stores 24-byte keys (module pointer, name pointer, auxiliary pointer) with a sentinel key of (-4096, qword_4FEE4D0, qword_4FEE4D8). Each entry's value (at +24) points to a guarded region tracker structure:
| Offset | Type | Description |
|---|---|---|
| +472 | i32 | Work counter |
| +480 | ptr | Block pointer array base |
| +488 | i64 | Capacity |
| +492 | i32 | Current size |
| +500 | i8 | Initialized flag |
Phase 4: Split and Rewire CFG (lines 1060-1670)
For each (first_instr, last_instr) pair identified in Phase 2, the pass creates five new basic blocks and rewires the CFG:
function create_guarded_region(first_instr, last_instr, module_ctx):
parent_bb = first_instr.parent
// 1. Split into 5 blocks
guarded_end_bb = split_block(parent_bb, after=last_instr, name="region.guarded.end")
barrier_bb = split_block(guarded_end_bb, at_start, name="region.barrier")
exit_bb = split_block(barrier_bb, at_start, name="region.exit")
guarded_bb = split_block(parent_bb, at=first_instr, name="region.guarded")
check_tid_bb = split_block(parent_bb, at=terminator, name="region.check.tid")
// 2. Register all blocks in worklist
for bb in [guarded_end_bb, barrier_bb, exit_bb, guarded_bb, check_tid_bb]:
register_in_worklist(bb)
// 3. Handle escaping values (shared memory promotion)
has_broadcast = false
for inst in guarded_bb:
outside_uses = [u for u in inst.uses if u.parent != guarded_bb]
if outside_uses:
has_broadcast = true
// Allocate shared memory for output
alloc = create_alloca(
type = inst.type,
address_space = 7, // shared memory
name = sanitize(inst.name) + ".guarded.output.alloc"
)
// Store result from master thread (inside guarded block)
create_store(inst, alloc, insert_in=guarded_bb)
// Load from all threads (after barrier)
load = create_load(
type = inst.type,
ptr = alloc,
name = sanitize(inst.name) + ".guarded.output.load",
insert_in = barrier_successor
)
// Rewrite all outside uses
replace_all_uses_outside(inst, load, guarded_bb)
// 4. Insert thread-id check
tid = call __kmpc_get_hardware_thread_id_in_block() // call ID 6
cmp = icmp eq tid, 0
br cmp, guarded_bb, barrier_bb
// 5. Insert SPMD barrier
call __kmpc_barrier_simple_spmd(ident, tid) // call ID 187
// 6. If broadcast values exist, insert second barrier after loads
if has_broadcast:
call __kmpc_barrier_simple_spmd(ident, tid) // ensures loads complete
The resulting CFG for a complex kernel with serial code between two parallel regions:
entry:
...
region.check.tid:
%tid = call i32 @__kmpc_get_hardware_thread_id_in_block()
%cmp = icmp eq i32 %tid, 0
br i1 %cmp, label %region.guarded, label %region.barrier
region.guarded: ; master thread only
... serial code ...
store %result, %shared_mem ; broadcast output
br label %region.guarded.end
region.guarded.end:
br label %region.barrier
region.barrier:
call void @__kmpc_barrier_simple_spmd(%ident, %tid)
%result = load %shared_mem ; all threads read
call void @__kmpc_barrier_simple_spmd(%ident, %tid) ; if broadcast
br label %region.exit
region.exit:
... next parallel region (all threads) ...
Name Sanitization
Output variable names are sanitized for use as global symbol names. Non-alphanumeric, non-underscore characters are replaced with .:
// Identical logic in both cicc and upstream LLVM
char sanitize_char(char c) {
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') ||
(c >= '0' && c <= '9') || c == '_')
return c;
return '.';
}
Shared Memory Output Promotion
When a value computed inside a guarded region (master-only code) is needed by all threads after the barrier, the pass promotes it through shared memory. This is the cicc implementation of what upstream LLVM calls "broadcast values." The sequence is:
-
Allocate:
sub_B30000creates an address-space-7 (shared/local) allocation with suffix.guarded.output.alloc. The allocation node is 80 bytes, subtype 7. -
Store:
sub_B4D460emits a store from the master thread's computed value into shared memory. Placed inside the guarded block, before the branch toregion.guarded.end. -
First barrier:
__kmpc_barrier_simple_spmd(runtime call ID 187) ensures the store is globally visible to all threads in the CTA. -
Load:
sub_B4D230emits a load from shared memory with suffix.guarded.output.load. Placed in the barrier successor block so all threads read the broadcast value. -
Second barrier: If broadcast values exist, a second
__kmpc_barrier_simple_spmdcall ensures all threads have completed their loads before the shared memory is potentially reused. -
Use rewriting:
sub_256E5A0replaces every use of the original value outside the guarded block with the loaded value.
State Machine Elimination
The state machine elimination is the core performance win of the SPMD transformation. Understanding the state machine that gets eliminated -- and its fallback generator -- is essential for reimplementation.
Generic-Mode Worker State Machine (What Gets Eliminated)
In Generic mode, __kmpc_target_init (runtime call ID 155) returns -1 for all threads except thread 0 (the master). The kernel entry code branches on this return value: thread 0 falls through to user code, while threads 1..N-1 jump to the worker state machine loop. This loop is the performance bottleneck that the SPMD transformation eliminates.
The complete Generic-mode kernel structure, as generated by the runtime and optionally customized by sub_2678420:
// Generic mode kernel entry (before SPMD transformation)
void __omp_offloading_kernel(KernelEnvironmentTy *env, KernelLaunchEnvironmentTy *launch_env) {
int ret = __kmpc_target_init(env, launch_env); // [155]
if (ret == -1)
goto worker_state_machine;
// === MASTER THREAD (thread 0) ===
// User code: serial sections + parallel dispatch
...
__kmpc_kernel_prepare_parallel(outlined_fn_ptr); // [157] signal workers
__kmpc_barrier_simple_generic(loc, gtid); // [188] wake workers
// ... workers execute outlined_fn ...
__kmpc_barrier_simple_generic(loc, gtid); // [188] wait for workers
// ... more serial code ...
__kmpc_target_deinit(); // [156]
return;
worker_state_machine:
// === WORKER THREADS (threads 1..N-1) ===
// sub_2678420 generates this structure with these exact labels:
worker_state_machine.begin:
__kmpc_barrier_simple_generic(loc, gtid); // [188] poll barrier
.is_active.check:
bool active = __kmpc_kernel_parallel(&fn); // [171] check for work
if (!active)
goto .done.barrier;
.parallel_region.check:
if (fn == known_outlined_fn_1)
goto .parallel_region.execute;
// ... more checks for known outlined functions ...
goto .fallback.execute;
.parallel_region.execute:
known_outlined_fn_1(args); // direct call (devirtualized)
goto .done.barrier;
.fallback.execute:
fn(args); // indirect call (generic)
.done.barrier:
__kmpc_kernel_end_parallel(); // [172] signal completion
__kmpc_barrier_simple_generic(loc, gtid); // [188] sync barrier
goto worker_state_machine.begin;
.finished:
return;
}
The state machine consumes five runtime calls per parallel-region invocation per worker thread: two __kmpc_barrier_simple_generic (ID 188) for poll/sync barriers, one __kmpc_kernel_parallel (ID 171) to check for dispatched work, one indirect or direct call to the outlined function, and one __kmpc_kernel_end_parallel (ID 172) to signal completion. Each __kmpc_barrier_simple_generic call compiles to a poll loop on a shared-memory flag -- not a hardware bar.sync -- because the generic barrier must handle the asymmetric wakeup protocol where the master thread signals workers through __kmpc_kernel_prepare_parallel.
Worker State Machine Generator: sub_2678420 (41 KB)
When the SPMD transformation fails (eligibility flag *(a1+241) == 0), cicc falls back to sub_2678420 which builds a customized state machine that is more efficient than the default runtime state machine. The customization replaces the indirect fn(args) call in .fallback.execute with a direct-call dispatch table when the set of outlined parallel functions is statically known.
| Property | Value |
|---|---|
| Function address | sub_2678420 |
| Decompiled size | 41 KB |
| Basic block labels | worker_state_machine.begin, .is_active.check, .parallel_region.check, .parallel_region.execute, .fallback.execute, .done.barrier, .finished |
| Diagnostics | OMP130, OMP131, OMP132, OMP133 |
The generator has two modes:
Mode 1: Remove unused state machine (OMP130). When the kernel has zero parallel regions (e.g., a #pragma omp target with no nested parallel), the state machine is dead code. sub_2678420 removes the entire worker loop and emits: "Removing unused state machine from generic-mode kernel." (OMP130).
Mode 2: Rewrite with customized dispatch (OMP131). When the kernel has N known parallel regions, the generator builds a switch/cascade of direct-call comparisons in .parallel_region.check and .parallel_region.execute, avoiding the overhead of indirect calls through __kmpc_kernel_parallel's function pointer. It emits: "Rewriting generic-mode kernel with a customized state machine." (OMP131).
// Customized state machine pseudocode (sub_2678420 output)
function build_custom_state_machine(kernel, parallel_regions):
// Create the 6 basic blocks with labels above
begin_bb = create_block("worker_state_machine.begin")
active_bb = create_block(".is_active.check")
check_bb = create_block(".parallel_region.check")
exec_bb = create_block(".parallel_region.execute")
fallback_bb = create_block(".fallback.execute")
barrier_bb = create_block(".done.barrier")
finished_bb = create_block(".finished")
// Entry: poll barrier
in begin_bb:
call __kmpc_barrier_simple_generic(loc, gtid) // [188]
br .is_active.check
// Check if master dispatched work
in active_bb:
%active = call i1 @__kmpc_kernel_parallel(&fn) // [171]
br %active, .parallel_region.check, .done.barrier
// Devirtualized dispatch: compare fn pointer against known functions
in check_bb:
for i, region in enumerate(parallel_regions):
%cmp = icmp eq fn, @outlined_fn_i
br %cmp, .parallel_region.execute.i, next_check
br .fallback.execute // no match -- use indirect call
// Direct call to known function (avoids indirect branch penalty)
in exec_bb:
for each matched region:
call @outlined_fn_i(args)
br .done.barrier
// Fallback: indirect call (should be unreachable if analysis is complete)
in fallback_bb:
call fn(args) // indirect
br .done.barrier
// End parallel + sync barrier
in barrier_bb:
call __kmpc_kernel_end_parallel() // [172]
call __kmpc_barrier_simple_generic(loc, gtid) // [188]
br .worker_state_machine.begin
// Optional: exit (reached via __kmpc_target_deinit signaling)
in finished_bb:
ret void
The runtime calls consumed by sub_2678420:
| Call ID | Function | Role in State Machine |
|---|---|---|
| 155 | __kmpc_target_init | Kernel entry; returns -1 for workers |
| 156 | __kmpc_target_deinit | Kernel exit cleanup |
| 157 | __kmpc_kernel_prepare_parallel | Master signals workers with outlined fn pointer |
| 171 | __kmpc_kernel_parallel | Worker checks if work is dispatched; returns fn ptr |
| 172 | __kmpc_kernel_end_parallel | Worker signals completion of parallel region |
| 188 | __kmpc_barrier_simple_generic | Poll-based barrier (shared-memory flag loop) |
SPMD Amenability Analysis Pipeline
The eligibility flag at *(a1+241) -- which gates whether sub_26968A0 attempts the SPMD transformation -- is computed by the Attributor-based OpenMP optimization driver at sub_269F530 (63 KB). This driver orchestrates interprocedural fixed-point analysis using the standard LLVM Attributor framework.
The analysis pipeline:
sub_269F530 (OpenMP Attributor Driver, 63 KB)
|
+-- sub_251BBC0 (AbstractAttribute infrastructure)
| Creates abstract attributes for each kernel, including
| the SPMD-compatibility tracker that will become a1+241.
|
+-- sub_251CD10 (Attributor::runTillFixpoint, 53 KB)
| Iterates up to openmp-opt-max-iterations (default: 256)
| times, updating abstract attribute states until convergence.
|
+-- sub_26747F0 (OpenMP kernel info collector)
Populates the PassState structure (a1) with:
a1+72: function handle
a1+160: serial-code-present flag
a1+224: multiple-region flag
a1+241: SPMD-eligible boolean <-- the gate
a1+280: parallel region array base
a1+288: parallel region count
a1+304: execution mode attribute map
The fixed-point analysis in sub_251CD10 converges by iterating over all abstract attributes until none change state. For SPMD eligibility, the key attribute tracks three conditions that must all hold:
-
Execution mode is Generic (mode byte == 1). Read via
sub_2674090/sub_2674040from the kernel's attribute map at*(a1+304). If the kernel is already SPMD or Bare, no transformation is needed. -
All reachable callees are SPMD-amenable. The analysis walks every call/invoke/callbr instruction in every parallel region of the kernel. Each callee is looked up in the SPMD-amenable function set at
*(a2+208)+34952. This set is populated by two sources:- Automatic population: When
sub_312CF50(the 194-case runtime declaration factory) creates a runtime function declaration, that function is automatically added to the set if it is known to be thread-safe (most__kmpc_*functions, allomp_*query functions). - User annotation: Functions declared with
[[omp::assume("ompx_spmd_amenable")]]are inserted into the set by the attribute parser.
The set uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure. If any callee fails the lookup, the analysis sets
*(a1+241) = 0and the transformation will emit OMP121 diagnostics instead. - Automatic population: When
-
No unresolvable side effects. Operations that are inherently unsafe when executed by all threads simultaneously -- such as I/O with ordering requirements, thread-local storage accesses assuming single-thread semantics, or calls to external functions with unknown side-effect profiles -- prevent SPMDization.
The Attributor driver at sub_269F530 also feeds into sub_2678420 (state machine generator) for kernels that fail SPMD eligibility, and into sub_2680940 (parallel region merging) for kernels that pass. The decision tree:
sub_269F530 analysis complete
|
+-- a1+241 == 1 (SPMD-eligible)
| |
| +-- a1+160 == 0 && a1+224 == 0 --> sub_26968A0 simple path
| +-- otherwise --> sub_26968A0 complex path
|
+-- a1+241 == 0 (not SPMD-eligible)
|
+-- has parallel regions --> sub_2678420 (custom state machine)
+-- no parallel regions --> sub_2678420 (remove dead state machine)
How the SPMD Transform Eliminates the State Machine
The actual elimination happens in sub_26968A0 and proceeds differently for simple vs. complex kernels, but the core mechanism is the same: replace the asymmetric master/worker execution model with symmetric all-thread execution.
Step 1: Remove the __kmpc_target_init return-value gate. In Generic mode, __kmpc_target_init returns -1 for workers and the kernel branches workers to the state machine loop. In SPMD mode, the return value is not used as a gate -- all threads fall through to user code. The transformation does not literally delete the __kmpc_target_init call (it is still needed for runtime initialization), but changes the execution mode attribute so the runtime initializes all threads as active.
Step 2: Eliminate the worker loop entirely. The basic blocks worker_state_machine.begin, .is_active.check, .parallel_region.check, .parallel_region.execute, .fallback.execute, .done.barrier, and .finished become dead code once the execution mode flips to SPMD. They are not explicitly deleted by sub_26968A0; instead, setting mode=2 in the KernelEnvironmentTy means the runtime never creates the worker branch, so the dead blocks are eliminated by subsequent DCE passes.
Step 3: Replace barrier primitives. Every __kmpc_barrier_simple_generic (ID 188) in the kernel is replaced with __kmpc_barrier_simple_spmd (ID 187). The difference:
- Generic barrier (ID 188): poll-based. Workers spin-check a shared-memory flag. The master writes the flag, then workers read it. This involves memory fences, cache-line bouncing, and potential bank conflicts. Compiles to a
ld.volatile.shared+ branch loop. - SPMD barrier (ID 187): hardware-based. Maps directly to PTX
bar.sync/ CUDA__syncthreads(). Single instruction, handled by the warp scheduler with zero polling overhead.
Step 4: Guard serial code. For the simple case (single parallel region), this is just:
%tid = call i32 @__kmpc_get_hardware_thread_id_in_block() ; [6]
%is_main = icmp eq i32 %tid, 0
br i1 %is_main, label %user_code, label %exit.threads
For the complex case (multiple parallel regions with serial gaps), the 5-block guarded region structure is created for each serial section, with shared-memory output promotion and double-barrier synchronization as described in Phase 4 above.
Step 5: Update execution mode. The kernel attribute is rewritten from Generic (1) to SPMD (2) via the intermediate GENERIC_SPMD (3) marker. This is the final, irreversible step. Once the mode is set, __kmpc_target_init at runtime will launch all threads into user code instead of routing N-1 threads to a state machine.
Performance Impact of Elimination
The state machine elimination saves:
| Source of overhead | Generic mode | SPMD mode | Savings |
|---|---|---|---|
| Worker idle polling | N-1 threads spin in __kmpc_barrier_simple_generic | No idle threads | 100% of idle cycles |
| Barrier latency | Poll-based shared-memory loop (10s-100s of cycles) | Hardware bar.sync (single cycle dispatch) | ~10-100x per barrier |
| Warp divergence at entry | Warp 0 diverges (thread 0 = master, threads 1-31 = workers) | No divergence | 1 warp fully utilized |
| Indirect calls | __kmpc_kernel_parallel returns fn ptr for indirect dispatch | No indirect calls -- outlined fn body inlined/direct | Branch predictor pressure eliminated |
| Register pressure | Workers hold state machine registers while idle | No state machine registers | Improved occupancy |
| Shared memory | Generic barriers use shared-memory flags | Only guarded-output allocations use shared memory | Reduced shared memory pressure |
On a typical #pragma omp target parallel for kernel, the SPMD transformation eliminates 5 runtime calls per parallel-region per worker-thread per iteration of the state machine loop. For a 256-thread CTA with one parallel region, that is 255 threads x 5 calls = 1,275 eliminated runtime calls per kernel invocation.
Execution Mode Update
When the transformation succeeds, the kernel's execution mode attribute is updated from Generic (1) to SPMD (2). The update goes through an intermediate GENERIC_SPMD (3) state:
// At LABEL_227 (shared success path)
bv_entry = sub_ACD640(*(a1+304+8), /*mode=*/3, /*aux=*/0); // create mode-3 entry
current = sub_2673FD0(*(a1+304)); // read current attrs
*(a1+304) = sub_AAAE30(current, bv_entry, {key=0}, 1); // write SPMD mode
The execution mode encoding matches upstream LLVM's OMPTgtExecModeFlags:
| Value | Name | Meaning |
|---|---|---|
| 0 | OMP_TGT_EXEC_MODE_BARE | Bare mode (no runtime) |
| 1 | OMP_TGT_EXEC_MODE_GENERIC | Generic (state machine) |
| 2 | OMP_TGT_EXEC_MODE_SPMD | SPMD (all threads active) |
| 3 | OMP_TGT_EXEC_MODE_GENERIC_SPMD | Generic |
The mode is stored in the KernelEnvironmentTy global variable that __kmpc_target_init reads at kernel launch. Setting it to SPMD tells the runtime to skip the state machine setup and launch all threads directly into user code.
Limitations: What Prevents SPMDization
The following constructs cause the pass to emit OMP121 and fall back to Generic mode:
- Calls to non-SPMD-amenable functions: Any callee not in the SPMD-amenable set blocks transformation. The user override is
[[omp::assume("ompx_spmd_amenable")]]. - Nested parallelism: Kernels with nested
#pragma omp parallelregions inside a target region cannot be SPMDized because the worker threads are already participating. - Tasking constructs:
#pragma omp task,taskloop, and taskgroup create runtime-managed work units incompatible with the SPMD execution model. - Critical sections and ordered regions: These constructs require specific thread-identity semantics that conflict with SPMD guards.
- Unresolvable side effects: Calls to external functions whose side-effect profile is unknown (no declaration with
convergentorspmd_amenableannotations). - Exception handling with unresolvable handlers: Invoke instructions with non-zero exception handler counts that cannot be simplified block the transformation (checked via
sub_BD2BC0).
Comparison with Upstream LLVM OpenMPOpt
The cicc SPMD transformation in sub_26968A0 is a proprietary reimplementation that predates upstream LLVM's SPMDization and differs in several significant ways:
| Aspect | Upstream LLVM OpenMPOpt | cicc sub_26968A0 |
|---|---|---|
| Framework | Attributor-based (AAKernelInfo) | Standalone pass, direct IR mutation |
| Analysis approach | Fixed-point iteration via SPMDCompatibilityTracker | Pre-computed boolean flag at a1+241 |
| Guarded regions | insertInstructionGuardsHelper using SplitBlock | Custom 5-block split with explicit worklist registration |
| Broadcast mechanism | GlobalVariable in shared memory (internal linkage, UndefValue init) | alloca in address space 7 (shared) via sub_B30000 |
| Barrier | __kmpc_barrier_simple_spmd | Same: __kmpc_barrier_simple_spmd (call ID 187) |
| Hash tables | LLVM DenseSet / SmallPtrSet | Custom open-addressing with -4096 sentinel (details) |
| Region merging | Separate openmp-opt-enable-merging flag (disabled by default) | Integrated into the complex path; always runs when needed |
| State machine fallback | buildCustomStateMachine in same AAKernelInfo::manifest | Separate function sub_2678420 (41 KB) |
| Diagnostic IDs | OMP120, OMP121 (identical) | OMP120, OMP121 (identical) |
ompx_spmd_amenable override | Same attribute name | Same attribute name |
The key architectural difference is that upstream LLVM uses the Attributor framework's fixed-point iteration to converge on SPMD compatibility, while cicc separates the analysis (which sets a1+241) from the transformation (which is sub_26968A0). This separation allows cicc to make a single pass over the IR for the transformation rather than iterating to a fixpoint, at the cost of less flexibility in handling interdependent kernels.
Upstream's region merging is behind openmp-opt-enable-merging and disabled by default. cicc's complex path (Phase 3a-3d) performs region merging unconditionally when a kernel has multiple parallel regions with serial gaps, suggesting NVIDIA found merging beneficial enough for GPU targets to enable it by default.
Configuration Knobs
All knobs are standard LLVM cl::opt registrations present in the cicc binary. These match upstream LLVM options:
| Knob | Type | Default | Effect |
|---|---|---|---|
openmp-opt-disable | bool | false | Disables all OpenMP optimizations |
openmp-opt-disable-spmdization | bool | false | Disables SPMD transformation specifically |
openmp-opt-disable-deglobalization | bool | false | Disables device memory deglobalization |
openmp-opt-disable-folding | bool | false | Disables OpenMP folding optimizations |
openmp-opt-disable-state-machine-rewrite | bool | false | Disables custom state machine generation |
openmp-opt-disable-barrier-elimination | bool | false | Disables barrier elimination optimizations |
openmp-opt-disable-internalization | bool | false | Disables function internalization |
openmp-opt-enable-merging | bool | false | Enables parallel region merging (upstream default; cicc complex path always merges) |
openmp-opt-inline-device | bool | false | Inlines all applicable device functions |
openmp-opt-verbose-remarks | bool | false | Enables more verbose optimization remarks |
openmp-opt-max-iterations | unsigned | 256 | Maximum attributor fixpoint iterations |
openmp-opt-shared-limit | unsigned | UINT_MAX | Maximum shared memory usage for broadcast values |
openmp-opt-print-module-before | bool | false | Dumps IR before OpenMP optimizations |
openmp-opt-print-module-after | bool | false | Dumps IR after OpenMP optimizations |
Note: The openmp-opt-shared-limit knob controls how much shared memory can be consumed by broadcast value allocations in guarded regions. If the limit is exceeded, the transformation will not proceed for additional guarded outputs. The default of UINT_MAX effectively means no limit.
Diagnostic Strings
| Code | Severity | Message | Trigger |
|---|---|---|---|
| OMP120 | Remark | "Transformed generic-mode kernel to SPMD-mode." | Successful transformation (both simple and complex paths) |
| OMP121 | Warning | "Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override" | Callee not in SPMD-amenable set |
| OMP130-OMP133 | Various | State machine diagnostics | sub_2678420 (fallback, not this pass) |
| OMP150 | Remark | Parallel region merging | sub_2697xxx (separate merging diagnostics) |
Diagnostics are emitted only when a handler is registered at *(a2+4392) and the handler's isEnabled virtual method (vtable offset +48) returns true. The construction follows the pattern: sub_B174A0 (remark) or sub_B178C0 (warning) builds a DiagnosticInfo, sub_B18290 appends the message text, and sub_1049740 emits to the handler.
Runtime Call Dependencies
The transformation uses these runtime functions from the OpenMP runtime declaration table:
| Call ID | Function | Signature | Usage |
|---|---|---|---|
| 6 | __kmpc_get_hardware_thread_id_in_block | i32() | Thread identification for tid == 0 guards |
| 180 | __kmpc_alloc_shared | i8*(i64) | Allocate shared memory for guarded output promotion (complex path) |
| 181 | __kmpc_free_shared | void(i8*, i64) | Free shared memory allocations at kernel exit (complex path) |
| 187 | __kmpc_barrier_simple_spmd | void(ident_t*, i32) | Lightweight SPMD barrier (maps to PTX bar.sync) |
The state machine fallback (sub_2678420) uses a different set of runtime calls, all of which become dead code after successful SPMD transformation:
| Call ID | Function | Signature | Eliminated by SPMD |
|---|---|---|---|
| 155 | __kmpc_target_init | i32(KernelEnvironmentTy*, KernelLaunchEnvironmentTy*) | Return value no longer gates workers |
| 156 | __kmpc_target_deinit | void() | Retained (still needed for cleanup) |
| 157 | __kmpc_kernel_prepare_parallel | void(i8*) | Eliminated -- no worker dispatch needed |
| 171 | __kmpc_kernel_parallel | i1(i8**) | Eliminated -- no worker polling loop |
| 172 | __kmpc_kernel_end_parallel | void() | Eliminated -- no worker completion signal |
| 188 | __kmpc_barrier_simple_generic | void(ident_t*, i32) | Replaced with ID 187 (SPMD barrier) |
Additionally, the SPMD-amenable function set at *(a2+208)+34952 is populated by the runtime table builder (sub_312CF50) during module initialization. Functions declared via sub_312CF50 cases 0-193 are automatically considered, along with user-annotated functions.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Generic-to-SPMD transformation pass (this function, 61 KB) | sub_26968A0 | -- | -- |
| Worker state machine generation (Generic fallback, 41 KB) | sub_2678420 | -- | -- |
Attributor-based OpenMP optimization driver (63 KB, sets a1+241) | sub_269F530 | -- | -- |
| Parallel region merging (52 KB) | sub_2680940 | -- | -- |
| AbstractAttribute infrastructure (Attributor framework) | sub_251BBC0 | -- | -- |
| Attributor::runTillFixpoint (53 KB, fixed-point iteration engine) | sub_251CD10 | -- | -- |
| OpenMP kernel info collector (populates PassState) | sub_26747F0 | -- | -- |
| Attributor Module Pass entry point (51 KB) | sub_2591C20 | -- | -- |
| Read execution mode from attribute map | sub_2674090 | -- | -- |
| Read execution mode (alternate entry) | sub_2674040 | -- | -- |
| Get parallel region thread configuration | sub_250CBE0 | -- | -- |
| Read attribute from kernel attribute map | sub_2673FD0 | -- | -- |
| Create secondary barrier call | sub_2673A60 | -- | -- |
| OpenMP runtime call table lookup by ID (194-case switch, 117 KB) | sub_312CF50 | -- | -- |
| registerRuntimeFunction (registers declaration in table) | sub_3122A50 | -- | -- |
Parallel region outliner (47 KB, creates .omp_par functions) | sub_313D1B0 | -- | -- |
| Get function entry basic block | sub_25096F0 | -- | -- |
| Get function scope / debug info | sub_BD5C60 | -- | -- |
| Build CFG region (start/end blocks) | sub_AA8550 | -- | -- |
| Build exit/cleanup block | sub_AA4D50 | -- | -- |
| Split basic block | sub_F36960 | -- | -- |
| Allocate IR instruction node | sub_BD2C40 | -- | -- |
| Fill instruction as runtime-call value load | sub_B4A410 | -- | -- |
| Create integer constant (zero for tid check) | sub_AD64C0 | -- | -- |
| Create integer constant (alternate entry, used in complex path) | sub_AD6530 | -- | -- |
| Create icmp instruction | sub_B52500 | -- | -- |
| Create branch instruction (opcode 3) | sub_B4C9A0 | -- | -- |
| Create shared-memory alloca (addr space 7) | sub_B30000 | -- | -- |
| Create store instruction | sub_B4D460 | -- | -- |
| Create load instruction | sub_B4D230 | -- | -- |
| Replace all uses of a value | sub_256E5A0 | -- | -- |
| Create runtime library call instruction | sub_921880 | -- | -- |
| Create bit-vector entry | sub_ACD640 | -- | -- |
| Insert into attribute map | sub_AAAE30 | -- | -- |
| Register block in pass manager worklist | sub_D695C0 | -- | -- |
| Construct remark DiagnosticInfo | sub_B174A0 | -- | -- |
| Construct warning DiagnosticInfo | sub_B178C0 | -- | -- |
| Append string to diagnostic message | sub_B18290 | -- | -- |
| Emit diagnostic to handler | sub_1049740 | -- | -- |
| Check if instruction is a call | sub_B46970 | -- | -- |
| Check if instruction is an invoke | sub_B46420 | -- | -- |
| Get invoke exception handler count | sub_BD2BC0 | -- | -- |
| Insert guard instructions at range boundary | sub_B444E0 | -- | -- |
| Fast-path comparison instruction creation | sub_AAB310 | -- | -- |
| Full comparison instruction creation | sub_B523C0 | -- | -- |
| Build name from debug info + suffix | sub_CA0F50 | -- | -- |
| Ref-count increment on metadata/debug-info | sub_B96E90 | -- | -- |
| Ref-count decrement on metadata/debug-info | sub_B91220 | -- | -- |
| Transfer metadata ownership between blocks | sub_B976B0 | -- | -- |
| Get terminator's successor block pointer | sub_986580 | -- | -- |
| Add operand bundle to instruction | sub_B99FD0 | -- | -- |
| Duplicate metadata reference | sub_266EF50 | -- | -- |
| Process entry block terminator successor | sub_B491C0 | -- | -- |
| Get instruction value type | sub_ACA8A0 | -- | -- |
| Get IR node name | sub_BD5D20 | -- | -- |
| Vector push_back (dynamic arrays) | sub_C8CC70 | -- | -- |
| Vector reserve/grow | sub_C8D5F0 | -- | -- |
Cross-References
- OpenMP Runtime Declaration Table -- complete runtime function table (
sub_312CF50), including__kmpc_barrier_simple_spmd(ID 187) and__kmpc_get_hardware_thread_id_in_block(ID 6) - Entry Point & CLI -- how OpenMP target offloading flags reach the optimizer
- LLVM Optimizer -- pipeline slots 75/76/154 where
openmp-optruns - CLI Flags --
openmp-opt-*knob documentation
LTO & Module Optimization
CICC v13.0 implements Link-Time Optimization as a five-pass pipeline that exploits the GPU's closed-world compilation model for optimization opportunities unavailable to CPU compilers. In CPU LTO, the linker merges partially-optimized object files and runs a second round of optimization on the combined module. The fundamental constraint is that shared libraries, dynamic loading, and symbol interposition limit what the optimizer can assume about the complete program. On GPU, none of these constraints exist. Every __device__ function that can execute on the hardware must be statically visible at compile time -- there is no device-side dlopen, no .so files, no PLT/GOT, no symbol preemption. This closed-world guarantee means the LTO pipeline can inline aggressively across translation units, devirtualize every virtual call site against a complete class hierarchy, and promote or split global variables with full knowledge that no external observer will access the original symbols.
The LTO pipeline runs after the main LLVM optimizer (tier 0-3 passes) has performed per-module optimization. It is triggered when cicc processes bitcode from separate compilation (nvcc --device-c / -dc mode), where each .cu file compiles to a relocatable device object containing LLVM bitcode in the NVVM container. The device linker (nvlink) merges these objects and reinvokes cicc in LTO mode, passing the combined bitcode through the LTO pipeline before final PTX emission. In whole-program compilation (the default), the pipeline is still partially active -- GlobalOpt and the inliner run regardless, but the summary-based import machinery is skipped because there is only one module.
| LTO pipeline entry | sub_12F5F30 (0x12F5F30, 37.8 KB) |
| NVModuleSummary driver | sub_D81040 (0xD81040, 56 KB) |
| Summary builder | sub_D7D4E0 (0xD7D4E0, 74 KB) |
| Address range (summary cluster) | 0xD60000--0xD82000 |
| Address range (import/inline cluster) | 0x1850000--0x186CA00 |
| NVVM container IRLevel for LTO | NVVM_IR_LEVEL_LTO (value 1) |
| Compile mode for separate compilation | NVVM_COMPILE_MODE_SEPARATE_ABI (value 2) |
| Module flags read | EnableSplitLTOUnit, UnifiedLTO, ThinLTO |
Why LTO Matters for GPU
Three properties of GPU execution make LTO dramatically more valuable than on CPU:
Function calls are expensive. Every GPU function call marshals arguments through the .param calling convention via st.param / ld.param instruction sequences. A function with 8 struct arguments can generate hundreds of cycles of marshaling overhead that inlining eliminates entirely. Cross-module inlining -- which requires LTO -- is the primary mechanism for removing this cost for functions defined in separate translation units. See the inliner cost model for the full cost analysis.
Register pressure determines performance. Occupancy is bounded by per-thread register usage, with discrete cliff boundaries. Call boundaries force the backend to save and restore registers across the call site, often spilling to local memory (device DRAM, 200-800 cycle latency). LTO enables cross-module inlining, which in turn enables cross-function register allocation -- the single most impactful optimization for GPU code.
Indirect calls are catastrophic. An indirect call in PTX (call.uni through a register) prevents backend inlining, forces full register spills, destroys instruction scheduling freedom, and creates warp-divergence hazards. Whole-program devirtualization, which requires LTO-level visibility of the complete type hierarchy, converts indirect calls to direct calls and enables all downstream optimizations.
Regular LTO vs ThinLTO
CICC supports both regular (monolithic) LTO and ThinLTO. The LTO driver at sub_D81040 reads three module flags via sub_BA91D0 to determine which mode is active:
| Module Flag | Effect |
|---|---|
EnableSplitLTOUnit | Enables the split LTO unit mechanism for type metadata |
UnifiedLTO | Enables LLVM's unified LTO pipeline (combined thin+regular) |
ThinLTO | Activates summary-based import and the two-phase declaration merge in sub_D7D4E0 |
Regular LTO merges all translation units into a single LLVM module, then runs the full optimization pipeline on the merged result. This gives the optimizer complete visibility but has O(n) memory cost in the total program size and serializes compilation. For GPU programs this is often acceptable because device code is typically smaller than host code.
ThinLTO builds per-module summaries (via NVModuleSummary), uses the summaries to make import decisions without loading full bitcode, then imports selected functions and optimizes each module independently. The builder's a8 parameter (thinlto_mode flag) activates Phase 2 of the summary builder, which performs a second walk over declarations to merge forward-declared and defined symbol tables. This mode enables parallel per-module optimization at the cost of less global visibility.
In practice, NVIDIA's toolchain (nvcc + nvlink) uses regular LTO as the default for device code, because the closed-world model and relatively small code size (compared to CPU programs) make the memory and compile-time cost acceptable. ThinLTO is available for large CUDA programs where compile time is a concern, activated by passing -dlto to nvcc (device LTO) or -flto=thin through the driver.
LTO Pipeline
The LTO pipeline executes five major passes in a fixed order. Each pass consumes the output of its predecessor:
┌────────────────────────────────────────────────────────────────────────┐
│ NVVM Container (IRLevel=1) │
│ LLVM Bitcode + Module Flags │
└────────────────────┬───────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 1. NVModuleSummary Builder (sub_D7D4E0, 74 KB) │
│ Build per-function summaries with 4-level import priority, │
│ complexity budget, CUDA attribute flags, call graph edges │
└────────────────────┬──────────────────────────────────────────-┘
│ ModuleSummaryIndex
▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. ThinLTO Function Import (sub_1854A20, 4.3 KB) │
│ Summary-guided cross-module import with floating-point │
│ threshold computation, priority-class multipliers, │
│ global import budget cap │
└────────────────────┬──────────────────────────────────────────-┘
│ Materialized functions + thinlto_src_module metadata
▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. Inliner (sub_1864060 + sub_2613930 + sub_38576C0) │
│ Four parallel cost models: NVIDIA custom (20K budget), │
│ LLVM standard (225), New PM CGSCC + ML, NVPTX target │
└────────────────────┬──────────────────────────────────────────-┘
│ Inlined module
▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. GlobalOpt (sub_18612A0, 65 KB) │
│ Small-constant promotion (≤2047 bits), SRA for structs │
│ (≤16 fields), malloc/free elimination, address-space-aware │
└────────────────────┬──────────────────────────────────────────-┘
│ Optimized globals
▼
┌─────────────────────────────────────────────────────────────────┐
│ 5. WholeProgramDevirtualization (sub_2703170, 13 KB) │
│ Type-test metadata → vtable resolution → direct calls │
│ Red-black tree for type info lookup, 0x90-byte records │
└────────────────────┬──────────────────────────────────────────-┘
│
▼
Dead Kernel Elimination + GlobalDCE
→ Standard optimizer pipeline (tier 0-3)
→ Code generation + PTX emission
The LTO pipeline entry at sub_12F5F30 (37.8 KB) orchestrates this sequence and also runs dead kernel elimination -- removing __global__ functions that are never referenced by host-side kernel launches. This is a GPU-specific optimization: on CPU, the linker preserves all externally-visible entry points, but in GPU LTO the compiler knows the complete set of kernel launch sites from the host code.
LTO Pipeline Entry -- sub_12F5F30 Algorithm
sub_12F5F30 (0x12F5F30, 37,797 bytes) is the top-level LTO orchestrator. It is called after the CLI parser (sub_12F7D90) has resolved the compilation mode bitmask and the LTO argument vector has been populated from the -Xlto forwarding meta-flag. The function operates in three distinct modes determined by the mode bitmask in a13:
| Mode | Bitmask | CLI Flag | Behavior |
|---|---|---|---|
| gen-lto | 0x21 | -gen-lto | Emit partially-optimized bitcode for later linking. No dead-kernel pass. |
| full LTO | 0x23 | -lto | Full merge + optimize + dead-kernel elimination + emit PTX. |
| link-lto | 0x26 | -link-lto | Link pre-existing LTO bitcode modules, run full pipeline. |
The function's argument list is reconstructed from the LTO output vector v330 (the fourth CLI routing vector, populated by -Xlto and the six -host-ref-* flags). It receives the merged LLVM module, the host reference tables, and the compilation options struct.
Pseudocode: sub_12F5F30 Top-Level
function sub_12F5F30(module, lto_args, options, error_cb):
# ---- Phase A: Parse LTO-specific arguments ----
mode = NONE
trace_enabled = false
optimize_unused_vars = false
host_refs = HostRefTable{} # 6-field table: ek, ik, ec, ic, eg, ig
force_device_c = false
for arg in lto_args:
switch arg:
case "-gen-lto": mode = GEN_LTO
case "-link-lto": mode = LINK_LTO
case "-olto": lto_opt_level = next_arg()
case "--device-c": device_c = true
case "--force-device-c": force_device_c = true
case "--trace": trace_enabled = true
case "-optimize-unused-variables": optimize_unused_vars = true
case "-has-global-host-info": has_host_info = true
case "-host-ref-ek=*": host_refs.ek = parse_symbol_list(value)
case "-host-ref-ik=*": host_refs.ik = parse_symbol_list(value)
case "-host-ref-ec=*": host_refs.ec = parse_symbol_list(value)
case "-host-ref-ic=*": host_refs.ic = parse_symbol_list(value)
case "-host-ref-eg=*": host_refs.eg = parse_symbol_list(value)
case "-host-ref-ig=*": host_refs.ig = parse_symbol_list(value)
# ---- Phase B: Build preserved-symbol sets ----
# Collect symbols from llvm.used and llvm.metadata named metadata
used_set = collect_named_metadata(module, "llvm.used")
metadata_set = collect_named_metadata(module, "llvm.metadata")
# Merge host reference tables into a unified "referenced from host" set.
# The 6 host-ref flags encode three entity types x two reference modes:
# e = explicit reference (symbol name appears in host launch site)
# i = implicit reference (symbol address taken on host side)
# k = kernel (__global__), c = constant (__constant__), g = global (__device__)
host_referenced_kernels = host_refs.ek UNION host_refs.ik
host_referenced_constants = host_refs.ec UNION host_refs.ic
host_referenced_globals = host_refs.eg UNION host_refs.ig
# ---- Phase C: Decide what to preserve ----
preserved = used_set UNION metadata_set UNION host_referenced_kernels
if NOT optimize_unused_vars:
preserved = preserved UNION host_referenced_constants
UNION host_referenced_globals
# ---- Phase D: Dead kernel/variable elimination ----
if mode == GEN_LTO:
# gen-lto: emit bitcode only, skip elimination
return emit_lto_bitcode(module)
if has_host_info:
dead_kernel_elimination(module, preserved, trace_enabled)
if optimize_unused_vars:
dead_variable_elimination(module, preserved,
host_referenced_constants,
host_referenced_globals,
trace_enabled)
# ---- Phase E: Run the 5-pass LTO pipeline ----
if mode == LINK_LTO or mode == FULL_LTO:
run_module_summary_builder(module) # sub_D7D4E0 via sub_D81040
run_thinlto_import(module) # sub_1854A20 (if ThinLTO)
run_inliner(module) # sub_1864060 + sub_2613930
run_globalopt(module) # sub_18612A0
run_whole_program_devirt(module) # sub_2703170
run_global_dce(module) # final GlobalDCE sweep
# ---- Phase F: Hand off to optimizer pipeline ----
return module # returned to sub_12E7E70 for tier 0-3 passes
Host Reference Flag Encoding
The six -host-ref-* flags are the mechanism by which nvlink communicates host-side symbol usage to cicc's LTO pass. nvlink inspects the host-side relocatable objects and emits a semicolon-separated list of device symbol names for each flag. The two-letter suffix encodes:
| Suffix | Entity Type | Reference Kind |
|---|---|---|
-host-ref-ek | __global__ kernel | Explicit (launch site in host code) |
-host-ref-ik | __global__ kernel | Implicit (address taken, e.g. &myKernel) |
-host-ref-ec | __constant__ variable | Explicit (cudaMemcpyToSymbol target) |
-host-ref-ic | __constant__ variable | Implicit (address taken) |
-host-ref-eg | __device__ global variable | Explicit (cudaMemcpyToSymbol target) |
-host-ref-ig | __device__ global variable | Implicit (address taken) |
The -has-global-host-info flag signals that nvlink has provided complete host reference information. When this flag is absent, sub_12F5F30 conservatively preserves all externally-visible symbols -- the dead kernel/variable elimination pass is skipped entirely.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
sub_12F5F30 | 0x12F5F30 | 37.8 KB | LTO pipeline entry and dead-symbol orchestrator |
sub_12F5610 | 0x12F5610 | 7.3 KB | LLVM module linker wrapper (Linker::linkModules) |
sub_12F7D90 | 0x12F7D90 | 14.3 KB | CLI argument parser (architecture, opt level, flags) |
sub_12F4060 | 0x12F4060 | 15.7 KB | TargetMachine creation with NVIDIA options |
sub_1C13840 | 0x1C13840 | -- | Global/function iterator used for dead-code sweep |
sub_12F1650 | 0x12F1650 | 5.2 KB | Bitcode reader variant A |
sub_12F11C0 | 0x12F11C0 | 5.2 KB | Bitcode reader variant B |
Dead Kernel Elimination Algorithm
Dead kernel elimination is the most impactful GPU-specific optimization in the LTO pipeline. It exploits the closed-world model: every __global__ function that will ever execute must have a corresponding <<<>>> launch site (or cudaLaunchKernel call) in the host code that nvlink has already seen. Any kernel not in the host reference set is dead.
This pass cannot exist on CPU. A CPU linker must preserve all non-hidden external functions because shared libraries loaded at runtime via dlopen could call them. On GPU there is no dlopen, no dynamic symbol resolution, no PLT. The set of reachable kernels is completely determined at link time.
Pseudocode: dead_kernel_elimination
function dead_kernel_elimination(module, preserved_set, trace):
# Walk all functions in the module via sub_1C13840 iterator
worklist = []
for func in module.functions():
if func.isDeclaration():
continue
cc = func.getCallingConv()
# PTX calling convention 71 = __global__ (kernel entry point)
# PTX calling convention 72 = __device__ (device function)
# PTX calling convention 95 = CUDA internal (managed init)
if cc != 71:
continue # only eliminate kernels, not device functions
name = func.getName()
if name in preserved_set:
continue # referenced from host, or in llvm.used -- keep it
# This kernel has no host launch site.
if trace:
emit_diagnostic("no reference to kernel " + name)
worklist.append(func)
# ---- Remove dead kernels ----
for func in worklist:
# Before erasing, check if any device-side indirect references exist.
# On GPU, device-side function pointers (callback patterns) can reference
# kernels via address-of. Check use_empty():
if NOT func.use_empty():
# Has device-side users -- cannot safely remove.
# (This is rare: kernels are almost never called from device code.)
continue
func.replaceAllUsesWith(UndefValue)
func.eraseFromParent()
return len(worklist)
Pseudocode: dead_variable_elimination
When -optimize-unused-variables is enabled, the same logic extends to __device__ and __constant__ global variables:
function dead_variable_elimination(module, preserved_set,
host_constants, host_globals, trace):
worklist = []
for gv in module.globals():
if gv.isDeclaration():
continue
name = gv.getName()
if name in preserved_set:
continue
as = gv.getAddressSpace()
# Address space 1 = global, address space 4 = constant
if as == 4 and name NOT in host_constants:
if trace:
emit_diagnostic("no reference to variable " + name)
worklist.append(gv)
elif as == 1 and name NOT in host_globals:
if trace:
emit_diagnostic("no reference to variable " + name)
worklist.append(gv)
for gv in worklist:
if NOT gv.use_empty():
continue # still referenced from device code
gv.eraseFromParent()
return len(worklist)
The --trace-lto CLI flag (which maps to --trace in the LTO argument vector via the flag catalog at line 2394) enables the diagnostic messages. When active, cicc prints one line per eliminated symbol to stderr, enabling build-system integration and debugging of unexpected kernel removal.
Module Merge Process
Before sub_12F5F30 can perform dead-kernel elimination or any LTO optimization, the separate-compilation bitcode modules must be merged into a single LLVM module. This merge happens in two layers: the NVIDIA module linker wrapper sub_12F5610 (7.3 KB) and the underlying LLVM IRLinker at sub_16786A0 (61 KB).
Two-Level Linking Architecture
nvlink extracts .nv_fatbin bitcode sections
|
v
┌─────────────────────────────────────────────────────────────┐
│ NVIDIA Module Loader (sub_12C06E0, 63 KB) │
│ - Validates LLVM bitcode magic (0xDEC0170B or 0x4243C0DE) │
│ - Checks IR version via sub_12BFF60 │
│ - Validates target triple (must be "nvptx64-*") │
│ - Single-module fast path: return directly if N=1 │
│ - Multi-module: normalize triples, set matching DataLayout │
└─────────────────────┬───────────────────────────────────────┘
│ N validated modules
v
┌─────────────────────────────────────────────────────────────┐
│ NVIDIA Module Linker Wrapper (sub_12F5610, 7.3 KB) │
│ - Selects primary module (typically the largest) │
│ - For each secondary module: │
│ Copy triple from primary → secondary │
│ Call IRLinker to merge secondary into primary │
│ - Post-link: restore linkage attributes from hash table │
│ Values 7-8: external linkage (low 6 bits) │
│ Other: set low 4 bits + visibility from bits 4-5 │
│ Set dso_local flag (byte+33 |= 0x40) │
└─────────────────────┬───────────────────────────────────────┘
│
v
┌─────────────────────────────────────────────────────────────┐
│ LLVM IRLinker::run (sub_16786A0, 61 KB) │
│ - Allocates 0x2000-byte DenseMap for symbol resolution │
│ - Hash function: (addr >> 9) ^ (addr >> 4) │
│ - Resolves COMDAT groups (sub_167DAB0, 39 KB) │
│ - Links global value prototypes (sub_1675980, 37 KB) │
│ - Links function bodies (sub_143B970, 14 KB) │
│ - Merges named metadata (llvm.dbg.cu, llvm.used, etc.) │
│ - Resolves llvm.global_ctors / llvm.global_dtors ordering │
│ - Maps values across modules via DenseMap<Value*, Value*> │
│ - Tombstone sentinels: empty=-8, deleted=-16 │
└─────────────────────┬───────────────────────────────────────┘
│ single merged module
v
sub_12F5F30 (LTO pipeline entry)
Pseudocode: Module Merge (sub_12F5610 + sub_12C06E0)
function module_merge(module_list, llvm_ctx, options):
# ---- Step 1: Load and validate all modules (sub_12C06E0) ----
modules = []
for entry in module_list:
buf = open_buffer(entry.data, entry.length, entry.name)
# Validate bitcode magic
magic = read_u32(buf, 0)
if magic != 0x0B17C0DE and magic != 0xDEC04342:
error("invalid bitcode: " + entry.name)
return NULL
module = parse_bitcode(buf, llvm_ctx) # sub_15099C0
# Check IR version compatibility (sub_12BFF60)
if ir_version_check(module_list, module, flags) != 0:
error(entry.name + ": error: incompatible IR detected. "
"Possible mix of compiler/IR from different releases.")
return NULL
# Validate target triple
triple = module.getTargetTriple()
if NOT triple.startswith("nvptx64-"):
error("Module does not contain a triple, "
"should be 'nvptx64-'")
return NULL
modules.append(module)
# ---- Step 2: Single-module fast path ----
if len(modules) == 1:
return modules[0]
# ---- Step 3: Multi-module linking (sub_12F5610) ----
# Save linkage attributes before linking (they get modified)
linkage_map = DenseMap<StringRef, u8>{}
for module in modules:
for func in module.functions():
linkage_map[func.getName()] = func.getLinkage()
for gv in module.globals():
linkage_map[gv.getName()] = gv.getLinkage()
# Select primary module and link secondaries into it
primary = modules[0]
for i in range(1, len(modules)):
secondary = modules[i]
# Normalize: copy DataLayout from primary to secondary
secondary.setDataLayout(primary.getDataLayout())
secondary.setTargetTriple(primary.getTargetTriple())
# IRLinker::run (sub_16786A0)
# Resolves COMDATs, links globals, maps values, merges metadata
err = Linker::linkModules(primary, secondary)
if err:
error("<module_name>: link error: <details>")
return NULL
# ---- Step 4: Restore linkage attributes ----
# During linking, LLVM may promote linkage (e.g., internal -> external)
# to resolve cross-module references. Restore the original linkage
# where possible, preserving the correct visibility for PTX emission.
for func in primary.functions():
name = func.getName()
if name in linkage_map:
original = linkage_map[name]
if original in [7, 8]: # external linkage variants
func.setLinkage(original & 0x3F)
else:
func.setLinkage(original & 0x0F)
if (original & 0x30) != 0:
func.setVisibility(original >> 4)
func.setDSOLocal(true) # byte+33 |= 0x40
for gv in primary.globals():
# same linkage restoration logic
...
return primary
Key Data Structures in the Merge
| Structure | Location | Details |
|---|---|---|
| Value map DenseMap | Allocated in sub_16786A0 | 0x2000 bytes (8192), hash: (addr >> 9) ^ (addr >> 4), quadratic probing |
| Linkage hash table | Stack-allocated in sub_12E1EF0 (v362) | Maps StringRef name to original linkage byte |
| Function-to-module map | Stack-allocated in sub_12E1EF0 (v359) | Maps StringRef name to function pointer for split-module dispatch |
| COMDAT group map | Internal to sub_167DAB0 | Tracks COMDAT selection kinds: any / exact-match / largest / no-dup / same-size |
| Named metadata merge list | Internal to sub_1671B40 | Special handling for llvm.dbg.cu, llvm.used, llvm.compiler.used, llvm.global_ctors, llvm.global_dtors, llvm.global.annotations |
| Module config flag | dword_4F99BC0 | Controls linker behavior variant |
Split-Module Compilation and Re-Linking
When concurrent compilation is active (thread count > 1 and multiple defined functions), the optimization pipeline uses a split-module strategy: each function is extracted into its own bitcode module, optimized independently in a thread pool, and then re-linked. The split/re-link cycle uses the same sub_12F5610 linker wrapper:
- Split (
sub_1AB9F40): extracts per-function bitcode using a filter callback (sub_12D4BD0) that selects a single function by name from the function-to-module hash table. - Optimize (thread pool via
sub_16D5230): each worker runssub_12E86C0(Phase II optimizer) withqword_4FBB3B0 = 2. - Re-link (
sub_12F5610): merges all per-function bitcode modules back into a single module. - Restore linkage (v362 hash table): the saved linkage attributes from step 0 are written back to prevent linkage promotion artifacts.
This cycle is orchestrated by sub_12E1EF0 (51 KB, the top-level concurrent compilation entry). The GNU Jobserver integration (sub_16832F0) throttles thread pool size to match the build system's -j level when cicc is invoked from make.
Separate Compilation and the NVVM Container
When nvcc --device-c compiles a .cu file, cicc produces an NVVM container with CompileMode = NVVM_COMPILE_MODE_SEPARATE_ABI (value 2) and IRLevel = NVVM_IR_LEVEL_LTO (value 1). This container wraps partially-optimized LLVM bitcode -- the per-module optimizer has run, but cross-module optimization has not. The bitcode is embedded in the ELF .nv_fatbin section of the relocatable object file.
At link time, nvlink extracts the bitcode sections from all input objects, concatenates them, and passes the result back to cicc in LTO mode. cicc deserializes each container, links the bitcode modules via LLVM's Linker::linkModules, and then runs the LTO pipeline described above on the merged module. The pipeline sees the complete device program for the first time at this point.
The IRLevel enum controls which optimizations have already been applied:
| IRLevel | Value | Meaning |
|---|---|---|
NVVM_IR_LEVEL_UNIFIED_AFTER_DCI | 0 | Default: fully optimized, no LTO needed |
NVVM_IR_LEVEL_LTO | 1 | Partially optimized, awaiting LTO pipeline |
NVVM_IR_LEVEL_OPTIX | 2 | OptiX pipeline IR (separate optimization model) |
Pass Inventory
| Pass | Entry Point | Size | Pipeline Slot | Type | Sub-page |
|---|---|---|---|---|---|
| NVModuleSummary Builder | sub_D7D4E0 | 74 KB | N/A (called from driver) | Analysis | module-summary.md |
| NVModuleSummary Driver | sub_D81040 | 56 KB | N/A (LTO entry) | Module | module-summary.md |
| ThinLTO Function Import | sub_1854A20 | 4.3 KB | Slot 43 ("function-import") | Module | thinlto-import.md |
| ThinLTO Threshold Engine | sub_1853180 | 5.1 KB | N/A (called from import driver) | Utility | thinlto-import.md |
| NVIDIA Custom Inliner | sub_1864060 | 75 KB | CGSCC pass | CGSCC | inliner-cost.md |
| LLVM Standard InlineCost | sub_30DC7E0 | 51 KB | N/A (library) | Analysis | inliner-cost.md |
| New PM CGSCC Inliner | sub_2613930 | 69 KB | CGSCC pass | CGSCC | inliner-cost.md |
| NVPTX Target Cost Modifier | sub_38576C0 | 58 KB | N/A (target hook) | Target | inliner-cost.md |
| GlobalOpt | sub_18612A0 | 65 KB | Slot 45 ("globalopt") | Module | globalopt.md |
| WholeProgramDevirt | sub_2703170 | 13 KB | Slot 121 ("wholeprogramdevirt") | Module | devirtualization.md |
Key Differences from CPU LTO
| Aspect | CPU LTO | CICC GPU LTO |
|---|---|---|
| Import threshold | 100 instructions (default) | Priority-class multipliers, global budget at dword_4FAB120 |
| Cold import | 0x multiplier (never import cold) | Imports cold functions if priority >= 2 |
| Inline budget | 225 (LLVM default) | 20,000 (NVIDIA custom), 89x larger |
| Devirt conservatism | Must handle DSOs, hidden visibility | Full type hierarchy always visible |
| Code size concern | Bloats .text, impacts cache/pages | No shared libs; size is secondary to register pressure |
| Address spaces | Trivial (flat memory model) | 5+ address spaces; GlobalOpt must preserve AS through splits |
| Dead symbol elimination | Linker GC sections | Dead kernel elimination in sub_12F5F30 |
| Threshold comparison | Integer instruction count | Floating-point threshold with hotness/linkage/priority multipliers |
| ML-guided inlining | Available upstream | Integrated via InlineAdvisor at sub_2609820 with model at sub_29B2CD0 |
LTO Knob Summary
NVModuleSummary Knobs
| Knob | Default | Effect |
|---|---|---|
dword_4F87C60 (global override) | 0 | When nonzero, forces all symbols to importable; value 2 = conservative comdat handling |
ThinLTO Import Knobs
Registered in ctor_184_0 (0x4DA920) and ctor_029 (0x489C80):
| Knob | Type | Default | Effect |
|---|---|---|---|
import-instr-limit | int | 100 | Base instruction count threshold for import |
import-hot-multiplier | float | 10.0 | Multiplier applied to threshold for hot callsites |
import-cold-multiplier | float | 0.0 | Multiplier for cold callsites (0 = never import cold on CPU) |
dword_4FAB120 | int | -1 | Global import budget; negative = unlimited |
dword_4FAA770 | int | 0 | Current import count (runtime accumulator) |
summary-file | string | -- | Path to external summary file for ThinLTO |
function-import | -- | -- | Pipeline registration string (slot 43) |
disable-thinlto-funcattrs | bool | false | Disable ThinLTO function attribute propagation |
thinlto-workload-def | string | -- | Workload definition file for priority-guided import |
Inliner Knobs
Registered in ctor_186_0 (0x4DBEC0):
| Knob | Type | Default | Effect |
|---|---|---|---|
inline-budget | int | 20,000 | Per-caller inlining cost budget (NVIDIA custom model) |
inline-total-budget | int | -- | Global total budget across all callers |
inline-adj-budget1 | int | -- | Adjusted per-caller budget (secondary) |
nv-inline-all | bool | off | Force inline every function call |
profuseinline | bool | off | Verbose inlining diagnostic output |
inline-switchctrl | int | -- | Heuristic tuning for switch statements |
inline-threshold | int | 225 | LLVM standard model threshold (separate from NVIDIA's 20K) |
function-inline-cost-multiplier | float | -- | New PM: penalty multiplier for recursive functions |
GlobalOpt Knobs
No dedicated cl::opt flags. All thresholds are hardcoded:
| Parameter | Value | Description |
|---|---|---|
| Max bits for promotion | 2,047 (0x7FF) | Globals exceeding this fall through to SRA |
| Max struct fields for SRA | 16 | Structs with >16 fields are not split |
| Hash table load factor | 75% | Triggers rehash of processed-globals table |
| Pipeline position | Step 30 (tier 2/3) | After GlobalDCE, before LoopVectorize |
Devirtualization Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
wholeprogramdevirt | -- | -- | Pipeline registration string (slot 121) |
The pass has no NVIDIA-specific tuning knobs. It relies entirely on the completeness of type_test metadata produced by the NVModuleSummary builder.
Cross-References
- NVModuleSummary Builder -- 4-level import priority, complexity budget, CUDA attribute tracking
- ThinLTO Function Import -- threshold computation, priority-class multipliers, global budget
- Inliner Cost Model -- four parallel models,
.paramaddress space cost, ML advisory - GlobalOpt for GPU -- address-space-aware SRA, small-constant promotion, malloc elimination
- Whole-Program Devirtualization -- closed-world virtual call resolution, type test metadata
- NVVM Container Format -- IRLevel enum, CompileMode, bitcode payload encoding
- LLVM Optimizer -- LTO pipeline entry at
sub_12F5F30, tier system - LazyCallGraph & CGSCC -- call graph infrastructure used by the CGSCC inliner
- Entry Point & CLI -- flag catalog routing to lto output vector,
-dcmode
NVModuleSummary Builder
CICC replaces LLVM's ModuleSummaryAnalysis with a custom NVModuleSummary subsystem that extends the ModuleSummaryIndex with GPU-specific information. The builder at sub_D7D4E0 (74 KB, 2571 decompiled lines) walks every global value in a module, constructs per-function summaries with CUDA-aware call graph edges, assigns four-level import priorities using a custom priority table, tracks function complexity on a profile-guided budget, and records CUDA-specific attributes such as address-space linkage, kernel-vs-device classification, and device memory reference patterns. The summary is the data source for all downstream ThinLTO decisions -- the ThinLTO importer reads these summaries to decide which functions to pull across module boundaries, and the inliner cost model consumes the complexity budget to calibrate cross-module inline thresholds.
Upstream LLVM's computeFunctionSummary (in ModuleSummaryAnalysis.cpp) counts instructions, builds call graph edges from CallBase operands, collects reference edges by walking instruction operands, and records type test / devirtualization metadata. It produces a FunctionSummary with a flat instruction count and a call edge list annotated with CalleeInfo::HotnessType (Unknown/Cold/None/Hot). NVIDIA's replacement does all of this, then adds: a 4-level import priority classification per function, a 28-bit profile-scaled complexity budget, CUDA address-space tracking (filtering out device-memory-only declarations from import candidacy), kernel identification via first-instruction opcode probing, six separate CUDA-specific accumulator structures for device call context, and a two-phase declaration re-walk that merges forward-declared and defined symbol tables for ThinLTO.
| Builder entry | sub_D7D4E0 (0xD7D4E0, 74 KB) |
| LTO driver | sub_D81040 (0xD81040, 56 KB) |
| Per-function analyzer | sub_D741C0 (0xD741C0, 19 KB) |
| Call graph analyzer | sub_D6EA70 (0xD6EA70, 19 KB) |
| Summary packer | sub_D77220 (0xD77220) |
| Summary serializer | sub_1535340 (0x1535340, 26 KB) |
| Summary parser | sub_150B5F0 (0x150B5F0, 63 KB) |
| Address range | 0xD60000--0xD82000 (full NVModuleSummary cluster) |
| Stack frame | 1,552 bytes (0x610) |
Summary Fields Beyond Upstream
Upstream LLVM's FunctionSummary stores instruction count, call edges with hotness, reference edges, type test GUIDs, and a few flags (norecurse, returndoesnotalias, etc). NVIDIA extends this with the following per-function fields:
| Field | Encoding | Width | Description |
|---|---|---|---|
| Import priority | *entry & 0x7 | 3 bits | 4-level priority: 0 = not importable, 1 = low, 2 = standard, 3 = force-import |
| Address-taken flag | *entry & 0x8 | 1 bit | Set if sub_B49220(GV) returns true (function has its address taken) |
| Complexity budget | *entry >> 4 | 28 bits | Profile-scaled importance, max 0xFFFFFFF (268,435,455) |
| Kernel bit | flags & (1 << 9) | 1 bit | Set if first instruction opcode is 36 (kernel entry point) |
| Has-unwind-info | flags & (1 << 0) | 1 bit | sub_B2DCC0(func) -- has personality function |
| Not-inline | flags & (1 << 1) | 1 bit | Function marked noinline |
| Read-none | flags & (1 << 2) | 1 bit | Attribute #34 readnone |
| No-unwind | flags & (1 << 3) | 1 bit | Attribute #22 nounwind |
| Will-return | flags & (1 << 4) | 1 bit | Attribute #31 willreturn |
| No-return | flags & (1 << 5) | 1 bit | Attribute #3 noreturn |
| Must-progress | flags & (1 << 6) | 1 bit | Attribute #41 mustprogress |
| Has-visible-alias | flags & (1 << 7) | 1 bit | Accumulated alias visibility flag |
| Has-non-importable-refs | flags & (1 << 8) | 1 bit | References symbols that cannot be imported |
| Has-any-import | module flag bit 6 | 1 bit | OR of device-ref, has-typed-symbol, has-non-importable |
The per-entry summary record in the primary hash table is 16 bytes. The lower 32 bits pack the priority/address-taken/budget fields. The upper 64 bits hold a pointer to the full FunctionSummary record built by sub_D77220.
Builder Algorithm
The builder executes in three phases within sub_D7D4E0. The LTO driver sub_D81040 calls the builder after reading module flags (EnableSplitLTOUnit, UnifiedLTO, ThinLTO) and iterating all functions via a callback iterator.
Phase 1: Global Value Walk (lines 559--1671)
The module's global value list is a linked list rooted at Module+72 (the GlobalList field). The sentinel node is at Module+72 itself; the first real element is at Module+80.
// Phase 1: iterate all GlobalValues in the module
GlobalValue *sentinel = (GlobalValue *)(module + 72);
GlobalValue *cur = *(GlobalValue **)(module + 80);
while (cur != sentinel) {
uint8_t opcode = cur->ir_node[0]; // IR opcode byte
switch (opcode) {
case 61: /* '=' -- Function definition */
process_function(cur);
break;
case 62: /* '>' -- GlobalVariable */
process_global_variable(cur);
break;
case 34: /* '"' -- Alias (kind 1) */
case 40: /* '(' -- Alias (kind 2) */
process_alias(cur);
break;
case 85: /* 'U' -- Declaration/extern */
process_declaration(cur);
break;
}
cur = cur->next;
}
For each function (opcode 61), the builder performs:
1. Import priority assignment. Queries the ImportPriorityTable via sub_D84370(table, func, PSI, 0). If found and the table is non-null, the priority is determined:
entry = getImportKind(priority_table, func, PSI, 0);
if (entry.found) {
if (isImported(priority_table, entry)) // sub_D84440
priority = 3; // force-import
else if (isImportCandidate(priority_table, entry, 3) == 0) // sub_D84450
priority = 2; // standard importable
else
priority = 1; // low priority
} else {
priority = 0; // not importable
}
2. Complexity budget computation. When ProfileSummaryInfo is available and the function was found in the priority table, the builder computes a profile-scaled importance value:
uint64_t profile_count = getProfileCount(PSI, func); // sub_FDD860
uint64_t threshold = getHotThreshold(PSI); // sub_FDC4B0
if (profile_count exists) {
APInt importance = computeScaledImportance(profile_count, threshold); // sub_F04200
normalizeImportPriority(&importance, 8); // sub_D78C90: right-shift by 8
budget += importance.getZExtValue();
budget = min(budget, 0xFFFFFFF); // clamp to 28-bit max
}
// Pack into entry: lower 4 bits = priority | address_taken, upper 28 bits = budget
*entry_word = (budget << 4) | (*entry_word & 0xF);
The 28-bit budget is consumed downstream by ThinLTO to decide how much inlining budget to allocate for functions imported from other modules. A budget of 0 means the function has no profile data and gets the baseline threshold; a budget near the 268M ceiling means the function is extremely hot and will receive aggressive cross-module inlining.
3. Call graph edge construction. For functions with call graph info (bit 5 of byte 7: func->ir_node[7] & 0x20), the builder extracts two kinds of edges:
- Direct call edges from attribute group #35: the callee list. Each callee gets a GUID via
sub_9E27D0, and edges are collected into a temporary vector (4-byte stride per GUID). - Reference edges with type info from attribute group #34: operand bundles encoding reference edges with type metadata. Each reference carries a
CalleeTypebyte and parameter type pairs extracted from MDNode operands. The MDNode decoding walks: operand -> parent (opcode 1 = MDString) -> offset 136 (opcode 17 = MDTuple) -> string data at offset 24.
Call graph edge records are 136 bytes each (stride 136 in the edge vector) and contain source name, target name, and edge attributes. Type-metadata edges are 72 bytes each.
4. CUDA address-space filtering. When the CUDA-mode flag (a6) is set and a declaration has address space 25 in its type chain, the function sets the device-reference flag (v327). Functions whose type resolves to address space 25 are excluded from import candidacy -- device-memory-only declarations cannot be cross-module imported in ThinLTO. The check:
if (cuda_mode && is_declaration(func)) {
Type *ty = func->type_at_offset_minus_2;
if (getAddressSpace(ty) == 25) {
has_device_ref = true;
goto skip_import; // do not mark as importable
}
}
Address space 25 appears to be an internal NVVM encoding for device-side linkage. This differs from the standard NVPTX address spaces (0 = generic, 1 = global, 3 = shared, 4 = constant, 5 = local). The summary records this flag so the importer can avoid attempting to import device-side-only symbols, which would fail at link time.
5. CUDA call context collection. For functions with the device attribute bit (func[33] & 0x20), the builder calls sub_D7CF70 to populate six parallel accumulator structures:
| Accumulator | Offset | Likely content |
|---|---|---|
v408 | +0 | Direct device call targets |
v415 | +1 | Shared memory references |
v422 | +2 | Texture/surface references |
v429 | +3 | Constant memory references |
v436 | +4 | Kernel launch edges |
a5 | +5 | Additional context (passed from caller) |
These six vectors capture the GPU-specific dependency information that upstream LLVM's summary has no concept of. The ThinLTO importer uses this to make GPU-aware import decisions -- for example, a function that references shared memory in another module must also import the shared memory declaration.
Phase 2: ThinLTO Declaration Re-Walk (lines 1673--1911)
When thinlto_mode (parameter a8) is true, the builder performs a second pass over forward-declared symbols:
Step 1. Re-walk function declarations collected during Phase 1. For each, remove from the "seen" set and re-analyze via sub_D7B190 into a secondary hash table.
Step 2. Re-walk global variable declarations through a separate dedup mechanism using sub_C8CA60 for hash-based deduplication.
Step 3. Merge the secondary (forward-declared) and primary (defined) hash tables. On collision -- the same symbol appears as both declared and defined -- sub_D76140 removes the entry from the defined table and sub_D7AF10 re-inserts into the merged table with updated visibility. This merge ensures that the summary captures cross-module edges even for symbols that are only forward-declared in the current module.
The two-phase design is necessary because CUDA compilation units frequently contain forward declarations of device functions defined in other translation units. Without this re-walk, the summary would miss the cross-module edges for these declarations, and ThinLTO would fail to import them.
Phase 3: Finalize and Emit (lines 1912--2569)
Module-level flag assembly. After processing all globals, the builder computes two flag words:
// v134: module-level attribute summary (bits 0-10)
v134 = (linkage & 0xF) // bits 0-3
| ((visibility & 0x3) << 4) // bits 4-5
| (has_any_import << 6) // bit 6: OR of v327|v316|v358
| (has_comdat << 7) // bit 7
| (has_comdat_attr << 8) // bit 8
| (dll_storage_class << 9); // bits 9-10
// v143: per-function flags (bits 0-9)
v143 = has_unwind_info // bit 0
| (not_inline << 1) // bit 1
| (readnone << 2) // bit 2
| (nounwind << 3) // bit 3
| (willreturn << 4) // bit 4
| (noreturn << 5) // bit 5
| (mustprogress << 6) // bit 6
| (has_visible_alias << 7) // bit 7
| (has_non_importable_refs << 8) // bit 8
| (is_kernel << 9); // bit 9
The kernel detection walks to the function's first instruction via offset 24, verifies the opcode is in range 30--40 (basic block terminators), and checks specifically for opcode 36, which encodes a kernel entry point. This is how the summary distinguishes __global__ kernel functions from __device__ helper functions without relying on metadata -- it inspects the compiled IR structure directly.
Summary record packing. All collected data is packed into the final FunctionSummary via sub_D77220, which takes 14 arguments:
sub_D77220(
&result, // output FunctionSummary*
module_flags, // v134
instruction_count, // v324
function_flags, // v143 (includes kernel bit)
&priority_slice, // import priority table slice
guid_ref_list, // GUID reference list
&typed_refs, // type-checked reference list (72-byte entries)
&typed_edges, // typed call graph edges (136-byte entries)
&simple_edges, // simple call graph edges (GUID array)
device_context, // CUDA device context edges
additional_edges, // extra edge data
&bundle_refs, // operand bundle references
&cross_module_calls, // cross-module call records
¶m_types // per-parameter type metadata
);
The result is stored via sub_D7A690(index, func, &result) which merges the summary into the module-level index.
Callback invocation. The a9 parameter is a callback object with vtable layout: a9+16 points to a shouldSkip() predicate; a9+24 points to a processFunction(a9, GlobalValue*) handler. When shouldSkip() returns null, the callback is invoked for each function. The callback result is processed by sub_D8D9B0 which extracts additional summary information (likely profile or LTO-specific metadata).
Serialization and the NVVM Container
The summary is serialized into bitcode by sub_1535340 (writeModuleSummary, 26 KB). This function writes a MODULE_STRTAB_BLOCK and GLOBALVAL_SUMMARY_BLOCK into the LLVM bitcode stream using standard bitcode encoding (VBR integers, abbreviation-driven records). The strings "ThinLTO" and "Unexpected anonymous function when writing summary" appear in this function.
On the reading side, sub_150B5F0 (parseModuleSummaryIndex, 63 KB) and sub_9EBD80 (parseGlobalSummaryBlock, 82 KB) deserialize the summary from bitcode back into the in-memory ModuleSummaryIndex. These parsers handle GUID hashes, function/alias/global summaries, and module paths.
The bitcode writer at sub_1538EC0 writes the producer string as "LLVM7.0.1" despite CICC being built on LLVM 20.0.0 internally -- this is the NVVM IR compatibility layer. The summary blocks are embedded in this bitcode stream alongside the IR, so the NVVM container format (see NVVM Container) carries both the IR and its summary in a single bitcode file.
Import Priority System
The 4-level priority system is the primary extension over upstream LLVM's binary importable/not-importable model. Upstream uses GlobalValueSummary::ImportKind which is essentially a boolean; NVIDIA introduces graduated priority levels that feed a floating-point threshold multiplier in the importer.
| Level | Value | Meaning | Importer behavior |
|---|---|---|---|
| 0 | 0b000 | Not importable | Never imported |
| 1 | 0b001 | Low priority | Threshold multiplied by cold multiplier (dword_4FAACC0) |
| 2 | 0b010 | Standard | Threshold multiplied by default multiplier (dword_4FAB040) |
| 3 | 0b011 | Force-import | Threshold multiplied by hot multiplier (dword_4FAAE80) |
The importer at sub_1853180 converts the integer base threshold to float, multiplies by the per-priority-level constant, converts back to integer, and compares against the function's cost from the summary (stored at offset 0x40 in the summary entry). A fourth multiplier (dword_4FAADA0) handles "critical" priority (priority class 4 in the importer's switch), though the summary builder only produces levels 0--3.
For comdat/linkonce symbols discovered during Phase 3, a special minimum priority applies:
min_priority = 3 * (dword_4F87C60 != 2) + 1;
// dword_4F87C60 == 2: min_priority = 1 (conservative)
// dword_4F87C60 != 2: min_priority = 4 (aggressive import)
Hash Table Infrastructure
The builder manages multiple open-addressing hash tables with different entry sizes. All use the standard DenseMap pointer hash and growth policy; see Hash Table and Collection Infrastructure for the common implementation.
| Table | Entry size | Probe strategy | Purpose |
|---|---|---|---|
Primary (v384--v387) | 16 bytes | Linear probing | Main summary entries (ptr + metadata) |
Secondary (v388--v393) | 8 bytes | Linear probing | Forward-declared symbol GUIDs |
GUID dedup (v406--v407) | 8 bytes | Linear scan + memmove | Deduplication during merge |
Seen set (v451--v455) | Variable | Flat array or hash | Tracks processed GlobalValues |
The "seen set" has two modes selected by v455: when v455 = 1, it uses a flat inline buffer at v456 with HIDWORD(v453) as the count; when v455 = 0, it switches to a hash table via sub_C8CA60. This dual-mode design optimizes for the common case of small modules (flat scan is faster when count is low) while scaling to large modules.
Rehash strategy: new_capacity = max(64, next_power_of_2(4 * current_count)). The power-of-2 is computed via _BitScanReverse. If the new capacity equals the old, the table is cleared in-place via memset to the empty sentinel (0xFF for 8-byte entries, 0xF8 for 16-byte entries). Otherwise the old buffer is freed and a new one allocated via sub_C7D670 (aligned_alloc(8, size)).
Knobs and Global Variables
| Symbol | Type | Default | Effect |
|---|---|---|---|
dword_4F87C60 | int | 0 | Import priority override: 0 = normal, 1 = force all importable, 2 = conservative mode |
qword_4F878A8 | bool | false | When set in ThinLTO mode, forces re-analysis of all referenced-but-undefined symbols |
byte_3F871B3 | byte | (varies) | Cross-module GUID namespace prefix, distinguishes same-named symbols across modules |
dword_4FAB120 | int | -1 | Global import budget (-1 = unlimited) |
dword_4FAA770 | int | 0 | Running count of imports performed |
dword_4FAAE80 | float | (varies) | Hot function threshold multiplier |
dword_4FAACC0 | float | (varies) | Cold function threshold multiplier |
dword_4FAADA0 | float | (varies) | Critical section threshold multiplier |
dword_4FAB040 | float | (varies) | Default threshold multiplier |
byte_4FAAA20 | bool | false | Enable thinlto_src_module metadata annotation on imported functions |
The dword_4F87C60 override is the most impactful knob. Setting it to 1 makes every function importable regardless of its linkage or visibility, which is useful for whole-program optimization but can cause link-time explosions. Setting it to 2 enables conservative mode where comdat symbols get minimal priority (level 1 instead of 4), preventing aggressive cross-module import of weakly-linked symbols.
Comparison with Upstream ModuleSummaryAnalysis
| Aspect | Upstream LLVM | CICC NVModuleSummary |
|---|---|---|
| Entry point | computeFunctionSummary() | sub_D7D4E0 (2571 lines vs ~400) |
| Priority levels | Binary (importable or not) | 4 levels (0--3) with float multipliers |
| Complexity metric | Flat instruction count | 28-bit profile-scaled budget |
| Call edge annotation | CalleeInfo::HotnessType (4 values) | 136-byte records with full type metadata |
| Address space awareness | None | Filters device-only (AS 25) from import |
| Kernel detection | None | Opcode-36 probe for __global__ functions |
| Declaration re-walk | None | Two-phase merge of declared + defined |
| CUDA context | None | 6 accumulators for device call patterns |
| Hash table sizing | LLVM DenseMap | Custom open-addressing with dual-mode seen set |
| Profile integration | BFI-based hotness | ProfileSummaryInfo scaled budget |
| Serialization | Standard ModuleSummaryIndex bitcode | Same format, extended fields |
The most architecturally significant difference is the priority system. Upstream LLVM makes a binary import/no-import decision based on a single threshold comparison. NVIDIA's 4-level system allows the importer to process functions in priority order (primary/secondary/tertiary passes in sub_1854A20) with different threshold multipliers per level, enabling much finer control over cross-module optimization aggressiveness.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
NVModuleSummary::buildModuleSummary() -- main builder | 0xD7D4E0 | 74 KB | -- |
NVModuleSummary::runOnModule() -- LTO driver | 0xD81040 | 56 KB | -- |
NVModuleSummary::analyzeFunction() | 0xD741C0 | 19 KB | -- |
NVModuleSummary::processGlobalRef() | 0xD6FF50 | 47 KB | -- |
NVModuleSummary::collectGlobalInfo() | 0xD6A180 | 21 KB | -- |
NVModuleSummary::analyzeCallGraph() | 0xD6EA70 | 19 KB | -- |
NVModuleSummary::visitInstruction() | 0xD7B190 | 9 KB | -- |
| Alias processing helper | 0xD738B0 | 11 KB | -- |
NVModuleSummary::computeImportCost() | 0xD72D40 | 9 KB | -- |
NVModuleSummary::resolveReferences() | 0xD64DE0 | 16 KB | -- |
NVModuleSummary::getTypeMetadata() | 0xD669C0 | 11 KB | -- |
NVModuleSummary::processTypeId() | 0xD640E0 | 12 KB | -- |
NVModuleSummary::computeVisibility() | 0xD63080 | 11 KB | -- |
| Summary serialization helper (recursive) | 0xD60CE0 | 15 KB | -- |
| Summary serialization helper | 0xD61E90 | 10 KB | -- |
NVModuleSummary::packFunctionSummary() -- 14-arg final packer | 0xD77220 | -- | -- |
NVModuleSummary::addInlineSummary() -- CUDA context collector | 0xD7CF70 | -- | -- |
NVModuleSummary::addEdge() | 0xD76530 | -- | -- |
NVModuleSummary::addRef() | 0xD768F0 | -- | -- |
NVModuleSummary::addSpecialGlobal() (llvm.used etc.) | 0xD76CA0 | -- | -- |
NVModuleSummary::addTypeRef() | 0xD76D40 | -- | -- |
NVModuleSummary::computeNextPrime() -- hash table sizing | 0xD76FC0 | -- | -- |
NVModuleSummary::getModuleHash() | 0xD771D0 | -- | -- |
NVModuleSummary::destroyEdgeList() | 0xD77880 | -- | -- |
NVModuleSummary::destroyRefList() | 0xD786F0 | -- | -- |
NVModuleSummary::compareImportPriority() | 0xD788E0 | -- | -- |
NVModuleSummary::computeSymbolHash() | 0xD789D0 | -- | -- |
NVModuleSummary::resizeTable() | 0xD78B00 | -- | -- |
NVModuleSummary::normalizeImportPriority() | 0xD78C90 | -- | -- |
NVModuleSummary::addCallEdge() | 0xD793D0 | -- | -- |
| Rehash/resize (next power-of-2, min 64) | 0xD79200 | -- | -- |
NVModuleSummary::copyTable() | 0xD7A410 | -- | -- |
NVModuleSummary::mergeSymbols() | 0xD7A690 | -- | -- |
NVModuleSummary::computeFinalOrder() | 0xD7AC80 | -- | -- |
NVModuleSummary::getOrInsertSummary() | 0xD7BAA0 | -- | -- |
NVModuleSummary::visitGlobalValue() | 0xD7BD50 | -- | -- |
NVModuleSummary::getImportKind() | 0xD84370 | -- | -- |
NVModuleSummary::isImported() | 0xD84440 | -- | -- |
NVModuleSummary::isImportCandidate() | 0xD84450 | -- | -- |
NVModuleSummary::processInliningDecisions() | 0xD8B020 | 21 KB | -- |
NVModuleSummary::computeInlineBenefit() | 0xD8C2B0 | 8 KB | -- |
NVModuleSummary::buildCalleeList() | 0xD8D9B0 | 9 KB | -- |
NVModuleSummary::cloneModuleSummary() | 0xD8E7E0 | 32 KB | -- |
| GUID lookup/creation (namespace-aware) | 0x9CA390 | -- | -- |
| Get attribute group by kind from GlobalValue | 0xB91C10 | -- | -- |
ProfileSummaryInfo::getProfileCount() | 0xFDD860 | -- | -- |
ProfileSummaryInfo::getHotThreshold() | 0xFDC4B0 | -- | -- |
writeModuleSummary() -- bitcode serializer | 0x1535340 | 26 KB | -- |
parseModuleSummaryIndex() -- bitcode deserializer | 0x150B5F0 | 63 KB | -- |
Cross-References
- Inliner Cost Model -- consumes complexity budget for cross-module inline decisions
- ThinLTO Function Import -- reads summaries, applies threshold multipliers per priority level
- NVVM Container Format -- the bitcode container that carries serialized summaries
- GlobalOpt -- uses summary visibility information for global optimization
- WholeProgramDevirtualization -- consumes type test GUIDs from the summary
Inliner Cost Model
CICC v13.0 contains four parallel inliner cost models -- an architecturally unusual design that reflects both the historical evolution of NVIDIA's compiler and the fundamental differences between GPU and CPU inlining economics. The NVIDIA custom inliner at 0x1864060 (75 KB, 2135 decompiled lines) uses a 20,000-unit budget that is 89x the upstream LLVM default of 225. Roughly 60% of the custom inliner's code computes type-size comparisons for argument coercion cost, because on GPU the dominant cost of a function call is not instruction count but .param address-space marshaling. Alongside the custom model, CICC also links the standard LLVM InlineCostAnalysis at 0x30DC7E0 (51 KB), a New Pass Manager CGSCC inliner at 0x2613930 (69 KB) with ML-based advisory support, and an NVPTX target-specific cost modifier at 0x38576C0 (58 KB) that injects a +2000 bonus for GPU intrinsics.
| Model A: NVIDIA custom | sub_1864060 (0x1864060, 75 KB, CGSCC) |
| Model B: LLVM standard | sub_30DC7E0 (0x30DC7E0, 51 KB, InlineCostAnalysis) |
| Model C: New PM CGSCC | sub_2613930 (0x2613930, 69 KB, recursive SCC) |
| Model D: NVPTX target | sub_38576C0 (0x38576C0, 58 KB, opcode-based) |
| Knob constructor | ctor_186_0 (0x4DBEC0, 14 KB) |
| LLVM knob constructor | ctor_625_0 / ctor_715_0 (0x58FAD0, 27 KB) |
Why Four Inliner Models
The four models are not truly interchangeable alternatives -- they serve overlapping but distinct roles in the compilation pipeline:
Model A is the original NVIDIA inliner, predating the LLVM 14+ New Pass Manager. It operates on NVIDIA's internal NVVM IR node format (not LLVM IR), walks the callee body with bespoke type-size arithmetic, and is the only model that understands .param-space argument coercion costs. It runs inside the legacy CGSCC inliner framework via sub_186CA00 (Inliner::inlineCallsImpl). When CICC runs in its default optimization pipeline, this is the model that makes the bulk of inlining decisions.
Model B is upstream LLVM's InlineCostAnalysis::analyzeCall, compiled into CICC essentially unmodified. It uses LLVM's instruction-counting cost model with a 225-unit default threshold, the inline-threshold, inlinedefault-threshold, and PGO deferral knobs. It exists because CICC links the full LLVM codebase and certain LLVM passes (e.g., the always-inliner, sample-profile inliner) call into getInlineCost / analyzeCall directly.
Model C is the New Pass Manager's CGSCC inliner at 0x2613930. It handles recursive SCC splitting, carries the function-inline-cost-multiplier knob for penalizing recursive functions, and can delegate decisions to an InlineAdvisor (sub_2609820, 57 KB). The advisor supports three modes registered in the pipeline parser: default, development (training), and release (inference). The ML model inference path lives at sub_29B2CD0 / sub_29B4290. CICC registers the pipeline string "inliner-ml-advisor-release" for the release mode (parser slot 49).
Model D is an NVPTX target-specific cost modifier at 0x38576C0 that adjusts inline costs based on opcode analysis. Its primary contribution is a +2000 cost bonus for functions containing opcode tag 9 instructions (see Opcode Tag 9 Bonus below). This runs as a layer on top of whichever primary cost model is active, modifying the accumulated cost at offset+72 and comparing against the threshold at offset+76.
The historical layering is: NVIDIA built Model A first for their custom NVVM IR, then LLVM matured its own inliner (Model B), then the New PM arrived with ML advisory (Model C), and NVPTX target hooks added GPU-specific adjustments (Model D). Rather than consolidating, NVIDIA kept all four because each handles a different phase or code path in the pipeline.
The .param Address Space Problem
Understanding the NVIDIA inliner requires understanding why GPU function calls are so expensive compared to CPU calls. On x86, a function call requires pushing arguments to registers/stack, a CALL instruction, and a RET. The overhead is typically 5-20 cycles.
On NVIDIA GPUs, there is no hardware call stack for registers. The PTX calling convention works through the .param address space:
- Caller declares
.paramvariables viaDeclareParam(opcode 505) orDeclareScalarParam(opcode 506) for each argument. - Caller stores argument values into
.paramspace viast.paraminstructions (opcodes 571-573 for StoreV1/V2/V4). - Caller emits the
callinstruction referencing the.paramdeclarations. - Callee loads arguments from
.paramspace viald.paraminstructions. - Return values come back through
.paramspace viald.param(opcodes 515-516, 568-570 for LoadRetParam / LoadV1/V2/V4). - Byval arguments (structs passed by value) copy the entire struct to
.paramspace field by field.
Each function call therefore generates O(n) st.param + O(n) ld.param instructions where n is the number of arguments, plus register save/restore if the callee needs more registers than are available (spills go to local memory, which is device DRAM -- hundreds of cycles). Additionally, call boundaries destroy instruction scheduling freedom, prevent cross-boundary register allocation, and create branch divergence hazards at the call/return sites.
This is why NVIDIA's default inline budget of 20,000 is not as aggressive as it sounds: inlining a function with 50 instructions but 8 struct arguments might save hundreds of cycles of .param marshaling overhead.
Model A: NVIDIA Custom Inliner
Knob Inventory
All knobs are registered in ctor_186_0 at 0x4DBEC0:
| Knob | Type | Default | Purpose |
|---|---|---|---|
inline-budget | int | 20,000 | Per-caller inlining cost budget |
inline-total-budget | int | (none) | Global total budget across all callers in the module |
inline-adj-budget1 | int | (none) | Secondary per-caller budget, dynamically adjusted |
nv-inline-all | bool | off | Force inline every function call unconditionally |
profuseinline | bool | off | Verbose inlining diagnostics (NVIDIA profuse framework) |
inline-switchctrl | int | (none) | Switch-statement inlining heuristic tuning |
inline-numswitchfunc | int | (none) | Penalty based on number of switch stmts in callee |
inline-maxswitchcases | int | (none) | Maximum switch cases before cost penalty applies |
disable-inlined-alloca-merging | bool | off | Disable post-inline alloca merging |
CLI surface mapping:
| User Flag | Routed To |
|---|---|
-aggressive-inline | -inline-budget=40000 (2x default) |
-disable-inlining | -disable-inlining |
-inline-budget=N | Sets per-caller budget directly |
-inline-info | Diagnostic flag for inline decisions |
Entry and Early Bail-Outs
The entry point sub_1864060 takes four arguments: a1 = function/callsite node, a2 = context, a3 = callback, a4 = data pointer. The function performs a series of eligibility checks before any cost computation:
Intrinsic name check. Calls sub_1649960(a1) to retrieve the function name. If the name starts with the 4-byte magic 0x6D6C6C6C (an LLVM intrinsic prefix) followed by '.', returns 0 immediately. LLVM intrinsics are never inlined through this path.
Pre-analysis walk. Initializes a 32-byte inline-analysis state struct via sub_1ACF5D0, then calls sub_1ACF600 which delegates to sub_1ACF0B0. This walks the callee body to collect basic metrics (instruction count, call count, basic block count). If the pre-analysis returns nonzero, the function is not analyzable.
Linkage check. Reads the byte at a1+32. The low nibble encodes linkage class: values 7 (linkonce_odr) and 8 (weak_odr) are eligible for inlining. Bits [7:6] encode visibility: 0x2 = hidden (OK), 0x1 = protected (bail). The function also requires byte at a1+16 == 3 (function definition, not declaration), bit 0 of byte at a1+80 == 0 (no noinline attribute), and sub_15E4F60(a1) returning false (no optnone).
function shouldInline(callsite):
name = getName(callsite.callee)
if name starts with LLVM_INTRINSIC_PREFIX:
return NEVER_INLINE
state = initAnalysisState()
if preAnalyze(callsite.callee, state) != 0:
return NEVER_INLINE
linkage = callsite.callee.linkage
if linkage not in {linkonce_odr, weak_odr}:
return NEVER_INLINE
if callsite.callee.isDeclaration:
return NEVER_INLINE
if callsite.callee.hasNoinline:
return NEVER_INLINE
if callsite.callee.hasOptnone:
return NEVER_INLINE
// ... proceed to cost computation
Callee Body Scan
After eligibility checks pass, the inliner walks the callee's operand/argument list (linked list at a1+8). Each argument node is classified by its type tag at byte offset +16 via sub_1648700:
| Tag Range | Meaning | Action |
|---|---|---|
<= 0x17 | Basic types or call-like | If tag == 5 (phi): recurse into operands, check all > 0x17; otherwise bail |
0x36 (54) | Load-like instruction | Collect into loads vector |
0x37 (55) | Store-like instruction | Collect into stores vector |
0x47 (71, 'G') | Aggregate/GEP | Enter sub-operand scan |
The loads and stores are accumulated into two SmallVectors (v357, v360) with initial inline capacity of 4 elements each. These vectors are the input to the argument coercion cost check.
Load-Store Combinatorial Bail-Out
Before proceeding to the expensive type-size computation, the function checks:
if (num_loads * num_stores > 100):
return BAIL_OUT // Too expensive argument copy pattern
This prevents inlining functions where argument materialization would create a quadratic load-store explosion. Consider a function taking 4 struct-by-value arguments, each with 30 fields: that is 120 loads times 120 stores = 14,400 combinations, far above the 100 threshold. Without this guard, the type-size computation engine below would take unreasonable time.
Type-Size Computation Engine
The bulk of sub_1864060 -- lines 1140 through 2100, approximately 60% of the function -- is a type-size computation engine. This is the single most distinctive feature of the NVIDIA inliner: where LLVM counts instructions, NVIDIA computes byte-level argument coercion costs.
The engine walks NVVM IR type nodes and computes byte sizes for each argument at both the callsite (actual argument) and the callee (formal parameter). The type tag dispatch is repeated 8+ times across different contexts:
| Type Tag | Type | Size Computation |
|---|---|---|
0x01 | half | 16 bits |
0x02 | float | 32 bits |
0x03 | double | 64 bits |
0x04 | fp80 | 80 bits |
0x05 | fp128 | 128 bits |
0x06 | ppc_fp128 | 128 bits |
0x07 | pointer | sub_15A9520(module, 0) for target pointer size |
0x08 | array | element_type_size * count (recursive) |
0x09 | x86_mmx | 64 bits |
0x0A | vector | element_type_size * count (recursive) |
0x0B | integer | (dword >> 8) bits |
0x0C | function | Recurse (unusual, but handled) |
0x0D | struct | sub_15A9930 for layout size |
0x0E | packed struct | Manual: 8 * count * align * ceil |
0x0F | named type | sub_15A9520(module, type_id) |
0x10 | opaque/token | element_type_size * count |
The byte-size formula applied uniformly is:
byte_size = (multiplier * bit_width + 7) >> 3
The core comparison at the heart of the cost model:
if callee_arg_size > callee_formal_size:
// Argument is being widened at the call boundary
// This costs extra st.param + ld.param instructions
// Proceed to next comparison level (accumulate cost)
else:
// Sizes match or shrink -- this argument pair is OK
Arguments are processed in groups of 4 (loop unrolled at line 2098: v142 += 4, --v306 where v306 = num_stores * 8 >> 5, i.e., groups of 4 store arguments). Remainder arguments (1-3 after the groups-of-4 loop) are handled by the type compatibility check function sub_185CCC0 which calls sub_15CCEE0 for type matching.
Struct Layout Walk
The helper sub_185B2A0 (3 KB) performs a stack-based DFS walk of struct type trees to count fields. It handles pointer types (tag 15), struct types (tag 13/14), and array types (tag 16). The walk has a hard depth limit of 20 levels, preventing runaway recursion on deeply nested struct definitions.
Argument Coercion Check
The helper sub_185D7C0 (9 KB) classifies each callee operand and determines whether argument coercion is needed at the inline callsite. For each operand in the callee's argument linked list at a1+8, it:
- Reads the instruction tag via
sub_1648700. - Computes the formal parameter type size.
- Computes the actual argument type size at the callsite.
- If sizes differ, flags this argument as requiring coercion (extra cost).
- If the argument is a struct, invokes the struct layout walk to count individual field copies.
Callsite Transformation
When the callee qualifies for "alias inline" (replacing a call with direct body substitution), the function:
- Allocates a new 88-byte IR node via
sub_1648A60(88, 1). - Builds a function reference node via
sub_15F8BC0. - Builds a call replacement node via
sub_15F9660. - Walks callee operands to collect phi nodes into a worklist.
- For each phi: copies via
sub_1596970, updates operands viasub_15F2120, replaces references viasub_1648780. - Deletes original phis via
sub_159D850. - Performs final callsite replacement via
sub_164D160+sub_15E55B0.
Switch Statement Heuristics
Three dedicated knobs control inlining of switch-heavy functions. On GPU, large switch statements are particularly costly because:
- Branch divergence: Each thread in a warp may take a different case, serializing execution.
- No branch prediction hardware: Every divergent branch pays full penalty.
- Control flow reconvergence: The hardware must synchronize threads after the switch, wasting cycles.
The inline-switchctrl knob tunes the general heuristic sensitivity. inline-numswitchfunc penalizes functions containing many switch statements. inline-maxswitchcases sets a case-count ceiling beyond which a switch-heavy callee is considered too expensive to inline regardless of other factors.
nv-inline-all: Force-All Mode
The nv-inline-all knob bypasses cost analysis entirely and forces inlining of every call. This is used for specific compilation modes where the call graph must be completely flattened:
- OptiX ray tracing: The hardware intersection pipeline requires a single monolithic function. All user-defined intersection, closest-hit, any-hit, and miss programs must be inlined into a single continuation function.
- Aggressive LTO: When doing whole-program optimization with small modules, flattening removes all call overhead.
Two-Budget System
NVIDIA uses a two-level budget to control inlining granularity:
inline-budget(default 20,000): Per-caller limit. Caps how much code can be inlined into a single function, preventing any one function from becoming unreasonably large.inline-total-budget: Module-wide limit. Caps the total amount of inlining across all callers in the compilation unit.inline-adj-budget1: A secondary per-caller limit that may be dynamically adjusted based on context -- for example, kernel entry points (__global__functions) may receive a higher adjusted budget because they are the outermost scope and benefit most from aggressive inlining.
The threshold adjustment helper at sub_1868880 (12 KB) modifies thresholds based on calling context through pure arithmetic on cost/threshold values (no string evidence, entirely numeric).
Alloca Merging
The disable-inlined-alloca-merging knob controls post-inline stack allocation merging. On GPU, "stack" means local memory, which is device DRAM (hundreds of cycles latency). Merging allocas from inlined callees with the caller's allocations reduces total local memory consumption. Lower local memory usage directly improves occupancy (more concurrent thread blocks per SM). The default is to enable merging.
Model B: LLVM Standard InlineCostAnalysis
The standard LLVM InlineCostAnalysis::analyzeCall at 0x30DC7E0 (51 KB) is compiled into CICC from upstream LLVM sources. Its knobs are registered in ctor_625_0 / ctor_715_0 at 0x58FAD0 (27 KB of option registration, an unusually large constructor due to the 40+ individual cost parameter registrations).
Key upstream LLVM knobs present in CICC:
| Knob | Default | Purpose |
|---|---|---|
inline-threshold | 225 | Base inlining threshold |
inlinedefault-threshold | 225 | Default when no hint/profile |
inlinehint-threshold | 325 | Threshold for __attribute__((always_inline)) hint |
inline-cold-callsite-threshold | 45 | Threshold for cold callsites |
inlinecold-threshold | 45 | Threshold for functions with cold attribute |
hot-callsite-threshold | 3000 | Threshold for hot callsites (PGO) |
locally-hot-callsite-threshold | 525 | Threshold for locally hot callsites |
inline-instr-cost | 5 | Cost per instruction |
inline-call-penalty | 25 | Penalty per callsite in callee |
inline-memaccess-cost | 0 | Cost per load/store |
inline-savings-multiplier | 8 | Multiplier for cycle savings |
inline-savings-profitable-multiplier | 4 | Multiplier for profitability check |
inline-size-allowance | 100 | Max callee size inlined without savings proof |
inline-cost-full | false | Compute full cost even when over threshold |
inline-enable-cost-benefit-analysis | false | Enable cost-benefit analysis |
inline-deferral | (PGO) | Defer inlining in cold paths |
inline-remark-attribute | (off) | Emit inline remarks |
The LLVM model fundamentally counts instructions (at inline-instr-cost = 5 units each) and subtracts savings from constant propagation, dead code elimination after argument specialization, and simplified control flow. This instruction-counting approach is appropriate for CPUs where call overhead is small and code size is the primary concern. It is inadequate for GPUs where argument marshaling dominates.
Model C: New PM CGSCC Inliner
The New Pass Manager inliner at 0x2613930 (69 KB) handles recursive SCC processing and integrates with LLVM's InlineAdvisor framework. Its key differentiation is the function-inline-cost-multiplier knob that penalizes recursive function inlining -- a scenario the NVIDIA custom inliner (Model A) does not handle.
The InlineAdvisor at sub_2609820 (57 KB) supports three modes:
| Mode | Pipeline String | Behavior |
|---|---|---|
default | "inline-advisor" | Heuristic-based (uses Model B cost analysis) |
development | (training path) | Feature extraction for ML model training |
release | "inliner-ml-advisor-release" | ML model inference via sub_29B2CD0 / sub_29B4290 |
The ML inference path extracts features from the callsite and callee (instruction count, call depth, loop nesting, etc.) and feeds them through a model to produce an inline/no-inline decision. This is standard upstream LLVM ML inlining infrastructure compiled into CICC; there is no evidence of NVIDIA-custom ML model weights, though NVIDIA could supply custom weights via the enable-ml-inliner knob (registered as an enum: {default, development, release}).
NVPTX Opcode Tag 9 Bonus (+2000)
Model D at sub_38576C0 modifies inline costs based on NVPTX-specific opcode analysis. The key logic:
for each instruction in callee:
tag = getOpcodeTag(instruction)
if ((tag >> 4) & 0x3FF) == 9:
inline_cost += 2000
// ... accumulate other per-instruction costs
The state layout of the cost analyzer object:
| Offset | Field | Purpose |
|---|---|---|
| +72 | Accumulated cost | Running sum of per-instruction costs |
| +76 | Threshold | Budget for this callsite |
| +120 | Per-instruction cost (lo) | Cost array element (low) |
| +128 | Per-instruction cost (hi) | Cost array element (high) |
The +2000 bonus for tag 9 opcodes encourages inlining of functions containing specific GPU operations -- likely tensor core instructions, warp-level intrinsics, or other operations that benefit significantly from being visible to the register allocator and instruction scheduler within the caller's scope. The bonus is large enough (equivalent to inlining ~400 regular LLVM instructions at cost 5 each) to override most size-based objections.
NVIDIA vs. LLVM: Complete Comparison
| Feature | NVIDIA (Model A) | LLVM (Model B) |
|---|---|---|
| Default threshold | 20,000 | 225 |
| Aggressive threshold | 40,000 | Varies by -O level |
| Primary cost metric | Argument type-size coercion | Instruction count |
| Cost per instruction | N/A (not instruction-based) | 5 units |
| Struct handling | Deep field-by-field walk (depth limit 20) | Aggregate flat cost |
| GPU opcode bonus | +2000 for tag 9 | N/A |
| Load x store bail-out | > 100 combinations | N/A |
| Switch heuristics | 3 dedicated knobs | 1 (case-cluster-penalty) |
| Budget system | Per-caller + module total + adjusted | Per-callsite only |
| Diagnostic knob | profuseinline | inline-remark-attribute |
| Force-all mode | nv-inline-all | inline-all-viable-calls (hidden) |
| ML-based advisor | No (separate path via Model C) | Yes (InlineAdvisor) |
| Recursive cost multiplier | No | function-inline-cost-multiplier |
| Alloca merging control | disable-inlined-alloca-merging | N/A |
| Call penalty | Implicit (.param marshaling cost) | 25 units per callsite |
| PGO integration | No evidence | inline-deferral, hot-callsite-threshold |
Decision Flowchart
The complete inlining decision flow through Model A:
CallSite arrives at sub_186CA00
|
sub_186B510: check remarks
|
sub_1864060: shouldInline
|
+--------+--------+
| |
Name is LLVM Name is user
intrinsic? function
| |
NEVER INLINE Init analysis state
sub_1ACF5D0
|
Pre-analyze callee
sub_1ACF600
|
+-------+-------+
| |
Returns 0 Returns != 0
(analyzable) (cannot analyze)
| |
Check linkage NEVER INLINE
(7=linkonce_odr
8=weak_odr)
|
+-----------+-----------+
| |
Eligible Not eligible
| (wrong linkage,
Check noinline, declaration,
optnone attrs protected vis)
| |
+-----+-----+ NEVER INLINE
| |
Has attr No attr
| |
NEVER INLINE Walk callee body
collect loads/stores
|
loads * stores > 100?
+-----+-----+
| |
Yes No
| |
BAIL OUT Type-size computation
(60% of function)
|
Compute per-argument
coercion cost
|
Total cost < inline-budget?
+-----+-----+
| |
Yes No
| |
INLINE DO NOT INLINE
Transform callsite
sub_1648A60 / sub_15F8BC0
Call Graph
sub_186CA00 Inliner::inlineCallsImpl (CGSCC SCC walk)
+-> sub_186B510 Inline decision with remarks
+-> sub_1864060 shouldInline / cost computation (THIS)
+-> sub_1ACF5D0 Inline analysis state init
+-> sub_1ACF600 Pre-analysis callee walk
| +-> sub_1ACF0B0 Metric collection
+-> sub_185FD30 Argument materialization cost (5 KB)
+-> sub_185E850 Post-inline cleanup assessment (9 KB)
+-> sub_185B2A0 Struct layout walk, depth limit 20 (3 KB)
+-> sub_185D7C0 Argument matching / coercion (9 KB)
+-> sub_185B9F0 Recursive operand simplification (5 KB)
+-> sub_185CCC0 Type compatibility check (4 KB)
+-> sub_18612A0 GlobalOpt integration (65 KB, conditional)
+-> sub_1868880 Inline threshold adjustment (12 KB)
+-> sub_1866840 Post-inline callsite update (42 KB)
Why 89x the LLVM Budget
The 20,000 vs. 225 ratio sounds extreme, but the economics are different:
CPU call overhead is approximately 5-20 cycles (push/pop registers, branch prediction handles the rest). A function with 50 instructions that is not inlined costs perhaps 60-70 cycles total. Inlining saves ~15 cycles. The savings must justify the I-cache pressure increase.
GPU call overhead includes: (1) declaring .param variables for every argument, (2) st.param for each argument value, (3) ld.param in the callee for each argument, (4) register save/restore to local memory (device DRAM, 200-800 cycle latency) if the callee's register demand exceeds what is available, (5) loss of instruction scheduling across the call boundary, (6) branch divergence at call/return. For a function with 8 arguments, the .param overhead alone is 16+ memory operations. With register spilling, a single function call can cost 1000+ cycles.
Furthermore, GPU functions tend to be small (typically 10-100 instructions for device helper functions). The NVIDIA cost model does not count instructions at all -- it counts the argument marshaling cost. A function with 200 instructions but 2 scalar arguments is cheap to call; a function with 10 instructions but 8 struct arguments is expensive. The 20,000 budget reflects this: it is not 89x more aggressive in inlining large functions; it is calibrated for a cost model where the per-argument coercion cost dominates rather than instruction count.
With -aggressive-inline (budget 40,000, i.e., 178x the LLVM default), NVIDIA targets workloads like OptiX where complete flattening is desired but nv-inline-all is too blunt (it ignores all cost analysis).
What Upstream LLVM Gets Wrong for GPU
Upstream LLVM's inliner cost model was built for x86/AArch64 where function call overhead is small and code size is the primary inlining constraint. On GPU, every assumption is wrong:
- Upstream assumes a 225-instruction budget is sufficient. The default
inline-thresholdof 225 reflects CPU economics where a function call costs 5-20 cycles (register push/pop + branch). On GPU, a single function call with 8 struct arguments generates 16+.param-space memory operations, potential register spills to device DRAM (200-800 cycle latency), loss of cross-boundary scheduling, and branch divergence hazards. NVIDIA's 20,000-unit budget (89x upstream) is calibrated for this reality, not because GPU code is more aggressive about inlining large functions. - Upstream counts instructions as the primary cost metric. LLVM prices each instruction at 5 units and subtracts savings from constant propagation and dead code elimination. NVIDIA's custom inliner (Model A) does not count instructions at all -- 60% of its 75KB body computes byte-level argument type-size coercion costs, because on GPU the dominant cost of a function call is
.paramaddress-space marshaling, not instruction count. - Upstream has no concept of
.param-space argument passing cost. CPU calling conventions pass arguments in registers (nearly free) or via L1-cached stack (3-5 cycles). On GPU, every argument requires explicitDeclareParam+st.param(caller) +ld.param(callee) sequences. A function with 10 instructions but 8 struct arguments is more expensive to call than one with 200 instructions and 2 scalar arguments. Upstream's model gets this exactly backwards. - Upstream uses a single per-callsite budget. NVIDIA uses a three-level system: per-caller budget (
inline-budget), module-wide total budget (inline-total-budget), and a dynamically adjusted secondary budget (inline-adj-budget1) that can give kernel entry points higher limits. This multi-level approach prevents any single caller from bloating while still allowing aggressive inlining where it matters most. - Upstream has no GPU intrinsic awareness. NVIDIA's Model D applies a +2000 cost bonus for functions containing opcode tag 9 instructions (likely tensor core or warp-level intrinsics), because these operations benefit enormously from being visible to the register allocator and scheduler within the caller's scope. Upstream LLVM has no mechanism to express "this function contains operations that are disproportionately valuable to inline."
Key Addresses
| Address | Size | Function |
|---|---|---|
0x1864060 | 75 KB | shouldInline / inline cost computation |
0x186CA00 | 61 KB | Inliner::inlineCallsImpl (CGSCC core) |
0x186B510 | 20 KB | Inline decision with remarks |
0x1866840 | 42 KB | Post-inline callsite update |
0x1868880 | 12 KB | Inline threshold adjustment |
0x185FD30 | 5 KB | Argument materialization |
0x185E850 | 9 KB | Post-inline cleanup |
0x185B2A0 | 3 KB | Struct layout walk (depth 20) |
0x185D7C0 | 9 KB | Argument coercion check |
0x185B9F0 | 5 KB | Recursive operand simplification |
0x185CCC0 | 4 KB | Type compatibility check |
0x18612A0 | 65 KB | GlobalOpt integration |
0x1ACF5D0 | -- | Inline analysis state init |
0x1ACF600 | -- | Pre-analysis callee walk |
0x30DC7E0 | 51 KB | InlineCostAnalysis::analyzeCall (LLVM) |
0x2613930 | 69 KB | New PM CGSCC inliner |
0x2609820 | 57 KB | Inline advisor / ML inliner |
0x38576C0 | 58 KB | NVPTX target-specific cost modifier |
0x4DBEC0 | 14 KB | NVIDIA inliner knob registration |
0x58FAD0 | 27 KB | LLVM InlineCost option registration |
Reimplementation Checklist
- Type-size-based cost model (60% of the inliner). Implement the argument coercion cost engine that walks NVVM IR type nodes (16 type tags: half through opaque/token) to compute byte-level sizes for both callsite actuals and callee formals, using the formula
byte_size = (multiplier * bit_width + 7) >> 3. Flag arguments wherecallee_arg_size > callee_formal_sizeas requiring.param-space widening. - 20,000-unit budget system. Implement the three-level budget: per-caller
inline-budget(default 20,000), module-wideinline-total-budget, and dynamically adjustedinline-adj-budget1(kernel entry points may receive higher limits). Include the-aggressive-inlinemapping to budget 40,000 andnv-inline-allforce-all mode. - Early bail-out chain. Implement the eligibility checks in order: LLVM intrinsic name prefix rejection, pre-analysis callee walk (instruction/call/block counts), linkage check (linkonce_odr/weak_odr only), visibility check, noinline/optnone attribute rejection, and the
loads * stores > 100combinatorial bail-out. - Struct layout walk (depth limit 20). Implement the stack-based DFS walk of struct type trees to count fields for coercion cost, handling pointer types (tag 15), struct types (tag 13/14), and array types (tag 16), with a hard depth limit of 20 levels.
- Switch statement heuristics. Implement the three GPU-specific switch knobs (
inline-switchctrl,inline-numswitchfunc,inline-maxswitchcases) that penalize switch-heavy callees where branch divergence, absent branch prediction, and reconvergence overhead make inlining particularly costly. - NVPTX opcode tag 9 bonus (+2000). Implement the target-specific cost modifier that scans callee instructions for opcode tag 9 (likely tensor core/warp intrinsics) and adds a +2000 bonus to encourage inlining functions containing GPU operations that benefit from cross-boundary register allocation and scheduling.
ThinLTO Function Import
CICC v13.0 implements LLVM's ThinLTO function import pipeline with GPU-specific modifications to the threshold computation, candidate filtering, and provenance tracking. The core of the system lives in two functions -- sub_1854A20 (the import driver, 4,326 bytes) and sub_1853180 (the threshold computation engine, 5,059 bytes) -- with an entry point at sub_1855B10 that parses the -summary-file / -function-import command line and orchestrates the whole-module import flow. The fundamental difference from CPU ThinLTO is that GPU compilation operates in a closed-world model: there are no shared libraries, no dynamic linking, and no PLT/GOT indirection. Every device function will be statically linked into the final PTX. This means CICC can afford far more aggressive import thresholds than CPU compilers, because the code size cost of importing is paid once per GPU binary rather than once per shared-object load.
The import subsystem reads NVModuleSummary data (built by sub_D7D4E0, see Module Summary) to make summary-guided decisions about which functions to pull from other translation units. Each candidate is evaluated against a floating-point threshold that incorporates callsite hotness, linkage type, and a per-priority-class multiplier. A global import budget caps the total number of imports to prevent compile-time explosion. After import, each materialized function receives thinlto_src_module metadata so downstream passes (particularly the inliner) know its origin module.
| Import driver | sub_1854A20 (0x1854A20, 4,326 B) |
| Threshold computation | sub_1853180 (0x1853180, 5,059 B) |
| Threshold comparison gate | sub_18518A0 (0x18518A0) |
| Import execution | sub_15E4B20 (0x15E4B20) |
| Import candidate evaluator | sub_1852CC0 (0x1852CC0) |
| Entry point | sub_1855B10 (0x1855B10, 10,503 B) |
| Whole-module processing | sub_1858B90 (0x1858B90, 31,344 B) |
| Type metadata propagation | sub_185E850 (0x185E850, 24,263 B) |
| Pipeline registration | "function-import" (slot 43, Module pass) |
| Knob constructor (primary) | ctor_184_0 (0x4DA920, 13,693 B) |
| Knob constructor (supplementary) | ctor_029 (0x489C80, 1,120 B) |
| Knob constructor (pass-level) | ctor_420_0 (0x532010, 11,787 B) |
Why GPU ThinLTO Differs from CPU ThinLTO
Upstream LLVM's ThinLTO was designed for CPU executables and shared libraries where import decisions must balance code size (impacts disk, cache, page faults) against optimization opportunity (cross-module inlining, constant propagation). The default import-instr-limit is 100 instructions, the cold multiplier is 0, and the hot multiplier is 10x. These conservative defaults reflect a world where over-importing bloats .text sections shared across address spaces.
GPU compilation inverts these tradeoffs:
-
No shared libraries. Device code is statically linked into a fatbinary. There is no dynamic linker, no GOT, no PLT. Importing a function costs compile time but has zero runtime overhead beyond instruction cache pressure.
-
Function calls are expensive. As documented in the inliner cost model, every GPU function call marshals arguments through
.paramaddress space viast.param/ld.paramsequences. Inlining (which requires importing first) eliminates this overhead entirely. -
Closed-world optimization. The compiler sees all device code. There are no opaque DSOs. This means aggressive import cannot break ABI contracts that don't exist.
-
Register pressure is the real constraint. On GPU, the limiting factor is not code size but register count, which determines occupancy. Import + inline can actually reduce register pressure by enabling cross-function register allocation and eliminating
.param-space spills.
These factors push CICC toward much more aggressive import thresholds. The priority-class multiplier system (section below) allows CICC to tune import aggressiveness per-callsite rather than using a single global threshold.
What Gets Imported and What Does Not
The NVModuleSummary builder (sub_D7D4E0) assigns a 4-level import priority to every global value when building the module summary index:
| Priority | Meaning | Import behavior |
|---|---|---|
| 0 | Not importable | Local/hidden linkage, never imported |
| 1 | Importable, not preferred | Will import only if threshold is generous |
| 2 | Standard importable | Normal import candidate |
| 3 | Force-import | Highest priority, always imported if budget allows |
The priority is determined by querying the ImportPriorityTable (parameter a4 of sub_D7D4E0) via sub_D84370, sub_D84440 (force-import check), and sub_D84450 (importable check). A global override at dword_4F87C60 can force all symbols to priority 1 or higher.
Functions that are imported:
__device__functions with internal or linkonce_odr linkage (template instantiations, inline functions)- Math library implementations (libdevice functions) called from device code
- Helper functions from header-only libraries (Thrust, CUB, cutlass templates)
- Constant global variables with initializers (
import-constants-with-refs= true by default)
Functions that are NEVER imported:
- Kernels (
__global__functions). These are entry points. They are never candidates for cross-module import because they represent the root of execution; they are called from host code, not from other device functions. The summary builder marks them as non-importable. - Host functions. Host code is handled by the host compiler (gcc/clang), not cicc. They never appear in the device module summary.
- Functions in address space 25. The summary builder at lines 1388-1395 explicitly skips functions whose type resolves to address space 25, with a
goto LABEL_495that bypasses the import-eligible path. The raw report notes: "device functions can't be cross-module imported in ThinLTO" -- this refers specifically to functions that are declarations only with device-memory address space linkage, meaning they reference device-side symbols without a definition in the current TU. - Functions with the "not importable" flag. Bit 4 (
0x10) of the linkage byte at offset+0x0Cin the function summary entry. The import driver checkstest byte [entry+0Ch], 0x10and skips on set.
Import Algorithm: Complete Pseudocode
Complexity. Let C = number of import candidates across all modules, G = number of unique GUIDs, and L = total number of name entries across all candidates. Stage 1 (threshold computation, sub_1853180) iterates every candidate once: O(C). For each candidate, the GUID dedup hash table (slot = GUID * 37 & (size - 1)) provides O(1) amortized lookup with linear probing. The name array scan is up to 4-level unrolled, giving O(L) total across all candidates. The 11-case linkage dispatch via jump table is O(1) per entry. The priority-class threshold adjustment is O(1) per candidate (a single float multiply). The global budget check is O(1). Overall Stage 1: O(C + L). Stage 2 (triple-pass driver, sub_1854A20) processes three priority-ordered linked lists, each in a single pass: O(C) total. Per-candidate import execution (sub_15E4B20) is O(I_f) where I_f = instructions in the imported function (bitcode materialization). The whole-module processing (sub_1858B90, 31KB) is O(F * I_avg) where F = total functions and I_avg = average instruction count. The dedup hash table growth follows standard load-factor 75% doubling, maintaining O(1) amortized operations. Total: O(C + L + sum(I_imported)).
The import process runs in two major stages. Stage 1 (sub_1853180) builds a prioritized list of qualifying candidates by evaluating each against a computed threshold. Stage 2 (sub_1854A20) materializes candidates via a triple-pass sweep over three priority-ordered linked lists, executing the actual cross-module function import.
Stage 1: Threshold Computation Engine (sub_1853180)
Address range: 0x1853180--0x1854543 (5,059 bytes). Six parameters, 0xB8-byte stack frame. Uses a jump table at dword_42BA140 for the 11-case linkage-type dispatch.
// sub_1853180 -- Threshold computation with GUID dedup and priority-class multipliers
//
// Evaluates every candidate in summary_ctx against base_threshold adjusted by
// priority class. Emits qualifying candidates to result_array as 24-byte
// entries {GUID, threshold, import_record_ptr}. Tracks already-evaluated
// GUIDs via guid_hash_table to prevent duplicate work.
//
// Binary: 0x1853180, 5059 bytes. Stack: 0xB8.
// Jump table: dword_42BA140 (11 entries, linkage dispatch).
//
// Globals read:
// dword_4FAAE80 hot_multiplier (float, default 10.0)
// dword_4FAACC0 cold_multiplier (float, default 0.0)
// dword_4FAADA0 critical_multiplier (float, default 100.0)
// dword_4FAB040 default_multiplier (float, default 1.0)
// dword_4FAB120 global_import_budget (int, default -1 = unlimited)
// dword_4FAA770 running_import_count (int, reset per module)
fn threshold_compute(
summary_ctx, // rdi -> [rbp-0x88]: candidate arrays and metadata
module_info, // rsi -> [rbp-0x58]: source module summary
base_threshold, // edx -> [rbp-0x7C]: integer base threshold (import-instr-limit)
guid_hash_table, // rcx -> [rbp-0x50]: DenseMap<uint64_t, metadata> for dedup
result_array, // r8 -> [rbp-0x60]: growable output array
visited_set, // r9 -> [rbp-0xA0]: tracks already-evaluated GUIDs
):
candidate_begin = summary_ctx[+0x28] // r12: start of candidate pointer array
candidate_end = summary_ctx[+0x30] // r14: one-past-end
// ---- Outer loop: iterate every candidate ----
while candidate_begin != candidate_end: // 0x18531C4
candidate_ptr = *candidate_begin
guid = candidate_ptr & ~0x7 // mask low 3 tag bits
// ---- GUID dedup via multiplicative-hash table ----
table_size = guid_hash_table[+0x18]
if table_size > 0: // 0x18531D0
table_data = guid_hash_table[+0x00]
raw_guid = candidate_ptr[+0x00] // 8-byte GUID
// Hash: slot = (GUID * 37) & (table_size - 1)
// Implemented as: lea edx,[rsi+rsi*8] -> edx=GUID*9
// lea edx,[rsi+rdx*4] -> edx=GUID+GUID*36=GUID*37
slot = (raw_guid * 37) & (table_size - 1) // 0x18531E8
// 16-byte slots: {GUID (8B), metadata (8B)}
probe_ptr = table_data + slot * 16
stored_guid = probe_ptr[+0x00]
if stored_guid == raw_guid:
goto next_candidate // already evaluated
// Linear probing on collision
probe_step = 1
while stored_guid != 0xFFFFFFFFFFFFFFFF: // -1 = empty sentinel
slot = (slot + probe_step) & (table_size - 1)
probe_step += 1
probe_ptr = table_data + slot * 16
stored_guid = probe_ptr[+0x00]
if stored_guid == raw_guid:
goto next_candidate // found: already seen
// GUID not in table -- fall through to evaluation
// ---- Name array scan ----
// When dedup table is absent, scan name components directly
name_begin = candidate_ptr[+0x18] // 0x1853250
name_end = candidate_ptr[+0x20]
// Up-to-4-level unrolled name comparison (0x1853670-0x18538BA):
// Level 1: entry = [name_ptr - 8]
// Level 2: entry = [name_ptr + 0]
// Level 3: entry = [name_ptr + 8]
// Level 4: entry = [name_ptr + 0x10]
// Each level checks:
// visibility flag at [r14+0xB0] -> if set: test byte [entry+0Ch], 0x20
// entry type: entry[+0x08] must == 2 (function summary)
// not-importable: test byte [entry+0Ch], 0x10 -> skip if set
// linkage: entry[+0x0C] & 0x0F -> 11-case switch
for each name_entry in name_begin..name_end:
entry = *name_entry
if entry[+0x08] != 2: // not a function summary
continue
linkage_byte = entry[+0x0C]
if linkage_byte & 0x10: // "not importable" flag
continue
linkage = linkage_byte & 0x0F // 0x185324E
// ---- Linkage-type dispatch (11 cases via jump table) ----
switch linkage: // dword_42BA140
case 0: // ExternalLinkage
case 1: // AvailableExternallyLinkage
case 3: // InternalLinkage
case 5: // ExternalWeakLinkage
case 6: // CommonLinkage
goto standard_threshold_path // loc_18536E8
case 7: // WeakAnyLinkage
case 8: // WeakODRLinkage
// Weak linkage requires name verification via memcmp
// to confirm the candidate matches the expected symbol
// before allowing import.
expected_name = resolve_name(candidate_ptr)
actual_name = resolve_name(entry)
if memcmp(expected_name, actual_name, name_len) != 0:
continue // 0x1853A71: name mismatch
goto standard_threshold_path
case 2: // AppendingLinkage
case 4: // PrivateLinkage
case 9: // LinkOnceAnyLinkage
case 10: // LinkOnceODRLinkage
goto special_handling_path // loc_1853928
// ---- Standard threshold path ----
standard_threshold_path:
// Dereference alias chain for external linkage
if entry.function_type == 0: // external
entry = entry[+0x40] // follow alias pointer
linkage = entry[+0x0C] & 0x0F // re-extract
// ---- Priority-class threshold adjustment ----
// 0x1853441: convert base_threshold to float
threshold_f = (float)base_threshold // cvtsi2ss xmm2, eax
priority_class = entry[+0x08] & 0x7 // 3-bit field, al=[r15+8]&7
switch priority_class:
case 3: // HOT callsite
threshold_f *= dword_4FAAE80 // hot_multiplier (10.0)
// mulss xmm0, cs:dword_4FAAE80
case 1: // COLD callsite
threshold_f *= dword_4FAACC0 // cold_multiplier (0.0)
// mulss xmm0, cs:dword_4FAACC0
case 4: // CRITICAL callsite
threshold_f *= dword_4FAADA0 // critical_multiplier (100.0)
// mulss xmm0, cs:dword_4FAADA0
default: // no priority match
threshold_f *= dword_4FAB040 // default_multiplier (1.0)
// mulss xmm0, cs:dword_4FAB040
adjusted_threshold = (int)threshold_f // cvttss2si rax, xmm0
// Stored to [rbp-0x78] and r11d for comparison
// ---- Cost comparison (0x1853AA8) ----
function_cost = entry[+0x40] // IR instruction count
if adjusted_threshold < function_cost: // cmp r11d, [rcx+40h]
continue // jb not_eligible
// ---- "Not importable" double-check ----
if entry[+0x0C] & 0x10: // test byte [rcx+0Ch], 0x10
continue
// ---- Max-threshold-wins for duplicates (0x18534C2) ----
if guid already in result_array:
existing_record = result_slot[+0x10]
if existing_record != NULL:
existing_threshold = result_slot[+0x08]
if (float)existing_threshold >= threshold_f:
continue // existing is better; skip
result_slot[+0x08] = adjusted_threshold // update to higher
goto next_candidate
// ---- Global budget check (0x185340A) ----
budget = dword_4FAB120 // global_import_budget
if budget >= 0: // test eax,eax; js proceed
if dword_4FAA770 >= budget: // cmp counter vs budget
continue // jge skip: budget exhausted
// ---- Allocate dedup hash table node (0x1853953) ----
node = malloc(16) // 0x22077B0: edi=0x10
if node != NULL:
node[+0x00] = 0 // clear forward pointer
node[+0x08] = guid
sub_1851560( // hash table insert
guid_hash_table[+0x08], // insert point
bucket_index, // slot
guid, // key
1 // insert_mode
)
// ---- Emit to result array (0x1853517) ----
count = result_array[+0x08] // current count
capacity = result_array[+0x0C]
if count >= capacity:
grow_result_array(result_array) // realloc path
// 24-byte entry: offset = count * 24
entry_ptr = result_array.base + count * 24 // lea rax,[rax+rax*2]; shl rax,3
entry_ptr[+0x00] = guid // 8 bytes: function GUID
entry_ptr[+0x08] = adjusted_threshold // 4 bytes: threshold value
entry_ptr[+0x10] = import_record_ptr // 8 bytes: import record
result_array[+0x08] = count + 1 // increment count
// ---- Increment global counter (0x1853510) ----
dword_4FAA770 += 1 // add cs:dword_4FAA770, 1
next_candidate:
candidate_begin += 8 // advance to next candidate
Threshold computation arithmetic in detail. The four multiplier constants live in .data as IEEE 754 single-precision floats. The SSE scalar path is:
; At 0x1853441 -- convert integer base threshold to float
pxor xmm2, xmm2
cvtsi2ss xmm2, rax ; xmm2 = (float)base_threshold
; Priority dispatch -- one of four paths selected:
; HOT (priority 3):
movss xmm0, cs:dword_4FAAE80 ; xmm0 = 10.0f
mulss xmm0, xmm2 ; xmm0 = 10.0 * base
; COLD (priority 1):
mulss xmm0, cs:dword_4FAACC0 ; xmm0 = 0.0 * base = 0.0
; CRITICAL (priority 4):
mulss xmm0, cs:dword_4FAADA0 ; xmm0 = 100.0 * base
; DEFAULT (all others):
mulss xmm0, cs:dword_4FAB040 ; xmm0 = 1.0 * base
; Convert back to integer for comparison
cvttss2si rax, xmm0 ; rax = (int)threshold_f (truncation)
The cvttss2si truncation means threshold values are floored, not rounded. For base_threshold=100 and hot_multiplier=10.0, the adjusted threshold is exactly 1000. The cold path with multiplier 0.0 always produces threshold 0, meaning cold functions are never imported unless the multiplier is overridden.
Stage 2: Triple-Pass Import Driver (sub_1854A20)
Address range: 0x1854A20--0x1855B06 (4,326 bytes). Four parameters, 0x278-byte stack frame. Callee-saved: r15, r14, r13, r12, rbx.
The driver processes candidates across three priority-ordered linked lists embedded in the guid_import_map structure. Each list covers a different import priority class. The three passes guarantee that high-priority candidates are imported (and consume budget) before lower-priority ones get a chance.
// sub_1854A20 -- Triple-pass import driver
//
// Materializes cross-module function bodies for candidates that pass
// threshold evaluation. Processes three linked lists in priority order:
// Pass 1: primary list at [import_map + 0x00] (highest priority)
// Pass 2: secondary list at [import_map + 0x10] (medium priority)
// Pass 3: tertiary list at [import_map + 0x30] (lowest priority)
//
// For each candidate: check importable flag, evaluate threshold via
// sub_18518A0, execute import via sub_15E4B20, optionally attach
// thinlto_src_module metadata.
//
// Binary: 0x1854A20, 4326 bytes. Stack: 0x278.
//
// Globals read:
// byte_4FAAA20 enable_import_metadata (bool)
fn import_driver(
import_ctx, // rdi -> [rbp-0x258]: import state object
module_summary_idx, // rsi -> [rbp-0x260]: combined summary index
source_module_info, // rdx -> [rbp-0x278]: source module descriptor
guid_import_map, // rcx -> [rbp-0x268]: hash map of GUID -> import lists
// also saved to rbx
):
// ---- Initialize resolved-summary storage (0x1854A45) ----
sub_1674380(
&local_resolved_storage, // rdi = [rbp-0x290]
source_module_info // rsi = rdx
)
// ---- Check if import map is empty (0x1854A6C) ----
entry_count = guid_import_map[+0x08]
if entry_count == 0:
goto empty_import_path // 0x1854AB3
// ======================================================================
// PASS 1: PRIMARY CANDIDATE LIST (0x1854B99 -- 0x1854F3B)
// List head: [guid_import_map + 0x00]
// Importable flag: byte [node - 0x21] & 0x20
// Summary ptr: [node - 0x38]
// ======================================================================
primary_list = guid_import_map[+0x00] // rsi = [rbx]
// Scan to first valid entry (skip sentinels -8 and NULL)
cursor = primary_list[+0x00]
if cursor == 0xFFFFFFFFFFFFFFF8 || cursor == NULL:
scan forward through primary_list[+0x08], [+0x10], ...
// Inner scan: load qword, test for NULL, cmp against -8
// Stop at first non-null, non-sentinel entry
end_of_candidates = primary_list + entry_count * 8 // r12
while cursor != end_of_candidates: // 0x1854BF0
// ---- Load candidate descriptor ----
desc = *cursor // rax = [r14]
summary_data = desc[+0x00] // rdx = [rax]
cost_info = desc + 0x40 // threshold/cost at +0x40
// ---- Evaluate candidate (0x1854C02) ----
sub_1852CC0(&local_buf, guid_import_map) // import candidate evaluator
// ---- Advance to next valid entry ----
next = cursor[+0x08]
// Scan forward: skip NULL and sentinel -8 entries
while next == NULL || next == 0xFFFFFFFFFFFFFFF8:
next += 8
// ---- Per-node import decision loop (0x1854E39) ----
for each node in candidate.linked_nodes:
if node == NULL:
continue // test r15, r15
// Importable flag check
importable = node[-0x21] & 0x20 // test byte [r15-0x21], 0x20
if !importable:
continue // jz skip
// Extract function summary (stored 0x38 bytes before node)
func_summary = node[-0x38] // r13 = [r15-0x38]
// Resolve function name/info
sub_15E4EB0(cursor, func_summary) // 0x1854E61
// ---- Format import remark (diagnostic output) ----
resolved_threshold = [rbp-0x1D8]
resolved_info = [rbp-0x1E0]
sub_16C1840(guid_import_map, resolved_info, resolved_threshold)
// cost component remark
sub_16C1A90(guid_import_map, resolved_info, resolved_threshold)
// threshold component remark
sub_16C1AA0(guid_import_map, [rbp-0x210]) // finalize remark string
free([rbp-0x1E0]) // cleanup temp string
// ---- Threshold comparison gate (0x1854EE3) ----
cost = cursor[+0x10] // estimated function cost
hot_count = cursor[+0x08] // call frequency / hotness
qualifies = sub_18518A0(hot_count, cost) // THRESHOLD GATE
if !qualifies: // test rax,rax; jz skip
continue
// ---- Execute import (0x1854EF7) ----
sub_15E4B20(import_ctx, func_summary) // MATERIALIZE FUNCTION
// Check abort signal
status = [rbp-0xD0]
if status & 0xFFFFFFFFFFFFFFFE: // caller requested abort
goto early_return
// ---- Attach provenance metadata (0x1854F0D) ----
if byte_4FAAA20 != 0: // enable-import-metadata
source_name = sub_161FF10(func_summary) // resolve source module name
// Create optimization remark
sub_1627350(remark_ctx, 1) // edx=1: enabled
// Attach metadata string (0x1855261):
// lea rsi, "thinlto_src_module" ; 0x42BA2F8, length 0x12
sub_1627100(
func_summary, // target function
"thinlto_src_module", // metadata key (18 chars)
source_name // metadata value
)
// ======================================================================
// PASS 2: SECONDARY CANDIDATE LIST (0x1854F41 -- 0x1855074)
// List head: [guid_import_map + 0x10]
// Same importable-flag check: byte [node - 0x21] & 0x20
// Same summary extraction: [node - 0x38]
// ======================================================================
secondary_list = guid_import_map[+0x10] // r15 = [rcx+10h]
secondary_sentinel = guid_import_map[+0x08]
// Identical processing pattern:
// - Iterate linked-list nodes
// - Check importable flag: byte [r15-0x21] & 0x20
// - Extract summary: [r15-0x38]
// - sub_18518A0 threshold gate
// - sub_15E4B20 import execution
// - Conditional thinlto_src_module metadata attachment
for each node in secondary_list:
if node[-0x21] & 0x20 == 0:
continue
summary = node[-0x38]
if !sub_18518A0(node.hot_count, node.cost):
continue
sub_15E4B20(import_ctx, summary)
if byte_4FAAA20:
attach_provenance_metadata(summary)
// ======================================================================
// PASS 3: TERTIARY CANDIDATE LIST (0x1855074 -- 0x1855190)
// List head: [guid_import_map + 0x30]
// Different offsets:
// Summary extraction: [node - 0x30] (not -0x38)
// Importable flag: byte [node - 0x19] & 0x20 (not -0x21)
// ======================================================================
tertiary_list = guid_import_map[+0x30]
// Same processing pattern but with adjusted offsets:
for each node in tertiary_list:
if node[-0x19] & 0x20 == 0: // note: -0x19, not -0x21
continue
summary = node[-0x30] // note: -0x30, not -0x38
if !sub_18518A0(node.hot_count, node.cost):
continue
sub_15E4B20(import_ctx, summary)
if byte_4FAAA20:
attach_provenance_metadata(summary)
// ======================================================================
// POST-IMPORT: Result materialization (0x1854B3C -- 0x1854B97)
// ======================================================================
result_count = [rbp-0x100]
if result_count > 0:
import_source = sub_16704E0() // r13: source module handle
import_dest = sub_16704F0() // r14: destination module handle
result_base = [rbp-0x110]
result_end = result_base + result_count * 8
for each result_entry in result_base..result_end: // 0x1854B7D
func = *result_entry
// Skip if function already exists in source module
if sub_1670560(func, import_source): // test al,al; jnz next
continue
// Materialize into destination module
sub_1670560(func, import_dest)
// ======================================================================
// CLEANUP (0x1854AE7 -- 0x1854B22)
// ======================================================================
// Release import list entries (16-byte stride)
cleanup_base = [rbp-0xF0]
cleanup_count = eax
cleanup_end = cleanup_base + cleanup_count * 16
for each entry in cleanup_base..cleanup_end (stride=16):
value = entry[+0x00]
if value == 0xFFFFFFFFFFFFFFF8: // sentinel -8: empty
continue
if value == 0xFFFFFFFFFFFFFFFC: // sentinel -4: deleted
continue
sub_161E7C0(entry[+0x08]) // release associated data
free(cleanup_base) // j___libc_free_0
// ---- Empty-import finalization ----
empty_import_path: // 0x1854AB3
import_ctx.status = 0 // clear status byte
flags = import_ctx[+0x08]
flags = (flags & 0xFC) | 0x02 // set "import complete, no imports"
import_ctx[+0x08] = flags
sub_1851C60(&local_import_list) // finalize empty path cleanup
Why three passes with different offsets. The three linked lists represent three structural layers in the guid_import_map:
| Pass | List head offset | Summary offset | Importable-flag offset | Interpretation |
|---|---|---|---|---|
| 1 (primary) | [map+0x00] | node[-0x38] | node[-0x21] & 0x20 | Direct call targets from the current module -- highest priority because they are on the critical path |
| 2 (secondary) | [map+0x10] | node[-0x38] | node[-0x21] & 0x20 | Transitively-reachable functions (callees of callees) -- import enables deeper inlining chains |
| 3 (tertiary) | [map+0x30] | node[-0x30] | node[-0x19] & 0x20 | Speculative candidates (address-taken functions, indirect call targets inferred from devirtualization) -- lowest confidence |
The different offsets in pass 3 (-0x30 instead of -0x38, -0x19 instead of -0x21) indicate a different node layout for speculative candidates. These nodes carry less metadata (8 fewer bytes between the summary pointer and the node base, and the importable flag is 8 bytes closer to the node).
Threshold Comparison Gate (sub_18518A0)
The gate function takes two arguments -- hot_count (rdi) and cost (rsi) -- and returns nonzero if the candidate qualifies for import. The driver calls it at three points (once per pass). This function encapsulates the final accept/reject decision after the per-priority-class threshold adjustment has already been applied by sub_1853180.
// sub_18518A0 -- Threshold comparison gate
// Returns: nonzero if candidate should be imported, zero otherwise
//
// rdi = hot_count (call frequency from profile or summary)
// rsi = cost (adjusted threshold value from Stage 1)
fn threshold_gate(hot_count, cost) -> bool:
// The exact comparison logic depends on whether profile data
// is available. With profile data, hot_count is a raw call
// count; the gate compares the cost against a profile-weighted
// threshold. Without profile data, this degenerates to a
// direct comparison: cost <= threshold.
return hot_count > 0 || cost <= current_threshold
Threshold Multiplier Constants
The four floating-point multiplier constants are stored in the .data section and are set by the corresponding cl::opt registrations in ctor_184_0:
| Address | Knob | Default | Purpose |
|---|---|---|---|
dword_4FAAE80 | import-hot-multiplier | 10.0 | Multiplier for hot callsites |
dword_4FAACC0 | import-cold-multiplier | 0.0 | Multiplier for cold callsites |
dword_4FAADA0 | import-critical-multiplier | 100.0 | Multiplier for critical callsites |
dword_4FAB040 | (default path) | 1.0 | Multiplier when no priority class matches |
With the upstream default import-instr-limit of 100, a hot callsite gets threshold 1,000 instructions and a critical callsite gets threshold 10,000. The cold multiplier of 0.0 means cold functions are never imported by default -- the threshold evaluates to zero.
Effective threshold table (for import-instr-limit=100):
| Priority class | Multiplier | Effective threshold | Typical candidates |
|---|---|---|---|
| Critical (4) | 100.0x | 10,000 instructions | Manually annotated hot paths, PGO-identified critical edges |
| Hot (3) | 10.0x | 1,000 instructions | Profile-guided hot callsites, frequently-called templates |
| Default (0,2) | 1.0x | 100 instructions | Standard callsites without profile data |
| Cold (1) | 0.0x | 0 instructions | Provably cold paths -- never imported at default settings |
The evolution factors control how thresholds decay as imports cascade through the call graph:
| Knob | Default | Effect |
|---|---|---|
import-instr-evolution-factor | 0.7 | Each transitive import level reduces the threshold to 70% of the previous |
import-hot-evolution-factor | 1.0 | Hot callsite chains do not decay (threshold stays constant through transitive imports) |
The evolution factor is applied by the caller of sub_1853180 before passing base_threshold. For a chain A -> B -> C where A is the root module:
- Import B into A: threshold =
import-instr-limit(100) - Import C into A (transitively via B): threshold =
100 * 0.7= 70 - Import D into A (transitively via C via B): threshold =
100 * 0.7 * 0.7= 49
For hot chains with import-hot-evolution-factor=1.0, the threshold remains 1,000 at every transitive level, enabling arbitrarily deep import chains for hot call paths.
Global Import Budget
Two globals control the total import count:
| Address | Role | Default |
|---|---|---|
dword_4FAB120 | Maximum allowed imports | -1 (unlimited) |
dword_4FAA770 | Running import counter | 0 (reset per module) |
The budget check at 0x185340A:
mov eax, cs:dword_4FAB120 ; load budget
test eax, eax
js proceed ; negative = unlimited
cmp cs:dword_4FAA770, eax ; counter vs budget
jge skip ; at or over budget -> skip
When the budget is -1 (the import-cutoff default), the js (jump-if-sign) branch is taken unconditionally, bypassing the budget check. Setting -import-cutoff=N limits the total number of imported functions to N, useful for debugging import-related miscompilations via bisection.
The counter increment at 0x1853510:
add cs:dword_4FAA770, 1 ; increment after successful import
This is a non-atomic add -- safe because ThinLTO import runs single-threaded per module in CICC (unlike CPU LLVM where the thin link runs in parallel). The counter resets to 0 at the start of each module's import phase.
Integration with the 20,000-Budget Inliner
The import + inline pipeline in CICC works as a two-phase system:
-
Import phase (this page): ThinLTO brings cross-module function bodies into the current module based on summary-guided threshold decisions. The imported functions are marked with
thinlto_src_modulemetadata. -
Inline phase (inliner cost model): The NVIDIA custom inliner at
sub_1864060runs with a 20,000-unit per-caller budget. Imported functions are prime inlining candidates because they were specifically imported because they are called from this module.
The inliner-function-import-stats knob (registered in ctor_186_0 at 0x4DBEC0, values: basic or verbose) tracks how many imported functions were actually inlined. This provides feedback on whether the import thresholds are well-calibrated: if functions are imported but then not inlined (because they exceed the inline budget), the import was wasted compile time.
The typical flow for a template-heavy CUDA library like CUB or cutlass:
- Each
.cufile compiles to a ThinLTO bitcode module with a summary index - The thin link step reads all summaries and builds a combined index
- For each module,
sub_1853180evaluates import candidates using the combined index - Hot template instantiations (e.g.,
cub::DeviceReduce::Sum<float>) get thresholdbase * 10.0(hot) orbase * 100.0(critical) - The imported function bodies arrive in the module and are immediately available to the 20,000-budget inliner
- The inliner folds the imported template bodies into their callers, eliminating
.parammarshaling
Entry Point: sub_1855B10
Address: 0x1855B10, 10,503 bytes. This is the runOnModule entry for the "function-import" pass (pipeline slot 43). It orchestrates the entire import flow:
fn function_import_pass_entry(module):
// Parse required options
if summary_file_path is empty:
error("error: -function-import requires -summary-file")
return
summary_index = load_summary_file(summary_file_path)
if summary_index is error:
error("Error loading file")
return
// Build GUID-to-import map from summary index
guid_import_map = build_import_map(module, summary_index)
// Stage 1: threshold computation
sub_1853180(summary_ctx, module_info, import_instr_limit,
guid_hash_table, result_array, visited_set)
// Stage 2: triple-pass import
sub_1854A20(import_ctx, summary_index, source_module, guid_import_map)
// Post-import: attribute propagation (if enabled)
if propagate_attrs:
propagate_summary_attributes(module, summary_index)
Knob Inventory
All knobs are registered across three constructors:
ctor_184_0 at 0x4DA920 (13,693 B -- ThinLTO Function Import options):
| Knob | Type | Default | Effect |
|---|---|---|---|
import-instr-limit | unsigned | 100 | Base instruction count threshold |
import-cutoff | int | -1 | Max total imports (-1 = unlimited) |
import-instr-evolution-factor | float | 0.7 | Threshold decay per transitive level |
import-hot-evolution-factor | float | 1.0 | Hot chain decay (1.0 = no decay) |
import-hot-multiplier | float | 10.0 | Threshold multiplier for hot callsites |
import-critical-multiplier | float | 100.0 | Threshold multiplier for critical callsites |
import-cold-multiplier | float | 0.0 | Threshold multiplier for cold callsites |
print-imports | bool | false | Print names of imported functions |
print-import-failures | bool | false | Print rejected candidates with reasons |
compute-dead | bool | true | Strip dead symbols from index |
enable-import-metadata | bool | false | Attach thinlto_src_module / thinlto_src_file metadata |
summary-file | string | (none) | Summary file path for -function-import |
import-all-index | bool | false | Import every external function in the index |
ctor_420_0 at 0x532010 (11,787 B -- pass-level ThinLTO options):
| Knob | Type | Default | Effect |
|---|---|---|---|
force-import-all | bool | false | Import even noinline functions |
import-declaration | bool | false | Import function declarations as fallback |
thinlto-workload-def | string | (none) | JSON file mapping root functions to import lists |
ctor_029 at 0x489C80 (1,120 B -- supplementary ThinLTO options):
| Knob | Type | Default | Effect |
|---|---|---|---|
propagate-attrs | bool | true | Propagate attributes through the summary index |
import-constants-with-refs | bool | true | Import constant globals that have references |
ctor_419 at 0x531850 (6,358 B -- FunctionAttrs inference):
| Knob | Type | Default | Effect |
|---|---|---|---|
disable-thinlto-funcattrs | bool | false | Disable function attribute inference from ThinLTO summaries |
Data Structures
Import Candidate Linked List
Each of the three priority lists in the guid_import_map is a singly-linked list with 8-byte node entries:
| Offset | Content |
|---|---|
[node+0x00] | Entry value (pointer to candidate descriptor, or GUID) |
[node+0x08] | Next slot / next node pointer |
Sentinels: 0xFFFFFFFFFFFFFFF8 (-8) = empty slot, 0xFFFFFFFFFFFFFFFC (-4) = deleted slot. These sentinel values are standard open-addressing hash map markers repurposed for the linked-list traversal.
GUID Import Map Layout
The guid_import_map structure (parameter rcx of sub_1854A20) contains the three priority lists:
| Offset | Size | Content |
|---|---|---|
+0x00 | 8 | Primary list head (direct call targets) |
+0x08 | 8 | Entry count / secondary sentinel |
+0x10 | 8 | Secondary list head (transitive callees) |
+0x18 | 8 | (reserved / alignment) |
+0x20 | 8 | (reserved / alignment) |
+0x28 | 8 | (reserved / alignment) |
+0x30 | 8 | Tertiary list head (speculative candidates) |
GUID Dedup Hash Table
| Field | Size | Description |
|---|---|---|
| Slot size | 16 bytes | {GUID (8B), metadata (8B)} |
| Hash function | multiplicative | slot = (GUID * 37) & (table_size - 1) |
| Collision resolution | linear probing | Increment slot by 1, wrap at table_size |
| Empty sentinel | -1 | 0xFFFFFFFFFFFFFFFF |
| Size field | offset +0x18 | Number of slots in table (always power of 2) |
The multiplication constant 37 produces reasonable distribution for GUIDs that are typically MD5 hashes of mangled names. The linear probing is adequate because the table is sized to maintain a low load factor.
Result Array
Growable array with 24-byte entries:
| Offset | Size | Content |
|---|---|---|
+0x00 | 8 | Function GUID |
+0x08 | 4 | Adjusted threshold value |
+0x10 | 8 | Import record pointer |
Header: [+0x08] = current count, [+0x0C] = capacity. Growth is handled by a realloc path when count >= capacity.
Per-Function Summary Entry (import-relevant fields)
| Offset | Size | Content |
|---|---|---|
+0x08 | 4 | Entry type (2 = function summary) |
+0x0C | 1 | Linkage byte: low 4 bits = linkage type, bit 4 = not-importable flag, bit 5 = importable flag |
+0x40 | 4 | Function cost (IR instruction count, used for threshold comparison) |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| ThinLTO import driver (triple-pass candidate processing) | sub_1854A20 | 4,326 B | -- |
| Threshold computation with GUID dedup and priority-class multipliers | sub_1853180 | 5,059 B | -- |
| Threshold comparison gate (returns nonzero if candidate qualifies) | sub_18518A0 | -- | -- |
| Import candidate evaluator (prepares candidate for threshold check) | sub_1852CC0 | -- | -- |
Import list builder (called by sub_1853180) | sub_1852FB0 | -- | -- |
Import list node allocator (called by sub_1853180) | sub_1852A30 | -- | -- |
Import list initialization (called by sub_1853180) | sub_1851200 | -- | -- |
| Execute import decision (materialize function into destination) | sub_15E4B20 | -- | -- |
| Resolve function name/info from summary | sub_15E4EB0 | -- | -- |
Entry point (parses -function-import / -summary-file) | sub_1855B10 | 10,503 B | -- |
| Whole-module ThinLTO processing | sub_1858B90 | 31,344 B | -- |
| Type metadata propagation during import | sub_185E850 | 24,263 B | -- |
Attach named metadata (used for thinlto_src_module) | sub_1627100 | -- | -- |
| Create optimization remark (import diagnostic) | sub_1627350 | -- | -- |
| Resolve source module name string | sub_161FF10 | -- | -- |
| Check if function exists in a given module | sub_1670560 | -- | -- |
| Get "import source" module handle | sub_16704E0 | -- | -- |
| Get "import destination" module handle | sub_16704F0 | -- | -- |
| Format import remark (cost component) | sub_16C1840 | -- | -- |
| Format import remark (threshold component) | sub_16C1A90 | -- | -- |
| Finalize import remark string | sub_16C1AA0 | -- | -- |
| Hash table insert (GUID dedup table) | sub_1851560 | -- | -- |
| Initialize resolved function summary storage | sub_1674380 | -- | -- |
| Finalize empty-import path cleanup | sub_1851C60 | -- | -- |
| Release import list entry data | sub_161E7C0 | -- | -- |
malloc wrapper (used for 16-byte dedup node allocation) | sub_22077B0 | -- | -- |
Cross-References
- Inliner Cost Model -- the downstream consumer of imported functions. Import brings bodies into the module; the 20,000-budget inliner decides whether to fold them into callers.
- Module Summary --
sub_D7D4E0builds theNVModuleSummarythat drives import decisions. The 4-level priority system, complexity budget, and CUDA-specific filtering all originate here. - Pipeline & Ordering --
function-importis registered as pipeline slot 43, a Module-level pass. - IP Memory Space Propagation -- after import, cross-module functions may carry address-space annotations that IPMSP must reconcile.
- Hash Infrastructure -- the GUID dedup table uses the same DenseMap pattern documented there.
GlobalOpt for GPU
CICC implements a custom GlobalOpt pass (sub_18612A0, 65 KB, 2179 decompiled lines) that replaces LLVM's stock GlobalOptPass with GPU-aware global variable transformations. The pass operates on NVIDIA's internal IR representation rather than LLVM IR directly, and adds address-space-aware logic that stock LLVM lacks entirely: it extracts the CUDA address space from the global's flags byte ((flags >> 2) & 7), preserves address space through all generated replacement globals, and applies promotion thresholds calibrated for GPU memory hierarchy. The pass runs at pipeline position 30 in the tier-2 and tier-3 optimization sequences (via wrapper sub_196A2B0), immediately after GlobalDCE / ConstantProp (sub_1968390) and before LoopVectorize. It runs at -O2 and above; tier-1 does not include it. The inliner cost model at sub_18612A0 also calls into GlobalOpt as a subroutine when evaluating whether a callee's globals can be folded after inlining, creating a tight coupling between inlining decisions and global optimization.
The pass implements four transformation strategies with decreasing priority: small-constant promotion for globals under 2047 bits, scalar replacement of aggregates (SRA) for struct globals with up to 16 fields, malloc/free elimination for heap-allocated globals with single-unit access, and a hash-table-driven deduplication cleanup pass. Each strategy preserves the original global's NVPTX address space, which is critical -- a __device__ global in address space 1 must remain in AS 1 after splitting, not silently migrate to AS 0 (generic). The generated IR uses distinctive suffixes (.body, .init, .val, .notinit, .f0...f15, .isneg, .isnull) that survive through to PTX emission and are visible in cuobjdump output.
| Core transform | sub_18612A0 (0x18612A0, 65 KB, 2179 lines) |
| Pipeline wrapper | sub_196A2B0 (0x196A2B0) |
| Recursive re-application | sub_185B1D0 (0x185B1D0) |
| Pre-SRA setup | sub_185B7E0 (0x185B7E0) |
| Hash table rehash | sub_1860410 (0x1860410) |
| Per-user SRA rewrite | sub_1860BE0 (0x1860BE0) |
| Pipeline position | Step 30 (tier 2/3), after GlobalDCE, before LoopVectorize |
| Minimum opt level | -O2 (tier 2) |
| Pass registration | "globalopt" in pipeline parser at slot 45 |
| IR node allocation | 88 bytes per global, 64 bytes per basic block, 56 bytes per instruction |
Address Space Handling
Every transformation in this pass must respect CUDA address spaces. The global's address space is extracted at line 577 of the decompilation:
uint8_t addr_space = (*(uint8_t*)(global + 33) >> 2) & 7;
The NVPTX address spaces relevant here are 0 (generic), 1 (global/__device__), 3 (shared/__shared__), 4 (constant/__constant__), and 5 (local). See Address Spaces for the complete table with hardware mapping, pointer widths, and latency numbers.
When sub_18612A0 creates replacement globals via sub_15E51E0, it passes the extracted address space to the constructor. The created global inherits the same address space, linkage (always internal, linkage code 7), and metadata (copied via sub_15E6480). This is the key delta from stock LLVM: upstream GlobalOpt does not consider address space when splitting globals because host-side address spaces are trivial. On GPU, promoting a __shared__ struct global to per-field __shared__ globals preserves the 10x latency advantage over DRAM, while accidentally demoting to generic would force the hardware to resolve address space at runtime via the generic-to-specific address resolution unit.
Entry Guard: Type Filtering
Before attempting any transformation, the pass filters on the global's type tag (byte at type + 8). The acceptance bitmask is 0x8A7E:
// Bits set: 1,2,3,4,5,9,11,13,15
uint16_t bitmask = 0x8A7E;
if ((1 << type_tag) & bitmask) {
// accepted: i16, i32, i64, i80, float, double, arbitrary-int, struct, opaque-ptr
}
Additionally, struct (tag 13), vector (tag 14), and array (tag 16) types are accepted if sub_16435F0(type, 0) returns true -- this is the isAnalyzableType predicate that recursively checks whether the type's leaf elements are all scalars or pointers.
After type filtering, the pass walks the global's use-list. Every user must be either a store (opcode tag 54) or a load (opcode tag 55). If any user is an arithmetic instruction (tag <= 23), a GEP used in a non-trivial way, or any other instruction kind, the global is rejected -- it cannot be optimized because its address escapes or is used in a way the pass cannot model.
Path A: Small-Constant Promotion
When the global's initializer is a struct constant and its total bit-size (including alignment padding) fits within 2047 bits (0x7FF), the pass promotes it into a function-local value with a separate initializer function. This threshold is NVIDIA-specific -- upstream LLVM uses different heuristics based on TargetData layout considerations. The 2047-bit ceiling corresponds roughly to 64 32-bit registers, aligning with the per-thread register budget on most SM architectures where promoting beyond that limit would spill to local memory and negate the benefit.
Size Computation
The pass walks the type tree recursively to compute total bit-size. The implementation at lines 499-570 of the decompilation uses a switch on the type tag byte at type + 8:
| Type tag | Type | Bits |
|---|---|---|
| 0x1 | i16 / half | 16 |
| 0x2 | i32 / float | 32 |
| 0x3 | i64 | 64 |
| 0x4 | x86_fp80 | 80 |
| 0x5 | i128 | 128 |
| 0x6 | fp128 / ppc_fp128 | 128 |
| 0x7 | pointer | sub_15A9520(target, 0) * 8 |
| 0x9 | double | 64 |
| 0xB | iN (custom width) | from type word >> 8 |
| 0xD | struct | 8 * field_count (via sub_15A9930) |
| 0xE | vector | 8 * alignment * num_elements * padded_size |
| 0xF | opaque ptr | sub_15A9520(target, addr_space) * 8 |
| 0x0, 0x8, 0xA, 0xC, 0x10 | array variants | element_size * array_length (recursive) |
Note that opaque pointers (tag 0xF) use getPointerSizeInBits(target, addr_space) -- the pointer size varies by address space on NVPTX (64-bit for AS 0/1, potentially 32-bit for AS 3/5 on some targets). Tags 0x0, 0x8 (label/token), 0xA (metadata), and 0xC (bfloat) all fall into the array-multiplier path -- they extract an element count and recurse, which handles the case where these type wrappers contain inner array types.
The pseudocode for the size computation:
// sub_18612A0, lines 499-570
uint64_t compute_total_bits(Type *type, TargetInfo *target, uint8_t addr_space) {
uint8_t tag = *(uint8_t *)(type + 8);
switch (tag) {
case 0x1: return 16; // i16 / half
case 0x2: return 32; // i32 / float
case 0x3: return 64; // i64
case 0x4: return 80; // x86_fp80
case 0x5: return 128; // i128
case 0x6: return 128; // fp128 / ppc_fp128
case 0x7: return sub_15A9520(target, 0) * 8; // generic pointer
case 0x9: return 64; // double
case 0xB: return *(uint32_t *)(type + 8) >> 8; // iN custom-width
case 0xD: { // struct
uint64_t layout = sub_15A9930(target, type); // getStructLayout
return 8 * *(uint32_t *)(layout + 12); // 8 * element_count
}
case 0xE: { // vector
uint64_t align = sub_15A9FE0(target, type); // getAlignment
uint64_t n_elts = *(uint32_t *)(type + 12);
uint64_t elem_bits = compute_total_bits(
sub_16463B0(type, 0), target, addr_space); // getArrayElementType
return 8 * align * n_elts * ((elem_bits + align - 1) / align);
}
case 0xF: return sub_15A9520(target, addr_space) * 8; // opaque ptr (AS-aware)
default: { // 0x0,0x8,0xA,0xC,0x10: array
uint64_t n_elts = *(uint32_t *)(type + 12);
Type *elem = sub_16463B0(type, 0); // getArrayElementType
return n_elts * compute_total_bits(elem, target, addr_space);
}
}
}
The acceptance check at line 570:
if (total_elements * alignment * ceil_div(total_bits, alignment) > 0x7FF)
goto path_b; // too large, try SRA instead
Generated IR Pattern
For a qualifying global, the pass generates three components:
; Original: @my_global = addrspace(1) global { i32, i32 } { i32 42, i32 7 }
; After promotion:
@my_global.body = internal addrspace(1) global { i32, i32 } { i32 42, i32 7 }
define internal void @my_global.init() {
store { i32, i32 } { i32 42, i32 7 }, ptr addrspace(1) @my_global.body
ret void
}
; All loads of @my_global replaced with: load ptr addrspace(1) @my_global.body
; ExtractValue users get ".val" accessors
; Uninitialized code paths get "notinit" sentinel via sub_15FB630
The .body global is created via sub_15E51E0 with the same address space and internal linkage (code 7). The .init function is created via sub_15E5070. The pass then walks all users of the original global: loads (tag 55) get redirected to the .body global, GEPs (tag 71) get RAUW'd via sub_164D160, and extractvalue instructions (tag 75) get specialized .val accessors. Sub-opcodes on the extractvalue determine further handling: codes 0x20/0x25/0x29 produce notinit sentinels, 0x24/0x28 extract terminal types via sub_159C540, and 0x21-0x23/0x26-0x27 pass through unchanged.
The full promotion pseudocode covering body creation, init creation, and use rewriting:
// sub_18612A0, lines 577-805 — Path A: small-constant promotion
void promote_small_constant(Global *global, Module *module, Value *init_val,
Type *type, TargetInfo *target) {
// --- Extract address space from global flags ---
uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
// --- Create ".body" global in same address space ---
void *node = sub_1648A60(88, 1); // IRBuilder::create
Global *body_gv = sub_15E51E0(
get_scope(module), type, /*init=*/0, /*linkage=*/7,
concat_name(global, ".body"), addr_space); // createGlobalVar
sub_15E6480(global, body_gv); // copyMetadata
// --- Rewrite all users of original global ---
Use *use = *(Use **)(global + 8); // use-list head
while (use != NULL) {
Instruction *inst = sub_1648700(use); // getInstruction
uint8_t opcode = *(uint8_t *)(inst + 16);
if (opcode == 71) { // GEP
// If GEP references old global, RAUW to body
sub_164D160(inst, body_gv); // RAUW
sub_15F20C0(inst); // eraseFromParent
} else {
// Create local variable referencing body
Value *local = sub_15FD590(inst, get_scope(module),
"newgv", module); // createLocalVar
sub_1648780(use, local); // replaceUseWith
}
use = *(Use **)(use + 8); // next use
}
// --- Create ".init" function ---
Function *init_fn = sub_15E5070(
get_scope(module), type, /*linkage=*/7,
init_val, concat_name(global, ".init")); // createFunction
int init_user_count = 0;
// Walk users again for extractvalue and load rewriting
use = *(Use **)(body_gv + 8);
while (use != NULL) {
Instruction *inst = sub_1648700(use);
uint8_t opcode = *(uint8_t *)(inst + 16);
if (opcode == 55) { // load
sub_15F9480(init_val, init_fn); // createStoreInit
init_user_count++;
} else if (opcode == 75) { // extractvalue
Value *val_acc = sub_15F8F80(inst, type, init_fn,
concat_name(global, ".val")); // createExtractValue
uint8_t sub_opcode = *(uint8_t *)(inst + 24);
switch (sub_opcode) {
case 0x20: case 0x25: case 0x29:
// Uninitialized path: create "notinit" sentinel
sub_15FB630(val_acc, "notinit", inst); // createNotInit
break;
case 0x24: case 0x28:
// Terminal type extraction
sub_159C540(val_acc); // getTerminalType
break;
default: // 0x21-0x23, 0x26-0x27
break; // pass-through
}
sub_164D160(inst, val_acc); // RAUW
sub_15F20C0(inst); // eraseFromParent
init_user_count++;
}
use = *(Use **)(use + 8);
}
// --- Finalize ---
if (init_user_count > 0) {
sub_1631BE0(module_fn_list, init_fn); // insertIntoFnList
// Patch metadata chain at global+56
*(void **)(global + 56) = init_fn;
} else {
// Dead init function: destroy
sub_15E5530(init_fn); // destroyFunctionBody
sub_159D9E0(init_fn); // destroyFunction
sub_164BE60(init_fn); // dropAllReferences
sub_1648B90(init_fn); // markDead (flags |= 1)
}
sub_15E55B0(global); // erase original global
sub_15F20C0(module_entry); // erase module-level ref
// --- Recursive re-application to newly created .body ---
sub_185B1D0(body_gv, target); // recursiveGlobalOpt
}
After rewriting all uses, if the .init function has users, it is linked into the module's function list via sub_1631BE0. If it has zero users (the initializer was never needed), the function body is destroyed and marked dead. The original global is erased via sub_15E55B0. Finally, sub_185B1D0 recursively re-applies GlobalOpt to the newly created .body global, enabling cascaded optimizations.
Path B: Scalar Replacement of Aggregates (SRA)
When a global is too large for constant promotion, the pass attempts SRA -- exploding a struct global into per-field scalar globals. This path has stricter preconditions:
- The caller's
flagparameter (a4) must be zero -- when set, SRA is disabled. - The initializer must be the unique initializer for this global (verified via
sub_15A0680). - The type must be a struct (tag 13) with 1 to 16 fields:
field_count - 1 <= 0xF. - Every user must reference only this global -- no cross-global pointer arithmetic.
The 16-field limit is a hardcoded constant at line 822 of the decompilation. It prevents combinatorial explosion in the null-check and free chains that follow: each field generates one icmp eq (null check), one or, one conditional branch, one free_it block, and one next block. Beyond 16 fields the cost of the generated guard code would exceed the benefit of splitting.
Use Analysis: Store Value Collection
Before field explosion, the pass collects all stored values into a hash set to determine which initializers are live. For each store (tag 54) user of the global, sub_185CAF0 inserts the stored value into a hash/set structure at v432. The scratch buffer starts with capacity 32 and grows via sub_16CC920 when full. This collection serves two purposes: it validates that all stores write analyzable values (no opaque function pointers or computed addresses), and it builds the value set used later to initialize the per-field globals.
// sub_18612A0, lines 823-868 — Store value collection for SRA
void collect_store_values(Global *global, Module *module,
HashSet *store_set, Buffer *scratch) {
Use *use = *(Use **)(global + 8);
int store_count = 0;
while (use != NULL) {
Instruction *inst = sub_1648700(use); // getInstruction
uint8_t opcode = *(uint8_t *)(inst + 16);
if (opcode == 54) { // store
sub_185CAF0(use, store_set, scratch); // collectStoredValue
store_count++;
// Grow scratch if full
if (scratch->size >= scratch->capacity) {
if (scratch->capacity < 64)
memset(scratch->data, 0xFF, scratch->capacity * 8);
else
sub_16CC920(scratch); // growScratchBuffer
}
}
use = *(Use **)(use + 8);
}
}
Global-Only-Use Validation
After collection, lines 878-1017 validate that every user of every collected global references only the target global -- no cross-global pointer arithmetic is allowed. The validation walks the use chain of each collected global. For each operand slot (24-byte stride, count from *(uint32_t *)(global + 20) & 0xFFFFFFF):
- If the operand is the module itself: accepted.
- If the opcode tag is <= 0x17 (arithmetic/comparison): rejected -- the global's address is used in computation.
- If the opcode is 77 (GEP): the pass calls
sub_16CC9F0(find in sorted set) to verify the GEP's base pointer is the same global being split. - If the opcode is 54 (store): the pass checks that the store's parent basic block (at offset -24 from the operand) belongs to the global being analyzed.
If any operand fails validation, a flag v17 is set to zero and the entire SRA path is abandoned for this global.
Field Explosion
For each field index 0 through field_count - 1, the pass creates a replacement global variable in the same address space with internal linkage. The full pseudocode at lines 1084-1476:
// sub_18612A0, lines 1084-1476 — SRA field explosion
typedef struct {
Global **data;
uint64_t size;
uint64_t capacity;
} FieldVec;
void sra_explode_fields(Global *global, Module *module, Type *struct_type,
Value *init_val, TargetInfo *target, FieldVec *fields) {
uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
const char *global_name = sub_1649960(global); // getName
uint32_t field_count = *(uint32_t *)(struct_type + 12);
uint64_t ptr_bits = sub_15A9520(target, addr_space); // getPointerSizeInBits
for (uint32_t i = 0; i < field_count; i++) {
// --- Extract field type and offset ---
Type *field_type = sub_1646BA0(struct_type, ptr_bits); // getStructFieldType
uint64_t field_offset = sub_15A06D0(struct_type, i); // computeFieldOffset
// --- Generate name: "my_global.f0", "my_global.f1", ... ---
char name[256];
snprintf(name, sizeof(name), "%s.f%d", global_name, i);
// --- Extract field initializer from parent init ---
Value *field_init = sub_15FEBE0(module, init_val, field_type); // createBitcast/GEP
// --- Create field global in same address space, internal linkage ---
Global *field_gv = sub_15E51E0(
get_scope(module), field_type, field_init,
/*linkage=*/7, name, addr_space); // createGlobalVar
// --- Copy metadata from parent to field global ---
sub_15E6480(global, field_gv); // copyMetadata
// --- Store into dynamically-grown field vector ---
if (fields->size >= fields->capacity) {
// Realloc growth: double capacity (lines 1161-1220)
uint64_t new_cap = fields->capacity * 2;
if (new_cap < 8) new_cap = 8;
fields->data = realloc(fields->data, new_cap * sizeof(Global *));
fields->capacity = new_cap;
}
fields->data[fields->size++] = field_gv;
// --- Compute field bit-size (same type switch as Path A) ---
uint64_t field_bits = compute_total_bits(field_type, target, addr_space);
uint64_t alignment;
if (*(uint8_t *)(field_type + 8) == 0xD) { // struct
uint64_t layout = sub_15A9930(target, field_type);
alignment = *(uint64_t *)(layout + 8);
} else {
alignment = sub_15A9FE0(target, field_type); // getAlignment
}
uint64_t padded = alignment * ((field_bits + alignment - 1) / alignment);
// --- Create GEP replacement and store initializer ---
Value *gep = sub_15FEBE0(module, field_gv, field_type); // createBitcast/GEP
sub_15F9660(field_offset, field_gv, global); // createFieldStore
}
}
The field globals are stored in a dynamically-grown std::vector with realloc growth strategy (lines 1161-1220 of the decompilation). The growth factor is 2x with a minimum initial capacity of 8 entries.
Null/Negative Guards
After field explosion, the pass generates safety checks for the original global's pointer value. This pattern handles the case where the global was heap-allocated via malloc -- the original pointer might be null or negative (indicating allocation failure on some platforms). The guard chain is constructed at lines 1478-1535:
// sub_18612A0, lines 1478-1535 — Null/negative guard chain generation
Value *build_guard_chain(Global *global, FieldVec *fields,
Module *module, TargetInfo *target) {
// --- Create %isneg = icmp slt <ptr>, 0 ---
// Opcode 51 = ICmp, predicate 40 = SLT (signed less than zero)
Value *isneg = sub_15FEC10(
/*dest=*/NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/40,
get_module_sym(module), /*offset=*/0,
concat_name(global, ".isneg"), get_current_bb(module)); // createICmp
Value *chain = isneg;
// --- For each field: %isnullI = icmp eq <field_ptr>, null ---
for (uint64_t i = 0; i < fields->size; i++) {
Global *field_gv = fields->data[i];
uint64_t field_offset = sub_15A06D0(
get_type(global), i); // computeFieldOffset
// Predicate 32 = EQ (equal to null)
Value *isnull = sub_15FEC10(
/*dest=*/NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/32,
field_gv, field_offset,
concat_name(global, ".isnull"), get_current_bb(module));
// Chain with OR: %tmpI = or i1 %chain, %isnullI
// Opcode 27 = OR
char tmp_name[16];
snprintf(tmp_name, sizeof(tmp_name), "tmp%lu", i);
chain = sub_15FB440(/*opcode=*/27, chain, isnull,
tmp_name, module); // createBinOp(OR)
}
return chain; // final chained predicate
}
The generated IR for a 3-field struct:
%isneg = icmp slt ptr @original_global, null ; predicate 40 = SLT
%isnull0 = icmp eq ptr @my_global.f0, null ; predicate 32 = EQ
%tmp0 = or i1 %isneg, %isnull0
%isnull1 = icmp eq ptr @my_global.f1, null
%tmp1 = or i1 %tmp0, %isnull1
%isnull2 = icmp eq ptr @my_global.f2, null
%tmp2 = or i1 %tmp1, %isnull2
br i1 %tmp2, label %malloc_ret_null, label %malloc_cont
The .isneg guard is created by sub_15FEC10 with opcode 51 (ICmp), predicate 40 (SLT with zero). Per-field .isnull guards use predicate 32 (EQ with null). The guards are chained with OR instructions (opcode 27) via sub_15FB440. The chain evaluation is linear in the number of fields -- for the maximum 16 fields, this produces 17 icmp instructions and 16 or instructions, plus one terminal conditional branch.
Malloc/Free Decomposition Algorithm
This is the core of NVIDIA's per-field malloc/free elimination, covering lines 1537-1640 of the decompilation. When the chained null check indicates a valid allocation, the pass generates a multi-block control flow that replaces the original single malloc/free pair with per-field conditional frees. This is the key divergence from upstream LLVM: stock tryToOptimizeStoreOfMallocToGlobal treats the malloc/free as an atomic pair, replacing it with a single static allocation. NVIDIA decomposes to per-field granularity, generating N+2 basic blocks (one malloc_ret_null, one malloc_cont, and for each field one free_it plus one next block).
The complete pseudocode:
// sub_18612A0, lines 1537-1640 — Malloc/free decomposition
void decompose_malloc_free(Global *global, Module *module, Function *fn,
FieldVec *fields, Value *guard_chain,
TargetInfo *target) {
uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
// === Step 1: Create control flow skeleton ===
// "malloc_cont" — continuation after successful allocation check
BasicBlock *malloc_cont_bb = sub_157FBF0(
fn, get_global_chain(module), "malloc_cont"); // createBB
// "malloc_ret_null" — failure path returning null
BasicBlock *ret_null_body = sub_157E9C0(fn); // createReturnBB
BasicBlock *malloc_ret_null_bb = sub_157FB60(
NULL, ret_null_body, "malloc_ret_null", NULL); // createBBWithPred
// === Step 2: Emit conditional branch on guard chain ===
// br i1 %guard_chain, label %malloc_ret_null, label %malloc_cont
sub_15F8650(
get_terminator(fn), // insertion point
malloc_ret_null_bb, // true target (fail)
malloc_cont_bb, // false target (success)
guard_chain, // condition (isneg|isnull)
fn); // createCondBr
// === Step 3: Per-field conditional free and reinitialization ===
BasicBlock *current_bb = malloc_cont_bb;
for (uint64_t i = 0; i < fields->size; i++) {
Global *field_gv = fields->data[i];
uint64_t field_offset = sub_15A06D0(get_type(global), i);
Type *field_type = sub_1646BA0(get_type(global),
sub_15A9520(target, addr_space));
// 3a. Create "tmp" alloca in current block
Value *tmp_alloca = sub_15F9330(
NULL, field_type, "tmp", current_bb); // createAlloca
// 3b. Create non-null check: %condI = icmp ne <field_ptr>, null
// Opcode 51 = ICmp, predicate 33 = NE (not equal to null)
char cond_name[64];
snprintf(cond_name, sizeof(cond_name), "%s.f%lu.nonnull",
sub_1649960(global), i);
Value *cond = sub_15FED60(
NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/33,
field_gv, field_offset, cond_name, current_bb); // createICmpNE
// 3c. Create "free_it" block — frees this field if non-null
char free_name[64];
snprintf(free_name, sizeof(free_name), "free_it%lu", i);
BasicBlock *free_it_bb = sub_157FB60(
NULL, NULL, free_name, NULL); // createBBWithPred
// 3d. Create "next" block — fallthrough after conditional free
char next_name[64];
snprintf(next_name, sizeof(next_name), "next%lu", i);
BasicBlock *next_bb = sub_157FB60(
NULL, NULL, next_name, NULL); // createBBWithPred
// 3e. Conditional branch: non-null → free, null → skip
// br i1 %condI, label %free_itI, label %nextI
sub_15F8650(
get_terminator_of(current_bb),
free_it_bb, // true: free
next_bb, // false: skip
cond, fn); // createCondBr
// 3f. In free_it block: wire field into use-def chain, then branch to next
sub_15FDB00(field_gv, get_use_chain(free_it_bb),
i, free_it_bb); // wireDef
// Unconditional branch: free_it → next
sub_15F8590(NULL, next_bb, free_it_bb); // createBr
// 3g. In next block: store field initializer into the new field global
sub_15F9850(field_offset, tmp_alloca, next_bb); // createStoreToField
current_bb = next_bb;
}
// === Step 4: Wire entry into malloc_cont, erase original ===
// Unconditional branch from entry into malloc_cont
sub_15F8590(NULL, malloc_cont_bb, get_entry_bb(fn)); // createBr
// Erase the original global
sub_15F20C0(get_module_entry(module)); // eraseFromParent
}
The generated CFG for a 2-field struct { i32, float }:
entry:
br i1 %tmp1, label %malloc_ret_null, label %malloc_cont
malloc_ret_null:
ret null
malloc_cont:
%cond0 = icmp ne ptr @g.f0, null
br i1 %cond0, label %free_it0, label %next0
free_it0:
; free(@g.f0) — conditional per-field deallocation
br label %next0
next0:
store i32 <init0>, ptr addrspace(1) @g.f0
%cond1 = icmp ne ptr @g.f1, null
br i1 %cond1, label %free_it1, label %next1
free_it1:
; free(@g.f1)
br label %next1
next1:
store float <init1>, ptr addrspace(1) @g.f1
; ... continuation
Each free_it block is conditionally entered only when the field pointer is non-null, preventing double-free on fields that were never successfully allocated. The next blocks store the field initializer after the conditional free, ensuring the field global is properly initialized regardless of whether freeing occurred. This per-field decomposition enables a critical optimization that upstream LLVM cannot perform: if a later pass (dead store elimination, constant propagation) determines that only some fields of the struct are actually used, the unused field globals and their associated free_it/next blocks become dead code and are trivially eliminated by GlobalDCE.
Address-Space-Aware Splitting
The address space preservation logic is woven throughout both the field explosion and the malloc/free decomposition. Every call to sub_15E51E0 (createGlobalVar) passes the extracted address space from the parent global. The extraction point is always the same: (*(uint8_t *)(global + 33) >> 2) & 7. This is critical for three reasons:
-
Shared memory splitting: A
__shared__struct global (AS 3) split into per-field globals must keep each field in AS 3. If any field migrated to AS 0 (generic), the hardware would resolve the address at runtime via the generic-to-specific resolution unit, adding 10-20 cycles of latency per access and defeating the purpose of placing data in shared memory. -
Constant memory splitting: A
__constant__struct (AS 4) split into fields must remain in AS 4 to benefit from the constant cache's broadcast capability. A single warp reading the same constant field hits the cache once and broadcasts to all 32 threads. In AS 0 (generic), this broadcast would not occur. -
Pointer size consistency: On some NVPTX targets, pointers in AS 3 (shared) and AS 5 (local) are 32-bit, while AS 0 and AS 1 pointers are 64-bit. The size computation for opaque pointers (tag 0xF) calls
sub_15A9520(target, addr_space)-- if the address space were lost during splitting, the pointer size calculation would be wrong, producing incorrect field offsets and corrupted stores.
The per-field null checks in the guard chain also respect address space: the icmp eq with null uses a null pointer of the correct address space width. A 32-bit null in AS 3 is not the same bit pattern as a 64-bit null in AS 1.
Hash Table for Processed Globals
After field explosion and malloc rewrite, the pass uses a custom hash table (open addressing, 32-byte entries) to track which globals and their transitive users have been processed. This is an instance of the NVIDIA-original hash table variant (sentinel pair -8/-16) as documented in the hash infrastructure page.
| Offset | Field | Description |
|---|---|---|
| +0 | key | Pointer to global (sentinel: -8 = empty, -16 = tombstone) |
| +8 | data | Pointer to field-global vector |
| +16 | size | Current vector size |
| +24 | cap | Vector capacity |
Hash function, quadratic probing with triangular numbers, and 75% load factor / 12.5% tombstone compaction thresholds all follow the standard DenseMap infrastructure; see Hash Table and Collection Infrastructure for details.
The processing loop (lines 1710-1812) iterates remaining users of the original global and rewrites them to reference the new field globals:
// sub_18612A0, lines 1710-1812 — Post-SRA user rewriting via hash table
void rewrite_remaining_users(Global *global, FieldVec *fields,
HashTable *table, Module *module,
TargetInfo *target) {
Use *use = *(Use **)(global + 8);
while (use != NULL) {
Use *next_use = *(Use **)(use + 8);
Instruction *inst = sub_1648700(use);
uint8_t opcode = *(uint8_t *)(inst + 16);
if (opcode == 54) { // store
// Walk the store's own use-chain
Use *store_use = *(Use **)(inst + 8);
while (store_use != NULL) {
Use *next_su = *(Use **)(store_use + 8);
// Per-user SRA rewrite: replaces GEP+store/load sequences
// with direct accesses to the appropriate field global
sub_1860BE0(store_use, table, fields, target); // rewriteUserForSRA
store_use = next_su;
}
// If store has no remaining uses, erase it
if (*(Use **)(inst + 8) == NULL) {
sub_15F20C0(inst); // eraseFromParent
// Remove from hash table (mark as tombstone)
HashEntry *entry = sub_1860630(
inst, 0, table, NULL); // lookupInTable
if (entry != NULL)
entry->key = (void *)(-16); // tombstone
}
} else {
// For non-store users (loads, etc.): create direct stores
// to the appropriate field global
for (uint64_t i = 0; i < fields->size; i++) {
uint64_t offset = sub_15A06D0(
get_type(global), i); // computeFieldOffset
sub_15F9660(offset, fields->data[i], inst); // createFieldStore
}
}
use = next_use;
}
}
After all users are rewritten, cleanup proceeds in two phases: first, operand lists of dead GEP (tag 77) and store (tag 54) instructions are unlinked from the use chain (nulling out 24-byte-stride operand slots at lines 2004-2079); second, the dead instructions are erased via sub_15F20C0 at lines 2081-2117. Finally, the original global declaration is erased via sub_15E55B0, and all temporary data structures (hash table backing array, field vectors, scratch buffers) are freed at lines 2119-2161.
Top-Level Driver: sub_18612A0
The complete control flow of the core transform function, integrating all four strategies. This pseudocode corresponds to the entire 2179-line decompilation:
// sub_18612A0 — Core GlobalOpt transform for a single global variable
// Returns: 1 if transformed, 0 if no transformation applied
int globalopt_transform(Global *global, Module *module, Type *type,
int flag, TargetInfo *target, TargetInfo *target2) {
// === Phase 1: Type filter (lines 444-451) ===
uint8_t type_tag = *(uint8_t *)(type + 8);
uint16_t bitmask = 0x8A7E; // bits: 1,2,3,4,5,9,11,13,15
if (!((1 << type_tag) & bitmask)) {
// Additional acceptance for struct(13), vector(14), array(16)
if (type_tag == 13 || type_tag == 14 || type_tag == 16) {
if (!sub_16435F0(type, 0)) // isAnalyzableType
return 0;
} else {
return 0;
}
}
// === Phase 2: Use validation — all users must be store/load (lines 452-481) ===
Buffer scratch = { .data = alloca(8 * sizeof(void *)), .size = 0, .capacity = 8 };
Use *use = *(Use **)(global + 8);
while (use != NULL) {
Instruction *inst = sub_1648700(use); // getInstruction
uint8_t opcode = *(uint8_t *)(inst + 16);
if (opcode <= 0x17) return 0; // arithmetic: reject
if (opcode == 54) { // store
if (!sub_185C920(inst, &scratch)) // analyzeStore
return 0;
} else if (opcode != 55) { // not load either
return 0;
}
use = *(Use **)(use + 8);
}
// === Phase 3: Collect store values and evaluate initializer (lines 482-493) ===
Buffer store_buf = { .data = calloc(32, sizeof(void *)), .size = 0, .capacity = 32 };
sub_185C560(module, global, &store_buf); // collectStoreValues
Value *init_val = sub_140B2F0(module, target, global, 1); // evaluateInitializer
// === Phase 4: Try Path A — small-constant promotion (lines 494-805) ===
uint8_t init_tag = *(uint8_t *)(init_val + 16);
if (init_tag == 13) { // struct constant
uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
uint64_t total_bits = compute_total_bits(type, target, addr_space);
uint64_t alignment = sub_15A9FE0(target, type);
uint64_t padded = alignment * ((total_bits + alignment - 1) / alignment);
if (padded <= 0x7FF) { // <= 2047 bits
promote_small_constant(global, module, init_val, type, target);
free(store_buf.data);
return 1;
}
}
// === Phase 5: Try Path B — SRA of struct globals (lines 807-2177) ===
if (flag != 0) { free(store_buf.data); return 0; } // SRA disabled by caller
// Verify unique initializer
if (init_val != sub_15A0680(get_module_sym(module), 1, 0)) {
free(store_buf.data); return 0;
}
// Check struct with 1-16 fields
if (type_tag == 14) type = unwrap_vector(type); // vector peeling
if (*(uint8_t *)(type + 8) != 13) { free(store_buf.data); return 0; }
uint32_t field_count = *(uint32_t *)(type + 12);
if (field_count - 1 > 0xF) { free(store_buf.data); return 0; } // > 16 fields
// Collect stored values into hash set (lines 823-868)
HashSet store_set;
init_hashset(&store_set);
collect_store_values(global, module, &store_set, &scratch);
// Validate all users reference only this global (lines 878-1017)
if (!validate_global_only_uses(global, &store_set)) {
free(store_buf.data); return 0;
}
// Optional vector type peeling (lines 1026-1083)
if (*(uint8_t *)(type + 8) == 14) {
peel_vector_type(global, module, type, target);
}
// Field explosion (lines 1084-1476)
FieldVec fields = { .data = NULL, .size = 0, .capacity = 0 };
sra_explode_fields(global, module, type, init_val, target, &fields);
// Null/negative guard chain (lines 1478-1535)
Value *guard = build_guard_chain(global, &fields, module, target);
// Malloc/free decomposition (lines 1537-1640)
Function *fn = get_parent_function(global);
decompose_malloc_free(global, module, fn, &fields, guard, target);
// Hash-table-driven user rewriting (lines 1642-2161)
HashTable processed;
init_hashtable(&processed);
rewrite_remaining_users(global, &fields, &processed, module, target);
// Cleanup: unlink dead operands, erase dead instructions
cleanup_dead_instructions(&processed); // lines 2004-2117
// Erase original global and free temporaries
sub_15E55B0(global); // lines 2119-2161
free(fields.data);
free(store_buf.data);
destroy_hashtable(&processed);
destroy_hashset(&store_set);
return 1;
}
LTO Interaction
GlobalOpt benefits significantly from LTO's whole-program visibility. In single-compilation mode, a __device__ global with external linkage cannot be optimized because the compiler cannot prove it is unused by other translation units. With ThinLTO, the NVModuleSummary builder records per-global reference edges, and the ThinLTO importer pulls definitions across module boundaries. After import, GlobalOpt can see all users of a global across the entire program and make decisions that are impossible in per-module compilation:
- Internalization: A global referenced only within one module (after import) can be marked internal (linkage 7), enabling all four transformation paths.
- Dead global elimination: A global with zero users after import is trivially dead and erased. The NVModuleSummary builder's address-space tracking ensures that
__device__globals referenced by kernels are not prematurely killed -- a kernel's reference counts as a use even when no host-side code touches the global. - Cross-module constant propagation: After import, if a
__device__global is stored exactly once (from a host-sidecudaMemcpyToSymbol) and loaded many times across multiple device functions, the single-store can be propagated as a constant, unlocking Path A's small-constant promotion.
The pass wrapper sub_196A2B0 is also called from the inliner cost model (sub_18612A0 address shared by both -- the inliner calls the GlobalOpt transform function to evaluate whether post-inline global folding would pay for the inline cost). This creates a feedback loop: inlining a caller that references a global may expose the global for optimization, which reduces code size, which makes further inlining cheaper.
Recursion
After completing either Path A or Path B, the pass recursively calls sub_185B1D0 on the newly created replacement globals. This handles cascading opportunities: splitting a struct global into fields may expose one of the field globals for further small-constant promotion (if a field is a small struct itself), or for dead elimination (if one field is never used). The recursion terminates when no further transformations apply -- each recursive call runs the same type filter and use validation, so it will return 0 for leaf scalars or globals with non-store/load users.
Knobs and Thresholds
| Threshold | Value | Source | Effect |
|---|---|---|---|
| Max bits for Path A | 2047 (0x7FF) | Hardcoded | Globals exceeding this fall through to SRA |
| Max struct fields for SRA | 16 | Hardcoded | Structs with >16 fields are not split |
| Hash table load factor | 75% (3/4) | Hardcoded | Triggers rehash of processed-globals table |
| Tombstone threshold | 12.5% (1/8) | Hardcoded | Triggers compacting rehash |
| Initial scratch buffer | 8 entries | Hardcoded | For use analysis; grows via sub_16CC920 |
| Store collection buffer | 32 entries | Hardcoded | For store value collection; grows dynamically |
| SRA disable flag (a4) | Caller-set | Runtime | When set, Path B is bypassed entirely |
| Pipeline gate | opts[1440] | Config array | When set, the sub_196A2B0 wrapper is skipped |
| Optimization tier | >= 2 | Pipeline config | GlobalOpt not run at tier 1 |
The pipeline parser registers "globalopt" at slot 45 in the pass name table, mapping to llvm::GlobalOptPass. The NVIDIA wrapper sub_196A2B0 is gated by the config array at offset 1440 -- when opts[1440] is set, the wrapper skips the pass entirely. At tier 2, GlobalOpt runs unconditionally at pipeline position 30. At tier 3, it runs with the same parameters but benefits from more aggressive SCCP and GlobalDCE having run upstream.
There are no user-facing CLI flags that directly control the 2047-bit threshold or the 16-field SRA limit. These are compile-time constants in the binary. The only external control is the tier-level gate and the opts[1440] kill switch.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
sub_18612A0 | 0x18612A0 | -- | Core transform: type filter, Path A, Path B |
sub_196A2B0 | 0x196A2B0 | -- | Pipeline wrapper (calls core after GlobalDCE) |
sub_185B1D0 | 0x185B1D0 | -- | Recursive re-application to split globals |
sub_185B7E0 | 0x185B7E0 | -- | Pre-SRA setup |
sub_1860410 | 0x1860410 | -- | Hash table rehash |
sub_1860630 | 0x1860630 | -- | Hash table lookup |
sub_1860BE0 | 0x1860BE0 | -- | Per-user SRA rewrite |
sub_185C560 | 0x185C560 | -- | Collect all store values for a global |
sub_185C920 | 0x185C920 | -- | Analyze single store for optimizability |
sub_185CAF0 | 0x185CAF0 | -- | Collect stored value into hash set |
sub_15E51E0 | 0x15E51E0 | -- | Create global variable (88 bytes, with AS) |
sub_15E5070 | 0x15E5070 | -- | Create init function |
sub_164D160 | 0x164D160 | -- | RAUW (Replace All Uses With) |
sub_15F20C0 | 0x15F20C0 | -- | Erase instruction from parent |
sub_15E55B0 | 0x15E55B0 | -- | Erase global declaration |
sub_15A9520 | 0x15A9520 | -- | getPointerSizeInBits(target, addr_space) |
sub_15A9930 | 0x15A9930 | -- | getStructLayout (field offsets) |
sub_15A06D0 | 0x15A06D0 | -- | computeFieldOffset |
sub_1646BA0 | 0x1646BA0 | -- | getStructFieldType |
sub_16435F0 | 0x16435F0 | -- | isAnalyzableType(type, depth) |
sub_140B2F0 | 0x140B2F0 | -- | evaluateInitializer(module, target, ..., 1) |
sub_15FB630 | 0x15FB630 | -- | Create notinit sentinel |
sub_15FB440 | 0x15FB440 | -- | Create binary OR (opcode 27) |
sub_15FEC10 | 0x15FEC10 | -- | Create ICmp instruction |
sub_15F8650 | 0x15F8650 | -- | Create conditional branch |
sub_15F8590 | 0x15F8590 | -- | Create unconditional branch |
sub_157FBF0 | 0x157FBF0 | -- | Create basic block |
sub_15FED60 | 0x15FED60 | -- | Create ICmp NE (opcode 51, predicate 33) |
sub_15F9330 | 0x15F9330 | -- | Create alloca ("tmp" variable in block) |
sub_15FDB00 | 0x15FDB00 | -- | Wire def into use-def chain |
sub_15F9850 | 0x15F9850 | -- | Create store-to-field-global |
sub_157E9C0 | 0x157E9C0 | -- | Create return basic block (null-return) |
sub_157FB60 | 0x157FB60 | -- | Create basic block with predecessor |
sub_15F55D0 | 0x15F55D0 | -- | Grow operand list |
sub_1648700 | 0x1648700 | -- | getInstruction(use) from use-chain |
sub_1649960 | 0x1649960 | -- | getName(global/fn) returns C string |
sub_1648A60 | 0x1648A60 | -- | IRBuilder::create(size, kind) allocates IR node |
sub_15E5530 | 0x15E5530 | -- | Destroy function body |
sub_159D9E0 | 0x159D9E0 | -- | Destroy function |
sub_164BE60 | 0x164BE60 | -- | Drop all references |
sub_1648B90 | 0x1648B90 | -- | Mark dead (flags or-equals 1) |
sub_1631BE0 | 0x1631BE0 | -- | Insert into function list |
sub_15A9FE0 | 0x15A9FE0 | -- | getAlignment(target, type) ABI alignment |
sub_15A0680 | 0x15A0680 | -- | lookupSymbol(module_sym, idx, flags) |
sub_16463B0 | 0x16463B0 | -- | getArrayElementType(ptr, idx) |
sub_159C540 | 0x159C540 | -- | getTerminalType(type) |
sub_1752100 | 0x1752100 | -- | Collect use-def chain |
sub_15E6480 | 0x15E6480 | -- | Copy metadata from global to global |
sub_15F8F80 | 0x15F8F80 | -- | Create extractvalue instruction |
sub_15F9480 | 0x15F9480 | -- | Create store-init (initializer store) |
sub_15F9660 | 0x15F9660 | -- | Create field store (offset + field global) |
sub_15FD590 | 0x15FD590 | -- | Create local variable ("newgv") |
sub_15FEBE0 | 0x15FEBE0 | -- | Create bitcast/GEP for field extraction |
sub_1648780 | 0x1648780 | -- | Replace use with value |
sub_16CC920 | 0x16CC920 | -- | Grow scratch buffer |
sub_16CC9F0 | 0x16CC9F0 | -- | Find in sorted set |
sub_1968390 | 0x1968390 | -- | GlobalDCE / ConstantProp (runs before GlobalOpt) |
Differences from Upstream LLVM GlobalOpt
Stock LLVM's GlobalOptPass (in lib/Transforms/IPO/GlobalOpt.cpp) performs similar high-level transformations: SRA of globals, shrink-to-bool, constant marking, dead global elimination, malloc/free removal, static constructor evaluation, calling convention optimization (fastcc), and alias resolution. The NVIDIA implementation diverges in these concrete ways:
-
Internal IR, not LLVM IR. The pass operates on NVIDIA's custom IR node format with 88-byte global nodes, 24-byte operand stride, and type tags at offset +8/+16 of type/instruction nodes. A reimplementation targeting upstream LLVM would use
GlobalVariable,StoreInst,LoadInst, andGetElementPtrInstdirectly. -
2047-bit constant promotion threshold. LLVM does not have a single bit-count gate for constant promotion. NVIDIA's threshold likely targets the GPU register file: 2047 bits is approximately 64 32-bit registers, close to the per-thread register budget on many SM architectures.
-
Per-field malloc decomposition. Stock LLVM's
tryToOptimizeStoreOfMallocToGlobalhandles malloc/free as a single pair. NVIDIA generates per-field null checks, conditional frees, and continuation blocks -- a more aggressive decomposition. -
Custom hash table. LLVM uses
DenseMap/SmallPtrSet. NVIDIA uses a hand-rolled open-addressing hash table with 32-byte entries (see Hash Table and Collection Infrastructure for the hash function and sentinel values). -
Address-space preservation. Every created global explicitly receives the source global's address space. Stock LLVM does not special-case address spaces in GlobalOpt.
-
Recursive re-application. After splitting, NVIDIA calls
sub_185B1D0to re-run GlobalOpt on the results. Upstream LLVM relies on the pass manager to schedule re-runs via its invalidation mechanism. -
Inliner integration. The inliner cost model at the same address range calls into GlobalOpt to evaluate post-inline global folding benefit. This tight coupling does not exist in upstream LLVM where inlining and GlobalOpt are independent passes.
Cross-References
- NVModuleSummary Builder -- builds the global reference edges that determine which globals are live across modules
- Inliner Cost Model -- calls GlobalOpt's transform function to evaluate post-inline global optimization benefit
- ThinLTO Function Import -- imports functions across module boundaries, exposing globals for cross-module optimization
- Alias Analysis & NVVM AA -- address-space-aware alias analysis that informs which memory operations can alias globals in different address spaces
- MemorySpaceOpt -- resolves generic pointers to specific address spaces; runs before GlobalOpt and may expose globals that were previously behind generic pointers
- Pipeline & Ordering -- full pass ordering showing GlobalOpt's position at step 30
- Type Translation, Globals & Special Vars -- how EDG frontend assigns address spaces to global variables during IR generation
- Hash Infrastructure -- hash function, sentinel values, and probing strategy used by the processed-globals table
- Struct Splitting -- the NewPM
lower-aggr-copiespass that handles similar aggregate decomposition at a different pipeline stage - Address Spaces -- complete NVPTX address space reference including pointer sizes and latency characteristics
Whole-Program Devirtualization
CICC v13.0 includes LLVM's WholeProgramDevirtPass at sub_2703170 (13,077 bytes), which replaces indirect virtual calls with direct calls using whole-program type information. On GPU this optimization is far more consequential than on CPU: an indirect call in PTX compiles to a call.uni through a register, which prevents the backend from inlining the callee, forces all live registers across the call boundary into local memory spills, destroys instruction scheduling freedom, and creates a warp-divergence hazard if threads in the same warp resolve the function pointer to different targets. A single devirtualized call site in a hot kernel loop can therefore improve performance by an order of magnitude -- the direct call enables inlining by the inliner cost model, which in turn eliminates .param-space marshaling, enables cross-boundary register allocation, and restores the instruction scheduler's ability to interleave memory and arithmetic operations.
CICC's devirtualization operates in a privileged position: GPU compilation is inherently a closed-world model. Every function that can be called on the device must be visible at link time -- there is no dynamic loading, no shared libraries, and no dlopen on GPU. This means the set of possible implementations for any virtual function is fully known, making single-implementation devirtualization almost always profitable and branch funnels rare. The pass runs as a module-level pass (pipeline parser slot 121, registered as "wholeprogramdevirt") during the LTO phase, after the NVModuleSummary builder has computed type test metadata and before GlobalDCE eliminates dead virtual methods.
| Entry point | sub_2703170 (0x2703170, 13,077 bytes) |
| Address range | 0x2703170--0x2706485 |
| Stack frame | 856 bytes (0x358) |
| Pass name | "wholeprogramdevirt" (pipeline slot 121) |
| Pass type | Module pass |
| Callee-saved | r15, r14, r13, r12, rbx |
| Return value | 1 = module modified, 0 = no changes |
| Remark category | "wholeprogramdevirt" / "Devirtualized" |
| Helper range | sub_2700B00--sub_2708220 (branch funnel helpers, summary I/O) |
The Closed-World GPU Advantage
Upstream LLVM's WholeProgramDevirt is designed primarily for LTO pipelines where some modules may not be visible (ThinLTO import/export split, shared libraries with hidden visibility). The pass must therefore be conservative: it can only devirtualize when !type metadata proves that the vtable set is complete. On GPU, this conservatism is unnecessary. All device code is statically linked into a single fatbinary -- there are no device-side shared libraries, no runtime code loading (the driver JIT compiles PTX, but does not add new device functions), and __device__ virtual functions cannot escape to host code. The entire class hierarchy is visible.
CICC exploits this by running WPD in regular LTO mode (not ThinLTO export/import split), where the pass directly resolves virtual calls against the merged module. The NVModuleSummary builder records type_test metadata for all device vtables, and the pass consumes this metadata to build a complete picture of every virtual call site and every possible target. In practice, GPU programs rarely have deep polymorphic hierarchies in device code (the hardware penalties discourage it), so most virtual call sites resolve to a single implementation.
The Formal Closed-World Argument
The closed-world guarantee on GPU rests on five architectural invariants, each of which eliminates a source of conservatism that forces upstream LLVM to leave calls indirect:
| # | Invariant | What upstream LLVM must worry about | Why GPU is immune |
|---|---|---|---|
| 1 | No device-side shared libraries | A .so loaded at runtime could add a new vtable entry for a class. LTO must mark !vcall_visibility metadata linkage-unit to prove the vtable set is closed within the link unit. | The CUDA driver loads PTX/SASS as a monolithic blob. cuModuleLoad does not support incremental symbol addition. There is no dl_iterate_phdr on device. |
| 2 | No dlopen on device | Host-side dlopen can inject new implementations of virtual functions. Upstream must check !vcall_visibility for translation-unit scope. | Device code has no equivalent of dlopen. The only way to add device code is to recompile and reload the entire module. |
| 3 | No device-side RTTI | dynamic_cast and typeid on host can defeat devirtualization by requiring the vtable to contain RTTI pointers that reference external type_info objects. | CUDA explicitly prohibits dynamic_cast and typeid in __device__ functions. Device vtables contain no RTTI pointers. The NVVM IR verifier (sub_12DD660) rejects code that attempts dynamic_cast in device context. |
| 4 | No exceptions on device | Virtual destructors in exception-handling code create additional vtable entries and __cxa_throw unwinding paths that must be considered. | CUDA does not support exceptions in device code. Virtual destructors are simple (no EH cleanup), and the compiler can see every destructor call site. |
| 5 | Complete link-time visibility | ThinLTO's import/export split means some modules may not be available during WPD. The pass must use summary-based resolution with wholeprogramdevirt-summary-action=import/export. | CICC uses wholeprogramdevirt-summary-action=none (direct resolution on the merged module). All device functions, including those from separate compilation units, are linked by nvlink into a single merged module before the LTO pipeline runs. |
The practical consequence: CICC sets whole-program-visibility effectively to true for all device code. The !vcall_visibility metadata that upstream uses to distinguish "linkage-unit" from "translation-unit" scope becomes irrelevant -- every device vtable is within a single, complete, closed translation unit.
How NVModuleSummary Feeds WPD
The NVModuleSummary builder at sub_D7D4E0 (2,571 decompiled lines, 74KB) produces the type metadata that WPD consumes. The interaction is:
-
NVModuleSummary walks every
GlobalValuein the module (linked list atModule+72). For each function (opcode0x3D), it extracts attribute groups #34 (reference edges with type metadata) and #35 (direct call targets) viasub_B91C10. -
For reference edges with type info (attribute #34), the builder decodes MDNode operands (lines 1193-1228 of the decompilation): each parameter position >= 2 yields a type node (opcode range 5-36), walked to a parent MDTuple (opcode 17) containing the type name string at offset 24 (indirect through pointer if length > 64).
-
These type-metadata edges are packed into the
FunctionSummaryrecord bysub_D77220as thev378(type-checked references) argument. The resulting metadata lands in the module asllvm.type.test/type_test_assumenamed metadata nodes. -
WPD reads these nodes back via
sub_B6AC80(module, 0x166)at its entry point, completing the producer-consumer chain.
DevirtSCCRepeatedPass: The Outer Loop
WPD at the module level is one of two devirtualization mechanisms. The other operates at CGSCC granularity: DevirtSCCRepeatedPass at sub_2284BC0 (16KB) wraps the CGSCC pipeline in a fixed-point iteration loop that re-runs until no new devirtualization opportunities are discovered or a maximum iteration count is reached. On reaching the limit, the pass emits "Max devirtualization iterations reached". The abort-on-max-devirt-iterations-reached knob (registered at constructor 378) controls whether this is a fatal error or a warning. The iteration count at O1-O3 is 1; at tier 3 (maximum optimization) it is 5, giving the inliner and devirtualizer multiple rounds to discover indirect-to-direct call conversions that expose further inlining opportunities.
The two mechanisms are complementary: module-level WPD resolves virtual calls using global type hierarchy information (vtable metadata), while CGSCC-level devirtualization catches cases where inlining reveals new constant function pointers that can be resolved without type metadata.
Algorithm
The pass executes in seven phases:
Phase 1: Metadata Extraction (0x2703170--0x27031CA)
The entry point fetches four named metadata nodes from the module using sub_B6AC80 (getNamedMetadata):
| Enum ID | Metadata Node | Purpose |
|---|---|---|
0x166 (358) | llvm.type.test / type_test_assume | Records of @llvm.assume(@llvm.type.test(%ptr, %typeID)) intrinsic results |
0x164 (356) | llvm.type.checked.load | Call sites using type-checked vtable loads |
0x165 (357) | llvm.type.checked.load.relative | Relative vtable pointer variant (compact vtables) |
0x0B (11) | Module-level type metadata | Type summaries describing vtable layouts |
If neither type_test_assume nor module-level type metadata are present, the pass checks for type_checked_load and type_checked_load_relative as fallbacks. If none exist, the pass returns 0 immediately.
The assembly sequence at the entry point:
; 0x2703170: entry
mov esi, 0x166 ; enum ID = 358 (type_test_assume)
call sub_B6AC80 ; rbx = getNamedMetadata(module, 0x166)
mov esi, 0x164 ; enum ID = 356 (type_checked_load)
call sub_B6AC80 ; r13 = getNamedMetadata(module, 0x164)
mov esi, 0x165 ; enum ID = 357 (type_checked_load_relative)
call sub_B6AC80 ; [rbp-0x338] = result
mov esi, 0x0B ; enum ID = 11 (module-level type metadata)
call sub_B6AC80 ; r12 = result
Phase 2: Type Test Record Iteration (0x2703296--0x2703383)
Type test records are stored in an array at offset +0xA0 of the metadata state, with count at +0xA8. Each record is 144 bytes (0x90):
struct TypeTestRecord { // 0x90 = 144 bytes per record
uint8_t *type_value; // +0x00: pointer to type test value
// ... call site references, metadata links ...
};
// Iteration pattern at 0x2703296:
TypeTestRecord *base = state->records; // [state + 0xA0]
uint32_t count = state->record_count; // [state + 0xA8]
TypeTestRecord *end = base + count; // stride = 0x90
// Address computation in binary:
// lea rax, [rax+rax*8] ; count * 9
// shl rax, 4 ; count * 144 = count * 0x90
// add rax, rdx ; end pointer
for (TypeTestRecord *rec = base; rec != end; rec++) {
if (rec->type_value[0] != 0) continue; // skip already-processed
// ... look up type in hierarchy ...
}
For each record whose type byte is 0 (unprocessed), the pass computes a string hash of the type name via sub_B91420 (get type name) and sub_B2F650 (string hash), then looks up the type in a red-black tree rooted at offset +0xE0 of the module state.
Phase 3: Hash Table Construction (0x2703589--0x2703AE2)
Unique type test values are tracked in an open-addressed hash table with 56-byte entries. The hash function combines bit-shifted fields to reduce clustering:
uint32_t hash(uint32_t val, uint32_t mask) {
return ((val >> 4) ^ (val >> 9)) & mask;
}
The table uses power-of-2 sizing with LLVM-layer sentinels (empty = 0xFFFFFFFFE000, deleted = 0xFFFFFFFFF000). See Hash Table and Collection Infrastructure for the probing and growth policy.
Each 56-byte hash table entry stores:
| Offset | Size | Field |
|---|---|---|
+0x00 | 8 | Type test value (key) |
+0x08 | 8 | Flags / padding |
+0x10 | 8 | Type info pointer |
+0x18 | 8 | Associated data (resolution result) |
+0x20 | 8 | Red-black tree node (self-referential on init) |
+0x28 | 8 | Link pointer |
+0x30 | 8 | Count / size |
Slot addressing uses the identity slot_index * 7 * 8 = slot_index * 56:
; At 0x27035A0:
lea rdx, ds:0[rsi*8] ; rsi = slot index, rdx = slot*8
sub rdx, rsi ; rdx = slot*8 - slot = slot*7
mov rsi, [rdi+rdx*8] ; load from table base + slot*56
Table growth is handled by sub_2702540, which reallocates and rehashes all entries using the same (val >> 4) ^ (val >> 9) function against the new mask. Entry initialization at 0x2703A33:
; Insert new entry:
add [rbp-0x2D0], 1 ; increment unique type count
call sub_2702540 ; grow table if needed (returns new entry ptr in rax)
mov dword [rax+10h], 0 ; clear type info
mov qword [rax+18h], 0 ; clear data
mov [rax], rdx ; store type test value
lea rdx, [rax+10h]
mov [rax+20h], rdx ; self-referential link (RB tree node init)
mov [rax+28h], rdx ; self-referential link
mov qword [rax+30h], 0 ; zero count
Phase 4: Type Hierarchy Lookup via Red-Black Tree (0x27032F7--0x2703362, 0x2704183--0x2704267)
For each unique type, the pass searches a red-black tree keyed by hashed type name. The tree is rooted at offset +0xE0 of the module state, with the sentinel node at +0xD8. The search is a two-phase process with a three-field comparison:
Phase 4a: Compute Type Name Hash
// At 0x27032F7:
char *name = sub_B91420(type_value); // returns (name_ptr, name_len)
uint64_t hash = sub_B2F650(name, len); // string hash
// Tree root and sentinel:
RBTreeNode *root = module_state[+0xE0]; // root pointer
RBTreeNode *sentinel = module_state + 0xD8; // sentinel node address
sub_B2F650 (stringHash) is LLVM's standard xxHash-style string hasher. It produces a 64-bit hash that is stored at node[+0x20] for each type in the tree.
Phase 4b: Descend Tree by Hash
// At 0x270330C:
RBTreeNode *current = root;
RBTreeNode *best = sentinel; // rcx = sentinel initially
while (current != NULL) {
uint64_t node_hash = current[+0x20]; // hash stored in node
if (target_hash < node_hash) {
best = current; // track nearest greater
current = current[+0x10]; // left child
} else if (target_hash > node_hash) {
current = current[+0x18]; // right child
} else {
// hash matches -- proceed to Phase 4c
break;
}
}
if (current == NULL) goto not_found;
The binary encodes this as:
compare_node:
cmp rsi, [r15+20h] ; compare target hash vs node hash
ja go_right ; target > node -> right child
jnb hash_match ; target == node -> verify
mov rcx, r15 ; track best (left-leaning)
mov r15, [r15+10h] ; r15 = left child
test r15, r15
jnz compare_node
jmp not_found
go_right:
mov r15, [r15+18h] ; r15 = right child
test r15, r15
jnz compare_node
jmp not_found
Phase 4c: Verify Full Match (Hash Collision Resolution)
On hash match, the pass performs a two-step verification to handle collisions:
// At 0x2704200:
// Step 1: compare string lengths
if (current[+0x30] != target_length) {
// Length mismatch -- this is a hash collision, not a real match.
// Continue tree traversal to the next candidate.
goto next_candidate;
}
// Step 2: compare actual type name strings
char *node_name = (char *)current[+0x28]; // node's type name data
char *target_name = target_string; // from sub_B91420
int cmp = memcmp(node_name, target_name, target_length);
if (cmp != 0) goto next_candidate;
// Verified match -- read vtable data
The binary at 0x2704200--0x2704240:
cmp r12, [r15+30h] ; compare string length
jnz next_candidate ; length mismatch
mov rdi, [r15+28h] ; s1 = node's string data
mov rsi, [rbp-0x348] ; s2 = target string data
mov rdx, r12 ; n = length
call _memcmp
test eax, eax
jz found_match
Phase 4d: Extract Vtable Data
After verifying the type match, the pass reads the vtable descriptor from the type node:
// At 0x2704248:
void *vtable_start = current[+0x68]; // vtable start address
void *vtable_data = current[+0x70]; // vtable data pointer (function pointers)
if (vtable_data == NULL) goto skip_type; // no vtable -> nothing to devirtualize
The vtable_data pointer leads to an array of function pointers representing the virtual method implementations for this type. The pass iterates this array comparing each entry against call site signatures to identify devirtualization candidates.
Phase 5: Virtual Call Resolution (0x2703974--0x27039BA)
For each call site on a matched type, the pass calls sub_26FEE10 (resolveVirtualCall):
bool resolveVirtualCall(
void *module_state, // rdi: r15 (module/pass state)
void *target_candidates, // rsi: candidates vector from [rbp-0x230]
void *hash_entry, // rdx: r12 (pointer to hash table entry + 8)
uint32_t candidate_count, // rcx: from [rbp-0x228]
void *call_site_info // r8: r13 (call site from [r15+0x28])
);
// Returns: al = 1 if unique resolution found, 0 otherwise
The resolution algorithm within sub_26FEE10 works by comparing the vtable offset encoded in each call site's llvm.type.test / llvm.type.checked.load intrinsic against the vtable slot offsets of all candidate implementations. When exactly one candidate matches, the resolution succeeds with strategy 1 (direct call). When multiple candidates exist but all return the same constant or can be distinguished by a single offset, strategy 2 (unique member) is chosen. When multiple distinct targets exist, strategy 3 (branch funnel) is produced.
The resolution result is written to hash_entry[+0x28] as a strategy selector:
| Value | Strategy | Upstream LLVM counter |
|---|---|---|
| 1 | Direct call (single implementation) | NumSingleImpl |
| 2 | Unique member dispatch | NumUniformRetVal / NumUniqueRetVal |
| 3 | Branch funnel | NumBranchFunnel |
Before calling sub_26FEE10, the pass checks two preconditions:
// At 0x2703974:
void *call_site_list = module_state[+0x28]; // r13 = [r15+0x28]
if (call_site_list == NULL) goto skip;
if (type_value[0] != 0) goto skip; // byte check: direct type info only
void *existing = hash_entry[+0x28];
if (existing != 0) goto already_resolved; // skip if previously resolved
Phase 6: Strategy Application (0x2703BA3--0x27046F0)
Strategy 1 -- Direct Call Replacement (0x27044DA)
When only one class implements the virtual function (the common case on GPU), the indirect call is replaced with a direct call to the resolved function. This is handled by sub_26F9AB0 (rewriteCallToDirectCall):
// At 0x27044DA:
void rewriteCallToDirectCall(
void *type_entry, // rdi: r12
void *call_site, // rsi: [r15+0x38]
uint64_t vtable_data, // rdx: byte_3F871B3 (vtable offset data)
uint32_t flags, // ecx: 0
void *resolved_function // r8: [rbx+0x40]
);
This is the simplest and most common optimization: the call.reg becomes call.direct, enabling downstream inlining. On GPU this is by far the dominant strategy. Consider a CUDA kernel with a virtual method call inside a loop:
; Before devirtualization (PTX):
ld.global.u64 %rd1, [%rd0]; // load vtable ptr
ld.global.u64 %rd2, [%rd1+16]; // load function ptr at vtable slot 2
call.uni %rd2, (%args); // indirect call -- full scheduling barrier
; After devirtualization (PTX):
call.uni _ZN7DerivedN4workEv, (%args); // direct call -- inlinable
The direct call then becomes an inlining candidate with CICC's 20,000-unit budget (89x the upstream LLVM default of 225), and the inliner typically eliminates it entirely, producing fully-inlined code with no call overhead.
Strategy 2 -- Unique Member Dispatch (0x27045C9)
When multiple classes exist but the call can be dispatched through a unique member offset, the pass rewrites via sub_26F9080 (rewriteToUniqueMember), passing the diagnostic string "unique_member" (13 chars). The member offset is read from hash_entry[+0x60] and the base type from hash_entry[+0x00].
; At 0x27045D9:
mov r11, [r12] ; type info (base type)
mov rsi, [r12+60h] ; member offset
lea rax, "unique_member" ; diagnostic string (13 chars)
call sub_26F9080 ; rewriteToUniqueMember
; rdx = r14 (type test record)
; rcx = r13 (call site)
; r9 = rdi (vtable byte offset / 8)
; "unique_member" + length 0x0D pushed on stack
After the initial rewrite, sub_26FAF90 performs call-site-specific fixup, checking [rbx+0x40] to determine if additional adjustment is needed (e.g., adjusting this pointer offset for multiple inheritance).
Upstream LLVM's equivalent covers two sub-strategies: uniform return value optimization (all implementations return the same constant -- replace the call with that constant) and unique return value optimization (for i1 returns, compare the vptr against the one vtable that returns a different value). Both are folded under the "unique_member" label in CICC's implementation.
Strategy 3 -- Branch Funnel (0x27043B5)
When multiple possible targets exist and cannot be reduced to a single dispatch, the pass creates a branch funnel -- a compact conditional dispatch sequence that checks the vtable pointer and branches to the correct target. This is handled by three functions:
sub_26F78E0-- create branch funnel metadata (with diagnostic string"branch_funnel", 13 chars)sub_BCF480-- build the conditional dispatch structuresub_BA8C10-- emit the indirect branch sequence
; At 0x27043B5:
mov r12, [rbx] ; vtable pointer
mov rdi, [r12] ; function pointer from vtable
call sub_BCB120 ; get function declaration
; At 0x27043D3:
lea rax, "branch_funnel" ; 13 chars at 0x42BCB92
call sub_26F78E0 ; create branch funnel metadata
call sub_BCF480 ; build dispatch structure
call sub_BA8C10 ; emit indirect branch sequence
The branch funnel supports two dispatch granularities:
| Granularity | String | Function | Description |
|---|---|---|---|
| Byte | "byte" (4 chars, at 0x3F8C256) | sub_26F9120 | Check byte offset into vtable to select target |
| Bit | "bit" (3 chars, at 0x43ADFE0+0xE) | sub_26F9120 | Check bit offset for single-bit discrimination |
The emission sequence at 0x270450C--0x27045BF:
; Byte-granularity dispatch:
lea rcx, "byte" ; at 0x3F8C256
mov [rbp-0x318], 4 ; string length
call sub_26F9120 ; emit byte-offset check
; Bit-granularity dispatch:
lea rbx, "bit" ; at 0x43ADFE0 + 0xE
mov [rbp-0x328], 3 ; string length
call sub_26F9120 ; emit bit-offset check
; Finalize:
call sub_26FB610 ; r8=byte_result, r9=bit_result
; rdi=r12, rdx=byte_3F871B3
The finalization call sub_26FB610 receives both byte and bit results and produces the final dispatch sequence. On GPU, branch funnels are rare because device code hierarchies are typically shallow, but the infrastructure exists for cases like thrust/CUB polymorphic iterators.
Upstream LLVM gates branch funnels behind the wholeprogramdevirt-branch-funnel-threshold knob (default: 10 targets per call site). CICC inherits this threshold.
Phase 7: Cleanup (0x2704144--0x270342E)
After processing all types, the pass performs four cleanup operations:
- Function attribute cleanup (
0x2704144): iterates the module's function list (red-black tree at[rax+10h]), callingsub_B98000with parameter0x1C(attribute cleanup enum) on each function. - Import list cleanup (
0x270416C): processes entries atmodule[+0x110..+0x118], callingsub_B43D60to release function metadata for imported declarations. - Type hierarchy destruction:
sub_26F92C0releases all type hierarchy data structures. - Hash table deallocation (
0x27033C3): iterates all non-sentinel entries, callssub_26F75B0to release per-entry resolution data, thensub_C7D6A0to free the table buffer. Type test result vectors (0x70-byte elements with sub-vectors at offsets+0x10,+0x28,+0x40,+0x58) are freed element by element.
Hash table cleanup detail:
// At 0x27033C3:
uint32_t count = hash_table_entry_count; // [rbp-0x2B8]
if (count == 0) goto skip_cleanup;
void *base = hash_table_base; // [rbp-0x2C8]
void *end = base + count * 56; // count * 7 * 8
for (void *entry = base; entry < end; entry += 56) {
uint64_t key = *(uint64_t *)entry;
if (key == 0xFFFFFFFFE000) continue; // empty sentinel
if (key == 0xFFFFFFFFF000) continue; // deleted sentinel
sub_26F75B0(entry[+0x18]); // release resolution data
}
sub_C7D6A0(base); // free table buffer
GPU-Specific Constraints
Virtual Functions in Device Code
CUDA allows __device__ virtual functions, but with restrictions that simplify devirtualization:
- No RTTI on device. There is no
typeidordynamic_caston GPU. This means vtable layouts do not contain RTTI pointers, simplifying vtable reconstruction. The NVVM IR verifier rejects code that attemptsdynamic_castin device context. - No exceptions on device. Virtual destructors do not need to handle
__cxa_throwunwinding paths. - Closed world. No device-side shared libraries, no
dlopen, no runtime code generation. All virtual targets are known at compile time. - No separate compilation for virtual dispatch. Device linking (nvlink) resolves all symbols before PTX emission, so the merged module always has complete type information.
- Simplified vtable layout. Without RTTI pointers and exception tables, device vtables are a flat array of function pointers at known offsets. This makes vtable slot arithmetic straightforward for the WPD pass.
Cost of Unresolved Indirect Calls
If devirtualization fails, the PTX backend must emit a call.uni or call through a register. This has several penalties:
- No inlining. The callee is unknown, so the inliner cannot evaluate it.
- Full
.parammarshaling. Every argument must be written to.paramspace; no copy elision is possible. The call ABI (opcodes 510-513:CallDirect,CallDirectNoProto,CallIndirect,CallIndirectNoProto) forces.param-space round-tripping. - Register pressure spike. All live registers across the call must be spilled to local memory (device DRAM, ~400 cycle latency on SM 70-90).
- Scheduling barrier. The call is a full fence for instruction scheduling -- no operations can be reordered across it.
- Divergence hazard. If different threads in a warp resolve the pointer to different functions, execution serializes both paths. In the worst case (32 different targets), this is a 32x slowdown.
- Occupancy reduction. The register spills increase per-thread local memory usage, reducing occupancy and thus hiding less memory latency.
This is why CICC's default inlining budget of 20,000 (89x the upstream LLVM default) makes sense in combination with aggressive devirtualization: the pass converts expensive indirect calls into direct calls, and the inliner then eliminates them entirely.
Relationship to LowerTypeTests
The LowerTypeTests pass (sub_188C730, 96,984 bytes at 0x188C730; also sub_2638ED0 at 70KB) is the other half of the type-test infrastructure. While WPD consumes type test metadata to resolve virtual calls, LowerTypeTests produces the runtime type-checking implementation. The interaction:
| Pass | Role | When |
|---|---|---|
NVModuleSummary (sub_D7D4E0) | Produces type metadata in function summaries | During summary construction |
WholeProgramDevirt (sub_2703170) | Consumes type metadata, resolves virtual calls | LTO phase, after summary, before GlobalDCE |
LowerTypeTests (sub_188C730) | Lowers remaining @llvm.type.test intrinsics to runtime bit tests | After WPD, if CFI is active |
On GPU, LowerTypeTests is largely dead code -- CUDA does not use Control-Flow Integrity (CFI), and WPD resolves most type tests statically. The sweep at 0x1880000 confirms: "WPD/CFI/LowerTypeTests cluster is also upstream-only; CUDA does not use CFI or type-based devirtualization" in the sense of runtime CFI checks. The type metadata is consumed entirely by WPD's compile-time resolution.
LowerTypeTests validates its input with: "Second argument of llvm.type.test must be metadata" and "Second argument of llvm.type.test must be a metadata string". These error paths are unreachable in normal CUDA compilation but exist because CICC links the full upstream LLVM IPO library.
Optimization Remarks
When a call site is successfully devirtualized, the pass emits an optimization remark through the diagnostic handler. The remark is constructed at 0x2703EDA using three components:
| Component | String | Address |
|---|---|---|
| Remark name | "Devirtualized" (13 chars) | 0x42BCBEe |
| Pass name | "wholeprogramdevirt" (18 chars) | 0x42BC950 |
| Body prefix | "devirtualized " (14 chars) | 0x42BCBE2 |
| Attribute key | "FunctionName" (12 chars) | 0x42BC980 |
The remark construction sequence:
// At 0x2703EDA:
sub_B17560(&remark, "Devirtualized", 13, "wholeprogramdevirt", 18);
sub_B18290(&remark, "devirtualized ", 14); // append body
sub_B16430(&remark, "FunctionName", 12); // create named attribute
sub_26F69E0(&remark, resolved_function); // attach target name
sub_B180C0(&remark); // finalize
sub_1049740(diag_handler, &remark); // publish to handler
The remark is visible via -Rpass=wholeprogramdevirt and includes the name of the resolved target function (obtained from the function's name metadata or via sub_26F69E0 for unnamed functions).
After remark emission, extensive cleanup of small-string-optimized (SSO) std::string objects is performed -- each remark component checks if the string buffer was heap-allocated (compare pointer vs stack buffer address) and frees if necessary.
Knobs
| Knob | Type | Default | Effect |
|---|---|---|---|
wholeprogramdevirt-branch-funnel-threshold | unsigned | 10 | Maximum number of call targets per call site for branch funnel emission. Beyond this threshold, the call site is left indirect. |
whole-program-visibility | bool | false | Force enable whole-program visibility even without !vcall_visibility metadata. On GPU this is effectively always true. |
disable-whole-program-visibility | bool | false | Force disable whole-program visibility for debugging. |
wholeprogramdevirt-summary-action | enum | none | Controls summary interaction: none, import, export. CICC uses none (direct resolution on merged module). |
wholeprogramdevirt-read-summary | string | empty | Read type resolutions from a bitcode/YAML file. |
wholeprogramdevirt-write-summary | string | empty | Write type resolutions to a bitcode/YAML file. |
wholeprogramdevirt-skip | string list | empty | Comma-separated list of function names to exclude from devirtualization. |
wholeprogramdevirt-check | enum | none | Runtime checking mode: none, trap (abort on incorrect devirt), fallback (fall back to indirect call). |
wholeprogramdevirt-keep-unreachable-function | bool | true | Keep unreachable functions as possible devirt targets (conservative default). |
wholeprogramdevirt-print-index-based | bool | false | Print index-based devirtualization messages for debugging. |
wholeprogramdevirt-cutoff | signed | -1 | Maximum number of devirtualization actions to perform. -1 = unlimited. Useful for bisecting devirtualization-induced miscompiles. |
abort-on-max-devirt-iterations-reached | bool | false | When DevirtSCCRepeatedPass at sub_2284BC0 hits its iteration limit, abort instead of warning. Registered at constructor 378. |
Complexity
| Operation | Complexity | Notes |
|---|---|---|
| Hash table insert/lookup | O(1) amortized, O(n) worst case | Linear probing with sentinel-based open addressing |
| Type hierarchy lookup | O(log n) | Red-black tree keyed by type name hash, with memcmp verification |
| Per-type call resolution | O(call_sites * candidates) | For each type, check every call site against every candidate target |
| Branch funnel emission | O(vtable_entries) per site | Linear in number of possible targets |
String hash (sub_B2F650) | O(name_length) | One-pass hash of the type name string |
| Total pass | O(T * S * C * log T) | T = types, S = call sites per type, C = candidates. Typically sparse on GPU. |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
WholeProgramDevirtPass::run | sub_2703170 | 13,077 | Pass entry point |
buildTypeTestInfo | sub_2702830 | ~2,600 | Build type test records from metadata |
growHashTable | sub_2702540 | ~740 | Grow and rehash the type test hash table |
resolveVirtualCall | sub_26FEE10 | ~3,200 | Attempt single-target resolution for a call site |
rewriteCallToDirectCall | sub_26F9AB0 | ~1,600 | Strategy 1: replace indirect call with direct call |
rewriteToUniqueMember | sub_26F9080 | ~640 | Strategy 2: unique member dispatch rewrite |
finalizeUniqueMember | sub_26FAF90 | ~1,700 | Strategy 2: call-site-specific fixup |
createBranchFunnelMeta | sub_26F78E0 | ~1,100 | Strategy 3: create branch funnel metadata |
buildBranchFunnel | sub_BCF480 | ~6,400 | Strategy 3: build conditional dispatch structure |
emitIndirectBranch | sub_BA8C10 | ~8,200 | Strategy 3: emit indirect branch sequence |
emitDispatchCheck | sub_26F9120 | ~500 | Branch funnel byte/bit offset check |
finalizeBranchFunnel | sub_26FB610 | ~1,800 | Branch funnel finalization |
destroyTypeHierarchy | sub_26F92C0 | ~400 | Release type hierarchy data structures |
releaseResolutionData | sub_26F75B0 | ~300 | Free per-entry resolution data |
attachFunctionName | sub_26F69E0 | ~240 | Attach function name to optimization remark |
branchFunnelHelper | sub_2700B00 | ~9,800 | Branch funnel main helper (called from sub_2703170) |
summaryIO | sub_2706490 | ~7,600 | WPD summary read/write (-wholeprogramdevirt-read-summary) |
DevirtSCCRepeatedPass::run | sub_2284BC0 | 16,000 | CGSCC devirtualization iteration loop |
getNamedMetadata | sub_B6AC80 | ~200 | Fetch named metadata node from module |
getTypeInfoName | sub_B91420 | ~300 | Compute type info name string |
stringHash | sub_B2F650 | ~180 | Hash a type name string (xxHash-style) |
createRemarkHeader | sub_B17560 | ~250 | Create optimization remark header |
appendRemarkBody | sub_B18290 | ~200 | Append body text to remark |
createNamedAttribute | sub_B16430 | ~200 | Create named attribute for remark |
publishRemark | sub_1049740 | ~100 | Publish remark to diagnostic handler |
Cross-References
- NVModuleSummary Builder -- produces the
type_testmetadata consumed by this pass; records devirtualization-relevant type GUIDs in per-function summaries viasub_D7D4E0. - Inliner Cost Model -- devirtualized direct calls become inlining candidates with a 20,000-unit budget; the entire value of devirtualization on GPU depends on the inliner subsequently eliminating the call.
- ThinLTO Function Import -- in ThinLTO mode the pass would operate in export/import phases, but CICC primarily uses regular LTO for device code.
- Pipeline & Ordering -- WPD is registered at pipeline parser slot 121 as a module pass; it runs during the LTO phase after summary construction and before GlobalDCE.
- NVPTX Call ABI -- describes the
.param-space calling convention that makes indirect calls so expensive (opcodes 510-513: CallDirect, CallDirectNoProto, CallIndirect, CallIndirectNoProto). - LazyCallGraph & CGSCC -- devirtualization converts ref edges to call edges in the call graph, triggering SCC re-computation via
switchInternalEdgeToCall. TheDevirtSCCRepeatedPassatsub_2284BC0wraps the CGSCC pipeline in a fixed-point loop. - GPU Execution Model -- explains why indirect calls are so expensive on GPU (warp divergence, scheduling barriers, register spilling to local memory).
- Hash Infrastructure -- the type test hash table uses the same sentinel-based open-addressing pattern as CICC's universal DenseMap infrastructure.
GPU Execution Model
This page is the single authoritative reference for the GPU hardware properties that drive cicc's optimization decisions. Every other wiki page that mentions register pressure, occupancy cliffs, memory coalescing, warp divergence, or the .param calling convention should cross-reference this page rather than re-explaining the concepts inline. The page exists because these properties shape literally every pass in the compiler, from SROA (which exists to avoid .local memory) through register allocation (which trades register count for occupancy) to LTO inlining (which eliminates .param marshaling). Understanding the execution model is a prerequisite for understanding any cicc optimization decision that differs from upstream LLVM.
The material below describes the hardware model as cicc sees it -- the properties that are visible in the binary through TTI hooks, threshold constants, cost model comparisons, and diagnostic strings. Where specific numbers vary by SM generation, the sm_70+ (Volta through Blackwell) values are given unless otherwise noted.
SIMT Warp Execution
NVIDIA GPUs execute threads in groups of 32 called warps. All 32 threads in a warp share a single program counter under the SIMT (Single Instruction, Multiple Threads) model. The hardware issues one instruction per clock to all 32 threads simultaneously -- there is no per-thread instruction decode, fetch, or issue overhead. Each thread has its own register state and can execute a different data path, but they all advance through the program in lockstep.
This is not SIMD in the CPU sense. On a CPU with AVX-512, the programmer (or compiler) explicitly packs 16 floats into a vector register and issues a single vector instruction. On a GPU, the programmer writes scalar code for one thread, and the hardware transparently replicates it across 32 threads. The distinction matters for cicc because vectorization on GPU does not fill SIMD lanes -- it produces wide loads (ld.v2, ld.v4) within a single thread's scalar stream to improve memory transaction width and reduce instruction count. The VF returned by TTI::getRegisterBitWidth(Vector) is 32 bits (one scalar register), not 512 or 1024.
Divergence
When a branch condition evaluates differently across threads in a warp, the hardware serializes both paths. First the "taken" subset executes while the others are masked off; then the "not-taken" subset executes. The warp reconverges at a point determined by the hardware's reconvergence stack (pre-Volta) or independent thread scheduling (Volta+). Both paths execute regardless of how many threads take each side, so a divergent branch in a hot loop can halve throughput even if only one thread disagrees.
Divergence is the primary reason cicc includes the StructurizeCFG pass (which converts irreducible control flow to reducible form), the CSSA pass (which repairs SSA across divergent join points), the Loop Index Split pass (which eliminates index-dependent branches that cause per-iteration divergence), and the Branch Distribution pass (which separates uniform from divergent computation).
The constant warpSize = 32 is hardcoded in cicc's SCEV range analysis (intrinsic ID ~370, range [32, 33)) and is the architectural constant behind every power-of-two factor enforcement in the loop unroller and loop vectorizer.
Register Pressure and Occupancy
The register file is the single most constrained resource on an NVIDIA GPU and the single most important factor in cicc's optimization heuristics. Understanding the relationship between register count, occupancy, and performance is essential to understanding why cicc makes the decisions it does.
The Register Budget
Each Streaming Multiprocessor (SM) has a fixed 32-bit register file:
| SM Generation | Registers per SM | Max Registers per Thread |
|---|---|---|
| SM 70 (Volta) | 65,536 | 255 |
| SM 75 (Turing) | 65,536 | 255 |
| SM 80 (Ampere) | 65,536 | 255 |
| SM 86 (Ampere GA10x) | 65,536 | 255 |
| SM 89 (Ada) | 65,536 | 255 |
| SM 90 (Hopper) | 65,536 | 255 |
| SM 100 (Blackwell) | 65,536 | 255 |
These 65,536 registers are shared among all resident threads. The hardware partitions them at kernel launch time based on the per-thread register count reported by ptxas. The partition is coarse-grained -- registers are allocated in units of warp groups, not individual threads.
Occupancy Cliffs
The relationship between per-thread register count and achievable occupancy is a step function with sharp discontinuities:
Registers/thread Max warps/SM Max threads/SM Occupancy
32 64 2048 100%
33-40 48 1536 75%
41-48 32 1024 50% <-- cliff
49-64 32 1024 50%
65-80 24 768 37.5% <-- cliff
81-96 20 640 31.3%
97-128 16 512 25% <-- cliff
129-168 12 384 18.8%
169-255 8 256 12.5% <-- cliff
(Exact thresholds vary by SM generation and block size; these are representative for sm_70+ with standard block configurations.)
Adding a single register -- from 32 to 33 registers per thread -- drops maximum occupancy from 64 warps to 48 warps, a 25% reduction. These are the occupancy cliffs that cicc's heuristics are designed to avoid. The cost is asymmetric: the 33rd register provides trivial benefit (one fewer spill), but the occupancy loss costs 25% of the SM's latency-hiding capacity.
This is why:
- The loop unroller uses conservative thresholds that balance ILP against register growth
- The loop vectorizer limits VF to 2 or 4 even though wider vectors are legal
- LSR has an
lsr-rp-limitknob that hard-rejects formulae exceeding a register pressure ceiling - LICM runs twice -- once to hoist, once to sink back values whose extended live ranges hurt occupancy
- The rematerialization pass recomputes values rather than keeping them live across long ranges
- The register allocator uses
-maxreg(default 70) as a pressure cap rather than a physical assignment constraint
The cicc binary contains no explicit occupancy table -- it delegates final register assignment and occupancy computation to ptxas. But the thresholds in the optimization passes (LSR's lsr-rp-limit, the unroller's PartialThreshold, the vectorizer's register-pressure-bounded interleave count) are all calibrated to stay below known cliff boundaries.
PTX Virtual Registers
PTX has no fixed physical register file from the compiler's perspective. cicc emits virtual registers in nine typed classes (%p, %rs, %r, %rd, %f, %fd, %h, %hh, %rq -- see Register Classes). The ptxas assembler performs the actual register allocation from virtual to physical registers, using the SM's register file as the constraint. cicc's job is to minimize the number of simultaneously live virtual registers so that ptxas can produce a low register-count assignment.
The typed register model means that a 32-bit integer (%r) and a 32-bit float (%f) occupy separate register namespaces -- they never alias. A 64-bit value (%rd, %fd) occupies two 32-bit register slots. An Int128Regs value (%rq) occupies four. This is why the type legalization pass aggressively scalarizes vector types and the IV demotion pass narrows 64-bit induction variables to 32-bit: every bit of width reduction directly saves register pressure.
Memory Hierarchy
GPU memory is organized into physically disjoint address spaces with radically different performance characteristics. On a CPU, the entire address space is a flat virtual memory with uniform-latency cache hierarchy. On a GPU, choosing the wrong address space for an access can cost 100x in latency. This section summarizes the performance-relevant properties; for complete address space encoding, aliasing rules, and data layout strings, see Address Spaces.
Latency Table
| Memory | LLVM AS | PTX Qualifier | Latency (cycles) | Scope | Capacity |
|---|---|---|---|---|---|
| Registers | -- | %r, %f, etc. | 0 | Per-thread | 255 per thread (SM 70+) |
| Shared | 3 | .shared | 20-30 | Per-CTA (block) | 48-228 KB per SM |
| Constant cache | 4 | .const | 4-8 (hit) | Read-only, device-wide | 64 KB per SM |
| Parameter | 101 | .param | 4-8 | Per-kernel launch | Mapped to constant bank |
| Local (L1 hit) | 5 | .local | ~30 | Per-thread stack | L1 partition |
| Local (L2 hit) | 5 | .local | ~200 | Per-thread stack | L2 partition |
| Global (L2 hit) | 1 | .global | 32-128 | Device-wide | L2 cache |
| Global (DRAM) | 1 | .global | 200-800 | Device-wide | Device DRAM |
| Generic | 0 | .generic | +4-8 over resolved | Virtual | Runtime-resolved |
| Shared cluster | 7 | .shared::cluster | 30-50 | Cross-CTA (SM 90+) | Cluster shared pool |
The 200-800 cycle range for global DRAM access is the defining constraint of GPU performance. It means that a single cache-missing load stalls the executing warp for hundreds of cycles. The hardware hides this latency through warp-level multithreading (see next section), but only if enough warps are resident -- which brings us back to register pressure and occupancy.
Why Each Memory Matters for cicc
Registers vs. .local: Every alloca that SROA fails to promote becomes a .local allocation backed by DRAM. A .local access that misses L1 costs 200-400 cycles versus zero for a register. This is why SROA runs twice in the pipeline and why cicc's inline budget (20,000 vs upstream 225) is so aggressive -- inlining eliminates allocas from byval parameter copies.
Shared memory (AS 3): On-chip SRAM with 20-30 cycle latency, shared across all threads in a CTA (thread block). Uses 32-bit pointers (when +sharedmem32bitptr is active), saving one register per pointer compared to 64-bit global pointers. This is why LSR has disable-lsr-for-sharedmem32-ptr -- strength-reducing a 32-bit shared pointer can produce 64-bit intermediates that defeat the optimization.
Constant memory (AS 4): Hardware-cached read-only memory with 4-8 cycle latency on cache hit. The NVVM AA marks AS 4 as NoModRef, enabling LICM to hoist constant loads without checking for intervening stores.
.param space (AS 101): Used for function argument passing (see the calling convention section below). Read-only from device code. Mapped to the constant cache path, so reads are 4-8 cycles.
Generic (AS 0): The performance killer. A generic pointer forces a runtime address-space lookup (+4-8 cycles per access) and destroys alias analysis precision (every generic pointer MayAlias with everything). This is why MemorySpaceOpt exists -- resolving generic pointers to specific address spaces is one of the highest-impact optimizations in cicc.
Memory Coalescing
The GPU memory subsystem services warp-wide requests in 128-byte transactions (or 32-byte sectors on some architectures). When 32 threads in a warp access 32 consecutive 4-byte values (128 bytes total), the hardware coalesces the 32 individual requests into a single transaction. This is the stride-1 access pattern -- the ideal case.
Thread 0 loads addr+0 ┐
Thread 1 loads addr+4 │
Thread 2 loads addr+8 │ One 128-byte transaction
... │
Thread 31 loads addr+124 ┘
When threads access non-consecutive addresses (stride > 1, scattered, or misaligned), the hardware must issue multiple transactions to satisfy the warp's requests. In the worst case (32 threads accessing 32 different cache lines), a single warp load generates 32 separate transactions, reducing effective bandwidth by 32x.
Coalescing is why the loop vectorizer targets VF=2 or VF=4 on GPU: vectorizing a per-thread loop with ld.v4.f32 loads four consecutive elements per thread in a single wide transaction, improving bytes-per-transaction. It is also why the loop unroller enforces power-of-two factors -- non-power-of-two unroll factors create asymmetric access patterns that interact poorly with the 128-byte transaction boundary.
The memory coalescing model also explains why cicc's SLP vectorizer pairs adjacent scalar loads into ld.v2 / ld.v4 instructions -- not for SIMD parallelism (there is none) but for transaction width optimization.
No Out-of-Order Execution
GPU warps execute instructions strictly in program order. There is no out-of-order execution, no speculative execution, no branch prediction, and no reorder buffer. A warp that encounters a long-latency operation (global memory load, texture fetch) simply stalls until the result is available.
The sole latency-hiding mechanism is warp-level multithreading. Each SM maintains multiple warps in flight simultaneously. When one warp stalls on a memory access, the hardware switches to another ready warp in the same clock cycle (zero-cost context switch, because each warp has its own register state). This is why occupancy matters -- more resident warps means more opportunities to hide latency through interleaving.
The absence of OOO execution has profound implications for cicc:
ILP must be compiler-created. On a CPU, the hardware reorder buffer discovers and exploits instruction-level parallelism dynamically. On a GPU, the compiler (cicc + ptxas) must explicitly schedule independent instructions adjacent to each other so the hardware can overlap them. This is why loop unrolling is so valuable on GPU -- it creates independent instructions from different iterations that the scheduler can interleave -- and why the interleave count in the loop vectorizer exists (it replicates the vectorized body to expose more ILP).
Every stall is a stall. There is no store buffer to absorb write latency, no load queue to speculatively issue reads. The scheduling passes (instruction scheduling, block placement) must model this accurately.
Instruction issue width bounds throughput. Each SM has a fixed number of instruction schedulers (typically 4 per SM on sm_70+), each issuing one instruction per clock to one warp. The total instruction throughput of an SM is schedulers * clock_rate. The TTI scheduling info at TTI+56 (issue width at +32, latency at +36 within the sub-structure) encodes this model and feeds the vectorizer's interleave count cap.
The .param Calling Convention
Function calls on NVIDIA GPUs are expensive in a way that has no CPU equivalent. On x86, a function call pushes arguments to registers or the stack (a cached memory region), executes CALL, and the callee reads them back. Total overhead: 5-20 cycles. On GPU, there is no hardware call stack for registers. The PTX calling convention works through the .param address space:
Call Sequence
// Caller side:
.param .align 8 .b8 param0[16]; // DeclareParam
st.param.b64 [param0+0], %rd1; // Store arg 0, field 0
st.param.b64 [param0+8], %rd2; // Store arg 0, field 1
.param .b32 param1; // DeclareScalarParam
st.param.b32 [param1+0], %r5; // Store arg 1
call.uni (retval0), callee, (param0, param1); // The actual call
// Callee side:
ld.param.b64 %rd10, [param0+0]; // Load arg 0, field 0
ld.param.b64 %rd11, [param0+8]; // Load arg 0, field 1
ld.param.b32 %r20, [param1+0]; // Load arg 1
// ... function body ...
st.param.b32 [retval0+0], %r30; // Store return value
ret;
// Back in caller:
ld.param.b32 %r6, [retval0+0]; // Load return value
Each function call generates O(n) st.param + O(n) ld.param instructions where n is the total number of argument fields (not just argument count -- structs are marshaled field-by-field). A function with 8 struct arguments containing 4 fields each generates 32 stores + 32 loads + the call instruction itself. At shared/constant-cache latency (4-8 cycles per access), this is 256-512 cycles of pure marshaling overhead.
Additionally:
- Call boundaries destroy scheduling freedom. The hardware cannot overlap instructions across a call/return boundary.
- Call boundaries force register save/restore. If the callee needs more registers than are available in the caller's allocation, the hardware spills to
.localmemory (DRAM, 200-800 cycles). - Indirect calls are catastrophic. An indirect call (
call.unithrough a register) prevents all of the above from being optimized statically. No inlining, no cross-function register allocation, no dead argument elimination.
This is why:
- cicc's custom inliner uses a 20,000-unit budget (89x upstream LLVM's 225) -- the
.parammarshaling cost for a typical function easily exceeds the 225-unit threshold - LTO is dramatically more valuable on GPU than on CPU -- cross-module inlining eliminates
.paramoverhead for functions in separate translation units - Whole-program devirtualization is critical -- converting indirect calls to direct calls enables inlining and eliminates the worst-case register spill scenario
- 60% of the NVIDIA custom inliner's code computes type-size comparisons for argument coercion cost, because the
.parammarshaling cost dominates the inlining decision
The SelectionDAG Encoding
The SelectionDAG backend uses opcodes DeclareParam (505), DeclareScalarParam (506), StoreV1/V2/V4 (571-573), and LoadRetParam / LoadV1/V2/V4 (515-516, 568-570) for the param passing convention. The .param space is encoded as SelectionDAG code 5 in sub_33B0210. For complete opcode details, see NVPTX Machine Opcodes.
Address Space Semantics
GPU memory is partitioned into physically disjoint hardware regions. Pointers in different non-generic address spaces can never reference the same byte -- a property that NVVM AA exploits for O(1) NoAlias determination. The generic address space (AS 0) is a virtual overlay resolved at runtime by the hardware's address translation unit, which tests whether the address falls in the shared, local, or global window.
The following properties have direct optimization impact:
| Property | Global (AS 1) | Shared (AS 3) | Local (AS 5) | Constant (AS 4) |
|---|---|---|---|---|
| Pointer width | 64-bit | 32-bit* | 32-bit (effective) | 64-bit |
| Read-only | No | No | No | Yes |
| Cross-CTA visible | Yes | No | No | Yes |
| Hardware addressing modes | Base + offset | Base + offset, banked | Frame pointer + offset | Indexed constant cache |
| Coalescing | 128-byte transactions | 32 banks, 4-byte stride | Per-thread (no coalescing) | Broadcast to warp |
* 32-bit when +sharedmem32bitptr target feature is active (the default for sm_70+).
The 32-bit pointer optimization for shared memory saves one register per shared-memory pointer and reduces all address arithmetic from 64-bit to 32-bit operations. This is encoded in the NVPTX data layout string as p3:32:32:32 and is the reason the IV Demotion pass exists -- it narrows 64-bit induction variables to 32-bit when the loop operates entirely in shared memory.
For the complete address space reference -- including aliasing rules, the MemorySpaceOpt bitmask encoding, cvta intrinsic mapping, isspacep folding, and per-SM shared memory sizes -- see Address Spaces.
Compiler Implications Summary
Every major cicc optimization decision traces back to one or more of the properties above. The following table maps each hardware property to the compiler passes it shapes:
| Hardware Property | Compiler Impact | Key Passes |
|---|---|---|
| Warp divergence serializes both paths | Minimize control flow in hot loops | StructurizeCFG, CSSA, Loop Index Split, Branch Distribution |
| Register count determines occupancy | All transforms must minimize live values | Register Allocation, LSR, LICM, Rematerialization, IV Demotion |
| Occupancy cliffs are discrete | Threshold-driven heuristics with cliff awareness | Loop Unroll, Loop Vectorize, LSR lsr-rp-limit |
| No OOO execution | Compiler must create ILP | Loop Unroll (ILP via body replication), Scheduling, vectorizer interleave count |
.local spill costs 200-800 cycles | Aggressively promote allocas | SROA (runs twice), Inliner (20K budget eliminates byval copies) |
.param marshaling is O(n) per call | Aggressively inline | Inliner, LTO, Devirtualization |
| 128-byte coalescing transactions | Optimize memory access stride | Loop Vectorize (VF=2/4 for ld.v2/ld.v4), SLP Vectorizer |
| Address spaces are disjoint | NoAlias for cross-space pairs | NVVM AA, MemorySpaceOpt |
| Generic pointers destroy alias precision | Resolve to specific space | MemorySpaceOpt, IPMSP |
| Shared memory uses 32-bit pointers | Narrow IV and address width | IV Demotion, LSR disable-lsr-for-sharedmem32-ptr |
| Closed-world compilation model | Full-program visibility | LTO, Dead Kernel Elimination, Devirtualization |
| Constant cache is 4-8 cycles | Hoist constant loads freely | LICM, NVVM AA NoModRef for AS 4 |
What Upstream LLVM Gets Wrong
Upstream LLVM's NVPTX backend correctly implements the PTX virtual register model and the basic address space numbering. But the optimization passes assume CPU-like economics:
-
Inline threshold of 225 assumes function calls cost 5-20 cycles. GPU calls cost hundreds of cycles due to
.parammarshaling. NVIDIA overrides to 20,000. -
LSR cost model compares formulae by counting registers and instructions with equal weight. On GPU, one extra register can cost 25% occupancy; one extra instruction costs nearly nothing. NVIDIA replaces the formula solver entirely.
-
LICM assumes hoisting is always profitable. On CPU, moving an operation from loop body to preheader is strictly beneficial. On GPU, it extends the live range of the hoisted value across the entire loop, consuming a register for all iterations. NVIDIA runs LICM twice (hoist then sink) and relies on rematerialization to undo unprofitable hoists.
-
Vectorization targets SIMD lane width.
TTI::getRegisterBitWidth(Vector)returns 256 (AVX2) or 512 (AVX-512) on CPU. NVPTX returns 32 -- there are no SIMD lanes. Vectorization targets memory transaction width, not ALU parallelism. -
No occupancy model exists in upstream. CPU register allocation minimizes spill cost. GPU register allocation must minimize total register count to maximize occupancy. These are different objective functions.
-
Address spaces are an afterthought. Upstream LLVM treats address spaces as metadata annotations. On GPU, they are physically disjoint hardware memory partitions with different pointer widths, latencies, and aliasing properties. Every pass that touches pointers must be address-space-aware.
Cross-References
- Address Spaces -- complete encoding, aliasing rules, MemorySpaceOpt bitmask, data layout strings
- Register Classes -- nine typed register classes, encoding scheme, coalescing rules
- Register Allocation -- greedy RA,
-maxregconstraint, pressure tracking - Loop Vectorize -- VF selection, memory coalescing motivation, register-pressure-bounded IC
- Loop Unroll -- ILP vs register pressure tradeoff, power-of-two enforcement
- LSR (NVIDIA Custom) -- occupancy-aware formula solver, register pressure gating
- LICM -- hoist/sink dual invocation, register pressure tension
- SROA --
.localelimination, dual-invocation pipeline position - Inliner Cost Model -- 20K budget,
.parammarshaling cost, four parallel models - LTO & Module Optimization -- closed-world model, dead kernel elimination
- MemorySpaceOpt -- generic-to-specific address space resolution
- StructurizeCFG -- divergence-safe control flow restructuring
- CSSA -- conventional SSA for SIMT divergence correctness
- Rematerialization -- register pressure reduction via recomputation
- IV Demotion -- 64-bit to 32-bit IV narrowing for shared memory
- Instruction Scheduling -- in-order scheduling, MRPA pressure tracking
- NVPTX Target Infrastructure -- TTI hooks, data layout, target features
Address Spaces
This page is the single source of truth for NVPTX address space numbering, hardware mapping, pointer widths, aliasing rules, and the internal bitmask encoding used by MemorySpaceOpt. It supersedes all inline address space tables elsewhere in the wiki -- those pages should cross-reference this one rather than maintaining their own copies.
NVPTX defines eight address spaces in cicc v13.0, six of which correspond to physically disjoint hardware memory partitions. The generic (flat) address space is a virtual overlay resolved at runtime by the GPU's address translation unit. The eighth, tensor memory (AS 6), is a Blackwell-era addition accessible only through TMA intrinsics. A ninth, AS 25, is used internally within NVVM IR for device-linkage annotations and never reaches PTX emission. A tenth, AS 53, appears in MemorySpaceOpt initialization as an internal annotation space for global variable tracking.
Master Address Space Table
| LLVM AS | Name | PTX Qualifier | Hardware | Pointer Width | Typical Latency | CUDA Qualifier |
|---|---|---|---|---|---|---|
| 0 | Generic (flat) | .generic | Virtual -- address translation unit maps to physical space at runtime | 64-bit | +4-8 cycles over resolved (translation overhead) | Default for unresolved pointers |
| 1 | Global | .global | Device DRAM, L2 cached, optionally L1 cached | 64-bit | 200-800 cycles (DRAM); 32-128 cycles (L2 hit) | __device__, cudaMalloc |
| 3 | Shared | .shared | Per-CTA on-chip scratchpad SRAM (48-228 KB per SM) | 32-bit (when p3:32:32:32 active) or 64-bit | 20-30 cycles (bank-conflict-free) | __shared__ |
| 4 | Constant | .const | Read-only constant cache (64 KB per SM) | 64-bit | 4-8 cycles (cache hit); DRAM latency on miss | __constant__ |
| 5 | Local | .local | Per-thread private stack in DRAM, L1 cached | 32-bit (effective) or 64-bit | Same as global (backed by DRAM) | Stack allocations (alloca) |
| 6 | Tensor Memory | N/A (TMA intrinsics only) | Blackwell tensor memory (SM 100+) | 64-bit | Varies (TMA pipeline) | N/A -- accessed via cp.async.bulk intrinsics |
| 7 | Shared Cluster | .shared::cluster | Distributed shared memory across CTAs in a cluster (SM 90+) | 32-bit or 64-bit | ~30-50 cycles (cross-CTA penalty over AS 3) | __shared__ with cluster scope |
| 25 | Internal device linkage | N/A | Not a physical memory -- NVVM IR annotation for __device__ linkage | N/A | N/A | Used internally by module summary for extern device resolution |
| 53 | Internal annotation | N/A | Not a physical memory -- used by MemorySpaceOpt for global tracking | N/A | N/A | Internal to cicc pipeline |
| 101 | Param | .param | Kernel parameter window (mapped into constant bank or global memory) | 64-bit | 4-8 cycles (constant cache path) | Kernel parameters (__global__ function args) |
Address space 2 is not used by NVPTX. The numbering gap between shared (3) and constant (4) is inherited from upstream LLVM NVPTX conventions. The NVVM verifier's valid-AS check uses the formula (AS + ~2) & 0xFFFFFF) > 2, which accepts AS values 0, 1, and 3 unconditionally; AS 2 is sometimes valid depending on context.
Aliasing Rules
The core property exploited by NVVM AA is hardware address space disjointness: pointers in different non-generic address spaces can never reference the same byte. NVVM AA (nvptx-aa) encodes this as a NoAlias rule for every cross-space pointer pair, with the following exceptions.
| Pointer A | Pointer B | Alias Result | Reason |
|---|---|---|---|
| AS 0 (generic) | Any | MayAlias | Generic can map to any physical space at runtime |
| AS X (same) | AS X (same) | MayAlias | Same space -- further analysis needed (BasicAA, TBAA) |
| AS 1 (global) | AS 101 (param) | MayAlias | cvta.param on SM 70+ makes param addressable as global |
| AS 3 (shared) | AS 7 (shared cluster) | MayAlias | Cluster shared memory overlaps with regular shared |
| Any other cross-space pair | NoAlias | Physically disjoint hardware memory partitions |
The NVVM AA algorithm (pseudocode from NVPTXAAResult::alias in cicc):
AliasResult alias(Loc1, Loc2):
AS1 = getAddressSpace(Loc1.Ptr, TraverseLimit) // walk through casts
AS2 = getAddressSpace(Loc2.Ptr, TraverseLimit)
if AS1 == 0 or AS2 == 0: return MayAlias // generic kills precision
if (AS1==3 and AS2==7) or (AS1==7 and AS2==3): return MayAlias
if AS1 == AS2: return MayAlias // same space, need deeper AA
return NoAlias // different non-generic spaces
The getAddressSpace helper walks backward through getUnderlyingObject (stripping GEPs, bitcasts, PHIs) up to nvptx-traverse-address-aliasing-limit (default 6) levels deep, resolving generic pointers that were produced by addrspacecast from a specific space.
ModRef Rules
| Address Space | ModRef Mask | Meaning |
|---|---|---|
| AS 4 (constant) | NoModRef | Read-only -- never modified |
| AS 101 (param) | NoModRef | Kernel params are read-only from device code |
| All others | ModRef | May be both read and written |
These masks enable DSE to skip constant/param stores entirely, and LICM to hoist loads from constant memory without checking for intervening stores.
MemorySpaceOpt Internal Bitmask
MemorySpaceOpt (sub_1C70910) encodes address spaces as single-bit positions in a byte-wide bitmask for efficient dataflow computation. The mapping is performed in sub_1CA8CD0 via a switch on the LLVM address space ID:
| Bit | Value | LLVM AS | Name |
|---|---|---|---|
| 0 | 0x01 | 1 | Global |
| 1 | 0x02 | 3 | Shared |
| 2 | 0x04 | 4 | Constant |
| 3 | 0x08 | 5 | Local |
| 4 | 0x10 | 101 | Param |
| 0-3 | 0x0F | N/A | Unknown (union of global + shared + constant + local) |
// sub_1CA8CD0 — address space to bitmask
switch (addrspace) {
case 1: return 0x01; // global
case 3: return 0x02; // shared
case 4: return 0x04; // constant
case 5: return 0x08; // local
case 101: return 0x10; // param
default: return 0x0F; // unknown = union of all non-param
}
When multiple pointer sources contribute different address spaces (e.g., through PHI nodes or function arguments receiving pointers from different call sites), the bitmask is OR'd. A singleton bit (popcount == 1) means the space is fully resolved; multiple bits set means the pointer is ambiguous and requires either runtime isspacep or a conservative default to global.
Resolution Decision
Once the bitmask is computed for a pointer:
- Single bit set: Resolved. The pass inserts an
addrspacecastfrom generic to the target space and replaces all uses. - Multiple bits set, param bit included: If
param-always-point-to-globalis true (default), resolve to global. The rationale: kernel parameters always point into global device memory. - Multiple bits set, no param: Ambiguous. Emit warning
"Cannot tell what pointer points to, assuming global memory space"and default to global. - Zero bits: Unreachable code or analysis error.
Relationship to EDG Frontend Encoding
The EDG frontend uses a separate encoding in the symbol table entry at offset +156/+157:
| EDG Bit | Value | Memory Space |
|---|---|---|
| +156 bit 0 | 0x01 | __device__ (any device placement) |
| +156 bit 1 | 0x02 | __shared__ |
| +156 bit 2 | 0x04 | __constant__ |
| +156 bit 4 | 0x10 | Read-only linkage flag |
| +157 bit 0 | 0x01 | __managed__ |
The EDG memory_space_code at offset +136 maps to LLVM address spaces during IR generation: code 1 (__device__) maps to AS 1, code 2 (__shared__) maps to AS 3, code 3 (__constant__) maps to AS 4.
The Generic Address Space Problem
The generic (flat, AS 0) address space is the fundamental obstacle to alias precision on GPUs. When the EDG frontend or NVVM IR generator cannot determine which physical memory a pointer targets, it emits the pointer in AS 0. The hardware resolves generic addresses at runtime by checking whether the address falls within the shared memory window, the local memory window, or defaults to global -- a process that adds 4-8 cycles of latency per access.
For NVVM AA, a generic pointer forces MayAlias against every other pointer, destroying the disjointness guarantee and blocking optimizations in DSE, LICM, GVN, and MemorySSA. Three mechanisms address this:
1. MemorySpaceOpt (compile-time conversion). The two-phase inter-procedural pass resolves generic pointers by tracing them back to their allocation sites through use-def chains. When a generic pointer always derives from a __shared__ variable, the pass inserts addrspacecast to AS 3 and rewrites all uses. When different call sites disagree on the address space for the same argument, the pass clones the function into space-specialized versions. Every generic pointer resolved gives NVVM AA an additional NoAlias edge. Disabling this pass (-disable-MemorySpaceOptPass) causes 2-20x performance regressions.
2. AA address-space traversal. Even without MemorySpaceOpt, NVVM AA's getAddressSpace helper walks through addrspacecast chains. If %p was produced by addrspacecast i8 addrspace(3)* %s to i8*, the traversal discovers AS 3 despite %p being in AS 0 at the use site.
3. !noalias.addrspace metadata (kind 42). cicc attaches this metadata to instructions when address space information is known but the pointer itself remains generic. The AA evaluator detects this via opcode byte 0x4E ('N') and sets bit 2 in a pointer-tagged value (OR with 4), propagating disambiguation information through to AAResults::alias. This is a cicc-specific extension not found in upstream LLVM.
Data Layout Strings
The NVPTX data layout string encodes pointer widths and alignment for each address space. cicc produces three variants based on pointer width and shared memory pointer mode.
64-bit with shared memory specialization (most common production mode)
e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
64-bit without shared memory specialization
e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
32-bit mode
e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64
Field-by-Field Breakdown
| Field | Meaning | NVIDIA Note |
|---|---|---|
e | Little-endian | All NVIDIA GPUs |
p:64:64:64 | Default pointer: 64-bit size, 64-bit ABI align, 64-bit preferred align | Applies to AS 0 (generic), AS 1 (global), AS 4 (constant), AS 101 (param) |
p3:32:32:32 | AS 3 pointer: 32-bit size, 32-bit ABI align, 32-bit preferred align | Shared memory is on-chip, addressable with 32 bits even in 64-bit mode |
i1:8:8 | Booleans stored as 8-bit | Standard |
i128:128:128 | 128-bit integers: 128-bit aligned | Used by cmpxchg on global/shared |
n16:32:64 | Native integer widths | PTX has 16-bit, 32-bit, and 64-bit register files |
v16:16:16 / v32:32:32 | Vector alignment: natural | 16-bit vectors at 16-bit, 32-bit vectors at 32-bit |
Shared Memory 32-bit Pointer Optimization
The p3:32:32:32 entry is the most impactful NVIDIA delta in the data layout. Shared memory lives in 48-228 KB of on-chip SRAM per SM, addressable with 32-bit pointers even when the rest of the address space is 64-bit. Using 32-bit pointers for shared memory saves register pressure (one register instead of two for every shared pointer) and instruction count (32-bit arithmetic instead of 64-bit for every address calculation).
The optimization is controlled by three knobs that alias the same underlying global (unk_4D0461C):
| Knob | Source |
|---|---|
nvptx-short-ptr | Backend option (ctor_609_0 at 0x585D30) |
nvptx-32-bit-smem | Backend option (same constructor) |
+sharedmem32bitptr | Target feature string (passed via -arch processing) |
When any of these is active, the data layout gains the p3:32:32:32 entry, and LLVM's type system treats all addrspace(3)* pointers as 32-bit. This is transparent to the rest of the compiler -- DataLayout queries like getPointerSizeInBits(3) return 32 automatically, and all pointer arithmetic in shared memory is lowered to 32-bit operations.
The same 32-bit treatment applies to local memory (AS 5) in practice: local stack addresses are within the per-thread frame and always fit in 32 bits. However, the data layout does not carry an explicit p5:32:32:32 entry -- the 32-bit treatment is enforced by the SelectionDAG lowering which uses AS 7 for stack operations.
Known-Bits Implications
The 32-bit address spaces have direct implications for the known-bits analysis (sub_BD5420):
| Address Space | Pointer Width | Known Bits Effect |
|---|---|---|
| AS 0 (generic) | 64-bit | Pointer alignment only |
| AS 1 (global) | 64-bit | Low 4 bits often known-zero (16-byte alignment typical) |
| AS 3 (shared) | 32-bit | Low 2 bits known-zero (4-byte minimum), bits [32,63] irrelevant |
| AS 4 (constant) | 64-bit | Low 2 bits known-zero (4-byte alignment) |
| AS 5 (local) | 32-bit effective | Low 2 bits known-zero (stack alignment), bits [32,63] irrelevant |
DemandedBits exploits the 32-bit address spaces to eliminate zero-extensions and truncations around shared/local address calculations, keeping all pointer arithmetic in 32-bit ALU operations. This interacts with IV Demotion (sub_18B1DE0), which narrows 64-bit induction variables to 32-bit where shared memory address calculations permit.
Data Layout Validation
The NVVM verifier (sub_2C80C90) validates the data layout string at multiple pipeline points:
- If empty:
"Empty target data layout, must exist" - If invalid: prints
"Example valid data layout:"with reference strings fromoff_4C5D0A0(32-bit) andoff_4C5D0A8(64-bit) - A shortened compatibility form
e-i64:64-v16:16-v32:32-n16:32:64is used in the IR linker (sub_106AB30) to verify that two modules being linked share the same NVPTX target data layout.
Address Space Casts
NVPTX has strict rules for addrspacecast instructions, enforced by the NVVM verifier:
-
At least one side must be generic (AS 0). Casting between two non-generic address spaces is prohibited:
"Cannot cast non-generic pointer to different non-generic pointer". You must go through generic:addrspace(3) -> addrspace(0) -> addrspace(1). -
Source and target must be valid. The verifier rejects invalid address space IDs with
"Invalid target address space"/"Invalid source address space". -
Alloca must be in generic.
"Allocas are not supported on address spaces except Generic"-- alloca produces AS 0 pointers; MemorySpaceOpt later promotes them to AS 5. -
Tensor memory (AS 6) rejects load/store.
"Tensor Memory loads/stores are not supported"-- AS 6 memory must be accessed through TMA intrinsics (cp.async.bulk.*), not regular load/store instructions. -
cmpxchg is restricted.
"cmpxchg pointer operand must point to generic, global, or shared address space"-- atomic compare-exchange only supports AS 0, AS 1, and AS 3, with i32/i64/i128 operand types.
cvta Intrinsic Mapping
The PTX cvta (Convert Virtual Address) instructions are lowered through intrinsic IDs in the EDG frontend (sub_94A030):
| Intrinsic ID Range | Direction | Address Space |
|---|---|---|
| 0xC1 (193) | Generic -> Specific | Shared (AS 3) |
| 0xC2 (194) | Generic -> Specific | Constant (AS 4) |
| 0xC3 (195) | Generic -> Specific | Local (AS 5) |
| 0xC4 (196) | Generic -> Specific | Global (AS 1) |
| 0xC5 (197) | Specific -> Generic | Shared (AS 3) |
| 0xC6 (198) | Specific -> Generic | Constant (AS 4) |
| 0xC7 (199) | Specific -> Generic | Local (AS 5) |
| 0xC8 (200) | Specific -> Generic | Global (AS 1) |
The specific-to-generic direction emits addrspacecast (opcode 0x30). The generic-to-specific direction uses a store-to-temp followed by a load with the target address space annotation.
SelectionDAG Address Space Encoding
The SelectionDAG backend uses a secondary address space encoding for the .param passing convention. In sub_33B0210 (intrinsic lowering within the SelectionDAG), pointer arguments use this mapping:
| SelectionDAG Code | LLVM AS | PTX Space |
|---|---|---|
| 1 | 1 (global) | .global |
| 2 | 3 (shared) | .shared |
| 3 | 4 (constant) | .const |
| 4 | 5 (local) | .local |
| 5 | -- | .param (not a real AS, lowered to param window) |
| 7 | 7 (shared cluster) | .shared::cluster |
Stack operations (SelectionDAG opcode 16, StackAlloc) explicitly use AS 7 for the .param-like space when lowering stack frames via sub_33FF780(dag, ..., 7, 0, 1, 0).
Internal Address Spaces (Non-Physical)
AS 25 -- Device Linkage Annotation
Address space 25 is used by the module summary pass (sub_1C28690 in p2-H01-nvmodule-summary.txt) to tag functions and variables with __device__ linkage during inter-module resolution. When a function's type resolves to AS 25, it indicates the symbol has device-side linkage and requires device-side extern resolution. This address space never appears in emitted PTX -- it is consumed during linking and stripped before codegen.
AS 53 -- MemorySpaceOpt Global Annotation
During pass initialization (sub_1CAB590), MemorySpaceOpt filters module globals that carry address space 53 and registers them into internal tracking structures. This appears to be an annotation mechanism for marking globals that require special address space analysis. Like AS 25, this address space is internal and does not survive to PTX emission.
Shared Memory Specializations by SM Generation
| SM | Shared Memory Size | Cluster Support | AS 7 Available | Shared Memory Pointer |
|---|---|---|---|---|
| SM 70 (Volta) | 96 KB configurable with L1 | No | No | 32-bit (when +sharedmem32bitptr) |
| SM 80 (Ampere) | 164 KB configurable | No | No | 32-bit |
| SM 86 (Ampere GA10x) | 100 KB configurable | No | No | 32-bit |
| SM 89 (Ada) | 100 KB configurable | No | No | 32-bit |
| SM 90 (Hopper) | 228 KB configurable | Yes | Yes | 32-bit |
| SM 100 (Blackwell) | 228 KB configurable | Yes | Yes | 32-bit |
With SM 90+, __shared__ variables accessed with cluster scope use .shared::cluster (AS 7), which provides cross-CTA access within a cooperative thread array cluster. Regular intra-CTA shared access remains on AS 3 (.shared). The EarlyCSE pass (sub_2781BB6) detects AS 7 stores and applies conservative aliasing to prevent CSE across shared cluster barriers.
isspacep Intrinsics
The PTX isspacep instruction tests at runtime whether a generic pointer points to a specific address space. cicc represents these as intrinsics with builtin IDs 0xFD0-0xFD5:
| Builtin ID | PTX | Tests for |
|---|---|---|
0xFD0 | isspacep.global | Global (AS 1) |
0xFD1 | isspacep.shared | Shared (AS 3) |
0xFD2 | isspacep.local | Local (AS 5) |
0xFD3 | isspacep.const | Constant (AS 4) |
0xFD4 | isspacep.shared::cta | Shared CTA-local (AS 3, SM 90+) |
0xFD5 | isspacep.shared::cluster | Shared cluster (AS 7, SM 90+) |
MemorySpaceOpt's second-time resolver (sub_1CA9E90) folds these to compile-time constants when the pointer's address space is already known: isspacep.shared(%p) where %p is proven to be AS 3 folds to true. This eliminates runtime address space checks from conditional code patterns like:
if (__isShared(p))
atomicAdd_shared(p, val);
else
atomicAdd(p, val);
Configuration Knobs Affecting Address Spaces
| Knob | Default | Effect |
|---|---|---|
nvptx-short-ptr | -- | Enable 32-bit pointers for shared/const/local |
nvptx-32-bit-smem | -- | Same effect as above (alias) |
param-always-point-to-global | true | Resolve ambiguous param pointers to global |
mem-space-alg | 2 | Algorithm selection for MemorySpaceOpt (2 = default, others select alternate impl at sub_2CBBE90) |
track-indir-load | true | Track pointers loaded from memory during address space analysis |
track-int2ptr | true | Track inttoptr casts during analysis |
nvptx-traverse-address-aliasing-limit | 6 | Max depth for NVVM AA getAddressSpace traversal |
do-clone-for-ip-msp | -1 (unlimited) | Max function clones for inter-procedural specialization |
process-alloca-always | true | Treat alloca as definite local (AS 5) |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| MemorySpaceOpt pass entry | sub_1C70910 | -- | Mode dispatch, IP-MSP worklist driver |
| Per-BB instruction scanner | sub_1CA8CD0 | -- | AS-to-bitmask mapping switch |
| Use-def chain walker | sub_1CA5350 | -- | Backward pointer origin tracking |
| First-time resolver | sub_1CA2920 | -- | Conservative address space resolution |
| Second-time resolver | sub_1CA9E90 | -- | Hash-table-based resolution, isspacep folding |
| MemorySpaceCloning engine | sub_2CBBE90 | -- | Inter-procedural function cloning (71KB) |
| IPMSP module pass variant | sub_1C6A6C0 | -- | LIBNVVM path (54KB) |
| EDG cvta lowering | sub_94A030 | -- | Address space cast intrinsic generation |
| EDG decl-side memspace processing | sub_6582F0 | -- | CUDA attribute to memory space code resolution |
| EDG def-side memspace processing | sub_65F400 | -- | Definition validation and initializer handling |
| NVVMModuleVerifier | sub_2C80C90 | -- | Data layout and address space validation |
| NVVMIntrinsicVerifier | sub_2C7B6A0 | -- | Per-intrinsic address space constraint checking |
| SelectionDAG intrinsic lowering | sub_33B0210 | -- | Backend AS mapping for param passing |
| getPointerAlignmentBits | sub_BD5420 | -- | Known-bits for address space pointer widths |
| NVIDIA intrinsic known-bits oracle | sub_F0C4B0 | -- | Special register ranges |
Cross-References
- Memory Space Optimization -- Two-phase address space resolver, bitmask dataflow, function cloning
- IPMSP -- Inter-procedural memory space propagation, worklist algorithm
- Alias Analysis & NVVM AA -- Address space disjointness, AA chain,
!noalias.addrspace - NVPTX Target Infrastructure -- Data layout strings,
+sharedmem32bitptrfeature, TTI hooks - KnownBits & DemandedBits -- Address space pointer width in known-bits, DemandedBits narrowing
- NVVM Verifier -- addrspacecast rules, tensor memory restriction, cmpxchg constraints
- EDG Frontend -- CUDA memory space attributes (
__shared__,__constant__,__device__) - SelectionDAG -- Backend address space encoding for param passing
- IV Demotion -- Exploits 32-bit shared memory pointers for induction variable narrowing
- EarlyCSE -- Shared cluster (AS 7) store handling
NVPTX Register Classes
This page is the single authoritative reference for the nine NVPTX register classes used throughout cicc v13.0. Register class tables previously duplicated in Register Allocation, Register Coalescing, PTX Emission, and AsmPrinter are consolidated here. When those pages reference register classes, they should cross-reference this page rather than maintaining inline copies.
| Register encoding | sub_21583D0 (4.6KB) |
| PTX type suffix map | sub_2163730 (1.7KB) |
| PTX prefix map | sub_21638D0 (1.6KB) |
| Copy opcode dispatch | sub_2162350 (3.0KB) |
| Register info init (legacy) | sub_2163AB0 / sub_2149CD0 |
| Register info init (new PM) | sub_30590F0 / sub_301F0C0 |
| Register decl emission | sub_2158E80 (17KB) |
| Internal-only class vtable | off_4A026E0 |
The Nine Register Classes
NVPTX defines nine register classes that participate in PTX code generation. Each class is identified at runtime by its vtable pointer, which sub_2163730 and sub_21638D0 use as a switch key to produce the PTX type suffix and register prefix respectively. The encoding function sub_21583D0 maps each class to a 4-bit tag that occupies bits [31:28] of the 32-bit encoded register ID.
| Tag | Vtable | Class Name | PTX Type | Prefix | Encoded ID | Width | Description |
|---|---|---|---|---|---|---|---|
| 1 | off_4A027A0 | Int1Regs | .pred | %p | 0x10000000 | 1 | Predicate (boolean) |
| 2 | off_4A02720 | Int16Regs | .b16 | %rs | 0x20000000 | 16 | Short integer |
| 3 | off_4A025A0 | Int32Regs | .b32 | %r | 0x30000000 | 32 | General-purpose integer |
| 4 | off_4A024A0 | Int64Regs | .b64 | %rd | 0x40000000 | 64 | Double-width integer |
| 5 | off_4A02620 | Float32Regs | .f32 | %f | 0x50000000 | 32 | Single-precision float |
| 6 | off_4A02520 | Float64Regs | .f64 | %fd | 0x60000000 | 64 | Double-precision float |
| 7 | off_4A02760 | Int16HalfRegs | .b16 | %h | 0x70000000 | 16 | Half-precision float (f16, bf16) |
| 8 | off_4A026A0 | Int32HalfRegs | .b32 | %hh | 0x80000000 | 32 | Packed pair (v2f16, v2bf16, v2i16, v4i8) |
| 9 | off_4A02460 | Int128Regs | .b128 | %rq | 0x90000000 | 128 | 128-bit wide (tensor core) |
Naming Discrepancy
Two naming conventions exist in the codebase, depending on whether the name was recovered from the emission functions or from the register allocator context:
| Vtable | Emission name (sub_2163730/sub_21638D0) | RA-context name (sub_2162350) | Resolution |
|---|---|---|---|
off_4A02760 | Int16HalfRegs | Float16Regs | Same class. The emission functions use the TableGen-derived name Int16HalfRegs; the RA raw report uses the semantic alias Float16Regs. Both refer to off_4A02760. |
off_4A026A0 | Int32HalfRegs | Float16x2Regs | Same class. Int32HalfRegs is the TableGen name; Float16x2Regs is the semantic alias. Both refer to off_4A026A0. |
off_4A02460 | Int128Regs | SpecialRegs | Different raw reports assigned different names to off_4A02460. The emission report identifies it as Int128Regs (based on .b128 type and %rq prefix). The earlier RA sweep report labeled it SpecialRegs. The emission-derived name Int128Regs is more accurate: .b128 / %rq is used for 128-bit tensor-core values (i128 on SM 70+), not for special/environment registers. |
The tenth vtable off_4A026E0 is present in the binary but returns "!Special!" from both sub_2163730 and sub_21638D0. It is never assigned an encoded ID and never participates in register declaration emission. It is an internal-only sentinel class used within NVPTXRegisterInfo initialization (string "ENVREG10" at register info offset +72).
Throughout this wiki, the emission-derived names (Int16HalfRegs, Int32HalfRegs, Int128Regs) are canonical. Pages written before this consolidation may use the RA-context aliases.
Register Encoding Scheme -- sub_21583D0
Every virtual register in the NVPTX backend is encoded as a 32-bit value that packs the register class and a per-class index into a single integer. The encoding function at sub_21583D0 (4.6KB) implements this:
encoded_register = class_tag | (register_index & 0x0FFFFFFF)
The bit layout:
31 28 27 0
+------+-------------------------------+
| class| register index |
| tag | (28 bits) |
+------+-------------------------------+
- Bits [31:28] -- 4-bit class tag, values
0x1through0x9as listed in the table above. - Bits [27:0] -- 28-bit register index within that class, supporting up to 268 million registers per class.
The function operates in two modes:
-
Physical register (register_id >= 0): Returns the raw index directly (low 28 bits). Physical registers on NVPTX are a vestigial concept -- the target has no fixed register file -- but LLVM's infrastructure requires them for reserved registers like
%SPand%SPL. -
Virtual register (register_id < 0, i.e., bit 31 set in LLVM's internal convention): Looks up the register class from the
MachineRegisterInforegister map, matches the class vtable against the nine known vtable addresses, and returnsclass_encoded_id | (register_index & 0x0FFFFFFF).
If the vtable does not match any of the nine known classes, the function triggers a fatal error:
"Bad register class"
This is a hard abort, not a recoverable diagnostic. It indicates that either a new register class was added without updating the encoding function, or memory corruption produced an invalid vtable pointer.
Why Bits [31:28] and Not Bits [31:29]
LLVM's standard convention uses bit 31 (0x80000000) to distinguish physical from virtual registers internally. The NVPTX encoding reclaims this bit as part of the class tag because after encoding, the distinction between physical and virtual is no longer meaningful -- all registers in emitted PTX are virtual. Tag value 0x8 (Int32HalfRegs) has bit 31 set, which would collide with LLVM's virtual-register marker. This works because the encoding is applied only during emission, after register allocation is complete and the physical/virtual distinction is irrelevant.
Complete Class Separation
The nine register classes are completely disjoint. There is no cross-class interference: an Int32Regs register (%r) never conflicts with a Float32Regs register (%f) even though both are 32 bits wide. This is a fundamental consequence of PTX's typed register model. In PTX, .reg .b32 %r0 and .reg .f32 %f0 are distinct storage locations from ptxas's perspective. Two implications follow:
-
No cross-class coalescing. The register coalescer at
sub_34AF4A0enforces a same-class check on every coalescing candidate. Cross-class copies (e.g., a bitcast fromi32tof32) must survive as explicitmovinstructions in the emitted PTX. -
Per-class pressure accounting. The greedy register allocator at
sub_2F5A640tracks register pressure per class independently. The-maxreglimit bounds total live registers across all classes combined, but interference within any single class never spills over to another.
This is unlike CPU targets (x86, AArch64) where integer and floating-point registers can alias through sub-register relationships, or where a single physical register appears in multiple register classes.
Copy Opcodes -- sub_2162350
The function sub_2162350 (3.0KB, "Copy one register into another with a different width") dispatches copy instruction emission based on the source and destination register classes. Each class has two opcodes: one for same-class copies (e.g., mov.b32 %r1, %r0) and one for cross-class copies (e.g., bitcasting between Int32Regs and Float32Regs):
| Class | Same-Class Opcode | Cross-Class Opcode | Notes |
|---|---|---|---|
| Int1Regs | 39424 | 39424 | No distinct cross-class path |
| Int16Regs | 39296 | 39296 | No distinct cross-class path |
| Int32Regs | 39552 | 10816 | Cross = mov.b32 bitcast to float |
| Int64Regs | 39680 | 11008 | Cross = mov.b64 bitcast to double |
| Float32Regs | 30656 | 10880 | Cross = mov.b32 bitcast to integer |
| Float64Regs | 30784 | 11072 | Cross = mov.b64 bitcast to integer |
| Int16HalfRegs | 30528 | 10688 | Cross = mov.b16 half-to-short |
| Int32HalfRegs | 39552 | 39552 | Uses same opcode as Int32Regs same-class |
| Int128Regs | 39168 | 39168 | No distinct cross-class path |
Classes where both opcodes are identical (Int1Regs, Int16Regs, Int32HalfRegs, Int128Regs) have no meaningful cross-class copy path. For predicates (Int1Regs), this is because there is no other 1-bit type. For 128-bit registers, tensor-core values have no peer class to bitcast into. The Int32HalfRegs class shares its same-class opcode (39552) with Int32Regs because both emit .b32 copies -- the packed v2f16 value is simply treated as a 32-bit bitpattern for copying.
The five classes with distinct cross-class opcodes (Int32Regs, Int64Regs, Float32Regs, Float64Regs, Int16HalfRegs) are exactly those that participate in bitcast operations between integer and floating-point interpretations of the same bit width.
Register Declaration Emission -- sub_2158E80
During function body emission, sub_2158E80 (17KB) emits .reg declarations for every register class used by the function. The process:
- Iterate the register map at
this+800in the AsmPrinter state. - Deduplicate classes using a hash table at
this+808..832. - Track the maximum index per class across all virtual registers.
- Emit one declaration per class in the format:
.reg .pred %p<5>; // 5 predicate registers (indices 0..4)
.reg .b16 %rs<12>; // 12 short integer registers
.reg .b32 %r<47>; // 47 general-purpose 32-bit
.reg .b64 %rd<8>; // 8 double-width integer
.reg .f32 %f<20>; // 20 single-precision float
.reg .f64 %fd<3>; // 3 double-precision float
.reg .b16 %h<4>; // 4 half-precision float
.reg .b32 %hh<2>; // 2 packed-pair registers
.reg .b128 %rq<1>; // 1 tensor-core 128-bit register
The count for each class is max_register_index + 1. The PTX declaration syntax %prefix<N> declares registers %prefix0 through %prefix(N-1).
Note that Int16HalfRegs and Int16Regs share the same PTX type suffix (.b16) but have different prefixes (%h vs %rs). Similarly, Int32HalfRegs and Int32Regs share .b32 but use %hh vs %r. The PTX assembler ptxas treats these as completely separate register namespaces -- the prefix, not the type, determines the namespace.
Stack pointer registers (%SP, %SPL) are emitted before the class declarations when the function has a non-zero local frame. These use .b64 in 64-bit mode or .b32 in 32-bit mode.
Per-Class Detail
Int1Regs -- Predicates
| Property | Value |
|---|---|
| Vtable | off_4A027A0 |
| PTX type | .pred |
| Prefix | %p |
| Tag | 0x1 |
| Width | 1 bit |
| Legal MVTs | i1 |
| Same-class copy | 39424 |
Predicate registers hold boolean values used for conditional branches (@%p1 bra target), select instructions (selp), and set-predicate results (setp). They are the only 1-bit registers in PTX. There is no cross-class copy path because no other class holds 1-bit values. The coalescer excludes predicates from cross-class analysis entirely.
Int16Regs -- Short Integers
| Property | Value |
|---|---|
| Vtable | off_4A02720 |
| PTX type | .b16 |
| Prefix | %rs |
| Tag | 0x2 |
| Width | 16 bits |
| Legal MVTs | i16 |
| Same-class copy | 39296 |
Short integer registers hold 16-bit integer values. PTX .param space widens all scalars below 32 bits to .b32, so %rs registers appear primarily in computation, not in function signatures. The prefix %rs (register-short) distinguishes these from %h (Int16HalfRegs) even though both declare as .b16.
Int32Regs -- General-Purpose 32-bit
| Property | Value |
|---|---|
| Vtable | off_4A025A0 |
| PTX type | .b32 |
| Prefix | %r |
| Tag | 0x3 |
| Width | 32 bits |
| Legal MVTs | i32 |
| Same-class copy | 39552 |
| Cross-class copy | 10816 |
The workhorse register class. Holds 32-bit integers, addresses in 32-bit mode, loop indices, and general computation results. Cross-class copy opcode 10816 handles bitcast to Float32Regs (%f).
Int64Regs -- Double-Width Integer
| Property | Value |
|---|---|
| Vtable | off_4A024A0 |
| PTX type | .b64 |
| Prefix | %rd |
| Tag | 0x4 |
| Width | 64 bits |
| Legal MVTs | i64 |
| Same-class copy | 39680 |
| Cross-class copy | 11008 |
Holds 64-bit integers and device pointers in 64-bit mode (the common case). Cross-class copy opcode 11008 handles bitcast to Float64Regs (%fd).
Float32Regs -- Single-Precision Float
| Property | Value |
|---|---|
| Vtable | off_4A02620 |
| PTX type | .f32 |
| Prefix | %f |
| Tag | 0x5 |
| Width | 32 bits |
| Legal MVTs | f32 |
| Same-class copy | 30656 |
| Cross-class copy | 10880 |
Holds IEEE 754 single-precision floats. Note the .f32 type suffix rather than .b32 -- PTX distinguishes float from bitwise register types even at the same width. Cross-class copy opcode 10880 handles bitcast to Int32Regs (%r).
Float64Regs -- Double-Precision Float
| Property | Value |
|---|---|
| Vtable | off_4A02520 |
| PTX type | .f64 |
| Prefix | %fd |
| Tag | 0x6 |
| Width | 64 bits |
| Legal MVTs | f64 |
| Same-class copy | 30784 |
| Cross-class copy | 11072 |
Holds IEEE 754 double-precision floats. Cross-class copy opcode 11072 handles bitcast to Int64Regs (%rd).
Int16HalfRegs -- Half-Precision Float
| Property | Value |
|---|---|
| Vtable | off_4A02760 |
| PTX type | .b16 |
| Prefix | %h |
| Tag | 0x7 |
| Width | 16 bits |
| Legal MVTs | f16, bf16 |
| Same-class copy | 30528 |
| Cross-class copy | 10688 |
Despite the Int16 in the TableGen-derived name, this class holds half-precision floating-point values (f16 and bf16). The .b16 PTX type (bitwise 16-bit) is used rather than a hypothetical .f16 because PTX's type system uses .b16 for all 16-bit values that are not short integers. The %h prefix distinguishes these registers from %rs (Int16Regs). Cross-class copy opcode 10688 handles conversion to Int16Regs.
The semantic alias Float16Regs appears in some wiki pages and is equally valid.
Int32HalfRegs -- Packed Half-Precision Pairs
| Property | Value |
|---|---|
| Vtable | off_4A026A0 |
| PTX type | .b32 |
| Prefix | %hh |
| Tag | 0x8 |
| Width | 32 bits |
| Legal MVTs | v2f16, v2bf16, v2i16, v4i8 |
| Same-class copy | 39552 |
| Cross-class copy | 39552 |
This is the only register class for vector types on NVPTX. It holds exactly 32 bits of packed data: two f16 values, two bf16 values, two i16 values, or four i8 values. The %hh prefix distinguishes it from %r (Int32Regs). Both same-class and cross-class copy opcodes are 39552 (identical to Int32Regs same-class), because copies of packed values are simple 32-bit bitwise moves.
All vector types wider than 32 bits (v4f32, v2f64, v8i32, etc.) are illegal on NVPTX and must be split or scalarized during type legalization. See the vector legalization documentation for the split/scalarize dispatch.
The semantic alias Float16x2Regs appears in some wiki pages.
Int128Regs -- 128-bit Tensor Core Values
| Property | Value |
|---|---|
| Vtable | off_4A02460 |
| PTX type | .b128 |
| Prefix | %rq |
| Tag | 0x9 |
| Width | 128 bits |
| Legal MVTs | i128 (SM 70+) |
| Same-class copy | 39168 |
| Cross-class copy | 39168 |
The widest register class, introduced for tensor core operations on Volta (SM 70) and later architectures. Holds 128-bit values used as operands and accumulators in mma and wmma instructions. The %rq prefix stands for "register quad" (4x32 bits). There is no cross-class copy path because no other class holds 128-bit values.
During register coalescing, 128-bit values are tracked as wide register pairs (two 64-bit halves). The coalescer at sub_3497B40 handles paired-register decomposition: when coalescing the low half, the high half inherits corresponding constraints.
An earlier raw report (p2c.5-01-register-alloc.txt) labeled off_4A02460 as SpecialRegs. This was an error in that report's identification. The vtable off_4A02460 emits .b128 / %rq, which is the 128-bit class for tensor core values, not a class for special/environment registers.
The Internal-Only Class -- off_4A026E0
| Property | Value |
|---|---|
| Vtable | off_4A026E0 |
| PTX type | "!Special!" |
| Prefix | "!Special!" |
| Encoded ID | None |
A tenth vtable address appears in the register info initialization path (sub_2163AB0). Both sub_2163730 and sub_21638D0 return the sentinel string "!Special!" for this vtable. It has no encoded ID, no PTX declaration, and never produces emitted registers. The string "ENVREG10" at register info offset +72 (alongside "Int1Regs" at offset +80) suggests this class is associated with environment registers -- hardware-defined read-only registers like %tid, %ctaid, %ntid, etc. These are emitted by dedicated special-register emission functions (sub_21E86B0, sub_21E9060) rather than through the register class encoding path.
Register Info Initialization
NVPTXRegisterInfo objects are created by two factory functions corresponding to the two pass manager generations:
| Legacy PM | New PM | |
|---|---|---|
| Factory | sub_2149CD0 | sub_301F0C0 |
| Init | sub_2163AB0 | sub_30590F0 |
| Object size | 224 bytes | 248 bytes |
Both call sub_1F4A910 (TargetRegisterInfo::InitMCRegisterInfo) with the register descriptor table at off_49D26D0 and register unit data at unk_4327AF0. Key fields in the initialized structure:
| Offset | Content |
|---|---|
| +44 | NumRegs (total register count) |
| +72 | "ENVREG10" (environment register class name) |
| +80 | "Int1Regs" (first register class name) |
| +96 | numRegClasses (initially 1, expanded during init) |
Coalescing Constraints
The register coalescer imposes these constraints based on register class:
| Class | Coalesceable | Constraint Flag (offset +3, mask 0x10) |
|---|---|---|
| Int1Regs | Same class only | Set |
| Int16Regs | Same class only | Set |
| Int32Regs | Same class only | Set (type code 12) |
| Int64Regs | Same class only | Set (type code 13) |
| Float32Regs | Same class only | Set (type code 15) |
| Float64Regs | Same class only | Set |
| Int16HalfRegs | Same class only | Set |
| Int32HalfRegs | Same class only | Set |
| Int128Regs | Never coalesced | Cleared |
Int128Regs (the class at off_4A02460, previously mislabeled SpecialRegs in the coalescing page) has its constraint flag cleared, excluding it from the coalescing worklist entirely. This makes sense: tensor-core 128-bit values have specific register-pair relationships that the coalescer must not disturb.
Cross-class copies between Int32Regs/Float32Regs and between Int64Regs/Float64Regs are bitcasts that the coalescer never eliminates -- they must survive as explicit PTX mov instructions because the source and destination live in different register namespaces.
Differences from Upstream LLVM NVPTX
The upstream LLVM NVPTX backend (as of LLVM 20.0.0) defines these register classes in NVPTXRegisterInfo.td:
Int1Regs,Int16Regs,Int32Regs,Int64Regs-- identical.Float16Regs,Float16x2Regs-- upstream names for cicc'sInt16HalfRegs/Int32HalfRegs. The rename reflects NVIDIA's preference for the TableGen-derived integer-typed names.Float32Regs,Float64Regs-- identical.Int128Regs-- present in upstream, matches cicc.- No
SpecialRegsclass in upstream. Special registers are handled through dedicated physical registers, not a register class. - No
off_4A026E0internal-only class in upstream.
The encoding scheme (4-bit tag in [31:28], 28-bit index in [27:0]) and the fatal "Bad register class" error path are NVIDIA additions not present in upstream LLVM's NVPTX backend, which relies on standard MCRegisterInfo encoding.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| Register class encoding (class tag OR index) | sub_21583D0 | 4.6KB | -- |
Register class -> PTX type suffix (.pred, .b32, .f32, ...) | sub_2163730 | 1.7KB | -- |
Register class -> PTX prefix (%p, %r, %f, ...) | sub_21638D0 | 1.6KB | -- |
| Copy opcode dispatch by register class | sub_2162350 | 3.0KB | -- |
| Stack frame + register declaration emission | sub_2158E80 | 17KB | -- |
| NVPTXRegisterInfo init (legacy PM) | sub_2163AB0 | 1.1KB | -- |
| NVPTXRegisterInfo factory (legacy PM) | sub_2149CD0 | -- | -- |
| NVPTXRegisterInfo init (new PM) | sub_30590F0 | -- | -- |
| NVPTXRegisterInfo factory (new PM) | sub_301F0C0 | -- | -- |
| TargetRegisterInfo::InitMCRegisterInfo | sub_1F4A910 | -- | -- |
| Special register emission (%tid, %ctaid, %ntid, %nctaid) | sub_21E86B0 | -- | -- |
| Cluster register emission (SM 90+) | sub_21E9060 | -- | -- |
Cross-References
- Register Allocation -- greedy RA that operates on these classes; pressure tracking and
-maxregconstraint - Register Coalescing -- same-class-only coalescing policy, copy opcode classification
- PTX Emission -- function header orchestrator that calls the register declaration emitter
- AsmPrinter -- per-instruction emission that calls the encoding function
- Type Legalization -- vector type legalization driven by the Int32HalfRegs-only vector model
- NVPTX Target Infrastructure -- NVPTXTargetMachine that owns the register info objects
NVPTX Machine Opcode Reference
This page is the master reference for NVPTX MachineInstr opcodes as they exist in cicc v13.0. These are the target-specific opcode numbers assigned during instruction selection and consumed by register allocation, instruction scheduling, the AsmPrinter, and every other machine-level pass. They are distinct from both LLVM IR opcodes (which live in the Instruction hierarchy) and from ISD/NVPTXISD SelectionDAG node opcodes (which exist only during lowering and are erased by ISel). A MachineInstr's opcode field is the 16-bit value at MachineInstr offset +68, and it indexes into the MCInstrDesc table to obtain operand counts, constraint classes, implicit defs/uses, and scheduling information.
| Constraint table | word_3F3E6C0 (static .data array of 16-bit entries) |
| Constraint emitter | sub_B612D0 (104KB, 179-case switch) |
| Copy-type mapper | sub_3494EA0 (12.7KB, maps opcodes 1--0x12 to families 440--503) |
| Register class builder | sub_B5BA00 (21KB, 111 cases) |
| Operand type classifier | sub_34961A0 (26.6KB, reads byte_444C4A0) |
| ISel entry | sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) |
| Intrinsic lowering switch | sub_33B0210 (343KB, hundreds of NVVM intrinsics) |
Opcode Numbering Scheme
Opcodes 0--approximately 430 correspond to generic LLVM TargetOpcode values and standard LLVM machine pseudo-instructions (COPY, PHI, IMPLICIT_DEF, INLINEASM, etc.). These are identical to upstream LLVM 20.0.0. NVPTX target-specific opcodes begin around opcode 440 and extend into the thousands. The highest confirmed opcode numbers are in the 4900+ range (tcgen05 tensor core instructions for Blackwell).
The opcode numbering is generated by TableGen from the NVPTX .td instruction definitions and compiled into the MCInstrDesc table. Since cicc is a stripped binary, the symbolic names are lost. The identifications below come from behavioral analysis: matching the constraint table patterns, AsmPrinter string emission, and SelectionDAG lowering code against known PTX instruction semantics.
The Constraint Table: word_3F3E6C0
Every NVPTX machine opcode has an entry in the global constraint table at word_3F3E6C0. This is a flat array of 16-bit words, indexed by (opcode - 1). Each word packs two fields:
| Bits | Field | Purpose |
|---|---|---|
[7:0] (low byte) | constraint_class | Index into the 179-case switch in sub_B612D0 |
[15:8] (high byte) | register_class_id | Target register class for the instruction's primary result |
The access pattern, decompiled from sub_B612D0:
uint16_t entry = word_3F3E6C0[opcode - 1];
uint8_t constraint_class = entry & 0xFF; // low byte
uint8_t register_class = (entry >> 8) & 0xFF; // high byte
switch (constraint_class) {
case 0x00: ... // simple 2-input ALU
case 0x01: ... // 3-input FMA
...
case 0xB2: ... // maximum observed class
}
The constraint class determines how many operands the instruction has, what register class each operand belongs to, and which operands are tied. Each case in the switch constructs a stack-allocated array of 16-byte constraint descriptors (see Pattern Database for the full descriptor layout) and calls sub_A78010 to emit them.
179 Constraint Classes
The constraint classes range from 0x00 through 0xB2 (179 values). Each class represents a distinct operand signature. Representative patterns:
| Class Range | Pattern | Descriptor Count | Typical Instructions |
|---|---|---|---|
| 0x00--0x0F | Simple ALU (2 inputs, 1 output) | 3 | add, sub, mul, and, or, xor |
| 0x10--0x1F | Ternary (3 inputs, 1 output) | 4 | fma, madc, selp |
| 0x20--0x3F | Load/store variants | 2--5 | ld, st with address space and vector width |
| 0x40--0x5F | Conversion and move | 2--3 | cvt, mov, bitcast |
| 0x60--0x7F | Atomic and barrier | 3--6 | atom.*, membar, fence |
| 0x80--0x9F | Texture/surface | 4--12 | tex., sust., suld.* |
| 0xA0--0xAF | Tensor core (MMA) | 6--16 | hmma, imma, wmma, mma |
| 0xB0 | Maximum operand (17 inputs) | 18 | Complex intrinsic (opcode 176) |
| 0xB1--0xB2 | Miscellaneous high-operand-count | variable | Specialized instructions |
The maximum observed operand count is 17 (constraint class 0xB0, associated with opcode 176), requiring 18 descriptor entries (17 inputs + 1 output) and 288 bytes of stack space in the constraint emitter's frame.
Register Class IDs in the High Byte
The high byte of each word_3F3E6C0 entry identifies the register class for the instruction's result. These IDs map to NVPTX's typed virtual register files:
| ID | Register Class | PTX Type | PTX Prefix | Vtable Address |
|---|---|---|---|---|
| 14 | Int32Regs | .b32 | %r | off_4A025A0 |
| 22 | Int16Regs | .b16 | %rs | off_4A02720 |
| 40 | Float32Regs | .f32 | %f | off_4A02620 |
| 43 | Float16Regs | .b16 | %h | off_4A02760 |
| 50 | Int64Regs | .b64 | %rd | off_4A024A0 |
| 51 | Float64Regs | .f64 | %fd | off_4A02520 |
| 52 | Int128Regs | .b128 | %rq | off_4A02460 |
| 78 | PredRegs | .pred | %p | off_4A027A0 |
| 86 | SpecialRegs | (varies) | (varies) | off_4A026E0 |
Additional register class IDs observed in the constraint table (24, 27, 29, 32, 36, 39, 41, 67, 72, 76) likely correspond to sub-classes or aliased classes (e.g., Int32HalfRegs with ID related to 32 and prefix %hh), but their exact mappings have not been recovered. Instructions that produce no register result (stores, barriers, calls) have a zero or don't-care value in the high byte.
Identified Opcode Families
The following sections catalog every opcode range where the binary-to-PTX mapping has been confirmed. Opcodes are grouped by functional family. Where an opcode's identity is uncertain, it is marked with a question mark.
Copy and Move Family (440--503)
These are the NVPTX-specific copy instructions that the NVPTX register coalescer at sub_34AF4A0 processes. The standard LLVM RegisterCoalescer handles only the generic COPY pseudo (a generic TargetOpcode, not in this range); the NVPTX coalescer handles these target-specific copy families in a second pass.
The mapping function sub_3494EA0 contains a switch statement that classifies internal opcode IDs (1--0x12) into copy families:
| Opcode Range | Family | Description |
|---|---|---|
| 440--443 | Type-preserving moves | Same-class copies: i32-to-i32, i64-to-i64, f32-to-f32, f64-to-f64. These map from operand type codes 12, 13, 15 in the byte_444C4A0 classification table. |
| 444--470 (approx.) | Cross-class moves | Bitcasting copies between register classes (e.g., i32 to f32). These survive coalescing as explicit mov instructions in PTX because the source and destination register types differ. |
| 471--490 (approx.) | Paired/wide moves | 128-bit register pair copies for tensor core paths. The low and high halves are tracked jointly by sub_3497B40. |
| 491--503 (approx.) | ABI parameter copies | .param-related copies at call boundaries. These arise from the calling convention and are prime targets for coalescing. |
The byte_444C4A0 operand-type classification table (16-byte entries, indexed by MVT enum) feeds the coalescer's type check:
struct OperandTypeEntry { // 16 bytes at byte_444C4A0[16 * mvt - 16]
uint8_t type_code; // +0: 12=i32, 13=i64, 15=f32, etc.
uint8_t size_class; // +1: size in register-width units
uint8_t register_bank; // +2: bank identifier
uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
uint8_t reserved[12]; // +4: padding
};
The constraint flag at offset +3 (mask 0x10) gates whether an operand participates in coalescing. Operands without this bit set (e.g., SpecialRegs) are excluded from the coalescer's worklist entirely.
Call ABI Family (505--573)
These opcodes implement the PTX .param-space calling convention. They are emitted by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and form the backbone of every device-function call sequence.
| Opcode | Name | PTX Equivalent | Operands |
|---|---|---|---|
| 315 | CallSeqBegin | (pseudo) call frame setup | chain, seq_id, zero |
| 316 | CallSeqEnd_Outer | (pseudo) outer call frame teardown | chain, glue, callee_ref, callee_ref_hi |
| 505 | DeclareParam | .param .align A .b8 param[N] | chain, alignment, param_index, byte_size |
| 506 | DeclareScalarParam | .param .bW paramN | chain, alignment, param_index, widened_size |
| 507 | DeclareRetParam | .param .align A .b8 retval[N] | chain, alignment, byte_size, zero |
| 508 | DeclareRetScalarParam | .param .bW retval | chain, 1, widened_size, zero |
| 510 | CallDirect | call (retval), func, (params) | chain, callee, params... |
| 511 | CallDirectNoProto | call func, (params) (old-style) | chain, callee, params... |
| 512 | CallIndirect | call (retval), %rd, (params) | chain, func_ptr, params... |
| 513 | CallIndirectNoProto | call %rd, (params) | chain, func_ptr, params... |
| 514 | CallStart | (pseudo) actual call emission point | CallProto result |
| 515 | LoadRetParam | ld.param.bW retvalN | call_result, 1, element_index |
| 516 | LoadRetParamLast | ld.param.bW retvalN (last) | call_result, 1, element_index |
| 517 | CallSeqEnd | (pseudo) inner call frame teardown | last_load, chain, flag |
| 518 | CallProto | .callprototype | chain, callee, proto_string |
| 521 | DeclareRetParam_Ext | .param for return (ext path) | CallSeqEnd result, seq_id |
| 527 | StoreCalleeRetAddr | (pseudo) callee return addr | chain, proto_symbol |
| 528 | StoreRetValToParam | st.param.bW retvalN (return) | chain, value, offset |
The call sequence follows a strict emission order:
CallSeqBegin(315)
for each argument:
DeclareParam(505) or DeclareScalarParam(506)
StoreV1/V2/V4(571/572/573) — store argument values
DeclareRetParam(507) or DeclareRetScalarParam(508) [if callee returns]
CallProto(518)
CallStart(514) [actual call point]
for each return value:
LoadRetParam(515) or LoadRetParamLast(516)
CallSeqEnd(517)
DeclareRetParam_Ext(521) [if prototype present]
CallSeqEnd_Outer(316)
Vector Load/Store Family (568--573)
These opcodes handle vectorized .param-space data movement, emitted during argument passing and return value extraction:
| Opcode | Name | PTX Equivalent | Vector Width |
|---|---|---|---|
| 568 | LoadV1 | ld.param.b32 / ld.param.b64 | 1 element |
| 569 | LoadV2 | ld.param.v2.b32 / ld.param.v2.b64 | 2 elements |
| 570 | LoadV4 | ld.param.v4.b32 / ld.param.v4.b64 | 4 elements |
| 571 | StoreV1 | st.param.b32 / st.param.b64 | 1 element |
| 572 | StoreV2 | st.param.v2.b32 / st.param.v2.b64 | 2 elements |
| 573 | StoreV4 | st.param.v4.b32 / st.param.v4.b64 | 4 elements |
The vector width selection logic in LowerCall (sub_3040BF0, lines 1429--1440):
accumulated_operand_count == 3 -> StoreV1 (571), width=1
accumulated_operand_count == 4 -> StoreV2 (572), width=2
accumulated_operand_count == 6 -> StoreV4 (573), width=4
other -> fatal error (unreachable)
The same pattern applies to LoadV1/V2/V4 on the return path. These opcodes are also used for by-value struct argument decomposition, where the struct is stored element-by-element into .param space using 8-byte chunks via StoreV1(571).
Atomic Family (294--317, 462)
Atomic opcodes are emitted by sub_20BED60 during DAG legalization and emitted as PTX by sub_21E5E70 (base) and sub_21E6420 (L2-hinted variant for SM 80+):
| Opcode Range | PTX Instruction | Types |
|---|---|---|
| 294--297 | atom.add | f32, f64, i32, i64 |
| 302--305 | atom.min | s32, s64, u32, u64 |
| 314--317 | atom.max | s32, s64, u32, u64 |
| 462 | atom.cas | generic (compare-and-swap) |
Within the PTX emission layer, the atomic operation is encoded in a packed operand word:
| Bits | Field | Values |
|---|---|---|
[7:4] | scope | 0=gpu (default), 1=cta, 2=sys |
[23:16] (BYTE2) | operation | 0x00=exch, 0x01=add.u, 0x03=and, 0x05=or, 0x06=xor, 0x07=max.s, 0x08=min.s, 0x09=max.u, 0x0A=min.u, 0x0B=add.f, 0x0C=inc, 0x0D=dec, 0x0E=cas |
Note that operation codes 0x02 and 0x04 are absent -- there is no signed atomic add or a second OR variant, matching the PTX ISA specification.
On Ampere (SM 80+), each atomic operation has an L2 cache-hinted variant emitted by sub_21E6420. The PTX format becomes atom[.scope].op.L2::cache_hint.type, instructing the GPU to retain or evict data in L2 after the atomic completes.
Barrier and Fence Family (287--290)
| Opcode | PTX Instruction | Scope |
|---|---|---|
| 287 | membar.gpu | GPU |
| 288 | membar.cta | CTA (thread block) |
| 289 | membar.sys | System |
| 290 | fence.sc.cluster | Cluster (SM 90+) |
The emission function sub_21E94F0 dispatches on the low 4 bits of the operand word. The fence.sc.cluster instruction requires SM 90 (Hopper) and provides sequentially-consistent fence semantics at cluster scope.
Cluster barrier instructions (SM 90+, emitted by sub_21E8EA0):
| Operand Encoding | PTX Instruction |
|---|---|
bits[3:0]=0, bits[7:4]=0 | barrier.cluster.arrive |
bits[3:0]=0, bits[7:4]=1 | barrier.cluster.arrive.relaxed |
bits[3:0]=1, bits[7:4]=0 | barrier.cluster.wait |
bits[3:0]=1, bits[7:4]=1 | barrier.cluster.wait.relaxed |
NVPTXISD Custom DAG Opcodes (22--499)
These are SelectionDAG-level opcodes used during lowering. After instruction selection, they are replaced by concrete MachineInstr opcodes. They are documented here because the DAG opcode numbers appear in the binary's lowering functions and serve as the conceptual identity of each instruction family:
| DAG Opcode | Identity | Notes |
|---|---|---|
| 22 | NVPTXISD::TargetAddr | Data pointer computation |
| 24 | NVPTXISD::Wrapper | Global address wrapping |
| 149 | NVPTXISD::ATOMIC_LOAD | Atomic load (lowered from IR atomic) |
| 152 | NVPTXISD::SELECT_CC | Conditional select (ternary) |
| 189 | NVPTXISD::MoveParam | Thread index and parameter moves |
| 193--196 | NVPTXISD::MIN/MAX | Min/max variants (2- and 3-source) |
| 197 | NVPTXISD::CTPOP | Population count |
| 198--204 | NVPTXISD::ConstantPool | Constant pool entry variants |
| 208 | NVPTXISD::CMPXCHG | Compare-and-exchange |
| 213--214 | NVPTXISD::STORE_SIGNED | Store with sign-extension flag |
| 215 | NVPTXISD::AddrSpaceCast | Address space conversion (within lowering) |
| 230 | NVPTXISD::DeclareLocal | Declare local variable / address of param |
| 233--234 | NVPTXISD::AddrSpaceCast pair | Two-step address space cast |
| 245--274 | NVPTXISD::MathOp_RN/RZ/RM/RP | Rounded math (add, mul, sqrt, div, fma) |
| 310 | NVPTXISD::Annotation | PTX .pragma annotation |
| 321 | NVPTXISD::StackRestore | Stack pointer restore |
| 322 | NVPTXISD::StackAlloc | Dynamic stack allocation |
| 330 | NVPTXISD::FunctionAddr | Function address (for indirect calls) |
| 335 | NVPTXISD::BinaryArith | Two-operand arithmetic |
| 371 | NVPTXISD::DynAreaOffset | Dynamic alloca offset |
| 499 | NVPTXISD::ConditionalBranch | Conditional branch with .param alloc |
The rounded math opcodes (245--274) follow a systematic pattern. The intrinsic lowering switch at sub_33B0210 maps NVVM intrinsic IDs to NVPTXISD opcodes:
| Intrinsic ID | NVPTXISD Opcode | PTX Operation |
|---|---|---|
| 63 | 249 | add.rz |
| 64 | 255 | mul.rz |
| 89 | 267 | fma.rz |
| 170 | 245 | add.rm |
| 172 | 274 | mul.rm |
| 250 | 271 | fma.rm |
| 308 | 270 | add.rp |
| 309 | 272 | mul.rp |
| 310 | 273 | fma.rp |
| 325 | 248 | sqrt.rz |
| 328 | 254 | sqrt.rm |
| 335 | 246 | sqrt.rp |
| 348 | 250 | div.rz |
| 349 | 256 | div.rm |
| 355 | 269 | div.rp |
MMA / Tensor Core Opcodes
Tensor core MachineInstr opcodes occupy a large range and are organized by generation. The central MMA instruction builder at sub_21E74C0 reads a packed 64-bit descriptor to determine the specific instruction variant.
Pre-Blackwell (SM 70--90) families:
| Function | Family | PTX Base | Min SM |
|---|---|---|---|
sub_21E0360 | HMMA load A/B | wmma.load.a / wmma.load.b | 70 |
sub_21E0630 | HMMA load C | wmma.load.c | 70 |
sub_21DFBF0 | HMMA store C | wmma.store.c | 70 |
sub_21E0870 | HMMA MMA | wmma.mma / mma | 70 |
sub_21E1280 | IMMA load A/B | wmma.load.a (int) | 72 |
sub_21E15D0 | IMMA load C | wmma.load.c (int) | 72 |
sub_21E1830 | IMMA store C | wmma.store.c (int) | 72 |
sub_21E1D20 | IMMA MMA | mma (integer, with saturation) | 72 |
sub_21E2280 | BMMA MMA | mma (binary, b1.and.popc / b1.xor.popc) | 75 |
Each family exists in two copies: the AsmPrinter-side at 0x21Dxxxx--0x21Exxxx and the NVPTX backend-side at 0x36Exxxx.
Blackwell tcgen05 (SM 100+):
Opcodes 4905--4940 cover 10 shape variants of tcgen05.mma. The packed descriptor encodes:
| Bit | Field | Values |
|---|---|---|
| 0 | scaleD | 0 or 1 |
| 1 | negA | 0=positive, 1=negative |
| 2 | negB | 0=positive, 1=negative |
| 3 | transA | 0=normal, 1=transposed |
| 4 | transB | 0=normal, 1=transposed |
| 5 | sparsity | structured sparsity enable |
[8:6] | type encoding | mxf4nvf4, i8, mxf8f6f4, f16, tf32, fp4, mxf4, bf16 |
Modifiers include block_scale, weight_stationary, and scaleInputAccumulator. The architecture gate is subtarget+340 >= 0x3E8 (SM 100 decimal).
MMA Shape and Type Encoding
The MMA instruction builder uses enumerated shape and type codes embedded in the packed descriptor:
Shape codes (bits [39:32]):
| Code | Shape | PTX String | Min SM |
|---|---|---|---|
| 0x01 | m8n8k4 | "m8n8k4" | 70 |
| 0x02 | m8n8k16 | "m8n8k16" | 72 |
| 0x03 | m8n8k32 | "m8n8k32" | 75 |
| 0x04 | m8n8k64 | "m8n8k64" | 75 |
| 0x05 | m8n8k128 | "m8n8k128" | 75 |
| 0x10 | m16n8k4 | "m16n8k4" | 80 |
| 0x11 | m16n8k8 | "m16n8k8" | 75 |
| 0x12 | m16n8k16 | "m16n8k16" | 80 |
| 0x13 | m16n8k32 | "m16n8k32" | 75 |
| 0x14 | m16n8k64 | "m16n8k64" | 75 |
| 0x15 | m16n8k128 | "m16n8k128" | 75 |
| 0x16 | m16n8k256 | "m16n8k256" | 75 |
| 0x17 | m16n16k16 | "m16n16k16" | 90 |
| 0x18 | m32n8k16 | "m32n8k16" | 90? |
| 0x19 | m16n16k8 | "m16n16k8" | 70 |
Data type codes (in aty/bty fields):
| Code | Type | Bits | PTX |
|---|---|---|---|
| 1 | b1 | 1 | "b1" |
| 2 | s4 | 4 | "s4" |
| 3 | u4 | 4 | "u4" |
| 4 | s8 | 8 | "s8" |
| 5 | u8 | 8 | "u8" |
| 6 | f16 | 16 | "f16" |
| 7 | bf16 | 16 | "bf16" |
| 8 | tf32 | 19 | "tf32" |
| 9 | f64 | 64 | "f64" |
| 10 | f32 | 32 | "f32" |
| 11 | s32 | 32 | "s32" |
Special Register Access
Special register read instructions map to PTX special registers. The AsmPrinter function sub_21E86B0 dispatches on a single-byte operand:
| Operand | Register | Description |
|---|---|---|
| 0x26 | %tid.x | Thread ID, X |
| 0x27 | %tid.y | Thread ID, Y |
| 0x28 | %tid.z | Thread ID, Z |
| 0x29 | %ntid.x | Block dimension, X |
| 0x2A | %ntid.y | Block dimension, Y |
| 0x2B | %ntid.z | Block dimension, Z |
| 0x2C | %ctaid.x | Block ID, X |
| 0x2D | %ctaid.y | Block ID, Y |
| 0x2E | %ctaid.z | Block ID, Z |
| 0x2F | %nctaid.x | Grid dimension, X |
| 0x30 | %nctaid.y | Grid dimension, Y |
| 0x31 | %nctaid.z | Grid dimension, Z |
| 0x5E | (dynamic) | %warpid / %laneid (via sub_3958DA0) |
| 0x5F | (dynamic) | %nwarpid or similar (via sub_3958DA0) |
Cluster special registers (SM 90+, sub_21E9060) add 15 registers: %is_explicit_cluster, %cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}.
Address Space Conversion
The cvta instruction family is emitted by sub_21E7FE0:
| Operand Value | Suffix | Full Instruction |
|---|---|---|
| 0 | (none) | cvta (generic) |
| 1 | .global | cvta.to.global / cvta.global |
| 3 | .shared | cvta.to.shared / cvta.shared |
| 4+ | .local | cvta.to.local / cvta.local |
Direction is determined by a separate operand: value 0 emits "a" (to-generic), value 1 emits "b" (to-specific).
Constraint Emission Pipeline
The full path from opcode to emitted constraint:
sub_B612D0(emitter_state, opcode):
// Step 1: Table lookup
entry = word_3F3E6C0[opcode - 1]
reg_class = entry >> 8
constraint_class = entry & 0xFF
// Step 2: Build descriptor array on stack
switch (constraint_class):
case 0x00:
// Simple 2-input ALU: {op0=RC, op1=RC, result=RC}
desc[0] = {kind=0, value=sub_A778C0(state, reg_class, flags)}
desc[1] = {kind=1, value=sub_A778C0(state, reg_class, flags)}
desc[2] = {kind=-1, value=sub_B5BA00(state, reg_class)}
sub_A78010(state, desc, 3)
case 0x01:
// Ternary FMA: {op0, op1, op2, result}
desc[0..2] = three input constraints
desc[3] = {kind=-1, value=sub_B5BA00(state, reg_class)}
sub_A78010(state, desc, 4)
...
case 0xB0:
// 17-input complex: 17 input constraints + 1 output
for i in 0..16:
desc[i] = {kind=i, value=...}
desc[17] = {kind=-1, value=sub_B5BA00(state, reg_class)}
sub_A78010(state, desc, 18)
Key helper functions:
| Address | Function | Purpose |
|---|---|---|
sub_A778C0 | createRegClassConstraint(state, regclass, flags) | Build input operand constraint for a specific register class |
sub_A77AD0 | createAnyRegConstraint(state, flags) | Build an unconstrained ("any register") input constraint |
sub_A79C90 | composeConstraints(state, desc, N) | Merge N descriptors into a single composite constraint |
sub_B5BA00 | createOutputConstraint(state, regclass_id) | Build the output/result constraint |
sub_A78010 | emitConstraint(state, desc_array, N) | Finalize and emit the constraint with N entries |
sub_B612D0 | emitInstrConstraint(state, opcode) | Top-level entry: table lookup + switch + emit |
The constraint descriptors are purely stack-allocated within sub_B612D0's approximately 0x160-byte frame. No heap allocation occurs during constraint emission.
Complete Identified Opcode Summary
The following table consolidates every opcode where the binary-to-PTX mapping has been confirmed or strongly inferred. This represents a partial inventory -- the total opcode space extends to at least 4940, and many opcodes in the gaps (particularly in the load/store, texture, surface, and extended intrinsic ranges) remain unidentified.
| Opcode | Identity | Family | Evidence Source |
|---|---|---|---|
| 0--~430 | Generic LLVM TargetOpcode | LLVM standard | upstream LLVM 20.0.0 |
| 440--443 | Type-preserving moves | Copy | register coalescer (sub_3494EA0) |
| 444--503 | Cross-class / wide / ABI copies | Copy | register coalescer (sub_3494EA0) |
| 294--297 | atom.add (f32/f64/i32/i64) | Atomic | DAG legalization (sub_20BED60) |
| 302--305 | atom.min (s32/s64/u32/u64) | Atomic | DAG legalization (sub_20BED60) |
| 314--317 | atom.max (s32/s64/u32/u64) | Atomic | DAG legalization (sub_20BED60) |
| 315 | CallSeqBegin | Call ABI | LowerCall (sub_3040BF0) |
| 316 | CallSeqEnd_Outer | Call ABI | LowerCall |
| 462 | atom.cas | Atomic | DAG legalization |
| 499 | ConditionalBranch | Control | intrinsic lowering |
| 505 | DeclareParam | Call ABI | LowerCall |
| 506 | DeclareScalarParam | Call ABI | LowerCall |
| 507 | DeclareRetParam | Call ABI | LowerCall |
| 508 | DeclareRetScalarParam | Call ABI | LowerCall |
| 510 | CallDirect | Call ABI | LowerCall |
| 511 | CallDirectNoProto | Call ABI | LowerCall |
| 512 | CallIndirect | Call ABI | LowerCall |
| 513 | CallIndirectNoProto | Call ABI | LowerCall |
| 514 | CallStart | Call ABI | LowerCall |
| 515 | LoadRetParam | Call ABI | LowerCall |
| 516 | LoadRetParamLast | Call ABI | LowerCall |
| 517 | CallSeqEnd | Call ABI | LowerCall |
| 518 | CallProto | Call ABI | LowerCall |
| 521 | DeclareRetParam_Ext | Call ABI | LowerCall |
| 527 | StoreCalleeRetAddr | Call ABI | LowerCall |
| 528 | StoreRetValToParam | Call ABI | LowerCall |
| 568 | LoadV1 | Vector Param | LowerCall |
| 569 | LoadV2 | Vector Param | LowerCall |
| 570 | LoadV4 | Vector Param | LowerCall |
| 571 | StoreV1 | Vector Param | LowerCall |
| 572 | StoreV2 | Vector Param | LowerCall |
| 573 | StoreV4 | Vector Param | LowerCall |
| 4905--4940 | tcgen05.mma (10 shape variants) | Tensor Core | Blackwell emission (sub_21E8CD0) |
Gaps and Unknown Ranges
The following opcode ranges are known to contain NVPTX instructions but have not been fully mapped:
| Range | Likely Contents | Evidence |
|---|---|---|
| 430--439 | Transition zone (generic-to-target boundary) | Adjacent to copy family |
| 574--~800 | Global/shared/local loads and stores | Large gap between param-store and first identified general opcode |
| 800--~1500 | Texture and surface instructions | sub_33B0210 intrinsic switch references hundreds of tex/surf intrinsics |
| 1500--~3000 | Shuffle, vote, match, redux | Warp-level intrinsic families |
| 3000--~4000 | WGMMA, TMA, bulk operations | Hopper-era instruction families |
| 4000--4904 | Additional tensor/cluster instructions | Bridging pre-Blackwell and tcgen05 |
Recovering these ranges requires systematic analysis of the sub_33B0210 intrinsic lowering switch (343KB, the single largest function in the binary) and correlation with the AsmPrinter's printInstruction dispatch table.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
Constraint emission (179-case switch on word_3F3E6C0) | sub_B612D0 | 104KB | -- |
| Register class set builder (111 cases) | sub_B5BA00 | 21KB | -- |
| Operand type decoder (101 cases) | sub_B6B200 | 44KB | -- |
createRegClassConstraint(state, regclass, flags) | sub_A778C0 | -- | -- |
createAnyRegConstraint(state, flags) | sub_A77AD0 | -- | -- |
composeConstraints(state, desc, N) | sub_A79C90 | -- | -- |
emitConstraint(state, desc_array, N) | sub_A78010 | -- | -- |
| Opcode-to-copy-type mapping (switch, families 440--503) | sub_3494EA0 | 12.7KB | -- |
Operand-type classification (reads byte_444C4A0) | sub_34961A0 | 26.6KB | -- |
| Register-pair decomposition (wide/paired registers) | sub_3497B40 | 16.5KB | -- |
NVPTXTargetLowering::LowerCall (call ABI opcodes) | sub_3040BF0 | 88KB | -- |
| Intrinsic lowering switch (NVVM intrinsic to opcode) | sub_33B0210 | 343KB | -- |
NVPTXDAGToDAGISel::Select (ISel entry) | sub_3090F90 | 91KB | -- |
| MMA instruction builder (packed descriptor) | sub_21E74C0 | 17KB | -- |
| Atomic operation PTX emission (base) | sub_21E5E70 | -- | -- |
| L2 cache-hinted atomic PTX emission (SM 80+) | sub_21E6420 | -- | -- |
| Memory barrier PTX emission | sub_21E94F0 | -- | -- |
| Cluster barrier PTX emission (SM 90+) | sub_21E8EA0 | -- | -- |
| Special register PTX emission | sub_21E86B0 | -- | -- |
| Cluster special register PTX emission (SM 90+) | sub_21E9060 | -- | -- |
| Address space conversion (cvta) PTX emission | sub_21E7FE0 | -- | -- |
| tcgen05 Blackwell MMA emission (SM 100+) | sub_21E8CD0 | -- | -- |
| Register class to encoded ID mapping | sub_21583D0 | -- | -- |
| Register class to PTX type suffix | sub_2163730 | -- | -- |
| Register class to PTX register prefix | sub_21638D0 | -- | -- |
Global Data References
| Symbol | Address | Purpose |
|---|---|---|
word_3F3E6C0 | 0x3F3E6C0 | Constraint table (16-bit entries, indexed by opcode-1) |
byte_444C4A0 | 0x444C4A0 | MVT/operand type table (16-byte entries, indexed by MVT enum) |
word_4456340 | 0x4456340 | MVT to vector element count (16-bit entries) |
word_4456580 | 0x4456580 | MVT to scalarized MVT (16-bit entries) |
byte_3F252E0 | 0x3F252E0 | Constraint type classification table |
qword_502A920 | 0x502A920 | SM processor table (45 entries, stride-2) |
Cross-References
- Pattern Database -- detailed constraint descriptor layout and emission sub-functions
- Register Coalescing -- the NVPTX-specific coalescer that processes copy family opcodes 440--503
- Code Generation -- pipeline overview including ISel, RA, and machine-level passes
- InstrEmitter -- how SDNodes become MachineInstrs with these opcodes
- Register Allocation -- greedy RA that consumes constraint table data
- AsmPrinter -- the PTX emission layer that converts these opcodes to text
CLI Flag Inventory
cicc v13.0 accepts approximately 111 unique flag keys across five parsing sites, expanding to ~142 flag+value combinations when counting value variants, and ~169 when including all architecture triplets. Flags are parsed in sub_8F9C90 (real main), sub_900130 (LibNVVM path A), sub_12CC750/sub_9624D0 (LibNVVM option processors), and sub_12C8DD0 (flag catalog builder with 65 registered configurations).
The flag system is architecturally split into two layers: a hardcoded dispatch layer in the top-level parsers (sub_8F9C90, sub_900130, sub_12CC750/sub_9624D0) that handles mode selection, pass-through, LTO, and structural flags via strcmp/prefix-match chains; and a BST-backed catalog layer (sub_12C8DD0 + sub_95EB40/sub_12C8B40) that handles all flags whose effect is purely "store a value and forward strings to output vectors."
The Four Output Vectors
Every flag ultimately routes its effects into one or more of four output std::vector<std::string> buffers. These vectors are the sole interface between the CLI parser and the downstream pipeline stages:
| Vector | Seed | Output args | Downstream stage |
|---|---|---|---|
v324 (lnk) | "lnk" | a5/a6 | Phase 1: Linker / IR-link (sub_906xxx) |
v327 (opt) | "opt" | a7/a8 | Phase 2: Optimizer (LLVM opt / sub_12E54A0) |
v330 (lto) | (none) | a9/a10 | Phase 3: LTO passes |
v333 (llc) | "llc" | a11/a12 | Phase 4: LLC codegen |
Each vector element is a 32-byte std::string with SSO. At function exit (lines ~1462-1553 of sub_9624D0), each vector is serialized: count = (end - begin) >> 5, then malloc(8 * count) for the char** array, with each string individually malloc(len+1) + memcpy + null-terminated.
The lto vector receives no seed string and is only populated by explicit LTO flags (-Xlto, -olto, -gen-lto, -link-lto, --device-c, --force-device-c, host-ref flags) and the architecture string.
Mode Selection
The top-level entry point sub_8F9C90 sets a mode variable v263 that selects the compilation pipeline:
| Flag | Mode | Description |
|---|---|---|
-lgenfe | 1 | EDG C++ frontend (legacy genfe path) |
-libnvvm | 2 | LibNVVM API path |
-lnk | 3 | Linker path (forces keep=true) |
-opt | 4 | Optimizer-only path (forces keep=true) |
-llc | 6 | LLC backend-only path |
Within the LibNVVM option processors (sub_12CC750/sub_9624D0), the first argument is checked as a 4-byte or 8-byte integer for phase routing. Phase routing is stored at a1+240:
| argv[0] hex | String | Phase ID | a1+240 |
|---|---|---|---|
0x6B6E6C2D | -lnk | 1 | 1 |
0x74706F2D | -opt | 2 | 2 |
0x636C6C2D | -llc | 3 | 3 |
0x63766E2D | -nvc | 3 | 3 (alias) |
0x6D76766E62696C2D | -libnvvm | 4 | 4 |
When phase routing is active (a1+240 != 0), sub_95C880(phase_id, argc, argv, &count, &mode_flags) returns the allocated argv array for that single phase, stored directly into the corresponding output pair. When a1+240 == 0, mode flags default to 7 (all phases), and the full multi-phase option parsing loop runs.
The BST-Backed Flag Catalog
Catalog construction: sub_95EB40 / sub_12C8DD0
The function sub_95EB40(a1, cl_mode_flag) (standalone path) or sub_12C8DD0 (LibNVVM path) builds a std::map<std::string, OptionEntry> at a1+248. The underlying data structure is a C++ red-black tree (the standard library std::map implementation), with the tree root at a1+248, the sentinel/end node at a1+256, and the node count at a1+288.
Registration is performed by 65 calls to sub_95E8B0 + sub_95BF90 (standalone) or sub_12C8B40 (LibNVVM). Each call inserts one BST node.
BST node layout (168 bytes)
Each node in the red-black tree has the following layout:
| Offset | Size | Content |
|---|---|---|
| +0 | 24 | RB-tree metadata (color, parent, left, right pointers) |
| +32 | 32 | Key: flag name string (std::string with SSO) |
| +64 | 32 | lnk forwards: space-separated flags for lnk vector |
| +96 | 32 | opt forwards: space-separated flags for opt vector |
| +128 | 32 | llc forwards: space-separated flags for llc vector |
| +160 | 8 | Value pointer: points to the offset in the options structure where the flag's current value is stored |
BST lookup: sub_95D600 / sub_12C8530
When the main parsing loop encounters a flag string, it calls sub_95D600 (standalone) or sub_12C8530 (LibNVVM) to perform a standard std::map::lower_bound-style traversal of the red-black tree. The lookup compares the input flag string against registered key strings at node offset +32 using strcmp semantics. On match, the node's three forwarding strings (lnk/opt/llc) are split on spaces and appended to their respective output vectors.
Duplicate detection
Each BST node's value pointer points into the options structure. If the value storage already has a non-zero sentinel (the QWORD immediately following the 32-byte STR32 slot), the flag was already set. On duplicate:
"libnvvm : error: <flag> defined more than once"
Flags NOT in the catalog
The following flag categories are handled by hardcoded strcmp/prefix-match chains in the main parsing loop BEFORE the catalog lookup, and therefore bypass the BST entirely:
- Mode selection flags (
-lnk,-opt,-llc,-nvc,-libnvvm) -Ofast-compile=<level>(parsed at lines ~690-833)- Pass-through flags (
-Xopt,-Xllc,-Xlnk,-Xlto) - LTO flags (
-lto,-gen-lto,-gen-lto-and-llc,-link-lto,-olto,-gen-opt-lto,--trace-lto) - Device compilation flags (
--device-c,--force-device-c,--partial-link) - Host reference flags (
-host-ref-{ec,eg,ek,ic,ig,ik}) -maxreg=<N>(has its own duplicate-check logic ata1+1200)-split-compile=<N>,-split-compile-extended=<N>(ata1+1480/a1+1488)-opt-passes=<pipeline>(ata1+1512/a1+1520)-discard-value-names=<0|1>(complex multi-phase interaction)-time-passes(must be sole flag; unsupported in LibNVVM API path)-cl-mode(setsv278=1, affects routing for-prec-div,-fast-math,-prec-sqrt)-jump-table-density=<N>(forwarded directly to llc)-jobserver(forwarded to opt)--emit-optix-ir(disables ip-msp + licm, setsa13=0x43)--nvvm-64,--nvvm-32(handled insub_95C230)
If none of the hardcoded checks match and the BST lookup also fails, the flag falls through to the catchall entry at options structure offset +1256, which triggers:
"libnvvm : error: <flag> is an unsupported option"
Complete Flag-to-Pipeline Vector Routing Table
The table below documents every flag's routing from user input to the four output vectors. "Store" indicates the options structure offset where the value is recorded. Flags marked with [BST] are registered in the catalog; flags marked with [HC] are hardcoded in the parsing loop.
Architecture Flags [BST]
All 24 architecture entries share options structure offset +552 and follow the same 3-column pattern:
| User flag | lnk vector | opt vector | llc vector |
|---|---|---|---|
-arch=compute_75 | -R __CUDA_ARCH=750 | -opt-arch=sm_75 | -mcpu=sm_75 |
-arch=compute_80 | -R __CUDA_ARCH=800 | -opt-arch=sm_80 | -mcpu=sm_80 |
-arch=compute_86 | -R __CUDA_ARCH=860 | -opt-arch=sm_86 | -mcpu=sm_86 |
-arch=compute_87 | -R __CUDA_ARCH=870 | -opt-arch=sm_87 | -mcpu=sm_87 |
-arch=compute_88 | -R __CUDA_ARCH=880 | -opt-arch=sm_88 | -mcpu=sm_88 |
-arch=compute_89 | -R __CUDA_ARCH=890 | -opt-arch=sm_89 | -mcpu=sm_89 |
-arch=compute_90 | -R __CUDA_ARCH=900 | -opt-arch=sm_90 | -mcpu=sm_90 |
-arch=compute_90a | -R __CUDA_ARCH=900 | -opt-arch=sm_90a | -mcpu=sm_90a |
-arch=compute_100 | -R __CUDA_ARCH=1000 | -opt-arch=sm_100 | -mcpu=sm_100 |
-arch=compute_100a | -R __CUDA_ARCH=1000 | -opt-arch=sm_100a | -mcpu=sm_100a |
-arch=compute_100f | -R __CUDA_ARCH=1000 | -opt-arch=sm_100f | -mcpu=sm_100f |
-arch=compute_103 | -R __CUDA_ARCH=1030 | -opt-arch=sm_103 | -mcpu=sm_103 |
-arch=compute_103a | -R __CUDA_ARCH=1030 | -opt-arch=sm_103a | -mcpu=sm_103a |
-arch=compute_103f | -R __CUDA_ARCH=1030 | -opt-arch=sm_103f | -mcpu=sm_103f |
-arch=compute_110 | -R __CUDA_ARCH=1100 | -opt-arch=sm_110 | -mcpu=sm_110 |
-arch=compute_110a | -R __CUDA_ARCH=1100 | -opt-arch=sm_110a | -mcpu=sm_110a |
-arch=compute_110f | -R __CUDA_ARCH=1100 | -opt-arch=sm_110f | -mcpu=sm_110f |
-arch=compute_120 | -R __CUDA_ARCH=1200 | -opt-arch=sm_120 | -mcpu=sm_120 |
-arch=compute_120a | -R __CUDA_ARCH=1200 | -opt-arch=sm_120a | -mcpu=sm_120a |
-arch=compute_120f | -R __CUDA_ARCH=1200 | -opt-arch=sm_120f | -mcpu=sm_120f |
-arch=compute_121 | -R __CUDA_ARCH=1210 | -opt-arch=sm_121 | -mcpu=sm_121 |
-arch=compute_121a | -R __CUDA_ARCH=1210 | -opt-arch=sm_121a | -mcpu=sm_121a |
-arch=compute_121f | -R __CUDA_ARCH=1210 | -opt-arch=sm_121f | -mcpu=sm_121f |
Note: the a and f sub-variants share the base SM number for __CUDA_ARCH (e.g., sm_100a and sm_100f both emit __CUDA_ARCH=1000) but get distinct -opt-arch= and -mcpu= strings. The architecture string is also stored into the lto vector via sub_95D700, preserving the full -arch=compute_XX string.
Architecture validation bitmask
Architecture is validated at a1+8 using bitmask 0x60081200F821:
offset = SM_number - 75
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
-> ERROR: "is an unsupported option"
Valid bit positions:
| Bit | SM | Generation |
|---|---|---|
| 0 | 75 | Turing |
| 5 | 80 | Ampere |
| 11 | 86 | Ampere |
| 12 | 87 | Jetson Orin |
| 13 | 88 | Ada |
| 14 | 89 | Ada Lovelace |
| 15 | 90 | Hopper |
| 25 | 100 | Blackwell |
| 28 | 103 | Blackwell+ |
| 35 | 110 | Post-Blackwell |
| 45 | 120 | Next-gen |
| 46 | 121 | Next-gen |
Maximum offset: 0x2E = 46 (SM 121). All pre-Turing architectures (SM 70 and below) are rejected.
Architecture specification forms
Architecture can be specified in many forms, all converging to a numeric SM value. Trailing a or f suffixes are stripped before numeric parsing. On parse failure: "Unparseable architecture: <val>".
| Form | Example | Source |
|---|---|---|
-arch <val> | -arch sm_90 | sub_8F9C90 |
-arch<val> | -archsm_90 | sub_8F9C90 (compact) |
--nv_arch <val> | --nv_arch sm_100a | sub_8F9C90 |
-mcpu=sm_<N> | -mcpu=sm_90 | LLVM-style |
-opt-arch=sm_<N> | -opt-arch=sm_90 | Optimizer |
-arch=compute_<N> | -arch=compute_100 | Compute capability |
__CUDA_ARCH=<N> | __CUDA_ARCH=900 | Raw define |
Hex-encoded flag checks in sub_8F9C90:
0x6D733D7570636D2D=-mcpu=sm0x6372612D74706F2D=-opt-arc0x6F633D686372612D=-arch=co0x6372615F766E2D2D=--nv_arc
Optimization Level Flags
| User flag | Type | Store | lnk | opt | llc | Default |
|---|---|---|---|---|---|---|
-opt=0 | [BST] | +392 | -- | -- | -- | |
-opt=1 | [BST] | +392 | -- | -- | -- | |
-opt=2 | [BST] | +392 | -- | -- | -- | |
-opt=3 | [BST] | +392 | -- | -- | -- | default |
-Osize | [BST] | +488 | -- | -Osize | -Osize | off |
-Om | [BST] | +520 | -- | -Om | -Om | off |
-disable-allopts | [BST] | +424 | -lnk-disable-allopts | -opt-disable-allopts | -llc-disable-allopts | off |
-disable-llc-opts | [BST] | +840 | -- | -- | -- | off |
The -opt=<N> flags do not directly emit to any vector at registration time. Instead, at the routing stage (lines 1444-1563 of sub_9624D0), the optimization level drives one of three code paths:
- Custom pipeline set (
a1+1520 != 0): emits-passes=<pipeline_string>to opt vector - Normal mode (
a1+1520 == 0,a1+1640 == 0): emits-O<level>to opt vector - Fast-compile mode (
a1+1640 != 0): emits-optO<level>+-llcO2to llc vector
Floating Point Control Flags
| User flag | Type | Store | lnk | opt | llc | Default |
|---|---|---|---|---|---|---|
-ftz=0 | [BST] | +584 | -- | -- | -- | default |
-ftz=1 | [BST] | +584 | -R __CUDA_FTZ=1 | -nvptx-f32ftz | -nvptx-f32ftz | |
-prec-sqrt=0 | [BST] | +616 | -- | -- | -nvptx-prec-sqrtf32=0 | CL default |
-prec-sqrt=1 | [BST] | +616 | -R __CUDA_PREC_SQRT=1 | -- | -nvptx-prec-sqrtf32=1 | CUDA default |
-prec-div=0 (CL) | [BST] | +648 | -- | -opt-use-prec-div=false | -nvptx-prec-divf32=0 | |
-prec-div=0 (CUDA) | [BST] | +648 | -- | -opt-use-prec-div=false | -nvptx-prec-divf32=1 | |
-prec-div=1 (CL) | [BST] | +648 | -- | -opt-use-prec-div=true | -nvptx-prec-divf32=1 | |
-prec-div=1 (CUDA) | [BST] | +648 | -R __CUDA_PREC_DIV=1 | -opt-use-prec-div=true | -nvptx-prec-divf32=2 | default |
-prec-div=2 | [BST] | +648 | -- | -- | -nvptx-prec-divf32=3 | |
-fma=0 | [BST] | +680 | -- | -- | -nvptx-fma-level=0 | |
-fma=1 | [BST] | +680 | -- | -- | -nvptx-fma-level=1 | default |
-enable-mad | [BST] | +712 | -- | -- | -nvptx-fma-level=1 | off |
-opt-fdiv=0 | [BST] | +456 | -- | -opt-fdiv=0 | -- | default |
-opt-fdiv=1 | [BST] | +456 | -- | -opt-fdiv=1 | -- | |
-no-signed-zeros | [BST] | +1160 | -- | -opt-no-signed-zeros | -- | off |
Note on -prec-div: the CUDA vs CL distinction is controlled by the magic cookie a4 (0xABBA = CUDA, 0xDEED = OpenCL). CUDA -prec-div=1 maps to -nvptx-prec-divf32=2 (IEEE-correct division), while CL maps to level 1 (software approximation). When -prec-div=0 is set under CUDA, it still maps to -nvptx-prec-divf32=1 (not 0), because CUDA never drops below software approximation.
Fast Math Aggregate Flags
| User flag | Type | Store | lnk | opt | llc |
|---|---|---|---|---|---|
-unsafe-math | [BST] | +744 | -R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1 | -opt-use-fast-math -nvptx-f32ftz | -nvptx-fma-level=1 -nvptx-f32ftz |
-fast-math (CL) | [BST] | +776 | -R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1 | -opt-use-fast-math -nvptx-f32ftz | -nvptx-f32ftz |
-fast-math (CUDA) | [BST] | +776 | -R __CUDA_USE_FAST_MATH=1 | -opt-use-fast-math | -- |
-unsafe-math always sets FTZ in the backend (-nvptx-f32ftz), while CUDA -fast-math does not touch the backend FTZ flag -- it only sets the preprocessor define and the optimizer flag.
Debug and Diagnostic Flags
| User flag | Type | Store | lnk | opt | llc | Default |
|---|---|---|---|---|---|---|
-g | [BST] | +296 | -debug-compile | -debug-compile | -- | off |
-generate-line-info | [BST] | +328 | -- | -generate-line-info | -- | off |
-no-lineinfo-inlined-at | [BST] | +360 | -- | -- | -line-info-inlined-at=0 | off |
-show-src | [BST] | +808 | -- | -- | -nvptx-emit-src | off |
-enable-verbose-asm | [BST] | +1224 | -- | -- | -asm-verbose | off |
-w | [BST] | +872 | -- | -w | -w | off |
-Werror | [BST] | +904 | -- | -Werror | -Werror | off |
-debug-compile | [BST] | +296 | -- | -debug-compile | -- | off |
-line-info-inlined-at=0 | alias | -- | -- | -- | -line-info-inlined-at=0 | off |
-inline-info | [HC] | -- | -- | -pass-remarks=inline -pass-remarks-missed=inline -pass-remarks-analysis=inline | -- | off |
Inlining and Function Flags
| User flag | Type | Store | lnk | opt | llc | Default |
|---|---|---|---|---|---|---|
-disable-inlining | [BST] | +1064 | -- | -disable-inlining | -- | off |
-aggressive-inline | [BST] | +1608 | -- | -inline-budget=40000 | -- | off |
-restrict | [BST] | +1096 | -- | -- | -nvptx-kernel-params-restrict | off |
-allow-restrict-in-struct | [BST] | +1128 | -- | -allow-restrict-in-struct | -allow-restrict-in-struct | off |
-enable-opt-byval | [BST] | +1032 | -- | -enable-opt-byval | -- | off |
Optimization Control Flags
| User flag | Type | Store | lnk | opt | llc | Default |
|---|---|---|---|---|---|---|
-opt-disable-allopts | derived | -- | -- | -opt-disable-allopts | -- | off |
-lnk-disable-allopts | derived | -- | -lnk-disable-allopts | -- | -- | off |
-llc-disable-allopts | derived | -- | -- | -- | -llc-disable-allopts | off |
These three are emitted by -disable-allopts (see above); they do not exist as independent user flags.
Rematerialization Flags
| User flag | Type | Store | lnk | opt | llc |
|---|---|---|---|---|---|
-vasp-fix | [BST] | +1352 | -- | -- | -vasp-fix1=true -vasp-fix2=true |
-new-nvvm-remat | [BST] | +1384 | -- | -- | -enable-new-nvvm-remat=true -nv-disable-remat=true -rp-aware-mcse=true |
-disable-new-nvvm-remat | [BST] | +1416 | -- | -- | -enable-new-nvvm-remat=false -nv-disable-remat=false -rp-aware-mcse=false |
-disable-nvvm-remat | [BST] | +1448 | -- | -- | -enable-new-nvvm-remat=false -nv-disable-remat=true -rp-aware-mcse=false |
These are multi-flag compound emissions. Note the subtle difference: -disable-nvvm-remat sets -nv-disable-remat=true (disables classic remat) but -enable-new-nvvm-remat=false (also disables new remat), while -disable-new-nvvm-remat disables both new remat AND classic remat AND register-pressure-aware MCSE.
Analysis and Transform Control Flags
| User flag | Type | Store | lnk | opt | llc |
|---|---|---|---|---|---|
-no-aggressive-positive-stride-analysis | [BST] | +1544 | -- | -aggressive-positive-stride-analysis=false | -- |
disable-load-select-transform | [BST] | +1576 | -- | -disable-load-select-transform=true | -- |
Note: disable-load-select-transform is registered WITHOUT a leading - in the catalog.
Pass-Through (Forwarding) Flags [HC]
| Flag | Target vector | Special handling |
|---|---|---|
-Xopt <arg> | opt | If <arg> starts with -opt-discard-value-names=, extracts value; if "1", sets v276=false |
-Xllc <arg> | llc | None |
-Xlnk <arg> | lnk | If <arg> starts with -lnk-discard-value-names=, extracts value; if "1", sets v275=false |
-Xlto <arg> | lto | If <arg> starts with -lto-discard-value-names=, extracts value; if "1", sets v282=false |
Each consumes the next argument from argv.
LTO Flags [HC]
| User flag | a13 bitmask effect | lto vector | Notes |
|---|---|---|---|
-lto | (a13 & 0x300) | 0x23 | -- | Full LTO mode |
-gen-lto | (a13 & 0x300) | 0x21 | -gen-lto | Emit LTO bitcode |
-gen-lto-and-llc | a13 |= 0x20 | -gen-lto | Emit LTO + run LLC |
-link-lto | (a13 & 0x300) | 0x26 | -link-lto | Link LTO modules |
-olto | -- | -olto + argv[i+1] | Takes next arg as LTO opt level |
-gen-opt-lto | sets v280=1 | -- | Affects lowering at end of parsing |
--trace-lto | -- | --trace | LTO tracing |
Device Compilation Flags [HC]
| User flag | lto vector |
|---|---|
--device-c | --device-c |
--force-device-c | --force-device-c |
--partial-link | (no-op, consumed but not forwarded) |
Host Reference Flags [HC]
| User flag | lto vector |
|---|---|
-host-ref-ek=<val> | -host-ref-ek=<val> |
-host-ref-ik=<val> | -host-ref-ik=<val> |
-host-ref-ec=<val> | -host-ref-ec=<val> |
-host-ref-ic=<val> | -host-ref-ic=<val> |
-host-ref-eg=<val> | -host-ref-eg=<val> |
-host-ref-ig=<val> | -host-ref-ig=<val> |
-has-global-host-info | -has-global-host-info |
Pipeline Control Flags [HC]
| User flag | Store | Routing | Default |
|---|---|---|---|
-opt-passes=<pipeline> | +1512 | opt: -passes=<pipeline> (overrides -O<N>) | unset |
-passes=<pipeline> | -- | opt: -passes=<pipeline> (sub_9624D0 only) | unset |
-lsa-opt=0 | -- | opt: -lsa-opt=0 | generated by -Ofast-compile=max or CL-mode |
-memory-space-opt=0 | -- | opt: -memory-space-opt=0 | generated by -Ofast-compile=max |
-memory-space-opt=1 | -- | opt: -memory-space-opt=1 | generated when opt level allows |
-rox-opt=0 | -- | opt: -rox-opt=0 | generated when -prec-div=0 or -prec-sqrt=0 (non-CL) |
-do-ip-msp=<0|1> | -- | opt: -do-ip-msp=<val> | |
-do-licm=<0|1> | -- | opt: -do-licm=<val> | |
-optimize-unused-variables | -- | lto: -optimize-unused-variables | off |
Ofast-compile Levels [HC]
Stored at a1+1640. Only ONE -Ofast-compile= is allowed; a second triggers "libnvvm : error: -Ofast-compile specified more than once".
| Level string | a1+1640 | Description | Side effects |
|---|---|---|---|
"0" | 1 (then reset to 0) | Disabled | opt: fast-compile=off string |
"min" | 4 | Minimal speedup | opt: -fast-compile=min |
"mid" | 3 | Medium speedup | opt: -fast-compile=mid + second flag |
"max" | 2 | Maximum speedup | opt: -fast-compile=max; forces -lsa-opt=0, -memory-space-opt=0 |
When -Ofast-compile is active (level >= 1), the -passes=/-O routing is bypassed. Instead: -optO<level> and -llcO2 are emitted to the llc vector (lines 1453-1460).
Miscellaneous Flags [HC]
| User flag | Store | Routing | Notes |
|---|---|---|---|
-maxreg=<N> | +1192 | opt: -maxreg=<N>, llc: -maxreg=<N> | Error on duplicate |
-split-compile=<N> | +1480 | opt: -split-compile=<N> | Error on duplicate |
-split-compile-extended=<N> | +1480 | opt: -split-compile-extended=<N>, sets a1+1644=1 | Same storage as -split-compile |
-jump-table-density=<N> | -- | llc: -jump-table-density=<N> | |
-jobserver | -- | opt: -jobserver | |
-cl-mode | -- | No forwarding; sets v278=1 | Affects -prec-div, -prec-sqrt, -fast-math routing |
-time-passes | -- | Unsupported in LibNVVM API (error if a14 != NULL) | Must be sole flag |
--emit-optix-ir | -- | opt: -do-ip-msp=0, opt: -do-licm=0; a13 = (a13 & 0x300) | 0x43 | |
--nvvm-64 | -- | a13 |= 0x100 | 64-bit NVVM mode |
--nvvm-32 | -- | a13 |= 0x200 | 32-bit NVVM mode |
Discard-Value-Names [HC]
This flag has the most complex interaction logic in the parser. Seven boolean tracking variables control its behavior:
| Variable | Meaning |
|---|---|
v275 | lnk-discard-value-names override (from -Xlnk) |
v276 | opt-discard-value-names override (from -Xopt) |
v277 | global discard-value-names flag was used |
v278 | CL-mode detected |
v279 | -Xlnk was used for discard-value-names |
v281 | -Xlto was used for discard-value-names |
v282 | lto-discard-value-names override (from -Xlto) |
v283 | -Xopt was used for discard-value-names |
When a4 == 0xABBA (CUDA) and no explicit -discard-value-names:
- Default: discard (a1+232 = 1)
- Emits:
-lnk-discard-value-names=1to lnk,-opt-discard-value-names=1to opt,-lto-discard-value-names=1to lto - UNLESS overridden by per-phase
-Xflags
When a4 == 0xDEED (OpenCL): only applies if (a13 & 0x20) is set.
Error on conflicting definitions: "libnvvm : error: -discard-value-names defined more than once, or defined for both libnvvm and sub-phase".
I/O and General Flags
| Flag | Effect |
|---|---|
-o <file> | Output file (fatal if missing) |
-v | Verbose mode |
-dryrun | Do not execute compilation |
-keep | Keep intermediate files |
-irversion | Print IR version and exit |
-nvvmir-library <f> | NVVM IR library file (also = form) |
-m64 | 64-bit mode flag (sets *a8 = 1) |
Recognized input extensions: .bc, .ci, .i, .cup, .optixir, .ii. The .cup extension triggers --orig_src_path_name / --orig_src_file_name handling.
Options Structure Layout
The options structure passed as a1 to sub_9624D0/sub_12CC750 is ~1,644 bytes. Key offsets:
| Offset | Size | Content | Default |
|---|---|---|---|
| +8 | DWORD | SM architecture number | 75 |
| +232 | BYTE | discard-value-names master (0=keep, 1=discard) | 0 |
| +240 | DWORD | Phase routing mode (0=full, 1-4=single) | 0 |
| +248 | PTR | BST root (std::map red-black tree) | |
| +256 | PTR | BST sentinel/end node | |
| +288 | QWORD | BST node count | |
| +296 | STR32 | -g / -debug-compile value | |
| +328 | STR32 | -generate-line-info value | |
| +360 | STR32 | -no-lineinfo-inlined-at value | |
| +392 | STR32 | Optimization level (0/1/2/3) | "3" |
| +400 | QWORD | opt-level already-set sentinel | |
| +424 | STR32 | -disable-allopts value | |
| +456 | STR32 | -opt-fdiv value | "0" |
| +464 | QWORD | opt-fdiv already-set sentinel | |
| +488 | STR32 | -Osize value | |
| +520 | STR32 | -Om value | |
| +552 | STR32 | Architecture defines | compute_75 |
| +560 | QWORD | arch already-set sentinel | |
| +584 | STR32 | -ftz value | "0" |
| +592 | QWORD | ftz already-set sentinel | |
| +616 | STR32 | -prec-sqrt value | "1" (CUDA) / "0" (CL) |
| +624 | QWORD | prec-sqrt already-set sentinel | |
| +648 | STR32 | -prec-div value | "1" |
| +656 | QWORD | prec-div already-set sentinel | |
| +680 | STR32 | -fma value | "1" |
| +688 | QWORD | fma already-set sentinel | |
| +712 | STR32 | -enable-mad value | |
| +744 | STR32 | -unsafe-math value | |
| +776 | STR32 | -fast-math value | |
| +808 | STR32 | -show-src value | |
| +840 | STR32 | -disable-llc-opts value | |
| +872 | STR32 | -w value | |
| +904 | STR32 | -Werror value | |
| +1032 | STR32 | -enable-opt-byval value | |
| +1064 | STR32 | -disable-inlining value | |
| +1096 | STR32 | -restrict value | |
| +1128 | STR32 | -allow-restrict-in-struct value | |
| +1160 | STR32 | -no-signed-zeros value | |
| +1192 | STR32 | -maxreg value string | |
| +1200 | QWORD | maxreg already-set sentinel | |
| +1224 | STR32 | -enable-verbose-asm value | |
| +1256 | STR32 | Catchall (unrecognized flag) | |
| +1352 | STR32 | -vasp-fix value | |
| +1384 | STR32 | -new-nvvm-remat value | |
| +1416 | STR32 | -disable-new-nvvm-remat value | |
| +1448 | STR32 | -disable-nvvm-remat value | |
| +1480 | STR32 | -split-compile value | |
| +1488 | QWORD | split-compile already-set sentinel | |
| +1512 | STR32 | -opt-passes pipeline string | |
| +1520 | QWORD | opt-passes already-set sentinel | |
| +1544 | STR32 | -no-aggressive-positive-stride-analysis | |
| +1576 | STR32 | disable-load-select-transform | |
| +1608 | STR32 | -aggressive-inline value | |
| +1640 | DWORD | Ofast-compile level (0-4) | 0 |
| +1644 | BYTE | split-compile-extended flag | 0 |
Each STR32 is a 32-byte std::string with SSO (small string optimization). The QWORD "already-set sentinel" fields serve as duplicate-detection guards.
Compilation Mode Bitmask (a13)
The a13 parameter is an in/out bitmask that controls which pipeline phases execute and what LTO mode is active:
| Bit/Mask | Meaning |
|---|---|
0x07 | Phase control (default = 7 = all phases) |
0x10 | Debug compile or line-info enabled |
0x20 | LTO generation enabled |
0x21 | gen-lto mode |
0x23 | Full LTO mode |
0x26 | link-lto mode |
0x43 | emit-optix-ir mode |
0x80 | gen-opt-lto lowering flag |
0x100 | --nvvm-64 (64-bit mode) |
0x200 | --nvvm-32 (32-bit mode) |
0x300 | Mask for 64/32-bit mode bits |
Magic Cookie Values (a4)
| Value | Meaning | Effects |
|---|---|---|
0xABBA (43962) | CUDA compilation | -prec-div routing uses CUDA levels; -fast-math uses CUDA defines; discard-value-names defaults to on |
0xDEED (57069) | OpenCL compilation | -prec-sqrt defaults to 0; -fast-math/-prec-div use CL routing; -cl-mode scanning active |
Default Values When Flags Are Absent
When a registered flag is not found in the user's arguments, sub_9624D0 checks whether the stored-value sentinel is zero and applies defaults:
| Flag | Sentinel | Default applied |
|---|---|---|
-opt= | a1+400 == 0 | -opt=3 (optimization level 3) |
-arch=compute_ | a1+560 == 0 | -arch=compute_75 (SM 75 Turing) |
-ftz= | a1+592 == 0 | -ftz=0 (no flush-to-zero) |
-prec-sqrt= | a1+624 == 0 | -prec-sqrt=1 (CUDA) or -prec-sqrt=0 (CL) |
-prec-div= | a1+656 == 0 | -prec-div=1 (precise division) |
-fma= | a1+688 == 0 | -fma=1 (FMA enabled) |
-opt-fdiv= | a1+464 == 0 | -opt-fdiv=0 |
Differences Between sub_12CC750 and sub_9624D0
The two option processors are near-identical. Key differences:
| Aspect | sub_12CC750 | sub_9624D0 |
|---|---|---|
| Binary size | 87KB decompiled | 75KB decompiled |
-memory-space-opt default | 0 | 1 |
-passes= flag | absent | present |
-disable-struct-lowering | present | absent |
-prec-sqrt CL default | 0 | 1 |
| Pipeline | LibNVVM entry path | Standalone/generic path |
| Companion builder | sub_12C8DD0 | sub_95EB40 |
| BST lookup | sub_12C8530 | sub_95D600 |
Error Handling
All error strings follow the pattern "libnvvm : error: <message>":
| Error | Trigger |
|---|---|
<flag> is an unsupported option | Flag not matched by hardcoded checks or BST lookup |
<flag> defined more than once | Duplicate -maxreg, or duplicate BST-registered flag |
-arch=compute_<N> is an unsupported option | Architecture fails bitmask validation |
-Ofast-compile specified more than once | Second -Ofast-compile= encountered |
-Ofast-compile called with unsupported level, only supports 0, min, mid, or max | Invalid level string |
split compilation defined more than once | Duplicate -split-compile or -split-compile-extended |
-discard-value-names defined more than once, or defined for both libnvvm and sub-phase | Conflicting discard-value-names |
<value> is an unsupported value for option: <flag> | From sub_95C230 extended parser |
Function Address Map
| Address | Function | Role |
|---|---|---|
0x8F9C90 | sub_8F9C90 | Real main entry point (argc/argv from OS) |
0x900130 | sub_900130 | LibNVVM Path A CLI parser |
0x9624D0 | sub_9624D0 | LibNVVM option processor (standalone variant) |
0x9685E0 | sub_9685E0 | Pipeline orchestrator (wraps sub_9624D0) |
0x967070 | sub_967070 | Post-option-parse pipeline setup |
0x95EB40 | sub_95EB40 | BST option map builder (standalone) |
0x95E8B0 | sub_95E8B0 | Flag template registration (standalone) |
0x95D600 | sub_95D600 | BST option map lookup (standalone) |
0x95CB50 | sub_95CB50 | Prefix-match string comparison |
0x95CA80 | sub_95CA80 | Value extraction after = |
0x95C880 | sub_95C880 | Single-phase delegator |
0x95C230 | sub_95C230 | Extended flag parser (--nvvm-64/--nvvm-32) |
0x95BF90 | sub_95BF90 | BST node insertion helper |
0x95BC80 | sub_95BC80 | String storage into options struct |
0x12CC750 | sub_12CC750 | LibNVVM option processor (LibNVVM variant) |
0x12C8DD0 | sub_12C8DD0 | BST option map builder (LibNVVM, 65 entries) |
0x12C8B40 | sub_12C8B40 | Individual flag registration (LibNVVM) |
0x12C8530 | sub_12C8530 | BST option map lookup (LibNVVM) |
0x12C7B30 | sub_12C7B30 | Pass name registration into pipeline ordering |
0x12C6E90 | sub_12C6E90 | Sub-argument splitter for mode flags |
0x12C6910 | sub_12C6910 | Flag filter (-debug-compile, -g, -generate-line-info) |
0x8FD0D0 | sub_8FD0D0 | Key-value parser (used by sub_900130) |
0x8FD6D0 | sub_8FD6D0 | String concatenation builder |
Cross-References
- Optimization Levels -- O-level pipeline builders and fast-compile tiers
- Configuration Knobs -- 1,496
cl::optknobs set by the flags documented here - NVVMPassOptions -- 222-slot struct that receives CLI-routed values
- Environment Variables -- environment-based configuration (parallel to CLI)
- Pipeline Overview -- how the four output vectors feed into pipeline stages
- nvcc Interface -- how nvcc constructs the argv passed to cicc
- Architecture Targets -- SM feature gating driven by
-arch=compute_<N>
Optimization Levels
cicc v13.0 supports four standard optimization levels (O0 through O3) and three fast-compile tiers (Ofcmin, Ofcmid, Ofcmax). These are mutually exclusive with the custom --passes= interface. The pipeline name is selected in the new-PM driver sub_226C400 and assembled by sub_12E54A0. The full optimization pipeline builder is sub_12DE330, with tier-specific insertion handled by sub_12DE8F0.
Pipeline Name Selection
The new-PM driver at sub_226C400 selects a pipeline name string based on boolean flags in the config struct:
| Config Offset | Flag | Pipeline Name |
|---|---|---|
| byte[888] | O0 | nvopt<O0> |
| byte[928] | O1 | nvopt<O1> |
| byte[968] | O2 | nvopt<O2> |
| byte[1008] | O3 | nvopt<O3> |
| qw[131..132] | fc="max" | nvopt<Ofcmax> |
| qw[131..132] | fc="mid" | nvopt<Ofcmid> |
| qw[131..132] | fc="min" | nvopt<Ofcmin> |
Selection logic in sub_226C400 (lines 828--874):
if (O1_flag) -> "nvopt<O1>"
else if (O2_flag) -> "nvopt<O2>"
else if (O3_flag) -> "nvopt<O3>"
else if (fc_len == 3) {
if (fc == "max") -> "nvopt<Ofcmax>"
if (fc == "mid") -> "nvopt<Ofcmid>"
if (fc == "min") -> "nvopt<Ofcmin>"
}
else -> "nvopt<O0>"
Combining -O# with --passes= is an error:
"Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass, use -passes='default<O#>,other-pass'"
The pipeline name is passed to sub_2277440 (new-PM text parser), which constructs the actual PassManager. The nvopt prefix is registered as a pipeline element in sub_225D540 (new PM) and sub_12C35D0 (legacy PM), with vtables at 0x4A08350 / 0x49E6A58.
Fast-Compile Level Encoding
The fast-compile level is stored as an integer at offset 1640 (or 1648 in the clone) of the compilation context:
| Value | CLI Source | Behavior |
|---|---|---|
| 0 | (no flag, or -Ofast-compile=0) | Normal O-level pipeline |
| 1 | -Ofast-compile=0 | Forwarded then reset to 0 |
| 2 | -Ofast-compile=max / -Ofc=max | Minimal pipeline, fastest compile |
| 3 | -Ofast-compile=mid / -Ofc=mid | Medium pipeline |
| 4 | -Ofast-compile=min / -Ofc=min | Close to full optimization |
Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level".
When level=1, the flag is forwarded to the optimizer phase as a pass argument and then the level is reset to 0 at offset 1640 (so it becomes normal O-level optimization). When level=2 (max), the optimizer arg string -Ofast-compile=max is appended. When level=3 (mid), -Ofast-compile=mid is appended. When level=4 (min), -Ofast-compile=min is appended.
Tier Summary
| Pipeline | Approx Passes | LSA-Opt | MemSpaceOpt | Compile Speed |
|---|---|---|---|---|
nvopt<O0> | 5--8 | off | off | Fastest (no opt) |
nvopt<Ofcmax> | 12--15 | forced 0 | forced 0 | Fast |
nvopt<Ofcmid> | 25--30 | normal | enabled | Medium |
nvopt<Ofcmin> | 30--35 | normal | enabled | Slower |
nvopt<O1> | ~40 + tier-1 | normal | enabled | Normal |
nvopt<O2> | ~40 + tier-1/2 | normal | enabled | Normal |
nvopt<O3> | ~40 + tier-1/2/3 | normal | enabled | Slowest |
Pipeline Architecture: Tier 0 + Tiers 1/2/3
O1/O2/O3 share a common pipeline construction path. The key insight is that optimization happens in layers:
- Tier 0 (
sub_12DE330): The full base pipeline of ~40 passes. Fires for ALL of O1, O2, and O3 whenopts[4224](optimization-enabled) is set. - Tier 1 (
sub_12DE8F0(PM, 1, opts)): Additional passes gated byopts[3528]. Fires for O1, O2, and O3. - Tier 2 (
sub_12DE8F0(PM, 2, opts)): Additional passes gated byopts[3568]. Fires for O2 and O3 only. - Tier 3 (
sub_12DE8F0(PM, 3, opts)): Additional passes gated byopts[3608]. Fires for O3 only.
The tier control fields in the NVVMPassOptions struct at 4512 bytes:
| Offset | Type | Meaning |
|---|---|---|
| 3528 | bool | Tier 1 enable (O1+) |
| 3532 | int | Tier 1 phase threshold |
| 3568 | bool | Tier 2 enable (O2+) |
| 3572 | int | Tier 2 phase threshold |
| 3608 | bool | Tier 3 enable (O3+) |
| 3612 | int | Tier 3 phase threshold |
| 4224 | bool | Tier 0 enable (any O-level) |
| 4228 | int | Tier 0 phase threshold |
The assembler loop in sub_12E54A0 (lines 481--553) iterates over the plugin/external pass list at opts[4488]. Each entry has a phase_id; when the phase_id exceeds a tier's threshold, that tier fires:
for each entry in opts[4488..4496]:
phase_id = entry[8..12]
if (opts[4224] && phase_id > opts[4228]):
sub_12DE330(PM, opts) // Tier 0
opts[4224] = 0 // one-shot
if (opts[3528] && phase_id > opts[3532]):
sub_12DE8F0(PM, 1, opts) // Tier 1
opts[3528] = 0
if (opts[3568] && phase_id > opts[3572]):
sub_12DE8F0(PM, 2, opts) // Tier 2
opts[3568] = 0
if (opts[3608] && phase_id > opts[3612]):
sub_12DE8F0(PM, 3, opts) // Tier 3
opts[3608] = 0
AddPass(PM, entry->createPass())
After the loop, any remaining unfired tiers fire unconditionally.
Tier 0: Full Base Pipeline (sub_12DE330)
sub_12DE330 at 0x12DE330 is called for all O1/O2/O3 compilations. It constructs the ~40-pass base pipeline:
| # | Factory | Pass | Guard | Notes |
|---|---|---|---|---|
| 1 | sub_1654860(1) | VerifierPass | always | |
| 2 | sub_1A62BF0(1,0,0,1,0,0,1) | CGSCC/Inliner | always | Pipeline EP 1, 1 iteration |
| 3 | sub_1B26330() | NVVMReflect | always | |
| 4 | sub_185D600() | SROA | always | |
| 5 | sub_1C6E800() | NVVMLowerArgs | always | |
| 6 | sub_1C6E560() | NVVMLowerAlloca | always | |
| 7 | sub_1857160() | SimplifyCFG | always | |
| 8 | sub_1842BC0() | InstCombine | always | |
| 9 | sub_17060B0(1,0) | GVN | opts[3160] | Debug-dump enabled |
| 10 | sub_12D4560() | NVVMVerify | always | |
| 11 | sub_18A3090() | LoopRotate | always | |
| 12 | sub_184CD60() | LICM | always | |
| 13 | sub_1869C50(1,0,1) | IndVarSimplify | !opts[1040] | |
| 14 | sub_1833EB0(3) | LoopUnroll | always | Factor = 3 |
| 15 | sub_17060B0(1,0) | GVN | always | |
| 16 | sub_1952F90(-1) | LoopIndexSplit/SCCP | always | Threshold = -1 (unlimited) |
| 17 | sub_1A62BF0(1,0,0,1,0,0,1) | CGSCC/Inliner | always | |
| 18 | sub_1A223D0() | DSE | always | |
| 19 | sub_17060B0(1,0) | GVN | always | |
| 20 | sub_1A7A9F0() | MemCpyOpt | always | |
| 21 | sub_1A62BF0(1,0,0,1,0,0,1) | CGSCC/Inliner | always | |
| 22 | sub_1A02540() | ADCE | always | |
| 23 | sub_198DF00(-1) | JumpThreading/CVP | always | Threshold = -1 |
| 24 | sub_1C76260() | NVVMDivergenceLowering | !opts[1320] | |
| 25 | sub_195E880(0) | Reassociate | opts[2880] | Default on (slot 143) |
| 26 | sub_19C1680(0,1) | SpeculativeExecution | !opts[1360] | |
| 27 | sub_17060B0(1,0) | GVN | opts[3160] | Debug-dump enabled |
| 28 | sub_19401A0() | SCCP | always | |
| 29 | sub_1968390() | GlobalDCE/ConstantProp | always | |
| 30 | sub_196A2B0() | GlobalOpt | always | |
| 31 | sub_19B73C0(2,-1,-1,-1,-1,-1,-1) | LoopVectorize/SLP | always | Width=2, thresholds=-1 |
| 32 | sub_17060B0(1,0) | GVN | always | |
| 33 | sub_190BB10(0,0) | EarlyCSE | always | |
| 34 | sub_1A13320() | TailCallElim | always | |
| 35 | sub_17060B0(1,1) | GVN (verified) | opts[3160] | Verify mode |
| 36 | sub_18F5480() | NewGVN | always | |
| 37 | sub_18DEFF0() | Sink | always | |
| 38 | sub_1A62BF0(1,0,0,1,0,0,1) | CGSCC/Inliner | always | |
| 39 | sub_18B1DE0() | Sinking2 | always | NVIDIA custom |
| 40 | sub_1841180() | LoopSimplify/LCSSA | always |
After sub_12DE330 returns, opts[4224] is cleared (one-shot).
Tiers 1/2/3: Phase-Specific Sub-Pipeline (sub_12DE8F0)
sub_12DE8F0 at 0x12DE8F0 is a single function called with tier in {1, 2, 3}. The tier value is stored into qword_4FBB410 (phase tracker). When tier==3 and qword_4FBB370 byte4 is 0, the feature flags are set to 6 (enabling advanced barrier opt + memory space opt gates).
The following table lists every pass in sub_12DE8F0 with its tier-dependent guard condition. A pass runs only when ALL conditions in its Guard column are satisfied.
| # | Factory | Pass | Guard | O1 | O2 | O3 |
|---|---|---|---|---|---|---|
| 1 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] | Y | Y | Y |
| 2 | sub_1A223D0() | NVVMIRVerification | !opts[2600] | Y | Y | Y |
| 3 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] | Y | Y | Y |
| 4 | sub_18E4A00() | NVVMBarrierAnalysis | opts[3488] | Y | Y | Y |
| 5 | sub_1C98160(0) | NVVMLowerBarriers | opts[3488] | Y | Y | Y |
| 6 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] | Y | Y | Y |
| 7 | sub_12D4560() | NVVMVerifier | !opts[600] | Y | Y | Y |
| 8 | sub_185D600() | IPConstPropagation | opts[3200] && !opts[920] | Y | Y | Y |
| 9 | sub_1857160() | NVVMReflect | opts[3200] && !opts[880] | Y | Y | Y |
| 10 | sub_18A3430() | NVVMPredicateOpt | opts[3200] && !opts[1120] | Y | Y | Y |
| 11 | sub_1842BC0() | SCCP | opts[3200] && !opts[720] | Y | Y | Y |
| 12 | sub_17060B0(1,0) | PrintModulePass | !opts[1080] | Y | Y | Y |
| 13 | sub_12D4560() | NVVMVerifier | !opts[600] | Y | Y | Y |
| 14 | sub_18A3090() | NVVMPredicateOpt variant | opts[3200] && !opts[2160] | Y | Y | Y |
| 15 | sub_184CD60() | ConstantMerge | opts[3200] && !opts[1960] | Y | Y | Y |
| 16 | sub_190BB10(1,0) | SimplifyCFG | tier!=1 && !opts[1040] && !opts[1200] | - | Y | Y |
| 17 | sub_1952F90(-1) | LoopIndexSplit | (same as #16) && !opts[1160] | - | Y | Y |
| 18 | sub_12D4560() | NVVMVerifier | (same as #16) && !opts[600] | - | Y | Y |
| 19 | sub_17060B0(1,0) | PrintModulePass | (same as #16) && !opts[1080] | - | Y | Y |
| 20 | sub_195E880(0) | LICM | opts[3704] && opts[2880] && !opts[1240] | Y | Y | Y |
| 21 | sub_1C8A4D0(v12) | EarlyCSE | always; v12=1 if opts[3704] | Y | Y | Y |
| 22 | sub_1869C50(1,0,1) | Sink | tier!=1 && !opts[1040] | - | Y | Y |
| 23 | sub_1833EB0(3) | TailCallElim | tier==3 && !opts[320] | - | - | Y |
| 24 | sub_1CC3990() | NVVMUnreachableBlockElim | !opts[2360] | Y | Y | Y |
| 25 | sub_18EEA90() | CorrelatedValuePropagation | opts[3040] | Y | Y | Y |
| 26 | sub_12D4560() | NVVMVerifier | !opts[600] | Y | Y | Y |
| 27 | sub_1A223D0() | NVVMIRVerification | !opts[2600] | Y | Y | Y |
| 28 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] | Y | Y | Y |
| 29 | sub_1C4B6F0() | Inliner | !opts[440] && !opts[480] | Y | Y | Y |
| 30 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] | Y | Y | Y |
| 31 | sub_1A7A9F0() | InstructionSimplify | !opts[2720] | Y | Y | Y |
| 32 | sub_12D4560() | NVVMVerifier | !opts[600] | Y | Y | Y |
| 33 | sub_1A02540() | GenericToNVVM | !opts[2200] | Y | Y | Y |
| 34 | sub_198DF00(-1) | LoopSimplify | !opts[1520] | Y | Y | Y |
| 35 | sub_1C76260() | ADCE | !opts[1320] && !opts[1480] | Y | Y | Y |
| 36 | sub_17060B0(1,0) | PrintModulePass | (same as #35) && !opts[1080] | Y | Y | Y |
| 37 | sub_12D4560() | NVVMVerifier | (same as #35) && !opts[600] | Y | Y | Y |
| 38 | sub_195E880(0) | LICM | opts[2880] && !opts[1240] | Y | Y | Y |
| 39 | sub_1C98160(0/1) | NVVMLowerBarriers | opts[3488] | Y | Y | Y |
| 40 | sub_19C1680(0,1) | LoopUnroll | !opts[1360] | Y | Y | Y |
| 41 | sub_17060B0(1,0) | PrintModulePass | !opts[1080] | Y | Y | Y |
| 42 | sub_19401A0() | InstCombine | !opts[1000] | Y | Y | Y |
| 43 | sub_196A2B0() | EarlyCSE | !opts[1440] | Y | Y | Y |
| 44 | sub_1968390() | SROA | !opts[1400] | Y | Y | Y |
| 45 | sub_19B73C0(tier,...) | LoopVectorize/SLP (1st) | tier!=1; params vary by SM | - | Y | Y |
| 46 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] | Y | Y | Y |
| 47 | sub_19B73C0(tier,...) | LoopVectorize/SLP (2nd) | !opts[2760] | Y | Y | Y |
| 48 | sub_1A62BF0(1,...) | LLVM standard pipeline | !opts[600] | Y | Y | Y |
| 49 | sub_1A223D0() | NVVMIRVerification | !opts[2600] | Y | Y | Y |
| 50 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] | Y | Y | Y |
| 51 | sub_17060B0(1,0) | PrintModulePass | !opts[1080] | Y | Y | Y |
| 52 | sub_190BB10(0,0) | SimplifyCFG | !opts[960] | Y | Y | Y |
| 53 | sub_1922F90() | NVIDIA loop pass | opts[3080] | Y | Y | Y |
| 54 | sub_195E880(0) | LICM | opts[2880] && !opts[1240] | Y | Y | Y |
| 55 | sub_1A13320() | NVVMRematerialization | !opts[2320] | Y | Y | Y |
| 56 | sub_1968390() | SROA | !opts[1400] | Y | Y | Y |
| 57 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] | Y | Y | Y |
| 58 | sub_18EEA90() | CorrelatedValuePropagation | opts[3040] | Y | Y | Y |
| 59 | sub_18F5480() | DSE | !opts[760] | Y | Y | Y |
| 60 | sub_18DEFF0() | DCE | !opts[280] | Y | Y | Y |
| 61 | sub_1A62BF0(1,...) | LLVM standard pipeline | !opts[600] | Y | Y | Y |
| 62 | sub_1AAC510() | NVIDIA-specific pass | !opts[520] && !opts[560] | Y | Y | Y |
| 63 | sub_1A223D0() | NVVMIRVerification | !opts[2600] | Y | Y | Y |
| 64 | sub_1CB4E40(1) | NVVMIntrinsicLowering | !opts[2000] | Y | Y | Y |
| 65 | sub_1C8E680() | MemorySpaceOpt | !opts[2680]; param from opts[3120] | Y | Y | Y |
| 66 | sub_1A223D0() | NVVMIRVerification | opts[3120] && !opts[2600] | Y | Y | Y |
| 67 | sub_17060B0(1,0) | PrintModulePass | !opts[1080] | Y | Y | Y |
| 68 | sub_1CC71E0() | NVVMGenericAddrOpt | !opts[2560] | Y | Y | Y |
| 69 | sub_1C98270(1,opts[2920]) | NVVMLowerBarriers variant | opts[3488] | Y | Y | Y |
| 70 | sub_17060B0(1,0) | PrintModulePass | opts[3160] && !opts[1080] | Y | Y | Y |
| 71 | sub_1C6FCA0() | ADCE | opts[2840] && !opts[1840] | Y | Y | Y |
| 72 | sub_18B1DE0() | LoopOpt/BarrierOpt | opts[3200] && !opts[2640] | Y | Y | Y |
| 73 | sub_1857160() | NVVMReflect (late) | opts[3200] && tier==3 && !opts[880] | - | - | Y |
| 74 | sub_1841180() | FunctionAttrs | opts[3200] && !opts[680] | Y | Y | Y |
| 75 | sub_1C46000() | NVVMLateOpt | tier==3 && !opts[360] | - | - | Y |
| 76 | sub_1841180() | FunctionAttrs (2nd) | opts[3200] && !opts[680] | Y | Y | Y |
| 77 | sub_1CBC480() | NVVMLowerAlloca | !opts[2240] && !opts[2280] | Y | Y | Y |
| 78 | sub_1CB73C0() | NVVMBranchDist | !opts[2080] && !opts[2120] | Y | Y | Y |
| 79 | sub_1C7F370(1) | NVVMWarpShuffle | opts[3328] && !opts[1640] | Y | Y | Y |
| 80 | sub_1CC5E00() | NVVMReduction | opts[3328] && !opts[2400] | Y | Y | Y |
| 81 | sub_1CC60B0() | NVVMSinking2 | opts[3328] && !opts[2440] | Y | Y | Y |
| 82 | sub_1CB73C0() | NVVMBranchDist (2nd) | opts[3328] && !opts[2080] && !opts[2120] | Y | Y | Y |
| 83 | sub_17060B0(1,0) | PrintModulePass | opts[3328] && !opts[1080] | Y | Y | Y |
| 84 | sub_1B7FDF0(3) | Reassociate | opts[3328] && !opts[1280] | Y | Y | Y |
| 85 | sub_17060B0(1,0) | PrintModulePass (final) | opts[3160] && !opts[1080] | Y | Y | Y |
O1 vs O2 vs O3: Complete Diff
The three O-levels differ through exactly five mechanisms. Every pass that is NOT listed here runs identically at all three levels.
1. Tier guard: tier!=1 (O2/O3 only)
These passes are present in sub_12DE8F0 but skip when tier==1 (O1):
| Pass | Factory | Effect of skipping at O1 |
|---|---|---|
| SimplifyCFG | sub_190BB10(1,0) | No inter-tier CFG cleanup |
| LoopIndexSplit | sub_1952F90(-1) | No inter-tier loop splitting |
| NVVMVerifier (post-split) | sub_12D4560() | No verification after split |
| Sink | sub_1869C50(1,0,1) | No inter-tier instruction sinking |
| LoopVectorize/SLP (1st call) | sub_19B73C0(tier,...) | No aggressive vectorization |
At O1, the base pipeline (Tier 0) already includes one instance of LoopVectorize with sub_19B73C0(2,-1,-1,-1,-1,-1,-1) -- width 2, all thresholds at -1 (unlimited). The tier!=1 guard blocks a SECOND, more aggressive vectorization pass with SM-dependent parameters.
2. Tier guard: tier==3 (O3 only)
These passes run exclusively at O3:
| Pass | Factory | Purpose |
|---|---|---|
| TailCallElim | sub_1833EB0(3) | Additional tail call optimization pass |
| NVVMReflect (late) | sub_1857160() | Second-round __nvvm_reflect resolution |
| NVVMLateOpt | sub_1C46000() | O3-exclusive NVIDIA custom late optimization |
sub_1C46000 (NVVMLateOpt) is the most significant O3-exclusive pass. It runs only when !opts[360] (not disabled) and only at tier==3. This is a dedicated NVIDIA optimization pass that performs additional transformations after the main pipeline is complete.
3. Feature flag qword_4FBB370 escalation
When tier==3 and qword_4FBB370 byte4 is 0, the function sets qword_4FBB370 = 6 (binary 110). This enables two feature gates:
- Advanced barrier optimization (bit 1)
- Memory space optimization extensions (bit 2)
These gates affect behavior in downstream passes that read qword_4FBB370, such as sub_12EC4F0 (the machine pass pipeline executor).
4. LoopVectorize/SLP parameter differences
sub_19B73C0 is called with different parameters depending on context:
| Call site | Parameters | Tier |
|---|---|---|
Tier 0 (sub_12DE330 #31) | (2, -1, -1, -1, -1, -1, -1) | All O1/O2/O3 |
| Tier 1/2/3, 1st call (#45) | (tier, ...) SM-dependent | O2/O3 only |
| Tier 1/2/3, 2nd call (#47) | (tier, ...) | All tiers |
| Ofcmid language path | (3, -1, -1, 0, 0, -1, 0) | Fast-compile |
The 7 parameters to sub_19B73C0 control:
arg1: Vector width factor (2 at Tier 0,tierat higher tiers)arg2..arg7: Thresholds for cost model, trip count, and SLP width. Value -1 means unlimited/auto; value 0 means conservative/disabled.
At O2, sub_19B73C0(2, ...) provides moderate vectorization. At O3, sub_19B73C0(3, ...) increases the vector width factor, enabling wider SIMD exploration. The SM-architecture-dependent parameters are resolved at runtime based on the target GPU.
5. CGSCC iteration count
sub_1A62BF0 is the CGSCC (Call Graph SCC) pass manager factory. The first argument is the pipeline extension point / iteration count:
| Context | Call | Iterations |
|---|---|---|
| Tier 0 (all O-levels) | sub_1A62BF0(1,0,0,1,0,0,1) | 1 |
| Ofcmid path | sub_1A62BF0(5,0,0,1,0,0,1) | 5 |
| Language "mid" path | sub_1A62BF0(8,0,0,1,1,0,1) | 8, with extra opt flag |
O1/O2/O3 all use 1-iteration CGSCC in their shared Tier 0 pipeline. The iteration count differences appear in the fast-compile and language-specific paths, not between O-levels.
Complete O-Level Comparison Matrix
| Feature | O0 | O1 | O2 | O3 |
|---|---|---|---|---|
| Tier 0 base pipeline (~40 passes) | - | Y | Y | Y |
| Tier 1 sub-pipeline | - | Y | Y | Y |
| Tier 2 sub-pipeline | - | - | Y | Y |
| Tier 3 sub-pipeline | - | - | - | Y |
| LoopVectorize (base, width=2) | - | Y | Y | Y |
| LoopVectorize (tier, SM-dependent) | - | - | Y | Y |
| SimplifyCFG (inter-tier) | - | - | Y | Y |
| LoopIndexSplit (inter-tier) | - | - | Y | Y |
| Sink (inter-tier) | - | - | Y | Y |
| TailCallElim (extra) | - | - | - | Y |
| NVVMReflect (late round) | - | - | - | Y |
NVVMLateOpt (sub_1C46000) | - | - | - | Y |
| Feature flags escalation (6) | - | - | - | Y |
| NVVMDivergenceLowering | - | Y | Y | Y |
| SpeculativeExecution | - | Y | Y | Y |
| MemorySpaceOpt | - | Y | Y | Y |
| NVVMWarpShuffle | - | Y | Y | Y |
| NVVMReduction | - | Y | Y | Y |
| NVVMRematerialization | - | Y | Y | Y |
| NVVMBranchDist | - | Y | Y | Y |
| LSA optimization | off | on | on | on |
O0 Pipeline (Minimal)
When no O-level flag is set and no fast-compile level is active, the assembler falls through to LABEL_159 which calls:
sub_1C8A4D0(0) -- NVVMFinalCleanup or similar minimal pass
Then the common tail at LABEL_84 adds:
- MemorySpaceOpt (conditional, skipped at O0 since
opts[3488]is typically unset) sub_1CEBD10()-- NVVMFinal / cleanupsub_1654860(1)-- VerifierPasssub_12DFE00()-- Codegen pass setup
The O0 pipeline does NOT call sub_12DE330 or sub_12DE8F0. It runs only the infrastructure passes (TargetLibraryInfo, TargetTransformInfo, BasicAA, AssumptionCacheTracker, ProfileSummaryInfo) plus minimal canonicalization.
Ofcmax Pipeline (Fastest Compile)
Ofcmax bypasses the full pipeline entirely. It forces two optimizer flags:
-lsa-opt=0(disables LSA optimization)-memory-space-opt=0(disables MemorySpaceOpt pass)
This forcing happens in BOTH sub_9624D0 (line 1358--1361) and sub_12CC750 (line 2025--2079). The condition is:
if (!compare(lsa_opt_flag, "0") || fc_level == 2):
append("-lsa-opt=0")
append("-memory-space-opt=0")
Additionally, when fc_level == 2 AND lsa_opt is NOT already "0", the libnvvm path also injects -lsa-opt=0, mem2reg, -memory-space-opt=0.
The minimal pass sequence:
| # | Factory | Pass |
|---|---|---|
| 1 | sub_18B3080(1) | Sinking2Pass (fast mode, flag=1) |
| 2 | sub_1857160() | SimplifyCFG |
| 3 | sub_19CE990() | LoopStrengthReduce (if applicable) |
| 4 | sub_1B26330() | NVVMReflect |
| 5 | sub_12D4560() | NVVMVerify |
| 6 | sub_184CD60() | LICM |
| 7 | sub_1C4B6F0() | LowerSwitch |
| 8 | sub_12D4560() | NVVMVerify |
Ofcmid Pipeline (Medium)
Ofcmid runs ~25--30 passes without forcing LSA or MemorySpaceOpt off. The pass sequence from sub_12E54A0 (lines 814--861):
| # | Factory | Pass | Guard |
|---|---|---|---|
| 1 | sub_184CD60() | LICM | !opts[1960] |
| 2 | sub_1CB4E40(0) | AnnotationCleanup | always |
| 3 | sub_1B26330() | NVVMReflect | always |
| 4 | sub_198E2A0() | CorrelatedValuePropagation | always |
| 5 | sub_1CEF8F0() | NVVMPeephole | always |
| 6 | sub_215D9D0() | NVVMPeephole2/TcgenAnnotation | always |
| 7 | sub_17060B0(1,0) | GVN | !opts[1080] |
| 8 | sub_198DF00(-1) | JumpThreading/CVP | always |
| 9 | sub_17060B0(1,0) | GVN | !opts[1080] |
| 10 | sub_1C6E800() | NVVMLowerArgs | always |
| 11 | sub_1832270(1) | LoopSimplify | always |
| 12 | sub_1A62BF0(5,0,0,1,0,0,1) | CGSCC (5 iterations) | always |
| 13 | sub_1CB4E40(0) | AnnotationCleanup | always |
| 14 | sub_18FD350(0) | DCE | always |
| 15 | sub_1841180() | LCSSA | always |
| 16 | sub_18DEFF0() | Sink | always |
| 17 | sub_17060B0(1,0) | GVN | always |
| 18 | sub_184CD60() | LICM | always |
| 19 | sub_195E880(0) | Reassociate | always |
| 20 | sub_190BB10(0,0) | EarlyCSE | always |
| 21 | sub_19B73C0(3,-1,-1,0,0,-1,0) | LoopVectorize (conservative) | always |
| 22 | sub_1A223D0() | DSE | always |
| 23 | sub_1C98160(0) | MemorySpaceOpt | always |
| 24 | sub_1C8E680(0) | MemorySpaceOpt2 | always |
| 25 | sub_1B7FDF0(3) | BranchFolding/CFGSimplify | always |
| 26 | sub_18B1DE0() | Sinking2 | always |
Key differences from the O1+ pipeline: Ofcmid uses 5-iteration CGSCC (vs 1 at O1+), includes NVVMPeephole/Peephole2 early, uses conservative LoopVectorize parameters (3,-1,-1,0,0,-1,0) with some thresholds zeroed, and skips NVVMDivergenceLowering, SpeculativeExecution, NVVMBranchDist, NVVMRematerialization, and the entire tier sub-pipeline.
Ofcmin Pipeline (Closest to Full Optimization)
Ofcmin takes the same path as Ofcmid through LABEL_297 in sub_12E54A0 but with the v238 flag set differently, enabling more aggressive settings. The pipeline is essentially the Ofcmid sequence with:
- More aggressive loop optimizer thresholds
- Additional CGSCC framework passes
- Closer parameter alignment to the O2 full pipeline
Ofcmin does NOT force -lsa-opt=0 or -memory-space-opt=0. Like Ofcmid, it still skips the tier 1/2/3 sub-pipeline entirely, keeping compile time lower than O1.
Post-Optimization Common Tail
Regardless of pipeline tier, sub_12E54A0 always appends at LABEL_84 (lines 640--653):
| # | Factory | Pass | Guard |
|---|---|---|---|
| 1 | sub_1C98160(opts[2920]!=0) | MemorySpaceOpt | !v244 && opts[3488] |
| 2 | sub_1CEBD10() | NVVMFinal / cleanup | always |
| 3 | sub_1654860(1) | VerifierPass | !opts[2800] && !opts[4464] |
| 4 | sub_12DFE00(PM, v253, opts) | Codegen pass dispatch | always |
sub_12DFE00 (codegen dispatch) reads the optimization level from opts[200] to determine codegen aggressiveness. When opts[200] > 1, full dependency tracking is enabled across all codegen passes.
Always-Added Analysis Passes
Before any optimization, the pipeline assembler inserts (lines 396--420):
| # | Factory | Pass |
|---|---|---|
| 1 | sub_149CCE0 (368 bytes alloc) | TargetLibraryInfoWrapperPass |
| 2 | sub_1BFB520 (208 bytes alloc) | TargetTransformInfoWrapperPass |
| 3 | sub_14A7550() | VerifierPass / BasicAliasAnalysis |
| 4 | sub_1361950() | AssumptionCacheTracker |
| 5 | sub_1CB0F50() | ProfileSummaryInfoWrapperPass |
These five passes run at ALL optimization levels including O0.
NVVMPassOptions Offset-to-Guard Map
The passes gated by NVVMPassOptions boolean flags (opts struct at 4512 bytes). Slot defaults from sub_12D6300:
| Offset | Slot | Default | Controls | Used By |
|---|---|---|---|---|
| 280 | 15 | off | DCE disable | Tier 0 #37, Tier 1/2/3 #60 |
| 320 | 17 | off | TailCallElim disable | Tier 1/2/3 #23 (O3 only) |
| 360 | 19 | on | NVVMLateOpt disable | Tier 1/2/3 #75 (O3 only) |
| 440 | 23 | off | Inliner flag A disable | Tier 1/2/3 #29 |
| 480 | 25 | on | Inliner flag B disable | Tier 1/2/3 #29 |
| 600 | 31 | off | NVVMVerifier disable | Tier 1/2/3 #7,13,18,26,32,37 |
| 680 | 35 | off | FunctionAttrs disable | Tier 1/2/3 #74,76 |
| 720 | 37 | off | SCCP disable | Tier 1/2/3 #11 |
| 760 | 39 | off | DSE disable | Tier 1/2/3 #59 |
| 880 | 45 | off | NVVMReflect disable | Tier 1/2/3 #9,73 |
| 920 | 47 | off | IPConstPropagation disable | Tier 1/2/3 #8 |
| 960 | 49 | off | SimplifyCFG disable | Tier 1/2/3 #52 |
| 1000 | 51 | off | InstCombine disable | Tier 1/2/3 #42 |
| 1040 | 53 | off | Sink/SimplifyCFG disable | Tier 0 #13, Tier 1/2/3 #16,22 |
| 1080 | 55 | off | PrintModulePass disable | many |
| 1120 | 57 | off | NVVMPredicateOpt disable | Tier 1/2/3 #10 |
| 1160 | 59 | off | LoopIndexSplit disable | Tier 1/2/3 #17 |
| 1200 | 61 | off | SimplifyCFG tier guard | Tier 1/2/3 #16 |
| 1240 | 63 | off | LICM disable | Tier 1/2/3 #20,38,54 |
| 1280 | 65 | off | Reassociate disable | Tier 1/2/3 #84 |
| 1320 | 65 | off | NVVMDivergenceLow disable | Tier 0 #24, Tier 1/2/3 #35 |
| 1360 | 67 | off | LoopUnroll disable | Tier 0 #26, Tier 1/2/3 #40 |
| 1400 | 69 | off | SROA disable | Tier 1/2/3 #44,56 |
| 1440 | 71 | off | EarlyCSE disable | Tier 1/2/3 #43 |
| 1480 | 73 | off | ADCE extra guard | Tier 1/2/3 #35 |
| 1520 | 75 | off | LoopSimplify disable | Tier 1/2/3 #34 |
| 1640 | 81 | off | NVVMWarpShuffle disable | Tier 1/2/3 #79 |
| 1760 | 87 | off | MemorySpaceOpt disable | Common tail, language paths |
| 1840 | 91 | off | ADCE variant disable | Tier 1/2/3 #71 |
| 1960 | 97 | off | ConstantMerge disable | Tier 1/2/3 #15 |
| 2000 | 101 | off | NVVMIntrinsicLowering disable | Tier 1/2/3 #1,3,28,50,64 |
| 2080 | 103 | off | NVVMBranchDist disable A | Tier 1/2/3 #78,82 |
| 2120 | 105 | off | NVVMBranchDist disable B | Tier 1/2/3 #78,82 |
| 2200 | 109 | off | GenericToNVVM disable | Tier 1/2/3 #33 |
| 2240 | 111 | off | NVVMLowerAlloca A disable | Tier 1/2/3 #77 |
| 2280 | 113 | off | NVVMLowerAlloca B disable | Tier 1/2/3 #77 |
| 2320 | 115 | off | NVVMRematerialization disable | Tier 1/2/3 #55 |
| 2360 | 117 | on | NVVMUnreachableBlockElim disable | Tier 1/2/3 #24 |
| 2400 | 119 | off | NVVMReduction disable | Tier 1/2/3 #80 |
| 2440 | 121 | off | NVVMSinking2 disable | Tier 1/2/3 #81 |
| 2560 | 127 | off | NVVMGenericAddrOpt disable | Tier 1/2/3 #68 |
| 2600 | 129 | off | NVVMIRVerification disable | Tier 1/2/3 #2,27,49,63,66 |
| 2640 | 131 | off | LoopOpt/BarrierOpt disable | Tier 1/2/3 #72 |
| 2680 | 133 | off | MemorySpaceOpt (2nd) disable | Tier 1/2/3 #65 |
| 2720 | 135 | off | InstructionSimplify disable | Tier 1/2/3 #31 |
| 2760 | 137 | off | LoopVectorize 2nd disable | Tier 1/2/3 #47 |
| 2840 | 141 | on | ADCE enable (reversed) | Tier 1/2/3 #71 |
| 2880 | 143 | on | LICM enable (reversed) | Tier 0 #25, Tier 1/2/3 #20,38,54 |
| 2920 | 145 | off | LowerBarriers parameter | Common tail |
| 3000 | 151 | on | Early pass guard | Pre-opt phase |
| 3040 | 153 | off | CorrelatedValueProp enable | Tier 1/2/3 #25,58 |
| 3080 | 155 | on | NVIDIA loop pass enable | Tier 1/2/3 #53 |
| 3120 | 155 | on | MemorySpaceOpt(2nd) enable | Tier 1/2/3 #65,66 |
| 3160 | 157 | on | PrintModulePass enable | Tier 0 #9,27,35; Tier 1/2/3 many |
| 3200 | 159 | on | Advanced NVIDIA passes group | Tier 1/2/3 #8-11,14-15,72-76 |
| 3328 | 165 | on | SM-specific late passes block | Tier 1/2/3 #79-84 |
| 3488 | 173 | off | NVVMBarrierAnalysis enable | Tier 1/2/3 #4,5,39,69 |
| 3528 | 175 | off | Tier 1 enable | Pipeline assembler |
| 3568 | 177 | off | Tier 2 enable | Pipeline assembler |
| 3608 | 179 | off | Tier 3 enable | Pipeline assembler |
| 3648 | 181 | "" | Language/fc-level string ptr | Pipeline name selection |
| 3704 | 183 | off | Late optimization flag | Tier 1/2/3 #20,21; Pipeline B |
| 3904 | 192 | off | Debug/naming mode flag | BB naming loop |
| 4064 | 201 | off | Concurrent compilation flag | Thread count decision |
| 4104 | 203 | -1 | Thread count (integer) | sub_12E7E70 |
| 4224 | 209 | off | Tier 0 enable (opt active) | Pipeline assembler loop |
| 4304 | 213 | off | Device-code / additional opt | Pipeline B; fc dispatch |
| 4384 | 217 | off | Fast-compile bypass flag | Pipeline A vs B branch |
| 4464 | 221 | off | Late CFG cleanup guard | Common tail #3 |
Codegen Optimization Level Propagation
The -optO and -llcO flags propagate the optimization level to the backend code generator. In sub_12E54A0 (lines 1451--1460):
if (lsa_opt == "0" && some_flag == "1"):
append("-optO<level>")
append("-llcO2")
The codegen dispatch sub_12DFE00 reads opts[200] (the integer optimization level):
opts[200] == 0: Minimal codegen (no dependency tracking)opts[200] >= 1: Standard codegenopts[200] >= 2: Full dependency tracking enabled (v121 = true)
Cross-References
- NVVMPassOptions System -- complete 222-slot struct layout
- Pipeline Pass Registration -- 526-pass registration table
- Optimizer Architecture -- two-phase model, AddPass mechanism
- CLI Flags --
-O#,-Ofc=,--passes=routing - Knobs Reference -- all 1496 cl::opt knobs
- Concurrent Compilation -- Phase I/II threading model
NVVMPassOptions
NVVMPassOptions is NVIDIA's proprietary per-pass configuration system -- a 4,512-byte flat struct containing 221 option slots that controls every aspect of the NVVM optimization pipeline. It has no upstream LLVM equivalent. Where LLVM uses scattered cl::opt<T> globals that each pass reads independently, NVIDIA consolidates all pass configuration into a single contiguous struct that is allocated once and threaded through the entire pipeline assembler as a parameter. This design allows the pipeline to make pass-enable decisions through simple byte reads at known offsets rather than hash-table lookups, and it ensures that the complete configuration state can be copied between Phase I and Phase II of the two-phase compilation model.
The struct is populated by a single 125KB function (sub_12D6300) that reads from a PassOptionRegistry hash table and flattens the results into 221 typed slots. The pipeline assembler (sub_12E54A0) and its sub-pipeline builders (sub_12DE330, sub_12DE8F0) then read individual slots by offset to decide which passes to insert and how to configure them.
| Initializer | sub_12D6300 (125KB, 4,786 lines) |
| Struct size | 4,512 bytes (sub_22077B0(4512)) |
| Slot count | 221 (1-based index: 1--221) |
| Slot types | 5: STRING (24B), BOOL_COMPACT (16B), BOOL_INLINE (16B), INTEGER (16B), STRING_PTR (28B) |
| Type breakdown | 114 string + 83 bool compact + 17 bool inline + 6 integer + 1 string pointer |
| Registry lookup | sub_12D6170 (hash table at registry+120) |
| PassDef resolver | sub_1691920 (64-byte stride table) |
| Bool parser | sub_12D6240 (triple: lookup + lowercase + char test) |
| Callers | sub_12E7E70 (Phase orchestrator), sub_12F4060 (TargetMachine creation) |
| Consumers | sub_12E54A0, sub_12DE330, sub_12DE8F0, sub_12DFE00 |
Struct Layout
The struct is heap-allocated as a single 4,512-byte block. The first 16 bytes contain header fields, followed by 221 option slots packed contiguously, and a 32-byte zero trailer:
Offset Size Field
────── ──── ─────
0 4 int opt_level (copied from registry+112)
4 4 (padding)
8 8 qword ptr to PassOptionRegistry
16 ~4464 221 option slots (variable-size, packed)
4480 32 zero trailer (4 qwords, sentinel)
Slot offsets are deterministic -- they depend on the type sequence hard-coded into sub_12D6300. String slots consume 24 bytes, boolean and integer slots consume 16 bytes, and the unique string-pointer slot at index 181 consumes 28 bytes. The initializer writes each slot at a compile-time-constant offset; there is no dynamic layout calculation.
Slot Types
Type A: String Option (24 bytes) -- sub_12D6090
114 slots. Stores a string value (pass name or parametric value) along with flags, optimization level, and pass ID.
struct StringOption { // 24 bytes, written by sub_12D6090
char* value; // +0: pointer to string data
int32_t option_index; // +8: 1-based slot index
int32_t flags; // +12: from PassDef byte 40
int32_t opt_level; // +16: from header opt_level
int32_t pass_id; // +20: resolved via sub_1691920
};
Type B: Boolean Compact (16 bytes) -- sub_12D6100
83 slots. The most common boolean representation. The helper encapsulates the lookup-parse-resolve sequence.
struct BoolCompactOption { // 16 bytes, written by sub_12D6100
uint8_t value; // +0: 0 or 1
uint8_t pad[3]; // +1: padding
int32_t option_index; // +4: 1-based slot index
int32_t flags; // +8: from PassDef byte 40
int32_t pass_id; // +12: resolved via sub_1691920
};
Type C: Boolean Inline (16 bytes) -- direct write
17 slots. Identical layout to Type B, but written directly by sub_12D6300 rather than through the sub_12D6100 helper. These correspond to option pairs where the boolean resolution requires checking PassDef+36 (has_overrides byte) and resolving via sub_1691920 inline. The 17 inline boolean slots are: 7, 11, 13, 49, 53, 55, 59, 61, 95, 103, 119, 127, 151, 159, 169, 177, 211.
struct BoolInlineOption { // 16 bytes, same layout as Type B
uint8_t value; // +0: 0 or 1
uint8_t pad[3]; // +1
int32_t option_index; // +4: high 32 bits of sub_12D6240 return
int32_t opt_level; // +8: from header
int32_t pass_id; // +12: resolved inline
};
Type D: Integer (16 bytes) -- direct write via sub_16D2BB0
6 slots. The integer value is parsed from the registry string by sub_16D2BB0 (string-to-int64). Layout is identical to boolean compact but the first 4 bytes store a full int32_t rather than a single byte.
struct IntegerOption { // 16 bytes
int32_t value; // +0: parsed integer
int32_t option_index; // +4: 1-based slot index
int32_t opt_level; // +8
int32_t pass_id; // +12
};
Type E: String Pointer (28 bytes) -- slot 181 only
Unique. Stores a raw char* plus length rather than a managed string. Likely a file path or regex pattern that requires direct C-string access.
struct StringPtrOption { // 28 bytes, slot 181 only
char* data; // +0: raw char pointer
uint64_t length; // +8: string length
int32_t option_index; // +16: 1-based slot index
int32_t opt_level; // +20
int32_t pass_id; // +24
};
Pair Organization Pattern
The 221 slots follow a predominantly paired layout. Slots 1--6 are six standalone STRING options (likely the global compilation parameters: ftz, prec-div, prec-sqrt, fmad, opt-level, sm-arch). Starting at slot 7, slots are organized in (EVEN, ODD) pairs:
- Even slot N: STRING option -- the pass's parameter value or name
- Odd slot N+1: BOOLEAN or INTEGER option -- the enable/disable toggle
Each "pass knob" thus gets a string parameter slot and a boolean gate. The pipeline assembler reads the boolean to decide whether to insert the pass, and passes the string value as the pass's configuration parameter.
Exceptions to the pair pattern:
| Region | Anomaly |
|---|---|
| Slots 160--162 | Three consecutive STRING slots with a single boolean at 163 |
| Slots 191--193 | Slot 191 STRING, then two consecutive booleans at 192--193 |
| Slot 181 | STRING_PTR type instead of normal STRING |
| Slots 196--207 | Alternating STRING + INTEGER instead of STRING + BOOL |
Helper Functions
sub_12D6170 -- PassOptionRegistry::lookupOption
Looks up an option by its 1-based slot index in the hash table at registry+120. Returns a pointer to an OptionNode or 0 if the option was not set from the command line:
// Signature: int64 sub_12D6170(void* registry, int option_index)
// Returns: OptionNode* or 0
//
// OptionNode layout:
// +40 int16 flags
// +48 char** value_array_ptr (array of string values)
// +56 int value_count
The hash table uses open addressing. The lookup computes hash(option_index) and probes linearly. When an option is not present in the registry (meaning the user did not supply a CLI override), the caller falls back to the hard-coded default in sub_12D6300.
sub_12D6240 -- PassOptionRegistry::getBoolOption
Resolves a boolean option with a default value. This is the critical function for all 100 boolean slots -- it performs a three-step resolution:
sub_12D6240(registry, option_index, default_string):
1. Call sub_12D6170(registry, option_index)
2. If found AND has value:
lowercase the string via sub_16D2060
result = (first_char == '1' || first_char == 't') // "1" or "true"
3. If not found OR no value:
result = (default_string[0] == '1') // "0" -> false, "1" -> true
Return: packed(bool_value:8, flags:32) in low 40 bits
The packing convention is significant: the boolean value occupies the low 8 bits and the flags occupy bits 8--39. Callers unpack with (result & 0xFF) for the boolean and (result >> 8) for the flags.
sub_1691920 -- PassDefTable::getPassDef
Resolves a 1-based pass index to its PassDef entry in a table with 64-byte stride:
// sub_1691920(table_ptr, pass_index):
// return table_ptr[0] + (pass_index - 1) * 64
//
// PassDef layout (64 bytes):
// +32 int pass_id
// +36 byte has_overrides
// +40 int16 override_index
The pass_id field is written into every option slot and later used by the pipeline assembler to map configuration back to the pass factory that should receive it.
sub_16D2BB0 -- parseInt
Parses a string to a 64-bit integer. Used for the 6 integer-typed option slots (9, 197, 203, 205, 207, 215).
Default Values
Most boolean slots default to 0 (disabled). 14 slots default to 1 (enabled) -- these represent passes that run by default and must be explicitly disabled:
Confidence note: Pass associations marked
[MEDIUM]are inferred from pipeline guard cross-references (a4[offset]). Associations marked[LOW]are based solely on offset proximity or default-value patterns.
| Slot | Offset | Likely Pass | Confidence |
|---|---|---|---|
| 19 | 400 | Inliner (AlwaysInliner gate) | MEDIUM |
| 25 | 520 | NVIDIA-specific pass A | LOW |
| 93 | 1880 | ConstantMerge | HIGH |
| 95 | 1920 | NVVMIntrinsicLowering | HIGH |
| 117 | 2360 | NVVMUnreachableBlockElim | HIGH |
| 141 | 2840 | ADCE | HIGH |
| 143 | 2880 | LICM | HIGH |
| 151 | 3040 | CorrelatedValuePropagation | MEDIUM |
| 155 | 3120 | MemorySpaceOpt (second pass) | MEDIUM |
| 157 | 3160 | PrintModulePass (dump mode) | HIGH |
| 159 | 3200 | Optimization-level gating | MEDIUM |
| 165 | 3328 | Late-pipeline enable block | LOW |
| 211 | 4264 | (inline bool, late pass) | LOW |
| 219 | 4424 | (compact bool, late pass) | LOW |
Integer slot defaults:
| Slot | Offset | Default | Likely Meaning |
|---|---|---|---|
| 9 | 200 | 1 | Optimization threshold / iteration count |
| 197 | 3984 | 20 | Limit/threshold (e.g., unroll count) |
| 203 | 4104 | -1 | Thread count (sentinel for auto-detect via get_nprocs()) |
| 205 | 4144 | -1 | Thread count fallback |
| 207 | 4184 | -1 | Sentinel for unlimited/auto |
| 215 | 4344 | 0 | Disabled counter |
CLI Flag Routing
The path from a user-visible flag to an NVVMPassOptions slot traverses four stages:
nvcc -Xcicc -opt "-do-licm=0" ← user invocation
│
▼
sub_9624D0 (flag catalog, 75KB) ← parses -opt flags into opt_argv vector
│ pushes "-do-licm=0" into v327 (opt vector)
▼
PassOptionRegistry (hash table) ← opt-phase parser populates registry
│ key = slot_index, value = "0"
▼
sub_12D6300 (125KB initializer) ← flattens registry into 4512-byte struct
│ sub_12D6240(registry, LICM_SLOT, "1") → returns 0 (overridden)
│ writes opts[2880] = 0
▼
sub_12E54A0 / sub_12DE8F0 ← pipeline assembler reads opts[2880]
if (opts[2880]) AddPass(LICM); ← skipped because opts[2880] == 0
The -opt flag prefix is critical: it routes the argument to the optimizer phase vector rather than to the linker, LTO, or codegen phases. The flag catalog (sub_9624D0) recognizes several shorthand patterns:
| User Flag | Routes To | Effect |
|---|---|---|
--emit-optix-ir | opt "-do-ip-msp=0", opt "-do-licm=0" | Disables IPMSP and LICM for OptiX |
-Ofast-compile=max | opt "-fast-compile=max", opt "-memory-space-opt=0" | Disables MemorySpaceOpt |
-memory-space-opt=0 | opt "-memory-space-opt=0" | Direct pass disable |
-Xopt "-do-remat=0" | opt "-do-remat=0" | Direct pass-through to opt phase |
Pipeline Consumer: How Passes Read NVVMPassOptions
The pipeline assembler and its sub-pipeline builders receive the NVVMPassOptions struct as parameter a4 (in sub_12E54A0) or opts (in sub_12DE330/sub_12DE8F0). They read individual boolean slots by dereferencing a byte at a known offset and branching:
// Pattern 1: simple disable guard
if (!*(uint8_t*)(opts + 1760)) // opts[1760] = MemorySpaceOpt disable
AddPass(PM, sub_1C8E680(0), 1, 0); // insert MemorySpaceOpt
// Pattern 2: enable guard (inverted logic)
if (*(uint8_t*)(opts + 2880)) // opts[2880] = LICM enabled (default=1)
AddPass(PM, sub_195E880(0), 1, 0); // insert LICM
// Pattern 3: combined guard with opt-level gating
if (*(uint8_t*)(opts + 3200) && // opts[3200] = opt-level sufficient
!*(uint8_t*)(opts + 880)) // opts[880] = NVVMReflect not disabled
AddPass(PM, sub_1857160(), 1, 0); // insert NVVMReflect
// Pattern 4: integer parameter read
v12 = *(int32_t*)(opts + 200); // opts[200] = opt threshold (default=1)
// used to configure codegen dispatch in sub_12DFE00
The key insight is that the pipeline assembler never performs string comparison or hash-table lookup at pass-insertion time -- it reads pre-resolved values from the flat struct. This makes the ~150 pass-insertion decisions in sub_12E54A0 essentially free in terms of runtime cost.
Offset-to-Pass Mapping
The following table maps struct offsets (as seen in pipeline assembler guards opts[OFFSET]) to the passes they control. Offsets are byte offsets from the struct base. "Guard sense" indicates whether the pass runs when the byte is 0 (!opts[X] -- most common, where the option is a disable flag) or when it is nonzero (opts[X] -- the option is an enable flag).
| Offset | Slot | Guard Sense | Controlled Pass | Factory |
|---|---|---|---|---|
| 200 | 9 | value | Optimization threshold (integer, read by sub_12DFE00) | -- |
| 280 | 14-15 | !opts | DCE (DeadCodeElimination) | sub_18DEFF0 |
| 320 | 16-17 | !opts | TailCallElim / JumpThreading | sub_1833EB0 |
| 360 | 18-19 | !opts | NVVMLateOpt | sub_1C46000 |
| 400 | 20-21 | !opts | AlwaysInliner gate A | sub_1C4B6F0 |
| 440 | 22-23 | !opts | AlwaysInliner gate B | sub_1C4B6F0 |
| 480 | 24-25 | !opts | Inliner gate C | sub_1C4B6F0 |
| 520 | 26-27 | !opts | NVIDIA-specific pass A | sub_1AAC510 |
| 560 | 28-29 | !opts | NVIDIA-specific pass B | sub_1AAC510 |
| 600 | 30-31 | !opts | NVVMVerifier | sub_12D4560 |
| 680 | 34-35 | !opts | FunctionAttrs | sub_1841180 |
| 720 | 36-37 | !opts | SCCP | sub_1842BC0 |
| 760 | 38-39 | !opts | DSE (DeadStoreElimination) | sub_18F5480 |
| 880 | 44-45 | !opts | NVVMReflect | sub_1857160 |
| 920 | 46-47 | !opts | IPConstantPropagation | sub_185D600 |
| 960 | 48-49 | !opts | SimplifyCFG | sub_190BB10 |
| 1000 | 50-51 | !opts | InstCombine | sub_19401A0 |
| 1040 | 52-53 | !opts | Sink / SimplifyCFG (early) | sub_1869C50 |
| 1080 | 54-55 | !opts | PrintModulePass (dump IR) | sub_17060B0 |
| 1120 | 56-57 | !opts | NVVMPredicateOpt | sub_18A3430 |
| 1160 | 58-59 | !opts | LoopIndexSplit | sub_1952F90 |
| 1200 | 60-61 | !opts | SimplifyCFG (tier guard) | sub_190BB10 |
| 1240 | 62-63 | !opts | LICM | sub_195E880 |
| 1280 | 64-65 | !opts | Reassociate / Sinking | sub_1B7FDF0 |
| 1320 | 66-67 | !opts | ADCE (AggressiveDeadCodeElimination) | sub_1C76260 |
| 1360 | 68-69 | !opts | LoopUnroll | sub_19C1680 |
| 1400 | 70-71 | !opts | SROA | sub_1968390 |
| 1440 | 72-73 | !opts | EarlyCSE | sub_196A2B0 |
| 1480 | 74-75 | !opts | ADCE extra guard | sub_1C76260 |
| 1520 | 76-77 | !opts | LoopSimplify | sub_198DF00 |
| 1640 | 82-83 | !opts | NVVMWarpShuffle | sub_1C7F370 |
| 1680 | 84-85 | !opts | NVIDIA pass (early) | sub_19CE990 |
| 1760 | 88-89 | !opts | MemorySpaceOpt (primary) | sub_1C8E680 |
| 1840 | 92-93 | !opts | ADCE variant | sub_1C6FCA0 |
| 1960 | 98-99 | !opts | ConstantMerge / GlobalDCE | sub_184CD60 |
| 2000 | 100-101 | !opts | NVVMIntrinsicLowering | sub_1CB4E40 |
| 2040 | 102-103 | !opts | MemCpyOpt | sub_1B26330 |
| 2080 | 104-105 | !opts | BranchDist gate A | sub_1CB73C0 |
| 2120 | 106-107 | !opts | BranchDist gate B | sub_1CB73C0 |
| 2160 | 108-109 | !opts | NVVMPredicateOpt variant | sub_18A3090 |
| 2200 | 110-111 | !opts | GenericToNVVM | sub_1A02540 |
| 2240 | 112-113 | !opts | NVVMLowerAlloca gate A | sub_1CBC480 |
| 2280 | 114-115 | !opts | NVVMLowerAlloca gate B | sub_1CBC480 |
| 2320 | 116-117 | !opts | NVVMRematerialization | sub_1A13320 |
| 2360 | 118-119 | !opts | NVVMUnreachableBlockElim | sub_1CC3990 |
| 2400 | 120-121 | !opts | NVVMReduction | sub_1CC5E00 |
| 2440 | 122-123 | !opts | NVVMSinking2 | sub_1CC60B0 |
| 2560 | 128-129 | !opts | NVVMGenericAddrOpt | sub_1CC71E0 |
| 2600 | 130-131 | !opts | NVVMIRVerification | sub_1A223D0 |
| 2640 | 132-133 | !opts | LoopOpt / BarrierOpt | sub_18B1DE0 |
| 2680 | 134-135 | !opts | MemorySpaceOpt (second invocation) | sub_1C8E680 |
| 2720 | 136-137 | !opts | InstructionSimplify | sub_1A7A9F0 |
| 2760 | 138-139 | !opts | LoopUnswitch variant | sub_19B73C0 |
| 2840 | 141 | opts | ADCE (enabled by default, slot 141, default=1) | sub_1C6FCA0 |
| 2880 | 143 | opts | LICM (enabled by default, slot 143, default=1) | sub_195E880 |
| 2920 | 145 | value | LowerBarriers parameter | sub_1C98270 |
| 3000 | 150-151 | opts | Early pass guard | sub_18FD350 |
| 3040 | 151 | opts | CorrelatedValuePropagation (default=1) | sub_18EEA90 |
| 3080 | 153 | opts | NVIDIA-specific loop pass | sub_1922F90 |
| 3120 | 155 | opts | MemorySpaceOpt second-pass enable (default=1) | sub_1C8E680 |
| 3160 | 157 | opts | PrintModulePass enable (default=1) | sub_17060B0 |
| 3200 | 159 | opts | Optimization-level gate (default=1) | -- |
| 3328 | 165 | opts | Late-pipeline enable block (default=1) | multiple |
| 3488 | 174-175 | opts | NVVMBarrierAnalysis + LowerBarriers enable | sub_18E4A00 |
| 3648 | 181 | string | Language string ("ptx"/"mid") | path dispatch |
| 3704 | 185 | opts | Late optimization flag | sub_1C8A4D0 |
| 3904 | 193 | opts | Debug / verification mode | sub_12D3E60 |
| 3944 | 195 | opts | Basic block naming ("F%d_B%d") | sprintf |
| 3984 | 197 | value | Integer limit (default=20) | -- |
| 4064 | 201 | value | Concurrent compilation override | sub_12D4250 |
| 4104 | 203 | value | Thread count (default=-1, auto-detect) | sub_12E7E70 |
| 4144 | 205 | value | Thread count fallback (default=-1) | sub_12E7E70 |
| 4184 | 207 | value | Integer parameter (default=-1) | -- |
| 4224 | 209 | opts | Optimization enabled flag | tier dispatch |
| 4304 | 213 | opts | Device-code flag | Pipeline B |
| 4344 | 215 | value | Integer counter (default=0) | -- |
| 4384 | 217 | opts | Fast-compile bypass flag | Pipeline B dispatch |
| 4464 | 221 | !opts | Late CFG cleanup guard | sub_1654860 |
Known Option Names
Option names are stored in the PassOptionRegistry hash table, not in sub_12D6300 itself. The following names are extracted from binary string references in global constructors and pass factories:
Boolean Toggles (do-X / no-X)
| Name | Likely Slot Region | Default |
|---|---|---|
do-ip-msp | MemorySpaceOpt area | enabled |
do-clone-for-ip-msp | MemorySpaceOpt variant | -- |
do-licm | offset 2880 (slot 143) | 1 (enabled) |
do-remat | offset 2320 (slot 117) | enabled |
do-cssa | CSSA pass area | -- |
do-scev-cgp | SCEV-CGP area | -- |
do-function-scev-cgp | function-level SCEV-CGP | -- |
do-scev-cgp-aggresively [sic] | aggressive SCEV-CGP mode | -- |
do-base-address-strength-reduce | BaseAddrSR area | -- |
do-base-address-strength-reduce-chain | BaseAddrSR chain variant | -- |
do-comdat-renaming | COMDAT pass | -- |
do-counter-promotion | PGO counter promotion | -- |
do-lsr-64-bit | 64-bit loop strength reduction | -- |
do-sign-ext-expand | sign extension expansion | -- |
do-sign-ext-simplify | sign extension simplification | -- |
Dump/Debug Toggles
| Name | Purpose |
|---|---|
dump-ip-msp | Dump IR around MemorySpaceOpt |
dump-ir-before-memory-space-opt | IR dump pre-MSP |
dump-ir-after-memory-space-opt | IR dump post-MSP |
dump-memory-space-warnings | MSP diagnostic warnings |
dump-remat / dump-remat-add / dump-remat-iv / dump-remat-load | Rematerialization diagnostics |
dump-branch-dist | Branch distribution diagnostics |
dump-scev-cgp | SCEV-CGP diagnostics |
dump-base-address-strength-reduce | BaseAddrSR diagnostics |
dump-sink2 | Sinking2 diagnostics |
dump-before-cssa | CSSA input dump |
dump-phi-remove | PHI removal diagnostics |
dump-normalize-gep | GEP normalization dump |
dump-simplify-live-out | Live-out simplification dump |
dump-process-restrict | Process-restrict dump |
dump-process-builtin-assume | Builtin assume processing dump |
dump-conv-dot / dump-conv-func / dump-conv-text | Convergence analysis dumps |
dump-nvvmir | NVVM IR dump |
dump-va | Value analysis dump |
Parametric Knobs
| Name | Default | Purpose |
|---|---|---|
remat-for-occ | 120 | Occupancy target for rematerialization |
remat-gep-cost | 6000 | GEP rematerialization cost threshold |
remat-lli-factor | 10 | Long-latency instruction factor |
remat-max-live-limit | 10 | Maximum live range limit for remat |
remat-single-cost-limit | -- | Single-instruction remat cost limit |
remat-loop-trip | -- | Loop trip count for remat decisions |
remat-use-limit | -- | Use count limit for remat candidates |
remat-maxreg-ceiling | -- | Register ceiling for remat |
remat-move | -- | Remat move control |
remat-load-param | -- | Parameter load remat control |
remat-ignore-single-cost | -- | Ignore single-cost heuristic |
branch-dist-block-limit | -1 | Max blocks for branch distribution (-1 = unlimited) |
branch-dist-func-limit | -1 | Max functions for branch distribution |
branch-dist-norm | 0 | Branch distribution normalization mode |
scev-cgp-control | -- | SCEV-CGP mode selector |
scev-cgp-norm | -- | SCEV-CGP normalization |
scev-cgp-check-latency | -- | Latency check threshold |
scev-cgp-cross-block-limit | -- | Cross-block limit |
scev-cgp-idom-level-limit | -- | Immediate dominator level limit |
scev-cgp-inst-limit | -- | Instruction count limit |
scev-cgp-old-base | -- | Old base address mode |
scev-cgp-tid-max-value | -- | Thread ID max value |
base-address-strength-reduce-iv-limit | -- | IV limit for base addr SR |
base-address-strength-reduce-max-iv | -- | Max IV count |
cssa-coalesce | -- | CSSA coalescing mode |
cssa-verbosity | -- | CSSA diagnostic verbosity |
memory-space-opt-pass | -- | MSP pass variant selector |
peephole-opt | -- | Peephole optimizer control |
loop-index-split | -- | Loop index split control |
va-use-scdg | -- | Value analysis SCDG mode |
nvvm-peephole-optimizer | -- | NVVM peephole enable |
nvvm-intr-range | -- | Intrinsic range analysis control |
Differences from Upstream LLVM
Upstream LLVM has nothing resembling this system. The closest analogue is the cl::opt<T> flag mechanism, but that scatters configuration across hundreds of global variables that each pass reads independently. The differences are architectural:
| Aspect | Upstream LLVM | cicc NVVMPassOptions |
|---|---|---|
| Storage | ~1,689 scattered cl::opt globals in BSS | Single 4,512-byte contiguous struct |
| Initialization | Global constructors register each flag | One 125KB function flattens all 221 slots |
| Access pattern | Each pass reads its own globals | Pipeline assembler reads all slots centrally |
| Copyability | Not designed for copying | Struct is trivially memcpy-able for Phase I/II |
| Thread safety | Global cl::opt requires careful coordination | Each thread gets its own struct copy |
| Override mechanism | cl::opt command-line parser | PassOptionRegistry hash table with fallback defaults |
| Pass gating | Pass decides internally whether to run | Pipeline assembler decides before constructing pass |
The thread-safety property is crucial for the two-phase concurrent compilation model. When Phase II runs per-function compilation in parallel threads, each thread receives a copy of the NVVMPassOptions struct. If NVIDIA used upstream cl::opt globals for pass configuration, they would need global locks or TLS for every option read during pass execution -- an unacceptable overhead for a GPU compiler that may process hundreds of kernels in a single translation unit.
Interaction with Two-Phase Compilation
The NVVMPassOptions struct is allocated and populated before Phase I begins, in the orchestrator sub_12E7E70:
// sub_12E7E70, line ~128
void* opts = malloc(4512); // allocate NVVMPassOptions
sub_12D6300(opts, registry); // populate from CLI-parsed registry
// ... pass opts to sub_12E54A0 for Phase I ...
// ... pass same opts to sub_12E54A0 for Phase II ...
Both phases receive the same opts pointer. Individual passes within the pipeline assembler check qword_4FBB3B0 (the TLS phase counter) to skip themselves in the wrong phase -- but the NVVMPassOptions struct itself does not change between phases. This means a pass cannot be enabled in Phase I but disabled in Phase II through NVVMPassOptions alone; phase selection is handled by the separate TLS mechanism.
The second caller, sub_12F4060 (TargetMachine creation in the standalone path), performs an identical allocation and initialization sequence, confirming that every compilation path goes through the same NVVMPassOptions infrastructure.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
NVVMPassOptions::init | sub_12D6300 | 125KB | Populate 221 slots from registry |
PassOptionRegistry::lookupOption | sub_12D6170 | ~200B | Hash-table lookup by slot index |
PassOptionRegistry::getBoolOption | sub_12D6240 | ~300B | Boolean resolution with default |
writeStringOption | sub_12D6090 | ~150B | Write 24-byte string slot |
writeBoolOption | sub_12D6100 | ~120B | Write 16-byte boolean slot |
PassDefTable::getPassDef | sub_1691920 | ~80B | 64-byte stride table lookup |
parseInt | sub_16D2BB0 | ~100B | String-to-int64 parser |
toLowercase | sub_16D2060 | ~80B | String lowercasing for bool parse |
Cross-References
- LLVM Optimizer -- pipeline assembler that consumes NVVMPassOptions
- Configuration Knobs -- all three knob systems (cl::opt, NVVMPassOptions, codegen)
- CLI Flags -- flag catalog and routing to opt phase vector
- Optimization Levels -- O-level encoding and fast-compile modes
- Concurrent Compilation -- Phase I/II threading model
- Entry Point & CLI -- wizard mode and -opt flag dispatching
- OptiX IR -- forces
do-ip-msp=0anddo-licm=0
Configuration Knobs
Three independent knob systems control compiler behavior: LLVM cl::opt flags (~1,496 unique), NVVMPassOptions (222 slots), and NVIDIA codegen knobs (~70).
| LLVM cl::opt | 1,496 unique flags across 353 constructor files |
| NVVMPassOptions | 222 slots, initialized by sub_12D6300 (125KB) |
| Codegen knobs | ~70, parsed by sub_1C20170 / sub_CD9990 from NVVM container |
| BSS storage | 0x4F7FEA0–0x4FA5xxx (cl::opt), a1+0–a1+4464 (PassOptions) |
| Dual PM | Same options registered for both Legacy PM (sub_C53080) and New PM (sub_16B8280) |
| NVIDIA-specific | 172 of 1,496 cl::opt flags (11.5%) are NVIDIA-added |
Knob System 1: LLVM cl::opt
Registration Pattern
Every cl::opt follows this initialization sequence in a global constructor:
// Legacy PM path
InterlockedExchangeAdd64(sub_C523C0(), 1); // atomic option counter
sub_C53080(&option, "option-name", strlen); // set name
sub_C53130(&option); // finalize registration
__cxa_atexit(destructor, &option, &dso_handle);
// New PM path (parallel registration)
InterlockedExchangeAdd64(&unk_4FA0230, 1);
sub_16B8280(&option, "option-name", strlen);
sub_16B88A0(&option);
__cxa_atexit(destructor, &option, &dso_handle);
Each cl::opt<T> occupies ~224 bytes (0xE0) in BSS. Top constructors by option count: ctor_600 (30), ctor_433 (25), ctor_472 (24), ctor_609 (22), ctor_392 (22).
Category 1: Scalar Optimization (InstCombine + FP)
Constructor: ctor_165_0 at 0x4D0500 (11,731 bytes). Registers 12 NVIDIA-specific flags plus 4 standard LLVM flags.
| Flag | Type | Default | BSS Addr | Purpose |
|---|---|---|---|---|
split-gep-chain | bool | false | 0x4F901A8 | Split GEP chains to independent GEPs for better address mode selection |
Disable-Add-to-Or | bool | true | — | Disable add-to-or transformation (NVIDIA blocks this LLVM combine) |
opt-use-fast-math | bool | false | — | Enable aggressive FP simplification (set by -unsafe-math / -fast-math) |
opt-use-prec-div | bool | true | — | Use precise division (set by -prec-div=1; cleared by -prec-div=0) |
opt-no-signed-zeros | bool | false | — | Ignore signed zero distinction (set by -no-signed-zeros) |
disable-fp-cast-opt | bool | false | — | Disable FP-to-int and int-to-FP cast optimizations |
reorder-sext-before-cnst-add | bool | false | — | sext(add(a,CI)) to add(sext(a),CI) rewrite; hidden flag |
disable-sink | bool | false | — | Disable instruction sinking in InstCombine |
partial-sink | bool | false | — | Enable partial sinking of instructions |
nvptx-rsqrt-approx-opt | bool | false | — | Enable reciprocal sqrt approximation optimization |
disable-rsqrt-opt | bool | false | — | Disable reciprocal sqrt optimization entirely |
check-vn | bool | false | — | Verify value numbers after transformations (debug) |
Standard LLVM flags in same constructor: expensive-combines (bool), instcombine-maxarray-size (int, default 1024), instcombine-visit (int), instcombine-lower-dbg-declare (bool).
Category 2: Inliner Heuristics
Constructor: ctor_186_0 at 0x4DBEC0 (14,109 bytes). Nine NVIDIA-specific flags governing the custom CGSCC inliner at sub_1864060.
| Flag | Type | Default | Purpose |
|---|---|---|---|
profuseinline | bool | false | Verbose inlining diagnostics (NVIDIA profuse framework, not PGO profuse) |
inline-total-budget | int | none | Global total budget across all callers; unset = unlimited |
nv-inline-all | bool | false | Force inline ALL function calls (used by OptiX ray tracing) |
inline-budget | int | 20000 | Per-caller inlining cost budget; -aggressive-inline sets to 40000 |
inline-adj-budget1 | int | none | Secondary adjusted per-caller budget |
inline-switchctrl | int | none | Tune heuristic for switch-containing callees |
inline-numswitchfunc | int | none | Threshold for switch-heavy function penalty |
inline-maxswitchcases | int | none | Max switch cases before inlining penalty kicks in |
disable-inlined-alloca-merging | bool | false | Disable post-inline alloca merging into single frame slot |
"none" means the knob is unset by default and the heuristic falls back to internal logic.
Category 3: GVN (Global Value Numbering)
Constructor: ctor_201 at 0x4E0990. Eleven knobs (8 NVIDIA-specific + 3 upstream).
| Flag | Type | Default | BSS Addr | Purpose |
|---|---|---|---|---|
profusegvn | bool | true | 0x4FAE7E0 | Verbose GVN diagnostics (unusually, defaults on) |
gvn-dom-cache | bool | true | 0x4FAE700 | Cache dominator tree nodes; cache size = 32 |
max-recurse-depth | int | 1000 | 0x4FAE620 | Max recursion during value numbering (safety valve for template-heavy code) |
enable-phi-remove | int | 2 | 0x4FAEC40 | PHI removal aggressiveness: 0=off, 1=trivial only, 2=post-leader substitution |
dump-phi-remove | int | 0 | 0x4FAEB60 | Dump PHI removal decisions (debug) |
no-split-stores-below | int | -1 | 0x4FAEA80 | Min store width for splitting (bits); -1 = no limit |
no-split-stores-above | int | -1 | 0x4FAE9A0 | Max store width for splitting (bits); -1 = no limit |
split-stores | bool | true | 0x4FAE8C0 | Master enable for NVIDIA store-splitting in GVN |
enable-pre | bool | true | 0x4FAEEE0 | Enable Partial Redundancy Elimination (upstream LLVM) |
enable-load-pre | bool | true | 0x4FAEE00 | Enable load PRE across edges (upstream LLVM) |
enable-split-backedge-in-load-pre | bool | false | 0x4FAED20 | Allow backedge splitting during load PRE (upstream LLVM) |
Store splitting uses a custom NVIDIA registrar (sub_190BE40) that takes a default-value pointer. Both limit knobs default to -1 = all sizes eligible.
Category 4: Loop Strength Reduction
Constructor: ctor_214_0 at 0x4E4B00. Eleven NVIDIA-specific LSR flags (69% NVIDIA customization rate).
| Flag | Type | Default | Purpose |
|---|---|---|---|
disable-unknown-trip-lsr | bool | false | Disable LSR for loops with unknown trip count |
lsr-check-rp | bool | true [MEDIUM] | Check register pressure before applying LSR |
lsr-rp-limit | int | ~32-64 [LOW] | Skip LSR entirely when RP exceeds this limit (occupancy cliff) |
filter-bad-formula | bool | true [MEDIUM] | Filter out poor-quality LSR formulae early |
do-lsr-64-bit | bool | arch-dependent | Enable 64-bit loop strength reduction (false on sm_3x-5x, true on sm_70+) |
count-sxt-opt-for-reg-pressure | bool | true [MEDIUM] | Factor sign-extension elimination savings into RP analysis |
lsr-sxtopt | bool | true [MEDIUM] | Perform sign-extension elimination within LSR |
lsr-loop-level | int | 0 | Apply LSR only at specific loop nesting level (0 = all levels) |
lsr-skip-outer-loop | bool | false | Ignore outer-loop induction variables in LSR |
disable-lsr-for-sharedmem32-ptr | bool | false | Disable LSR for 32-bit shared memory pointers (GPU-specific) |
disable-lsr-complexity-discount | bool | false | Disable complexity estimation discount heuristic |
Standard LLVM LSR flags in same constructor: enable-lsr-phielim, lsr-insns-cost, lsr-exp-narrow, lsr-filter-same-scaled-reg, lsr-fix-iv-inc.
Category 5: IndVarSimplify
Constructor: ctor_203_0 at 0x4E1CD0 (7,007 bytes).
| Flag | Type | Default | Purpose |
|---|---|---|---|
Disable-unknown-trip-iv | bool | false | Disable IV substitution for unknown-trip-count loops |
iv-loop-level | int | none | Control which loop nesting levels get IV substitution |
Category 6: SimplifyCFG
Constructor: ctor_243_0 at 0x4ED0C0.
| Flag | Type | Default | Purpose |
|---|---|---|---|
disable-jump-threading | bool | false | Disable jump threading (for OCG experiments) |
fold-with-var-cond | bool | false | Fold branches with variance-based conditions |
Category 7: NVPTX Backend Math/Scheduling
Constructor: ctor_607 at 0x584B60 (13,700 bytes). Core numeric precision and FMA controls. Defaults are set by the CLI flag routing in sub_9624D0, not by the cl::opt constructors.
| Flag | Type | CLI Default | Purpose |
|---|---|---|---|
nvptx-sched4reg | bool | false | Schedule for register pressure (key NVPTX strategy) |
nvptx-fma-level | int | 1 | FMA contraction: 0=off, 1=on, 2=aggressive. CLI -fma=1 is default |
nvptx-prec-divf32 | int | 1 | F32 div precision: 0=approx, 1=full, 2=IEEE rnd+ftz, 3=IEEE no-ftz |
nvptx-prec-sqrtf32 | int | 1 | Sqrt precision: 0=approx, 1=rn. CLI -prec-sqrt=1 is default |
nvptx-approx-log2f32 | bool | false | Use lg2.approx for log2 (only set by -unsafe-math) |
nvptx-force-min-byval-param-align | bool | false | Force 4-byte minimum alignment for byval parameters |
nvptx-normalize-select | bool | false | Override shouldNormalizeToSelectSequence in TLI |
enable-bfi64 | bool | false | Enable 64-bit BFI (bit-field insert) instructions |
Note: These cl::opt knobs have no explicit default in their constructor (they init to 0/false). The effective defaults come from the CLI flag catalog: -fma=1 routes -nvptx-fma-level=1, -prec-div=1 routes -nvptx-prec-divf32=1, -prec-sqrt=1 routes -nvptx-prec-sqrtf32=1.
Category 8: NVPTX Backend Passes/Features
Constructor: ctor_609_0 at 0x585D30 (22 options total, largest NVPTX constructor).
| Flag | Type | Default | Purpose |
|---|---|---|---|
disable-nvptx-load-store-vectorizer | bool | false | Disable load/store vectorizer |
disable-nvptx-require-structured-cfg | bool | false | Turn off structured CFG requirement (transitional) |
nvptx-short-ptr | bool | false | 32-bit pointers for const/local/shared address spaces |
nvptx-enable-machine-sink | bool | false | Enable machine-level instruction sinking |
enable-new-nvvm-remat | bool | true | Enable new NVVM rematerialization engine |
nv-disable-remat | bool | false | Disable all rematerialization passes |
nv-disable-mem2reg | bool | false | Disable machine-IR mem2reg promotion |
nv-disable-scev-cgp | bool | true | Disable SCEV-based address mode optimization (on = disabled) |
nvptx-32-bit-smem | bool | false | Use 32-bit pointers for shared address space |
nvptx-exit-on-unreachable | bool | true | Lower unreachable as PTX exit instruction |
nvptx-early-byval-copy | bool | false | Create copy of byval function args in entry block |
enable-nvvm-peephole | bool | true | Enable NVVM peephole optimizer |
no-reg-target-nvptxremat | bool | false | Only run old remat on kernels without register targets |
lower-func-args | bool | true | Lower large aggregate function parameters to copies |
enable-sink | bool | true | Enable LLVM sinking pass |
disable-post-opt | bool | false | Disable IR optimizations in post-opt phase |
usedessa | int | 2 | deSSA method: 0=off, 1=basic, 2=full |
ldg | bool | true | Load-via-texture (ld.global.nc) constant transform |
Category 9: NVPTX Backend Extended
Constructor: ctor_610 at 0x5888A0 (7,400 bytes).
| Flag | Type | Default | Purpose |
|---|---|---|---|
unroll-assumed-size | int | 4 | Assumed element count for unknown-size local arrays during unroll analysis |
enable-loop-peeling | bool | false | Enable loop peeling transformation |
enable-256-bit-load-store | bool | false | Enable 256-bit (32-byte) vector load/store generation |
ias-param-always-point-to-global | bool | false | Assume function parameter pointers always point to global memory |
ias-strong-global-assumptions | bool | false | Stronger assumption: constant-buffer pointers resolve to globals |
ias-wmma-memory-space-opt | bool | false | Enable MemorySpaceOpt specialization for WMMA/tensor operations |
Category 10: Memory Space Optimization
Scattered across ctor_264, ctor_267_0, ctor_528, ctor_531_0. See MemorySpaceOpt and IPMSP for the full algorithm.
| Flag | Type | Default | Purpose |
|---|---|---|---|
mem-space-alg | int | 2 | Switch between MSO algorithm variants |
dump-ir-before-memory-space-opt | bool | false | Dump IR before MSO |
dump-ir-after-memory-space-opt | bool | false | Dump IR after MSO |
track-indir-load | bool | true | Track indirect loads during MSO dataflow |
track-int2ptr | bool | true | Track IntToPtr casts in MSO |
param-always-point-to-global | bool | true | Kernel parameter pointers always point to global memory |
devicefn-param-always-local | bool | false | Treat parameter space as local in device functions |
ignore-address-space-check | bool | false | Ignore address-space checks during branch distribution |
sink-into-texture | int | 3 | Sink loads into texture blocks: 0=off, 1=cross-block, 2=+intra, 3=+outside-only. See also Category 14 |
ldg | bool | true | Load Global Constant Transform (ld.global.nc) |
do-clone-for-ip-msp | int | -1 | Function cloning limit for IP-MSP (-1 = unlimited, 0 = disable) |
dump-ip-msp | bool | false | Dump interprocedural MSP info |
lower-read-only-devicefn-byval | bool | false | Handle byval attribute of args to read-only device functions |
reuse-lmem-very-long-live-range | int | — | Threshold for very-long live range in local memory reuse |
hoist-load-param | bool | false | Generate all ld.param in entry basic block |
sink-ld-param | bool | false | Sink one-use ld.param to use point |
process-alloca-always | bool | true | Treat alloca as definite local (AS 5) regardless of context |
wmma-memory-space-opt | bool | true | Enable memory space optimization for WMMA operations |
strong-global-assumptions | bool | true | Assume const buffer pointers always point to globals |
process-builtin-assume | bool | — | Process __builtin_assume(__is*(p)) assertions for space deduction |
Category 11: Rematerialization
Scattered across ctor_609_0, ctor_362, ctor_277_0, ctor_361_0, and others. See Rematerialization for full algorithm detail.
IR-Level Knobs (ctor_277_0 at 0x4F7BE0)
| Flag | Type | Default | Global | Purpose |
|---|---|---|---|---|
do-remat | int | 3 | dword_4FC05C0 | Master control. 0=off, 1=conservative, 2=normal, 3=full |
no-remat | string | (empty) | qword_4FC0440 | Comma-separated function exclusion list |
remat-iv | int | 4 | dword_4FBFB40 | IV demotion level. 0=off, 4=full |
remat-load | int | 1 | dword_4FBFA60 | Load rematerialization. 0=off, 1=on |
remat-add | int | 0 | dword_4FBF980 | Add/GEP factoring. 0=off |
remat-single-cost-limit | int | 6000 | dword_4FC0080 | Max cost per single live-in reduction |
remat-loop-trip | int | 20 | dword_4FBFFA0 | Default assumed loop trip count |
remat-gep-cost | int | 6000 | dword_4FBFEC0 | Max cost for GEP rematerialization |
remat-use-limit | int | 10 | dword_4FBFDE0 | Max number of uses for a candidate |
remat-max-live-limit | int | 10 | dword_4FBFD00 | Max live-in limit for rematerialization |
remat-maxreg-ceiling | int | 0 | dword_4FBF600 | Register ceiling (0 = uncapped) |
remat-for-occ | int | 120 | dword_4FBF8A0 | Occupancy-driven rematerialization target |
remat-lli-factor | int | 10 | dword_4FC0320 | Long-latency instruction cost factor |
remat-ignore-single-cost | bool | false | byte_4FBFC20 | Bypass per-value cost filter |
remat-move | bool | false | byte_4FC0400 | Remat move instructions |
simplify-live-out | int | 2 | dword_4FBF520 | NLO level. 0=off, 2=full |
dump-remat | int | 0 | dword_4FC0240 | Debug dump level (0-4+) |
dump-remat-iv | int | 0 | dword_4FC0160 | IV remat debug dump |
dump-remat-load | int | 0 | dword_4FBF720 | Load remat debug dump |
dump-remat-add | int | 0 | dword_4FBF640 | Add remat debug dump |
dump-simplify-live-out | bool | false | byte_4FBF400 | NLO debug dump |
Machine-Level Knobs (ctor_361_0 at 0x5108E0)
| Flag | Type | Default | Global | Purpose |
|---|---|---|---|---|
nv-remat-block | int | 14 | dword_4FD3820 | Bitmask controlling remat modes (bits 0-3) |
nv-remat-max-times | int | 10 | dword_4FD3740 | Max outer loop iterations |
nv-remat-block-single-cost | int | 10 | dword_4FD3660 | Max cost per single live value pull-in |
nv-remat-block-map-size-limit | int | 6 | dword_4FD3580 | Map size limit for single pull-in |
nv-remat-block-max-cost | int | 100 | dword_4FD3040 | Max total clone cost per live value reduction |
nv-remat-block-liveout-min-percentage | int | 70 | dword_4FD3120 | Min liveout % for special consideration |
nv-remat-block-loop-cost-factor | int | 20 | unk_4FD3400 | Loop cost multiplier |
nv-remat-default-max-reg | int | 70 | unk_4FD3320 | Default max register pressure target |
nv-remat-block-load-cost | int | 10 | unk_4FD2EC0 | Cost assigned to load instructions |
nv-remat-threshold-for-spec-reg | int | 20 | unk_4FD3860 | Threshold for special register remat |
nv-dump-remat-block | bool | false | byte_4FD2E80 | Debug dump toggle |
nv-remat-check-internal-live | bool | false | byte_4FD2DA0 | Check internal liveness during MaxLive |
max-reg-kind | int | 0 | qword_4FD2C20 | Kind of max register pressure info |
no-mi-remat | string | (empty) | qword_4FD2BE0 | Skip machine-level remat for named functions |
load-remat | bool | true | word_4FD32F0 | Enable load rematerialization |
vasp-fix1 | bool | false | word_4FD3210 | VASP fix for volatile/addsp |
General Remat Knobs (ctor_609_0, ctor_362, and others)
| Flag | Type | Default | Purpose |
|---|---|---|---|
nv-disable-remat | bool | false | Disable all remat passes |
enable-new-nvvm-remat | bool | true | Enable new NVVM remat engine (disables old) |
no-reg-target-nvptxremat | bool | false | Only old remat for kernels without register targets |
fp-remat | bool | false | Allow rematerializing floating-point instructions |
high-cost-remat | bool | false | Allow rematerializing high-cost instructions |
cost-threshold-remat | int | — | Cost threshold per remat action |
block-freq-cap-remat | int | — | Maximum raw block frequency value |
block-freq-norm-range-remat | int | — | Normalization range for block frequency in remat cost |
collect-candidate-scale-remat | int | — | Scaling ratio for high-RP candidate collection |
incremental-update-remat | bool | false | Incrementally update RP analysis after each remat |
verify-update-remat | bool | false | Debug: verify incremental update vs full analysis |
print-verify-remat | bool | false | Debug: print problematic RP on verification failure |
rp-remat | int | — | Debug: set a target register pressure number |
late-remat-update-threshold | int | — | Threshold for copy with many other copy uses |
remat-load-param | bool | false | Support rematerializing constant ld.param not in NVVM IR |
Category 12: SCEV-CGP (Address Mode Optimization)
Eleven NVIDIA-specific knobs for SCEV-based CodeGenPrepare. See CodeGenPrepare.
| Flag | Type | Default | Purpose |
|---|---|---|---|
do-scev-cgp | bool | false | Enable SCEV-based CodeGenPrepare |
do-scev-cgp-aggresively | bool | false | Aggressive SCEV-CGP mode [sic] |
do-function-scev-cgp | bool | false | Function-level SCEV-CGP |
nv-disable-scev-cgp | bool | true | Disable SCEV address mode optimization (master kill switch, on by default) |
scev-cgp-control | int | — | Control max transformations applied |
scev-cgp-cross-block-limit | int | — | Max common-base expressions from a single block |
scev-cgp-idom-level-limit | int | — | Max dominator tree levels to walk |
scev-cgp-inst-limit | int | — | Max instructions for a single parameter |
scev-cgp-old-base | bool | false | Force SCEV-CGP to create new base (vs reusing old) |
scev-cgp-tid-max-value | int | — | Max value of thread ID in SCEV expressions |
print-after-scev-cgp | bool | false | Print function after SCEV-CGP phase |
Category 13: Branch Distribution
Seven NVIDIA-specific flags.
| Flag | Type | Default | Purpose |
|---|---|---|---|
branch-dist-block-limit | int | — | Max blocks to apply branch distribution |
branch-dist-func-limit | int | — | Max functions to apply branch distribution |
branch-dist-norm | int | — | Normalization control |
no-branch-dist | string | — | Comma-separated list of functions to skip |
disable-complex-branch-dist | bool | false | Disable complex branch distribution |
dump-branch-dist | bool | false | Dump branch distribution info |
Category 14: Sinking / Code Motion
Thirteen knobs across multiple constructors. See Sinking2 for the NVIDIA-custom texture-aware sinking pass.
| Flag | Type | Default | Purpose |
|---|---|---|---|
sink-into-texture | int | 3 | Texture sinking aggressiveness: 0=off, 1=cross-block, 2=+intra, 3=+outside-only |
sink-limit | int | 20 | Max instructions to sink per Sinking2 invocation (complexity limiter) |
dump-sink2 | bool | false | Debug dump for Sinking2 pass |
sink-check-sched | bool | true | Check scheduling effects of sinking (stock Sink) |
sink-single-only | bool | true | Only sink single-use instructions (stock Sink) |
enable-andcmp-sinking | bool | false | Sink and/cmp sequences into branches |
aggressive-no-sink | bool | false | Sink all generated instructions |
max-uses-for-sinking | int | — | Don't sink instructions with too many uses |
rp-aware-sink | bool | false | Consider register pressure impact when sinking |
instcombine-code-sinking | bool | false | Enable code sinking within InstCombine |
hoist-const-stores | bool | false | Hoist loop-invariant stores |
Category 15: Register Pressure / Allocation
NVIDIA-specific knobs plus LLVM greedy allocator knobs. See Register Allocation for the full algorithm.
NVIDIA RP Knobs
| Flag | Type | Default | Purpose |
|---|---|---|---|
maxreg | int | none | Maximum register count (--maxrregcount equivalent) |
register-usage-level | int | — | Register usage level control |
cta-reconfig-aware-mrpa | bool | false | CTA reconfiguration-aware machine RP analysis |
cta-reconfig-aware-rpa | bool | false | CTA reconfiguration-aware RP analysis |
pred-aware-mcse | bool | false | Predicate-aware MachineCSE |
rp-aware-mcse | bool | false | Register-pressure-aware MachineCSE |
verify-update-mcse | bool | false | Debug: verify incremental RP update in MachineCSE |
incremental-update-mcse | bool | true | Incrementally update register pressure analysis in MachineCSE |
print-verify | bool | false | Print problematic RP info if MCSE verification fails |
pred-target-adjust | int | 0 | Predicate register target adjustment (-10 to +10) |
donot-insert-dup-copies | bool | false | Skip duplicate copies to predecessor basic block |
nv-disable-mem2reg | bool | false | Disable machine-level mem2reg |
LLVM Greedy Allocator Knobs
| Flag | Type | Default | Purpose |
|---|---|---|---|
split-spill-mode | int | 1 | Spill mode: 0=default, 1=size, 2=speed |
lcr-max-depth | int | 5 | Last chance recoloring max recursion depth |
lcr-max-interf | int | 8 | Last chance recoloring max interferences |
exhaustive-register-search | bool | false | Bypass LCR depth/interference cutoffs |
enable-deferred-spilling | bool | false | Defer spill code to end of allocation |
grow-region-complexity-budget | int | 10000 | growRegion() edge budget for live range splitting |
split-threshold-for-reg-with-hint | int | 75 | Split threshold percentage for hinted registers |
Category 16: Restrict / Aliasing
Five NVIDIA-specific flags. See Alias Analysis.
| Flag | Type | Default | Purpose |
|---|---|---|---|
process-restrict | bool | false | Process __restrict__ keyword for alias analysis |
allow-restrict-in-struct | bool | false | Allow __restrict__ inside struct members |
apply-multi-level-restrict | bool | false | Apply restrict to all pointer levels |
dump-process-restrict | bool | false | Debug dump during restrict processing |
strict-aliasing | bool | false | Datatype-based strict aliasing |
Category 17: CSSA / deSSA
Four knobs. See CSSA.
| Flag | Type | Default | Purpose |
|---|---|---|---|
cssa-coalesce | int | — | Control PHI operand coalescing strategy |
cssa-verbosity | int | 0 | Verbosity level |
dump-before-cssa | bool | false | Dump specific PHI operands being coalesced |
usedessa | int | 2 | deSSA method: 0=off, 1=basic, 2=full |
Category 18: Loop / Unrolling
Eight NVIDIA-specific knobs (beyond the 20+ standard LLVM loop-unrolling flags).
| Flag | Type | Default | Purpose |
|---|---|---|---|
nv-disable-loop-unrolling | bool | false | Disable loop unrolling in all passes |
aggressive-runtime-unrolling | bool | false | OCG-style unrolling heuristics |
aggressive-runtime-unrolling-fixed-factor | int | — | Force fixed unroll factor |
aggressive-runtime-unrolling-max-factor | int | — | Maximum unroll factor |
aggressive-runtime-unrolling-max-filler-instructions-per-batch | int | — | Max filler instructions |
unroll-runtime-nv-expensive | bool | false | NVIDIA heuristics for expensive loops |
unroll-runtime-convergent | bool | false | Allow unrolling with convergent instructions |
track-trip-count-more | bool | false | Track loop trip count more aggressively |
Category 19: GEP / Address Strength Reduction
Eight NVIDIA-specific knobs. See Base Address Strength Reduction.
| Flag | Type | Default | Purpose |
|---|---|---|---|
normalize-gep | bool | false | Normalize 64-bit GEP subscripts |
dump-normalize-gep | bool | false | Debug dump for GEP normalization |
do-base-address-strength-reduce | int | 0 | Two levels: 1=unconditional, 2=with conditions |
dump-base-address-strength-reduce | bool | false | Debug dump |
do-lsr-64-bit | bool | false | Loop strength reduction for 64-bit (shared with LSR) |
do-sign-ext-expand | bool | false | Expand sign-extension during SCEV build |
balance-dot-chain | bool | false | Balance chain of dot operations |
special-reassociate-for-threadid | bool | false | Don't move back expressions containing thread ID |
Category 20: Aggregate / Byval Lowering
Ten knobs.
| Flag | Type | Default | Purpose |
|---|---|---|---|
aggressive-max-aggr-lower-size | int | — | Threshold size for lowering aggregates |
aggressive-lsv | bool | false | Merge smaller dtypes in aggregate before vectorization |
vect-split-aggr | bool | false | Split aggregates before vectorization |
lower-aggr-unrolled-stores-limit | int | — | Limit stores in unrolled aggregate lowering |
large-aggr-store-limit | int | — | Create loops for aggregate store exceeding limit |
lower-func-args | bool | true | Lower large aggregate function parameters |
lsa-opt | bool | false | Optimize copying of struct args to local memory |
skiploweraggcopysafechk | bool | false | Skip safety check in loweraggcopy |
memdep-cache-byval-loads | bool | true | Preprocess byval loads to reduce compile time |
ldstmemcpy-glue-max | int | — | Limit for gluing ld/st of memcpy |
Category 21: Normalization / Canonicalization
Four knobs.
| Flag | Type | Default | Purpose |
|---|---|---|---|
norm-fold-all | bool | false | Fold all regular instructions |
norm-preserve-order | bool | false | Preserve original instruction order |
norm-rename-all | bool | false | Rename all instructions |
norm-reorder-operands | bool | false | Sort/reorder operands in commutative operations |
Category 22: NVVM Infrastructure
Five knobs.
| Flag | Type | Default | Purpose |
|---|---|---|---|
nvvm-lower-printf | bool | false | Enable printf lowering |
nvvm-reflect-enable | bool | true | NVVM reflection (reads __CUDA_FTZ, __CUDA_PREC_DIV, etc.) |
nvvm-verify-show-info | bool | false | Info messages during NVVM verification |
enable-nvvm-peephole | bool | true | NVVM peephole optimizer |
nv-ocl | bool | false | Deprecated OpenCL compatibility flag |
Category 23: Compilation Control
Constructor: ctor_043_0 at 0x48D7F0 + ctor_028_0 at 0x489160.
| Flag | Type | Default | Purpose |
|---|---|---|---|
debug-compile | bool | false | Compile for debugging (set by -g) |
generate-line-info | bool | false | Emit line info even without -G |
nvptx-f32ftz | bool | false | Flush f32 subnormals to zero; hidden |
w | bool | false | Disable warnings; hidden |
Werror | bool | false | Treat warnings as errors; hidden |
Osize | bool | false | Optimize for code size; hidden |
Om | bool | false | Maximum optimization mode; hidden |
maxreg | int | none | Maximum register count (no limit if unset) |
nvptx-nan | bool | false | NaN handling control; hidden |
jump-table-density | int | 10 | Minimum density (%) for jump table lowering |
pass-control | int | -1 | Disable all optional passes after pass N; -1 = no limit |
disable-passno | list | empty | Disable pass(es) by number (comma-separated) |
sep-comp | bool | false | Separate compilation mode |
proffile | string | — | Filename for PGO profile information |
R | string | — | Resource constraint: name=<int> format |
lnk-disable-allopts | bool | false | Disable all linker optimization passes |
disable-peephole | bool | false | Disable peephole optimizer |
disable-early-taildup | bool | false | Disable pre-regalloc tail duplication |
Category 24: Divergence / GPU Execution
Three flags.
| Flag | Type | Default | Purpose |
|---|---|---|---|
spec-exec-only-if-divergent-target | bool | false | Speculative execution only when target is divergent |
prefer-predicated-reduction-select | bool | false | Prefer predicated reduction over after-loop select |
openmp-opt-disable-barrier-elimination | bool | false | Disable OpenMP barrier elimination |
Category 25: MachinePipeliner (Swing Modulo Scheduling)
Eighteen LLVM-origin knobs for software pipelining. See Scheduling for the full algorithm.
| Flag | Type | Default | Global | Purpose |
|---|---|---|---|---|
enable-pipeliner | bool | true | unk_503EE20 | Master switch for SMS |
enable-pipeliner-opt-size | bool | false | qword_503ED40 | Enable SWP at -Os |
pipeliner-max-mii | int | 27 | qword_503ECE8 | Maximum allowed MII |
pipeliner-force-ii | int | 0 | qword_503EB80 | Force specific II (0 = auto) |
pipeliner-max-stages | int | 3 | qword_503EB28 | Maximum pipeline stages |
pipeliner-prune-deps | bool | true | qword_503E9C0 | Prune deps between unrelated Phi nodes |
pipeliner-prune-loop-carried | bool | true | qword_503E8E0 | Prune loop-carried order deps |
pipeliner-ignore-recmii | bool | false | qword_503E888 | Ignore RecMII; hidden |
pipeliner-show-mask | bool | false | qword_503E720 | Debug: show scheduling mask |
pipeliner-dbg-res | bool | false | qword_503E640 | Debug: resource usage |
pipeliner-annotate-for-testing | bool | false | qword_503E5E8 | Annotate instead of codegen |
pipeliner-experimental-cg | bool | false | qword_503E508 | Use peeling code generator |
pipeliner-ii-search-range | int | 10 | qword_503E3A0 | Range to search for II |
pipeliner-register-pressure | bool | false | qword_503E2C0 | Consider register pressure |
pipeliner-register-pressure-margin | int | 5 | qword_503E1E0 | Margin % for reg pressure limit |
pipeliner-mve-cg | bool | true | unk_503E100 | Use MVE code generator |
pipeliner-enable-copytophi | bool | true | qword_503E020 | Enable CopyToPhi DAG Mutation |
pipeliner-force-issue-width | int | 0 | qword_503DF40 | Force issue width (0 = auto) |
Category 26: LLVM Standard Inliner (Model B)
Seventeen LLVM-origin knobs from ctor_625_0 / ctor_715_0 at 0x58FAD0. These control the upstream InlineCostAnalysis::analyzeCall path; see Inliner Cost Model for why the NVIDIA custom model (Category 2) dominates in practice.
| Flag | Type | Default | Purpose |
|---|---|---|---|
inline-threshold | int | 225 | Base inlining threshold |
inlinedefault-threshold | int | 225 | Default when no hint/profile |
inlinehint-threshold | int | 325 | Threshold for __attribute__((always_inline)) hint |
inline-cold-callsite-threshold | int | 45 | Threshold for cold callsites |
inlinecold-threshold | int | 45 | Threshold for functions with cold attribute |
hot-callsite-threshold | int | 3000 | Threshold for hot callsites (PGO) |
locally-hot-callsite-threshold | int | 525 | Threshold for locally hot callsites |
inline-instr-cost | int | 5 | Cost per instruction |
inline-call-penalty | int | 25 | Penalty per callsite in callee |
inline-memaccess-cost | int | 0 | Cost per load/store |
inline-savings-multiplier | int | 8 | Multiplier for cycle savings |
inline-savings-profitable-multiplier | int | 4 | Multiplier for profitability check |
inline-size-allowance | int | 100 | Max callee size inlined without savings proof |
inline-cost-full | bool | false | Compute full cost even when over threshold |
inline-enable-cost-benefit-analysis | bool | false | Enable cost-benefit analysis |
inline-deferral | bool | — | Defer inlining in cold paths (PGO) |
inline-remark-attribute | bool | false | Emit inline remarks |
Category 27: New PM CGSCC Inliner (Model C)
Two knobs for the New Pass Manager CGSCC inliner at 0x2613930. See Inliner Cost Model.
| Flag | Type | Default | Purpose |
|---|---|---|---|
function-inline-cost-multiplier | int | — | Penalize recursive function inlining |
enable-ml-inliner | enum | default | ML advisory mode: default, development, release |
Knob System 2: NVVMPassOptions
222 pass option slots initialized by sub_12D6300 (125KB). Each slot is accessed by integer index (1--221) and stored in a ~4,480-byte struct.
Access Functions
| Function | Purpose |
|---|---|
sub_12D6170(base+120, index) | Fetch pass option descriptor by index |
sub_1691920(base+8, index) | Fetch pass option value from table |
sub_12D6090(a1+offset, ...) | Store string-typed option |
sub_12D6100(a1+offset, ...) | Store integer-typed option |
sub_12D6240(a1, index, "0") | Get option with default value |
See NVVMPassOptions for the complete 222-slot inventory.
Knob System 3: NVIDIA Codegen Knobs
Parsed from the NVVM container format by sub_1C20170 and sub_CD9990. See NVIDIA Custom Passes for the complete inventory.
Hidden / Obfuscated Flags
Obfuscated Flag (ctor_043_0 at ~0x48EE80)
A 4-byte CLI flag name computed via XOR-based obfuscation from unk_3F6F7C7:
v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));
Stored at qword_4F857C0 with flag bits 0x87 | 0x38 = hidden + really-hidden. NVIDIA deliberately hides this option from static analysis using FNV-1a-like constants.
Environment Variable Backdoors
| Variable | Purpose | Location |
|---|---|---|
NVVMCCWIZ | Wizard mode (value 553282) -- unlocks -v, -keep, -dryrun, -lgenfe, -opt, -llc, -lnk, -libnvvm | sub_8F9C90 |
bar | Extended debug pass registration | ctor_107_0 at 0x4A64D0 |
NVVM_IR_VER_CHK | Override IR version check (set to "0" to disable) | sub_12BFF60 |
LLVM_OVERRIDE_PRODUCER | Override bitcode producer string (default "7.0.1") | ctor_154 at 0x4CE640 |
MALLOC_CONF | jemalloc allocator tuning | sub_12FCDB0 |
LIBNVVM_DISABLE_CONCURRENT_API | Force single-threaded NVVM compilation | ctor_104 at 0x4A5810 |
CLI Defaults Set by Flag Routing
These are effective defaults applied by the flag catalog parser (sub_9624D0), not by cl::opt constructors. When no user flag is specified, the parser injects these:
| CLI flag | Default value | Routed cl::opt |
|---|---|---|
-arch=compute_<N> | compute_75 (SM 75, Turing) | target architecture |
-opt=<N> | 3 (O3) | optimization level |
-ftz=<N> | 0 (no flush-to-zero) | nvptx-f32ftz |
-prec-sqrt=<N> | 1 (precise) | nvptx-prec-sqrtf32=1 |
-prec-div=<N> | 1 (precise) | nvptx-prec-divf32=1 (CUDA) / =0 (CL) |
-fma=<N> | 1 (enabled) | nvptx-fma-level=1 |
-opt-fdiv=<N> | 0 (off) | optimizer fast-div control |
-Ofast-compile=<level> | 0 (off) | fast-compile pipeline |
NVIDIA Modification Density
| Subsystem | NVIDIA Knobs | LLVM Knobs | Customization Rate |
|---|---|---|---|
| LSR | 11 | 5 | 69% |
| InstCombine | 12 | 4 | 75% |
| Inliner (NVIDIA custom) | 9 | 0 | 100% |
| Inliner (LLVM standard) | 0 | 17 | 0% |
| GVN | 8 | 3 | 73% |
| NVPTX Backend | 30+ | 0 | 100% |
| SimplifyCFG | 2 | 8+ | 20% |
| Memory Space Opt | 20 | 0 | 100% |
| Rematerialization (IR) | 21 | 0 | 100% |
| Rematerialization (MI) | 16 | 0 | 100% |
| Rematerialization (General) | 15 | 0 | 100% |
| SCEV-CGP | 11 | 0 | 100% |
| Register Pressure | 12 | 7 | 63% |
| Sinking / Code Motion | 5 | 6 | 45% |
| MachinePipeliner | 0 | 18 | 0% |
| Vectorizer | 0 | 18+ | 0% |
| SCEV | 0 | 10+ | 0% |
Cross-References
- NVVMPassOptions -- 222-slot pass option system
- CLI Flags -- complete flag-to-pipeline routing
- Environment Variables -- all verified env vars
- Optimization Levels -- O0/O1/O2/O3 and fast-compile pipelines
- Rematerialization -- multi-pass remat engine
- Memory Space Optimization -- address space resolution
- Sinking2 -- texture-aware sinking
- Register Allocation -- greedy RA with NVIDIA extensions
- Scheduling -- SMS and MRPA
- IPMSP -- memory space optimization engine
- Alias Analysis -- restrict propagation
- CodeGenPrepare -- SCEV-CGP pass
- Inliner Cost Model -- four parallel inliner models
- GVN -- GPU-specific value numbering
- LSR -- GPU-aware loop strength reduction
Environment Variables
cicc v13.0 checks 22 distinct environment variables across 36 files containing getenv() calls. Six are NVIDIA-specific (two obfuscated), six come from the LLVM infrastructure, six from the EDG frontend, and the remainder from the build system, memory allocator, and shared ptxas/nvptxcompiler infrastructure. Two of the NVIDIA variables have their names encrypted in the .rodata section using an XOR+ROT13 cipher to prevent discovery through string scanning.
String Deobfuscation Engine
The deobfuscation function sub_8F98A0 at 0x8F98A0 decrypts variable names and option strings from .rodata ciphertext. The same engine is also used for hidden CLI option names (see CLI Flags).
Algorithm: sub_8F98A0
// Reconstructed pseudocode — sub_8F98A0 (0x8F98A0)
// Inputs:
// ciphertext — pointer to encrypted bytes in .rodata
// base — base address used as key seed (a2)
// length — number of bytes to decrypt
//
// Output:
// plaintext string on stack, null-terminated
char* deobfuscate(const uint8_t* ciphertext, uintptr_t base, size_t length) {
char buf[64];
for (size_t i = 0; i < length; i++) {
uint8_t raw = ciphertext[i];
// Phase 1: XOR with key derived from position
uint32_t key = -109 * ((i - base + 97) ^ 0xC5);
char ch = raw ^ (key & 0xFF);
// Phase 2: ROT13 on alphabetic characters
if (ch >= 'A' && ch <= 'Z')
ch = ((ch - 'A' + 13) % 26) + 'A';
else if (ch >= 'a' && ch <= 'z')
ch = ((ch - 'a' + 13) % 26) + 'a';
buf[i] = ch;
}
buf[length] = '\0';
return buf;
}
Key constant: The multiplier -109 (signed, i.e. 0xFFFFFF93) and the XOR mask 0xC5 together form a position-dependent key stream. The ROT13 phase is applied after the XOR, meaning the plaintext must survive two transformations. This is a weak cipher by design -- it only needs to defeat strings(1) scanning, not serious cryptanalysis.
Obfuscated String Table in .rodata
All obfuscated strings live in a contiguous region near 0x3C23A7B--0x3C23AD6. Each entry is referenced by its end byte address (the deobfuscator walks backward):
| End address | Length | Decrypted plaintext | Purpose |
|---|---|---|---|
byte_3C23AD6 | 14 | (option prefix) | CLI option prefix for -nvvm-version matching |
byte_3C23AC3 | 11 | "nvvm-latest" | Option suffix; sets v253 = 1 (Path A) |
byte_3C23AB4 | 6 | "nvvm70" | Option suffix; sets v253 = 0 (Path B) |
byte_3C23AAD | 13 | (option name) | Option name for error message display |
byte_3C23A9F | 15 | "NV_NVVM_VERSION" | Environment variable name for getenv() |
byte_3C23A82 | 6 | "nvvm70" | Env var value comparison string |
byte_3C23A7B | 11 | "nvvm-latest" | Env var value comparison string |
Additional encrypted copies exist at 0x42812C0 and 0x42812F0 for the two env var names (NV_NVVM_VERSION and LIBNVVM_NVVM_VERSION) used by sub_12B9F70.
Obfuscated CLI Flag (ctor_043)
A separate obfuscation instance at ~0x48EE80 in ctor_043 (0x48D7F0) decrypts a 4-byte hidden cl::opt name from data at unk_3F6F7C7. The algorithm variant uses FNV-1a-like constants:
v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));
The 0x811C9DC5 constant is the FNV-1a 32-bit offset basis. The resulting 4-character option is registered with flag bits 0x87 | 0x38 = hidden + really-hidden, making it invisible even to --help-hidden. This is stored at qword_4F857C0.
NVIDIA-Specific Variables
NVVMCCWIZ
| Property | Value |
|---|---|
| Checked in | sub_8F9C90 (real main) at 0x8F9C90, specifically 0x8F9D36 |
| Expected value | "553282" (magic number = 0x87142) |
| Effect | Sets byte_4F6D280 = 1 -- unlocks developer/wizard mode |
Mechanism: The value is parsed via strtol(v, 0, 10) and compared against the integer 553282. Any other value is silently ignored.
What wizard mode does: When byte_4F6D280 = 1:
-vflag actually enables verbose output (v259 = byte_4F6D280instead of 0)-keepflag actually preserves intermediate files (v262 = byte_4F6D280)-dryrunflag enables verbose as a side effect (v259 = byte_4F6D280)-lnkand-optmodes setv262 = byte_4F6D280(keep temps)
Without wizard mode, -v and -keep are no-ops -- the flags are parsed but have no effect because they set their variables to byte_4F6D280 which is 0.
NVVM_IR_VER_CHK
| Property | Value |
|---|---|
| Checked in | sub_12BFF60 at 0x12BFF60 (NVVM IR version verifier, instance 1) |
sub_2259720 at 0x2259720 (NVVM IR version verifier, instance 2) | |
| Expected value | "0" to disable version checking |
| Effect | Controls NVVM IR bitcode version metadata validation |
Detailed mechanism (from sub_12BFF60, 9KB):
- Reads
getenv("NVVM_IR_VER_CHK"). - If
NULLorstrtol(env, 0, 10) != 0: version checking is enabled (default). - If set to
"0": version checking is disabled (bypass).
When enabled, the function:
- Looks up
"nvvmir.version"named metadata viasub_1632310(module, &name). - Also checks
"llvm.dbg.cu"metadata (debug compile unit presence). - Iterates metadata operands, deduplicating via an open-addressing hash table:
- Hash function:
(value >> 9) ^ (value >> 4) & mask - Tombstone:
0xFFFFFFFFFFFFFFF0(-16) - Empty:
0xFFFFFFFFFFFFFFF8(-8)
- Hash function:
- For each unique 2-element version tuple
(major, minor):- Calls
sub_12BDA30(modules, major, minor)for IR compatibility check. - Special case:
major==2, minor==0always passes (sentinel for libdevice).
- Calls
- For 4-element tuples
(major, minor, debug_major, debug_minor):- Calls
sub_12BD890(modules, debug_major, debug_minor)for debug version check. - Special case:
debug_major==3, debug_minor<=2always passes.
- Calls
The env var is checked multiple times per invocation: before IR version validation, before debug IR version validation, and at each version tuple comparison. Return code 3 indicates incompatible version.
Current expected versions: nvvmir.version = {2, minor<=0x62}, debug version = {3, minor<=2}.
LIBNVVM_DISABLE_CONCURRENT_API
| Property | Value |
|---|---|
| Checked in | ctor_104 at 0x4A5810 (global constructor) |
| Expected value | Any non-NULL value |
| Effect | Sets byte_4F92D70 = 1 -- disables thread-safe libnvvm API usage |
Safety valve for environments where concurrent libnvvm compilation causes issues. Any non-NULL value triggers single-threaded API behavior. See Concurrent Compilation.
NV_NVVM_VERSION (Obfuscated)
| Property | Value |
|---|---|
| Checked in | sub_12B9F70 at 0x12B9F70, sub_12BB580 at 0x12BB580, sub_8F9C90 at 0x8F9C90 |
| Encrypted at | 0x3C23A90 and 0x42812C0 (two copies, same ciphertext) |
| Decryption | XOR with (-109 * ((byte_offset - base + 97) ^ 0xC5)) then ROT13 |
| Expected values | "nvvm70" (suppresses check), "nvvm-latest" (forces latest mode) |
How it controls compilation path selection in sub_8F9C90:
The dispatch variable v253 starts at 2 (default). When v253 is still 2 at post-parse time (lines 1590--1692):
sub_8F98A0decrypts the env var name frombyte_3C23A9F[-15..0].- Calls
getenv(decrypted_name). - Compares the result against two decrypted reference strings:
"nvvm70"(frombyte_3C23A82): setsv253 = 0(Path B -- NVVM/bitcode pipeline viasub_1262860orsub_1265970)"nvvm-latest"(frombyte_3C23A7B): setsv253 = 1(Path A -- PTX pipeline viasub_902D10orsub_905EE0)
- If neither matches: uses
(arch > 99)as the tiebreaker, with further modulation by-nvcand-optixirflags.
For multi-stage modes (v263 >= 3), the resolved path also determines which pipeline flag string is appended:
- Path A (
v253 == 1):xmmword_3C23BC0+"vm-latest"(25 bytes total, decodes to"-nvvm-version=nvvm-latest") - Path B (
v253 == 0):xmmword_3C23BC0+"vm70"(20 bytes total, decodes to"-nvvm-version=nvvm70")
The variable name is encrypted in the binary's .rodata section because NVIDIA intended to keep this escape hatch undiscoverable through casual strings(1) scanning. It controls a fundamental compilation mode choice.
LIBNVVM_NVVM_VERSION (Obfuscated)
| Property | Value |
|---|---|
| Checked in | sub_12B9F70 at 0x12B9F70 |
| Encrypted at | 0x42812F0 |
| Expected values | Same as NV_NVVM_VERSION |
Functionally identical to NV_NVVM_VERSION. Both names are checked by the same function sub_12B9F70; this provides an alternative name for the same feature. Likely exists so that libnvvm API users can set LIBNVVM_NVVM_VERSION while standalone cicc users set NV_NVVM_VERSION.
LLVM_OVERRIDE_PRODUCER
| Property | Value |
|---|---|
| Checked in | ctor_036 at 0x48CC90, ctor_154 at 0x4CE640 |
| Expected value | Any string |
| Effect | Overrides the producer identification string in output bitcode metadata |
Dual constructor behavior:
ctor_036(0x48CC90): ReadsLLVM_OVERRIDE_PRODUCER, falls back to"20.0.0"(the true LLVM version). Stored inqword_4F837E0.ctor_154(0x4CE640): ReadsLLVM_OVERRIDE_PRODUCER, falls back to"7.0.1"(the NVVM IR compatibility marker). Stored separately.
The bitcode writer (sub_1538EC0, 58KB) uses the ctor_154 value, producing "LLVM7.0.1" in the IDENTIFICATION_BLOCK. This means the output bitcode claims to be LLVM 7.0.1 format, even though cicc is built on LLVM 20.0.0 internally. Setting LLVM_OVERRIDE_PRODUCER overrides both constructors' values.
CAN_FINALIZE_DEBUG
| Property | Value |
|---|---|
| Checked in | sub_60F290 at 0x60F290, sub_4709E0 at 0x4709E0, sub_470DA0 at 0x470DA0 |
| Expected value | Controls debug finalization behavior |
| Effect | Gates debug information finalization passes |
Shared with ptxas and nvptxcompiler (same codebase origin). Controls whether debug information finalization passes execute. When unset, the default behavior applies. Three call sites confirmed.
LLVM Infrastructure Variables
AS_SECURE_LOG_FILE
Checked in ctor_720 at 0x5C0D60. Sets the secure log file path for the integrated assembler, registered as LLVM cl::opt "as-secure-log-file-name". Expected: a file path.
TMPDIR / TMP / TEMP / TEMPDIR
Checked in sub_16C5C30, sub_C843A0, and sub_721330. These are probed in priority order: TMPDIR first, then TMP, TEMP, TEMPDIR. The EDG frontend (sub_721330) only checks TMPDIR and falls back to "/tmp".
PATH
Checked in sub_16C5290, sub_16C7620, sub_C86E60. Standard PATH for findProgramByName lookups.
HOME
Checked in sub_C83840. Used by sys::path::home_directory with getpwuid_r() as fallback.
PWD
Checked in sub_16C56A0, sub_C82800. Used for fast current-directory resolution (faster than getcwd).
TERM
Checked in sub_7216D0 (EDG) and sub_16C6A40/sub_C86300 (LLVM). If TERM=="dumb", terminal colors are disabled. Otherwise, specific terminal type strings (ansi, xterm, screen, linux, cygwin, etc.) are matched by integer comparison to determine color capability.
EDG Frontend Variables
NOCOLOR
Checked in sub_67C750. Respects the no-color.org convention: if set to any value, all diagnostic coloring is disabled.
EDG_COLORS
Checked in sub_67C750. Custom color specification string for EDG diagnostics. Example: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01".
GCC_COLORS
Checked in sub_67C750. Fallback if EDG_COLORS is not set. Default: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32". Provides GCC-compatible diagnostic coloring.
USR_INCLUDE
Checked in sub_720A60. Overrides the system include path (default: "/usr/include") for the EDG frontend.
EDG_BASE
Checked in sub_7239A0. Sets the EDG base directory for predefined configuration files. Stored in qword_4F07578.
EDG_MODULES_PATH
Checked in sub_723900. Adds an additional search path for C++ modules in the EDG frontend.
Build System / Parallelism
MAKEFLAGS
| Property | Value |
|---|---|
| Checked in | sub_1682BF0 at 0x1682BF0 |
| Effect | GNU Make jobserver integration for parallel compilation limiting |
Parses for --jobserver-auth= with either:
fifo:prefix: FIFO-based jobserver (modern GNU Make)N,Mformat: pipe file descriptor pair (classic GNU Make)
When detected, cicc integrates with the jobserver to limit concurrent compilation passes.
Memory Allocator
MALLOC_CONF
Checked in sub_12FCDB0 (jemalloc initialization, 131,600 bytes -- the largest function in its range). One of five configuration sources for the bundled jemalloc allocator. Expected: jemalloc config string such as "narenas:2,dirty_decay_ms:0".
Dynamic / Generic Access
Two mechanisms allow runtime access to arbitrary environment variables:
--trace-env=VARNAMECLI flag (insub_125FB30andsub_900130): reads the named variable and injects its value into the compilation trace. This is a pass-through mechanism for build system integration.sub_C86120(LLVMsys::Process::GetEnvwrapper): genericgetenvhelper called with dynamic name parameters by LLVM's option processing infrastructure.
Complete Inventory
| # | Env Var Name | Origin | Obfuscated | Category | Key Function |
|---|---|---|---|---|---|
| 1 | NVVMCCWIZ | NVIDIA | no | Developer mode | sub_8F9C90 |
| 2 | NVVM_IR_VER_CHK | NVIDIA | no | Version check gate | sub_12BFF60, sub_2259720 |
| 3 | LIBNVVM_DISABLE_CONCURRENT_API | NVIDIA | no | Thread safety | ctor_104 |
| 4 | NV_NVVM_VERSION | NVIDIA | yes | Version compat / path select | sub_12B9F70, sub_12BB580 |
| 5 | LIBNVVM_NVVM_VERSION | NVIDIA | yes | Version compat (alias) | sub_12B9F70 |
| 6 | LLVM_OVERRIDE_PRODUCER | LLVM/NVIDIA | no | Bitcode metadata | ctor_036, ctor_154 |
| 7 | CAN_FINALIZE_DEBUG | NVIDIA | no | Debug finalization | sub_60F290, sub_4709E0, sub_470DA0 |
| 8 | AS_SECURE_LOG_FILE | LLVM | no | Assembler logging | ctor_720 |
| 9 | TMPDIR | LLVM/EDG | no | Temp directory | sub_16C5C30, sub_C843A0, sub_721330 |
| 10 | TMP | LLVM | no | Temp directory (fallback) | sub_16C5C30, sub_C843A0 |
| 11 | TEMP | LLVM | no | Temp directory (fallback) | sub_16C5C30, sub_C843A0 |
| 12 | TEMPDIR | LLVM | no | Temp directory (fallback) | sub_16C5C30, sub_C843A0 |
| 13 | PATH | LLVM | no | Executable lookup | sub_16C5290, sub_16C7620, sub_C86E60 |
| 14 | HOME | LLVM | no | Home directory | sub_C83840 |
| 15 | PWD | LLVM | no | Working directory | sub_16C56A0, sub_C82800 |
| 16 | TERM | LLVM/EDG | no | Terminal type | sub_7216D0, sub_16C6A40, sub_C86300 |
| 17 | NOCOLOR | EDG | no | Color disable | sub_67C750 |
| 18 | EDG_COLORS | EDG | no | Color scheme | sub_67C750 |
| 19 | GCC_COLORS | EDG | no | Color scheme (fallback) | sub_67C750 |
| 20 | USR_INCLUDE | EDG | no | Include path | sub_720A60 |
| 21 | EDG_BASE | EDG | no | EDG base dir | sub_7239A0 |
| 22 | EDG_MODULES_PATH | EDG | no | Module search path | sub_723900 |
| 23 | MAKEFLAGS | Build | no | Jobserver | sub_1682BF0 |
| 24 | MALLOC_CONF | jemalloc | no | Allocator config | sub_12FCDB0 |
Decompiler Artifacts
Several getenv("bar") calls appear in ctor_106, ctor_107, ctor_376, ctor_614. These are not real environment variable checks. The pattern getenv("bar") == (char*)-1 is jemalloc's initialization probe testing whether getenv is intercepted by a sanitizer. The string "bar" is a dummy.
"getenv" as a string in ctor_133 (qword_4F9B700[502]) is a function name in a libc symbol table used by the EDG frontend for tracking known standard library functions.
"fegetenv" in sub_E42970 is a math library function name in a builtin table.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| StringDeobfuscate | sub_8F98A0 | ~400B | XOR + ROT13 decryption engine |
| RealMain | sub_8F9C90 | 10,066B | Main entry; checks NVVMCCWIZ, dispatches via NV_NVVM_VERSION |
| NvvmVersionHelper | sub_12B9F70 | ~3KB | Reads NV_NVVM_VERSION / LIBNVVM_NVVM_VERSION; compares values |
| NvvmVersionHelper2 | sub_12BB580 | ~3KB | Second call site for NV_NVVM_VERSION |
| CheckIRVersion | sub_12BDA30 | ~1KB | IR major/minor compatibility check |
| CheckDebugVersion | sub_12BD890 | ~1KB | Debug IR major/minor compatibility check |
| NVVMIRVersionCheck | sub_12BFF60 | 9KB | Full NVVM IR version validator; reads NVVM_IR_VER_CHK |
| NVVMIRVersionCheck2 | sub_2259720 | 14KB | Second instance of version checker |
| JemallocInit | sub_12FCDB0 | 131,600B | jemalloc config parser; reads MALLOC_CONF |
| JobserverParser | sub_1682BF0 | ~2KB | MAKEFLAGS --jobserver-auth parser |
| GenericGetEnv | sub_C86120 | ~100B | LLVM sys::Process::GetEnv wrapper |
| EDGColorInit | sub_67C750 | ~2KB | NOCOLOR / EDG_COLORS / GCC_COLORS handler |
Cross-References
- CLI Flags -- flag-to-pipeline routing,
-v/-keepwizard mode dependency - Knobs -- internal configuration knobs (separate from env vars)
- Pipeline Entry --
sub_8F9C90real main, v253 dispatch logic - Bitcode I/O -- LLVM_OVERRIDE_PRODUCER / dual producer mechanism
- Concurrent Compilation -- LIBNVVM_DISABLE_CONCURRENT_API
- Libdevice Linking -- NVVM_IR_VER_CHK bypass for libdevice
- Debug Verify -- CAN_FINALIZE_DEBUG, NVVM_IR_VER_CHK interaction
- NVVM Container -- NvvmIRVersion/NvvmDebugVersion in container header